Robert Edgar. Independent scientist

Similar documents
Assigning Taxonomy to Marker Genes. Susan Huse Brown University August 7, 2014

Taxonomy and Clustering of SSU rrna Tags. Susan Huse Josephine Bay Paul Center August 5, 2013

Taxonomical Classification using:

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Detailed overview of the primer-free full-length SSU rrna library preparation.

Accuracy of taxonomy prediction for 16S rrna and fungal ITS sequences

BLAST. Varieties of BLAST

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Microbiome: 16S rrna Sequencing 3/30/2018

MiGA: The Microbial Genome Atlas

Microbial Diversity and Assessment (II) Spring, 2007 Guangyi Wang, Ph.D. POST103B

Chapter 17. Table of Contents. Objectives. Taxonomy. Classifying Organisms. Section 1 Biodiversity. Section 2 Systematics

Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Microbial Taxonomy. Slowly evolving molecules (e.g., rrna) used for large-scale structure; "fast- clock" molecules for fine-structure.

Microbial Taxonomy and the Evolution of Diversity

Introduction to Microbiology. CLS 212: Medical Microbiology Miss Zeina Alkudmani

Comparison of Three Fugal ITS Reference Sets. Qiong Wang and Jim R. Cole

Biology 2.1 Taxonomy: Domain, Kingdom, Phylum. ICan2Ed.com

Microbial Taxonomy. Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Name: Class: Date: ID: A

Chapter 17. Organizing Life's Diversity

Chapter 19. Microbial Taxonomy

PHYLOGENY & THE TREE OF LIFE

Interpreting the Molecular Tree of Life: What Happened in Early Evolution? Norm Pace MCD Biology University of Colorado-Boulder

Carolus Linnaeus System for Classifying Organisms. Unit 3 Lesson 2

Bergey s Manual Classification Scheme. Vertical inheritance and evolutionary mechanisms

Amplicon Sequencing. Dr. Orla O Sullivan SIRG Research Fellow Teagasc

Biologists use a system of classification to organize information about the diversity of living things.

SUPPLEMENTARY INFORMATION

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

Phylogeny 9/8/2014. Evolutionary Relationships. Data Supporting Phylogeny. Chapter 26

Zoology. Classification

Centrifuge: rapid and sensitive classification of metagenomic sequences

8/23/2014. Phylogeny and the Tree of Life

Test: Classification of Living Things

SUPPLEMENTARY INFORMATION

Multiple Choice Write the letter on the line provided that best answers the question or completes the statement.

Vocabulary: Fill in the definition for each word. Use your book and/or class notes. You can put the words in your own words. Animalia: Archaea:

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

A Novel Ribosomal-based Method for Studying the Microbial Ecology of Environmental Engineering Systems

The Classification of Plants and Other Organisms. Chapter 18

Section 18-1 Finding Order in Diversity

Biology Study Guide. VOCABULARY WORDS TO KNOW (+5 for making flashcards)

Classification of Living Things Test Review

Impact of training sets on classification of high-throughput bacterial 16s rrna gene surveys

SECTION 17-1 REVIEW BIODIVERSITY. VOCABULARY REVIEW Distinguish between the terms in each of the following pairs of terms.

PHYLOGENY AND SYSTEMATICS

Chapter 18: Classification

9.3 Classification. Lesson Objectives. Vocabulary. Introduction. Linnaean Classification

Chapter 26 Phylogeny and the Tree of Life

Organizing Life on Earth

Microbiology / Active Lecture Questions Chapter 10 Classification of Microorganisms 1 Chapter 10 Classification of Microorganisms

A Bayesian taxonomic classification method for 16S rrna gene sequences with improved species-level accuracy

Concept Modern Taxonomy reflects evolutionary history.

The Effect of Primer Choice and Short Read Sequences on the Outcome of 16S rrna Gene Based Diversity Studies

Chapter 17A. Table of Contents. Section 1 Categories of Biological Classification. Section 2 How Biologists Classify Organisms

Week 10: Homology Modelling (II) - HHpred

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Taxonomy. The science of naming organisms.

Automating the Quest for Novel Prokaryotic Diversity (Revisited)

DO NOW: Four Square Do Now

Model Accuracy Measures

rrdp: Interface to the RDP Classifier

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Unit 5: Taxonomy. KEY CONCEPT Organisms can be classified based on physical similarities.

Plant Names and Classification

A Bayesian Method for Guessing the Extreme Values in a Data Set

Classification. copyright cmassengale

Living Things. Chapter 2

Sorting It All Out CLASSIFICATION OF ORGANISMS

Outline. Classification of Living Things

An Automated Phylogenetic Tree-Based Small Subunit rrna Taxonomy and Alignment Pipeline (STAP)

Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rrna Gene Sequence Analysis

BIOINFORMATICS: An Introduction

Diversity, Productivity and Stability of an Industrial Microbial Ecosystem

Title ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses

Microbial Taxonomy. Classification of living organisms into groups. A group or level of classification

Historical Biogeography. Historical Biogeography. Systematics

Learning Outcome B1 13/10/2012. Student Achievement Indicators. Taxonomy: Scientific Classification. Student Achievement Indicators

Biology 211 (2) Week 1 KEY!

CLASSIFICATION UNIT GUIDE DUE WEDNESDAY 3/1

Classifying and Exploring Life

Bacillus anthracis. Last Lecture: 1. Introduction 2. History 3. Koch s Postulates. 1. Prokaryote vs. Eukaryote 2. Classifying prokaryotes

Classification Practice Test

9/19/2012. Chapter 17 Organizing Life s Diversity. Early Systems of Classification

Biology 160 Cell Lab. Name Lab Section: 1:00pm 3:00 pm. Student Learning Outcomes:

UNIT 4 TAXONOMY AND CLASSIFICATION

Honor pledge: I have neither given nor received unauthorized aid on this test. Name :

CS612 - Algorithms in Bioinformatics

Pipelining RDP Data to the Taxomatic Background Accomplishments vs objectives

Phylogeny and the Tree of Life

What are living things, and how can they be classified?

What is classification?

Microbial Taxonomy. C. Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

DO NOW (On notecard):

Introduction to microbiota data analysis

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Ch.2 Test. Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Other resources. Greengenes (bacterial) Silva (bacteria, archaeal and eukarya)

Transcription:

Robert Edgar Independent scientist robert@drive5.com www.drive5.com

"Bacterial taxonomy is a hornets nest that no one, really, wants to get into." Referee #1, UTAX paper

Assume prokaryotic species meaningful Starting point for automated classification Database of sequences + taxonomy annotations Bacteria & Archaea ~10k sequenced isolated strains

Classified prokaryotes ~12k named species ~2,300 genera Tiny fraction of total RDP Classifier training set v14 (RDP14) 10k full-length 16S sequences classified to genus but not species ~2k genera Best approximation I know of for authoritative db Named isolate set with species names, no longer supported? No 16S database documents gold standard subset AFAIK

SSU sequences + taxonomy annotations Greengenes 1.3M 16S sequences Obsolete? Last updated May 2013, secondgenome.com SILVA 1.8M 16S sequences ~100k genera 98% not named Small fraction of extant species / strains (billions?)

Full-length sequences can identify species If ~100% identical to known sequence 97% "rule" not reliable Paralogs in one species can be as low as 89% Different species can be >97% Short tags (V4) cannot resolve species Different species may have identical V4 sequences Genus resolution good, but not perfect 10% of genera in RDP14 have same V4 as another genus Even if only one 100% id hit, could be novel species

Different nomenclatures RDP: GG: SILVA: Based on Bergey s Based on NCBI Based on LSPN Conflicts between sequence & taxonomy Example: Escherichia and Shigella Sequence shows that these genera not monophyletic GG: leaves genus & species blank SILVA: new genus Escherichia-Shigella RDP: new genus Escherichia/Shigella

Large majority are environmental Known only from sequence Taxonomy annotations are predictions! Manual + automated methods Error rates?

Named isolates huge multiple alignment huge tree

Perfect alignment impossible Very hard to align across many phyla May not be possible / meaningful in hypervariable regions Especially GG NAST designed to introduce mis-alignents! Perfect tree prediction impossible Must be errors Plausibly could be many

"Mathematics is the art of giving the same name to different things" Henri Poincaré "Taxonomy should not give the same name to different things" Robert Edgar

Common name Identical name found in all systems (GG, SILVA & RDP) Most names are common Pair of databases Choose a rank, e.g. genus Identical sequences with common names If disagree, one annotation is wrong

GG-QIIME vs. SILVA-mothur Combined error rate: 24% genus 9% family 2% phylum Disagreement implies error in one or both dbs. Probably just one

GG ~3 4x more disagreements with RDP Implies GG error rate at least ~3x > SILVA SILVA ~6% genus errors GG ~18% genus errors!

Most genera are unnamed (RDP14 has named genera only) V4 region unless otherwise stated

Better coverage than you might expect for 20x bigger db but still many genera not in db ~6% of named genera wrong

Many genera are novel Better coverage than RDP or SILVA-mothur but 18% named genera wrong!

Reference data is sparse Top database hit has 90% id Does it have same genus, family...? What is Lowest Common Rank (LCR) Easy to find top hit(s) All algorithms find the top hit(s), more or less Hard to predict LCR This is the real challenge for taxonomy prediction

Half of genera have only one sequence Impossible to find genus-specific features Top hit 95% identity Same genus? Hard / impossible to predict Must choose between FPs and FNs Algorithm should indicate confidence ~95% is genus "twilight zone" Similar to 20% a.a. identity for protein homology

Avg. 4 seqs / genus Most 98% id Max accuracy < 100% due to singletons Leave-10%-out (Bokulich et al. PeerJ) Most 99% id

Test with e.g. LCR = family Models OTUs with ~94% to 87% id with top hit

Split trusted db (RDP14) into X rank and Y rank Example: LCR=family, make X family and Y family For each family, genus in X or Y (never both) genus is NEVER known family is ALWAYS known

Method for making query - database pairs with known LCRs from trusted ref db.

On each rank split, e.g. family Measure sensitivity to family & above Fraction families correctly predicted (all are known) Mis-classifications (FPs): Known but wrong, e.g. predict wrong family Over-classifications (FPs): Novel but classified, e.g. predicted a genus name

Genus=20% Family=45% Order=20% Cl=8% Ph=7% Estimated fraction of OTUs with LCR at each rank "Novelty profile" of the OTUs w.r.t. reference database

If 100% have LCR=genus Leave-one-out is a good test If 100% have LCR=family Then test with query= X family and db.=y family Realistic test Mixture of all LCRs, weighted by LCR frequencies