Taxonomy and Clustering of SSU rrna Tags. Susan Huse Josephine Bay Paul Center August 5, 2013

Similar documents
Assigning Taxonomy to Marker Genes. Susan Huse Brown University August 7, 2014

Robert Edgar. Independent scientist

Amplicon Sequencing. Dr. Orla O Sullivan SIRG Research Fellow Teagasc

Microbiome: 16S rrna Sequencing 3/30/2018

Other resources. Greengenes (bacterial) Silva (bacteria, archaeal and eukarya)

Taxonomical Classification using:

Title ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Detailed overview of the primer-free full-length SSU rrna library preparation.

Outline Classes of diversity measures. Species Divergence and the Measurement of Microbial Diversity. How do we describe and compare diversity?

Comparison of Three Fugal ITS Reference Sets. Qiong Wang and Jim R. Cole

FIG S1: Rarefaction analysis of observed richness within Drosophila. All calculations were

MiGA: The Microbial Genome Atlas

Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rrna Gene Sequence Analysis

The Effect of Primer Choice and Short Read Sequences on the Outcome of 16S rrna Gene Based Diversity Studies

Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Microbial Taxonomy. Slowly evolving molecules (e.g., rrna) used for large-scale structure; "fast- clock" molecules for fine-structure.

Microbial Diversity and Assessment (II) Spring, 2007 Guangyi Wang, Ph.D. POST103B

rrdp: Interface to the RDP Classifier

Introduction to microbiota data analysis

Accuracy of taxonomy prediction for 16S rrna and fungal ITS sequences

SUPPLEMENTARY INFORMATION

Impact of training sets on classification of high-throughput bacterial 16s rrna gene surveys

Supplemental Online Results:

Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria

PHYLOGENY AND SYSTEMATICS

Interpreting the Molecular Tree of Life: What Happened in Early Evolution? Norm Pace MCD Biology University of Colorado-Boulder

Censusing the Sea in the 21 st Century

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

A Novel Ribosomal-based Method for Studying the Microbial Ecology of Environmental Engineering Systems

Microbiota: Its Evolution and Essence. Hsin-Jung Joyce Wu "Microbiota and man: the story about us

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

Outline. Classification of Living Things

Stepping stones towards a new electronic prokaryotic taxonomy. The ultimate goal in taxonomy. Pragmatic towards diagnostics

Chapter 26 Phylogeny and the Tree of Life

8/23/2014. Phylogeny and the Tree of Life

Chad Burrus April 6, 2010

Microbial Taxonomy and the Evolution of Diversity

Probing diversity in a hidden world: applications of NGS in microbial ecology

An Automated Phylogenetic Tree-Based Small Subunit rrna Taxonomy and Alignment Pipeline (STAP)

Chapter 19. Microbial Taxonomy

Data Mining und Maschinelles Lernen

PHYLOGENY & THE TREE OF LIFE

Bergey s Manual Classification Scheme. Vertical inheritance and evolutionary mechanisms

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Microbial Taxonomy. Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Lecture: Mixture Models for Microbiome data

A (short) introduction to phylogenetics

Microbial Taxonomy. C. Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Introduction to polyphasic taxonomy

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Diversity, Productivity and Stability of an Industrial Microbial Ecosystem

Microbial analysis with STAMP

DETAILED RESULTS ON FUNGAL AND BACTERIAL COMMUNITIES COLONIZING THE COMPOST-INCUBATED POLYMER CARD SURFACE

A. Incorrect! In the binomial naming convention the Kingdom is not part of the name.

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Taxonomy and Biodiversity

Characterizing and predicting cyanobacterial blooms in an 8-year

Name: Class: Date: ID: A

Introduction to the SNP/ND concept - Phylogeny on WGS data

Plant Names and Classification

Biology 211 (2) Week 1 KEY!

Handling Fungal data in MoBeDAC

Ch 10. Classification of Microorganisms

Phylogeny 9/8/2014. Evolutionary Relationships. Data Supporting Phylogeny. Chapter 26

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Supplementary Information

Supplementary Figure 1. Chao 1 richness estimator of microbial OTUs (16S rrna

Bioinformatics Chapter 1. Introduction

The Tree of Life. Chapter 17

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

a,bD (modules 1 and 10 are required)

Pipelining RDP Data to the Taxomatic Background Accomplishments vs objectives

SUPPLEMENTARY INFORMATION

A Bayesian taxonomic classification method for 16S rrna gene sequences with improved species-level accuracy

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

Unit 5: Taxonomy. KEY CONCEPT Organisms can be classified based on physical similarities.

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

The Classification of Plants and Other Organisms. Chapter 18

The practice of naming and classifying organisms is called taxonomy.

Chapter 17A. Table of Contents. Section 1 Categories of Biological Classification. Section 2 How Biologists Classify Organisms

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Systems biology. Abstract

BLAST. Varieties of BLAST

Microbes and you ON THE LATEST HUMAN MICROBIOME DISCOVERIES, COMPUTATIONAL QUESTIONS AND SOME SOLUTIONS. Elizabeth Tseng

Characteristics of Life

GEOS 36501/EVOL January 2012 Page 1 of 23

The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome

Microbiology / Active Lecture Questions Chapter 10 Classification of Microorganisms 1 Chapter 10 Classification of Microorganisms

Fuyong Li 1,2, Thomas C. A. Hitch 3, Yanhong Chen 1, Christopher J. Creevey 3 and Le Luo Guan 1*

Microbiology Helmut Pospiech

Chapter 26 Phylogeny and the Tree of Life

and just what is science? how about this biology stuff?

Outline. I. Methods. II. Preliminary Results. A. Phylogeny Methods B. Whole Genome Methods C. Horizontal Gene Transfer

SUPPLEMENTARY INFORMATION

Dr. Amira A. AL-Hosary

SPECIATION. REPRODUCTIVE BARRIERS PREZYGOTIC: Barriers that prevent fertilization. Habitat isolation Populations can t get together

Chapter 19: Taxonomy, Systematics, and Phylogeny

Supplementary Information

Phylogeny. November 7, 2017

Chapter 17. Table of Contents. Objectives. Taxonomy. Classifying Organisms. Section 1 Biodiversity. Section 2 Systematics

Transcription:

Taxonomy and Clustering of SSU rrna Tags Susan Huse Josephine Bay Paul Center August 5, 2013

Primary Methods of Taxonomic Assignment Bayesian Kmer Matching RDP http://rdp.cme.msu.edu Wang, et al (2007) Appl Environ Microbiol. 73(16):5261-7 Sequence Matching GAST http://vamps.mbl.edu/resources/software.php Huse, et al (2008) PLoS Genetics 4: e1000255

k-mer indexing TGGTCTTGACATCCACAGAACTTTCCAGAGA TGGATTGGTGCCTTCGGGAACTGTGAGAC! Ex. 8-mer indexing TGGTCTTG! GGTCTTGA! GTCTTGAC! TCTTGACA! CTTGACAT! TTGACATC! TGACATCC! GACATCCA! ACATCCAC! CATCCACA!

RDP Web Results Reports: Taxonomy at each level Bootstrap value at each level Min 80% by default Domain Phylum Class Order Family Genus Root[100%] Bacteria[100%] Proteobacteria [100%] Alphaproteobacteria[100%] Rhodospirillales[100%] Rhodospirillaceae[100%] Dongia[99%]

Bootstrap Confidence Estimation the number of times a genus was selected out of 100 bootstrap trials was used as an estimate of confidence in the assignment to that genus.

Bootstrap Cutoff Values 0% 50% 80% V3 % classified to genus 100% 92.4% 82.3% % classified to correct genus 92.0% 95.0% 98.1% V4 0% 50% 80% % classified to genus 100% 97% 87.9% % classified to correct genus 92.8% 94.5% 95.7% V6 0% 50% 80% % classified to genus 100% 73.5% 40.4% % classified to correct genus 79.0% 96.5% 98.7% Based on 7,028 human gut sequences, RDP-classified full-length then cut and reclassified

Bootstrap Cutoff Values 0% 50% 80% V3 % classified to genus 100% 92.4% 82.3% % classified to correct genus 92.0% 95.0% 98.1% Assume 100 sequences: At 80%: 82 are classified to genus 80 seq are identified correctly (82 X 0.98 = 80) 2 seq are indentified incorrectly At 50%: 92 are classified to genus (10 more) 87 seq are correct (92 X 0.95 = 87) (7 more) 5 seq are incorrect (3 more) 30% of the incremental genus assignments are incorrect.

GAST Global Assignment of Sequence Taxonomy Sequence matching to nearest neighbor(s) in a reference database. Uses global alignment (entire length of amplicon) rather than local alignment (fragments) Distance is defined by sequence alignment.

GAST R1: TGGTCTTGACATCCACAGAT! Q: TGGTCTTGACATCCACAGAT! TGGTCTTGACATCGACAGAT! TGGTCTTGACATCTACAGAT! R2: TGGTCTTGGCATCTACAGAT! RefID: R1,R2 Distance: 0.05

GAST Flowchart High Quality 16S tags Nearest RefSeq(s) and GAST Distance RefDB (RefV6, RefV3V5, RefSSU) Using tags cut to match primers is much more efficient Consensus Taxonomy (2/3s majority)

Ex. Consensus Calculation 1: Firmicutes; Clostridia; Clostridiales; Lachnospiraceae (1%) 4: Firmicutes; Clostridia; Clostridiales; Clostridiaceae; Clostridium (40%) 5: Firmicutes; Clostridia; Clostridiales; Clostridiaceae; Clostridium; perfringens (50%) 1. Lowest rank = species 2. Calculate voting: 5 / 10= 50% (less than 66%) 3. Next rank = genus 4. Calculate genus voting: (4+5) / 10 = 90% 5. Assign taxonomy to genus level

GAST Distance GAST does not report a bootstrap value. GAST reports the distance to the nearest sequence. If distance = 0, then good accuracy If distance = 0.05 (5%) then likely same family and maybe genus, but not likely species.

Genus Level Comparison V3 N = 299,044 Other = 99 V6 N = 322,971 Other = 82 Full-Length N = 5,519 Other = 26 Human gut microbiota with 250nt V3, 60nt V6, 1000nt FL

Reference Database Considerations 1. Size of the database Does it contain reference sequences similar to your data? 2. Taxonomy of the database Are the references classified to genus or species? 3. Quality of the database Are there chimeras or low-quality sequences?

Example Reference Databases 1. SILVA database 2. RDP training set 3. Annotated Greengenes 4. Site specific: HOMD and CORE for oral

Specificity in RefSSU (SILVA) RefHVR sequences mapping V3-V5 V6-V4 1 taxon (unique mapping) 97.1% 97.5% 2 taxa (ambiguous mapping) 2.3% 2.0% 2 taxa by depth only (same lineage) (unique lineage mapping) 1.9% (99%) 1.8% (99.3%) If genus, unique genus 99.7% 99.8% If species, unique species 93.1% 93.6% Assigning taxonomy is only as good as your reference

BLAST to nt Don t do it!!! Top BLAST hit can lead to unexpected results: hits to unclassified, sources other than microbial SSU rrna partial hits (local alignment).

Methods Comparison RDP is considered a standard is available on their website or can be used locally and through mothur and QIIME Works better on longer sequences, not as well on shorter sequences. GAST works well for shorter sequences (V6) can go to species depending on the reference database BLAST to nt often returns incorrect taxonomy, not good for pipelines

Taxonomic Consistency For meaningful community comparisons, taxonomic names must be consistent: Names - one organism should have only one name Levels for systematic comparisons, need consistent levels: Kingdom;Phylum;Class;Order;Family;Genus;Species

Taxonomic Names Bergey s Taxonomic Outline manual of taxonomic names for bacteria List of Prokaryotic names with Standing in the Nomenclature (vetting process) NCBI similar taxonomy, but multiple subs (subclass, suborder, subfamily, tribe) Fungi UNITE (a work in progress) Archaea also a work in progress

Sources of Error in Taxonomic Analyses Primer bias Chimeras Non-16S tag amplified Discovery of novel 16S Unrepresented in reference database Low-quality references Taxonomy not available Incorrect taxonomy Ambiguous hypervariable sequence Reference biased toward most studied

Why OTUs? Because we can! We can t: do whole genome sequencing of communities assign complete taxonomic names calculate pylogenetic trees from millions of short tags Need to develop new methods that are more biologically or evolutionarily meaningful, such as oligotyping Come back in 10 years

Different clustering algorithms have very different effects on the size and number of OTUs created

Primary Methods complete linkage - no two sequences in a cluster are farther apart than the clustering width single linkage - each sequence is within the clustering width of at least one other sequence in the cluster average linkage - the average distance from a sequence to every other sequence is less than the width. Dependent on input order. greedy clusters - test sequentially and incorporate sequence into first qualifying OTU. Dependent on input order. reference compare tags to pre-established set of reference OTUs (open and closed) oligotyping using sequence-based information directly

Cluster Width Diameter Sequences are never more than D apart. (CL) Radius Sequences are never more than R from seed. (AL, Ref, Greedy) Link Length Every sequence is with L of another sequence. (SL)

The Problem of Inflation Clustering algorithms return more OTUs than predicted for mock communities. OTU inflation leads to: alpha diversity inflation beta diversity inflation Where does this inflation come from? How can we adjust for this inflation?

Inflation in Action 1,042 is a few more than the expected 2

Sequencing Noise in E.coli Distance Tags Seqs 0 177,697 2 82.4% 0.02 33,425 633 97.9% 0.03 3738 2268 99.6% 0.04 2 1 99.6% 0.05 635 306 99.9% 0.06 38 25 99.9% 0.07 89 72 100.0% 0.08 22 17 100.0% 0.09 0 0 100.0% 0.10 4 4 100.0% > 0.10 23 16 100.0% Cluster at 3% using only tags within 3% of the correct sequence

Inflation with small sequence error Only 129 OTUs Much better!!

18,156 sequences and 392 positions Example MSA Regardless of clustering algorithm, an MSA cannot fully align tags whose sequences are too divergent

MSA Limitations Distance Comparison Independent Distances Subsampled Distances X-axis: alignment of 1000 sequences subsampled from 40,000 sequences Y-axis: MSA of 1,000 bacterial sequences

Clustering sequences within 3% of known templates E. coli (2) S. epidermidis (1) Clone 43 (43) MS-CL 129 / 89 / 694 Alignment method PW-CL 6 / 5 / 308 MS Multiple Sequence Alignment CL Complete Linkage clustering PW Pairwise Alignment AL Average Linkage clustering

Cluster Count: 31 26 45 7 Complete Linkage propagates OTUs #5 #6 #9 #8 #4 #3 #1 #12 #11 #2 #7 #10 #13 Each V6 variant with 2nt distance, will start a new OTU. Each V6 variant with 1nt distance, can cluster with the seed or a 2nd variant.

Average Linkage collapses errors Cluster Count: 1 #1 Clusters tend to be heavily dominated by their most abundant sequence, which strongly weights the average and smoothes the noise.

Greedy Seeds (UClust) Sort by abundance First Seq 1 is seed of Cluster 1 Compare next Seq to each cluster in order, if Distance(Seq, Cluster) <= W, include it in the cluster if Seq is not a member of any cluster, create a new cluster

Greedy Seeds (UClust) Cluster Count: 1

Reference OTUs 1. Create a full-length tree of reference sequences (greengenes) 2. Cluster those sequences 3. Map each tag to the nearest reference 4. Bin each tag to that reference s OTU#

Oligotyping 1. Align the sequences 2. Determine the alignment positions with the most information (Shannon entropy) 3. Iterate the position selection 4. Use those positions to bin all sequences

Clustering sequences within 3% of known templates E. coli (2) S. epidermidis (1) Clone 43 (43) MS-CL 129 / 89 / 694 Clustering method MS-AL 54 / 44 / 218 Alignment method PW-CL 6 / 5 / 308 PW-AL 2 / 1 / 43 MS Multiple Sequence Alignment CL Complete Linkage clustering PW Pairwise Alignment AL Average Linkage clustering

Accurate with small error 2 OTUs created!!

OTU Inflation with more errors with errors back up to 277 OTUs

Still lose outlier sequencing errors Multiple sequencing errors still not clustered

Single Linkage Preclustering at 2% 1. Sort sequences by abundance 2. Precluster using 1nt change (2% in V6) single-linkage 3. Most abundant sequence represents all reads in the precluster

SLP Smooths out sequencing errors in data, as well as PCR errors, as well as fine natural variation such as SNPs. SLP is not designed to denoise sequences. SLP should only be used for preclustering before clustering at >= 3%.

Total Tags 215,618 / 197,876 / 202,340 Unique Sequences 3,309 / 3,177 / 7,461 3% Clustering Single-Linkage Preclustering MS - CL 1042 / 1267 / 2473 PW - AL 277 / 323 / 458 PW AL 88 / 128 / 275 Expected OTUs: E. coli (2) / S. epidermidis (1) / Clone 43 (43)

All Ecoli with SLP Back down to 88 OTUs

OTU Consistency Will I get the same OTUs every time?

Sample Size and Rarefaction 7000 M2FN PML Rarefaction MS-CL - PML OTUs 6000 5000 4000 3000 2000 1000 0-20,000 40,000 60,000 80,000 100,000 120,000 Number of Sequences Sampled 5K 10K 15K 20K 50K 100K

Sample Size and Rarefaction PML SLP-PW-AL

Relative Inflation Absolute number of errant OTUs will increase with sample size. Relative number of errant OTUs will decrease with sample complexity

The Magical 3% NOT! 3% SSU OTUs = Species and 6% SSU OTUs = Genera

History of The 3% OTUs 1. Wayne et al (1987) IJSEM 37(4): 463-464 Bacterial systematics phylogeny is best for defining species, want to use the entire genome; therefore, 70% similarity as defined by DNA-DNA reassociation. 2. Stackebrandt and Goebel (1994) IJSEM 44(4): 846-849 70% DNA-DNA ~ 97% similarity of full-length 16S 3. 97% similarity of any hypervariable region anywhere in the Bacterial kingdom = species

What is the right cluster width? Distances vary by hypervariable region Full-length does not equal V1-V3 which does not equal V3-V5 which does not equal V6, etc. Distance vary by bacterial lineage What specificity are you trying to gain? No single clustering width is the right answer, use several (and wave your hands a lot)

Clustering Options mothur QIIME UClust (USearch) Oligotyping ESPRIT Tree CROP (Bayesian)

Clustering Questions How meaningful are clusters functionally? When is an errare rare and when is it an error? Should it be included in an existing cluster or start its own? How to place sequences if OTUs overlap? What is the effect of residual low quality data or chimeras? How sensitive are alpha and beta diversity estimates to clustering results?

OTUs vs Taxonomy Novel organisms Many unnamed organisms Some clades only defined to phyla or class Many species names based on phenotype rather than genotype Do not lump together all 16S unknowns or diverse partially classified.

From the Sample to the Population: Diversity and Distance Metrics

Diversity 5 4 4 3 3 2 2 1 1 0 Sampling Depth and Alpha Diversity Not Richness - 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 Sampling Depth SLP - NPShannon SLP - Simpson CL - NPShannon Simpson Robust to both singletons and depth

Ex. Distance Metrics Jaccard, Unweighted UniFrac: presence / absence U (A,B) / U (A,B) U Bray Curtis, Weighted Unifrac: relative abundance Σ [ min (A i, B i ) ] / avg (N A,B ) A i = abundance of OTU i in A N = sample size Morisita-Horn (ThetaYC): relative and weighted abundance Σ [ RA i * RB i ] / avg ( Σ [ RA i 2 ], Σ [ RB i 2 ] ) RA i = relative abundance of OTU i in A

Do subsamples of the same sample look alike? For a sample of 50,000 reads: 1. Randomly select 5,000 reads 10 times 2. Calculate subsample beta diversity using several distance metrics

0.7 0.6 Community Distance of Subsamples at 5,000 Reads Community Distance 0.5 0.4 0.3 0.2 0.1 0 Replicates Morisita Horn Bray-Curtis Jaccard Presence/absence (Jaccard) is highly skewed by the inconsistency of the low abundance members

0.12 Community Distance of Subsamples 0.1 Community Distance 0.08 0.06 0.04 0.02 0 Replicates Bray Curtis (1K) Bray-Curtis (5K) Morisita Horn (1K) Morisita Horn (5K) Subsample 1,000 and 5,000 reads from sample of 50,000 reads, Pairwise distances for replicates at single depth

Subsampling Across Depths For a sample of 50,000 reads: Randomly select subsamples at depths: 1,000 5,000 7,500 10,000 15,000 20,000 25,000 Calculate subsample beta diversity using several distance metrics

Effect of Sample Depth - Bray Curtis Nearly 100% Different 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 0.000 1000 5000 7500 10000 15000 20000 25000 25000 20000 15000 10000 7500 5000 1000 Bray Curtis uses absolute counts, intra-community distances are high as depths diverge

Morisita Horn uses relative abundances, intra-community distances are low across depths above minimum sampling depth. Effect of Sample Depth - Morisita Horn 0.009 Nearly 0.5% Different 0.008 0.007 0.006 0.005 0.004 0.003 0.002 0.001 0.000 1,000 5,000 7,500 10,000 15,000 20,000 25,000 20,000 15,000 10,000 7,500 5,000 1,000

PCoA plots group communities based on pairwise distances. Communities that have low pairwise distances are close together on the plot. Communities with larger pairwise distance are farther apart distant on the plot.

"#"$& &'!#"()*+,-./0#1.+2#34-.*.+5#64-/# "#""'& "#""(& "#"")& "#""%& "&!"#$#!"#""%&!"#"")&!"#""(& &$+"""&& &*+"""&& &,+*""&& &$"+"""&& &$*+"""&& &%"+"""&& &%*+"""&&!"#""'&!"#"$&!"#"$%&!"#%#!"#"$*&!"#"$&!"#""*& "& "#""*& "#"$& Minimum sample depth here of 10,000, but will be a function of the diversity of the sample

0.4 SLP Clustering and Bray-Curtis 0.3 0.2 0.1 PC 2 0-0.1-0.2 1,000 2,000 5,000 7,500 10,000 15,000 20,000 25,000 30,000 40,000-0.3 PC 1-0.4-0.2 0 0.2 0.4 0.6 0.8 Bray-Curtis PCoA clusters entirely on depth (each point represents 10 atop one another)