Taxonomy and Clustering of SSU rrna Tags Susan Huse Josephine Bay Paul Center August 5, 2013
Primary Methods of Taxonomic Assignment Bayesian Kmer Matching RDP http://rdp.cme.msu.edu Wang, et al (2007) Appl Environ Microbiol. 73(16):5261-7 Sequence Matching GAST http://vamps.mbl.edu/resources/software.php Huse, et al (2008) PLoS Genetics 4: e1000255
k-mer indexing TGGTCTTGACATCCACAGAACTTTCCAGAGA TGGATTGGTGCCTTCGGGAACTGTGAGAC! Ex. 8-mer indexing TGGTCTTG! GGTCTTGA! GTCTTGAC! TCTTGACA! CTTGACAT! TTGACATC! TGACATCC! GACATCCA! ACATCCAC! CATCCACA!
RDP Web Results Reports: Taxonomy at each level Bootstrap value at each level Min 80% by default Domain Phylum Class Order Family Genus Root[100%] Bacteria[100%] Proteobacteria [100%] Alphaproteobacteria[100%] Rhodospirillales[100%] Rhodospirillaceae[100%] Dongia[99%]
Bootstrap Confidence Estimation the number of times a genus was selected out of 100 bootstrap trials was used as an estimate of confidence in the assignment to that genus.
Bootstrap Cutoff Values 0% 50% 80% V3 % classified to genus 100% 92.4% 82.3% % classified to correct genus 92.0% 95.0% 98.1% V4 0% 50% 80% % classified to genus 100% 97% 87.9% % classified to correct genus 92.8% 94.5% 95.7% V6 0% 50% 80% % classified to genus 100% 73.5% 40.4% % classified to correct genus 79.0% 96.5% 98.7% Based on 7,028 human gut sequences, RDP-classified full-length then cut and reclassified
Bootstrap Cutoff Values 0% 50% 80% V3 % classified to genus 100% 92.4% 82.3% % classified to correct genus 92.0% 95.0% 98.1% Assume 100 sequences: At 80%: 82 are classified to genus 80 seq are identified correctly (82 X 0.98 = 80) 2 seq are indentified incorrectly At 50%: 92 are classified to genus (10 more) 87 seq are correct (92 X 0.95 = 87) (7 more) 5 seq are incorrect (3 more) 30% of the incremental genus assignments are incorrect.
GAST Global Assignment of Sequence Taxonomy Sequence matching to nearest neighbor(s) in a reference database. Uses global alignment (entire length of amplicon) rather than local alignment (fragments) Distance is defined by sequence alignment.
GAST R1: TGGTCTTGACATCCACAGAT! Q: TGGTCTTGACATCCACAGAT! TGGTCTTGACATCGACAGAT! TGGTCTTGACATCTACAGAT! R2: TGGTCTTGGCATCTACAGAT! RefID: R1,R2 Distance: 0.05
GAST Flowchart High Quality 16S tags Nearest RefSeq(s) and GAST Distance RefDB (RefV6, RefV3V5, RefSSU) Using tags cut to match primers is much more efficient Consensus Taxonomy (2/3s majority)
Ex. Consensus Calculation 1: Firmicutes; Clostridia; Clostridiales; Lachnospiraceae (1%) 4: Firmicutes; Clostridia; Clostridiales; Clostridiaceae; Clostridium (40%) 5: Firmicutes; Clostridia; Clostridiales; Clostridiaceae; Clostridium; perfringens (50%) 1. Lowest rank = species 2. Calculate voting: 5 / 10= 50% (less than 66%) 3. Next rank = genus 4. Calculate genus voting: (4+5) / 10 = 90% 5. Assign taxonomy to genus level
GAST Distance GAST does not report a bootstrap value. GAST reports the distance to the nearest sequence. If distance = 0, then good accuracy If distance = 0.05 (5%) then likely same family and maybe genus, but not likely species.
Genus Level Comparison V3 N = 299,044 Other = 99 V6 N = 322,971 Other = 82 Full-Length N = 5,519 Other = 26 Human gut microbiota with 250nt V3, 60nt V6, 1000nt FL
Reference Database Considerations 1. Size of the database Does it contain reference sequences similar to your data? 2. Taxonomy of the database Are the references classified to genus or species? 3. Quality of the database Are there chimeras or low-quality sequences?
Example Reference Databases 1. SILVA database 2. RDP training set 3. Annotated Greengenes 4. Site specific: HOMD and CORE for oral
Specificity in RefSSU (SILVA) RefHVR sequences mapping V3-V5 V6-V4 1 taxon (unique mapping) 97.1% 97.5% 2 taxa (ambiguous mapping) 2.3% 2.0% 2 taxa by depth only (same lineage) (unique lineage mapping) 1.9% (99%) 1.8% (99.3%) If genus, unique genus 99.7% 99.8% If species, unique species 93.1% 93.6% Assigning taxonomy is only as good as your reference
BLAST to nt Don t do it!!! Top BLAST hit can lead to unexpected results: hits to unclassified, sources other than microbial SSU rrna partial hits (local alignment).
Methods Comparison RDP is considered a standard is available on their website or can be used locally and through mothur and QIIME Works better on longer sequences, not as well on shorter sequences. GAST works well for shorter sequences (V6) can go to species depending on the reference database BLAST to nt often returns incorrect taxonomy, not good for pipelines
Taxonomic Consistency For meaningful community comparisons, taxonomic names must be consistent: Names - one organism should have only one name Levels for systematic comparisons, need consistent levels: Kingdom;Phylum;Class;Order;Family;Genus;Species
Taxonomic Names Bergey s Taxonomic Outline manual of taxonomic names for bacteria List of Prokaryotic names with Standing in the Nomenclature (vetting process) NCBI similar taxonomy, but multiple subs (subclass, suborder, subfamily, tribe) Fungi UNITE (a work in progress) Archaea also a work in progress
Sources of Error in Taxonomic Analyses Primer bias Chimeras Non-16S tag amplified Discovery of novel 16S Unrepresented in reference database Low-quality references Taxonomy not available Incorrect taxonomy Ambiguous hypervariable sequence Reference biased toward most studied
Why OTUs? Because we can! We can t: do whole genome sequencing of communities assign complete taxonomic names calculate pylogenetic trees from millions of short tags Need to develop new methods that are more biologically or evolutionarily meaningful, such as oligotyping Come back in 10 years
Different clustering algorithms have very different effects on the size and number of OTUs created
Primary Methods complete linkage - no two sequences in a cluster are farther apart than the clustering width single linkage - each sequence is within the clustering width of at least one other sequence in the cluster average linkage - the average distance from a sequence to every other sequence is less than the width. Dependent on input order. greedy clusters - test sequentially and incorporate sequence into first qualifying OTU. Dependent on input order. reference compare tags to pre-established set of reference OTUs (open and closed) oligotyping using sequence-based information directly
Cluster Width Diameter Sequences are never more than D apart. (CL) Radius Sequences are never more than R from seed. (AL, Ref, Greedy) Link Length Every sequence is with L of another sequence. (SL)
The Problem of Inflation Clustering algorithms return more OTUs than predicted for mock communities. OTU inflation leads to: alpha diversity inflation beta diversity inflation Where does this inflation come from? How can we adjust for this inflation?
Inflation in Action 1,042 is a few more than the expected 2
Sequencing Noise in E.coli Distance Tags Seqs 0 177,697 2 82.4% 0.02 33,425 633 97.9% 0.03 3738 2268 99.6% 0.04 2 1 99.6% 0.05 635 306 99.9% 0.06 38 25 99.9% 0.07 89 72 100.0% 0.08 22 17 100.0% 0.09 0 0 100.0% 0.10 4 4 100.0% > 0.10 23 16 100.0% Cluster at 3% using only tags within 3% of the correct sequence
Inflation with small sequence error Only 129 OTUs Much better!!
18,156 sequences and 392 positions Example MSA Regardless of clustering algorithm, an MSA cannot fully align tags whose sequences are too divergent
MSA Limitations Distance Comparison Independent Distances Subsampled Distances X-axis: alignment of 1000 sequences subsampled from 40,000 sequences Y-axis: MSA of 1,000 bacterial sequences
Clustering sequences within 3% of known templates E. coli (2) S. epidermidis (1) Clone 43 (43) MS-CL 129 / 89 / 694 Alignment method PW-CL 6 / 5 / 308 MS Multiple Sequence Alignment CL Complete Linkage clustering PW Pairwise Alignment AL Average Linkage clustering
Cluster Count: 31 26 45 7 Complete Linkage propagates OTUs #5 #6 #9 #8 #4 #3 #1 #12 #11 #2 #7 #10 #13 Each V6 variant with 2nt distance, will start a new OTU. Each V6 variant with 1nt distance, can cluster with the seed or a 2nd variant.
Average Linkage collapses errors Cluster Count: 1 #1 Clusters tend to be heavily dominated by their most abundant sequence, which strongly weights the average and smoothes the noise.
Greedy Seeds (UClust) Sort by abundance First Seq 1 is seed of Cluster 1 Compare next Seq to each cluster in order, if Distance(Seq, Cluster) <= W, include it in the cluster if Seq is not a member of any cluster, create a new cluster
Greedy Seeds (UClust) Cluster Count: 1
Reference OTUs 1. Create a full-length tree of reference sequences (greengenes) 2. Cluster those sequences 3. Map each tag to the nearest reference 4. Bin each tag to that reference s OTU#
Oligotyping 1. Align the sequences 2. Determine the alignment positions with the most information (Shannon entropy) 3. Iterate the position selection 4. Use those positions to bin all sequences
Clustering sequences within 3% of known templates E. coli (2) S. epidermidis (1) Clone 43 (43) MS-CL 129 / 89 / 694 Clustering method MS-AL 54 / 44 / 218 Alignment method PW-CL 6 / 5 / 308 PW-AL 2 / 1 / 43 MS Multiple Sequence Alignment CL Complete Linkage clustering PW Pairwise Alignment AL Average Linkage clustering
Accurate with small error 2 OTUs created!!
OTU Inflation with more errors with errors back up to 277 OTUs
Still lose outlier sequencing errors Multiple sequencing errors still not clustered
Single Linkage Preclustering at 2% 1. Sort sequences by abundance 2. Precluster using 1nt change (2% in V6) single-linkage 3. Most abundant sequence represents all reads in the precluster
SLP Smooths out sequencing errors in data, as well as PCR errors, as well as fine natural variation such as SNPs. SLP is not designed to denoise sequences. SLP should only be used for preclustering before clustering at >= 3%.
Total Tags 215,618 / 197,876 / 202,340 Unique Sequences 3,309 / 3,177 / 7,461 3% Clustering Single-Linkage Preclustering MS - CL 1042 / 1267 / 2473 PW - AL 277 / 323 / 458 PW AL 88 / 128 / 275 Expected OTUs: E. coli (2) / S. epidermidis (1) / Clone 43 (43)
All Ecoli with SLP Back down to 88 OTUs
OTU Consistency Will I get the same OTUs every time?
Sample Size and Rarefaction 7000 M2FN PML Rarefaction MS-CL - PML OTUs 6000 5000 4000 3000 2000 1000 0-20,000 40,000 60,000 80,000 100,000 120,000 Number of Sequences Sampled 5K 10K 15K 20K 50K 100K
Sample Size and Rarefaction PML SLP-PW-AL
Relative Inflation Absolute number of errant OTUs will increase with sample size. Relative number of errant OTUs will decrease with sample complexity
The Magical 3% NOT! 3% SSU OTUs = Species and 6% SSU OTUs = Genera
History of The 3% OTUs 1. Wayne et al (1987) IJSEM 37(4): 463-464 Bacterial systematics phylogeny is best for defining species, want to use the entire genome; therefore, 70% similarity as defined by DNA-DNA reassociation. 2. Stackebrandt and Goebel (1994) IJSEM 44(4): 846-849 70% DNA-DNA ~ 97% similarity of full-length 16S 3. 97% similarity of any hypervariable region anywhere in the Bacterial kingdom = species
What is the right cluster width? Distances vary by hypervariable region Full-length does not equal V1-V3 which does not equal V3-V5 which does not equal V6, etc. Distance vary by bacterial lineage What specificity are you trying to gain? No single clustering width is the right answer, use several (and wave your hands a lot)
Clustering Options mothur QIIME UClust (USearch) Oligotyping ESPRIT Tree CROP (Bayesian)
Clustering Questions How meaningful are clusters functionally? When is an errare rare and when is it an error? Should it be included in an existing cluster or start its own? How to place sequences if OTUs overlap? What is the effect of residual low quality data or chimeras? How sensitive are alpha and beta diversity estimates to clustering results?
OTUs vs Taxonomy Novel organisms Many unnamed organisms Some clades only defined to phyla or class Many species names based on phenotype rather than genotype Do not lump together all 16S unknowns or diverse partially classified.
From the Sample to the Population: Diversity and Distance Metrics
Diversity 5 4 4 3 3 2 2 1 1 0 Sampling Depth and Alpha Diversity Not Richness - 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 Sampling Depth SLP - NPShannon SLP - Simpson CL - NPShannon Simpson Robust to both singletons and depth
Ex. Distance Metrics Jaccard, Unweighted UniFrac: presence / absence U (A,B) / U (A,B) U Bray Curtis, Weighted Unifrac: relative abundance Σ [ min (A i, B i ) ] / avg (N A,B ) A i = abundance of OTU i in A N = sample size Morisita-Horn (ThetaYC): relative and weighted abundance Σ [ RA i * RB i ] / avg ( Σ [ RA i 2 ], Σ [ RB i 2 ] ) RA i = relative abundance of OTU i in A
Do subsamples of the same sample look alike? For a sample of 50,000 reads: 1. Randomly select 5,000 reads 10 times 2. Calculate subsample beta diversity using several distance metrics
0.7 0.6 Community Distance of Subsamples at 5,000 Reads Community Distance 0.5 0.4 0.3 0.2 0.1 0 Replicates Morisita Horn Bray-Curtis Jaccard Presence/absence (Jaccard) is highly skewed by the inconsistency of the low abundance members
0.12 Community Distance of Subsamples 0.1 Community Distance 0.08 0.06 0.04 0.02 0 Replicates Bray Curtis (1K) Bray-Curtis (5K) Morisita Horn (1K) Morisita Horn (5K) Subsample 1,000 and 5,000 reads from sample of 50,000 reads, Pairwise distances for replicates at single depth
Subsampling Across Depths For a sample of 50,000 reads: Randomly select subsamples at depths: 1,000 5,000 7,500 10,000 15,000 20,000 25,000 Calculate subsample beta diversity using several distance metrics
Effect of Sample Depth - Bray Curtis Nearly 100% Different 1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 0.000 1000 5000 7500 10000 15000 20000 25000 25000 20000 15000 10000 7500 5000 1000 Bray Curtis uses absolute counts, intra-community distances are high as depths diverge
Morisita Horn uses relative abundances, intra-community distances are low across depths above minimum sampling depth. Effect of Sample Depth - Morisita Horn 0.009 Nearly 0.5% Different 0.008 0.007 0.006 0.005 0.004 0.003 0.002 0.001 0.000 1,000 5,000 7,500 10,000 15,000 20,000 25,000 20,000 15,000 10,000 7,500 5,000 1,000
PCoA plots group communities based on pairwise distances. Communities that have low pairwise distances are close together on the plot. Communities with larger pairwise distance are farther apart distant on the plot.
"#"$& &'!#"()*+,-./0#1.+2#34-.*.+5#64-/# "#""'& "#""(& "#"")& "#""%& "&!"#$#!"#""%&!"#"")&!"#""(& &$+"""&& &*+"""&& &,+*""&& &$"+"""&& &$*+"""&& &%"+"""&& &%*+"""&&!"#""'&!"#"$&!"#"$%&!"#%#!"#"$*&!"#"$&!"#""*& "& "#""*& "#"$& Minimum sample depth here of 10,000, but will be a function of the diversity of the sample
0.4 SLP Clustering and Bray-Curtis 0.3 0.2 0.1 PC 2 0-0.1-0.2 1,000 2,000 5,000 7,500 10,000 15,000 20,000 25,000 30,000 40,000-0.3 PC 1-0.4-0.2 0 0.2 0.4 0.6 0.8 Bray-Curtis PCoA clusters entirely on depth (each point represents 10 atop one another)