Assigning Taxonomy to Marker Genes Susan Huse Brown University August 7, 2014
In a nutshell Taxonomy is assigned by comparing your DNA sequences against a database of DNA sequences from known taxa
Marker Genes Bacteria 16S (SSU rrna) Archaea 16S (SSU rrna) Protist 18S (SSU rrna) Fungi ITS, LSU rrna
Taxonomic Consistency For meaningful community comparisons, taxonomic names must be consistent: Names - one organism should have only one name Levels for automated comparisons, need consistent levels: Kingdom;Phylum;Class;Order;Family;Genus;Species
Official Taxonomic Names Bergey s Taxonomic Outline manual of taxonomic names for bacteria List of Prokaryotic names with Standing in the Nomenclature (vetting process) NCBI similar taxonomy, but multiple subs (subclass, suborder, subfamily, tribe) Fungi UNITE Archaea also a work in progress
Primary Methods of Taxonomic Assignment Marker genes RDP GAST BLAST Shotgun Metagenomic MetaPhlAn AMPHORA
Ribosomal Database Project Uses Bayesian Kmer Matching Wang, et al (2007) Appl Environ Microbiol. 73(16):5261-7 http://rdp.cme.msu.edu
k-mer indexing TGCTAGCTAGTGACATCCACAGAACTTTCCA GAGATGGATTGGTGCCTTCGGGAACTGTGA! Ex. 4-mer indexing TGCT 1! AGCT 1! GCTA 2! 1! TAGT 1! CTAG 2! 1! TAGC 1! AGTG 1! GTGA 1!
RDP Web Results Reports: Taxonomy at each level Bootstrap value at each level Min 80% by default Domain Phylum Class Order Family Genus Root[100%] Bacteria[100%] Proteobacteria [100%] Alphaproteobacteria[100%] Rhodospirillales[100%] Rhodospirillaceae[100%] Dongia[99%]
Bootstrap Confidence Estimation the number of times a genus was selected out of 100 bootstrap trials was used as an estimate of confidence in the assignment to that genus.
Bootstrap Cutoff Values 0% 50% 80% V3 % classified to genus 100% 92.4% 82.3% % classified to correct genus 92.0% 95.0% 98.1% V4 0% 50% 80% % classified to genus 100% 97% 87.9% % classified to correct genus 92.8% 94.5% 95.7% V6 0% 50% 80% % classified to genus 100% 73.5% 40.4% % classified to correct genus 79.0% 96.5% 98.7% Based on 7,028 human gut sequences, RDP-classified full-length then cut and reclassified
Incremental Accuracy V3 0% 50% 80% % classified to genus 100% 92.4% 82.3% % classified to correct genus 92.0% 95.0% 98.1% Assume 1000 reads: At 80%: 823 are classified to genus 807 are identified correctly (823 X 0.981 = 807) 16 seq are identified incorrectly (823-807 = 16) At 50%: 924 are classified to genus (101 more) 878 are correct (924 X 0.95 = 878) (71 more) 46 are incorrect (30 more) 30% of the added reads are incorrect (=30/101).
GAST: Global Assignment of Sequence Taxonomy GAST uses direct comparison to a reference database to assign taxonomy. http://vamps.mbl.edu/resources/software.php Huse, et al (2008) PLoS Genetics 4: e1000255
Reference Matching Sequence matching to nearest sequence in a reference database. global alignment (USEARCH) rather than local alignment (BLAST) Distance is defined by the number of mismatches along the sequence alignment. http://vamps.mbl.edu/resources/software.php Huse, et al (2008) PLoS Genetics 4: e1000255
Assigned Taxon Name GAST uses a lowest common taxon method to assign taxonomy when a read is equidistant to more than one reference sequence. Assignment requires a minimum of 66% concurrence (2/3s majority).
GAST Flowchart High Quality 16S tags Nearest RefSeq(s) and GAST Distance RefDB (e.g., RefV3V4 RefSSU) Using tags cut to match primers is much more efficient Consensus Taxonomy (2/3s majority)
GAST R1: TGGTCTTGACATCCACAGAT! Q: TGGTCTTGACATCCACAGAT! Query exactly matches 1 reference RefID: R1 Distance: 0.0
GAST R1: TGGACTTGACATCCACAGAT! Q: TGGTCTTGACATCCACAGAT! TGGTCTTGACATCGACAGAT! R2: TGGTCTTGGCATCTACAGAT! Query inexactly matches 2 references with two mismatches each RefID R1,R2 Distance = 1 / 20 = 0.05
Consensus Calculation If Query equally matches multiple references: Firmicutes; Clostridia; Clostridiales; Lachnospiraceae (1 hit) Firmicutes; Clostridia; Clostridiales; Clostridiaceae; Clostridium (4 hits) Firmicutes; Clostridia; Clostridiales; Clostridiaceae; Clostridium; perfringens (5 hits) 1. Calculate community size: 1+4+5 = 10 2. Lowest rank = species 3. Calculate voting: 5 / 10 = 50% (less than 66%) 4. Next rank = genus 5. Calculate genus voting: (4+5) / 10 = 90% 6. Assign taxonomy to genus level
GAST Distance GAST does not report a bootstrap value or confidence interval, GAST reports the distance to the nearest sequence. If distance = 0, then good accuracy If distance = 0.05 (5%) then likely same family and maybe genus, but not species.
BLAST Top BLAST hit can lead to unexpected results: hits to unclassified, sources other than microbial SSU rrna local alignments can be misleading Supervised BLAST is an excellent tool Unsupervised can be dangerous
Methods Comparison RDP is considered the standard available via RDP website, mothur, QIIME, or local Works better on longer sequences, not as well on shorter sequences. GAST works well for shorter sequences can go to species depending on the reference database BLAST to nt often returns incorrect taxonomy, not good for pipelines, good for checking individual results
16S regions give similar results at the genus level V3 N = 299,044 Other = 99 V6 N = 322,971 Other = 82 Full-Length N = 5,519 Other = 26 Human gut microbiota with 250nt V3, 60nt V6, 1000nt FL
Reference Database Considerations 1. Size of the database Does it contain reference sequences similar to your data? 2. Taxonomy of the database Are the references classified to genus or species? 3. Quality of the database Are there chimeras or low-quality sequences?
Example Reference Databases 1. SILVA database SSU, LSU http://www.arb-silva.de/ 2. RDP training set SSU, ITS http://rdp.cme.msu.edu/index.jsp 3. Greengenes http://greengenes.lbl.gov/cgi-bin/nphindex.cgi
Specialized Databases HOMD Human Oral Microbiome Database http://www.homd.org/ OSU CORE for oral http://microbiome.osu.edu/ UNITE http://unite.ut.ee/
Which Hypervariable Region? Two regions are better than one - more information Different regions have different specificity at the genus or species level Different primer sets can have different biases Depends on your samples
16S Specificity in SILVA RefHVR sequences mapping V3-V5 V6-V4 Unique taxon 97.1% 97.5% If genus, unique genus 99.7% 99.8% If species, unique species 93.1% 93.6% 2 Taxa (ambiguous) 2.3% 2.0% 2 Lineages 0.6% 0.5% Unique lineage mapping 99% 99% Assigning taxonomy is only as good as your reference
Sources of Error in Taxonomic Analyses Primer bias Chimeras Non-16S tag amplified Discovery of novel 16S Unrepresented in reference database Low-quality references Taxonomy not available Incorrect taxonomy Ambiguous hypervariable sequence Reference biased toward most studied