Taxonomical Classification using:

Taxonomical Classification using: Extracting ecological signal from noise: introduction to tools for the analysis of NGS data from microbial communities Bergen, April 19-20 2012

INTRODUCTION Taxonomical prediction = Who is out there and how many? Composition of the microbial community SSU rrna (16S/18S) - de facto standard in environmental genomics Amplicons or rrna tags Shotgun rrna / RNA-Seq (LSU + SSU) Classification of subset (~0.1%) in shotgun metagenome data

INTRODUCTION Taxonomy = system for classification Phylogeny = evolutionary development Bad phylogeny -> bad taxonomy Bad taxonomy -> Bad / less meaningful classification

AVAILABLE TAXONOMIES AND REF. DATABASES NCBI Taxonomy. Not meant to be authoritative but what sequences in Genbank are mapped to. Commonly used for taxonomical classification (best hit, MEGAN) Polyphyletic unclassified nodes, or even incorrectly. Incorrect assignments and expired taxa. RDP (Ribosomal Database Project) Greengenes --> SILVA <--

SILVA Includes all three domains of life (including Eukaryotes) SSURef 106: ~500k full length SSU sequences and 20k LSU sequences Taxonomy assignments to clusters that include uncultured organisms (up to genus level) Distributed for the ARB software packages, plus some online resources

CLASSIFICATION METHODS Can be roughly divided into those based on: 1. Inferred multiple alignments (e.g. NAST) 2. Nucleotide composition (e.g. RDP Classifier) 3. Pairwise alignments (e.g. BLAST)

CLASSIFICATION METHODS 1. Infer multiple alignment (NAST, SINA WebAligner, etc) and insert into existing reference tree (GreenGenes classifier, LCA) + Best accuracy for reads close to known reference sequences [Liu et al, 2008] - Slow and sensitive to read novelty or quality

CLASSIFICATION METHODS 2. Nucleotide composition based - RDP Classifier (8-mer): + Fast. Similar results to BLAST in environmental datasets [Liu et al, 2008] - More sensitive to sequencing noise and small differences 3. Pairwise alignment to reference database (Best BLAST hit, Lowest Common Ancestor, MEGAN) + With LCA relatively fast and accurate [Liu et al, 2008] - LCA very sensitive to assignments in ref. database

CLASSIFICATION METHODS In addition: Methods based on reconstruction of phylogenetic tree. + Ability to study phylogenetic novelty - Slow and expensive - High false positive-rate in Liu et al benchmark

CREST WORKFLOW Alignment (Megablast) to the SilvaMod reference database and LCA using custom python script or MEGAN [Huson et al, 2007]. Mapping taxa to ranks using NCBI Taxonomy Minimum similarity filters (99% for species, 97% for genus, 95% for family, 90% for order...) Web interface (max. 1,000 sequences) including Megablast (under development using Hodman)

2% range from top Scoring BLAST Hit, min score=155 bits Blast match #1, Score = 100 bits Query: 1 CTGCCCTGGCTTCTATTATGCGTGACGT... Sbjct: 350 CTGCCCGGGC-TCTATTATGCGTGACGT... Blast match #2, Score = 95 bits Query: 1 CTGCCCTGGCTTCTATTATGCGTGACGT... Sbjct: 349 CTGCCCGGGC--CTATTAGGCGTGACGT... Blast match #3, Score = 90 bits Query: 3 CCCTGGCTTCTATTA-TGCGTGACGTGTC... Sbjct: 353 CCCGTGC-TCTATTAGTGCGTGACCTATG...

OUTPUT /*0'$()(*1* 2*1*!34$+*$5' 67*0' 8$#94' :7*, :'--4-*0),0%*$!057*'* ;< =>==??; @A @< :'--4-*0),0%*$B*5('0#*??@A =><<=C? D<E ;<C :'--4-*0),0%*$F4G*0H,(* A =>===AA A I,$' 2,(*-?<=A =><<<CC JA; 8$5-*""#K#'+)*()+,&*#$)-'.'- D =>===DJ A!""#$%&'$(")*()+,&*#$)-'.'-!""#$%&'$(")*()L7H-4&)-'.'- /*0'$()(*1* 2*1*!34$+*$5' 67*0' 8$#94' :7*,!057*'* 27*4&*057*',(* J =>===JE @ I,$'!057*'* F40H*057*',(* ;E =>==?J@ A< @E B*5('0#*!5#+,3*5('0#*? =>===< @ I,$' B*5('0#*!5(#$,3*5('0#* J =>===JE @ I,$' B*5('0#* B*5('0,#+'('" JEC =>=EA@A ;A A@E B*5('0#* BMANE E =>===EC J? B*5('0#* :*-+#(70#1 AE =>==AC? A I,$' B*5('0#* :7-,0,3# E =>===EC J? B*5('0#* :H*$,3*5('0#* A =>===AA A I,$' Also FASTA format with assignments for each sequence + a more parser-friendly format for abundance.

PERFORMANCE TESTING Exhaustive tenfold cross validation: aligning 1/10 of reference database to the other 9/10 Different lengths (full-length, 450 bp and 100 bp) Gives recall rate and false positive rate Removal of taxa: cross validation removing whole genera, families or phyla and aligning to remaining Real data: Assignment of 4 different SSU rrna datasets from environmental genomics studies

COMPARISON TO OTHER METHODS Greengenes Similar approach used very recently to create alignment-informed consensus taxonomy Larger database, but few sequences annotated to genus rank Alternative files for LCA classification built RDP Classifier Nucleotide composition based + Naïve Bayes Classifier Used with default training set + Greengenes (QIIME)

RESULTS ROC for 10 split cross validation, Family rank (Fragment length=450bp) ROC for 10 split cross validation, Genus rank (Fragment length=450bp) Recall rate 0.0 0.2 0.4 0.6 0.8 1.0 SilvaMod106/LCA Greengenes/LCA Greengenes/RDP Classifier RDP Classifier default LCA range=0.02 Confidence cutoff=0.8 Recall rate 0.0 0.2 0.4 0.6 0.8 1.0 SilvaMod106/LCA Greengenes/LCA Greengenes/RDP Classifier RDP Classifier default LCA range=0.02 Confidence cutoff=0.8 0.00 0.01 0.02 0.03 0.04 0.05 False Positive Rate 0.00 0.05 0.10 0.15 False Positive Rate

RESULTS ROC for 10 split cross validation, Family rank (Fragment length=100bp) ROC for 10 split cross validation, Genus rank (Fragment length=100bp) Recall rate 0.0 0.2 0.4 0.6 0.8 1.0 SilvaMod106/LCA Greengenes/LCA Greengenes/RDP Classifier RDP Classifier default LCA range=0.02 Confidence cutoff=0.8 Recall rate 0.0 0.2 0.4 0.6 0.8 1.0 SilvaMod106/LCA Greengenes/LCA Greengenes/RDP Classifier RDP Classifier default LCA range=0.02 Confidence cutoff=0.8 0.00 0.01 0.02 0.03 0.04 0.05 False Positive Rate 0.00 0.05 0.10 0.15 False Positive Rate

RESULTS!"#$%&'(./"0%)).%#1.2%)*".34*(5(6".$%5".2$4'.$"'46%)74275%8%. 0$4**.6%)(1%5(4#9. :"5,41 &%)*".+4*(5(6"./%5".%5. ;$%(#(#<.=. &$%<'"#5. $"'46"1.$%#>.)"6").24$. /"2"$"#0".*"5 )"#<5,!"#"$% &%'()("* +,-)%!"# $ %&'($)*+,-!-. /-/01 /-/22 /-23!"# $ %&'($)*+ 45/678 /-93 /-31 /-:3!"# $ %&'($)*+ 0//678 /-02 /-34 /-94!"# $ ;<==>?=>=@,-!-. /-05 /-//0: /-13!"# $ ;<==>?=>=@ 45/678 /-0/ /-05 /-:1!"# $ ;<==>?=>=@ 0//678 /-03 /-9A /-92 BCD 7 ;<==>?=>=@,-!-. E /-43 /-05 BCD 7 ;<==>?=>=@ 45/678 E /-0: /-/A3 BCD 7 ;<==>?=>=@ 0//678 E /-//:9 /-/04 BCD 7 BCD6(1,-!-. /-3A /-3A /-:2 BCD 7 BCD6(1 45/678 /-95 /-99 /-00 BCD 7 BCD6(1 0//678 /-/1A /-/:: /-/3: False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 Removal of whole families cross validation (Fragment length=450bp, SilvaMod106+LCA) 0.36 0.11 0.021 Genus rank Family rank Phylum rank 0.00 0.01 0.02 0.03 0.04 0.05 0.06 $!"#6.'$@@&G&.$I&*>R66H@&>?6)=?$7'$@I6$'&?>L=>I@6Q&IJ&>6$69S6<$>?=6*G6 IJ=6J&?J=@I67&I@.*<=66$@6Q=''6$@68=<.=>I6@&L&'$<&IP6G&'I=<@ Relative LCA range 7 M$N(=6O$P=@6.'$@@&G&.$I&*>6H@&>?6IJ=6BCD6"'$@@&G&=<6Q&IJ6$67**I@I<$86.*>G&+=>.=6.HI*GG6*G6/-A. F>E.<*88=+6GH''E'=>?IJ6@=KH=>.=@6G<*L6IJ=6<=G=<=>.=6*<6I<$&>&>?6+$I$@=

RESULTS A2*20"* %.67#%.-(78 $4879:#94(* "(178(.-#94(* AB584:C78,.*#,.:!"#$"%&'%() *"&+%,-,(.!"#$%&#'(!"#$%&#'( )**+,(-.!"#$%&#'( U L7.59#T(:C#.#V%0"'P#.*(E-,7-:#1(:9O487#W#FJ#:4#.#97X+7-O7#(-#"(*2.345 E'F525.)*.G" "C4:E+-#,7:.E7-4,7 "C4:E+-#,7:.:8.-9O8(Q:4,7 =G"#8LP0#.,Q*(O4-9 =G"#8LP0#.,Q*(O4-9 B,*2-)!!9)5DH>) 5"270 I FFI F=S?J??S=K< ISDJ< 6"*+,7 B52'%'%()C)!+25"),8)5"270)200'(%"7 A2*20"* 9%'#$")*2:2);<=>=?@ 2 D"8"5"%&")0"* /"%$0 123'-. 4+.-$3 /"%"52 123'-'"0 4+.-2 %/0 1 "(*2.345 %.67#%.-(78!"#$% &!#&% ''#(%!)*+*),(*+*$ ))*+*$ %/0 1 "(*2.345 $4879:#94(*!+#,% "'#-% ''#)% $!$*+*)"" )(";);)"& $'*$*," %/0 1 "(*2.345 "(178(.-#94(* <=>?@ "'#(% '!#"% ()*)*+ -)*)*+ $+*)*+ %/0 1 "(*2.345 AB584:C78,.*#,.: -&#(% '!#+% ''#"%!"*$*),$*-*) =D;?;= %/0 1!877-E7-79 %.67#%.-(78 ==>F@ GH>F@ DI>D@ =F;J;J?F;J;J )!*+*+ %/0 1!877-E7-79 $4879:#94(* =H>K@ FF>=@ IH>=@ =<J;J;J =?G;=;J <=;?;? %/0 1!877-E7-79 "(178(.-#94(*!'#+% GJ>=@ IF>G@ <I;=;J F<;=;J =I;=;J %/0 1!877-E7-79 AB584:C78,.*#,.: KK>F@ ID>J@ DD>H@ =F;=;J?<;G;J $)*$*) LMN O!877-E7-79 %.67#%.-(78 J K?>?@ D=>I@ J?I;J;J D;J;J LMN O!877-E7-79 $4879:#94(* J F?>?@ IG>K@ J ===;J;J =G;?;= LMN O!877-E7-79 "(178(.-#94(* J F<>H@ DJ>F@ J F<;=;J =J;=;J LMN O!877-E7-79 AB584:C78,.*#,.: J I=>G@ DK>I@ J =D;<;J D;?;J LMN O LMN#2G %.67#%.-(78 D><@ F=>=@ IK>=@ =K;J;J?J;J;? =J;J;? LMN O LMN#2G $4879:#94(* ==>D@ HJ>H@ IJ>D@ =KG;?;J DF;?;J?J;?;= LMN O LMN#2G "(178(.-#94(* G>K@ <D>K@ GG>J@ <G;=;J <D;=;J =J;=;J LMN O LMN#2G AB584:C78,.*#,.: IH>H@ D=>K@ DK>K@?=;?;J =K;?;J I;?;J. P+,178#4R#+-(X+7#:.Y.#E(27-#97Q.8.:7*B#R48#1.O:78(.#;#.8OC.7.#;#7+6.8B4:79>#ZC787#:C7#C(EC79:#:4:.*#-+,178#4R#:.Y.#

RESULTS

ACKNOWLEDGMENTS Tim Urich Steffen Jørgensen Lise Øvreås Inge Jonassen Daniel Huson Markus Gorfer Svenn Helge Grindhaug