Taxonomical Classification using:

Similar documents
Robert Edgar. Independent scientist

Assigning Taxonomy to Marker Genes. Susan Huse Brown University August 7, 2014

Comparison of Three Fugal ITS Reference Sets. Qiong Wang and Jim R. Cole

Microbiome: 16S rrna Sequencing 3/30/2018

A Bayesian taxonomic classification method for 16S rrna gene sequences with improved species-level accuracy

Taxonomy and Clustering of SSU rrna Tags. Susan Huse Josephine Bay Paul Center August 5, 2013

Title ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

MiGA: The Microbial Genome Atlas

An Automated Phylogenetic Tree-Based Small Subunit rrna Taxonomy and Alignment Pipeline (STAP)

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Detailed overview of the primer-free full-length SSU rrna library preparation.

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Prac%cal Bioinforma%cs for Life Scien%sts. Week 14, Lecture 28. István Albert Bioinforma%cs Consul%ng Center Penn State

Accuracy of taxonomy prediction for 16S rrna and fungal ITS sequences

SUPPLEMENTARY INFORMATION

Heuristic Alignment and Searching

SUPPLEMENTARY INFORMATION

BLAST. Varieties of BLAST

Centrifuge: rapid and sensitive classification of metagenomic sequences

Amplicon Sequencing. Dr. Orla O Sullivan SIRG Research Fellow Teagasc

The Effect of Primer Choice and Short Read Sequences on the Outcome of 16S rrna Gene Based Diversity Studies

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

A (short) introduction to phylogenetics

PHYLOGENY AND SYSTEMATICS

Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Microbial Taxonomy. Slowly evolving molecules (e.g., rrna) used for large-scale structure; "fast- clock" molecules for fine-structure.

Handling Fungal data in MoBeDAC

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

objective functions...

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Supplemental Online Results:

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Microbial Taxonomy. Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Other resources. Greengenes (bacterial) Silva (bacteria, archaeal and eukarya)

rrdp: Interface to the RDP Classifier

Comparing whole genomes

Introduction to microbiota data analysis

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Microbial Taxonomy and the Evolution of Diversity

Microbes and you ON THE LATEST HUMAN MICROBIOME DISCOVERIES, COMPUTATIONAL QUESTIONS AND SOME SOLUTIONS. Elizabeth Tseng

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure

Single alignment: Substitution Matrix. 16 march 2017

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

Introduction to Evolutionary Concepts

Microbial analysis with STAMP

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Phylogenetic analyses. Kirsi Kostamo

Using Bioinformatics to Study Evolutionary Relationships Instructions

Genomics and bioinformatics summary. Finding genes -- computer searches

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Sequencing alignment Ameer Effat M. Elfarash

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Synteny Portal Documentation

In-Depth Assessment of Local Sequence Alignment

BINF6201/8201. Molecular phylogenetic methods

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database

Basic Local Alignment Search Tool

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Chapter 26 Phylogeny and the Tree of Life

Macroevolution Part I: Phylogenies

RGP finder: prediction of Genomic Islands

A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling

Constructing Evolutionary/Phylogenetic Trees

Session 5: Phylogenomics

Large-Scale Genomic Surveys

Overview of IslandPick pipeline and the generation of GI datasets

Sequence Analysis '17- lecture 8. Multiple sequence alignment

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Bioinformatics for Biologists

Bioinformatics Chapter 1. Introduction

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Ch. 9 Multiple Sequence Alignment (MSA)

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Mitochondrial Genome Annotation

Sequencing alignment Ameer Effat M. Elfarash

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

The practice of naming and classifying organisms is called taxonomy.

Chad Burrus April 6, 2010

Impact of training sets on classification of high-throughput bacterial 16s rrna gene surveys

Model Accuracy Measures

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rrna Gene Sequence Analysis

Homology and Information Gathering and Domain Annotation for Proteins

a,bD (modules 1 and 10 are required)

8/23/2014. Phylogeny and the Tree of Life

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Probing diversity in a hidden world: applications of NGS in microbial ecology

Week 10: Homology Modelling (II) - HHpred

SnoPatrol: How many snorna genes are there? Supplementary

The Tree of Life. Chapter 17

CHAPTER 10 Taxonomy and Phylogeny of Animals

Multiple sequence alignment

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Chapter 17A. Table of Contents. Section 1 Categories of Biological Classification. Section 2 How Biologists Classify Organisms

Supplementary Information

Transcription:

Taxonomical Classification using: Extracting ecological signal from noise: introduction to tools for the analysis of NGS data from microbial communities Bergen, April 19-20 2012

INTRODUCTION Taxonomical prediction = Who is out there and how many? Composition of the microbial community SSU rrna (16S/18S) - de facto standard in environmental genomics Amplicons or rrna tags Shotgun rrna / RNA-Seq (LSU + SSU) Classification of subset (~0.1%) in shotgun metagenome data

INTRODUCTION Taxonomy = system for classification Phylogeny = evolutionary development Bad phylogeny -> bad taxonomy Bad taxonomy -> Bad / less meaningful classification

AVAILABLE TAXONOMIES AND REF. DATABASES NCBI Taxonomy. Not meant to be authoritative but what sequences in Genbank are mapped to. Commonly used for taxonomical classification (best hit, MEGAN) Polyphyletic unclassified nodes, or even incorrectly. Incorrect assignments and expired taxa. RDP (Ribosomal Database Project) Greengenes --> SILVA <--

SILVA Includes all three domains of life (including Eukaryotes) SSURef 106: ~500k full length SSU sequences and 20k LSU sequences Taxonomy assignments to clusters that include uncultured organisms (up to genus level) Distributed for the ARB software packages, plus some online resources

CLASSIFICATION METHODS Can be roughly divided into those based on: 1. Inferred multiple alignments (e.g. NAST) 2. Nucleotide composition (e.g. RDP Classifier) 3. Pairwise alignments (e.g. BLAST)

CLASSIFICATION METHODS 1. Infer multiple alignment (NAST, SINA WebAligner, etc) and insert into existing reference tree (GreenGenes classifier, LCA) + Best accuracy for reads close to known reference sequences [Liu et al, 2008] - Slow and sensitive to read novelty or quality

CLASSIFICATION METHODS 2. Nucleotide composition based - RDP Classifier (8-mer): + Fast. Similar results to BLAST in environmental datasets [Liu et al, 2008] - More sensitive to sequencing noise and small differences 3. Pairwise alignment to reference database (Best BLAST hit, Lowest Common Ancestor, MEGAN) + With LCA relatively fast and accurate [Liu et al, 2008] - LCA very sensitive to assignments in ref. database

CLASSIFICATION METHODS In addition: Methods based on reconstruction of phylogenetic tree. + Ability to study phylogenetic novelty - Slow and expensive - High false positive-rate in Liu et al benchmark

CREST WORKFLOW Alignment (Megablast) to the SilvaMod reference database and LCA using custom python script or MEGAN [Huson et al, 2007]. Mapping taxa to ranks using NCBI Taxonomy Minimum similarity filters (99% for species, 97% for genus, 95% for family, 90% for order...) Web interface (max. 1,000 sequences) including Megablast (under development using Hodman)

2% range from top Scoring BLAST Hit, min score=155 bits Blast match #1, Score = 100 bits Query: 1 CTGCCCTGGCTTCTATTATGCGTGACGT... Sbjct: 350 CTGCCCGGGC-TCTATTATGCGTGACGT... Blast match #2, Score = 95 bits Query: 1 CTGCCCTGGCTTCTATTATGCGTGACGT... Sbjct: 349 CTGCCCGGGC--CTATTAGGCGTGACGT... Blast match #3, Score = 90 bits Query: 3 CCCTGGCTTCTATTA-TGCGTGACGTGTC... Sbjct: 353 CCCGTGC-TCTATTAGTGCGTGACCTATG...

OUTPUT /*0'$()(*1* 2*1*!34$+*$5' 67*0' 8$#94' :7*, :'--4-*0),0%*$!057*'* ;< =>==??; @A @< :'--4-*0),0%*$B*5('0#*??@A =><<=C? D<E ;<C :'--4-*0),0%*$F4G*0H,(* A =>===AA A I,$' 2,(*-?<=A =><<<CC JA; 8$5-*""#K#'+)*()+,&*#$)-'.'- D =>===DJ A!""#$%&'$(")*()+,&*#$)-'.'-!""#$%&'$(")*()L7H-4&)-'.'- /*0'$()(*1* 2*1*!34$+*$5' 67*0' 8$#94' :7*,!057*'* 27*4&*057*',(* J =>===JE @ I,$'!057*'* F40H*057*',(* ;E =>==?J@ A< @E B*5('0#*!5#+,3*5('0#*? =>===< @ I,$' B*5('0#*!5(#$,3*5('0#* J =>===JE @ I,$' B*5('0#* B*5('0,#+'('" JEC =>=EA@A ;A A@E B*5('0#* BMANE E =>===EC J? B*5('0#* :*-+#(70#1 AE =>==AC? A I,$' B*5('0#* :7-,0,3# E =>===EC J? B*5('0#* :H*$,3*5('0#* A =>===AA A I,$' Also FASTA format with assignments for each sequence + a more parser-friendly format for abundance.

PERFORMANCE TESTING Exhaustive tenfold cross validation: aligning 1/10 of reference database to the other 9/10 Different lengths (full-length, 450 bp and 100 bp) Gives recall rate and false positive rate Removal of taxa: cross validation removing whole genera, families or phyla and aligning to remaining Real data: Assignment of 4 different SSU rrna datasets from environmental genomics studies

COMPARISON TO OTHER METHODS Greengenes Similar approach used very recently to create alignment-informed consensus taxonomy Larger database, but few sequences annotated to genus rank Alternative files for LCA classification built RDP Classifier Nucleotide composition based + Naïve Bayes Classifier Used with default training set + Greengenes (QIIME)

RESULTS ROC for 10 split cross validation, Family rank (Fragment length=450bp) ROC for 10 split cross validation, Genus rank (Fragment length=450bp) Recall rate 0.0 0.2 0.4 0.6 0.8 1.0 SilvaMod106/LCA Greengenes/LCA Greengenes/RDP Classifier RDP Classifier default LCA range=0.02 Confidence cutoff=0.8 Recall rate 0.0 0.2 0.4 0.6 0.8 1.0 SilvaMod106/LCA Greengenes/LCA Greengenes/RDP Classifier RDP Classifier default LCA range=0.02 Confidence cutoff=0.8 0.00 0.01 0.02 0.03 0.04 0.05 False Positive Rate 0.00 0.05 0.10 0.15 False Positive Rate

RESULTS ROC for 10 split cross validation, Family rank (Fragment length=100bp) ROC for 10 split cross validation, Genus rank (Fragment length=100bp) Recall rate 0.0 0.2 0.4 0.6 0.8 1.0 SilvaMod106/LCA Greengenes/LCA Greengenes/RDP Classifier RDP Classifier default LCA range=0.02 Confidence cutoff=0.8 Recall rate 0.0 0.2 0.4 0.6 0.8 1.0 SilvaMod106/LCA Greengenes/LCA Greengenes/RDP Classifier RDP Classifier default LCA range=0.02 Confidence cutoff=0.8 0.00 0.01 0.02 0.03 0.04 0.05 False Positive Rate 0.00 0.05 0.10 0.15 False Positive Rate

RESULTS!"#$%&'(./"0%)).%#1.2%)*".34*(5(6".$%5".2$4'.$"'46%)74275%8%. 0$4**.6%)(1%5(4#9. :"5,41 &%)*".+4*(5(6"./%5".%5. ;$%(#(#<.=. &$%<'"#5. $"'46"1.$%#>.)"6").24$. /"2"$"#0".*"5 )"#<5,!"#"$% &%'()("* +,-)%!"# $ %&'($)*+,-!-. /-/01 /-/22 /-23!"# $ %&'($)*+ 45/678 /-93 /-31 /-:3!"# $ %&'($)*+ 0//678 /-02 /-34 /-94!"# $ ;<==>?=>=@,-!-. /-05 /-//0: /-13!"# $ ;<==>?=>=@ 45/678 /-0/ /-05 /-:1!"# $ ;<==>?=>=@ 0//678 /-03 /-9A /-92 BCD 7 ;<==>?=>=@,-!-. E /-43 /-05 BCD 7 ;<==>?=>=@ 45/678 E /-0: /-/A3 BCD 7 ;<==>?=>=@ 0//678 E /-//:9 /-/04 BCD 7 BCD6(1,-!-. /-3A /-3A /-:2 BCD 7 BCD6(1 45/678 /-95 /-99 /-00 BCD 7 BCD6(1 0//678 /-/1A /-/:: /-/3: False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 Removal of whole families cross validation (Fragment length=450bp, SilvaMod106+LCA) 0.36 0.11 0.021 Genus rank Family rank Phylum rank 0.00 0.01 0.02 0.03 0.04 0.05 0.06 $!"#6.'$@@&G&.$I&*>R66H@&>?6)=?$7'$@I6$'&?>L=>I@6Q&IJ&>6$69S6<$>?=6*G6 IJ=6J&?J=@I67&I@.*<=66$@6Q=''6$@68=<.=>I6@&L&'$<&IP6G&'I=<@ Relative LCA range 7 M$N(=6O$P=@6.'$@@&G&.$I&*>6H@&>?6IJ=6BCD6"'$@@&G&=<6Q&IJ6$67**I@I<$86.*>G&+=>.=6.HI*GG6*G6/-A. F>E.<*88=+6GH''E'=>?IJ6@=KH=>.=@6G<*L6IJ=6<=G=<=>.=6*<6I<$&>&>?6+$I$@=

RESULTS A2*20"* %.67#%.-(78 $4879:#94(* "(178(.-#94(* AB584:C78,.*#,.:!"#$"%&'%() *"&+%,-,(.!"#$%&#'(!"#$%&#'( )**+,(-.!"#$%&#'( U L7.59#T(:C#.#V%0"'P#.*(E-,7-:#1(:9O487#W#FJ#:4#.#97X+7-O7#(-#"(*2.345 E'F525.)*.G" "C4:E+-#,7:.E7-4,7 "C4:E+-#,7:.:8.-9O8(Q:4,7 =G"#8LP0#.,Q*(O4-9 =G"#8LP0#.,Q*(O4-9 B,*2-)!!9)5DH>) 5"270 I FFI F=S?J??S=K< ISDJ< 6"*+,7 B52'%'%()C)!+25"),8)5"270)200'(%"7 A2*20"* 9%'#$")*2:2);<=>=?@ 2 D"8"5"%&")0"* /"%$0 123'-. 4+.-$3 /"%"52 123'-'"0 4+.-2 %/0 1 "(*2.345 %.67#%.-(78!"#$% &!#&% ''#(%!)*+*),(*+*$ ))*+*$ %/0 1 "(*2.345 $4879:#94(*!+#,% "'#-% ''#)% $!$*+*)"" )(";);)"& $'*$*," %/0 1 "(*2.345 "(178(.-#94(* <=>?@ "'#(% '!#"% ()*)*+ -)*)*+ $+*)*+ %/0 1 "(*2.345 AB584:C78,.*#,.: -&#(% '!#+% ''#"%!"*$*),$*-*) =D;?;= %/0 1!877-E7-79 %.67#%.-(78 ==>F@ GH>F@ DI>D@ =F;J;J?F;J;J )!*+*+ %/0 1!877-E7-79 $4879:#94(* =H>K@ FF>=@ IH>=@ =<J;J;J =?G;=;J <=;?;? %/0 1!877-E7-79 "(178(.-#94(*!'#+% GJ>=@ IF>G@ <I;=;J F<;=;J =I;=;J %/0 1!877-E7-79 AB584:C78,.*#,.: KK>F@ ID>J@ DD>H@ =F;=;J?<;G;J $)*$*) LMN O!877-E7-79 %.67#%.-(78 J K?>?@ D=>I@ J?I;J;J D;J;J LMN O!877-E7-79 $4879:#94(* J F?>?@ IG>K@ J ===;J;J =G;?;= LMN O!877-E7-79 "(178(.-#94(* J F<>H@ DJ>F@ J F<;=;J =J;=;J LMN O!877-E7-79 AB584:C78,.*#,.: J I=>G@ DK>I@ J =D;<;J D;?;J LMN O LMN#2G %.67#%.-(78 D><@ F=>=@ IK>=@ =K;J;J?J;J;? =J;J;? LMN O LMN#2G $4879:#94(* ==>D@ HJ>H@ IJ>D@ =KG;?;J DF;?;J?J;?;= LMN O LMN#2G "(178(.-#94(* G>K@ <D>K@ GG>J@ <G;=;J <D;=;J =J;=;J LMN O LMN#2G AB584:C78,.*#,.: IH>H@ D=>K@ DK>K@?=;?;J =K;?;J I;?;J. P+,178#4R#+-(X+7#:.Y.#E(27-#97Q.8.:7*B#R48#1.O:78(.#;#.8OC.7.#;#7+6.8B4:79>#ZC787#:C7#C(EC79:#:4:.*#-+,178#4R#:.Y.#

RESULTS

ACKNOWLEDGMENTS Tim Urich Steffen Jørgensen Lise Øvreås Inge Jonassen Daniel Huson Markus Gorfer Svenn Helge Grindhaug