Comparison of Three Fugal ITS Reference Sets. Qiong Wang and Jim R. Cole

Similar documents
Taxonomical Classification using:

Assigning Taxonomy to Marker Genes. Susan Huse Brown University August 7, 2014

Taxonomy and Clustering of SSU rrna Tags. Susan Huse Josephine Bay Paul Center August 5, 2013

MiGA: The Microbial Genome Atlas

Robert Edgar. Independent scientist

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

Accuracy of taxonomy prediction for 16S rrna and fungal ITS sequences

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Microbiome: 16S rrna Sequencing 3/30/2018

Title ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses

Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria

Microbial Taxonomy and the Evolution of Diversity

rrdp: Interface to the RDP Classifier

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Supplementary File 4: Methods Summary and Supplementary Methods

The practice of naming and classifying organisms is called taxonomy.

PHYLOGENY AND SYSTEMATICS

Handling Fungal data in MoBeDAC

Pipelining RDP Data to the Taxomatic Background Accomplishments vs objectives

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Microbial Diversity and Assessment (II) Spring, 2007 Guangyi Wang, Ph.D. POST103B

Macroevolution Part I: Phylogenies

The Classification of Plants and Other Organisms. Chapter 18

A Bayesian taxonomic classification method for 16S rrna gene sequences with improved species-level accuracy

Comparative Genomics II

Phylogenetic Tree Reconstruction

Classification, Phylogeny yand Evolutionary History

Chapter 1 - Lecture 3 Measures of Location

Outline. Classification of Living Things

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Naïve Bayesian Classifier for Rapid Assignment of rrna Sequences into the New Bacterial Taxonomy

SUPPLEMENTARY INFORMATION

Chad Burrus April 6, 2010

Prac%cal Bioinforma%cs for Life Scien%sts. Week 14, Lecture 28. István Albert Bioinforma%cs Consul%ng Center Penn State

Chapter 26 Phylogeny and the Tree of Life

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Biologists use a system of classification to organize information about the diversity of living things.

CLASSIFICATION UNIT GUIDE DUE WEDNESDAY 3/1

1. HyperLogLog algorithm

Iterative Laplacian Score for Feature Selection

AP Biology. Cladistics

Biology 211 (2) Week 1 KEY!

Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rrna Gene Sequence Analysis

SPECIATION. REPRODUCTIVE BARRIERS PREZYGOTIC: Barriers that prevent fertilization. Habitat isolation Populations can t get together

Diversity, Productivity and Stability of an Industrial Microbial Ecosystem

C3020 Molecular Evolution. Exercises #3: Phylogenetics

The Life System and Environmental & Evolutionary Biology II

SUPPLEMENTARY INFORMATION

Behavioral Data Mining. Lecture 2

Chapter 17A. Table of Contents. Section 1 Categories of Biological Classification. Section 2 How Biologists Classify Organisms

Amy Driskell. Laboratories of Analytical Biology National Museum of Natural History Smithsonian Institution, Wash. DC

Centrifuge: rapid and sensitive classification of metagenomic sequences

Taxonomy and Biodiversity

Organizing Life on Earth

Methods for Microbiome Analysis

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Phylogeny 9/8/2014. Evolutionary Relationships. Data Supporting Phylogeny. Chapter 26

Biology 2.1 Taxonomy: Domain, Kingdom, Phylum. ICan2Ed.com

8/23/2014. Phylogeny and the Tree of Life

CS612 - Algorithms in Bioinformatics

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Classification of Organisms

A (short) introduction to phylogenetics

Biology Classification Unit 11. CLASSIFICATION: process of dividing organisms into groups with similar characteristics

Organizing Life s Diversity Section 17.1 Classification

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

University of Groningen

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Unit 9: Taxonomy (Classification) Notes

Microbes and you ON THE LATEST HUMAN MICROBIOME DISCOVERIES, COMPUTATIONAL QUESTIONS AND SOME SOLUTIONS. Elizabeth Tseng

18-1 Finding Order in Diversity Slide 2 of 26

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Analysis of N-terminal Acetylation data with Kernel-Based Clustering

Supplementary Information

Name: Class: Date: ID: A

Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Microbial Taxonomy. Slowly evolving molecules (e.g., rrna) used for large-scale structure; "fast- clock" molecules for fine-structure.

Chapter 26. Phylogeny and the Tree of Life. Lecture Presentations by Nicole Tunbridge and Kathleen Fitzpatrick Pearson Education, Inc.

Department of Computer and Information Science and Engineering. CAP4770/CAP5771 Fall Midterm Exam. Instructor: Prof.

Intraspecific gene genealogies: trees grafting into networks

Test Bank for Microbiology A Systems Approach 3rd edition by Cowan

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

a,bD (modules 1 and 10 are required)

Dr. Amira A. AL-Hosary

Measures of Location. Measures of position are used to describe the relative location of an observation

An Automated Phylogenetic Tree-Based Small Subunit rrna Taxonomy and Alignment Pipeline (STAP)

K-means-based Feature Learning for Protein Sequence Classification

The Tree of Life. Chapter 17

SECTION 17-1 REVIEW BIODIVERSITY. VOCABULARY REVIEW Distinguish between the terms in each of the following pairs of terms.

Class XI Chapter 1 The Living World Biology

PhyloNet. Yun Yu. Department of Computer Science Bioinformatics Group Rice University

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Concept Modern Taxonomy reflects evolutionary history.

Stepping stones towards a new electronic prokaryotic taxonomy. The ultimate goal in taxonomy. Pragmatic towards diagnostics

ECE 661: Homework 10 Fall 2014

Test: Classification of Living Things

Transcription:

RDP TECHNICAL REPORT Created 04/12/2014, Updated 08/08/2014 Summary Comparison of Three Fugal ITS Reference Sets Qiong Wang and Jim R. Cole wangqion@msu.edu, colej@msu.edu In this report, we evaluate the performance of three different fungal ITS datasets using RDP Classifier (1). The genera covered differed significantly between the three sets. The UNITE was the largest dataset, covered at least 85% and 73% genera from DOE_SFA and Warcup, respectively. DOE_SFA is the smallest, containing only 48% and 39% of the Warcup and UNITE genera, respectively. Warcup showed the highest and tightest similarity within species with median at 96%. UNITE_sh (grouping by UNITE species hypothesis accession code) has a median similarity within species of 90%. Warcup and UNITE_sh performed similarly during the leave- one- sequence- out testing: 85%, 88% accuracy at species and 93%, 90% at genus respectively. Warcup showed the best accuracy during leave- one- taxon- out testing, with UNITE_sh the second best. It took 80 seconds to classify 1000 near- full length ITS sequences using the UNITE_sh training set on a single CPU on Mac 3.2 GHz Intel Core i5 processor. Using the Warcup training set, the speed was twice as fast, roughly proportional to the relative number of species. When trained on UNITE_name set in which sequences were grouped by UNITE taxon names, the Classifier performed much worse than when trained on the UNITE_sh. Both the Warcup and UNITE_sh ITS training sets are available on RDP Classifier web site, and RDP SourceForge repository (http://sourceforge.net/projects/rdp- classifier/) and GitHub repository (and https://github.com/rdpstaff). ITS Reference Sets DOE_SFA ref set: This is a published hand- curated set. The sequences and taxonomy construction of this set were described in detail in Porras- Alfaro et al. (U.S. Department of Energy Science Focus Area; 2). It contains lineage only to the genus level. Briefly, the majority of sequences were selected from published phylogenies or from NCBI searches. It only contains lineage information down to genus level. Warcup ref set: An version from an active curatorial effort kindly provided by Paul Greenfield and Vinita Deshpande of the Australian Commonwealth Scientific and Industrial Research Organization (manuscript in preparation). It also incorporates some training sequences from DOE_SFA and UNITE ref sets. It contains lineages to the species level.

UNITE ref set: A set consisting of UNITE core sequences (excluding chimeric and low quality) for each dynamic species hypothesis provided by Kessy Abarenkov of UNITE on July 4, 2014. This file uses the UNITE dynamic species hypotheses. These were created using a two- tier clustering process, which first cluster sequences to subgenus/genus level and then to finer species level (3). In addition to the UNITE species hypothesis accession code number, each sequence is labeled with a lineage including a more traditional UNITE taxon name as species designation. We tested the UNITE set twice once grouping by UNITE taxon name as terminal taxa (UNITE_name) and a second time, using a concatenation of the UNITE Species hypotheses and UNITE taxon name to group sequences into terminal taxa (UNITE_sh). For example, instead of having one terminal taxon Cortinarius_caesiocortinatus, this set has two terminal taxa Cortinarius_caesiocortinatus SH192002.06FU and Cortinarius_caesiocortinatus SH192062.06FU. Except the grouping of sequences into terminal taxa, the sequences included in these two UNITE ref sets are identical. For each of the ref sets described above, we constructed a unique set by removing any sequence identical to, or a substring of another sequence in the same training set. Removing duplicates is important for evaluating the performance of the dataset to avoid inflated results. The taxonomic composition and the number of sequences are listed in Table 1a. In addition to the common domain Fungi, DOE_SFA set contains 1 sequence from each of three domains Protozoa, Viridiplantae and Stramenopiles; UNITE contains 56 sequences from domain Protozoa. Vast majority of these three datasets contain sequences of full ITS regions, including ITS1, 5.8S and ITS2 (Table 1b). Table 1a: taxonomic compositions of major ranks Rank Warcup DOE_SFA UNITE domain (kingdom) 1 2 2 phylum 8 11 10 class 40 36 45 order 131 118 167 family 364 328 523 genus 1,620 1,134 2,135 species 8,967 NA 20,221* Unique Sequences 17,923 6,889 145,019 * The UNITE_sh has 20,221 species level taxa, the UNITE_name has 10,346. Table 1b: Completeness of the unique sequences Completeness (%) Warcup DOE_SFA UNITE (Near) complete 95.2 94.6 97.5 Incomplete ITS1 2.4 2.7 1.2 Incomplete ITS2 2 2.2 1.1 Incomplete both 0.3 0.4 0.1

Results Commonality We compared the three ref sets to measure the extent that genera and sequences were shared between the different data sets (Table 2a, 2b). UNITE is the largest set, containing 85% of genera from Warcup and 73% of genera from DOE_SFA. It also contains more than half of the sequences (Genbank accnos) from Warcup and DOE_SFA. Warcup is the second largest set, containing 69% of genera from DOE_SFA and 64% of genera from UNITE. The percent of sequences from the other sets found in either Warcup or DOE_SFA was less than 15% (Table 2b). The number of shared genera and shared sequences between each pair of ref set was shown in Venn diagram (Fig. 1). Table 2a: Shared genera Warcup DOE_SFA UNITE Warcup 48% 85% DOE_SFA 69% 73% UNITE 64% 39% Table 2b: Shared Sequences Warcup DOE_SFA UNITE Warcup 6% 66% DOE_SFA 15% 56% UNITE 8% 3% Shared genera DOE_SFA Warcup 246$ 56$ 40$ 723$ 109$! 646$! 657$ UNITE Shared sequences DOE_SFA 2786$ 239$ 5777$ 796$ 3068$ 11111$!! 130044$ UNITE Warcup Figure 1: Venn diagram of shared genera and shared sequences. Taxa Similarity We examined how close the sequences were within taxa and between taxa. Since no good multiple alignment methods are available for ITS, we used Sab scores as a measure of similarity between sequences.

DOE_SFA does not group at species rank, the median Sab score within genera is 56% and drops to 31% among families (Fig. 2). Warcup showed the highest and tightest similarity within species with median at 96%. UNITE_sh has a median similarity within species of 90%, with a large range from 72% (2 nd percentile) to 99% (98 th percentile). For both DOE_SFA and UNITE_sh, the higher ranks were slightly less similar than Warcup. UNITE_name has the lowest median similarity of 37% within species. The similarity between species (or higher ranks) was low for all the sets. Figure 2: box and whisker plots showing intra- taxa similarity (Sab score) for each major rank. The 1 st quartile, median and 3rd quartiles are shown as the bottom, middle and top of the box, the 2 nd and 98th percentiles are indicated by whiskers. From clockwise: Warcup, DOE_SFA, UNITE_name and UNITE_sh. Note DOE_SFA does not group at species rank.

Leave- One- Out Testing We preformed both leave- one- sequence- out and leave- one- taxon- out testing on the three fungal ITS datasets. All Warcup and DOE_SFA sequences were used for testing. For the UNITE_sh and UNITE_name sets, one sequence from each species was chosen randomly as query for these tests. Classification without bootstrap cutoff was use for these accuracy measurements. Warcup achieved 85% accuracy at species level and 93% accuracy at genus level. UNITE_sh showed 88% at species and 90% at genus level (Fig. 3). DOE_SFA showed only 79% at genus level. One notable difference worth mentioning here are differences between our testing results and the testing results from the publication describing DOE_SFA dataset (2). Duplicate sequences were not removed from the training set in those tests while they were removed for this report. When a taxon was removed from the testing, the accuracy at lower ranks (order, family, genus) decreased for all the data sets. For example, if the species was not present in the training set, in 73% of the cases, the Classifier trained on Warcup set can assign a sequence to the correct genus, but for only 58% of the cases when trained on UNITE_sh set. If the genus is not present, Classifier trained on Warcup set made the correct family assignment 90% of the time but only 77% of the time when trained on UNITE_sh set and 60% when trained on DOE_SFA set. When tested on the UNITE_name ref set constructed using the species name as the terminal taxon name, the Classifier showed only 74% accuracy at species level and 80% at genus level with leave- one- sequence- out testing. The accuracy of the leave- one- taxon- out testing using UNITE_name set was also worse than the one using UNITE_sh set. Further investigating the misclassified sequences during leave- one- out testing, we found they have the closest match (highest Sab score) to a sequence from a different species in the majority of the cases (Table 3). Table 3: percent of misclassified sequences with closest matches in different taxon # misclassified seqs % misclassified seqs with closest match in different taxon Warcup 2920 66.3% DOE_SFA 1347 78.6% UNITE_sh 3337 58.2% UNITE_name 5373 30.1%

100% 90% 80% Accuracy 70% 60% 50% 40% 30% 100% 90% 80% Accuracy 70% 60% 50% 40% domain phylum class order family genus species DOE_SFA Warcup UNITE_sh UNITE_name 30% domain phylum class order family genus Figure 3: Classification accuracy at each major taxon rank from leave- one- out testing. The RDP Classifier was trained on the each of the four fungal ITS sets. No bootstrap cutoff was applied in the accuracy calculation. Top: leave- one- sequence- out testing. Bottom: leave- one- taxon- out testing. Methods Leave- one- sequence- out testing: each iteration one sequence from the training set was chosen as a test sequence. That sequence was removed from training set. The assignment of the sequence produced by the Classifier was compared to the original taxonomy label to measure the accuracy of the Classifier. Singleton

sequences might be included in the accuracy calculation if the higher rank taxon contained multiple sequences. For example if a sequences is the only sequences for a species, it s not included in the accuracy calculation for species rank; but if this sequence belonged to a genus containing multiple species, then it was included in the accuracy calculation for the genus rank. Leave- one- taxon- out testing is very similar to the leave- one- sequence- out testing except for each test sequence, the lowest taxon that sequence assigned to (either species or genus node) was removed from the training set. This is intended to test if the species or genus is no present in the training set, how likely the Classifier can assign the sequence to the correct genus or higher taxa. Sab score: the percent of share 8- mers between two sequences. This is the same score as the one calculated by RDP SeqMatch except the latter uses 7- mer. We used 8- mer here because Classifier performs the best using 8- mer when trained on 16S rrna datasets. Taxa Similarity: for each pair of sequences from a set, we calculated the Sab score and added score to the lowest common ancestor taxon of the two sequences. For example, if these two sequences were from the same species, the Sab score was added to species pool to measure how close sequences are within species. If there are from the same genus but not from the same species, the Sab score was added to the genus pool to measure how close they are between species. The Sab scores for each rank were used to generate box and whisker plots. Completeness Measurement: sequence records were retrieved from Genbank using the Genbank accnos from all three ref sets. Only sequences with feature internal transcribed spacer 1" and internal transcribed spacer 2" were considered as complete and the corresponding sequence region were kept. The resulted in 13,912 complete reference sequences (called COMBO set). For each query sequence in each of the ref sets, the pairwise alignment between the query and a sequence from COMBO set with the best alignment score was used to determine the completeness. A query is marked with Incomplete ITS1 if the query alignment contains at least 50 inserts in the beginning, or Incomplete ITS2 is it contains at least 50 inserts at the end of the alignment, or both. References 1. Wang, Q, G. M. Garrity, J. M. Tiedje, and J. R. Cole. 2007. Naïve Bayesian Classifier for Rapid Assignment of rrna Sequences into the New Bacterial Taxonomy. Appl Environ Microbiol. 73(16):5261-7. 2. Porras- Alfaro A, Liu KL, Kuske CR, Xie G. 2014. From genus to phylum: large- subunit and internal transcribed spacer rrna operon regions show similar classification accuracies influenced by database composition. Appl Environ Microbiol. 80(3):829-40.

3. Koljalg U., Nilsson R.H., Abarenkov K., Tedersoo L., Taylor A., Bahram M., Bates S.T., Bruns T.D., Bengtsson- Palme J., Callaghan T.M., et al. 2013. Towards a unified paradigm for sequence- based identification of fungi. Molecular Ecology 22: 5271 5277.