Comparison of Three Fugal ITS Reference Sets. Qiong Wang and Jim R. Cole

RDP TECHNICAL REPORT Created 04/12/2014, Updated 08/08/2014 Summary Comparison of Three Fugal ITS Reference Sets Qiong Wang and Jim R. Cole wangqion@msu.edu, colej@msu.edu In this report, we evaluate the performance of three different fungal ITS datasets using RDP Classifier (1). The genera covered differed significantly between the three sets. The UNITE was the largest dataset, covered at least 85% and 73% genera from DOE_SFA and Warcup, respectively. DOE_SFA is the smallest, containing only 48% and 39% of the Warcup and UNITE genera, respectively. Warcup showed the highest and tightest similarity within species with median at 96%. UNITE_sh (grouping by UNITE species hypothesis accession code) has a median similarity within species of 90%. Warcup and UNITE_sh performed similarly during the leave- one- sequence- out testing: 85%, 88% accuracy at species and 93%, 90% at genus respectively. Warcup showed the best accuracy during leave- one- taxon- out testing, with UNITE_sh the second best. It took 80 seconds to classify 1000 near- full length ITS sequences using the UNITE_sh training set on a single CPU on Mac 3.2 GHz Intel Core i5 processor. Using the Warcup training set, the speed was twice as fast, roughly proportional to the relative number of species. When trained on UNITE_name set in which sequences were grouped by UNITE taxon names, the Classifier performed much worse than when trained on the UNITE_sh. Both the Warcup and UNITE_sh ITS training sets are available on RDP Classifier web site, and RDP SourceForge repository (http://sourceforge.net/projects/rdp- classifier/) and GitHub repository (and https://github.com/rdpstaff). ITS Reference Sets DOE_SFA ref set: This is a published hand- curated set. The sequences and taxonomy construction of this set were described in detail in Porras- Alfaro et al. (U.S. Department of Energy Science Focus Area; 2). It contains lineage only to the genus level. Briefly, the majority of sequences were selected from published phylogenies or from NCBI searches. It only contains lineage information down to genus level. Warcup ref set: An version from an active curatorial effort kindly provided by Paul Greenfield and Vinita Deshpande of the Australian Commonwealth Scientific and Industrial Research Organization (manuscript in preparation). It also incorporates some training sequences from DOE_SFA and UNITE ref sets. It contains lineages to the species level.

UNITE ref set: A set consisting of UNITE core sequences (excluding chimeric and low quality) for each dynamic species hypothesis provided by Kessy Abarenkov of UNITE on July 4, 2014. This file uses the UNITE dynamic species hypotheses. These were created using a two- tier clustering process, which first cluster sequences to subgenus/genus level and then to finer species level (3). In addition to the UNITE species hypothesis accession code number, each sequence is labeled with a lineage including a more traditional UNITE taxon name as species designation. We tested the UNITE set twice once grouping by UNITE taxon name as terminal taxa (UNITE_name) and a second time, using a concatenation of the UNITE Species hypotheses and UNITE taxon name to group sequences into terminal taxa (UNITE_sh). For example, instead of having one terminal taxon Cortinarius_caesiocortinatus, this set has two terminal taxa Cortinarius_caesiocortinatus SH192002.06FU and Cortinarius_caesiocortinatus SH192062.06FU. Except the grouping of sequences into terminal taxa, the sequences included in these two UNITE ref sets are identical. For each of the ref sets described above, we constructed a unique set by removing any sequence identical to, or a substring of another sequence in the same training set. Removing duplicates is important for evaluating the performance of the dataset to avoid inflated results. The taxonomic composition and the number of sequences are listed in Table 1a. In addition to the common domain Fungi, DOE_SFA set contains 1 sequence from each of three domains Protozoa, Viridiplantae and Stramenopiles; UNITE contains 56 sequences from domain Protozoa. Vast majority of these three datasets contain sequences of full ITS regions, including ITS1, 5.8S and ITS2 (Table 1b). Table 1a: taxonomic compositions of major ranks Rank Warcup DOE_SFA UNITE domain (kingdom) 1 2 2 phylum 8 11 10 class 40 36 45 order 131 118 167 family 364 328 523 genus 1,620 1,134 2,135 species 8,967 NA 20,221* Unique Sequences 17,923 6,889 145,019 * The UNITE_sh has 20,221 species level taxa, the UNITE_name has 10,346. Table 1b: Completeness of the unique sequences Completeness (%) Warcup DOE_SFA UNITE (Near) complete 95.2 94.6 97.5 Incomplete ITS1 2.4 2.7 1.2 Incomplete ITS2 2 2.2 1.1 Incomplete both 0.3 0.4 0.1

Results Commonality We compared the three ref sets to measure the extent that genera and sequences were shared between the different data sets (Table 2a, 2b). UNITE is the largest set, containing 85% of genera from Warcup and 73% of genera from DOE_SFA. It also contains more than half of the sequences (Genbank accnos) from Warcup and DOE_SFA. Warcup is the second largest set, containing 69% of genera from DOE_SFA and 64% of genera from UNITE. The percent of sequences from the other sets found in either Warcup or DOE_SFA was less than 15% (Table 2b). The number of shared genera and shared sequences between each pair of ref set was shown in Venn diagram (Fig. 1). Table 2a: Shared genera Warcup DOE_SFA UNITE Warcup 48% 85% DOE_SFA 69% 73% UNITE 64% 39% Table 2b: Shared Sequences Warcup DOE_SFA UNITE Warcup 6% 66% DOE_SFA 15% 56% UNITE 8% 3% Shared genera DOE_SFA Warcup 246$ 56$ 40$ 723$ 109$! 646$! 657$ UNITE Shared sequences DOE_SFA 2786$ 239$ 5777$ 796$ 3068$ 11111$!! 130044$ UNITE Warcup Figure 1: Venn diagram of shared genera and shared sequences. Taxa Similarity We examined how close the sequences were within taxa and between taxa. Since no good multiple alignment methods are available for ITS, we used Sab scores as a measure of similarity between sequences.

DOE_SFA does not group at species rank, the median Sab score within genera is 56% and drops to 31% among families (Fig. 2). Warcup showed the highest and tightest similarity within species with median at 96%. UNITE_sh has a median similarity within species of 90%, with a large range from 72% (2 nd percentile) to 99% (98 th percentile). For both DOE_SFA and UNITE_sh, the higher ranks were slightly less similar than Warcup. UNITE_name has the lowest median similarity of 37% within species. The similarity between species (or higher ranks) was low for all the sets. Figure 2: box and whisker plots showing intra- taxa similarity (Sab score) for each major rank. The 1 st quartile, median and 3rd quartiles are shown as the bottom, middle and top of the box, the 2 nd and 98th percentiles are indicated by whiskers. From clockwise: Warcup, DOE_SFA, UNITE_name and UNITE_sh. Note DOE_SFA does not group at species rank.

Leave- One- Out Testing We preformed both leave- one- sequence- out and leave- one- taxon- out testing on the three fungal ITS datasets. All Warcup and DOE_SFA sequences were used for testing. For the UNITE_sh and UNITE_name sets, one sequence from each species was chosen randomly as query for these tests. Classification without bootstrap cutoff was use for these accuracy measurements. Warcup achieved 85% accuracy at species level and 93% accuracy at genus level. UNITE_sh showed 88% at species and 90% at genus level (Fig. 3). DOE_SFA showed only 79% at genus level. One notable difference worth mentioning here are differences between our testing results and the testing results from the publication describing DOE_SFA dataset (2). Duplicate sequences were not removed from the training set in those tests while they were removed for this report. When a taxon was removed from the testing, the accuracy at lower ranks (order, family, genus) decreased for all the data sets. For example, if the species was not present in the training set, in 73% of the cases, the Classifier trained on Warcup set can assign a sequence to the correct genus, but for only 58% of the cases when trained on UNITE_sh set. If the genus is not present, Classifier trained on Warcup set made the correct family assignment 90% of the time but only 77% of the time when trained on UNITE_sh set and 60% when trained on DOE_SFA set. When tested on the UNITE_name ref set constructed using the species name as the terminal taxon name, the Classifier showed only 74% accuracy at species level and 80% at genus level with leave- one- sequence- out testing. The accuracy of the leave- one- taxon- out testing using UNITE_name set was also worse than the one using UNITE_sh set. Further investigating the misclassified sequences during leave- one- out testing, we found they have the closest match (highest Sab score) to a sequence from a different species in the majority of the cases (Table 3). Table 3: percent of misclassified sequences with closest matches in different taxon # misclassified seqs % misclassified seqs with closest match in different taxon Warcup 2920 66.3% DOE_SFA 1347 78.6% UNITE_sh 3337 58.2% UNITE_name 5373 30.1%

100% 90% 80% Accuracy 70% 60% 50% 40% 30% 100% 90% 80% Accuracy 70% 60% 50% 40% domain phylum class order family genus species DOE_SFA Warcup UNITE_sh UNITE_name 30% domain phylum class order family genus Figure 3: Classification accuracy at each major taxon rank from leave- one- out testing. The RDP Classifier was trained on the each of the four fungal ITS sets. No bootstrap cutoff was applied in the accuracy calculation. Top: leave- one- sequence- out testing. Bottom: leave- one- taxon- out testing. Methods Leave- one- sequence- out testing: each iteration one sequence from the training set was chosen as a test sequence. That sequence was removed from training set. The assignment of the sequence produced by the Classifier was compared to the original taxonomy label to measure the accuracy of the Classifier. Singleton

sequences might be included in the accuracy calculation if the higher rank taxon contained multiple sequences. For example if a sequences is the only sequences for a species, it s not included in the accuracy calculation for species rank; but if this sequence belonged to a genus containing multiple species, then it was included in the accuracy calculation for the genus rank. Leave- one- taxon- out testing is very similar to the leave- one- sequence- out testing except for each test sequence, the lowest taxon that sequence assigned to (either species or genus node) was removed from the training set. This is intended to test if the species or genus is no present in the training set, how likely the Classifier can assign the sequence to the correct genus or higher taxa. Sab score: the percent of share 8- mers between two sequences. This is the same score as the one calculated by RDP SeqMatch except the latter uses 7- mer. We used 8- mer here because Classifier performs the best using 8- mer when trained on 16S rrna datasets. Taxa Similarity: for each pair of sequences from a set, we calculated the Sab score and added score to the lowest common ancestor taxon of the two sequences. For example, if these two sequences were from the same species, the Sab score was added to species pool to measure how close sequences are within species. If there are from the same genus but not from the same species, the Sab score was added to the genus pool to measure how close they are between species. The Sab scores for each rank were used to generate box and whisker plots. Completeness Measurement: sequence records were retrieved from Genbank using the Genbank accnos from all three ref sets. Only sequences with feature internal transcribed spacer 1" and internal transcribed spacer 2" were considered as complete and the corresponding sequence region were kept. The resulted in 13,912 complete reference sequences (called COMBO set). For each query sequence in each of the ref sets, the pairwise alignment between the query and a sequence from COMBO set with the best alignment score was used to determine the completeness. A query is marked with Incomplete ITS1 if the query alignment contains at least 50 inserts in the beginning, or Incomplete ITS2 is it contains at least 50 inserts at the end of the alignment, or both. References 1. Wang, Q, G. M. Garrity, J. M. Tiedje, and J. R. Cole. 2007. Naïve Bayesian Classifier for Rapid Assignment of rrna Sequences into the New Bacterial Taxonomy. Appl Environ Microbiol. 73(16):5261-7. 2. Porras- Alfaro A, Liu KL, Kuske CR, Xie G. 2014. From genus to phylum: large- subunit and internal transcribed spacer rrna operon regions show similar classification accuracies influenced by database composition. Appl Environ Microbiol. 80(3):829-40.

3. Koljalg U., Nilsson R.H., Abarenkov K., Tedersoo L., Taylor A., Bahram M., Bates S.T., Bruns T.D., Bengtsson- Palme J., Callaghan T.M., et al. 2013. Towards a unified paradigm for sequence- based identification of fungi. Molecular Ecology 22: 5271 5277.