Briefings in Bioinformatics Advance Access published May 29, 2012

Size: px

Start display at page:

Download "Briefings in Bioinformatics Advance Access published May 29, 2012"

Edith Mathews
6 years ago
Views:

1 Briefings in Bioinformatics Advance Access published May 29, 2012 BRIEFINGS IN BIOINFORMATICS. page1of15 doi: /bib/bbs023 Review of tandem repeat search tools: a systematic approach to evaluating algorithmic performance Kian Guan Lim, Chee Keong Kwoh, Li Yang Hsu and Adrianto Wirawan Abstract The prevalence of tandem repeats in eukaryotic genomes and their association with a number of genetic diseases has raised considerable interest in locating these repeats. Over the last 10 ^15 years, numerous tools have been developed for searching tandem repeats, but differences in the search algorithms adopted and difficulties with parameter settings have confounded many users resulting in widely varying results. In this review, we have systematically separated the algorithmic aspect of the search tools from the influence of the parameter settings. We hope that this will give a better understanding of how the tools differ in algorithmic performance, their inherent constraints and how one should approach in evaluating and selecting them. Keywords: microsatellites; minisatellites; repeat finder; tandem repeat; method of comparison INTRODUCTION Tandem repeats are repeated sequences that occur not only sequentially but also directly adjacent to each other. They are common in eukaryotes. The human genome alone contains approximately repeats constituting close to 7% of the genome [1]. There has been considerable interest in finding these repeats as their under- or over-representation is associated with a wide range of human diseases including Fragile-X syndrome, Huntington s disease, Friedreich s ataxia, certain forms of cancer [1] and more than 40 other neurological, neurodegenerative and neuromuscular diseases [2, 3]. Definition of tandem repeats Normally, the repeats are classified into microsatellites (also known as short tandem repeats or simple sequence repeats), minisatellites and satellite DNA [1] according to their period size. They can be further categorized into perfect or imperfect repeats (also called exact or approximate repeats) depending on whether they are exact copies of one another or deviate by 1 bp due to mutational mismatch, insertion or deletion. Microsatellites are the most prevalent and tend to be perfect repeats [4]. There is no consensus on the precise definition of their period size. It is generally agreed to be <10 bp but many studies consider only 2 6 bp as microsatellites [1, 4 9]. Repeats with period size >10 bp are normally known as minisatellites, and those >100 bp are sometimes referred to as satellite DNA [10]. Unlike microsatellites, there is limited variability in the length of minisatellites and satellite DNA because of mutations within the sequences. They are hence mostly imperfect rather than exact repeats. Corresponding author: Kian Guan Lim. Division of Software and Information Systems, School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore Tel.: þ ; Fax: þ ; kglim1@e.ntu.edu.sg Kian Guan Lim is a part-time master student in the School of Computer Engineering, Nanyang Technological University, Singapore, studying Master of Science (MSc) in Bioinformatics. He had pursued his undergraduate studies in the same university and graduated with a Bachelor Degree in Electrical Engineering. Chee Keong Kwoh is an Associate Professor in the Division of Information Systems, School of Computer Engineering, Nanyang Technological University, Singapore. Li Yang Hsu is an Assistant Professor at the Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore. He is also the consultant at the Department of Medicine at National University Hospital. AdriantoWirawan is a Post-Doctoral Research Fellow at the Johannes Gutenberg University Mainz in Germany. Before that he was a Research Fellow at Yong Loo Lin School of Medicine in the National University of Singapore, Singapore. He holds a PhD in Computer Engineering from Nanyang Technological University, Singapore. ß The Author Published by Oxford University Press. For Permissions, please journals.permissions@oup.com

2 page 2 of 15 Lim et al. Tandem repeat search tools Given the considerable interest in tandem repeats, it is not surprising to find a rich abundance of bioinformatics tools developed for detecting and characterizing them. Two surveys conducted by Sharma et al. [5] and Merkel and Gemmell [6] have revealed more than 25 tools in existence, and newer ones continue to emerge. Repeat detection Most of the search tools adopt a two-phase approach to detecting tandem repeats: search and filtering [6, 11]. In the first phase, one of two types of algorithms is commonly implemented: (i) a combinatorial search algorithm that exhaustively sections the entire sequence into subsequences and compares every adjacent subsequence to identify potential tandem repeats for all positions and all period sizes [12 16] or (ii) a statistical- or heuristic-based algorithm that uses small segment windows to scan through the sequence to first detect the possible repetitive loci (normally small perfect repeats) before associating them to determine the longer repeats [17 20]. The time complexity for the combinatorial algorithm grows exponentially, especially when the search involves imperfect and overlapping repeats [11, 21]. This could be reduced to near-polynomial time [11] with the heuristic- or statistical-based approach but the comprehensiveness of the search, which depends on the choice of window size [17, 18], is hard to guarantee a major criticism of this approach [6, 7]. Filtering for significant repeats In the second phase, various forms of filtering methods are used to identify and extract the biologically significant repeats. It can be a simple length filter [12] or a more complex implementation of heuristicbased (such as k-mean clustering [22, 23]) or statistics-based [17, 18] models. In addition, alignment algorithms such as wraparound dynamic programming [17, 18] or star-centered algorithm [20] are also employed to determine the imperfect repeats based on the given mismatch, insertion and deletion (indel) threshold settings. For most of these tools, the users are allowed to adjust the filtering parameters to suit their specific application but the reported results are highly sensitive to these settings [1, 5 7]. Moreover, different developers use or recommend different settings to detect what they consider as biologically significant repeats [1], thus making it difficult for users to understand their intricate effects. Previous reviews of tandem repeat search tools Leclercq et al. [7] was the first to point out how the parameter settings affect the distribution of the detected repeats in numbers, length and divergence. Through the in silico analysis of the five widely used tools (TRF, Mreps, Sputnik, STAR and RepeatMasker), they found that the treatment of minimum length has the most effect on the number of repeats detected with an 80-fold difference between two extremes [7]. Alignment weights and minimum score settings are reported to have striking influence, resulting in large discrepancies [7]. The review by Merkel and Gemmell [6] further highlighted that the tools varied widely in the repeat definition, search algorithm and filtering method, resulting in various biases. In comparing five commonly used repeat search tools (TRF, Mreps, Sputnik, Msatfinder and TROLL), they also found significant divergence of up to three orders of magnitude due to parameter settings [6]. Minimum length has the most significant effect on repeat detection, followed by the minimum scores and alignment weights used [6]. Separation of parameter setting biases While the two reviews report the search performance as being highly sensitive to the parameter settings, for an objective evaluation of different tools, one should ideally separate the influence of the parameter settings from the tool s intrinsic performance to assess the individual capability to find relevant repeats. The parameter settings should not convolute the search results while making tool comparisons. However, this is not an easy task given the coupled nature of the settings to the search algorithm, their high sensitivity and, in some cases, the lack of resolution for fine adjustments. Redundancy of repeats In addition, both reviews do not give much attention to how the tools account for the redundancy of repeats yet another source of discrepancy in repeat search. In many cases, overlapping repeats are resulted from detection of multiple similar repeats, in or near the same locus. While some of the overlaps could be genuine repeats blended together [1, 4, 10], it is hard to make a clear distinction between those

3 Review of tandem repeat search tools page3of15 due to program artifacts and the biologically significant repeats (except for some simple cases such as when two repeats are reported with same starting position). Thus, tools like INVERTER [23] prefer to report all the overlapping repeats and leave the user to do the filtering, while others like TRF and Mreps selectively filter out and report only what the programs consider as the most credible overlapping repeats. This causes large disparities in the reporting of overlapping repeats among different tools. Our contribution This review attempts to complement and update the previous reviews by addressing the concerns highlighted above, incorporating the latest developed tools. To separate the varying influence of parameter settings from the algorithmic aspect, we have used only perfect repeats and compared the results obtained from searches using the default settings of each tool. Aside from repeat length and period size detected, we also looked at redundancy of repeats, a factor that has previously been overlooked. Finally, the review extends the comparison from a single genome across multiple genomes to study the effects of genome composition. Tools for evaluation Seven tandem repeat search tools have been selected for this review. Three of them are older but widely used search tools TRF [17], Mreps [12] and Sputnik [14]. The other four are newer tools that have been developed within the last 5 years ATRHunter [18], imex [13], T-reks [20] and INVERTER [23]. Sputnik, Mreps, imex and INVERTER use the combinatorial approach, whereas the other three adopt the statistical- or heuristic-based method. Table 1 summarizes their key features. METHODS A systematic method of evaluating the algorithmic performance To assess the algorithmic performance of the tools and understand their inherent constraints, there is a need to minimize the biases caused by the parameter settings (Table 2). Here, we use a systemic approach for which we first evaluate the tools performance under default settings, sequentially filtering for non-redundant (non-overlapping) and perfect repeats. In a second run, we then evaluate the tools performance using parameter settings to maximize the detection of perfect repeats. Perfect repeats provide a good model for studying the intrinsic tool performance, as these repeats are precise and not subjective to varying interpretations. For example, there is only one precise interpretation of this repetitive sequence (ATTACATTACAT TAC) for a perfect repeat, which is (ATTAC) 3. This makes it easier to control and standardize the parameter settings when making tool comparisons, which are normally limited to repeat length and period size. Alternatively, the interpretation of imperfect repeats is highly variable and depends also on the mismatch and indel parameter settings. It is difficult to find suitable settings among so many parameters to minimize their influence for imperfect repeat search so as to have an objective comparison. Extraction and comparison of relevant repeats A filtering process is used to extract only the relevant repeats from the collected results for comparison. To extract the perfect repeats for the tools without the perfect repeat settings such as TRF, the results are filtered for repeats with a mismatch percentage of 100% and indel percentage of zero. Similarly, for ATRHunter, only repeats with the score of 1 (indicating perfect match) are considered. Also, to maintain consistency for tool comparison, we only considered non-overlapping repeats since it is difficult to establish a good benchmark for the amount of appropriate overlap, and not all the tools handle the overlapping repeats in the same manner. The non-overlapping repeats in this study refer only to repeats with starting position higher than the ending position of the previous repeat. Settings for maximized perfect repeat detection For the second run, to maximize the perfect repeat search, we simply set the resolution parameter to zero for Mreps; for T-reks, the similarity level was set to 100% with indel percentage set to zero; and for INVERTER, imex and Sputnik, we selected the option to perform perfect repeat search (for Sputnik simply include -r 0 in the command line argument). But not all the tools have the option to select perfect repeat search directly. For those that could not, the match score, mismatch and indel (insertion and deletion) penalties and the minimum score were adjusted to maximize the search for perfect repeat

4 page 4 of 15 Lim et al. Table 1: Comparison of tandem repeat search tool features Tools Search algorithm Period size coverage (bp) Min. repeat length (bp) Imperfect repeat detection Mismatch Indel Reporting of overlapping repeats Remarks TRF Heuristic 1^ a Yes Yes Yes (up to 3) Win-DOS Ver Mreps Combinatorial 1^ 6 10 b Yes No No DOS version ATRHunter Heuristic 1^ a Yes Yes Yes (up to 3) DOS version INVERTER Combinatorial 1^200 2 c No No Yes (all) Executable JAR T-reks Heuristic 1^24 9^14 d Yes Yes Yes (all) Executable JAR imex Combinatorial 1^5 e Yes Yes No Web-based Sputnik Combinatorial 2^5 5 a Yes No No DOS version a Obtained by setting the minimum score to zero or the minimum possible value. b p þ 9wherepisperiodsize(default:P ¼1). c Depends on Period Size Setting. d 9 bp for mono-nucleotide repeats and14 bp for the rest. e No restriction, depends on user settings. Table 2 : Default or recommended settings of the evaluated tandem repeat search tools Tools Default or recommended settings Key implications TRF Match score ¼ 2, mismatch penalty ¼ 7, indel penalty ¼ 7, minimum alignment score ¼ 50, maximum period size ¼ 500 With the minimum alignment score of 50, only repeats 25 bp are reported. (Note: Inherent within the program, TRF limits the similarity level to 80% and Acceptable indel to only 10%) Mreps Res ¼ 0 Only perfect repeats with a minimum length of 10 bp are reported. The minimum length of 10 bp is inherent within Mreps unless the option -allowsmall is selected. ATRHunter Match score ¼ 2, mismatch penalty ¼ 7, indel penalty ¼ 7, minimum alignment score ¼ 50, maximum period size ¼ 500 Fifty is the lowest setting for the minimum alignment score. But it does not appear to have any effect on the minimum repeat length for reporting. INVERTER Minimum period size ¼ 9, subset option ¼ ON Overlapping repeats will be reported. T-reks Similarity level ¼ 0.65, percent of indel ¼ 20%, This is less stringent than thetrf inherent limits. overlapping repeats are allowed to be reported. imex Minimum repeat number: mono ¼ 5, Di ¼ 3, tri ¼ 2, tetra ¼ 2, penta ¼ 2, hexa ¼ 2 The program will only report repeats that meet these criteria. Imperfection limit per repeat unit: mono ¼1, di ¼1, tri ¼1, tetra ¼ 2, penta ¼ 2, hexa ¼ 3; Overall imperfection per tract ¼10% Sputnik Match bonus ¼1, mismatch penalty ¼ 10, minimum score ¼10, fail score ¼ 1 Not much is known of Sputnik s default setting. The values appeared to be as indicated. The use of Fail Score is not well known but it is suspected for use to terminate the search the threshold is reached. With the minimum score of 10, only repeats 10 bp or longer are reported. Also, the high mismatch penalty seems to filter off all imperfect repeats. with minimum filtering of the exact repeats. In TRF, this was set by adjusting the alignment weights to the extreme values of 2, 50 and 50 for match score, mismatch penalty and indels penalty, respectively, and the minimum score was pushed down to zero to allow reporting of all repeats detected indiscriminately. Similarly, the match score, mismatch penalty, indel penalty and minimum score for ATRHunter were set to 1, 50, 50 and 50, respectively. Repeat length constraint We had to account for the different minimum repeat length constraints imposed by the different tools. TRF, imex and Sputnik have a minimum repeat

5 Review of tandem repeat search tools page5of15 length of 5 bp. For Mreps, INVERTER and T-reks, it is 9 bp. For ATRHunter, it is 17 bp. To have an objective comparison, we arbitrarily selected the minimum length of 20 bp for this study. This is not far from most practices as many of biologically significant repeats are typically longer; especially those related to disease [1]. Intra- and inter-genomic comparison of tools performance Both runs using (i) default parameter settings and (ii) settings for maximized repeat detection were conducted using the Saccharomyces cerevisiae S288c genome as the reference dataset. The S.cerevisiae S288c genome dataset was obtained from the NCBI website (dated 16 June 2011) with the accession numbers: NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ and NC_ for Chromosome 1 to 16 and the Mitochondria, respectively. We further extended the maximized perfect repeat search to other genomes to check for consistency of tool performance across species with different genome sizes, both eukaryotes and the bacteria. The Caenorhabditis elegans (NC_ , NC_ , NC_ , NC_ , NC_ , NC_ and NC_ for Chromosome 1 to 6 and the Mitochondria, respectively), Oryza sativa Japanica (NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ and NC_ for Chromosome 1 to 12, the Mitochondria and the plasmid, respectively), and Homo sapiens (NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ and NC_ for Chromosome 1 to 22, X, Y and the Mitochondria, respectively) are used as the dataset for the inter-genomic comparison among the eukaryotic species on top of S. cerevisiae S288c. For the bacteria, two common variants of the two Acinetobacter baumannii and Escherichia coli are used: A. baumannii 0057 (NC_ ), A. baumannii SDF (NC_ ), E. coli O157:H7 (NC_ ) and E. coli K-12 (NC_ ). Similarly, all genome data were obtained from the NCBI website (dated 16 June 2011). RESULTS Intra-genomic comparison of tools performance The results of the maximized perfect repeat runs and the default setting runs using the S. cerevisiae S288c genome as the reference dataset are summarized in Table 3, with the detailed distribution of repeats by repeat length and period sizes given in Tables 4 and 5, respectively. They contain (i) a set of detected repeat repeats from the maximized perfect repeat runs, (ii) a set of the detected repeats as collected based on the tool s default settings, (iii) the proportion of detected repeats that are non-overlapped and (iv) the proportion of non-overlapping repeats that are perfect repeat. Algorithmic performance of the tools in repeat detection From Table 3, we clearly see that the differences in repeat detection among the tools with the maximized perfect repeat settings are smaller ( repeats between the two extremities) when compared with the two order of magnitude differences with the default settings (from 2058 to as high as repeats). This shows that there is no marked difference in the algorithmic performance of the tools when the influences of the minimum length, mismatches and indels are removed. While the total number of detected perfect repeats appears much lower than those in default settings, it becomes clear from the distribution of the repeats (Table 4) that this is due to the restriction of the minimum length, which is arbitrarily set to 20 bp. When comparing repeat length by repeat length, the number of detected perfect repeats in the maximized perfect repeat runs is equal if not more than that of the default settings. This shows that the tandem repeat search tools can be optimized to detect a higher of perfect repeats with the appropriate settings. The results also show that comparing the total number of detected repeats alone precludes us from understanding the true performance of the search tools. It is important to note that while tools like Mreps, TRF, INVERTER and ATRHunter seem to

6 page 6 of 15 Lim et al. Table 3: Summary of intra-genomic comparison between maximized perfect repeat settings and the tools default settings using S. cerevisiae S288c ( bp) as the dataset Tools Number of detected repeats with maximized perfect repeat settings b Number of detected repeats using the tools default settings As collected Non-overlapping repeats As per default settings (%) Perfect repeats As per default settings (%)_ Repeat length >9bp a (%) Mreps (88.4) 6853(88.4) 6853(88.4) TRF (64.9) 333 (16.2) 333 (16.2) INVERTER (13.7) 1726 (13.7) 1726 (13.7) ATRHunter (78.0) 834 (6.3) 834 (6.3) T-reks (15.8) 2807 (0.3) 1831 (0.2) imex (97.9) (93.1) (4.9) Sputnik (100.0) 2625 (100.0) 2625 (100.0) Highlighted in () shows the percentage of the number of repeats (as shown in each column) when compared with the number of repeats as collected (as shown in the third column). a Perfect repeats with repeat length >9bp. b The settings to maximize the detection of perfect repeats for tool comparison are as follows: for Mreps, the resolution parameter was set to zero. For Sputnik, the option -r 0 parameter is included. ForT-reks, the similarity level was set to 100% with indel percent set to zero. The Perfect Repeat Search option was selected for INVERTER and imex. ForTRF and ATRHunter where there is no direct setting, the least stringent criterion was applied to allow all possible lengths of exact repeats to be reported with no or minimum filtering. Specifically, intrf, the alignment weights were set to 2, 50 and 50 for match score, mismatch penalty and indels penalty, respectively, and the minimum score was set to zero. The match score, mismatch penalty, indel penalty and minimum score for ATRHunter was set to1, ^50, ^50 and 50.The minimum length was arbitrarily set to 20 bp to objectively compare across different tools, and the results were filtered for overlapping repeats. outperform imex and Sputnik in the total number of detected perfect repeats, that is in fact due to the inherent limitation of imex and Sputnik to detect only short repeats: 1 6 and 2 5 bp, respectively. When one compares period size by period size, both perform as well as the other tools in detecting the respective repeats. This is further supported when we examined the consistency of their performance for the other genomes (Tables 8 and 9). Handling of overlapping repeats Table 3 clearly shows how the tools performances under default settings are strongly influenced by differences in the way that the redundancy of repeats is handled. Most of the reported overlapping repeats are largely due to program artifacts rather than genuine overlaps occurring naturally. It has been a challenge for most algorithms to account for and extract the most representative overlapping repeats. For example, a program can identify a repetitive sequence (ATATATATAT) as (AT) 5 or (ATAT) 2.5. Clearly, both account for the same repeat, and which one to report depends subjectively on the algorithm design. This is especially a concern for longer repeats with numerous short multiple periods. Hence, we see a large discrepancy in the number of overlapping repeats reported by different tools. For tools like T-reks and INVERTER where there is little or no filtering, we found an excessive number of these repeats detected; T-reks detected >84% of its repeats as overlaps and INVERTER had >86%. Many of the repeats occur at the same loci with different repeat lengths and period sizes. For tools like Mreps, ATRHunter and imex that filter out the non-relevant overlapping repeats, the difference between the overlaps and non-overlaps is significantly smaller, in the range of only %. TRF reports a higher number of overlapping repeats (35.1%) since it allows up to three overlapping repeats to be reported. Sputnik, by its inherent algorithmic design, prevents the search of repeats at the same loci, and hence does not report any overlapping repeat. Repeat length and period size constraints It is also clear from Tables 4 and 5 that repeat detection is heavily biased by the tools minimum length setting and their constraints on its period size. This enforces the points made by the two previous reviews that one has to be careful in the choice

7 Review of tandem repeat search tools page7of15 Table 4 : Detected repeats (distributed by repeat length) on the S. cerevisiae S288c genome using default settings and compared with maximized perfect repeat run Repeat length (bp) I >30 Total Mreps Default settings As collected Non-overlap Non-overlap, Perfect ^ ^ ^ ^ ^ Maximized perfect repeat run TRF Default settings As collected ^ ^ ^ Non-overlap Non-overlap, Perfect Maximized perfect repeat run Inverter Default settings As collected ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 2783 ^ 1582 ^ 711 ^ 1718 ^ ^ Non-overlap ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 795 ^ 319 ^ 82 ^ 236 ^ ^ Non-overlap, Perfect ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 795 ^ 319 ^ 82 ^ 236 ^ ^ Maximized perfect repeat run ATRHunter Default settings As collected Non-overlap ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Non-overlap, Perfect ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Maximized perfect repeat run T-reks Default settings As collected Non-overlap Non-overlap, Perfect Maximized perfect repeat run imex Default settings As collected Non-overlap Non-overlap, Perfect ^ ^ Maximized perfect repeat run Sputnik As collected ^ ^ ^ ^ ^ 302 ^ 990 ^ ^ 176 ^ ^ ^ Non-overlap ^ ^ ^ ^ ^ 302 ^ 990 ^ ^ 176 ^ ^ ^ Non-overlap, Perfect ^ ^ ^ ^ ^ 334 ^ 1095 ^ ^ 134 ^ ^ ^ Maximized perfect repeat run ^ ^ As Collected çresults as collected using the tools default settings; Non-overlap çfiltered for overlapping repeats; Non-overlap, Perfect çonly non-overlapping perfect repeats.

8 page 8 of 15 Lim et al. Table 5: Detected repeats (distributed by period size) on the S. cerevisiae S288c genome using the default settings and compared with maximized perfect repeat run Peroid Size (bp) >20 Total Mreps Default Settings As Collected Non-overlap Non-overlap, Perfect Maximized perfect repeat run TRF Default Settings As Collected Non-overlap Non-overlap, Perfect Maximized perfect repeat run INVERTER Default Settings As Collected ^ ^ ^ ^ ^ ^ ^ ^ Non-overlap ^ ^ ^ ^ ^ ^ ^ ^ Non-overlap, Perfect ^ ^ ^ ^ ^ ^ ^ ^ Maximized perfect repeat run ^ ^ ATRHunter Default Settings As Collected Non-overlap Non-overlap, Perfect ^ ^ Maximized perfect repeat run T-reks Default Settings As Collected Non-overlap Non-overlap, Perfect ^ 1 ^ 8 ^ 4 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 2807 Maximized perfect repeat run imex Default Settings As Collected ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Non-overlap ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Non-overlap, Perfect ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Maximized perfect repeat run ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 583 Sputnik Default Settings As Collected ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 2625 Non-overlap ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 2625 Non-overlap, Perfect ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 2601 Maximized perfect repeat run ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 398 As Collected çresults as collected using the tools default settings; Non-overlap çfiltered for overlapping repeats; Non-overlap, Perfect çonly non-overlapping perfect repeats.

9 Review of tandem repeat search tools page9of15 of minimum length and period size when searching for repeats. From Table 4, we note that the default minimum score and penalty settings of TRF limit the program to only repeats of length >25 bp, hence giving the impression that the program has underperformed compared with the other tools. However, it is possible to lower the minimum length for TRF (by adjusting its match score and minimum threshold score) to detect repeats as short as 5 bp (data not shown here as we only report repeats with length 20 bp). On the other hand, imex reported an overwhelming number of repeats <10 bp, which are of little or no biological significance [24]. In fact, >90% of the imex-detected repeats can be eliminated if one considers only repeats >10 bp as statistically significant [24]. Tools like ATRHunter, INVERTER and Sputnik are further limited by their ability to handle non-integer repetitions. As a result, we noticed sharp drops in the number of repeats detected when the length of the repeat was equivalent to the length of a prime number (i.e. 11, 13, 17, 19, and 23). Table 5 further reveals the constraint of the search tools in period size, particularly for INVERTER, imex and Sputnik. For INVERTER, the minimum period size could be adjusted to as low as 3 bp but the default was set to 9 bp as the tool was specifically designed for detecting inverted repeats that are usually >10 bp [1]. For imex and Sputnik, their period sizes are fixed at 1 6 and 2 5 bp, respectively, as they are designed for microsatellites. Biases in settings for imperfect repeat detection Finally, strong biases were present in the tools mismatch and indel settings for the detection of imperfect repeats. For INVERTER, Mreps and Sputnik, the default settings allowed only for the detection of perfect repeats. On the other extreme, T-reks s settings of 65% for mismatch and 20% for indels retrieved an excessive amount of repeats resulting in a 50-fold difference between the repeats (non-overlapping) and those filtered for perfect repeats. For ATRHunter and TRF, these differences were only between 12 and 4 times, respectively. It is difficult to compare and make sense of these results given that a one-to-one correspondence between the settings is impossible. For example, when we set the mismatch similarity of T-reks to 80% to have an equivalent setting with TRF, and searching for the repeats >25 bp, T-reks managed to detect only 65 repeats, much lower than the 1173 repeats detected by TRF. This was also far fewer than the reported bp or longer repeats when the similarity level was set to 65%. Inter-genomic comparison of tools performance Tables 6 and 7 summarize the tool s performances for maximized perfect repeat detection when compared across different species of eukaryotes and the bacteria. The distributions of the repeats by period size are shown in Figures 1 and 2 with the detailed breakdown of the repeats by period size given in Tables 8 and 9. Consistency of algorithmic performance From Tables 6 and 7, one can observe the consistency of performance among the tools across the different species. Mreps appeared to rank consistently on top, followed by TRF, INVERTER, ATRHunter, T-reks, imex and Sputnik. The only two exceptions are between T-reks and imex for H. sapiens and O. sativajaponica, which are rich in 1 6 bp microsatellites. This consistency in tool performance across different species is encouraging as it shows that the number of detected repeats is not influenced by the different genome sizes, and differences in the distribution and representation of the repeats among the different species. However, tools like imex that are designed for short repeat search tended to perform better for genomes rich in microsatellites, whereas tools like ATRHunter that are optimized for longer length repeats performed better for genomes rich in such repeats, i.e. C. elegans. TRF and Mreps on the other hand seemed to perform equally well in both situations as tools for more generalized usage. It is hence instructive to use more than one tool with at least one optimized for microsatellites and another for finding longer repeats for ab initio detection of repeats for genomes that are not well characterized. Overall, we observed that the tools can be grouped by their performance. The four top ranked tools have performance close to each other in the range of 25% of repeat detection. Mreps and TRF are closer to each other (no >12% difference) than the rest. INVERTER and ATRHunter fall in the second best performance group, lower than Mreps and TRF by 9 26% on average, and generally within 4% of each other. The last group, which

10 page 10 of 15 Lim et al. Table 6: Results of maximized perfect repeat runs for selected genomes of eukaryotic species Genome Saccharomyces cerevisiae S288c Caenorhabditis elegans Oryza sativa Japonica Homo sapiens Size bp bp bp bp Tools Detected Repeats (Percentage of Rank 1) Detected Repeats (Percentage of Rank 1) Detected Repeats (Percentage of Rank 1) Detected Repeats (Percentage of Rank 1) Mreps 1501 (100) (100) (100) (100) TRF 1325(88) 23517(96) 56493(96) (95) INVERTER 1169 (78) (81) (83) (90) ATRHunter 1115 (74) (79) (81) (87) T-reks 643 (43) 6081 (25) (38) (55) imex 583 (39) 2627 (11) (40) (60) Sputnik 398 (27) 1498 (6) (33) (32) Percentage of Mreps ¼ number of detected repeats by the tool/number of detected repeats by Mreps 100%. This is to show how the other tools fared in relation to the highest ranking toolçmreps.the different shaded colour represents the band of performance among the tools grouped according to their close association with each other. Table 7: Results of maximized perfect repeat runs for selected genomes of bacteria Genome Acinetobacter baumannii 0057 Acinetobacter baumannii SDF Escherichia coli K12 Escherichia coli O157H7 Size bp bp bp bp Tools Detected repeats (Percentage of Mreps) Detected repeats (Percentage of Mreps) includes T-reks, imex and Sputnik, exhibited more varied performances. The results also showed that there is no significant difference in the detection of tandem repeats between the combinatorial approach and the statistic-/heuristic-based approach from an algorithmic viewpoint. TRF, which adopts the statistic-/heuristic-based approach, performed equally well compared with Mreps, which uses the combinatorial search. While there seems to be a performance demarcation between the first two groups, i.e. Mreps and TRF versus INVERTER and ATRHunter, they are not as distinct as one might think. In the case of the H. sapiens genome, the results were so similar that they could have been grouped together. The trends are also not obvious when examining their Detected repeats (Percentage of Mreps) Mreps 45 (100) 70 (100) 26 (100) 72 (100) TRF 45 (100) 72 (103) 26 (100) 67 (93) INVERTER 34 (76) 62 (86) 21 (81) 59 (82) ATRHunter 34(76) 61(85) 19(73) 56(78) T-reks 12 (27) 47 (65) 13 (50) 32 (44) imex 5 (11) 5 (7) 0 (0) 8 (11) Sputnik 1 (2) 3 (4) 0 (0) 0 (0) Detected repeats (Percentage of Mreps) Percentage of Mreps ¼ number of detected repeats by the tool/number of detected repeats by Mreps 100%. This is to show how the other tools fared in relation to the highest ranking toolçmreps.the different shaded colour represents the band of performance among the tools grouped according to their close association with each other. period size distribution (Table 8). There are various confounding effects worth highlighting here. For example, for INVERTER, we placed a minimum period size limit of 3 bp to shorten the search time. This did not affect the overall results as the monoand di-nucleotide repeats are being accounted for in the longer period size repeats. For instance, in the S. cerevisiae S288c genome, one notices four to six times as many 4 bp repeats of the INVERTER than the rest. In actual fact, >75% of these repeats were supposedly di-nucleotide repeats but were counted as repeats in the 4 bp period size. Period size coverage Both Leclercq et al. [7] and Merkel and Gemmell (2008) [6] limited their investigations to short

Review of tandem repeat search tools page 11 of 15 Fig 1: Distribution of perfect repeat by period size for (a) Saccharomyces cerevisiae, (b) Caenorhabditis elegans, (c) Oryza sativa Japanica and (d)

11 Review of tandem repeat search tools page 11 of 15 Fig 1: Distribution of perfect repeat by period size for (a) Saccharomyces cerevisiae, (b) Caenorhabditis elegans, (c) Oryza sativa Japanica and (d) Homo sapiens. tandem repeats, such as microsatellites (period size: 1 6 bp). In this study, we surveyed repeats with period sizes beyond 6 bp and up to 20 bp. From the breakdown of the detected repeats as shown in Figures 1 and 2, we notice the frequent abundance of longer period size perfect repeats across many species; mostly imperfect repeats which tends to be composed of longer repeats (in both length and period size). In fact, repeats with period sizes beyond 6 bp account for at least half the total of number of repeats detected in the human genome, >60% of the three other eukaryotic species and >70% of the five bacteria being studied. Even for microsatellites, the proportion of period size coverage from 7 to 10 bp (which are still being considered by some biologists as microsatellites) is quite significantly represented in most selected species (though a lesser extent in H. sapiens) and especially in bacteria. In addition, while assessing the tools performance, we noticed that T-reks has limited performance in searching longer period size repeats (Tables 5, 8 and 9). Except for mono-nucleotide repeats, the program detects much fewer perfect repeats on all counts for period sizes from 2 bp onwards. This seems strange as the tool outperforms the others when imperfect repeats are being included. But as explained earlier this could be due to the less stringent parameters. DISCUSSIONS AND CONCLUSION The gamut of search tools for finding tandem repeats is wide ranging and non-standardized. One has to be careful in understanding the tools inherent constraints to select the right tool for the right purpose. It is hence important to be able to compare the repeat search tools and understand their behavior and inherent limitations. This is not an easy task due to the multitude of parameters implemented that has been already shown by two previous reviews [6, 7]. In this study, we used perfect repeats for comparing the tools. For this, we have filtered output from searches with default parameters for perfect and

page 12 of 15 Lim et al. Fig 2: Distribution of perfect repeat by period size for two variants of (a) Acinetobacter baumannii 0057 and SDF as shown in (a) and (b), and two variants of E.

12 page 12 of 15 Lim et al. Fig 2: Distribution of perfect repeat by period size for two variants of (a) Acinetobacter baumannii 0057 and SDF as shown in (a) and (b), and two variants of E. coli K12 and O157H7 as shown in (c) and (d). non-redundant repeats as well as (in a second run) adjusted the tools parameter settings to maximize the detection of prefect repeats. Also, instead of simply evaluating the detected repeats as a single total, we looked at their distribution by period size for the first 20 bp so as to give deeper insight into the understanding of the tools behavior and inherent constraints. These series of methods aim at separating the parameter biases from the algorithmic aspects and have proven useful in providing a consistent way of comparing the tools and explaining their differences. We found no measurable differences in the algorithmic performance between the exhaustive combinatorial approach and the statistical-/heuristic-based approach. Still, one notices a marginal surplus in repeat detection (4 5%) for tools that use the combinatorial method of search (like Mreps and INVERTER) over those adopting the statisticalbased approach (TRF and ATRHunter). One possible explanation is that algorithms using statistical approach rely on the association of short repetitive elements of exact matches to form the longer repeats. While this may be useful in finding repeats that are partially or imperfectly matched, it is harder to associate the longer perfect repeats that are not entirely multiple of the smaller repetitive elements. Hence, the latter tools have the tendency to miss the longer perfect repeats unlike the brute force approach of the combinatorial method with no such constraint. These differences though are not significant for the reason that long strands of perfectly matched tandem repeats are not common in most genomes. We further observed that repeat finding tools designed for generalized usage (such as Mreps, TRF, INVERTER and ATRHunter) tend to outperform those (T-reks, imex and Sputnik) that are constrained by design to search repeats only of certain period sizes when comparing the total number of repeats detected. But, when one examines the number of detected repeats period size by period size, there is no significant difference between the two groups.

13 Review of tandem repeat search tools page 13 of 15 Table 8 : Number of detected repeats (distributed by period size) for maximized perfect repeat runs for selected genomes of eukaryotic species Period Size (bp) >20 Total Saccharomyces cerevisiae S288c Mreps TRF INVERTER ^ ^ ATRHunter T-reks imex ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 583 Sputnik ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 398 Caenorhabditis elegans Mreps TRF INVERTER ^ ^ ATRHunter T-reks imex ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 2627 Sputnik ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 1498 Oryza sativa japonica Mreps TRF INVERTER ^ ^ ATRHunter T-reks imex ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Sputnik ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Homo Sapiens Mreps TRF INVERTER ^ ^ ATRHunter T-reks imex ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Sputnik ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Note: bp - stands for base-pairs

14 page 14 of 15 Lim et al. Table 9: Number of detected repeats (distributed by period size) for maximized perfect repeat runs for selected genomes of bacteria Period Size (bp) >20 Total Acinetobacter baumannii 0057 Mreps ^ ^ ^ ^ ^ 5 ^ ^ 2 2 ^ 4 1 ^ 5 45 TRF ^ ^ ^ ^ ^ ^ ^ 1 1 ^ 6 45 INVERTER ^ ^ ^ ^ ^ ^ ^ ^ 1 7 ^ 8 ^ ^ 2 2 ^ ATRHunter ^ ^ ^ ^ ^ 3 7 ^ 5 ^ ^ 2 2 ^ 4 1 ^ 4 34 T-reks ^ ^ ^ ^ ^ 3 2 ^ 1 ^ ^ ^ ^ ^ ^ ^ ^ ^ 12 imex ^ ^ ^ ^ 1 4 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 5 Sputnik ^ ^ ^ ^ 1 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 1 Acinetobacter baumannii SDF Mreps ^ ^ ^ 7 6 ^ 8 ^ 1 ^ ^ ^ TRF ^ ^ ^ ^ ^ 2 ^ ^ ^ INVERTER ^ ^ ^ ^ 2 6 ^ 8 ^ 1 ^ ^ ^ ATRHunter ^ ^ ^ ^ ^ ^ 6 ^ 10 ^ 1 ^ ^ ^ T-reks ^ ^ ^ ^ 3 ^ ^ 7 ^ 1 ^ ^ ^ imex ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 5 Sputnik ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 3 Escherichia coli O157H7 Mreps ^ ^ ^ ^ ^ 11 ^ ^ 1 ^ ^ TRF ^ ^ ^ ^ ^ 9 ^ ^ 2 ^ ^ INVERTER ^ ^ ^ ^ ^ 2 ^ ^ 1 ^ ^ ATRHunter ^ ^ ^ ^ ^ 8 ^ ^ 1 ^ ^ T-reks ^ ^ ^ ^ ^ 8 ^ 1 2 ^ ^ 5 3 ^ 1 ^ ^ 8 32 imex ^ ^ ^ ^ ^ 8 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 8 Sputnik ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Escherichia coli K12 Mreps ^ ^ ^ ^ ^ 1 ^ ^ 1 1 ^ 1 ^ 1 ^ 8 26 TRF ^ ^ ^ ^ ^ ^ ^ ^ 1 1 ^ ^ 5 26 INVERTER ^ ^ ^ ^ ^ ^ ^ 2 ^ ^ 1 1 ^ 1 ^ 1 ^ 8 21 ATRHunter ^ ^ ^ ^ ^ ^ ^ 2 ^ ^ 1 1 ^ 1 ^ 1 ^ 6 19 T-reks ^ ^ ^ ^ ^ ^ ^ 2 ^ ^ ^ 1 ^ 1 1 ^ 1 ^ 1 ^ 6 13 imex ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Sputnik ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Notwithstanding the comparison made in this study, the choice of tools still depends on the user s own preferences and needs. Although Mreps seems superior to the rest, still it has the limitation of handling only repeats with mismatches and not indels. TRF offers much more flexibility but one has to exercise great care in the non-intuitive parameter settings to obtain the desired results. The same is true for ATRHunter. INVERTER and T-reks on the other hand offers a more direct and user-friendly interface but may not be as effective for certain searches, such as shown in the case of T-reks. Similarly, INVERTER is currently limited to search for perfect repeats only. imex and Sputnik are more suitable for short repeat searches. We did not evaluate the tools performance for imperfect repeats for several reasons. One is that it is difficult to make equitable adjustments in the highly sensitive parameter settings to allow for objective comparisons. This should be a direction for future research. Second, there is also no consensus to the degree of degeneracy allowed for repeats from the biological viewpoint. A number of studies set it at 1 bp mismatch for every 10 bp of the repeats [8, 9, 24] while others put a similarity level of 80% for substitution error and 10% for indels as the criteria [17, 18]. We have shown earlier that the search results can vary widely depending on which criterion is being adopted. One possible approach to evaluating imperfect repeat search would be to examine the shift in behavior with the varying adjustment of the parameters. Coupling this with the biological understanding of how tandem repeats evolve, one could determine whether the tools performance is within reasonable bounds as expected in nature. The validity of this approach, however, requires further testing.

Pyrobayes: an improved base caller for SNP discovery in pyrosequences

Pyrobayes: an improved base caller for SNP discovery in pyrosequences Aaron R Quinlan, Donald A Stewart, Michael P Strömberg & Gábor T Marth Supplementary figures and text: Supplementary Figure 1. The