Briefings in Bioinformatics Advance Access published May 29, 2012

Size: px
Start display at page:

Download "Briefings in Bioinformatics Advance Access published May 29, 2012"

Transcription

1 Briefings in Bioinformatics Advance Access published May 29, 2012 BRIEFINGS IN BIOINFORMATICS. page1of15 doi: /bib/bbs023 Review of tandem repeat search tools: a systematic approach to evaluating algorithmic performance Kian Guan Lim, Chee Keong Kwoh, Li Yang Hsu and Adrianto Wirawan Abstract The prevalence of tandem repeats in eukaryotic genomes and their association with a number of genetic diseases has raised considerable interest in locating these repeats. Over the last 10 ^15 years, numerous tools have been developed for searching tandem repeats, but differences in the search algorithms adopted and difficulties with parameter settings have confounded many users resulting in widely varying results. In this review, we have systematically separated the algorithmic aspect of the search tools from the influence of the parameter settings. We hope that this will give a better understanding of how the tools differ in algorithmic performance, their inherent constraints and how one should approach in evaluating and selecting them. Keywords: microsatellites; minisatellites; repeat finder; tandem repeat; method of comparison INTRODUCTION Tandem repeats are repeated sequences that occur not only sequentially but also directly adjacent to each other. They are common in eukaryotes. The human genome alone contains approximately repeats constituting close to 7% of the genome [1]. There has been considerable interest in finding these repeats as their under- or over-representation is associated with a wide range of human diseases including Fragile-X syndrome, Huntington s disease, Friedreich s ataxia, certain forms of cancer [1] and more than 40 other neurological, neurodegenerative and neuromuscular diseases [2, 3]. Definition of tandem repeats Normally, the repeats are classified into microsatellites (also known as short tandem repeats or simple sequence repeats), minisatellites and satellite DNA [1] according to their period size. They can be further categorized into perfect or imperfect repeats (also called exact or approximate repeats) depending on whether they are exact copies of one another or deviate by 1 bp due to mutational mismatch, insertion or deletion. Microsatellites are the most prevalent and tend to be perfect repeats [4]. There is no consensus on the precise definition of their period size. It is generally agreed to be <10 bp but many studies consider only 2 6 bp as microsatellites [1, 4 9]. Repeats with period size >10 bp are normally known as minisatellites, and those >100 bp are sometimes referred to as satellite DNA [10]. Unlike microsatellites, there is limited variability in the length of minisatellites and satellite DNA because of mutations within the sequences. They are hence mostly imperfect rather than exact repeats. Corresponding author: Kian Guan Lim. Division of Software and Information Systems, School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore Tel.: þ ; Fax: þ ; kglim1@e.ntu.edu.sg Kian Guan Lim is a part-time master student in the School of Computer Engineering, Nanyang Technological University, Singapore, studying Master of Science (MSc) in Bioinformatics. He had pursued his undergraduate studies in the same university and graduated with a Bachelor Degree in Electrical Engineering. Chee Keong Kwoh is an Associate Professor in the Division of Information Systems, School of Computer Engineering, Nanyang Technological University, Singapore. Li Yang Hsu is an Assistant Professor at the Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore. He is also the consultant at the Department of Medicine at National University Hospital. AdriantoWirawan is a Post-Doctoral Research Fellow at the Johannes Gutenberg University Mainz in Germany. Before that he was a Research Fellow at Yong Loo Lin School of Medicine in the National University of Singapore, Singapore. He holds a PhD in Computer Engineering from Nanyang Technological University, Singapore. ß The Author Published by Oxford University Press. For Permissions, please journals.permissions@oup.com

2 page 2 of 15 Lim et al. Tandem repeat search tools Given the considerable interest in tandem repeats, it is not surprising to find a rich abundance of bioinformatics tools developed for detecting and characterizing them. Two surveys conducted by Sharma et al. [5] and Merkel and Gemmell [6] have revealed more than 25 tools in existence, and newer ones continue to emerge. Repeat detection Most of the search tools adopt a two-phase approach to detecting tandem repeats: search and filtering [6, 11]. In the first phase, one of two types of algorithms is commonly implemented: (i) a combinatorial search algorithm that exhaustively sections the entire sequence into subsequences and compares every adjacent subsequence to identify potential tandem repeats for all positions and all period sizes [12 16] or (ii) a statistical- or heuristic-based algorithm that uses small segment windows to scan through the sequence to first detect the possible repetitive loci (normally small perfect repeats) before associating them to determine the longer repeats [17 20]. The time complexity for the combinatorial algorithm grows exponentially, especially when the search involves imperfect and overlapping repeats [11, 21]. This could be reduced to near-polynomial time [11] with the heuristic- or statistical-based approach but the comprehensiveness of the search, which depends on the choice of window size [17, 18], is hard to guarantee a major criticism of this approach [6, 7]. Filtering for significant repeats In the second phase, various forms of filtering methods are used to identify and extract the biologically significant repeats. It can be a simple length filter [12] or a more complex implementation of heuristicbased (such as k-mean clustering [22, 23]) or statistics-based [17, 18] models. In addition, alignment algorithms such as wraparound dynamic programming [17, 18] or star-centered algorithm [20] are also employed to determine the imperfect repeats based on the given mismatch, insertion and deletion (indel) threshold settings. For most of these tools, the users are allowed to adjust the filtering parameters to suit their specific application but the reported results are highly sensitive to these settings [1, 5 7]. Moreover, different developers use or recommend different settings to detect what they consider as biologically significant repeats [1], thus making it difficult for users to understand their intricate effects. Previous reviews of tandem repeat search tools Leclercq et al. [7] was the first to point out how the parameter settings affect the distribution of the detected repeats in numbers, length and divergence. Through the in silico analysis of the five widely used tools (TRF, Mreps, Sputnik, STAR and RepeatMasker), they found that the treatment of minimum length has the most effect on the number of repeats detected with an 80-fold difference between two extremes [7]. Alignment weights and minimum score settings are reported to have striking influence, resulting in large discrepancies [7]. The review by Merkel and Gemmell [6] further highlighted that the tools varied widely in the repeat definition, search algorithm and filtering method, resulting in various biases. In comparing five commonly used repeat search tools (TRF, Mreps, Sputnik, Msatfinder and TROLL), they also found significant divergence of up to three orders of magnitude due to parameter settings [6]. Minimum length has the most significant effect on repeat detection, followed by the minimum scores and alignment weights used [6]. Separation of parameter setting biases While the two reviews report the search performance as being highly sensitive to the parameter settings, for an objective evaluation of different tools, one should ideally separate the influence of the parameter settings from the tool s intrinsic performance to assess the individual capability to find relevant repeats. The parameter settings should not convolute the search results while making tool comparisons. However, this is not an easy task given the coupled nature of the settings to the search algorithm, their high sensitivity and, in some cases, the lack of resolution for fine adjustments. Redundancy of repeats In addition, both reviews do not give much attention to how the tools account for the redundancy of repeats yet another source of discrepancy in repeat search. In many cases, overlapping repeats are resulted from detection of multiple similar repeats, in or near the same locus. While some of the overlaps could be genuine repeats blended together [1, 4, 10], it is hard to make a clear distinction between those

3 Review of tandem repeat search tools page3of15 due to program artifacts and the biologically significant repeats (except for some simple cases such as when two repeats are reported with same starting position). Thus, tools like INVERTER [23] prefer to report all the overlapping repeats and leave the user to do the filtering, while others like TRF and Mreps selectively filter out and report only what the programs consider as the most credible overlapping repeats. This causes large disparities in the reporting of overlapping repeats among different tools. Our contribution This review attempts to complement and update the previous reviews by addressing the concerns highlighted above, incorporating the latest developed tools. To separate the varying influence of parameter settings from the algorithmic aspect, we have used only perfect repeats and compared the results obtained from searches using the default settings of each tool. Aside from repeat length and period size detected, we also looked at redundancy of repeats, a factor that has previously been overlooked. Finally, the review extends the comparison from a single genome across multiple genomes to study the effects of genome composition. Tools for evaluation Seven tandem repeat search tools have been selected for this review. Three of them are older but widely used search tools TRF [17], Mreps [12] and Sputnik [14]. The other four are newer tools that have been developed within the last 5 years ATRHunter [18], imex [13], T-reks [20] and INVERTER [23]. Sputnik, Mreps, imex and INVERTER use the combinatorial approach, whereas the other three adopt the statistical- or heuristic-based method. Table 1 summarizes their key features. METHODS A systematic method of evaluating the algorithmic performance To assess the algorithmic performance of the tools and understand their inherent constraints, there is a need to minimize the biases caused by the parameter settings (Table 2). Here, we use a systemic approach for which we first evaluate the tools performance under default settings, sequentially filtering for non-redundant (non-overlapping) and perfect repeats. In a second run, we then evaluate the tools performance using parameter settings to maximize the detection of perfect repeats. Perfect repeats provide a good model for studying the intrinsic tool performance, as these repeats are precise and not subjective to varying interpretations. For example, there is only one precise interpretation of this repetitive sequence (ATTACATTACAT TAC) for a perfect repeat, which is (ATTAC) 3. This makes it easier to control and standardize the parameter settings when making tool comparisons, which are normally limited to repeat length and period size. Alternatively, the interpretation of imperfect repeats is highly variable and depends also on the mismatch and indel parameter settings. It is difficult to find suitable settings among so many parameters to minimize their influence for imperfect repeat search so as to have an objective comparison. Extraction and comparison of relevant repeats A filtering process is used to extract only the relevant repeats from the collected results for comparison. To extract the perfect repeats for the tools without the perfect repeat settings such as TRF, the results are filtered for repeats with a mismatch percentage of 100% and indel percentage of zero. Similarly, for ATRHunter, only repeats with the score of 1 (indicating perfect match) are considered. Also, to maintain consistency for tool comparison, we only considered non-overlapping repeats since it is difficult to establish a good benchmark for the amount of appropriate overlap, and not all the tools handle the overlapping repeats in the same manner. The non-overlapping repeats in this study refer only to repeats with starting position higher than the ending position of the previous repeat. Settings for maximized perfect repeat detection For the second run, to maximize the perfect repeat search, we simply set the resolution parameter to zero for Mreps; for T-reks, the similarity level was set to 100% with indel percentage set to zero; and for INVERTER, imex and Sputnik, we selected the option to perform perfect repeat search (for Sputnik simply include -r 0 in the command line argument). But not all the tools have the option to select perfect repeat search directly. For those that could not, the match score, mismatch and indel (insertion and deletion) penalties and the minimum score were adjusted to maximize the search for perfect repeat

4 page 4 of 15 Lim et al. Table 1: Comparison of tandem repeat search tool features Tools Search algorithm Period size coverage (bp) Min. repeat length (bp) Imperfect repeat detection Mismatch Indel Reporting of overlapping repeats Remarks TRF Heuristic 1^ a Yes Yes Yes (up to 3) Win-DOS Ver Mreps Combinatorial 1^ 6 10 b Yes No No DOS version ATRHunter Heuristic 1^ a Yes Yes Yes (up to 3) DOS version INVERTER Combinatorial 1^200 2 c No No Yes (all) Executable JAR T-reks Heuristic 1^24 9^14 d Yes Yes Yes (all) Executable JAR imex Combinatorial 1^5 e Yes Yes No Web-based Sputnik Combinatorial 2^5 5 a Yes No No DOS version a Obtained by setting the minimum score to zero or the minimum possible value. b p þ 9wherepisperiodsize(default:P ¼1). c Depends on Period Size Setting. d 9 bp for mono-nucleotide repeats and14 bp for the rest. e No restriction, depends on user settings. Table 2 : Default or recommended settings of the evaluated tandem repeat search tools Tools Default or recommended settings Key implications TRF Match score ¼ 2, mismatch penalty ¼ 7, indel penalty ¼ 7, minimum alignment score ¼ 50, maximum period size ¼ 500 With the minimum alignment score of 50, only repeats 25 bp are reported. (Note: Inherent within the program, TRF limits the similarity level to 80% and Acceptable indel to only 10%) Mreps Res ¼ 0 Only perfect repeats with a minimum length of 10 bp are reported. The minimum length of 10 bp is inherent within Mreps unless the option -allowsmall is selected. ATRHunter Match score ¼ 2, mismatch penalty ¼ 7, indel penalty ¼ 7, minimum alignment score ¼ 50, maximum period size ¼ 500 Fifty is the lowest setting for the minimum alignment score. But it does not appear to have any effect on the minimum repeat length for reporting. INVERTER Minimum period size ¼ 9, subset option ¼ ON Overlapping repeats will be reported. T-reks Similarity level ¼ 0.65, percent of indel ¼ 20%, This is less stringent than thetrf inherent limits. overlapping repeats are allowed to be reported. imex Minimum repeat number: mono ¼ 5, Di ¼ 3, tri ¼ 2, tetra ¼ 2, penta ¼ 2, hexa ¼ 2 The program will only report repeats that meet these criteria. Imperfection limit per repeat unit: mono ¼1, di ¼1, tri ¼1, tetra ¼ 2, penta ¼ 2, hexa ¼ 3; Overall imperfection per tract ¼10% Sputnik Match bonus ¼1, mismatch penalty ¼ 10, minimum score ¼10, fail score ¼ 1 Not much is known of Sputnik s default setting. The values appeared to be as indicated. The use of Fail Score is not well known but it is suspected for use to terminate the search the threshold is reached. With the minimum score of 10, only repeats 10 bp or longer are reported. Also, the high mismatch penalty seems to filter off all imperfect repeats. with minimum filtering of the exact repeats. In TRF, this was set by adjusting the alignment weights to the extreme values of 2, 50 and 50 for match score, mismatch penalty and indels penalty, respectively, and the minimum score was pushed down to zero to allow reporting of all repeats detected indiscriminately. Similarly, the match score, mismatch penalty, indel penalty and minimum score for ATRHunter were set to 1, 50, 50 and 50, respectively. Repeat length constraint We had to account for the different minimum repeat length constraints imposed by the different tools. TRF, imex and Sputnik have a minimum repeat

5 Review of tandem repeat search tools page5of15 length of 5 bp. For Mreps, INVERTER and T-reks, it is 9 bp. For ATRHunter, it is 17 bp. To have an objective comparison, we arbitrarily selected the minimum length of 20 bp for this study. This is not far from most practices as many of biologically significant repeats are typically longer; especially those related to disease [1]. Intra- and inter-genomic comparison of tools performance Both runs using (i) default parameter settings and (ii) settings for maximized repeat detection were conducted using the Saccharomyces cerevisiae S288c genome as the reference dataset. The S.cerevisiae S288c genome dataset was obtained from the NCBI website (dated 16 June 2011) with the accession numbers: NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ and NC_ for Chromosome 1 to 16 and the Mitochondria, respectively. We further extended the maximized perfect repeat search to other genomes to check for consistency of tool performance across species with different genome sizes, both eukaryotes and the bacteria. The Caenorhabditis elegans (NC_ , NC_ , NC_ , NC_ , NC_ , NC_ and NC_ for Chromosome 1 to 6 and the Mitochondria, respectively), Oryza sativa Japanica (NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ , NC_ and NC_ for Chromosome 1 to 12, the Mitochondria and the plasmid, respectively), and Homo sapiens (NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ , NT_ and NC_ for Chromosome 1 to 22, X, Y and the Mitochondria, respectively) are used as the dataset for the inter-genomic comparison among the eukaryotic species on top of S. cerevisiae S288c. For the bacteria, two common variants of the two Acinetobacter baumannii and Escherichia coli are used: A. baumannii 0057 (NC_ ), A. baumannii SDF (NC_ ), E. coli O157:H7 (NC_ ) and E. coli K-12 (NC_ ). Similarly, all genome data were obtained from the NCBI website (dated 16 June 2011). RESULTS Intra-genomic comparison of tools performance The results of the maximized perfect repeat runs and the default setting runs using the S. cerevisiae S288c genome as the reference dataset are summarized in Table 3, with the detailed distribution of repeats by repeat length and period sizes given in Tables 4 and 5, respectively. They contain (i) a set of detected repeat repeats from the maximized perfect repeat runs, (ii) a set of the detected repeats as collected based on the tool s default settings, (iii) the proportion of detected repeats that are non-overlapped and (iv) the proportion of non-overlapping repeats that are perfect repeat. Algorithmic performance of the tools in repeat detection From Table 3, we clearly see that the differences in repeat detection among the tools with the maximized perfect repeat settings are smaller ( repeats between the two extremities) when compared with the two order of magnitude differences with the default settings (from 2058 to as high as repeats). This shows that there is no marked difference in the algorithmic performance of the tools when the influences of the minimum length, mismatches and indels are removed. While the total number of detected perfect repeats appears much lower than those in default settings, it becomes clear from the distribution of the repeats (Table 4) that this is due to the restriction of the minimum length, which is arbitrarily set to 20 bp. When comparing repeat length by repeat length, the number of detected perfect repeats in the maximized perfect repeat runs is equal if not more than that of the default settings. This shows that the tandem repeat search tools can be optimized to detect a higher of perfect repeats with the appropriate settings. The results also show that comparing the total number of detected repeats alone precludes us from understanding the true performance of the search tools. It is important to note that while tools like Mreps, TRF, INVERTER and ATRHunter seem to

6 page 6 of 15 Lim et al. Table 3: Summary of intra-genomic comparison between maximized perfect repeat settings and the tools default settings using S. cerevisiae S288c ( bp) as the dataset Tools Number of detected repeats with maximized perfect repeat settings b Number of detected repeats using the tools default settings As collected Non-overlapping repeats As per default settings (%) Perfect repeats As per default settings (%)_ Repeat length >9bp a (%) Mreps (88.4) 6853(88.4) 6853(88.4) TRF (64.9) 333 (16.2) 333 (16.2) INVERTER (13.7) 1726 (13.7) 1726 (13.7) ATRHunter (78.0) 834 (6.3) 834 (6.3) T-reks (15.8) 2807 (0.3) 1831 (0.2) imex (97.9) (93.1) (4.9) Sputnik (100.0) 2625 (100.0) 2625 (100.0) Highlighted in () shows the percentage of the number of repeats (as shown in each column) when compared with the number of repeats as collected (as shown in the third column). a Perfect repeats with repeat length >9bp. b The settings to maximize the detection of perfect repeats for tool comparison are as follows: for Mreps, the resolution parameter was set to zero. For Sputnik, the option -r 0 parameter is included. ForT-reks, the similarity level was set to 100% with indel percent set to zero. The Perfect Repeat Search option was selected for INVERTER and imex. ForTRF and ATRHunter where there is no direct setting, the least stringent criterion was applied to allow all possible lengths of exact repeats to be reported with no or minimum filtering. Specifically, intrf, the alignment weights were set to 2, 50 and 50 for match score, mismatch penalty and indels penalty, respectively, and the minimum score was set to zero. The match score, mismatch penalty, indel penalty and minimum score for ATRHunter was set to1, ^50, ^50 and 50.The minimum length was arbitrarily set to 20 bp to objectively compare across different tools, and the results were filtered for overlapping repeats. outperform imex and Sputnik in the total number of detected perfect repeats, that is in fact due to the inherent limitation of imex and Sputnik to detect only short repeats: 1 6 and 2 5 bp, respectively. When one compares period size by period size, both perform as well as the other tools in detecting the respective repeats. This is further supported when we examined the consistency of their performance for the other genomes (Tables 8 and 9). Handling of overlapping repeats Table 3 clearly shows how the tools performances under default settings are strongly influenced by differences in the way that the redundancy of repeats is handled. Most of the reported overlapping repeats are largely due to program artifacts rather than genuine overlaps occurring naturally. It has been a challenge for most algorithms to account for and extract the most representative overlapping repeats. For example, a program can identify a repetitive sequence (ATATATATAT) as (AT) 5 or (ATAT) 2.5. Clearly, both account for the same repeat, and which one to report depends subjectively on the algorithm design. This is especially a concern for longer repeats with numerous short multiple periods. Hence, we see a large discrepancy in the number of overlapping repeats reported by different tools. For tools like T-reks and INVERTER where there is little or no filtering, we found an excessive number of these repeats detected; T-reks detected >84% of its repeats as overlaps and INVERTER had >86%. Many of the repeats occur at the same loci with different repeat lengths and period sizes. For tools like Mreps, ATRHunter and imex that filter out the non-relevant overlapping repeats, the difference between the overlaps and non-overlaps is significantly smaller, in the range of only %. TRF reports a higher number of overlapping repeats (35.1%) since it allows up to three overlapping repeats to be reported. Sputnik, by its inherent algorithmic design, prevents the search of repeats at the same loci, and hence does not report any overlapping repeat. Repeat length and period size constraints It is also clear from Tables 4 and 5 that repeat detection is heavily biased by the tools minimum length setting and their constraints on its period size. This enforces the points made by the two previous reviews that one has to be careful in the choice

7 Review of tandem repeat search tools page7of15 Table 4 : Detected repeats (distributed by repeat length) on the S. cerevisiae S288c genome using default settings and compared with maximized perfect repeat run Repeat length (bp) I >30 Total Mreps Default settings As collected Non-overlap Non-overlap, Perfect ^ ^ ^ ^ ^ Maximized perfect repeat run TRF Default settings As collected ^ ^ ^ Non-overlap Non-overlap, Perfect Maximized perfect repeat run Inverter Default settings As collected ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 2783 ^ 1582 ^ 711 ^ 1718 ^ ^ Non-overlap ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 795 ^ 319 ^ 82 ^ 236 ^ ^ Non-overlap, Perfect ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 795 ^ 319 ^ 82 ^ 236 ^ ^ Maximized perfect repeat run ATRHunter Default settings As collected Non-overlap ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Non-overlap, Perfect ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Maximized perfect repeat run T-reks Default settings As collected Non-overlap Non-overlap, Perfect Maximized perfect repeat run imex Default settings As collected Non-overlap Non-overlap, Perfect ^ ^ Maximized perfect repeat run Sputnik As collected ^ ^ ^ ^ ^ 302 ^ 990 ^ ^ 176 ^ ^ ^ Non-overlap ^ ^ ^ ^ ^ 302 ^ 990 ^ ^ 176 ^ ^ ^ Non-overlap, Perfect ^ ^ ^ ^ ^ 334 ^ 1095 ^ ^ 134 ^ ^ ^ Maximized perfect repeat run ^ ^ As Collected çresults as collected using the tools default settings; Non-overlap çfiltered for overlapping repeats; Non-overlap, Perfect çonly non-overlapping perfect repeats.

8 page 8 of 15 Lim et al. Table 5: Detected repeats (distributed by period size) on the S. cerevisiae S288c genome using the default settings and compared with maximized perfect repeat run Peroid Size (bp) >20 Total Mreps Default Settings As Collected Non-overlap Non-overlap, Perfect Maximized perfect repeat run TRF Default Settings As Collected Non-overlap Non-overlap, Perfect Maximized perfect repeat run INVERTER Default Settings As Collected ^ ^ ^ ^ ^ ^ ^ ^ Non-overlap ^ ^ ^ ^ ^ ^ ^ ^ Non-overlap, Perfect ^ ^ ^ ^ ^ ^ ^ ^ Maximized perfect repeat run ^ ^ ATRHunter Default Settings As Collected Non-overlap Non-overlap, Perfect ^ ^ Maximized perfect repeat run T-reks Default Settings As Collected Non-overlap Non-overlap, Perfect ^ 1 ^ 8 ^ 4 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 2807 Maximized perfect repeat run imex Default Settings As Collected ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Non-overlap ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Non-overlap, Perfect ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Maximized perfect repeat run ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 583 Sputnik Default Settings As Collected ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 2625 Non-overlap ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 2625 Non-overlap, Perfect ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 2601 Maximized perfect repeat run ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 398 As Collected çresults as collected using the tools default settings; Non-overlap çfiltered for overlapping repeats; Non-overlap, Perfect çonly non-overlapping perfect repeats.

9 Review of tandem repeat search tools page9of15 of minimum length and period size when searching for repeats. From Table 4, we note that the default minimum score and penalty settings of TRF limit the program to only repeats of length >25 bp, hence giving the impression that the program has underperformed compared with the other tools. However, it is possible to lower the minimum length for TRF (by adjusting its match score and minimum threshold score) to detect repeats as short as 5 bp (data not shown here as we only report repeats with length 20 bp). On the other hand, imex reported an overwhelming number of repeats <10 bp, which are of little or no biological significance [24]. In fact, >90% of the imex-detected repeats can be eliminated if one considers only repeats >10 bp as statistically significant [24]. Tools like ATRHunter, INVERTER and Sputnik are further limited by their ability to handle non-integer repetitions. As a result, we noticed sharp drops in the number of repeats detected when the length of the repeat was equivalent to the length of a prime number (i.e. 11, 13, 17, 19, and 23). Table 5 further reveals the constraint of the search tools in period size, particularly for INVERTER, imex and Sputnik. For INVERTER, the minimum period size could be adjusted to as low as 3 bp but the default was set to 9 bp as the tool was specifically designed for detecting inverted repeats that are usually >10 bp [1]. For imex and Sputnik, their period sizes are fixed at 1 6 and 2 5 bp, respectively, as they are designed for microsatellites. Biases in settings for imperfect repeat detection Finally, strong biases were present in the tools mismatch and indel settings for the detection of imperfect repeats. For INVERTER, Mreps and Sputnik, the default settings allowed only for the detection of perfect repeats. On the other extreme, T-reks s settings of 65% for mismatch and 20% for indels retrieved an excessive amount of repeats resulting in a 50-fold difference between the repeats (non-overlapping) and those filtered for perfect repeats. For ATRHunter and TRF, these differences were only between 12 and 4 times, respectively. It is difficult to compare and make sense of these results given that a one-to-one correspondence between the settings is impossible. For example, when we set the mismatch similarity of T-reks to 80% to have an equivalent setting with TRF, and searching for the repeats >25 bp, T-reks managed to detect only 65 repeats, much lower than the 1173 repeats detected by TRF. This was also far fewer than the reported bp or longer repeats when the similarity level was set to 65%. Inter-genomic comparison of tools performance Tables 6 and 7 summarize the tool s performances for maximized perfect repeat detection when compared across different species of eukaryotes and the bacteria. The distributions of the repeats by period size are shown in Figures 1 and 2 with the detailed breakdown of the repeats by period size given in Tables 8 and 9. Consistency of algorithmic performance From Tables 6 and 7, one can observe the consistency of performance among the tools across the different species. Mreps appeared to rank consistently on top, followed by TRF, INVERTER, ATRHunter, T-reks, imex and Sputnik. The only two exceptions are between T-reks and imex for H. sapiens and O. sativajaponica, which are rich in 1 6 bp microsatellites. This consistency in tool performance across different species is encouraging as it shows that the number of detected repeats is not influenced by the different genome sizes, and differences in the distribution and representation of the repeats among the different species. However, tools like imex that are designed for short repeat search tended to perform better for genomes rich in microsatellites, whereas tools like ATRHunter that are optimized for longer length repeats performed better for genomes rich in such repeats, i.e. C. elegans. TRF and Mreps on the other hand seemed to perform equally well in both situations as tools for more generalized usage. It is hence instructive to use more than one tool with at least one optimized for microsatellites and another for finding longer repeats for ab initio detection of repeats for genomes that are not well characterized. Overall, we observed that the tools can be grouped by their performance. The four top ranked tools have performance close to each other in the range of 25% of repeat detection. Mreps and TRF are closer to each other (no >12% difference) than the rest. INVERTER and ATRHunter fall in the second best performance group, lower than Mreps and TRF by 9 26% on average, and generally within 4% of each other. The last group, which

10 page 10 of 15 Lim et al. Table 6: Results of maximized perfect repeat runs for selected genomes of eukaryotic species Genome Saccharomyces cerevisiae S288c Caenorhabditis elegans Oryza sativa Japonica Homo sapiens Size bp bp bp bp Tools Detected Repeats (Percentage of Rank 1) Detected Repeats (Percentage of Rank 1) Detected Repeats (Percentage of Rank 1) Detected Repeats (Percentage of Rank 1) Mreps 1501 (100) (100) (100) (100) TRF 1325(88) 23517(96) 56493(96) (95) INVERTER 1169 (78) (81) (83) (90) ATRHunter 1115 (74) (79) (81) (87) T-reks 643 (43) 6081 (25) (38) (55) imex 583 (39) 2627 (11) (40) (60) Sputnik 398 (27) 1498 (6) (33) (32) Percentage of Mreps ¼ number of detected repeats by the tool/number of detected repeats by Mreps 100%. This is to show how the other tools fared in relation to the highest ranking toolçmreps.the different shaded colour represents the band of performance among the tools grouped according to their close association with each other. Table 7: Results of maximized perfect repeat runs for selected genomes of bacteria Genome Acinetobacter baumannii 0057 Acinetobacter baumannii SDF Escherichia coli K12 Escherichia coli O157H7 Size bp bp bp bp Tools Detected repeats (Percentage of Mreps) Detected repeats (Percentage of Mreps) includes T-reks, imex and Sputnik, exhibited more varied performances. The results also showed that there is no significant difference in the detection of tandem repeats between the combinatorial approach and the statistic-/heuristic-based approach from an algorithmic viewpoint. TRF, which adopts the statistic-/heuristic-based approach, performed equally well compared with Mreps, which uses the combinatorial search. While there seems to be a performance demarcation between the first two groups, i.e. Mreps and TRF versus INVERTER and ATRHunter, they are not as distinct as one might think. In the case of the H. sapiens genome, the results were so similar that they could have been grouped together. The trends are also not obvious when examining their Detected repeats (Percentage of Mreps) Mreps 45 (100) 70 (100) 26 (100) 72 (100) TRF 45 (100) 72 (103) 26 (100) 67 (93) INVERTER 34 (76) 62 (86) 21 (81) 59 (82) ATRHunter 34(76) 61(85) 19(73) 56(78) T-reks 12 (27) 47 (65) 13 (50) 32 (44) imex 5 (11) 5 (7) 0 (0) 8 (11) Sputnik 1 (2) 3 (4) 0 (0) 0 (0) Detected repeats (Percentage of Mreps) Percentage of Mreps ¼ number of detected repeats by the tool/number of detected repeats by Mreps 100%. This is to show how the other tools fared in relation to the highest ranking toolçmreps.the different shaded colour represents the band of performance among the tools grouped according to their close association with each other. period size distribution (Table 8). There are various confounding effects worth highlighting here. For example, for INVERTER, we placed a minimum period size limit of 3 bp to shorten the search time. This did not affect the overall results as the monoand di-nucleotide repeats are being accounted for in the longer period size repeats. For instance, in the S. cerevisiae S288c genome, one notices four to six times as many 4 bp repeats of the INVERTER than the rest. In actual fact, >75% of these repeats were supposedly di-nucleotide repeats but were counted as repeats in the 4 bp period size. Period size coverage Both Leclercq et al. [7] and Merkel and Gemmell (2008) [6] limited their investigations to short

11 Review of tandem repeat search tools page 11 of 15 Fig 1: Distribution of perfect repeat by period size for (a) Saccharomyces cerevisiae, (b) Caenorhabditis elegans, (c) Oryza sativa Japanica and (d) Homo sapiens. tandem repeats, such as microsatellites (period size: 1 6 bp). In this study, we surveyed repeats with period sizes beyond 6 bp and up to 20 bp. From the breakdown of the detected repeats as shown in Figures 1 and 2, we notice the frequent abundance of longer period size perfect repeats across many species; mostly imperfect repeats which tends to be composed of longer repeats (in both length and period size). In fact, repeats with period sizes beyond 6 bp account for at least half the total of number of repeats detected in the human genome, >60% of the three other eukaryotic species and >70% of the five bacteria being studied. Even for microsatellites, the proportion of period size coverage from 7 to 10 bp (which are still being considered by some biologists as microsatellites) is quite significantly represented in most selected species (though a lesser extent in H. sapiens) and especially in bacteria. In addition, while assessing the tools performance, we noticed that T-reks has limited performance in searching longer period size repeats (Tables 5, 8 and 9). Except for mono-nucleotide repeats, the program detects much fewer perfect repeats on all counts for period sizes from 2 bp onwards. This seems strange as the tool outperforms the others when imperfect repeats are being included. But as explained earlier this could be due to the less stringent parameters. DISCUSSIONS AND CONCLUSION The gamut of search tools for finding tandem repeats is wide ranging and non-standardized. One has to be careful in understanding the tools inherent constraints to select the right tool for the right purpose. It is hence important to be able to compare the repeat search tools and understand their behavior and inherent limitations. This is not an easy task due to the multitude of parameters implemented that has been already shown by two previous reviews [6, 7]. In this study, we used perfect repeats for comparing the tools. For this, we have filtered output from searches with default parameters for perfect and

12 page 12 of 15 Lim et al. Fig 2: Distribution of perfect repeat by period size for two variants of (a) Acinetobacter baumannii 0057 and SDF as shown in (a) and (b), and two variants of E. coli K12 and O157H7 as shown in (c) and (d). non-redundant repeats as well as (in a second run) adjusted the tools parameter settings to maximize the detection of prefect repeats. Also, instead of simply evaluating the detected repeats as a single total, we looked at their distribution by period size for the first 20 bp so as to give deeper insight into the understanding of the tools behavior and inherent constraints. These series of methods aim at separating the parameter biases from the algorithmic aspects and have proven useful in providing a consistent way of comparing the tools and explaining their differences. We found no measurable differences in the algorithmic performance between the exhaustive combinatorial approach and the statistical-/heuristic-based approach. Still, one notices a marginal surplus in repeat detection (4 5%) for tools that use the combinatorial method of search (like Mreps and INVERTER) over those adopting the statisticalbased approach (TRF and ATRHunter). One possible explanation is that algorithms using statistical approach rely on the association of short repetitive elements of exact matches to form the longer repeats. While this may be useful in finding repeats that are partially or imperfectly matched, it is harder to associate the longer perfect repeats that are not entirely multiple of the smaller repetitive elements. Hence, the latter tools have the tendency to miss the longer perfect repeats unlike the brute force approach of the combinatorial method with no such constraint. These differences though are not significant for the reason that long strands of perfectly matched tandem repeats are not common in most genomes. We further observed that repeat finding tools designed for generalized usage (such as Mreps, TRF, INVERTER and ATRHunter) tend to outperform those (T-reks, imex and Sputnik) that are constrained by design to search repeats only of certain period sizes when comparing the total number of repeats detected. But, when one examines the number of detected repeats period size by period size, there is no significant difference between the two groups.

13 Review of tandem repeat search tools page 13 of 15 Table 8 : Number of detected repeats (distributed by period size) for maximized perfect repeat runs for selected genomes of eukaryotic species Period Size (bp) >20 Total Saccharomyces cerevisiae S288c Mreps TRF INVERTER ^ ^ ATRHunter T-reks imex ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 583 Sputnik ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 398 Caenorhabditis elegans Mreps TRF INVERTER ^ ^ ATRHunter T-reks imex ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 2627 Sputnik ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 1498 Oryza sativa japonica Mreps TRF INVERTER ^ ^ ATRHunter T-reks imex ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Sputnik ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Homo Sapiens Mreps TRF INVERTER ^ ^ ATRHunter T-reks imex ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Sputnik ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Note: bp - stands for base-pairs

14 page 14 of 15 Lim et al. Table 9: Number of detected repeats (distributed by period size) for maximized perfect repeat runs for selected genomes of bacteria Period Size (bp) >20 Total Acinetobacter baumannii 0057 Mreps ^ ^ ^ ^ ^ 5 ^ ^ 2 2 ^ 4 1 ^ 5 45 TRF ^ ^ ^ ^ ^ ^ ^ 1 1 ^ 6 45 INVERTER ^ ^ ^ ^ ^ ^ ^ ^ 1 7 ^ 8 ^ ^ 2 2 ^ ATRHunter ^ ^ ^ ^ ^ 3 7 ^ 5 ^ ^ 2 2 ^ 4 1 ^ 4 34 T-reks ^ ^ ^ ^ ^ 3 2 ^ 1 ^ ^ ^ ^ ^ ^ ^ ^ ^ 12 imex ^ ^ ^ ^ 1 4 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 5 Sputnik ^ ^ ^ ^ 1 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 1 Acinetobacter baumannii SDF Mreps ^ ^ ^ 7 6 ^ 8 ^ 1 ^ ^ ^ TRF ^ ^ ^ ^ ^ 2 ^ ^ ^ INVERTER ^ ^ ^ ^ 2 6 ^ 8 ^ 1 ^ ^ ^ ATRHunter ^ ^ ^ ^ ^ ^ 6 ^ 10 ^ 1 ^ ^ ^ T-reks ^ ^ ^ ^ 3 ^ ^ 7 ^ 1 ^ ^ ^ imex ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 5 Sputnik ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 3 Escherichia coli O157H7 Mreps ^ ^ ^ ^ ^ 11 ^ ^ 1 ^ ^ TRF ^ ^ ^ ^ ^ 9 ^ ^ 2 ^ ^ INVERTER ^ ^ ^ ^ ^ 2 ^ ^ 1 ^ ^ ATRHunter ^ ^ ^ ^ ^ 8 ^ ^ 1 ^ ^ T-reks ^ ^ ^ ^ ^ 8 ^ 1 2 ^ ^ 5 3 ^ 1 ^ ^ 8 32 imex ^ ^ ^ ^ ^ 8 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 8 Sputnik ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Escherichia coli K12 Mreps ^ ^ ^ ^ ^ 1 ^ ^ 1 1 ^ 1 ^ 1 ^ 8 26 TRF ^ ^ ^ ^ ^ ^ ^ ^ 1 1 ^ ^ 5 26 INVERTER ^ ^ ^ ^ ^ ^ ^ 2 ^ ^ 1 1 ^ 1 ^ 1 ^ 8 21 ATRHunter ^ ^ ^ ^ ^ ^ ^ 2 ^ ^ 1 1 ^ 1 ^ 1 ^ 6 19 T-reks ^ ^ ^ ^ ^ ^ ^ 2 ^ ^ ^ 1 ^ 1 1 ^ 1 ^ 1 ^ 6 13 imex ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Sputnik ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Notwithstanding the comparison made in this study, the choice of tools still depends on the user s own preferences and needs. Although Mreps seems superior to the rest, still it has the limitation of handling only repeats with mismatches and not indels. TRF offers much more flexibility but one has to exercise great care in the non-intuitive parameter settings to obtain the desired results. The same is true for ATRHunter. INVERTER and T-reks on the other hand offers a more direct and user-friendly interface but may not be as effective for certain searches, such as shown in the case of T-reks. Similarly, INVERTER is currently limited to search for perfect repeats only. imex and Sputnik are more suitable for short repeat searches. We did not evaluate the tools performance for imperfect repeats for several reasons. One is that it is difficult to make equitable adjustments in the highly sensitive parameter settings to allow for objective comparisons. This should be a direction for future research. Second, there is also no consensus to the degree of degeneracy allowed for repeats from the biological viewpoint. A number of studies set it at 1 bp mismatch for every 10 bp of the repeats [8, 9, 24] while others put a similarity level of 80% for substitution error and 10% for indels as the criteria [17, 18]. We have shown earlier that the search results can vary widely depending on which criterion is being adopted. One possible approach to evaluating imperfect repeat search would be to examine the shift in behavior with the varying adjustment of the parameters. Coupling this with the biological understanding of how tandem repeats evolve, one could determine whether the tools performance is within reasonable bounds as expected in nature. The validity of this approach, however, requires further testing.

Pyrobayes: an improved base caller for SNP discovery in pyrosequences

Pyrobayes: an improved base caller for SNP discovery in pyrosequences Pyrobayes: an improved base caller for SNP discovery in pyrosequences Aaron R Quinlan, Donald A Stewart, Michael P Strömberg & Gábor T Marth Supplementary figures and text: Supplementary Figure 1. The

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Frequently Asked Questions (FAQs)

Frequently Asked Questions (FAQs) Frequently Asked Questions (FAQs) Q1. What is meant by Satellite and Repetitive DNA? Ans: Satellite and repetitive DNA generally refers to DNA whose base sequence is repeated many times throughout the

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS CRYSTAL L. KAHN and BENJAMIN J. RAPHAEL Box 1910, Brown University Department of Computer Science & Center for Computational Molecular Biology

More information

General context Anchor-based method Evaluation Discussion. CoCoGen meeting. Accuracy of the anchor-based strategy for genome alignment.

General context Anchor-based method Evaluation Discussion. CoCoGen meeting. Accuracy of the anchor-based strategy for genome alignment. CoCoGen meeting Accuracy of the anchor-based strategy for genome alignment Raluca Uricaru LIRMM, CNRS Université de Montpellier 2 3 octobre 2008 1 / 31 Summary 1 General context 2 Global alignment : anchor-based

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

IDENTIFYING INVERTED REPEAT STRUCTURE IN DNA SEQUENCES USING CORRELATION FRAMEWORK

IDENTIFYING INVERTED REPEAT STRUCTURE IN DNA SEQUENCES USING CORRELATION FRAMEWORK IDENTIFYING INVERTED REPEAT STRUCTURE IN DNA SEQUENCES USING CORRELATION FRAMEWORK Ravi Gupta, Ankush Mittal, and Kuldip Singh Department of Electronics & Computer Engineering, Indian Institute of Technology

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Computational approaches for functional genomics

Computational approaches for functional genomics Computational approaches for functional genomics Kalin Vetsigian October 31, 2001 The rapidly increasing number of completely sequenced genomes have stimulated the development of new methods for finding

More information

Exponents Drill. Warm-up Problems. Problem 1 If (x 3 y 3 ) -3 = (xy) -z, what is z? A) -6 B) 0 C) 1 D) 6 E) 9. Problem 2 36 =?

Exponents Drill. Warm-up Problems. Problem 1 If (x 3 y 3 ) -3 = (xy) -z, what is z? A) -6 B) 0 C) 1 D) 6 E) 9. Problem 2 36 =? Exponents Drill Warm-up Problems Problem 1 If (x 3 y 3 ) -3 = (xy) -z, what is z? A) -6 B) 0 C) 1 D) 6 E) 9 Problem 2 3 36 4 4 3 2 =? A) 0 B) 1/36 C) 1/6 D) 6 E) 36 Problem 3 3 ( xy) =? 6 6 x y A) (xy)

More information

2 Genome evolution: gene fusion versus gene fission

2 Genome evolution: gene fusion versus gene fission 2 Genome evolution: gene fusion versus gene fission Berend Snel, Peer Bork and Martijn A. Huynen Trends in Genetics 16 (2000) 9-11 13 Chapter 2 Introduction With the advent of complete genome sequencing,

More information

Using Bioinformatics to Study Evolutionary Relationships Instructions

Using Bioinformatics to Study Evolutionary Relationships Instructions 3 Using Bioinformatics to Study Evolutionary Relationships Instructions Student Researcher Background: Making and Using Multiple Sequence Alignments One of the primary tasks of genetic researchers is comparing

More information

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

Computational statistics

Computational statistics Computational statistics Combinatorial optimization Thierry Denœux February 2017 Thierry Denœux Computational statistics February 2017 1 / 37 Combinatorial optimization Assume we seek the maximum of f

More information

Supplementary Information for Discovery and characterization of indel and point mutations

Supplementary Information for Discovery and characterization of indel and point mutations Supplementary Information for Discovery and characterization of indel and point mutations using DeNovoGear Avinash Ramu 1 Michiel J. Noordam 1 Rachel S. Schwartz 2 Arthur Wuster 3 Matthew E. Hurles 3 Reed

More information

Supplementing information theory with opposite polarity of amino acids for protein contact prediction

Supplementing information theory with opposite polarity of amino acids for protein contact prediction Supplementing information theory with opposite polarity of amino acids for protein contact prediction Yancy Liao 1, Jeremy Selengut 1 1 Department of Computer Science, University of Maryland - College

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

Network alignment and querying

Network alignment and querying Network biology minicourse (part 4) Algorithmic challenges in genomics Network alignment and querying Roded Sharan School of Computer Science, Tel Aviv University Multiple Species PPI Data Rapid growth

More information

042 ADDITIONAL MATHEMATICS (For School Candidates)

042 ADDITIONAL MATHEMATICS (For School Candidates) THE NATIONAL EXAMINATIONS COUNCIL OF TANZANIA CANDIDATES ITEM RESPONSE ANALYSIS REPORT FOR THE CERTIFICATE OF SECONDARY EDUCATION EXAMINATION (CSEE) 2015 042 ADDITIONAL MATHEMATICS (For School Candidates)

More information

Hands-On Nine The PAX6 Gene and Protein

Hands-On Nine The PAX6 Gene and Protein Hands-On Nine The PAX6 Gene and Protein Main Purpose of Hands-On Activity: Using bioinformatics tools to examine the sequences, homology, and disease relevance of the Pax6: a master gene of eye formation.

More information

Variant visualisation and quality control

Variant visualisation and quality control Variant visualisation and quality control You really should be making plots! 25/06/14 Paul Theodor Pyl 1 Classical Sequencing Example DNA.BAM.VCF Aligner Variant Caller A single sample sequencing run 25/06/14

More information

Supplementary Information

Supplementary Information Supplementary Information 1 List of Figures 1 Models of circular chromosomes. 2 Distribution of distances between core genes in Escherichia coli K12, arc based model. 3 Distribution of distances between

More information

Biological Systems: Open Access

Biological Systems: Open Access Biological Systems: Open Access Biological Systems: Open Access Liu and Zheng, 2016, 5:1 http://dx.doi.org/10.4172/2329-6577.1000153 ISSN: 2329-6577 Research Article ariant Maps to Identify Coding and

More information

Computational Structural Bioinformatics

Computational Structural Bioinformatics Computational Structural Bioinformatics ECS129 Instructor: Patrice Koehl http://koehllab.genomecenter.ucdavis.edu/teaching/ecs129 koehl@cs.ucdavis.edu Learning curve Math / CS Biology/ Chemistry Pre-requisite

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,

More information

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010 BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Copyright Mark Brandt, Ph.D A third method, cryogenic electron microscopy has seen increasing use over the past few years.

Copyright Mark Brandt, Ph.D A third method, cryogenic electron microscopy has seen increasing use over the past few years. Structure Determination and Sequence Analysis The vast majority of the experimentally determined three-dimensional protein structures have been solved by one of two methods: X-ray diffraction and Nuclear

More information

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on: 17 Non-collinear alignment This exposition is based on: 1. Darling, A.E., Mau, B., Perna, N.T. (2010) progressivemauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5(6):e11147.

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Multiple Mapping Method: A Novel Approach to the Sequence-to-Structure Alignment Problem in Comparative Protein Structure Modeling

Multiple Mapping Method: A Novel Approach to the Sequence-to-Structure Alignment Problem in Comparative Protein Structure Modeling 63:644 661 (2006) Multiple Mapping Method: A Novel Approach to the Sequence-to-Structure Alignment Problem in Comparative Protein Structure Modeling Brajesh K. Rai and András Fiser* Department of Biochemistry

More information

RNA evolution and Genotype to Phenotype maps

RNA evolution and Genotype to Phenotype maps RNA evolution and Genotype to Phenotype maps E.S. Colizzi November 8, 2018 Introduction Biological evolution occurs in a population because 1) different genomes can generate different reproductive success

More information

Introduction to Digital Evolution Handout Answers

Introduction to Digital Evolution Handout Answers Introduction to Digital Evolution Handout Answers Note to teacher: The questions in this handout and the suggested answers (in red, below) are meant to guide discussion, not be an assessment. It is recommended

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Graph Alignment and Biological Networks

Graph Alignment and Biological Networks Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale

More information

Enumeration and symmetry of edit metric spaces. Jessie Katherine Campbell. A dissertation submitted to the graduate faculty

Enumeration and symmetry of edit metric spaces. Jessie Katherine Campbell. A dissertation submitted to the graduate faculty Enumeration and symmetry of edit metric spaces by Jessie Katherine Campbell A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL)

12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL) 12.12 Model Building, and the Effects of Multicollinearity (Optional) 1 Although Excel and MegaStat are emphasized in Business Statistics in Practice, Second Canadian Edition, some examples in the additional

More information

A Browser for Pig Genome Data

A Browser for Pig Genome Data A Browser for Pig Genome Data Thomas Mailund January 2, 2004 This report briefly describe the blast and alignment data available at http://www.daimi.au.dk/ mailund/pig-genome/ hits.html. The report describes

More information

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week:

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week: Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week: Course general information About the course Course objectives Comparative methods: An overview R as language: uses and

More information

Classical Selection, Balancing Selection, and Neutral Mutations

Classical Selection, Balancing Selection, and Neutral Mutations Classical Selection, Balancing Selection, and Neutral Mutations Classical Selection Perspective of the Fate of Mutations All mutations are EITHER beneficial or deleterious o Beneficial mutations are selected

More information

The Minimal-Gene-Set -Kapil PHY498BIO, HW 3

The Minimal-Gene-Set -Kapil PHY498BIO, HW 3 The Minimal-Gene-Set -Kapil Rajaraman(rajaramn@uiuc.edu) PHY498BIO, HW 3 The number of genes in organisms varies from around 480 (for parasitic bacterium Mycoplasma genitalium) to the order of 100,000

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

New Approaches to the Development of GC/MS Selected Ion Monitoring Acquisition and Quantitation Methods Technique/Technology

New Approaches to the Development of GC/MS Selected Ion Monitoring Acquisition and Quantitation Methods Technique/Technology New Approaches to the Development of GC/MS Selected Ion Monitoring Acquisition and Quantitation Methods Technique/Technology Gas Chromatography/Mass Spectrometry Author Harry Prest 1601 California Avenue

More information

arxiv: v1 [cs.ds] 3 Feb 2018

arxiv: v1 [cs.ds] 3 Feb 2018 A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II) CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 389; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs06.html 1/12/06 CAP5510/CGS5166 1 Evaluation

More information

Genomes and Their Evolution

Genomes and Their Evolution Chapter 21 Genomes and Their Evolution PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from

More information

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment Molecular Modeling 2018-- Lecture 7 Homology modeling insertions/deletions manual realignment Homology modeling also called comparative modeling Sequences that have similar sequence have similar structure.

More information

SPR Satisfaction Survey BC Surgical Patient Registry (SPR) Satisfaction Survey 2012

SPR Satisfaction Survey BC Surgical Patient Registry (SPR) Satisfaction Survey 2012 BC Surgical Patient Registry (SPR) Satisfaction Survey 2012 July 16, 2012 Table of Contents BACKGROUND...1 RESULTS...2 1. Demographics... 2 2. Usage of the SPR... 3 3. Access to the SPR... 6 4. Value of

More information

9.2 Multiplication Properties of Radicals

9.2 Multiplication Properties of Radicals Section 9.2 Multiplication Properties of Radicals 885 9.2 Multiplication Properties of Radicals Recall that the equation x 2 = a, where a is a positive real number, has two solutions, as indicated in Figure

More information

DIMACS Technical Report March Game Seki 1

DIMACS Technical Report March Game Seki 1 DIMACS Technical Report 2007-05 March 2007 Game Seki 1 by Diogo V. Andrade RUTCOR, Rutgers University 640 Bartholomew Road Piscataway, NJ 08854-8003 dandrade@rutcor.rutgers.edu Vladimir A. Gurvich RUTCOR,

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

You are required to know all terms defined in lecture. EXPLORE THE COURSE WEB SITE 1/6/2010 MENDEL AND MODELS

You are required to know all terms defined in lecture. EXPLORE THE COURSE WEB SITE 1/6/2010 MENDEL AND MODELS 1/6/2010 MENDEL AND MODELS!!! GENETIC TERMINOLOGY!!! Essential to the mastery of genetics is a thorough knowledge and understanding of the vocabulary of this science. New terms will be introduced and defined

More information

Algorithms in Computational Biology (236522) spring 2008 Lecture #1

Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: 15:30-16:30/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office hours:??

More information

1 Computational problems

1 Computational problems 80240233: Computational Complexity Lecture 1 ITCS, Tsinghua Univesity, Fall 2007 9 October 2007 Instructor: Andrej Bogdanov Notes by: Andrej Bogdanov The aim of computational complexity theory is to study

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Protein Aggregation Optimization: An Algorithmic Approach

Protein Aggregation Optimization: An Algorithmic Approach Protein Aggregation Optimization: An Algorithmic Approach Brandon Plost December 4, 2009 Parkinson s, Alzheimer s, and Huntington s Diseases are primarily caused by a biological phenomenon called protein

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

Quality and Coverage of Data Sources

Quality and Coverage of Data Sources Quality and Coverage of Data Sources Objectives Selecting an appropriate source for each item of information to be stored in the GIS database is very important for GIS Data Capture. Selection of quality

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

Exercise 3 Exploring Fitness and Population Change under Selection

Exercise 3 Exploring Fitness and Population Change under Selection Exercise 3 Exploring Fitness and Population Change under Selection Avidians descended from ancestors with different adaptations are competing in a selective environment. Can we predict how natural selection

More information

Optimal Sequences of Trials for Balancing Practice and Repetition Effects. Han-Suk Sohn, Dennis L. Bricker, J. Richard Simon, and Yi-Chih Hsieh

Optimal Sequences of Trials for Balancing Practice and Repetition Effects. Han-Suk Sohn, Dennis L. Bricker, J. Richard Simon, and Yi-Chih Hsieh Optimal Sequences of Trials for Balancing Practice and Repetition Effects Han-Suk Sohn, Dennis L. Bricker, J. Richard Simon, and Yi-Chih Hsieh The University of Iowa December 10, 1996 Correspondence should

More information

Integrated Electricity Demand and Price Forecasting

Integrated Electricity Demand and Price Forecasting Integrated Electricity Demand and Price Forecasting Create and Evaluate Forecasting Models The many interrelated factors which influence demand for electricity cannot be directly modeled by closed-form

More information

AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-12 NATIONAL SECURITY COMPLEX

AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-12 NATIONAL SECURITY COMPLEX AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-1 NATIONAL SECURITY COMPLEX J. A. Mullens, J. K. Mattingly, L. G. Chiang, R. B. Oberer, J. T. Mihalczo ABSTRACT This paper describes a template matching

More information

BIOINFORMATICS LAB AP BIOLOGY

BIOINFORMATICS LAB AP BIOLOGY BIOINFORMATICS LAB AP BIOLOGY Bioinformatics is the science of collecting and analyzing complex biological data. Bioinformatics combines computer science, statistics and biology to allow scientists to

More information

Moderators Report/ Principal Moderator Feedback. Summer 2016 Pearson Edexcel GCSE in Astronomy (5AS02) Paper 01

Moderators Report/ Principal Moderator Feedback. Summer 2016 Pearson Edexcel GCSE in Astronomy (5AS02) Paper 01 Moderators Report/ Principal Moderator Feedback Summer 2016 Pearson Edexcel GCSE in Astronomy (5AS02) Paper 01 The controlled assessment forms 25% of the overall mark for this specification. Candidates

More information

Comparative genomics: Overview & Tools + MUMmer algorithm

Comparative genomics: Overview & Tools + MUMmer algorithm Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first

More information

Minimization of Boolean Expressions Using Matrix Algebra

Minimization of Boolean Expressions Using Matrix Algebra Minimization of Boolean Expressions Using Matrix Algebra Holger Schwender Collaborative Research Center SFB 475 University of Dortmund holger.schwender@udo.edu Abstract The more variables a logic expression

More information

On the Use of Forecasts when Forcing Annual Totals on Seasonally Adjusted Data

On the Use of Forecasts when Forcing Annual Totals on Seasonally Adjusted Data The 34 th International Symposium on Forecasting Rotterdam, The Netherlands, June 29 to July 2, 2014 On the Use of Forecasts when Forcing Annual Totals on Seasonally Adjusted Data Michel Ferland, Susie

More information

Haploid & diploid recombination and their evolutionary impact

Haploid & diploid recombination and their evolutionary impact Haploid & diploid recombination and their evolutionary impact W. Garrett Mitchener College of Charleston Mathematics Department MitchenerG@cofc.edu http://mitchenerg.people.cofc.edu Introduction The basis

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see

More information

Learning Computer-Assisted Map Analysis

Learning Computer-Assisted Map Analysis Learning Computer-Assisted Map Analysis by Joseph K. Berry* Old-fashioned math and statistics can go a long way toward helping us understand GIS Note: This paper was first published as part of a three-part

More information

Differential Modeling for Cancer Microarray Data

Differential Modeling for Cancer Microarray Data Differential Modeling for Cancer Microarray Data Omar Odibat Department of Computer Science Feb, 01, 2011 1 Outline Introduction Cancer Microarray data Problem Definition Differential analysis Existing

More information

Indirect Measurement Technique: Using Trigonometric Ratios Grade Nine

Indirect Measurement Technique: Using Trigonometric Ratios Grade Nine Ohio Standards Connections Measurement Benchmark D Use proportional reasoning and apply indirect measurement techniques, including right triangle trigonometry and properties of similar triangles, to solve

More information

Satellite Maneuver Detection Using Two-line Element (TLE) Data. Tom Kelecy Boeing LTS, Colorado Springs, CO / Kihei, HI

Satellite Maneuver Detection Using Two-line Element (TLE) Data. Tom Kelecy Boeing LTS, Colorado Springs, CO / Kihei, HI Satellite Maneuver Detection Using Two-line Element (TLE) Data Tom Kelecy Boeing LTS, Colorado Springs, CO / Kihei, HI Doyle Hall Boeing LTS, Colorado Springs, CO / Kihei, HI Kris Hamada Pacific Defense

More information

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

RNA Search and! Motif Discovery Genome 541! Intro to Computational! Molecular Biology RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology" Day 1" Many biologically interesting roles for RNA" RNA secondary structure prediction" 3 4 Approaches to Structure

More information

Nemo: Multi-Criteria Test-Suite Minimization with Integer Nonlinear Programming.

Nemo: Multi-Criteria Test-Suite Minimization with Integer Nonlinear Programming. Jun-Wei Lin University of California, Irvine junwel1@uci.edu Joshua Garcia University of California, Irvine joshug4@uci.edu ABSTRACT Multi-criteria test-suite minimization aims to remove redundant test

More information

CHAPTER 13 PROKARYOTE GENES: E. COLI LAC OPERON

CHAPTER 13 PROKARYOTE GENES: E. COLI LAC OPERON PROKARYOTE GENES: E. COLI LAC OPERON CHAPTER 13 CHAPTER 13 PROKARYOTE GENES: E. COLI LAC OPERON Figure 1. Electron micrograph of growing E. coli. Some show the constriction at the location where daughter

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

ABE Math Review Package

ABE Math Review Package P a g e ABE Math Review Package This material is intended as a review of skills you once learned and wish to review before your assessment. Before studying Algebra, you should be familiar with all of the

More information

Quantitative Measurement of Genome-wide Protein Domain Co-occurrence of Transcription Factors

Quantitative Measurement of Genome-wide Protein Domain Co-occurrence of Transcription Factors Quantitative Measurement of Genome-wide Protein Domain Co-occurrence of Transcription Factors Arli Parikesit, Peter F. Stadler, Sonja J. Prohaska Bioinformatics Group Institute of Computer Science University

More information

Mathmatics 239 solutions to Homework for Chapter 2

Mathmatics 239 solutions to Homework for Chapter 2 Mathmatics 239 solutions to Homework for Chapter 2 Old version of 8.5 My compact disc player has space for 5 CDs; there are five trays numbered 1 through 5 into which I load the CDs. I own 100 CDs. a)

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

FINISHING MY FOSMID XAAA113 Andrew Nett

FINISHING MY FOSMID XAAA113 Andrew Nett FINISHING MY FOSMID XAAA113 Andrew Nett The assembly of fosmid XAAA113 initially consisted of three contigs with two gaps (Fig 1). This assembly was based on the sequencing reactions of subclones set up

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Regression Clustering

Regression Clustering Regression Clustering In regression clustering, we assume a model of the form y = f g (x, θ g ) + ɛ g for observations y and x in the g th group. Usually, of course, we assume linear models of the form

More information

Practical Algebra. A Step-by-step Approach. Brought to you by Softmath, producers of Algebrator Software

Practical Algebra. A Step-by-step Approach. Brought to you by Softmath, producers of Algebrator Software Practical Algebra A Step-by-step Approach Brought to you by Softmath, producers of Algebrator Software 2 Algebra e-book Table of Contents Chapter 1 Algebraic expressions 5 1 Collecting... like terms 5

More information

Introduction to Molecular and Cell Biology

Introduction to Molecular and Cell Biology Introduction to Molecular and Cell Biology Molecular biology seeks to understand the physical and chemical basis of life. and helps us answer the following? What is the molecular basis of disease? What

More information