Computing repetitive structures in indeterminate strings

Size: px
Start display at page:

Download "Computing repetitive structures in indeterminate strings"

Transcription

1 Computing repetitive structures in indeterminate strings Pavlos Antoniou 1, Costas S. Iliopoulos 1, Inuka Jayasekera 1, Wojciech Rytter 2,3 1 Dept. of Computer Science, King s College London, London WC2R 2LS, England, UK 2 Faculty of Mathematics, Informatics, and Mechanics University of Warsaw, Banacha 2, , Warsaw, Poland 3 Faculty of Mathematics and Informatics Copernicus University, Tprun, Poland Abstract. We study the problem of finding local and global covers and repetitive structures called seeds in indeterminate strings. An indeterminate string is a sequence X = X[1] X[2]... X[n], where X[i] Σ for each i, and Σ is a given alphabet of fixed size. We present an algorithm for finding the smallest cover of the string x in O(nlogn) time, where n is the length of the string n. Then we extend this algorithm to find all the local and global covers of the string x and we extend the latter to compute the seeds of the string. 1 Introduction Covers are considered as common regularities in a string along with repetitions and periods. They are periodically repetitive. A substring w of a string x is called a cover of x if and only if x can be constructed by concatenations and superpositions of w. A seed is an extended cover in the sense of a cover of a superstring of x. Finding the regularities present in strings is not only interesting in string algorithms but it is also useful in many applications. These applications include molecular biology, data compression and computational music analysis. Regularities in strings have been studied widely the last 20 years. There are several O(nlogn)- time algorithms for finding repetitions ([3],[6]), in a string x, where n is the length of x. Apostolico and Breslauer [1] gave an optimal O(loglogn)-time parallel algorithm for finding all the repetitions. The preprocessing of the Knuth-Morris-Pratt algorithm [11] finds all periods of every prefix of x in linear time. In many cases, it is desirable to relax the meaning of repetition. For instance, if we allow overlapping and concatenations of periods in a string we get the notion of covers. The notion of covers was introduced by Apostolico, Farach and Iliopoulos in [2], where a linear-time algorithm to test superprimitivity, was given. Moore and Smyth in [12] gave linear time algorithms for finding all covers of a string x.

2 An extension of the notion of covers, is that of seeds; that is, covers of a superstring of x. The notion of seeds was introduced by Iliopoulos, Moore and Park [9] and an O(nlogn)-time algorithm was given for computing all seeds of x. A parallel algorithm for finding all seeds was presented by Berkman, Iliopoulos and Park [5], that requires O(logn) time and O(nlogn) work. In this work, we find these string regularities in indeterminate strings. An indeterminate string is a sequence X = X[1] X[2]... X[n], where X[i] Σ for each i, and Σ is a given alphabet of potentially large size. The simplest form of indeterminate string is one in which indeterminate positions can contain only a don t care letter, that is, a letter that matches any letter in the alphabet Σ on which x is defined. An algorithm was described [7] for computing all bits 1 gtatcaccgccagtggtat ataccactggcggtgatac tcaacaccgccagagataa ttatctctggcggtgttga ttatcaccgcagatggtta taaccatctgcggtgataa ctatcaccgcaagggataa ttatcccttgcggtgatag ctaacaccgtgcgtgttga tcaacacgcacggtgttag ttacctctggcggtgataa ttatcaccgccagaggtaa 22.6 bits G A CT CT A A C AT -6 C CTATC C GT -2-1 C 0CTGTAG 1 12 Lambda ci and cro binding sites 2GACCA G 3 GA A G T T G5GAT4AG 6 7T GA C T Fig.1. A sequence logo of a biological indeterminate sequence. Picture taken from [13] occurrences of a pattern p in a text string x, where both p and x are defined on the alphabet Σ, but although efficient in theory, the algorithm was not useful in practice. Indeterminate string pattern matching has mainly been handled by bit mapping techniques (ShiftOr method) [4],[15]. These techniques have 2

3 been used to find matches for an indeterminate pattern p in a string x [8] and the agrep utility [14] has been virtually one of the few practical algorithms available for indeterminate pattern-matching. In [8] the authors extended the notion of indeterminate strings by distinguishing two distinct forms of indeterminate match ( quantum and deterministic ). Roughly speaking, a quantum match allows an indeterminate letter to match two or more distinct letters during a single matching process; a determinate match restricts each indeterminate letter to a single match[8]. In this paper we will find string regularities present in indeterminate strings. We will report the local covers and subsequently all the covers of an indeterminate string x. In Section 2 we will present the basic definitions. Sections 3, 4, 5 and 6 present the algorithms. After that, follows an algorithm analysis and conclusions. 2 Basic definitions A string is a sequence of zero or more symbols from an alphabet Σ. The set of all strings over Σ is denoted by Σ. The length of a string x is denoted by x. The empty string, the string of length zero, is denoted by ǫ. The i-th symbol of a string x is denoted by x[i]. A string w is a substring of x if x = uwv, where u, v ǫ Σ. We denote by x[i... j] the substring of x that starts at position i and ends at position j. Conversely, x is called a superstring of w. A string w is a prefix of x if x = wy, for y ǫ Σ. Similarly, w is a suffix of x if x = yw, for w ǫ Σ. We call a string w a subsequence of x (or x is a supersequence of w) if w is obtained by deleting zero or more symbols at any positions from x. For example, ace is a subsequence of aabcdef. For a given set S of strings, a string w is called a common supersequence of S if s is a supersequence of every string in S. The string xy is a concatenation of the strings x and y. The concatenation of k copies of x is denoted by x k. For two strings x = x[1...n] and y = y[1...m] such that x[n i n] = y[1...i] for some i 1 (that is, such that x has a suffix equal to a prefix of y), the string x[1...n]y[i m] is said to be a superposition of x and y. Alternatively, we may say that x overlaps with y. A substring y of x is called a repetition in x, if x = uy k v, where u, y, v are substrings of x and k 2, y 0. For example, if x = aababab, then a (appearing in positions 1 and 2) and ab (appearing in positions 2, 4 and 6) are repetitions in x; in particular a 2 = aa is called a square and (ab) 3 = ababab is called a cube. A substring w is called a period of a string x, if x can be written as x = w k w r where k 1 and w is a prefix of w. The shortest period of x is called the period of x. For example, if x = abcabcab, then abc, abcabc and the string x itself are periods of x, while abc is the period of x. 3

4 A substring w of x is called a cover of x, if x can be constructed by concatenating or overlapping copies of w. We also say that w covers x. For example, if x = ababaaba, then aba and x are covers of x. If x has a cover w x, x is said to be quasiperiodic; otherwise, x is superprimitive. An indeterminate string is a sequence X = X[1] X[2]... X[n], where X[i] Σ for each i, and Σ is a given alphabet of potentially large size. The following theorem will be used in the algorithms of this paper to access sorted elements of a double linked list in constant time. Theorem 1 ([10]) Let a[1],...,a[n] be a doubly linked list. There exist an algorithm that preprocess the list a in such way that after a number of deletions in the list a, one can find the nearest a[j] to the left of a[i] with a[j] a[i] in constant time. 2.1 Using the masking technique In order to efficiently match character classes, we represent our strings as a sequence of 4 bit masks. The alphabet for describing DNA sequences has 4 symbols, namely {A, C, G, T }. We convert these single characters to represent the set of bit masks {1000, 0100, 0010, 0001}. For k characters, x 1...x k we can represent the character set [x 1...x k ] as: M(x 1 ) OR M(x 2 ) OR... OR M(x k ) where M xi is the 4 bit mask of x i. Using this representation and the bitwise AND operation we can determine wether there is a match between characters or character sets. Where a non-zero result would indicate a match and a zero result a mismatch. For example, if we wanted to determine whether [AC] matched with [CD] we would first work convert the character sets into 4 bit masks: [AC] = 1000 OR 0100 = 1100, [CG] = 0100 OR 0010 = We then perform a bitwise AND operation on the 4 bit masks:1100 AND 0110 = Since we have a non-zero result, we can conclude [AC] matched with [CD], as they have C as a common symbol (the character representation of the resulting bit mask). In the algorithms for the remainder of the paper, we will be applying bit masking to your indeterminate strings. 3 Computing the smallest cover of the indeterminate string x The following algorithm finds the smallest cover û that covers the indeterminate string x. Assume that we have performed k-iterations. So far, we have built: 4

5 Position Array We find all the occurrences of substring û in x and we denote the occurrence of û at position I µ as u µ. Then u 1 is a prefix of x and û = k. Then the cover û = u µ, 1 µ l. The starting positions of each substring u is noted in an array S: S 1 = {I 1, I 2,...I µ 1, I µ, I µ+1...i l } Gap Array The distances between the starting positions of consecutive u µ, u µ+1 is denoted by g i and i is entered into a second array S 2. Then the array S 2 is as follows: S 2 = {g 1, g 2,...g µ 1, g µ, g µ+1...g l } Figure 2 presents the distances g i between the substrings u as arcs between the substrings. g 1 g 2 g µ u 1 u 2 u 3... u µ u µ+1... x Fig.2. Covering the string x with substring u. Order Array We create array S 3, which holds the elements of array S 2 sorted from smallest to largest. We also create a doubly linked list L, where for each element g i in S 2 and we keep its Order(g i ) according to S 3. The reason we create this doubly linked list is to be able to use Theorem 1. This theorem allows us to access whichever element of S 3 in constant time. Testing We apply a simple test to determine whether the substring û is a cover of x. We check whether the largest element of S 3, g l is smaller than k. If g l k, then û is a cover of x. If û is not a cover of x, for û = k, we continue by extending k by 1 and solve the problem for u = k + 1. Accordingly, the distance allowed between the substrings, in order for them to be considered as covers, is also increased to k + 1. Main Steps We extend the length of each u µ, 1 µ l by one character, to length k

6 So far we have a series of prefixes of length k to check whether they can be extended by one character. Let u i k+1 denote the k + 1-th character of ui. Step 1 We check to see whether the next character of the current substring is equal to the next character of another substring u i j, i.e if ui k+1 is equal to uj k+1. This check can be performed via the bit masking method for index i, which is a good method for practical purposes, without affecting the running time of the algorithm. Let and let their starting positions u i 1 k+1 = u i 2 k+1 = = u iτ k+1 I = {i 1, i 2...i τ } Then, û k = {a a u i k+1, i I} and û = û 1û 2...û k. Step 2 Suppose in one position I µ, u µ cannot be extended further without giving a mismatch. This is illustrated in Figure 3 at the point marked by. But there are matching substrings following this unmatched substring. Therefore, we want to discard this u µ and cover the string with the rest of the substrings u i. We do this by first deleting I µ from S 1. Then, we delete and update the distances between g µ 1 and g µ+1 in S 2. S 1 = {I 1, I 2,...I µ 1, I µ, I µ+1...i l } S 2 = {g 1, g 2,...g µ 1, g µ, g µ+1...g l } By keeping the linked list L and from Theorem 1, the corresponding distances g µ 1 and g µ+1 in S 2 can be found in constant time. Step 3 We do a binary search and insert the distance g µ in its corresponding position in the sorted set S 2. This requires O(logn) time. We test whether g l k + 1. If this equation is true, then û with û = k + 1 is a cover of x. Figure 2, shows an example of this operation. Arcs g µ 1 and g µ will be deleted and will be replaced by g µ. 6

7 g µ g 1 g µ 1 g µ u 1 u 2 u 3 ui... x... uj... x Fig.3. Extending the length of substring u from k to k + 1 to find a cover for string x. At the position marked by a cross substring u cannot be extended further without giving a mismatch. 4 Computing the maximal local covers of x The following algorithm finds maximal substrings of x which are covered locally by some non-extendable factor, û of x. As with Algorithm 3, we assume that we have performed k-iterations. However, the algorithm varies in that we are now not only concerned with u that is a prefix of x. We are therefore considering all the factors of x (starting with length two) as possible local covers of x. After k-iterations, we would have created the following: Position Array We have a set of local covers {û (1), û (2),..., û (λ) } with û i = k, 1 i λ. The starting positions of each substring û j is noted in an array S ( j) 1. S (j) 1 = {I (j) 1, I (j) 2,...I (j) µ 1} We perform for following steps for each j, 1 j λ Gap Array The distances between the starting positions of consecutive û (j) substrings are entered into a second array S (j) 2. We denote the distances between consecutive -th and u (j) i+1 -th occurrence of û(j) as g (j) i ; then the array S (j) 2 is as follows: u (j) i S (j) 2 = {g (j) 1, g(j) 2,...g(j) µ,...g(j) l } Order Array We create array S (j) 3, which holds the elements of array S (j) 2 sorted from smallest to largest. We also create a doubly linked list L, where for each element g (j) i in S (j) 2 and we keep its Order(g (j) i ) according to S (j) 3. As with Algorithm 3, the doubly linked list is needed to utilize Theorem 1. 7

8 Local Covers We create an array LC of local covers which have been detected up to now in the algorithm. Each cover is stored as a set of pairs, (Λ (j) l, Λ (j) r ), where Λ (j) l and Λ (j) r are the left-most and right-most positions of the i-th local cover of x respectively. Figure 4 shows an example of the local covers array. û û û û û û û û û LC = {(1, 18), (20, 34)} Fig. 4. Example of LC array, supposing that û and û are local covers. Main Steps We extend the length of each û j by one character, to length k + 1. We then partition the set S (j) 1 into sets to represent all possible extensions û j of length k+1. The following steps are repeated until û cannot be extended any further. Step 1 We check to see whether the next character of the current substring is equal to the next character of another substring, i.e if u i,(j) k+1 is equal to uv,j k+1. This check can be performed via the bit masking method for index i, which is a good method for practical purposes, without affecting the running time of the algorithm. Let and let their starting positions Then, û (j) k u i 1,(j) k+1 = ui 2,(j) k+1 = = uiτ,(j) k+1 I = {i 1, i 2...i τ } = {a a u i,(j) k+1, i I} and û j = û (j) 1 û(j) 2...û (j) k. Step 2 Suppose in one position I µ (j), u µ,(j) cannot be extended further without giving a mismatch. But there are matching substrings following this unmatched substring. Therefore, we want to discard this u µ and cover the string with the rest of the substrings u i. We do this by first deleting I µ (j) from S (j) 1. Then, we delete and update the distances between g (j) µ 1 and g (j) µ+1 in S (j) 2. S (j) 1 = {I (j) 1, I(j) 2,...I(j) µ 1, I (j) 8 µ, I(j) µ+1...i(j) l }

9 S (j) 2 = {g (j) 1, g (j) 2,...g µ 1, (j) g (j) µ, g(j) µ+1...g (j) By keeping the linked list L and from Theorem 1, the corresponding distances g (j) µ 1 and g (j) µ+1 in S (j) 2 can be found in constant time. The non-extendable occurrences, say u γ get new set of data structures S (γ) 1, S (γ) 2, S (γ) 3 and LC. l } I 1 û a I 2 û b û I 3 b I 4 û a I 5 û c.... I l û c Fig. 5. Extension of û for local covers S = {I 1, I 2, I 3, I 4, I 5,..., I l } Sû = {I 2, I 3,...} Sû = {I 1, I 4,...} Sû = {I 5,..., I l } Fig. 6. Partitioning the set S Step 3 We now have to update the LC array. Following the extension of û (j), two cases are possible: Case 1: An occurrence of û (j) within a local cover cannot be extended by the same character as all other occurrences of û (j) in the same local cover. In this case: (i) the local cover of length k is maximal. (ii) the local cover of length k +1 will be split into two smaller local covers and Λ updated accordingly (see Figure 7). û û û û û û û û LC = {(1, 7), (9, 18), (20, 34)} Fig. 7. LC after removal of third occurrence of û 9

10 Case 2: An occurrence of û (j) can be extended beyond the end position of the local cover to which it belongs. In this case, if the condition g ( j) û k +1 is met, the two local covers are joined together to make a larger local cover and Λ is updated accordingly (see Figure 8). û û û û û û û û LC = {(1, 7), (9, 34)} Fig. 8. LC after extension of u 5 Computing all the covers of x To find all the covers of the string x, we slightly modify Algorithm 3. Instead of stopping when we have g l, we continue increasing the length of u, until u = n 1, where n = x. During every iteration of the MAIN STEP, if g l > u, we output u, as it is a cover of x. 6 Computing the seeds of x A substring w of x is called a seed of x, if w covers one superstring of x (this can be any superstring of x, including x itself). For example, aba and ababa are some seeds of x = ababaab. If a subsrting u is a seed of a string x, then there exists a superstring y, y = sxv, s < u and v < u, which can be constracted by overlapping or concatinating copies of the strings u 1, u 2, u 3,..., u l. By the definition of seeds, x[i...n] can be matched to any prefix of u and x[1...j] can be matched to any suffix of u (because u has to cover any superstring of x. Therefore, we extend the previous algorithm for finding all the covers of x, by one more test, to find the seeds of x. Up to this point, we have found a series of u i covers. We think about this problem like someone has cut the superstring prematurely before the last substring u l finished and has cut it early losing the beginning of the first substring u 1. Figure 5 shows this notion. We want to check whether cover u i is also a seed of x. We would do this by checking if x[i...n] can be matched to any prefix of u and x[1...j] can be matched to any suffix of u. In position I 1 where we have the first occurrence of u, u 1, we test whether u 1 is a suffix of x. Additionally, we test whether the last occurrence of u, u l, is a prefix of u. If these two sentences are true then 10

11 u u u suffix(u) u u prefix(u) x Fig.9. Finding seeds of string x. the sequence u 1, u 2...u l of covers found are also seeds of x, as they can form a superstring of x. 7 Conclusion In conclusion we have shown O(nlogn) algorithms for finding the smallest cover, local covers and all the covers of a string. We have also presented a O(nlogn) algorithm for finding the seeds of a string. All the algorithms which we have used are easily adaptable to allow the bit-matching technique to be used to allow efficient implementations. References 1. A. Apostolico and D. Breslauer. An optimal o(loglog n)-time parallel algorithm for detecting all squares in a string. SIAM J. Comput., 25(6): , A. Apostolico, M. Farach, and C. S. Iliopoulos. Optimal superprimitivity testing for strings. Information Processing Letters, 39:17 20, A. Apostolico and F. P. Preparata. Optimal off-line detection of repetitions in a string. Theor. Comput. Sci., 22: , R. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Commun. ACM, 35(10):74 82, O. Berkman, C. S. Iliopoulos, and K. Park. The subtree max gap problem with application to parallel string covering. Information and Computation, 123(1): , M. Crochemore. An optimal algorithm for computing the repetitions in a word. Inf. Process. Lett., 12(5): , M. J. Fischer and M. S. Paterson. String-matching and other products. Technical report, Cambridge, MA, USA, J. Holub, W. F. Smyth, and S. Wang. Fast pattern-matching on indeterminate strings. J. of Discrete Algorithms, 6(1):37 50, C. S. Iliopoulos, D. W. G. Moore, and K. Park. Covering a string. In Proceedings of the 4- th Symposium on Combinatorial Pattern Matching, volume 684 of Lecture Notes in Computer Science, pages 54 62, Berlin, Springer-Verlag. 10. H. Imai and T. Asano. Dynamic orthogonal segment intersection search. J. Algorithms, 8(1):1 18, D. E. Knuth, J. Morris, and V. R. Pratt. Fast pattern matching in strings. SIAM Journal of Computing, 6(2): , D. Moore and W. F. Smyth. An optimal algorithm to compute all the covers of a string. Inf. Process. Lett., 50(5): , M. C. Shaner, I. M. Blair, and T. D. Schneider. Sequence logos: A powerful, yet simple, tool. In T. N. Mudge, V. Milutinovic, and L. Hunter, editors, Proceedings of the Twenty-Sixth Annual Hawaii International Conference on System Sciences, Volume 1: Architecture and Biotechnology Computing, pages IEEE Computer Society Press. 14. S. Wu and U. Manber. Agrep a fast approximate pattern-matching tool. In Proceedings USENIX Winter 1992 Technical Conference, pages , San Francisco, CA, S. Wu and U. Manber. Fast text searching: allowing errors. Commun. ACM, 35(10):83 91,

Finding all covers of an indeterminate string in O(n) time on average

Finding all covers of an indeterminate string in O(n) time on average Finding all covers of an indeterminate string in O(n) time on average Md. Faizul Bari, M. Sohel Rahman, and Rifat Shahriyar Department of Computer Science and Engineering Bangladesh University of Engineering

More information

Approximate periods of strings

Approximate periods of strings Theoretical Computer Science 262 (2001) 557 568 www.elsevier.com/locate/tcs Approximate periods of strings Jeong Seop Sim a;1, Costas S. Iliopoulos b; c;2, Kunsoo Park a;1; c; d;3, W.F. Smyth a School

More information

String Regularities and Degenerate Strings

String Regularities and Degenerate Strings M.Sc. Engg. Thesis String Regularities and Degenerate Strings by Md. Faizul Bari Submitted to Department of Computer Science and Engineering in partial fulfilment of the requirments for the degree of Master

More information

OPTIMAL PARALLEL SUPERPRIMITIVITY TESTING FOR SQUARE ARRAYS

OPTIMAL PARALLEL SUPERPRIMITIVITY TESTING FOR SQUARE ARRAYS Parallel Processing Letters, c World Scientific Publishing Company OPTIMAL PARALLEL SUPERPRIMITIVITY TESTING FOR SQUARE ARRAYS COSTAS S. ILIOPOULOS Department of Computer Science, King s College London,

More information

(Preliminary Version)

(Preliminary Version) Relations Between δ-matching and Matching with Don t Care Symbols: δ-distinguishing Morphisms (Preliminary Version) Richard Cole, 1 Costas S. Iliopoulos, 2 Thierry Lecroq, 3 Wojciech Plandowski, 4 and

More information

Implementing Approximate Regularities

Implementing Approximate Regularities Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National

More information

String Regularities and Degenerate Strings

String Regularities and Degenerate Strings M. Sc. Thesis Defense Md. Faizul Bari (100705050P) Supervisor: Dr. M. Sohel Rahman String Regularities and Degenerate Strings Department of Computer Science and Engineering Bangladesh University of Engineering

More information

Optimal Superprimitivity Testing for Strings

Optimal Superprimitivity Testing for Strings Optimal Superprimitivity Testing for Strings Alberto Apostolico Martin Farach Costas S. Iliopoulos Fibonacci Report 90.7 August 1990 - Revised: March 1991 Abstract A string w covers another string z if

More information

Overlapping factors in words

Overlapping factors in words AUSTRALASIAN JOURNAL OF COMBINATORICS Volume 57 (2013), Pages 49 64 Overlapping factors in words Manolis Christodoulakis University of Cyprus P.O. Box 20537, 1678 Nicosia Cyprus christodoulakis.manolis@ucy.ac.cy

More information

Number of occurrences of powers in strings

Number of occurrences of powers in strings Author manuscript, published in "International Journal of Foundations of Computer Science 21, 4 (2010) 535--547" DOI : 10.1142/S0129054110007416 Number of occurrences of powers in strings Maxime Crochemore

More information

Efficient (δ, γ)-pattern-matching with Don t Cares

Efficient (δ, γ)-pattern-matching with Don t Cares fficient (δ, γ)-pattern-matching with Don t Cares Yoan José Pinzón Ardila Costas S. Iliopoulos Manolis Christodoulakis Manal Mohamed King s College London, Department of Computer Science, London WC2R 2LS,

More information

SIMPLE ALGORITHM FOR SORTING THE FIBONACCI STRING ROTATIONS

SIMPLE ALGORITHM FOR SORTING THE FIBONACCI STRING ROTATIONS SIMPLE ALGORITHM FOR SORTING THE FIBONACCI STRING ROTATIONS Manolis Christodoulakis 1, Costas S. Iliopoulos 1, Yoan José Pinzón Ardila 2 1 King s College London, Department of Computer Science London WC2R

More information

Text Searching. Thierry Lecroq Laboratoire d Informatique, du Traitement de l Information et des

Text Searching. Thierry Lecroq Laboratoire d Informatique, du Traitement de l Information et des Text Searching Thierry Lecroq Thierry.Lecroq@univ-rouen.fr Laboratoire d Informatique, du Traitement de l Information et des Systèmes. International PhD School in Formal Languages and Applications Tarragona,

More information

Maximal Unbordered Factors of Random Strings arxiv: v1 [cs.ds] 14 Apr 2017

Maximal Unbordered Factors of Random Strings arxiv: v1 [cs.ds] 14 Apr 2017 Maximal Unbordered Factors of Random Strings arxiv:1704.04472v1 [cs.ds] 14 Apr 2017 Patrick Hagge Cording 1 and Mathias Bæk Tejs Knudsen 2 1 DTU Compute, Technical University of Denmark, phaco@dtu.dk 2

More information

Multiple Pattern Matching

Multiple Pattern Matching Multiple Pattern Matching Stephen Fulwider and Amar Mukherjee College of Engineering and Computer Science University of Central Florida Orlando, FL USA Email: {stephen,amar}@cs.ucf.edu Abstract In this

More information

Varieties of Regularities in Weighted Sequences

Varieties of Regularities in Weighted Sequences Varieties of Regularities in Weighted Sequences Hui Zhang 1, Qing Guo 2 and Costas S. Iliopoulos 3 1 College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, Zhejiang 310023,

More information

A Lower-Variance Randomized Algorithm for Approximate String Matching

A Lower-Variance Randomized Algorithm for Approximate String Matching A Lower-Variance Randomized Algorithm for Approximate String Matching Mikhail J. Atallah Elena Grigorescu Yi Wu Department of Computer Science Purdue University West Lafayette, IN 47907 U.S.A. {mja,egrigore,wu510}@cs.purdue.edu

More information

String Range Matching

String Range Matching String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings

More information

String Matching with Variable Length Gaps

String Matching with Variable Length Gaps String Matching with Variable Length Gaps Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind Technical University of Denmark Abstract. We consider string matching with variable length

More information

The number of runs in Sturmian words

The number of runs in Sturmian words The number of runs in Sturmian words Paweł Baturo 1, Marcin Piatkowski 1, and Wojciech Rytter 2,1 1 Department of Mathematics and Computer Science, Copernicus University 2 Institute of Informatics, Warsaw

More information

Sorting suffixes of two-pattern strings

Sorting suffixes of two-pattern strings Sorting suffixes of two-pattern strings Frantisek Franek W. F. Smyth Algorithms Research Group Department of Computing & Software McMaster University Hamilton, Ontario Canada L8S 4L7 April 19, 2004 Abstract

More information

On-line String Matching in Highly Similar DNA Sequences

On-line String Matching in Highly Similar DNA Sequences On-line String Matching in Highly Similar DNA Sequences Nadia Ben Nsira 1,2,ThierryLecroq 1,,MouradElloumi 2 1 LITIS EA 4108, Normastic FR3638, University of Rouen, France 2 LaTICE, University of Tunis

More information

Three new strategies for exact string matching

Three new strategies for exact string matching Three new strategies for exact string matching Simone Faro 1 Thierry Lecroq 2 1 University of Catania, Italy 2 University of Rouen, LITIS EA 4108, France SeqBio 2012 November 26th-27th 2012 Marne-la-Vallée,

More information

The Number of Runs in a String: Improved Analysis of the Linear Upper Bound

The Number of Runs in a String: Improved Analysis of the Linear Upper Bound The Number of Runs in a String: Improved Analysis of the Linear Upper Bound Wojciech Rytter Instytut Informatyki, Uniwersytet Warszawski, Banacha 2, 02 097, Warszawa, Poland Department of Computer Science,

More information

Efficient High-Similarity String Comparison: The Waterfall Algorithm

Efficient High-Similarity String Comparison: The Waterfall Algorithm Efficient High-Similarity String Comparison: The Waterfall Algorithm Alexander Tiskin Department of Computer Science University of Warwick http://go.warwick.ac.uk/alextiskin Alexander Tiskin (Warwick)

More information

State Complexity of Neighbourhoods and Approximate Pattern Matching

State Complexity of Neighbourhoods and Approximate Pattern Matching State Complexity of Neighbourhoods and Approximate Pattern Matching Timothy Ng, David Rappaport, and Kai Salomaa School of Computing, Queen s University, Kingston, Ontario K7L 3N6, Canada {ng, daver, ksalomaa}@cs.queensu.ca

More information

SUFFIX TREE. SYNONYMS Compact suffix trie

SUFFIX TREE. SYNONYMS Compact suffix trie SUFFIX TREE Maxime Crochemore King s College London and Université Paris-Est, http://www.dcs.kcl.ac.uk/staff/mac/ Thierry Lecroq Université de Rouen, http://monge.univ-mlv.fr/~lecroq SYNONYMS Compact suffix

More information

MURDOCH RESEARCH REPOSITORY This is the author s final version of the work, as accepted for publication following peer review but without the publisher s layout or pagination. The definitive version is

More information

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem Hsing-Yen Ann National Center for High-Performance Computing Tainan 74147, Taiwan Chang-Biau Yang and Chiou-Ting

More information

Compressed Index for Dynamic Text

Compressed Index for Dynamic Text Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Brute-Force Pattern Matching ( 11.2.1) The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1 Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Outline and Reading Strings ( 9.1.1) Pattern matching algorithms Brute-force algorithm ( 9.1.2) Boyer-Moore algorithm ( 9.1.3) Knuth-Morris-Pratt

More information

arxiv: v2 [cs.ds] 5 Mar 2014

arxiv: v2 [cs.ds] 5 Mar 2014 Order-preserving pattern matching with k mismatches Pawe l Gawrychowski 1 and Przemys law Uznański 2 1 Max-Planck-Institut für Informatik, Saarbrücken, Germany 2 LIF, CNRS and Aix-Marseille Université,

More information

Lecture 5: The Shift-And Method

Lecture 5: The Shift-And Method Biosequence Algorithms, Spring 2005 Lecture 5: The Shift-And Method Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 5: Shift-And p.1/19 Seminumerical String Matching Most

More information

Average Complexity of Exact and Approximate Multiple String Matching

Average Complexity of Exact and Approximate Multiple String Matching Average Complexity of Exact and Approximate Multiple String Matching Gonzalo Navarro Department of Computer Science University of Chile gnavarro@dcc.uchile.cl Kimmo Fredriksson Department of Computer Science

More information

1 Alphabets and Languages

1 Alphabets and Languages 1 Alphabets and Languages Look at handout 1 (inference rules for sets) and use the rules on some examples like {a} {{a}} {a} {a, b}, {a} {{a}}, {a} {{a}}, {a} {a, b}, a {{a}}, a {a, b}, a {{a}}, a {a,

More information

Graduate Algorithms CS F-20 String Matching

Graduate Algorithms CS F-20 String Matching Graduate Algorithms CS673-2016F-20 String Matching David Galles Department of Computer Science University of San Francisco 20-0: String Matching Given a source text, and a string to match, where does the

More information

SORTING SUFFIXES OF TWO-PATTERN STRINGS.

SORTING SUFFIXES OF TWO-PATTERN STRINGS. International Journal of Foundations of Computer Science c World Scientific Publishing Company SORTING SUFFIXES OF TWO-PATTERN STRINGS. FRANTISEK FRANEK and WILLIAM F. SMYTH Algorithms Research Group,

More information

Optimal Superprimitivity Testing for Strings

Optimal Superprimitivity Testing for Strings Purdue University Purdue e-pubs Computer Science Technical Reports Department of Computer Science 1990 Optimal Superprimitivity Testing for Strings Alberto Apostolico Martin Farach Costas S. Iliopoulos

More information

Shift-And Approach to Pattern Matching in LZW Compressed Text

Shift-And Approach to Pattern Matching in LZW Compressed Text Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya Kida, Masayuki Takeda, Ayumi Shinohara, and Setsuo Arikawa Department of Informatics, Kyushu University 33 Fukuoka 812-8581, Japan {kida,

More information

String Search. 6th September 2018

String Search. 6th September 2018 String Search 6th September 2018 Search for a given (short) string in a long string Search problems have become more important lately The amount of stored digital information grows steadily (rapidly?)

More information

Lecture 3: String Matching

Lecture 3: String Matching COMP36111: Advanced Algorithms I Lecture 3: String Matching Ian Pratt-Hartmann Room KB2.38: email: ipratt@cs.man.ac.uk 2017 18 Outline The string matching problem The Rabin-Karp algorithm The Knuth-Morris-Pratt

More information

How many double squares can a string contain?

How many double squares can a string contain? How many double squares can a string contain? F. Franek, joint work with A. Deza and A. Thierry Algorithms Research Group Department of Computing and Software McMaster University, Hamilton, Ontario, Canada

More information

arxiv: v1 [cs.ds] 2 Dec 2009

arxiv: v1 [cs.ds] 2 Dec 2009 Variants of Constrained Longest Common Subsequence arxiv:0912.0368v1 [cs.ds] 2 Dec 2009 Paola Bonizzoni Gianluca Della Vedova Riccardo Dondi Yuri Pirola Abstract In this work, we consider a variant of

More information

Motif Extraction from Weighted Sequences

Motif Extraction from Weighted Sequences Motif Extraction from Weighted Sequences C. Iliopoulos 1, K. Perdikuri 2,3, E. Theodoridis 2,3,, A. Tsakalidis 2,3 and K. Tsichlas 1 1 Department of Computer Science, King s College London, London WC2R

More information

On the Number of Distinct Squares

On the Number of Distinct Squares Frantisek (Franya) Franek Advanced Optimization Laboratory Department of Computing and Software McMaster University, Hamilton, Ontario, Canada Invited talk - Prague Stringology Conference 2014 Outline

More information

Efficient Sequential Algorithms, Comp309

Efficient Sequential Algorithms, Comp309 Efficient Sequential Algorithms, Comp309 University of Liverpool 2010 2011 Module Organiser, Igor Potapov Part 2: Pattern Matching References: T. H. Cormen, C. E. Leiserson, R. L. Rivest Introduction to

More information

Pattern Matching (Exact Matching) Overview

Pattern Matching (Exact Matching) Overview CSI/BINF 5330 Pattern Matching (Exact Matching) Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Pattern Matching Exhaustive Search DFA Algorithm KMP Algorithm

More information

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence Danlin Cai, Daxin Zhu, Lei Wang, and Xiaodong Wang Abstract This paper presents a linear space algorithm for finding

More information

A Multiobjective Approach to the Weighted Longest Common Subsequence Problem

A Multiobjective Approach to the Weighted Longest Common Subsequence Problem A Multiobjective Approach to the Weighted Longest Common Subsequence Problem David Becerra, Juan Mendivelso, and Yoan Pinzón Universidad Nacional de Colombia Facultad de Ingeniería Department of Computer

More information

A Unifying Framework for Compressed Pattern Matching

A Unifying Framework for Compressed Pattern Matching A Unifying Framework for Compressed Pattern Matching Takuya Kida Yusuke Shibata Masayuki Takeda Ayumi Shinohara Setsuo Arikawa Department of Informatics, Kyushu University 33 Fukuoka 812-8581, Japan {

More information

Discovering Most Classificatory Patterns for Very Expressive Pattern Classes

Discovering Most Classificatory Patterns for Very Expressive Pattern Classes Discovering Most Classificatory Patterns for Very Expressive Pattern Classes Masayuki Takeda 1,2, Shunsuke Inenaga 1,2, Hideo Bannai 3, Ayumi Shinohara 1,2, and Setsuo Arikawa 1 1 Department of Informatics,

More information

Online Computation of Abelian Runs

Online Computation of Abelian Runs Online Computation of Abelian Runs Gabriele Fici 1, Thierry Lecroq 2, Arnaud Lefebvre 2, and Élise Prieur-Gaston2 1 Dipartimento di Matematica e Informatica, Università di Palermo, Italy Gabriele.Fici@unipa.it

More information

Module 9: Tries and String Matching

Module 9: Tries and String Matching Module 9: Tries and String Matching CS 240 - Data Structures and Data Management Sajed Haque Veronika Irvine Taylor Smith Based on lecture notes by many previous cs240 instructors David R. Cheriton School

More information

INF 4130 / /8-2017

INF 4130 / /8-2017 INF 4130 / 9135 28/8-2017 Algorithms, efficiency, and complexity Problem classes Problems can be divided into sets (classes). Problem classes are defined by the type of algorithm that can (or cannot) solve

More information

Reversal Distance for Strings with Duplicates: Linear Time Approximation using Hitting Set

Reversal Distance for Strings with Duplicates: Linear Time Approximation using Hitting Set Reversal Distance for Strings with Duplicates: Linear Time Approximation using Hitting Set Petr Kolman Charles University in Prague Faculty of Mathematics and Physics Department of Applied Mathematics

More information

arxiv: v1 [cs.dm] 18 May 2016

arxiv: v1 [cs.dm] 18 May 2016 A note on the shortest common superstring of NGS reads arxiv:1605.05542v1 [cs.dm] 18 May 2016 Tristan Braquelaire Marie Gasparoux Mathieu Raffinot Raluca Uricaru May 19, 2016 Abstract The Shortest Superstring

More information

Counting and Verifying Maximal Palindromes

Counting and Verifying Maximal Palindromes Counting and Verifying Maximal Palindromes Tomohiro I 1, Shunsuke Inenaga 2, Hideo Bannai 1, and Masayuki Takeda 1 1 Department of Informatics, Kyushu University 2 Graduate School of Information Science

More information

Information Processing Letters. A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings

Information Processing Letters. A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings Information Processing Letters 108 (2008) 360 364 Contents lists available at ScienceDirect Information Processing Letters www.vier.com/locate/ipl A fast and simple algorithm for computing the longest

More information

Computation Theory Finite Automata

Computation Theory Finite Automata Computation Theory Dept. of Computing ITT Dublin October 14, 2010 Computation Theory I 1 We would like a model that captures the general nature of computation Consider two simple problems: 2 Design a program

More information

INF 4130 / /8-2014

INF 4130 / /8-2014 INF 4130 / 9135 26/8-2014 Mandatory assignments («Oblig-1», «-2», and «-3»): All three must be approved Deadlines around: 25. sept, 25. oct, and 15. nov Other courses on similar themes: INF-MAT 3370 INF-MAT

More information

Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval

Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval Kjell Lemström; Esko Ukkonen Department of Computer Science; P.O.Box 6, FIN-00014 University of Helsinki, Finland {klemstro,ukkonen}@cs.helsinki.fi

More information

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction. Microsoft Research Asia September 5, 2005 1 2 3 4 Section I What is? Definition is a technique for efficiently recurrence computing by storing partial results. In this slides, I will NOT use too many formal

More information

On Pattern Matching With Swaps

On Pattern Matching With Swaps On Pattern Matching With Swaps Fouad B. Chedid Dhofar University, Salalah, Oman Notre Dame University - Louaize, Lebanon P.O.Box: 2509, Postal Code 211 Salalah, Oman Tel: +968 23237200 Fax: +968 23237720

More information

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with

More information

Algorithms for Three Versions of the Shortest Common Superstring Problem

Algorithms for Three Versions of the Shortest Common Superstring Problem Algorithms for Three Versions of the Shortest Common Superstring Problem Maxime Crochemore, Marek Cygan, Costas Iliopoulos, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Walen King s College

More information

An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching

An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching An Adaptive Finite-State Automata Application to the problem of Reducing the Number of States in Approximate String Matching Ricardo Luis de Azevedo da Rocha 1, João José Neto 1 1 Laboratório de Linguagens

More information

Internal Pattern Matching Queries in a Text and Applications

Internal Pattern Matching Queries in a Text and Applications Internal Pattern Matching Queries in a Text and Applications Tomasz Kociumaka Jakub Radoszewski Wojciech Rytter Tomasz Waleń Abstract We consider several types of internal queries: questions about subwords

More information

Data Structure for Dynamic Patterns

Data Structure for Dynamic Patterns Data Structure for Dynamic Patterns Chouvalit Khancome and Veera Booning Member IAENG Abstract String matching and dynamic dictionary matching are significant principles in computer science. These principles

More information

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

Pattern-Matching for Strings with Short Descriptions

Pattern-Matching for Strings with Short Descriptions Pattern-Matching for Strings with Short Descriptions Marek Karpinski marek@cs.uni-bonn.de Department of Computer Science, University of Bonn, 164 Römerstraße, 53117 Bonn, Germany Wojciech Rytter rytter@mimuw.edu.pl

More information

Bio nformatics. Lecture 3. Saad Mneimneh

Bio nformatics. Lecture 3. Saad Mneimneh Bio nformatics Lecture 3 Sequencing As before, DNA is cut into small ( 0.4KB) fragments and a clone library is formed. Biological experiments allow to read a certain number of these short fragments per

More information

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

Compact Suffix Trees Resemble Patricia Tries: Limiting Distribution of Depth

Compact Suffix Trees Resemble Patricia Tries: Limiting Distribution of Depth Purdue University Purdue e-pubs Department of Computer Science Technical Reports Department of Computer Science 1992 Compact Suffix Trees Resemble Patricia Tries: Limiting Distribution of Depth Philippe

More information

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS *

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS * A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS * 1 Jorma Tarhio and Esko Ukkonen Department of Computer Science, University of Helsinki Tukholmankatu 2, SF-00250 Helsinki,

More information

Hierarchical Overlap Graph

Hierarchical Overlap Graph Hierarchical Overlap Graph B. Cazaux and E. Rivals LIRMM & IBC, Montpellier 8. Feb. 2018 arxiv:1802.04632 2018 B. Cazaux & E. Rivals 1 / 29 Overlap Graph for a set of words Consider the set P := {abaa,

More information

Sequence comparison by compression

Sequence comparison by compression Sequence comparison by compression Motivation similarity as a marker for homology. And homology is used to infer function. Sometimes, we are only interested in a numerical distance between two sequences.

More information

Improving the KMP Algorithm by Using Properties of Fibonacci String

Improving the KMP Algorithm by Using Properties of Fibonacci String Improving the KMP Algorithm by Using Properties of Fibonacci String Yi-Kung Shieh and R. C. T. Lee Department of Computer Science National Tsing Hua University d9762814@oz.nthu.edu.tw and rctlee@ncnu.edu.tw

More information

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009 Outline Approximation: Theory and Algorithms The Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 6, 2009 1 Nikolaus Augsten (DIS) Approximation: Theory and

More information

Lecture 18 April 26, 2012

Lecture 18 April 26, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and

More information

arxiv: v1 [cs.cc] 4 Feb 2008

arxiv: v1 [cs.cc] 4 Feb 2008 On the complexity of finding gapped motifs Morris Michael a,1, François Nicolas a, and Esko Ukkonen a arxiv:0802.0314v1 [cs.cc] 4 Feb 2008 Abstract a Department of Computer Science, P. O. Box 68 (Gustaf

More information

arxiv: v2 [cs.ds] 16 Mar 2015

arxiv: v2 [cs.ds] 16 Mar 2015 Longest common substrings with k mismatches Tomas Flouri 1, Emanuele Giaquinta 2, Kassian Kobert 1, and Esko Ukkonen 3 arxiv:1409.1694v2 [cs.ds] 16 Mar 2015 1 Heidelberg Institute for Theoretical Studies,

More information

A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus

A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus Timothy A. D. Fowler Department of Computer Science University of Toronto 10 King s College Rd., Toronto, ON, M5S 3G4, Canada

More information

Small-Space Dictionary Matching (Dissertation Proposal)

Small-Space Dictionary Matching (Dissertation Proposal) Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length

More information

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VF Files On-line MatBio 18 Solon P. Pissis and Ahmad Retha King s ollege London 02-Aug-2018 Solon P. Pissis and Ahmad Retha

More information

Average Case Analysis of the Boyer-Moore Algorithm

Average Case Analysis of the Boyer-Moore Algorithm Average Case Analysis of the Boyer-Moore Algorithm TSUNG-HSI TSAI Institute of Statistical Science Academia Sinica Taipei 115 Taiwan e-mail: chonghi@stat.sinica.edu.tw URL: http://www.stat.sinica.edu.tw/chonghi/stat.htm

More information

Linear-Time Computation of Local Periods

Linear-Time Computation of Local Periods Linear-Time Computation of Local Periods Jean-Pierre Duval 1, Roman Kolpakov 2,, Gregory Kucherov 3, Thierry Lecroq 4, and Arnaud Lefebvre 4 1 LIFAR, Université de Rouen, France Jean-Pierre.Duval@univ-rouen.fr

More information

Reverse Engineering Prefix Tables

Reverse Engineering Prefix Tables Reverse Engineering Prefix Tables Julien Clément, Maxime Crochemore, Giuseppina Rindone To cite this version: Julien Clément, Maxime Crochemore, Giuseppina Rindone. Reverse Engineering Prefix Tables. Susanne

More information

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12 Algorithms Theory 15 Text search P.D. Dr. Alexander Souza Text search Various scenarios: Dynamic texts Text editors Symbol manipulators Static texts Literature databases Library systems Gene databases

More information

arxiv: v1 [cs.ds] 15 Feb 2012

arxiv: v1 [cs.ds] 15 Feb 2012 Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl

More information

Approximation: Theory and Algorithms

Approximation: Theory and Algorithms Approximation: Theory and Algorithms The String Edit Distance Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 6, 2009 Nikolaus Augsten (DIS) Approximation:

More information

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Searching Sear ( Sub- (Sub )Strings Ulf Leser Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf

More information

Algorithms: COMP3121/3821/9101/9801

Algorithms: COMP3121/3821/9101/9801 NEW SOUTH WALES Algorithms: COMP3121/3821/9101/9801 Aleks Ignjatović School of Computer Science and Engineering University of New South Wales TOPIC 4: THE GREEDY METHOD COMP3121/3821/9101/9801 1 / 23 The

More information

Lecture 4: September 19

Lecture 4: September 19 CSCI1810: Computational Molecular Biology Fall 2017 Lecture 4: September 19 Lecturer: Sorin Istrail Scribe: Cyrus Cousins Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes

More information

Intrusion Detection and Malware Analysis

Intrusion Detection and Malware Analysis Intrusion Detection and Malware Analysis IDS feature extraction Pavel Laskov Wilhelm Schickard Institute for Computer Science Metric embedding of byte sequences Sequences 1. blabla blubla blablabu aa 2.

More information

A Linear Time Algorithm for Ordered Partition

A Linear Time Algorithm for Ordered Partition A Linear Time Algorithm for Ordered Partition Yijie Han School of Computing and Engineering University of Missouri at Kansas City Kansas City, Missouri 64 hanyij@umkc.edu Abstract. We present a deterministic

More information

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012 Similarity Search The String Edit Distance Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 8, 2012 Nikolaus Augsten (DIS) Similarity Search Unit 2 March 8,

More information

arxiv: v2 [cs.ds] 8 Apr 2016

arxiv: v2 [cs.ds] 8 Apr 2016 Optimal Dynamic Strings Paweł Gawrychowski 1, Adam Karczmarz 1, Tomasz Kociumaka 1, Jakub Łącki 2, and Piotr Sankowski 1 1 Institute of Informatics, University of Warsaw, Poland [gawry,a.karczmarz,kociumaka,sank]@mimuw.edu.pl

More information

Greedy Conjecture for Strings of Length 4

Greedy Conjecture for Strings of Length 4 Greedy Conjecture for Strings of Length 4 Alexander S. Kulikov 1, Sergey Savinov 2, and Evgeniy Sluzhaev 1,2 1 St. Petersburg Department of Steklov Institute of Mathematics 2 St. Petersburg Academic University

More information

Lecture 15. Saad Mneimneh

Lecture 15. Saad Mneimneh Computat onal Biology Lecture 15 DNA sequencing Shortest common superstring SCS An elegant theoretical abstraction, but fundamentally flawed R. Karp Given a set of fragments F, Find the shortest string

More information