Computing repetitive structures in indeterminate strings

Size: px

Start display at page:

Download "Computing repetitive structures in indeterminate strings"

Warren Hoover
5 years ago
Views:

1 Computing repetitive structures in indeterminate strings Pavlos Antoniou 1, Costas S. Iliopoulos 1, Inuka Jayasekera 1, Wojciech Rytter 2,3 1 Dept. of Computer Science, King s College London, London WC2R 2LS, England, UK 2 Faculty of Mathematics, Informatics, and Mechanics University of Warsaw, Banacha 2, , Warsaw, Poland 3 Faculty of Mathematics and Informatics Copernicus University, Tprun, Poland Abstract. We study the problem of finding local and global covers and repetitive structures called seeds in indeterminate strings. An indeterminate string is a sequence X = X[1] X[2]... X[n], where X[i] Σ for each i, and Σ is a given alphabet of fixed size. We present an algorithm for finding the smallest cover of the string x in O(nlogn) time, where n is the length of the string n. Then we extend this algorithm to find all the local and global covers of the string x and we extend the latter to compute the seeds of the string. 1 Introduction Covers are considered as common regularities in a string along with repetitions and periods. They are periodically repetitive. A substring w of a string x is called a cover of x if and only if x can be constructed by concatenations and superpositions of w. A seed is an extended cover in the sense of a cover of a superstring of x. Finding the regularities present in strings is not only interesting in string algorithms but it is also useful in many applications. These applications include molecular biology, data compression and computational music analysis. Regularities in strings have been studied widely the last 20 years. There are several O(nlogn)- time algorithms for finding repetitions ([3],[6]), in a string x, where n is the length of x. Apostolico and Breslauer [1] gave an optimal O(loglogn)-time parallel algorithm for finding all the repetitions. The preprocessing of the Knuth-Morris-Pratt algorithm [11] finds all periods of every prefix of x in linear time. In many cases, it is desirable to relax the meaning of repetition. For instance, if we allow overlapping and concatenations of periods in a string we get the notion of covers. The notion of covers was introduced by Apostolico, Farach and Iliopoulos in [2], where a linear-time algorithm to test superprimitivity, was given. Moore and Smyth in [12] gave linear time algorithms for finding all covers of a string x.

2 An extension of the notion of covers, is that of seeds; that is, covers of a superstring of x. The notion of seeds was introduced by Iliopoulos, Moore and Park [9] and an O(nlogn)-time algorithm was given for computing all seeds of x. A parallel algorithm for finding all seeds was presented by Berkman, Iliopoulos and Park [5], that requires O(logn) time and O(nlogn) work. In this work, we find these string regularities in indeterminate strings. An indeterminate string is a sequence X = X[1] X[2]... X[n], where X[i] Σ for each i, and Σ is a given alphabet of potentially large size. The simplest form of indeterminate string is one in which indeterminate positions can contain only a don t care letter, that is, a letter that matches any letter in the alphabet Σ on which x is defined. An algorithm was described [7] for computing all bits 1 gtatcaccgccagtggtat ataccactggcggtgatac tcaacaccgccagagataa ttatctctggcggtgttga ttatcaccgcagatggtta taaccatctgcggtgataa ctatcaccgcaagggataa ttatcccttgcggtgatag ctaacaccgtgcgtgttga tcaacacgcacggtgttag ttacctctggcggtgataa ttatcaccgccagaggtaa 22.6 bits G A CT CT A A C AT -6 C CTATC C GT -2-1 C 0CTGTAG 1 12 Lambda ci and cro binding sites 2GACCA G 3 GA A G T T G5GAT4AG 6 7T GA C T Fig.1. A sequence logo of a biological indeterminate sequence. Picture taken from [13] occurrences of a pattern p in a text string x, where both p and x are defined on the alphabet Σ, but although efficient in theory, the algorithm was not useful in practice. Indeterminate string pattern matching has mainly been handled by bit mapping techniques (ShiftOr method) [4],[15]. These techniques have 2

3 been used to find matches for an indeterminate pattern p in a string x [8] and the agrep utility [14] has been virtually one of the few practical algorithms available for indeterminate pattern-matching. In [8] the authors extended the notion of indeterminate strings by distinguishing two distinct forms of indeterminate match ( quantum and deterministic ). Roughly speaking, a quantum match allows an indeterminate letter to match two or more distinct letters during a single matching process; a determinate match restricts each indeterminate letter to a single match[8]. In this paper we will find string regularities present in indeterminate strings. We will report the local covers and subsequently all the covers of an indeterminate string x. In Section 2 we will present the basic definitions. Sections 3, 4, 5 and 6 present the algorithms. After that, follows an algorithm analysis and conclusions. 2 Basic definitions A string is a sequence of zero or more symbols from an alphabet Σ. The set of all strings over Σ is denoted by Σ. The length of a string x is denoted by x. The empty string, the string of length zero, is denoted by ǫ. The i-th symbol of a string x is denoted by x[i]. A string w is a substring of x if x = uwv, where u, v ǫ Σ. We denote by x[i... j] the substring of x that starts at position i and ends at position j. Conversely, x is called a superstring of w. A string w is a prefix of x if x = wy, for y ǫ Σ. Similarly, w is a suffix of x if x = yw, for w ǫ Σ. We call a string w a subsequence of x (or x is a supersequence of w) if w is obtained by deleting zero or more symbols at any positions from x. For example, ace is a subsequence of aabcdef. For a given set S of strings, a string w is called a common supersequence of S if s is a supersequence of every string in S. The string xy is a concatenation of the strings x and y. The concatenation of k copies of x is denoted by x k. For two strings x = x[1...n] and y = y[1...m] such that x[n i n] = y[1...i] for some i 1 (that is, such that x has a suffix equal to a prefix of y), the string x[1...n]y[i m] is said to be a superposition of x and y. Alternatively, we may say that x overlaps with y. A substring y of x is called a repetition in x, if x = uy k v, where u, y, v are substrings of x and k 2, y 0. For example, if x = aababab, then a (appearing in positions 1 and 2) and ab (appearing in positions 2, 4 and 6) are repetitions in x; in particular a 2 = aa is called a square and (ab) 3 = ababab is called a cube. A substring w is called a period of a string x, if x can be written as x = w k w r where k 1 and w is a prefix of w. The shortest period of x is called the period of x. For example, if x = abcabcab, then abc, abcabc and the string x itself are periods of x, while abc is the period of x. 3

4 A substring w of x is called a cover of x, if x can be constructed by concatenating or overlapping copies of w. We also say that w covers x. For example, if x = ababaaba, then aba and x are covers of x. If x has a cover w x, x is said to be quasiperiodic; otherwise, x is superprimitive. An indeterminate string is a sequence X = X[1] X[2]... X[n], where X[i] Σ for each i, and Σ is a given alphabet of potentially large size. The following theorem will be used in the algorithms of this paper to access sorted elements of a double linked list in constant time. Theorem 1 ([10]) Let a[1],...,a[n] be a doubly linked list. There exist an algorithm that preprocess the list a in such way that after a number of deletions in the list a, one can find the nearest a[j] to the left of a[i] with a[j] a[i] in constant time. 2.1 Using the masking technique In order to efficiently match character classes, we represent our strings as a sequence of 4 bit masks. The alphabet for describing DNA sequences has 4 symbols, namely {A, C, G, T }. We convert these single characters to represent the set of bit masks {1000, 0100, 0010, 0001}. For k characters, x 1...x k we can represent the character set [x 1...x k ] as: M(x 1 ) OR M(x 2 ) OR... OR M(x k ) where M xi is the 4 bit mask of x i. Using this representation and the bitwise AND operation we can determine wether there is a match between characters or character sets. Where a non-zero result would indicate a match and a zero result a mismatch. For example, if we wanted to determine whether [AC] matched with [CD] we would first work convert the character sets into 4 bit masks: [AC] = 1000 OR 0100 = 1100, [CG] = 0100 OR 0010 = We then perform a bitwise AND operation on the 4 bit masks:1100 AND 0110 = Since we have a non-zero result, we can conclude [AC] matched with [CD], as they have C as a common symbol (the character representation of the resulting bit mask). In the algorithms for the remainder of the paper, we will be applying bit masking to your indeterminate strings. 3 Computing the smallest cover of the indeterminate string x The following algorithm finds the smallest cover û that covers the indeterminate string x. Assume that we have performed k-iterations. So far, we have built: 4

5 Position Array We find all the occurrences of substring û in x and we denote the occurrence of û at position I µ as u µ. Then u 1 is a prefix of x and û = k. Then the cover û = u µ, 1 µ l. The starting positions of each substring u is noted in an array S: S 1 = {I 1, I 2,...I µ 1, I µ, I µ+1...i l } Gap Array The distances between the starting positions of consecutive u µ, u µ+1 is denoted by g i and i is entered into a second array S 2. Then the array S 2 is as follows: S 2 = {g 1, g 2,...g µ 1, g µ, g µ+1...g l } Figure 2 presents the distances g i between the substrings u as arcs between the substrings. g 1 g 2 g µ u 1 u 2 u 3... u µ u µ+1... x Fig.2. Covering the string x with substring u. Order Array We create array S 3, which holds the elements of array S 2 sorted from smallest to largest. We also create a doubly linked list L, where for each element g i in S 2 and we keep its Order(g i ) according to S 3. The reason we create this doubly linked list is to be able to use Theorem 1. This theorem allows us to access whichever element of S 3 in constant time. Testing We apply a simple test to determine whether the substring û is a cover of x. We check whether the largest element of S 3, g l is smaller than k. If g l k, then û is a cover of x. If û is not a cover of x, for û = k, we continue by extending k by 1 and solve the problem for u = k + 1. Accordingly, the distance allowed between the substrings, in order for them to be considered as covers, is also increased to k + 1. Main Steps We extend the length of each u µ, 1 µ l by one character, to length k

6 So far we have a series of prefixes of length k to check whether they can be extended by one character. Let u i k+1 denote the k + 1-th character of ui. Step 1 We check to see whether the next character of the current substring is equal to the next character of another substring u i j, i.e if ui k+1 is equal to uj k+1. This check can be performed via the bit masking method for index i, which is a good method for practical purposes, without affecting the running time of the algorithm. Let and let their starting positions u i 1 k+1 = u i 2 k+1 = = u iτ k+1 I = {i 1, i 2...i τ } Then, û k = {a a u i k+1, i I} and û = û 1û 2...û k. Step 2 Suppose in one position I µ, u µ cannot be extended further without giving a mismatch. This is illustrated in Figure 3 at the point marked by. But there are matching substrings following this unmatched substring. Therefore, we want to discard this u µ and cover the string with the rest of the substrings u i. We do this by first deleting I µ from S 1. Then, we delete and update the distances between g µ 1 and g µ+1 in S 2. S 1 = {I 1, I 2,...I µ 1, I µ, I µ+1...i l } S 2 = {g 1, g 2,...g µ 1, g µ, g µ+1...g l } By keeping the linked list L and from Theorem 1, the corresponding distances g µ 1 and g µ+1 in S 2 can be found in constant time. Step 3 We do a binary search and insert the distance g µ in its corresponding position in the sorted set S 2. This requires O(logn) time. We test whether g l k + 1. If this equation is true, then û with û = k + 1 is a cover of x. Figure 2, shows an example of this operation. Arcs g µ 1 and g µ will be deleted and will be replaced by g µ. 6

7 g µ g 1 g µ 1 g µ u 1 u 2 u 3 ui... x... uj... x Fig.3. Extending the length of substring u from k to k + 1 to find a cover for string x. At the position marked by a cross substring u cannot be extended further without giving a mismatch. 4 Computing the maximal local covers of x The following algorithm finds maximal substrings of x which are covered locally by some non-extendable factor, û of x. As with Algorithm 3, we assume that we have performed k-iterations. However, the algorithm varies in that we are now not only concerned with u that is a prefix of x. We are therefore considering all the factors of x (starting with length two) as possible local covers of x. After k-iterations, we would have created the following: Position Array We have a set of local covers {û (1), û (2),..., û (λ) } with û i = k, 1 i λ. The starting positions of each substring û j is noted in an array S ( j) 1. S (j) 1 = {I (j) 1, I (j) 2,...I (j) µ 1} We perform for following steps for each j, 1 j λ Gap Array The distances between the starting positions of consecutive û (j) substrings are entered into a second array S (j) 2. We denote the distances between consecutive -th and u (j) i+1 -th occurrence of û(j) as g (j) i ; then the array S (j) 2 is as follows: u (j) i S (j) 2 = {g (j) 1, g(j) 2,...g(j) µ,...g(j) l } Order Array We create array S (j) 3, which holds the elements of array S (j) 2 sorted from smallest to largest. We also create a doubly linked list L, where for each element g (j) i in S (j) 2 and we keep its Order(g (j) i ) according to S (j) 3. As with Algorithm 3, the doubly linked list is needed to utilize Theorem 1. 7

8 Local Covers We create an array LC of local covers which have been detected up to now in the algorithm. Each cover is stored as a set of pairs, (Λ (j) l, Λ (j) r ), where Λ (j) l and Λ (j) r are the left-most and right-most positions of the i-th local cover of x respectively. Figure 4 shows an example of the local covers array. û û û û û û û û û LC = {(1, 18), (20, 34)} Fig. 4. Example of LC array, supposing that û and û are local covers. Main Steps We extend the length of each û j by one character, to length k + 1. We then partition the set S (j) 1 into sets to represent all possible extensions û j of length k+1. The following steps are repeated until û cannot be extended any further. Step 1 We check to see whether the next character of the current substring is equal to the next character of another substring, i.e if u i,(j) k+1 is equal to uv,j k+1. This check can be performed via the bit masking method for index i, which is a good method for practical purposes, without affecting the running time of the algorithm. Let and let their starting positions Then, û (j) k u i 1,(j) k+1 = ui 2,(j) k+1 = = uiτ,(j) k+1 I = {i 1, i 2...i τ } = {a a u i,(j) k+1, i I} and û j = û (j) 1 û(j) 2...û (j) k. Step 2 Suppose in one position I µ (j), u µ,(j) cannot be extended further without giving a mismatch. But there are matching substrings following this unmatched substring. Therefore, we want to discard this u µ and cover the string with the rest of the substrings u i. We do this by first deleting I µ (j) from S (j) 1. Then, we delete and update the distances between g (j) µ 1 and g (j) µ+1 in S (j) 2. S (j) 1 = {I (j) 1, I(j) 2,...I(j) µ 1, I (j) 8 µ, I(j) µ+1...i(j) l }

9 S (j) 2 = {g (j) 1, g (j) 2,...g µ 1, (j) g (j) µ, g(j) µ+1...g (j) By keeping the linked list L and from Theorem 1, the corresponding distances g (j) µ 1 and g (j) µ+1 in S (j) 2 can be found in constant time. The non-extendable occurrences, say u γ get new set of data structures S (γ) 1, S (γ) 2, S (γ) 3 and LC. l } I 1 û a I 2 û b û I 3 b I 4 û a I 5 û c.... I l û c Fig. 5. Extension of û for local covers S = {I 1, I 2, I 3, I 4, I 5,..., I l } Sû = {I 2, I 3,...} Sû = {I 1, I 4,...} Sû = {I 5,..., I l } Fig. 6. Partitioning the set S Step 3 We now have to update the LC array. Following the extension of û (j), two cases are possible: Case 1: An occurrence of û (j) within a local cover cannot be extended by the same character as all other occurrences of û (j) in the same local cover. In this case: (i) the local cover of length k is maximal. (ii) the local cover of length k +1 will be split into two smaller local covers and Λ updated accordingly (see Figure 7). û û û û û û û û LC = {(1, 7), (9, 18), (20, 34)} Fig. 7. LC after removal of third occurrence of û 9

10 Case 2: An occurrence of û (j) can be extended beyond the end position of the local cover to which it belongs. In this case, if the condition g ( j) û k +1 is met, the two local covers are joined together to make a larger local cover and Λ is updated accordingly (see Figure 8). û û û û û û û û LC = {(1, 7), (9, 34)} Fig. 8. LC after extension of u 5 Computing all the covers of x To find all the covers of the string x, we slightly modify Algorithm 3. Instead of stopping when we have g l, we continue increasing the length of u, until u = n 1, where n = x. During every iteration of the MAIN STEP, if g l > u, we output u, as it is a cover of x. 6 Computing the seeds of x A substring w of x is called a seed of x, if w covers one superstring of x (this can be any superstring of x, including x itself). For example, aba and ababa are some seeds of x = ababaab. If a subsrting u is a seed of a string x, then there exists a superstring y, y = sxv, s < u and v < u, which can be constracted by overlapping or concatinating copies of the strings u 1, u 2, u 3,..., u l. By the definition of seeds, x[i...n] can be matched to any prefix of u and x[1...j] can be matched to any suffix of u (because u has to cover any superstring of x. Therefore, we extend the previous algorithm for finding all the covers of x, by one more test, to find the seeds of x. Up to this point, we have found a series of u i covers. We think about this problem like someone has cut the superstring prematurely before the last substring u l finished and has cut it early losing the beginning of the first substring u 1. Figure 5 shows this notion. We want to check whether cover u i is also a seed of x. We would do this by checking if x[i...n] can be matched to any prefix of u and x[1...j] can be matched to any suffix of u. In position I 1 where we have the first occurrence of u, u 1, we test whether u 1 is a suffix of x. Additionally, we test whether the last occurrence of u, u l, is a prefix of u. If these two sentences are true then 10

11 u u u suffix(u) u u prefix(u) x Fig.9. Finding seeds of string x. the sequence u 1, u 2...u l of covers found are also seeds of x, as they can form a superstring of x. 7 Conclusion In conclusion we have shown O(nlogn) algorithms for finding the smallest cover, local covers and all the covers of a string. We have also presented a O(nlogn) algorithm for finding the seeds of a string. All the algorithms which we have used are easily adaptable to allow the bit-matching technique to be used to allow efficient implementations. References 1. A. Apostolico and D. Breslauer. An optimal o(loglog n)-time parallel algorithm for detecting all squares in a string. SIAM J. Comput., 25(6): , A. Apostolico, M. Farach, and C. S. Iliopoulos. Optimal superprimitivity testing for strings. Information Processing Letters, 39:17 20, A. Apostolico and F. P. Preparata. Optimal off-line detection of repetitions in a string. Theor. Comput. Sci., 22: , R. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Commun. ACM, 35(10):74 82, O. Berkman, C. S. Iliopoulos, and K. Park. The subtree max gap problem with application to parallel string covering. Information and Computation, 123(1): , M. Crochemore. An optimal algorithm for computing the repetitions in a word. Inf. Process. Lett., 12(5): , M. J. Fischer and M. S. Paterson. String-matching and other products. Technical report, Cambridge, MA, USA, J. Holub, W. F. Smyth, and S. Wang. Fast pattern-matching on indeterminate strings. J. of Discrete Algorithms, 6(1):37 50, C. S. Iliopoulos, D. W. G. Moore, and K. Park. Covering a string. In Proceedings of the 4- th Symposium on Combinatorial Pattern Matching, volume 684 of Lecture Notes in Computer Science, pages 54 62, Berlin, Springer-Verlag. 10. H. Imai and T. Asano. Dynamic orthogonal segment intersection search. J. Algorithms, 8(1):1 18, D. E. Knuth, J. Morris, and V. R. Pratt. Fast pattern matching in strings. SIAM Journal of Computing, 6(2): , D. Moore and W. F. Smyth. An optimal algorithm to compute all the covers of a string. Inf. Process. Lett., 50(5): , M. C. Shaner, I. M. Blair, and T. D. Schneider. Sequence logos: A powerful, yet simple, tool. In T. N. Mudge, V. Milutinovic, and L. Hunter, editors, Proceedings of the Twenty-Sixth Annual Hawaii International Conference on System Sciences, Volume 1: Architecture and Biotechnology Computing, pages IEEE Computer Society Press. 14. S. Wu and U. Manber. Agrep a fast approximate pattern-matching tool. In Proceedings USENIX Winter 1992 Technical Conference, pages , San Francisco, CA, S. Wu and U. Manber. Fast text searching: allowing errors. Commun. ACM, 35(10):83 91,

Finding all covers of an indeterminate string in O(n) time on average

Finding all covers of an indeterminate string in O(n) time on average Md. Faizul Bari, M. Sohel Rahman, and Rifat Shahriyar Department of Computer Science and Engineering Bangladesh University of Engineering