Approximate periods of strings

Size: px

Start display at page:

Download "Approximate periods of strings"

Gervase Boone
5 years ago
Views:

Theoretical Computer Science 262 (2001) 557 568 www.elsevier.com/locate/tcs Approximate periods of strings Jeong Seop Sim a;1, Costas S. Iliopoulos b; c;2, Kunsoo Park a;1; c; d;3, W.F.

1 Theoretical Computer Science 262 (2001) Approximate periods of strings Jeong Seop Sim a;1, Costas S. Iliopoulos b; c;2, Kunsoo Park a;1; c; d;3, W.F. Smyth a School of Computer Science and Engineering, Seoul National University, Seoul , South Korea b Department of Computer Science, King s College London, London WC2R 2LS, UK c School of Computing, Curtin University, Perth WA 6825, Australia d Department of Computing & Software, McMaster University, Hamilton, Ontario, Canada L8S 4KI Received 27 December 1999; revised 27 June 2000; accepted 7 August 2000 Communicated by M. Crochemore Abstract The study of approximately periodic strings is relevant to diverse applications such as molecular biology, data compression, and computer-assisted music analysis. Here we study dierent forms of approximate periodicity under a variety of distance functions. We consider three related problems, for two of which we derive polynomial-time algorithms; we then show that the third problem is NP-complete. c 2001 Elsevier Science B.V. All rights reserved. Keywords: Periodicity; Approximate periods; Repetitions; Distance function 1. Introduction Repetitive or periodic strings have been studied in such diverse elds as molecular biology, data compression, and computer-assisted music analysis. In response to requirements arising out of a variety of applications, interest has arisen in algorithms for nding regularities in strings; that is, periodicities of an approximate nature. Some important regularities that have been studied in the literature are the following: Periods: Astring p is called a period of a string x if x can be written as x = p k p, where k 1 and p is a prex of p. The shortest period of x is called the period of x. For example, if x = abcabcab, then abc; abcabc; and x itself are periods of x, Corresponding author. Fax: addresses: jssim@theory.snu.ac.kr (J.S. Sim), csi@dcs.kcl.ac.uk (C.S. Iliopoulos), kpark@theory.snu.ac.kr (K. Park), smyth@mcmaster.ca (W.F. Smyth). 1 Supported by KOSEF Grant and the Brain Korea 21 Project. 2 Supported in part by the CCSLAAR Royal Society Research Grant. 3 Supported by NSERC Grant No. A /01/$ - see front matter c 2001 Elsevier Science B.V. All rights reserved. PII: S (00)

2 558 J.S. Sim et al. / Theoretical Computer Science 262 (2001) while abc is the period of x. Ifx has a period p such that p 6 x =2, then x is said to be periodic. Further, if setting x = p k implies k =1; x is said to be primitive; if k 2; p k is called a repetition. Covers: Astring w is called a cover of x if x can be constructed by concatenations and superpositions of w. For example, if x = ababaaba, then aba and x are the covers of x. Ifx has a cover w x; x is said to be quasiperiodic; otherwise, x is superprimitive. Seeds: Asubstring w of x is called a seed of x if it is a cover of any superstring of x. For example, aba and ababa are some seeds of x = ababaab. Repetitions: A repetition is an immediately repeated nonempty string. For example, if x = aababab, then aa and ababab are repetitions in x; in particular, a 2 = aa is called a square or tandem repeat and (ab) 3 = ababab is called a cube. The notions cover and seed are generalizations of periods in the sense that superpositions as well as concatenations are used to dene them. Asignicant amount of research has been done on each of these four notions: Periods: The preprocessing of the Knuth Morris Pratt algorithm [21] nds all periods of x in linear time in fact, all periods of every prex of x. In parallel computation, Apostolico et al. [2] gave an optimal O(log log n) time algorithm for nding all periods, where n is the length of x. Covers: Apostolico et al. [4] introduced the notion of covers and described a lineartime algorithm to test whether x is superprimitive or not (see also [7, 8, 17]). Moore and Smyth [28] and recently Li and Smyth [24] gave linear-time algorithms for nding all covers of x. In parallel computation, Iliopoulos and Park [18] obtained an optimal O(log log n) time algorithm for nding all covers of x. Apostolico and Ehrenfeucht [3] and Iliopoulos and Mouchard [16] considered the problem of nding maximal quasiperiodic substrings of x. Atwo-dimensional variant of the covering problem was studied in [11, 14], and minimum covering by substrings of a given length in [19]. Seeds: Iliopoulos et al. [15] introduced the notion of seeds and gave an O(n log n) time algorithm for computing all seeds of x. For the same problem, Berkman et al. [6] presented a parallel algorithm that requires O(log n) time and O(n log n) work. Repetitions: There are several O(n log n) time algorithms for nding all the repetitions in a string [10, 5, 26]. In parallel computation, Apostolico and Breslauer [1] gave an optimal O(log log n) time algorithm (i.e., total work is O(n log n)) for nding all the repetitions. Anatural extension of the repetition problems is to allow errors. Approximate repetitions are common in applications such as molecular biology and computer-assisted music analysis [9, 12]. Among the four notions above, only approximate repetitions have been studied. If x = uww v where w and w are similar, ww is called an approximate square or approximate tandem repeat. When there is a nonempty string y between w and w, we say that w and w are an approximate nontandem repeat. In [23], Landau and Schmidt gave an O(kn log k log n) time algorithm for nding approximate squares whose edit distance is at most k in a text of length n. Schmidt also

3 J.S. Sim et al. / Theoretical Computer Science 262 (2001) gave an O(n 2 log n) algorithm for nding approximate tandem or nontandem repeats in [30] which uses an arbitrary score for similarity of repeated strings. In this paper, we introduce the notion of approximate periods which is an approximate version of periods. Here we study dierent forms of approximate periodicity under a variety of distance functions. We consider three related problems, for two of which we derive polynomial-time algorithms; we then show that the third problem is NP-complete. This paper is organized as follows. In Section 2, we describe some notations and denitions used in this paper. In Section 3, we dene approximate periods and three related problems studied in this paper. Then we present polynomial-time algorithms for two problems and the proof of NP-completeness of the third problem in Section 4. We conclude in Section Preliminaries A string is a sequence of zero or more characters from an alphabet. The set of all strings over the alphabet is denoted by and all strings of length m over is denoted by m. The empty string is denoted by. The length of a string x is denoted by x and the ith character of x by x[i]. When a string w is x[i] x[i +1] x[j], we denote w by x[i::j] and w is called a substring (or a factor [13]) of x. Conversely, x is called a superstring of w. For example, bcd is a substring of abcde, and abcde is a superstring of bcd. Ablank space is denoted by =, and we regard it as a character for convenience. Astring w is a prex of x if x = wu for u. Similarly, w is a sux of x if x = uw for u. Astring w is a subsequence (also called a subword [13]) of x (or x is a supersequence of w) ifw is obtained by deleting zero or more characters (at any positions) from x. For example, ace is a subsequence of aabcdef. For a given set S of strings, a string s is called a common supersequence of S if s is a supersequence of every string in S. The distance (x; y) between two strings x and y is the minimum cost to convert one string x to the other string y. There are several well-known distance functions. The edit distance between two strings x and y is the minimum number of edit operations to convert x to y. The edit operations are the insertion of a character into x, the deletion of a character from x, and the change (or substitution) of a character in x with a character in y. Note that since we dene the edit distance by the number of edit operations, the cost of each edit operation is 1. The Hamming distance between x and y is the smallest number of change operations to convert x to y. Note that the Hamming distance can be dened only when x = y because it does not allow insertions and deletions. The edit distance can be generalized by using a penalty matrix. A penalty matrix species the substitution cost for each pair of characters and the insertion=deletion cost for each character. The weighted edit distance between x and y is the minimum cost to convert x to y using a penalty matrix. Apenalty matrix is a metric when it satises

4 560 J.S. Sim et al. / Theoretical Computer Science 262 (2001) Fig. 1. An alignment. the following conditions for all a; b; c {}: (a; b) 0, (a; b)=(b; a), (a; a)=0; and (a; c)6(a; b) + (b; c) (triangle inequality). An alignment of a set S of strings can be represented by a two-dimensional matrix, where each row is a string in S and all rows have the same length. The equality of lengths of all the rows can be obtained by inserting zero or more spaces to each of the strings in S. For example, when S = {abcae; bcd; abde}; an alignment of S is shown in Fig. 1. We say that a distance function (x; y) is a relative distance function if the lengths of strings x and y are considered in the value of (x; y); otherwise, it is an absolute distance function. The Hamming distance and the edit distance are examples of absolute distance functions. There are two ways to dene a relative distance between x and y: First, we can x one of the two strings and dene a relative distance function with respect to the xed string. The error ratio with respect to x is dened to be d= x, where d is an absolute distance between x and y. Second, we can dene a relative distance function symmetrically. The symmetric error ratio is dened to be d=l, where d is an absolute distance between x and y, and l =( x + y )=2 [31]. Note that we may take l = x + y (then everything is the same except that the ratio is multiplied by 2). If d is the edit distance between x and y, the error ratio with respect to x or the symmetric error ratio is called a relative edit distance. The weighted edit distance can also be used as a relative distance function because the penalty matrix can contain arbitrary costs. 3. Problem denitions Given two strings x; p and a distance function ; we dene approximate periods as follows. If there exists a partition of x into disjoint blocks of substrings, i.e., x = p 1 p 2 p r (p i ) such that (p; p i )6t for 16i r; and (p ;p r )6t where p is some prex of p; we say that p is a t-approximate period of x (or p is an approximate period of x with distance t). Each p i ; 16i6r; will be called a partition block of x. The last partition block p r can be matched to any prex p of p as in the denition of exact periods. If p is shorter than p; the last partition block will be

5 J.S. Sim et al. / Theoretical Computer Science 262 (2001) called short. Note that there can be several versions of approximate periods according to the denition of distance function. We consider the following problems related to approximate periods. Problem 1. Given x; p; and any distance function ; nd the minimum t such that p is a t-approximate period of x. Since p is xed in this case, it makes no dierence whether is an absolute distance function or an error ratio with respect to p. If a threshold k6 p on the edit distance is given as input in Problem 1, the problem asks whether p is a k-approximate period of x or not. Note that if the edit distance is used for, it is trivially true that p is a p -approximate period of x. Problem 2. Given a string x and a relative distance function ; nd a substring p of x that is an approximate period of x with the minimum distance. Since the length of p is not (a priori) xed in this problem, we need to use a relative distance function for (i.e., an error ratio or a weighted edit distance) rather than an absolute distance function. For example, if the absolute edit distance is used, every substring of x of length 1 is a 1-approximate period of x. Moreover, we assume that there are at least two partition blocks of x except possibly the short partition block, because otherwise the longest proper prex of x (or any long prex of x) can easily become an approximate period of x with a small distance. This assumption will be applied to Problem 3, too. Problem 3. Given a string x and a relative distance function ; nd a string p that is an approximate period of x with the minimum distance. This problem is harder than Problem 2 because p can be any string, not necessarily a substring of x. 4. Algorithms and NP-completeness Basically, we will use weighted edit distances with metric penalty matrices for the distance function in each problem. Recall that a penalty matrix denes the substitution cost for each pair of characters and the insertion or deletion cost for each character Problem 1 Our algorithm for Problem 1 consists of two steps. Let n = x and m = p. (1) Let w ij be the distance between p and x[i::j]; 16i6j n and w in be the minimum value of the distances between all prexes of p and x[i::n]. Note that x[i::n] can be matched to any prex of p by the denition of approximate periods. Compute w ij for all 16i6j6n.

6 562 J.S. Sim et al. / Theoretical Computer Science 262 (2001) (2) Compute the minimum t such that p is a t-approximate period of x. We use dynamic programming to compute t as follows. Let t i be the minimum value such that p is a t i -approximate period of x[1::i]. Let t 0 = 0. For i =1 to n, we compute t i by the following formula: t i = min (max(t h;w h+1;i )): 06h i The value t n is the minimum t such that p is a t-approximate period of x. To compute the distances in step 1, we use a dynamic programming table called the D table. To compute the distance between two strings x and y, ad table of size ( x + 1) ( y + 1) is used. Each entry D[i; j]; 06i6 x and 06j6 y, stores the minimum cost of transforming x[1::i] to y[1::j]. Initially, D[0; 0] = 0; D[i; 0] = D[i 1; 0] + (x[i];), and D[0; j]=d[0; j 1] + (; y[j]). Then we can compute all the entries of the D table in O( x y ) time by the following recurrence: D[i 1; j]+(x[i];) D[i; j] = min D[i; j 1] + (; y[ j]) D[i 1; j 1] + (x[i]; y[ j]): Note that is a blank space, and thus (a; ) means the deletion cost of a and (; a) means the insertion cost of a. We now consider the time complexity of the above algorithm for Problem 1 under various distances functions. For a weighted edit distance, we make a D table of size (m +1) (n i + 2) for each position i of x. Since the distances between all prexes of p and x[i::n] appear in the last column of the D table, we can get the minimum value of the last column as w in without increasing the time complexity. Hence, step 1 takes O(mn 2 ) time. In step 2, we can compute the minimum t in O(n 2 ) time since we compare O(n) values for each t i. Therefore, the total time complexity is O(mn 2 ). When the edit distance is used for, this algorithm for Problem 1 can be improved. In this case, (a; b) is always 1 if a b; (a; b) = 0, otherwise. Now it is not necessary to compute the edit distances between p and the substrings of x whose lengths are larger than 2m because their edit distances with p will exceed m. Step 1 now takes O(m 2 n) time since we make a D table of size (m +1) (2m + 1) for each position of x. Also, step 2 can be done in O(mn) time since we compare O(m) values at each position of x. Thus the time complexity is reduced to O(m 2 n). However, we can do better. Step 1 can be solved in O(mn) time by the algorithm due to Landau et al. [22]. Given two strings x and y and a forward (resp. backward) solution for the comparison between x and y, the algorithm in [22] incrementally computes a solution for x and by (resp. yb) ino(k) time, where b is an additional character and k is a threshold on the edit distance. This can be done due to the relationship between the solution for x and y and the solution for x and by. When k = m (i.e., the threshold is not given), we can compute all the edit distances between p and every substring of x whose length is at most 2m in O(mn) time using this algorithm. Recently, Kim and Park [20] gave a simpler O(mn)-time algorithm for the same problem. Therefore,

7 J.S. Sim et al. / Theoretical Computer Science 262 (2001) we can solve Problem 1 in O(mn) time if the edit distance is used for. When the threshold k on the edit distance is given as input for Problem 1, it can be solved in O(kn) time because each step of the above algorithm takes O(kn) time. If we use the Hamming distance for, the algorithm takes trivially O(n) time since x must be partitioned into blocks of size m. (The last partition block can be shorter than m.) Therefore, we have the following theorem. Theorem 1. Problem 1 can be solved in O(mn 2 ) time when a weighted edit distance is used for. If the edit distance (resp. the Hamming distance) is used for ; it can be solved in O(mn) time (resp. in O(n) time) Problem 2 When a relative edit distance is used for, Problem 2 can be solved in O(n 4 ) time by our algorithm for Problem 1. Let p be a candidate string for the approximate period of x. If we take each substring of x as p and apply the O(mn) algorithm for Problem 1 (that uses the algorithm in [22]), it takes O( p n) time for each p. Since there are O(n 2 ) substrings of x, the overall time is O(n 4 ). Without using the somewhat complicated algorithm in [22], however, we can solve Problem 2 in O(n 4 ) time by the following simple algorithm for weighted edit distances as well as relative edit distances. We take every substring of x as a candidate string p, and compute the minimum distance t such that p is a t-approximate period of x. But, we do this for all substrings of x that start at a position i at the same time, as follows. Let T be the minimum distance so far. Initially, T =. For i =1 to n, wedothe following. For each i, we process the n i + 1 substrings that start at position i as candidate strings. Let m be the length of a chosen substring of x as p. Initially, m =1. (1) Take x[i::i + m 1] as p and compute w hj for all 16h6j6n. The denition of w hj is the same as in Problem 1. This computation can be done by making nd tables with p and each of n suxes of x. By adding just one row to each of previous D tables (i.e., ndtables when p = x[i::i +m 2]), we can compute these new D tables in O(n 2 ) time. See Fig. 2. (Note that when m = 1, we create new D tables.) (2) Compute the minimum distance t such that p is a t-approximate period of x. This step is similar to the second step of the algorithm for Problem 1. Let t j be the minimum value such that p is a t j -approximate period of x[1::j] and let t 0 =0. For j =1 to n, we compute t j by the following formula: t j = min 06h j (max(t h;w h+1;j )): The value t n is the minimum t such that p is a t-approximate period of x. Ift n is smaller than T, we update T with t n.ifm n i + 1, increase m by 1 and go to step 1.

8 564 J.S. Sim et al. / Theoretical Computer Science 262 (2001) Fig. 2. Computing new D tables. When all the steps are completed, the nal value of T is the minimum distance and the substring p that is a T -approximate period of x is an answer to Problem 2. Theorem 2. Problem 2 can be solved in O(n 4 ) time when a weighted edit distance or a relative edit distance is used for. When a relative Hamming distance is used for ; Problem 2 can be solved in O(n 3 ) time. Proof. For a weighted edit distance, we make n D tables in O(n 2 ) time in step 1 and compute the minimum distance in O(n 2 ) time in step 2. For m =1 to n i +1, we repeat the two steps. Therefore, it takes O(n 3 ) time for each i and the total time complexity of this algorithm is O(n 4 ). If a relative edit distance is used, this algorithm can be slightly simplied as in Problem 1, but it still takes time O(n 4 ). For a relative Hamming distance, it takes O(n) time for each candidate string and there are O(n 2 ) candidate strings. Thus, the total time complexity is O(n 3 ) Problem 3 Given a set of strings, the shortest common supersequence (SCS) problem is to nd a shortest common supersequence of all strings in the set. The SCS problem is NP-complete [25, 29]. We will show that Problem 3 is NP-complete by a reduction from the SCS problem. In this section we will call Problem 3 the AP problem (abbreviation of the approximate period problem). The decision versions of the SCS and AP problems are as follows: Denition 1. Given a positive integer m and a nite set S of strings from where is a nite alphabet, the SCS problem is to decide if there exists a common supersequence w of S such that w 6m. Denition 2. Given a number t; a string x from ( ) where is a nite alphabet, and a penalty matrix, the AP problem is to decide if there exists a string u such that u is a t-approximate period of x.

J.S. Sim et al. / Theoretical Computer Science 262 (2001) 557 568 565 Fig. 3. The penalty matrix M. Now we transform an instance of the SCS problem to an instance of the AP problem.

9 J.S. Sim et al. / Theoretical Computer Science 262 (2001) Fig. 3. The penalty matrix M. Now we transform an instance of the SCS problem to an instance of the AP problem. We can assume that = {0; 1} since the SCS problem is NP-complete even if = {0; 1} [27, 29]. Assume that there are n strings s 1 ;:::;s n in S. First, we set = {a; b; #; $; 1 ; 2 ;}. Let x =# 1 m $# 2 m $#s 1 $#s 2 $ #s n $. Then, set t = m and dene the penalty matrix as in Fig. 3, where a shaded entry can be any value greater than m which makes the penalty matrix a metric. For example, every shaded entry can be m + 1. It is easy to see that this transformation can be done in polynomial time. Lemma 1. Assume that x is constructed as above. If u is an m-approximate period of x; then u is of the form #$ where {a; b} m. Proof. We rst show that u must have one # and one $. (1) Suppose that u has no # (resp. $). Clearly, there exists a partition block of x which has at least one # (resp. $), and the distance between u and the partition block is greater than m. Therefore, u must have at least one # and at least one $. (2) Suppose that u has more than one # (or $). Assume that u has two # s. (The other cases are similar.) Then u must also have two $ s, because unless the number of # s equals that of $ s in u, at least one partition block of x cannot have the same numbers of # s and $ s to those of u. If the numbers of # s and $ s in u dier from those of a partition block of x, the distance between u and the partition block exceed m due to the penalty matrix M. Consider the rst partition block of x. The rst partition block is # 1 m $# 2 m $ because it must have two # s and two $ s as u. For the distance between u and the rst partition block of x to be at most m; u must have at least m characters from { 1 ; 2 }. In such cases, however, the distance between u and at least one partition block of x will exceed m. It remains to show that u =#$ such that {a; b} m. Since u has one # and one $, x must be partitioned just after every occurrence of $. Let u be of the form #$, where ; ; {0; 1; a;b; 1 ; 2 ;}. Consider the rst two partition blocks # 1 m $

10 566 J.S. Sim et al. / Theoretical Computer Science 262 (2001) Fig. 4. An alignment of S {u}. and # 2 m $ofx. If contains i 1 s for i 1; must also have i 2 s and the remaining m 2i characters in must be from {a; b} so that the distances between u and the rst two partition blocks of x do not exceed m. However, this makes the distance between u and any other partition block of x exceed m due to 1 s and 2 s in. Hence cannot have 1 or 2. Also, cannot have any character from {0; 1; } since 0, 1 and have cost 2 with 1 and 2 in the rst two partition blocks of x. For the distances between u and the rst two partition blocks of x to be at most m, and must be empty and must be of the form {a; b} m. See Fig. 4. Theorem 3. The AP problem is NP-complete. Proof. It is easy to see that the AP problem is in NP. To show that the AP problem is NP-complete, we need to show that S has a common supersequence w such that w 6m if and only if there exists a string u such that u is an m-approximate period of x. (if) By Lemma 1, u =#$ where {a; b} m. Since u is an m-approximate period of x, the distance between u and each partition block #s i $ is at most m. (The distances between u and the rst two partition blocks # 1 m $ and # 2 m $ are always m.) Consider an alignment of S {u}. Since = m and the distance between and s i is at most m, each a (resp. b) in must be aligned with 0 (resp. 1) or in s i. (See cases (a) and (b) in Fig. 4.) If we substitute 0 for a and 1 for b in, we obtain a common supersequence w of s 1 ;:::;s n such that w = m. (Note that if a or b in is aligned with for all s i, we can delete the character in and we can obtain a common supersequence which is shorter than m. See case (c) in Fig. 4.) Asimilar alignment was used by Wang and Jiang [32].

11 J.S. Sim et al. / Theoretical Computer Science 262 (2001) (only if ) Let w be a common supersequence of S such that w 6m. Let be the string constructed by substituting a for 0 and b for1inw. (When w m, we append some characters from {a; b} to so that = m:) Partition x just after every occurrence of $. The distance between each partition block of x and #$ ism since each a (resp. b) in can be aligned with 0 (resp. 1), ; 1,or 2 in each partition block. Therefore, #$ is an m-approximate period of x. 5. Conclusion In this paper, we have introduced the notion of approximate periods and three related problems. We presented polynomial-time algorithms for two of the problems and an NP-completeness proof for the third problem. There can be a variety of problems on approximate periods. For example, the following problem is an interesting one: Given a string x and a distance function, nd a substring of x that has an approximate period. This is a problem of nding an interesting area (i.e., approximately periodic area) in a given string. References [1] A. Apostolico, D. Breslauer, An optimal O(loglogN )-time parallel algorithm for detecting all squares in a string, SIAM J. Comput. 25 (6) (1996) [2] A. Apostolico, D. Breslauer, Z. Galil, Optimal parallel algorithms for periods, palindromes and squares, in: Proc. 19th Int. Colloq. Automata Languages and Programming, Lecture Notes in Computer Science, Vol. 623, Springer, Berlin, 1992, pp [3] A. Apostolico, A. Ehrenfeucht, Ecient detection of quasiperiodicities in strings, Theoret. Comput. Sci. 119 (2) (1993) [4] A. Apostolico, M. Farach, C.S. Iliopoulos, Optimal superprimitivity testing for strings, Inform. Process. Lett. 39 (1) (1991) [5] A. Apostolico, F.P. Preparata, Optimal o-line detection of repetitions in a string, Theoret. Comput. Sci. 22 (1983) [6] O. Berkman, C.S. Iliopoulos, K. Park, The subtree max gap problem with application to parallel string covering, Inform. Comput. 123 (1) (1995) [7] D. Breslauer, An on-line string superprimitivity test, Inform. Process. Lett. 44 (1992) [8] D. Breslauer, Testing string superprimitivity in parallel, Inform. Process. Lett. 49 (5) (1994) [9] T. Crawford, C.S. Iliopoulos, R. Raman, String matching techniques for musical similarity and melodic recognition, Comput. Musicol. 11 (1998) [10] M. Crochemore, An optimal algorithm for computing the repetitions in a word, Inform. Process. Lett. 12 (5) (1981) [11] M. Crochemore, C.S. Iliopoulos, M. Korda, Two-dimensional prex string matching and covering on square matrices, Algorithmica 20 (1998) [12] M. Crochemore, C.S. Iliopoulos, H. Yu, Algorithms for computing evolutionary chains in molecular and musical sequences, Proc. 9th Austral. Workshop on Combinatorial Algorithms, 1998, pp [13] M. Crochemore, W. Rytter, Text Algorithms, Oxford University Press, Oxford, [14] C.S. Iliopoulos, M. Korda, Optimal parallel superprimitivity testing on square arrays, Parallel Process. Lett. 6 (3) (1996) [15] C.S. Iliopoulos, D.W.G. Moore, K. Park, Covering a string, Algorithmica 16 (1996) [16] C.S. Iliopoulos, L. Mouchard, An O(nlogn) algorithm for computing all maximal quasiperiodicities in strings, in: Proc. Computing: Australasian Theory Symposium, Lecture Notes in Computer Science, Springer, Berlin, 1999, pp

12 568 J.S. Sim et al. / Theoretical Computer Science 262 (2001) [17] C.S. Iliopoulos, K. Park, An optimal O(log logn)-time algorithm for parallel superprimitivity testing, J. Korea Inform. Sci. Soc. 21 (1994) [18] C.S. Iliopoulos, K. Park, Awork-time optimal algorithm for computing all string covers, Theoret. Comput. Sci. 164 (1996) [19] C.S. Iliopoulos, W.F. Smyth, On-line algorithms for k-covering, Proc. 9th Austral. Workshop on Combinatorial Algorithms, 1998, pp [20] S. Kim, K. Park, Adynamic edit distance table, in: Proc. 11th Symp. Combinatorial Pattern Matching, Lecture Notes in Computer Science, Vol. 1848, Springer, Berlin, 2000, pp [21] D.E. Knuth, J.H. Morris, V.R. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (1) (1977) [22] G.M. Landau, E.W. Myers, J.P. Schmidt, Incremental string comparison, SIAM J. Comput. 27 (2) (1998) [23] G.M. Landau, J.P. Schmidt, An algorithm for approximate tandem repeats, in: Proc. 4th Symp. Combinatorial Pattern Matching, Lecture Notes in Computer Science, Vol. 648, Springer, Berlin, 1993, pp [24] Y. Li, W.F. Smyth, An optimal on-line algorithm to compute all the covers of a string, preprint. [25] D. Maier, The complexity of some problems on subsequences and supersequences, J. ACM 25 (2) (1978) [26] M.G. Main R.J. Lorentz, An algorithm for nding all repetitions in a string, J. Algorithms 5 (1984) [27] M. Middendorf, More on the complexity of common superstring and supersequence problems, Theoret. Comput. Sci. 125 (2) (1994) [28] D. Moore, W.F. Smyth, Acorrection to An optimal algorithm to compute all the covers of a string, Inform. Process. Lett. 54 (2) (1995) [29] K.J. Raiha, E. Ukkonen, The shortest common supersequence problem over binary alphabet is NP-complete, Theoret. Comput. Sci. 16 (1981) [30] J.P. Schmidt, All highest scoring paths in weighted grid graphs and its application to nding all approximate repeats in strings, SIAM J. Comput. 27 (4) (1998) [31] P.H. Sellers, Pattern recognition genetic sequences by mismatch density, Bull. Math. Biol. 46 (4) (1984) [32] L. Wang, T. Jiang, On the complexity of multiple sequence alignment, J. Comput. Biol. 1 (4) (1994)

Implementing Approximate Regularities

Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National