A Survey of the Longest Common Subsequence Problem and Its. Related Problems

Size: px

Start display at page:

Download "A Survey of the Longest Common Subsequence Problem and Its. Related Problems"

Philomena Lamb
5 years ago
Views:

1 Survey of the Longest Common Subsequence Problem and Its Related Problems

2 Survey of the Longest Common Subsequence Problem and Its Related Problems Thesis Submitted to the Faculty of Department of Computer Science and Engineering National Sun Yat-sen University by Jian-Bein Chen In Partial Fulfillment of the Requirements for the Degree of Master of Science June 29, 2005

4 TBLE OF CONTENTS Page LIST OF FIURES LIST OF TBLES BSTRCT Chapter 1. Introduction Chapter 2. The Longest Common Subsequence Problem The Longest Common Subsequence Problem Hunt and Szymanki s lgorithm Hirschberg s lgorithm Some lgorithms of LCS Linear Space lgorithms Chapter 3. Related Problems of LCS The Cyclic String Correction Problem The Longest Increasing Subsequence Problem The Constrained Longest Common Subsequence Problem Chapter 4. The Edit Distance Problem The Edit Distance Problem Other Operations in the Edit Distance Problem Chapter 5. Sequence lignment lobal lignment

5 Page 5.2 Local lignment ffine ap Penalty Multiple Sequences lignment Chapter 6. Conclusion BIBLIORPHY

6 LIST OF FIURES Figure Page 2.1 LCS subsequences n example of LCS of two strings. () Each entry L(i, j) = k means there is an LCS of length k along the path from (i, j) to (0, 0). (B) Three possible directions of the tracking back path Steps 1 and 2 of Hunt and Szymanski s algorithm Step 3 of Hunt and Szymanski s algorithm Step 4 of Hunt and Szymanski s algorithm Contours in the scoring matrix Three possible regions in contour Steps 1 and 2 of Hirschberg s algorithm. θ is the number of distinct symbols in B, N[θ] is the occurrences of θ in B Steps 3 9 of Hirschberg s algorithm when k = 1 and Steps 3 9 of Hirschberg s algorithm when k = Step 10 of Hirschberg s algorithm Calculation of break points The divide and conquer of the LCS algorithm. [37] The recursive calls of the LCS algorithm. [37] n example of transforming two cyclic strings n example of LCS crossing

7 Figure Page 3.3 The divide and conquer of Maes s algorithm. The a k is the kth character of and the dash line is the boundary of the LCS. Here, = a 1 a 2 a 8. That is, a i = a 8+i, 1 i Solving LIS with the LCS algorithm, = is the input sequence and B = is the sorted sequence is the longest increasing subsequence in this case Possible break points on = CCT, B = CTC with the constrained sequence C = CC The result obtained from two input sequences CCT and CTC with constrained sequence CC. Here, L k denotes the scoring matrix of an LCS which contains the constrained sequence c 1 c 2 c k The tracing path obtained from sequences CCT and CTC with constrained sequence CC n example for finding the edit distance between two stings.(a) The tracing back in the score matrix between = CCT and B = CTC. (b) The operations to transform to B The optimal score of sequence alignment between B = CUC and = UC obtained by dynamic programming The optimal alignment on = CUC and B = UCU obtained by tracing back approach The optimal alignment of =CUC and B=UCU by using the dynamic programming (a)the ants move in the divergent point. (b) The ants randomly choose the path. (c) Because the upper path is longer than the lower path, more ants pass through the lower path due to the more pheromone. (d) The ants will follow the pheromone rate to choose the better path, so the shorter path will be taken more times

8 LIST OF TBLES Table Page 2.1 Time and space complexity of algorithms computing LCS(,B), where m = and n = B, r is the number of matching pairs, and p is the length of a longest common subsequence, E = m + n 2p = edit distance between and B, D is the number of dominant matches, and k is the number of bits of the word. [102, 103, 108] The algorithms for the LIS problem Some algorithms for sorting by reversals Some algorithms for sorting by transpositions Some algorithms of multiple sequence alignment The ambiguous symbol of IUPC

9 BSTRCT The longest common subsequence (LCS) problem is a classical problem both in combinational optimization and computational biology. During the past few decades, a considerable number of studies have been focused on LCS and its related problems. In this thesis, we shall present a complete survey on LCS and its related problems, and review some efficient algorithms for solving these problems. We shall also give the definition to each problem, present some theorems, and illustrate time complexity and space complexity of the algorithms for solving these problems.

10 CHPTER 1 Introduction Since the LCS (Longest Common Subsequence) problem has been proposed in 1974 by Wagner and Fischer [129], much ink has been spent on this subject [8,13,21, 22,22,25,33,38,51,72,91,93,94,97,138]. Until now, LCS is still a hot research topic in many specialized fields, including computer science, molecular biology, mathematics, speech recognition, gas chromatography, bird song analysis and so on. In different areas, LCS will have some differences in problem definition. The purpose of this thesis is to list those problems which are transferred from LCS, and to enumerate some well-known algorithms in each problem. The purpose of LCS algorithms is to find all matching points and dominate points quickly and efficiency. The LCS can be divided into two categories. One is processing the scoring matrix row by row and the other is contour to contour [61,65]. In different applications, there are many LCS related problem, such as the LIS (longest increasing subsequence) problems [23, 47, 137], the cyclic string correction problem [26, 49, 73, 86, 100, 111, 122], the longest arc-preserving common subsequence problem [67, 68, 81, 82] and the constrained longest common subsequence problem [30,104,123]. In these LCS related problems, with some LCS properties, we can find an efficient way to solve these problems. The edit distance is widely used in many areas, such as computational biology, speech recognition and text retrieval [7, 40, 54, 92, 96, 109, 111, 127]. The Levenshtein edit distance has three different edit operations, insertion, deletion and replacement [78]. The LCS is a special case of edit distance with two operations only, which 1

11 are insertion and deletion. Based on this idea, some other edit operations have also been defined, such as transpose, reversal, move operation and so on [36, 70, 90]. Sequence alignment is an important problem in computer science, and it can be divided into pairwise alignment and multiple sequences alignment [25,39,46,74,75, 106, 131, 134]. When we construct an alignment, there are two different strategies. One is global alignment [6, 13, 14, 135], which focuses on the alignment with full strings. The other is local alignment [34, 87, 98, 120, 121, 126], which emphasizes the substring of the input strings. In biology, the mutations usually occur in the successive region, not random in the genome. Therefore, we have the alignment problem with affine gap. Multiple sequence alignment (MS) algorithms can be classified into exact algorithm, progressive algorithm and iterative algorithm. For speeding up, many algorithms for solving the multiple sequence alignment are based on pairwise alignment [24, 29, 48, 139]. The rest of this thesis is organized as follows. In Chapter 2, we introduce some well-known dynamic programming algorithms, which solve the LCS problem efficiently. We also give two different approaches for the LCS problem. One is Hunt and Szymanki s algorithm [65] in Section 2.2, and the other is Hirschberg s algorithm [61] in Section 2.3. We discuss some other algorithms about LCS, and give the upper bound for the time complexity of LCS [11,52,53,60,85,108,128,132] in Section 2.4. We give some linear algorithm in Section 2.5. In Section 3.1, we introduce the cyclic string correction problem, and reduce the problem into cyclic longest common subsequence. We give some LIS algorithms in Section 3.2, we introduce the constrained longest common subsequence problem and tow algorithms for it [30, 104, 123]. In Chapter 4, we give some prosperities in the Levenshtein edit distance problem and list some complexities of the edit distance problem with various edit operations. In the Chapter 4.2, we give a brief introduction in sequence 2

12 alignment. We introduce the global alignment and local alignment in Sections 5.1 and 5.2, respectively. The pairwise alignment with affine gap is introduced in Section 5.3. We give the three kinds of multiple sequence alignment algorithms and some biology programs in Section 5.4. In Chapter 6, we list some open LCS problems and its related problems, and also give a conclusion of this thesis. 3

13 CHPTER 2 The Longest Common Subsequence Problem The LCS problem is defined by Wagner and Fischer in 1974 [129]. Since then, many algorithms have been proposed [8,13,21,22,22,25,33,38,51,64,72,91,93,94,97, 136,138]. These algorithms should be categorized into the scoring matrix processing with row by row way and contour to contour way. We will introduce these algorithms one by one, according to its category. 2.1 The Longest Common Subsequence Problem Let and B be two strings, where = a 1 a 2 a n and B = b 1 b 2 b m, n m. subsequence is obtained by deleting zero or some (not necessarily consecutive) characters of the original string. common subsequence of and B is a subsequence of both and B. longest common subsequence (LCS) of and B is a common subsequence with maximum length. enerally speaking, the LCS may not be unique. For example, in Figure 2.1, we can see that T and C are both common subsequences in CTC and CCT. In addition, C is longer than T, and no other common sequence in CTC and CCT is longer than C. Therefore, C is the longest common subsequence between CTC and CCT. The LCS problem can be defined as follows: Definition : iven two strings and B, where = a 1 a 2 a n and B = b 1 b 2 b m, n m, the LCS problem is to defined to find the LCS of and B. 4

14 : C T C B: C C T Common subsequence 1 = T Common subsequence 2 = CT Figure 2.1 LCS subsequences. The first kind of the LCS algorithms are to build the scoring matrix row by row, usually using dynamic programming. In 1970, Needleman and Wunsch [98] proposed a well-known method, which uses dynamic programming to calculate an (m + 1) (n + 1) matrix. Let 1,i denote a 1 a 2 a i, 1 i n, and B 1,j denote b 1 b 2 b j, 1 j m. Let L(i, j) denote the value of the LCS of 1,i and B 1,j. Each entry L(i, j) will have three cases as follows: Case 1: Delete a i. The LCS between a 1 a 2... a i and b 1 b 2... b j is the same as the LCS between a 1 a 2... a i 1 and b 1 b 2... b j. Case 2: Delete b j. The LCS between a 1 a 2... a i and b 1 b 2... b j is the same as the LCS between a 1 a 2... a i and b 1 b 2... b j 1. Case 3: Keep the same symbol a i and b j. The LCS between a 1 a 2... a i and b 1 b 2... b j is the same as the LCS between a 1 a 2... a i 1 and b 1 b 2... b j 1 plus the matching score 1. Therefore, L(i, j) will reduce to the one of the three cases L(i 1, j), L(i, j 1) and L(i 1, j 1) + 1, and each case can also follow the same conditions to reduce into the three cases with smaller size. Finally, the problem can reduce the initial state 5

15 C C T c i-1,j-1 c i,j-1 3 C 4 T c i-1,j ( b ) c i,j C ( a ) Figure 2.2 n example of LCS of two strings. () Each entry L(i, j) = k means there is an LCS of length k along the path from (i, j) to (0, 0). (B) Three possible directions of the tracking back path. L(i, 0), L(0, j) or L(0, 0), whose scores are all 0. We can calculate L(i, j) formally as follows: L(i, 0) = 0, i 0 L(0, j) = 0, j 0 L(i 1, j) L(i, j) = max L(i, j 1) L(i 1, j 1) + 1 if a i = b j, 1 i n, 1 j m. The advantage of this algorithm is easy to implement and to trace back. It is clear that the time complexity is O(mn). In Figure 2.2, we use the above scoring function to fill the matrix row by row, and use the matching pairs to trace back 6

16 the LCS. In the figure, we can note that the score can be increased only by a match. Based on this idea, there are two different approaches to find the LCS. First, Hunt and Szymanki s [65] algorithm uses dynamic programming to find those match positions. The second algorithm was proposed by Hirschberg [61], which uses the divide and conquer strategy to find the match positions. In the following two sections, we will introduce these two algorithms in detail. 2.2 Hunt and Szymanki s lgorithm Hunt and Szymanki [65] performed preprocessing on the input strings, and found out all matches in the input strings first. With the information obtained from the preprocessing, we can compute the matching positions more quickly. The algorithm can be divided in three steps and the time complexity is (O(r + n) log n), where n is the length of two input strings and r is the total number of matches between the two strings. Let LCS(, B) denote the length of LCS of strings and B. In this algorithm, there is a key data structure, the threshold values array T i,k, where T i,k is the least j value such that 1,i and B 1,j have a common subsequence of length k. For example in Figure 2.3, T [4, 3] = 6 means that LCS( 1,4, B 1,6 )= 3, but LCS( 1,4, B 1,5 ) 2. In this example 1,4 = CT and B 1,6 = CCT, and their LCS length is 3. It can be easily seen that the LCS length of 1,4 and B 1,5 = CC is only 2. Hunt and Szymanki s lgorithm Input: Two strings = a 1 a 2 a n and B = b 1 b 2 b n. Output: The longest common subsequence of and B. 7

17 element array [1 : n], B[1 : n]; integer array T HRESH[0 : n]; list array MT CHLIST [1 : n] pointer array LINK[1 : n] pointer PTR; 1. for i = 1 step 1 until n do set MT CHLIST [i] =< j 1, j 2,, j p > such that j 1 > j 2 > > j p and [i] = B[j q ] for 1 q p; 2. T HRESH[0] = 0; for i = 1 step 1 until n do T HRESH[i] = n + 1; LINK[0] = null 3. for i = 1 step 1 until n do for j on MT CHLIST [i] do begin find k such that T HRESH[k 1] < j T HRESH[k]; if j < T HRESH[k] then begin T HRESH[k] = j; LINK[k] = newnode(i, j, LINK[k 1]); end; end; 8

18 4. k = largest k such that T HRESH[k] n + 1; P T R = LINK[k]; while P T R null do begin print (i, j) pair pointerd to by PTR advance PTR end Here we denote (i, j) as a general pair, and < i, j > as a matching pair where a i = b j. The set of j-values is {j < a i, b j >}. In the first step, we record the j-value where a i matches in B with decreasing order. For example in Figure 2.3, we denote string B = CCT as a sequence S B = C1, 2, C3, 4, 5, T 6, 7, 8 with index j. The j-value of and in B are {2, 4, 7} and {5, 8}, respectively. Therefore, we sort S B by lexical order and return the sequence S B = 2, 4, 7, C1, C3, 5, 8, T 6 in O(n log n) time. We can search any symbol in the sorted list, and this process will take O(log n) time in binary search. In the worst case, we have to search n distinct symbols in the search process and we need O(n log n) time. Hence, we sort and search the input string in Step 1, which requires O(n log n) time and O(n) space. Step 2 initiates the T HRESH array with O(n) time, and fills all entries (n + 1) as infinite. We show an example in Figure 2.3 to illustrate these steps, the character appears in column 5 and column 8 for row 2. Hence, MT CHLIST [2] is given by [8, 5] to record that < 2, 5 > and < 2, 8 > are matching pairs. In Step 3, there are two outer for-loops to do all < i, j > with decreasing j with increasing i. Hence, we will execute r loops. In each loop, we add the matching list in our efficient data structure THRESH row by row. When we want to add a < i, j > into the T HRESH[i] array and to replace T HRESH[i][k], it means for all 9

19 C C T C T C MTCHLIST i Figure 2.3 Steps 1 and 2 of Hunt and Szymanski s algorithm. common subsequences of length k, j is the leftmost one. For example in Figure 2.4, when j is on MT CHLIST [2], we first add element 8 in T HRESH[2][2], because 2 8 infinite, and add a linked list to T HRESH[1][1], it means now we have a common sequence whose length is 1 in score matrix (2, 1), and also have a common subsequence whose length is 2 in the score matrix (2, 8) (1, 2). Then we add the second element 5 on MT CHLIST [2], 2 5 8, so 5 will replace 8, it means we find a better solution. In step 3, we consider all matches in the scoring matrix so it requires O(n + r log n) time and at most O(r) list nodes. In the final step, we follow the maximum column in T HRESH[], and trace the LCS according to the linked list, it takes at most O(n) time. In 1987, postolico and uerra [12] used the idea of dominating match to reduce the number of matches which need to be considered. 10

20 THRESH[] j j j j on MTCHLIST[ 1 ] THRESH[] j j j on MTCHLIST[ 2 ] THRESH[] j j j on MTCHLIST[ 3 ] j on MTCHLIST[ 8 ] THRESH[] j i MTCHLIST[] Figure 2.4 Step 3 of Hunt and Szymanski s algorithm. 11

21 C C T C T C ( ) T[ row, k ] k row ( B ) Figure 2.5 Step 4 of Hunt and Szymanski s algorithm. 12

22 2.3 Hirschberg s lgorithm The second kind of approaches is the contour-based approach, which was proposed by Hirschberg in 1977 [61]. Before explaining the algorithm, we shall first give some definitions and properties of the contour. Here we denote < i, j > as the position where a i = b j, and L(i, j) as the length of the LCS in 1,i and B 1,j. We call < i, j > a k-candidate, if a i = b j and L(i, j) = k. ccording to Theorem 2.1, among all k-candidates, we need only consider these k-candidates which dominate other k-candidates and are not dominated by any other k-candidates. These considered k-candidates are called minimal k-candidates. In Figure 2.6, all circled positions are k-candidates, and bold ones are minimal k-candidates. For example, < 1, 2 >, < 1, 4 >, < 1, 7 >, < 3, 1 > and < 7, 1 > are all 1-candidates between CTC and CCT. It is clear that < 1, 2 > and < 3, 1 > are both minimal 1-candidates, because they can dominate all other 1-candidates except for each other. Therefore, all 2-candidates can be obtained by < 1, 3 > and < 3, 1 >. The matrix separated by dash lines is called contour, which has its own k-candidate region and can never be crossed to each other. Besides, minimal k-candidates must exist in the corner of the contour, because this is the position which can never be dominated. Hence, one can transfer the LCS problem to finding the set of minimal k-candidates. If we find out all minimal k-candidates and there is no (k + 1)-candidate, then the length of the LCS is k, and the LCS path must exist in the set. Theorem 2.1 [61] Let =< x 1, y 1 > and B =< x 2, y 2 > be two k-candidates. If x 1 y 1 and x 2 y 2, then we can say that dominate B (B is a superfluous k-candidate) since any (k + 1)-candidates generated by B can also be generate by. Thus, from the set of k-candidates, we need only consider those candidates are minimal under the usual vector ordering. Note that if and B are both minimal elements, then x 1 < x 2 iff y 2 > y 1. 13

23 B C C T C T C Figure 2.6 Contours in the scoring matrix. Theorem 2.2 [61] For k 1, < i, j > is a minimal k-candidate iff j is the minimum value such b j = a i, and low < j < high, where high is the minimum j-value of all k-candidates whose i-value is less than i (no upper limit if there is no such k-candidate) and low is the minimum j-value of all (k 1)-candidates whose i-value is less than i. In this algorithm, Σ = s, where s is the numbers of distinct symbols in B and each symbol in Σ is encoded from 1, 2,, s. Here we denote NB[θ] as the number of times that symbol θ occurs in string B and P B[θ, 1],, P B[θ, NB[θ]] as the order list of string B in which symbol θ occurs. For example, in Figure 2.8, there are 4 distinct symbols in string B. Hence, is encoded to 1, is encoded to 2, C is encoded to 3, and T is encoded to 4. The symbol is the second row in the P B array. Therefore, P B[2, 1] = 5 and P B[2, 2] = 8 means that in string B, the 14

24 symbol first occurs in the fifth position, and the second one occurs in the eighth position. The Hirschberg lgorithm Input: Two strings = a 1 a 2 a n and B = b 1 b 2 b m. Output: The longest common subsequence of and B. LD(m,n,,B,C,p) 1. NB[θ] = 0 for θ = 1,..., s P B[θ, 0] = 0 for θ = 1,..., s P B[0, 0] = 0; P B[0, 1] = 0 for j = 1 step until n do begin NB[b j ] = NB[b j ] + 1 P B[b j, NB[b j ]] = j end 2. D[0, i] = 0 for i= 0,..., m lowcheck = 1 3. for k = 1 step 1 do begin 4. N[θ] = NB[θ] for θ 1,..., s N[0] = 1 F L = 0 low = D[k 1, lowcheck] high = n

25 5. for i = lowcheck + 1 step 1 until m do begin 6. while P B[a i, N[a i ] 1] > low do N[a i ] = N[a i ] 1 7. if high > P B[a i, N[a i ]] > low then begin high = P B[a i, N[a i ]] D[k, i] = high if F L = 0 then {lowcheck = i, F L = 1} end else D[k, i] 8. if D[k 1, i] > 0 then low = D[k 1, i] end loop of step 5 9. if F L = 0 then go to Step 10 end loop of step p = k 1 k = p for i = m + 1 step -1 until 0 do if D[k, i] > 0 then begin c k = a i k = k 1 end In Step 1, we record the number of occurrences of symbol θ in B and sort them by their j-values in the PB matrix. This will take O(n log s) time, where s is the number of distinct symbols in B. Through this step, we can quickly bound the region, from which we can find the next minimal k-candidates. In Step 2, we use Theorem 2.2 to make a possible region to find out all k- candidates using previous (k 1)-candidates. For example, Figure 2.7(a) shows the 16

26 initial state with k = 2, where the gray area is the possible area for the minimal 2-candidates. In Figure 2.7(b), a new bound of gray area is set after we find the first minimal 2-candidate. Therefore, each time we find a new minimal k-candidate, the search area for other k-candidates will become smaller. In Figure 2.9 and Figure 2.10, we show how each new minimal k-candidate is found. In Step 2, we will do p loops to find out all k-candidates. In each loop, we use PB to quickly locate the specific symbol which has the closest j-value to the lower bound to decrease the region more efficiently. Since it takes O(n) time for a single loop, Step 2 can be done in O(pn) time. fter we obtain each minimal k-candidate, the k-candidate will be added to another matrix D[k, i] = j, which means the position of common subsequence with length k in row i is j. Finally, we can use matrix D to trace the LCS path, which will take O(m) time. These three steps take O(n log s + pn + m), therefore the time complexity is O(n log s + pn), where s is the number of distinct symbols, and p is the length of LCS. 2.4 Some lgorithms of LCS In recent years, there has been renewal of interest in the lower bound for the complexity of the LCS problem [3, 47, 60]. In 1980, Masek and Paterson [66] used the Four Russians technique to prove that if Σ is arbitrary or finite, the upper bound of the time complexity for LCS is O(mn log log n/ log n) and O(mn/ log n), respectively. Until now, this still theoretically fastest algorithm for the LCS problem. Using dynamic programming, the LCS problem can be solved easily in O(mn) time and space, where m and n are the lengths of LCS input strings. Many heuristics algorithms try to reduce the complexity with special inputs, such as matching number, shorter LCS length, small alphabet and so on [9 13,27,28,44,61,65,91,94,134]. For example, Hunt and Szymanski s algorithm takes O(r log n) time and Hirschberg s 17

27 algorithm takes O(pn) time where r is the number of matching pairs and p is the length of LCS. When r and p are small, these two algorithm are faster than the traditional algorithm. However, in the worst case, r will be n 2 and p will be n, then these two algorithm become worse than the traditionl algorithm. In recent years, many algorithms try to use bit-vector to speed up LCS, which are word-dependent. These algorithms assume the computer can process one word at a time, where the length of the word is k (normally 32 or 64, depending on the machine). Hence, it only takes O(1) to process a word whose length is less than or equal to k. The advantage of bit-vector algorithm is avoiding input-sensitive and output sensitive. ccording to this hypothesis, many algorithms were proposed with bit-vector, and the LCS problem can be solved in O(nm/k) time [16, 93]. In 2000, Crochemore et al. gave an O(n 2 /k) time and O(n/k) space algorithm [38]. In Summary, we list these algorithms in Table Linear Space lgorithms When computer scientists analyze an algorithm, they usually focus on its time complexity and space complexity. Computing the LCS with the Wagner and Fischer s algorithm typically requires O(m n) space, where m and n are the lengths of the input strings. This space complexity will become infeasible if n and m are very large. In this section, we will present linear space algorithms for the LCS problem. Hirschberg proposed a linear space version for the LCS problem [59]. The main idea in this algorithm is to find a break point of the LCS and use divide and conquer method to reconstruct the LCS path. iven two strings = a 1 a 2 a n and B = b 1 b 2 b m, where m and n are the lengths of and B, respectively. LB(m, n, L) is a linear space algorithm and it returns L which is the last row of the original LCS m n matrix. The algorithm is as follows [59]: 18

28 Year uthor(s) Time Complexity Space Complexity 1974 Wagner and Fischer [129] O(mn) O(mn) 1975 Hirschberg [59] O(mn) O(n) 1977 Hunt and Szymanski [65] O((n + r) log n) O(r + n) 1977 Hirschberg [61] O(pn + n log n) O(pn) 1977 Hirschberg [61] O(L(m p) log n) O((m p) 2 + n) 1980 Masek and Paterson [66] O(n max{1, m/ log n}) O(n 2 / log n) 1982 Nakatsu et al. [94] O(n(m p)) O(m 2 ) 1984 Hsu and Du [10, 62] O(pm log(n/p) + pm) O(pm) 1985 Ukkonen [125] O(Em) O(E min{m, E}) 1986 postolico [9] O(n + m log n + D log(mn/d)) O(r + m) 1986 Myers [91] O(D(m + n)) O(D(m + n)) 1987 Kumar and Rangan [71] O(n(m p)) O(n) 1987 postolico and uerra [12] O(pm + n) O(D + n) 1990 Chin and Poon [27] O(n + min{d, pm}) O(D + n) 1992 postolico [11] O(pm) O(n) 1992 Eppstein et al. [44] O(n + D log log min{d, mn/d}) O(D + m) 1999 Rick [108] O(min{pm, p(n p)}) O(n) Table 2.1 Time and space complexity of algorithms computing LCS(,B), where m = and n = B, r is the number of matching pairs, and p is the length of a longest common subsequence, E = m + n 2p = edit distance between and B, D is the number of dominant matches, and k is the number of bits of the word. [102, 103, 108] 19

29 LC(m, n,, B, C) 1. If problem is trivial, solve it: if n = 0 then C = φ (φ is the empty string) else if m = 1 then if j n such that (1) = B(j) then C = (1) else C = φ 2. Otherwise, spilt problem: else begin i = m/2 ; 3. Evaluate L(i, j) and L (i, j) [j = 0, 1,, n]: LB( 1,i, B 1,n, L 1 ); LB( m,i+1, B n,1, L 2 ); 4. Find j such that L(i, j) + L (i, j) = L(m, n) using problem: M = max{l 1 j + L 2 (n j)}; k = min j such taht L1(j) + L2(n j) = M; 5. Solve simpler problems: LC(i, k, 1,i, B 1,k, C1); LC(m i, n k, i+1,m, B k+1,n, C2); 6. ive output: C = C1 C2; end The main idea of this algorithm lies in Step 3, which uses LB to find the break point. Step 3 can be described in more detail as follows: 1. Use LB( 1,m/2, B 1,n ) to calculate the last row L 1 of LCS matrix between a 1 a 2 a m/2 and b 1 b 2 b n. 2. Use LB( n,m/2+1, B n,1 ) to calculate the last row L 2 of LCS matrix between a m a m 1 a m/2+1 and b n b n 1 b 1. 20

30 3. Calculate L(k) = L 1 (k) + L 2 (n k) where L i (k) is the kth element of L i and k = 1, 2,, n. Hence, the maximum L(k) is the break point. Take Figure 2.12 for example, Figure 2.12 (a) and (b) compute L 1 = and L 2 = , respectivly. fterwards, we can compute the maximum of L 1 + L 2, which is 5. It means the LCS must pass through this region and the length of the LCS is 5. fter we pick the break point we can cut the original problem to two smaller problems. In Figure 2.13 and Figure 2.14, we can find out the computing area is decreased by using the break point. 21

31 C C T C C T ( a ) ( b ) C C T C 2 ( c ) Figure 2.7 Three possible regions in contour. 22

32 C T B C C T PB[θ, N[θ]] C T C Figure 2.8 Steps 1 and 2 of Hirschberg s algorithm. θ is the number of distinct symbols in B, N[θ] is the occurrences of θ in B. 23

33 D[k, i] k = 1 i k i = 0, high = 9, low = 0... i k i = 1, high = 2, low = 0, lowcheck = PB[θ, N[θ]] C T i = 3, high = 1, low = 0 k = high = 9, low = 2, lowcheck = i = 2, high = 5, low = 2, lowcheck = i = 3 high = 3, low = Figure 2.9 Steps 3 9 of Hirschberg s algorithm when k = 1 and 2. 24

34 k= high = 9, low = 5, lowcheck = high = 5, low = high = 9, low = high = 4, low = high = 6, low = 3, lowcheck = high = 3, low = k= C T PB[θ, N[θ]] Figure 2.10 Steps 3 9 of Hirschberg s algorithm when k = 3. 25

35 B C C T C T C k i C T C Figure 2.11 Step 10 of Hirschberg s algorithm. 26

36 B 1,8 B 8,1 - C C T - T C C , C 5, C T ( a ) ( b ) B C C T C 4 T L L C ( c ) Figure 2.12 Calculation of break points. 27

37 [0,0] m Break poinks region of an LCS n [ [m/2], k ] Possible LCS curve [m,n] Figure 2.13 The divide and conquer of the LCS algorithm. [37] 28

38 LCS(m,n) Figure 2.14 The recursive calls of the LCS algorithm. [37] 29

39 CHPTER 3 Related Problems of LCS The LCS problem can be add some additional conditions will become some LCS related problems. The cyclic string correction problem is add condition in input. The longest increasing subsequence is add condition in the mismatch condition. The constrained longest common subsequence is add condition in the output with a constrained word. 3.1 The Cyclic String Correction Problem The linear string to string correction problem was defined by Wagner and Fischer in 1974 [129], which can be solved by using a well known dynamic programming algorithm. Using this algorithm, we can obtain the edit distance for any prefix of and B, which takes O(mn) time and O(mn) space, where m and n are the lengths of input sequences and B, respectively. In 1990, Maes defined a cyclic string correction problem, whose input are cyclic stings instead of linear ones [84]. We define a cyclic shift σ as an operation of shift, which can be written as σ(a 1 a 2 a n ) = a 2 a 3 a n a 1. iven = a 1 a 2 a n, which is a string with length n. Let σ k () be the function which repeats the σ function k times and will produce the shifted sequence a k+1 a k+2 a n a 1 a 2 a k. ccording to the above definition, it is clear that = σ n (). Here we denote [] as a cyclic string, which is a set composed by σ 1 () σ 2 () σ n 1 (). iven two cyclic strings [] and [B], we can compute the edit distance 30

40 δ([], [B]) = min{δ(σ k (), σ l (B)) 1 k n =, 1 l m = B }, where δ(, B) is the edit distance of two strings and B. In Chapter 4, we will show that the longest common subsequence problem is a special case of the edit distance problem. Therefore, any algorithm that solves the edit distance problem can be easily modified to solves the LCS problem. Hence, here we take the cyclic longest common subsequence problem as an example to explain the cyclic string problem, and show some of its properties. The cyclic longest common subsequence problem can be formalized as follows: Definition : iven two cyclic strings = a 1 a 2 a n and B = b 1 b 2 b m, n m, the cyclic longest common subsequence problem is to find LCS([], [B]) = max {LCS(σ k (),σ l (B)) 1 k n =, 1 l m = B }, where LCS(, B) is the length of longest common subsequence of strings and B. Theorem 3.1 LCS([], [B]) = l, where 1 s n and 1 t m, if and only if LCS(σ s (), B) = l [84]. ccording to Theorem 3.1, we can transfer the cyclic longest common subsequence problem LCS([], [B]) to LCS([], B). In this problem, we can take the original LCS algorithm to solve this problem. If we calculate LCS(σ 1 (), B), LCS(σ 2 (), B),, LCS(σ n (), B) one by one, it will take O(n 2 m) time. We can use (2n + 1) (m + 1) space to store LCS([], B) score matrix, each LCS[σ k (), B] can be found in the (k + n) (m + 1) windows. In Maes s algorithm, we compute the value inside the boundaries instead of computing all values in the whole windows. Theorem 3.2 Let L s and L t be two LCS paths which can be calculated from LCS(ρ s (), B) and LCS(ρ t (), B). If s t, then L s and L t never cross through each other [84]. 31

41 a s b t a s a p σ s () σ t (B) b t b r [] [B] l match ( a ) ( b ) a p σ p () a p a s σ p () σ r (B) b r b r b t [B] ( c ) d Figure 3.1 n example of transforming two cyclic strings. Take Figure 3.2 as an example, L s is the LCS path of LCS(σ s (), B), and L t is the LCS path of LCS(σ t (), B). Let e1 and e2 denote the bold and dash paths between p d and P c, respectively. (1) If the number of matches on e1 is not less than that on e2, we can find another LCS path passing through e1 in LCS(σ t (), B) which is better than L t. (2) Otherwise, we can find another LCS path passing through e2 in LCS(σ s (), B) which is better than L s. Observably, conditions (1) and (2) lead to a contradiction. Therefore, the LCS path in each windows will never cross by each other. 32

42 B a s P sh L s P th P d a t 2 e1 P c e2 P st a s + n L t P tt a t + n Figure 3.2 n example of LCS crossing. Maes proposed an algorithm for solving this problem in O(nm log n) time [84] in ccording to Theorem 3.2, we know that any two LCS paths in different windows would never cross. See Figure 3.3, we compute L 0 and L n as boundaries first. In fact, L 0 is the same as L n. Then we compute the L n/2 by only computing the values of the area bounded by L 0, L n and the horizontal lines on positions n/2 and 3n/2. With the new lower boundary L n/2, we can find L n/4 by only computing the area bounded by L 0 and L n/2. Obviously, we can use the divide and conquer method to separate the original problem to the smaller ones. In Figure 3.3, a k is the kth character of and the dash line is the boundary of the LCS. 33

43 Because LCS paths never cross to each other, we only calculate the region between the boundaries of the windows. 3.2 The Longest Increasing Subsequence Problem The LIS (longest increasing subsequence) problem is well-studied in computer science [23, 47] and also has some biology applications [39, 109, 138]. The longest increasing subsequence problem is to find a longest increasing subsequence of a given sequence of distinct integers S = x 1 x 2 x n. For example, given a sequence S = , we can see that 3578 and 478 are both increasing subsequences in S. In addition, 3578 is longer than 478, and no other sequence in S is longer than Therefore, 3578 is the longest increasing subsequence in The LIS problem can be formally defined as follows: Definition : iven a comparable sequence of n integers S = x 1 x 2 x n, where all integers are distinct, the LIS problem is to find a subsequence x i1 x i2 x ik of S, where x i1 x i2, x ik and k is the maximum. If we have the sorted sequence of the input, then we can find the LIS simply by using the dynamic programming method. We can treat the input as sequence, and the sorted input as sequence B, then LCS(, B) will calculate the longest increasing subsequence in. For example, in Figure3.4, we can show that the LIS of is Hunt and Szymanski [65] presented an algorithm which can solve the LIS problem in O(n log n) time and its time complexity is reduced to O(n log log n) if the van Emde Boas data structure is used. Bespamyatnikh and Segal [23] also used the van Emde Boas data structure to find out all longest increasing subsequences in O(n log log n). 34

44 B a 1 B a 5 a 8 a 9 a 12 a 16 a 3 a 7 a 10 a 14 a 2 a 4 a 6 a 8 a 9 a 11 a 13 a 15 Figure 3.3 The divide and conquer of Maes s algorithm. The a k is the kth character of and the dash line is the boundary of the LCS. Here, = a 1 a 2 a 8. That is, a i = a 8+i, 1 i 8. 35

45 B Figure 3.4 Solving LIS with the LCS algorithm, = is the input sequence and B = is the sorted sequence is the longest increasing subsequence in this case. 36

46 Year uthor(s) Time Complexity 1961 Schensted [110] O(n log n) 1977 Hunt and Szymanski [65] O(n log log n) 2000 Bespamyatnikh and Segal [23] O(n log log n)) Table 3.1 The algorithms for the LIS problem. lbert et al. [4] gave an algorithm to solve the window LIS problems. iven a sequence S = x 1 x 2 x n and a window size w n, a window of S is defined as a substring x i+1 x i+2 x i+w where 0 i n w. There are three kinds of problems as follows: (1) Local Max Value: finding the length k of the longest increasing subsequence in all windows. (2) Local Max Sequence: finding the LIS in each window. (3) lobal Max Sequence: finding the windows with the longest increasing subsequence among all windows. lbert et al. [4] takes O(n log log n + OUT P UT ) time to solve these three problems. iven two comparable sequences S = x 1 x 2 x n and T = y 1 y 2 y m. Yang et al. [137] proposed an algorithm to solve the longest common increasing subsequence problem between two comparable sequences in O(mn) time and O(mn) space. 3.3 The Constrained Longest Common Subsequence Problem The constrained longest common subsequence problem can be defined as follows [104]: 37

47 Definition : iven two strings = a 1 a 2 a n and B = b 1 b 2 b m and a constrained sequence C = c 1 c 2 c p, the constrained longest common subsequence problem (CLCS) is to find a longest common subsequence containing C as its subsequence. In the original LCS problem, the longest common subsequence is the optimal. However, the CLCS want to find a longest common subsequence containing specific sequence may be shorter than the original LCS. For example, given two strings CCT and CTC, the longest common subsequence is C. However, if we give a constrained sequence CC in addition, then the constrained longest common subsequence is CC which is different greatly from C. iven two strings = a 1 a 2 a n and B = b 1 b 2 b m and a constrained sequence C = c 1 c 2 c p. Let q,r = a q a q+1 a r and B s,t = b s a s+1 a t are substrings in and B respectively, where 1 q r n and 1 s t m. Let LCS(q, r, s, t) denote the length of LCS between q,r and B s,t. For 1 k p, 1 i n and 1 j m, let LCS k (i, j) denote the length of LCS between 1,i and B 1,j with a i = b j = c k. Because the constrained sequence must be in the output, we can use each constrained sequence character c k as one break point. Based on this idea, Tsai [123] uses c k to separate the input string to several individual parts, and calculate the common subsequences among these parts with dynamic programming. The scoring function can be formalize as follows: LCS(, B) = max 1 i n,1 j n {LCS p (i, j) + LCS(i + 1, n, j + 1, m) } LCS k (i, j) = LCS(1, i 1, 1, j 1) + 1, where k = 1, a i = b j = c k L k (i, j) = max 1 x<i,1 y<j { L k 1 (x, y) + LCS(x + 1, i 1, y + 1, j 1) + 1 }, where a i = b j = c k, 2 k p, 1 i n and 1 j m. 38

48 : C C T B: C T C Figure 3.5 Possible break points on = CCT, B = CTC with the constrained sequence C = CC. Figure 3.5 shows an example of break points, where the two input strings = CCT and B = CTC, and the constrained sequence C = T. LCS(, B) = LCS 3 (3, 7) + LCS(4, 8, 8, 8) = LCS 3 (3, 7) + 1 LCS 3 (3, 7) = LCS 2 (2, 6) + 1 LCS 2 (2, 6) = LCS 1 (1, 3) + 1 LCS 1 (1, 3) = 1 Therefore, LCS(, B) = 4 Hence, we can summarize that the algorithm takes O(pm 2 n 2 ) time. as follows: Chin et. al [30] also followed the idea of break points and reduce the conditions Case 1: Ifa i = b j = c k, the CLCS score between a 1 a 2... a i and b 1 b 2... b j is equal to the LCS score between a 1 a 2... a i 1 and b 1 b 2... b j 1 plus the matching score. In addition, a 1 a 2... a i 1 and b 1 b 2... b j 1 may contain c 1 c 2... c k 1. 39

49 Case 2: Ifa i = b j and a i c k, the CLCS score between a 1 a 2... a i and b 1 b 2... b j is equal to the LCS score between a 1 a 2... a i 1 and b 1 b 2... b j 1 plus the matching score. In addition, a 1 a 2... a i 1 and b 1 b 2... b j 1 may contain c 1 c 2... c k. Case 3: Ifa i b j and a i c k, the CLCS score between a 1 a 2... a i and b 1 b 2... b j is equal to the LCS score between a 1 a 2... a i 1 and b 1 b 2... b j. In addition, a 1 a 2... a i 1 and b 1 b 2... b j may contain c 1 c 2... c k. Case 4: Ifa i b j and b j c k, the CLCS score between a 1 a 2... a i and b 1 b 2... b j is equal to the LCS score between a 1 a 2... a i and b 1 b 2... b j 1. In addition, a 1 a 2... a i and b 1 b 2... b j 1 may contain c 1 c 2... c k. We can formalize the above cases as follows: L 0 (i, 0) = 0, 0 i m L 0 (0, j) = 0, 0 j n L 0 (i 1, j) L 0 (i, j) = max L 0 (i, j 1) L 0 (i 1, j 1) + 1 if a i = b j L k (i, 0) =, 0 i m and 1 k p, where 1 i m 1 j n L k (0, j) =, 0 j n and 1 k p L k (i 1, j) L k (i, j 1) L k (i, j) = max L k (i 1, j 1) + 1 if a i = b j L k 1 (i 1, j 1) + 1 if a i = b j = c k, where 1 i m 1 j n 1 k p Take Figure 3.6 for example, we should compute the original scoring matrix L 0 first, and find the break point in L 0 (1, 1). In L 1 (3, 2), a 3 = C and b 2 = is a mismatch, then L 1 (3, 2) = max{l 1 (2, 3)), (L 1 (3, 1)} = 1. In L 1 (3, 3)), a 3 = C and b 3 = C is a match, then L 1 (3, 3) = L 0 (2, 2) + 1 = 2. We can follow the function to compute k scoring matrices and use the tracing back technique to find 40

50 the constrained longest common subsequence. We also give an example of tracing back in Figure

51 C C T C T C L C C T C T C L 1 42

52 C C T C T C L C C T C T C L 3 Figure 3.6 The result obtained from two input sequences CCT and CTC with constrained sequence CC. Here, L k denotes the scoring matrix of an LCS which contains the constrained sequence c 1 c 2 c k. 43

53 C C T C T C L C C T C T C L 1 44

54 C C T C T C L C C T C T C L 3 Figure 3.7 The tracing path obtained from sequences CCT and CTC with constrained sequence CC. 45

55 CHPTER 4 The Edit Distance Problem The edit distance is used widely in many areas, such as computational biology, speech recognition and text retrieval [7,40,54,92,96,109,111,127]. In this chapter, we will introduce the Levenshtein edit distance in Section 4.1. There are many different kinds of operations in the edit distance problem. The edit distance has different complexities while adopting different operations. In Section 4.2, we will introduce some different operations in the edit distance problem and list some complexities of them. 4.1 The Edit Distance Problem The Levenshtein edit distance problem was first proposed by Levenshtein in 1965 [78, 79]. few years later, Wagner and Fischer presented some definitions and properties in their paper [129]. We will follow their definition in this section. Let = a 1 a 2 a n be a sequence over an alphabet set Σ, and i,j = a i a i+1 a j, where 1 i j n. Let Λ denote the null string and i,j = φ if i > j. n edit operation s is an ordered pair (a, b) (φ, φ) of characters, which indicate a b. n edit operation (a, b) called a deletion if b = φ, an insertion if a = φ, and a replacement if a φ and b φ. n edit sequence S = s 1 s 2 s m is a sequence of edit operations. We denote S as a sequence of strings 0 1 m, such that = 0, B = m, and i 1 i via s i for 1 i m. In short, we can use s 1 s 2 s m to change string to string B. 46

56 Let γ(s) denote the cost function of edit operation s. For an edit sequence S, γ(s) = m i=1 γ(s i ). Therefore, we can denote the edit distance δ(, B) from string to string B as the minimal cost for transforming to B. Formally, δ(, B) = min{γ(s) S is an edit sequence used for transforming to B}. Let and B be strings of lengths n and m, respectively. Let D(i, j) = δ( 1,i, B 1,j ), for 1 i n, 1 j m. We have D(i 1, j 1) + γ(a i b j ), D(i, j) = min D(i 1, j) + γ(a i φ), D(i, j 1) + γ(φ b j ), where D(i, j) is the optimal edit distance of the first i characters of 1,i = a 1, a 2,, a i andb 1,j = b 1, b 2, cdots, b j. The last paired character must be one of the following: Case 1: Delete a i. The edit distance between a 1 a 2... a i and b 1 b 2... b j is equal to the edit distance between a 1 a 2... a i 1 and b 1 b 2... b j plus the deleting score. Case 2: Insert b j. The edit distance between a 1 a 2... a i and b 1 b 2... b j is equal to the edit distance between a 1 a 2... a i and b 1 b 2... b j 1 plus the inserting score. Case 3: Replacing a i to b j. The edit distance between a 1 a 2... a i and b 1 b 2... b j is equal to the edit distance between a 1 a 2... a i 1 and b 1 b 2... b j 1 plus the replacing score. Because each D(i, j) is the cost of the minimum cost between 1,i and B 1,j, and the above three conditions are considered, the D(i, j) should be the minimal cost for transforming 1,i to B 1,j. Let T be a trace from to B, where T is any set of integers (i, j) satisfying: 1. 1 i n and 1 j m. 47

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence Danlin Cai, Daxin Zhu, Lei Wang, and Xiaodong Wang Abstract This paper presents a linear space algorithm for finding