A Survey of the Longest Common Subsequence Problem and Its. Related Problems

Size: px
Start display at page:

Download "A Survey of the Longest Common Subsequence Problem and Its. Related Problems"

Transcription

1 Survey of the Longest Common Subsequence Problem and Its Related Problems

2 Survey of the Longest Common Subsequence Problem and Its Related Problems Thesis Submitted to the Faculty of Department of Computer Science and Engineering National Sun Yat-sen University by Jian-Bein Chen In Partial Fulfillment of the Requirements for the Degree of Master of Science June 29, 2005

3

4 TBLE OF CONTENTS Page LIST OF FIURES LIST OF TBLES BSTRCT Chapter 1. Introduction Chapter 2. The Longest Common Subsequence Problem The Longest Common Subsequence Problem Hunt and Szymanki s lgorithm Hirschberg s lgorithm Some lgorithms of LCS Linear Space lgorithms Chapter 3. Related Problems of LCS The Cyclic String Correction Problem The Longest Increasing Subsequence Problem The Constrained Longest Common Subsequence Problem Chapter 4. The Edit Distance Problem The Edit Distance Problem Other Operations in the Edit Distance Problem Chapter 5. Sequence lignment lobal lignment

5 Page 5.2 Local lignment ffine ap Penalty Multiple Sequences lignment Chapter 6. Conclusion BIBLIORPHY

6 LIST OF FIURES Figure Page 2.1 LCS subsequences n example of LCS of two strings. () Each entry L(i, j) = k means there is an LCS of length k along the path from (i, j) to (0, 0). (B) Three possible directions of the tracking back path Steps 1 and 2 of Hunt and Szymanski s algorithm Step 3 of Hunt and Szymanski s algorithm Step 4 of Hunt and Szymanski s algorithm Contours in the scoring matrix Three possible regions in contour Steps 1 and 2 of Hirschberg s algorithm. θ is the number of distinct symbols in B, N[θ] is the occurrences of θ in B Steps 3 9 of Hirschberg s algorithm when k = 1 and Steps 3 9 of Hirschberg s algorithm when k = Step 10 of Hirschberg s algorithm Calculation of break points The divide and conquer of the LCS algorithm. [37] The recursive calls of the LCS algorithm. [37] n example of transforming two cyclic strings n example of LCS crossing

7 Figure Page 3.3 The divide and conquer of Maes s algorithm. The a k is the kth character of and the dash line is the boundary of the LCS. Here, = a 1 a 2 a 8. That is, a i = a 8+i, 1 i Solving LIS with the LCS algorithm, = is the input sequence and B = is the sorted sequence is the longest increasing subsequence in this case Possible break points on = CCT, B = CTC with the constrained sequence C = CC The result obtained from two input sequences CCT and CTC with constrained sequence CC. Here, L k denotes the scoring matrix of an LCS which contains the constrained sequence c 1 c 2 c k The tracing path obtained from sequences CCT and CTC with constrained sequence CC n example for finding the edit distance between two stings.(a) The tracing back in the score matrix between = CCT and B = CTC. (b) The operations to transform to B The optimal score of sequence alignment between B = CUC and = UC obtained by dynamic programming The optimal alignment on = CUC and B = UCU obtained by tracing back approach The optimal alignment of =CUC and B=UCU by using the dynamic programming (a)the ants move in the divergent point. (b) The ants randomly choose the path. (c) Because the upper path is longer than the lower path, more ants pass through the lower path due to the more pheromone. (d) The ants will follow the pheromone rate to choose the better path, so the shorter path will be taken more times

8 LIST OF TBLES Table Page 2.1 Time and space complexity of algorithms computing LCS(,B), where m = and n = B, r is the number of matching pairs, and p is the length of a longest common subsequence, E = m + n 2p = edit distance between and B, D is the number of dominant matches, and k is the number of bits of the word. [102, 103, 108] The algorithms for the LIS problem Some algorithms for sorting by reversals Some algorithms for sorting by transpositions Some algorithms of multiple sequence alignment The ambiguous symbol of IUPC

9 BSTRCT The longest common subsequence (LCS) problem is a classical problem both in combinational optimization and computational biology. During the past few decades, a considerable number of studies have been focused on LCS and its related problems. In this thesis, we shall present a complete survey on LCS and its related problems, and review some efficient algorithms for solving these problems. We shall also give the definition to each problem, present some theorems, and illustrate time complexity and space complexity of the algorithms for solving these problems.

10 CHPTER 1 Introduction Since the LCS (Longest Common Subsequence) problem has been proposed in 1974 by Wagner and Fischer [129], much ink has been spent on this subject [8,13,21, 22,22,25,33,38,51,72,91,93,94,97,138]. Until now, LCS is still a hot research topic in many specialized fields, including computer science, molecular biology, mathematics, speech recognition, gas chromatography, bird song analysis and so on. In different areas, LCS will have some differences in problem definition. The purpose of this thesis is to list those problems which are transferred from LCS, and to enumerate some well-known algorithms in each problem. The purpose of LCS algorithms is to find all matching points and dominate points quickly and efficiency. The LCS can be divided into two categories. One is processing the scoring matrix row by row and the other is contour to contour [61,65]. In different applications, there are many LCS related problem, such as the LIS (longest increasing subsequence) problems [23, 47, 137], the cyclic string correction problem [26, 49, 73, 86, 100, 111, 122], the longest arc-preserving common subsequence problem [67, 68, 81, 82] and the constrained longest common subsequence problem [30,104,123]. In these LCS related problems, with some LCS properties, we can find an efficient way to solve these problems. The edit distance is widely used in many areas, such as computational biology, speech recognition and text retrieval [7, 40, 54, 92, 96, 109, 111, 127]. The Levenshtein edit distance has three different edit operations, insertion, deletion and replacement [78]. The LCS is a special case of edit distance with two operations only, which 1

11 are insertion and deletion. Based on this idea, some other edit operations have also been defined, such as transpose, reversal, move operation and so on [36, 70, 90]. Sequence alignment is an important problem in computer science, and it can be divided into pairwise alignment and multiple sequences alignment [25,39,46,74,75, 106, 131, 134]. When we construct an alignment, there are two different strategies. One is global alignment [6, 13, 14, 135], which focuses on the alignment with full strings. The other is local alignment [34, 87, 98, 120, 121, 126], which emphasizes the substring of the input strings. In biology, the mutations usually occur in the successive region, not random in the genome. Therefore, we have the alignment problem with affine gap. Multiple sequence alignment (MS) algorithms can be classified into exact algorithm, progressive algorithm and iterative algorithm. For speeding up, many algorithms for solving the multiple sequence alignment are based on pairwise alignment [24, 29, 48, 139]. The rest of this thesis is organized as follows. In Chapter 2, we introduce some well-known dynamic programming algorithms, which solve the LCS problem efficiently. We also give two different approaches for the LCS problem. One is Hunt and Szymanki s algorithm [65] in Section 2.2, and the other is Hirschberg s algorithm [61] in Section 2.3. We discuss some other algorithms about LCS, and give the upper bound for the time complexity of LCS [11,52,53,60,85,108,128,132] in Section 2.4. We give some linear algorithm in Section 2.5. In Section 3.1, we introduce the cyclic string correction problem, and reduce the problem into cyclic longest common subsequence. We give some LIS algorithms in Section 3.2, we introduce the constrained longest common subsequence problem and tow algorithms for it [30, 104, 123]. In Chapter 4, we give some prosperities in the Levenshtein edit distance problem and list some complexities of the edit distance problem with various edit operations. In the Chapter 4.2, we give a brief introduction in sequence 2

12 alignment. We introduce the global alignment and local alignment in Sections 5.1 and 5.2, respectively. The pairwise alignment with affine gap is introduced in Section 5.3. We give the three kinds of multiple sequence alignment algorithms and some biology programs in Section 5.4. In Chapter 6, we list some open LCS problems and its related problems, and also give a conclusion of this thesis. 3

13 CHPTER 2 The Longest Common Subsequence Problem The LCS problem is defined by Wagner and Fischer in 1974 [129]. Since then, many algorithms have been proposed [8,13,21,22,22,25,33,38,51,64,72,91,93,94,97, 136,138]. These algorithms should be categorized into the scoring matrix processing with row by row way and contour to contour way. We will introduce these algorithms one by one, according to its category. 2.1 The Longest Common Subsequence Problem Let and B be two strings, where = a 1 a 2 a n and B = b 1 b 2 b m, n m. subsequence is obtained by deleting zero or some (not necessarily consecutive) characters of the original string. common subsequence of and B is a subsequence of both and B. longest common subsequence (LCS) of and B is a common subsequence with maximum length. enerally speaking, the LCS may not be unique. For example, in Figure 2.1, we can see that T and C are both common subsequences in CTC and CCT. In addition, C is longer than T, and no other common sequence in CTC and CCT is longer than C. Therefore, C is the longest common subsequence between CTC and CCT. The LCS problem can be defined as follows: Definition : iven two strings and B, where = a 1 a 2 a n and B = b 1 b 2 b m, n m, the LCS problem is to defined to find the LCS of and B. 4

14 : C T C B: C C T Common subsequence 1 = T Common subsequence 2 = CT Figure 2.1 LCS subsequences. The first kind of the LCS algorithms are to build the scoring matrix row by row, usually using dynamic programming. In 1970, Needleman and Wunsch [98] proposed a well-known method, which uses dynamic programming to calculate an (m + 1) (n + 1) matrix. Let 1,i denote a 1 a 2 a i, 1 i n, and B 1,j denote b 1 b 2 b j, 1 j m. Let L(i, j) denote the value of the LCS of 1,i and B 1,j. Each entry L(i, j) will have three cases as follows: Case 1: Delete a i. The LCS between a 1 a 2... a i and b 1 b 2... b j is the same as the LCS between a 1 a 2... a i 1 and b 1 b 2... b j. Case 2: Delete b j. The LCS between a 1 a 2... a i and b 1 b 2... b j is the same as the LCS between a 1 a 2... a i and b 1 b 2... b j 1. Case 3: Keep the same symbol a i and b j. The LCS between a 1 a 2... a i and b 1 b 2... b j is the same as the LCS between a 1 a 2... a i 1 and b 1 b 2... b j 1 plus the matching score 1. Therefore, L(i, j) will reduce to the one of the three cases L(i 1, j), L(i, j 1) and L(i 1, j 1) + 1, and each case can also follow the same conditions to reduce into the three cases with smaller size. Finally, the problem can reduce the initial state 5

15 C C T c i-1,j-1 c i,j-1 3 C 4 T c i-1,j ( b ) c i,j C ( a ) Figure 2.2 n example of LCS of two strings. () Each entry L(i, j) = k means there is an LCS of length k along the path from (i, j) to (0, 0). (B) Three possible directions of the tracking back path. L(i, 0), L(0, j) or L(0, 0), whose scores are all 0. We can calculate L(i, j) formally as follows: L(i, 0) = 0, i 0 L(0, j) = 0, j 0 L(i 1, j) L(i, j) = max L(i, j 1) L(i 1, j 1) + 1 if a i = b j, 1 i n, 1 j m. The advantage of this algorithm is easy to implement and to trace back. It is clear that the time complexity is O(mn). In Figure 2.2, we use the above scoring function to fill the matrix row by row, and use the matching pairs to trace back 6

16 the LCS. In the figure, we can note that the score can be increased only by a match. Based on this idea, there are two different approaches to find the LCS. First, Hunt and Szymanki s [65] algorithm uses dynamic programming to find those match positions. The second algorithm was proposed by Hirschberg [61], which uses the divide and conquer strategy to find the match positions. In the following two sections, we will introduce these two algorithms in detail. 2.2 Hunt and Szymanki s lgorithm Hunt and Szymanki [65] performed preprocessing on the input strings, and found out all matches in the input strings first. With the information obtained from the preprocessing, we can compute the matching positions more quickly. The algorithm can be divided in three steps and the time complexity is (O(r + n) log n), where n is the length of two input strings and r is the total number of matches between the two strings. Let LCS(, B) denote the length of LCS of strings and B. In this algorithm, there is a key data structure, the threshold values array T i,k, where T i,k is the least j value such that 1,i and B 1,j have a common subsequence of length k. For example in Figure 2.3, T [4, 3] = 6 means that LCS( 1,4, B 1,6 )= 3, but LCS( 1,4, B 1,5 ) 2. In this example 1,4 = CT and B 1,6 = CCT, and their LCS length is 3. It can be easily seen that the LCS length of 1,4 and B 1,5 = CC is only 2. Hunt and Szymanki s lgorithm Input: Two strings = a 1 a 2 a n and B = b 1 b 2 b n. Output: The longest common subsequence of and B. 7

17 element array [1 : n], B[1 : n]; integer array T HRESH[0 : n]; list array MT CHLIST [1 : n] pointer array LINK[1 : n] pointer PTR; 1. for i = 1 step 1 until n do set MT CHLIST [i] =< j 1, j 2,, j p > such that j 1 > j 2 > > j p and [i] = B[j q ] for 1 q p; 2. T HRESH[0] = 0; for i = 1 step 1 until n do T HRESH[i] = n + 1; LINK[0] = null 3. for i = 1 step 1 until n do for j on MT CHLIST [i] do begin find k such that T HRESH[k 1] < j T HRESH[k]; if j < T HRESH[k] then begin T HRESH[k] = j; LINK[k] = newnode(i, j, LINK[k 1]); end; end; 8

18 4. k = largest k such that T HRESH[k] n + 1; P T R = LINK[k]; while P T R null do begin print (i, j) pair pointerd to by PTR advance PTR end Here we denote (i, j) as a general pair, and < i, j > as a matching pair where a i = b j. The set of j-values is {j < a i, b j >}. In the first step, we record the j-value where a i matches in B with decreasing order. For example in Figure 2.3, we denote string B = CCT as a sequence S B = C1, 2, C3, 4, 5, T 6, 7, 8 with index j. The j-value of and in B are {2, 4, 7} and {5, 8}, respectively. Therefore, we sort S B by lexical order and return the sequence S B = 2, 4, 7, C1, C3, 5, 8, T 6 in O(n log n) time. We can search any symbol in the sorted list, and this process will take O(log n) time in binary search. In the worst case, we have to search n distinct symbols in the search process and we need O(n log n) time. Hence, we sort and search the input string in Step 1, which requires O(n log n) time and O(n) space. Step 2 initiates the T HRESH array with O(n) time, and fills all entries (n + 1) as infinite. We show an example in Figure 2.3 to illustrate these steps, the character appears in column 5 and column 8 for row 2. Hence, MT CHLIST [2] is given by [8, 5] to record that < 2, 5 > and < 2, 8 > are matching pairs. In Step 3, there are two outer for-loops to do all < i, j > with decreasing j with increasing i. Hence, we will execute r loops. In each loop, we add the matching list in our efficient data structure THRESH row by row. When we want to add a < i, j > into the T HRESH[i] array and to replace T HRESH[i][k], it means for all 9

19 C C T C T C MTCHLIST i Figure 2.3 Steps 1 and 2 of Hunt and Szymanski s algorithm. common subsequences of length k, j is the leftmost one. For example in Figure 2.4, when j is on MT CHLIST [2], we first add element 8 in T HRESH[2][2], because 2 8 infinite, and add a linked list to T HRESH[1][1], it means now we have a common sequence whose length is 1 in score matrix (2, 1), and also have a common subsequence whose length is 2 in the score matrix (2, 8) (1, 2). Then we add the second element 5 on MT CHLIST [2], 2 5 8, so 5 will replace 8, it means we find a better solution. In step 3, we consider all matches in the scoring matrix so it requires O(n + r log n) time and at most O(r) list nodes. In the final step, we follow the maximum column in T HRESH[], and trace the LCS according to the linked list, it takes at most O(n) time. In 1987, postolico and uerra [12] used the idea of dominating match to reduce the number of matches which need to be considered. 10

20 THRESH[] j j j j on MTCHLIST[ 1 ] THRESH[] j j j on MTCHLIST[ 2 ] THRESH[] j j j on MTCHLIST[ 3 ] j on MTCHLIST[ 8 ] THRESH[] j i MTCHLIST[] Figure 2.4 Step 3 of Hunt and Szymanski s algorithm. 11

21 C C T C T C ( ) T[ row, k ] k row ( B ) Figure 2.5 Step 4 of Hunt and Szymanski s algorithm. 12

22 2.3 Hirschberg s lgorithm The second kind of approaches is the contour-based approach, which was proposed by Hirschberg in 1977 [61]. Before explaining the algorithm, we shall first give some definitions and properties of the contour. Here we denote < i, j > as the position where a i = b j, and L(i, j) as the length of the LCS in 1,i and B 1,j. We call < i, j > a k-candidate, if a i = b j and L(i, j) = k. ccording to Theorem 2.1, among all k-candidates, we need only consider these k-candidates which dominate other k-candidates and are not dominated by any other k-candidates. These considered k-candidates are called minimal k-candidates. In Figure 2.6, all circled positions are k-candidates, and bold ones are minimal k-candidates. For example, < 1, 2 >, < 1, 4 >, < 1, 7 >, < 3, 1 > and < 7, 1 > are all 1-candidates between CTC and CCT. It is clear that < 1, 2 > and < 3, 1 > are both minimal 1-candidates, because they can dominate all other 1-candidates except for each other. Therefore, all 2-candidates can be obtained by < 1, 3 > and < 3, 1 >. The matrix separated by dash lines is called contour, which has its own k-candidate region and can never be crossed to each other. Besides, minimal k-candidates must exist in the corner of the contour, because this is the position which can never be dominated. Hence, one can transfer the LCS problem to finding the set of minimal k-candidates. If we find out all minimal k-candidates and there is no (k + 1)-candidate, then the length of the LCS is k, and the LCS path must exist in the set. Theorem 2.1 [61] Let =< x 1, y 1 > and B =< x 2, y 2 > be two k-candidates. If x 1 y 1 and x 2 y 2, then we can say that dominate B (B is a superfluous k-candidate) since any (k + 1)-candidates generated by B can also be generate by. Thus, from the set of k-candidates, we need only consider those candidates are minimal under the usual vector ordering. Note that if and B are both minimal elements, then x 1 < x 2 iff y 2 > y 1. 13

23 B C C T C T C Figure 2.6 Contours in the scoring matrix. Theorem 2.2 [61] For k 1, < i, j > is a minimal k-candidate iff j is the minimum value such b j = a i, and low < j < high, where high is the minimum j-value of all k-candidates whose i-value is less than i (no upper limit if there is no such k-candidate) and low is the minimum j-value of all (k 1)-candidates whose i-value is less than i. In this algorithm, Σ = s, where s is the numbers of distinct symbols in B and each symbol in Σ is encoded from 1, 2,, s. Here we denote NB[θ] as the number of times that symbol θ occurs in string B and P B[θ, 1],, P B[θ, NB[θ]] as the order list of string B in which symbol θ occurs. For example, in Figure 2.8, there are 4 distinct symbols in string B. Hence, is encoded to 1, is encoded to 2, C is encoded to 3, and T is encoded to 4. The symbol is the second row in the P B array. Therefore, P B[2, 1] = 5 and P B[2, 2] = 8 means that in string B, the 14

24 symbol first occurs in the fifth position, and the second one occurs in the eighth position. The Hirschberg lgorithm Input: Two strings = a 1 a 2 a n and B = b 1 b 2 b m. Output: The longest common subsequence of and B. LD(m,n,,B,C,p) 1. NB[θ] = 0 for θ = 1,..., s P B[θ, 0] = 0 for θ = 1,..., s P B[0, 0] = 0; P B[0, 1] = 0 for j = 1 step until n do begin NB[b j ] = NB[b j ] + 1 P B[b j, NB[b j ]] = j end 2. D[0, i] = 0 for i= 0,..., m lowcheck = 1 3. for k = 1 step 1 do begin 4. N[θ] = NB[θ] for θ 1,..., s N[0] = 1 F L = 0 low = D[k 1, lowcheck] high = n

25 5. for i = lowcheck + 1 step 1 until m do begin 6. while P B[a i, N[a i ] 1] > low do N[a i ] = N[a i ] 1 7. if high > P B[a i, N[a i ]] > low then begin high = P B[a i, N[a i ]] D[k, i] = high if F L = 0 then {lowcheck = i, F L = 1} end else D[k, i] 8. if D[k 1, i] > 0 then low = D[k 1, i] end loop of step 5 9. if F L = 0 then go to Step 10 end loop of step p = k 1 k = p for i = m + 1 step -1 until 0 do if D[k, i] > 0 then begin c k = a i k = k 1 end In Step 1, we record the number of occurrences of symbol θ in B and sort them by their j-values in the PB matrix. This will take O(n log s) time, where s is the number of distinct symbols in B. Through this step, we can quickly bound the region, from which we can find the next minimal k-candidates. In Step 2, we use Theorem 2.2 to make a possible region to find out all k- candidates using previous (k 1)-candidates. For example, Figure 2.7(a) shows the 16

26 initial state with k = 2, where the gray area is the possible area for the minimal 2-candidates. In Figure 2.7(b), a new bound of gray area is set after we find the first minimal 2-candidate. Therefore, each time we find a new minimal k-candidate, the search area for other k-candidates will become smaller. In Figure 2.9 and Figure 2.10, we show how each new minimal k-candidate is found. In Step 2, we will do p loops to find out all k-candidates. In each loop, we use PB to quickly locate the specific symbol which has the closest j-value to the lower bound to decrease the region more efficiently. Since it takes O(n) time for a single loop, Step 2 can be done in O(pn) time. fter we obtain each minimal k-candidate, the k-candidate will be added to another matrix D[k, i] = j, which means the position of common subsequence with length k in row i is j. Finally, we can use matrix D to trace the LCS path, which will take O(m) time. These three steps take O(n log s + pn + m), therefore the time complexity is O(n log s + pn), where s is the number of distinct symbols, and p is the length of LCS. 2.4 Some lgorithms of LCS In recent years, there has been renewal of interest in the lower bound for the complexity of the LCS problem [3, 47, 60]. In 1980, Masek and Paterson [66] used the Four Russians technique to prove that if Σ is arbitrary or finite, the upper bound of the time complexity for LCS is O(mn log log n/ log n) and O(mn/ log n), respectively. Until now, this still theoretically fastest algorithm for the LCS problem. Using dynamic programming, the LCS problem can be solved easily in O(mn) time and space, where m and n are the lengths of LCS input strings. Many heuristics algorithms try to reduce the complexity with special inputs, such as matching number, shorter LCS length, small alphabet and so on [9 13,27,28,44,61,65,91,94,134]. For example, Hunt and Szymanski s algorithm takes O(r log n) time and Hirschberg s 17

27 algorithm takes O(pn) time where r is the number of matching pairs and p is the length of LCS. When r and p are small, these two algorithm are faster than the traditional algorithm. However, in the worst case, r will be n 2 and p will be n, then these two algorithm become worse than the traditionl algorithm. In recent years, many algorithms try to use bit-vector to speed up LCS, which are word-dependent. These algorithms assume the computer can process one word at a time, where the length of the word is k (normally 32 or 64, depending on the machine). Hence, it only takes O(1) to process a word whose length is less than or equal to k. The advantage of bit-vector algorithm is avoiding input-sensitive and output sensitive. ccording to this hypothesis, many algorithms were proposed with bit-vector, and the LCS problem can be solved in O(nm/k) time [16, 93]. In 2000, Crochemore et al. gave an O(n 2 /k) time and O(n/k) space algorithm [38]. In Summary, we list these algorithms in Table Linear Space lgorithms When computer scientists analyze an algorithm, they usually focus on its time complexity and space complexity. Computing the LCS with the Wagner and Fischer s algorithm typically requires O(m n) space, where m and n are the lengths of the input strings. This space complexity will become infeasible if n and m are very large. In this section, we will present linear space algorithms for the LCS problem. Hirschberg proposed a linear space version for the LCS problem [59]. The main idea in this algorithm is to find a break point of the LCS and use divide and conquer method to reconstruct the LCS path. iven two strings = a 1 a 2 a n and B = b 1 b 2 b m, where m and n are the lengths of and B, respectively. LB(m, n, L) is a linear space algorithm and it returns L which is the last row of the original LCS m n matrix. The algorithm is as follows [59]: 18

28 Year uthor(s) Time Complexity Space Complexity 1974 Wagner and Fischer [129] O(mn) O(mn) 1975 Hirschberg [59] O(mn) O(n) 1977 Hunt and Szymanski [65] O((n + r) log n) O(r + n) 1977 Hirschberg [61] O(pn + n log n) O(pn) 1977 Hirschberg [61] O(L(m p) log n) O((m p) 2 + n) 1980 Masek and Paterson [66] O(n max{1, m/ log n}) O(n 2 / log n) 1982 Nakatsu et al. [94] O(n(m p)) O(m 2 ) 1984 Hsu and Du [10, 62] O(pm log(n/p) + pm) O(pm) 1985 Ukkonen [125] O(Em) O(E min{m, E}) 1986 postolico [9] O(n + m log n + D log(mn/d)) O(r + m) 1986 Myers [91] O(D(m + n)) O(D(m + n)) 1987 Kumar and Rangan [71] O(n(m p)) O(n) 1987 postolico and uerra [12] O(pm + n) O(D + n) 1990 Chin and Poon [27] O(n + min{d, pm}) O(D + n) 1992 postolico [11] O(pm) O(n) 1992 Eppstein et al. [44] O(n + D log log min{d, mn/d}) O(D + m) 1999 Rick [108] O(min{pm, p(n p)}) O(n) Table 2.1 Time and space complexity of algorithms computing LCS(,B), where m = and n = B, r is the number of matching pairs, and p is the length of a longest common subsequence, E = m + n 2p = edit distance between and B, D is the number of dominant matches, and k is the number of bits of the word. [102, 103, 108] 19

29 LC(m, n,, B, C) 1. If problem is trivial, solve it: if n = 0 then C = φ (φ is the empty string) else if m = 1 then if j n such that (1) = B(j) then C = (1) else C = φ 2. Otherwise, spilt problem: else begin i = m/2 ; 3. Evaluate L(i, j) and L (i, j) [j = 0, 1,, n]: LB( 1,i, B 1,n, L 1 ); LB( m,i+1, B n,1, L 2 ); 4. Find j such that L(i, j) + L (i, j) = L(m, n) using problem: M = max{l 1 j + L 2 (n j)}; k = min j such taht L1(j) + L2(n j) = M; 5. Solve simpler problems: LC(i, k, 1,i, B 1,k, C1); LC(m i, n k, i+1,m, B k+1,n, C2); 6. ive output: C = C1 C2; end The main idea of this algorithm lies in Step 3, which uses LB to find the break point. Step 3 can be described in more detail as follows: 1. Use LB( 1,m/2, B 1,n ) to calculate the last row L 1 of LCS matrix between a 1 a 2 a m/2 and b 1 b 2 b n. 2. Use LB( n,m/2+1, B n,1 ) to calculate the last row L 2 of LCS matrix between a m a m 1 a m/2+1 and b n b n 1 b 1. 20

30 3. Calculate L(k) = L 1 (k) + L 2 (n k) where L i (k) is the kth element of L i and k = 1, 2,, n. Hence, the maximum L(k) is the break point. Take Figure 2.12 for example, Figure 2.12 (a) and (b) compute L 1 = and L 2 = , respectivly. fterwards, we can compute the maximum of L 1 + L 2, which is 5. It means the LCS must pass through this region and the length of the LCS is 5. fter we pick the break point we can cut the original problem to two smaller problems. In Figure 2.13 and Figure 2.14, we can find out the computing area is decreased by using the break point. 21

31 C C T C C T ( a ) ( b ) C C T C 2 ( c ) Figure 2.7 Three possible regions in contour. 22

32 C T B C C T PB[θ, N[θ]] C T C Figure 2.8 Steps 1 and 2 of Hirschberg s algorithm. θ is the number of distinct symbols in B, N[θ] is the occurrences of θ in B. 23

33 D[k, i] k = 1 i k i = 0, high = 9, low = 0... i k i = 1, high = 2, low = 0, lowcheck = PB[θ, N[θ]] C T i = 3, high = 1, low = 0 k = high = 9, low = 2, lowcheck = i = 2, high = 5, low = 2, lowcheck = i = 3 high = 3, low = Figure 2.9 Steps 3 9 of Hirschberg s algorithm when k = 1 and 2. 24

34 k= high = 9, low = 5, lowcheck = high = 5, low = high = 9, low = high = 4, low = high = 6, low = 3, lowcheck = high = 3, low = k= C T PB[θ, N[θ]] Figure 2.10 Steps 3 9 of Hirschberg s algorithm when k = 3. 25

35 B C C T C T C k i C T C Figure 2.11 Step 10 of Hirschberg s algorithm. 26

36 B 1,8 B 8,1 - C C T - T C C , C 5, C T ( a ) ( b ) B C C T C 4 T L L C ( c ) Figure 2.12 Calculation of break points. 27

37 [0,0] m Break poinks region of an LCS n [ [m/2], k ] Possible LCS curve [m,n] Figure 2.13 The divide and conquer of the LCS algorithm. [37] 28

38 LCS(m,n) Figure 2.14 The recursive calls of the LCS algorithm. [37] 29

39 CHPTER 3 Related Problems of LCS The LCS problem can be add some additional conditions will become some LCS related problems. The cyclic string correction problem is add condition in input. The longest increasing subsequence is add condition in the mismatch condition. The constrained longest common subsequence is add condition in the output with a constrained word. 3.1 The Cyclic String Correction Problem The linear string to string correction problem was defined by Wagner and Fischer in 1974 [129], which can be solved by using a well known dynamic programming algorithm. Using this algorithm, we can obtain the edit distance for any prefix of and B, which takes O(mn) time and O(mn) space, where m and n are the lengths of input sequences and B, respectively. In 1990, Maes defined a cyclic string correction problem, whose input are cyclic stings instead of linear ones [84]. We define a cyclic shift σ as an operation of shift, which can be written as σ(a 1 a 2 a n ) = a 2 a 3 a n a 1. iven = a 1 a 2 a n, which is a string with length n. Let σ k () be the function which repeats the σ function k times and will produce the shifted sequence a k+1 a k+2 a n a 1 a 2 a k. ccording to the above definition, it is clear that = σ n (). Here we denote [] as a cyclic string, which is a set composed by σ 1 () σ 2 () σ n 1 (). iven two cyclic strings [] and [B], we can compute the edit distance 30

40 δ([], [B]) = min{δ(σ k (), σ l (B)) 1 k n =, 1 l m = B }, where δ(, B) is the edit distance of two strings and B. In Chapter 4, we will show that the longest common subsequence problem is a special case of the edit distance problem. Therefore, any algorithm that solves the edit distance problem can be easily modified to solves the LCS problem. Hence, here we take the cyclic longest common subsequence problem as an example to explain the cyclic string problem, and show some of its properties. The cyclic longest common subsequence problem can be formalized as follows: Definition : iven two cyclic strings = a 1 a 2 a n and B = b 1 b 2 b m, n m, the cyclic longest common subsequence problem is to find LCS([], [B]) = max {LCS(σ k (),σ l (B)) 1 k n =, 1 l m = B }, where LCS(, B) is the length of longest common subsequence of strings and B. Theorem 3.1 LCS([], [B]) = l, where 1 s n and 1 t m, if and only if LCS(σ s (), B) = l [84]. ccording to Theorem 3.1, we can transfer the cyclic longest common subsequence problem LCS([], [B]) to LCS([], B). In this problem, we can take the original LCS algorithm to solve this problem. If we calculate LCS(σ 1 (), B), LCS(σ 2 (), B),, LCS(σ n (), B) one by one, it will take O(n 2 m) time. We can use (2n + 1) (m + 1) space to store LCS([], B) score matrix, each LCS[σ k (), B] can be found in the (k + n) (m + 1) windows. In Maes s algorithm, we compute the value inside the boundaries instead of computing all values in the whole windows. Theorem 3.2 Let L s and L t be two LCS paths which can be calculated from LCS(ρ s (), B) and LCS(ρ t (), B). If s t, then L s and L t never cross through each other [84]. 31

41 a s b t a s a p σ s () σ t (B) b t b r [] [B] l match ( a ) ( b ) a p σ p () a p a s σ p () σ r (B) b r b r b t [B] ( c ) d Figure 3.1 n example of transforming two cyclic strings. Take Figure 3.2 as an example, L s is the LCS path of LCS(σ s (), B), and L t is the LCS path of LCS(σ t (), B). Let e1 and e2 denote the bold and dash paths between p d and P c, respectively. (1) If the number of matches on e1 is not less than that on e2, we can find another LCS path passing through e1 in LCS(σ t (), B) which is better than L t. (2) Otherwise, we can find another LCS path passing through e2 in LCS(σ s (), B) which is better than L s. Observably, conditions (1) and (2) lead to a contradiction. Therefore, the LCS path in each windows will never cross by each other. 32

42 B a s P sh L s P th P d a t 2 e1 P c e2 P st a s + n L t P tt a t + n Figure 3.2 n example of LCS crossing. Maes proposed an algorithm for solving this problem in O(nm log n) time [84] in ccording to Theorem 3.2, we know that any two LCS paths in different windows would never cross. See Figure 3.3, we compute L 0 and L n as boundaries first. In fact, L 0 is the same as L n. Then we compute the L n/2 by only computing the values of the area bounded by L 0, L n and the horizontal lines on positions n/2 and 3n/2. With the new lower boundary L n/2, we can find L n/4 by only computing the area bounded by L 0 and L n/2. Obviously, we can use the divide and conquer method to separate the original problem to the smaller ones. In Figure 3.3, a k is the kth character of and the dash line is the boundary of the LCS. 33

43 Because LCS paths never cross to each other, we only calculate the region between the boundaries of the windows. 3.2 The Longest Increasing Subsequence Problem The LIS (longest increasing subsequence) problem is well-studied in computer science [23, 47] and also has some biology applications [39, 109, 138]. The longest increasing subsequence problem is to find a longest increasing subsequence of a given sequence of distinct integers S = x 1 x 2 x n. For example, given a sequence S = , we can see that 3578 and 478 are both increasing subsequences in S. In addition, 3578 is longer than 478, and no other sequence in S is longer than Therefore, 3578 is the longest increasing subsequence in The LIS problem can be formally defined as follows: Definition : iven a comparable sequence of n integers S = x 1 x 2 x n, where all integers are distinct, the LIS problem is to find a subsequence x i1 x i2 x ik of S, where x i1 x i2, x ik and k is the maximum. If we have the sorted sequence of the input, then we can find the LIS simply by using the dynamic programming method. We can treat the input as sequence, and the sorted input as sequence B, then LCS(, B) will calculate the longest increasing subsequence in. For example, in Figure3.4, we can show that the LIS of is Hunt and Szymanski [65] presented an algorithm which can solve the LIS problem in O(n log n) time and its time complexity is reduced to O(n log log n) if the van Emde Boas data structure is used. Bespamyatnikh and Segal [23] also used the van Emde Boas data structure to find out all longest increasing subsequences in O(n log log n). 34

44 B a 1 B a 5 a 8 a 9 a 12 a 16 a 3 a 7 a 10 a 14 a 2 a 4 a 6 a 8 a 9 a 11 a 13 a 15 Figure 3.3 The divide and conquer of Maes s algorithm. The a k is the kth character of and the dash line is the boundary of the LCS. Here, = a 1 a 2 a 8. That is, a i = a 8+i, 1 i 8. 35

45 B Figure 3.4 Solving LIS with the LCS algorithm, = is the input sequence and B = is the sorted sequence is the longest increasing subsequence in this case. 36

46 Year uthor(s) Time Complexity 1961 Schensted [110] O(n log n) 1977 Hunt and Szymanski [65] O(n log log n) 2000 Bespamyatnikh and Segal [23] O(n log log n)) Table 3.1 The algorithms for the LIS problem. lbert et al. [4] gave an algorithm to solve the window LIS problems. iven a sequence S = x 1 x 2 x n and a window size w n, a window of S is defined as a substring x i+1 x i+2 x i+w where 0 i n w. There are three kinds of problems as follows: (1) Local Max Value: finding the length k of the longest increasing subsequence in all windows. (2) Local Max Sequence: finding the LIS in each window. (3) lobal Max Sequence: finding the windows with the longest increasing subsequence among all windows. lbert et al. [4] takes O(n log log n + OUT P UT ) time to solve these three problems. iven two comparable sequences S = x 1 x 2 x n and T = y 1 y 2 y m. Yang et al. [137] proposed an algorithm to solve the longest common increasing subsequence problem between two comparable sequences in O(mn) time and O(mn) space. 3.3 The Constrained Longest Common Subsequence Problem The constrained longest common subsequence problem can be defined as follows [104]: 37

47 Definition : iven two strings = a 1 a 2 a n and B = b 1 b 2 b m and a constrained sequence C = c 1 c 2 c p, the constrained longest common subsequence problem (CLCS) is to find a longest common subsequence containing C as its subsequence. In the original LCS problem, the longest common subsequence is the optimal. However, the CLCS want to find a longest common subsequence containing specific sequence may be shorter than the original LCS. For example, given two strings CCT and CTC, the longest common subsequence is C. However, if we give a constrained sequence CC in addition, then the constrained longest common subsequence is CC which is different greatly from C. iven two strings = a 1 a 2 a n and B = b 1 b 2 b m and a constrained sequence C = c 1 c 2 c p. Let q,r = a q a q+1 a r and B s,t = b s a s+1 a t are substrings in and B respectively, where 1 q r n and 1 s t m. Let LCS(q, r, s, t) denote the length of LCS between q,r and B s,t. For 1 k p, 1 i n and 1 j m, let LCS k (i, j) denote the length of LCS between 1,i and B 1,j with a i = b j = c k. Because the constrained sequence must be in the output, we can use each constrained sequence character c k as one break point. Based on this idea, Tsai [123] uses c k to separate the input string to several individual parts, and calculate the common subsequences among these parts with dynamic programming. The scoring function can be formalize as follows: LCS(, B) = max 1 i n,1 j n {LCS p (i, j) + LCS(i + 1, n, j + 1, m) } LCS k (i, j) = LCS(1, i 1, 1, j 1) + 1, where k = 1, a i = b j = c k L k (i, j) = max 1 x<i,1 y<j { L k 1 (x, y) + LCS(x + 1, i 1, y + 1, j 1) + 1 }, where a i = b j = c k, 2 k p, 1 i n and 1 j m. 38

48 : C C T B: C T C Figure 3.5 Possible break points on = CCT, B = CTC with the constrained sequence C = CC. Figure 3.5 shows an example of break points, where the two input strings = CCT and B = CTC, and the constrained sequence C = T. LCS(, B) = LCS 3 (3, 7) + LCS(4, 8, 8, 8) = LCS 3 (3, 7) + 1 LCS 3 (3, 7) = LCS 2 (2, 6) + 1 LCS 2 (2, 6) = LCS 1 (1, 3) + 1 LCS 1 (1, 3) = 1 Therefore, LCS(, B) = 4 Hence, we can summarize that the algorithm takes O(pm 2 n 2 ) time. as follows: Chin et. al [30] also followed the idea of break points and reduce the conditions Case 1: Ifa i = b j = c k, the CLCS score between a 1 a 2... a i and b 1 b 2... b j is equal to the LCS score between a 1 a 2... a i 1 and b 1 b 2... b j 1 plus the matching score. In addition, a 1 a 2... a i 1 and b 1 b 2... b j 1 may contain c 1 c 2... c k 1. 39

49 Case 2: Ifa i = b j and a i c k, the CLCS score between a 1 a 2... a i and b 1 b 2... b j is equal to the LCS score between a 1 a 2... a i 1 and b 1 b 2... b j 1 plus the matching score. In addition, a 1 a 2... a i 1 and b 1 b 2... b j 1 may contain c 1 c 2... c k. Case 3: Ifa i b j and a i c k, the CLCS score between a 1 a 2... a i and b 1 b 2... b j is equal to the LCS score between a 1 a 2... a i 1 and b 1 b 2... b j. In addition, a 1 a 2... a i 1 and b 1 b 2... b j may contain c 1 c 2... c k. Case 4: Ifa i b j and b j c k, the CLCS score between a 1 a 2... a i and b 1 b 2... b j is equal to the LCS score between a 1 a 2... a i and b 1 b 2... b j 1. In addition, a 1 a 2... a i and b 1 b 2... b j 1 may contain c 1 c 2... c k. We can formalize the above cases as follows: L 0 (i, 0) = 0, 0 i m L 0 (0, j) = 0, 0 j n L 0 (i 1, j) L 0 (i, j) = max L 0 (i, j 1) L 0 (i 1, j 1) + 1 if a i = b j L k (i, 0) =, 0 i m and 1 k p, where 1 i m 1 j n L k (0, j) =, 0 j n and 1 k p L k (i 1, j) L k (i, j 1) L k (i, j) = max L k (i 1, j 1) + 1 if a i = b j L k 1 (i 1, j 1) + 1 if a i = b j = c k, where 1 i m 1 j n 1 k p Take Figure 3.6 for example, we should compute the original scoring matrix L 0 first, and find the break point in L 0 (1, 1). In L 1 (3, 2), a 3 = C and b 2 = is a mismatch, then L 1 (3, 2) = max{l 1 (2, 3)), (L 1 (3, 1)} = 1. In L 1 (3, 3)), a 3 = C and b 3 = C is a match, then L 1 (3, 3) = L 0 (2, 2) + 1 = 2. We can follow the function to compute k scoring matrices and use the tracing back technique to find 40

50 the constrained longest common subsequence. We also give an example of tracing back in Figure

51 C C T C T C L C C T C T C L 1 42

52 C C T C T C L C C T C T C L 3 Figure 3.6 The result obtained from two input sequences CCT and CTC with constrained sequence CC. Here, L k denotes the scoring matrix of an LCS which contains the constrained sequence c 1 c 2 c k. 43

53 C C T C T C L C C T C T C L 1 44

54 C C T C T C L C C T C T C L 3 Figure 3.7 The tracing path obtained from sequences CCT and CTC with constrained sequence CC. 45

55 CHPTER 4 The Edit Distance Problem The edit distance is used widely in many areas, such as computational biology, speech recognition and text retrieval [7,40,54,92,96,109,111,127]. In this chapter, we will introduce the Levenshtein edit distance in Section 4.1. There are many different kinds of operations in the edit distance problem. The edit distance has different complexities while adopting different operations. In Section 4.2, we will introduce some different operations in the edit distance problem and list some complexities of them. 4.1 The Edit Distance Problem The Levenshtein edit distance problem was first proposed by Levenshtein in 1965 [78, 79]. few years later, Wagner and Fischer presented some definitions and properties in their paper [129]. We will follow their definition in this section. Let = a 1 a 2 a n be a sequence over an alphabet set Σ, and i,j = a i a i+1 a j, where 1 i j n. Let Λ denote the null string and i,j = φ if i > j. n edit operation s is an ordered pair (a, b) (φ, φ) of characters, which indicate a b. n edit operation (a, b) called a deletion if b = φ, an insertion if a = φ, and a replacement if a φ and b φ. n edit sequence S = s 1 s 2 s m is a sequence of edit operations. We denote S as a sequence of strings 0 1 m, such that = 0, B = m, and i 1 i via s i for 1 i m. In short, we can use s 1 s 2 s m to change string to string B. 46

56 Let γ(s) denote the cost function of edit operation s. For an edit sequence S, γ(s) = m i=1 γ(s i ). Therefore, we can denote the edit distance δ(, B) from string to string B as the minimal cost for transforming to B. Formally, δ(, B) = min{γ(s) S is an edit sequence used for transforming to B}. Let and B be strings of lengths n and m, respectively. Let D(i, j) = δ( 1,i, B 1,j ), for 1 i n, 1 j m. We have D(i 1, j 1) + γ(a i b j ), D(i, j) = min D(i 1, j) + γ(a i φ), D(i, j 1) + γ(φ b j ), where D(i, j) is the optimal edit distance of the first i characters of 1,i = a 1, a 2,, a i andb 1,j = b 1, b 2, cdots, b j. The last paired character must be one of the following: Case 1: Delete a i. The edit distance between a 1 a 2... a i and b 1 b 2... b j is equal to the edit distance between a 1 a 2... a i 1 and b 1 b 2... b j plus the deleting score. Case 2: Insert b j. The edit distance between a 1 a 2... a i and b 1 b 2... b j is equal to the edit distance between a 1 a 2... a i and b 1 b 2... b j 1 plus the inserting score. Case 3: Replacing a i to b j. The edit distance between a 1 a 2... a i and b 1 b 2... b j is equal to the edit distance between a 1 a 2... a i 1 and b 1 b 2... b j 1 plus the replacing score. Because each D(i, j) is the cost of the minimum cost between 1,i and B 1,j, and the above three conditions are considered, the D(i, j) should be the minimal cost for transforming 1,i to B 1,j. Let T be a trace from to B, where T is any set of integers (i, j) satisfying: 1. 1 i n and 1 j m. 47

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence Danlin Cai, Daxin Zhu, Lei Wang, and Xiaodong Wang Abstract This paper presents a linear space algorithm for finding

More information

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem Hsing-Yen Ann National Center for High-Performance Computing Tainan 74147, Taiwan Chang-Biau Yang and Chiou-Ting

More information

Lecture 2: Pairwise Alignment. CG Ron Shamir

Lecture 2: Pairwise Alignment. CG Ron Shamir Lecture 2: Pairwise Alignment 1 Main source 2 Why compare sequences? Human hexosaminidase A vs Mouse hexosaminidase A 3 www.mathworks.com/.../jan04/bio_genome.html Sequence Alignment עימוד רצפים The problem:

More information

6.6 Sequence Alignment

6.6 Sequence Alignment 6.6 Sequence Alignment String Similarity How similar are two strings? ocurrance o c u r r a n c e - occurrence First model the problem Q. How can we measure the distance? o c c u r r e n c e 6 mismatches,

More information

Minimal Height and Sequence Constrained Longest Increasing Subsequence

Minimal Height and Sequence Constrained Longest Increasing Subsequence Minimal Height and Sequence Constrained Longest Increasing Subsequence Chiou-Ting Tseng, Chang-Biau Yang and Hsing-Yen Ann Department of Computer Science and Engineering National Sun Yat-sen University,

More information

A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem

A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem Dmitry Korkin This work introduces a new parallel algorithm for computing a multiple longest common subsequence

More information

Efficient High-Similarity String Comparison: The Waterfall Algorithm

Efficient High-Similarity String Comparison: The Waterfall Algorithm Efficient High-Similarity String Comparison: The Waterfall Algorithm Alexander Tiskin Department of Computer Science University of Warwick http://go.warwick.ac.uk/alextiskin Alexander Tiskin (Warwick)

More information

Lecture 4: September 19

Lecture 4: September 19 CSCI1810: Computational Molecular Biology Fall 2017 Lecture 4: September 19 Lecturer: Sorin Istrail Scribe: Cyrus Cousins Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes

More information

Algorithm Design and Analysis

Algorithm Design and Analysis Algorithm Design and Analysis LECTURE 18 Dynamic Programming (Segmented LS recap) Longest Common Subsequence Adam Smith Segmented Least Squares Least squares. Foundational problem in statistic and numerical

More information

Sequence Alignment (chapter 6)

Sequence Alignment (chapter 6) Sequence lignment (chapter 6) he biological problem lobal alignment Local alignment Multiple alignment Introduction to bioinformatics, utumn 6 Background: comparative genomics Basic question in biology:

More information

Information Processing Letters. A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings

Information Processing Letters. A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings Information Processing Letters 108 (2008) 360 364 Contents lists available at ScienceDirect Information Processing Letters www.vier.com/locate/ipl A fast and simple algorithm for computing the longest

More information

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009 Outline Approximation: Theory and Algorithms The Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 6, 2009 1 Nikolaus Augsten (DIS) Approximation: Theory and

More information

Noisy Subsequence Recognition Using Constrained String Editing Involving Substitutions, Insertions, Deletions and Generalized Transpositions 1

Noisy Subsequence Recognition Using Constrained String Editing Involving Substitutions, Insertions, Deletions and Generalized Transpositions 1 Noisy Subsequence Recognition Using Constrained String Editing Involving Substitutions, Insertions, Deletions and Generalized Transpositions 1 B. J. Oommen and R. K. S. Loke School of Computer Science

More information

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012 Similarity Search The String Edit Distance Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 8, 2012 Nikolaus Augsten (DIS) Similarity Search Unit 2 March 8,

More information

Outline. Similarity Search. Outline. Motivation. The String Edit Distance

Outline. Similarity Search. Outline. Motivation. The String Edit Distance Outline Similarity Search The Nikolaus Augsten nikolaus.augsten@sbg.ac.at Department of Computer Sciences University of Salzburg 1 http://dbresearch.uni-salzburg.at WS 2017/2018 Version March 12, 2018

More information

Implementing Approximate Regularities

Implementing Approximate Regularities Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National

More information

Approximation: Theory and Algorithms

Approximation: Theory and Algorithms Approximation: Theory and Algorithms The String Edit Distance Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 2 March 6, 2009 Nikolaus Augsten (DIS) Approximation:

More information

6. DYNAMIC PROGRAMMING II

6. DYNAMIC PROGRAMMING II 6. DYNAMIC PROGRAMMING II sequence alignment Hirschberg's algorithm Bellman-Ford algorithm distance vector protocols negative cycles in a digraph Lecture slides by Kevin Wayne Copyright 2005 Pearson-Addison

More information

Dynamic Programming. Weighted Interval Scheduling. Algorithmic Paradigms. Dynamic Programming

Dynamic Programming. Weighted Interval Scheduling. Algorithmic Paradigms. Dynamic Programming lgorithmic Paradigms Dynamic Programming reed Build up a solution incrementally, myopically optimizing some local criterion Divide-and-conquer Break up a problem into two sub-problems, solve each sub-problem

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Similarity Search. The String Edit Distance. Nikolaus Augsten.

Similarity Search. The String Edit Distance. Nikolaus Augsten. Similarity Search The String Edit Distance Nikolaus Augsten nikolaus.augsten@sbg.ac.at Dept. of Computer Sciences University of Salzburg http://dbresearch.uni-salzburg.at Version October 18, 2016 Wintersemester

More information

Remark on Hsu-Du New Algorithm for the LCS Problem

Remark on Hsu-Du New Algorithm for the LCS Problem Purdue University Purdue e-pubs Department of Computer Science Technical Reports Department of Computer Science 1985 Remark on Hsu-Du New Algorithm for the LCS Problem A. Apostolico Report Number: 85-544

More information

CS 580: Algorithm Design and Analysis

CS 580: Algorithm Design and Analysis CS 58: Algorithm Design and Analysis Jeremiah Blocki Purdue University Spring 28 Announcement: Homework 3 due February 5 th at :59PM Midterm Exam: Wed, Feb 2 (8PM-PM) @ MTHW 2 Recap: Dynamic Programming

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

An Algorithm for Computing the Invariant Distance from Word Position

An Algorithm for Computing the Invariant Distance from Word Position An Algorithm for Computing the Invariant Distance from Word Position Sergio Luán-Mora 1 Departamento de Lenguaes y Sistemas Informáticos, Universidad de Alicante, Campus de San Vicente del Raspeig, Ap.

More information

CSE 202 Dynamic Programming II

CSE 202 Dynamic Programming II CSE 202 Dynamic Programming II Chapter 6 Dynamic Programming Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. 1 Algorithmic Paradigms Greed. Build up a solution incrementally,

More information

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17 601.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17 12.1 Introduction Today we re going to do a couple more examples of dynamic programming. While

More information

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6) Sequence lignment (chapter ) he biological problem lobal alignment Local alignment Multiple alignment Background: comparative genomics Basic question in biology: what properties are shared among organisms?

More information

Two Algorithms for LCS Consecutive Suffix Alignment

Two Algorithms for LCS Consecutive Suffix Alignment Two Algorithms for LCS Consecutive Suffix Alignment Gad M. Landau Eugene Myers Michal Ziv-Ukelson Abstract The problem of aligning two sequences A and to determine their similarity is one of the fundamental

More information

On Pattern Matching With Swaps

On Pattern Matching With Swaps On Pattern Matching With Swaps Fouad B. Chedid Dhofar University, Salalah, Oman Notre Dame University - Louaize, Lebanon P.O.Box: 2509, Postal Code 211 Salalah, Oman Tel: +968 23237200 Fax: +968 23237720

More information

Finding a longest common subsequence between a run-length-encoded string and an uncompressed string

Finding a longest common subsequence between a run-length-encoded string and an uncompressed string Journal of Complexity 24 (2008) 173 184 www.elsevier.com/locate/jco Finding a longest common subsequence between a run-length-encoded string and an uncompressed string J.J. Liu a, Y.L. Wang b,, R.C.T.

More information

CS473 - Algorithms I

CS473 - Algorithms I CS473 - Algorithms I Lecture 10 Dynamic Programming View in slide-show mode CS 473 Lecture 10 Cevdet Aykanat and Mustafa Ozdal, Bilkent University 1 Introduction An algorithm design paradigm like divide-and-conquer

More information

Efficient Algorithms forregular Expression Constrained Sequence Alignment p. 1/35

Efficient Algorithms forregular Expression Constrained Sequence Alignment p. 1/35 Efficient Algorithms for Regular Expression Constrained Sequence Alignment Yun-Sheng Chung, Chin Lung Lu, and Chuan Yi Tang Department of Computer Science National Tsing Hua University, Taiwan Department

More information

Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

More information

More Dynamic Programming

More Dynamic Programming CS 374: Algorithms & Models of Computation, Spring 2017 More Dynamic Programming Lecture 14 March 9, 2017 Chandra Chekuri (UIUC) CS374 1 Spring 2017 1 / 42 What is the running time of the following? Consider

More information

Analysis and Design of Algorithms Dynamic Programming

Analysis and Design of Algorithms Dynamic Programming Analysis and Design of Algorithms Dynamic Programming Lecture Notes by Dr. Wang, Rui Fall 2008 Department of Computer Science Ocean University of China November 6, 2009 Introduction 2 Introduction..................................................................

More information

Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval

Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval Including Interval Encoding into Edit Distance Based Music Comparison and Retrieval Kjell Lemström; Esko Ukkonen Department of Computer Science; P.O.Box 6, FIN-00014 University of Helsinki, Finland {klemstro,ukkonen}@cs.helsinki.fi

More information

Pairwise sequence alignment

Pairwise sequence alignment Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

More information

More Dynamic Programming

More Dynamic Programming Algorithms & Models of Computation CS/ECE 374, Fall 2017 More Dynamic Programming Lecture 14 Tuesday, October 17, 2017 Sariel Har-Peled (UIUC) CS374 1 Fall 2017 1 / 48 What is the running time of the following?

More information

Efficient Algorithms for the Longest Common Subsequence Problem with Sequential Substring Constraints

Efficient Algorithms for the Longest Common Subsequence Problem with Sequential Substring Constraints 2011 11th IEEE International Conference on Bioinformatics and Bioengineering Efficient Algorithms for the Longest Common Subsequence Problem with Sequential Substring Constraints Chiou-Ting Tseng, Chang-Biau

More information

CS325: Analysis of Algorithms, Fall Midterm

CS325: Analysis of Algorithms, Fall Midterm CS325: Analysis of Algorithms, Fall 2017 Midterm I don t know policy: you may write I don t know and nothing else to answer a question and receive 25 percent of the total points for that problem whereas

More information

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS *

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS * A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS * 1 Jorma Tarhio and Esko Ukkonen Department of Computer Science, University of Helsinki Tukholmankatu 2, SF-00250 Helsinki,

More information

Data Structures in Java

Data Structures in Java Data Structures in Java Lecture 20: Algorithm Design Techniques 12/2/2015 Daniel Bauer 1 Algorithms and Problem Solving Purpose of algorithms: find solutions to problems. Data Structures provide ways of

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Computing a Longest Common Palindromic Subsequence

Computing a Longest Common Palindromic Subsequence Fundamenta Informaticae 129 (2014) 1 12 1 DOI 10.3233/FI-2014-860 IOS Press Computing a Longest Common Palindromic Subsequence Shihabur Rahman Chowdhury, Md. Mahbubul Hasan, Sumaiya Iqbal, M. Sohel Rahman

More information

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n Aside: Golden Ratio Golden Ratio: A universal law. Golden ratio φ = lim n a n+b n a n = 1+ 5 2 a n+1 = a n + b n, b n = a n 1 Ruta (UIUC) CS473 1 Spring 2018 1 / 41 CS 473: Algorithms, Spring 2018 Dynamic

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Bio nformatics. Lecture 3. Saad Mneimneh

Bio nformatics. Lecture 3. Saad Mneimneh Bio nformatics Lecture 3 Sequencing As before, DNA is cut into small ( 0.4KB) fragments and a clone library is formed. Biological experiments allow to read a certain number of these short fragments per

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Linear-Space Alignment

Linear-Space Alignment Linear-Space Alignment Subsequences and Substrings Definition A string x is a substring of a string x, if x = ux v for some prefix string u and suffix string v (similarly, x = x i x j, for some 1 i j x

More information

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55 Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

More information

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming 20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, 2008 4 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance 4. Global and local alignment

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1 Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Outline and Reading Strings ( 9.1.1) Pattern matching algorithms Brute-force algorithm ( 9.1.2) Boyer-Moore algorithm ( 9.1.3) Knuth-Morris-Pratt

More information

Introduction to Bioinformatics Algorithms Homework 3 Solution

Introduction to Bioinformatics Algorithms Homework 3 Solution Introduction to Bioinformatics Algorithms Homework 3 Solution Saad Mneimneh Computer Science Hunter College of CUNY Problem 1: Concave penalty function We have seen in class the following recurrence for

More information

Objec&ves. Review. Dynamic Programming. What is the knapsack problem? What is our solu&on? Ø Review Knapsack Ø Sequence Alignment 3/28/18

Objec&ves. Review. Dynamic Programming. What is the knapsack problem? What is our solu&on? Ø Review Knapsack Ø Sequence Alignment 3/28/18 /8/8 Objec&ves Dynamic Programming Ø Review Knapsack Ø Sequence Alignment Mar 8, 8 CSCI - Sprenkle Review What is the knapsack problem? What is our solu&on? Mar 8, 8 CSCI - Sprenkle /8/8 Dynamic Programming:

More information

Sequential and Parallel Algorithms for the All-Substrings Longest Common Subsequence Problem

Sequential and Parallel Algorithms for the All-Substrings Longest Common Subsequence Problem Sequential and Parallel Algorithms for the All-Substrings Longest Common Subsequence Problem C. E. R. Alves Faculdade de Tecnologia e Ciências Exatas Universidade São Judas Tadeu São Paulo-SP-Brazil prof.carlos

More information

Dynamic Programming: Shortest Paths and DFA to Reg Exps

Dynamic Programming: Shortest Paths and DFA to Reg Exps CS 374: Algorithms & Models of Computation, Spring 207 Dynamic Programming: Shortest Paths and DFA to Reg Exps Lecture 8 March 28, 207 Chandra Chekuri (UIUC) CS374 Spring 207 / 56 Part I Shortest Paths

More information

String Matching with Variable Length Gaps

String Matching with Variable Length Gaps String Matching with Variable Length Gaps Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind Technical University of Denmark Abstract. We consider string matching with variable length

More information

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12 Algorithms Theory 15 Text search P.D. Dr. Alexander Souza Text search Various scenarios: Dynamic texts Text editors Symbol manipulators Static texts Literature databases Library systems Gene databases

More information

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

Computer Science & Engineering 423/823 Design and Analysis of Algorithms Computer Science & Engineering 423/823 Design and Analysis of s Lecture 09 Dynamic Programming (Chapter 15) Stephen Scott (Adapted from Vinodchandran N. Variyam) 1 / 41 Spring 2010 Dynamic programming

More information

Module 1: Analyzing the Efficiency of Algorithms

Module 1: Analyzing the Efficiency of Algorithms Module 1: Analyzing the Efficiency of Algorithms Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu What is an Algorithm?

More information

Lecture 5: September Time Complexity Analysis of Local Alignment

Lecture 5: September Time Complexity Analysis of Local Alignment CSCI1810: Computational Molecular Biology Fall 2017 Lecture 5: September 21 Lecturer: Sorin Istrail Scribe: Cyrus Cousins Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes

More information

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

Computer Science & Engineering 423/823 Design and Analysis of Algorithms Computer Science & Engineering 423/823 Design and Analysis of Algorithms Lecture 03 Dynamic Programming (Chapter 15) Stephen Scott and Vinodchandran N. Variyam sscott@cse.unl.edu 1/44 Introduction Dynamic

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Lecture : p he biological problem p lobal alignment p Local alignment p Multiple alignment 6 Background: comparative genomics p Basic question in biology: what properties

More information

A Universal Turing Machine

A Universal Turing Machine A Universal Turing Machine A limitation of Turing Machines: Turing Machines are hardwired they execute only one program Real Computers are re-programmable Solution: Universal Turing Machine Attributes:

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching Goodrich, Tamassia Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Brute-Force Pattern Matching ( 11.2.1) The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift

More information

Multiple Sequence Alignment (MAS)

Multiple Sequence Alignment (MAS) Multiple Sequence lignment (MS) Group-to-group alignments Steven driaensen & Ken Tanaka References Osamu Goto Optimal lignment between groups of sequences and its application to multiple sequence alignment

More information

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer

Algorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer Algorithm Theory 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore Institut für Informatik Wintersemester 2007/08 Text Search Scenarios Static texts Literature databases Library systems Gene databases

More information

2. Exact String Matching

2. Exact String Matching 2. Exact String Matching Let T = T [0..n) be the text and P = P [0..m) the pattern. We say that P occurs in T at position j if T [j..j + m) = P. Example: P = aine occurs at position 6 in T = karjalainen.

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 410 (2009) 2759 2766 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Note Computing the longest topological

More information

General Methods for Algorithm Design

General Methods for Algorithm Design General Methods for Algorithm Design 1. Dynamic Programming Multiplication of matrices Elements of the dynamic programming Optimal triangulation of polygons Longest common subsequence 2. Greedy Methods

More information

Lecture 3: String Matching

Lecture 3: String Matching COMP36111: Advanced Algorithms I Lecture 3: String Matching Ian Pratt-Hartmann Room KB2.38: email: ipratt@cs.man.ac.uk 2017 18 Outline The string matching problem The Rabin-Karp algorithm The Knuth-Morris-Pratt

More information

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas n Introduction to Bioinformatics lgorithms Multiple lignment Slides revised and adapted to Bioinformática IS 2005 na eresa Freitas n Introduction to Bioinformatics lgorithms Outline Dynamic Programming

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille IT University of Copenhagen Rolf Fagerberg University of Southern Denmark Inge Li Gørtz

More information

Chapter 6. Weighted Interval Scheduling. Dynamic Programming. Algorithmic Paradigms. Dynamic Programming Applications

Chapter 6. Weighted Interval Scheduling. Dynamic Programming. Algorithmic Paradigms. Dynamic Programming Applications lgorithmic Paradigms hapter Dynamic Programming reedy. Build up a solution incrementally, myopically optimizing some local criterion. Divide-and-conquer. Break up a problem into sub-problems, solve each

More information

Sequence Comparison. mouse human

Sequence Comparison. mouse human Sequence Comparison Sequence Comparison mouse human Why Compare Sequences? The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity

More information

Computational Biology Lecture 5: Time speedup, General gap penalty function Saad Mneimneh

Computational Biology Lecture 5: Time speedup, General gap penalty function Saad Mneimneh Computational Biology Lecture 5: ime speedup, General gap penalty function Saad Mneimneh We saw earlier that it is possible to compute optimal global alignments in linear space (it can also be done for

More information

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

Pairwise alignment, Gunnar Klau, November 9, 2005, 16: Pairwise alignment, Gunnar Klau, November 9, 2005, 16:36 2012 2.1 Growth rates For biological sequence analysis, we prefer algorithms that have time and space requirements that are linear in the length

More information

Dynamic Programming: Shortest Paths and DFA to Reg Exps

Dynamic Programming: Shortest Paths and DFA to Reg Exps CS 374: Algorithms & Models of Computation, Fall 205 Dynamic Programming: Shortest Paths and DFA to Reg Exps Lecture 7 October 22, 205 Chandra & Manoj (UIUC) CS374 Fall 205 / 54 Part I Shortest Paths with

More information

Efficient Approximation of Large LCS in Strings Over Not Small Alphabet

Efficient Approximation of Large LCS in Strings Over Not Small Alphabet Efficient Approximation of Large LCS in Strings Over Not Small Alphabet Gad M. Landau 1, Avivit Levy 2,3, and Ilan Newman 1 1 Department of Computer Science, Haifa University, Haifa 31905, Israel. E-mail:

More information

Global alignments - review

Global alignments - review Global alignments - review Take two sequences: X[j] and Y[j] M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] 2 M[i-1, j] 2 The best alignment for X[1 i] and Y[1 j] is called M[i, j] X[j] Initiation: M[,]= pply

More information

String Range Matching

String Range Matching String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings

More information

Algorithm Design CS 515 Fall 2015 Sample Final Exam Solutions

Algorithm Design CS 515 Fall 2015 Sample Final Exam Solutions Algorithm Design CS 515 Fall 2015 Sample Final Exam Solutions Copyright c 2015 Andrew Klapper. All rights reserved. 1. For the functions satisfying the following three recurrences, determine which is the

More information

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler STATC141 Spring 2005 The materials are from Pairise Sequence Alignment by Robert Giegerich and David Wheeler Lecture 6, 02/08/05 The analysis of multiple DNA or protein sequences (I) Sequence similarity

More information

arxiv: v1 [cs.ds] 9 Apr 2018

arxiv: v1 [cs.ds] 9 Apr 2018 From Regular Expression Matching to Parsing Philip Bille Technical University of Denmark phbi@dtu.dk Inge Li Gørtz Technical University of Denmark inge@dtu.dk arxiv:1804.02906v1 [cs.ds] 9 Apr 2018 Abstract

More information

On the Exponent of the All Pairs Shortest Path Problem

On the Exponent of the All Pairs Shortest Path Problem On the Exponent of the All Pairs Shortest Path Problem Noga Alon Department of Mathematics Sackler Faculty of Exact Sciences Tel Aviv University Zvi Galil Department of Computer Science Sackler Faculty

More information

Divide and Conquer Algorithms. CSE 101: Design and Analysis of Algorithms Lecture 14

Divide and Conquer Algorithms. CSE 101: Design and Analysis of Algorithms Lecture 14 Divide and Conquer Algorithms CSE 101: Design and Analysis of Algorithms Lecture 14 CSE 101: Design and analysis of algorithms Divide and conquer algorithms Reading: Sections 2.3 and 2.4 Homework 6 will

More information

Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027 Sequence Alignment Evolution CT Amemiya et al. Nature 496, 311-316 (2013) doi:10.1038/nature12027 Evolutionary Rates next generation OK OK OK X X Still OK? Sequence conservation implies function Alignment

More information

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007. Proofs, Strings, and Finite Automata CS154 Chris Pollett Feb 5, 2007. Outline Proofs and Proof Strategies Strings Finding proofs Example: For every graph G, the sum of the degrees of all the nodes in G

More information

CSCI Honor seminar in algorithms Homework 2 Solution

CSCI Honor seminar in algorithms Homework 2 Solution CSCI 493.55 Honor seminar in algorithms Homework 2 Solution Saad Mneimneh Visiting Professor Hunter College of CUNY Problem 1: Rabin-Karp string matching Consider a binary string s of length n and another

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Friday Four Square! Today at 4:15PM, Outside Gates

Friday Four Square! Today at 4:15PM, Outside Gates P and NP Friday Four Square! Today at 4:15PM, Outside Gates Recap from Last Time Regular Languages DCFLs CFLs Efficiently Decidable Languages R Undecidable Languages Time Complexity A step of a Turing

More information

Dynamic Programming. Prof. S.J. Soni

Dynamic Programming. Prof. S.J. Soni Dynamic Programming Prof. S.J. Soni Idea is Very Simple.. Introduction void calculating the same thing twice, usually by keeping a table of known results that fills up as subinstances are solved. Dynamic

More information

Small-Space Dictionary Matching (Dissertation Proposal)

Small-Space Dictionary Matching (Dissertation Proposal) Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length

More information

String Search. 6th September 2018

String Search. 6th September 2018 String Search 6th September 2018 Search for a given (short) string in a long string Search problems have become more important lately The amount of stored digital information grows steadily (rapidly?)

More information

Turing s thesis: (1930) Any computation carried out by mechanical means can be performed by a Turing Machine

Turing s thesis: (1930) Any computation carried out by mechanical means can be performed by a Turing Machine Turing s thesis: (1930) Any computation carried out by mechanical means can be performed by a Turing Machine There is no known model of computation more powerful than Turing Machines Definition of Algorithm:

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

Finding all covers of an indeterminate string in O(n) time on average

Finding all covers of an indeterminate string in O(n) time on average Finding all covers of an indeterminate string in O(n) time on average Md. Faizul Bari, M. Sohel Rahman, and Rifat Shahriyar Department of Computer Science and Engineering Bangladesh University of Engineering

More information

Computing the Longest Common Subsequence of Multiple RLE Strings

Computing the Longest Common Subsequence of Multiple RLE Strings The 29th Workshop on Combinatorial Mathematics an Computation Theory Computing the Longest Common Subsequence of Multiple RLE Strings Ling-Chih Yao an Kuan-Yu Chen Grauate Institute of Networking an Multimeia

More information