On Pattern Matching With Swaps

Size: px

Start display at page:

Download "On Pattern Matching With Swaps"

Jeremy Hood
5 years ago
Views:

1 On Pattern Matching With Swaps Fouad B. Chedid Dhofar University, Salalah, Oman Notre Dame University - Louaize, Lebanon P.O.Box: 2509, Postal Code 211 Salalah, Oman Tel: Fax: fchedid@du.edu.om, fchedid@ndu.edu.lb Abstract Pattern Matching with Swaps (PMS for short) is a variation of the classical pattern matching problem where a match is allowed to include disjoint local swaps. In 2009, Cantone Faro devised a new dynamic programming algorithm for PMS, named Cross-Sampling, that runs in O(nm) time uses O(m) space. More important, Cross-Sampling admits a lineartime implementation based on bit parallelism when the pattern s size is comparable to the word size of the machine. In this paper, we present improved dynamic programming formulations of the approach of Cantone Faro for PMS which result in simpler algorithms that are much easier to be comprehended implemented. Keywords: Pattern Matching with Swaps, Approximate Pattern Matching with Swaps, Bit-Parallelism, Dynamic Programming, Efficient Algorithms. I. INTRODUCTION The classical Pattern Matching problem (PM for short) is a well-studied problem in computer science. This problem is defined as follows. Given a fixed alphabet Σ, a pattern P Σ of length m a text T Σ of length n m, PM asks for a way to find all occurrences of P in T. The Pattern Matching with Swaps problem (PMS for short) is a variation of PM in which a match is allowed to include disjoint local swaps. More precisely, a pattern P is said to have a swapped match with a text T at location j if adjacent characters in P can be swapped, if necessary, so as to make P identical to the substring of T ending at location j. We have included below an example of a pattern P = bbababab having a swapped match with a string T = ababbbaabba at location j = 8. Observe that two swaps are needed for this swap-match. Also, observe that both swaps are disjoint; that is, each character can be involved in at most one swap, that identical adjacent characters are not allowed to be swapped. P= b b a b a b a b T= a b a b b b a a b b a j= PMS was introduced in 1995 [8] as an open problem in non-stard stringology. A variant of PMS, named Approximate Pattern Matching with Swaps (APMS for short), asks to find for each location of the text where there is a swapped-match of the pattern the number of swaps needed to obtain a match at that location. We now know that PMS APMS have important applications in many fields such as computational biology, text musical retrieval, data mining, network security [5]. An algorithm for PMS that runs in o(nm) time first appears in [1]. Algorithms for PMS APMS that run in time O(n log m log σ), where σ is the size of the alphabet (= Σ ) appear in [2] [3]. We mention that both solutions in [1] [3] are based on the Fast Fourier Transform (FFT) method. A non FFT-based algorithm for PMS first appears in [7] where an algorithm, based on bit-parallelism, was devised with running time O((n + m) log m), if the pattern size is comparable to the word size of the machine. In 2009, Cantone Faro [5] devised a new dynamic programming algorithm for PMS, named Cross-Sampling, that runs in time O(nm) uses O(m) space. More important, Cross-Sampling admits a linear time implementations based on bit-parallelism, if the size of the pattern is comparable to the word size of the machine. Moreover, Cross-Sampling can be easily adapted to solve APMS in time O(nm) (O(n) for short patterns based on bit-parallelism). Thus, for the first time, we have an algorithm that solves PMS APMS for short patterns in linear time. In 2009, Campanelli et al. [4] described a variation of the Cross-Sampling algorithm that inherits much of the structure of Cross-Sampling but is based on a right-to-left scan of the text. The new algorithm, named Backward-Cross-Sampling, runs in time O(nm 2 ); however, extensive computer runs show that Backward-Cross-Sampling outperforms Cross-Sampling in practice [4]. In this paper, we present improved dynamic programming formulations of the approaches of Cantone Faro Campanelli et al. for PMS. Our work gives new algorithms for PMS APMS that are much easier to be comprehended implemented. In the sequel, a string P will be represented as a finite array P [0... m 1], which is basically the concatenation of the characters P [i], for 0 i m 1. Note that P [i] denotes the (i+1)th character of the string P. Let P i denote the prefix of P of length i + 1 (0 i m 1). The rest of the paper is organized as follows. Section 2 gives basic definitions. Section 3 presents our simpler solutions for PMS APMS. Section 4 presents more efficient versions of our solutions based on bit-parallelism Section 5 concludes the paper with some observations for future work. II. PROBLEM DEFINITION Let Σ be a fixed alphabet let P T be two strings over Σ of lengths m n m, respectively /13/$ IEEE

2 Definition 1: A swap permutation of P is a permutation π : {0,..., m 1} {0,..., m 1} such that: 1) if π(i) = j then π(j) = i (characters are swapped). 2) for all i, π(i) {i 1, i, i + 1} (only adjacent characters can be swapped) 3) if π(i) i then P [π(i)] P [i] (identical characters are not allowed to be swapped) The swapped version of P under the permutation π is denoted as π(p ); that is, π(p ) is the concatenation of the characters P [π(i)], for 0 i m 1. For a given text string T Σ, we say that P has a swapped match with T at location j if there exists a swap permutation π of P such that π(p ) has an exact match with T ending at location j. For example, the swap permutation that corresponds to the swap match shown as an example in the previous section is π(bbababab) = babbbaab. This swap match has two swaps: π(1) = 2, π(2) = 1 π(4) = 5, π(5) = 4. In this case, we write P T 8. Moreover, since 2 swaps are needed for this swap-match, we also write P 2 T 8. The Pattern Matching with Swaps Problem (PMS for short) is the following: INPUT: A text string T [0... n 1] a pattern string P [0... m 1] over a fixed alphabet Σ. OUTPUT: All locations j for m 1 j n 1 such that P T j. The Approximate Pattern Matching with Swaps Problem (APMS for short) is the following: INPUT: A text string T [0... n 1] a pattern string P [0... m 1] over a fixed alphabet Σ. OUTPUT: The number of swaps k needed for each m 1 j n 1, where P k T j. III. SIMPLER ALGORITHMS FOR PMS AND APMS We present improved dynamic programming formulations of the approaches of Canton Faro Campanelli et al. for PMS APMS which result in simpler algorithms for these problems that are much easier to be comprehended implemented. A. A Simpler Algorithm for PMS The main idea behind the Cross-Sampling algorithm is a new approach for finding all prefixes P i of P that have swapped matches with T ending at some location j, for 0 j n 1. This will be denoted by P i T j. The paper [5] defines a collection of sets S j, for 0 j n 1, as follows. S j = {0 i m 1 : P i T j } Thus, the pattern P has a swapped match with the text T ending at location j if only if S j m 1. To compute S j, the authors of [5] define another collection of sets S j, for 0 j n 1, as follows. S j = {0 i m 1 : P i 1 T j 1 P [i] = T [j + 1]} Then, it is shown how to compute S j in terms of S j 1 S j 1, where S j 1 is computed in terms of S j 2. This formulation of the solution gives a dynamic programming Algorithm Prefix-Sampling (P, m, T, n) S[m, n] {0} { Initially, all entries are set to False } for j 0 to n 1 do S[0, j] P [0] = T [j] for j 1 to n 1 do S[1, j] (S[0, j 1] (P [1] = T [j])) ((P [1] = T [j 1]) (P [0] = T [j])) for i 2 to m 1 do for j i to n 1 do S[i, j] (S[i 1, j 1] (P [i] = T [j])) (S[i 2, j 2] ((P [i] = T [j 1]) (P [i 1] = T [j]))) for j m 1 to n 1 do if S[m 1, j] then print j {Here, P T j } Fig. 1. The Prefix-Sampling Algorithm for PMS based iterative solution for PMS that runs in O(mn) time uses O(m) space. The dynamic programming approach of Cross-Sampling is based on the following lemma: Lemma 2: Let T P be a text of length n a pattern of length m, respectively. Then, for 0 i m 1 0 j n 1, we have that P i T j if only if one of the following two facts holds: P [i] = T [j] P i 1 T j 1. P [i] = T [j 1], P [i 1] = T [j], P i 2 T j 2. We use the above lemma to propose a simpler version of Cross- Sampling. Let us define the Boolean matrix S j i, for 0 i m 1 0 j n 1, as follows. S j i = 1, if P i T j Thus, the pattern P has a swapped match with T at location j if only if S j m 1 = 1. The following recursive definition of Sj i is inspired by Lemma 2, for 2 i m 1 i j n 1: S j i (Sj 1 i 1 (P [i] = T [j])) (S j 2 i 2 ((P [i] = T [j 1]) (P [i 1] = T [j]))) (1) The base cases for i = 0 i = 1 are given by S j 0 (P [0] = T [j]), for 0 j n 1. S j 1 (Sj 1 0 (P [1] = T [j])) ((P [1] = T [j 1]) (P [0] = T [j])), for 1 j n 1. These recursive relations compute S j i in terms of S j 1 i 1 S j 2 i 2. The recursive relations in Equation 1 give a dynamic matrix Sm n iteratively in O(nm) time O(m) space. Our resultant algorithm, named Prefix-Sampling, is shown in Fig. 1. The code in Fig. 1 runs in O(nm) time uses O(nm) space. However, it is a simple matter to modify the code so that it uses only O(m) space (compute the matrix Sm n column wise by keeping track of only three columns at a time S 1, S 2, S 3, where S 3 is computed in terms of S 2 S 1 ). We mention that the ideas behind Prefix-Sampling first appear in [6] as part of

3 TABLE I. A SAMPLE RUN OF PREFIX-SAMPLING S j= i= i= i= i= i= i= i= TABLE II. A SAMPLE RUN OF APPROXIMATE-PREFIX-SAMPLING S j= i= i= i= i= i= i= i= our parallel algorithm for PMS that runs in O(m 2 ) time using processors on a linear array model of computation. n m 1 We have traced Prefix-Sampling on the following PMS instance (taken from [5] for ease of comparison): let P = babaaab of length m = 7 T = abbababaabbab of length n = 13. The results (See Table I) show that P has a swapped match with T at location j = 9 (a swap-match corresponds to a non-zero entry in the last row of the table). B. A Simpler Algorithm for APMS Cantone Faro [5] showed how to adapt their Cross- Sampling algorithm to solve the APMS problem. Their Approximate-Cross-Sampling algorithm works with two new collections of sets S j S j, for 0 j n 1, where S j = {(i, k) : 0 i m 1 P i k T j } S j = {(i, k) : 0 i m 1 (P i 1 k T j 1 or i = 0) P [i] = T [j + 1]} Clearly, P k T j if only if (m 1, k) S j. The dynamic programming approach of Approximate-Cross- Sampling is based on the following lemma: Lemma 3: Let T P be a text of length n a pattern of length m, respectively. Then, for 0 i m 1 0 j n 1, we have that P i k T j if only if one of the following two facts holds: P [i] = T [j] either (i = 0 k = 0) or P i 1 k T j 1. P [i] = T [j 1], P [i 1] = T [j] either (i = 1 k = 1 or P i 2 k 1 T j 2. We use the above lemma to propose a simpler version of Approximate-Cross-Sampling. We redefine our Boolean matrix S j i from the previous section so that its definition reads as follows. For 0 i m 1 0 j n 1, we have S j i = k + 1, if P i k T j Thus, there will be k swaps involved in the swap-match of the pattern P with the text T at location j if only if S j m 1 = k + 1. The following recursive definition of Sj i is inspired by Lemma 3, for 2 i m 1 i j n 1: S j i Sj 1 i 1, if Sj 1 i 1 (P [i] = T [j]). S j 2 i 2 + 1, if Sj 2 i 2 ((P [i] = T [j 1]) (P [i 1] = T [j])). 0, otherwise. (2) The base case for i = 0 is defined as follows, for 0 j n 1. S j 0 (P [0] = T [j]). The base case for i = 1 is defined as follows, for 1 j n 1. S j i 1, if Sj 1 0 (P [1] = T [j]). 2, if (P [1] = T [j 1]) (P [0] = T [j]). 0, otherwise. These recursive relations compute S j i in terms of S j 1 i 1 S j 2 i 2. The recursive relations in Equation 2 give a dynamic matrix Sm n iteratively in O(nm) time O(m) space. For lack of space, we do not show the code of our resultant algorithm; however, we show a trace of this algorithm on the problem instance P = babaaab T = abbababaabbab. The results (See Table II) show that there will be 2 (= S[6, 9] 1) swaps needed for the swap-match of P with T at location j = 9. C. A Simpler Algorithm for PMS by Scanning the Text Backward The basic idea of Backward-Cross-Sampling of Campanelli et al. [4] is to search for all occurrences of the pattern in the text by scanning the characters of the text from right to left. In particular, Backward-Cross-Sampling processes the text in fixed-sized windows of size m which are searched for the longest prefix of the pattern that has a swapped match with the text ending at the last position j of the current window. After processing a text window, the largest matched prefix P i T j is computed, then j is incremented by m i as to left-align the current window of the text with P i. Let P [i h i] denotes the substring of P of length h ending at location i. The paper [4] defines two collections of sets Sj h W j h, for 0 j n 1 0 h m, where S h j = {h 1 i m 1 : P [i h i] T j } W h j = {h i m 2 : P [i h i] T j P [i h + 1] = T [j h]} Observe that P h T j if only if (h 1) Sj h. By the same token, P T j if only if Sj m = {m 1}. The Backward-Cross-Sampling algorithm for computing the sets is inspired by the following lemma: S h j Lemma 4: Let T P be a text of length n a pattern of length m, respectively. Then, for 0 j n 1, 0 h m, h 1 i m 1, we have that P [i h+1... i] T j if only if one of the following two facts holds: P [i h+2... i] T [j] P [i h+1] = T j h+1].

4 Algorithm Backward-Prefix-Sampling (P, m, T, n) 1. l 0; j m 1 2. while j n 1 do 3. for i 0 to m 1 do 4. S 1 [i] P [i] = T [j] 5. if S 1 [0] then l 1 6. for i 1 to m 1 do 7. S 2 [i] (S 1 [i] (P [i 1] = T [j 1])) 8. ((P [i] = T [j 1]) (P [i 1] = T [j])) 9. if S 2 [1] then l for h 3 to m do 11. for i h 1 to m 1 do 12. S 3 [i] (S 2 [i] (P [i h + 1] = T [j h + 1])) 13. (S 1 [i] ((P [i h + 2] = T [j h + 1]) 14. (P [i h + 1] = T [j h + 2]))) 15. if S 3 [h 1] then 16. l h {Here, P h T j } 17. if (S 2 = 0) (S 3 = 0)) then goto S 1 S 2 ; S 2 S if l = m then 20. print j {Here, P T j } 21. j = j else j j + m l 23. End of while j n 1) Fig. 2. The Backward-Prefix-Sampling Algorithm for PMS P [i h i] T [j], P [i h + 1] = T [j h + 2], P [i h + 2] = T [j h + 1]. We use the above lemma to propose a simpler version of Backward-Cross-Sampling. We define the Boolean matrix S h [i, j], for 0 i m 1, 0 j n 1, 1 h m, as follows. S h [i, j] = 1 if P [i h i] T j Thus, P h T j if only if S h [h 1, j] = 1. By the same token, the pattern P has a swapped match with T at location j if only if S m [m 1, j] = 1. The following recursive definition of S h [i, j] is inspired by Lemma 4, for 0 j n 1, 3 h m, h 1 i m 1: S h [i, j] (S h 1 [i, j] (P [i h + 1] = T [j h + 1])) (S h 2 [i, j] ((P [i h + 2] = T [j h + 1]) (P [i h + 1] = T [j h + 2]))) (3) The base cases for h = 1 h = 2 are given by S 1 [i, j] P [i] = T [j], for 0 i m 1 0 j n 1). S 2 [i, j] (S 1 [i, j] (P [i 1] = T [j 1])) ((P [i] = T [j 1]) (P [i 1] = T [j])), for 1 i m 1 1 j n 1. These recursive relations compute Sj h in terms of Sh 1 j S h 2 j. The recursive relations in Equation 3 give a dynamic matrix Sm n iteratively in O(nm 2 ) time O(m) space. Our resultant algorithm, named Backward-Prefix-Sampling, is shown in Fig. 2. TABLE III. A SAMPLE RUN OF APPROXIMATE-BACKWARD-PREFIX-SAMPLING S h j S 1 6 S 2 6 S 3 6 S 4 6 S 5 6 S 6 6 S 1 9 S 2 9 S 3 9 S 4 9 S 5 9 S 6 9 S 7 9 i= i= i= i= i= i= i= D. A Simpler Algorithm for APMS by Scanning the Text Backward Our Solution from the previous section can be easily extended to solve the APMS problem. We redefine the Boolean matrix S h [i, j] from the previous section so that its definition reads as follows. For 0 i m 1, 0 j n 1, 1 h m, we have S h [i, j] = k + 1 if P [i h i] k T j Thus, P k T j if only if S m [m 1, j] = k + 1. The following recursive definition of S h [i, j] is inspired by lemmas 3 4, for 0 j n 1, 3 h m, h 1 i m 1: S h [i, j] S h 1 [i, j], if (S h 1 [i, j] (P [i h + 1] = T [j h + 1])) S h 2 [i, j] + 1, if (S h 2 [i, j] ((P [i h + 2] = T [j h + 1]) (P [i h + 1] = T [j h + 2]))) (4) The base case for h = 1 is defined as follows, for 0 i m 1 0 j n 1. S 1 [i, j] 1, if P [i] = T [j]. The base case for h = 2 is defined as follows, for 1 i m 1 1 j n 1. S 2 [i, j] 1, if (S 1 [i, j] (P [i 1] = T [j 1])). S 2 [i, j] 2, if ((P [i] = T [j 1]) (P [i 1] = T [j])). The recursive relations in Equation 4 give a dynamic programming algorithm for computing the elements of the matrix S h [m, n] iteratively in O(nm 2 ) time O(m) space. For lack of space, we do not show the code of our resultant algorithm; however, we show a trace of this algorithm on the problem instance P = babaaab T = abbababaabbab. The results (See Table III) show that there will be 2 (= S m 9 [m 1] 1 = S 7 9[6] 1 = 3 1) swaps needed for the swap-match of P with T at location j = 9. IV. IMPROVED ALGORITHMS USING BIT-PARALLELISM We now consider the case of short patterns. In particular, we are interested in the case where the entire pattern can be stored in one word of computer memory. We show that under such condition, all our algorithms from the previous section admit more efficient implementations using bit-parallelism. In particular, Prefix-Sampling Backward-Prefix-Sampling will then have a linear running time.

5 Let T P be a text of length n a pattern of length m, respectively. Following [5], for each c Σ, we define the m- bit vector M c [0,..., m 1] as follows. M c [i] = 1, if P [i] = c, 0 otherwise. First, we consider Prefix-Sampling (See Fig. 1). The main statement in Prefix-Sampling is the one that computes S j from S j 1 S j 2. Using bit parallelism, that statement can be coded as follows. S j ((S j 1 1)&M T [j] ) ((S j 2 2)&(M T [j 1] &(M T [j] 1))) We demonstrate the embedding of bit-parallelism in Prefix- Sampling on the input instance P = babaaab T = abbababaabbab. In particular, we show how the algorithm would compute the vector S 5 from S 4 = ( ) S 3 = ( ) (See Table I in Section III). (The bit vectors are written from highest to lowest ordered bits). We have M T [j] = M T [5] = M a = ( ) M T [j 1] = M T [4] = M b = ( ). The 0th bit of S 5 is determined as follows. S 5 = M a &( ) = ( )&( ) = ( ). Thus, the 0th bit of S 5 is zero. The 1st bit of S 5 is determined as follows. For ease of presentation, we make use of two temporary variables T 1 T 2. Let T 1 = (S 4 1)&M a = ( )&( ) = ( ). Let T 2 = T 1 (M b &(M a 1)) = ( ) (( )&( ) = ( ). Then, S 5 = S 5 (T 2 &0 m 1 10) = ( ) (( )&( ) = ( ). Thus, the 1st bit of S 5 is 1. The remaining bits of S 5 are computed as follows. Let T 2 = T 2 &(T 1 (S 3 2)) = ( )&(( ) ( )) = ( ). Finally, S 5 = S 5 (T 2 &1 m 2 00) = ( ) ( )&( )) = ( ). For lack of space, we chose not to include the code of our Bit- Parallelism-Prefix-Sampling, but to include the code of our Bit-Parallelism-Backward-Prefix-Sampling in Fig. 3, where S 1, S 2, S 3 are now three m-bit vectors. We demonstrate some of the steps of Bit-Parallelism-Backward-Prefix- Sampling, which runs in O(n) time, on the problem instance P = babaaab T = abbababaabbab. First, observe that m = P = 7, n = T = 13, M a = ( ), M b = ( ). Line 1 of the code in Fig. 3 sets j to 6. Line 3 sets S 1 to M T [6] = M b = ( ). Since S 1 [0] = 1, the prefix of P ending at location i = 0 has a swapped match with T at location j = 6. This is true because P [0] = T [6]. Line 5 sets S 2 as follows. S 2 = (S 1 &(M T [5] 1)) (M T [5] &(M T [6] 1)) = (S 1 &(M a 1)) (M a &(M b 1)) = (( )&( )) (( )&( ))) = ( ) ( ) = ( ). Since S 2 [1] = 1, the prefix of P ending at location i = 1 has a swapped match with T at location j = 6. This is true because P [0... 1] = ba T [5... 6] = ab. Bit-Parallelism-Backward-Prefix-Sampling (m, n) 1. S 1 S 2 S 3 0; l 0; j m 1 2. while j n 1 do 3. S 1 M T [j] 4. if S 1 &0 m 1 1 0) then l 1 5. S 2 (S 1 &(M T [j 1] 1)) (M T [j 1] &(M T [j] 1)) 6. if S 2 &0 m ) then l 2 7. for h 3 to m do 8. S 3 (S 2 &(M T [j h+1] (h 1)) 9. (S 1 &((M T [j h+1] (h 2)) 10. &(M T [j h+2] (h 1))) 11. if (S 3 &0 m h 10 h 1 ) 0 then 12. l h {Here, P h T j } 13. if (S 2 = 0) (S 3 = 0)) then goto S 1 S 2 ; S 2 S if l = m then 16. print j {Here, P T j } 17. j = j else j j + m l 19. End-while (j n 1) Fig. 3. The Bit-Parallelism-Backward-Prefix-Sampling Algorithm for PMS V. CONCLUDING REMARKS A drawback of our Backward-Prefix-Sampling Campanelli et al. s Backward-Cross-Sampling is that both algorithms do not remember the length of the prefix matched in previous search attempts. We propose to rectify this issue as follows. Once a largest prefix P i, for i m 1, is found to have a swapped match with the text at location j, after the text window is shifted to the right as to become left-aligned with the pattern (line 22 in Fig. 2), the following iteration of the j loop (line 2 in Fig. 2) can simply search for the subpattern P [i... m 1] in the text subwindow T [j m + i j], then combine results to exp the largest matched prefix of the pattern in the text window. This modification can be expected to improve the performance of both algorithms in practice. REFERENCES [1] A. Amir, Y. Aumann, G.M. Lau, M. Lewenstein, N. Lewenstein, Pattern Matching With Swaps, Proc. IEEE Symposium on Foundations of Computer Science (FOCS), pp , [2] A. Amir, M. Lewenstein, E. Porat, Approximate Swapped Matching, Information Processing Letters, 83:1, pp , [3] A. Amir, R. Cole, R. Hariharan, M. Lewenstein, E. Porat, Overlap Matching, Inf. Comput., 181:1, pp , [4] M. Campanelli, D. Cantone S. Faro, A New Algorithm for Efficient Pattern Matching With Swaps, Proc. IWOCA 2009, to appear. [5] D. Cantone S. Faro, Pattern Matching With Swaps for Short Patterns in Linear Time, Proc. 35th Intl. Conference on Theory Practice of Computer Science (SofSem 2009), LNCS 5404, Springer, pp , [6] F.B. Chedid, Parallel Pattern Matching With Swaps on a Linear Array, Proc. 10th Intl. Conference on Algorithms Architectures for Parallel Processing (ICA3PP 2010), LNCS 6081, Springer, pp , [7] C. S. Iliopoulos M. s. Rahman, A New Model to Solve the Swap Matching Problem Efficient Algorithms for Short Patterns, Proc. 34th Intl. Conference on Theory Practice of Computer Science (SofSem 2008), LNCS 4910, Springer, pp , [8] S. Muthukrishnan, New Results Open Problems Related to Non- Stard Stringology, Proc. 6th Annual Symp. Combinatorial Pattern Matching, LNCS 937, Springer, pp , 1995.

PATTERN MATCHING WITH SWAPS IN PRACTICE

International Journal of Foundations of Computer Science c World Scientific Publishing Company PATTERN MATCHING WITH SWAPS IN PRACTICE MATTEO CAMPANELLI Università di Catania, Scuola Superiore di Catania