Average Case Analysis of the Boyer-Moore Algorithm

Size: px

Start display at page:

Download "Average Case Analysis of the Boyer-Moore Algorithm"

Toby Lawson
6 years ago
Views:

1 Average Case Analysis of the Boyer-Moore Algorithm TSUNG-HSI TSAI Institute of Statistical Science Academia Sinica Taipei 115 Taiwan URL: November 19, 2003 Abstract Some limit theorems (including a Berry-Esseen bound) are derived for the number of comparisons taken by the Boyer-Moore algorithm for finding the occurrences of a given pattern in a random Markovian text. Previously, only special variants of this algorithm have been analyzed. We also propose means of computing the limiting constants for the mean and the variance. 1 Introduction String match is a huge area with a wide literature and a large number of applications. The basic problem is to identify all occurrences of a given pattern in a large text. We study in this paper the most widely used string-matching algorithms, the Boyer-Moore (BM) algorithm and its simplified version, the Boyer- Moore-Horspool (BMH) algorithm. We are concerned with the probabilistic behavior of the algorithms, as well as the evaluation of the constants in the limit theorems. Applications of string match are found in many communicative medium, including language processing, graphics, videos and sound. Typical applications include: key-word searches (or replacement) in any text editing software, the use of search engines, phone number lookup, search in databases like gene-bank, etc. Although the use of exact string match in computational biology may be limited, exact string match is the prototype, both in algorithms and in analysis, of more useful approximate string match and hidden string search algorithms; see Flajolet et al. [9]. Throughout the paper, A is a set of q characters, T is a text of n characters from A and S is a pattern of m characters from A. We assume that q 2 and n m. Although several factors determine the running time of a program, we deal in this paper only with the number of character comparisons; other cost measures may be similarly analyzed. We denote by C n = C n (A) the number of comparisons used by algorithm A. The most naive algorithm for finding all occurrences of a pattern in a text consists in comparing consecutively the characters of the pattern to each m-length substring of the text. The procedure moves to 1

2 the next m-length substring as soon as a mismatch occurs. This simple algorithm has C n = m(n m+1) in the worst case (when all pairwise comparisons have been executed). The Knuth-Morris-Pratt (1977) algorithm, which is based on building up an automaton from the pattern, avoids comparing the match part in the text to the pattern more than once, improves the worstcase to C n 2n m (see Cole [8]), the average-case of C n /n being close to 1 + 1/q (see Régnier [13]). However, it has been observed that in many applications, the average value of C n /n for several algorithms is less than one (see [12] for recent experiments). Among them, the BM algorithm is considered to be the most efficient algorithm (see Lecrocq s Handbook of Exact String-Matching Algorithms [6]). Despite its popularity in diverse applications, probabilistic analysis of the BM algorithms has not received much attention in the literature. Most known probabilistic results are for the BMH algorithm. We first summarize some related results and then present our new ones. Baeza-Yates, Gonnet and Régnier (1990 & 1992) applied an analytic approach to the average-case analysis of the BMH algorithm. Under the assumption that the text is independent and identically distributed (iid), they derived an exact expression for the linearity constant µ = lim n E (C n /n). For earlier results, see Barth (1984) and Régnier (1989). The first distributional result for string-matching algorithms was derived by Mahmoud, Smythe and Régnier (1997). They proved the asymptotic normality of C n via analytic methods (the text being iid) for the BMH algorithm, C n µn Bn D N (0, 1), where D denotes convergence in distribution. However, the value of B is difficult to be computed. Smythe (2001) obtained the same asymptotic normality result but for Markovian input by a completely probabilistic approach (via theory of Markov chains). A simple expression for the linearity constant µ was also derived, but the state space of his Markov chain seems difficult to be further simplified and the evaluation of B still remains unclear. We propose a similar but simpler Markov-chain approach; we consider the open windows together with a distance counter as a Markov chain and derive a Berry-Esseen bound, the law of the iterated logarithm and a better estimate for the mean for C n with explicit determination of µ and B under the assumption that the text is iid and the pattern is fixed. One major drawback of the usual Markovian approach is that the computational complexity grows exponentially when the pattern length increases. Our main contribution is to construct a state space with a growth rate of O(m 2 ). We thus are able to compute the linearity constant µ for the BM algorithm when the pattern length is of reasonable size. With the construction of the state space, we can also compute the exact value of B for both the BM and the BMH algorithms. All computations can be efficiently done by mathematical softwares (like Maple); numeric results are presented in Section 5. 2 Description of the algorithms For completeness, we describe the BM algorithm in this section; see Boyer et al. [5] or Lecroq [6] for more details. 2

3 Denote the text by T = t 1 t 2 t n and the pattern by S = s m s m 1 s 1 ; note that in the paper we index the pattern from right to left because BM type algorithms scan in that direction inside each comparing windows. The BM algorithm We begin by aligning S under the first open window t 1 t m. Start with s 1 and compare each pattern character to the corresponding character in the open window from right to left, until a mismatch is detected or when the left end of the window is reached (an occurrence is found if all m characters match). Then, S is shifted to the right and a new window in the text is opened. The distance shifted (from the current open window to the next) is determined by the following two rules. Bad-character rule. For all a A, let B(a) := mink 1 : s 1+k = a}. Good-suffix rule. For all 1 i m, let [s G(i) := min k 1 : maxm,i 1+k}, s 1+k ] matches exactly to a suffix of [s i 1, s 1 ] and s i+k s i if i + k m }, where [s i, s j ], i j denotes the sequence s i s i 1 s j and min } = m. If there are i 1 matches, the ith pair is a mismatch and a is the mismatch character in the text, then the distance shifted is maxg(i), B(a) i + 1}, and we call G(i) the good-suffix shift and B(a) i + 1 the bad-character shift. If all m characters match, then the distance shifted is G(m). The BMH algorithm This is a simplification of the BM algorithm using only the bad-character rule. The distance shifted is B(a), where a is the right-most character of the open window. When the first comparison yields a mismatch, i.e. i = 1, the bad-character shift dominates the goodsuffix shift, and thus the distances shifted by the BM and BMH algorithm are identical. When q is large (as in Chinese language), it becomes more likely that a mismatch happens right at the first comparison in an open window. In this case, the performance of BM and BMH algorithms are comparable. We give an example to demonstrate both algorithms. Suppose A = a, b, c} and S = abaa. Then G(1) = 2, G(2) = 1, G(3) = 3, G(4) = 3, B(a) = 1, B(b) = 2, B(c) = 4. 3

4 Suppose T = acabaacca. The first three shifting of the BM algorithm (left) and the BMH algorithm (right) are shown separately as follows. a c c c a a c c c a In the first window, i = 1 and the mismatch character in text is b, so the distance shifted is 2 for both BM and BMH algorithms. In the second window, all four characters match (an occurrence is found), so the distance shifted is G(4) = 3 for BM algorithm and is B(a) = 1 for BMH algorithm. The next windows for BM and BMH algorithms are different. For BM algorithm, i = 2 and the mismatch character in text is c, so the distance shifted is maxg(2), B(c) 2 + 1} = 3. For BMH, the right-most character of the window in text is c, so the distance shifted is B(c) = 4. 3 A Markovian approach Assume that the text is iid and the pattern is fixed. Let W j be the j th open window in the processing of the algorithm. Obviously, W j } is a Markov chain with the state space A m. Let g(w j ) be the distance to shift at W j and f(w j ) be the number of comparisons taken inside the window W j. Let N n be one plus the total number of shifts } k k+1 N n = 1 + max k : g(w j ) n m < g(w j ). (1) We now define the pair of random variables j=1 W n (W Nn, Z n ), where Z n = n m N n 1 j=1 g(w j ). The random variable Z n counts the distance between n and the position of the the current window. It is also obvious that W n } n m is a Markov chain with the state space A m 0, 1,..., m 1}. Note that not every state is visited by the algorithms. Reduce the size of the state space We now propose a means of reducing the size of the state space by properly combining states. Let Γ = D 1,..., D K } be a partition of A m. To combine the states and also to preserve the values of g and f, we require that Γ satisfies the following two conditions. C1 Every element in the same class gives the same value under f and the same value under g. C2 New transition probability is well-defined, that is, for all D i, D j Γ, j=1 p(a, D j ) = p(b, D j ) for all a, b D i, where p(a, D) is the transition probability from the state a into the class D. 4

5 With the conditions C1 and C2, W j } can be restricted naturally to the state space Γ. Now, we construct a partition of A m satisfying the conditions C1 and C2. We need a few more notations. Recall that S = s m s 1. Here, we use the symbol x to denote any character in A and a 1,..., a k to denote a character in A different from a 1,..., a k. We write s i+l 1 s i for the class of strings of the form m l }} x x l }} s i+l 1 s i+1 s i, and a 1,..., a k s i+l 2 s i for the class of strings of the form m l l }}}} x x a 1,..., a k s i+l 2 s i+1 s i, where 1 i m and 1 l m i + 1. We call l the length of s i+l 1 s i or a 1,..., a k s i+l 2 s i. Let Γ i = s m s i, s m s i, s m 1 s i,..., s i+1 s i }, for 1 i m. Let Γ h be the finest disjoint partition of A m decomposed by the classes in Γ i } and the class s m,..., s 2, s 1. Note that s m s 1 Γ h and all classes in Γ h are of the form s m,..., s 2, s 1, s m s i or a 1,..., a k s i+l 2 s i, where the set a i } consists of those s j s with s j 1 s j l+1 = s i+l 2 s i. Lemma 1. The partition Γ h satisfies the conditions C1 and C2 with respect to the BMH algorithm. Proof. The right most character of each class determines the value of g function except for s m,..., s 2, s 1, which gives the value m. Thus each class of Γ h has only one value under the shift function g. We need to show that each class of Γ h has only one value under the function f. Let α Γ h. It is sufficient to examine the following two cases. (i) Suppose α = s m s i. It is trivial when i = 1. Suppose i > 1. If s m s i s m i+1 s 1, then there is at least one mismatch and thus α has only one value under f. If s m s i = s m i+1 s 1 then s m s 1 α, contrary to the disjointness in Γ h. (ii) Suppose α = a 1,..., a k s i+l 2 s i. The same argument applies to the case s i+l 2 s i s l 1 s 1. If s i+l 2 s i = s l 1 s 1, then s l a 1,..., a k } and this forces a 1,..., a k to be a mismatch. Thus α has only one value under f and this completes the proof of condition C1. Let α, β Γ h with length l α and l β, respectively. If l β g(α), then condition C2 holds trivially. Suppose l β > g(α) and β = s m s m lβ +1 or b 1,..., b κ s j+lβ 2 s j. Let β 1 be the g(α)-length suffix of β and β 2 be the (l β g(α))-length prefix of β, namely see the alignment below β 1 = s m lβ +g(α) s m lβ +1 (s j+g(α) 1 s j ), β 2 = s m s m lβ +g(α)+1 (b 1,..., b κ s j+l 2 s j+g(α) ); l }} α α l β g(α) }} β 2 5 g(α) }} t g(α) }} β 1

6 where t is the g(α)-length random text. To show condition C2, we need to show that either α β 2 = or α β 2. (2) The relations (2) are easily checked when l β g(α) < l α. Suppose l β g(α) > l α. Clearly, α β 2. We claim that α β 2 =. Let β 3 be the l α -length suffix of β 2. If α β 3 then there is at least one mismatch and thus α β 2 =. If α = β 3 then β 2 α for some β 2 Γ h, contrary to the disjointness in Γ h. Suppose l β g(α) = l α. It is also easy to check (2) when α = s m s m lα+1. Suppose α = a 1,..., a k s i+lα 2 s i. Let α 1 be the (l α 1)-length suffix of α and β 4 be the (l β g(α) 1)-length suffix of β 2, namely α 1 = s i+lα 2 s i, β 4 = s m 1 s m lβ +g(α)+1 (s j+l 2 s j+g(α) ). If β 4 α 1 then α β 2 =. Suppose β 4 = α 1. There are two further cases. (i) If β = s m s m lβ +1, then s m a 1,..., a k }, by the definition of a i }. Thus, we have α β 2 =. (ii) If β = b 1,..., b κ s j+lβ 2 s j, then the same reasoning as above gives b 1,..., b κ } a 1,..., a k }. Thus we have α β 2. This completes the proof of condition C2. It is easy to see that Γ h satisfies the condition C1 and C2 with respect to a modified BM algorithm that combines the good-suffix rule and the bad-character rule of the BMH algorithm. However, finer partition is needed for the BM algorithm, since there are possibly more than one value of g for some classes in Γ h. We have then to decompose those classes according to the values of g, and trace back which leads to the splitting classes and further decompose them. Such a tracing-back decomposition is continued until the length of the splitting classes is less than two. The proof of the condition C2 is left to the reader. We now generate a partition Γ b satisfying conditions C1 and C2 with respect to BM algorithm. For example, suppose A = 1, 2, 3, 4} and S = Then Γ 1 = , , 32121, 2121, 121, 21}, Γ 2 = 43212, 43212, 3212, 212, 12}, Γ 3 = 4321, 4321, 321, 21}, Γ 4 = 432, 432, 32}, Γ 5 = 43, 43}, Γ 6 = 4}. Cancelling one of the overlapping 21 in Γ 1 and Γ 3, combining 121 in Γ 1 and 321 in Γ 3 and combining 12 in Γ 2 and 32 in Γ 4, gives the partition Γ h. In Γ h, in the class 1321, g(221) = 2 and g(421) = 3. Decompose the class 1321 into 221 and 421. But, p(132, 221) and p(132, 421) are not well-defined. The 6

7 solution is to decompose the class 132 into 22 and 42. Then we have the partition Γ b , , 32121, 2121, , , 32121, 2121, 43212, 43212, 3212, 212, 43212, 43212, 3212, 212, 4321, 4321, 1321, 21, 4321, 4321, 221, 421, 21, Γ h =, Γ 432, 432, 132, b = 432, 432, 22, 42, 43, 43, 43, 43, 4 4. Clearly, Γ i = (m 2 + m)/2 and Γ h 1 + Γ i. The number of additional classes from the decomposition is small. Thus Γ b is O(m 2 ). Determination of the transition probability Let Γ be a state space constructed as above for W j }. Let θ denote the distribution of W 1. Let α, β Γ. With the same notations as above, we have the transition probability θ(β) when g(α) l β, p(α, β) = θ(β 1 ) when g(α) < l β and α β 2, 0 when g(α) < l β and α β 2 =. Define the space Γ = (α, i) : α Γ, 0 i < g(α)}. Observe that Γ is a state space for W n }. Let p be transition probability for W n } with respect to Γ. It is easy to see that 1 when 0 i g(α) 2 and (β, j) = (α, i + 1), p((α, i), (β, j)) = p(α, β) when i = g(α) 1 and j = 0. 4 Asymptotic behavior of the number of comparisons Recall that C n is the number of character comparisons taken by the algorithm. Observe that C n = Nn j=1 f(w j). By applying the strong law of large number for the Markov chain W j } (see Chung [7]), there exist two constants µ f and µ g such that 1 n n f(w j ) µ f a.s. and j=1 1 n n g(w j ) µ g a.s. (3) j=1 Clearly, N n almost surely as n. From (3) it follows that C n n = Nn Nn j=1 g(w j) + j=1 f(w j) ( n N n j=1 g(w j) ) µ f µ (S) a.s. µ g Thus, the limiting mean of C n /n is µ S by the bounded convergence theorem. Define the weight function, h(w n ) = f(w Nn )1 Zn=0}. 7

8 Then C n = n j=1 h(w n). We apply the limit theorems in Chung [7] to the Markov chains W n } and obtain the following theorems. Theorem 1. where B (S) 0 is a constant specified in (8). Theorem 2. If B (S) > 0 then E(C n ) = µ (S) n + o( n), (4) Var(C n ) = B (S) n + o(n), (5) lim sup n C n µ (S) n 2B(S) n ln n ln n and the corresponding lim inf is equal to 1 almost surely. = 1 a.s. (6) The results from the application of the limit theorems in Chung include a central limit theorem. Here, we can obtain a stronger result by applying the Berry-Esseen theorem in Bolthausen [4] to the Markov chains W n }. The theorem in Bolthausen is for Markov chains with a possibly infinite discrete state space. Their conditions, the third moment of a returning time and the first moment of an entrance time, hold automatically for Markov chains with only finite states. Theorem 3. If B (S) > 0 then sup <x< P ) B(S) n < x Φ(x) = O(n 1/2 ), (7) ( C n µ (S) n where Φ(x) is the standard normal distribution. Explicit determination of the quantities µ (S) and B (S) Let P be the transition matrix and Ω be the class of recurrent states for W j }. Let π be the invariant probability measure with respect to P (i.e. πp = π). Then µ f = α Ω f(α)π α and µ g = α Ω g(α)π α. Let P be the transition matrix and Ω be the class of recurrent states for W n }. Let π be the invariant probability measure with respect to P. It is easy to check that π (α,i) = π α /µ g. From Chung [7, p.88], B (S) = h 2 c (α)π α + 2 h c (α)π α h c (β)π β (m α,α0 + m α0,β m α,β ), (8) α Ω α Ω β α 0 where α 0 is a fixed state (the value of B (S) is independent of the choice of α 0 ), h c (α) = h(α) µ (S) 8

9 and m α,β is the expectation of the time of first entrance of state β starting from state α. The M = m α,β } can be obtained from P, M = (I Z + EZ dg )D, where I is the identity matrix, Z is the fundamental matrix for P, E is a matrix with all entries 1, Z dg results from Z by setting off-diagonal entries equal to 0 and D is diagonal matrix with ith entry 1/π i (see Kemeny and Snell [10]). 5 Evaluation for the constants of the limiting mean and variance In this section, we assume that each t i is uniformly distributed in A = 1,..., q}. The average is taken for all patterns with the same length. First, we compute the constant µ (S) for the BM algorithm. In Figure 1, we have the maximum, average and minimum values of µ (S) for m = 2 to 10 and q = 2, 4, 8. Figure 1: The maximum, average and minimum values of µ (S) for the BM algorithm m ( q = 2 ) m ( q = 4 ) m ( q = 8 ) m q = 2 q = 4 q = Patterns reaching the minimum constant are of the form S = 11 1; while patterns reaching the maximum constant are showed in the table below. 9

10 m q = 2 q = 4 q = Let 1 σ = (µ q m (S) µ) 2, where µ is the average, be the standard deviation of µ (S). The values of σ are showed in Figure 2. S Figure 2: The standard deviation of the linearity constant for the BM algorithm q = q = q = m We compare our theoretical result with a recent experimental result [12] (see Figure 3). In their experiments, the text consists of 150,000 characters built randomly and 50 patterns randomly generated for each length are searched. The experimental result is very close to our theoretical result when m 10 and q = 8. The coincidence may be explained by the graphs in Figure 2 and Figure 4. Note that our approach becomes less efficient when computing the exact constant for longer patterns, say patterns with length > 20. However, a good estimate of the constant can be by obtained by the following procedure. The idea is to use a property about the good-suffix rule: there is a K such that G(i) is a constant for all i K (usually K log q m). Note that bad-character shift is always less than or equal to G(K). Thus, the class s K s 1 has only one value under g, G(K). Choose an m 0 K. We consider the pattern S = s m0 s 1, find the corresponding state space Γ and compute the invariant probability π. 10

11 Figure 3: The experimental C n /n vs. the average of the theoretical constant µ (S) for the BM algorithm in the cases q = 2 and experimental theoretical m ( q = 2 ) m ( q = 2 ) m ( q = 8 ) m ( q = 8 ) Note that every class of Γ has only one value under g and only one value under f except for the class α 0 = s m0 s 1. Thus µ g = α Γ g(α) π α, and µ f α Γ f(α) π α (m m 0 1) π α0. Note that π α0 = q m0 µ g q m 0 m 0. We can choose a suitable m 0 to control the order of the errors. We compute the constant for each papern individually. So, the number of patterns can not be too large. In the right-hand part of Figure 3, we use random samples of 2000 patterns for the estimation of the average of µ (S). Finally, we compute the value of B (S) for the BM and the BMH algorithm for m = 2 to 8 in the cases q = 2, 4, 8. The results are presented in Figure 4 and Figure 5. Note that B (S) = 0 in the special case that S = 12 and A = 1, 2}. 11

12 Figure 4: The maximum, average and minimum values of B (S) for the BM algorithm m ( q = 2 ) m ( q = 4 ) m ( q = 8 ) m q = 2 q = 4 q = Conclusion In the paper we investigated only the original BM and the BMH algorithms. Our approach may be readily modified for other variants of the BM algorithm. The extension from iid to Markov also seems to be straightforward. However, the transition matrix remains complex and thus a further simplification of µ (S) and B (S) seems to be difficult. We conclude by mentioning some interesting extensions for the BM algorithm. Random pattern. Evaluate the constant µ and B under the assumption that the pattern is random with a fixed length. Characterize the asymptotic behavior as the pattern length m (comparatively small compared with n). Maximum and minimum. Investigate the asymptotic behavior of the maximum and minimum values of µ (S) and the combinatorial properties of patterns reaching these values. Large deviation. Derive estimates for the probabilities of large-deviations via the constructed Markov chain. References [1] Baeza-Yates, R.A., Gonnet, G.H. and Régnier, M. Analysis of Boyer-Moore type string searching algorithms. Proc. 1st ACM-SIAM Symposium on Discrete Algorithms, San Francisco (1990),

13 Figure 5: The maximum, average and minimum values of B (S) for the BMH algorithm m ( q = 2 ) m ( q = 4 ) m ( q = 8 ) m q = 2 q = 4 q = [2] Baeza-Yates, R.A. and Régnier, M. Average running time of the Boyer-Moore-Horspool algorithm. Theoretical Computer Science 92 (1992), [3] Barth, G. An analytical comparison of two string matching algorithms. Information processing characters 18 (1984), [4] Bolthausen, E. The Berry-Esseen theorem for functionals of discrete Markov chains. Z. Wahrsch. Verw. Gebiete 54 (1980), no. 1, [5] Boyer, R.S. and Moore, J.S. A first string searching algorithm. Comm. ACM 20 (1977), [6] Charras, C. and Lecroq, T. Handbook of exact string-matching algorithms. lecroq/string/. [7] Chung, K.L. Markov chains with stationary transition probabilities. Second edition, Springer- Verlag, New York, [8] Cole, R. Tight bounds on the complexity of the Boyer- Moore string matching algorithm. SIAM J. Comput. 23 (1994), [9] Flajolet, P., Szpankowski, W. and Vallée, B. Hidden word statistics. Preprint (2002). [10] Kemeny, J. G. and Snell, J. L. Finite Markov chains. Van Nostrand, 1960; Springer,

14 [11] Mahmoud, H. M., Smythe, R. T. and Régnier, M. Analysis of Boyer-Moore-Horspool stringmatching heuristic. Random Structures Algorithms 10 (1997), [12] Michailidis, P.D. and Margaritis, K.G. On-line string matching algorithms: survey and experimental results. Intern. J. Computer Math. 76 (2001), [13] Régnier, M. Knuth-Morris-Pratt algorithm: An analysis. Mathematical Foundations Of Computer Science 1989 (Porabka Poland, 1989). Lecture Notes in Computer Science, Vol. 379, Springer, Berlin, 1989, [14] Smythe, R.T. The Boyer-Moore-Horspool heuristic with Markovian input. Random Structures Algorithms 18 (2001),

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12

15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12 Algorithms Theory 15 Text search P.D. Dr. Alexander Souza Text search Various scenarios: Dynamic texts Text editors Symbol manipulators Static texts Literature databases Library systems Gene databases