The NFA Segments Scan Algorithm

Size: px

Start display at page:

Download "The NFA Segments Scan Algorithm"

Roderick Tucker
6 years ago
Views:

The NFA Segments Scan Algorithm Omer Barkol, David Lehavi HP Laboratories HPL-2014-10 Keyword(s): formal languages; regular expression; automata Abstract: We present a novel way for parsing text with

1 The NFA Segments Scan Algorithm Omer Barkol, David Lehavi HP Laboratories HPL Keyword(s): formal languages; regular expression; automata Abstract: We present a novel way for parsing text with non deterministic finite automatons. For "real life" regular expressions and text, our algorithm scans only a fraction of the characters, and performs a small number of operations for each of these characters (for synthetic worse case scenarios, it would perform worse than classical algorithms). Although there are similar approaches, our algorithm is far simpler and less resource consuming than the alternatives we are aware of. External Posting Date: February 21, 2014 [Fulltext] Internal Posting Date: February 21, 2014 [Fulltext] Approved for External Publication Copyright 2014 Hewlett-Packard Development Company, L.P.

2 The NFA segments scan algorithm Omer Barkol and David Lehavi HP Labs Israel Abstract. We present a novel way for parsing text with non deterministic finite automatons. For real life regular expressions and text, our algorithm scans only a fraction of the characters, and performs a small number of operations for each of these characters (for synthetic worse case scenarios, it would perform worse than classical algorithms). Although there are similar approaches, our algorithm is far simpler and less resource consuming than the alternatives we are aware of. 1 Introduction The pattern matching problem calls for discovering if a given string x is in a language L. In the case where L is the language of strings containing a given word as a substring, there are several fast (and by now classical) algorithms - see [BM], and [KMP]. In the case where L the language of strings containing one word out of a given set, there are two (again, rather classical) algorithms - see [AC], [C-W]. In the case where L is a general regular language, we are aware of two approaches: a rather complicated approached presented in [WW], and a rather simple but resource consuming one presented in [Ke] (in which one has to maintain an entire suffix tree for each state in the automaton corresponding to the regular language). Our approach is somewhat similar to the one presented in [Ke], but it avoids the big memory overhead, and is easier to analze and generalize. Our method is motivated by our observations on the strcture of real life regular expressions on the one hand, and the moral of the Boyer-Moore algorithm on the other: Many real life regular expressions are composed of contigous words connected by either or operations, or Kleen-* operations on a single node in the automaton. Addapting the Boyer-Moore phylosophy, we match the words between the Kleen-* s, and jump ahead in order to match the next word(s). We summarized this approach the following principles: At any stage of the execution, one should hold all the possible sub-matches of the processed sub-string to the automaton. Regarding non-deterministic finite automaton as a directed graph, instead of storing a sub-matche as a path on the automaton, store the path s endpoints. Instead of advancing one character at a time over easily matched pieces of the string/automaton and matching a contiguous pieces of the string to a path, one can jump and attempt to match another piece of the string to another part of the automaton, and then these paths should be glued.

3 2 Preliminaries Notations: For each string x we denote the ith character of x by x i. We denote by [i, j] the segment (or set) of integers {i, i + 1,..., j 1, j}. Given a non deterministic automaton M = (Q, Σ,, q 0, {q F }) we define a path on the automaton to be a map f : [i, j] Q, such that for each i k < j, f(k + 1) (f(k)). We then say that f[i, j] is a path from q to q if f(i) = q and f(j) = q. We define the distance between two states q, q Q: dist(q, q ) := min f[i,j] a path from q to q (j i). We assume that M does not admit sinks (i.e. nodes with self loops as the only outgoing edges). Before adding some less standard definitions to support our algorithm we present the Thompson algorithm as presented in [Th]: Algorithm 1: Thompson s algorithm Input: x = (x 0,... x x 1 ) Σ x, M Output: True if the M matches x; False o.w. p 0 Z {q o } while p < x do if q F Z then return True // match! Z {q q Z : q (q, x p )} // crawl right p p + 1 if Z = then return False // no active states end return False // reached end of string We now add a few non-standard definitions. φ(q) = True σ Σ : q (q, σ) or q = q F. ψ(q) = True x Σ, x ɛ : q (q, x) or q = q 0, or q = q F. Example 1. If our automaton is the one corresponding to the regular expression.*a.*bca described by the following diagram:.. q a start 0 q b 1 q c 2 q a 3 q F then φ(q) = ψ(q) = True if and only if q {q 0, q 1, q F }. Intuitively and keeping our motivation (presented in the introduction) in mind one should think about nodes satisfying φ as nodes from which we always want to jump. There is no point in reading one character at a time if the only possible path we are trying to extend is one ending with a node satisfying φ(q); in a sense this is exactly the Boyer-Moore approach: if the match of the pattern fails and you have to start looking from the beginning of the pattern (which

4 is the equivalent of a node satisfying φ(q)), you should jump as far as possible before starting. The intuition behind nodes satisfying ψ(q) is more heuristic: these are nodes which typically and in real-life have more chance to match. From an algorithmic point of view, the difference between nodes satisfying φ(q), and one satisfying ψ(q), is that once we jump from a node satisfying ψ but not φ, we have to glue back-segments. We also define the following for every q Q: Ψ(q) def = {q path f : [1, m] Q from q to q, j [2, m 1] : ψ(f(j))} l(q) def = max q Ψ(q) dist(q, q ). I.e, in Example 1 above we would have Ψ(q 0 ) = {q 0, q 1 }, Ψ(q 1 ) = {q 1, q 2, q 3, q F }, Ψ(q 2 ) = {q 2, q 3, q F }, Ψ(q 3 ) = {q 3, q F }, Ψ(q F ) = {q F }, l(q 0 ) = 1, l(q 1 ) = 3, l(q 2 ) = 2, l(q 3 ) = 1, l(q F ) = 0. Note that as M does not admit sinks l > 0 for any non terminal state (either accepting or not). Note also that Ψ, l are computable using reverse BFS from q F. The set Ψ(q) is simply the set of nodes on the automaton which are reachable from q but do not satisfy ψ(q) to which we jump (i.e. start a new match to another path) if our current path ends with a node satisfying ψ(q), whereas l(q) is the size of the jump on the string. Given a string x Σ, a finite disjoint union I = t i=1 [a i, b i ] of ordered intervals inside N, we define P I to be the set of maps f : I Q such that 0 I f(0) = q 0, and i : f(a i+1 ) Ψ(f(b i )), k i [a i, b i 1] : f(k + 1) (f(k), x k ). E.g. in the automaton from Example 1, given the string bad and writing functions as sets of ordered pairs we have P [0,2] [5,5] = {{(0, q 0 ), (1, q 0 ), (2, q 1 ), (5, q 1 )}, {(0, q 0 ), (1, q 0 ), (2, q 1 ), (5, q 2 )}, {(0, q 0 ), (1, q 0 ), (2, q 1 ), (5, q 3 )}, {(0, q 0 ), (1, q 0 ), (2, q 1 ), (5, q F )}}. Finally we define S I to be set of pairs of end points of the non contigous segments of function of P I ; i.e. for the automaton in Example 1 and the same string as above we would have S [0,2] [5,5] = {{(q 0, q 1 )}, {(q 0, q 1 ), (q 2, q 2 )}, {(q 0, q 1 ), (q 3, q 3 )}, {(q 0, q 1 ), (q F, q F )}}. 3 The Segments Scan Algorithm We are now ready to present our segments automaton scan algorithm. The algorithm implicitly uses interval unions I of only two intervals: [0, p] [p + b, p + e]

5 (intuitively p represents a point which Thompson s algorithm surely arrived, and the second segment is eventually supposed to glue to the first; we jump ahead in this fashion for the same reason we do so in Boyer-Moores algorithm: if we fail we want to do so while advancing as much as we can on the string). Instead of finding a compact representation of P I we find a compact representation of S I : the possible f(p) (intuitevly: the points we jump from will be represented by a union of two sets A, B depending if the nodes satisfies φ or not). The pairs corresponding to the second interval of S I will be represented by a set of pairs denote S (continuing with our example from above, for S [0,2] [5,5] we have S = {(q 1, q 1 ), (q 2, q 2 ), (q 3, q 3 ), (q F, q F )}). We denote π 2 S = {q q : (q, q) S} (i.e. the projection on the second coordinate); we use the analog notation π 1 for projection on the first coordinate. We set O, O to be two boolean oracles (note that usualy these oracles are simple functions depending on S, ψ, φ - see Example 2 below): Algorithm 2: The Segments Scan Algorithm Input: x = (x 0,... x x 1 ) Σ x, M, preprocessed data φ, ψ, Ψ, l Output: True if the M matches x; False o.w. p, b, e 0 S {(q 0, q 0 )} A B while p + e x do if b = 0 then if q F π 2 S then return True // match! if O or p = e = 0 then // jump B π 2 S A A {q B φ(q)} p p + e b, e min q A B l(q ) S {(q, q ) q A B : q Ψ(q)} if q F π 2 S and e > b and O then // string end or crawl right if p + e = x then return False // reached end of string S {(q, q) q : (q, q ) S, q (q, x p+e )} e e + 1 else // crawl left b b 1 S {(q, q) q : (q, q) S, q (q, x p+b )} if b = 0 then S {(q, q) S q A B} // glue if S = then return False // no active segments if (q, q) S : φ(q ) then b 0 end return False // passed end of string

6 Example 2. In our experiments (see Section 5) we considered three pairs of oracles: O 1 : b = 0 or q π 2 S : ψ(q) O 1 : q π 2 S : ψ(q) O 2 : b = 0 O 2 : q π 2 S : ψ(q) O 3 : False O 3 : True Theorem 1. The segments scan algorithm returns the same truth value as the Thompson algorithm. In order to prove the theorem we first claim two Lemmas: Lemma 1. Immediately after line 6 in Thompson s algorithm, the set Z is the set of endpoints of the paths in P [0,p]. The proof of this Lemma is quite straightforward, and see [Th] for the details. We turn to prove the invariant our algorithm holds: Lemma 2. Immediately after each of the lines 13, 17, 21, 23 in the segments automaton scan algorithm: f P [0,p] [p+b,p+e] : either f(p) A B or (b = 0 and f(p) A B π 1 S), moreover: (f(p + b), f(p + e)) S, (q, q ) S : f P [0,p] [p+b,p+e] : f(p + b) = q, f(p + e) = q, f(p) A B. Proof. The initialized values are I = {[0, 0]}, p, b, e = 0, A, B =, and S = {(q 0, q 0 )}; thus as for all f P [0,0] : f(0) = q 0 the two properties hold. We will show that if the properties hold in the beginning of the loop (i.e. at line 6) it holds in each of the other lines. We thus induct on the number of times we encounter each of this lines. We now separate to cases: After line 13: We get to this line either by not entering the if statement of line 6, which means all values are the same and thus the hypothesis holds, or we did, and then b = 0. We now use the fact that: [0, p] [p + b, p + e] = [0, p] [p + 0, p + e] = [0, p + e] P [0,p] [p+0,p+e] = P [0,p+e]. By the induction hypothesis, (f(p + b), f(p + e)) S, and thus for the new p = p + e it holds that f(p ) = f(p + e) π 2 S = B. Also, for the new values of b, e, note that b = e, and that by the assignment of line 13 (f(p + b ), f(p + e )) = (f(p + b ), f(p + b )) S. Also, the only (q, q ) that where added in those lines to S are (f(p + b ), f(p + b )) where obviously, f(p ) A B. After line 17: By the induction hypothesis, the claim holds when the program was last before line 15; it holds after line 17 by the definition of. After line 21: Here we seperate to two cases: If b > 1, then by the induction hypothesis, the claim holds when the program was last before line 18; it holds before line 21 by the definition of, and the then part of the if in line 21 is never executed.

7 If b = 1 (note that we never get to this line with b = 0), then - by the definition of, the only way in which the induction hypothesis is violated before line 21, is that the paths paremetrized by S may not glue at p to the paths parameterized by A B; i.e. at before line 21 there are pairs (q, q ) S such that q A B, which means that the functions (from [p, p + e] to Q) parameterized by the pair (q, q ) do not glue at p to any of the functions parameterized by A B. However, this issue is ammended by line 21, where we get rid of the bad pairs (implicitly, by setting a unique value for f(p), the one which comes from the set A B). Thus the hypothesis hold after this line. After line 23: By the induction hypothesis, the claim holds when the program was last before line 23; it holds after the line by the definition of φ: indeed if (q, q ) S, and f : [p + b, p + e] Q, then since σ Σ : q (q, σ) the function f can be trivially extended to [p + b 1, p + e] by setting f(p + b 1) to q; we conclude this argument using a decsending induction on b. Proof ( of Theorem 1). By the Lemmas 1 and 2, if we get to line 7 in the segments scan algorithm, then using parameter values from the segments scan algorithm Thompson s algorithm reached place p Thompson = p+e on the string, with front Z = π 2 S. The first conclusion we draw from this fact, is that if the segments scan algorithm exists succesfully on line 7, then so does Thompson s algorithm. As for the other direction, assume that the segments scan algorithm exists unsuccesfully. We will analyze what happened between the last time the algorithm visited line 7 and the exit point, and prove that Thompson s algorithm exists unsucessfully as well. Let A, B, p be as they were set after the last visit to line 12, and arguing by contradiction and assuming that Thompson s algorithm exit succesfully let p T-final be the number of iterations of Thompson s algorithm, then by Lemma 2 there is a map f P [p,pt-final ] such that f(p) A B and f(p T-final ) = q F. By (decending) induction on (on a a below) this means that for all a, a such that p a a p T-final, the following two properties hold: P [0,p] [a,a ], and f P [0,p] [a,pt-final ] : f(p T-final ) = q F. By the first of these properties we do not pass the if in line 22 before hitting line 7 again, and by the second of these p + e cannot excceed p T-final (see the condition in line 14) before hitting 7 again contradicting the assumption that we already reached this line for the last time. 3.1 Prunning of redundant extenions There are four minor changes in the algorithm which may allways be used to trim down the sizes of A, B and S. For the sake of simplicity of the exposition we omitted them from the initial algorithm presentation. The four changes we

8 present are by and large independant of one another (we explicitly note when they are not). 1. Currently the set A represents all the nodes in the front which satisfy φ; instead we can make A represent all the nodes q in the front which satisfy φ, that admit paths from them to q F which do not pass through other nodes of A (otherwise - why keep q? we can just keep these other points of A). I.e. imidiately after line 10 we modify A as follows: A {q A path f from q to q F : range(f) A = {q}}. Note that the predicate above may be precomputed before we execute the algorithm namely for each node q statisfying φ we may encode the set of nodes Φ(q) such that we erase q from A only if A Φ(q). 2. In the case where b = 0, the set π 1 S is glued to A B. Thus we may work directly with π 2 S instead of S. Representing π 2 S by B, we simply have to make the following changes: Line 2: Substitute by S. Line 4: Substitute by B {q 0 }. Line 7: Substitute the π 2 S in the condition by B. Line 9: Erase Line 16: Substitute by: if b=0 then B {q q B : q (q, x p+e )} else S {(q, q) q : (q, q ) S, q (q, x p+e )}. Line 21: Substitute the then part by B {q (q, q) S q A B}. 3. Ideally, instead of extending all the possible segments to the left, we would want to prune pairs (q, q) such that min q A B dist(q, q ) > b. While this goal is dificult to achieve in general, it is easy enough to prune some of these pairs, by modifing the update of S in the left crawl (line 21) to S {(q, q) q : (q, q) S, q (q, x p+b ) and (q q or q A B)}. Note that a similar change should be made in the modification to line 21 in the previous paragraph. 4. Finaly, allowing crawling to the right from states in A B bloughts the size of S. Such a crawl is redundant since we eventually crawl left to A B (either before performing another jump, or after it). Thus we can modify the update of S in line 13 to S {(q, q ) q A B : q Ψ(q), q q or q A}, and the condition line 22 to S = and A =. 3.2 Jumping, crawling, and oracles In line 14 we use the oracle O to decide whether crawling to the left is better than crawling to the right; where as in line 8 we use the oracle O to determine

9 when it s better to crawl to the right, and when to jump (in Section 5 we show test results for the three oracles presented in Example 2). All the example oracles we considred in Example 2 are motivated by Boyer-Moore; i.e. they have a bias to crawl to the left which is only violated in cases which do not occur in the absence of loops in the automaton. We do not know if this design of the oracles is optimal (or even close to optimal) for typical regular expression and for either typical or worst case string. While we are not sure on how to analyze worst case behaviour for typical regular expression, we are working on an approach that we hope will prove usefull both in the analysys, and in the design of better oracles for typical strings where typical here means generated by a Markov process (both for the string and the regular expression, where in the second one the Markov process is a heirarchical one on the application of regular expression gramtical rules). This approach is motivated by the run-time analysis of the Boyer-Moore algorithm in e.g. [B-YR] [S] [Ts]) for Markovian inputs. 3.3 Generalization: Segment unions of more than two segments Our algorithm works on unions of segments I which are a union of only two segments the first of which starts at 0. Hence, our algorithm either updates data about the segment which does not contain 0, or unites the two segments. However, we can modify the definition of Ψ(q) to Ψ k (q) = {q path f : [1, m] Q from q to q, #{j [2, m 1] : ψ(f(j))} k}, thus allowing the path to contain at most k nodes satisfying ψ(q) (possibly, but not necessarily, adding other requirement e.g. distance, or simply being some special nodes on them). The algorithmic change there would be to work with Is which contain more than two segments; thus at each iteration of the main loop we would have to decide not between extensing to left or right side of the second segment of S I which is currently represented by S but between exending the left or right side of any of the segments except the first one. As we can store more states of the NFA at once, and if some state scans are more likely to fail than others, this may be an advantage. As usual, this can be determined based on the NFA, the text, or both. 4 Qualitative analysis of the number of character reads, character comparisons, and the front size The number of character comparisons we perform in the segments scan algorithm #S for iteration i, i an iteration of the main loop (with some mild change if we use the second acceleration in 3.1, since in some iterations we have to add the size of B instead of S which is smaller), whereas

10 the number of character comparisons of Thompson s algorithm is #Z for place p. p place on string In Section 6 we discuss various alternatives of substituting the left and right crawling operations on the entire set S by a constant time operation. In this case, when comparing the segments scan algorithm to Thompson s, one simply has to compare the number of iterations of the segments scan algorithm which is the number of character reads it performs, and the length of the string up to acceptance/denail of the automaton which is the number of character reads Thompson s algorithm performs. Example 3 (Bad regex and input string for the segments scan algorithm). We note that given the regular expression b(.{10}a.{9}a..{8}a.{2}.{7}a.{3}.{6}a.{4}.{5}a.{5}.{4}a.{6}.{3}a.{7}.{2}a.{8}.a.{9} a.{10}) the input string aaaa..., and assuming the standard assumption above, our algorithm requires 10 times more character comparisons than Thompson s. Why are we presenting this algorithm then? Simply put, the acceleration of the segments scan algorithm comes from performing big jumps (thus reducing the number of iterations of the main loop), while not increasing the size of S by much in a way that will compensate on this reduction. We cannot make any statements which are true for all input strings and all NFAs (in 3.2 we discussed future plans on making better worst case analysys, as well as probabilistic quantitative statements, and how these statements would affect the algorithm). However, we do make two empirical claims which hold for regular expressions and input strings in real life : Most not very short input words correspond only to paths on the NFA which stay on nodes satisfying φ(q). Most not very short input words which correspond to paths on the NFA which do not stay on nodes satisfying φ(q), correspond to a small number of such paths. The effect of the first rule of thumb here is that we may perform big jumps, and that therefor the number of iterations of the algorithm is small. The effect of the second rule of thumb is that after crawling only a small number of letters, the size of the S is still small. 5 Experimental results In our experiments, we implemented the algorithm, and tested how many characters out of the input string x are actually read during the run of the segment scan algorithm given an automaton for different regular expressions. We

11 used the three oracles pairs from Example 2 which we denote in the table below simple by 1, 2, 3 and used the first three optimizations presented in 3.1, (but not the forth). We ran our searches on tests used by boost (see On the Mark Twain corpus with the following results: regex % for 1 % for 2 % for 3 Twain Huck[[:alpha:]] [[:alpha:]]+ing Tom Sawyer Tom Sawyer Huckleberry Finn (Tom Sawyer Huckleberry Finn).{0,30}river river.{0,30}(tom Sawyer Huckleberry Finn) and on test html searches benchmarks with the following results: regex % for 1 % for 2 % for 3 beman john dave <p>.*</p> <h[1-8][^>]*>.*</h[1-8]> <a[^>]+href=("[^"]*" [^[:space:]]+)[^>]*> <img[^>]+src=("[^"]*" [^[:space:]]+)[^>]*> <font[^>]+face=("[^"]*" [^[:space:]]+)[^>]*>.*</font> One can observe that the percentage of characters read in this test set even for complicated and very not word match like regular expressions (as in the last of the html example searches) where as low as 34%. Moreover, note that there is a significant difference depending on the oracle, and thus an oracle that can learn the input (regular expressions and common stings) might be much more efficient. 6 Accelerating the inner loops There are three inner loops in our algorithm: one in the computation of π 2 S (which is rather standard to accelerate), the left expansion line 16, and right expansion in line 21, where we compute {(q, q) q : (q, q ) S, q (q, x p+e )}, {(q, q) q : (q, q) S, q (q, x p+b )}, respectively. Reducing these loops from O(S) time operations to O(1) time operations changes the performance of the algorithm from the number of character comparisons to the number of character reads (see Section 4). In this section we consider two acceleration methods for these loops: the more conservative method is constructing the DFA corresponding to the segments scan algorithm; whereas

12 the more radical one are is realiance on a hardware incarnation of the original NFA. 6.1 Full DFA and hybrid execution Our algorithm is a complicated way to scan an automaton. Nevertheless, it is still an automaton scanning algorithm, and as such, it admits an underlying DFA. I.e. we can construct the corresponding DFA (whose size, in a worse case scenario, is O(size of the NFA squared). E.g. the segments DFA corresponding to the NFA from Example 1 is given by the following diagram (here we use the oracles O 1, O 1, and all four optimizations presented in 3.1 when computing S): start abc,3 S = {(q 0, q 0 ), (q 1, q 1 )} b = 1, e = 1 A = B = {q 0 } a,1 S = {(q 3, q F )} b = 2, e = 3 c S = {(q 2, q F )} b = 1, e = 3 b S =Irrelevant b = 0, e = 3 A = {q 1 }, B = {q F } b,3 a,1 c,3 a S = {(q 2, q 3 )} b = 2, e = 3 b S = {(q 1, q 3 )} b = 1, e = 3 a S = {(q 1, q F )} b = 1, e = 4 S = {(q 1, q 1 ), (q 2, q 2 ), (q 3, q 3 ), (q F, q F )} b = 3, e = 3 c b,2 a,2 b c,1 S = {(q 2, q F } b = 2, e = 3 c S = {(q 3, q F )} b = 2, e = 4 a S = {(q 1, q F )} b = 2, e = 5 a,2 Legened: In out-going edges from dashed nodes, we consider the charcater x p+e, where as in full nodes we consider the character x p+b. The first parameter on each edge is the accepting character of the edges; the second paramter if one exists is the increment of p. Finally note that we could represent the DFA partially, and run the scan in a hybrid mode (see [BC]). 6.2 Accelerating left and right expansions, given an NFA with front expansion in O(1) Assuming we have at our disposal an NFA implementation such that front expansion is done in O(1), (this assumption which is not that far from reality - see

13 e.g. [SP]). We may store S as a (possibly sparse) bit matrix C, and utilize the given NFA implementation to accelerate the left and right expansions: I.e. the right expansion C with the character σ is given by C [i, j] = k 1 (j,σ)c[i, k]. 7 Conclusion We have presented a new algorithm to match strings to regular expressions. This algorithm does evidently not perform well in the worst case, but rather is suited for real life regular expressions as we encounter in the industry (e.g., in security filtering scenarios or in IT monitoring scenarios, to name a couple). The algorithm is inspired by the Boyer-Moore algorithm, in jumping on hopeless matches. By doing so, it actually holds a set of segments in the automaton, that might be complementing the computation of the string by the automaton. We have shown that the algorithm is most suitable to be parallelized, and that it can be generalized in many different and permission ways. References AC. A. V. Aho, M. J. Corasick (June 1975) Efficient string matching: An aid to bibliographic search. Communications of the ACM June 1975, 18 (6): B-YR. R. A. Baeza-Yates, M. Régnier Average Running Time of the Boyer-Moore- Horspool Algorithm. Theor. Comput. Sci. 92(1): (1992) 8 BC. M. Becchi, P. Crowley A hybrid finite automaton for practical deep packet inspection. CoNEXT 2007: 1 11 BM. R.S. Boyer, S. J. Moore A fast string matching algorithm Comm. ACM. 01/1977; 20 (10): C-W. B. Commentz-Walter A String Matching Algorithm Fast on the Average ICALP 1979: Extended abstract. 1 G. Z. Galil On improving the worst case running time of the Boyer-Moore string matching algorithm. September 1979 Comm. ACM (New York, NY, USA: Association for Computing Machinery 22 (9): Ke. S. Kearns Accelerated Finite Automata Enable Regular Expression Searching in Sublinear Time 2013: preprint. 1 KMP. D. Knuth, J. H. Morris, V. Pratt Fast pattern matching in strings. SIAM Journal on Computing 1977, 6 (2): SP. R. P. S. Sidhu, V. K. Prasanna Fast Regular Expression Matching Using FPGAs. FCCM 2001: S. R. T. Smythe The Boyer-Moore-Horspool heuristic with Markovian input. Random Structures and Algorithms, Volume (2001) 8 Th. K. Thompson Programming Techniques: Regular expression search algorithm. Communications of the ACM 1968, 11 (6): , 5 Ts. Tsung-Hsi Tsai Average case analysis of the Boyer-Moore algorithm. Random Struct. Algorithms 28(4): (2006). 8 WW. B. W. Watson, R.E. Watson a Boyer-Moore-stayle algorithm for regular expression pattern matching. Science of Computer programming 48 (2003)

CS243, Logic and Computation Nondeterministic finite automata

CS243, Logic and Computation Nondeterministic finite automata CS243, Prof. Alvarez NONDETERMINISTIC FINITE AUTOMATA (NFA) Prof. Sergio A. Alvarez http://www.cs.bc.edu/ alvarez/ Maloney Hall, room 569 alvarez@cs.bc.edu Computer Science Department voice: (67) 552-4333