Fster Regulr Expression Mtching Philip Bille Mikkel Thorup
Outline Definition Applictions History tour of regulr expression mtching Thompson s lgorithm Myers lgorithm New lgorithm Results nd extensions
Regulr Expressions A chrcter α is regulr expression. If S nd T re regulr expressions, then so is The union S T The conctention ST (S T) The kleene str S*
Lnguges The lnguge L(R) of regulr expression R is: L(α) = {α} L(S T) = L(S) L(T) L(ST) = L(S)L(T) L(S*) = {ε} L(S) L(S) 2 L(S) 3
An exmple R = (*)(b c) L(R) = {b, c, b, c, b, c,...}
Regulr Expression Mtching Given regulr expression R nd string Q the regulr expression mtching problem is to decide if Q L(R). How fst cn we solve regulr expression mtching for R = m nd Q = n?
Applictions Primitive in lrge scle dt processing: Internet Trffic Anlysis Protein serching XML queries Stndrd utilities nd tools Grep nd Sed Perl
Outline Definition Applictions History tour of regulr expression mtching Thompson s lgorithm Myers lgorithm New lgorithm Results nd extensions
Thompson s Algorithm 1968 () α (b) N(S) N(T ) ɛ N(S) ɛ ɛ (c) ɛ N(T ) ɛ (d) ɛ N(S) ɛ ɛ Construct non-deterministic finite utomton (NFA) from R.
Thompson s Algorithm 1968 R = ( ) (b c) 1 2 3 4 5 b 64 57 c 86 79 10 8 Thompson NFA (TNFA) N(R) hs O( R ) = O(m) sttes nd trnsitions. N(R) ccepts L(R). Any pth from strt to ccept stte corresponds to string in L(R) nd vice vers. Trverse TNFA on Q one chrcter t time. O(m) per chrcter => O( Q m) = O(nm) time lgorithm. Top ten list of problems in stringology 1985 [Glil1985].
Myers Algorithm 1992: 1-D decomposition A 3 A 1 A 2 1 2 3 4 5 b 64 57 c 86 79 10 8 Decompose N(R) into tree O(m/x) micro TNFAs with t most x = Θ(log n) sttes. Stte-sets nd micro TNFAs encoded in O(x) bits ([BFC2008]). Tbulte stte-set trnsitions for micro-tnfas. Tble size: 2 O(x) = O(n ε ). => constnt time for micro TNFA stte-set trnsition. O(m/x) = O(m/log n) stte-set trnsition lgorithm for N(R). O(nm/log n) lgorithm.
How cn we improve Myers lgorithm? A 3 A 1 A 2 1 2 3 4 5 b 64 57 c 86 79 10 8 Fst lgorithms [Myers1992, BFC2008, Bille2006] ll im to speedup stte-set trnsition for 1 chrcter. To red/write stte-set we need Ω(m/log n) time (we ssume log n word length). To improve we need to hndle multiple chrcters quickly. Min chllenge is dependencies from ε-trnsitions.
New Algorithm: 2-D decomposition A 3 A 1 A 2 1 2 3 4 5 b 64 57 c 86 79 10 8 Decompose N(R) into O(m/x) micro TNFAs with t most x = Θ(log n) sttes s Myers lgorithm. Prtition Q into segments of length y = Θ(log 1/2 n). Stte-set trnsition on segments in O(m/x) time. => lgorithm using O(nm/xy) = O(nm/log 1,5 n) time.
Overview Gol: Do stte set trnsition on y = Θ(log 1/2 n) chrcters in O(m/x) = O(m/ log n) time. Simplifying ssumption: constnt size lphbet. Algorithm: 4 trversls on tree of micro TNFAs. 1-3 itertively builds informtion. 4 computes the ctul stte-set trnsition. Tbultion to do ech trversl in constnt time per micro TNFA => O(m/x) time lgorithm.
Computing Accepted Substrings q = b A 3 A 1 A 2 1 2 3 4 5 b 64 57 c 86 79 10 8 Gol: For micro TNFA A compute the substrings of q tht re ccepted by Ā. We hve A1 : {ε,,}, A2 : {b}, A3 : {b,b}. Bottom-up trversl using tbultion in constnt time per micro TNFA. Encode set of substring in O(y 2 ) = O(log n) bits. Tble input: micro TNFA, substrings of children, q. Tble size 2 O(x + y 2+ y) = 2 O(x + y 2) = O(n ε ).
Computing Pth Prefixes to Accepting Sttes q = b S = {1,3} A 3 A 1 A 2 1 2 3 4 5 b 64 57 c 86 79 10 8 Gol: For micro TNFA A compute the prefixes of q mtching pth from S to the ccepting stte in Ā. We hve A1 : {, }, A2 :, A3 : {b}. Bottom-up trversl using tbultion in constnt time per micro TNFA. Encode prefixes in O(y) = O(log 1/2 n) bits. Tble input: micro TNFA, substrings nd pth prefixes of children, q, stte-set for A. Tble size 2 O(x + y 2) = O(n ε ).
Computing Pth Prefixes to Strt Sttes q = b S = {1,3} A 3 A 1 A 2 1 2 3 4 5 b 64 57 c 86 79 10 8 Gol: For micro TNFA A compute the prefixes of q mtching pth from S to the strt stte in N(R). We hve A1 : {}, A2 : {, }, A3 : {ε}. Top-down trversl using tbultion in constnt time per micro TNFA. Tbultion: Similr to previous trversl.
Updting Stte-Sets q = b S = {1,3} A 3 A 1 A 2 1 2 3 4 5 b 64 57 c 86 79 10 8 Gol: For micro TNFA A compute the next stte-set. We hve A1 :, A2 : {7,10}, A3 : {10}. Trversl using tbultion in constnt time per micro TNFA. Tbultion: Similr to previous trversl.
Algorithm Summry Tbultion in 2 O(x + y 2) = O(n ε ) time nd spce. 4 trversls ech using O(m/x) time to process length y segment of Q. => lgorithm using O(nm/xy) = O(nm/log 1,5 n) time nd O(n ε ) spce. Hence, we hve $\sqrt{\log n}$ ed (slogn ed) Myers result.
Extensions Unbounded lphbets cost dditionl log log n fctor in speed. Unbounded lphbets for free if m n 1-ε. I/O bounds: 1-D decomposition gives O(nm/B) I/Os, we get O(nm/ M 1/2 B) I/ Os. Other fetures: Independent tbultion. Streming.