A Unified Construction of the Glushkov, Follow, and Antimirov Automata

Size: px

Start display at page:

Download "A Unified Construction of the Glushkov, Follow, and Antimirov Automata"

Derrick Hoover
5 years ago
Views:

1 A Unified Construction of the Glushkov, Follow, nd Antimirov Automt Cyril Alluzen nd Mehryr Mohri Cournt Institute of Mthemticl Sciences 251 Mercer Street, New York, NY 10012, USA Astrct. A numer of different techniques hve een introduced in the lst few decdes to crete ɛ-free utomt representing regulr expressions such s the Glushkov utomt, follow utomt, or Antimirov utomt. This pper presents simple nd unified view of ll these construction methods oth for unweighted nd weighted regulr expressions. It descries simpler lgorithms with time complexities t lest s fvorle s tht of the est previously known techniques, nd provides concise proof of their correctness. Our lgorithms re ll sed on two stndrd utomt opertions: epsilon-removl nd minimiztion. This contrsts with the multitude of complicted nd specil-purpose techniques previously descried in the literture, nd mkes it strightforwrd to generlize these lgorithms to the weighted cse. In prticulr, we extend the definition nd construction of follow utomt to the cse of weighted regulr expressions over closed semiring nd present the first lgorithm to compute weighted Antimirov utomt. 1 Introduction The construction of finite utomt representing regulr expressions hs een widely studied due to their multiple pplictions to pttern-mtching nd mny other res of text processing [1, 21]. The most clssicl construction, Thompson s construction [13, 24], cretes finite utomton with numer of sttes nd trnsitions liner in the length m of the regulr expression. The time complexity of the lgorithm is lso liner, O(m). But Thompson s utomton contins trnsitions leled with the empty string ɛ which crete dely in pttern mtching. Mny lterntive techniques hve een introduced in the lst few decdes to crete ɛ-free utomt representing regulr expressions, in prticulr, Glushkov utomt [11], follow utomt [12], nd Antimirov utomt [2]. The Glushkov utomton, or position utomton, ws independently introduced y [11] nd [16]. It hs exctly n + 1 sttes ut my hve up to n 2 trnsitions, where n is the numer of occurrences of lphet symols ppering in the expression. Thus, its size is qudrtic in tht of the Thompson utomton for resonle regulr expressions for which m = O(n). However, when using itprllelism for regulr expression serch, thnks to its smller numer of sttes, the Glushkov utomton cn e represented with hlf the numer of mchine words required y the Thompson utomton [20,21].

2 Automton Algorithm Complexity Glushkov rmeps(t) O(mn) Follow min(rmeps(t)) O(mn) Antimirov rmeps(min(rmeps( T))) O(m log m + mn) Tle 1. Simple lgorithms for the construction of Glushkov, follow, nd Antimirov utomt nd their time complexity. T is the Thompson utomton. For n utomton A, A denotes the utomton derived from A y mrking lphet symols with their position in the expression. When the symols re mrked, the sme nottion denotes the opertion tht removes the mrking. T is otined y mrking some ɛ-trnsitions in T, mking it deterministic (the ɛ-trnsitions mrked re removed y the rmeps opertion). Severl techniques hve een suggested for constructing the Glushkov utomton. In [3], the construction is sed on the recursive definition of the follow function nd its complexity is in O(n 3 ). The lgorithm descried y [4] is sed on n optimiztion of the recursive definition of the follow function nd its complexity is in O(m + n 2 ). It requires the expression to e first rewritten in str-norml form, which cn e done non-trivilly in O(m). Severl other qudrtic lgorithms hve een given: tht of [9] which is sed on n optimiztion of the follow recursion, nd tht of [22], sed on the ZPC structure, which consists of two mutully linked copies of the syntctic tree of the expression. The Antimirov or prtil derivtives utomton ws introduced y [2]. It is in generl smller thn the Glushkov utomton with up to n + 1 sttes nd up to n 2 trnsitions. It ws in fct proven y [8] (see [12] for simpler proof) to e the quotient of the Glushkov utomton for some equivlence reltion. The complexity of the originl construction lgorithm of [2] is O(m 5 ). [8] presented n lgorithm whose complexity is O(m 2 ). Finlly, the follow utomton ws introduced y [12]. It is the quotient of the Glushkov utomton y the follow equivlence: two sttes re equivlent if they hve the sme follow nd the sme finlity. The uthor gve n O(m + n 2 ) lgorithm where some ɛ-trnsitions re removed from the utomton t ech step of the Thompson construction s well s t the end. An O(m + n 2 ) lgorithm using the ZPC structure ws given in [7], which requires the regulr expression to e rewritten in str-norml form. Some of these results hve een extended to weighted regulr expressions over ritrry semirings. The generliztion of the Thompson construction trivilly follows from [23]. The Glushkov utomton cn e nturlly extended to the weighted cse [5], nd n O(m 2 ) construction lgorithm sed on the generliztion of the ZPC construct ws given y [6]. The Antimirov utomton ws generlized to the weighted cse y [15], ut no explicit construction lgorithm or complexity nlysis ws given y the uthors. This pper presents simple nd unified view of ll these ɛ-free utomt (Glushkov, follow, nd Antimirov) oth in the cse of unweighted nd weighted regulr expressions. It descries simpler lgorithms with time complexities t lest s fvorle s tht of the est previously known techniques, nd provides concise proofs. Our lgorithms re ll sed on two stndrd utomt opertions: epsilon-removl nd minimiztion, s summrized in Tle 1.

3 This contrsts with the multitude of complicted nd specil-purpose techniques nd proofs descried y others to construct these utomt: no need for fine-tuning some recursions, no requirement tht the regulr expression e in str-norml form, nd no need to mintin multiple copies of the syntctic tree. Our nlysis provides etter understnding of ɛ-free utomt representing regulr expressions: they re ll the results of the ppliction of some comintions of epsilon-removl nd minimiztion to the clssicl Thompson utomt. This mkes it strightforwrd to generlize these lgorithms to the weighted cse y using the generliztion of ɛ-removl nd minimiztion [17, 18]. It lso results in much simpler lgorithms thn existing ones. In prticulr, this leds to strightforwrd lgorithm for the construction of the Glushkov utomton of weighted regulr expression, nd, in the cse of closed semirings, helps us generlize follow utomt to the weighted cse. We lso give the first explicit construction lgorithm of the Antimirov utomton of weighted expression. When the semiring is k-closed, or, in the Glushkov cse, only null-k-closed for the regulr expression considered, the complexity of our lgorithms is the sme s in the unweighted cse. The pper is orgnized s follows. Section 2 introduces the definitions nd rief description of the elementry lgorithms used in the following sections. Section 3 presents nd nlyzes our lgorithm for the construction of the Glushkov utomton, Section 4 the sme for the follow utomton, nd Section 5 for the Antimirov utomton. 2 Preliminries Semirings. A system (K,,, 0, 1) is semiring when (K,, 0) is commuttive monoid with identity element 0; (K,, 1) is monoid with identity element 1; distriutes over ; nd 0 is n nnihiltor for : for ll K, 0 = 0 = 0. Thus, semiring is ring tht my lck negtion. Some fmilir semirings include the oolen semiring (B,,, 0, 1), the tropicl semiring (R + { }, min, +,, 0), nd the rel semiring (R +, +,, 0, 1). A semiring K is sid to e closed if for ll K, the infinite sum n=0 n is well-defined nd in K, nd if ssocitivity, commuttivity, nd distriutivity pply to countle sums [19]. K is sid to e k-closed if for ll K, k+1 n=0 n = k n=0 n. More generlly, we will sy tht K is closed (k-closed) for n utomton A, if the closedness (resp. k-closedness) xioms hold for ll cycle weights of A. In some semirings, e.g., the proility semiring (R +, +,, 0, 1), the equlity k+1 n=0 n = k n=0 n my hold for the cycle weights of A only pproximtely, modulo ɛ > 0. A is then sid to e ɛ-k-closed for tht semiring. Weighted utomt. A weighted utomton A over semiring K is 7-tuple (Σ, Q, E, I, λ, F, ρ) where: Σ is finite lphet; Q is finite set of sttes; I Q the set of initil sttes; F Q the set of finl sttes; E Q (Σ {ɛ}) K Q finite set of trnsitions; λ : I K the initil weight function; nd ρ : F K the finl weight function mpping F to K. We denote y p[π] the origin nd n[π] the destintion stte of pth π E in n utomton A. i[π] denotes the lel of π, nd w[π] its weight otined

4 y -multiplying the weights of its constituent trnsitions. We lso denote y P(p, q) the set of pths from p to q nd y P(I, x, F) the set of pths from the initil sttes I to the finl sttes F leled with x Σ. The weight ssocited y A to n input string x Σ is otined y summing the weights of these pths multiplied y their initil nd finl weights: [A](x) = π P(I,x,F) λ(p[π]) w[π] ρ(n[π]). [A](x) is defined to e 0 when P(I, x, F) =. Shortest distnce. Let A e weighted utomton over K. The shortest distnce from p to q is defined s d[p, q] = π P(p,q) w[π]. It cn e computed using the generic single-source shortest-distnce lgorithm of [19] if K is k-closed for A, or using generliztion of Floyd-Wrshll [14,19] if K is closed for A. Epsilon-removl. The generl ɛ-removl lgorithm of [18] consists of first computing the ɛ-closure of ech stte p in A, closure(p) = {(q, w): w = d ɛ [p, q] = w[π] 0}, (1) π P(p,q),i[π]=ɛ nd then, for ech stte p, of deleting ll the outgoing ɛ-trnsitions of p, nd dding out of p ll the non-ɛ trnsitions leving ech stte q closure(p) with their weight pre- -multiplied y d ɛ [p, q]. If K is k-closed for the ɛ-cycles of A, 1 then the generic single-source shortest-distnce lgorithm [19] cn e used to compute the ɛ-closures. Weight-pushing nd weighted minimiztion. Weight-pushing [17] is normliztion lgorithm tht redistriutes the weights long the pths of A such tht e E[q] w[e]+ρ(q) = 1 for every stte q Q. We denote y push(a) the resulting utomton. The lgorithm requires tht K e zero-sum free, wekly left-divisile nd closed or k-closed for A since it depends on the computtion of d[q, F] for ll q Q. It ws proved in [17] tht, if A is deterministic, i.e., if no two trnsitions leving ny stte shre the sme lel nd if it hs unique initil stte, then the weight-pushing followed y unweighted minimiztion with ech pir (lel, weight) viewed s n lphet symol, leds to miniml deterministic weighted utomton equivlent to A, denoted y min(a). Regulr expressions. A weighted regulr expression over the semiring K is defined recursively y:, ɛ nd Σ re regulr expressions, nd if α nd β re regulr expressions then kα, αk for k K, α + β, α β nd α re lso regulr expressions. We denote y null(α) the weight ssocited y α to the empty string ɛ. A weighted regulr expression α is well-defined iff for every suterm of the form β, null(β) is well-defined nd in K. We will sy tht K is null-k-closed for α if there exist k 0 such tht for every suterm β of α, null(β) = k i=0 null(β)i. We denote y α the length of α, nd y α Σ the width of α, i.e., the numer of occurrences of lphet symols in α. Let pos(α) = {1, 2,..., α Σ } e the set of (lphet symol) positions in α. An unweighted regulr expression cn e seen s weighted expression over the oolen semiring (B,,, 0, 1). Thompson utomton. We denote y A T (α) the Thompson utomton of α nd y I AT (α) nd F AT (α) its unique initil nd finl sttes. For i pos(α), we denote y p i nd q i the sttes of A T (α) such tht the trnsition from p i to q i is leled with the lphet symol t the i-th position in α. The sttes p i 1 For A to e well-defined, K needs to e closed for the ɛ-cycles of A.

5 () 1,2, 3, ,5 0 1,2 6 3,4,5 () (c) (d) Fig. 1. () The Thompson utomton, () Glushkov utomton, (c) Follow utomton, nd (d) Antimirov utomton of the regulr expression α = ( + )( + + ), the running exmple used in [12]. In (d), stte {0} corresponds to the derived term α, {1, 2} to τ = ( + + ), {6} to τ, nd {3, 4, 5} to τ. (sttes q i ) re the only sttes hving non-ɛ outgoing (resp. incoming) trnsition. Figure 1() shows the Thompson utomton in the specil cse of the regulr expression α = ( + )( + + ). 3 Glushkov Automton Let α e weighted regulr expression over the lphet Σ nd the semiring K. The Glushkov utomton of α is n ɛ-free non-deterministic weighted utomton representing α tht hs n initil stte plus one stte for ech position in α, i.e. ech occurrence of n lphet symol in α. Figure 1() shows n exmple. The forml definition of the Glushkov utomton is sed on the functions null, first, lst, nd follow. Tle 2 gives the recursive definition of these functions. null(α) K is the weight ssocited y α to the empty string ɛ nd is thus the finl weight of the initil stte of the utomton. lst(α) pos(α) K is the set of positions with the corresponding finl weights where non-empty string ccepted y α cn end. first(α) pos(α) K is the set of positions with the corresponding weights tht cn e reched y reding one lphet symol from the initil stte. Similrly, follow(α, i) pos(α) K is the set of positions with the corresponding weights tht cn e reched y reding one lphet symol from the position i. In these definitions, the union of two weighted susets X, Y pos(α) K is defined y X Y = {(i, X, i Y, i ) : X, i Y, i 0}. For ny weighted suset X pos(α) K, weight k K, nd position i, k X denotes the weighted

6 α null(α) first(α) lst(α) follow(α, i) 0 ɛ 1 j 0 {(j, 1)} {(j, 1)} kβ k null(β) k first(β) lst(β) follow(β, i) βk null(β) k first(β) lst(β) k follow(β, j i) follow(β, i) if i pos(β) β + γ null(β) null(γ) first(β) first(γ) lst(β) lst(γ) follow(γ, i) if i pos(γ) β γ null(β) null(γ) null(β) first(γ) first(β) lst(β) null(γ) lst(γ) 8 >< follow(β, i) lst(β), i first(γ) if i pos(β) >: follow(γ, i) if i pos(γ) β null(β) null(β) first(β) lst(β) null(β) follow(β, i) lst(β ), i first(γ) Tle 2. Definition of the functions null, first, lst, nd follow. For convenience, we lso define follow(α,0) = first(α) nd lst 0(α) s lst(α) {(0, null(α))} if null(α) 0, lst(α) otherwise. suset nd X, i the weight defined y: { {(i, k w) (i, w) X} if k 0, k X = otherwise, nd X, i = { w if w: (i, w) X, 0 otherwise. X k is defined similrly. Let α denote the weighted regulr expression otined y mrking ech symol of α with its position. The Glushkov or position utomton A G (α) of α is defined y A G (α) = (Σ, pos 0 (α), E, 0, 1, F, ρ) where its sttes set is pos 0 (α) = {0} pos(α) nd its trnsition set E = {(i,, w, j): (j, w) follow(α, i) nd pos(α, j) = }. (2) A stte i pos 0 (α) is finl iff there exist w K such tht (i, w) lst 0 (α) nd when it is finl ρ(i) = w. The following lemm shows tht there exists simple reltionship etween the first, lst, nd follow functions nd the ɛ-closures of the sttes in the Thompson utomton tht dmit non-ɛ incoming trnsition (sttes q i ). Lemm 1. Let α e weighted regulr expression nd let A = A T (α). Then (i) (i, w) first(α) iff (p i, w) closure(i A ); (ii) (i, w) follow(α, j) iff (p i, w) closure(q j ); nd (iii) (i, w) lst(α) iff (F A, w) closure(q i ). Proof. Note tht if α =, α = ɛ or α =, then the properties trivilly hold. The proof is y induction on the length of the regulr expression nd is given in the cse α = β γ. Other cses cn e treted similrly. Assume tht the properties hold for ll expressions shorter thn α. Let A = A T (α), B = A T (β) nd C = A T (γ). If α = β γ, then closure A (I A ) = closure B (I B ) [B][ɛ] closure C (I A ), thus, since [B][ɛ] = null(β), (i) holds

7 y induction. If j pos(γ), then closure A (q j ) = closure C (q j ). Otherwise, if j pos(β), then closure A (q j ) = closure B (q j ) closure B (q j ), F B closure C (I C ). (3) Thus, y induction, oth (ii) nd (iii) hold. The following proposition follows directly from the lemm just presented. Proposition 1. Let α e weighted regulr expression. Then A G (α) = rmeps(a T (α)). (4) Proof. The only sttes potentilly ccessile in rmeps(a T (α)) re the sttes q i, i 0, since they re the only sttes with non-ɛ incoming trnsitions. A stte q i is finl with weight w in rmeps(a T (α)) iff (F AT (α), w) closure(q i ), tht is, y Lemm 1, iff (i, w) lst 0 (α). (q j, i, w, q i ) is trnsition in rmeps(a T (α)) iff (p i, w) closure(q j ), tht is, y Lemm 1, iff (i, w) follow(α, j). Thus, A G (α) = rmeps(a T (α)). This proposition suggests nturl lgorithm to compute the Glushkov utomton. The following lemm helps determine its complexity. Lemm 2. Let A e the Thompson utomton of weighted regulr expression over k-closed semiring nd let s e stte of A. Then, the shortest-distnce lgorithm of [19] cn e used to compute the shortest distnces from the source stte s to ll sttes of A in liner time. Proof. We give sketch of the proof. The complexity of the single-source shortestdistnce lgorithm of [19] depends on the queue discipline used, tht is the order in which sttes re extrcted from the queue. One cn use queue discipline tht tkes dvntge of the specific structure of the Thompson utomton. Ech suterm of the form β +γ or β defines su-utomton with n entry stte nd n exit stte. The pproprite queue discipline enforces tht ech su-utomton e fully visited efore eing exited. The lgorithm of [19] cn lso e modified to store the shortest-distnce through the su-utomton of β suterm once it hs een computed nd void susequent revisit. With tht queue discipline, the complexity of the lgorithm is liner. Theorem 1. Let α e weighted regulr expression over semiring K nullk-closed for α nd let m = α nd n = α Σ. Then, the Glushkov utomton of α cn e constructed in time O(mn) y pplying ɛ-removl to its Thompson utomton. Proof. If K is null-k-closed for α, then K is k-closed for ll the pths considered during the computtion of the ɛ-closures nd, y Lemm 2, ech ɛ-closure cn e computed in liner time, tht is in O(m). Since n + 1 closures need to e computed, the totl complexity is in O(mn + n 2 ) = O(mn). In the unweighted cse, the unpulished mnuscript of [10] showed tht the Glushkov utomton could e otined y removing the ɛ-trnsitions from the Thompson utomton using specil-purpose ɛ-removl lgorithm.

8 4 Follow Automton The follow utomton of n unweighted regulr expression α, denoted y A F (α) ws introduced y [12]. Figure 1(c) shows n exmple. It is the quotient of A G (α) y the equivlence reltion F defined over pos 0 (α) y: i F j iff { {i, j} lst0 (α) or {i, j} lst 0 (α) =, nd follow(α, i) = follow(α, j). (5) Proposition 2. For ny regulr expression α, the following identities hold: A F (α) = min(a G (α)) nd A F (α) = min(a G (α)). Note tht it is mentioned in [12] tht minimiztion could e used to construct the follow utomt ut the uthors clim tht the complexity of minimiztion would e in O(n 2 log n) mking this pproch less efficient. The following lemm shows tht minimiztion hs in fct etter complexity in this cse. Oserve tht A G (α) is deterministic. Lemm 3. The time complexity of Hopcroft s minimiztion lgorithm pplied to A G (α) is liner in the size of A G (α): it is in O(n 2 ) where n = α Σ. Proof. We give sketch of the proof. The log Q fctor in Hopcroft s lgorithm corresponds to the numer of times the incoming trnsitions t given stte q re used to split suset (tenttive equivlence clss). In A G (α), trnsitions shring the sme lel hve ll the sme destintion stte (the utomton is 1-locl), thus ech incoming trnsition of stte q cn only e used to split suset once. The numer of trnsitions in A G (α) is t most n 2. The lemm holds in fct for ll 1-locl utomt. This leds to simple lgorithm for constructing the follow utomton of regulr expression α sed on: A F (α) = min(rmeps(a T (α))). (6) The complexity of this lgorithm is in O(mn) which is the sme s tht of the more complicted nd specil-purpose lgorithms of [12, 7]. When the semiring K is wekly divisile, zero-sum free, nd closed, we cn define the follow utomton of weighted regulr expression α s: A F (α) = min(a G (α)). Theorem 2. Let α e weighted regulr expression over K. If K is k-closed for the Thompson utomton of α, then the follow utomton of α cn e computed in O(mn) y pplying epsilon-removl followed y weighted minimiztion to the Thompson utomton of α. Proof. The shortest-distnce computtion required y weight-pushing cn e done in O(m) in the cse of A T (α) nd is preserved y ɛ-removl. The weighted utomton push(a G (α)) is 1-locl when considered s finite utomton over pirs (lel, weight), thus Lemm 3 cn e pplied.

9 5 Antimirov Automton The definition of the Antimirov utomton of regulr expression is sed on tht of the prtil derivtives of regulr expressions, which re multisets of pirs of the form (w, α) where w K is weight nd α weighted regulr expression over K. For ny weight k K nd ny regulr expression β, we define the following opertions: k (w, α) = (k w, α) (w, α) k = (w, αk) (w, α) β = (w, α β), (7) which cn e nturlly extended to multisets of pirs (w, α). By multisets, we men tht {(w, α)} {(w, α )} = {(w, α), (w, α )}. The prtil derivtive of α with respect to Σ is the multiset of pirs (w, α) defined recursively y: (ɛ) = (1) = (β + γ) = (β) (γ) () = ɛ if =, otherwise (β γ) = (β) γ null(β) (γ) (kβ) = k (β) (β ) = null(β) (β) β (βk) = (β) k. The prtil derivtive of α with respect to the string s Σ is denoted y s (α) nd recursively defined y s (α) = ( s (α)). Let D(α) = {β : (w, β) s (α) with s Σ nd w K}. Note tht for D(α) to e well-defined, we need to define when two expressions re the sme. Here, we will only llow the following identities: α = α =, + α = α + =, 0α = α0 =, ɛ α = α ɛ = α, 1α = α1 = α, k(k α) = (k k )α, (αk)k = α(k k ) nd (α + β) γ = α γ + β γ. 2 The Antimirov or prtil derivtives utomton A A (α) of α is then the utomton defined y A A (α) = (Σ, D(α), E, α, 1, F, null) where E = {(β,, w, γ): w = (w,γ) (β) w } nd F = {β D(α): null(β) 0}. Figure 1(d) shows the Antimirov utomton for specific regulr expression. Let Σ = Σ {ɛ 1 +, ɛ 2 +, ɛ 1, ɛ 2 }. We denote y ÂT(α) the weighted utomton over Σ otined y recursively mrking some of the ɛ-trnsitions of the Thompson utomton A T (α) s follows: if α = β + γ, we lel y ɛ 1 + (ɛ 2 +) the ɛ-trnsition from I AT (α) to I AT (β) (resp. I AT (γ)); if α = β, we lel y ɛ 1 (ɛ 2 ) the two ɛ-trnsitions to I AT (β) (resp. F AT (α)). Oserve tht ÂT(α) cn e viewed s n utomton recognizing the expression α over Σ recursively defined y =, ɛ = ɛ, â =, kβ = k β, βk = βk, β + γ = ɛ 1 + β + ɛ 2 + γ, β γ = β γ nd β = (ɛ 1 β) ɛ 2. For i pos 0 (α), we use the sme nottion q i (with q 0 = I) for the corresponding sttes in A T (α), Â T (α) nd rmeps(ât(α)). For stte q in rmeps(ât(α)), we define y L(q) the lnguge recognized from q considering rmeps(ât(α)) s n unweighted utomton over pirs (symol,weight). Lemm 4 follows from our mrking of the ɛ-trnsitions. 2 These identities re the trivil identities considered in [15] except for the lst two which were dded to simplify our presenttion. Any lrger set of identities cn e hndled with our method y rewriting α in the corresponding norml form.

10 Lemm 4. For i pos 0 (α), L(q i ) uniquely defines regulr expression over Σ, denoted y δ i (or δ α i in the presence of miguity). Lemm 5. For ll i pos 0 (α) nd j pos(α), we hve for p j, q i in A T (α) tht: (p j, w) closure(q i ) iff (w, δ j ) (δ i ). (8) Proof. The proof is y induction on the length of the regulr expression. If α =, α = ɛ or α =, then the properties trivilly hold. We give the proof in the cse α = β γ, other cses cn e treted similrly. Let A = A T (α), B = A T (β) nd C = A T (γ). If q i is in C, then δi α = δ γ i nd closure A (q i ) = closure C (q i ). Therefore, if (w, p j ) closure A (q i ), p j is in C nd then δj α = δγ j. Hence (8) recursively holds. If q i is in B, then δi α = δ β i γ nd we hve: (δ α i ) = (δ β i ) γ null(δβ i ) (γ) (9) closure A (q i ) = closure B (q i ) null(δ β i ) closure C(I C ). (10) By induction, we hve (p j, w) closure B (q i ) iff (w, δ β j ) (δ β i ), nd (p j, w) closure C (I C ) iff (w, δ γ j ) (δ γ 0 ) = (γ). Hence (8) follows. Note tht δ 0 = α, thus Lemm 5 implies tht the δ i re the derived terms of α, more precisely, i δ i is surjection from pos 0 (α) onto D(α). This leds us to the following result, where min B is unweighted minimiztion when ech pir (lel, weight) is treted s regulr symol nd rmeps denotes the removl of the mrked ɛ s. Proposition 3. We hve A A (α) = rmeps(min B (rmeps(ât(α)))). Proof. Note tht rmeps(ât(α)) is deterministic. During minimiztion, two sttes q i nd q j re merged iff L(q i ) = L(q j ), tht is, y Lemm 4, iff δ i = δ j. Thus, there is ijection etween D(α) nd the set of sttes of min B (rmeps(ât(α))) hving n incoming trnsition with lel in Σ, nd thus lso etween D(α) nd the set of sttes of A = rmeps(min B (rmeps(ât(α)))). Lemm 5 ensures tht the trnsitions of A re consistent with the definition of A A (α). Theorem 3. Let α e weighted regulr expression over K. If K is null-k-closed for α, then the Antimirov utomton of α cn e computed in O(m log m + mn) using ɛ-removl nd minimiztion. Theorem 3 follows from the fct tht rmeps(ât(α)) hs O(m) sttes nd trnsitions. In the unweighted cse, this complexity mtches tht of the more complicted nd est known lgorithm of [8]. In the weighted cse, the use of minimiztion over (lel,weight) pirs is suoptiml since sttes tht would e equivlent modulo -multiplictive fctor re not merged. When possile, using weighted minimiztion insted would led to smller utomton in generl. For exmple, if K is closed, we cn defined

11 the normlized Antimirov utomton of α s rmeps(min K (rmeps(ât(α)))). This utomton is lwys smller thn the Antimirov utomton nd the utomton of unitry derived terms of [15]. 3 When K is k-closed, it cn e constructed in O(m log m + mn). Remrk. When the condition out k-closedness (null-k-closedness for α) of K is relxed to the closedness of K (resp. tht α is well-defined), ll our construction lgorithms cn still e used y replcing the generic single-source shortestdistnce lgorithm with generliztion of the Floyd-Wrshll lgorithm [14, 19], leding to complexity of O(m 3 ). It is not hrd however to mintin the qudrtic complexity y modifying the generic single-source shortest-distnce lgorithm to tke dvntge of the specil topology of the Thompson utomton. In the unweighted cse, every regulr expression cn e strightforwrdly rewritten in ɛ-norml form such tht m = O(n). In tht cse, our O(mn) nd O(m log m + mn) complexities ecome O(m + n 2 ) which coincides with wht is often reported in the literture. 6 Conclusion We presented simple nd unified view of ɛ-free utomt representing unweighted nd weighted regulr expressions. We showed tht stndrd unweighted nd weighted epsilon-removl nd minimiztion lgorithms cn e used to crete the Glushkov, follow, nd Antimirov utomt nd tht the time complexity of our construction lgorithms is t lest s fvorle s tht of the est previously known lgorithm. This provides etter understnding of the ɛ-free utomt representing regulr expressions. It lso suggests using other comintions of epsilon-removl nd minimiztion for creting ɛ-free utomt. For exmple, in some contexts, it might e eneficil to use reverse-epsilon-removl rther thn epsilon-removl [18]. Note lso tht the Glushkov utomton cn e constructed on-the-fly since Thompson s construction nd epsilon-removl oth dmit n on-demnd implementtion. Acknowledgments. This project ws sponsored in prt y the Deprtment of the Army Awrd Numer W23RYX-3275-N605. The U.S. Army Medicl Reserch Acquisition Activity, 820 Chndler Street, Fort Detrick MD is the wrding nd dministering cquisition office. The content of this mteril does not necessrily reflect the position or the policy of the Government nd no officil endorsement should e inferred. This work ws lso prtilly funded y the New York Stte Office of Science Technology nd Acdemic Reserch (NYSTAR). References 1. A. V. Aho, R. Sethi, nd J. D. Ullmn. Compilers, Principles, Techniques nd Tools. Addison Wesley: Reding, MA, This utomton cn e viewed in our pproch s the result of simpler form of reweighting thn weight-pushing, the reweighting used y weighted minimiztion.

12 2. V. M. Antimirov. Prtil derivtives of regulr expressions nd finite utomton constructions. Theoreticl Computer Science, 155(2): , G. Berry nd R. Sethi. From regulr expressions to deterministic utomt. Theoreticl Computer Science, 48(3): , A. Brüggemnn-Klein. Regulr expressions into finite utomt. Theoreticl Computer Science, 120(2): , P. Cron nd M. Flouret. Glushkov construction for series: the non commuttive cse. Interntionl Journl of Computer Mthemtics, 80(4): , J.-M. Chmprnud, É. Lugerotte, F. Ourdi, nd D. Zidi. From regulr weighted expressions to finite utomt. In Proceedings of CIAA 2003, volume 2759 of Lecture Notes in Computer Science, pges Springer-Verlg, J.-M. Chmprnud, F. Nicrt, nd D. Zidi. Computing the follow utomton of n expression. In Proceedings of CIAA 2004, volume 3317 of Lecture Notes in Computer Science, pges Springer-Verlg, J.-M. Chmprnud nd D. Zidi. Computing the eqution utomton of regulr expression in O(s 2 ) spce nd time. In Proceedings of CPM 2001, volume 2089 of Lecture Notes in Computer Science, pges Springer-Verlg, C.-H. Chng nd R. Pge. From regulr expressions to DFA s using compressed NFA s. Theoreticl Computer Science, 178(1-2):1 36, D. Gimmrresi, J.-L. Ponty, nd D. Wood. Glushkov nd Thompson constructions: synthesis V. M. Glushkov. The strct theory of utomt. Russin Mthemticl Surveys, 16:1 53, L. Ilie nd S. Yu. Follow utomt. Informtion nd Computtion, 186(1): , S. C. Kleene. Representtions of events in nerve sets nd finite utomt. In C. E. Shnnon, J. McCrthy, nd W. R. Ashy, editors, Automt Studies, pges Princeton University Press, D. J. Lehmnn. Algeric structures for trnsitives closures. Theoreticl Computer Science, 4:59 76, S. Lomrdy nd J. Skrovitch. Derivtives of rtionl expressions with multiplicity. Theoreticl Computer Science, 332(1-3): , R. McNughton nd H. Ymd. Regulr expressions nd stte grphs for utomt. IEEE Trnsctions on Electronic Computers, 9(1):39 47, M. Mohri. Finite-Stte Trnsducers in Lnguge nd Speech Processing. Computtionl Linguistics, 23:2, M. Mohri. Generic e-removl nd input e-normliztion lgorithms for weighted trnsducers. Interntionl Journl of Foundtions of Computer Science, 13(1): , M. Mohri. Semiring Frmeworks nd Algorithms for Shortest-Distnce Prolems. Journl of Automt, Lnguges nd Comintorics, 7(3): , G. Nvrro nd M. Rffinot. Fst regulr expression serch. In Proceedings of WAE 99, volume 1668 of Lecture Notes in Computer Science, pges Springer-Verlg, G. Nvrro nd M. Rffinot. Flexile pttern mtching. Cmridge University Press, J.-L. Ponty, D. Zidi, nd J.-M. Chmprnud. A new qudrtic lgorithm to convert regulr expression into utomt. In Proceedings of WIA 96, volume 1260 of Lecture Notes in Computer Science, pges Springer-Verlg, M.-P. Schützenerger. On the definition of fmily of utomt. Informtion nd Control, 4: , K. Thompson. Regulr expression serch lgorithm. Communictions of the ACM, 11(6): , 1968.

A Unified Construction of the Glushkov, Follow, and Antimirov Automata, (TR )

A Unified Construction of the Glushkov, Follow, nd Antimirov Automt, (TR2006-880) Cyril Alluzen nd Mehryr Mohri Cournt Institute of Mthemticl Sciences 251 Mercer Street, New York, NY 10012, USA {lluzen,