A Unified Construction of the Glushkov, Follow, and Antimirov Automata, (TR )

Size: px

Start display at page:

Download "A Unified Construction of the Glushkov, Follow, and Antimirov Automata, (TR )"

Brian Richards
6 years ago
Views:

1 A Unified Construction of the Glushkov, Follow, nd Antimirov Automt, (TR ) Cyril Alluzen nd Mehryr Mohri Cournt Institute of Mthemticl Sciences 251 Mercer Street, New York, NY 10012, USA {lluzen, Astrct. Mny techniques hve een introduced in the lst few decdes to crete ɛ-free utomt representing regulr expressions: Glushkov utomt, the so-clled follow utomt, nd Antimirov utomt. This pper presents simple nd unified view of ll these ɛ-free utomt oth in the cse of unweighted nd weighted regulr expressions. It descries simple nd generl lgorithms with running time complexities t lest s good s tht of the est previously known techniques, nd provides concise proofs. The construction methods re ll sed on two stndrd utomt lgorithms: epsilon-removl nd minimiztion. This contrsts with the multitude of complicted nd specil-purpose techniques nd proofs put forwrd y others to construct these utomt. Our nlysis provides etter understnding of ɛ-free utomt representing regulr expressions: they re ll the results of the ppliction of some comintions of epsilon-removl nd minimiztion to the clssicl Thompson utomt. This mkes it strightforwrd to generlize these lgorithms to the weighted cse, which lso results in much simpler lgorithms thn existing ones. For weighted regulr expressions over closed semiring, we extend the notion of follow utomt to the weighted cse. We lso present the first lgorithm to compute the Antimirov utomt in the weighted cse. 1 Introduction The construction of finite utomt representing regulr expressions hs een widely studied due to its multiple pplictions to pttern-mtching nd mny This work ws prtilly funded y the New York Stte Office of Science Technology nd Acdemic Reserch (NYSTAR). This project ws sponsored in prt y the Deprtment of the Army Awrd Numer W23RYX-3275-N605. The U.S. Army Medicl Reserch Acquisition Activity, 820 Chndler Street, Fort Detrick MD is the wrding nd dministering cquisition office. The content of this mteril does not necessrily reflect the position or the policy of the Government nd no officil endorsement should e inferred.

2 other res of text processing [1, 22]. The most clssicl construction, Thompson s construction [14, 25], cretes finite utomton with numer of sttes nd trnsitions liner in the length m of the regulr expression. Figure 1() shows n exmple. The time complexity of the lgorithm is lso liner, O(m). But Thompson s utomton contins trnsitions leled with the empty string ɛ which crete dely in pttern mtching. Mny lterntive techniques hve een introduced in the lst few decdes to crete ɛ-free utomt representing regulr expressions, in prticulr, Glushkov utomt [11], follow utomt [13], nd Antimirov utomt [2]. The Glushkov utomton, or position utomton, ws independently introduced y [11] nd [17]. Figure 1() shows n exmple for prticulr regulr expression. The utomton hs exctly n + 1 sttes ut up to n 2 trnsitions, where n is the numer of occurrences of lphet symols ppering in the expression. For resonle expression, m = O(n), mking it qudrticlly lrger thn the Thompson utomton. When using it-prllelism for regulr expression serch, due to its smller numer of sttes, the Glushkov utomton cn e represented with hlf the numer of mchine words required y the Thompson utomton [21,22]. Severl techniques hve een suggested for constructing the Glushkov utomton. In [3], the construction is sed on the recursive definition of the follow function nd hs complexity of O(n 3 ). The lgorithm descried y [4] hs complexity of O(m + n 2 ) nd is sed on n optimiztion of the recursive definition of the follow function. It requires the expression to e first rewritten in strnorml form, which cn e done non-trivilly in O(m). Severl other qudrtic lgorithms hve een given: tht of [9] which is sed on n optimiztion of the follow recursion, nd tht of [23], sed on the ZPC structure, which consists of two mutully linked copies of the syntctic tree of the expression. The Antimirov or prtil derivtives utomton ws introduced y [2]. Figure 1(d) shows n exmple. It is in generl smller thn the Glushkov utomton with up to n+1 sttes nd up to n 2 trnsitions. It ws in fct proven y [8] (see [13] for simpler proof) to e the quotient of the Glushkov utomton for some equivlence reltion. The complexity of the originl construction lgorithm y [2] is O(m 5 ). [8] presented n lgorithm whose complexity is O(m 2 ). Finlly, the follow utomton ws introduced y [13], it is the quotient of the Glushkov utomton y the follow equivlence: two sttes re equivlent if they hve the sme follow nd the sme finlity. Figure 1(c) presents n exmple. The uthor gve n O(m+n 2 ) lgorithm where some ɛ-trnsitions re removed from the utomton t ech step of the construction of the Thompson construction s well s t the end. An O(m + n 2 ) lgorithm using the ZPC structure ws given in [7], which requires the regulr expression to e rewritten in str-norml form. Some of these results hve een extended to weighted regulr expressions over ritrry semirings. The generliztion of the Thompson construction trivilly follows from [24]. The Glushkov utomton cn e nturlly extended to the weighted cse [5], nd n O(m 2 ) construction lgorithm sed on the generliztion of the ZPC construct ws given y [6]. The Antimirov utomton ws

3 generlized to the weighted cse y [16], ut no explicit construction lgorithm or complexity nlysis ws given y the uthors. This pper presents simple nd unified view of ll these ɛ-free utomt (Glushkov, follow, nd Antimirov) oth in the cse of unweighted nd weighted regulr expressions. It descries simple nd generl lgorithms with running time complexities t lest s good s tht of the est previously known techniques, nd provides concise proofs. The construction methods re ll sed on two stndrd utomt lgorithms: epsilon-removl 1 nd minimiztion s summrized y the following tle: Automton Algorithm Complexity Glushkov rmeps(t) O(mn) Follow min(rmeps(t)) O(mn) Antimirov rmeps(min(rmeps( T))) O(m log m + mn) Where T is the Thompson utomton, T is the utomton derived from T y mrking lphet symols with their position in the expression. When the symols re mrked, the sme nottion denotes the opertion tht removes the mrking. T is otined y mrking some ɛ-trnsitions in T, mking it deterministic (the ɛ-trnsitions mrked re removed y the rmeps opertion). This contrsts with the multitude of complicted nd specil-purpose techniques nd proofs put forwrd y others to construct these utomt. No need for fine-tuning some recursions, no requirement tht the regulr expression e in str-norml form, nd no need to mintin multiple copies of the syntctic tree. Our nlysis provides etter understnding of ɛ-free utomt representing regulr expressions: they re ll the results of the ppliction of some comintions of epsilon-removl nd minimiztion to the clssicl Thompson utomt. This mkes it strightforwrd to generlize these lgorithms to the weighted cse y using the generliztion of ɛ-removl nd minimiztion [18, 19]. This lso results in much simpler lgorithms thn existing ones. In prticulr, this leds to strightforwrd lgorithm for the construction of the Glushkov utomton of weighted regulr expression, nd, in the cse of closed semirings, llows us to generlize the notion of follow utomton to the weighted cse. We lso give the first explicit construction lgorithm of the Antimirov utomton of weighted expression. When the semiring is k-closed (or only ɛ-k-closed for the regulr expression in the Glushkov cse), the complexities of the construction lgorithms re the sme s in the unweighted cse. 2 Preliminries Semirings A semiring (K,,, 0, 1) is ring tht my lck negtion. K is closed if = n 0 n is defined for ll K, nd k-closed if there exists 1 ɛ-removl is less well known s n lgorithm ecuse it hs often een nd continue to e only presented s prt of determiniztion in mny textooks.

4 () ,2, 3,6 4,5 0 1,2 6 3,4,5 () (c) (d) Fig. 1. () The Thompson utomton, () Glushkov utomton, (c) Follow utomton, nd (d) Antimirov utomton representing the regulr expression α = (+)( + + ). This regulr expression is the running exmple from [13]. k 0 such tht = k for ll K. Exmples of semirings re the oolen semiring (B,,, 0, 1), the tropicl semiring (R + { }, min, +,, 0), nd the rel semiring (R +, +,, 0, 1). Weighted utomt A weighted utomton A over semiring K is 7-uple (Σ, Q, E, I, F, λ, ρ) where: Σ is finite lphet; Q is finite set of sttes; I Q the set of initil sttes; F Q the set of finl sttes; E Q (Σ {ɛ}) K Q finite set of trnsitions; λ : I K the initil weight function; nd ρ : F K the finl weight function mpping F to K. Given trnsition e E, we denote y i[e] its input lel, p[e] its origin or previous stte nd n[e] its destintion stte or next stte, w[e] its weight. Given stte q Q, we denote y E[q] the set of trnsitions leving q. A pth π = e 1 e k is n element of E with consecutive trnsitions: n[e i 1 ] = p[e i ], i = 2,...,k. We extend n nd p to pths y setting: n[π] = n[e k ] nd p[π] = p[e 1 ]. A cycle π is pth whose origin nd destintion sttes coincide: n[π] = p[π]. We denote y P(q, q ) the set of pths from q to q nd y P(q, x, q ) the set of pths from q to q with input lel x Σ. These definitions cn e extended to susets R, R Q, y: P(R, x, R ) = q R, q R P(q, x, q ). The leling function i nd the weight function w cn lso e extended to pths: i[π] = i[e 1 ] i[e k ], w[π] = w[e 1 ] w[e k ]. The weight ssocited y A to

5 ech input string x Σ is [A](x) = π P(I,x,F) λ(p[π]) w[π] ρ(n[π]), [A](x) is defined to e 0 when P(I, x, F) =. Generl lgorithms Let A e weighted utomton over K. The shortest distnce from p to q is defined s d[p.q] = π P(p,q) w[π]. It cn e computed using the generic single-source shortest-distnce lgorithm of [20] if K is k-closed for A, or using generliztion of Floyd-Wrshll [15,20] if K is closed for A. The generl ɛ-removl lgorithm of [19] consists of first computing the ɛ- closure of ech stte p in A, closure(p) = {(q, w) w = d ɛ [p, q] = π P(p,q),i[π]=ɛ w[π] 0}, (1) nd then, for ech stte p, of deleting ll the outgoing ɛ-trnsitions of p, nd dding out of p ll the non-ɛ trnsitions leving ech stte q closure(p) with their weight pre- -multiplied y d ɛ [p, q]. If K is k-closed for the ɛ-cycles of A, 2 then the generic single-source shortest-distnce lgorithm [20] cn e used to compute the ɛ-closures. Weight pushing [18] is normliztion lgorithm tht redistriute the weights long the pths of A such tht e E[q] w[e] + ρ(q) = 1 for every stte q Q, we will denote y push(a) the resulting utomton. The lgorithm requires tht K is zero-sum free, wekly left divisile nd closed or k-closed for A since it depends on the computtion of d[q, F] for ll q Q. It ws proved in [18] tht, if A is deterministic (i.e. if no two trnsitions leving ny stte shre the sme lel nd if it hs unique initil stte), then the lgorithm consisting of weight pushing followed y unweighted minimiztion (considering the pirs (lel,weight) s single symol) leds to miniml utomton equivlent to A, denoted y min(a). See figure 2 for n illustrtion of these lgorithms, more detiled descriptions re given in the ppendix. Regulr expressions A weighted regulr expression over the semiring K is recursively defined y:, ɛ nd Σ re regulr expressions, nd if α nd β re regulr expressions then kα, αk for k K, α + β, α β nd α re lso regulr expressions. We denote y α the length of α, nd y α Σ the width α, i.e. the numer of occurrences of lphet symols in α. Let pos(α) = {1, 2,..., α Σ } e the set of (lphet symol) positions in α. An unweighted regulr expression cn e seen s weighted expression over the oolen semiring (B,,, 0, 1). We denote y A T (α) the Thompson utomton of α nd y I AT (α) nd F AT (α) its unique initil nd finl sttes. For i pos(α), we defined p i nd q i s the sttes such tht the lphet symol t the i-th position in α corresponds to the trnsition from p i to q i. These sttes re the only sttes hving respectively non-ɛ outgoing or incoming trnsition. 2 For A to e well defined, K needs to e closed for the ɛ-cycles of A.

6 0 /7 /3 c/8 / 3 1 / 2 /4 /6 2 /9 /5 3/1 0/75 /45 /5 c/24 /24 /15 c/8 /8 1/25 2/1 () () 0 /2 /2 c/4 d/3 f/1 1 2 f/3 g/1 f/6 g/2 3/1 0 /(1/8) /(1/8) c/(1/4) d/(3/8) f/(1/8) 1 2 f/(3/4) g/(1/4) f/(3/4) g/(1/4) 3/1 /(1/8) /(1/8) 0 c/(1/4) d/(3/8) 1 f/(1/8) f/(3/4) g/(1/4) 2/1 (c) (d) (e) Fig.2. () A weighted utomton A 1 over the rel semiring (R +,+,,0,1). () The result of the ppliction of ɛ-removl to A. (c) A weighted utomton A 2 over the rel semiring (R +,+,,0,1). (d) The result of weight pushing. (e) The result of minimiztion. The initil weight in the lst two utomt is Glushkov Automton Let α e weighted regulr expression over the lphet Σ nd the semiring K. We denote y α the weighted regulr expression otined y mrking ech symol of α with its position. The Glushkov or position utomton A G (α) of α is is defined y the 7-uple (Σ, pos 0 (α), E, 0, 1, F, ρ) where pos 0 (α) = pos(α) {0}, E = {(i,, w, j) : (j, w) follow(α, i) nd pos(α, j) = }, (2) nd for i pos 0 (α), i F iff there exist w K such tht (i, w) lst 0 (α), nd then ρ(i) = w. The functions null(α) K, first(α) pos(α) K, lst(α) pos(α) K nd follow(α, i) pos(α) K re recursively defined over the suterms of α s shown in the tles elow. We lso define follow(α, 0) = first(α) nd lst 0 (α) s lst(α) {(0, null(α))} if null(α) 0, nd lst(α) otherwise. For X pos(α) K, k K nd i pos(α), k X = {(i, k w) (i, w) X} if k 0, 0 X = (X k is defined similrly), nd X, i = w if there exists w such tht (i, w) X, nd X, i = 0 otherwise. The union of two weighted susets X nd Y is defined y X Y = {(i, X, i Y, i ) X, i Y, i 0}. For exmple, {(i, w)} {(i, w )} = {(i, w w )}.

7 null first lst 0 ɛ 1 i 0 {(i, 1)} {(i, 1)} kβ k null(β) k first(β) lst(β) βk null(β) k first(β) lst(β) k β + γ null(β) null(γ) first(β) first(γ) lst(β) lst(γ) β γ null(β) null(γ) first(β) null(β) first(γ) lst(β) null(γ) lst(γ) β null(β) null(β) first(β) lst(β) null(β) follow(, i) ɛ i kβ follow(β, i) βk follow(β, i) follow(, i) j follow(β, i) if i pos(β) β + γ j follow(γ, i) if i pos(γ) follow(β, i) lst(β), i first(γ) if i pos(β) β γ follow(γ, i) if i pos(γ) β follow(β, i) lst(β ), i first(γ) null(α) = null(α) is the vlue ssocited y α to ɛ. For α to e well defined, null(β) must e defined for every suterm. There is in fct very simple reltionship etween the first, lst nd follow functions nd the ɛ-closures of the sttes in the Thompson utomton tht dmit non-ɛ incoming trnsition. Lemm 1. Let α e weighted regulr expression. Let A = A T (α). Then (i) (i, w) first(α) iff (p i, w) closure(i A ); (ii) (i, w) follow(α, j) iff (p i, w) closure(q j ); nd (iii) (i, w) lst(α) iff (F A, w) closure(q i ). Proof. The proof is y induction on the length of the regulr expression. If α =, α = ɛ or α =, then the properties trivilly hold. Due to lck of spce, we will only tret the cse α = β γ, other cses cn e treted similrly. Let A = A T (α), B = A T (β) nd C = A T (γ). If α = β γ, then closure A (I A ) = closure B (I B ) [B][ɛ] closure C (I A ), thus (i) recursively holds since [B][ɛ] = null(β). If j pos(γ), then closure A (q j ) = closure C (q j ). Otherwise j pos(β) nd closure A (q j ) = closure B (q j ) closure B (q j ), F B closure C (I C ). (3) Thus, (ii) nd (iii) recursively hold. The following theorem follows directly from the lemm just presented. Theorem 1. Let α e weighted regulr expression. Then: A G (α) = rmeps(a T (α)). (4) Let α e weighted regulr expression α over K. We will sy tht K is ɛ-k-closed for α if there exist k such tht for every suterm β of α, null(β) = null(β) k.

8 Lemm 2. Let A e the Thompson utomton of weighted regulr expression over k-closed semiring. There is queue discipline for which the complexity of the single-source shortest-distnce lgorithm from ny stte in A is liner. Proof. We define the suterm depth of stte q in A s the numer of suterms β + γ nd β it elongs to. We then use lrger suterm-depth first queue discipline. The queue cn e mintined in constnt time since (1) there is t most two sttes hving the sme suterm depth in the queue t nytime nd (2) if d is the mximl suterm depth of n element in the queue t given time, the suterm depth of the stte inserted next will e d 1, d or d + 1. Theorem 2. Let α e weighted regulr expression over semiring K tht is ɛ- k-closed for α. The Glushkov utomton of α cn e constructed in time O(mn) y pplying ɛ-removl to its Thompson utomton. Proof. If K is ɛ-k-closed for α, then K is k-closed for ll the pths considered during the computtion of the ɛ-closures nd, y Lemm 2, ech ɛ-closure cn e computed in O(m). Since n + 1 closures need to e computed, the totl complexity is in O(mn + n 2 ) = O(mn). In the unweighted cse, the unpulished mnuscript [10] showed tht the Glushkov utomton could e otined y removing the ɛ-trnsitions from the Thompson utomton. However, the uthors used specil-purpose ɛ-removl lgorithm nd not the clssicl ɛ-removl lgorithm, limiting the scope of their results. 4 Follow Automton The follow utomton of n unweighted regulr expression α, denoted y A F (α) ws introduced y [13]. It is the quotient of A G (α) y the equivlence reltion F defined over pos 0 (α) y: i F j iff { {i, j} lst0 (α) or {i, j} lst 0 (α) =, nd follow(α, i) = follow(α, j). (5) Theorem 3. For ny regulr expression α, the following identities hold: A F (α) = min(a G (α)) A F (α) = min(a G (α)). Note tht it is mentioned in [13] tht minimiztion could e used to construct the follow utomt ut the uthors clim tht the complexity of minimiztion would e in O(n 2 log n) mking this pproch less efficient. The following theorem shows tht minimiztion hs in fct etter complexity in this cse. Oserve tht A G (α) is deterministic. Theorem 4. The time complexity of the Hopcroft s minimiztion lgorithm when pplied to A G (α) is liner, i.e., in O(n 2 ) where n = α Σ.

9 Proof. Due to spce constrints, we will give only sketch of the proof. The log Q fctor in Hopcroft s lgorithm corresponds to the numer of times the incoming trnsitions t given stte q re used to split suset (tenttive equivlence clss). In A G (α), trnsitions shring the sme lel hve ll the sme destintion stte (the utomton is 1-locl), thus ech incoming trnsition of stte q cn only e used once to split suset. This theorem ctully holds for ll 1-locl utomt. This leds to simple lgorithm for constructing the follow utomton of regulr expression α: A F (α) = min(rmeps(a T (α))). (6) whose complexity O(mn) is identicl to tht of the more complicted nd specilpurpose lgorithms of [13, 7]. When the semiring K is wekly divisile, zero-sum free, nd closed, we cn then define the follow utomton of weighted regulr expression α s: A F (α) = min(a G (α)). Theorem 5. If K is k-closed, then A F (α) cn e computed in O(mn). Proof. The shortest-distnce computtion required y weight pushing cn e done in O(m) in the cse of A T (α) nd is preserved y ɛ-removl. The weighted utomton push(a G (α)) is 1-locl when considered s finite utomton over pirs (lel, weight), thus theorem 4 cn e pplied. 5 Antimirov Automton In the following we will consider pirs (w, α) with w K, nd we define k (w, α) = (k w, α), (w, α) k = (w, αk) nd (w, α) β = (w, α β). These opertions cn nturlly e extended to multisets 3 of pirs (weight, expression). The prtil derivtive of α with respect to Σ is the multiset of pirs (weight, expression) recursively defined y: (ɛ) = (1) = (β + γ) = (β) (γ) () = ɛ if =, otherwise (β γ) = (β) γ null(β) (γ) (kβ) = k (β) (β ) = null(β) (β) β (βk) = (β) k The prtil derivtive of α with respect to the string s Σ, denoted s (α), is recursively defined y s (α) = ( s (α)). Let D(α) = {β : (w, β) s (α) with s Σ nd w K}. Note tht for D(α) to e well-defined, we need to define when two expressions re the sme. Here we will only llow the following identities: α = α =, +α = α + =, 0α = α0 =, ɛ α = α ɛ = α, 1α = α1 = α, k(k α) = (k k )α, (αk)k = α(k k ) nd (α + β) γ = α γ + β γ. 4 3 By multisets, we men tht {(w, α)} {(w, α)} = {(w, α),(w, α)}. 4 These identities re the trivil identities considered in [16] except for the lst two which were dded to simplify our presenttion. Any lrger set of identities cn e hndled with our method y rewriting α in the corresponding norml form.

10 The Antimirov or prtil derivtives utomton of α is defined y the 7- uple (Σ, D(α), E, α, 1, F, null) where E = {(β,, w, γ) w = (w,γ) (β) w } nd F = {β D(α) null(β) 0}. Let Σ = Σ {ɛ 1 +, ɛ2 +, ɛ1, ɛ2 }. We denote y ÂT(α) the weighted utomton over Σ otined y recursively mrking some of the ɛ-trnsitions of A T (α) s follows: if α = β + γ, we lel y ɛ 1 + (resp. ɛ2 + ) the ɛ-trnsition from I A T (α) to I AT (β) (resp. I AT (γ)); if α = β, we lel y ɛ 1 (resp. ɛ 2 ) the two ɛ-trnsitions to I AT (β) (resp. F AT (α)). Oserve tht ÂT(α) cn e viewed s n utomton recognizing the expression α over Σ recursively defined y =, ɛ = ɛ, â =, kβ = k β, βk = βk, β + γ = ɛ 1 + β + ɛ 2 + γ, β γ = β γ nd β = (ɛ 1 β) ɛ 2. For i pos 0 (α), we use the sme nottion q i (with q 0 = I) for the corresponding sttes in A T (α), Â T (α) nd rmeps(ât(α)). For stte q in rmeps(ât(α)), we define y L(q) the lnguge recognized from q considering rmeps(ât(α)) s n unweighted utomton over pirs (symol,weight). Lemm 3 follows from our mrking of the ɛ-trnsitions. Lemm 3. For i pos 0 (α), L(q i ) uniquely defines regulr expression over Σ, denoted y δ i (or δ α i when there is n miguity). Lemm 4. For ll i pos 0 (α) nd j pos(α), we hve for p j, q i in A T (α) tht: (p j, w) closure(q i ) iff (w, δ j ) (δ i ). (7) Proof. The proof is y induction on the length of the regulr expression. If α =, α = ɛ or α =, then the properties trivilly hold. Due to the lck of spce, we will only tret the cse α = β γ, other cses cn e treted similrly. Let A = A T (α), B = A T (β) nd C = A T (γ). If q i is in C, then δi α = δ γ i nd closure A (q i ) = closure C (q i ). Therefore, if (w, p j ) closure A (q i ), p j is in C nd then δj α = δγ j. Hence (7) recursively holds. If q i is in B, then δi α = δ β i γ nd we hve: (δ α i ) = (δ β i ) γ null(δβ i ) (γ) (8) closure A (q i ) = closure B (q i ) null(δ β i ) closure C(I C ). (9) By induction, we hve tht (p j, w) closure B (q i ) iff (w, δ β j ) (δ β i ), nd (p j, w) closure C (I C ) iff (w, δ γ j ) (δ γ 0 ) = (γ). Hence (7) follows. Oserve tht δ 0 = α, hence lemm 4 implies tht the δ i re the derived terms of α, more precisely, i δ i is surjection from pos 0 (α) onto D(α). This leds us to the following result, where min B is unweighted minimiztion when ech pir (lel,weight) is treted s regulr symol nd rmeps denotes the removl of the mrked ɛ s. Theorem 6. We hve A A (α) = rmeps(min B (rmeps(ât(α)))).

11 Proof. Note tht rmeps(ât(α)) is deterministic. During minimiztion, two sttes q i nd q j re equivlent iff L(q i ) = L(q j ), i.e. δ i = δ j (y lemm 3). Hence, there is ijection etween D(α) nd the set of sttes of min B (rmeps(ât(α))) hving n incoming trnsition with lel in Σ, nd hence etween D(α) nd the set of sttes of A = rmeps(min B (rmeps(ât(α)))). Lemm 4 ensures tht the trnsitions in A is consistent with the definition of A A (α). Theorem 7. If K is ɛ-k-closed, then A A (α) cn e computed in O(m log m + mn). Theorem 7 follows from the fct tht rmeps(ât(α)) hs O(m) sttes nd trnsitions. In the unweighted cse, this complexity is good s the more complicted nd est known lgorithm of [8]. In the weighted cse, the use of minimiztion over (lel,weight) pirs is su-optiml since sttes tht would e equivlent modulo -multiplictive fctor re not merged. When possile, using weighted minimiztion insted would led to smller utomton in generl. Hence, if K is closed, we cn defined the normlized Antimirov utomton of α s rmeps(min K (rmeps(ât(α)))). This utomton would lwys e smller thn the Antimirov utomton nd the utomton of unitry derived terms of [16] 5. If K is k-closed, it cn e constructed in O(m log m + mn). Remrk When the condition out k-closedness (resp. ɛ-k-closedness for α) of K is relxed to the closedness of K (resp. tht α is well-defined), ll our construction lgorithms cn still e used y replcing the generic single-source shortestdistnce lgorithm with generliztion of the Floyd-Wrshll lgorithm [15, 20], leding to complexity in O(m 3 ). It is not hrd however to mintin the qudrtic complexity y modifying the generic single-source shortest-distnce lgorithm to tke dvntge of the specil topology of the Thompson utomton. In the unweighted cse, every regulr expression cn e stightforwrdedly rewritten in ɛ-norml form such tht m = O(n). In tht cse, our O(mn) or O(m log m+mn) complexities ecome O(m+n 2 ) which is wht is often reported in the literture. 6 Conclusion We presented simple nd unified view of ɛ-free utomt representing unweighted nd weighted regulr expressions. We showed tht stndrd unweighted nd weighted epsilon-removl nd minimiztion cn e used to crete the Glushkov, follow, nd Antimirov utomt nd tht the complexities of these lgorithms mtch those of the est known lgorithms. This provides etter understnding 5 This utomton cn e viewed in our pproch s the result of simpler form of reweighting thn weight pushing, the reweighting used y weighted minimiztion.

12 of the ɛ-free utomt representing regulr expressions. It lso suggests using other comintions of epsilon-removl nd minimiztion for creting ɛ-free utomt. For exmple, in some contexts, it might e eneficil to use reverseepsilon-removl rther thn epsilon-removl [19]. Note lso tht the Glushkov utomton cn e constructed on-the-fly since Thompson s construction nd epsilon-removl oth dmit n on-demnd implementtion. References 1. A. V. Aho, R. Sethi, nd J. D. Ullmn. Compilers, Principles, Techniques nd Tools. Addison Wesley: Reding, MA, V. M. Antimirov. Prtil derivtives of regulr expressions nd finite utomton constructions. Theoreticl Computer Science, 155(2): , G. Berry nd R. Sethi. From regulr expressions to deterministic utomt. Theoreticl Computer Science, 48(3): , A. Brüggemnn-Klein. Regulr expressions into finite utomt. Theoreticl Computer Science, 120(2): , P. Cron nd M. Flouret. Glushkov construction for series: the non commuttive cse. Interntionl Journl of Computer Mthemtics, 80(4): , J.-M. Chmprnud, É. Lugerotte, F. Ourdi, nd D. Zidi. From regulr weighted expressions to finite utomt. In Proceedings of CIAA 2003, volume 2759 of Lecture Notes in Computer Science, pges Springer-Verlg, J.-M. Chmprnud, F. Nicrt, nd D. Zidi. Computing the follow utomton of n expression. In Proceedings of CIAA 2004, volume 3317 of Lecture Notes in Computer Science, pges Springer-Verlg, J.-M. Chmprnud nd D. Zidi. Computing the eqution utomton of regulr expression in O(s 2 ) spce nd time. In Proceedings of CPM 2001, volume 2089 of Lecture Notes in Computer Science, pges Springer-Verlg, C.-H. Chng nd R. Pge. From regulr expressions to DFA s using compressed NFA s. Theoreticl Computer Science, 178(1-2):1 36, D. Gimmrresi, J.-L. Ponty, nd D. Wood. Glushkov nd Thompson constructions: synthesis V. M. Glushkov. The strct theory of utomt. Russin Mthemticl Surveys, 16:1 53, J. Hopcroft. An n log n lgorithm for minimizing sttes in finite utomton. In Z. Kohvi nd A. Pz, editors, Proceedings of the Interntionl Symposium on the Theory of Mchines nd Computtions, pges Acdemic Press, L. Ilie nd S. Yu. Follow utomt. Informtion nd Computtion, 186(1): , S. C. Kleene. Representtions of events in nerve sets nd finite utomt. In C. E. Shnnon, J. McCrthy, nd W. R. Ashy, editors, Automt Studies, pges Princeton University Press, D. J. Lehmnn. Algeric structures for trnsitives closures. Theoreticl Computer Science, 4:59 76, S. Lomrdy nd J. Skrovitch. Derivtives of rtionl expressions with multiplicity. Theoreticl Computer Science, 332(1-3): , R. McNughton nd H. Ymd. Regulr expressions nd stte grphs for utomt. IEEE Trnsctions on Electronic Computers, 9(1):39 47, 1960.

13 18. M. Mohri. Finite-Stte Trnsducers in Lnguge nd Speech Processing. Computtionl Linguistics, 23:2, M. Mohri. Generic e-removl nd input e-normliztion lgorithms for weighted trnsducers. Interntionl Journl of Foundtions of Computer Science, 13(1): , M. Mohri. Semiring Frmeworks nd Algorithms for Shortest-Distnce Prolems. Journl of Automt, Lnguges nd Comintorics, 7(3): , G. Nvrro nd M. Rffinot. Fst regulr expression serch. In Proceedings of WAE 99, volume 1668 of Lecture Notes in Computer Science, pges Springer-Verlg, G. Nvrro nd M. Rffinot. Flexile pttern mtching. Cmridge University Press, J.-L. Ponty, D. Zidi, nd J.-M. Chmprnud. A new qudrtic lgorithm to convert regulr expression into utomt. In Proceedings of WIA 96, volume 1260 of Lecture Notes in Computer Science, pges Springer-Verlg, M.-P. Schützenerger. On the definition of fmily of utomt. Informtion nd Control, 4: , K. Thompson. Regulr expression serch lgorithm. Communictions of the ACM, 11(6): , 1968.

14 A Generl lgorithms A.1 Shortest distnce A generic single-source shortest-distnce lgorithm in weighted utomt ws presented in [20]. The lgorithm is generliztion of the clssicl shortestdistnce lgorithms. It does not require the semiring to e idempotent. For weighted utomton A over K, the condition for the lgorithm to work is tht K must e k-closed for A, i.e. there exist k N such tht for ny cycle c in A, w[c] = w[c] k. shortest-distnce(a, s) 1 for ech p Q do 2 d[p] r[p] 0 3 d[s] r[s] 1 4 S {s} 5 while S do 6 q hed(s) 7 dequeue(s) 8 R r[q] 9 r[q] 0 10 for ech e E[q] do 11 if d[n[e]] d[n[e]] (R w[e]) then 12 d[n[e]] d[n[e]] (R w[e]) 13 r[n[e]] r[n[e]] (R w[e]) 14 if n[e] S 15 enqueue(s, n[e]) 16 d[s] 1 Fig. 3. Pseudocode of the generic shortest-distnce lgorithm. The lgorithm is lso generic in the sense tht it works with ny queue discipline. The pseudocode of the lgorithm is given figure 3. The complexity of the lgorithm depends on the queue discipline chosen for S, more precisely it is in: O( Q + (T + T + C(A)) E mx N(q) + (C(I) + C(X)) N(q)) (10) q Q q Q where N(q) denotes the numer of times stte q is extrcted from the queue S, C(X) the cost of extrcting stte from S, C(I) the cost of inserting stte in S, nd C(A) the cost of n ssignment. In the cse of n cyclic utomton, using the topologicl order queue discipline, the complexity of the lgorithm is liner, i.e., O( Q + E ). In the cse of the tropicl semiring, using Fioncci heps, the complexity of the lgorithm is O( E + Q log Q ).

15 ɛ-removl(a) 1 for ech p Q do 2 E[p] {e E[p] : i[e] ɛ} 3 for ech (q, w) C[p] do C[p] = closure(p) 4 E[p] E[p] {(p,, w w, r) : (q,, w, r) E[q] nd ɛ} 5 if q F then 6 F F {p} 7 ρ[p] ρ[p] (w ρ[q] Fig.4. Pseudocode of the ɛ-removl lgorithm. A.2 Epsilon removl Let A e weighted utomton over K with ɛ-trnsitions. Let A ɛ e the utomton otined y deleting ll the trnsitions not leled y ɛ from A. A generl ɛ-removl lgorithm sed on the generic shortest distnce lgorithm presented ove ws given in [19]. This lgorithms works if the semiring K is k-closed for A ɛ. The lgorithm is divided in two steps. The first step consists of computing the ɛ-closure of ech stte p in A. Let d ɛ [p, q] denote the ɛ-distnce from p to q, for p, q Q: d ɛ [p, q] = w[π]. (11) The ɛ-closure of p is then defined s π P(p,q),i[π]=ɛ closure(p) = {(q, d ɛ [p, q]) d ɛ [p, q] 0}. (12) The ɛ-closure of p cn e computed y using the generic shortest-distnce lgorithm on A ɛ with source p. The second step consist of, for ech stte p hving t lest n incoming non-ɛ trnsition, deleting ll the outgoing ɛ-trnsitions of p, nd dding out of p ll the non-ɛ trnsitions leving ech stte q closure(p) with their weight pre- -multiplied y d ɛ [p, q]. The pseudocode of this second step is given figure 4. A.3 Weight pushing Weight pushing is n lgorithm for normlizing the distriution of the weights long the pths of weighted utomt [18]. Let A e weighted utomton over K nd ssume tht K is wekly left divisile nd zero sum free. For every stte q Q, ssume tht the shortest distnce from q to F: d F [q] = (w[π] ρ(n[π])) (13) π P(q,F)

16 is well defined in K. The weight pushing lgorithm consists of computing ech d F [q] nd of reweighting A in the following wy: e E such tht d F [p[e]] 0, w[e] d F [p[e]] 1 (w[e] d F [n[e]]) q I, λ[q] λ[q] d F [q] q F such tht d F [q] 0, ρ[q] d F [q] 1 ρ[q] (14) The complexity of the reweighting step is liner in the size of A under the ssumption tht the cost of the opertion is constnt. The first step cn e chieve y pplying the shortest-distnce lgorithm on the reverse of A, hence the complexity of this step is s discussed in section A.1. Weight pushing hs two interesting properties: (1) it does no chnge the weight of successful pths, (2) the resulting weighted utomton is stochstic, i.e. for ny stte q, the -sum of the weight of the outgoing trnsitions in q is equl to 1. A.4 Weighted minimiztion A weighted utomton A is deterministic if no two trnsitions leving ny stte shre the sme lel nd if it hs unique initil stte. A deterministic weighted utomton is miniml if there exists no other deterministic utomton hving smller numer of sttes nd relizing the sme function. A generl weighted minimiztion ws presented in [18]. Let A e weighted utomton over K, the lgorithm consists of the execution of the following steps: 1. weight pushing, 2. (unweighted) utomt minimiztion, considering ech pir (lel, weight) s single lel. Assuming tht the conditions of ppliction of weight pushing hold, the resulting weighted utomton, denoted y min(a), is miniml nd equivlent to A. The complexity of the second step is in O( E log Q ) using the Hopcroft lgorithm [12].

A Unified Construction of the Glushkov, Follow, and Antimirov Automata

A Unified Construction of the Glushkov, Follow, nd Antimirov Automt Cyril Alluzen nd Mehryr Mohri Cournt Institute of Mthemticl Sciences 251 Mercer Street, New York, NY 10012, USA {lluzen,mohri}@cs.nyu.edu