Regular Expressions and NFAs without ε-transitions

Similar documents
Theory of Computation Regular Languages. (NTU EE) Regular Languages Fall / 38

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Theory of Computation Regular Languages

AUTOMATA AND LANGUAGES. Definition 1.5: Finite Automaton

Convert the NFA into DFA

Finite Automata. Informatics 2A: Lecture 3. John Longley. 22 September School of Informatics University of Edinburgh

Minimal DFA. minimal DFA for L starting from any other

1.3 Regular Expressions

NFAs and Regular Expressions. NFA-ε, continued. Recall. Last class: Today: Fun:

CMSC 330: Organization of Programming Languages. DFAs, and NFAs, and Regexps (Oh my!)

19 Optimal behavior: Game theory

1.4 Nonregular Languages

1 From NFA to regular expression

Anatomy of a Deterministic Finite Automaton. Deterministic Finite Automata. A machine so simple that you can understand it in less than one minute

CS 301. Lecture 04 Regular Expressions. Stephen Checkoway. January 29, 2018

Harvard University Computer Science 121 Midterm October 23, 2012

Formal languages, automata, and theory of computation

p-adic Egyptian Fractions

For convenience, we rewrite m2 s m2 = m m m ; where m is repeted m times. Since xyz = m m m nd jxyj»m, we hve tht the string y is substring of the fir

CSCI 340: Computational Models. Kleene s Theorem. Department of Computer Science

Non Deterministic Automata. Linz: Nondeterministic Finite Accepters, page 51

1 Nondeterministic Finite Automata

CS 275 Automata and Formal Language Theory

CS 275 Automata and Formal Language Theory

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Chapter Five: Nondeterministic Finite Automata. Formal Language, chapter 5, slide 1

Improper Integrals, and Differential Equations

Math 1B, lecture 4: Error bounds for numerical methods

Regular expressions, Finite Automata, transition graphs are all the same!!

Non-Deterministic Finite Automata. Fall 2018 Costas Busch - RPI 1

More on automata. Michael George. March 24 April 7, 2014

Lecture 09: Myhill-Nerode Theorem

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

I1 = I2 I1 = I2 + I3 I1 + I2 = I3 + I4 I 3

Lecture 08: Feb. 08, 2019

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Model Reduction of Finite State Machines by Contraction

CS:4330 Theory of Computation Spring Regular Languages. Equivalences between Finite automata and REs. Haniel Barbosa

Nondeterminism. Nondeterministic Finite Automata. Example: Moves on a Chessboard. Nondeterminism (2) Example: Chessboard (2) Formal NFA

Designing finite automata II

Assignment 1 Automata, Languages, and Computability. 1 Finite State Automata and Regular Languages

3 Regular expressions

The Regulated and Riemann Integrals

Improper Integrals. Type I Improper Integrals How do we evaluate an integral such as

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

Finite Automata-cont d

Finite Automata. Informatics 2A: Lecture 3. Mary Cryan. 21 September School of Informatics University of Edinburgh

CS 188: Artificial Intelligence Spring 2007

Formal Languages and Automata

Probabilistic Model Checking Michaelmas Term Dr. Dave Parker. Department of Computer Science University of Oxford

Compiler Design. Fall Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Java II Finite Automata I

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a).

CHAPTER 1 Regular Languages. Contents

FABER Formal Languages, Automata and Models of Computation

CMSC 330: Organization of Programming Languages

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

CS 275 Automata and Formal Language Theory

First Midterm Examination

CMPSCI 250: Introduction to Computation. Lecture #31: What DFA s Can and Can t Do David Mix Barrington 9 April 2014

Chapter 0. What is the Lebesgue integral about?

Coalgebra, Lecture 15: Equations for Deterministic Automata

CISC 4090 Theory of Computation

CS415 Compilers. Lexical Analysis and. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

1 Online Learning and Regret Minimization

5. (±±) Λ = fw j w is string of even lengthg [ 00 = f11,00g 7. (11 [ 00)± Λ = fw j w egins with either 11 or 00g 8. (0 [ ffl)1 Λ = 01 Λ [ 1 Λ 9.

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.)

Turing Machines Part One

1 Structural induction, finite automata, regular expressions

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

GNFA GNFA GNFA GNFA GNFA

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2016

Homework 3 Solutions

Fundamentals of Computer Science

DIRECT CURRENT CIRCUITS

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. NFA for (a b)*abb.

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. Comparing DFAs and NFAs (cont.) Finite Automata 2

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying

Handout: Natural deduction for first order logic

A recursive construction of efficiently decodable list-disjunct matrices

Nondeterminism and Nodeterministic Automata

First Midterm Examination

Spanning tree congestion of some product graphs

f(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams

State Minimization for DFAs

UNIFORM CONVERGENCE. Contents 1. Uniform Convergence 1 2. Properties of uniform convergence 3

Converting Regular Expressions to Discrete Finite Automata: A Tutorial

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

CS375: Logic and Theory of Computing

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes

CS 314 Principles of Programming Languages

Speech Recognition Lecture 2: Finite Automata and Finite-State Transducers. Mehryar Mohri Courant Institute and Google Research

CHAPTER 1 Regular Languages. Contents. definitions, examples, designing, regular operations. Non-deterministic Finite Automata (NFA)

7.2 The Definite Integral

Lecture 3 ( ) (translated and slightly adapted from lecture notes by Martin Klazar)

Strong Bisimulation. Overview. References. Actions Labeled transition system Transition semantics Simulation Bisimulation

Speech Recognition Lecture 2: Finite Automata and Finite-State Transducers

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Lecture 9: LTL and Büchi Automata

Transcription:

Regulr Expressions nd NFAs without ε-trnsitions Georg chnitger Institut für Informtik, Johnn Wolfgng Goethe-Universität, Robert Myer trße 11 15, 60054 Frnkfurt m Min, Germny georg@thi.informtik.uni-frnkfurt.de Abstrct. We consider the problem of converting regulr expressions into ε-free NFAs with s few trnsitions s possible. If the regulr expression hs length n nd is defined over n lphbet of size k, then the previously best construction uses O(n min{k, log 2 n} log 2 n) trnsitions. We show tht O(n log 2 2k log 2 n) trnsitions suffice. For smll lphbets, for instnce if k = O(log 2 log 2 n), we further improve the upper bound to O(k 1+log n n). In prticulr, O(2 log 2 n n) trnsitions nd hence lmost liner size suffice for the binry lphbet! Finlly we show the lower bound Ω(n log 2 2 2k) nd s consequence the upper bound O(n log 2 2 n) of [7] for generl lphbets is best possible. Thus the conversion problem is solved for lrge lphbets (k = n Ω(1) ) nd lmost solved for smll lphbets (k = O(1)). Keywords: Automt nd forml lnguges, descriptionl complexity, nondeterministic utomt, regulr expressions. 1 Introduction One of the centrl tsks on the border between forml lnguge theory nd complexity theory is to describe infinite objects such s lnguges by finite formlisms such s utomt, grmmrs, expressions mong others nd to investigte the descriptionl complexity nd cpbility of these formlisms. Formlisms like expressions nd finite utomt hve proven to be very useful in building compilers, nd techniques converting one formlism into nother were used s bsic tools in the design of computer systems such s UNIX ([12] nd [5], p. 123). A typicl ppliction in lexicogrphicl nlysis strts with regulr expression tht hs to be converted into n ε-free nondeterministic finite utomton. Here, the descriptionl complexity of n expression R is its length nd the descriptionl complexity of nondeterministic finite utomton (NFA) is the number of its edges or trnsitions, where identicl edges with distinct lbels re differentited. All clssicl conversions [1, 3, 9, 12] produce ε-free NFAs with worst-cse size qudrtic in the length of the given regulr expression nd for some time this ws ssumed to be optiml [10]. But then Hromkovic, eibert nd Wilke [7] constructed ε-free NFAs with surprisingly only O(n(log 2 n) 2 ) trnsitions for regulr expressions of length n nd this trnsformtion cn even be implemented to run in time O(n log 2 n + m), where m is the size of the output [4]. ubsequently Geffert [2] showed tht even ε-free NFAs with O(n k log 2 n) trnsitions suffice for lphbets of size k, improving the bound of [7] for smll lphbets. We considerbly improve the upper bound of [2] for lphbets of smll size k. In prticulr we show Work supported by DFG grnt CHN 503/4-1.

Theorem 1. Every regulr expression R of length n over n lphbet of size k cn be recognized by n ε-free NFA with t most trnsitions. O(n min{log 2 n log 2 2k, k 1+log n }) As first consequence we obtin ε-free NFAs of size O(n log 2 n log 2 2k) for regulr expressions of length n over n lphbet of size k. For smll lphbets, for instnce if k = O(log 2 log 2 n), the upper bound O(n k 1+log n ) is better. In prticulr, O(n 2 log 2 n ) trnsitions nd hence lmost liner size suffice for the binry lphbet. A first lower bound ws lso given in [7], where it is shown tht the regulr expression E n = (1 + ε) (2 + ε) (n + ε) over the lphbet {1,..., n} requires NFAs of size t lest Ω(n log 2 n). Lifshits [8] improves this bound to Ω(n(log 2 n) 2 / log 2 log 2 n). We use ides developed in [8] to prove the following optiml symptotic bound for E n. Theorem 2. There re regulr expressions of length n over n lphbet of size k such tht ny equivlent ε-free NFA hs t lest Ω(n log 2 2 2k) trnsitions. Thus the construction of [7] is optiml for lrge lphbets, i.e., if k = n Ω(1). ince Theorem 1 is lmost optiml for lphbets of fixed size, only improvements for lphbets of intermedite size, i.e., ω(1) = k = n o(1), re still required. In ection 2 we show how to construct smll ε-free NFAs for given regulr expression R using ides from [2,7]. We obtin in Lemm 1 the upper bound O(n log 2 n log 2 2k) by short-cutting ε-pths within the cnonicl NFA. Wheres the cnonicl NFA (with ε-trnsitions) is derived from the expression tree of R, the shortcuts re derived from decomposition tree which is blnced version of the expression tree. The subsequent improvement for smll lphbets is bsed on repetedly pplying the previous upper bound to lrger nd lrger subexpressions. We show the lower bound for the regulr expression E n in ection 3. Conclusions nd open problems re stted in ection 4. 2 mll ε-free NFAs for Regulr Expressions We first describe the cnonicl construction of n NFA (with ε-trnsitions), given the expression tree of given regulr expression R. We then proceed in the next section by defining decomposition tree for R. After showing in section 2.2 how to obtin smll ε-free NFA from the decomposition tree we then recursively pply our recipe in section 2.3 to obtin close to optiml ε-free NFAs. First observe tht R = R 1 + R2 or R = for subexpressions R 1, R 2, of R. Fig. 1 shows this recursive expnsion in NFA-nottion. Thus, fter completing this recursive expnsion, we rrive t the cnonicl NFA N R with unique initil stte q 0 nd unique finl stte q f. Moreover, no trnsition enters q 0 nd no trnsition leves q f ; thus q f is trp stte. Finlly observe tht N R hs t most O(n) trnsitions for ny regulr expression R of length n. 2.1 Decomposition Trees Let T R be the expression tree of R with root r. To define decomposition trees we first introduce prtil cuts, where prtil cut is set of nodes of T R with no two nodes of C being ncestors or descendnts of ech other. We define TR v (C), for

R1 R2 R 1 ε ε conctention R2 union ε str Fig.1. The initil step in determining the NFA N R for regulr expression R. The undirected version of N R is series-prllel grph with unique source nd unique sink. The sink is trp stte. node v nd prtil cut C of T R, s the subtree of T R with root v nd ll childrenlinks for nodes in C removed. Hence nodes in C which re lso descendnts of v re (rtificil) leves in TR v (C). Moreover we lbel the lef for x C with n rtificil symbol denoting the regulr subexpression determined by x in T R : if R v (C) denotes the subexpression specified by TR v(c), then Rv (C) contins for ech rtificil lef the corresponding rtificil symbol. Finlly NR v (C) is the NFA obtined by recursively expnding R v (C) except for the rtificil symbols of R v (C); ny rtificil symbol is modeled by n rtificil trnsition p q, where p is the unique initil stte nd q is the unique finl stte of the cnonicl NFA for. + R 4 R 4 R 1 R 3 + R 3 R 2 R 1 R 2 Fig.2. Expression tree nd cnonicl NFA with rtificil trnsitions for R = (R 3 (R 1 + R 2)) + R 4 We introduce the decomposition tree TR for R s blnced, smll depth version of T R. We begin by determining seprting node v of T R, nmely node of T R with subtree of t lest n 3, but less thn 2n 3 leves. Then T 1 = TR v ( ) is the subtree of T R with root v nd T 1 determines the regulr expression = R v ( ). We remove the edge connecting v to its prent, rettch v s n rtificil lef lbeled with the rtificil symbol nd obtin the second subtree T 2 = TR r ({v}) specifying the regulr expression R r ({v}). We obtin the originl expression R fter replcing the rtificil symbol in R r ({v}) by the expression R v ( ). NR r ({v}) contins unique trnsition q 1 q2 with lbel. We obtin N R from NR r ({v}) fter identifying the unique initil nd finl sttes of NR v( ) with q 1 nd q 2 respectively nd then replcing the trnsition q 1 q2 by N v R ( ). To define the decomposition tree TR we crete new root s. We sy tht q 1 q 2 is the rtificil trnsition of s nd lbel s with the qudruple (r,, q 1 q2, v). In generl, if we lbel node t of TR with (u, C, q 1 q 2, v),

N v R ( ) T R : v T 2 N R : q 0 q 1 q 2 q f T 1 N r R ({v}) : q 0 q 1 q 2 q f Fig.3. The first expnsion step for T R nd the corresponding NFAs. - then we sy tht t represents the expression tree TR u (C) s well s the NFA NR u(c). When expnding node t, the NFA Nu R (C) is decomposed using the seprting node v. - The left child of t represents the expression tree TR v (C) s well s the NFA NR v(c) which hs the unique initil stte q 1 nd the unique finl stte q 2. - The right child of t represents the expression tree TR u (C {v}) nd the NFA NR v (C {v}). - One obtins the NFA NR u(c) from the NFA Nv R (C {v}) of the right child fter replcing the rtificil trnsition q 1 q2 by the NFA NR v (C) of the left child. We recursively repet this expnsion process for the left child of s representing TR v( ) s well s the right child representing TR r ({v}). Observe tht seprting nodes hve to hve t lest N 3, but less thn 2N 3 originl (i.e., non-rtificil) leves in their subtrees, where N is the current number of originl leves. We continue to expnd until ll trees contin t most one originl lef. If we hve reched node t of TR whose expression tree T R u (C) hs exctly one lef l (representing the originl trnsition q 1 q2 ), then we lbel t by the qudruple (u, C, q 1 q2, l) nd stop the expnsion. We summrize the importnt properties of TR. Proposition 1. Let R be regulr expression of length n. () Ech ε-free trnsition of N R ppers exctly once s n rtificil trnsition of lef of T R. (b) Assume tht node t is lbeled with (u, C, q 1 q2, v). Then ny ε-pth in N R from stte of the left child NFA NR v(c) to stte outside of Nv R (C) hs to trverse q 2. Any ε-pth in N R from stte outside of NR v (C) to stte of NR v(c) trverses q 1. (c) Let p q nd r b s be two ε-free trnsitions of N R corresponding to leves l 1, resp. l 2 of T R. Moreover let t be the lowest common ncestor of l 1 nd l 2 in T R. If there is n ε-pth q r in N R nd if q 1 q2 is the rtificil trnsition of t, then the pth trverses q 1 or q 2. (d) The depth of T R is bounded by O(log 2 n). Proof. () Any ε-free trnsition of N R ppers exctly once s lef of the expression tree T R. The clim follows, since ech expnsion step for the decomposition tree T R decomposes T R. (b) When growing the cnonicl NFA while expnding the expression tree T R, we lwys replce n rtificil trnsition by n NFA N with unique initil nd finl stte. No stte outside of N will ever be linked directly with stte inside of

N. Moreover N cn only be entered through its initil stte nd left through its finl stte. (c) The lowest common ncestor t represents the expression R u (C) for some node u nd cut C. If w is the seprting node of t, then R u (C) is decomposed into the expressions R w (C), recognized by N w R (C), nd Ru (C {w}), recognized by N u R (C {w}). Assume tht both endpoints of, sy, p q belong to N w R (C), the endpoints of r b s lie outside. But ccording to prt (b), the left child NFA NR w (C) cn only be entered through its initil or finl stte which coincides with n endpoint of the rtificil trnsition of t. (d) follows, since the number of originl leves is reduced ech time by t lest the fctor 2 3. 2.2 Constructing ε-free NFAs from the Decomposition Tree Property (c) of Proposition 1 is of prticulr importnce when building smll ε-free NFA from the regulr expression R, resp. from the NFA N R : ssume for instnce tht there re ε-pths q 1 p2, q 2 p3 nd q 3 p4 s well s ε-free trnsitions p i i qi for i = 1,...,4 in N R. Then there is pth P from p 1 to q 4 built from the ε-trnsition nd the four ε-free trnsitions. How cn we simulte P without ε-trnsitions? Assume tht within the decomposition tree u is the lowest common ncestor (lc) of p 1 1 q1 nd p 2 2 q2, v the lc of p 2 2 q2 nd p 3 3 q3 nd finlly tht w is the lc of p 3 3 q3 nd p 4 4 q4. Moreover let q1 u nd q2, u q1 v nd q2, v q1 w nd q2 w be the endpoints of the rtificil trnsitions of u, v nd w respectively. Pth P hs to trverse one of the endpoints for ll three lc s ccording to Proposition 1 (c). In prticulr, P my hve the form p 1 1 q1 q u 2 2 p2 q2 q v 1 3 p3 q3 q w 1 4 p4 q4. We concentrte on the two pth frgments q u 2 If we introduce new ε-free trnsitions p 2 2 q2 q v 1 nd q1 v p 3 3 q3 q w 1. q u 2 2 q v 1 nd q1 v 3 q w 1, then disregrding the very first nd the very lst ε-free trnsitions of P, we hve utilized the lc s s shortcuts in the ε-free equivlent q2 u 2 q v 3 1 q w 1 of P. We now nlyze this procedure in generl. In prticulr we improve upon the conversion of [2], where O(n log 2 n k) trnsitions re shown to suffice. Our pproch combines ides in [2] with Proposition 1 (c). Lemm 1. Let R be regulr expression of length n over n lphbet of size k. Then there is n ε-free NFA N for R with O(n log 2 n log 2 2k) trnsitions. N hs unique initil stte. If ε L(R), then N hs one finl stte nd otherwise N hs t most two finl sttes. Proof. Assume tht q 0 is the unique initil stte of N R nd q f its unique finl stte. We moreover ssume without loss of generlity tht ll sttes of N R hve n ε-loop. We choose q 0 nd q f s well s ll endpoints of rtificil trnsitions ssigned to nodes of T R s sttes for our ε-free NFA. q 0 is still the initil stte nd q f is (jointly with q 0, whenever ε L(R)) the only finl stte. Thus the ε-free NFA results from N R fter removing ll ε-trnsitions (nd ll sttes which re incident to ε-trnsitions only) nd inserting new ε-free trnsitions.

Let p q be n ε-free trnsition of N R nd let l be the corresponding lef of TR. We define the set A to contin q 0, q f s well s ll ncestors of l including l itself. We ssume tht q 0 nd q f re roots of imginry trees such tht ll leves of TR belong to the right subtree of q 0 s well s to the left subtree of q f. Consider ny two nodes v, w A nd let q1 v, qv 2 nd qw 1, qw 2 be the endpoints of the rtificil trnsitions for v nd w respectively. (We set q1 v = q2 v = q 0 for v = q 0 nd q1 v = q2 v = q f for v = q f.) We insert the trnsition q v i q w j for i, j {1, 2}, if there re ε-pths qi v p nd q qj w in N R. Let N be the NFA obtined from N R w ǫ v ǫ p q Fig.4. Introducing trnsitions between ncestors fter these insertions nd fter removing ll ε-trnsitions. Obviously ny ccepting pth from q 0 to q f in N cn be extended vi ε- trnsitions to n ccepting pth in N R nd hence L(N) L(N R ). Now consider n ccepting pth ε ε q 0 1 ε ε p1 q1 r ε ε pr qr qf for the word 1 r in N R. ince ll sttes of N R hve ε-loops, we my ssume tht ll ε-free trnsitions re seprted by ε-pths. Let l i be the lef of TR corresponding to the trnsition p i i qi. To obtin n ccepting pth in N, let v 0 = q 0, v 1,..., v r 1, v r = q f be sequence of nodes, where v i (1 i r 1) is the lowest common ncestor of l i nd l i+1 in TR. By Proposition 1(c) the ε-pths from q i 1 to p i nd from q i to p i+1 hve to hit endpoints q vi 1 j i 1 {q vi 1 1, q vi 1 2 } nd q vi j i {q vi 1, qvi 2 } of the respective rtificil trnsition. But then N contins the trnsition q vi 1 i j i 1 q v i nd j i q 1 0 q v 1 2 i 1 v j 1 q i 1 i j i 1 q v i i+1 v j i q i+1 i+2 j i+1 r qf is n ccepting pth in N. Hence L(N R ) L(N) nd N nd N R re equivlent. We still hve to count the number of trnsitions of N. We introduce trnsitions qi s qj t (resp. qt i qj s) only for trnsitions p q which re represented by leves belonging to the subtrees of s nd t. Hence s must be n ncestor or descendnt of t in TR. Thus, for given nodes s, t we introduce t most min{ s, t, k} trnsitions, where s nd t re the number of leves in the subtrees of s nd t respectively. We fix s. There re O( s /k) descendnts t of s with t k nd t most O( s k k) = O(s) newly introduced trnsitions connect s with high node t. The remining low nodes re prtitioned into O(log 2 k) levels, where one level produces t most O( s ) trnsitions, since t most t trnsitions connect s with node t of the level. Thus the number of trnsitions between qi s nd qj t for descendnts t of s is bounded by O( s (1 + log 2 k)) nd hence by O( s log 2 2k). Finlly we prtition ll nodes s of TR into O(log 2 n) levels, where one level requires O(n log 2 2k) trnsitions, nd overll O(n log 2 n log 2 2k) trnsitions suffice.

i ε ε ε ε i 1 p i 1 qi 1 p i i i+1 qi p i+1 qi+1 Fig.5. The construction of n equivlent pth 2.3 A Recursive Construction of mll ε-free NFAs How cn we come up with even smller ε-free NFAs? Assume tht we hve prtitioned the regulr expression R into (very smll) subexpressions of roughly sme size η. We pply the construction of Lemm 1 to ll subexpressions nd introduce t most O(n log 2 η log 2 2k) trnsitions, significnt reduction if η is drsticlly smller thn n. However now we hve to connect different subexpressions with globl trnsitions nd Lemm 1 inserts the vst mjority of trnsitions, leding to totl of O(n log 2 n log 2 2k) trnsitions. But we cn do fr better, if we re willing to increse the size of ε-free NFAs for every subexpression. Definition 1. Let N be n ε-free NFA with initil stte q 0 nd let F be the set of finl sttes. We sy tht ny trnsition (q 0, r) is n initil trnsition nd tht r is post-initil stte. Anlogously ny trnsition (r, s) is finl trnsition, provided s F, nd r is pre-finl stte. Observe tht it suffices to connect globl trnsition for subexpression with post-initil or pre-finl stte of n ε-free NFA for. As consequence, the number of globl trnsitions for is reduced drsticlly, provided we hve only few postinitil nd pre-finl sttes. But, given n ε-free NFA, how lrge re equivlent ε-free NFAs with reltively few initil or finl sttes? Proposition 2. Let N be n ε-free NFA with s trnsitions over n lphbet Σ of size k. Then there is n equivlent ε-free NFA N with O(k 2 + k s) trnsitions nd t most 3k + k 2 initil or finl trnsitions. N hs one initil stte. If ε L(N), then N hs one finl stte nd otherwise t most two finl sttes. Proof. Assume tht q 0 is the initil stte of N, F is the set of finl sttes nd Σ = {1,...,k}. Let ρ 1,...,ρ p be the post-initil sttes of N nd σ 1,..., σ q be the pre-finl sttes of N. Moreover let R i be the set of post-initil sttes in N receiving n i-trnsitions from q 0 nd let i be the set of pre-finl sttes of N sending n i-trnsitions into stte of F. We introduce new initil stte q 0 nd new finl stte q f. (q 0 is the second ccepting stte, if ε L(N).) For every Σ L(N) we insert the trnsition q 0 q f. We introduce new post-initil sttes r 1,...,r k, new pre-finl sttes s 1,...,s k nd insert i-trnsitions from q 0 to r i s well s from s i to q f. If ρ j R i nd if (ρ j, s) is trnsition with lbel b, then insert the trnsition (r i, s) with lbel b. Anlogously, if σ j i nd if (r, σ j ) is trnsition with lbel b, then we insert the trnsition (r, s i ) with lbel b. Thus the new sttes r i nd s i inherit their outgoing respectively incoming trnsitions from the sttes they re

responsible for. Finlly, to ccept ll words of length two in L(N), we introduce t most k 2 further initil nd finl trnsitions incident with q 0, q f nd post-initil sttes. Observe tht the new NFA N is equivlent with N, since, fter leving the new sttes r i nd before reching the new sttes s j, N works like N. The sttes q 0 nd q f re incident with t most k + 2 k + k 2 trnsitions: up to k trnsitions link q 0 nd q f, 2 k trnsitions connect q 0 nd the r i s (or the s i s nd q f ) nd t most k 2 trnsitions ccept words of length two. Finlly t most k s trnsitions leve sttes r i, not more thn ks trnsitions enter sttes s j nd hence the number of trnsitions increses from s to t most s (2k + 1) + 3 k + k 2. We now observe tht the combintion of Lemm 1 nd Proposition 2 provides significnt svings for lphbets of smll size. Proof of Theorem 1. Let T R be the expression tree nd TR be the decomposition tree. If u is node of TR, then let T R (u) denote the subtree of T R with root u. We begin with sketch of the construction of smll ε-free NFA for the expression R. We proceed itertively. In first phse we process cut of low nodes of TR, where ny such node u hs t most L 1 originl trnsitions in its subtree TR (u), where L 1 will be fixed lter. We insert dditionl trnsitions between descendnts of u ccording to Lemm 1 nd obtin n ε-free NFA N 1 (u). We repet this procedure in phse j; this time the cut consists of nodes u with t most L j originl trnsitions in their respective subtree TR (u). The prmeters L j will be fixed lter; here we only ssume tht L 1 < < L j 1 < L j < holds. Thus we process cuts of nodes of incresing height until we rech the root of TR. Let D be the set of descendnts w of u which we processed in the previous phse. At the beginning of phse j we ssume tht ll ε-free NFA N j 1 (w) for w D hve been constructed. When constructing n ε-free NFA N j (u) we re now fcing more complicted sitution thn in Lemm 1. Firstly, when building N j (u) in phse j > 1, we hve to merge the ε-free NFA N j 1 (w) for ll w D. As opposed to the cse of originl trnsitions, ny such NFA N j 1 (w) my ccept rbitrry strings insted of just single letters. To prepre for this more complicted scenrio we first pply Proposition 2 to replce N j 1 (w) by n equivlent ε-free NFA with few post-initil nd pre-finl sttes: ll dditionl trnsitions cn now be connected with one of these reltively few sttes. Assume tht u represents the expression tree TR v(c). Then the ε-free NFA Nj (u) will be built from NR v (C D). In prticulr, the rtificil trnsitions corresponding to ny w D re replced by the ε-free NFA N j 1 (w). But, nd this is the second difference to the sitution of Lemm 1, the rtificil trnsitions q 1 q2 corresponding to node in C re kept: this procedure llows to plug in the ε-free NFA N recognizing, whenever N is constructed. In summry, N j (u) results from its bse NFA NR v (C D) by removing ll ε-free trnsitions, keeping ll rtificil trnsitions for nodes in C, replcing rtificil trnsitions for nodes in D by the previously determined ε-free NFA nd by dding new ε-free trnsitions. We hve to del with the lst point in detil. Our construction utilizes the following invrint: if w D represents the expression tree TR x(c ), then the NFA N j 1 (w) is equivlent with the NFA NR x(c ). Here we ssume, for N j 1 (w) s well s for NR x(c ), tht ny rtificil trnsition q 1 q2 produces ll words of the regulr expression. We do not differentite between the first nd subsequent phses, since subsequent phses re more generl: in phse 1 only ε-free trnsitions re to be merged. But we my interpret ny such trnsition s ε-free NFAs N 0 (w) of phse 0 consisting of single trnsition only. In prticulr, the invrint holds initilly.

As in the proof of Lemm 1 we ssume tht ll sttes of N R hve n ε-loop. We now begin the forml description of our construction. 2.3.1 Phse j. We consider ll nodes u of TR which hve t most L j originl trnsitions in their subtree TR (u), wheres its prent hs more thn L j originl trnsitions. Obviously these nodes define cut in TR. Let D be the set of descendnts w of u which we processed in the previous phse. We build n ε-free NFA N j (u) from the ε-free NFAs N j 1 (w) for ll w D. For ny such descendnt w we pply Proposition 2 to N j 1 (w) nd obtin n ε-free NFA N j 1 (w) with t most O(k 2 ) initil or finl trnsitions; moreover size(n j 1 (w) ) O(k 2 + k size(n j 1 (w)). We utilize the few initil or finl trnsitions to cheply interconnect N j 1 (w) with (endpoints of rtificil trnsitions ssigned to) ncestors of w within TR (u). Assume gin tht u represents the expression tree TR v (C). Then the ε-free NFA N j (u) is obtined from NR v (C D) by removing ll ε-trnsitions, replcing the rtificil trnsitions ssigned to node w D by the ε-free NFA N j 1 (w), keeping ll rtificil trnsitions for nodes in C nd dding dditionl ε-free trnsitions. Only the lst point hs to be explined. As in the construction of Lemm 1, N j (u) keeps the initil nd finl sttes q1 u nd q2 u of NR v (C D). The insertion of new trnsitions is now more complex tsk thn in the sitution of Lemm 1, since we re working with full-fledged ε-free NFAs insted of ε-free trnsitions. In prticulr we hve to differentite three cses, nmely firstly the new cse ε L(N j 1 (w) ), then the originl cse considered in Lemm 1, nmely L(N j 1 (w) ) for some letter, nd finlly the second new cse, nmely tht L(N j 1 (w) ) contins words of length t lest two. Let q 1, q 2 be the unique initil nd finl sttes of N j 1 (w). (0) Assume ε L(N j 1 (w) ). This cse estblishes ε-pths nd is of interest only for the two remining cses where rechbility by ε-pths is crucil. (1) Assume L(N j 1 (w) ) for some letter. For ny ncestors t 1, t 2 of w in T R (u), for ny endpoints qt1, q t2 of their respective rtificil trnsitions nd for ny ε-pths q t1 q1 nd q 2 q t 2, introduce the trnsition q t1 q t 2. (This procedure is completely nlogous to Lemm 1.) (2) Let q 1 r be n rbitrry initil trnsition of N j 1 (w). Then, for ny ncestor t of w in T R (u), for ny endpoint qt of its rtificil trnsition nd for ny ε-pth q t q1 introduce the trnsition q t r. Anlogously, if r q2 is n rbitrry finl trnsition of N j 1 (w) nd if there is n ε-pth q 2 q t, then introduce the trnsition r q t. We tret the rtificil trnsitions of TR v (C), respectively their ε-free NFA which will replce the rtificil trnsitions t lter time, in exctly the sme wy. We hve to show tht N j (u) stisfies the invrint, i.e., tht N j (u) nd NR v(c) re equivlent whenever ny rtificil trnsition q 1 q2 produces ll words of the regulr expression. If we insert trnsition p q, then there re sttes r, s nd pth p r s q in NR v(c). Thus L(Nj (u)) L(NR v(c)). Any ccepting pth P in NR v (C) trverses ε-free NFAs corresponding to some sequence (w 1,..., w s ) for w 1,...,w s D; we my require tht t lest one letter is red for ech w i. ince ll sttes of N R hve ε-loops, we my ssume tht ll ε-free trnsitions re seprted by ε-pths. Hence, if w i represents n NFA with initil stte q wi 1 nd finl stte q wi 2, then there is ε-pth from qwi 2 to q wi+1 1 in NR v(c). Assume tht s nd t re the lest common ncestors in TR (u) of w i 1 nd w i nd of w i nd w i+1 respectively. We pply Proposition 1 (c) nd obtin ε-pths q s q wi 1 nd q wi 2 q t for pproprite endpoints q s, q t of the rtificil trnsitions of s nd t

respectively. If t lest two letters re red for w i then the corresponding ccepting pth Q in N j (u) jumps from q s to post-initil stte of N j 1 (w i ), then trverses N j 1 (w i ) nd finlly jumps from pre-finl stte of N j 1 (w i ) to q t ; otherwise Q jumps from q s directly to q t. Thus L(NR v(c)) L(Nj (u)): N j (u) nd NR v (C) re equivlent. 2.3.2 Accounting. In phse j we re interconnecting not only the ε-free utomt N j 1 (w) of the previous phse, but lso (the NFAs constructed for) the rtificil trnsitions. We therefore begin our nlysis by compring the number of rtificil trnsitions used for phse j with the size of the cut for phse j. Remember tht node u belongs to the cut for phse j iff u hs t most L j originl trnsitions in its subtree T R (u), but its prent p hs more thn L j originl trnsitions in its subtree T R (p). Proposition 3. () The ncestors of cut nodes define binry tree C j with cut nodes s the set of leves. (b) The number of different rtificil trnsitions generted t proper ncestor of cut node is smller thn the size of the cut. (c) The totl number of rtificil trnsitions used when constructing ll ε-free NFA N j (u) of phse j is not lrger thn the size of the cut processed in phse j 1. Proof. () Let v be n ncestor of cut node. If v does not belong to the cut, then v hs more thn L j originl trnsitions nd hence v hs two children which re ncestors of cut nodes. (Here we ssume tht node is its own ncestor.) (b) ince n ncestor genertes exctly one rtificil trnsition, there re no more rtificil trnsitions thn there re inner nodes of C j. According to prt (), C j is binry tree nd hence the number of inner nodes is smller thn the number of leves, i.e., smller thn the size of the cut. (c) Observe first tht n rtificil trnsition occurs in t most one cut node: if n rtificil trnsition is descendnt of the seprting node, it ppers only in the left subtree nd otherwise only in the right subtree. The clim follows now from prt (b). Thus we my only count the number of ε-free trnsitions introduced becuse of the ε-free NFA N j 1 (w) nd my disregrd rtificil trnsitions ll together. We hve prtitioned the new ε-free trnsitions in two clsses. To count trnsitions from clss (1), observe tht TR (u) hs t most O(log 2 L j ) levels nd the NFA N j (u) hs to merge t most O( Lj L j 1 ) descendnt NFAs N j 1 (w). We my now pply the nlysis developed for Lemm 1 to receive the upper bound O( Lj L j 1 log 2 L j log 2 2k) on the number of clss 1 trnsitions introduced for N j (u). The trnsitions in clss (2) connect one of the O(k 2 ) post-initil nd pre-finl sttes of some Nw with endpoints of rtificil trnsitions for t most O(log 2 L j ) ncestors within TR (u). Thus for ech NFA Nj (u) we hve introduced t most O(k 2 L j L j 1 log 2 L j ) trnsitions from the second clss. If ech descendnt NFA N j 1 (w) hs size t most s j 1, then ll descendnts contribute not more thn L O(k j L j 1 s j 1 ) trnsitions including the blow-up due to Proposition 2. Hence N j (u) hs t most O(s j ) trnsitions, where s j = k k L j L j 1 s j 1 + L j L j 1 log 2 L j log 2 2k + k 2 L j L j 1 log 2 L j L j s j 1 + 2k 2 L j log L j 1 L 2 L j. (1) j 1

We iterte recurrence (1) nd get for 1 r j s j k r L j r 1 s j r + L j r t=0 2k t+2 L j L j 1 t log 2 L j t. Assume tht the regulr expression R hs length n. Thus, if we ssume tht n = L i nd set r = j = i, then we introduce t most O(s i ) trnsitions, where s i k i n i 1 s 0 + L 0 t=0 2k t+2 n L i t 1 log 2 L i t. (2) ince phse 1 strts from single phse 0 NFAs consisting of single trnsition, we my set L 0 = s 0 = 1 nd the first term of (2) coincides with O(k i n). We set L j = 2 Lj 1 nd the sum in (2) is bounded by O(k i+1 n). Thus i log n nd the 1+log regulr expression R is recognized by n ε-free NFA with t most s i = O(k n n) trnsitions. 3 The Lower Bound We consider the regulr expression E n = (1 + ε) (2 + ε) (n + ε) of strictly incresing sequences. The following lower bound is symptoticlly optiml nd improves upon the Ω(n log 2 2 n/ log 2 log 2 n) bound of [8]. Lemm 2. ε-free NFAs for E n hve t lest Ω(n log 2 2 n) trnsitions. Before giving proof we show tht Theorem 2 is consequence of Lemm 2. We conctente E k exctly n/k times with itself to obtin R n,k = (E k ) n/k. Now ssume tht N n,k is n ε-free NFA recognizing R n,k. We sy tht trnsition e of N n,k belongs to copy i iff e is trversed by n ccepting pth with lbel sequence (1 2 k) i 1 σ (1 2 k) n/k i while reding the string σ ε. Now ssume tht there is trnsition e which belongs to two different copies i, j with i < j. Then we cn construct n ccepting pth with lbel sequence (1 2 k) j 1 τ (1 2 k) n/k i nd N n,k ccepts word outside of R n,k. Thus ny trnsition belongs to t most one copy. N n,k hs, s consequence of Lemm 2, t lest Ω(k log 2 2 k) trnsitions for ech copy nd hence N n,k hs t lest Ω( n k k log2 2 k) trnsitions. Observe tht the unry regulr expression 1 n requires NFAs of liner size nd hence we ctully get the lower bound Ω( n k k log2 2 2k) lso for k = 1. 3.1 An outline of the rgument Let n = 2 k. nd let N n be n rbitrry ε-free NFA for E n. Our bsic pproch follows the rgument of [8]. In prticulr, we my ssume tht N n is in norml l form, i.e., {0, 1,..., n} is the set of sttes of N n nd ny trnsition i j stisfies i < l j. To study the behvior of trnsitions we introduce the ordered complete binry tree T n with nodes {1,..., n 1} nd depth k 1. We ssign nmes to nodes such tht n inorder trversl of T n produces the sequence (1,..., n 1). Finlly we lbel the root r of T n with the set L(r) = {1,...,n}. If node v is lbeled with the set L(v) = {i+1,..., i+2t}, then we lbel its left child v l with L(v l ) = {i+1,..., i+t} nd its right child v r with L(v r ) = {i + t + 1,...,i+2t}. Finlly define v = L(v) s the size of v. Observe tht v = 2 holds for every lef v nd we interpret its children v l, resp. v r s virtul leves.

Exmple 1. Agin set n = 2 k. We recursively construct fmily of ε-free NFAs A n to recognize E n. {0, 1,..., n 1, n} is the set of sttes of A n ; stte 0 is the initil nd stte n is the finl stte of A n. To obtin A n plce two copies of A n/2 in sequence: {0, 1,..., n/2 1, n/2} nd {n/2, n/2 + 1,...n 1, n} re the sets of sttes of the first nd second copy respectively, where the finl stte n/2 of the first copy is lso the initil stte of the second copy. If ( 1,..., r, r+1,... s ) is ny incresing sequence with r n/2 < r+1, then the sequence hs n ccepting pth which strts in 0, reches stte n/2 when reding r nd ends in stte n when reding s. But incresing sequences ending in letter n/2, resp. strting in letter > n/2 hve to be ccepted s well. Therefore direct ll trnsitions, ending in the finl stte n/2 of the first copy, lso into the finl stte n. Anlogously, direct ll trnsitions, strting from the initil stte n/2 of the second copy, lso out of initil stte 0. Now unroll the recursion nd visulize A n on the tree T n (fter disregrding the initil stte 0 nd the finl stte n). The root of T n plys the role of stte n/2. In prticulr, for ny node v there re v mny trnsitions with lbels from the set L(v) between v nd the root. Thus the root is the trget of n k = n log 2 n trnsitions, implying tht A n hs n log 2 2 n trnsitions if trnsitions incident with sttes 0 or n re disregrded. Definition 2. We sy tht node v of T n is crossed from the left in N n iff for ll i L(v l ) nd ll sequences σ with σ i E n there is pth in N n with lbel sequence σ i which ends in stte y L(v r ). If ll sequences i τ E n with rbitrry i L(v r ) hve pth which strts in some stte x L(v l ), then we sy tht v is crossed from the right. In prticulr the lst trnsition of the pth crosses v, since it ends in L(v r ) nd is lbeled with letter from L(v l ). v σ i i y Fig. 6. An i-trnsition crossing v from the left Proposition 4. [8] Let v be n rbitrry node of T n. Then for ny ε-free NFA in norml form, v is crossed from the left or v is crossed from the right. Proof. Assume tht v is not crossed from the left. Then there is word σ i E n with i L(v l ) such tht no pth in N n with lbel sequence σ i hs finl trnsition crossing v. If v is lso not crossed from the right, then there is word j τ E n with j L(v r ) such tht no pth in N n with lbel sequence j τ hs n initil trnsition crossing v. But then N n rejects σ i j τ E n. Let C be the set of nodes v T n which re crossed from the left. We ssume tht more nodes re crossed from the left nd hence we concentrte on C. (ee (6) in ection 3.3 for forml definition of more ).

Assume tht w T n belongs to C nd tht node v belongs to Left(w), the set of nodes of T n which belong to the left subtree of w. Then ny sequence σ j E n with j L(v r ), nd hence j L(w l ), hs pth p σ,j in N n with lbel sequence σ j which ends in stte y L(w r ). Observe tht the lst trnsition e = (x, y) of p σ,j identifies w s the unique tree node with j L(w l ) nd y L(w r ). Moreover, if x L(v l ), then e lso identifies v s the unique tree node with x L(v l ) nd j L(v r ). We now observe tht N n hs Ω(n log 2 2 n) trnsitions if mjority of pirs (v, w) with v Left(w) is identified for too mny lbels j L(v r ). In prticulr, define N(h, h ) for h < h k 1 s the number of pirs (j, w), where w T n hs height h nd j belongs to the right subtree of node v Left(w) with height h. Then N(h, h ) = n/4, since for ny w exctly one fourth of ll lbels j is counted. But then h <h k 1 N(h, h ) = Ω(n log 2 2 n) holds nd it suffices to show tht ech pir (v, w) with w C nd v Left(w) hs Ω( v r ) trnsitions which identify v s well s w. Lbels j L(v r ) re problemtic if ll j-trnsitions e = (x, y) with x L(v) nd y L(w r ) re short for (v, w), i.e., ny such trnsition e strts in x L(v r ). If lbel j L(v r ) is short, then j-trnsitions into L(w r ) deprt close to j. But, since w is crossed from the left, preceding i-trnsition, for i L(u) with u Left(v), hve to rech one of these strting points, nd if these strting points re close to home for mny short lbels j, then consequently mny copies of i-trnsitions re required. To formlize this intuition we determine how fr to the left short j-trnsitions extend, but not with respect to trnsitions strting in L(v) nd ending in L(w r ) for some specific w, but rther with respect to worst-cse sequence τ = j σ k E n (with k L(w l ) for n rbitrry w C) such tht ny pth with sequence τ strts very close to j, if we require the pth to strt in L(v) nd to end in L(w r ). Definition 3. Vertices v T n, w C (with v < w) s well s lbels j L(v r ) nd k L(w l ) re given. Define d v,w (j, k) = min τ=j σ k E n nd d v (j) = min w C,k L(wl ) d v,w (j, k). mx { j x pth x y with sequence τ} x L(v),y L(w r) w v j σ k x j k y Fig.7. Mesuring the miniml distnce of strting points x of j-trnsitions from j Observe first tht d v (j) is only defined iff there is node w C with v < w. But we cn ssume tht the node w = n 1 is crossed from the left: L(w l ) = {n 1} holds nd ny trnsition with lbel n 1 hs to either end in node w or in the virtul

lef w r. Thus if we copy ll trnsitions with lbel n 1 from w to w r, then w is crossed from the left t the cost of t most doubling the size of the NFA. For ny i L(u) with u Left(v) there hs to be n i-trnsition which pproches j within distnce t most d v (j). Next we determine how close the mjority of lbels j L(v r ) hve to be pproched. Definition 4. Let s be mximl with the property tht t lest vr 2 lbels j L(v r ) stisfy d v (j) vr s. et s(v) = s nd cll lbel j L(v r) regulr for v iff d v (j) v r s(v) holds. If j L(v r ), then d v,w (j, k) 2 v r, since the strting point x of j-trnsition belongs to L(v). But then d v (j) 2 v r nd s(v) 1/2 (3) follows, since d v (j) 2 v r for ll lbels j L(v r ). At lest one hlf of ll lbels j L(v r ) re regulr, i.e., hve j-trnsitions which re forced by some node w C to hve strting points within distnce t most vr s(v) from j. Now, if v is crossed from the left nd if u belongs to Left(v), then t lest Ω(s(v)) i-trnsitions end in v r for ll lbels i L(u): ll regulr lbels j L(v r ) hve to be pproched within distnce t most v r /s(v). All in ll u s(v) trnsitions re required for fixed u nd v. Any such trnsition identifies v, however the sme trnsition my be counted for severl nodes u Left(v). In prticulr we show Lemm 3. N n hs t lest Ω( v C u s(v) log 2 2 (4s(u))) trnsitions. We prove Lemm 3 in the next section nd show in ection 3.3 tht Lemm 2 is consequence of Lemm 3. 3.2 hort Trnsitions Let u be n rbitrry node. Then less thn ur 2 lbels i L(u r ) stisfy d u (i) ur 2s(u) nd hence holds for more thn one hlf of ll lbels i L(u r ). u r 2s(u) < d u(i) (4) Proof of Lemm 3. We rbitrrily pick nodes u T n, v C with u Left(v) nd regulr lbel j for v. Then there is node w C with v < w nd lbel sequence τ = j τ k E n with k L(w l ) such tht ech pth for τ, which begins in x L(v) nd ends in L(w r ), stisfies j x v r /s(v). Let h be the smllest lbel in L(u l ). If i belongs to L(u r ), then ny pth with lbel sequence h i τ which ends in L(w r ), hs to hve n i-trnsition e which strts in L(u) (since h L(u l )) nd ends in L(v) (since u Left(v) nd j L(v)). From ll i-trnsitions which belong to pth u w r with lbel sequence i τ, we select n i-trnsition e = (x, y) with smllest possible left endpoint x L(u) nd cll e distinguished (for (u, v)). When counting trnsitions of N n we restrict ourselves to distinguished trnsitions. Firstly we determine the number of distinguished i-trnsitions for i L(u r ) which strt in L(u) nd end in L(v r ). econdly we bound the effect of multiple counting: ll i-trnsition do strt in L(u) L(v l ) nd hence they identify v, whenever they end in L(v r ). However i-trnsitions my not identify u.

At most v r /s(v) regulr lbels j for v hve j-trnsitions with common left endpoint. Moreover t most v r /s(v) regulr lbels hve trnsitions with left endpoint in L(v l ) nd therefore t lest v r /2 v r /s(v) v r /s(v) = v r /2 v r /s(v) 1 s(v) 2 1 different left endpoints in L(v r ) re required. As consequence, for ll i L(u r ), t lest s(v)/2 1 distinguished i-trnsitions strt in L(u) nd end in L(v r ). This result is meningless if s(v) < 2, but since v C, N n hs for every lbel i L(u r ) pth with lbel sequence h i, where the i-trnsition strts in L(u) nd ends in L(v r ). Thus for ll i L(u r ) t lest mx{s(v)/2 1, 1} s(v)/4 distinguished i- trnsitions strt in L(u) nd end in L(v r ). All these trnsitions identify v by their left nd right endpoint. Let E(u, v) be the set of trnsitions of N n which re distinguished for u nd v. We hve just seen tht E(u, v) u r s(v)/4 holds. Trnsitions in E(u, v) identify v, however they my not identify u. In prticulr, for i L(u r ) let e = (x, y) be distinguished i-trnsition for u nd v. If v C, then d u (i) d u,v (i, j) i x, (5) holds, since distinguished trnsitions mximize the difference between their lbel i nd their left endpoint (mong ll i-trnsitions prticipting in pth u w r which ends in j-trnsition). Furthermore let µ be left descendnt of v of smllest depth such tht the distinguished i-trnsition e is lso distinguished for µ nd v. Observe tht u hs to be descendnt of µ nd hence e is distinguished for t most µ log 2 i x nodes u. Thus, in order to control multiple counting, we hve to bound i x from below. We pply (4) nd (5) nd obtin µ r 2 s(µ) < d µ(i) i x for t lest one hlf of ll lbels i L(µ r ). Therefore trnsition e belongs to t most µ log 2 i x log µr 2 µ / 2 s(µ) = log 2 (4s(µ)) sets E(u, v). To void multiple counting we ssign weight 1/ log 2 (4s(u)) 2 to trnsition e E(u, v). If µ 1,..., µ r 1, µ r = µ re ll the tree nodes in Left(v) for which e is distinguished nd if µ i is descendnt of µ i+1, then i log 2 (4s(µ i )) nd hence r i=1 1/ log 2(4s(µ i )) 2 r i=1 1/i2 = O(1). To summrize: we hve E(u, v) u r s(v)/4 nd there is no multiple counting, if we ssign the weight 1/ log 2 2 (4s(u)) to trnsitions in E(u, v). Hence N n hs symptoticlly t lest v C E(u, v) log 2 2(4s(u)) = Ω( v C u s(v) log 2 2(4s(u)) ) trnsitions nd the clim follows. 3.3 Accounting We hve ssumed in the proof sketch tht more nodes re crossed from the left. We now formlize this to men v C,depth(v) (log 2 n)/2 v 2 log 2 v v C,depth(v) (log 2 n)/2 v 2 log 2 v. (6)

If (6) does not hold, then we work insted with C, the set of nodes which re crossed from the right. Thus we my ssume (6). We set T = v depth(v) (log 2 n)/2 2 log 2 v nd observe tht T = log 2 n d=(log 2 n)/2 depth(v)=d v 2 log 2 v = log 2 n d=(log 2 n)/2 n 2 d = Ω(n log2 2 n) holds. Thus Lemm 2 is consequence of Lemm 3, if we show v C which in turn follows, if s(v) u log 2 2(4s(u)) = Ω(T ), (7) u s(v) log 2 2 (4s(u)) = Ω( v log 2 v ) (8) holds for sufficiently mny nodes v. We first formlize wht sufficiently mny nodes mens. Definition 5. () w(u) = u 2 log 2 u is the weight of u nd q v (u) = w(u) w(u) is the probbility of u with respect to v, provided u belongs to Left(v). We define the probbility prob v [E] of event E Left(v) by the distribution q v. (b) K is suitbly lrge constnt with 1 4 2 16k/ log2 k 2 k1/3 for ll k K. Finlly set p(v) = 16/ log 2 K, if s(v) K, nd p(v) = 16/ log 2 s(v) otherwise. We begin by verifying (8) for ll nodes v which qulify: we disqulify v iff depth(v) < (log 2 n)/2 or v C or if prob v [s(u) 1 4 2 p(v) s (v) u Left(v)] < p(v), (9) where s (v) = mx{s(v), K}. If we disqulify v for the lst reson, then we lso disqulify ll descendnts u Left(v) with s(u) 2 p(v) s (v) /4. Thus v is disqulified either for obvious resons (i.e., depth(v) < (log 2 n)/2 or v C) or if too few left descendnts u hve sufficiently smll s -vlues nd hence if (8) seems to be flse for v. In second step we hve to show tht node v is disqulified with sufficiently smll probbility. This is not surprising, since, if v is disqulified for non-obvious resons, then s(u) is extremely lrge in comprison to s(v) for n overwhelming mjority of left descendnts u of v. To lter disqulify left mjority-descendnt u is now even hrder, since p(u) is inversely proportionl to log 2 s(u). We begin by investigting nodes which qulify. Lemm 4. Assume tht v qulifies. Then s(v) w(v) u log 2 2 (4s(u)) 8K.

Proof. ince v qulifies, we know tht v belongs to C nd depth(v) (log 2 n)/2 holds. Moreover s(u) 2 p(v) s (v) /4 holds with probbility q p(v) nd hence log 2 2 (4s(u)) p(v) s (v) holds with probbility q. We set p d (v) = prob v [s(u) 2 p(v) s (v) /4 u Left(v), depth(u) = d] nd p d = prob v [ depth(u) = d u Left(v)]. Then q = d p d(v) p d p(v). We obtin u log s(v) 2 v 1 log 2 2(4s(u)) = d=0 log 2 v 1 d=0, depth(u)=d p d (v), depth(u)=d s(v) u log 2 2(4s(u)) u s(v) p(v) s (v), (10) since 1/ log 2 2(4s(u)) 1/(p(v) s (v)) holds with probbility t lest p d (v) for nodes u Left(v) with depth(u) = d. Thus we cn further simplify the right hnd side of (10) nd get But p d = u s(v) log 2 2(4s(u)) s(v) s (v) v log 2 v 1 2 p d (v) p(v). d=0 v /2 d log2 v 1 i=0 v /2 i = d log2 v 1 i=0 i 2 log 2 v (log 2 v 1) log 2 v 4 log 2 v nd hence d 4 p d(v)/ log 2 v d p d(v) p d p(v), resp. d p d(v)/p(v) (log 2 v )/4. Thus we obtin s(v) s(v) u log 2 2 (4s(u)) s (v) v log 2 v 1 2 d=0 p d (v) p(v) s(v) s (v) w(v) 4. If s(v) K, then s(v) = s (v) nd we gin the contribution w(v)/4. Otherwise s(v) K nd we obtin t lest the contribution 1/2 K w(v) 4 = w(v) 8K, since s(v) 1/2 ccording to (3). Thus we hve reched our gol of contribution of t lest w(v) 8K in both cses. It suffices to show tht sufficiently mny nodes v qulify. If v is disqulified becuse of (9), then we lose the contribution w(v) + p(v) w(u), since we not only loose v, but possibly lso p(v)-frction of ll left descendnts of v. How lrge is this loss? Proposition 5. For ny node v, w(v) + p(v) Proof. We first observe w(u) = = log 2 v 1 d=0 log 2 v 1 d=0 w(u) 2p(v), depth(u)=d u 2 d w(u). (11) v 4 d = v log 2 v (log 2 v 1) w(v) log 2 v 8 8

nd therefore w(v) 8 w(u)/ log 2 v follows. Hence the contribution we lose in disqulifiction step for node v is bounded by w(v) + p(v) 8 w(u) ( log 2 v + p(v)) 2p(v) w(u), w(u) since 8/ log 2 v 16/ log 2 n p(v). We re now redy to bound the probbility of disqulifying nodes. Lemm 5. A node v with depth(v) (log 2 n)/2 is disqulified with probbility t most 1 2 + 64 log 2 K. Proof. Due to (6), node v (with depth(v) (log 2 n)/2) is disqulified bsed on non-membership in C with probbility t most 1/2. Thus it suffices to show tht disqulifiction becuse of (9) occurs with probbility t most 64/ log 2 K. We order the disqulifiction steps for the nodes w ccording to incresing depth of w. (If node is disqulified s consequence of n erlier disqulifiction, then this node is not listed.) Now ssume tht node v is disqulified followed by lter disqulifiction of left descendnt u of v. Remember tht with v we lso disqulify ll left descendnts u with s(u) 1 4 2 p(v) s (v). But 1 4 2 p(v) s (v) = 1 4 2 16s (v)/ log 2 s (v) nd by the choice of K, 1 4 2 16k/ log2 k 2 k1/3 for ll k K. As consequence, if the left descendnt u of v hs survived the disqulifiction step of v, then s(u) > 2 s (v) 1/3 nd in prticulr p(u) p(v)/2 follows. If node v is disqulified, then we lose the contribution 2p(v) w(u) due to (11). But then ll lter disqulifiction steps for left descendnts of v result in combined contribution of t most ( i 0 2p(v)/2i ) w(u) 4p(v) w(u). We re considering sequence of disqulified nodes, where the nodes re ordered ccording to incresing depth. If we double the loss contribution of node v, i.e., if we ssume the loss 4p(v) w(u) for node v, then we my demnd tht no node in the sequence is left descendnt of nother node in the sequence. The loss mesured ccording to (11) is then mximized, if we disqulify ll nodes of the rightmost pth strting in the root r nd if we ssume tht s(v) K for ll nodes of the pth. Obviously the overll loss is then bounded by (4p(r) 2 i ) i 0 u Left(r) w(u) 4 16 log 2 K 2 u Left(r) w(u). In other words, node is disqulified with probbility t most 4 16 log 2 K. Lemm 2 is now n immedite consequence of Lemm 4 nd Lemm 5, if K is chosen sufficiently lrge. 4 Conclusions nd Open Problems We hve shown tht every regulr expression R of length n over n lphbet of size k cn be recognized by n ε-free NFA with O(n min{log 2 n log 2 2k, k 1+log n }) trnsitions. For lphbets of fixed size (i.e., k = O(1)) our result implies tht O(n 2 O(log 2 n) ) trnsitions nd hence lmost liner size suffice. We hve lso shown the lower bound Ω(n log 2 2 2k) nd hence the construction of [7] is optiml for lrge lphbets, i.e., if k = n Ω(1).

A first importnt open question concerns the binry lphbet. Do ε-free NFAs of liner size exist or is it possible to show super-liner size lower bound? Moreover, lthough we hve considerbly nrrowed the gp between lower nd upper bounds, the gp for lphbets of intermedite size, i.e., ω(1) = k = n o(1) remins to be closed nd this is the second importnt open problem. For instnce, for k = log 2 n the lower bound Ω(n (log 2 log 2 n) 2 ) nd the upper bound O(n log 2 n log 2 log 2 n) re still by fctor of log 2 n/ log 2 log 2 n prt. Thirdly the size blowup when converting n NFA into n equivlent ε-free NFA remins to be determined. In [6] fmily N n of NFAs is constructed which hs equivlent ε-free NFAs of size Ω(n 2 / log 2 2 n) only. However the lphbet of N n hs size n/ log 2 n nd the gp between this lower bound nd the corresponding upper bound O(n 2 Σ ) remins considerble. Acknowledgement: Thnks to Gregor Grmlich nd Jurj Hromkovic for mny helpful discussions. References 1. R. Book,. Even,. Greibch, G. Ott, Ambiguity in grphs nd expressions, IEEE Trns. Comput. 20, pp. 149-153, 1971. 2. V. Geffert, Trnsltion of binry regulr expressions into nondeterministic ε-free utomt with O(nlog n) trnsitions, J. Comput. yst. ci. 66, pp. 451-472, 2003. 3. V.M. Glushkov, The bstrct theory of utomt, Russin Mth. urveys 16, pp. 1-53, 1961. Trnsltion by J. M. Jckson from Usp. Mt. Nut. 16, pp. 3-41, 1961. 4. C. Hgenh, A. Muscholl, Computing ǫ-free NFA from regulr expressions in O(n log 2 (n)) Time, ITA, 34 (4), pp. 257-278, 2000. 5. J.E. Hopcroft, R. Motwni, J.D. Ullmn, Introduction to Automt Theory, Lnguges nd Computtion, Addison-Wesley, 2001. 6. J. Hromkovič, G. chnitger, Compring the size of NFAs with nd without ε-trnsitions. Theor. Comput. ci., 380 (1-2), pp. 100-114, 2007. 7. J. Hromkovič,. eibert, T. Wilke, Trnslting regulr expression into smll ε-free nondeterministic utomt, J. Comput. yst. ci., 62, pp. 565-588, 2001. 8. Y. Lifshits, A lower bound on the size of ε-free NFA corresponding to regulr expression, Inf. Process. Lett. 85(6), pp. 293-299, 2003. 9. M.O. Rbin, D.cott, Finite utomt nd their decision problems, IBM J. Res. Develop. 3, pp. 114-125, 1959. 10..ippu, E. oislon-oininen, Prsing Theory, Vol. I: Lnguges nd Prsing, pringer-verlg, 1988. 11. G. chnitger, Regulr expressions nd NFAs without ε trnsitions, Proc. of the 23rd TAC, Lecture Notes in Computer cience 3884, pp. 432-443, 2006. 12. K. Thompson, Regulr expression serch, Commun. ACM 11, pp. 419-422, 1968.