CONCATENATION AND KLEENE STAR ON DETERMINISTIC FINITE AUTOMATA

1 CONCATENATION AND KLEENE STAR ON DETERMINISTIC FINITE AUTOMATA GUO-QIANG ZHANG, XIANGNAN ZHOU, ROBERT FRASER, LICONG CUI Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, Ohio 44106, USA E-mail: {gq,lxc48}@case.edu College of Mathematics and Econometrics, Hunan University, Changsha 41001, China Email: xnzhou8106@163.com Department of Mathematics, Case Western Reserve University, Cleveland, Ohio 44106, USA Email: rgf11@case.edu This paper presents direct, explicit algebraic constructions of concatenation and Kleene star on deterministic finite automata (DFA), using the Booleanmatrix method of Zhang in Ref. 1 and ideas of Kozen in Ref.. The consequence is trifold: (1) it provides an alternative proof of the classical Kleene s Theorem on the equivalence of regular expressions and DFAs without using nondeterministic finite automata (NFA); () it demonstrates how the language constructions of concatenation and Kleene star can be captured elegantly as algebraic laws in the form of binomial theorems; (3) it provides a demonstration of the (tight) upper bounds of the state complexity of concatenation and Kleene star, but offers a way to study the state complexity of NFA also. Keywords: Automata; Concatenation; Kleene Star; Boolean matrices. 1. Matrix-Approach to Automata Theory We begin by providing a brief account of the matrix-approach to automata theory as introduced by Zhang. 1 A Boolean matrix is a matrix (of size m n) whose elements are either 0 or 1, where the internal operations are carried out over the Boolean algebra. We write B m n for the set of all Boolean matrices of size m n. A Boolean (row) vector of dimension n is an n-tuple (b 1, b,..., b n ) of 0s and 1s. We write B n for the set of all Boolean vectors of dimension n. A column vector is the transpose ( ) t of a row vector. The characteristic vector of a subset A of {1,, n} is the row vector I n A B n such that the p-th component of

I n A is a 1 if and only if p A. The characteristic vector of a singleton set {p} is written as I n p, or simply I p. O m n stands for an (m n)-matrix, all of its elements are 0. When dimension is fixed by context, we abuse notion and write O n n as 0. A deterministic finite automaton (DFA) is a 5-tuple M = (Q, Σ, δ, q 0, F ), where Q is the finite set of states, Σ is the alphabet, δ : Q Σ Q is the transition function, q 0 is the start state, and F is the set of final states. For notational convenience, we use initial segments of natural numbers {1,,, n} to denote the set of states, and fix 1 to be the start state, for base/background DFAs. When there is no confusion, we omit the indication of the start state (which is assumed to be state 1 by default). Each n-state DFA determines a (associated) matrix system { a a Σ}, where a is the (n n) adjacency matrix of the a-labeled subgraph associated with the DFA. In other words, the (i, j) entry of a is 1 if and only if δ(i, a) = j. Since M is a DFA, each a is row-stochastic (i.e., every row contains precisely a single 1). The (Boolean) sum of all members a in the matrix system is the adjacency matrix. For a string w = a 1 a a n over Σ, we write w for the matrix product a1 a an. The language accepted by M, denoted L(M), is the set {w I q0 w I t F = 1}. See Ref. 1 for more details of the utility of this approach. Example {( ) ( 1.1. )} The matrix system of the following DFA is 0 1 1 0,. 1 0 0 1 b a b start 1 a With the use of Boolean matrices, it is straightforward to describe a wide spectrum of constructions on DFA in a simple, algebraic manner, with their correctness established by induction and algebraic manipulation. 1 Here we briefly treat Brzozowski s derivation in Ref. 3, as an example. Given a string u and a language L, the Brzozowski derivative u 1 L is the language {w uw L}. Suppose L is accepted by an n-state DFA M = (Q, Σ, δ, F ), with { a a Σ} its matrix system. Then a DFA accepting u 1 L can be

3 given as M = (Q, Σ, δ, q 0, F ), where Q = {A A B n n }, q 0 = u, δ (A, a) = A a, F = {A I 1 AI t F = 1}. One can see that w is accepted by M if and only if δ ( u, w) = uw F, i.e., uw is accepted by M. In the remainder of this paper, we present the constructions of concatenation and Kleene star on DFA, and analyze the state complexity of such constructions. It turns out that, without additional effort, these algebraic constructions are already optimal in the number of states used after projecting to the first row.. Concatenation This section presents the concatenation construction. Theorem.1. Suppose matrix systems { a 1 a Σ} and { a a Σ} are associated with m- and n-state DFAs M 1 = (Q 1, Σ, δ 1, F 1 ) and M = (Q, Σ, δ, F ), respectively. The DFA M = (Q, Σ, δ, q 0, F ) defined as Q = {(A, B) A B m m, B B m n }, q 0 = (T 0, T ), δ((a, B), a) = (A, B) a (= (A a 1, A a 1T + B a )), F = {(A, B) I m 1 BI t F = 1}, ( ) where a a = 1 a 1T 0 a for a Σ, T = I t F 1 I n 1, and T 0 is the (m m) identity matrix, has the property that L(M) = L(M 1 ) L(M ). To understand how this construction works, suppose δ(q 0, w) = (A, B) for some w Σ. By the definition of δ, we have, for a Σ, δ(q 0, wa) = ( wa 1, wa 1 T + B a ). Therefore, δ(q 0, wa) F if and only if I m 1 ( wa 1 T + B a )I t F = 1, or (I m 1 wa 1 I t F 1 I n 1 I t F )+(I m 1 B a I t F ) = 1. Hence, δ(q 0, wa) F if and only if either wa L(M 1 ) (i.e., I m 1 wa 1 I t F 1 = 1) and 1 F (i.e., I n 1 I t F = 1), or else I m 1 B a I t F = 1. In general, I m 1 A, the first row of A, keeps track of the ending state through w in M 1, and I m 1 B keeps track of all possible states (in M 1 and M ) resulting from a decomposition w = w 1 w,

4 with w 1 going through M 1 and w going through M. This analysis can be captured more precisely in general in the next lemma. Lemma.1. Suppose δ(q 0, w) = (A, B) in M, and suppose w = a 1 a l where a i Σ for 1 i l. We have B = l a1a ai i=0 1 T ai+1ai+ a l. Proof. Suppose δ(q 0, w) = (A, B) in the DFA M given in Theorem.1, and suppose w = a 1 a l, where a i Σ for 1 i l. In what follows, by the induction on the length of w, we show that A = w 1, B = l i=0 a1a ai 1 T ai+1ai+ a l Remark that when i = 0 or i = l, it represents T a1a a l and a1a a l 1 T, respectively. (1) Suppose that l = 1 and w = a 1, then δ(q 0, a 1 ) ( = (I m a 1 1 1, T ) a1 1 T ) 0 a1 = ( a1 1, a1 1 T + T a1 ). The conclusion holds. () Suppose that the conclusion holds when l = k 1 and δ(q 0, a 1 a a k 1 ) = (A k 1, B k 1 ), where k 1 A k 1 = w 1, B k 1 = a1a ai 1 T ai+1ai+ a k 1. i=0 Then when l = k and w = a 1 a a k, we have δ(δ(q 0, a 1 a a k 1 ), a k ) ( a k 1 a k 1 = (A k 1, B k 1 ) T ) 0 a k = (A k 1 a k 1, A k 1 a k 1 T + B k 1 a k ) k = ( w 1, a1a ai 1 T ai+1ai+ a k ). i=0 By induction, we know that the conclusion holds for any l N. This lemma captures the key technical content for the proof of Theorem.1. It is interesting to observe that this lemma assumes the general flavor of a binomial theorem. The proof of Theorem.1 is as follows:

5 Proof of Theorem.1. Suppose that δ(q 0, w) = q, then w L(M) iff q F. If w = ɛ, then q = q 0. Thus, ɛ L(M) iff q 0 F, iff I m 1 T I t F = 1, iff ɛ L(M 1 ) L(M ). Since q F iff I m 1 BI t F = 1, by Lemma.1, we have w = a 1 a a l L(M) iff l i=0 I m 1 a1a ai 1 T ai+1ai+ a l I t F = 1, which means w = a 1 a a l L(M 1 ) and ɛ L(M ), or there exists 1 i l 1 such that u = a 1 a a i L(M 1 ), v = a i+1 a i+ a l L(M ) and w = uv, or ɛ L(M 1 ) and w = a 1 a a l L(M ). Therefore, w L(M) iff w L(M 1 ) L(M ), that is, L(M) = L(M 1 ) L(M ). 3. Kleene Star This section presents the Kleene star construction. Theorem 3.1. Suppose the matrix system { a 1 a Σ} is associated with an n-state DFA M 1 = (Q 1, Σ, δ 1, F 1 ). The DFA M = (Q, Σ, δ, q 0, F ) with H = I t F 1 I 1 and Q = {A A B n n } {s}, q 0 = s, { a δ(q, a) = 1 (H 0 + H 1 ), if q = s, A a 1(H 0 + H 1 ), if q = A, F = {A I 1 AI t F 1 = 1} {s}, has the property that L(M) = (L(M 1 )). Here, H 1 = H and H 0 is the identity matrix. The role of H is to mark possible positions for string partition. Even though it has no effect by itself for the acceptance of strings (and represents a redundant term), it accounts for the restart of M 1 and prepares the way for the next chunk of strings to be scanned from the initial state of M 1. Therefore, upon reading a symbol a, M appends a to the end of the current chunk, but branches with two threads: extending the current chunk

6 (the a 1 term) for one, and starting a new chunk (the a 1H term) for the other. Lemma 3.1. Suppose w = a 1 a l with a i Σ for 1 i l. We have, for the DFA M given in Theorem 3.1, δ(s, w) = w1 1 H w 1 H w k 1 Hi. w=w 1 w k,1 k l w j ɛ,1 j k i=0,1 Proof. We show that the conclusion holds by induction on the length of w. (1) Suppose that l = 1 and w = a 1, then by the definition of the DFA M given in Theorem 3.1, we have δ(s, a 1 ) = a1 1 (H0 + H 1 ) = i=0,1 a1 1 Hi The conclusion holds. () Suppose that the conclusion holds when l = k 1 and w = a 1 a a k 1, i.e., δ(s, a 1 a a k 1 ) = w=w 1 w h,1 h k 1 w j ɛ,1 j h i=0,1 Then when l = k and w = a 1 a a k, we have Next, we show that w=w 1 w h,1 h k w j ɛ,1 j h i=0,1 δ(s, a 1 a a k ) = δ(δ(s, a 1 a a k 1 ), a k ) w1 1 H w h 1 Hi. = δ(s, a 1 a a k 1 ) a k 1 (H0 + H 1 ). δ(s, a 1 a k 1 ) a k 1 (H0 + H 1 ) = w1 1 H w h 1 Hi. w=w 1 w h,1 h k w j ɛ,1 j h i=0,1 Let L denote δ(s, a 1 a k 1 ) a k 1 (H0 + H 1 ), and let R denote w1 1 H w h 1 Hi. Let e be a term in L, then e = w1 1 H w h 1 Hi a k 0, 1, w 1 w h = a 1 a k 1. If i = 0, e = w1 1 H w 1 H w ha k 1 Hj, where i, j 1 H j, take

7 w h = w ha k, then w 1 w h = a 1 a k 1 a k, which means e is a term in R. If i = 1, e = w1 1 H w 1 H w h 1 H a k 1 Hj, take w h+1 = a k, then w 1 w h w h+1 = a 1 a k 1 a k, which yields e is a term in R. Hence, every term in L is a term in R. Let e be a term in R, then e = w1 1 H w h 1 Hi, where w 1 w h = w. If w h = a k, then e = w1 1 H w h 1 1 H a k 1 Hi and w 1 w h 1 = a 1 a k 1. By the induction, w1 1 H w h 1 in δ(s, a 1 a k 1 ). Thus, e is a term in L. Otherwise, w h = w h a k, w h ɛ. 1 H is a term In this case e = w1 1 H H w h 1 a k 1 Hi and w 1 w h 1 w h = a 1 a k 1, which yields w1 1 H H w h 1 is a term in δ(s, a 1 a k 1 ). Thus, e is a term in L. Therefore, every term in R is a term in L. Thus, when l = k, the conclusion holds. By induction, we know that the conclusion holds for any l N. Proof of Theorem 3.1. At first, s F implies ɛ L(M). Suppose w = a 1 a a l, then by Lemma 3.1, w L(M) iff there exist w 1, w,, w k such that w = w 1 w w k and I 1 w1 1 H w 1 H w k 1 (H0 + H 1 )I t F 1 = 1, i.e., w 1, w,, w k L(M 1 ). Therefore, L(M) = (L(M 1 )). Remark. The essential language operators associated with regular languages are union, concatenation, and Kleene star. After addressing the matrix constructions for concatenation, and Kleene star, we only need to note that the union (and intersection) construction is straightforward and is left as an exercise. 4. State Complexity State complexity studies the minimal number of states needed for a given language operation as a function of the sizes of the underlying automata. 4 One general observation on constructions given in Sections and 3 is that we only need to keep track of the first rows of the respective matrices used for states, since their status of being a final state is determined by prefixing I 1 in a matrix multiplication. Theorem 4.1. Projecting to the first row by replacing (A, B) systematically with (I 1 A, I 1 B) for concatenation and replacing A systematically with I 1 A for Kleene star, we have: (1) The number of reachable states for the concatenation construction given in Section is m n k n 1, where the first underlying DFA has m

8 states, the second has n states, and k is the number of final states the first DFA. () The number of reachable states for the Kleene star construction given in Section 3 is n 1 + n k 1, where n is the number of states of the underlying DFA and k is the number of its non-initial final states. We remark that these numbers are lowest possible upper bounds, since they agree with the results in Ref. 4. Proof. By replacing (A, B) systematically with (I 1 A, I 1 B) for concatenation and replacing A systematically with I 1 A for Kleene star, the construction M of concatenation in Section can be reduced as M = (Q, Σ, δ, q 0, F ) with Q = {(A, B) A B m, B B n }, q 0 = (I m 1 T 0, I m 1 T ) = I m 1 q 0, δ ((A, B), a) = (A, B) a, F = {(A, B) BI t F = 1}, and the construction M of Kleene star in Section 3 can be reduced as M = (Q, Σ, δ, s, F ) Q = {A A B n } {s} { δ I1 a (q, a) = 1(H 0 + H 1 ), if q = s, A a 1(H 0 + H 1 ), if q = A, F = {A AI t F 1 = 1} {s}. In what follows, the state complexity of concatenation and Kleene star are obtained by using the equivalent constructions M and M. Concatenation. Let k be the number of final states of M 1. Note that δ ((A, B), a) = (A, B) a = (A a 1, A a 1T + B a ), where (A, B) = δ (q 0, w), w Σ. From the proof of Theorem.1, we know that A = I m 1 w 1, which means A has exactly one entry being 1 among its m bits, since 1 is row stochastic (and so is wa 1 ). This means that there are a maximal number of m n possible bit vectors of the form (A a 1, A a 1T + B a ), where m accounts for the variability of A a 1 and n for the variability of A a 1T + B a. However, not all n combinations can be realized by A a 1T + B a : A a 1T is equal to I n 1 if and only if wa L(M 1 ). We know that the first entry in B will always be equal to 1 if any of the positions in A corresponding to any of the states in F 1 is equal to 1. In particular, we can never reach a state for which the entry of A corresponding to a final

9 state of M 1 is equal to 1 and the entry of B corresponding to the start state of M is equal to zero. There are k n 1 states of this form. So the total number of reachable states in M is m n k n 1. Kleene star. Let k be the number of non-initial final states of M 1. Then realizing that for nonempty w Σ, a Σ, we have δ (A, a) = A a 1(H 0 + H 1 ), where A = δ (s, w). Note that A a 1H = I 1 if and only if we have A a 1IF t 1 = 1. This, in turn, happens if and only if A a 1 has a 1 in some entry corresponding to a final state of M 1. But δ (A, a) is the sum of A a 1 and A a 1H. In particular, this means that if any entry of A a 1 corresponding to a final state of M 1 is equal to 1, then we have A a 1H = I 1, and so the first entry of A a 1(H 0 + H 1 ) must be equal to 1 as well. Finally, because A a 1H is always either equal to 0 or I 1, we know that if any position except for the first one in A a 1(H 0 + H 1 ) is nonzero, then the corresponding position in A a 1 must also be nonzero. Putting these facts together, we conclude that the first entry of δ (A, a) will always be equal to 1 if any position corresponding to any final state is equal to 1. There are n 1 possibly reachable states in which there is a 1 in the first position, and n k 1 possibly reachable states in which the first entry is 0 and the entry in the position corresponding to every element of F 1 is zero. Furthermore, we need to remember to include our start state in the total number of states for our DFA. So the maximum number of reachable states in the DFA M is n 1 + n k 1 + 1 1 = n 1 + n k 1. 5. Conclusion With the constructions given, we see that operations on regular expressions can be directly translated to constructions on DFA. We obtained along the way a proof of the classical Kleene s Theorem avoiding the use of NFA (using Arden s Lemma in the other direction). Our Lemmas (.1, 3.1) illustrated how laws of Boolean matrices capture language operations inductively and algebraically. The natural constructions using matrix systems are also optimal in the usage of states. Our approach does not depend on the deterministic nature of the underlying automata until the topic of state complexity. Barring the use of ɛ-edges, our constructions work for NFA, possibly informing the study of state complexity for NFA in Ref. 5 also. References 1. Guo-Qiang Zhang, Inform. Comput. 15(1), 138 (1999).. D. Kozen, Inform. Comput. 110, 366 (1994). 3. J.A. Brzozowski, J. Assoc. Comput. Mach. 11, 481 (1964).

10 4. S. Yu, Q. Zhuang, K. Salomaa, Theor. Comput. Sci. 15, 315 (1994). 5. S. Yu, Fundam. Inform. 64, 471 (005).