A Parallel Algorithm for Minimization of Finite Automata

A Parallel Algorithm for Minimization of Finite Automata B. Ravikumar X. Xiong Deartment of Comuter Science University of Rhode Island Kingston, RI 02881 E-mail: fravi,xiongg@cs.uri.edu Abstract In this aer, we resent a arallel algorithm for the minimization of deterministic finite state automata (DFA's) and discuss its imlementation on a connection machine CM-5 using data arallel and message assing models. We show that its time comlexity on a rocessor EREW PRAM ( n) for inuts of size + log n log ) uniformly on almost all instances. The work done by our algorithm is thus within a factor of O(log n) of the best known sequential algorithm. The sace used by our algorithm is linear in the inut size. The actual resource requirements of our imlementations are consistent with these estimates. Although arallel algorithms have been roosed for this roblem in the ast, they are not ractical. We discuss the imlementation details and the exerimental results. n is O( n log2 n 1. Introduction Finite state machines are of fundamental imortance in theory as well as alications. Their alications range from classical ones such as comilers and sequential circuits to recent alications including neural networks, hidden Markov models, hyertext meta languages and rotocol verification. A central roblem in automata theory is to minimize a given Deterministic Finite Automaton (DFA). This roblem has a long history going back to the origins of automata theory. Huffman [7] and Moore [11] resented an algorithm for DFA minimization in 1950's. Their algorithm runs in time O(n 2 ) on DFA's with n state and was adequate for most of the classical alications. Many variations of this algorithm have aeared over the years; see [14] for a comrehensive summary. Hocroft [5] develoed a significantly faster algorithm of time comlexity O(n log n). Although Hocroft's algorithm is very efficient in ractice, it can' t handle DFA's with millions of states. But there are many alications involving such large DFA's. A arallel algorithm can handle such instances since the combined memory size of the rocessors is adequate to store and rocess such large DFA's. This work is art of a collection of arallel rograms we are develoing for many comutation-intensive testing roblems involving DFA's. we will briefly review the revious work on the design of arallel algorithms for DFA minimization roblem. The roblem has been studied extensively, [10, 12, 9, 2] etc. JáJá and Kosaraju [9] studied the secial case in which the inut alhabet size is 1 on a mesh-connected arallel comuter and obtained an efficient algorithm. Cho and Huynh [2] showed that the roblem is in NC 2 and that it is NLOGSPACEcomlete. The algorithm of [12] alies only to the case in which the inut alhabet is of size 1. In this case, the most efficient algorithm is due to JáJá and Ryu [10] which is a CRCW-PRAM algorithm of time comlexity O(log n) and total work O(n log log n). When the inut alhabet is of size greater than 1, no efficient arallel algorithm (i.e., one of time comlexity O(log k n)forsomefixedk) is known with a `reasonable' total work (say, O(n 2 ) or below). In fact, the only ublished NC algorithm for this roblem is due to [2] which has a time comlexity O(log 2 n) but requires (n 6 ) rocessors.

This aer differs from the work mentioned above in two ways: (i) Our goal is to imlement arallel algorithms for DFA minimization on actual arallel machines. (ii) Our figure of merit is the average case erformance, not the worst-case erformance. Here we reort on the imlementation of rograms for minimizing DFA's (with unrestricted inut alhabet size) on a CM-5 connection machine with 512 rocessors. We imlemented two classes of algorithms - one based on data arallel rogramming and the other based on message assing. Our rograms erform well in ractice They can minimize `tyical' DFA's with millions of states in a few minutes and their scalability is good. The rimary reason for their success is that they have very good erformance on almost all instances. We formally establish this using a robabilistic model of Traktenbrot and Barzdin. We also discuss the imlementation details and the exeriments we conducted with our rograms. We conclude with some new questions that arise from our study. In this abridged version, we omit the roofs of all the claims. They can be found in the full version. 2. Preliminaries We assume familiarity with basic concets about finite automata. To fix the notation, however, we will briefly introduce some of the central notions. For details, we refer the reader to [6]. A DFA M consists of a finite set Q of states, a finite inut alhabet, a start state q 0 2 Q and a set F Q of acceting states, and the transition function : Q! Q. An inut string x is a sequence of symbols over. On an inut string x = x 1 x 2 :::x m, the DFA will visit a sequence of states q 0 q 1 :::q m starting with the start state by successively alying the transition function. Thus q 1 = (q 0 ;x 1 ), q 2 = (q 1 ;x 2 ) etc. The language L(M ) acceted by a DFA is defined as the set of strings x that take the DFA to an acceting state. I.e., x will be in L(M ) if and only if q m 2 F. Such languages are called regular languages, a well-understood and imortant class. Two DFA's M 1 and M 2 are said to be equivalent if they accet the same set of strings. It is common to resent DFA minimization as an equivalent roblem called the coarsest set artition roblem [1] which is stated as follows: Given a set Q, an initial artition of Q into disjoint blocks fb 0 ; :::; B m 1 g, and a collection of functions f i : Q! Q, find the coarsest (having fewest blocks) artition of Q say fe 1 ;E 2 ; :::; E q g such that: (1) the final artition is consistent with the initial artition, i.e., each E i is a subset of some B j and (2) a and b in E i imlies for all j, f j (a) and f j (b) are in the same block E k. The connection to minimization of DFA is the following. Q is the set of states, f i is the ma induced by the symbol a i (this means, (q; a i )= f i (q)), the initial artition is the one that artitions Q into two sets, F the set of acceting states and QnF the nonacceting ones. It should be clear that the size of the minimal DFA is the number of equivalence classes in the coarsest artition. In the following, we will assume that the inut is a DFA with n states defined over a k-letter alhabet in which every state is reachable from the start state. This assumtion simlifies the resentation of our algorithm; otherwise a rerocessing ste should be added to delete the unreachable state. 3. Parallel Algorithm and Its Analysis A high level descrition of our arallel algorithm is given below: Algorithm ParallelMin1 inut: DF A with n states, k inuts, 2 blocks B 0;B 1. outut: minimized DF A. begin do for i =0to k 1 do for j =0to n 1 do in arallel Get the block number B of state q j; Get the block number B 0 of the state (q j;x i), label the state q j with the air (B; B 0 ); Re-arrange (e.g. by arallel sorting) the states into blocks so that the states with the same labels are contiguous; end arallel for; Assign to all states in the same block a new (unique) block number in arallel; end for; until no new block roduced; Deduce the minimal DFA from the final artition; end. The outer loo of the above algorithm aears inherently sequential and it seems very hard to arallelize it significantly. But this is not of great concern since exeriments show that the total number of iterations of the outer loo is very small (usually less than

10) for all the DFA's we tested of sizes ranging from 4 thousands to more than 1 million. This relative invariance of the number of iterations can be exlained theoretically; see the discussion below. Thus, in ractice, there is no real gain in attemting to arallelize the outer loo. In the remainder of this section, we will mainly discuss the high-level details of imlementing the above arallel algorithm on EREW-PRAM model and resent an analysis that holds for almost all instances in a recise sense. We start with a brief descrition of the model due to Traktenbrot and Barzdin [13]. Our notation will be similar to the one used by [3] who adated Traktenbrot-Barzdin model in their study of assive learning algorithms. Our model involves a slight generalization of theirs. Basically, the robability sace on which we base our average case is narrowed by fixing a DFA G arbitrarily excet for its set of acceting states. Thus an adversary can choose G (called the automaton grah) subject to the only condition that all states are reachable from the start state. The robability distribution is defined over all DFA's M G which can be derived from this automaton grah by randomly selecting the set F of acceting states. In our model, we will allow each state to be included in F indeendently with robability ( can be a function of n). Let P n;"; be any redicate on an n-state automaton which deends on n, a confidence arameter " (where 0 " 1) and (defined above). We say that uniformly almost all automata have roerty P n;"; if the following holds: for all " > 0, foralln > 0 and for any n-state underlying automaton grah G, if we randomly choose the set of acceting states where each state is included indeendently with robability, then with robability at least 1 ", the redicate P n;"; holds. Let M= < Q; ;q 0 ;;F > be a DFA. The outut associated with a string x = x 1 :::x m from state q defined as the sequence y 0 :::y m where y i = 1 if the state reached (starting at q) after alying the refix x 1 :::x i is an acceting state, 0 otherwise. (Note that y 0 is 1 or 0 according as q is acceting or not.) We denote this outut by q < x >. A string x 2 is a distinguishing string for q i and q j if q i <x>6=q j <x>. Finally define d(q i ;q j )as the length of the shortest distinguishing string for the states q i and q j.wearenow ready to resent our results about the exected erformance of ParallelMin1. In the following, lg will denote logarithm to base 2. Lemma 1. For uniformly almost all automata with n states, every air of states has a distinguishing string of length at most 2lg( n2 " ) 1 lg( ) : 1 2(1 ) Our next result resents the average time comlexity (in the above sense) of an EREW (exclusive-read, exclusive-write) PRAM imlementation of the arallel algorithm resented above. We will assume that the readers are familiar with the PRAM model. See [8] for a systematic treatment of the PRAM algorithms. Theorem 2. For uniformly almost all automata with n states, the algorithm ParallelMin1 can be imlemented on an EREW PRAM with q n rocessors with time n lg n lg(n comlexity O(k 2 =") +lgq lg n). The q lg(1=(1 2(1 ))) sace comlexity (= amount of shared memory used) is O(n). Thus the total work done by the algorithm is O(n log 2 n) if and k are constants. The next result exlains why Hocroft's algorithm does not arallelize well. Theorem 3. Let Q 0 be the set of states of a DFA after minimization, and i the iteration number of the Hocroft's algorithm, then i jj(jq 0 j 1): 4. Imlementation as Data Parallel Program Data-arallelism is the arallelism achievable through the simultaneous execution of the same oeration across a set of data [4]. Since each oeration is alied simultaneously to the data, a data-arallel rogram has only one rogram counter and is characterized by the absence of multi-threaded execution which makes the design, imlementation and testing of rograms very difficult. Although data arallel rogramming was originally develoed for SIMD (or vector) hardware, it has become a successful aradigm in all tyes of arallel rocessing systems such as

shared-memory multirocessors, distributed memory machines, and networks of workstations. Since our arallel algorithm ParallelMin1 has a simle control structure and uses an uniform data structure (array), it is easy to imlement it as a data arallel rogram. We resent our data arallel version of the algorithm in the seudo code similar to C*, a data arallel extension of C rogramming language. We assume in our imlementation that the comutation begins with the inut DFA already stored as arallel variables. The data structures used are as follows. delta is an array of jj arallel variables, each is a arallel one of size n. The i-th element of the delta[j], [i]delta[j], stores the transition function of state i under inut symbol j (i.e., the next state of state i under inut symbol j, (i; j)), 0 i n 1, 0 j jj 1. block is a arallel variable to store artition numbers for all states. [i]block is the block number of state i. In ParallelMin2 below, statements like [arallel index]arallel var1 = arallel var2 are send oerations which send values of arallel variable (arallel var2) elements to other arallel variable (arallel var1) elements according to index maing arallel index. statements like arallel var1 =[arallel index]arallel var2 are get oerations which get values of arallel variable (arallel var1) elements from other arallel variable (arallel var2) elements. Algorithm ParallelMin2 inut: delta, array of arallel variables to store transition function; block, initial block numbers of the DFA; outut: block, (udated) block numbers denote state equivalence classes of the minimized DFA; begin q =2; do check =0; for k =0to jj 1do Get block numbers: stateindex = coord(0); fstateindex is used to kee track of movement of state indicesg outut[0] = block; outut[1] = [delta[k]]block; Sort according to oututs: sorted = rank(outut[0]); [sorted]stateindex = stateindex; [sorted]outut[0] = outut[0]; [sorted]outut[1] = outut[1]; sorted=rank(outut[1]); [sorted]stateindex = stateindex; [sorted]outut[0] = outut[0]; [sorted]outut[1]=outut[1]; Set new block names: link =0; where(outut[0]6=[.-1]outut[0]or outut[1]6=[.-1]outut[1]) link =1; newblock = refixsum(link); Check if new blocks roduced: if ( q equal to [n-1]newblock ) check++; continue; fconsistent for inut kg else [stateindex]block=newblock; q =[n-1]newblock; end of for; until (check equal to jj);fconsistent for all inutsg end. In our imlementation, sorting is done by finding ranks (a system library function) of elements and three get oerations to re-distribute them. The arallel refix comutation is erformed by a library function as well. The erformance characteristics of ParallelMin2 are discussed in section 6. 5. Imlementation as Message Passing Program Now we resent the message assing version of the basic arallel algorithm. It is a little more comlicated than the data arallel version because we have to control communication between rocessors exlicitly. Our aroach for storing the inut DFA is to divide the transition function and the final state information

equally among different rocessors. Storage of the intermediate results will also be distributed among the rocessors. This is a standard aroach used in distributed memory arallel algorithms. For simlicity, we assume that divides n, the number of states in the inut DFA. Processors will be identified by integers from 0 to 1. A detailed descrition of the various stes of this algorithm can be found in the full aer. Algorithm ParallelMin3 for the rocessor myadd inut: delta[0:: n 1] array to store art of the transition function. block[0:: n 1], initial block numbers of states stored in this rocessor. outut: (udated) block[0:: n 1] begin done = false; Initialization of indexedbuffer: indexedbuffer[0; 0:: n 1]= myadd n.. (myadd +1) n 1.Sofirstrowof indexedbuffer contains the state numbers associated with this rocessor. indexedbuffer[1; 0:: n 1] = block[0:: n 1], ut current block numbers of the states in this rocessor; Construct Message Passing Tables according to delta. These tables contain for each state q in the rocessor the address of the rocessor that contains (q; x j) and of the rocessors which contain the state(s) 1 (q; x j) for each j. while (not done) done = true; for k =0 to jj 1do Exchange oututs: All-to-all communication takes lace as follows: Suose a state s t belongs to a rocessor t and suose (s t;x j)belongs to myadd. The rocessor will send a message to t containing the block number of (s t;x j). The received messages will be ut the third row, namely indexedbuffer[2; 0::n= +1]. Parallel-sort: Sort indexedbuffer using the second and third row together as key. Note that After sorting, indexedbuffer's first row (state indices) may not be ordered and may not belong to this node. That is, the sorting will re-distribute data items. Rename blocks: If two states have the same `key', they are ut in the same block in the refined artition. Put new block numbers in indexedbuffer's second row. This is a global scanning oeration on indexedbuffer of all rocessors. Parallel-sort: Sort indexedbuffer using the first row as the key. This will bring the new block number of each state back to the rocessor to which it belongs. if (a new block roduced)done = false; end of for; end of while; block[0:: n end. 1]=indexedBuffer[1, 0:: n 6. Exerimental Results on CM-5 Next, we resent our exerimental results on CM-5. We ran the rograms described above for many DFA's of different sizes on a CM-5 with 32 nodes. Each node has 4 vector units. The instances of DFA's used in 1];

70 10 3 60 x : Message Passing o : Data Parallel Time (in Seconds) 10 2 10 1 x : Message Passing o : Data Parallel Time (in Seconds) 50 40 30 20 10 0 10 4 10 5 10 6 Size of DFAs 10 0 50 100 150 200 250 300 Number of Processors Figure 2. Scalability Figure 1. Running Time vs. Problem Size this exeriment were constructed using four different methods. All of them have two inut symbols. In Figure 1, the number of rocessors is fixed at 32. When the inut size is less than about 100000, the time comlexity is almost linear so that it takes about twice as much time to solve an instance twice as large. But when the inut size exceeds 100000, nonlinearity shows u, esecially for the message assing rogram. To study the scalability of our rograms, we chose three DFA's (with 109078, 262144 and 524288 states) and minimized them by using both rograms on a CM- 5 configured with 32, 64, 128 and 256 nodes. Figure 2 shows the results of average time. Both rograms seem to scale well; but the message assing rogram is the better one. 7. Acknowledgments Our rograms were imlemented and tested on a connection machine CM-5 at the National Center for Suercomuter Alications at Urbana-Chamaign. We thank A. Tridgell of Australian National University for his hel with the sorting routine used in the message assing rogram. References [1] A. Aho, J. Hocroft, and J. Ullman. The Design and Analysis of Algorithms. Addison-Wesley Publishing Co., Reading, MA, 1973. [2] S. Cho and D. Huynh. The arallel comlexity of coarsest set artition roblems. Information Processing Letters, 42:89 94, 1992. [3] Y. Freund et al. Efficient learning of tyical finite automata from random walks. In 25th ACM Symosium on Theory of Comuting, ages 315 324, 1993. [4] P. Hatcher and M. Quinn. Data-arallel Programming on MIMD Comuters. MIT Press, Cambridge, Mass., 1991. [5] J. Hocroft. An n log n algorithm for minimizing states in a finite automaton. In Theory of Machines and Comutations, ages 189 196. Academic Press, New York, 1971. [6] J. Hocroft and J. Ullman. Introduction to Automata Theory, Languages and Comutation. Addison- Wesley Co., Reading, Mass., 1978. [7] D. Huffman. The synthesis of sequential switching circuits. Journal of Franklin Institute, 257(3):161 190, 1954. [8] J. JáJá. An Introduction to Parallel Algorithms. Addison-Wesley Publishing Co., Reading, Mass., 1992. [9] J. JáJá and S. Kosaraju. Parallel algorithms for lanar grah isomorhism and related roblems. IEEE Transactions on Circuits and Systems, 35(3), March 1988. [10] J. JáJá and K. Ryu. An efficient arallel algorithm for the single function coarsest artition roblem. ACM Symosium on Parallel Algorithms and Architectures, 1993. [11] E. Moore. Gedanken-exeriments on sequential circuits. In C. Shannon and J. McCarthy, editors, Automata Studies, ages 129 153. Princeton University Press, Princeton, NJ, 1956. [12] Y. Srikant. A arallel algorithm for the minimization of finite state automata. Intern. J. Comuter Math., 32:1 11, 1990.

[13] B. Trakhtenbrot and Y. Barzdin'. Finite Automata: Behavior and Synthesis. North-Holland Publishing Comany, 1973. [14] B. Watson. A taxonomy of finite automaton minimization algorithms. Technical reort, Faculty of Mathematics and Comuter Science, Eindhoven University of Technology, The Netherlands, 1994.