Susequene Automt with Defult Trnsitions Philip Bille, Inge Li Gørtz, n Freerik Rye Skjoljensen Tehnil University of Denmrk {phi,inge,fskj}@tu.k Astrt. Let S e string of length n with hrters from n lphet of size σ. The susequene utomton (often lle the irete yli susequene grph) is the miniml eterministi finite utomton epting ll susequenes of S. A strightforwr onstrution shows tht the size (numer of sttes n trnsitions) of the susequene utomton is O(nσ) n tht this oun is symptotilly optiml. In this pper, we onsier susequene utomt with efult trnsitions, tht is, speil trnsitions to e tken only if none of the regulr trnsitions mth the urrent hrter, n whih o not onsume the urrent hrter. We show tht with efult trnsitions, muh smller susequene utomt re possile, n provie full tre-off etween the size of the utomton n the ely, i.e., the mximum numer of efult trnsition followe efore onsuming hrter. Speifilly, given ny integer prmeter k, 1 < k σ, we present susequene utomton with filure trnsition of size O(nk log k σ) n ely O(log k σ). Hene, with k = 2 we otin n utomton of size O(n log σ) n ely O(log σ). On the other extreme, with k = σ, we otin n utomton of size O(nσ) n ely O(1), thus mthing the oun for the stnr susequene utomton onstrution. The key omponent of our result is novel hierrhil utomt onstrution of inepenent interest. 1 Introution Let S e string of length n with hrters from n lphet of size σ. A susequene of S is ny string otine y eleting zero or more hrters from S. The susequene utomton (often lle the irete yli susequene grph) is the miniml eterministi finite utomton epting ll susequenes of S. Bez-Ytes [1] initite the stuy of susequene utomt. He presente simple onstrution using O(nσ) size (size enotes the totl numer of sttes n trnsitions) n showe tht this oun is optiml in the sense tht there re susequene utomt of size t lest Ω(nσ). He lso onsiere vritions with enoe input strings n multiple strings. Susequently, severl reserhers hve further stuie susequene utomt (n its vrints) [2 9]. See lso the surveys y Troníček n others [10, 11]. The generl prolem of susequene inexing, not limite to utomt se solutions, is investigte y Bille et l. [12].
In this pper, we onsier susequene utomt in the ontext of efult trnsitions, tht is, speil trnsitions to e tken only if none of the regulr trnsitions mth the urrent hrter, n whih o not onsume the urrent hrter. Eh stte hs t most one efult trnsition n hene the utomton remins eterministi. The key point of efult trnsitions is to reue the size of stnr utomt t the ost of introuing ely, i.e., the mximum numer of efult trnsition followe efore onsuming hrter. For instne, given pttern string of length m the lssi Knuth-Morris-Prtt (KMP) [13] string mthing lgorithm my e viewe s n utomton with efult trnsitions (typilly referre to s filure trnsitions). This utomton hs size O(m), wheres the stnr utomton with no efult trnsition woul nee Θ(mσ) spe. The ely of the utomton in the KMP lgorithm is either O(m) or O(log m) epening on the version. Similrly, the Aho-Corsik string mthing lgorithm for multiple strings my lso e viewe s n utomton with efult trnsitions [14]. More reently, efult trnsitions hve lso een use extensively to signifintly reue sizes of eterministi utomt for regulr expression [15, 16]. The min ie is to effetively enle sttes with lrge overlpping ientil sets of outgoing trnsitions to shre outgoing trnsitions using efult trnsitions. Surprisingly, no non-trivil ouns for susequene utomt with efult trnsitions re known. Nively, we n immeitely otin n O(nσ) size solution with O(1) ely y using the stnr susequene utomton (without efult trnsitions). At the other extreme, we n uil n utomton with n + 1 sttes (eh orresponing to prefix of S) with stnr n efult trnsition from the stte orresponing to the ith prefix to the stte orresponing to the i + 1st prefix (the stnr trnsition is lele S[i + 1]). It is strightforwr to show tht this les to n O(n) size solution with O(n) ely. Our min result is sustntilly improve tre-off etween the size n ely of the susequene utomton: Theorem 1. Let S e string of n hrters from n lphet of size σ. For ny integer prmeter k, 1 < k σ, we n onstrut susequene utomton with efult trnsitions of size O(nk log k σ) n ely O(log k σ). Hene, with k = 2 we otin n utomton of size O(n log σ) n ely O(log σ). On the other extreme, with k = σ, we otin n utomton of size O(nσ) n ely O(1), thus mthing the oun for the stnr susequene utomton onstrution. To otin our result, we first introue the level utomton. Intuitively, this utomton uses the sme sttes s the stnr solution, ut hierrhilly orers them in tree-like struture n smples seletion of their originl trnsitions se on their position in the tree, n s efult trnsition to the next stte on higher level. We show how to o this effiiently leing to solution with O(n log n) size n O(log n) ely. To hieve our full treoff from Theorem 1 we show how to ugment the onstrution with itionl triks for smll lphets n generlize the level utomton with prmeter k,
1 < k σ, where lrge k reues the height of the tree ut inreses the numer of trnsitions. 2 Preliminries A eterministi finite utomton (DFA) is tuple A = (Q, Σ, δ, q 0, F ) where Q is set of noes lle sttes, δ is set of lele irete eges etween sttes lle trnsitions where eh lel is hrter from the lphet Σ, q 0 Q is the initil stte n F Q is set of epting sttes. No outgoing trnsitions from the sme stte hve the sme lel. The size of A is the sum of the numer of sttes n trnsitions. We n think of A s n ege-lele irete grph. Given string P n pth p in A we sy tht p n P mth if the ontention of the lels on the trnsitions in p is P. We sy tht A epts string P if there is pth in A, from q 0 to ny stte in F, tht mthes P. Otherwise A rejets P. A eterministi finite utomton with efult trnsitions is eterministi finite utomton AD where eh stte n hve single unlele efult trnsition. Given string P n pth p in AD we efine mth etween P n p s efore, with the exeption tht for ny efult trnsition in p the orresponing hrter in P nnot mth ny stnr trnsition out of the strt stte of. Definition of epte n rejete strings re s efore. The ely of AD is the mximum length of ny pth tht only uses efult trnsitions. A susequene of S is string P, otine y removing zero or more hrters from S. A susequene utomton onstrute from S, enote SA, is eterministi finite utomton tht epts string P iff P is susequene of S. The SA is often lle the irete yli susequene grph or DASG. The SA hs n + 1 sttes, ll epting, tht we ientify with the integers {0, 1,..., n}. For eh stte s, 0 s n, we hve the following trnsitions: For eh unique hrter α in S[s + 1, n], there is trnsition lele α to the smllest stte s > s suh tht S[s ] = α. The SA hs size O(nσ) sine every stte n hve t most σ trnsitions. An exmple of n SA is given in Fig. 1. A susequene utomton with efult trnsitions onstrute from S, enote SAD, is eterministi finite utomton with efult trnsitions tht epts string P iff P is susequene of S. The next setion explores ifferent onfigurtions of trnsitions n efult trnsitions in SADs. 3 New Tre-Offs for Susequene Automt. We now present new tre-off for susequene utomt, with efult trnsitions. We will grully refine our onstrution until we otin n utomton tht gives the result presente in Theorem 1. In eh onstrution we hve n + 1
Fig. 1. An exmple of n SA onstrute from the string. sttes tht we ientify with the integers {0, 1,..., n}. Eh of these sttes represents prefix of the string S n re ll epting sttes. We first present the level utomton tht gives the first non-trivil tre-off tht exploits efult trnsitions. The generl ie is to onstrut hierrhy of sttes, suh tht every pth tht only uses efult trnsitions is gurntee to go through sttes where the outegree inreses t lest exponentilly. The level utomton is SAD of size O(n log n) n ely O(log n). By rguing tht ny pth going through stte with outegree σ will o so y tking regulr trnsition, we re le to improve oth the size n ely of the level utomton. This results in the lphet wre level utomton whih is SAD of size O(n log σ) n ely O(log σ). Finlly we present generlize onstrution tht gives tre-off etween size n ely y letting prmeter k, 1 < k σ, e the se of the exponentil inrese in outegree on pths with only efult trnsitions. This SAD hs size O(nk log k σ) n ely O(log k σ). With k = 2 we get n utomton of size O(n log σ) n ely O(log σ). In the other extreme, for k = σ we get n utomton of size O(nσ) n ely O(1). 3.1 Level Automton The level utomton is SAD with n+1 sttes tht we ientify with the integers {0, 1,..., n}. All sttes re epting. For eh stte i > 0, we ssoite level, level(i), given y: level(i) = mx({x i mo 2 x = 0}) Hene, level(i) is the exponent of the lrgest power of two tht ivies i. We o not ssoite ny level with stte 0. For stte s, we efine s to e the smllest stte suh tht s > s n level(s) level(s) + 1. The onfigurtion of the trnsitions in the level utomton re s follows: From stte 0 we hve efult trnsition to stte 1 n regulr trnsition to stte 1 with lel S[1]. For every other stte s, 1 s n, we hve the following trnsitions. A filure trnsition to stte s. If no suh stte exist, the stte s oes not hve filure trnsition.
For eh unique hrter α in S[s+1, min(s, n)], there is trnsition lele α to the smllest stte s > s suh tht S[s ] = α. An exmple of the level utomton onstrute from the string n lphet {,,, } is given in Fig. 2. The she rrows enote efult trnsitions n the vertil position of the sttes enotes their level. Fig. 2. The level utomton onstrute from the string. We first show tht the level utomton is SAD for S, i.e., the level utomton epts string iff. the string is susequene of S. To o so suppose tht P is string of length m epte y the level utomton n let s 1, s 2,..., s m e the sequene of sttes visite with regulr trnsitions on the pth tht epts P. From the efinition of the trnsition funtion, we know tht if trnsition with lel α les to stte s, then S[s ] = α. This mens tht S[s 1 ]S[s 2 ]... S[s m ] spells out susequene of S if the sequene s 1, s 2,..., s m is stritly monotonilly inresing. From the efinition of the trnsitions, stte s only hve trnsitions to sttes s if s > s. Hene, the sequene is stritly monotonilly inresing. For the other iretion, let P e susequene of S. Let S[1, s m ] e the miniml prefix of S tht ontins P s susequene n let s 1, s 2,..., s m e stritly monotonilly inresing sequene of sttes, suh tht S[s 1 ]S[s 2 ]... S[s m ] = P. Assume for ontrition tht P is not epte y the level utomton. Let P p e the lrgest prefix of P tht is epte y the level utomton suh tht the sequene of sttes visite with regulr trnsition on the mthing pth is prefix of s 1, s 2,..., s m. Let s i e the lst stte of tht prefix n let α = S[s i+1 ]. Consier the pth of sttes s i, 1,..., j otine y ontinuously following efult trnsitions from s i. When the efult trnsition of s i les to 1 we know tht s i hs trnsition with lel α, if α ours in S[s i +1, 1 ]. Applying this rgument to the rest of the sttes on the pth, we know tht one of the sttes s i, 1,..., j hs trnsition lele α, if α ours in S[s i + 1, min( j, n)], where min( j, n) = n euse j hs no efult trnsition. Sine S[s i+1 ]S[s i+2 ]... S[s m ] is susequene of S[s i + 1, n] one of the sttes s i, 1,..., j hs trnsition to s i+1 with lel α. This ontrits how we initil selete P p.
For ll s > 0, we wnt to show the following property of s n level(s): s s = 2 level(s) (1) By efinition 2 level(s) ivies s. Hene, the next integer, lrger thn s, tht 2 level(s) ivies must e s + 2 level(s). But sine 2 level(s) ws the lrgest power of two tht ivies s, we know tht the eomposition of s into sum of unique powers of two must ontin 2 level(s) n lso tht this power is the smllest. Hene, when we eompose s + 2 level(s) into unique powers of two we know tht the smllest power of two in this eomposition is t lest 2 level(s)+1. The efinition of s implies tht s = s + 2 level(s) whih is extly wht we wnte to show. Assume tht the ely of the level utomton is ue to pth p. We wnt to oun the length of p. The stte s on p is the only stte on p tht n hve trnsitions to the sttes s + 1, s + 2,..., s. Sine s s = 2 level(s) n level(s) > level(s), the length of p is oune y O(log n) euse stte s t level O(log n) n hve trnsitions to s s = 2 O(log n) = O(n) sttes. At eh level l we hve O(n/2 l+1 ) sttes, euse every 2 l th stte is ivie y 2 l, ut 2 l is only the lrgest ivisor in every seon of these ses. Beuse s s = 2 level(s) eh stte t level l hs t most 2 l outgoing trnsitions. Therefore, eh level ontriute with size t most n/2 l+1 2 l = O(n). Sine the ely is O(log n) we hve t most O(log n) levels n the totl size therefore eomes O(n log n). In summry, we hve shown the following result. Lemm 1. Let S e string of n hrters. We n onstrut susequene utomton with efult trnsitions of size O(n log n) n ely O(log n). 3.2 Alphet wre level utomton We introue the Alphet wre level utomton. When the level utomton rehes stte s where s s σ, then s n hve up to σ outgoing trnsitions without violting the spe nlysis ove. The level utomton only hs trnsition for eh unique hrter in S[s + 1, min(s, n)]. Hene, for ll sttes s in the lphet wre level utomton where s s σ, we let s hve trnsition for eh symol α in Σ, to the smllest stte s > s suh tht S[s ] = α. No epting pth woul ever tke efult trnsition from stte with σ outgoing trnsitions. Hene, sttes with σ outgoing trnsitions o not nee efult trnsitions. We hnge the level funtion to reflet this. For eh stte 1 i n we hve tht: level(i) = min( log 2 σ, mx({x i mo 2 x = 0})) (2) The onfigurtion of the trnsitions in the lphet wre level utomton is s follows: From stte 0 we hve efult trnsition to stte 1 n regulr trnsition to stte 1 with lel S[1]. For every other stte s, 1 s n, we hve the following trnsitions.
A filure trnsition to stte s. If no suh stte exist, the stte s oes not hve filure trnsition. If s s < σ then for eh unique hrter α in S[s + 1, min(s, n)], there is trnsition lele α to the smllest stte s > s suh tht S[s ] = α. If s s σ then for eh unique hrter α in S[s+1, n], there is trnsition lele α to the smllest stte s > s suh tht S[s ] = α. An exmple of the lphet wre level utomton onstrute from the string n lphet {,,, } is given in Fig. 3. The level utomton in Fig. 2 is onstrute from the sme string n the sme lphet. For omprison, stte 4 in Fig. 3 now hs outegree σ n hs trnsitions to the first sueeing ourrene of ny unique hrter n stte 8 hs een onstrine to level log 2 σ. Fig. 3. The lphet level utomton onstrute from the string. It is esy to show tht the lphet wre utomton is SAD y slightly moifying the rguments tht le to Lemm 1. Now, it is not neessrily true tht min( j, n) = n for ll pths s i, 1,..., j. So ssume tht min( j, n) < n. But euse j hs no efult trnsition, we know inste tht j hs σ outgoing trnsitions where one of them hs lel α, if α ours in S[ j + 1, n]. From this we n onlue tht one of the sttes s i, 1,..., j hs trnsition to s i+1 with lel α. The ely is now oune y O(log σ) euse no stte is ssigne level higher thn log 2 σ. The size is oune y O(n log σ) euse eh level hs size O(n) n we hve t most O(log σ) levels. In summry, we hve shown the following result. Lemm 2. Let S e string of n hrters. We n onstrut SAD of S with size O(n log σ) n ely O(log σ). 3.3 Full tre off We n generlize the onstrution ove y introuing prmeter k, 1 < k σ, whih is the se of the exponentil inrese in outegree of sttes on every pth
tht only uses efult trnsitions. Now, when we follow efult trnsition from s to s, the numer of outgoing trnsitions inrese with ftor k inste of ftor 2. This gives tre-off etween size n ely in the SAD etermine y k. Inresing k gives shorter ely of the SAD ut inreses the size n vie vers. Eh stte, exept stte 0, is still ssoite with level, ut we nee to generlize the level funtion to ount for the prmeter k. For every k n 1 i n we hve tht: level(i, k) = min( log k σ, mx({x i mo k x = 0})) (3) Now, the level funtion gives the lrgest power of k tht ivies i. The onfigurtion of the trnsitions in the generlize lphet wre level utomton is ientil to the onfigurtion of the lphet wre utomton ut is restte here for ompleteness. From stte 0 we hve efult trnsition to stte 1 n regulr trnsition to stte 1 with lel S[1]. For every other stte s, 1 s n, we hve the following trnsitions: A filure trnsition to stte s. If no suh stte exist, the stte s oes not hve filure trnsition. If s s < σ then for eh unique hrter α in S[s + 1, min(s, n)], there is trnsition lele α to the smllest stte s > s suh tht S[s ] = α. If s s σ then for eh unique hrter α in S[s+1, n], there is trnsition lele α to the smllest stte s > s suh tht S[s ] = α. We n show tht the generlize lphet wre utomton is SAD y the sme rguments tht le to 2. With the new efinition of the level funtion we hve tht s s = k level(s,k)+1 (s mo k level(s,k)+1 ) (4) for ll s > 0. The moulo term gives the ifferene etween s n the lrgest j < s suh tht k level(s,k)+1 ivies j. We re intereste in the ifferene etween s n the smllest i > s suh tht k level(s,k)+1 ivies i. But sine the ifferene etween j n i is k level(s,k)+1, we just sutrt the moulo term from k level(s,k)+1 to get the ifferene etween s n i. This expression gives the numer of outgoing trnsitions from stte s. The ely is oune y O(log k σ) euse we inrese the numer of outgoing trnsitions with ftor k eh time we follow efult trnsition n no stte is ssigne level higher thn log σ. Eh stte hs t most s s = k level(s,k)+1 (s mo k level(s,k)+1 ) k level(s,k)+1 outgoing trnsitions. At level l we hve O(n(k 1)/(k l+1 )) sttes eh with O(k l+1 ) outgoing trnsitions suh tht eh level hs size O(nk). The size of the utomton eomes O(nk log k σ) euse we hve O(log k σ) levels. In summry, we hve shown Theorem 1.
Referenes 1. Bez-Ytes, R.A.: Serhing susequenes. Theoret. Comput. Si. 78(2) (1991) 363 376 2. Troníček, Z., Shinohr, A.: The size of susequene utomton. Theoret. Comput. Si. 341(1) (2005) 379 384 3. Crohemore, M., Melihr, B., Troníček, Z.: Direte yli susequene grph: Overview. J. Dis. Algorithms 1(3-4) (2003) 255 280 4. Crohemore, M., Tronıek, Z.: Direte yli susequene grph for multiple texts. Tehnil Repport, Institut Gspr-Monge (1999) 99 13 5. Troníček, Z.: Episoe mthing*. In: Pro. 12th. CPM. (2001) 143 146 6. Hoshino, H., Shinohr, A., Tke, M., Arikw, S.: Online onstrution of susequene utomt for multiple texts. In: Pro. 7th SPIRE. (2000) 146 152 7. Frhn, E., Ferous, J., Moos, T., Rhmn, M.S.: Finite utomt se lgorithms for the generlize onstrine longest ommon susequene prolems. In: Pro. 17th SPIRE. (2010) 243 249 8. Bnni, H., Ineng, S., Shinohr, A., Tke, M.: Inferring strings from grphs n rrys. In: Pro. 28th MFCS. (2003) 208 217 9. Tronìček, Z.: Opertions on DASG. In: Pro. 4th WIA. (1999) 82 91 10. Troníček, Z.: Serhing susequenes. Ph. D. Thesis, Deprtment of Computer Siene n Engineering, FEE CTU in Prgue (2001) 11. Troniĉek, Z.: Common susequene utomton. In: Pro. 8th CIAA. (2003) 270 275 12. Bille, P., Frh-Colton, M.: Fst n ompt regulr expression mthing. Theoret. Comput. Si. 409 (2008) 486 496 13. Knuth, D.E., Jmes H. Morris, J., Prtt, V.R.: Fst pttern mthing in strings. SIAM J. Comput. 6(2) (1977) 323 350 14. Aho, A.V., Corsik, M.J.: Effiient string mthing: n i to iliogrphi serh. Commun. ACM 18(6) (1975) 333 340 15. Kumr, S., Dhrmpurikr, S., Yu, F., Crowley, P., Turner, J.: Algorithms to elerte multiple regulr expressions mthing for eep pket inspetion. In: Pro. 12th SIGCOMM. (2006) 339 350 16. Hyes, C.L., Luo, Y.: Dpio: high spee eep pket inspetion engine using ompt finite utomt. In: Pro. 3r ANCS. (2007) 195 203