The size of subsequence automaton

Theoreticl Computer Science 4 (005) 79 84 www.elsevier.com/locte/tcs Note The size of susequence utomton Zdeněk Troníček,, Ayumi Shinohr,c Deprtment of Computer Science nd Engineering, FEE CTU in Prgue, Czech Repulic Deprtment of Informtics, Kyushu University, Fukuok 8-858, Jpn c PRESTO, Jpn Science nd TechnologyCorportion (JST), Jpn Received Mrch 004; received in revised form 8 Ferury 005; ccepted Mrch 005 Communicted y M. Crochemore Astrct Given set of strings, the susequence utomton ccepts ll susequences of these strings. We derive lower ound for the mximum numer of sttes of this utomton. We prove tht the size of the susequence utomton for set of k strings of length n is Ω(n k /(k + ) k k!) for ny k. It solves n open prolem ecuse only the cse k ws shown efore. 005 Elsevier B.V. All rights reserved. Keywords: Serching susequences; Directed cyclic susequence grph; Susequence utomton. Introduction A susequence of string T is ny string otinle y deleting zero or more symols from T.GivensetP of strings, common susequence of P is string tht is susequence of every string in P. Motivtion for study of susequences comes from mny domins, e.g. from moleculr iology, signl processing, coding theory, nd rtificil intelligence. An exmple of the prolem with gret prcticl impct is the longest common susequence (LCS) prolem. The prolem is defined s follows: given set P of strings, we re to find common susequence of P tht hs mximl length mong ll common susequences of P. The decision version cn e, for exmple, to decide whether given string is common susequence of P. Another prolem, which comes from rtificil intelligence, is Corresponding uthor. E-mil ddresses: tronicek@fel.cvut.cz (Z. Troníček), yumi@i.kyushu-u.c.jp (A. Shinohr). 004-975/$ - see front mtter 005 Elsevier B.V. All rights reserved. doi:0.06/j.tcs.005.0.07

80 Z. Troníček, A. Shinohr / Theoreticl Computer Science 4 (005) 79 84 the prolem of seprting two sets of strings: given two sets, P (positive) nd N (negtive), of strings, we re to find string tht est seprtes them. A string S seprtes sets P nd N if S is susequence of P nd simultneously is not susequence of ny string in N. The decision version is defined s follows: given two sets P nd N of strings nd string S, we re to decide whether S seprtes P nd N. If the prolem is supposed to e nswered for severl strings S then it is sensile to preprocess the sets P nd N. We cn uild n utomton ccepting ll common susequences of P nd n utomton tht ccepts ny string tht is susequence of t lest one string in N. With these utomt we cn decide the prolem in time liner in the length of S. Both utomt were studied nd descried. The ltter one is known s the Directed Acyclic Susequence Grph (DASG) nd three uilding lgorithms re ville: right-to-left [], left-to-right [], nd on-line []. The former one is clled the Common Susequence Automton (CSA) nd cn e uilt y n off-line or on-line lgorithm which re modifictions of lgorithms for uilding the DASG. In this pper, we investigte the numer of sttes of the CSA. The lnguge ccepted y the CSA is suset of the lnguge ccepted y the DASG for the sme strings nd the utomt re very similr. If we use either the off-line or the on-line lgorithm, the set of sttes of the CSA is suset of sttes of the DASG. The only previous results re, ccording to our knowledge, the lower ound for the mximum numer of sttes of the DASG for two strings proved in []. We will prove the lower ound for ny (fixed) numer of strings. The pper is orgnized s follows. In Section we will recll the definition of the CSA nd in Section we will exmine the symptotic ehvior of the numer of sttes of the CSA in the worst cse. Let Σ e finite lphet of size σ nd ε e the empty word. A finite utomton is, in this pper, 5-tuple (Q, Σ, δ,q 0,F), where Q is finite set of sttes, Σ is n input lphet, δ : Q Σ Q is trnsition function, q 0 is the initil stte, nd F Q is the set of finl sttes. Let δ e reflexive-trnsitive closure of δ, i.e. δ (q, ε) = q, δ (q, ) = δ(q, ), δ (q,,..., l ) = δ (δ(q, ),,..., l ), where q Q, Σ nd,..., l Σ. Nottion i, j mens the intervl of integers from i to j, including oth i nd j. All strings in this pper re considered over lphet Σ, if not stted otherwise.. Definition of CSA Let P denote the set of strings T,T,...,T k. Let n i e the length of T i nd T i [j] e jth symol of T i for ll j,n i nd ll i,k. GivenT = t t,...,t n nd i, j,n,i j, nottion T [i,...,j] mens the string t i,t i+,...,t j. Definition. We define position point of set P s n ordered k-tuple [p,p,...,p k ], where p i 0,n i is position in string T i.ifp i 0,n i then it denotes the position in front of (p i + )th symol of T i, nd if p i = n i then it denotes the position ehind the lst symol of T i for ll i,k.

Z. Troníček, A. Shinohr / Theoreticl Computer Science 4 (005) 79 84 8 A position point [p,p,...,p k ] is clled initil position point if p i = 0 for ll i,k. We denote y ipp(p ) the initil position point of P nd y Pos(P ) the set of ll position points of P. Definition. For position point [p,p,...,p k ] Pos(P ) we define the common susequence position lphet Σ cp ([p,p,...,p k ]) s the set of ll symols which re contined simultneously in T [p +,...,n ],...,T k [p k +,...,n k ], i.e. Σ cp ([p,p,...,p k ]) = { Σ: i,k j p i +,n i :T i [j] =}. Definition. For position point [p,p,...,p k ] Pos(P ) nd Σ we define the common susequence trnsition function: csf ([p,p,...,p k ],)=[r,r,...,r k ], where r i = min{j : j>p i nd T i [j] =} for ll i,k if Σ cp ([p,p,...,p k ]), nd csf ([p,p,...,p k ],)= otherwise. Let csf e reflexive-trnsitive closure of csf, i.e. csf ([p,p,...,p k ], ε) =[p,p,...,p k ], csf ([p,p,...,p k ], ) = csf ([p,p,...,p k ], ), csf ([p,p,...,p k ],,..., l ) = csf (csf ([p,p,...,p k ], ),,..., l ), where j Σ cp (csf ([p,p,...,p k ],,..., j )) for ll j,l. Lemm. The utomton (Pos(P ), Σ, csf, ipp(p ), Pos(P )) ccepts string S iff S is common susequence of P. Proof. See []. The utomton from Lemm is clled the CSA for strings T,T,...,T k. An exmple of the CSA is in Fig.. Up to now, two lgorithms for uilding the CSA hve een descried. The first one is off-line nd uses the position points. The second one is on-line nd in ech step lods one input string into the utomton. We will riefly descrie the off-line lgorithm. The lgorithm genertes step y step ll rechle position points (sttes). At ech step we process one position point. First, we will find the common susequence position lphet for this point nd then determine the common susequence trnsition function for ech symol of tht lphet. When the position point hs een processed, we continue with next point until trnsitions of ll Fig.. The CSA for strings nd.

8 Z. Troníček, A. Shinohr / Theoreticl Computer Science 4 (005) 79 84 rechle position points re determined. The complexity of the lgorithm depends on the numer of sttes of the resulting utomton. Providing tht the totl numer of sttes is O(t), the lgorithm requires O(kσt) time.. Numer of sttes of CSA We will investigte the numer of sttes (rechle position points) of the CSA for set of strings over inry lphet. First, we will introduce n uxiliry structure clled generting tree. The generting tree is defined s generl rooted tree where nodes nd edges re leled y n integer vlue. We sy tht node is of order v if it is leled y vlue v. A node of order v hs exctly v output edges leled y,,...,v. Ending node of ech edge is leled y the sme vlue s this edge. If the root of the tree is of order k, we sy tht the tree is of order k. An exmple of the tree of order is in Fig.. We use the tree to descrie set of strings. Any pth from the root corresponds to string over lphet {,}. All nodes ut the root contriute y. An edge leled y l dds l. For exmple, 4 pth root node(4) node() node() corresponds to string. A node hs t most one output edge for given vlue, therefore no two strings generted y the tree re identicl. In the susequent, we consider the tree of order k nd denote, for i Z, i 0, y p(k, i) the numer of nodes on ith level of this tree. Furthermore, we will denote y p j (k, i), j k the numer of nodes on ith level which re leled y vlue of j. There is just one node of order k on ech level, tht is p k (k, i) =. On the 0th level, there is only one node (root). A node of order j on ith level hs descendnts of orders,,...,j on (i + )th level. In other words, node of order j on ith level is descendnt either of node of order j or of higher order on (i )th level. The numer of these nodes of higher order is the sme s the numer of nodes of order j +onith level. Tht is, p j (k, i) = p j (k, i )+p j+ (k, i), where i>0nd j <k. This formul is known from comintorics nd holds for inomil coefficients. For further derivtion we use Pscl s tringle (see Fig. ) which is common mens for expressing the reltions etween inomil coefficients. From Pscl s tringle we get: ( i p k (k, i) = 0 ), p k (k, i) = ( ) i,...,p (k, i) = ( i + k k ). i = 0 i = i = Fig.. The top of the generting tree of order.

Z. Troníček, A. Shinohr / Theoreticl Computer Science 4 (005) 79 84 8 i = p k (k, i) i = p k (k, i) i = p k (k, i) i = 5 i = 4 4 6 4 p k (k, i) p k 4 (k, i) i = 6 5 0 0 5 p k 5 (k, i) Fig.. The top of Pscl s tringle. The numer of nodes on ith level is hence p(k, i) = k ( ) i + k p j (k, i) =. k i=0 j= And the totl numer of nodes up to nth level is n p(k, i) = n ( ) ( ) i + k n + k =. k k i=0 This formul determines how mny strings ending with is generted y the tree of order k if we consider the first n levels of this tree. Lemm. Let k Z, k,n i Z, n i for ll i,k. Let T = () n,t = () n,...,t k = ( k ) n k,l k ={T,T,...,T k }, nd δ denote the trnsition function of the CSA for L k. Let M k e the set of strings generted ythe generting tree of order k. Then for ll u, v M k,u = v, oth ccepted ythe CSA for L k, holds δ (ipp(p ), u) = δ (ipp(p ), v). Proof (yinduction in k). Let Mk l denote the set of strings from the lth level of the generting tree of order k.. k=: M l={()l,() l,() () l,...,() l }. If two strings contin different numer of s they cnnot result in shift to the sme position in T. Therefore we cn consider only the strings with the sme numer of s. For given l Z, l, just ll strings in M l hve the sme numer of s. But simultneously no such two strings hve the sme numer of s. Thus, the trnsition function finishes t the sme position in T nd lwys t different position in T.. Let n k+ Z, n k+. We dd T k+ = ( k+ ) n k+ into set L k, tht is L k+ = L k {T k+ }. According to the induction hypothesis the lemm holds for k. We will show tht it holds for k + too y using induction in height h of the tree: () h = : Mk+ ={,,...,k+ }. No two strings from this set hve the sme numer of s, thus the lemm holds.

84 Z. Troníček, A. Shinohr / Theoreticl Computer Science 4 (005) 79 84 () The hypothesis sys tht the lemm holds for Mk+,M k+,...,mh k+. We will prove tht it holds lso for Mk+ h+. From the generting tree we get: Mh+ k+ = Mh+ k { k+ s: s Mk+ h }. We need to show tht the result holds for ny u, v Mh+ k { k+ s: s Mk+ h }. There re three cses: (i) u, v Mk h+, (ii) u, v { k+ s: s Mk+ h }, (iii) u Mh+ k nd v { k+ s: s Mk+ h }. Cse (i) follows from the induction hypothesis. In cse (ii) oth u nd v hve prefix k+ nd for the suffixes we cn use the induction hypothesis. Let us now consider cse (iii). Any string from Mk h+ results in shift to the (h + )th in T k = ( k ) n k. Furthermore, k+ mkes shift to the second in T k nd ecuse ny string in Mk+ h contins h symols, the lemm holds for cse (iii) s well. Theorem. Let n, k Z, k. There is set L of k strings ech of length t most n such tht the numer of sttes of the CSA for L is Ω(n k /(k + ) k k!). Proof. Let n i = i+ n,t i = ( i ) n i for ll i,k nd let L ={T,T,...,T k }. The CSA for L then ccepts ll the strings generted y the generting tree of order k up to level n k. The numer of these strings is ( ) nk + k k which proves the lemm. = (n k + k)(n k + k )...(n k + ) k! ( n k ) = Ω (k + ) k k! We note gin tht the result of Theorem is pplicle lso for the DASG. 4. Conclusion We checked tht the mximum numer of sttes of the susequence utomton for k strings of length O(n) is Ω(n k /(k + ) k k!). We lso delt with the prolem of tight upper ound for the numer of sttes. By exhustive serching we found the worst cses for severl lengths of input strings nd verified in [4] tht the sequence of the mximum numers of the sttes does not form ny well-known integer sequence. Hence the prolem of tight upper ound remins open. References [] R.A. Bez-Ytes, Serching susequences, Theoret. Comput. Sci. 78 () (99) 6 76. [] M. Crochemore, B. Melichr, Z. Troníček, Directed cyclic susequence grph overview, J Discrete Algorithms ( 4) (00) 55 80. [] H. Hoshino, A. Shinohr, M. Tked, S. Arikw, Online construction of susequence utomt for multiple texts, in: Proc. Symp. on String Processing nd Informtion Retrievl 000, L Coruñ, Spin, 000, IEEE Computer Society Press, Silver Spring, MD. [4] N.J.A. Slone, The on-line encyclopedi of integer sequences, http://www.reserch.tt.com/ njs/sequences/.