The size of subsequence automaton

Similar documents
Convert the NFA into DFA

1 Nondeterministic Finite Automata

Minimal DFA. minimal DFA for L starting from any other

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.)

Chapter 2 Finite Automata

Chapter Five: Nondeterministic Finite Automata. Formal Language, chapter 5, slide 1

CHAPTER 1 Regular Languages. Contents

Harvard University Computer Science 121 Midterm October 23, 2012

AUTOMATA AND LANGUAGES. Definition 1.5: Finite Automaton

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Formal Languages and Automata

Designing finite automata II

Speech Recognition Lecture 2: Finite Automata and Finite-State Transducers

Assignment 1 Automata, Languages, and Computability. 1 Finite State Automata and Regular Languages

Regular expressions, Finite Automata, transition graphs are all the same!!

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2016

Nondeterminism and Nodeterministic Automata

5. (±±) Λ = fw j w is string of even lengthg [ 00 = f11,00g 7. (11 [ 00)± Λ = fw j w egins with either 11 or 00g 8. (0 [ ffl)1 Λ = 01 Λ [ 1 Λ 9.

CMPSCI 250: Introduction to Computation. Lecture #31: What DFA s Can and Can t Do David Mix Barrington 9 April 2014

Model Reduction of Finite State Machines by Contraction

Grammar. Languages. Content 5/10/16. Automata and Languages. Regular Languages. Regular Languages

Java II Finite Automata I

Coalgebra, Lecture 15: Equations for Deterministic Automata

Lecture 09: Myhill-Nerode Theorem

80 CHAPTER 2. DFA S, NFA S, REGULAR LANGUAGES. 2.6 Finite State Automata With Output: Transducers

Formal languages, automata, and theory of computation

CS 267: Automated Verification. Lecture 8: Automata Theoretic Model Checking. Instructor: Tevfik Bultan

First Midterm Examination

Finite Automata-cont d

CS415 Compilers. Lexical Analysis and. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Speech Recognition Lecture 2: Finite Automata and Finite-State Transducers. Mehryar Mohri Courant Institute and Google Research

First Midterm Examination

CSCI 340: Computational Models. Kleene s Theorem. Department of Computer Science

Deterministic Finite Automata

Worked out examples Finite Automata

Lecture 9: LTL and Büchi Automata

Homework 3 Solutions

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2014

CS 310 (sec 20) - Winter Final Exam (solutions) SOLUTIONS

Converting Regular Expressions to Discrete Finite Automata: A Tutorial

a,b a 1 a 2 a 3 a,b 1 a,b a,b 2 3 a,b a,b a 2 a,b CS Determinisitic Finite Automata 1

CHAPTER 1 Regular Languages. Contents. definitions, examples, designing, regular operations. Non-deterministic Finite Automata (NFA)

CS 301. Lecture 04 Regular Expressions. Stephen Checkoway. January 29, 2018

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. Comparing DFAs and NFAs (cont.) Finite Automata 2

Regular Expressions (RE) Regular Expressions (RE) Regular Expressions (RE) Regular Expressions (RE) Kleene-*

CSE : Exam 3-ANSWERS, Spring 2011 Time: 50 minutes

CMSC 330: Organization of Programming Languages

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

The University of Nottingham SCHOOL OF COMPUTER SCIENCE A LEVEL 2 MODULE, SPRING SEMESTER LANGUAGES AND COMPUTATION ANSWERS

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. NFA for (a b)*abb.

Non-deterministic Finite Automata

Non-Deterministic Finite Automata. Fall 2018 Costas Busch - RPI 1

Lecture 08: Feb. 08, 2019

Finite-State Automata: Recap

More on automata. Michael George. March 24 April 7, 2014

Non-deterministic Finite Automata

Lexical Analysis Finite Automate

Theory of Computation Regular Languages

GNFA GNFA GNFA GNFA GNFA

Theory of Computation Regular Languages. (NTU EE) Regular Languages Fall / 38

DFA minimisation using the Myhill-Nerode theorem

2.4 Linear Inequalities and Interval Notation

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Surface maps into free groups

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

NFAs continued, Closure Properties of Regular Languages

NFAs continued, Closure Properties of Regular Languages

1B40 Practical Skills

Tutorial Automata and formal Languages

Exercises Chapter 1. Exercise 1.1. Let Σ be an alphabet. Prove wv = w + v for all strings w and v.

CS 311 Homework 3 due 16:30, Thursday, 14 th October 2010

p-adic Egyptian Fractions

CS 275 Automata and Formal Language Theory

CISC 4090 Theory of Computation

CS375: Logic and Theory of Computing

Talen en Automaten Test 1, Mon 7 th Dec, h45 17h30

PART 2. REGULAR LANGUAGES, GRAMMARS AND AUTOMATA

NFA DFA Example 3 CMSC 330: Organization of Programming Languages. Equivalence of DFAs and NFAs. Equivalence of DFAs and NFAs (cont.

Name Ima Sample ASU ID

Hamiltonian Cycle in Complete Multipartite Graphs

Kleene s Theorem. Kleene s Theorem. Kleene s Theorem. Kleene s Theorem. Kleene s Theorem. Kleene s Theorem 2/16/15

1.4 Nonregular Languages

Finite Automata Approach to Computing All Seeds of Strings with the Smallest Hamming Distance

1 From NFA to regular expression

Foundations of XML Types: Tree Automata

1.3 Regular Expressions

Software Engineering using Formal Methods

ɛ-closure, Kleene s Theorem,

Compiler Design. Fall Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Thoery of Automata CS402

State Minimization for DFAs

Regular Language. Nonregular Languages The Pumping Lemma. The pumping lemma. Regular Language. The pumping lemma. Infinitely long words 3/17/15

Anatomy of a Deterministic Finite Automaton. Deterministic Finite Automata. A machine so simple that you can understand it in less than one minute

Let's start with an example:

Scanner. Specifying patterns. Specifying patterns. Operations on languages. A scanner must recognize the units of syntax Some parts are easy:

BACHELOR THESIS Star height

Fundamentals of Computer Science

CM10196 Topic 4: Functions and Relations

Chapter 7. Kleene s Theorem. 7.1 Kleene s Theorem. The following theorem is the most important and fundamental result in the theory of FA s:

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

Transcription:

Theoreticl Computer Science 4 (005) 79 84 www.elsevier.com/locte/tcs Note The size of susequence utomton Zdeněk Troníček,, Ayumi Shinohr,c Deprtment of Computer Science nd Engineering, FEE CTU in Prgue, Czech Repulic Deprtment of Informtics, Kyushu University, Fukuok 8-858, Jpn c PRESTO, Jpn Science nd TechnologyCorportion (JST), Jpn Received Mrch 004; received in revised form 8 Ferury 005; ccepted Mrch 005 Communicted y M. Crochemore Astrct Given set of strings, the susequence utomton ccepts ll susequences of these strings. We derive lower ound for the mximum numer of sttes of this utomton. We prove tht the size of the susequence utomton for set of k strings of length n is Ω(n k /(k + ) k k!) for ny k. It solves n open prolem ecuse only the cse k ws shown efore. 005 Elsevier B.V. All rights reserved. Keywords: Serching susequences; Directed cyclic susequence grph; Susequence utomton. Introduction A susequence of string T is ny string otinle y deleting zero or more symols from T.GivensetP of strings, common susequence of P is string tht is susequence of every string in P. Motivtion for study of susequences comes from mny domins, e.g. from moleculr iology, signl processing, coding theory, nd rtificil intelligence. An exmple of the prolem with gret prcticl impct is the longest common susequence (LCS) prolem. The prolem is defined s follows: given set P of strings, we re to find common susequence of P tht hs mximl length mong ll common susequences of P. The decision version cn e, for exmple, to decide whether given string is common susequence of P. Another prolem, which comes from rtificil intelligence, is Corresponding uthor. E-mil ddresses: tronicek@fel.cvut.cz (Z. Troníček), yumi@i.kyushu-u.c.jp (A. Shinohr). 004-975/$ - see front mtter 005 Elsevier B.V. All rights reserved. doi:0.06/j.tcs.005.0.07

80 Z. Troníček, A. Shinohr / Theoreticl Computer Science 4 (005) 79 84 the prolem of seprting two sets of strings: given two sets, P (positive) nd N (negtive), of strings, we re to find string tht est seprtes them. A string S seprtes sets P nd N if S is susequence of P nd simultneously is not susequence of ny string in N. The decision version is defined s follows: given two sets P nd N of strings nd string S, we re to decide whether S seprtes P nd N. If the prolem is supposed to e nswered for severl strings S then it is sensile to preprocess the sets P nd N. We cn uild n utomton ccepting ll common susequences of P nd n utomton tht ccepts ny string tht is susequence of t lest one string in N. With these utomt we cn decide the prolem in time liner in the length of S. Both utomt were studied nd descried. The ltter one is known s the Directed Acyclic Susequence Grph (DASG) nd three uilding lgorithms re ville: right-to-left [], left-to-right [], nd on-line []. The former one is clled the Common Susequence Automton (CSA) nd cn e uilt y n off-line or on-line lgorithm which re modifictions of lgorithms for uilding the DASG. In this pper, we investigte the numer of sttes of the CSA. The lnguge ccepted y the CSA is suset of the lnguge ccepted y the DASG for the sme strings nd the utomt re very similr. If we use either the off-line or the on-line lgorithm, the set of sttes of the CSA is suset of sttes of the DASG. The only previous results re, ccording to our knowledge, the lower ound for the mximum numer of sttes of the DASG for two strings proved in []. We will prove the lower ound for ny (fixed) numer of strings. The pper is orgnized s follows. In Section we will recll the definition of the CSA nd in Section we will exmine the symptotic ehvior of the numer of sttes of the CSA in the worst cse. Let Σ e finite lphet of size σ nd ε e the empty word. A finite utomton is, in this pper, 5-tuple (Q, Σ, δ,q 0,F), where Q is finite set of sttes, Σ is n input lphet, δ : Q Σ Q is trnsition function, q 0 is the initil stte, nd F Q is the set of finl sttes. Let δ e reflexive-trnsitive closure of δ, i.e. δ (q, ε) = q, δ (q, ) = δ(q, ), δ (q,,..., l ) = δ (δ(q, ),,..., l ), where q Q, Σ nd,..., l Σ. Nottion i, j mens the intervl of integers from i to j, including oth i nd j. All strings in this pper re considered over lphet Σ, if not stted otherwise.. Definition of CSA Let P denote the set of strings T,T,...,T k. Let n i e the length of T i nd T i [j] e jth symol of T i for ll j,n i nd ll i,k. GivenT = t t,...,t n nd i, j,n,i j, nottion T [i,...,j] mens the string t i,t i+,...,t j. Definition. We define position point of set P s n ordered k-tuple [p,p,...,p k ], where p i 0,n i is position in string T i.ifp i 0,n i then it denotes the position in front of (p i + )th symol of T i, nd if p i = n i then it denotes the position ehind the lst symol of T i for ll i,k.

Z. Troníček, A. Shinohr / Theoreticl Computer Science 4 (005) 79 84 8 A position point [p,p,...,p k ] is clled initil position point if p i = 0 for ll i,k. We denote y ipp(p ) the initil position point of P nd y Pos(P ) the set of ll position points of P. Definition. For position point [p,p,...,p k ] Pos(P ) we define the common susequence position lphet Σ cp ([p,p,...,p k ]) s the set of ll symols which re contined simultneously in T [p +,...,n ],...,T k [p k +,...,n k ], i.e. Σ cp ([p,p,...,p k ]) = { Σ: i,k j p i +,n i :T i [j] =}. Definition. For position point [p,p,...,p k ] Pos(P ) nd Σ we define the common susequence trnsition function: csf ([p,p,...,p k ],)=[r,r,...,r k ], where r i = min{j : j>p i nd T i [j] =} for ll i,k if Σ cp ([p,p,...,p k ]), nd csf ([p,p,...,p k ],)= otherwise. Let csf e reflexive-trnsitive closure of csf, i.e. csf ([p,p,...,p k ], ε) =[p,p,...,p k ], csf ([p,p,...,p k ], ) = csf ([p,p,...,p k ], ), csf ([p,p,...,p k ],,..., l ) = csf (csf ([p,p,...,p k ], ),,..., l ), where j Σ cp (csf ([p,p,...,p k ],,..., j )) for ll j,l. Lemm. The utomton (Pos(P ), Σ, csf, ipp(p ), Pos(P )) ccepts string S iff S is common susequence of P. Proof. See []. The utomton from Lemm is clled the CSA for strings T,T,...,T k. An exmple of the CSA is in Fig.. Up to now, two lgorithms for uilding the CSA hve een descried. The first one is off-line nd uses the position points. The second one is on-line nd in ech step lods one input string into the utomton. We will riefly descrie the off-line lgorithm. The lgorithm genertes step y step ll rechle position points (sttes). At ech step we process one position point. First, we will find the common susequence position lphet for this point nd then determine the common susequence trnsition function for ech symol of tht lphet. When the position point hs een processed, we continue with next point until trnsitions of ll Fig.. The CSA for strings nd.

8 Z. Troníček, A. Shinohr / Theoreticl Computer Science 4 (005) 79 84 rechle position points re determined. The complexity of the lgorithm depends on the numer of sttes of the resulting utomton. Providing tht the totl numer of sttes is O(t), the lgorithm requires O(kσt) time.. Numer of sttes of CSA We will investigte the numer of sttes (rechle position points) of the CSA for set of strings over inry lphet. First, we will introduce n uxiliry structure clled generting tree. The generting tree is defined s generl rooted tree where nodes nd edges re leled y n integer vlue. We sy tht node is of order v if it is leled y vlue v. A node of order v hs exctly v output edges leled y,,...,v. Ending node of ech edge is leled y the sme vlue s this edge. If the root of the tree is of order k, we sy tht the tree is of order k. An exmple of the tree of order is in Fig.. We use the tree to descrie set of strings. Any pth from the root corresponds to string over lphet {,}. All nodes ut the root contriute y. An edge leled y l dds l. For exmple, 4 pth root node(4) node() node() corresponds to string. A node hs t most one output edge for given vlue, therefore no two strings generted y the tree re identicl. In the susequent, we consider the tree of order k nd denote, for i Z, i 0, y p(k, i) the numer of nodes on ith level of this tree. Furthermore, we will denote y p j (k, i), j k the numer of nodes on ith level which re leled y vlue of j. There is just one node of order k on ech level, tht is p k (k, i) =. On the 0th level, there is only one node (root). A node of order j on ith level hs descendnts of orders,,...,j on (i + )th level. In other words, node of order j on ith level is descendnt either of node of order j or of higher order on (i )th level. The numer of these nodes of higher order is the sme s the numer of nodes of order j +onith level. Tht is, p j (k, i) = p j (k, i )+p j+ (k, i), where i>0nd j <k. This formul is known from comintorics nd holds for inomil coefficients. For further derivtion we use Pscl s tringle (see Fig. ) which is common mens for expressing the reltions etween inomil coefficients. From Pscl s tringle we get: ( i p k (k, i) = 0 ), p k (k, i) = ( ) i,...,p (k, i) = ( i + k k ). i = 0 i = i = Fig.. The top of the generting tree of order.

Z. Troníček, A. Shinohr / Theoreticl Computer Science 4 (005) 79 84 8 i = p k (k, i) i = p k (k, i) i = p k (k, i) i = 5 i = 4 4 6 4 p k (k, i) p k 4 (k, i) i = 6 5 0 0 5 p k 5 (k, i) Fig.. The top of Pscl s tringle. The numer of nodes on ith level is hence p(k, i) = k ( ) i + k p j (k, i) =. k i=0 j= And the totl numer of nodes up to nth level is n p(k, i) = n ( ) ( ) i + k n + k =. k k i=0 This formul determines how mny strings ending with is generted y the tree of order k if we consider the first n levels of this tree. Lemm. Let k Z, k,n i Z, n i for ll i,k. Let T = () n,t = () n,...,t k = ( k ) n k,l k ={T,T,...,T k }, nd δ denote the trnsition function of the CSA for L k. Let M k e the set of strings generted ythe generting tree of order k. Then for ll u, v M k,u = v, oth ccepted ythe CSA for L k, holds δ (ipp(p ), u) = δ (ipp(p ), v). Proof (yinduction in k). Let Mk l denote the set of strings from the lth level of the generting tree of order k.. k=: M l={()l,() l,() () l,...,() l }. If two strings contin different numer of s they cnnot result in shift to the sme position in T. Therefore we cn consider only the strings with the sme numer of s. For given l Z, l, just ll strings in M l hve the sme numer of s. But simultneously no such two strings hve the sme numer of s. Thus, the trnsition function finishes t the sme position in T nd lwys t different position in T.. Let n k+ Z, n k+. We dd T k+ = ( k+ ) n k+ into set L k, tht is L k+ = L k {T k+ }. According to the induction hypothesis the lemm holds for k. We will show tht it holds for k + too y using induction in height h of the tree: () h = : Mk+ ={,,...,k+ }. No two strings from this set hve the sme numer of s, thus the lemm holds.

84 Z. Troníček, A. Shinohr / Theoreticl Computer Science 4 (005) 79 84 () The hypothesis sys tht the lemm holds for Mk+,M k+,...,mh k+. We will prove tht it holds lso for Mk+ h+. From the generting tree we get: Mh+ k+ = Mh+ k { k+ s: s Mk+ h }. We need to show tht the result holds for ny u, v Mh+ k { k+ s: s Mk+ h }. There re three cses: (i) u, v Mk h+, (ii) u, v { k+ s: s Mk+ h }, (iii) u Mh+ k nd v { k+ s: s Mk+ h }. Cse (i) follows from the induction hypothesis. In cse (ii) oth u nd v hve prefix k+ nd for the suffixes we cn use the induction hypothesis. Let us now consider cse (iii). Any string from Mk h+ results in shift to the (h + )th in T k = ( k ) n k. Furthermore, k+ mkes shift to the second in T k nd ecuse ny string in Mk+ h contins h symols, the lemm holds for cse (iii) s well. Theorem. Let n, k Z, k. There is set L of k strings ech of length t most n such tht the numer of sttes of the CSA for L is Ω(n k /(k + ) k k!). Proof. Let n i = i+ n,t i = ( i ) n i for ll i,k nd let L ={T,T,...,T k }. The CSA for L then ccepts ll the strings generted y the generting tree of order k up to level n k. The numer of these strings is ( ) nk + k k which proves the lemm. = (n k + k)(n k + k )...(n k + ) k! ( n k ) = Ω (k + ) k k! We note gin tht the result of Theorem is pplicle lso for the DASG. 4. Conclusion We checked tht the mximum numer of sttes of the susequence utomton for k strings of length O(n) is Ω(n k /(k + ) k k!). We lso delt with the prolem of tight upper ound for the numer of sttes. By exhustive serching we found the worst cses for severl lengths of input strings nd verified in [4] tht the sequence of the mximum numers of the sttes does not form ny well-known integer sequence. Hence the prolem of tight upper ound remins open. References [] R.A. Bez-Ytes, Serching susequences, Theoret. Comput. Sci. 78 () (99) 6 76. [] M. Crochemore, B. Melichr, Z. Troníček, Directed cyclic susequence grph overview, J Discrete Algorithms ( 4) (00) 55 80. [] H. Hoshino, A. Shinohr, M. Tked, S. Arikw, Online construction of susequence utomt for multiple texts, in: Proc. Symp. on String Processing nd Informtion Retrievl 000, L Coruñ, Spin, 000, IEEE Computer Society Press, Silver Spring, MD. [4] N.J.A. Slone, The on-line encyclopedi of integer sequences, http://www.reserch.tt.com/ njs/sequences/.