On-Line Construction of Compact Directed Acyclic Word Graphs

Similar documents
Counting Paths Between Vertices. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs

Lecture 6: Coding theory

CS 491G Combinatorial Optimization Lecture Notes

Prefix-Free Regular-Expression Matching

CS311 Computational Structures Regular Languages and Regular Grammars. Lecture 6

22: Union Find. CS 473u - Algorithms - Spring April 14, We want to maintain a collection of sets, under the operations of:

Mid-Term Examination - Spring 2014 Mathematical Programming with Applications to Economics Total Score: 45; Time: 3 hours

Subsequence Automata with Default Transitions

NON-DETERMINISTIC FSA

Global alignment. Genome Rearrangements Finding preserved genes. Lecture 18

CIT 596 Theory of Computation 1. Graphs and Digraphs

Computational Biology Lecture 18: Genome rearrangements, finding maximal matches Saad Mneimneh

CSE 332. Sorting. Data Abstractions. CSE 332: Data Abstractions. QuickSort Cutoff 1. Where We Are 2. Bounding The MAXIMUM Problem 4

2.4 Theoretical Foundations

Compression of Palindromes and Regularity.

Technische Universität München Winter term 2009/10 I7 Prof. J. Esparza / J. Křetínský / M. Luttenberger 11. Februar Solution

Common intervals of genomes. Mathieu Raffinot CNRS LIAFA

CS 573 Automata Theory and Formal Languages

Nondeterministic Finite Automata

I 3 2 = I I 4 = 2A

18.06 Problem Set 4 Due Wednesday, Oct. 11, 2006 at 4:00 p.m. in 2-106

Solutions for HW9. Bipartite: put the red vertices in V 1 and the black in V 2. Not bipartite!

Finite State Automata and Determinisation

Lecture 11 Binary Decision Diagrams (BDDs)

Convert the NFA into DFA

Algorithms & Data Structures Homework 8 HS 18 Exercise Class (Room & TA): Submitted by: Peer Feedback by: Points:

Data Structures LECTURE 10. Huffman coding. Example. Coding: problem definition

Nondeterministic Automata vs Deterministic Automata

General Suffix Automaton Construction Algorithm and Space Bounds

Minimal DFA. minimal DFA for L starting from any other

Factorising FACTORISING.

Logic, Set Theory and Computability [M. Coppenbarger]

1 Nondeterministic Finite Automata

Lecture 2: Cayley Graphs

for all x in [a,b], then the area of the region bounded by the graphs of f and g and the vertical lines x = a and x = b is b [ ( ) ( )] A= f x g x dx

CS 360 Exam 2 Fall 2014 Name

McCreight s Suffix Tree Construction Algorithm. Milko Izamski B.Sc. Informatics Instructor: Barbara König

Compiler Design. Spring Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Necessary and sucient conditions for some two. Abstract. Further we show that the necessary conditions for the existence of an OD(44 s 1 s 2 )

Regular expressions, Finite Automata, transition graphs are all the same!!

Project 6: Minigoals Towards Simplifying and Rewriting Expressions

CS 2204 DIGITAL LOGIC & STATE MACHINE DESIGN SPRING 2014

A Disambiguation Algorithm for Finite Automata and Functional Transducers

A Lower Bound for the Length of a Partial Transversal in a Latin Square, Revised Version

CSC2542 State-Space Planning

Implication Graphs and Logic Testing

= state, a = reading and q j

Fast index for approximate string matching

Direct construction of compact Directed Acyclic Word Graphs

INTRODUCTION TO AUTOMATA THEORY

COMPUTING THE QUARTET DISTANCE BETWEEN EVOLUTIONARY TREES OF BOUNDED DEGREE

GNFA GNFA GNFA GNFA GNFA

Surds and Indices. Surds and Indices. Curriculum Ready ACMNA: 233,

Formal Languages and Automata

The University of Nottingham SCHOOL OF COMPUTER SCIENCE A LEVEL 2 MODULE, SPRING SEMESTER MACHINES AND THEIR LANGUAGES ANSWERS

p-adic Egyptian Fractions

Lecture 8: Abstract Algebra

Intermediate Math Circles Wednesday 17 October 2012 Geometry II: Side Lengths

On a Class of Planar Graphs with Straight-Line Grid Drawings on Linear Area

@#? Text Search ] { "!" Nondeterministic Finite Automata. Transformation NFA to DFA and Simulation of NFA. Text Search Using Automata

Numbers and indices. 1.1 Fractions. GCSE C Example 1. Handy hint. Key point

Laboratory for Foundations of Computer Science. An Unfolding Approach. University of Edinburgh. Model Checking. Javier Esparza

CARLETON UNIVERSITY. 1.0 Problems and Most Solutions, Sect B, 2005

Section 2.3. Matrix Inverses

1 PYTHAGORAS THEOREM 1. Given a right angled triangle, the square of the hypotenuse is equal to the sum of the squares of the other two sides.

Outline Data Structures and Algorithms. Data compression. Data compression. Lossy vs. Lossless. Data Compression

The size of subsequence automaton

CS 275 Automata and Formal Language Theory

Lecture 08: Feb. 08, 2019

Section 2.1 Special Right Triangles

Welcome. Balanced search trees. Balanced Search Trees. Inge Li Gørtz

Chapter Five: Nondeterministic Finite Automata. Formal Language, chapter 5, slide 1

Computing on rings by oblivious robots: a unified approach for different tasks

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. Comparing DFAs and NFAs (cont.) Finite Automata 2

Algorithm Design and Analysis

6.5 Improper integrals

CS261: A Second Course in Algorithms Lecture #5: Minimum-Cost Bipartite Matching

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Automata and Regular Languages

Designing finite automata II

6. Suppose lim = constant> 0. Which of the following does not hold?

CMSC 330: Organization of Programming Languages

Algorithm Design and Analysis

Discrete Structures Lecture 11

On the Spectra of Bipartite Directed Subgraphs of K 4

The DOACROSS statement

Monochromatic Plane Matchings in Bicolored Point Set

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. NFA for (a b)*abb.

Grammar. Languages. Content 5/10/16. Automata and Languages. Regular Languages. Regular Languages

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

Nondeterminism and Nodeterministic Automata

Graph Theory. Simple Graph G = (V, E). V={a,b,c,d,e,f,g,h,k} E={(a,b),(a,g),( a,h),(a,k),(b,c),(b,k),...,(h,k)}

Homework 3 Solutions

CMPSCI 250: Introduction to Computation. Lecture #31: What DFA s Can and Can t Do David Mix Barrington 9 April 2014

A Short Introduction to Self-similar Groups

Discrete Structures, Test 2 Monday, March 28, 2016 SOLUTIONS, VERSION α

Data Structures and Algorithm. Xiaoqing Zheng

5. (±±) Λ = fw j w is string of even lengthg [ 00 = f11,00g 7. (11 [ 00)± Λ = fw j w egins with either 11 or 00g 8. (0 [ ffl)1 Λ = 01 Λ [ 1 Λ 9.

Suffix Trays and Suffix Trists: Structures for Faster Text Indexing

Regular languages refresher

Transcription:

On-Line Constrution of Compt Direte Ayli Wor Grphs Shunsuke neng, Hiroms Hoshino, Ayumi Shinohr, Msyuki Tke,SetsuoArikw, Ginrlo Muri 2, n Giulio Pvesi 2 Dept. of nformtis, Kyushu University, Jpn {s-ine,hoshino,yumi,tke,rikw}@i.kyushu-u..jp 2 Dept. of Computer Siene, Systems n Communition University of Miln Bio, tly {muri,pvesi}@iso.unimi.it Astrt. A Compt Direte Ayli Wor Grph (CDAWG) is spe effiient text inexing struture, tht n e use in severl ifferent string lgorithms, espeilly in the nlysis of iologil sequenes. n this pper, we present new on line lgorithm for its onstrution, s well s the onstrution of CDAWG for set of strings. ntroution Severl ifferent string prolems, like those eriving from the nlysis of iologil sequenes, n e solve effiiently with suitle text inexing struture. Perhps, the most wiely use n known struture of this kin is the suffix tree, tht n e uilt in liner time n permits to effiiently fin n lote ll the sustrings of given string. The min rwk of suffix trees is the itionl spe require to implement the struture. n mny pplitions, like sequene nlysis n pttern isovery in iologil sequenes, keeping s mny t s possile in min memory might provie signifint vntges. This ft hs le to the introution of more spe effiient strutures, like suffix rrys [], suffix ti [2], n others. n this work, we fous our ttention on the Compt Direte Ayli Wor Grph (CDAWG), first esrie in [3]. The CDAWG for string n e seen either s omption of the Direte Ayli Wor Grph (DAWG) [4], or minimiztion of the suffix tree, from whih it n e erive s shown in [3,5] for DAWGs n [6] for suffix trees. n the ltter se, the si ie is to merge reunnt prts of the suffix tree (see ig. ). Experimentl results [3,5] hve shown how CDAWGs provie signifint reutions of the memory spe require y suffix trees n DAWGs when pplie to genomi sequenes. A liner time lgorithm for the iret onstrution of the CDAWG of string is presente in [5], so to voi the itionl spe require y the preliminry onstrution of The results esrie in this work were rehe inepenently y the Kyushu n Miln groups, sumitte simultneously to the onferene, n merge into joint ontriution. A. Amir n G.M. Lnu (Es.): CPM 200, LNCS 2089, pp. 69 80, 200. Springer-Verlg Berlin Heielerg 200

70 Shunsuke neng et l. o o o o o o o ig.. Suffix tree n CDAWG for string oo. Sustrings o n o our s prefix of the sme suffixes: the orresponing noes re merge s well s the sutrees roote t the noes. Leves re merge into single finl noe. the DAWG or the suffix tree. The lgorithm is similr to MCreight s lgorithm for suffix trees [7]. n this pper, we present new lgorithm for the onstrution of CDAWGs, se on Ukkonen s lgorithm for suffix trees [8]. The lgorithm is on line, tht is, it proesses the hrters of the string from left to right one y one, with no nee to know the whole string eforehn. urthermore, we show how the lgorithm n e use to uil CDAWG for set of strings, struture first esrie in [3], where ws erive y ompting DAWG for set of strings. The min rwk of this pproh ws the ft tht, when new string ws e to the set, the DAWG h to e uilt gin from srth. nste, the lgorithm we present llows to new string iretly to the ompt struture. 2 Definitions Let Σ e nonempty finite lphet, n Σ the set of strings over Σ. f s = αβγ, withα, β, γ Σ,thenα is prefix of s, γ is suffix of s, n α, β, nγ re sustrings (ftors) of s. fs = s...s n is string in Σ, s enotes its length, n s[i..j] its sustring s i...s j.withsuf (s) we will enote the set of ll suffixes of s. LetX e suset of Σ. or ny string u Σ, u X = {x ux X}. Given string s, we efine the syntti ongruene on Σ ssoite with Suf (s) n enote y Suf (s) s: u Suf (s) v u Suf (s) =v Suf (s) (for ny u, v Σ ) Tht is, u n v our s prefixes of the sme suffixes of s. notherwors, the ourrenes of u n v must en t the sme positions in the string. Hene, if u n v our in the string, one must e suffix of the other. As in [3,5], we will ll lsses of ftors the ongruene lsses of the reltion Suf (s).the lss of ll strings tht re not sustrings of s is lle the egenerte lss. The

On-Line Constrution of Compt Direte Ayli Wor Grphs 7 ig. 2. mpliit CDAWG n CDAWG for string. longest string in non egenerte lss of ftors is the representtive of the lss. Given non egenerte lss of ftors C of Suf (s), n its representtive u, if there re t lest two hrters, Σ suh tht u n u re sustrings of s, thenc is strit lss of ftors of Suf (s).romnowon,wewillsytht two sustrings re stritly ongruent if they elong to the sme strit lss of ftors. We re now rey to give forml efinition of CDAWG. Definition. The ompt irete yli wor grph (CDAWG) of string s is irete yli grph, where:. two istint noes re mrke s initil n finl; 2. eges re lele with non empty sustrings of s; 3. lels of two eges leving the sme noe nnot egin with the sme hrter; 4. every suffix of s orrespons to pth on the grph strting from the initil noe n ening t noe, suh tht the ontention of the ege lels on the pth extly spells the suffix. rom now on, we will ll noe orresponing to suffix of s terminl noe; 5. sustrings spelle y pths strting from the initil noe n ening t the sme non terminl noe of the grph elong to the sme strit lss of ftors. The CDAWG of string s hs t most s + noes n 2 s 2 eges [3,5]. Aoring to the efinition of strit lss of ftors, non terminl noes must hve t lest two outgoing eges. We will enote with (p, α, q) theegep q of the grph lele with sustring α. The following efinitions will e useful throughout the pper: Definition 2. The impliit CDAWG of string s is CDAWG where noes with outegree one re remove, n eh ege entering noe with outegree one is merge with the ege leving it. n the impliit CDAWG of string s, thesuffixesofs re spelle out y pths in the grph strting t the initil noe, ut not neessrily ening t noe. An exmple is shown in ig. 2. or every noe p, letlength s (p) e the length of the longest sustring spelle y pth from the initil noe to p. Eges elonging

72 Shunsuke neng et l. to the spnning tree of the longest pths from the initil noe re lle soli eges. n other wors, n ege (p, α, q) is soli iff length s (q) =length s (p)+ α. inlly, we ssume tht the lel of eh ege is implemente with pir of integers enoting the strting n ening points in the string of the sustring orresponing to the lel, n every noe is nnotte with the length of the longest pth from the initil noe. 3 Constrution of the CDAWG for Single String GivennlphetΣ, lets = s...s n e string on Σ. Our lgorithm is ivie in n phses, uiling t eh phse i the impliit CDAWG G i for eh prefix s[..i] ofs. More in etil, the impliit CDAWG G i+ for s[..i +] is onstrute strting from grph G i for s[..i]. Eh phse i + is ivie in i + extensions, one for eh of the i + suffixes of s[..i + ]. n extension j of phse i +, the lgorithm fins the en of the pth from the initil noe lele with sustring s[j..i], n extens it y ing hrter s i+ to the pth, unless it is lrey there. Therefore, in phse i +, sustring s[..i + ] is first put on the grph, followe y s[2..i +],s[3..i + ], n so on. Extension i + of phse i + s the single hrter s i+ fter the initil noe. The initil grph G hs one initil noe n one finl noe, onnete y n ege lele y hrter s. The lgorithm n e skethe s follows:. Construt grph G 2. or i from to n o 3. or j from to i + o 4. in the en of the pth from lele s[j..i] 5. A hrter s i+ if neee 6. En for 7. En for At extension j of phse i +, one the en of the pth spelling s[j..i] hs een lote, the CDAWG n e upte oring to three ifferent rules:. n the urrent grph, the pth spelling s[j..i] ens in. To upte the grph, hrter s i+ is ppene to the lel of the ege entering. 2. The pth orresponing to s[j..i] oesnotontinuewiths i+, ut ontinues with t lest one hrter. f the pth ens t noe p, weretenew ege (p, s i+,). Otherwise, we rete new noe q t the en of the pth, splitting the ege in two t the point where the pth ens. Then, we rete new ege (q, s i+,). 3. Some pth t the en of s[j..i] ontinues with s i+. n this se, sustring s[j..i + ] is lrey in the urrent grph: we o nothing (hene the impliit grph). These rules, however, o not gurntee tht t the en of the phse we orretly onstrute CDAWG. n ft, the lgorithm must lso hek whether sustring stritly ongruent to nother one hs een enountere, or, onversely,

On-Line Constrution of Compt Direte Ayli Wor Grphs 73 ig. 3. mpliit CDAWG for string efore (left) n fter reiretion of n ege, t phse 6, extension 5. Noe, lele, ws rete t the previous extension, fter the insertion of t the en of the pth lele. Now,pth orresponing to is foun ening in the mile of non soli ege (,, ), tht is reirete to noe n eomes (,,). whether sustring hs to e remove from strit lss of ftors, so tht t the en of phse i+ pths ening t the sme noe orrespon to strit lsses of ftors of s[..i+], n vie vers. Here we sketh how the lgorithm hs to e moifie. A more etile esription of the lgorithm n its implementtion n e foun in [9]. Deteting Stritly Congruent tors. Two sustrings α n β elong to the sme lss C iff they re prefixes of the sme suffixes, n there re t lest two hrters, Σ suh tht α, α, β, nβ our in s. Moreover, α must e suffix of β, or vie vers. We suppose w.l.o.g. tht α = β, with Σ. We lso ssume tht α n β hve ourre just one, tht sustrings α n β hve een put in the grph in some previous phse (in two onseutive extensions), n in the urrent extension we hve to insert α. The pth spelling α ens in the mile of n ege, n the next hrter on the ege is. Anew noe p is rete t the en of the pth, s well s new ege (p,, ). At the following extension, we hve to lote β in the grph. f β hs ourre only one (together with α), it now elongs to the sme strit lss of ftors, n we en in the mile of non soli ege tht ontinues with. nthisse,wereiret theegetop, leling it with the prt of the lel tht ws ontine in the pth of β (see ig. 3). Sine there n e more thn two onseutive sustrings to e ssigne to the sme lss, it is possile tht we gin en long non soli eges in the following extensions. n this se, we reiret the non soli eges to p s well, until we reh n extension where we en t noe or long soli ege. Otherwise, if β h previously ourre lso y itself, either the pth orresponing to β ens t noe (β hs een followe y hrters ifferent from ), or the ege we en on is soli (β h een followe only y ). n the former se, if there is not n ege lele leving the noe we rete new

74 Shunsuke neng et l. 2 ig. 4. CDAWG for string t phse 7, extension 7. Chrter is foun t the en of the non soli ege (,,). At extension 6, the pth spelling ene t the finl noe. Thus, hs to e remove from the lss ssoite with noe, tht is lone into noe 2. Ege (,,) eomes (,,2). ege lele to the finl noe. n the ltter se, we rete new noe n onnet it to the finl noe with n ege lele. Then, there my e gin non soli eges tht hve to e reirete into the newly rete noe. Splitting Strit Clss of tors. Conversely, sustring tht hs een ssigne to strit lss of ftors hs to e remove from the lss if it oes not our s suffix of the representtive when new hrter s i+ is e to the string. Let α n β, α = β, e the two sustrings ssigne to the sme lss in the previous exmple. Now, suppose tht in phse i + we hve to insert β in the grph. n this se, s i+ is the lst hrter of β, n we fin it t the en of the ege entering noe p, tht is non soli, sine β is not the representtive of the lss. Now we hve two ses: s i+ ws foun t the en of n ege tht entere noe p lso t the previous extension, or we ene up somewhere else. n the former se, we h lso inserte α t the previous extension of the sme phse, therefore β still elongs to the sme lss. n the ltter, we hve etete n ourrene of β not preee y α, tht is, not s suffix of α, n we hve to remove it from the lss. To reflet this in the grph, we lone the noe p into new noe q, n reiret the non soli ege to q keeping the sme lel. The reirete ege eomes soli. An exmple is shown in ig. 4. f lso some suffixes of β h een previously ssigne to the sme lss s β, in the following extensions we will gin fin s i+ t the en of non soli ege entering p.these eges re reirete to q. t n e prove tht it suffies to hek only the lst ege on eh pth to ensure tht lss hs to e split. No loning tkes ple if hrter is foun t the en of n ege entering the finl noe. The two oservtions outline ove n e implemente in the lgorithm y moifying Rules 2 n 3 oringly. t is worth mentioning tht oth reiretion of eges to newly rete noe n noe loning n tke ple uring the sme phse. An exmple is shown in ig. 5.

On-Line Constrution of Compt Direte Ayli Wor Grphs 75 2 ig. 5. rom left to right, CDAWG for string t phse 6, extensions 5, 6, n 7. Chrter is put in the grph fter sustring, n the pth spelling is foun in the mile on non soli ege (,, ) (left) tht is reirete to noe (enter). Then, t extension 7 (tht s fter the empty string) is foun t the en of non soli ege. Noe is thus lone into noe 2 (right). 3. Using Suffix Links Nively, loting the en of s[j..i] inextensionj of phse i+ woul tke O(i j) time y wlking from the initil noe n mthing the hrters of s[j..i] long the eges of the grph. This woul le to n overll O(n 3 ) time omplexity for the onstrution of the whole grph. We will now reue it, s in [8], to O(n) y introuing suffix links n with some remrks. Definition 3. Let p e noe of the grph, ifferent from the initil or finl noe. Let β e the representtive of the lss ssoite with p. Thesuffix link of p, enote y L(p), is the noe q whose representtive γ is the longest suffix of β whose pth oes not en t p. The suffix link of noe p n e implemente with pointer from p to L(p). f γ is empty, then L(p) is the initil noe. Suffix links re not efine for the initil n the finl noe. Although the efinition oes not gurntee tht every noe in the grph hs suffix link, we n prove the following: Lemm. Any noe rete uring phse i + will hve suffix link from it y the en of the phse. Proof. n extension j of phse i + new noe p n e rete t the en of the pth spelling sustring s[j..i] y pplition of Rule 2 or y loning. n the former se, L(p) will e the first noe to e rete or enountere t the en of the pth orresponing to suffix of s[j..i] (possily fter ege reiretions). Suh noe lwys exists, sine the lst extension lotes the empty suffix t the initil noe. n the ltter se, let us suppose tht noe q is lone into noe p with pth spelling s[j..i + ]. Sustring s[j..i + ] is the longest suffix of the representtive of q tht oes not elong to the sme lss. Thus, L(q) isset to p. Suffix link L(p) is left unefine until one of the suffixes of s[j..i + ] ens t noe other thn p (tht gin oul e ).

76 Shunsuke neng et l. αβ p γ L(p) β q γ ig. 6. A suffix link. Noe p orrespons to lss αβ, noeq orrespons to β. Pths lele with suffixes of αβ longer thn β en t p. f t some extension j hrter s i+ is e fter αβγ, thenextensionsfromj + to j + α re impliitly performe s well. During ny phse, the only noe of the grph other thn the initil n the finl without suffix link from it is the lst rete one. Let us suppose tht the lgorithm hs omplete extension j of phse i +. Suffix links re use to spee up the serh for the remining suffixes of s[j..i]. Strting from the en of s[j..i] in the grph, we wlk kwrs long the pth orresponing to s[j..i] up to either the initil noe or noe p tht hs suffix link. This requires trversing t most one ege. Let γ e the ontention of the ege lels of the pth from p to s[j..i]. f p is not the initil noe, we move to noe L(p) n follow from it the pth spelling γ. Otherwise,weserhfors[j +..i] strting from. inlly we s i+ oring to one of the extension rules, reireting n ege or loning noe if neee. Notie tht, if noe p is the en of l 2 ifferent pths, the position rehe fter serhing from γ from L(p) will e the en of pth s[j + l..i], tht is, extensions from j +to j + l hve een impliitly performe t extension j. A pth spelling γ strting from L(p) lwys exists, sine ll the suffixes of s[j..i] re lrey in the grph. Thus, to fin the pth spelling γ the lgorithm just mthes the first hrters on the eges enountere. To otin liner time lgorithm, we nee just two more triks. Remrk. When uring ny extension Rule 3 is pplie, tht is, given sustring s[j..i + ] is lrey on the grph, then the sme rule will pply to ll further extensions, sine ll the suffixes of s[j..i + ] re lrey in the grph s well. Therefore, one Rule 3 is pplie (n no noe hs to e lone or eges reirete), we n stop n move on to the next phse, sine ll the strings to e inserte re lrey in the grph n no justment is neee for the lsses. Remrk 2. f new ege is rete entering the finl noe uring extension j of ny phse i, then Rule will lwys pply t extension j in ny suessive phse. Tht is, new hrters will lwys e ppene t the en of the lst ege in the pth ssoite with s[j..i], tht will enter the finl noe. Thus, when new ege is rete entering the finl noe with lel s[j..i + ], we lel it with integers h n e (j h i+), where e enotes the urrent phse, tht is,

On-Line Constrution of Compt Direte Ayli Wor Grphs 77 the urrent en position in the string. f we implement e with glol vrile, n set it to i + t the eginning of eh phse i +, we perform impliitly ll the extensions tht woul en up t the finl noe. Every phse i strts with series of pplitions of Rules n 2, tht put s i t the en of n ege entering the finl noe; when Rule 3 is pplie for the first time, it will e lso pplie to ll further extensions. Now, let j i e the first extension where Rule 3 is pplie with loning in phse i, nji the first extension where it is pplie without ege reiretion to the lone noe. Extensions j i + to ji will reiret eges to the lst noe rete. Extensions from ji +to i nee not to e performe, sine in eh of them we woul not o nything. n phse i +, ll extensions from to j i will pply Rule, therefore they re impliitly performe y setting the ounter e to i +.Thus, we n strt phse i+ iretly from extension ji, until we fin n extension where Rule 3 is pplie without loning or ege reiretion. This n e one y strting phse i+ from the position in the grph of the lst suffix of s[..i] tht h to e reirete to the lone noe. This took ple t extension ji. The first extension in phse i + will hve to look for s i+ extly t the enpoint of the lst extension of phse i. This will lso impliitly perform ll extensions from j i to ji. Of ourse, if in phse i Rule 3 is first pplie without loning wenmoveontophsei + s well. The lgorithm oes not nee to know whih extension is urrently performing. Tht is, it strts phse i + from the enpoint of phse i, ing s i+. Then it strts moving in the grph y using suffix links, n ing s i+ t the en of eh pth. f the kwr wlk ens t, nγ = γ...γ k is the lel of the pth trverse, then it looks for the pth lele γ 2...γ k.phsei + ens when the lgorithm pplies for the first time Rule 3 without noe loning or ege reiretion. Moreover, whenever we fin s i+ t the en of non soli ege, we no longer hve to hek wht hppene t the previous extension, n just lone the noe. n ft, if the representtive of the lss h een met uring one of the previous extensions, we woul hve stoppe the phse t tht point, without rehing the urrent extension. At the en of phse n, we hve onstrute the impliit CDAWG for string s. n orer to otin the tul CDAWG, we perform n itionl extension phse n +, extening the string to ummy symol $ tht oes not elong to the string lphet. Anywy, we o not inrement the phse ounter e to n +, so to voi ppening $ to eges entering the finl noe. Moreover, whenever new noe p hs to e rete, we o not the ege (p, $,) to the grph. Noes rete in this phse will thus hve outegree one, n will orrespon to terminl noes of the CDAWG. Notie tht, whenever pth s[j..n] ens long n ege, we lwys rete new noe n mrk it s terminl, while loning of noes n reiretion of eges work s in the previous phses. When pth s[j..n] ens t noe, we mrk the noe s terminl. At the en of the itionl phse, the impliit CDAWG hs een trnsforme into the tul CDAWG for string s. An exmple of the on line onstrution of CDAWG is shown in ig. 7.

78 Shunsuke neng et l. 2 2 ig. 7. rom left to right, onstrution of the CDAWG for string : t the en of phse 6 (impliit CDAWG for string ); t the en of phse 7 (, where,, n elong to the sme strit lss of ftors); t the en of phse 8 (, where n hve een remove from the lss with representtive ); the finl struture. Strs inite the position in the grph rehe t the en of the lst expliit extension of eh phse. With rguments nlogous to Ukkonen s lgorithm for suffix trees, we n prove the following: Theorem. Given string s = s...s n over finite lphet Σ, the lgorithm implemente with suffix links n impliit extensions uils the CDAWG for s in O(n) time n O(n Σ ) spe if the grph is implemente with trnsition mtrix, or in O(n Σ ) time n O(n) spe with jeny lists. Proof (Sketh). The opertions performe in ny expliit extension (retion or loning of noes, ege reiretions), tht is, extensions tht re not performe impliitly y inrementing the e ounter, tke onstnt time. Let ji the lst expliit extension performe t phse i, nj i+ the first expliit extension performe t phse i +. n the worst se, we hve j i+ = ji. Moreover, for eh i, j i j i+. Thus, t most 3n expliit extensions re performe y the lgorithm. At ny extension j of phse i, to lote the enpoint of s[j..i] the lgorithm wlks k t most one ege from the enpoint of s[j..i], follows suffix link, n then trverses some eges heking the first symol on eh ege. f the grph is implemente with trnsition mtrix, trversing n ege tkes onstnt time. Else, it tkes O( Σ ) time. The only thing unounte for is the overll numer of eges trverse. or every noe p of the grph, let the noe epth of p e the numer of noes on the pth from the root to p lele with the representtive of the lss ssoite with p. As in [8], the sum of the noe epths ounte uring ll the expliit extensions is reue t most y O(n), n sine the mximum noe epth is n, the mximum numer of eges trverse is oune y O(n).

On-Line Constrution of Compt Direte Ayli Wor Grphs 79 4 The CDAWG for Set of Strings The si ie of the CDAWG for set of strings S = {s,...,s k } is the sme of the single string struture. Now, the noes of the struture orrespon to ptterns tht our s prefix of the sme suffixes in every string of the set. n other wors, given Suf (S) (the set of the suffixes of the k strings), the noes of the CDAWG orrespon to strit lsses of ftors for Suf (S).Theonly ifferene is tht now we hve k finl noes... k, one for eh string, n we wnt ll the suffixes of s i to en t the orresponing finl noe i. This result n e otine y ppening ifferent termintion symol, not elonging to the string lphet, to eh string of the set. More formlly: Definition 4. The CDAWG for set of strings s...s k is irete yli grph, with noe mrke s initil n k istint noes... k mrke s finl. Eges re lele with non empty sustrings of t lest one of the strings. Lels of two eges leving the sme noe nnot egin with the sme hrter. or every string s i in the set, ll suffixes of s i re spelle y ptterns strting t the initil noe n ening t noe i. Pths ening t non finl noes orrespon to strit lsses of ftors of the ongruene reltion Suf (S). The CDAWG for set of strings n e onstrute with the lgorithm presente in the previous setion. irst, we uil the CDAWG for string s (with the termintion symol) n finl noe. Notie tht, sine the termintion symol oes not our nywhere else in s, the resulting struture is CDAWG, with no nee to perform the itionl phse. Then, string s 2 is e to the grph, ut in this se with finl noe 2. The sme will pply to every other string in the set. Noe loning n ege reiretion rules ensure the orretness of the resulting struture. t n e prove tht the lgorithm tkes O(N) time to onstrut the struture, implemente with trnsition mtrix, where N = k i= si. This struture (with mrginl ifferenes) ws first esrie in [3], where it ws uilt y reuing DAWG. Therefore, ing new string to the set require the onstrution of new DAWG from srth. The lgorithm presente here, inste, permits to strings iretly to the ompt struture (see ig. 8). As in [3] we n give n upper oun on the size of the struture. Theorem 2 (Blumer et l., [3]). The CDAWG for set of strings s...s k, hs t most N + k noes, where N = k i= si. 5 Conlusions A CDAWG is spe effiient text inexing struture tht represents ll the sustrings of string. We presente new on line lgorithm for its onstrution, s well s the onstrution of CDAWG for set of strings. The sme strutures n e ompute y reution strting from the orresponing DAWGs or suffix trees; however, the pproh presente in this pper permits to sve time n spe simultneously, sine the CDAWGs n e uilt iretly. Moreover, one the struture hs een uilt for set of strings, new strings n e e iretly to the ompt struture.

80 Shunsuke neng et l. $ $ $ $ $ 2 $2 $2 2 ig. 8. CDAWG for strings $ n $ 2, fter the insertion of $ (left) n $ 2 (right). Chrters $ n $ 2 re use s termintions. Eges (,$, )n(,$ 2, 2 ) hve een omitte. Aknowlegements The Miln Bio group hs een supporte y the tlin Ministry of University, uner the projet Bioinformtis n Genomi Reserh. Referenes. U. Mner n G. Myers. Suffix rrys: new metho for on line string serhes. SAM J. Computing, 22(5):935 948,993. 2. J. Kärkkäinen. Suffix tus: ross etween suffix tree n suffix rry. Comintoril Pttern Mthing, 937:9 204, July 995. 3. A. Blumer, J. Blumer, D. Hussler, R. MConnell, n A. Ehrenfeuht. Complete inverte files for effiient text retrievl n nlysis. Journl of the ACM, 34(3):578 595, 987. 4. A. Blumer, J. Blumer, D. Hussler, A. Ehrenfeuht, M. Chen, n J. Seifers. The smllest utomton reognizing the suwors of text. Theoretil Computer Siene, 40:3 55, 985. 5. M. Crohemore n R. Verin, On ompt irete yli wor grphs, Springer Verlg LNCS 26, pp.92 2, 997. 6. D. Gusfiel, Algorithms on Strings, Trees n Sequenes: Computer Siene n Computtionl Biology, Cmrige University Press, New York, 997. 7. E. MCreight. A spe eonomil suffix tree onstrution lgorithm. Journl of the ACM, 23(2):262 272, 976. 8. E. Ukkonen. On line onstrution of suffix trees. Algorithmi, 4(3):249 260, 995. 9. S. neng, H. Hoshino, A. Shinohr, M. Tke, n S. Arikw, On line onstrution of ompt irete yli wor grphs. DO Tehnil Report 83, Kyushu University, Jnury 200.