Title. Author(s) 髙木, 拓也. Issue Date DOI. Doc URL. Type. File Information. Studies on Efficient Index Construction for Multiple

Size: px

Start display at page:

Download "Title. Author(s) 髙木, 拓也. Issue Date DOI. Doc URL. Type. File Information. Studies on Efficient Index Construction for Multiple"

Shanna Johnston
5 years ago
Views:

1 Title Studies on Efficient Index Construction for Multiple Author(s) 髙木, 拓也 Issue Dte DOI /doctorl.k13077 Doc URL Type theses (doctorl) File Informtion Tkuy_Tkgi.pdf Instructions for use Hokkido University Collection of Scholrly nd Ac

2 Studies on Efficient Index Construction for Multiple nd Repetitive Texts ( ) Tkuy Tkgi Jnury 2018 Division of Computer Science nd Informtion Technology Grdute School of Informtion Science nd Technology Hokkido University

4 Astrct Text indexing prolem is one of the fundmentl prolems in computer science nd the im is to construct n efficient dt structure tht nswers queries such s text pttern mtching. For the lst decdes, there hs een n incresing mount of multiple texts such s dt generted from multiple sensors nd repetitive texts such s genome sequence collections. For exmple, the GeoLife Project collects trjectories from GPS loggers tht hve vriety of smpling rtes. These trjectories were recorded every 1 to 5 seconds or every 5 to 10 meters per point. For nother exmple, the 1000 Genomes Project collects the humn genomes from vrious groups. Since ech genome informtion is similr to ech other, the sme sustructures pper repetedly in this genome dtse. These projects re iming t dt nlysis, informtion retrievl, nd dt mining for text informtion. For pttern mtching, which is the most fundmentl query for texts, we cn nswer queries y using sic text pttern mtching lgorithms such s Knuth-Morris-Prtt (KMP) lgorithm nd Boyer-Moore (BM) lgorithm. Since these lgorithms scn the texts for ech query, it requires t lest liner time for dtse size in one query. In order to quickly process these dt, preprocessing nd indexing re importnt. For exmple, the suffix tree, one of the sic text indexes, cn support pttern mtching in liner time for pttern length. Therefore, uilding n efficient index structure is the key to processing these lrge mounts of text informtion. In this thesis, we show efficient index construction lgorithms for text dt. For multiple texts nd repetitive texts, there re severl prolems with indexing. 1

5 Since dt grow constntly for multiple sensor dt such s GPS trjectories, it is necessry for the index to support online construction for multiple texts. For repetitive texts tht is similr text collection such s genome sequences, we should e le to uild n index with more compressed size. In order to solve these prolems, we propose severl new index structures nd construction lgorithms. In prticulr, this thesis dels with speeding up construction nd opertions of indexes, online construction of indexes for multiple texts, nd construction of compressed indexes for texts including long repetitions. In Chpter 3, we propose fster version of leled trees (compct tries) clled pcked compct tries, y using it-prllel method. By doing this, we show fster construction of text indexes such s suffix trees nd fster vrious opertions like prefix serch, insertion, nd deletion. Since the compct trie is widely used dt structure, we cn speed up some lgorithms y using pcked compct tries. In prticulr, we show tht LZ-doule fctoriztion which is one kind of text compression lgorithm is speeded up. In Chpter 4, we first defined fully-online construction prolem, which is setting tht llows new input symol cn e dded n ritrry string of the set of input strings. To solve this prolem, we first showed fully-online construction lgorithm of DAG index clled the directed cyclic word grph (DAWG). We lso proposed fully-online construction lgorithm for the suffix tree using similrity etween DAWGs nd suffix trees. In Chpter 5, we proposed self-indexing method y comining n index clled the compct directed cyclic word grph (CDAWG) with grmmr compression, which is one of the compression methods. When the input text is compressile, the index cn e held with size smller thn the originl text. In Chpter 6, we give conclusions nd future work. Overll, we studied efficient lgorithms for text index construction in this thesis. 2

6 Contents 1 Introduction Bckground Reserch gols Summry of the results Contriutions of this thesis Preliminries Nottions on Strings Nottions on grphicl indexes Suffix tries Suffix trees Directed cyclic word grphs (DAWGs) Dulity of suffix trees nd DAWGs Compct directed cyclic word grphs (CDAWGs) Pcked Compct Tries Bckground Relted work Preliminries Compct tries

7 3.2.2 Dynmic predecessor dt structures Pcked dynmic compct tries Micro dynmic compct tries for short strings Pcked dynmic compct tries for long strings Micro trie decomposition Speeding-up with hshing Applictions to online string processing Preliminry experiments Conclusions of Chpter Fully-online Construction of Suffix trees for Multiple Texts Bckground Relted work Preliminries Suffix trees nd DAWGs for multiple texts Fully-online text collection Fully-online version of DAWG nd Weiner s suffix tree lgorithm Semi-online construction of Weiner s suffix trees nd DAWGs Fully-online construction of Weiner s suffix trees nd DAWGs Fully-online version of Ukkonen s suffix tree lgorithm Semi-online left-to-right suffix tree construction Difficulties in fully-online left-to-right suffix tree construction Fully-online left-to-right suffix tree lgorithms Conclusions of Chpter Liner-size Compct Directed Acyclic Word Grphs Bckground Preliminries

8 5.2.1 LSTrie Stright-line progrms The proposed dt structure: L-CDAWG Outline Constructing type-2 nodes nd edge suffix links Construction of the SLP for L-CDAWG The min result Conclusions of Chpter Conclusions nd Future Work Summry of the results Future work

10 Chpter 1 Introduction 1.1 Bckground For the lst decdes, there hs een n incresing mount of unstructured dt such s genetic dt, logging dt, nd We nd SNS texts, which hve een coined s ig dt. Most of these unstructured dt re ville in the form of text informtion. Therefore, there re demnds for lgorithms nd dt structure tht cn efficiently hndle these ig unstructured dt. Multiple growing texts nd repetitive texts re one of the fetures of text dt such s logging dt nd genetic dt. Multiple growing texts re text set tht cn e ppended new symol t the end of text in the set. There re mny text dt with this feture in the rel world. For exmple, due to the rpid development of network nd sensor technologies, vrious nd enormous strem dt re generted from multiple source such s GPS trjectory dt [72], sensor nd Twitter strems. These re represented s multiple texts or multiple sequences tht re constntly growing. Another feture of textul ig dt is clled repetitiveness [56]. It mens kind of text sets consisting of similr texts. For exmple, genome sequences [21] nd versioned document collections such s softwre repositories re one of the highly repetitive texts. 7

11 These dt contin mny long repetitions in the text. 1.2 Reserch gols In order to use ig dt, it is necessry to perform vrious queries such s dt mining nd informtion retrievl. However, ecuse of the mssive mount of dt, even simple queries such s text pttern mtching tke too much time. One of the solutions is to preprocess those dt nd crete n index tht supports the query in order to nswer quickly. Among indexes for texts, those indexes tht hve ll sustring informtion of ech text supports the most diverse queries. In this thesis, we study efficient index construction for multiple texts nd repetitive texts. There re the following demnds for construction of indexes with these text dt. First, in order to process lrge mount of dt t high speed, we wnt n index tht supports fst queries. Second, in order to construct n index for multiple growing texts, we need n index tht enles online construction for multiple texts. Finlly, to store lrge mount of dt, it must e smll in its size. 1.3 Summry of the results In this thesis, there re three min results to text indexing s follows. In Chpter 2, we introduce nottions nd definitions of some dt structures. In Chpter 3, we study ccelertion of compct tries using the pcked string technique. The dynmic compct trie [42, 65] is fundmentl dt structure for storing set of vrile-length strings. It cn store set of k strings over n lphet Σ with totl size n in O(n log n) its of spce. we propose pcked compct tries tht support fster prefix serch queries nd updte opertions of compct tries on the stndrd word RAM model. It still keeps n log σ + O(k log n) its of spce. 8

12 In Chpter 4, we study fully-online construction of DAWG nd suffix trees for multiple texts. Let T = {T 1,..., T K } e collection of texts. By fully-online, we men tht new chrcter cn e ppended to ny text in T t ny time. This is nturl generliztion of semi-online construction of indexing dt structures for multiple texts in which, fter new chrcter is ppended to the k-th text T k, then its previous texts T 1,..., T k 1 will remin sttic. We propose fully-online lgorithms which construct the directed cyclic word grph (DAWG) [14], nd the generlized suffix tree (GST ) [42] for T in O(n log σ) time nd O(n) words of spce, where n nd σ denote the totl length of texts in T nd the lphet size, respectively. In Chpter 5, we study compressed index comining CDAWGs nd grmmr compression. Recent studies hve shown tht the compct directed cyclic word grphs (CDAWG) [15] topology chieves the compressed size for repeted strings. However, there is no known method for supporting high-speed serch with the compressed size without hving the originl input string. Liner-size CDAWG proposed in this thesis chieves the compressed size while supporting serch time similr to originl CDAWG. In Chpter 6, we give the summry of this thesis, nd then discuss possile future reserches. 1.4 Contriutions of this thesis We studied three fundmentl prolems which re necessry when we construct the index tht cn efficiently hndle mssive mount of text dt. A verstile text index hs three fetures: high speed queries, fully-online construction, nd, smll spce complexity. Ech result of this thesis shows n index which chieves one of the three fetures. First, s sis of efficient text indexes llowing high speed query processing, we proposed n improved dt structure supporting high speed construction nd queries y using it-prllel methods. Secondly, for multiple growing texts like strem dt 9

13 from multiple sensors, we proposed construction lgorithm of n index in fully-online mnner. Thirdly, for texts tht contin mny repetitive structures, we proposed n index tht cn cpture the repeting structure nd store it in compressed size. Overll, we studied efficient lgorithms for text index construction which re sis to chieve n index with the three fetures. 10

14 Chpter 2 Preliminries In this chpter, we introduce sic definitions nd nottions in strings, suffix tries, suffix trees, directed cyclic word grphs, nd compct directed cyclic word grphs ccording to [24 26, 42]. 2.1 Nottions on Strings Let Σ e n ordered lphet. Any element of Σ is clled string. For ny string T, let T denote its length. Let ε e the empty string, nmely, ε = 0. If T = XY Z, then X, Y, nd Z re clled prefix, sustring, nd suffix of T, respectively. For ny 1 i j T, let T [i..j] denote the sustring of T tht egins t position i nd ends t position j in T. For ny 1 i T, let T [i] denote the ith chrcter of T. For ny string T, let Suffix(T ) denote the set of suffixes of T, nd for ny set T of strings, let Suffix(T ) denote the set of suffixes of ll strings in T. Nmely, Suffix(T ) = T T Suffix(T ). For ny string T, let T denote the reversed string of T, i.e., T = T [ T ] T [1]. Let T = {T 1,..., T K } e collection of K texts. For ny 1 k K, let lrs T (T k ) e the longest repeting suffix of T k tht occurs t lest twice in T. For ny strings 11

15 X, Y, LCP(X, Y ) denotes the longest common prefix of X nd Y. Throughout this thesis, the se of the logrithms will e 2, unless otherwise stted. For ny integers i j, [i, j] denotes the intervl {i, i + 1,..., j}. Our model of computtion is the stndrd word RAM of word size w = log n its. For simplicity, we ssume tht w is multiple of log σ, so α = log σ n letters re pcked in single word. Since we cn red w its in constnt time, we cn red nd process α consecutive letters in constnt time. 2.2 Nottions on grphicl indexes All index structures delt with in this thesis, such s suffix tries, suffix trees, CDAWGs, liner-size suffix tries (LSTries), nd liner-size CDAWGs (L-CDAWGs), re grphicl indexes in the sense tht n index is pointer-sed structure uilt on n underlying DAG G L = (V (L), E(L)) with root r V (L) nd mpping l : E(L) Σ + tht ssign lel idl(e) to ech edge e E(L). For n edge e = (u, v) E(L), we denote its end points y e.hi := u nd e.lo := v, respectively. The lel string of e is idl(e) Σ +. The string length of e is idslen(e) := idl(e) 1. An edge is clled tomic if idslen(e) = 1, nd thus, idl(e) Σ. For pth p = (e 1,..., e k ) of length k 1, we extend its end points, lel string, nd string length y p.hi := e 1.hi, p.lo := e k.lo, idl(p) := idl(e 1 )... idl(e k ) Σ +, nd idslen(p) := idslen(e 1 ) + + idslen(e k ) 1, respectively. 2.3 Suffix tries The suffix trie for text collection T = {T 1,..., T K }, denoted STrie(T ), is trie which represents Suffix(T ). The size of STrie(T ) is O(n 2 ), where n is the totl length of texts in T. We identify ech node v of STrie(T ) with the string tht v represents. 12

16 Suffix Trie c c Suffix Tree c c c c c c c c c c c Pth compction Minimiztion CDAWG c c c DAWG c Minimiztion c c c c c Pth compction Figure 2.1: Illustrtion of STrie(T ), STree(T ), DAWG(T ), nd CDAWG(T ) with T = c. The solid rrows nd roken rrows represent the edges nd the suffix links of ech dt structure, respectively. A sustring x of text in T is sid to e rnching in T, if there exist two distinct chrcters, Σ such tht oth x nd x re sustrings of some texts in T. Clerly, node x of STrie(T ) is rnching iff x is rnching in T. For ech node v of STrie(T ) with Σ nd v Σ, let slink(v) = v. This uxiliry edge slink(v) = v from v to v is clled suffix link. We define the reversed suffix link W (v) = v iff slink(v) = v. For ny node v nd Σ, if v is not sustring of the texts in T, then W (v) is undefined. By definition, the reversed suffix links on STrie(T ) form rooted tree which coincides with STrie(T ), the suffix trie for the collection T = {T 1,..., T K } of the reversed texts. 13

17 2.4 Suffix trees The suffix tree [68] for text collection T, denoted STree(T ), is compcted trie which represents Suffix(T ). STree(T ) is otined y compcting every pth of STrie(T ) which consists of non-rnching internl nodes (see Fig. 2.1). Since every internl node of STree(T ) is rnching, nd since there re t most n leves in STree(T ), the numers of edges nd nodes re O(n). The edge lels of STree(T ) re non-empty sustrings of some text in T. By representing ech edge lel x with triple k, i, j of integers s.t. x = T k [i..j], STree(T ) cn e stored with O(n) spce. We sy tht ny rnching (resp. non-rnching) sustring of T is n explicit node (resp. implicit node) of STree(T ). An implicit node x is represented y triple (v,, l), clled reference to x, such tht v is n explicit ncestor of x, is the first chrcter of the pth from v to x, nd l is the length of the pth from v to x. A reference (v,, l) to node x is clled cnonicl if v is the lowest explicit ncestor of x. For ech explicit node v of STree(T ) with Σ nd v Σ, let slink(v) = v. For ech explicit node v nd Σ, we lso define the reversed suffix link W (v) = vx where x Σ is the shortest string such tht vx is n explicit node of STree(T ). W (v) is undefined if v is not sustring of texts in T. These reversed suffix links re lso clled s Weiner links (or W-link in short) in the literture [16]. A W-link W (v) = vx is sid to e hrd if x = ε, nd soft if x Σ +. Let w e Boolen function such tht for ny explicit node v nd Σ, w (v) = 1 iff (soft or hrd) W-link W (v) exists. Notice tht if w (v) = 1 for node v nd Σ, then w (u) = 1 for every ncestor of v. 2.5 Directed cyclic word grphs (DAWGs) The directed cyclic word grph (DAWG in short) [14,15] of text collection T, denoted DAWG(T ), is smllest DAG which represents Suffix(T ). DAWG(T ) is otined y 14

18 merging identicl sutrees of STrie(T ) connected y the suffix links (see Fig. 2.1). Hence, the lel of every edge of DAWG(T ) is single chrcter. The numers of nodes nd edges of DAWG(T ) re O(n) [15], nd hence DAWG(T ) cn e stored with O(n) spce. DAWG(T ) cn e defined formlly s follows: For ny string x, let Epos T (x) e the set of ending positions of x in the texts in T, i.e., Epos T (x) = {(k, j) x = T k [j x + 1..j], 1 j T k, 1 k K}. Consider n equivlence reltion T on sustrings x, y of texts in T such tht x T y iff Epos T (x) = Epos T (y). For ny sustring x of texts of T, let [x] T denote the equivlence clss w.r.t. T. There is one-to-one correspondence etween ech node v of DAWG(T ) nd ech equivlence clss [x] T, nd hence we will identify ech node v of DAWG(T ) with its corresponding equivlence clss [x] T. Let long([x] T ) denote the longest memer of [x] T. By the definition of equivlence clsses, long([x] T ) is unique for ech [x] T nd every memer of [x] T is suffix of long([x] T ). If x, x re sustrings of some text in T with x Σ nd Σ, then there exists n edge leled with chrcter Σ from node [x] T to node [x] T. This edge is clled primry if long([x] T ) + 1 = long([x] T ), nd is clled secondry otherwise. For ech node [x] T of DAWG(T ) with x 1, let slink([x] T ) = y, where y is the longest suffix of long([x] T ) which does not elong to [x] T. In the exmple of Fig. 2.1, [] T = {, }. The edge leled with from node [] T to node [] T is primry, while the edge leled with from [] T to node [] T is secondry. slink([] T ) = [] T. 2.6 Dulity of suffix trees nd DAWGs There exists nice dulity etween suffix trees nd DAWGs. To oserve this, it is convenient to consider the collection T of the reversed texts ech of which egins with specil mrker $ i, i.e., T = {$ 1 T 1,..., $ K T K }. For ese of nottion, let S k = T k for 15

19 1 k K nd S = {$ 1 S 1,..., $ K S K } = T. Then, it is known (c.f. [14, 15, 25]) tht the reversed suffix links of DAWG(S) coincide with the suffix tree STree(T ) for the originl text collection T. This fct cn lso e oserved from the other direction. Nmely, the hrd (resp. soft) W-links of STree(T ) coincide with the primry (resp. secondry) edges of DAWG(S). Intuitively, this dulity holds ecuse (1) The reversed suffix links of STrie(S) form STrie(T ) (nd vice vers), nd (2) When we construct DAWG(S) from STrie(S), we merge isomorphic sutrees tht re connected y suffix links. During this merging process, the reversed suffix links get compcted nd the resulting compcted links form the edges of STree(T ). Using this dulity, we cn immeditely show tht the totl numer of hrd nd soft W-links is liner in the totl text length n, since the numer of edges of the DAWG is liner in n. This lso mens tht we cn esily mintin the Boolen indictor w with O(n) spce, so tht w (v) for given node v nd Σ cn e nswered in O(log σ) time (e.g., t ech node v we cn mintin BST storing only the chrcters c s.t. w c (v) = 1.) 2.7 Compct directed cyclic word grphs (CDAWGs) The compct directed cyclic word grph [15, 26] for text T, denoted CDAWG(T ), is the miniml compct utomton which represents Suffix(T ). CDAWG(T ) cn e otined from STree(T $) y merging isomorphic sutrees nd deleting ssocited endmrker $ Σ. Since CDAWG(T ) is n edge-leled DAG, we represent directed edge from node u to v with lel string x Σ + y triple f = (u, x, v). For ny node u, the lel strings of out-going edges from u strt with mutully distinct chrcters. 16

20 Formlly, CDAWG(T ) is defined s follows. For ny strings x, y, we denote x L y (resp. x R y) iff the eginning positions (resp. ending positions) of x nd y in T re equl. Let [x] L (resp. [x] R ) denote the equivlence clss of strings w.r.t. L (resp. R ). All strings tht re not sustrings of T form single equivlence clss, nd in the sequel we will consider only the sustrings of T. Let x (resp. x ) denote the longest memer of the equivlence clss [x] L (resp. [x] R ). Notice tht ech memer of [x] L (resp. [x] R ) is prefix of x (resp. suffix of x ). Let x = ( x ) = ( x ). We denote x y iff x = y, nd let [x] denote the equivlence clss w.r.t.. The longest memer of [x] is x nd we will lso denote it y vlue([x]). We define CDAWG(T ) s n edge-leled DAG (V, E) such tht V = {[ x ] R x is sustring of T } nd E = {([ x ] R, α, [ x α] R ) α Σ +, x x α}. The opertor corresponds to compcting non-rnching edges (like conversion from STrie(T ) to STree(T )) nd the [ ] R opertor corresponds to merging isomorphic sutrees of STree(T ). For simplicity, we use nottion so tht when we refer to node of CDAWG(T ) s [x], this implies x = x nd [x] = [ x ] R. Let [x] e ny node of CDAWG(T ) nd consider the suffixes of vlue([x]) which correspond to the suffix tree nodes tht re merged when trnsformed into the CDAWG. We define the suffix link of node [x] y slink([x]) = [y], iff y is the longest suffix of vlue([x]) tht does not elong to [x]. It is shown tht ll nodes of CDAWG(T ) except the sink correspond to the mximl repets of T. Actully, vlue([x]) is mximl repet in T [58]. Following this fct, one cn esily see tht the numers of edges of CDAWG(T ) nd CDAWG(T ) coincide with the numers e r T nd el T respectively [9, 58]. of right- nd left- extensions of mximl repets of T, By representing ech edge lel α with pirs (i, j) of integers such tht T [i..j] = α, CDAWG(T ) cn e stored in O(e r T log n + n log σ) its of spce. 17

22 Chpter 3 Pcked Compct Tries In this chpter, we present new dt structure clled the pcked compct trie (pcked c-trie) which stores set S of k strings of totl length n in n log σ + O(k log n) its of spce nd supports fst pttern mtching queries nd updtes, where σ is the lphet size. Assume tht α = log σ n letters re pcked in single mchine word on the stndrd word RAM model, nd let f(k, n) denote the query nd updte times of the dynmic predecessor/successor dt structure of our choice which stores k integers from universe [1, n] in O(k log n) its of spce. Then, given string of length m, our pcked c-tries support pttern mtching queries nd insert/delete opertions in O( m f(k, n)) α worst-cse time nd in O( m α + f(k, n)) expected time. Our experiments show tht our pcked c-tries re fster thn the stndrd compct tries (.k.. Ptrici trees) on rel dt sets. As n ppliction of our pcked c-trie, we show tht the sprse suffix tree for string of length n over prefix codes with k smpled positions, such s evenly-spced nd word delimited sprse suffix trees, for set of k word suffixes cn e constructed online in O(( n + k)f(k, n)) worst-cse time nd O( n + kf(k, n)) α α expected time with n log σ + O(k log n) its of spce. When k = O( n ), y using the α stte-of-the-rt dynmic predecessor/successor dt structures, we otin su-liner time construction lgorithms using only O( n ) its of spce in oth cses. α 19 We lso

23 discuss n ppliction of our pcked c-tries to online LZD fctoriztion. 3.1 Bckground The trie for set S of strings of totl length n is clssicl dt structure which occupies O(n log n + n log σ) its of spce nd llows for prefix serch nd insertion/deletion for given string of length m in O(m log σ) time, where σ is the lphet size. compct trie for S is pth-compressed trie where the edges in every non-rnching pth re merged into single edge [53]. By representing ech edge lel y pir of positions in string in S, the compct trie cn e stored in n log σ + O(k log n) its of spce, where k is the numer of strings in S, retining the sme time efficiency for prefix serch nd insertion/deletion for given string. Thus, compct tries hve widely een used in numerous pplictions such s dynmic dictionry mtching [44], suffix trees [68], sprse suffix trees [47], externl string indexes [30], nd grmmr-sed text compression [39]. In this chpter, we show how to ccelerte prefix serch queries nd updte opertions of compct tries on the stndrd word RAM model with mchine word size w = log n, still keeping n log σ + O(k log n)-it spce usge. A sic ide is to use the pcked string mtching pproch [12], where α = log σ n consecutive letters re pcked in single word nd cn e mnipulted in O(1) time. In this setting, we cn red given pttern P of length m in O( m ) time, ut, during the trversl of P over com- α pct trie, there cn e t most m rnching nodes. Thus, nïve implementtion of compct trie tkes O( m + m log σ) = O(m log σ) time even in the pcked mtching log σ n setting. To overcome the ove difficulty, we propose how to quickly process long nonrnching pths using it mnipultions, nd how to quickly process dense rnching sutrees using fst predecessor/successor queries nd dictionry look-ups. As result, 20 The

24 we otin new compct trie clled the pcked compct trie (pcked c-trie) for dynmic set S of strings with the following efficiency: Theorem 1 (min result) Let f(k, n) e the query/updte times of n ritrry dynmic predecessor/successor dt structure using O(k log n) its of spce for dynmic set of k integers from the universe [1, n]. Our pcked c-trie stores set S of k strings of totl length n in n log σ + O(k log n) its of spce nd supports prefix serch nd insertion/deletion for given string of length m in O( m f(k, n)) worst-cse time or in α O( m α + f(k, n)) expected time. Using Beme nd Fich s dt structure [6] or Willrd s y-fst trie [70] s the dynmic predecessor/successor dt structure, we otin the following corollry: Corollry 2 There exists pcked c-trie for dynmic set S of strings which uses n log σ+o(k log n) its of spce, nd supports prefix serch nd insert/delete opertions for given string of length m in O( m α log log n) expected time. log log k log log n log log log n ) worst-cse time or in O( m α + Unlike most other (compct) tries, our pcked c-trie does not mintin dictionry or serch structure for the children of ech node. Insted, we prtition our c-trie into h/α levels, where h is the length of the longest string in S. Then ech sutree of height α, clled micro c-trie, mintins predecessor/successor dictionry tht processes prefix serch inside the micro c-trie. A reduction from prefix serch to predecessor/successor queries ws lredy considered in n erlier work y Cole et l. [19], however, their dt structure is sttic. On the other hnd, our micro c-tries re dynmic. A similr technique to our pcked c-trie ws used in the linked dynmic uncompcted trie y Jnsson et l. [46]. Our experiments show tht our pcked c-tries re fster thn Ptrici trees for oth construction nd prefix serch in lmost ll dt sets we tested. 21

25 We show tht our pcked c-tries cn e pplied to efficient online construction of evenly sprse suffix trees [47], word suffix trees [45] nd its extension [64]. Also, pcked c-tries cn e used for online computtion of the LZ-Doule fctoriztion [39] (LZDF ), stte-of-the-rt online grmmr-sed text compressor. We lso show two pplictions to our pcked c-tries. The first ppliction is online construction of evenly sprse suffix trees [47], word suffix trees [45] nd its extension [64]. The existing lgorithms for these sprse suffix trees tke O(n log σ) worst-cse time using n log σ + O(k log n) its of where k is the numer of suffixes stored in the output sprse suffix tree. Using our pcked c-tries, we chieve O(( n α + k) log log k log log n log log log n ) worst-cse construction time nd O( n + k log log n) expected construction time. The α former is suliner in n when k = O( n ) nd σ = polylog(n), the ltter is suliner in α n when k = o( n log log n ) nd σ = polylog(n). To chieve these results, we show tht in our pcked c-trie, prefix serches nd insertion opertions cn e strted not only from the root ut from ny node. This cpility is necessry for online sprse suffix tree construction, since during the suffix link trversl we hve to insert new leves from non-root internl nodes. The second ppliction is online computtion of the LZ-Doule fctoriztion [39] (LZDF ), stte-of-the-rt online grmmr-sed text compressor. Goto et l. [39] presented Ptrici-tree sed lgorithm which computes the LZDF of given string T of length n in O(k(M + min{k, M} log σ)) worst-cse time using O(n log σ) its of spce, where k n is the numer of fctors nd M n is the length of the longest fctor. Using our pcked c-tries, we chieve good expected performnce with O(k( M α + f(k, n))) time for LZDF Relted work Belzzougui et l. [7] proposed rndomized compct trie clled the signed dynmic z-fst trie, which stores dynmic set S of k strings in n log σ + O(k log n) its of 22

26 spce. Given string of length m, the signed dynmic z-fst trie supports prefix serch in O( m + log m) worst-cse time only with high proility, nd supports insert/delete α opertions in O( m + log m) expected time only with high proility.1 On the other α hnd, our pcked c-trie lwys return the correct nswer for prefix serch, nd lwys insert/delete given string correctly, in the ounds stted in Theorem 1 nd Corollry 2. Andersson nd Thorup [3] proposed the exponentil serch tree which uses n log σ + O(k log n) its of spce, nd supports prefix serch nd insert/delete opertions in O(m + ) worst-cse time. Ech node v of the exponentil serch tree stores log k log log k constnt-time look-up dictionry for some children of v nd dynmic predecessor/successor dt structure for the other children of v. This implies tht given string of length m, t most m nodes in the serch pth for the string must e processed one y one, nd hence pcking α = log σ n letters in single word does not seem to speed-up the exponentil serch tree. Fischer nd Gwrychowski s wexponentil serch tree [33] proposed uses n log σ + O(k log n) its of spce, nd supports prefix serch nd insert/delete opertions in O(m + (log log σ)2 ) worst-cse time. When σ = polylog(n), our pcked c-trie chieves log log log σ log σ log log k log log n O(m log n log log log n ) = O(m (log log n)2 log n log log log n wexponentil serch tree requires O(m + ) = O(o(1)m) worst-cse time, while the (log log log n)2 log log log log n ) time2. 1 The O(log m) expected ound for insertion/deletion stted in [7] ssumes tht the prefix serch for the string hs lredy een performed. 2 For sufficiently long ptterns of length m = Θ(n), our pcked c-trie chieves worst-cse suliner o(n) time while the wexponentil serch tree requires O(n) time. 23

27 3.2 Preliminries Compct tries Let S = {X 1,..., X k } e set of k non-empty strings of totl length n. We consider dynmic dt structures for S llowing for fst prefix serches of given ptterns over strings in S, nd fst insertion/deletion of strings to/from S. Suppose S is prefix-free. The trie of S is tree s.t. ech edge is leled y single letter, the lels of the out edges of ech node re distinct, nd for ech X i S there is unique lef l i s.t. the pth from the root to l i spells out X i. The compct trie T S of S is pth-compressed trie otined y contrcting nonrnching pths into single edges. Nmely, in T S, ech edge is leled y non-empty sustring of T, ech internl node hs t lest two children, the out-going edges from ech node egin with distinct letters, nd ech edge lel x is encoded y triple i,, such tht x = X i [..] for some 1 i k nd 1 X i. The length of n edge e, denoted e, is the length of its lel string. Let root(t S ) denote the root of the compct trie T S. For ny node v, let prent(v) denotes its prent. For convenience, let e n uxiliry node s.t. prent(root(t S )) =. We ssume the edge from to root(t S ) is leled y n ritrry letter. For ny node v, let str(v) denotes the string otined y conctenting the edge lels from the root to v. Ech node v stores str(v). Let s e prefix of ny string in S. Let v e the shllowest node of T S such tht s is suffix of str(v) (notice s cn e equl to str(v)), nd let u = prent(v). The locus of string s in T S is pir ϕ = (e, h), where e is the edge from u to v nd h is the offset from u, nmely, h = s str(u). 3 We extend the str function to locus ϕ, so tht str(ϕ) = s. The string depth of locus ϕ is d(ϕ) = str(ϕ). A string P is recognized y 3 In the literture the locus is represented y (u, c, h) where c is the first letter of the lel of e. Since our pcked c-trie does not mintin serch structure for rnches, we represent the locus directly on e. 24

28 T S iff there is locus ϕ with str(ϕ) = P. We consider the following query nd opertions on dynmic compct tries. LPS(ϕ, P ): Given locus in T S nd pttern string P, it returns the locus ˆϕ of string str(ϕ)q in T S, where Q is the longest prefix of P for which str(ϕ)q is recognized y T S. When ϕ = ((, root(t S )), 1), then the query is known s the longest prefix serch for the pttern P in the compct trie. Insert(ϕ, X): Given locus ϕ in T S nd string X, it inserts new lef which corresponds to new string str(ϕ)x S into the compct trie, from the given locus ϕ. When there is no node t the locus ˆϕ = LPS(ϕ, X), then new node is creted t ˆϕ s the prent of the lef. When ϕ = ((, root(t S )), 1), then this is stndrd insertion of string X to T S. Delete(X i ): Given string X i S, it deletes the lef l i. If the out-degree of the prent v of l i ecomes 1 fter the deletion of l i, then the in-coming nd out-going edges of v re merged into single edge, nd v is lso deleted Dynmic predecessor dt structures. For dynmic set I [1, n] of k integers of w = log n its ech, dynmic predecessor dt structures (e.g., [6, 7, 71]) efficiently support predecessor query Pred(X) = mx({y I Y X} {0}), successor query Succ(X) = min({y I Y X} {n + 1}), nd insert/delete opertions for I. Theorem 3 Let f(k, n) e the time complexity of for predecessor/successor queries nd insert/delete opertions of n ritrry dynmic predecessor/successor dt structure which occupies O(k log n) its of spce. Beme nd Fich s dt structure [6] chieves f(k, n) = O( (log log k)(log log n) ) worst-cse time. log log log n Theorem 4 Let f(k, n) e the time complexity of for predecessor/successor queries nd insert/delete opertions of n ritrry dynmic predecessor/successor dt structure 25

29 which occupies O(k log n) its of spce. Willrd s Y-fst trie [70] chieves f(k, n) = O(log log n) expected time. 3.3 Pcked dynmic compct tries This section presents our new dynmic compct tries clled the pcked dynmic compct tries (pcked c-tries) for dynmic set S = {X 1,..., X k } of k strings of totl length n, which chieves the min result in Theorem 1. In the sequel, string X Σ is clled short if X α = log σ n, nd is clled long if X > α Micro dynmic compct tries for short strings. In this susection, we present our dt structure storing short strings. Our input is dynmic set S = {X 1,..., X k } of k strings of totl length n, such tht X i α = log σ n for every 1 i k. Hence it holds tht k σ α = n. For simplicity, we ssume for now tht X i = α for every 1 i k. The generl cse where S contins strings shorter thn α will e explined lter in Remrk 1. The dynmic dt structure for short strings, clled micro c-trie nd denoted MT S, consists of the following: (i) A dynmic compct trie of height exctly α storing the set S. Let N e the set of internl nodes, nd let L = {l 1,..., l k } e the set of k leves such tht l i corresponds to X i for 1 i k. Since every internl node is rnching, N k 1. Every node v of MT S corresponds to the string str(v) of log n its. Overll, this compct trie requires n log σ + O(k log n) its of spce (including S). (ii) A dynmic predecessor/successor dt structure D which stores the set S = {X 1,..., X k } of strings in O(k log n) its of spce, where ech X i is regrded s log n- it integer. D supports predecessor/successor queries nd insert/delete opertions in f(k, n) time ech. Clerly MT S requires n log σ + O(k log n) its of totl spce. The next lemm shows how to support in O(1) time LCP queries for strings repre- 26

30 sented y two given nodes on the dynmic micro c-trie MT S. This is relted to the leling scheme (e.g., see [1]) which ssigns short lel to ech node so tht lter, given the lels of two nodes, the lel of the LCA of the nodes cn e nswered in O(1) time. Although the sttic tree is considered in the leling scheme, our micro c-trie is dynmic. Also, our lgorithm is much simpler thn pplying the dynmic LCA dt structure [20] to our micro c-tries. Lemm 1 For ny nodes u nd v of the dynmic micro c-trie MT S, we cn compute LCP(str(u), str(v)) in O(1) time. Proof 1 We pd str(u) nd/or str(v) with n ritrry letter c so they ecome α long ech, nmely, let P = str(u)c α str(u) nd Q = str(v)c α str(v). We compute the most significnt it (ms) of the XOR of the it representtions of P nd Q. Let the it position of the ms, nd let z = ( 1)/ log σ. W.l.o.g. ssume str(u) str(v). (1) If z < str(u), then str(u)[1, z] = LCP(str(u), str(v)). In this cse, there exists rnching node y such tht str(y) = str(u)[1, z], nd hence LCP(str(u), str(v)) = str(y). (2) If z str(u), then str(u) = LCP(str(u), str(v)), nd hence str(u) = LCP(str(u), str(v)). Since ech of P nd Q is stored in single mchine word, we cn compute the XOR of P nd Q in O(1) time. The ms cn e computed in O(1) time using the technique of Fredmn nd Willrd [35]. This completes the proof. On micro c-tries, prefix serches nd insertion opertions cn e strted not only from the root ut from ny node. This is necessry for online sprse suffix tree construction sed on Ukkonen s lgorithm [65], since during the suffix link trversl we hve to insert new leves from non-root internl nodes. Theorem 5 The micro c-trie MT S supports LPS(ϕ, X) queries in O(f(k, n)) time. 27

31 Proof 2 Let P e the prefix of str(ϕ)x of length α, i.e., P = str(ϕ)x[1..α d(phi)]. The cse where P is represented y lef is esy, nd thus, in wht follows we focus on the cse where P is not represented y lef. First, we compute the string depth d = d(ϕ) [0, α]. Oserve tht d = mx{ LCP(P, Pred(P )), LCP(P, Succ(P )) }. Given P, we compute Pred(P ) nd Succ(P ) in O(f(k, n)) time. Then, we cn compute LCP(P, Pred(P )) in O(1) time y computing the ms of the XOR of the it representtions of P nd Pred(P ), s in Lemm 1. LCP(P, Succ(P )) cn e computed nlogously, nd thus, d = d(ϕ) cn e computed in O(f(k, n)) time. Second, we locte e = (u, v). See lso Fig Let Z = P [1, d]. Let LB = Zc α Z 1 nd UB = Zc α Z σ e the lexicogrphiclly lest nd gretest strings of length α with prefix Z, respectively. To locte u in MT S, we find the leftmost nd rightmost leves X L nd X R elow ϕ y X L = Succ(LB) nd X R = Pred(UB). Then, the longer one of LCP(X L 1, X L ) nd LCP(X R, X R+1 ) corresponds to the origin node u of e, nd LCP(X L, X R ) corresponds to the destintion node v of e. These LCPs cn e computed in O(1) time y Lemm 1. Wht remins is how to ccess the nodes u nd v representing these strings. In so doing, let $ e specil chrcter tht does not pper in ny strings in S. For ech string Y represented y n internl node of MT S, we pd $ t the end of Y so its length ecomes exctly α, nmely, we otin Y $ α Y. We insert this pdded string into dynmic dictionry dedicted only for internl nodes (here we use predecessor/successor dt structure). Now, given string represented y n internl node, we cn ccess the corresponding node in O(f(k, n)) time. Finlly we otin ϕ = ((u, v), d str(u) ) in overll O(f(k, n)) time. It follows from the proof of Theorem 5 tht dynmic predecessor/successor dt structure is enough to support pttern mtching queries on our dynmic micro c-tire. This implies tht we do not hve to store (the triples for) the edge lels in the micro c-trie. This oservtion is importnt when we consider delete opertions on the set S, s we will see in the next lemm. 28

32 micro c-trie LCA(l L-1, l L ) LCA(l R, l R+1 ) φ^ φ = root X[1..d] l L-1 X L l L l R X R l R+1 Figure 3.1: Given the initil locus ϕ (which is on the root in this figure) nd query pttern P = , the lgorithm of Theorem 5 nswers the LPS(ϕ, P ) query on the micro c-trie s in this figure. The nswer to the query is the locus ˆϕ for P [1..5] = Lemm 2 The micro c-trie MT S supports Insert(ϕ, X) nd Delete(X) opertions in O(f(k, n)) time. We ssume tht d(ϕ) + X α so tht the height of the micro compct trie will lwys e kept within α. Proof 3 We show how to support Insert(ϕ, X) in O(f(k, n)) time. Initilly S =, the micro compct trie MT S consists only of root(mt S ), nd predecessor/successor dictionry D contins no elements. When the first string X is inserted to S, then we crete lef elow the root nd insert X to D. Suppose tht the dt structure mintins string set S with S 1. To insert string X from the given locus ϕ, we first conduct the LPS(ϕ, X) query of Theorem 5, nd let ˆϕ = (e, h) e the nswer to the query. If h = e, then we simply insert new lef l from the destintion node of e. Otherwise, we split e t ˆϕ nd crete new node v there s the prent of the new lef, such tht str(v) = str( ˆϕ). The rest is the sme s in the former cse. After the new lef is inserted, we insert str(ϕ)x to D in O(f(k, n)) time. We consider Delete(X). Recll tht ech edge of the micro c-trie does not store 29

33 α α α α 0α 1α 2α 3α 4α Figure 3.2: Micro-trie decomposition: The pcked c-trie is decomposed into numer of micro c-tries (gry rectngles) ech of which is of height α = log σ n. Ech micro-trie is equipped with dynmic predecessor/successor dt structure. the triple representing its string lel. Thnks to this property, we need not consider updtes of the lels of the edges in the pth from the root to the deleted lef (which usully ecomes prolemtic in compct tries). Thus, we cn support Delete(X) in similr wy to Insert(ϕ, X), in O(f(k, n)) time. Remrk 1 When d(ϕ) + X < α, then we cn support Insert(ϕ, X) nd LPS(ϕ, X) s follows. When inserting X, we pd X with specil letter $ which does not pper in S. Nmely, we perform Insert(ϕ, X) opertion with X = X$ α d(ϕ) X. When computing LPS(ϕ, X), we pd X with nother specil letter # $ which does not pper in S. Nmely, we perform LPS(ϕ, X ) query with X = X# α d(ϕ) X. This gives us the correct locus for LPS(ϕ, X) Pcked dynmic compct tries for long strings. In this susection, we present the pcked dynmic compct trie (pcked c-trie) PT S for set S of vrile-length strings of length t most O(2 w ) = O(n). 30

34 3.3.3 Micro trie decomposition. We decompose PT S into numer of micro c-tries. See lso Fig Let h > α e the length of the longest string in S. We ctegorize the nodes of PT S into h/α +1 levels: We sy tht node of PT S is t level i (0 i h/α ) iff str(v) [iα, (i + 1)α 1]. The level of node v is denoted y level(v). A locus ϕ of PT S is clled oundry iff d(ϕ) is multiple of α. Consider ny pth from root(pt S ) to lef, nd ssume tht there is no node t some oundry kα on this pth. We crete n uxiliry node t tht oundry on this pth, iff there is t lest one non-uxiliry (i.e., originl) node t level i 1 or i + 1 on this pth. Let BN denote the set of nodes t the oundries, clled the oundry nodes. For ech oundry node v BN, we crete micro compct trie MT whose root root(mt ) is v, internl nodes re ll descendnts u of v with level(u) = level(v), nd leves re ll oundry descendnts l of v with level(l) = level(v) + 1. Notice tht ech oundry node is the root of micro c-trie t its level nd is lso lef of micro c-trie t the previous level. An edge is sid to e long edge iff its lel is t lest α long. We store the lel of ech long edge y triple of integers. Recll tht, on the other hnd, we do not store (encodings) of the edge lels in the micro c-tries. Lemm 3 The pcked c-trie PT S for prefix-free set S of k strings requires n log σ + O(k log n) its of spce. Proof 4 Firstly, we show the numer of uxiliry oundry nodes in PT S. At most 2 uxiliry oundry nodes re creted on ech originl edge of PT S. Since there re t most 2k 2 originl edges, the totl numer of uxiliry oundry nodes is t most 4k 4. Since there re t most 2k 1 originl nodes in PT S, the totl numer of nodes in PT S is t most 6k 5. Clerly, the totl numer of short strings of length t most α mintined y the micro c-tries is no more thn the numer of ll nodes in PT S. The 31

35 numer of long edges in PT S is no more thn the numer of its nodes. Overll, the totl spce of PT S is n log σ + O(k log n) its. For ny locus ϕ on PT S, ld(ϕ) denotes the locl string depth of ϕ in the micro c-trie MT tht contins ϕ. Nmely, if root(mt ) = v, the prent of u in PT S is u, nd e = (u, v), then ld(ϕ) = d(ϕ) d((e, e )). Prefix serch queries nd insert/delete opertions cn e supported y our pcked c-trie, s follows. Lemm 4 The pcked c-trie PT S supports LPS(ϕ, P ) query in O( m f(k, n)) worst-cse α time, where m = P > α. Proof 5 If m + ld(ϕ) α, the ound immeditely follows from Theorem 5. Assume m + ld(ϕ) > α, nd let q = α ld(ϕ) + 1. We fctorize P into h + 1 locks s p 0 = P [1, q 1], p 1 = P [q, q + α 1],..., p h 1 = P [q + (h 1)α, q + hα 1], nd p h = P [q + hα, m], where 1 p 0 α, p i = α for 1 i h 1, nd 1 p h α. Ech lock cn e computed in O(1) time y stndrd it opertions. If there is mismtch in p 0, we re done. Otherwise, for ech i in incresing order from 1 to h, we perform LPS(γ, p i ) query from the root γ of the corresponding micro c-trie t ech level of the corresponding pth strting from ϕ. This continues until we find either the first mismtch for some i or complete mtches for ll i s. Ech LPS query with ech micro c-trie tkes O(f(k, n)) time y Theorem 5. Since h = O( m), it tkes O( m f(k, n)) totl α α time. Lemm 5 The pcked c-trie PT S supports Insert(ϕ, X) nd Delete(X i ) opertions in O( m f(k, n)) worst-cse time, where m = X > α. α Proof 6 Insert(ϕ, X): we first perform LPS(ϕ, X) in O( m f(k, n)) time (Lemm 4). α Let x 0,..., x h e the fctoriztion of X w.r.t. ϕ, nd let x j e the lock of the fctoriztion contining the first mismtch. Then, we conduct Insert(γ, x j ) opertion on the corresponding micro c-trie, where γ is its root. It tkes O(f(k, n)) time (Lemm 2). 32

36 If j = h (x j is the lst lock in the fctoriztion of X), then we re done. Otherwise, we crete new edge with lel x jx j+1 x k, where x j is the suffix of X j which egins t the mismtched position, leding to the new lef l. We crete new oundry node if necessry. These opertions tke O(1) time ech. Hence, Insert(ϕ, X) tkes O( m f(k, n)) totl time. α Delete(X i ): Let Q e the pth from the root r of PT S to lef l i. If l i is child of the root of PT S, then we simply delete the single edge in Q. Otherwise, for ech su-pth of Q tht elongs to micro c-trie, we perform Delete opertion of Lemm 2 in this micro c-trie. Since the pth Q spns t most m α micro c-tries, the delete opertions on these micro c-tries tke O( m f(k, n)) totl time. For ech long edge in Q whose lel α refers to X i, let i,, e the triple representing the lel. We replce the triple with i,,, where X i is the predecessor of X i in S nd X i [.. ] = X[..] (if X i does not hve predecessor, then we cn use the successor of S insted). We cn find X i s follows. First, we compute ϕ = LPS(r, X i ) = LCA(l i, l i ). Then, we cn find l i y trversing the right-most pth from ϕ tht is to the left of the su-pth of Q from ϕ to l i. This cn e done in O( m α f(k, n)) time. The positions nd in X i cn e computed y simple rithmetics, since we know the totl length of the lels in the pth from ϕ to l i. Since the pth Q contins less thn m α edges in Q cn e updted in O( m α ) time. long edges, the triples for ll long Speeding-up with hshing. By ugmenting ech micro c-trie with hsh tle storing the short strings, we chieve good expected performnce, s follows: Lemm 6 The pcked c-trie PT S ugmented with hshing supports LPS(ϕ, X) query, Insert(ϕ, X) nd Delete(X) opertions in O( m α + f(k, n)) expected time. Proof 7 Let MT e ny micro c-trie in the pcked c-trie PT S, nd M the set of 33

37 strings mintined y MT ech eing of length t most α. We store ll strings of M in hsh tle ssocited to MT, which supports look-ups, insertions nd deletions in O(1) expected time. Let x 0,..., x h e the fctoriztion of X w.r.t. ϕ. To perform LPS(ϕ, X), we sk if str(ϕ)x 0 is in the hsh tle of the corresponding micro c-trie. If the nswer is no, the first mismtch occurs in x 0, nd the rest is the sme s in Lemm 4. If the nswer is yes, then for ech i from 1 to h in incresing order, we sk if x i is in the hsh tle of the corresponding micro c-trie, until we receive the first no with some i or we receive yes for ll i s. In the ltter cse, we re done. In the former cse, we perform LPS query with x i from the root of the corresponding micro c-trie. Since we perform t most one LPS query nd O( m) look-ups for hsh tles, it tkes O( m +f(k, n)) expected α α time. O( m + f(k, n)) expected time ounds for Insert(ϕ, X) nd Delete(X) immeditely α follow from the ove rguments. 3.4 Applictions to online string processing Sprse suffix trees. The suffix tree [68] of string T of length n is compct trie which stores ll n suffixes of T. A sprse suffix tree for set K [1, n] of smpled positions of T is compct trie which stores only the suset S = {T [i..n] i K} of the suffixes of T eginning t the smpled positions in K. It is known tht if the set K of smpled positions stisfy some properties (e.g., every r positions for some fixed r > 1 or the positions immeditely fter the word delimiters), the sprse suffix tree cn e constructed in n online mnner in O(n log σ) time nd n log σ + O(n log n) its of spce [45, 47, 64]. Pcked c-tries cn speed up online construction nd pttern mtching for these sprse suffix trees: Here ech input string X to Insert is given s pir (i, j) of positions in T s.t. X = T [i..j]. As Lemm 7 sttes, Insert opertion in such cse cn e 34

Minimal DFA. minimal DFA for L starting from any other

Minimal DFA. minimal DFA for L starting from any other Miniml DFA Among the mny DFAs ccepting the sme regulr lnguge L, there is exctly one (up to renming of sttes) which hs the smllest possile numer of sttes. Moreover, it is possile to otin tht miniml DFA