A Faster Grammar-Based Self-index

Size: px
Start display at page:

Download "A Faster Grammar-Based Self-index"

Transcription

1 A Faster Grammar-Based Self-index Travis Gagie 1,Pawe l Gawrychowski 2,, Juha Kärkkäinen 3, Yakov Nekrich 4, and Simon J. Puglisi 5 1 Aalto University, Finland 2 University of Wroc law, Poland 3 University of Helsinki, Finland 4 University of Bonn, Germany 5 King s College, London, UK Abstract. To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on straightline programs and LZ77. In this paper we show how, given a balanced straight-line program for a string S[1..n] whose LZ77 parse consists of z phrases, we can add O(z log log z) words and obtain a compressed selfindex for S such that, given a pattern P [1..m], we can list the occ occurrences of P in S in O ( m 2 +(m + occ) log log n ) time. All previous self-indexes are either larger or slower in the worst case. 1 Introduction With the advance of DNA-sequencing technologies comes the problem of how to store many individuals genomes compactly but such that we can search them quickly. Any two human genomes are 99.9% the same, but compressed self-indexes based on compressed suffix arrays, the Burrows-Wheeler Transform or LZ78 (see [21] for a survey) do not take full advantage of this similarity. Researchers have recently started building compressed self-indexes based on context-free grammars (CFGs) and LZ77 [23], which better compress highly repetitive strings. A compressed self-index stores a string S[1..n] in compressed form such that, given a pattern P [1..m], we can quickly list the occ occurrences of P in S. Claude and Navarro [4] gave the first compressed self-index based on grammars, which takes O(r log r) +r log n bits, where r is the number of rules in the given grammar, and O ( (m 2 + h(m + occ)) log r ) time, where h log n is the height of the parse tree. Throughout this paper, we write log to mean log 2. Very recently, the same authors [5] described another grammar-based compressed self-index, which takes 2R log r + R log n + ɛrlog r + o(r log r) bits, where R is the total length of the rules right-hand sides and 0 <ɛ 1, and O ( (m 2 /ɛ)logr + occ log r ) time 1. Whereas their first index requires a straightline program (SLP), their new index can be built on top of an arbitrary CFG. Supported by MNiSW grant number N , They have now reduced the log R factor in this time bound to log(log n/ log r). A.-H. Dediu and C. Martín-Vide (Eds.): LATA 2012, LNCS 7183, pp , c Springer-Verlag Berlin Heidelberg 2012

2 A Faster Grammar-Based Self-index 241 Table 1. Space and time bounds for Claude and Navarro s compressed self-indexes [4,5], Kreft and Navarro s [16] and our own. Throughout, r is the number of rules in a given CFG generating a string S[1..n] and only S, but the CFG must be an SLP in Row 1 and a balanced SLP in Row 4. In Row 1, h is the height of the parse tree. In Row 2, R is the total length of the rules right-hand sides and 0 <ɛ 1. In Row 3, d is the depth of nesting in the parse. source total space (bits) search time [4] O(r log r)+r log n O ( (m 2 + h(m +occ))logr ) [5] 2R log r + R log n + ɛrlog r + o(r log r) O ( (m 2 /ɛ)logr +occlogr ) [16] 2z log(n/z)+z log z +5z log σ + O(z)+o(n) O ( m 2 d +(m +occ)logz ) Thm. 4 2r log r + O(z(log n +logz log log z)) O ( m 2 +(m +occ)loglogn ) Kreft and Navarro [16] gave the first (and, so far, only) compressed self-index based on LZ77, which takes 2z log(n/z)+z log z+5z log σ+o(z)+o(n) bits, where z is the number of phrases in the LZ77 parse of S, ando ( m 2 d +(m +occ)logz ) time, where d z is the depth of nesting in the parse. Throughout this paper, we consider the non-self-referential LZ77 parse, so z log n. Theo(n) termin Kreft and Navarro s space bound can be dropped at the cost of increasing the time bound by a factor of log(n/z). In this paper we show how, given a balanced SLP for S, we can add O(z(log n +logz log log z)) bits and obtain a compressed self-index that answers queries in O ( m 2 +(m + occ) log log n ) time. We note that Maruyama et al. [18] recently described another grammar-based index but its search time depends on the number of occurrences of a maximal common subtree in [edit-sensitive parse] trees of P and S, making it difficult to compare their data structure to those mentioned above. In order to better compare our bounds, those by Claude and Navarro and those by Kreft and Navarro all of which are summarized in Table 1 it is useful to review some bounds for the LZ77, balanced and unbalanced SLPs, and arbitrary CFGs. The LZ77 compression algorithm works by parsing S from left to right into phrases: after parsing S[1..i 1], it finds the longest prefix S[i..j 1] of S[i..m] that has occurred before and selects S[i..j] as the next phrase. The previous occurrence of S[i..j 1] is called the phrase s source. Kreft and Navarro considered the non-self-referential LZ77 parse, in which each phrase s source must end before the phrase itself begins, so z log n. For ease of comparison, we also consider the non-self-referential parse. In an SLP of size r, each of the non-terminals X 1,...,X r appears on the lefthand side of exactly one rule and each rule is either of the form X i a, where a is a terminal, or X i X j X k,wherei>j,k; thus, the SLP generates exactly one string and can be encoded in 2r log r + O(r) bits. Rytter [22] and Charikar et al. [2] showed that, in every CFG generating S and only S, the total length of the rules right-hand sides is at least z. Rytter showed how we can build an SLP for S with O(z log n) rules whose parse tree has the shape of an AVL tree, which is height-balanced, and Charikar et al. showed how to build such an SLP whose parse tree is weight-balanced. Finally, Charikar et al. showed that finding

3 242 T. Gagie et al. an algorithm for building a CFG generating S and only S in which the total length of the rules right-hand sides is o(z log n/ log log n), would imply progress on the well-studied problem of building addition chains. With Rytter s and Charikar et al. s bounds in mind, it is clear that the time bound of our index is strictly better than those of the others. Kreft and Navarro s index has the best worst-case space bound but only when we drop the o(n) term, increasing their time bound by a factor of log(n/z). As it is not known whether it is possible, regardless of the time taken, to build a CFG that generates S and only S and in which the total length of the rules right-hand sides is o(z log n), it is also not known whether Claude and Navarro s space bounds are ever strictly better than ours. It is even conceivable that they could be worse: suppose z =log O(1) n and the minimum total length of the rules right-hand sides in a CFG generating S and only S, isz log n; then the O(r log n) andr log n terms in Claude and Navarro s space bounds are Θ(z log 2 n), while the dominant term in our own space bound is 2r log r = Θ(z log n log log n). To build our index, we add bookmarks to the SLP that allow us to extract substrings quickly from around boundaries between phrases in the LZ77 parse. In Section 2 we show that a bookmark for a position b takes O(log n) bits and, for any length l, allows us to extract S[b l..b + l] ino(l + log log n) time. Kreft and Navarro showed how, if we can extract substrings quickly from around phrase boundaries, then we can quickly list all occurrences of P that finish at or cross those boundaries, which are called primary occurrences. They first use two Patricia trees and access at the boundaries to find, for each position i 1 in P, the lexicographic range of the reverses of phrases ending with P [1..i], and the lexicographic range of suffixes of S starting with P [i +1..m] at phrase boundaries. They then use a wavelet tree as a data structure for 2-dimensional range reporting to find all the phrase boundaries immediately preceded by P [1..i] and followed by P [i+1..m]. In Section 3 we show how, using bookmarks for faster access and a faster (albeit larger) range-reporting data structure, we can list all primary occurrences in O ( m 2 +(m + occ) log log n ) time. Occurrences of P that neither finish at nor cross phrase boundaries are called secondary and they can be found recursively from the primary occurrences. We build a data structure for 2-sided range reporting on an n n grid on which we place a point (i, j) for each phrase s source S[i..j]. Since the queries are 2-sided, this data structure takes O(z log n) bits and answers queries in O(log log n + p) time, where p is the number of points reported. If a phrase contains a secondary occurrence at a certain position in the phrase, then that phrase s source must also contain an occurrence at the corresponding position. It follows that, by querying the data structure with (a, b) for each primary or secondary occurrence S[a..b] we find, we can list all secondary occurrences of P in O(occ log log n) time. 2 Adding Bookmarks to a Balanced SLP An SLP for S is called balanced if its parse tree is height- or weight-balanced. We claim that, for 1 i j = i +2g n, we can choose two nodes v and

4 A Faster Grammar-Based Self-index 243 u v w i k j = i +2g b = i + g k + 1 Fig. 1. Suppose we are given an SLP for S and consider its parse tree. For 1 i j = i +2g n, we can choose two nodes v and w of height O(log g) in the parse tree whose subtrees include the ith through jth leaves. Without loss of generality, assume the bth leaf is in v s subtree, where b = i + g. By storing the non-terminals at v and w and the path from v to b, we can extract S[b l..b + l] ino(l +logg) time. w of height O(log g) in the parse tree whose subtrees include the ith through jth leaves. To see why, let u be the lowest common ancestor of the ith and jth leaves. If u has height O(log g), then we choose its children as v and w; otherwise, suppose the rightmost leaf in u s left subtree is the kth in the tree. We choose as v the lowest common ancestor of the ith through kth leaves, which must lie on the rightmost path in u s left subtree. Since v s right subtree has fewer than k i +1 2g leaves, v has height O(log g). We choose as w the lowest common ancestor of the (k + 1)st through jth leaves, which must lie on the leftmost path in u s right subtree. Since w s left subtree has fewer than j k 2g leaves, w also has height O(log g). Figure 1 illustrates our choice of v and w. Without loss of generality, assume b = i + g k; the case when b>kis symmetric. Then the bth leaf which is midway between the ith and jth is in v s subtree and we can store the path from v to the bth leaf in O(log g) bits. With this information, in O(l + log g) time we can perform partial depth-first traversals of v s subtree that start by descending from v to the bth leaf and then visit the l g leaves immediately to its left and right, if they are in v s subtree. Some of the l leaves immediately to right of the bth leaf may be the leftmost leaves in w s subtree, instead; we can descend from w and visit them in O(l +logg) time. It follows that, if we store the non-terminals at v and w and O(log g) extra bits then, given l g, we can extract S[b l..b + l]ino(l +logg)time. Lemma 1. Given a balanced SLP for S with r rules and integers b and g, wecan store 2logr + O(log g) bits such that later, given l, we can extract S[b l..b + l] in O(l +logg) time. The drawback to Lemma 1 is that we must know in advance an upper bound on the number of characters we will want to extract. To remove this restriction,

5 244 T. Gagie et al. for each position we wish to bookmark, we apply Lemma 1 twice: the first time, we set g = n so that, given l log n, we can extract S[b l..b + l] ino(l) time; the second time, we set g =logn so that, given l log n, we can extract S[b l..b + l] ino(l + log log n) time. Theorem 2. Given a balanced SLP for S and a position b, we can store O(log n) bits such that later, given l, we can extract S[b l..b+l] in O(l + log log n) time. We note that, without changing the( asymptotic) space bound in Theorem 2, the time bound can be improved to O l +log [c] n,wherec is any constant and log [c] n is the result of recursively applying the logarithm function c times to n. If we increase the space bound to O(log n log n), where log is the iterated logarithm, then the time bound decreases to O(l). Even as stated, however, Theorem 2 is enough for our purposes in this paper as we use only the following corollary, which follows by applying Theorem 2 for each phrase boundary in the LZ77 parse. Corollary 3. Given a balanced SLP for S, we can add O(z log n) bits such that later, given l, we can extract l characters to either side of any phrase boundary in O(l + log log n) time. As an aside, we note that Rytter s construction produces an SLP in which, for each phrase in the LZ77 parse, there is a non-terminal whose expansion is that phrase. For any balanced SLP with this property, it is easy to add bookmarks such that we can extract quickly from phrase boundaries: for each phrase, we store the non-terminal generating that phrase and the leftmost and rightmost non-terminals at height log log n in its parse tree. 3 Listing Occurrences As explained in Section 1, we use the same framework as Kreft and Navarro but with faster access at the phrase boundaries, using Corollary 3 and a faster data structure for 2-dimensional range reporting. Given a pattern P [1..m], for 1 i m, we find all the boundaries where the previous phrase ends with P [1..i] and the next phrase starts with P [i +1..m]. To do this, we build one Patricia tree for the reversed phrases and another for the suffixes starting at phrase boundaries. A Patricia tree [20] is a compressed trie in which the length of each edge label is recorded and then all but the first character of each edge label are discarded. It follows that the two Patricia trees have O(z) nodes and take O(z log n) bits; notice that no two suffixes can be the same and, by the definition of the LZ77 parse, no two reversed phrases can be the same. We label each leaf in the first Patricia tree by the number of the phrase whose reverse leads to that leaf; we label each leaf in the second Patricia tree by the number of the phrase starting the suffix that leads to that leaf. For example, if S = alabar a la alabarda$, then its LZ77 parse is as shown in Figure 2 (from Kreft and Navarro s paper). For convenience, we treat all

6 A Faster Grammar-Based Self-index 245 al abar a l a al abarda$ Fig. 2. The LZ77 parse of alabar a la alabarda$, from Kreft and Navarro s paper [16]. Horizontal lines indicate phrases sources, with arrows leading to the boxes containing the phrases themselves. strings as ending with a special character $. In this case, the reversed phrases are as shown below with the phrase numbers, in order by phrase number on the left and in lexicographic order on the right. 1) a$ 9) $a$ 2) l$ 5) $ 3) ba$ 6) a$ 4) ra$ 7) al$ 5) $ 1) a$ 6) a$ 3) ba$ 7) al$ 8) drabala$ 8) drabala$ 2) l$ 9) $a$ 4) ra$ The suffixes starting at phrase boundaries are as shown below with the numbers of the phrases with which they begin, in order by phrase number on the left and in lexicographic order on the right. 2) labar a la alabarda$ 5) a la alabarda$ 3) abar a la alabarda$ 9) a$ 4) ar a la alabarda$ 6) a la alabarda$ 5) a la alabarda$ 3) abar a la alabarda$ 6) a la alabarda$ 8) alabarda$ 7) la alabarda$ 4) ar a la alabarda$ 8) alabarda$ 7) la alabarda$ 9) a$ 2) labar a la alabarda$ Notice we do not consider the whole string to be a suffix starting at a phrase boundary. We are interested only in phrase boundaries such that occurrences of patterns can cross the boundary with a non-empty prefix to the left of the boundary. The Patricia trees for the reversed phrases and suffixes are shown in Figure 3. Notice that, since la alabarda$ and labar a la alabarda$ share both their first and second characters, the second character a is omitted from the edge from the root to its rightmost child in the Patrica trie for the suffixes; we indicate this with a *. For each position i 1inP, we find the lexicographic range of the reverses of phrases ending with P [1..i], and the lexicographic range of suffixes of S starting

7 246 T. Gagie et al. 9 $ $ a 5 $ l 6 7 a b d 1 3 l 8 r 2 4 a l* 5 b $ b l r Fig. 3. The Patricia trees for the reversed phrases (on the left) in the LZ77 parse of alabar a la alabarda$ and the suffixes starting at phrase boundaries (on the right) with P [i +1..m] at phrase boundaries. To do this, we first search for the reverse of P [1..i] in the Patricia tree for the reversed phrases and search for P [i +1..m] in the Patricia tree for the suffixes, then use our data structure from Corollary 3 to verify that the characters omitted from the edge labels match those in the corresponding positions in P.ThistakesO(m + log log n) time for each choice of i, oro ( m 2 + m log log n ) time in total. For example, if P = ala, then we search for a and la, al and a, and ala and the empty suffix. In the first search, we reach the root s third child in the Patricia tree for the reversed phrases, which is the 5th leaf (labelled 1), and the root s third child in the Patricia tree for the suffixes, which is the ancestor of the 7th and 8th leaves (labelled 7 and 2); the second and third searches fail because there are no paths in the first Patricia tree corresponding to al or ala. Once we have the lexicographic ranges for a choice of i, we check whether there are any pairs of consecutive phrase numbers with the first number j from the first range and the second number j + 1 from the second range. In our example, 1, 2) is such a pair, meaning that the first phrase ends with a and the second phrase starts with la, so there is a primary occurrence crossing the first phrase boundary. To be able to find efficiently such pairs, we store a data structure for 2-dimensional range reporting on a z (z 1) grid on which we have placed z 1points,witheachpoint(a, b) indicating that the phrase numbers of the lexicographically ath reversed phrase and the lexicographically bth suffix, are consecutive. Figure 4 shows the grid for our example, on the left, with the foursided range query on the 5th row and the the 7th and 8th columns that returns the point (5, 8) indicating a primary occurrence overlapping the phrases 1 and 2. Notice that the first row is always empty, since the lexicographically first reversed phrase starts with the end-of-string character $. Kreft and Navarro used a wavelet tree but we use a recent data structure by Chan, Larsen and Pǎtraşcu [1] that takes O(z log log z) words and answers queries in O((1 + p) log log z) time, where p is the number of points reported. With this data structure, we can list all occ 1 primary occurrences in O ( m 2 +(m + occ 1 ) log log n ) time. Once we have found all the primary occurrences, it is not difficult to recursively find all the secondary occurrences. We build a data structure for 2-sided range reporting on an n n grid on which we place a point (i, j) for each phrase s source S[i..j]. Since the queries are 2-sided, we can implement this data structure as a predecessor data structure and a range-maximum data structure [19] which together take O(z log n) bits and answer queries in O(log log n + p) time [7,11],

8 A Faster Grammar-Based Self-index (1, 6) (1, 1) (2, 3) Fig. 4. On the left, the grid we use for listing primary occurrences in S = alabar a la alabarda$. The sides of the grid are labelled with the phrase numbers from the leaves in the two Patricia trees. A four-sided range query on the 5th row and 7th and 8th columns that returns the point (5, 8) indicating a primary occurrence overlapping the phrases 1 and 2. On the right, the grid we use for listing secondary occurrences in S = alabar a la alabarda$. The black dots indicate phrase sources and the grey dot indicates a two-sided query to find sources containing S[1..3]. where p is the number of points reported. If a phrase contains a secondary occurrence at a certain position in the phrase, then that phrase s source must also contain an occurrence at the corresponding position. Notice that, by the definition of the parse, the first occurrence of any substring must be primary. It follows that, by querying the data structure with (a, b) for each primary or secondary occurrence S[a..b] we find, we can list all secondary occurrences of P in O(occ log log n) time. Figure 4 shows the grid for our example, on the right, with black dots indicating phrase sources and the grey dot indicating a two-sided query to find sources containing the primary occurrence S[1..3] of ala. This query returns a point (1, 6), so the phrase whose source is S[1..6] i.e., the 8th contains a secondary occurrence of ala. Theorem 4. Given a balanced straight-line program for a string S[1..n] whose LZ77 parse consists of z phrases, we can add O(z log log z) words and obtain a compressed self-index for S such that, given a pattern P [1..m], we can list the occ occurrences of P in S in O ( m 2 +(m + occ) log log n ) time. 4 Postscript We recently designed a compressed self-index [12] based on the Relative Lempel- Ziv (RLZ) compression scheme proposed by Kuruppu, Puglisi and Zobel [17]. Given a database of genomes from individuals of the same species, Kuruppu et al. store the first genome G in an FM-index [8] and store the others compressed with a version of LZ77 [23] that allows phrases to be copied only from the first genome. They showed that RLZ compresses well and supports fast extraction but did not show how to support search. Of course, given a pattern P [1..m], we can quickly find all occurrences of P in G by searching in its FM-index. If we

9 248 T. Gagie et al. store a data structure for 2-sided range reporting like the one described at the end of Section 3, then we can also quickly find all secondary occurrences of P in the rest of the database. We now sketch one way to find primary occurrences. Let R be the rest of the database and suppose its RLZ parse relative to G consists of z phrases, d of them distinct. Consider each distinct phrase as a meta-character and consider the parse as a string R [1..z] over the alphabet of meta-characters. Suppose a meta-character x in R indicates a phrase copied from G[i..j]; then we associate with x the non-empty interval corresponding to G[i..j] in the Burrows-Wheeler Transform (BWT) of G, and the non-empty interval corresponding to the reverse (G[i..j]) R of G[i..j] inthebwtofg R. We will refer to two orderings on the alphabet of meta-characters: lexicographic by their phrases (lex), and lexicographic by the reverses of their phrases (r-lex). Notice that the interval for a string A in the BWT of G contains the intervals of all the meta-characters whose phrases start with A, which are consecutive in the lex ordering; the interval for A R in the BWT of G R contains the intervals of all the meta-characters whose phrases end with A, whichare consecutive in the r-lex ordering. More generally, if the intervals for two strings overlap in a BWT, then one must be contained in the other. We can use this fact to store small data structures such that, given an interval in the BWT of G or G R, we can quickly find the corresponding interval in the lex or r-lex orderings of the meta-characters. We build a data structure for 4-sided range reporting on a d z grid. We place apoint(a, b) on the grid if the lexicographically bth suffix of R, considering metacharacters in lex order, is preceded by a copy of the ath distinct meta-character in the r-lex ordering. In addition to the FM-index for G we also store FM-indexes for G R and R. We can find all primary occurrences of P in R if, for 1 i m, we 1. use the FM-index for G R to find the interval for (P [1..i]) R in the BWT of G R ; 2. map that interval to the interval in the r-lex ordering containing phrases ending with P [1..i]; 3. compute the RLZ parse of P [i +1..m] relative to G; 4. use the FM-index for G to find the interval for the last phrase P [j..m] of that parse in the BWT of G; 5. map that interval to the interval in the lex ordering containing phrases starting with P [j..m]; 6. use the FM-index for R to find the interval for the parse of P [i +1..m] in the BWT of R ; 7. find all points (a, b) onthed z grid with a in the interval for (P [1..i]) R in the BWT of G R,andb in the interval for the parse of P [i +1..m] inthe BWT of R. Notice that, if an occurrence of P [i +1..m] starts at a phrase boundary in R s parse, then the phrases in P [i +1..m] s parse are the same as the subsequent phrases in R s parse, except that the last phrase P [j..m] inp [i +1..m] s parse may be only a prefix of the corresponding phrase in R s parse. When we use the

10 A Faster Grammar-Based Self-index 249 FM-index for R to find the interval for P [i +1..m] s parse in the BWT of R,we start with the interval containing the meta-characters that immediately precede in R meta-characters whose phrases start with P [j..m]. If we encounter another phrase in P [i +1..m] s parse that does not occur in R s parse, then we know no occurrence of P [i +1..m] starts at a phrase boundary in R s parse. Steps 1, 3, 4 and 6 in the process above take time depending linearly on m, however, so repeating the whole process for each i between 1 and m takes time depending quadratically on m. If we use the FM-index for G R to search for (P [1..m]) R in G R then, as a byproduct, we find the intervals for (P [1..i]) R in the BWT of G R for 1 i m. Similarly, if we use the FM-index for G to search for P [1..m] ing then we find the intervals for P [j..m] inthebwtofgfor 1 j m. We can therefore avoid repeating Steps 1 and 4. To avoid repeating Steps 3 and 6, we use 1-dimensional dynamic programming to compute the RLZ parses of all suffixes of P and the corresponding intervals in the BWT of R.Assumethat,giventheintervalforastringAc in the BWT of G and the length of A, we can quickly find the interval for A; we will explain later how we do this. We work from right to left in P. Assume we have already computed the interval in the BWT of G for the longest prefix P [i +1..l] of P [i +1..m] that occurs in G and that, for i +1 k m, wehavealready computed the RLZ parse of P [k..m] relative to G and the interval for that parse in the BWT of R. We will show how to compute the interval in the BWT of G for the longest prefix of P [i..m] that occurs in G, the RLZ parse of P [i..m] and the interval for that parse in the BWT of R. Knowing the interval for P [i+1..l]inthebwtofg, we can use the FM-index for G to find the interval for P [i..l]. If this interval is non-empty, then P [i..l] is the longest prefix of P [i..m] that occurs in G, and the first phrase in the RLZ parse of P [i..m]; the rest of the parse is the same as for P [l +1..m]. We map the interval for P [i..l]inthebwtofg to the corresponding interval in the lex order. If l = m, then the interval in the BWT of R is the one containing the metacharacters that immediately precede in R meta-characters whose phrases start with P [i..m]. If l<m, then the interval in the lex order should be associated with a single meta-character; otherwise, P [i..l] does not occur in the parse of R and, thus, the interval in the BWT of R is empty. Knowing the interval for the parse of P [l +1..m] inthebwtofr, we use the FM-index for R to find the interval for the parse of P [i..m]. If the interval for P [i..l] inthebwtofg is empty, then we compute the intervals for P [i +1..l ]andp[i..l ], for l decreasing from l 1toi, untilwe find a value of l such that the interval for P [i..l ] is non-empty; we then proceed as described above. We now explain how, given the interval for a string Ac in the BWT of G and the length of A, we can quickly find the interval for A. In addition to the FM-index for G, we store a compressed suffix array (CSA) and compressed longest-common-prefix (LCP) array [9] for G and a previous/nextsmaller-value data structure [10] for the LCP. Suppose [i..j] is the interval for Ac in the BWT of G. Notice that LCP[i] andlcp[j +1]areboth atmost A. If LCP[i] = A, then we use the P/NSV data structure to find the previous

11 250 T. Gagie et al. entry LCP[i ] in the LCP array that is smaller than A ; otherwise, we set i = i. If LCP[j] = A, then we use the P/NSV data structure to find the next entry LCP[j + 1] in the LCP array that is smaller than A ; otherwise, we set j = j. The interval for A in the BWT of G is [i..j ]. We use the CSA by Grossi, Gupta and Vitter [13] which takes (1/ɛ)nH k (G)+ O(n) bits and supports access to the suffix array of G in O(log ɛ n) time, where k α log σ n, σ is the size of the alphabet of G and α<1andɛ 1areconstants; see [21]. We note that, using this CSA, we can also reduce the time needed to locate the occurrences of P in G. Choosing the right implementations of the various other data structures, we obtain the following result: Theorem 5. Suppose we are asked to build a compressed self-index for a database of genomes from individuals of the same species. Let G[1..n] be the first genome in the database and suppose the RLZ parse of the rest of the database relative to G consists of z phrases. For any positive constants α<1 and ɛ and k α log 4 n,we can store the database in O(n(H k (G)+1)+z(log n +logz log log z)) bits such that, given a pattern P [1..m], we can find all occ occurrences of P in the database in O((m + occ) log ɛ n) time. Many of the ideas used in this section have been used previously by other authors, notably Chien et al. [3], Hon et al. [14] and Huang et al. [15]. We recently became aware that Do, Jansson, Sadakane and Sung [6] independently proved a theorem similar to Theorem 5 before we did. We have described our own ideas here because they lead to an incomparable time-space tradeoff. References 1. Chan, T.M., Larsen, K.G., Pǎtraşcu, M.: Orthogonal range searching on the RAM, revisited. In: Proceedings of the 27th Symposium on Computational Geometry (SoCG), pp (2011) 2. Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Transactions on Information Theory 51(7), (2005) 3. Chien, Y.-F., Hon, W.-K., Shah, R., Vitter, J.S.: Geometric Burrows-Wheeler Transform: Linking range searching and text indexing. In: Proceedings of the Data Compression Conference (DCC), pp (2008) 4. Claude, F., Navarro, G.: Self-indexed Text Compression Using Straight-Line Programs. In: Královič, R., Niwiński, D. (eds.) MFCS LNCS, vol. 5734, pp Springer, Heidelberg (2009) 5. Claude, F., Navarro, G.: Improved grammar-based self-indexes. Tech. Rep , arxiv.org (2011) 6. Do, H.H., Jansson, J., Sadakane, K., Sung, W.K.: Indexing strings via textual substitutions from a reference, manuscript 7. van Emde Boas, P.: Preserving order in a forest in less than logarithmic time. In: Proceedings of the 16th Symposium on Foundations of Computer Science (FOCS), pp (1975) 8. Ferragina, P., Manzini, G.: Indexing compressed text. Journal of the ACM 52(4), (2005)

12 A Faster Grammar-Based Self-index Fischer, J.: Wee LCP. Information Processing Letters 110(8-9), (2010) 10. Fischer, J.: Combined data structure for previous- and next-smaller-values. Theoretical Computer Science 412(22), (2011) 11. Gabow, H.N., Bentley, J.L., Tarjan, R.E.: Scaling and related techniques for geometry problems. In: Proceedings of the 16th Symposium on Theory of Computing (STOC), pp (1984) 12. Gagie, T., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A compressed self-index for genomic databases. Tech. Rep , arxiv.org (2011) 13. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the 14th Symposium on Discrete Algorithms (SODA), pp (2003) 14. Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: On Entropy-Compressed Text Indexing in External Memory. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE LNCS, vol. 5721, pp Springer, Heidelberg (2009) 15. Huang, S., Lam, T.W., Sung, W.K., Tam, S.L., Yiu, S.M.: Indexing Similar DNA Sequences. In: Chen, B. (ed.) AAIM LNCS, vol. 6124, pp Springer, Heidelberg (2010) 16. Kreft, S., Navarro, G.: Self-indexing Based on LZ77. In: Giancarlo, R., Manzini, G. (eds.) CPM LNCS, vol. 6661, pp Springer, Heidelberg (2011) 17. Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE LNCS, vol. 6393, pp Springer, Heidelberg (2010) 18. Maruyama, S., Nakahara, M., Kishiue, N., Sakamoto, H.: ESP-Index: A Compressed Index Based on Edit-Sensitive Parsing. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE LNCS, vol. 7024, pp Springer, Heidelberg (2011) 19. McCreight, E.M.: Priority search trees. SIAM Journal on Computing 14(2), (1985) 20. Morrison, D.R.: PATRICIA - Practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM 15(4), (1968) 21. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) (2007) 22. Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoretical Computer Science 302(1-3), (2003) 23. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), (1977)

A Faster Grammar-Based Self-Index

A Faster Grammar-Based Self-Index A Faster Grammar-Based Self-Index Travis Gagie 1 Pawe l Gawrychowski 2 Juha Kärkkäinen 3 Yakov Nekrich 4 Simon Puglisi 5 Aalto University Max-Planck-Institute für Informatik University of Helsinki University

More information

arxiv: v1 [cs.ds] 15 Feb 2012

arxiv: v1 [cs.ds] 15 Feb 2012 Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl

More information

Smaller and Faster Lempel-Ziv Indices

Smaller and Faster Lempel-Ziv Indices Smaller and Faster Lempel-Ziv Indices Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, Universidad de Chile, Chile. {darroyue,gnavarro}@dcc.uchile.cl Abstract. Given a text T[1..u] over an

More information

Lecture 18 April 26, 2012

Lecture 18 April 26, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and

More information

Università degli studi di Udine

Università degli studi di Udine Università degli studi di Udine Computing LZ77 in Run-Compressed Space This is a pre print version of the following article: Original Computing LZ77 in Run-Compressed Space / Policriti, Alberto; Prezza,

More information

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Gonzalo Navarro and Víctor Sepúlveda CeBiB Center for Biotechnology and Bioengineering, Chile Department of Computer Science, University

More information

Forbidden Patterns. {vmakinen leena.salmela

Forbidden Patterns. {vmakinen leena.salmela Forbidden Patterns Johannes Fischer 1,, Travis Gagie 2,, Tsvi Kopelowitz 3, Moshe Lewenstein 4, Veli Mäkinen 5,, Leena Salmela 5,, and Niko Välimäki 5, 1 KIT, Karlsruhe, Germany, johannes.fischer@kit.edu

More information

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:

More information

Compressed Index for Dynamic Text

Compressed Index for Dynamic Text Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution

More information

On Compressing and Indexing Repetitive Sequences

On Compressing and Indexing Repetitive Sequences On Compressing and Indexing Repetitive Sequences Sebastian Kreft a,1,2, Gonzalo Navarro a,2 a Department of Computer Science, University of Chile Abstract We introduce LZ-End, a new member of the Lempel-Ziv

More information

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with

More information

arxiv: v1 [cs.ds] 30 Nov 2018

arxiv: v1 [cs.ds] 30 Nov 2018 Faster Attractor-Based Indexes Gonzalo Navarro 1,2 and Nicola Prezza 3 1 CeBiB Center for Biotechnology and Bioengineering 2 Dept. of Computer Science, University of Chile, Chile. gnavarro@dcc.uchile.cl

More information

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES DO HUY HOANG (B.C.S. (Hons), NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE SCHOOL OF COMPUTING NATIONAL

More information

Journal of Discrete Algorithms

Journal of Discrete Algorithms Journal of Discrete Algorithms 18 (2013) 100 112 Contents lists available at SciVerse ScienceDirect Journal of Discrete Algorithms www.elsevier.com/locate/jda ESP-index: A compressed index based on edit-sensitive

More information

arxiv: v1 [cs.ds] 19 Apr 2011

arxiv: v1 [cs.ds] 19 Apr 2011 Fixed Block Compression Boosting in FM-Indexes Juha Kärkkäinen 1 and Simon J. Puglisi 2 1 Department of Computer Science, University of Helsinki, Finland juha.karkkainen@cs.helsinki.fi 2 Department of

More information

Suffix Array of Alignment: A Practical Index for Similar Data

Suffix Array of Alignment: A Practical Index for Similar Data Suffix Array of Alignment: A Practical Index for Similar Data Joong Chae Na 1, Heejin Park 2, Sunho Lee 3, Minsung Hong 3, Thierry Lecroq 4, Laurent Mouchard 4, and Kunsoo Park 3, 1 Department of Computer

More information

arxiv: v3 [cs.ds] 6 Sep 2018

arxiv: v3 [cs.ds] 6 Sep 2018 Universal Compressed Text Indexing 1 Gonzalo Navarro 2 arxiv:1803.09520v3 [cs.ds] 6 Sep 2018 Abstract Center for Biotechnology and Bioengineering (CeBiB), Department of Computer Science, University of

More information

arxiv: v2 [cs.ds] 6 Jul 2015

arxiv: v2 [cs.ds] 6 Jul 2015 Online Self-Indexed Grammar Compression Yoshimasa Takabatake 1, Yasuo Tabei 2, and Hiroshi Sakamoto 1 1 Kyushu Institute of Technology {takabatake,hiroshi}@donald.ai.kyutech.ac.jp 2 PRESTO, Japan Science

More information

arxiv: v1 [cs.ds] 25 Nov 2009

arxiv: v1 [cs.ds] 25 Nov 2009 Alphabet Partitioning for Compressed Rank/Select with Applications Jérémy Barbay 1, Travis Gagie 1, Gonzalo Navarro 1 and Yakov Nekrich 2 1 Department of Computer Science University of Chile {jbarbay,

More information

arxiv: v4 [cs.ds] 6 Feb 2010

arxiv: v4 [cs.ds] 6 Feb 2010 Grammar-Based Compression in a Streaming Model Travis Gagie 1, and Pawe l Gawrychowski 2 arxiv:09120850v4 [csds] 6 Feb 2010 1 Department of Computer Science University of Chile travisgagie@gmailcom 2 Institute

More information

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca

More information

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca

More information

Stronger Lempel-Ziv Based Compressed Text Indexing

Stronger Lempel-Ziv Based Compressed Text Indexing Stronger Lempel-Ziv Based Compressed Text Indexing Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile.

More information

Optimal-Time Text Indexing in BWT-runs Bounded Space

Optimal-Time Text Indexing in BWT-runs Bounded Space Optimal-Time Text Indexing in BWT-runs Bounded Space Travis Gagie Gonzalo Navarro Nicola Prezza Abstract Indexing highly repetitive texts such as genomic databases, software repositories and versioned

More information

arxiv: v1 [cs.ds] 22 Nov 2012

arxiv: v1 [cs.ds] 22 Nov 2012 Faster Compact Top-k Document Retrieval Roberto Konow and Gonzalo Navarro Department of Computer Science, University of Chile {rkonow,gnavarro}@dcc.uchile.cl arxiv:1211.5353v1 [cs.ds] 22 Nov 2012 Abstract:

More information

Optimal Color Range Reporting in One Dimension

Optimal Color Range Reporting in One Dimension Optimal Color Range Reporting in One Dimension Yakov Nekrich 1 and Jeffrey Scott Vitter 1 The University of Kansas. yakov.nekrich@googlemail.com, jsv@ku.edu Abstract. Color (or categorical) range reporting

More information

Converting SLP to LZ78 in almost Linear Time

Converting SLP to LZ78 in almost Linear Time CPM 2013 Converting SLP to LZ78 in almost Linear Time Hideo Bannai 1, Paweł Gawrychowski 2, Shunsuke Inenaga 1, Masayuki Takeda 1 1. Kyushu University 2. Max-Planck-Institut für Informatik Recompress SLP

More information

String Range Matching

String Range Matching String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings

More information

Preview: Text Indexing

Preview: Text Indexing Simon Gog gog@ira.uka.de - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Text Indexing Motivation Problems Given a text

More information

Small-Space Dictionary Matching (Dissertation Proposal)

Small-Space Dictionary Matching (Dissertation Proposal) Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length

More information

Approximate String Matching with Lempel-Ziv Compressed Indexes

Approximate String Matching with Lempel-Ziv Compressed Indexes Approximate String Matching with Lempel-Ziv Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2 and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,

More information

Alphabet Friendly FM Index

Alphabet Friendly FM Index Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM

More information

Approximate String Matching with Ziv-Lempel Compressed Indexes

Approximate String Matching with Ziv-Lempel Compressed Indexes Approximate String Matching with Ziv-Lempel Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2, and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,

More information

A Space-Efficient Frameworks for Top-k String Retrieval

A Space-Efficient Frameworks for Top-k String Retrieval A Space-Efficient Frameworks for Top-k String Retrieval Wing-Kai Hon, National Tsing Hua University Rahul Shah, Louisiana State University Sharma V. Thankachan, Louisiana State University Jeffrey Scott

More information

A Simple Alphabet-Independent FM-Index

A Simple Alphabet-Independent FM-Index A Simple Alphabet-Independent -Index Szymon Grabowski 1, Veli Mäkinen 2, Gonzalo Navarro 3, Alejandro Salinger 3 1 Computer Engineering Dept., Tech. Univ. of Lódź, Poland. e-mail: sgrabow@zly.kis.p.lodz.pl

More information

Alphabet-Independent Compressed Text Indexing

Alphabet-Independent Compressed Text Indexing Alphabet-Independent Compressed Text Indexing DJAMAL BELAZZOUGUI Université Paris Diderot GONZALO NAVARRO University of Chile Self-indexes are able to represent a text within asymptotically the information-theoretic

More information

Compressed Representations of Sequences and Full-Text Indexes

Compressed Representations of Sequences and Full-Text Indexes Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Università di Pisa GIOVANNI MANZINI Università del Piemonte Orientale VELI MÄKINEN University of Helsinki AND GONZALO NAVARRO

More information

New Lower and Upper Bounds for Representing Sequences

New Lower and Upper Bounds for Representing Sequences New Lower and Upper Bounds for Representing Sequences Djamal Belazzougui 1 and Gonzalo Navarro 2 1 LIAFA, Univ. Paris Diderot - Paris 7, France. dbelaz@liafa.jussieu.fr 2 Department of Computer Science,

More information

At the Roots of Dictionary Compression: String Attractors

At the Roots of Dictionary Compression: String Attractors At the Roots of Dictionary Compression: String Attractors Dominik Kempa Department of Computer Science, University of Helsinki Helsinki, Finland dkempa@cs.helsinki.fi ABSTRACT A well-known fact in the

More information

Fast Fully-Compressed Suffix Trees

Fast Fully-Compressed Suffix Trees Fast Fully-Compressed Suffix Trees Gonzalo Navarro Department of Computer Science University of Chile, Chile gnavarro@dcc.uchile.cl Luís M. S. Russo INESC-ID / Instituto Superior Técnico Technical University

More information

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Wing-Kai Hon Kunihiko Sadakane Wing-Kin Sung Abstract Suffix trees and suffix arrays are the most prominent full-text indices, and their

More information

arxiv: v2 [cs.ds] 8 Apr 2016

arxiv: v2 [cs.ds] 8 Apr 2016 Optimal Dynamic Strings Paweł Gawrychowski 1, Adam Karczmarz 1, Tomasz Kociumaka 1, Jakub Łącki 2, and Piotr Sankowski 1 1 Institute of Informatics, University of Warsaw, Poland [gawry,a.karczmarz,kociumaka,sank]@mimuw.edu.pl

More information

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille 1, Rolf Fagerberg 2, and Inge Li Gørtz 3 1 IT University of Copenhagen. Rued Langgaards

More information

Reducing the Space Requirement of LZ-Index

Reducing the Space Requirement of LZ-Index Reducing the Space Requirement of LZ-Index Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile {darroyue, gnavarro}@dcc.uchile.cl 2 Dept. of

More information

Space-Efficient Construction Algorithm for Circular Suffix Tree

Space-Efficient Construction Algorithm for Circular Suffix Tree Space-Efficient Construction Algorithm for Circular Suffix Tree Wing-Kai Hon, Tsung-Han Ku, Rahul Shah and Sharma Thankachan CPM2013 1 Outline Preliminaries and Motivation Circular Suffix Tree Our Indexes

More information

arxiv: v1 [cs.ds] 8 Sep 2018

arxiv: v1 [cs.ds] 8 Sep 2018 Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space Travis Gagie 1,2, Gonzalo Navarro 2,3, and Nicola Prezza 4 1 EIT, Diego Portales University, Chile 2 Center for Biotechnology

More information

SUFFIX TREE. SYNONYMS Compact suffix trie

SUFFIX TREE. SYNONYMS Compact suffix trie SUFFIX TREE Maxime Crochemore King s College London and Université Paris-Est, http://www.dcs.kcl.ac.uk/staff/mac/ Thierry Lecroq Université de Rouen, http://monge.univ-mlv.fr/~lecroq SYNONYMS Compact suffix

More information

Rank and Select Operations on Binary Strings (1974; Elias)

Rank and Select Operations on Binary Strings (1974; Elias) Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo

More information

Dynamic Entropy-Compressed Sequences and Full-Text Indexes

Dynamic Entropy-Compressed Sequences and Full-Text Indexes Dynamic Entropy-Compressed Sequences and Full-Text Indexes VELI MÄKINEN University of Helsinki and GONZALO NAVARRO University of Chile First author funded by the Academy of Finland under grant 108219.

More information

Succinct Suffix Arrays based on Run-Length Encoding

Succinct Suffix Arrays based on Run-Length Encoding Succinct Suffix Arrays based on Run-Length Encoding Veli Mäkinen Gonzalo Navarro Abstract A succinct full-text self-index is a data structure built on a text T = t 1 t 2...t n, which takes little space

More information

String Searching with Ranking Constraints and Uncertainty

String Searching with Ranking Constraints and Uncertainty Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 2015 String Searching with Ranking Constraints and Uncertainty Sudip Biswas Louisiana State University and Agricultural

More information

Grammar Compression: Grammatical Inference by Compression and Its Application to Real Data

Grammar Compression: Grammatical Inference by Compression and Its Application to Real Data JMLR: Workshop and Conference Proceedings 34:3 20, 2014 Proceedings of the 12th ICGI Grammar Compression: Grammatical Inference by Compression and Its Application to Real Data Hiroshi Sakamoto Kyushu Institute

More information

Text Indexing: Lecture 6

Text Indexing: Lecture 6 Simon Gog gog@kit.edu - 0 Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Reviewing the last two lectures We have seen two top-k document retrieval frameworks. Question

More information

Motif Extraction from Weighted Sequences

Motif Extraction from Weighted Sequences Motif Extraction from Weighted Sequences C. Iliopoulos 1, K. Perdikuri 2,3, E. Theodoridis 2,3,, A. Tsakalidis 2,3 and K. Tsichlas 1 1 Department of Computer Science, King s College London, London WC2R

More information

On-line String Matching in Highly Similar DNA Sequences

On-line String Matching in Highly Similar DNA Sequences On-line String Matching in Highly Similar DNA Sequences Nadia Ben Nsira 1,2,ThierryLecroq 1,,MouradElloumi 2 1 LITIS EA 4108, Normastic FR3638, University of Rouen, France 2 LaTICE, University of Tunis

More information

LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations

LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations Jérémy Barbay 1, Johannes Fischer 2, and Gonzalo Navarro 1 1 Department of Computer Science, University of Chile, {jbarbay gnavarro}@dcc.uchile.cl

More information

Longest Gapped Repeats and Palindromes

Longest Gapped Repeats and Palindromes Discrete Mathematics and Theoretical Computer Science DMTCS vol. 19:4, 2017, #4 Longest Gapped Repeats and Palindromes Marius Dumitran 1 Paweł Gawrychowski 2 Florin Manea 3 arxiv:1511.07180v4 [cs.ds] 11

More information

Compressed Representations of Sequences and Full-Text Indexes

Compressed Representations of Sequences and Full-Text Indexes Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Dipartimento di Informatica, Università di Pisa, Italy GIOVANNI MANZINI Dipartimento di Informatica, Università del Piemonte

More information

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems Shunsuke Inenaga 1, Ayumi Shinohara 2,3 and Masayuki Takeda 2,3 1 Department of Computer Science, P.O. Box 26 (Teollisuuskatu 23)

More information

arxiv: v2 [cs.ds] 3 Oct 2017

arxiv: v2 [cs.ds] 3 Oct 2017 Orthogonal Vectors Indexing Isaac Goldstein 1, Moshe Lewenstein 1, and Ely Porat 1 1 Bar-Ilan University, Ramat Gan, Israel {goldshi,moshe,porately}@cs.biu.ac.il arxiv:1710.00586v2 [cs.ds] 3 Oct 2017 Abstract

More information

ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION

ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION PAOLO FERRAGINA, IGOR NITTO, AND ROSSANO VENTURINI Abstract. One of the most famous and investigated lossless data-compression schemes is the one introduced

More information

Optimal Dynamic Sequence Representations

Optimal Dynamic Sequence Representations Optimal Dynamic Sequence Representations Gonzalo Navarro Yakov Nekrich Abstract We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on

More information

Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT)

Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT) Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT) Disclaimer Students attending my lectures are often astonished that I present the material in a much livelier form than in this script.

More information

Compact Indexes for Flexible Top-k Retrieval

Compact Indexes for Flexible Top-k Retrieval Compact Indexes for Flexible Top-k Retrieval Simon Gog Matthias Petri Institute of Theoretical Informatics, Karlsruhe Institute of Technology Computing and Information Systems, The University of Melbourne

More information

Complementary Contextual Models with FM-index for DNA Compression

Complementary Contextual Models with FM-index for DNA Compression 2017 Data Compression Conference Complementary Contextual Models with FM-index for DNA Compression Wenjing Fan,WenruiDai,YongLi, and Hongkai Xiong Department of Electronic Engineering Department of Biomedical

More information

Online Sorted Range Reporting and Approximating the Mode

Online Sorted Range Reporting and Approximating the Mode Online Sorted Range Reporting and Approximating the Mode Mark Greve Progress Report Department of Computer Science Aarhus University Denmark January 4, 2010 Supervisor: Gerth Stølting Brodal Online Sorted

More information

E D I C T The internal extent formula for compacted tries

E D I C T The internal extent formula for compacted tries E D C T The internal extent formula for compacted tries Paolo Boldi Sebastiano Vigna Università degli Studi di Milano, taly Abstract t is well known [Knu97, pages 399 4] that in a binary tree the external

More information

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding SIGNAL COMPRESSION Lecture 7 Variable to Fix Encoding 1. Tunstall codes 2. Petry codes 3. Generalized Tunstall codes for Markov sources (a presentation of the paper by I. Tabus, G. Korodi, J. Rissanen.

More information

Alternative Algorithms for Lyndon Factorization

Alternative Algorithms for Lyndon Factorization Alternative Algorithms for Lyndon Factorization Suhpal Singh Ghuman 1, Emanuele Giaquinta 2, and Jorma Tarhio 1 1 Department of Computer Science and Engineering, Aalto University P.O.B. 15400, FI-00076

More information

Maximal Unbordered Factors of Random Strings arxiv: v1 [cs.ds] 14 Apr 2017

Maximal Unbordered Factors of Random Strings arxiv: v1 [cs.ds] 14 Apr 2017 Maximal Unbordered Factors of Random Strings arxiv:1704.04472v1 [cs.ds] 14 Apr 2017 Patrick Hagge Cording 1 and Mathias Bæk Tejs Knudsen 2 1 DTU Compute, Technical University of Denmark, phaco@dtu.dk 2

More information

Burrows-Wheeler Transforms in Linear Time and Linear Bits

Burrows-Wheeler Transforms in Linear Time and Linear Bits Burrows-Wheeler Transforms in Linear Time and Linear Bits Russ Cox (following Hon, Sadakane, Sung, Farach, and others) 18.417 Final Project BWT in Linear Time and Linear Bits Three main parts to the result.

More information

arxiv: v2 [cs.ds] 5 Mar 2014

arxiv: v2 [cs.ds] 5 Mar 2014 Order-preserving pattern matching with k mismatches Pawe l Gawrychowski 1 and Przemys law Uznański 2 1 Max-Planck-Institut für Informatik, Saarbrücken, Germany 2 LIF, CNRS and Aix-Marseille Université,

More information

Succinct 2D Dictionary Matching with No Slowdown

Succinct 2D Dictionary Matching with No Slowdown Succinct 2D Dictionary Matching with No Slowdown Shoshana Neuburger and Dina Sokol City University of New York Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d

More information

Approximation of smallest linear tree grammar

Approximation of smallest linear tree grammar Approximation of smallest linear tree grammar Artur Jeż 1 and Markus Lohrey 2 1 MPI Informatik, Saarbrücken, Germany / University of Wrocław, Poland 2 University of Siegen, Germany Abstract A simple linear-time

More information

Dynamic Ordered Sets with Exponential Search Trees

Dynamic Ordered Sets with Exponential Search Trees Dynamic Ordered Sets with Exponential Search Trees Arne Andersson Computing Science Department Information Technology, Uppsala University Box 311, SE - 751 05 Uppsala, Sweden arnea@csd.uu.se http://www.csd.uu.se/

More information

String Indexing for Patterns with Wildcards

String Indexing for Patterns with Wildcards MASTER S THESIS String Indexing for Patterns with Wildcards Hjalte Wedel Vildhøj and Søren Vind Technical University of Denmark August 8, 2011 Abstract We consider the problem of indexing a string t of

More information

Traversing Grammar-Compressed Trees with Constant Delay

Traversing Grammar-Compressed Trees with Constant Delay Traversing Grammar-Compressed Trees with Constant Delay Markus Lohrey, Sebastian Maneth, and Carl Philipp Reh Universität Siegen University of Edinburgh Hölderlinstraße 3 Crichton Street 57076 Siegen,

More information

Skriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT)

Skriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT) 1 Recommended Reading Skriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT) D. Gusfield: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. M. Crochemore,

More information

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2 Text Compression Jayadev Misra The University of Texas at Austin December 5, 2003 Contents 1 Introduction 1 2 A Very Incomplete Introduction to Information Theory 2 3 Huffman Coding 5 3.1 Uniquely Decodable

More information

arxiv: v2 [cs.ds] 29 Jan 2014

arxiv: v2 [cs.ds] 29 Jan 2014 Fully Online Grammar Compression in Constant Space arxiv:1401.5143v2 [cs.ds] 29 Jan 2014 Preferred Infrastructure, Inc. maruyama@preferred.jp Shirou Maruyama and Yasuo Tabei Abstract PRESTO, Japan Science

More information

Succincter text indexing with wildcards

Succincter text indexing with wildcards University of British Columbia CPM 2011 June 27, 2011 Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview

More information

arxiv:cs/ v1 [cs.it] 21 Nov 2006

arxiv:cs/ v1 [cs.it] 21 Nov 2006 On the space complexity of one-pass compression Travis Gagie Department of Computer Science University of Toronto travis@cs.toronto.edu arxiv:cs/0611099v1 [cs.it] 21 Nov 2006 STUDENT PAPER Abstract. We

More information

On the bit-complexity of Lempel-Ziv compression

On the bit-complexity of Lempel-Ziv compression On the bit-complexity of Lempel-Ziv compression Paolo Ferragina Igor Nitto Rossano Venturini Abstract One of the most famous and investigated lossless datacompression schemes is the one introduced by Lempel

More information

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille IT University of Copenhagen Rolf Fagerberg University of Southern Denmark Inge Li Gørtz

More information

Querying and Embedding Compressed Texts

Querying and Embedding Compressed Texts Querying and Embedding Compressed Texts Yury Lifshits 1, Markus Lohrey 2 1 Steklov Institut of Mathematics, St.Petersburg, Russia 2 Universität Stuttgart, FMI, Germany yura@logic.pdmi.ras.ru, lohrey@informatik.uni-stuttgart.de

More information

Internal Pattern Matching Queries in a Text and Applications

Internal Pattern Matching Queries in a Text and Applications Internal Pattern Matching Queries in a Text and Applications Tomasz Kociumaka Jakub Radoszewski Wojciech Rytter Tomasz Waleń Abstract We consider several types of internal queries: questions about subwords

More information

Implementing Approximate Regularities

Implementing Approximate Regularities Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National

More information

Text matching of strings in terms of straight line program by compressed aleshin type automata

Text matching of strings in terms of straight line program by compressed aleshin type automata Text matching of strings in terms of straight line program by compressed aleshin type automata 1 A.Jeyanthi, 2 B.Stalin 1 Faculty, 2 Assistant Professor 1 Department of Mathematics, 2 Department of Mechanical

More information

Grammar Compressed Sequences with Rank/Select Support

Grammar Compressed Sequences with Rank/Select Support Grammar Compressed Sequences with Rank/Select Support Gonzalo Navarro and Alberto Ordóñez 2 Dept. of Computer Science, Univ. of Chile, Chile. gnavarro@dcc.uchile.cl 2 Lab. de Bases de Datos, Univ. da Coruña,

More information

Computing a Longest Common Palindromic Subsequence

Computing a Longest Common Palindromic Subsequence Fundamenta Informaticae 129 (2014) 1 12 1 DOI 10.3233/FI-2014-860 IOS Press Computing a Longest Common Palindromic Subsequence Shihabur Rahman Chowdhury, Md. Mahbubul Hasan, Sumaiya Iqbal, M. Sohel Rahman

More information

Space-Efficient Re-Pair Compression

Space-Efficient Re-Pair Compression Space-Efficient Re-Pair Compression Philip Bille, Inge Li Gørtz, and Nicola Prezza Technical University of Denmark, DTU Compute {phbi,inge,npre}@dtu.dk Abstract Re-Pair [5] is an effective grammar-based

More information

Opportunistic Data Structures with Applications

Opportunistic Data Structures with Applications Opportunistic Data Structures with Applications Paolo Ferragina Giovanni Manzini Abstract There is an upsurging interest in designing succinct data structures for basic searching problems (see [23] and

More information

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression Sergio De Agostino Sapienza University di Rome Parallel Systems A parallel random access machine (PRAM)

More information

arxiv: v1 [cs.ds] 9 Apr 2018

arxiv: v1 [cs.ds] 9 Apr 2018 From Regular Expression Matching to Parsing Philip Bille Technical University of Denmark phbi@dtu.dk Inge Li Gørtz Technical University of Denmark inge@dtu.dk arxiv:1804.02906v1 [cs.ds] 9 Apr 2018 Abstract

More information

CRAM: Compressed Random Access Memory

CRAM: Compressed Random Access Memory CRAM: Compressed Random Access Memory Jesper Jansson 1, Kunihiko Sadakane 2, and Wing-Kin Sung 3 1 Laboratory of Mathematical Bioinformatics, Institute for Chemical Research, Kyoto University, Gokasho,

More information

Faster Compact On-Line Lempel-Ziv Factorization

Faster Compact On-Line Lempel-Ziv Factorization Faster Compact On-Line Lempel-Ziv Factorization Jun ichi Yamamoto, Tomohiro I, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda Department of Informatics, Kyushu University, Nishiku, Fukuoka, Japan

More information

String Matching with Variable Length Gaps

String Matching with Variable Length Gaps String Matching with Variable Length Gaps Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind Technical University of Denmark Abstract. We consider string matching with variable length

More information

Oblivious String Embeddings and Edit Distance Approximations

Oblivious String Embeddings and Edit Distance Approximations Oblivious String Embeddings and Edit Distance Approximations Tuğkan Batu Funda Ergun Cenk Sahinalp Abstract We introduce an oblivious embedding that maps strings of length n under edit distance to strings

More information

On Pattern Matching With Swaps

On Pattern Matching With Swaps On Pattern Matching With Swaps Fouad B. Chedid Dhofar University, Salalah, Oman Notre Dame University - Louaize, Lebanon P.O.Box: 2509, Postal Code 211 Salalah, Oman Tel: +968 23237200 Fax: +968 23237720

More information

The Burrows-Wheeler Transform: Theory and Practice

The Burrows-Wheeler Transform: Theory and Practice The Burrows-Wheeler Transform: Theory and Practice Giovanni Manzini 1,2 1 Dipartimento di Scienze e Tecnologie Avanzate, Università del Piemonte Orientale Amedeo Avogadro, I-15100 Alessandria, Italy. 2

More information