A Faster Grammar-Based Self-index
|
|
- Simon Lester
- 6 years ago
- Views:
Transcription
1 A Faster Grammar-Based Self-index Travis Gagie 1,Pawe l Gawrychowski 2,, Juha Kärkkäinen 3, Yakov Nekrich 4, and Simon J. Puglisi 5 1 Aalto University, Finland 2 University of Wroc law, Poland 3 University of Helsinki, Finland 4 University of Bonn, Germany 5 King s College, London, UK Abstract. To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on straightline programs and LZ77. In this paper we show how, given a balanced straight-line program for a string S[1..n] whose LZ77 parse consists of z phrases, we can add O(z log log z) words and obtain a compressed selfindex for S such that, given a pattern P [1..m], we can list the occ occurrences of P in S in O ( m 2 +(m + occ) log log n ) time. All previous self-indexes are either larger or slower in the worst case. 1 Introduction With the advance of DNA-sequencing technologies comes the problem of how to store many individuals genomes compactly but such that we can search them quickly. Any two human genomes are 99.9% the same, but compressed self-indexes based on compressed suffix arrays, the Burrows-Wheeler Transform or LZ78 (see [21] for a survey) do not take full advantage of this similarity. Researchers have recently started building compressed self-indexes based on context-free grammars (CFGs) and LZ77 [23], which better compress highly repetitive strings. A compressed self-index stores a string S[1..n] in compressed form such that, given a pattern P [1..m], we can quickly list the occ occurrences of P in S. Claude and Navarro [4] gave the first compressed self-index based on grammars, which takes O(r log r) +r log n bits, where r is the number of rules in the given grammar, and O ( (m 2 + h(m + occ)) log r ) time, where h log n is the height of the parse tree. Throughout this paper, we write log to mean log 2. Very recently, the same authors [5] described another grammar-based compressed self-index, which takes 2R log r + R log n + ɛrlog r + o(r log r) bits, where R is the total length of the rules right-hand sides and 0 <ɛ 1, and O ( (m 2 /ɛ)logr + occ log r ) time 1. Whereas their first index requires a straightline program (SLP), their new index can be built on top of an arbitrary CFG. Supported by MNiSW grant number N , They have now reduced the log R factor in this time bound to log(log n/ log r). A.-H. Dediu and C. Martín-Vide (Eds.): LATA 2012, LNCS 7183, pp , c Springer-Verlag Berlin Heidelberg 2012
2 A Faster Grammar-Based Self-index 241 Table 1. Space and time bounds for Claude and Navarro s compressed self-indexes [4,5], Kreft and Navarro s [16] and our own. Throughout, r is the number of rules in a given CFG generating a string S[1..n] and only S, but the CFG must be an SLP in Row 1 and a balanced SLP in Row 4. In Row 1, h is the height of the parse tree. In Row 2, R is the total length of the rules right-hand sides and 0 <ɛ 1. In Row 3, d is the depth of nesting in the parse. source total space (bits) search time [4] O(r log r)+r log n O ( (m 2 + h(m +occ))logr ) [5] 2R log r + R log n + ɛrlog r + o(r log r) O ( (m 2 /ɛ)logr +occlogr ) [16] 2z log(n/z)+z log z +5z log σ + O(z)+o(n) O ( m 2 d +(m +occ)logz ) Thm. 4 2r log r + O(z(log n +logz log log z)) O ( m 2 +(m +occ)loglogn ) Kreft and Navarro [16] gave the first (and, so far, only) compressed self-index based on LZ77, which takes 2z log(n/z)+z log z+5z log σ+o(z)+o(n) bits, where z is the number of phrases in the LZ77 parse of S, ando ( m 2 d +(m +occ)logz ) time, where d z is the depth of nesting in the parse. Throughout this paper, we consider the non-self-referential LZ77 parse, so z log n. Theo(n) termin Kreft and Navarro s space bound can be dropped at the cost of increasing the time bound by a factor of log(n/z). In this paper we show how, given a balanced SLP for S, we can add O(z(log n +logz log log z)) bits and obtain a compressed self-index that answers queries in O ( m 2 +(m + occ) log log n ) time. We note that Maruyama et al. [18] recently described another grammar-based index but its search time depends on the number of occurrences of a maximal common subtree in [edit-sensitive parse] trees of P and S, making it difficult to compare their data structure to those mentioned above. In order to better compare our bounds, those by Claude and Navarro and those by Kreft and Navarro all of which are summarized in Table 1 it is useful to review some bounds for the LZ77, balanced and unbalanced SLPs, and arbitrary CFGs. The LZ77 compression algorithm works by parsing S from left to right into phrases: after parsing S[1..i 1], it finds the longest prefix S[i..j 1] of S[i..m] that has occurred before and selects S[i..j] as the next phrase. The previous occurrence of S[i..j 1] is called the phrase s source. Kreft and Navarro considered the non-self-referential LZ77 parse, in which each phrase s source must end before the phrase itself begins, so z log n. For ease of comparison, we also consider the non-self-referential parse. In an SLP of size r, each of the non-terminals X 1,...,X r appears on the lefthand side of exactly one rule and each rule is either of the form X i a, where a is a terminal, or X i X j X k,wherei>j,k; thus, the SLP generates exactly one string and can be encoded in 2r log r + O(r) bits. Rytter [22] and Charikar et al. [2] showed that, in every CFG generating S and only S, the total length of the rules right-hand sides is at least z. Rytter showed how we can build an SLP for S with O(z log n) rules whose parse tree has the shape of an AVL tree, which is height-balanced, and Charikar et al. showed how to build such an SLP whose parse tree is weight-balanced. Finally, Charikar et al. showed that finding
3 242 T. Gagie et al. an algorithm for building a CFG generating S and only S in which the total length of the rules right-hand sides is o(z log n/ log log n), would imply progress on the well-studied problem of building addition chains. With Rytter s and Charikar et al. s bounds in mind, it is clear that the time bound of our index is strictly better than those of the others. Kreft and Navarro s index has the best worst-case space bound but only when we drop the o(n) term, increasing their time bound by a factor of log(n/z). As it is not known whether it is possible, regardless of the time taken, to build a CFG that generates S and only S and in which the total length of the rules right-hand sides is o(z log n), it is also not known whether Claude and Navarro s space bounds are ever strictly better than ours. It is even conceivable that they could be worse: suppose z =log O(1) n and the minimum total length of the rules right-hand sides in a CFG generating S and only S, isz log n; then the O(r log n) andr log n terms in Claude and Navarro s space bounds are Θ(z log 2 n), while the dominant term in our own space bound is 2r log r = Θ(z log n log log n). To build our index, we add bookmarks to the SLP that allow us to extract substrings quickly from around boundaries between phrases in the LZ77 parse. In Section 2 we show that a bookmark for a position b takes O(log n) bits and, for any length l, allows us to extract S[b l..b + l] ino(l + log log n) time. Kreft and Navarro showed how, if we can extract substrings quickly from around phrase boundaries, then we can quickly list all occurrences of P that finish at or cross those boundaries, which are called primary occurrences. They first use two Patricia trees and access at the boundaries to find, for each position i 1 in P, the lexicographic range of the reverses of phrases ending with P [1..i], and the lexicographic range of suffixes of S starting with P [i +1..m] at phrase boundaries. They then use a wavelet tree as a data structure for 2-dimensional range reporting to find all the phrase boundaries immediately preceded by P [1..i] and followed by P [i+1..m]. In Section 3 we show how, using bookmarks for faster access and a faster (albeit larger) range-reporting data structure, we can list all primary occurrences in O ( m 2 +(m + occ) log log n ) time. Occurrences of P that neither finish at nor cross phrase boundaries are called secondary and they can be found recursively from the primary occurrences. We build a data structure for 2-sided range reporting on an n n grid on which we place a point (i, j) for each phrase s source S[i..j]. Since the queries are 2-sided, this data structure takes O(z log n) bits and answers queries in O(log log n + p) time, where p is the number of points reported. If a phrase contains a secondary occurrence at a certain position in the phrase, then that phrase s source must also contain an occurrence at the corresponding position. It follows that, by querying the data structure with (a, b) for each primary or secondary occurrence S[a..b] we find, we can list all secondary occurrences of P in O(occ log log n) time. 2 Adding Bookmarks to a Balanced SLP An SLP for S is called balanced if its parse tree is height- or weight-balanced. We claim that, for 1 i j = i +2g n, we can choose two nodes v and
4 A Faster Grammar-Based Self-index 243 u v w i k j = i +2g b = i + g k + 1 Fig. 1. Suppose we are given an SLP for S and consider its parse tree. For 1 i j = i +2g n, we can choose two nodes v and w of height O(log g) in the parse tree whose subtrees include the ith through jth leaves. Without loss of generality, assume the bth leaf is in v s subtree, where b = i + g. By storing the non-terminals at v and w and the path from v to b, we can extract S[b l..b + l] ino(l +logg) time. w of height O(log g) in the parse tree whose subtrees include the ith through jth leaves. To see why, let u be the lowest common ancestor of the ith and jth leaves. If u has height O(log g), then we choose its children as v and w; otherwise, suppose the rightmost leaf in u s left subtree is the kth in the tree. We choose as v the lowest common ancestor of the ith through kth leaves, which must lie on the rightmost path in u s left subtree. Since v s right subtree has fewer than k i +1 2g leaves, v has height O(log g). We choose as w the lowest common ancestor of the (k + 1)st through jth leaves, which must lie on the leftmost path in u s right subtree. Since w s left subtree has fewer than j k 2g leaves, w also has height O(log g). Figure 1 illustrates our choice of v and w. Without loss of generality, assume b = i + g k; the case when b>kis symmetric. Then the bth leaf which is midway between the ith and jth is in v s subtree and we can store the path from v to the bth leaf in O(log g) bits. With this information, in O(l + log g) time we can perform partial depth-first traversals of v s subtree that start by descending from v to the bth leaf and then visit the l g leaves immediately to its left and right, if they are in v s subtree. Some of the l leaves immediately to right of the bth leaf may be the leftmost leaves in w s subtree, instead; we can descend from w and visit them in O(l +logg) time. It follows that, if we store the non-terminals at v and w and O(log g) extra bits then, given l g, we can extract S[b l..b + l]ino(l +logg)time. Lemma 1. Given a balanced SLP for S with r rules and integers b and g, wecan store 2logr + O(log g) bits such that later, given l, we can extract S[b l..b + l] in O(l +logg) time. The drawback to Lemma 1 is that we must know in advance an upper bound on the number of characters we will want to extract. To remove this restriction,
5 244 T. Gagie et al. for each position we wish to bookmark, we apply Lemma 1 twice: the first time, we set g = n so that, given l log n, we can extract S[b l..b + l] ino(l) time; the second time, we set g =logn so that, given l log n, we can extract S[b l..b + l] ino(l + log log n) time. Theorem 2. Given a balanced SLP for S and a position b, we can store O(log n) bits such that later, given l, we can extract S[b l..b+l] in O(l + log log n) time. We note that, without changing the( asymptotic) space bound in Theorem 2, the time bound can be improved to O l +log [c] n,wherec is any constant and log [c] n is the result of recursively applying the logarithm function c times to n. If we increase the space bound to O(log n log n), where log is the iterated logarithm, then the time bound decreases to O(l). Even as stated, however, Theorem 2 is enough for our purposes in this paper as we use only the following corollary, which follows by applying Theorem 2 for each phrase boundary in the LZ77 parse. Corollary 3. Given a balanced SLP for S, we can add O(z log n) bits such that later, given l, we can extract l characters to either side of any phrase boundary in O(l + log log n) time. As an aside, we note that Rytter s construction produces an SLP in which, for each phrase in the LZ77 parse, there is a non-terminal whose expansion is that phrase. For any balanced SLP with this property, it is easy to add bookmarks such that we can extract quickly from phrase boundaries: for each phrase, we store the non-terminal generating that phrase and the leftmost and rightmost non-terminals at height log log n in its parse tree. 3 Listing Occurrences As explained in Section 1, we use the same framework as Kreft and Navarro but with faster access at the phrase boundaries, using Corollary 3 and a faster data structure for 2-dimensional range reporting. Given a pattern P [1..m], for 1 i m, we find all the boundaries where the previous phrase ends with P [1..i] and the next phrase starts with P [i +1..m]. To do this, we build one Patricia tree for the reversed phrases and another for the suffixes starting at phrase boundaries. A Patricia tree [20] is a compressed trie in which the length of each edge label is recorded and then all but the first character of each edge label are discarded. It follows that the two Patricia trees have O(z) nodes and take O(z log n) bits; notice that no two suffixes can be the same and, by the definition of the LZ77 parse, no two reversed phrases can be the same. We label each leaf in the first Patricia tree by the number of the phrase whose reverse leads to that leaf; we label each leaf in the second Patricia tree by the number of the phrase starting the suffix that leads to that leaf. For example, if S = alabar a la alabarda$, then its LZ77 parse is as shown in Figure 2 (from Kreft and Navarro s paper). For convenience, we treat all
6 A Faster Grammar-Based Self-index 245 al abar a l a al abarda$ Fig. 2. The LZ77 parse of alabar a la alabarda$, from Kreft and Navarro s paper [16]. Horizontal lines indicate phrases sources, with arrows leading to the boxes containing the phrases themselves. strings as ending with a special character $. In this case, the reversed phrases are as shown below with the phrase numbers, in order by phrase number on the left and in lexicographic order on the right. 1) a$ 9) $a$ 2) l$ 5) $ 3) ba$ 6) a$ 4) ra$ 7) al$ 5) $ 1) a$ 6) a$ 3) ba$ 7) al$ 8) drabala$ 8) drabala$ 2) l$ 9) $a$ 4) ra$ The suffixes starting at phrase boundaries are as shown below with the numbers of the phrases with which they begin, in order by phrase number on the left and in lexicographic order on the right. 2) labar a la alabarda$ 5) a la alabarda$ 3) abar a la alabarda$ 9) a$ 4) ar a la alabarda$ 6) a la alabarda$ 5) a la alabarda$ 3) abar a la alabarda$ 6) a la alabarda$ 8) alabarda$ 7) la alabarda$ 4) ar a la alabarda$ 8) alabarda$ 7) la alabarda$ 9) a$ 2) labar a la alabarda$ Notice we do not consider the whole string to be a suffix starting at a phrase boundary. We are interested only in phrase boundaries such that occurrences of patterns can cross the boundary with a non-empty prefix to the left of the boundary. The Patricia trees for the reversed phrases and suffixes are shown in Figure 3. Notice that, since la alabarda$ and labar a la alabarda$ share both their first and second characters, the second character a is omitted from the edge from the root to its rightmost child in the Patrica trie for the suffixes; we indicate this with a *. For each position i 1inP, we find the lexicographic range of the reverses of phrases ending with P [1..i], and the lexicographic range of suffixes of S starting
7 246 T. Gagie et al. 9 $ $ a 5 $ l 6 7 a b d 1 3 l 8 r 2 4 a l* 5 b $ b l r Fig. 3. The Patricia trees for the reversed phrases (on the left) in the LZ77 parse of alabar a la alabarda$ and the suffixes starting at phrase boundaries (on the right) with P [i +1..m] at phrase boundaries. To do this, we first search for the reverse of P [1..i] in the Patricia tree for the reversed phrases and search for P [i +1..m] in the Patricia tree for the suffixes, then use our data structure from Corollary 3 to verify that the characters omitted from the edge labels match those in the corresponding positions in P.ThistakesO(m + log log n) time for each choice of i, oro ( m 2 + m log log n ) time in total. For example, if P = ala, then we search for a and la, al and a, and ala and the empty suffix. In the first search, we reach the root s third child in the Patricia tree for the reversed phrases, which is the 5th leaf (labelled 1), and the root s third child in the Patricia tree for the suffixes, which is the ancestor of the 7th and 8th leaves (labelled 7 and 2); the second and third searches fail because there are no paths in the first Patricia tree corresponding to al or ala. Once we have the lexicographic ranges for a choice of i, we check whether there are any pairs of consecutive phrase numbers with the first number j from the first range and the second number j + 1 from the second range. In our example, 1, 2) is such a pair, meaning that the first phrase ends with a and the second phrase starts with la, so there is a primary occurrence crossing the first phrase boundary. To be able to find efficiently such pairs, we store a data structure for 2-dimensional range reporting on a z (z 1) grid on which we have placed z 1points,witheachpoint(a, b) indicating that the phrase numbers of the lexicographically ath reversed phrase and the lexicographically bth suffix, are consecutive. Figure 4 shows the grid for our example, on the left, with the foursided range query on the 5th row and the the 7th and 8th columns that returns the point (5, 8) indicating a primary occurrence overlapping the phrases 1 and 2. Notice that the first row is always empty, since the lexicographically first reversed phrase starts with the end-of-string character $. Kreft and Navarro used a wavelet tree but we use a recent data structure by Chan, Larsen and Pǎtraşcu [1] that takes O(z log log z) words and answers queries in O((1 + p) log log z) time, where p is the number of points reported. With this data structure, we can list all occ 1 primary occurrences in O ( m 2 +(m + occ 1 ) log log n ) time. Once we have found all the primary occurrences, it is not difficult to recursively find all the secondary occurrences. We build a data structure for 2-sided range reporting on an n n grid on which we place a point (i, j) for each phrase s source S[i..j]. Since the queries are 2-sided, we can implement this data structure as a predecessor data structure and a range-maximum data structure [19] which together take O(z log n) bits and answer queries in O(log log n + p) time [7,11],
8 A Faster Grammar-Based Self-index (1, 6) (1, 1) (2, 3) Fig. 4. On the left, the grid we use for listing primary occurrences in S = alabar a la alabarda$. The sides of the grid are labelled with the phrase numbers from the leaves in the two Patricia trees. A four-sided range query on the 5th row and 7th and 8th columns that returns the point (5, 8) indicating a primary occurrence overlapping the phrases 1 and 2. On the right, the grid we use for listing secondary occurrences in S = alabar a la alabarda$. The black dots indicate phrase sources and the grey dot indicates a two-sided query to find sources containing S[1..3]. where p is the number of points reported. If a phrase contains a secondary occurrence at a certain position in the phrase, then that phrase s source must also contain an occurrence at the corresponding position. Notice that, by the definition of the parse, the first occurrence of any substring must be primary. It follows that, by querying the data structure with (a, b) for each primary or secondary occurrence S[a..b] we find, we can list all secondary occurrences of P in O(occ log log n) time. Figure 4 shows the grid for our example, on the right, with black dots indicating phrase sources and the grey dot indicating a two-sided query to find sources containing the primary occurrence S[1..3] of ala. This query returns a point (1, 6), so the phrase whose source is S[1..6] i.e., the 8th contains a secondary occurrence of ala. Theorem 4. Given a balanced straight-line program for a string S[1..n] whose LZ77 parse consists of z phrases, we can add O(z log log z) words and obtain a compressed self-index for S such that, given a pattern P [1..m], we can list the occ occurrences of P in S in O ( m 2 +(m + occ) log log n ) time. 4 Postscript We recently designed a compressed self-index [12] based on the Relative Lempel- Ziv (RLZ) compression scheme proposed by Kuruppu, Puglisi and Zobel [17]. Given a database of genomes from individuals of the same species, Kuruppu et al. store the first genome G in an FM-index [8] and store the others compressed with a version of LZ77 [23] that allows phrases to be copied only from the first genome. They showed that RLZ compresses well and supports fast extraction but did not show how to support search. Of course, given a pattern P [1..m], we can quickly find all occurrences of P in G by searching in its FM-index. If we
9 248 T. Gagie et al. store a data structure for 2-sided range reporting like the one described at the end of Section 3, then we can also quickly find all secondary occurrences of P in the rest of the database. We now sketch one way to find primary occurrences. Let R be the rest of the database and suppose its RLZ parse relative to G consists of z phrases, d of them distinct. Consider each distinct phrase as a meta-character and consider the parse as a string R [1..z] over the alphabet of meta-characters. Suppose a meta-character x in R indicates a phrase copied from G[i..j]; then we associate with x the non-empty interval corresponding to G[i..j] in the Burrows-Wheeler Transform (BWT) of G, and the non-empty interval corresponding to the reverse (G[i..j]) R of G[i..j] inthebwtofg R. We will refer to two orderings on the alphabet of meta-characters: lexicographic by their phrases (lex), and lexicographic by the reverses of their phrases (r-lex). Notice that the interval for a string A in the BWT of G contains the intervals of all the meta-characters whose phrases start with A, which are consecutive in the lex ordering; the interval for A R in the BWT of G R contains the intervals of all the meta-characters whose phrases end with A, whichare consecutive in the r-lex ordering. More generally, if the intervals for two strings overlap in a BWT, then one must be contained in the other. We can use this fact to store small data structures such that, given an interval in the BWT of G or G R, we can quickly find the corresponding interval in the lex or r-lex orderings of the meta-characters. We build a data structure for 4-sided range reporting on a d z grid. We place apoint(a, b) on the grid if the lexicographically bth suffix of R, considering metacharacters in lex order, is preceded by a copy of the ath distinct meta-character in the r-lex ordering. In addition to the FM-index for G we also store FM-indexes for G R and R. We can find all primary occurrences of P in R if, for 1 i m, we 1. use the FM-index for G R to find the interval for (P [1..i]) R in the BWT of G R ; 2. map that interval to the interval in the r-lex ordering containing phrases ending with P [1..i]; 3. compute the RLZ parse of P [i +1..m] relative to G; 4. use the FM-index for G to find the interval for the last phrase P [j..m] of that parse in the BWT of G; 5. map that interval to the interval in the lex ordering containing phrases starting with P [j..m]; 6. use the FM-index for R to find the interval for the parse of P [i +1..m] in the BWT of R ; 7. find all points (a, b) onthed z grid with a in the interval for (P [1..i]) R in the BWT of G R,andb in the interval for the parse of P [i +1..m] inthe BWT of R. Notice that, if an occurrence of P [i +1..m] starts at a phrase boundary in R s parse, then the phrases in P [i +1..m] s parse are the same as the subsequent phrases in R s parse, except that the last phrase P [j..m] inp [i +1..m] s parse may be only a prefix of the corresponding phrase in R s parse. When we use the
10 A Faster Grammar-Based Self-index 249 FM-index for R to find the interval for P [i +1..m] s parse in the BWT of R,we start with the interval containing the meta-characters that immediately precede in R meta-characters whose phrases start with P [j..m]. If we encounter another phrase in P [i +1..m] s parse that does not occur in R s parse, then we know no occurrence of P [i +1..m] starts at a phrase boundary in R s parse. Steps 1, 3, 4 and 6 in the process above take time depending linearly on m, however, so repeating the whole process for each i between 1 and m takes time depending quadratically on m. If we use the FM-index for G R to search for (P [1..m]) R in G R then, as a byproduct, we find the intervals for (P [1..i]) R in the BWT of G R for 1 i m. Similarly, if we use the FM-index for G to search for P [1..m] ing then we find the intervals for P [j..m] inthebwtofgfor 1 j m. We can therefore avoid repeating Steps 1 and 4. To avoid repeating Steps 3 and 6, we use 1-dimensional dynamic programming to compute the RLZ parses of all suffixes of P and the corresponding intervals in the BWT of R.Assumethat,giventheintervalforastringAc in the BWT of G and the length of A, we can quickly find the interval for A; we will explain later how we do this. We work from right to left in P. Assume we have already computed the interval in the BWT of G for the longest prefix P [i +1..l] of P [i +1..m] that occurs in G and that, for i +1 k m, wehavealready computed the RLZ parse of P [k..m] relative to G and the interval for that parse in the BWT of R. We will show how to compute the interval in the BWT of G for the longest prefix of P [i..m] that occurs in G, the RLZ parse of P [i..m] and the interval for that parse in the BWT of R. Knowing the interval for P [i+1..l]inthebwtofg, we can use the FM-index for G to find the interval for P [i..l]. If this interval is non-empty, then P [i..l] is the longest prefix of P [i..m] that occurs in G, and the first phrase in the RLZ parse of P [i..m]; the rest of the parse is the same as for P [l +1..m]. We map the interval for P [i..l]inthebwtofg to the corresponding interval in the lex order. If l = m, then the interval in the BWT of R is the one containing the metacharacters that immediately precede in R meta-characters whose phrases start with P [i..m]. If l<m, then the interval in the lex order should be associated with a single meta-character; otherwise, P [i..l] does not occur in the parse of R and, thus, the interval in the BWT of R is empty. Knowing the interval for the parse of P [l +1..m] inthebwtofr, we use the FM-index for R to find the interval for the parse of P [i..m]. If the interval for P [i..l] inthebwtofg is empty, then we compute the intervals for P [i +1..l ]andp[i..l ], for l decreasing from l 1toi, untilwe find a value of l such that the interval for P [i..l ] is non-empty; we then proceed as described above. We now explain how, given the interval for a string Ac in the BWT of G and the length of A, we can quickly find the interval for A. In addition to the FM-index for G, we store a compressed suffix array (CSA) and compressed longest-common-prefix (LCP) array [9] for G and a previous/nextsmaller-value data structure [10] for the LCP. Suppose [i..j] is the interval for Ac in the BWT of G. Notice that LCP[i] andlcp[j +1]areboth atmost A. If LCP[i] = A, then we use the P/NSV data structure to find the previous
11 250 T. Gagie et al. entry LCP[i ] in the LCP array that is smaller than A ; otherwise, we set i = i. If LCP[j] = A, then we use the P/NSV data structure to find the next entry LCP[j + 1] in the LCP array that is smaller than A ; otherwise, we set j = j. The interval for A in the BWT of G is [i..j ]. We use the CSA by Grossi, Gupta and Vitter [13] which takes (1/ɛ)nH k (G)+ O(n) bits and supports access to the suffix array of G in O(log ɛ n) time, where k α log σ n, σ is the size of the alphabet of G and α<1andɛ 1areconstants; see [21]. We note that, using this CSA, we can also reduce the time needed to locate the occurrences of P in G. Choosing the right implementations of the various other data structures, we obtain the following result: Theorem 5. Suppose we are asked to build a compressed self-index for a database of genomes from individuals of the same species. Let G[1..n] be the first genome in the database and suppose the RLZ parse of the rest of the database relative to G consists of z phrases. For any positive constants α<1 and ɛ and k α log 4 n,we can store the database in O(n(H k (G)+1)+z(log n +logz log log z)) bits such that, given a pattern P [1..m], we can find all occ occurrences of P in the database in O((m + occ) log ɛ n) time. Many of the ideas used in this section have been used previously by other authors, notably Chien et al. [3], Hon et al. [14] and Huang et al. [15]. We recently became aware that Do, Jansson, Sadakane and Sung [6] independently proved a theorem similar to Theorem 5 before we did. We have described our own ideas here because they lead to an incomparable time-space tradeoff. References 1. Chan, T.M., Larsen, K.G., Pǎtraşcu, M.: Orthogonal range searching on the RAM, revisited. In: Proceedings of the 27th Symposium on Computational Geometry (SoCG), pp (2011) 2. Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Transactions on Information Theory 51(7), (2005) 3. Chien, Y.-F., Hon, W.-K., Shah, R., Vitter, J.S.: Geometric Burrows-Wheeler Transform: Linking range searching and text indexing. In: Proceedings of the Data Compression Conference (DCC), pp (2008) 4. Claude, F., Navarro, G.: Self-indexed Text Compression Using Straight-Line Programs. In: Královič, R., Niwiński, D. (eds.) MFCS LNCS, vol. 5734, pp Springer, Heidelberg (2009) 5. Claude, F., Navarro, G.: Improved grammar-based self-indexes. Tech. Rep , arxiv.org (2011) 6. Do, H.H., Jansson, J., Sadakane, K., Sung, W.K.: Indexing strings via textual substitutions from a reference, manuscript 7. van Emde Boas, P.: Preserving order in a forest in less than logarithmic time. In: Proceedings of the 16th Symposium on Foundations of Computer Science (FOCS), pp (1975) 8. Ferragina, P., Manzini, G.: Indexing compressed text. Journal of the ACM 52(4), (2005)
12 A Faster Grammar-Based Self-index Fischer, J.: Wee LCP. Information Processing Letters 110(8-9), (2010) 10. Fischer, J.: Combined data structure for previous- and next-smaller-values. Theoretical Computer Science 412(22), (2011) 11. Gabow, H.N., Bentley, J.L., Tarjan, R.E.: Scaling and related techniques for geometry problems. In: Proceedings of the 16th Symposium on Theory of Computing (STOC), pp (1984) 12. Gagie, T., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A compressed self-index for genomic databases. Tech. Rep , arxiv.org (2011) 13. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the 14th Symposium on Discrete Algorithms (SODA), pp (2003) 14. Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: On Entropy-Compressed Text Indexing in External Memory. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE LNCS, vol. 5721, pp Springer, Heidelberg (2009) 15. Huang, S., Lam, T.W., Sung, W.K., Tam, S.L., Yiu, S.M.: Indexing Similar DNA Sequences. In: Chen, B. (ed.) AAIM LNCS, vol. 6124, pp Springer, Heidelberg (2010) 16. Kreft, S., Navarro, G.: Self-indexing Based on LZ77. In: Giancarlo, R., Manzini, G. (eds.) CPM LNCS, vol. 6661, pp Springer, Heidelberg (2011) 17. Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE LNCS, vol. 6393, pp Springer, Heidelberg (2010) 18. Maruyama, S., Nakahara, M., Kishiue, N., Sakamoto, H.: ESP-Index: A Compressed Index Based on Edit-Sensitive Parsing. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE LNCS, vol. 7024, pp Springer, Heidelberg (2011) 19. McCreight, E.M.: Priority search trees. SIAM Journal on Computing 14(2), (1985) 20. Morrison, D.R.: PATRICIA - Practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM 15(4), (1968) 21. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) (2007) 22. Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoretical Computer Science 302(1-3), (2003) 23. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), (1977)
A Faster Grammar-Based Self-Index
A Faster Grammar-Based Self-Index Travis Gagie 1 Pawe l Gawrychowski 2 Juha Kärkkäinen 3 Yakov Nekrich 4 Simon Puglisi 5 Aalto University Max-Planck-Institute für Informatik University of Helsinki University
More informationarxiv: v1 [cs.ds] 15 Feb 2012
Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl
More informationSmaller and Faster Lempel-Ziv Indices
Smaller and Faster Lempel-Ziv Indices Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, Universidad de Chile, Chile. {darroyue,gnavarro}@dcc.uchile.cl Abstract. Given a text T[1..u] over an
More informationLecture 18 April 26, 2012
6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and
More informationUniversità degli studi di Udine
Università degli studi di Udine Computing LZ77 in Run-Compressed Space This is a pre print version of the following article: Original Computing LZ77 in Run-Compressed Space / Policriti, Alberto; Prezza,
More informationPractical Indexing of Repetitive Collections using Relative Lempel-Ziv
Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Gonzalo Navarro and Víctor Sepúlveda CeBiB Center for Biotechnology and Bioengineering, Chile Department of Computer Science, University
More informationForbidden Patterns. {vmakinen leena.salmela
Forbidden Patterns Johannes Fischer 1,, Travis Gagie 2,, Tsvi Kopelowitz 3, Moshe Lewenstein 4, Veli Mäkinen 5,, Leena Salmela 5,, and Niko Välimäki 5, 1 KIT, Karlsruhe, Germany, johannes.fischer@kit.edu
More informationIndexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile
Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:
More informationCompressed Index for Dynamic Text
Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution
More informationOn Compressing and Indexing Repetitive Sequences
On Compressing and Indexing Repetitive Sequences Sebastian Kreft a,1,2, Gonzalo Navarro a,2 a Department of Computer Science, University of Chile Abstract We introduce LZ-End, a new member of the Lempel-Ziv
More informationTheoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts
Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with
More informationarxiv: v1 [cs.ds] 30 Nov 2018
Faster Attractor-Based Indexes Gonzalo Navarro 1,2 and Nicola Prezza 3 1 CeBiB Center for Biotechnology and Bioengineering 2 Dept. of Computer Science, University of Chile, Chile. gnavarro@dcc.uchile.cl
More informationCOMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES
COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES DO HUY HOANG (B.C.S. (Hons), NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE SCHOOL OF COMPUTING NATIONAL
More informationJournal of Discrete Algorithms
Journal of Discrete Algorithms 18 (2013) 100 112 Contents lists available at SciVerse ScienceDirect Journal of Discrete Algorithms www.elsevier.com/locate/jda ESP-index: A compressed index based on edit-sensitive
More informationarxiv: v1 [cs.ds] 19 Apr 2011
Fixed Block Compression Boosting in FM-Indexes Juha Kärkkäinen 1 and Simon J. Puglisi 2 1 Department of Computer Science, University of Helsinki, Finland juha.karkkainen@cs.helsinki.fi 2 Department of
More informationSuffix Array of Alignment: A Practical Index for Similar Data
Suffix Array of Alignment: A Practical Index for Similar Data Joong Chae Na 1, Heejin Park 2, Sunho Lee 3, Minsung Hong 3, Thierry Lecroq 4, Laurent Mouchard 4, and Kunsoo Park 3, 1 Department of Computer
More informationarxiv: v3 [cs.ds] 6 Sep 2018
Universal Compressed Text Indexing 1 Gonzalo Navarro 2 arxiv:1803.09520v3 [cs.ds] 6 Sep 2018 Abstract Center for Biotechnology and Bioengineering (CeBiB), Department of Computer Science, University of
More informationarxiv: v2 [cs.ds] 6 Jul 2015
Online Self-Indexed Grammar Compression Yoshimasa Takabatake 1, Yasuo Tabei 2, and Hiroshi Sakamoto 1 1 Kyushu Institute of Technology {takabatake,hiroshi}@donald.ai.kyutech.ac.jp 2 PRESTO, Japan Science
More informationarxiv: v1 [cs.ds] 25 Nov 2009
Alphabet Partitioning for Compressed Rank/Select with Applications Jérémy Barbay 1, Travis Gagie 1, Gonzalo Navarro 1 and Yakov Nekrich 2 1 Department of Computer Science University of Chile {jbarbay,
More informationarxiv: v4 [cs.ds] 6 Feb 2010
Grammar-Based Compression in a Streaming Model Travis Gagie 1, and Pawe l Gawrychowski 2 arxiv:09120850v4 [csds] 6 Feb 2010 1 Department of Computer Science University of Chile travisgagie@gmailcom 2 Institute
More informationSelf-Indexed Grammar-Based Compression
Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca
More informationSelf-Indexed Grammar-Based Compression
Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca
More informationStronger Lempel-Ziv Based Compressed Text Indexing
Stronger Lempel-Ziv Based Compressed Text Indexing Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile.
More informationOptimal-Time Text Indexing in BWT-runs Bounded Space
Optimal-Time Text Indexing in BWT-runs Bounded Space Travis Gagie Gonzalo Navarro Nicola Prezza Abstract Indexing highly repetitive texts such as genomic databases, software repositories and versioned
More informationarxiv: v1 [cs.ds] 22 Nov 2012
Faster Compact Top-k Document Retrieval Roberto Konow and Gonzalo Navarro Department of Computer Science, University of Chile {rkonow,gnavarro}@dcc.uchile.cl arxiv:1211.5353v1 [cs.ds] 22 Nov 2012 Abstract:
More informationOptimal Color Range Reporting in One Dimension
Optimal Color Range Reporting in One Dimension Yakov Nekrich 1 and Jeffrey Scott Vitter 1 The University of Kansas. yakov.nekrich@googlemail.com, jsv@ku.edu Abstract. Color (or categorical) range reporting
More informationConverting SLP to LZ78 in almost Linear Time
CPM 2013 Converting SLP to LZ78 in almost Linear Time Hideo Bannai 1, Paweł Gawrychowski 2, Shunsuke Inenaga 1, Masayuki Takeda 1 1. Kyushu University 2. Max-Planck-Institut für Informatik Recompress SLP
More informationString Range Matching
String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings
More informationPreview: Text Indexing
Simon Gog gog@ira.uka.de - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Text Indexing Motivation Problems Given a text
More informationSmall-Space Dictionary Matching (Dissertation Proposal)
Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length
More informationApproximate String Matching with Lempel-Ziv Compressed Indexes
Approximate String Matching with Lempel-Ziv Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2 and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,
More informationAlphabet Friendly FM Index
Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM
More informationApproximate String Matching with Ziv-Lempel Compressed Indexes
Approximate String Matching with Ziv-Lempel Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2, and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,
More informationA Space-Efficient Frameworks for Top-k String Retrieval
A Space-Efficient Frameworks for Top-k String Retrieval Wing-Kai Hon, National Tsing Hua University Rahul Shah, Louisiana State University Sharma V. Thankachan, Louisiana State University Jeffrey Scott
More informationA Simple Alphabet-Independent FM-Index
A Simple Alphabet-Independent -Index Szymon Grabowski 1, Veli Mäkinen 2, Gonzalo Navarro 3, Alejandro Salinger 3 1 Computer Engineering Dept., Tech. Univ. of Lódź, Poland. e-mail: sgrabow@zly.kis.p.lodz.pl
More informationAlphabet-Independent Compressed Text Indexing
Alphabet-Independent Compressed Text Indexing DJAMAL BELAZZOUGUI Université Paris Diderot GONZALO NAVARRO University of Chile Self-indexes are able to represent a text within asymptotically the information-theoretic
More informationCompressed Representations of Sequences and Full-Text Indexes
Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Università di Pisa GIOVANNI MANZINI Università del Piemonte Orientale VELI MÄKINEN University of Helsinki AND GONZALO NAVARRO
More informationNew Lower and Upper Bounds for Representing Sequences
New Lower and Upper Bounds for Representing Sequences Djamal Belazzougui 1 and Gonzalo Navarro 2 1 LIAFA, Univ. Paris Diderot - Paris 7, France. dbelaz@liafa.jussieu.fr 2 Department of Computer Science,
More informationAt the Roots of Dictionary Compression: String Attractors
At the Roots of Dictionary Compression: String Attractors Dominik Kempa Department of Computer Science, University of Helsinki Helsinki, Finland dkempa@cs.helsinki.fi ABSTRACT A well-known fact in the
More informationFast Fully-Compressed Suffix Trees
Fast Fully-Compressed Suffix Trees Gonzalo Navarro Department of Computer Science University of Chile, Chile gnavarro@dcc.uchile.cl Luís M. S. Russo INESC-ID / Instituto Superior Técnico Technical University
More informationBreaking a Time-and-Space Barrier in Constructing Full-Text Indices
Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Wing-Kai Hon Kunihiko Sadakane Wing-Kin Sung Abstract Suffix trees and suffix arrays are the most prominent full-text indices, and their
More informationarxiv: v2 [cs.ds] 8 Apr 2016
Optimal Dynamic Strings Paweł Gawrychowski 1, Adam Karczmarz 1, Tomasz Kociumaka 1, Jakub Łącki 2, and Piotr Sankowski 1 1 Institute of Informatics, University of Warsaw, Poland [gawry,a.karczmarz,kociumaka,sank]@mimuw.edu.pl
More informationImproved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts
Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille 1, Rolf Fagerberg 2, and Inge Li Gørtz 3 1 IT University of Copenhagen. Rued Langgaards
More informationReducing the Space Requirement of LZ-Index
Reducing the Space Requirement of LZ-Index Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile {darroyue, gnavarro}@dcc.uchile.cl 2 Dept. of
More informationSpace-Efficient Construction Algorithm for Circular Suffix Tree
Space-Efficient Construction Algorithm for Circular Suffix Tree Wing-Kai Hon, Tsung-Han Ku, Rahul Shah and Sharma Thankachan CPM2013 1 Outline Preliminaries and Motivation Circular Suffix Tree Our Indexes
More informationarxiv: v1 [cs.ds] 8 Sep 2018
Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space Travis Gagie 1,2, Gonzalo Navarro 2,3, and Nicola Prezza 4 1 EIT, Diego Portales University, Chile 2 Center for Biotechnology
More informationSUFFIX TREE. SYNONYMS Compact suffix trie
SUFFIX TREE Maxime Crochemore King s College London and Université Paris-Est, http://www.dcs.kcl.ac.uk/staff/mac/ Thierry Lecroq Université de Rouen, http://monge.univ-mlv.fr/~lecroq SYNONYMS Compact suffix
More informationRank and Select Operations on Binary Strings (1974; Elias)
Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo
More informationDynamic Entropy-Compressed Sequences and Full-Text Indexes
Dynamic Entropy-Compressed Sequences and Full-Text Indexes VELI MÄKINEN University of Helsinki and GONZALO NAVARRO University of Chile First author funded by the Academy of Finland under grant 108219.
More informationSuccinct Suffix Arrays based on Run-Length Encoding
Succinct Suffix Arrays based on Run-Length Encoding Veli Mäkinen Gonzalo Navarro Abstract A succinct full-text self-index is a data structure built on a text T = t 1 t 2...t n, which takes little space
More informationString Searching with Ranking Constraints and Uncertainty
Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 2015 String Searching with Ranking Constraints and Uncertainty Sudip Biswas Louisiana State University and Agricultural
More informationGrammar Compression: Grammatical Inference by Compression and Its Application to Real Data
JMLR: Workshop and Conference Proceedings 34:3 20, 2014 Proceedings of the 12th ICGI Grammar Compression: Grammatical Inference by Compression and Its Application to Real Data Hiroshi Sakamoto Kyushu Institute
More informationText Indexing: Lecture 6
Simon Gog gog@kit.edu - 0 Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Reviewing the last two lectures We have seen two top-k document retrieval frameworks. Question
More informationMotif Extraction from Weighted Sequences
Motif Extraction from Weighted Sequences C. Iliopoulos 1, K. Perdikuri 2,3, E. Theodoridis 2,3,, A. Tsakalidis 2,3 and K. Tsichlas 1 1 Department of Computer Science, King s College London, London WC2R
More informationOn-line String Matching in Highly Similar DNA Sequences
On-line String Matching in Highly Similar DNA Sequences Nadia Ben Nsira 1,2,ThierryLecroq 1,,MouradElloumi 2 1 LITIS EA 4108, Normastic FR3638, University of Rouen, France 2 LaTICE, University of Tunis
More informationLRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations
LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations Jérémy Barbay 1, Johannes Fischer 2, and Gonzalo Navarro 1 1 Department of Computer Science, University of Chile, {jbarbay gnavarro}@dcc.uchile.cl
More informationLongest Gapped Repeats and Palindromes
Discrete Mathematics and Theoretical Computer Science DMTCS vol. 19:4, 2017, #4 Longest Gapped Repeats and Palindromes Marius Dumitran 1 Paweł Gawrychowski 2 Florin Manea 3 arxiv:1511.07180v4 [cs.ds] 11
More informationCompressed Representations of Sequences and Full-Text Indexes
Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Dipartimento di Informatica, Università di Pisa, Italy GIOVANNI MANZINI Dipartimento di Informatica, Università del Piemonte
More informationA Fully Compressed Pattern Matching Algorithm for Simple Collage Systems
A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems Shunsuke Inenaga 1, Ayumi Shinohara 2,3 and Masayuki Takeda 2,3 1 Department of Computer Science, P.O. Box 26 (Teollisuuskatu 23)
More informationarxiv: v2 [cs.ds] 3 Oct 2017
Orthogonal Vectors Indexing Isaac Goldstein 1, Moshe Lewenstein 1, and Ely Porat 1 1 Bar-Ilan University, Ramat Gan, Israel {goldshi,moshe,porately}@cs.biu.ac.il arxiv:1710.00586v2 [cs.ds] 3 Oct 2017 Abstract
More informationON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION
ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION PAOLO FERRAGINA, IGOR NITTO, AND ROSSANO VENTURINI Abstract. One of the most famous and investigated lossless data-compression schemes is the one introduced
More informationOptimal Dynamic Sequence Representations
Optimal Dynamic Sequence Representations Gonzalo Navarro Yakov Nekrich Abstract We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on
More informationSkriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT)
Skriptum VL Text Indexing Sommersemester 2012 Johannes Fischer (KIT) Disclaimer Students attending my lectures are often astonished that I present the material in a much livelier form than in this script.
More informationCompact Indexes for Flexible Top-k Retrieval
Compact Indexes for Flexible Top-k Retrieval Simon Gog Matthias Petri Institute of Theoretical Informatics, Karlsruhe Institute of Technology Computing and Information Systems, The University of Melbourne
More informationComplementary Contextual Models with FM-index for DNA Compression
2017 Data Compression Conference Complementary Contextual Models with FM-index for DNA Compression Wenjing Fan,WenruiDai,YongLi, and Hongkai Xiong Department of Electronic Engineering Department of Biomedical
More informationOnline Sorted Range Reporting and Approximating the Mode
Online Sorted Range Reporting and Approximating the Mode Mark Greve Progress Report Department of Computer Science Aarhus University Denmark January 4, 2010 Supervisor: Gerth Stølting Brodal Online Sorted
More informationE D I C T The internal extent formula for compacted tries
E D C T The internal extent formula for compacted tries Paolo Boldi Sebastiano Vigna Università degli Studi di Milano, taly Abstract t is well known [Knu97, pages 399 4] that in a binary tree the external
More informationSIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding
SIGNAL COMPRESSION Lecture 7 Variable to Fix Encoding 1. Tunstall codes 2. Petry codes 3. Generalized Tunstall codes for Markov sources (a presentation of the paper by I. Tabus, G. Korodi, J. Rissanen.
More informationAlternative Algorithms for Lyndon Factorization
Alternative Algorithms for Lyndon Factorization Suhpal Singh Ghuman 1, Emanuele Giaquinta 2, and Jorma Tarhio 1 1 Department of Computer Science and Engineering, Aalto University P.O.B. 15400, FI-00076
More informationMaximal Unbordered Factors of Random Strings arxiv: v1 [cs.ds] 14 Apr 2017
Maximal Unbordered Factors of Random Strings arxiv:1704.04472v1 [cs.ds] 14 Apr 2017 Patrick Hagge Cording 1 and Mathias Bæk Tejs Knudsen 2 1 DTU Compute, Technical University of Denmark, phaco@dtu.dk 2
More informationBurrows-Wheeler Transforms in Linear Time and Linear Bits
Burrows-Wheeler Transforms in Linear Time and Linear Bits Russ Cox (following Hon, Sadakane, Sung, Farach, and others) 18.417 Final Project BWT in Linear Time and Linear Bits Three main parts to the result.
More informationarxiv: v2 [cs.ds] 5 Mar 2014
Order-preserving pattern matching with k mismatches Pawe l Gawrychowski 1 and Przemys law Uznański 2 1 Max-Planck-Institut für Informatik, Saarbrücken, Germany 2 LIF, CNRS and Aix-Marseille Université,
More informationSuccinct 2D Dictionary Matching with No Slowdown
Succinct 2D Dictionary Matching with No Slowdown Shoshana Neuburger and Dina Sokol City University of New York Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d
More informationApproximation of smallest linear tree grammar
Approximation of smallest linear tree grammar Artur Jeż 1 and Markus Lohrey 2 1 MPI Informatik, Saarbrücken, Germany / University of Wrocław, Poland 2 University of Siegen, Germany Abstract A simple linear-time
More informationDynamic Ordered Sets with Exponential Search Trees
Dynamic Ordered Sets with Exponential Search Trees Arne Andersson Computing Science Department Information Technology, Uppsala University Box 311, SE - 751 05 Uppsala, Sweden arnea@csd.uu.se http://www.csd.uu.se/
More informationString Indexing for Patterns with Wildcards
MASTER S THESIS String Indexing for Patterns with Wildcards Hjalte Wedel Vildhøj and Søren Vind Technical University of Denmark August 8, 2011 Abstract We consider the problem of indexing a string t of
More informationTraversing Grammar-Compressed Trees with Constant Delay
Traversing Grammar-Compressed Trees with Constant Delay Markus Lohrey, Sebastian Maneth, and Carl Philipp Reh Universität Siegen University of Edinburgh Hölderlinstraße 3 Crichton Street 57076 Siegen,
More informationSkriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT)
1 Recommended Reading Skriptum VL Text-Indexierung Sommersemester 2010 Johannes Fischer (KIT) D. Gusfield: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. M. Crochemore,
More informationText Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2
Text Compression Jayadev Misra The University of Texas at Austin December 5, 2003 Contents 1 Introduction 1 2 A Very Incomplete Introduction to Information Theory 2 3 Huffman Coding 5 3.1 Uniquely Decodable
More informationarxiv: v2 [cs.ds] 29 Jan 2014
Fully Online Grammar Compression in Constant Space arxiv:1401.5143v2 [cs.ds] 29 Jan 2014 Preferred Infrastructure, Inc. maruyama@preferred.jp Shirou Maruyama and Yasuo Tabei Abstract PRESTO, Japan Science
More informationSuccincter text indexing with wildcards
University of British Columbia CPM 2011 June 27, 2011 Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview
More informationarxiv:cs/ v1 [cs.it] 21 Nov 2006
On the space complexity of one-pass compression Travis Gagie Department of Computer Science University of Toronto travis@cs.toronto.edu arxiv:cs/0611099v1 [cs.it] 21 Nov 2006 STUDENT PAPER Abstract. We
More informationOn the bit-complexity of Lempel-Ziv compression
On the bit-complexity of Lempel-Ziv compression Paolo Ferragina Igor Nitto Rossano Venturini Abstract One of the most famous and investigated lossless datacompression schemes is the one introduced by Lempel
More informationImproved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts
Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille IT University of Copenhagen Rolf Fagerberg University of Southern Denmark Inge Li Gørtz
More informationQuerying and Embedding Compressed Texts
Querying and Embedding Compressed Texts Yury Lifshits 1, Markus Lohrey 2 1 Steklov Institut of Mathematics, St.Petersburg, Russia 2 Universität Stuttgart, FMI, Germany yura@logic.pdmi.ras.ru, lohrey@informatik.uni-stuttgart.de
More informationInternal Pattern Matching Queries in a Text and Applications
Internal Pattern Matching Queries in a Text and Applications Tomasz Kociumaka Jakub Radoszewski Wojciech Rytter Tomasz Waleń Abstract We consider several types of internal queries: questions about subwords
More informationImplementing Approximate Regularities
Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National
More informationText matching of strings in terms of straight line program by compressed aleshin type automata
Text matching of strings in terms of straight line program by compressed aleshin type automata 1 A.Jeyanthi, 2 B.Stalin 1 Faculty, 2 Assistant Professor 1 Department of Mathematics, 2 Department of Mechanical
More informationGrammar Compressed Sequences with Rank/Select Support
Grammar Compressed Sequences with Rank/Select Support Gonzalo Navarro and Alberto Ordóñez 2 Dept. of Computer Science, Univ. of Chile, Chile. gnavarro@dcc.uchile.cl 2 Lab. de Bases de Datos, Univ. da Coruña,
More informationComputing a Longest Common Palindromic Subsequence
Fundamenta Informaticae 129 (2014) 1 12 1 DOI 10.3233/FI-2014-860 IOS Press Computing a Longest Common Palindromic Subsequence Shihabur Rahman Chowdhury, Md. Mahbubul Hasan, Sumaiya Iqbal, M. Sohel Rahman
More informationSpace-Efficient Re-Pair Compression
Space-Efficient Re-Pair Compression Philip Bille, Inge Li Gørtz, and Nicola Prezza Technical University of Denmark, DTU Compute {phbi,inge,npre}@dtu.dk Abstract Re-Pair [5] is an effective grammar-based
More informationOpportunistic Data Structures with Applications
Opportunistic Data Structures with Applications Paolo Ferragina Giovanni Manzini Abstract There is an upsurging interest in designing succinct data structures for basic searching problems (see [23] and
More informationComputing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome
Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression Sergio De Agostino Sapienza University di Rome Parallel Systems A parallel random access machine (PRAM)
More informationarxiv: v1 [cs.ds] 9 Apr 2018
From Regular Expression Matching to Parsing Philip Bille Technical University of Denmark phbi@dtu.dk Inge Li Gørtz Technical University of Denmark inge@dtu.dk arxiv:1804.02906v1 [cs.ds] 9 Apr 2018 Abstract
More informationCRAM: Compressed Random Access Memory
CRAM: Compressed Random Access Memory Jesper Jansson 1, Kunihiko Sadakane 2, and Wing-Kin Sung 3 1 Laboratory of Mathematical Bioinformatics, Institute for Chemical Research, Kyoto University, Gokasho,
More informationFaster Compact On-Line Lempel-Ziv Factorization
Faster Compact On-Line Lempel-Ziv Factorization Jun ichi Yamamoto, Tomohiro I, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda Department of Informatics, Kyushu University, Nishiku, Fukuoka, Japan
More informationString Matching with Variable Length Gaps
String Matching with Variable Length Gaps Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind Technical University of Denmark Abstract. We consider string matching with variable length
More informationOblivious String Embeddings and Edit Distance Approximations
Oblivious String Embeddings and Edit Distance Approximations Tuğkan Batu Funda Ergun Cenk Sahinalp Abstract We introduce an oblivious embedding that maps strings of length n under edit distance to strings
More informationOn Pattern Matching With Swaps
On Pattern Matching With Swaps Fouad B. Chedid Dhofar University, Salalah, Oman Notre Dame University - Louaize, Lebanon P.O.Box: 2509, Postal Code 211 Salalah, Oman Tel: +968 23237200 Fax: +968 23237720
More informationThe Burrows-Wheeler Transform: Theory and Practice
The Burrows-Wheeler Transform: Theory and Practice Giovanni Manzini 1,2 1 Dipartimento di Scienze e Tecnologie Avanzate, Università del Piemonte Orientale Amedeo Avogadro, I-15100 Alessandria, Italy. 2
More information