A Faster Grammar-Based Self-index

Size: px

Start display at page:

Download "A Faster Grammar-Based Self-index"

Simon Lester
6 years ago
Views:

1 A Faster Grammar-Based Self-index Travis Gagie 1,Pawe l Gawrychowski 2,, Juha Kärkkäinen 3, Yakov Nekrich 4, and Simon J. Puglisi 5 1 Aalto University, Finland 2 University of Wroc law, Poland 3 University of Helsinki, Finland 4 University of Bonn, Germany 5 King s College, London, UK Abstract. To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on straightline programs and LZ77. In this paper we show how, given a balanced straight-line program for a string S[1..n] whose LZ77 parse consists of z phrases, we can add O(z log log z) words and obtain a compressed selfindex for S such that, given a pattern P [1..m], we can list the occ occurrences of P in S in O ( m 2 +(m + occ) log log n ) time. All previous self-indexes are either larger or slower in the worst case. 1 Introduction With the advance of DNA-sequencing technologies comes the problem of how to store many individuals genomes compactly but such that we can search them quickly. Any two human genomes are 99.9% the same, but compressed self-indexes based on compressed suffix arrays, the Burrows-Wheeler Transform or LZ78 (see [21] for a survey) do not take full advantage of this similarity. Researchers have recently started building compressed self-indexes based on context-free grammars (CFGs) and LZ77 [23], which better compress highly repetitive strings. A compressed self-index stores a string S[1..n] in compressed form such that, given a pattern P [1..m], we can quickly list the occ occurrences of P in S. Claude and Navarro [4] gave the first compressed self-index based on grammars, which takes O(r log r) +r log n bits, where r is the number of rules in the given grammar, and O ( (m 2 + h(m + occ)) log r ) time, where h log n is the height of the parse tree. Throughout this paper, we write log to mean log 2. Very recently, the same authors [5] described another grammar-based compressed self-index, which takes 2R log r + R log n + ɛrlog r + o(r log r) bits, where R is the total length of the rules right-hand sides and 0 <ɛ 1, and O ( (m 2 /ɛ)logr + occ log r ) time 1. Whereas their first index requires a straightline program (SLP), their new index can be built on top of an arbitrary CFG. Supported by MNiSW grant number N , They have now reduced the log R factor in this time bound to log(log n/ log r). A.-H. Dediu and C. Martín-Vide (Eds.): LATA 2012, LNCS 7183, pp , c Springer-Verlag Berlin Heidelberg 2012

2 A Faster Grammar-Based Self-index 241 Table 1. Space and time bounds for Claude and Navarro s compressed self-indexes [4,5], Kreft and Navarro s [16] and our own. Throughout, r is the number of rules in a given CFG generating a string S[1..n] and only S, but the CFG must be an SLP in Row 1 and a balanced SLP in Row 4. In Row 1, h is the height of the parse tree. In Row 2, R is the total length of the rules right-hand sides and 0 <ɛ 1. In Row 3, d is the depth of nesting in the parse. source total space (bits) search time [4] O(r log r)+r log n O ( (m 2 + h(m +occ))logr ) [5] 2R log r + R log n + ɛrlog r + o(r log r) O ( (m 2 /ɛ)logr +occlogr ) [16] 2z log(n/z)+z log z +5z log σ + O(z)+o(n) O ( m 2 d +(m +occ)logz ) Thm. 4 2r log r + O(z(log n +logz log log z)) O ( m 2 +(m +occ)loglogn ) Kreft and Navarro [16] gave the first (and, so far, only) compressed self-index based on LZ77, which takes 2z log(n/z)+z log z+5z log σ+o(z)+o(n) bits, where z is the number of phrases in the LZ77 parse of S, ando ( m 2 d +(m +occ)logz ) time, where d z is the depth of nesting in the parse. Throughout this paper, we consider the non-self-referential LZ77 parse, so z log n. Theo(n) termin Kreft and Navarro s space bound can be dropped at the cost of increasing the time bound by a factor of log(n/z). In this paper we show how, given a balanced SLP for S, we can add O(z(log n +logz log log z)) bits and obtain a compressed self-index that answers queries in O ( m 2 +(m + occ) log log n ) time. We note that Maruyama et al. [18] recently described another grammar-based index but its search time depends on the number of occurrences of a maximal common subtree in [edit-sensitive parse] trees of P and S, making it difficult to compare their data structure to those mentioned above. In order to better compare our bounds, those by Claude and Navarro and those by Kreft and Navarro all of which are summarized in Table 1 it is useful to review some bounds for the LZ77, balanced and unbalanced SLPs, and arbitrary CFGs. The LZ77 compression algorithm works by parsing S from left to right into phrases: after parsing S[1..i 1], it finds the longest prefix S[i..j 1] of S[i..m] that has occurred before and selects S[i..j] as the next phrase. The previous occurrence of S[i..j 1] is called the phrase s source. Kreft and Navarro considered the non-self-referential LZ77 parse, in which each phrase s source must end before the phrase itself begins, so z log n. For ease of comparison, we also consider the non-self-referential parse. In an SLP of size r, each of the non-terminals X 1,...,X r appears on the lefthand side of exactly one rule and each rule is either of the form X i a, where a is a terminal, or X i X j X k,wherei>j,k; thus, the SLP generates exactly one string and can be encoded in 2r log r + O(r) bits. Rytter [22] and Charikar et al. [2] showed that, in every CFG generating S and only S, the total length of the rules right-hand sides is at least z. Rytter showed how we can build an SLP for S with O(z log n) rules whose parse tree has the shape of an AVL tree, which is height-balanced, and Charikar et al. showed how to build such an SLP whose parse tree is weight-balanced. Finally, Charikar et al. showed that finding

3 242 T. Gagie et al. an algorithm for building a CFG generating S and only S in which the total length of the rules right-hand sides is o(z log n/ log log n), would imply progress on the well-studied problem of building addition chains. With Rytter s and Charikar et al. s bounds in mind, it is clear that the time bound of our index is strictly better than those of the others. Kreft and Navarro s index has the best worst-case space bound but only when we drop the o(n) term, increasing their time bound by a factor of log(n/z). As it is not known whether it is possible, regardless of the time taken, to build a CFG that generates S and only S and in which the total length of the rules right-hand sides is o(z log n), it is also not known whether Claude and Navarro s space bounds are ever strictly better than ours. It is even conceivable that they could be worse: suppose z =log O(1) n and the minimum total length of the rules right-hand sides in a CFG generating S and only S, isz log n; then the O(r log n) andr log n terms in Claude and Navarro s space bounds are Θ(z log 2 n), while the dominant term in our own space bound is 2r log r = Θ(z log n log log n). To build our index, we add bookmarks to the SLP that allow us to extract substrings quickly from around boundaries between phrases in the LZ77 parse. In Section 2 we show that a bookmark for a position b takes O(log n) bits and, for any length l, allows us to extract S[b l..b + l] ino(l + log log n) time. Kreft and Navarro showed how, if we can extract substrings quickly from around phrase boundaries, then we can quickly list all occurrences of P that finish at or cross those boundaries, which are called primary occurrences. They first use two Patricia trees and access at the boundaries to find, for each position i 1 in P, the lexicographic range of the reverses of phrases ending with P [1..i], and the lexicographic range of suffixes of S starting with P [i +1..m] at phrase boundaries. They then use a wavelet tree as a data structure for 2-dimensional range reporting to find all the phrase boundaries immediately preceded by P [1..i] and followed by P [i+1..m]. In Section 3 we show how, using bookmarks for faster access and a faster (albeit larger) range-reporting data structure, we can list all primary occurrences in O ( m 2 +(m + occ) log log n ) time. Occurrences of P that neither finish at nor cross phrase boundaries are called secondary and they can be found recursively from the primary occurrences. We build a data structure for 2-sided range reporting on an n n grid on which we place a point (i, j) for each phrase s source S[i..j]. Since the queries are 2-sided, this data structure takes O(z log n) bits and answers queries in O(log log n + p) time, where p is the number of points reported. If a phrase contains a secondary occurrence at a certain position in the phrase, then that phrase s source must also contain an occurrence at the corresponding position. It follows that, by querying the data structure with (a, b) for each primary or secondary occurrence S[a..b] we find, we can list all secondary occurrences of P in O(occ log log n) time. 2 Adding Bookmarks to a Balanced SLP An SLP for S is called balanced if its parse tree is height- or weight-balanced. We claim that, for 1 i j = i +2g n, we can choose two nodes v and

4 A Faster Grammar-Based Self-index 243 u v w i k j = i +2g b = i + g k + 1 Fig. 1. Suppose we are given an SLP for S and consider its parse tree. For 1 i j = i +2g n, we can choose two nodes v and w of height O(log g) in the parse tree whose subtrees include the ith through jth leaves. Without loss of generality, assume the bth leaf is in v s subtree, where b = i + g. By storing the non-terminals at v and w and the path from v to b, we can extract S[b l..b + l] ino(l +logg) time. w of height O(log g) in the parse tree whose subtrees include the ith through jth leaves. To see why, let u be the lowest common ancestor of the ith and jth leaves. If u has height O(log g), then we choose its children as v and w; otherwise, suppose the rightmost leaf in u s left subtree is the kth in the tree. We choose as v the lowest common ancestor of the ith through kth leaves, which must lie on the rightmost path in u s left subtree. Since v s right subtree has fewer than k i +1 2g leaves, v has height O(log g). We choose as w the lowest common ancestor of the (k + 1)st through jth leaves, which must lie on the leftmost path in u s right subtree. Since w s left subtree has fewer than j k 2g leaves, w also has height O(log g). Figure 1 illustrates our choice of v and w. Without loss of generality, assume b = i + g k; the case when b>kis symmetric. Then the bth leaf which is midway between the ith and jth is in v s subtree and we can store the path from v to the bth leaf in O(log g) bits. With this information, in O(l + log g) time we can perform partial depth-first traversals of v s subtree that start by descending from v to the bth leaf and then visit the l g leaves immediately to its left and right, if they are in v s subtree. Some of the l leaves immediately to right of the bth leaf may be the leftmost leaves in w s subtree, instead; we can descend from w and visit them in O(l +logg) time. It follows that, if we store the non-terminals at v and w and O(log g) extra bits then, given l g, we can extract S[b l..b + l]ino(l +logg)time. Lemma 1. Given a balanced SLP for S with r rules and integers b and g, wecan store 2logr + O(log g) bits such that later, given l, we can extract S[b l..b + l] in O(l +logg) time. The drawback to Lemma 1 is that we must know in advance an upper bound on the number of characters we will want to extract. To remove this restriction,

5 244 T. Gagie et al. for each position we wish to bookmark, we apply Lemma 1 twice: the first time, we set g = n so that, given l log n, we can extract S[b l..b + l] ino(l) time; the second time, we set g =logn so that, given l log n, we can extract S[b l..b + l] ino(l + log log n) time. Theorem 2. Given a balanced SLP for S and a position b, we can store O(log n) bits such that later, given l, we can extract S[b l..b+l] in O(l + log log n) time. We note that, without changing the( asymptotic) space bound in Theorem 2, the time bound can be improved to O l +log [c] n,wherec is any constant and log [c] n is the result of recursively applying the logarithm function c times to n. If we increase the space bound to O(log n log n), where log is the iterated logarithm, then the time bound decreases to O(l). Even as stated, however, Theorem 2 is enough for our purposes in this paper as we use only the following corollary, which follows by applying Theorem 2 for each phrase boundary in the LZ77 parse. Corollary 3. Given a balanced SLP for S, we can add O(z log n) bits such that later, given l, we can extract l characters to either side of any phrase boundary in O(l + log log n) time. As an aside, we note that Rytter s construction produces an SLP in which, for each phrase in the LZ77 parse, there is a non-terminal whose expansion is that phrase. For any balanced SLP with this property, it is easy to add bookmarks such that we can extract quickly from phrase boundaries: for each phrase, we store the non-terminal generating that phrase and the leftmost and rightmost non-terminals at height log log n in its parse tree. 3 Listing Occurrences As explained in Section 1, we use the same framework as Kreft and Navarro but with faster access at the phrase boundaries, using Corollary 3 and a faster data structure for 2-dimensional range reporting. Given a pattern P [1..m], for 1 i m, we find all the boundaries where the previous phrase ends with P [1..i] and the next phrase starts with P [i +1..m]. To do this, we build one Patricia tree for the reversed phrases and another for the suffixes starting at phrase boundaries. A Patricia tree [20] is a compressed trie in which the length of each edge label is recorded and then all but the first character of each edge label are discarded. It follows that the two Patricia trees have O(z) nodes and take O(z log n) bits; notice that no two suffixes can be the same and, by the definition of the LZ77 parse, no two reversed phrases can be the same. We label each leaf in the first Patricia tree by the number of the phrase whose reverse leads to that leaf; we label each leaf in the second Patricia tree by the number of the phrase starting the suffix that leads to that leaf. For example, if S = alabar a la alabarda$, then its LZ77 parse is as shown in Figure 2 (from Kreft and Navarro s paper). For convenience, we treat all

6 A Faster Grammar-Based Self-index 245 al abar a l a al abarda$ Fig. 2. The LZ77 parse of alabar a la alabarda$, from Kreft and Navarro s paper [16]. Horizontal lines indicate phrases sources, with arrows leading to the boxes containing the phrases themselves. strings as ending with a special character $. In this case, the reversed phrases are as shown below with the phrase numbers, in order by phrase number on the left and in lexicographic order on the right. 1) a$ 9) $a$ 2) l$ 5) $ 3) ba$ 6) a$ 4) ra$ 7) al$ 5) $ 1) a$ 6) a$ 3) ba$ 7) al$ 8) drabala$ 8) drabala$ 2) l$ 9) $a$ 4) ra$ The suffixes starting at phrase boundaries are as shown below with the numbers of the phrases with which they begin, in order by phrase number on the left and in lexicographic order on the right. 2) labar a la alabarda$ 5) a la alabarda$ 3) abar a la alabarda$ 9) a$ 4) ar a la alabarda$ 6) a la alabarda$ 5) a la alabarda$ 3) abar a la alabarda$ 6) a la alabarda$ 8) alabarda$ 7) la alabarda$ 4) ar a la alabarda$ 8) alabarda$ 7) la alabarda$ 9) a$ 2) labar a la alabarda$ Notice we do not consider the whole string to be a suffix starting at a phrase boundary. We are interested only in phrase boundaries such that occurrences of patterns can cross the boundary with a non-empty prefix to the left of the boundary. The Patricia trees for the reversed phrases and suffixes are shown in Figure 3. Notice that, since la alabarda$ and labar a la alabarda$ share both their first and second characters, the second character a is omitted from the edge from the root to its rightmost child in the Patrica trie for the suffixes; we indicate this with a *. For each position i 1inP, we find the lexicographic range of the reverses of phrases ending with P [1..i], and the lexicographic range of suffixes of S starting

7 246 T. Gagie et al. 9 $ $ a 5 $ l 6 7 a b d 1 3 l 8 r 2 4 a l* 5 b $ b l r Fig. 3. The Patricia trees for the reversed phrases (on the left) in the LZ77 parse of alabar a la alabarda$ and the suffixes starting at phrase boundaries (on the right) with P [i +1..m] at phrase boundaries. To do this, we first search for the reverse of P [1..i] in the Patricia tree for the reversed phrases and search for P [i +1..m] in the Patricia tree for the suffixes, then use our data structure from Corollary 3 to verify that the characters omitted from the edge labels match those in the corresponding positions in P.ThistakesO(m + log log n) time for each choice of i, oro ( m 2 + m log log n ) time in total. For example, if P = ala, then we search for a and la, al and a, and ala and the empty suffix. In the first search, we reach the root s third child in the Patricia tree for the reversed phrases, which is the 5th leaf (labelled 1), and the root s third child in the Patricia tree for the suffixes, which is the ancestor of the 7th and 8th leaves (labelled 7 and 2); the second and third searches fail because there are no paths in the first Patricia tree corresponding to al or ala. Once we have the lexicographic ranges for a choice of i, we check whether there are any pairs of consecutive phrase numbers with the first number j from the first range and the second number j + 1 from the second range. In our example, 1, 2) is such a pair, meaning that the first phrase ends with a and the second phrase starts with la, so there is a primary occurrence crossing the first phrase boundary. To be able to find efficiently such pairs, we store a data structure for 2-dimensional range reporting on a z (z 1) grid on which we have placed z 1points,witheachpoint(a, b) indicating that the phrase numbers of the lexicographically ath reversed phrase and the lexicographically bth suffix, are consecutive. Figure 4 shows the grid for our example, on the left, with the foursided range query on the 5th row and the the 7th and 8th columns that returns the point (5, 8) indicating a primary occurrence overlapping the phrases 1 and 2. Notice that the first row is always empty, since the lexicographically first reversed phrase starts with the end-of-string character $. Kreft and Navarro used a wavelet tree but we use a recent data structure by Chan, Larsen and Pǎtraşcu [1] that takes O(z log log z) words and answers queries in O((1 + p) log log z) time, where p is the number of points reported. With this data structure, we can list all occ 1 primary occurrences in O ( m 2 +(m + occ 1 ) log log n ) time. Once we have found all the primary occurrences, it is not difficult to recursively find all the secondary occurrences. We build a data structure for 2-sided range reporting on an n n grid on which we place a point (i, j) for each phrase s source S[i..j]. Since the queries are 2-sided, we can implement this data structure as a predecessor data structure and a range-maximum data structure [19] which together take O(z log n) bits and answer queries in O(log log n + p) time [7,11],

8 A Faster Grammar-Based Self-index (1, 6) (1, 1) (2, 3) Fig. 4. On the left, the grid we use for listing primary occurrences in S = alabar a la alabarda$. The sides of the grid are labelled with the phrase numbers from the leaves in the two Patricia trees. A four-sided range query on the 5th row and 7th and 8th columns that returns the point (5, 8) indicating a primary occurrence overlapping the phrases 1 and 2. On the right, the grid we use for listing secondary occurrences in S = alabar a la alabarda$. The black dots indicate phrase sources and the grey dot indicates a two-sided query to find sources containing S[1..3]. where p is the number of points reported. If a phrase contains a secondary occurrence at a certain position in the phrase, then that phrase s source must also contain an occurrence at the corresponding position. Notice that, by the definition of the parse, the first occurrence of any substring must be primary. It follows that, by querying the data structure with (a, b) for each primary or secondary occurrence S[a..b] we find, we can list all secondary occurrences of P in O(occ log log n) time. Figure 4 shows the grid for our example, on the right, with black dots indicating phrase sources and the grey dot indicating a two-sided query to find sources containing the primary occurrence S[1..3] of ala. This query returns a point (1, 6), so the phrase whose source is S[1..6] i.e., the 8th contains a secondary occurrence of ala. Theorem 4. Given a balanced straight-line program for a string S[1..n] whose LZ77 parse consists of z phrases, we can add O(z log log z) words and obtain a compressed self-index for S such that, given a pattern P [1..m], we can list the occ occurrences of P in S in O ( m 2 +(m + occ) log log n ) time. 4 Postscript We recently designed a compressed self-index [12] based on the Relative Lempel- Ziv (RLZ) compression scheme proposed by Kuruppu, Puglisi and Zobel [17]. Given a database of genomes from individuals of the same species, Kuruppu et al. store the first genome G in an FM-index [8] and store the others compressed with a version of LZ77 [23] that allows phrases to be copied only from the first genome. They showed that RLZ compresses well and supports fast extraction but did not show how to support search. Of course, given a pattern P [1..m], we can quickly find all occurrences of P in G by searching in its FM-index. If we

9 248 T. Gagie et al. store a data structure for 2-sided range reporting like the one described at the end of Section 3, then we can also quickly find all secondary occurrences of P in the rest of the database. We now sketch one way to find primary occurrences. Let R be the rest of the database and suppose its RLZ parse relative to G consists of z phrases, d of them distinct. Consider each distinct phrase as a meta-character and consider the parse as a string R [1..z] over the alphabet of meta-characters. Suppose a meta-character x in R indicates a phrase copied from G[i..j]; then we associate with x the non-empty interval corresponding to G[i..j] in the Burrows-Wheeler Transform (BWT) of G, and the non-empty interval corresponding to the reverse (G[i..j]) R of G[i..j] inthebwtofg R. We will refer to two orderings on the alphabet of meta-characters: lexicographic by their phrases (lex), and lexicographic by the reverses of their phrases (r-lex). Notice that the interval for a string A in the BWT of G contains the intervals of all the meta-characters whose phrases start with A, which are consecutive in the lex ordering; the interval for A R in the BWT of G R contains the intervals of all the meta-characters whose phrases end with A, whichare consecutive in the r-lex ordering. More generally, if the intervals for two strings overlap in a BWT, then one must be contained in the other. We can use this fact to store small data structures such that, given an interval in the BWT of G or G R, we can quickly find the corresponding interval in the lex or r-lex orderings of the meta-characters. We build a data structure for 4-sided range reporting on a d z grid. We place apoint(a, b) on the grid if the lexicographically bth suffix of R, considering metacharacters in lex order, is preceded by a copy of the ath distinct meta-character in the r-lex ordering. In addition to the FM-index for G we also store FM-indexes for G R and R. We can find all primary occurrences of P in R if, for 1 i m, we 1. use the FM-index for G R to find the interval for (P [1..i]) R in the BWT of G R ; 2. map that interval to the interval in the r-lex ordering containing phrases ending with P [1..i]; 3. compute the RLZ parse of P [i +1..m] relative to G; 4. use the FM-index for G to find the interval for the last phrase P [j..m] of that parse in the BWT of G; 5. map that interval to the interval in the lex ordering containing phrases starting with P [j..m]; 6. use the FM-index for R to find the interval for the parse of P [i +1..m] in the BWT of R ; 7. find all points (a, b) onthed z grid with a in the interval for (P [1..i]) R in the BWT of G R,andb in the interval for the parse of P [i +1..m] inthe BWT of R. Notice that, if an occurrence of P [i +1..m] starts at a phrase boundary in R s parse, then the phrases in P [i +1..m] s parse are the same as the subsequent phrases in R s parse, except that the last phrase P [j..m] inp [i +1..m] s parse may be only a prefix of the corresponding phrase in R s parse. When we use the

10 A Faster Grammar-Based Self-index 249 FM-index for R to find the interval for P [i +1..m] s parse in the BWT of R,we start with the interval containing the meta-characters that immediately precede in R meta-characters whose phrases start with P [j..m]. If we encounter another phrase in P [i +1..m] s parse that does not occur in R s parse, then we know no occurrence of P [i +1..m] starts at a phrase boundary in R s parse. Steps 1, 3, 4 and 6 in the process above take time depending linearly on m, however, so repeating the whole process for each i between 1 and m takes time depending quadratically on m. If we use the FM-index for G R to search for (P [1..m]) R in G R then, as a byproduct, we find the intervals for (P [1..i]) R in the BWT of G R for 1 i m. Similarly, if we use the FM-index for G to search for P [1..m] ing then we find the intervals for P [j..m] inthebwtofgfor 1 j m. We can therefore avoid repeating Steps 1 and 4. To avoid repeating Steps 3 and 6, we use 1-dimensional dynamic programming to compute the RLZ parses of all suffixes of P and the corresponding intervals in the BWT of R.Assumethat,giventheintervalforastringAc in the BWT of G and the length of A, we can quickly find the interval for A; we will explain later how we do this. We work from right to left in P. Assume we have already computed the interval in the BWT of G for the longest prefix P [i +1..l] of P [i +1..m] that occurs in G and that, for i +1 k m, wehavealready computed the RLZ parse of P [k..m] relative to G and the interval for that parse in the BWT of R. We will show how to compute the interval in the BWT of G for the longest prefix of P [i..m] that occurs in G, the RLZ parse of P [i..m] and the interval for that parse in the BWT of R. Knowing the interval for P [i+1..l]inthebwtofg, we can use the FM-index for G to find the interval for P [i..l]. If this interval is non-empty, then P [i..l] is the longest prefix of P [i..m] that occurs in G, and the first phrase in the RLZ parse of P [i..m]; the rest of the parse is the same as for P [l +1..m]. We map the interval for P [i..l]inthebwtofg to the corresponding interval in the lex order. If l = m, then the interval in the BWT of R is the one containing the metacharacters that immediately precede in R meta-characters whose phrases start with P [i..m]. If l<m, then the interval in the lex order should be associated with a single meta-character; otherwise, P [i..l] does not occur in the parse of R and, thus, the interval in the BWT of R is empty. Knowing the interval for the parse of P [l +1..m] inthebwtofr, we use the FM-index for R to find the interval for the parse of P [i..m]. If the interval for P [i..l] inthebwtofg is empty, then we compute the intervals for P [i +1..l ]andp[i..l ], for l decreasing from l 1toi, untilwe find a value of l such that the interval for P [i..l ] is non-empty; we then proceed as described above. We now explain how, given the interval for a string Ac in the BWT of G and the length of A, we can quickly find the interval for A. In addition to the FM-index for G, we store a compressed suffix array (CSA) and compressed longest-common-prefix (LCP) array [9] for G and a previous/nextsmaller-value data structure [10] for the LCP. Suppose [i..j] is the interval for Ac in the BWT of G. Notice that LCP[i] andlcp[j +1]areboth atmost A. If LCP[i] = A, then we use the P/NSV data structure to find the previous

11 250 T. Gagie et al. entry LCP[i ] in the LCP array that is smaller than A ; otherwise, we set i = i. If LCP[j] = A, then we use the P/NSV data structure to find the next entry LCP[j + 1] in the LCP array that is smaller than A ; otherwise, we set j = j. The interval for A in the BWT of G is [i..j ]. We use the CSA by Grossi, Gupta and Vitter [13] which takes (1/ɛ)nH k (G)+ O(n) bits and supports access to the suffix array of G in O(log ɛ n) time, where k α log σ n, σ is the size of the alphabet of G and α<1andɛ 1areconstants; see [21]. We note that, using this CSA, we can also reduce the time needed to locate the occurrences of P in G. Choosing the right implementations of the various other data structures, we obtain the following result: Theorem 5. Suppose we are asked to build a compressed self-index for a database of genomes from individuals of the same species. Let G[1..n] be the first genome in the database and suppose the RLZ parse of the rest of the database relative to G consists of z phrases. For any positive constants α<1 and ɛ and k α log 4 n,we can store the database in O(n(H k (G)+1)+z(log n +logz log log z)) bits such that, given a pattern P [1..m], we can find all occ occurrences of P in the database in O((m + occ) log ɛ n) time. Many of the ideas used in this section have been used previously by other authors, notably Chien et al. [3], Hon et al. [14] and Huang et al. [15]. We recently became aware that Do, Jansson, Sadakane and Sung [6] independently proved a theorem similar to Theorem 5 before we did. We have described our own ideas here because they lead to an incomparable time-space tradeoff. References 1. Chan, T.M., Larsen, K.G., Pǎtraşcu, M.: Orthogonal range searching on the RAM, revisited. In: Proceedings of the 27th Symposium on Computational Geometry (SoCG), pp (2011) 2. Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Transactions on Information Theory 51(7), (2005) 3. Chien, Y.-F., Hon, W.-K., Shah, R., Vitter, J.S.: Geometric Burrows-Wheeler Transform: Linking range searching and text indexing. In: Proceedings of the Data Compression Conference (DCC), pp (2008) 4. Claude, F., Navarro, G.: Self-indexed Text Compression Using Straight-Line Programs. In: Královič, R., Niwiński, D. (eds.) MFCS LNCS, vol. 5734, pp Springer, Heidelberg (2009) 5. Claude, F., Navarro, G.: Improved grammar-based self-indexes. Tech. Rep , arxiv.org (2011) 6. Do, H.H., Jansson, J., Sadakane, K., Sung, W.K.: Indexing strings via textual substitutions from a reference, manuscript 7. van Emde Boas, P.: Preserving order in a forest in less than logarithmic time. In: Proceedings of the 16th Symposium on Foundations of Computer Science (FOCS), pp (1975) 8. Ferragina, P., Manzini, G.: Indexing compressed text. Journal of the ACM 52(4), (2005)

12 A Faster Grammar-Based Self-index Fischer, J.: Wee LCP. Information Processing Letters 110(8-9), (2010) 10. Fischer, J.: Combined data structure for previous- and next-smaller-values. Theoretical Computer Science 412(22), (2011) 11. Gabow, H.N., Bentley, J.L., Tarjan, R.E.: Scaling and related techniques for geometry problems. In: Proceedings of the 16th Symposium on Theory of Computing (STOC), pp (1984) 12. Gagie, T., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A compressed self-index for genomic databases. Tech. Rep , arxiv.org (2011) 13. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the 14th Symposium on Discrete Algorithms (SODA), pp (2003) 14. Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: On Entropy-Compressed Text Indexing in External Memory. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE LNCS, vol. 5721, pp Springer, Heidelberg (2009) 15. Huang, S., Lam, T.W., Sung, W.K., Tam, S.L., Yiu, S.M.: Indexing Similar DNA Sequences. In: Chen, B. (ed.) AAIM LNCS, vol. 6124, pp Springer, Heidelberg (2010) 16. Kreft, S., Navarro, G.: Self-indexing Based on LZ77. In: Giancarlo, R., Manzini, G. (eds.) CPM LNCS, vol. 6661, pp Springer, Heidelberg (2011) 17. Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE LNCS, vol. 6393, pp Springer, Heidelberg (2010) 18. Maruyama, S., Nakahara, M., Kishiue, N., Sakamoto, H.: ESP-Index: A Compressed Index Based on Edit-Sensitive Parsing. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE LNCS, vol. 7024, pp Springer, Heidelberg (2011) 19. McCreight, E.M.: Priority search trees. SIAM Journal on Computing 14(2), (1985) 20. Morrison, D.R.: PATRICIA - Practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM 15(4), (1968) 21. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) (2007) 22. Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoretical Computer Science 302(1-3), (2003) 23. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), (1977)

A Faster Grammar-Based Self-Index

A Faster Grammar-Based Self-Index Travis Gagie 1 Pawe l Gawrychowski 2 Juha Kärkkäinen 3 Yakov Nekrich 4 Simon Puglisi 5 Aalto University Max-Planck-Institute für Informatik University of Helsinki University