A Faster Grammar-Based Self-Index

Size: px

Start display at page:

Download "A Faster Grammar-Based Self-Index"

Preston Skinner
6 years ago
Views:

1 A Faster Grammar-Based Self-Index Travis Gagie 1 Pawe l Gawrychowski 2 Juha Kärkkäinen 3 Yakov Nekrich 4 Simon Puglisi 5 Aalto University Max-Planck-Institute für Informatik University of Helsinki University of Bonn King s College, London LATA 12 Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 1 / 21

2 Motivation We want to store repetitive texts (say, genomic databases) in compressed form, but such that we can search them quickly. In other words, given a text, build a small structure which allows fast pattern matching. Pattern matching? In this talk pattern matching = exact pattern matching, i.e., given P[1..m] we want to find where it occurs exactly in text S[1..n]. We might want the first occurrence, or all of them, or just a few... Such structure is called an index. If it also allows retrieving the original text, it is called a self-index. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 2 / 21

3 Problem, more precisely We are asked to build an self-index for a string S[1..n] whose LZ77 parse consists of z phrases. What is LZ77? Given a text, we split it into z disjoint fragments called phrases. Each fragment is a single letter, or a substring of the already defined prefix. The number of those phrases is believed to be the right measure of how repetitive the text is. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 3 / 21

4 Solution? Straight-line program, or grammar representation Simply a context-free grammar with exactly one production per nonterminal. It is known that given a LZ77 parse consisting of z phrases, we can construct such program consisting of just O(z log n) words. The program can be assumed to be balanced, meaning that for each production A BC we have that B C. Extracting an arbitrary substring of length l from a balanced SLP takes O(log n + l) time. But how do we search?! Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 4 / 21

6 l 7 Framework (of Navarro) A LZ77 Self-Index Current Lempel-Ziv Indexes A LZ77 Self-Index Conclusions _ 7 (_,1) (a,1) b d 1 3 l r 2 4 5 $ $ a 8 1 1 1 1 1 1 1 1 1 1 2 2 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5

5 6 l 7 Framework (of Navarro) A LZ77 Self-Index Current Lempel-Ziv Indexes A LZ77 Self-Index Conclusions _ 7 (_,1) (a,1) b d 1 3 l r $ $ a a la ba r_ a _ la _a la ba rda$ _ (a,1) (l,2) 5 $ _ b (l,5) r b _ d 1 8 G. Navarro Indexing LZ77 Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 5 / 21

6 Old Idea ravis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 6 / 21

7 Old Idea ravis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 6 / 21

8 Old Idea Secondary occurrence An occurrence is secondary iff it is completely contained in some phrase. ravis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 6 / 21

9 Old Idea Primary occurrence An occurrence is primary iff it crosses some boundary. ravis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 6 / 21

10 Old Idea Primary occurrence An occurrence is primary iff it crosses some boundary. Assuming we have all primary occurrences, all secondary occurrences can be found via 2-sided 2D range reporting. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 6 / 21

11 Old Idea, continued Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 7 / 21

12 Old Idea, continued P[1..i] P[i+1..m] Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 7 / 21

13 Old Idea, continued P[1..i] P[i+1..m] To find all primary occurrences of P[1..m], for each 1 i m, we 1 search for P[i + 1..m] in the Patricia tree of the suffixes starting at phrase boundaries, 2 search for (P[1..i]) R in the Patricia tree of the reversed phrases, 3 check the results via random access, 4 use range reporting to find all boundaries preceded by P[1..i] and followed by P[i + 1..m]. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 7 / 21

14 New Ideas We don t use random access during search. We need extract only from the phrase boundaries, so we store bookmarks to them. We also store a data structure for 1D range reporting (or a bitvector) at each node with depth at most log log z. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 8 / 21

15 Bounds Calculation shows that if we choose our data structures carefully and can extract from bookmarks in O(1) time per character then with O(z log log z) extra words we can find all occ occurrences of P in O ( m 2 + occ log log n ) time. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für Informat 9 / 21

16 Comparison source total space (bits) search time C & N 09 O( SLP ) + r log n O ( (m 2 + h(m + occ)) log r ) C & N 11 (2 + o(1)) CFG + R log n + ɛ r log r O ( (m 2 /ɛ) log R + occ log r ) K & N 11 2z log(n/z) + z log z + 5z log σ + O(z) + o(n) O ( m 2 d + (m + occ) log z ) Us BSLP + O(z(log n + log z log log z)) O ( m 2 + occ log log n ) r: number of rules in the grammar h: height of the parse tree R: total length of the right-hand sides d: depth of nesting in the parse Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für10 Informat / 21

17 Bookmarking Lemma Given a balanced SLP for S with r rules and integers b and g, we can store 2 log r + O(log g) bits such that later, given l g, we can extract S[b l..b + l] in O(l + log g) time. Corollary We can store O(log z) words such that, given any l, we can extract S[b l..b + l] in O(l) time. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für11 Informat / 21

18 Parse Tree u v w O(log g) i k j = i + 2g b = i + g k + 1 Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für12 Informat / 21

19 Space Bounds (in words) Patricia trees O(z) bookmarks O(z log z) 1D range reporting O(z log log z) 4-sided 2D range reporting O(z log log z) 2-sided 2D range reporting O(z) O(z log log z) Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für13 Informat / 21

20 Time Bounds searching in Patricia trees O ( m 2) (with perfect hashing if necessary) extracting from bookmarks O ( m 2) 1D or 4-sided 2D range reporting O ( m 2) 2-sided 2D range reporting O(occ log log n) O ( m 2 + occ log log n ) Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für14 Informat / 21

21 Final Result Theorem Given a balanced SLP for a string S[1..n] whose LZ77 parse consists of z phrases, we can add O(z log log z) words such that, given a pattern P[1..m], we can find all occ occurrences of P in O ( m 2 + occ log log n ) time. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für15 Informat / 21

22 News Flash!!! (not in the paper) Theorem Given a balanced SLP with r rules for a string S[1..n] whose LZ77 parse consists of z phrases, we can build an O(r + z log log z)-word data structure such that, given a pattern P[1..m], we can find all occ occurrences of P in S in O(m log z + occ log log n) time. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für16 Informat / 21

23 PAUSE FOR BREATH Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für17 Informat / 21

24 Relative Lempel-Ziv Kuruppu, Puglisi and Zobel proposed that, to store a genomic database, we 1 build an FM-index for the first genome G (or an artificial reference genome), 2 compress the rest with a version of LZ77 that allows phrases to be copied only from G. Do, Jansson, Sadakane and Sung designed an RLZ-index before we did, but didn t publish it. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für18 Informat / 21

25 So what happens in RLZ? Theorem We can store the database in O ( n(h k (G) + 1) + z(log n + log z log log z) ) bits such that we can find all occ occurrences of P in time O((m + occ) log ɛ (n + z)). The log ɛ (n + z) factor in the query time comes from accessing the suffix array using Grossi, Gupta and Vitter s CSA. In real life, it should be log n. Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für19 Informat / 21

26 Future work: 1 Is it possible to construct from LZ77 parse a SLP of size smaller than O(z log n)? 2 Can we achieve O(m + occ) query time? (maybe with a slightly bigger self-index?) Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für20 Informat / 21

27 QUESTIONS? Travis Gagie, Pawe l Gawrychowski, Juha Kärkkäinen, A Faster YakovGrammar-Based Nekrich, Simon Self-Index Puglisi (Aalto University, Max-Planck-Institute LATA 12 für21 Informat / 21

arxiv: v1 [cs.ds] 15 Feb 2012

arxiv: v1 [cs.ds] 15 Feb 2012 Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl