Preview: Text Indexing
|
|
- Josephine Mills
- 5 years ago
- Views:
Transcription
1 Simon Gog - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
2 Text Indexing Motivation Problems Given a text T of length n over an alphabet Σ of size σ and a pattern P of length m. Typical questions we want to answer efficiently: Does P occur in T? (Existence query) How often does P occur in T? (Count query) Where does P occur in T? (Locate query) Two scenarios: Scan the whole text T for each query (time complexity O(n + m))). Build an index for T once. Use index to answer the query (time complexity not linear dependent on n or even independent!). Simon Gog:
3 Text Indexing Motivation Index solution Suffix Array (SA) was already explained in this lecture. Existence/count time: O(m log n) Locate time: O(m log n + occ), where occ is the number of occurrences. Drawbacks SA is larger than T : n log n compared to n log σ. For σ = 256 and n = 2 32 we get factor 4. SA requires T to answer queries. Other classical indexes: Suffix trees (ST), String B-Tree (SBT), and uncompressed positional Inverted Indexes (PII). 2 Simon Gog:
4 Text Indexing Suffix Array i SA[i] T [SA[i]..n ] 8 $ 7 a$ 2 abarbara$ 3 7 abrabarbara$ 4 abracadabrabarbara$ 5 3 acadabrabarbara$ 6 5 adabrabarbara$ 7 5 ara$ 8 2 arbara$ 9 4 bara$ barbara$ 8 brabarbara$ 2 bracadabrabarbara$ 3 4 cadabrabarbara$ 4 6 dabrabarbara$ 5 6 ra$ 6 9 rabarbara$ 7 2 racadabrabarbara$ 8 3 rbara$ 3 Simon Gog: T [SA[i]..n ] < T [SA[i + ]..n ] SA[i] contains the starting position of the ith lex. smallest suffix of T. Matching algorithm: binary search (forward, left-to-right)
5 Compressed Text Indexes The FM-Index Ferragina and Manzini [2] Index based on Burrows-Wheeler-Transform (BWT) Matching algorithm works backwards (right-to-left) Existence and count queries in time O(m log σ) BWT BWT [i] = T [SA[i] mod n] uncompressed size: n log σ bits compressed size: nh k (T ) bits (+information for contexts of length k) 4 Simon Gog:
6 i SA[i] BWT T [SA[i]..n ] 8 a $ 7 r a$ 2 r abarbara$ 3 7 d abrabarbara$ 4 $ abracadabrabarbara$ 5 3 r acadabrabarbara$ 6 5 c adabrabarbara$ 7 5 b ara$ 8 2 b arbara$ 9 4 r bara$ a barbara$ 8 a brabarbara$ 2 a bracadabrabarbara$ 3 4 a cadabrabarbara$ 4 6 a dabrabarbara$ 5 6 a ra$ 6 9 b rabarbara$ 7 2 b racadabrabarbara$ 8 3 a rbara$ BWT [i] = T [SA[i] ], for SA[i] > BWT [i] = T [n ], for SA[i] = I.e. BWT [i] is the character preceding suffix SA[i] 5 Simon Gog:
7 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ Array C contains for each c Σ the position of the first suffix in SA which starts with c: $ a b c d r r Operation rank(i, X, BWT ) returns how often character X Σ occurs in the prefix BWT [..i ]. Example: search for P = bar. 6 Simon Gog:
8 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Initial interval: [sp, ep ] = [..n ] Determine interval for r: sp = C[r]+rank(sp, r, BWT ) ep = C[r]+rank(ep +, r, BWT ) 7 Simon Gog:
9 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Initial interval: [sp, ep ] = [..n ] Determine interval for r: sp = 5+rank(, r, BWT ) ep = 5+rank(9, r, BWT ) 7 Simon Gog:
10 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Initial interval: [sp, ep ] = [..n ] Determine interval for r: sp = 5+ ep = 5+rank(9, r, BWT ) 7 Simon Gog:
11 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Initial interval: [sp, ep ] = [..n ] Determine interval for r: sp = 5+ = 5 ep = 5+4 = 8 7 Simon Gog:
12 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = C[a]+rank(sp, a, BWT ) ep 2 = C[a]+rank(ep +, a, BWT ) 8 Simon Gog:
13 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = +rank(5, a, BWT ) ep 2 = +rank(ep, a, BWT ) 8 Simon Gog:
14 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = +rank(5, a, BWT ) ep 2 = +rank(ep, a, BWT ) 8 Simon Gog:
15 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = +6 ep 2 = +rank(9, a, BWT ) 8 Simon Gog:
16 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = +6 = 7 ep 2 = +8 = 8 8 Simon Gog:
17 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = C[b]+rank(sp 2, b, BWT ) ep 3 = C[b]+rank(ep 2 +, b, BWT ) 9 Simon Gog:
18 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = 9+rank(7, b, BWT ) ep 3 = 9+rank(ep, b, BWT ) 9 Simon Gog:
19 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = 9+rank(7, b, BWT ) ep 3 = 9+rank(ep, b, BWT ) 9 Simon Gog:
20 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = 9+ ep 3 = 9+rank(9, b, BWT ) 9 Simon Gog:
21 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = 9+ = 9 ep 3 = 9+2 = 9 Simon Gog:
22 Summary Only C and a data structure R supporting the rank operation on BWT are required for existence and count queries. Space: σ log n bits for C + space for R Time: O(m t rank ), where t rank is time for one rank operation. Independent from n? Next: How to implement rank? Rank operation Constant time and o(n) extra space solution on bitvectors (Jacobson [989]) Solution on general sequences: Wavelet Tree (Grossi et al. [23]) Simon Gog:
23 Summary Only C and a data structure R supporting the rank operation on BWT are required for existence and count queries. Space: σ log n bits for C + space for R Time: O(m t rank ), where t rank is time for one rank operation. Independent from n? If t rank is independent from n Next: How to implement rank? Rank operation Constant time and o(n) extra space solution on bitvectors (Jacobson [989]) Solution on general sequences: Wavelet Tree (Grossi et al. [23]) Simon Gog:
24 Wavelet Tree Example: Calculate Rank arrd$rcbbraaaaaabba a$bbaaaaaabba rrdrcr a$aaaaaaa bbbb dc rrrr a = $ aaaaaaaa c d rank(, a, WT ) = rank(rank(rank(,, b ɛ ) = 5,, b ) = 3,, b ) = 2 Simon Gog:
25 Wavelet Tree Example: Calculate Rank arrd$rcbbraaaaaabba a$bbaaaaaabba rrdrcr a$aaaaaaa bbbb dc rrrr a = $ aaaaaaaa c d rank(, a, WT ) = rank(rank(rank(,, b ɛ ) = 5,, b ) = 3,, b ) = 2 Simon Gog:
26 Wavelet Tree Example: Calculate Rank arrd$rcbbraaaaaabba a$bbaaaaaabba rrdrcr a$aaaaaaa bbbb dc rrrr a = $ aaaaaaaa c d rank(, a, WT ) = rank(rank(rank(,, b ɛ ) = 5,, b ) = 3,, b ) = 2 Simon Gog:
27 Wavelet Tree Example: Calculate Rank arrd$rcbbraaaaaabba a$bbaaaaaabba rrdrcr a$aaaaaaa bbbb dc rrrr a = $ aaaaaaaa c d rank(, a, WT ) = rank(rank(rank(,, b ɛ ) = 5,, b ) = 3,, b ) = 2 Simon Gog:
28 Compressed Text Indexing meets Algorithm Engineering State-of-the-art Recent FM-Indexes are as small as the output of state-of-the-art compressors (like gzip,xz) while matching takes microseconds per character. This is the result of theoretical and practical improvements: Shape of the WT (balanced, Huffman, Hu-Tucker,...) Bitvector representation (uncompressed/compressed) Hardware (popcount instruction, page size) Different sampling strategies for SA values (for locate queries)... 2 Simon Gog:
29 Compressed Text Indexing Combining a H -compressed bitvector with a Huffman shaped wavelet tree results in H k (T ) bits of space. 2 MB test instance WT-HUFF WT-HUFFcompr Time Space Time Space (µs) (%) (µs) (%) DBLP.XML DNA ENGLISH PROTEINS SOURCES Simon Gog:
30 Compressed Text Indexing Our Toolbox for Compact/Succinct Data Structures Succinct Data Structure Library (SDSL) A C++ template library for compact/succinct structures Parametrizable structures Bitvectors Compressed Integer Vectors Rank/Select Structures Wavelet Trees/Wavelet Matrices Compressed Suffix Arrays/Trees Search Engines Available at 4 Simon Gog:
31 Lecture Text Indexing Content Theory: Classical indexes (Suffix Arrays/Suffix Trees/Inverted Indexes) Building blocks for compact/succinct structures Compressed Bitvector Rank Structures Select Structures Range-Min-Max-Tree Compressed indexes FM-Indexes/Compressed Suffix Arrays Versions for highly-repetitive text Compressed Suffix Trees Search Engines Practice Use SDSL to implement and analyze structures. Design a code search engine. 5 Simon Gog:
32 Bibliography Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings of the 4st Annual Symposium on Foundations of Computer Science, (FOCS 2), pages , 2. Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 23), pages 84 85, 23. Guy Jacobson. Space-efficient static trees and graphs. In FOCS, pages , Simon Gog:
33 H k of selected Pizza&Chili 2MB test cases H k (T ) contexts/ T in percent k DBLP.XML DNA ENGLISH PROTEINS Simon Gog:
Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts
Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with
More informationText Indexing: Lecture 6
Simon Gog gog@kit.edu - 0 Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Reviewing the last two lectures We have seen two top-k document retrieval frameworks. Question
More informationSuccincter text indexing with wildcards
University of British Columbia CPM 2011 June 27, 2011 Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview
More informationarxiv: v1 [cs.ds] 19 Apr 2011
Fixed Block Compression Boosting in FM-Indexes Juha Kärkkäinen 1 and Simon J. Puglisi 2 1 Department of Computer Science, University of Helsinki, Finland juha.karkkainen@cs.helsinki.fi 2 Department of
More informationAlphabet Friendly FM Index
Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM
More informationA Simple Alphabet-Independent FM-Index
A Simple Alphabet-Independent -Index Szymon Grabowski 1, Veli Mäkinen 2, Gonzalo Navarro 3, Alejandro Salinger 3 1 Computer Engineering Dept., Tech. Univ. of Lódź, Poland. e-mail: sgrabow@zly.kis.p.lodz.pl
More informationCompact Indexes for Flexible Top-k Retrieval
Compact Indexes for Flexible Top-k Retrieval Simon Gog Matthias Petri Institute of Theoretical Informatics, Karlsruhe Institute of Technology Computing and Information Systems, The University of Melbourne
More informationRank and Select Operations on Binary Strings (1974; Elias)
Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo
More informationarxiv: v1 [cs.ds] 15 Feb 2012
Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl
More informationLecture 18 April 26, 2012
6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and
More informationarxiv: v1 [cs.ds] 25 Nov 2009
Alphabet Partitioning for Compressed Rank/Select with Applications Jérémy Barbay 1, Travis Gagie 1, Gonzalo Navarro 1 and Yakov Nekrich 2 1 Department of Computer Science University of Chile {jbarbay,
More informationCompressed Index for Dynamic Text
Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution
More informationSmaller and Faster Lempel-Ziv Indices
Smaller and Faster Lempel-Ziv Indices Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, Universidad de Chile, Chile. {darroyue,gnavarro}@dcc.uchile.cl Abstract. Given a text T[1..u] over an
More informationarxiv: v1 [cs.ds] 22 Nov 2012
Faster Compact Top-k Document Retrieval Roberto Konow and Gonzalo Navarro Department of Computer Science, University of Chile {rkonow,gnavarro}@dcc.uchile.cl arxiv:1211.5353v1 [cs.ds] 22 Nov 2012 Abstract:
More informationIndexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile
Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:
More informationSuccinct Data Structures for Text and Information Retrieval
Succinct Data Structures for Text and Information Retrieval Simon Gog 1 Matthias Petri 2 1 Institute of Theoretical Informatics Karslruhe Insitute of Technology 2 Computing and Information Systems The
More informationNew Lower and Upper Bounds for Representing Sequences
New Lower and Upper Bounds for Representing Sequences Djamal Belazzougui 1 and Gonzalo Navarro 2 1 LIAFA, Univ. Paris Diderot - Paris 7, France. dbelaz@liafa.jussieu.fr 2 Department of Computer Science,
More informationA Faster Grammar-Based Self-Index
A Faster Grammar-Based Self-Index Travis Gagie 1 Pawe l Gawrychowski 2 Juha Kärkkäinen 3 Yakov Nekrich 4 Simon Puglisi 5 Aalto University Max-Planck-Institute für Informatik University of Helsinki University
More informationCompact Data Strutures
(To compress is to Conquer) Compact Data Strutures Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017 Agenda
More informationCompressed Representations of Sequences and Full-Text Indexes
Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Università di Pisa GIOVANNI MANZINI Università del Piemonte Orientale VELI MÄKINEN University of Helsinki AND GONZALO NAVARRO
More informationDynamic Entropy-Compressed Sequences and Full-Text Indexes
Dynamic Entropy-Compressed Sequences and Full-Text Indexes VELI MÄKINEN University of Helsinki and GONZALO NAVARRO University of Chile First author funded by the Academy of Finland under grant 108219.
More informationPractical Indexing of Repetitive Collections using Relative Lempel-Ziv
Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Gonzalo Navarro and Víctor Sepúlveda CeBiB Center for Biotechnology and Bioengineering, Chile Department of Computer Science, University
More informationE D I C T The internal extent formula for compacted tries
E D C T The internal extent formula for compacted tries Paolo Boldi Sebastiano Vigna Università degli Studi di Milano, taly Abstract t is well known [Knu97, pages 399 4] that in a binary tree the external
More informationBreaking a Time-and-Space Barrier in Constructing Full-Text Indices
Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Wing-Kai Hon Kunihiko Sadakane Wing-Kin Sung Abstract Suffix trees and suffix arrays are the most prominent full-text indices, and their
More informationAlphabet-Independent Compressed Text Indexing
Alphabet-Independent Compressed Text Indexing DJAMAL BELAZZOUGUI Université Paris Diderot GONZALO NAVARRO University of Chile Self-indexes are able to represent a text within asymptotically the information-theoretic
More informationOptimal Dynamic Sequence Representations
Optimal Dynamic Sequence Representations Gonzalo Navarro Yakov Nekrich Abstract We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on
More informationEfficient Accessing and Searching in a Sequence of Numbers
Regular Paper Journal of Computing Science and Engineering, Vol. 9, No. 1, March 2015, pp. 1-8 Efficient Accessing and Searching in a Sequence of Numbers Jungjoo Seo and Myoungji Han Department of Computer
More informationSuccinct Data Structures for NLP-at-Scale
Succinct Data Structures for NLP-at-Scale Matthias Petri Trevor Cohn Computing and Information Systems The University of Melbourne, Australia first.last@unimelb.edu.au November 20, 2016 Who are we? Trevor
More informationCompressed Representations of Sequences and Full-Text Indexes
Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Dipartimento di Informatica, Università di Pisa, Italy GIOVANNI MANZINI Dipartimento di Informatica, Università del Piemonte
More informationComplementary Contextual Models with FM-index for DNA Compression
2017 Data Compression Conference Complementary Contextual Models with FM-index for DNA Compression Wenjing Fan,WenruiDai,YongLi, and Hongkai Xiong Department of Electronic Engineering Department of Biomedical
More informationCOMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES
COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES DO HUY HOANG (B.C.S. (Hons), NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE SCHOOL OF COMPUTING NATIONAL
More informationSuffix Sorting Algorithms
Suffix Sorting Algorithms Timo Bingmann Text-Indexierung Vorlesung 2016-12-01 INSTITUTE OF THEORETICAL INFORMATICS ALGORITHMICS KIT University of the State of Baden-Wuerttemberg and National Research Center
More informationAdvanced Data Structures
Simon Gog gog@kit.edu - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Predecessor data structures We want to support
More informationOpportunistic Data Structures with Applications
Opportunistic Data Structures with Applications Paolo Ferragina Giovanni Manzini Abstract There is an upsurging interest in designing succinct data structures for basic searching problems (see [23] and
More informationData Compression Techniques
Data Compression Techniques Part 2: Text Compression Lecture 5: Context-Based Compression Juha Kärkkäinen 14.11.2017 1 / 19 Text Compression We will now look at techniques for text compression. These techniques
More informationSuccinct Suffix Arrays based on Run-Length Encoding
Succinct Suffix Arrays based on Run-Length Encoding Veli Mäkinen Gonzalo Navarro Abstract A succinct full-text self-index is a data structure built on a text T = t 1 t 2...t n, which takes little space
More informationCompressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Roberto Grossi Dipartimento di Informatica Università di Pisa 56125 Pisa, Italy grossi@di.unipi.it Jeffrey
More informationString Range Matching
String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings
More informationarxiv: v1 [cs.ds] 8 Sep 2018
Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space Travis Gagie 1,2, Gonzalo Navarro 2,3, and Nicola Prezza 4 1 EIT, Diego Portales University, Chile 2 Center for Biotechnology
More informationarxiv:cs/ v1 [cs.it] 21 Nov 2006
On the space complexity of one-pass compression Travis Gagie Department of Computer Science University of Toronto travis@cs.toronto.edu arxiv:cs/0611099v1 [cs.it] 21 Nov 2006 STUDENT PAPER Abstract. We
More informationA Space-Efficient Frameworks for Top-k String Retrieval
A Space-Efficient Frameworks for Top-k String Retrieval Wing-Kai Hon, National Tsing Hua University Rahul Shah, Louisiana State University Sharma V. Thankachan, Louisiana State University Jeffrey Scott
More informationRead Mapping. Burrows Wheeler Transform and Reference Based Assembly. Genomics: Lecture #5 WS 2014/2015
Mapping Burrows Wheeler and Reference Based Assembly Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin Genomics: Lecture #5 WS 2014/2015 Today Burrows Wheeler FM index
More informationAdvanced Text Indexing Techniques. Johannes Fischer
Advanced ext Indexing echniques Johannes Fischer SS 2009 1 Suffix rees, -Arrays and -rays 1.1 Recommended Reading Dan Gusfield: Algorithms on Strings, rees, and Sequences. 1997. ambridge University Press,
More informationOptimal-Time Text Indexing in BWT-runs Bounded Space
Optimal-Time Text Indexing in BWT-runs Bounded Space Travis Gagie Gonzalo Navarro Nicola Prezza Abstract Indexing highly repetitive texts such as genomic databases, software repositories and versioned
More informationSimple Compression Code Supporting Random Access and Fast String Matching
Simple Compression Code Supporting Random Access and Fast String Matching Kimmo Fredriksson and Fedor Nikitin Department of Computer Science and Statistics, University of Joensuu PO Box 111, FIN 80101
More informationBandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)
Compression Motivation Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Storage: Store large & complex 3D models (e.g. 3D scanner
More informationData Compression Techniques
Data Compression Techniques Part 2: Text Compression Lecture 7: Burrows Wheeler Compression Juha Kärkkäinen 21.11.2017 1 / 16 Burrows Wheeler Transform The Burrows Wheeler transform (BWT) is a transformation
More informationSUCCINCT DATA STRUCTURES
SUCCINCT DATA STRUCTURES by Ankur Gupta Department of Computer Science Duke University Date: Approved: Jeffrey Scott Vitter, Supervisor Pankaj Agarwal Roberto Grossi Xiaobai Sun Dissertation submitted
More informationStronger Lempel-Ziv Based Compressed Text Indexing
Stronger Lempel-Ziv Based Compressed Text Indexing Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile.
More informationApproximate String Matching with Lempel-Ziv Compressed Indexes
Approximate String Matching with Lempel-Ziv Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2 and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,
More informationOptimal lower bounds for rank and select indexes
Optimal lower bounds for rank and select indexes Alexander Golynski David R. Cheriton School of Computer Science, University of Waterloo agolynski@cs.uwaterloo.ca Technical report CS-2006-03, Version:
More informationarxiv: v1 [cs.ds] 21 Nov 2012
The Rightmost Equal-Cost Position Problem arxiv:1211.5108v1 [cs.ds] 21 Nov 2012 Maxime Crochemore 1,3, Alessio Langiu 1 and Filippo Mignosi 2 1 King s College London, London, UK {Maxime.Crochemore,Alessio.Langiu}@kcl.ac.uk
More informationCOMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT
COMP9319 Web Data Compression and Search Lecture 2: daptive Huffman, BWT 1 Original readings Login to your cse account:! cd ~cs9319/papers! Original readings of each lecture will be placed there. 2 Course
More informationSmall-Space Dictionary Matching (Dissertation Proposal)
Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length
More informationApproximate String Matching with Ziv-Lempel Compressed Indexes
Approximate String Matching with Ziv-Lempel Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2, and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,
More informationAdvanced Data Structures
Simon Gog gog@kit.edu - 0 Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Dynamic Perfect Hashing What we want: O(1) lookup
More informationAdvanced Data Structures
Simon Gog gog@kit.edu - Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Predecessor data structures We want to support the following operations on a set of integers from
More informationTheoretical aspects of ERa, the fastest practical suffix tree construction algorithm
Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm Matevž Jekovec University of Ljubljana Faculty of Computer and Information Science Oct 10, 2013 Text indexing problem
More informationCOMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT
COMP9319 Web Data Compression and Search Lecture 2: daptive Huffman, BWT 1 Original readings Login to your cse account: cd ~cs9319/papers Original readings of each lecture will be placed there. 2 Course
More informationLZ77-like Compression with Fast Random Access
-like Compression with Fast Random Access Sebastian Kreft and Gonzalo Navarro Dept. of Computer Science, University of Chile, Santiago, Chile {skreft,gnavarro}@dcc.uchile.cl Abstract We introduce an alternative
More informationForbidden Patterns. {vmakinen leena.salmela
Forbidden Patterns Johannes Fischer 1,, Travis Gagie 2,, Tsvi Kopelowitz 3, Moshe Lewenstein 4, Veli Mäkinen 5,, Leena Salmela 5,, and Niko Välimäki 5, 1 KIT, Karlsruhe, Germany, johannes.fischer@kit.edu
More informationThe Burrows-Wheeler Transform: Theory and Practice
The Burrows-Wheeler Transform: Theory and Practice Giovanni Manzini 1,2 1 Dipartimento di Scienze e Tecnologie Avanzate, Università del Piemonte Orientale Amedeo Avogadro, I-15100 Alessandria, Italy. 2
More information2. Exact String Matching
2. Exact String Matching Let T = T [0..n) be the text and P = P [0..m) the pattern. We say that P occurs in T at position j if T [j..j + m) = P. Example: P = aine occurs at position 6 in T = karjalainen.
More informationString Searching with Ranking Constraints and Uncertainty
Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 2015 String Searching with Ranking Constraints and Uncertainty Sudip Biswas Louisiana State University and Agricultural
More informationRun-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE
General e Image Coder Structure Motion Video x(s 1,s 2,t) or x(s 1,s 2 ) Natural Image Sampling A form of data compression; usually lossless, but can be lossy Redundancy Removal Lossless compression: predictive
More informationFast Fully-Compressed Suffix Trees
Fast Fully-Compressed Suffix Trees Gonzalo Navarro Department of Computer Science University of Chile, Chile gnavarro@dcc.uchile.cl Luís M. S. Russo INESC-ID / Instituto Superior Técnico Technical University
More informationA Four-Stage Algorithm for Updating a Burrows-Wheeler Transform
A Four-Stage Algorithm for Updating a Burrows-Wheeler ransform M. Salson a,1,. Lecroq a, M. Léonard a, L. Mouchard a,b, a Université de Rouen, LIIS EA 4108, 76821 Mont Saint Aignan, France b Algorithm
More informationThis article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution
More informationJumbled String Matching: Motivations, Variants, Algorithms
Jumbled String Matching: Motivations, Variants, Algorithms Zsuzsanna Lipták University of Verona (Italy) Workshop Combinatorial structures for sequence analysis in bioinformatics Milano-Bicocca, 27 Nov
More informationEfficient Fully-Compressed Sequence Representations
Algorithmica (2014) 69:232 268 DOI 10.1007/s00453-012-9726-3 Efficient Fully-Compressed Sequence Representations Jérémy Barbay Francisco Claude Travis Gagie Gonzalo Navarro Yakov Nekrich Received: 4 February
More informationAn Algorithmic Framework for Compression and Text Indexing
An Algorithmic Framework for Compression and Text Indexing Roberto Grossi Ankur Gupta Jeffrey Scott Vitter Abstract We present a unified algorithmic framework to obtain nearly optimal space bounds for
More informationPattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1
Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Outline and Reading Strings ( 9.1.1) Pattern matching algorithms Brute-force algorithm ( 9.1.2) Boyer-Moore algorithm ( 9.1.3) Knuth-Morris-Pratt
More informationA Simpler Analysis of Burrows-Wheeler Based Compression
A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan School of Computer Science, Tel Aviv University, Tel Aviv, Israel; email: haimk@post.tau.ac.il Shir Landau School of Computer Science,
More informationText Indexing, Suffix Sorting & Data Compression: Common Problems and Techniques
Text Indexing, uffix orting & Data Compression: Common Problems and Techniques Roberto Grossi Dipartimento di Informatica Università di Pisa Roadmap pace efficiency issues hort stories: suffix array permutations
More informationReducing the Space Requirement of LZ-Index
Reducing the Space Requirement of LZ-Index Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile {darroyue, gnavarro}@dcc.uchile.cl 2 Dept. of
More informationA Faster Grammar-Based Self-index
A Faster Grammar-Based Self-index Travis Gagie 1,Pawe l Gawrychowski 2,, Juha Kärkkäinen 3, Yakov Nekrich 4, and Simon J. Puglisi 5 1 Aalto University, Finland 2 University of Wroc law, Poland 3 University
More informationSelf-Indexed Grammar-Based Compression
Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca
More informationData Compression Techniques
Data Compression Techniques Part 1: Entropy Coding Lecture 4: Asymmetric Numeral Systems Juha Kärkkäinen 08.11.2017 1 / 19 Asymmetric Numeral Systems Asymmetric numeral systems (ANS) is a recent entropy
More informationEECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have
EECS 229A Spring 2007 * * Solutions to Homework 3 1. Problem 4.11 on pg. 93 of the text. Stationary processes (a) By stationarity and the chain rule for entropy, we have H(X 0 ) + H(X n X 0 ) = H(X 0,
More informationOn Compressing and Indexing Repetitive Sequences
On Compressing and Indexing Repetitive Sequences Sebastian Kreft a,1,2, Gonzalo Navarro a,2 a Department of Computer Science, University of Chile Abstract We introduce LZ-End, a new member of the Lempel-Ziv
More informationJournal of Discrete Algorithms
Journal of Discrete Algorithms 18 (2013) 100 112 Contents lists available at SciVerse ScienceDirect Journal of Discrete Algorithms www.elsevier.com/locate/jda ESP-index: A compressed index based on edit-sensitive
More informationarxiv: v2 [cs.ds] 6 Jul 2015
Online Self-Indexed Grammar Compression Yoshimasa Takabatake 1, Yasuo Tabei 2, and Hiroshi Sakamoto 1 1 Kyushu Institute of Technology {takabatake,hiroshi}@donald.ai.kyutech.ac.jp 2 PRESTO, Japan Science
More informationSource Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria
Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal
More informationLecture 1 : Data Compression and Entropy
CPS290: Algorithmic Foundations of Data Science January 8, 207 Lecture : Data Compression and Entropy Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will study a simple model for
More informationMultiple Pattern Matching
Multiple Pattern Matching Stephen Fulwider and Amar Mukherjee College of Engineering and Computer Science University of Central Florida Orlando, FL USA Email: {stephen,amar}@cs.ucf.edu Abstract In this
More informationCSEP 590 Data Compression Autumn Arithmetic Coding
CSEP 590 Data Compression Autumn 2007 Arithmetic Coding Reals in Binary Any real number x in the interval [0,1) can be represented in binary as.b 1 b 2... where b i is a bit. x 0 0 1 0 1... binary representation
More informationSection Summary. Sequences. Recurrence Relations. Summations. Examples: Geometric Progression, Arithmetic Progression. Example: Fibonacci Sequence
Section 2.4 Section Summary Sequences. Examples: Geometric Progression, Arithmetic Progression Recurrence Relations Example: Fibonacci Sequence Summations Introduction Sequences are ordered lists of elements.
More informationBurrows-Wheeler Transforms in Linear Time and Linear Bits
Burrows-Wheeler Transforms in Linear Time and Linear Bits Russ Cox (following Hon, Sadakane, Sung, Farach, and others) 18.417 Final Project BWT in Linear Time and Linear Bits Three main parts to the result.
More informationA Repetitive Corpus Testbed
Chapter 3 A Repetitive Corpus Testbed In this chapter we present a corpus of repetitive texts. These texts are categorized according to the source they come from into the following: Artificial Texts, Pseudo-
More informationUniversità degli studi di Udine
Università degli studi di Udine Computing LZ77 in Run-Compressed Space This is a pre print version of the following article: Original Computing LZ77 in Run-Compressed Space / Policriti, Alberto; Prezza,
More informationLecture 4 : Adaptive source coding algorithms
Lecture 4 : Adaptive source coding algorithms February 2, 28 Information Theory Outline 1. Motivation ; 2. adaptive Huffman encoding ; 3. Gallager and Knuth s method ; 4. Dictionary methods : Lempel-Ziv
More informationComputing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes
Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes Enno Ohlebusch, Simon Gog, and Adrian Kügel Institute of Theoretical Computer Science, University of Ulm, D-89069
More informationSelf-Indexed Grammar-Based Compression
Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca
More informationChapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code
Chapter 2 Date Compression: Source Coding 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code 2.1 An Introduction to Source Coding Source coding can be seen as an efficient way
More informationMore Haste, Less Waste: Lowering the Redundancy in Fully Indexable Dictionaries
More Haste, Less Waste: Lowering the Redundancy in Fully Indexable Dictionaries Roberto Grossi, Alessio Orlandi, Rajeev Raman, S. Srinivasa Rao To cite this version: Roberto Grossi, Alessio Orlandi, Rajeev
More informationConverting SLP to LZ78 in almost Linear Time
CPM 2013 Converting SLP to LZ78 in almost Linear Time Hideo Bannai 1, Paweł Gawrychowski 2, Shunsuke Inenaga 1, Masayuki Takeda 1 1. Kyushu University 2. Max-Planck-Institut für Informatik Recompress SLP
More informationSuccinct 2D Dictionary Matching with No Slowdown
Succinct 2D Dictionary Matching with No Slowdown Shoshana Neuburger and Dina Sokol City University of New York Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d
More informationInverting the Burrows-Wheeler Transform
FUNCTIONAL PEARL Inverting the Burrows-Wheeler Transform Richard Bird and Shin-Cheng Mu 1 Programming Research Group, Oxford University Wolfson Building, Parks Road, Oxford, OX1 3QD, UK Abstract The objective
More informationAdapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts
Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts Domenico Cantone Simone Faro Emanuele Giaquinta Department of Mathematics and Computer Science, University of Catania, Italy 1 /
More informationA simpler analysis of Burrows Wheeler-based compression
Theoretical Computer Science 387 (2007) 220 235 www.elsevier.com/locate/tcs A simpler analysis of Burrows Wheeler-based compression Haim Kaplan, Shir Landau, Elad Verbin School of Computer Science, Tel
More information