Preview: Text Indexing

Size: px
Start display at page:

Download "Preview: Text Indexing"

Transcription

1 Simon Gog - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

2 Text Indexing Motivation Problems Given a text T of length n over an alphabet Σ of size σ and a pattern P of length m. Typical questions we want to answer efficiently: Does P occur in T? (Existence query) How often does P occur in T? (Count query) Where does P occur in T? (Locate query) Two scenarios: Scan the whole text T for each query (time complexity O(n + m))). Build an index for T once. Use index to answer the query (time complexity not linear dependent on n or even independent!). Simon Gog:

3 Text Indexing Motivation Index solution Suffix Array (SA) was already explained in this lecture. Existence/count time: O(m log n) Locate time: O(m log n + occ), where occ is the number of occurrences. Drawbacks SA is larger than T : n log n compared to n log σ. For σ = 256 and n = 2 32 we get factor 4. SA requires T to answer queries. Other classical indexes: Suffix trees (ST), String B-Tree (SBT), and uncompressed positional Inverted Indexes (PII). 2 Simon Gog:

4 Text Indexing Suffix Array i SA[i] T [SA[i]..n ] 8 $ 7 a$ 2 abarbara$ 3 7 abrabarbara$ 4 abracadabrabarbara$ 5 3 acadabrabarbara$ 6 5 adabrabarbara$ 7 5 ara$ 8 2 arbara$ 9 4 bara$ barbara$ 8 brabarbara$ 2 bracadabrabarbara$ 3 4 cadabrabarbara$ 4 6 dabrabarbara$ 5 6 ra$ 6 9 rabarbara$ 7 2 racadabrabarbara$ 8 3 rbara$ 3 Simon Gog: T [SA[i]..n ] < T [SA[i + ]..n ] SA[i] contains the starting position of the ith lex. smallest suffix of T. Matching algorithm: binary search (forward, left-to-right)

5 Compressed Text Indexes The FM-Index Ferragina and Manzini [2] Index based on Burrows-Wheeler-Transform (BWT) Matching algorithm works backwards (right-to-left) Existence and count queries in time O(m log σ) BWT BWT [i] = T [SA[i] mod n] uncompressed size: n log σ bits compressed size: nh k (T ) bits (+information for contexts of length k) 4 Simon Gog:

6 i SA[i] BWT T [SA[i]..n ] 8 a $ 7 r a$ 2 r abarbara$ 3 7 d abrabarbara$ 4 $ abracadabrabarbara$ 5 3 r acadabrabarbara$ 6 5 c adabrabarbara$ 7 5 b ara$ 8 2 b arbara$ 9 4 r bara$ a barbara$ 8 a brabarbara$ 2 a bracadabrabarbara$ 3 4 a cadabrabarbara$ 4 6 a dabrabarbara$ 5 6 a ra$ 6 9 b rabarbara$ 7 2 b racadabrabarbara$ 8 3 a rbara$ BWT [i] = T [SA[i] ], for SA[i] > BWT [i] = T [n ], for SA[i] = I.e. BWT [i] is the character preceding suffix SA[i] 5 Simon Gog:

7 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ Array C contains for each c Σ the position of the first suffix in SA which starts with c: $ a b c d r r Operation rank(i, X, BWT ) returns how often character X Σ occurs in the prefix BWT [..i ]. Example: search for P = bar. 6 Simon Gog:

8 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Initial interval: [sp, ep ] = [..n ] Determine interval for r: sp = C[r]+rank(sp, r, BWT ) ep = C[r]+rank(ep +, r, BWT ) 7 Simon Gog:

9 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Initial interval: [sp, ep ] = [..n ] Determine interval for r: sp = 5+rank(, r, BWT ) ep = 5+rank(9, r, BWT ) 7 Simon Gog:

10 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Initial interval: [sp, ep ] = [..n ] Determine interval for r: sp = 5+ ep = 5+rank(9, r, BWT ) 7 Simon Gog:

11 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Initial interval: [sp, ep ] = [..n ] Determine interval for r: sp = 5+ = 5 ep = 5+4 = 8 7 Simon Gog:

12 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = C[a]+rank(sp, a, BWT ) ep 2 = C[a]+rank(ep +, a, BWT ) 8 Simon Gog:

13 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = +rank(5, a, BWT ) ep 2 = +rank(ep, a, BWT ) 8 Simon Gog:

14 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = +rank(5, a, BWT ) ep 2 = +rank(ep, a, BWT ) 8 Simon Gog:

15 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = +6 ep 2 = +rank(9, a, BWT ) 8 Simon Gog:

16 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp, ep ] = [5..8] Determine interval for ar: sp 2 = +6 = 7 ep 2 = +8 = 8 8 Simon Gog:

17 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = C[b]+rank(sp 2, b, BWT ) ep 3 = C[b]+rank(ep 2 +, b, BWT ) 9 Simon Gog:

18 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = 9+rank(7, b, BWT ) ep 3 = 9+rank(ep, b, BWT ) 9 Simon Gog:

19 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = 9+rank(7, b, BWT ) ep 3 = 9+rank(ep, b, BWT ) 9 Simon Gog:

20 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = 9+ ep 3 = 9+rank(9, b, BWT ) 9 Simon Gog:

21 i BWT T [SA[i]..n ] a $ r a$ 2 r abarbara$ 3 d abrabarbara$ 4 $ abracadabrabarbara$ 5 r acadabrabarbara$ 6 c adabrabarbara$ 7 b ara$ 8 b arbara$ 9 r bara$ a barbara$ a brabarbara$ 2 a bracadabrabarbara$ 3 a cadabrabarbara$ 4 a dabrabarbara$ 5 a ra$ 6 b rabarbara$ 7 b racadabrabarbara$ 8 a rbara$ C $ a b c d r Search backwards for bar. Interval: [sp 2, ep 2 ] = [7..8] Determine interval for bar: sp 3 = 9+ = 9 ep 3 = 9+2 = 9 Simon Gog:

22 Summary Only C and a data structure R supporting the rank operation on BWT are required for existence and count queries. Space: σ log n bits for C + space for R Time: O(m t rank ), where t rank is time for one rank operation. Independent from n? Next: How to implement rank? Rank operation Constant time and o(n) extra space solution on bitvectors (Jacobson [989]) Solution on general sequences: Wavelet Tree (Grossi et al. [23]) Simon Gog:

23 Summary Only C and a data structure R supporting the rank operation on BWT are required for existence and count queries. Space: σ log n bits for C + space for R Time: O(m t rank ), where t rank is time for one rank operation. Independent from n? If t rank is independent from n Next: How to implement rank? Rank operation Constant time and o(n) extra space solution on bitvectors (Jacobson [989]) Solution on general sequences: Wavelet Tree (Grossi et al. [23]) Simon Gog:

24 Wavelet Tree Example: Calculate Rank arrd$rcbbraaaaaabba a$bbaaaaaabba rrdrcr a$aaaaaaa bbbb dc rrrr a = $ aaaaaaaa c d rank(, a, WT ) = rank(rank(rank(,, b ɛ ) = 5,, b ) = 3,, b ) = 2 Simon Gog:

25 Wavelet Tree Example: Calculate Rank arrd$rcbbraaaaaabba a$bbaaaaaabba rrdrcr a$aaaaaaa bbbb dc rrrr a = $ aaaaaaaa c d rank(, a, WT ) = rank(rank(rank(,, b ɛ ) = 5,, b ) = 3,, b ) = 2 Simon Gog:

26 Wavelet Tree Example: Calculate Rank arrd$rcbbraaaaaabba a$bbaaaaaabba rrdrcr a$aaaaaaa bbbb dc rrrr a = $ aaaaaaaa c d rank(, a, WT ) = rank(rank(rank(,, b ɛ ) = 5,, b ) = 3,, b ) = 2 Simon Gog:

27 Wavelet Tree Example: Calculate Rank arrd$rcbbraaaaaabba a$bbaaaaaabba rrdrcr a$aaaaaaa bbbb dc rrrr a = $ aaaaaaaa c d rank(, a, WT ) = rank(rank(rank(,, b ɛ ) = 5,, b ) = 3,, b ) = 2 Simon Gog:

28 Compressed Text Indexing meets Algorithm Engineering State-of-the-art Recent FM-Indexes are as small as the output of state-of-the-art compressors (like gzip,xz) while matching takes microseconds per character. This is the result of theoretical and practical improvements: Shape of the WT (balanced, Huffman, Hu-Tucker,...) Bitvector representation (uncompressed/compressed) Hardware (popcount instruction, page size) Different sampling strategies for SA values (for locate queries)... 2 Simon Gog:

29 Compressed Text Indexing Combining a H -compressed bitvector with a Huffman shaped wavelet tree results in H k (T ) bits of space. 2 MB test instance WT-HUFF WT-HUFFcompr Time Space Time Space (µs) (%) (µs) (%) DBLP.XML DNA ENGLISH PROTEINS SOURCES Simon Gog:

30 Compressed Text Indexing Our Toolbox for Compact/Succinct Data Structures Succinct Data Structure Library (SDSL) A C++ template library for compact/succinct structures Parametrizable structures Bitvectors Compressed Integer Vectors Rank/Select Structures Wavelet Trees/Wavelet Matrices Compressed Suffix Arrays/Trees Search Engines Available at 4 Simon Gog:

31 Lecture Text Indexing Content Theory: Classical indexes (Suffix Arrays/Suffix Trees/Inverted Indexes) Building blocks for compact/succinct structures Compressed Bitvector Rank Structures Select Structures Range-Min-Max-Tree Compressed indexes FM-Indexes/Compressed Suffix Arrays Versions for highly-repetitive text Compressed Suffix Trees Search Engines Practice Use SDSL to implement and analyze structures. Design a code search engine. 5 Simon Gog:

32 Bibliography Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings of the 4st Annual Symposium on Foundations of Computer Science, (FOCS 2), pages , 2. Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 23), pages 84 85, 23. Guy Jacobson. Space-efficient static trees and graphs. In FOCS, pages , Simon Gog:

33 H k of selected Pizza&Chili 2MB test cases H k (T ) contexts/ T in percent k DBLP.XML DNA ENGLISH PROTEINS Simon Gog:

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with

More information

Text Indexing: Lecture 6

Text Indexing: Lecture 6 Simon Gog gog@kit.edu - 0 Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Reviewing the last two lectures We have seen two top-k document retrieval frameworks. Question

More information

Succincter text indexing with wildcards

Succincter text indexing with wildcards University of British Columbia CPM 2011 June 27, 2011 Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview

More information

arxiv: v1 [cs.ds] 19 Apr 2011

arxiv: v1 [cs.ds] 19 Apr 2011 Fixed Block Compression Boosting in FM-Indexes Juha Kärkkäinen 1 and Simon J. Puglisi 2 1 Department of Computer Science, University of Helsinki, Finland juha.karkkainen@cs.helsinki.fi 2 Department of

More information

Alphabet Friendly FM Index

Alphabet Friendly FM Index Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM

More information

A Simple Alphabet-Independent FM-Index

A Simple Alphabet-Independent FM-Index A Simple Alphabet-Independent -Index Szymon Grabowski 1, Veli Mäkinen 2, Gonzalo Navarro 3, Alejandro Salinger 3 1 Computer Engineering Dept., Tech. Univ. of Lódź, Poland. e-mail: sgrabow@zly.kis.p.lodz.pl

More information

Compact Indexes for Flexible Top-k Retrieval

Compact Indexes for Flexible Top-k Retrieval Compact Indexes for Flexible Top-k Retrieval Simon Gog Matthias Petri Institute of Theoretical Informatics, Karlsruhe Institute of Technology Computing and Information Systems, The University of Melbourne

More information

Rank and Select Operations on Binary Strings (1974; Elias)

Rank and Select Operations on Binary Strings (1974; Elias) Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo

More information

arxiv: v1 [cs.ds] 15 Feb 2012

arxiv: v1 [cs.ds] 15 Feb 2012 Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl

More information

Lecture 18 April 26, 2012

Lecture 18 April 26, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and

More information

arxiv: v1 [cs.ds] 25 Nov 2009

arxiv: v1 [cs.ds] 25 Nov 2009 Alphabet Partitioning for Compressed Rank/Select with Applications Jérémy Barbay 1, Travis Gagie 1, Gonzalo Navarro 1 and Yakov Nekrich 2 1 Department of Computer Science University of Chile {jbarbay,

More information

Compressed Index for Dynamic Text

Compressed Index for Dynamic Text Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution

More information

Smaller and Faster Lempel-Ziv Indices

Smaller and Faster Lempel-Ziv Indices Smaller and Faster Lempel-Ziv Indices Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, Universidad de Chile, Chile. {darroyue,gnavarro}@dcc.uchile.cl Abstract. Given a text T[1..u] over an

More information

arxiv: v1 [cs.ds] 22 Nov 2012

arxiv: v1 [cs.ds] 22 Nov 2012 Faster Compact Top-k Document Retrieval Roberto Konow and Gonzalo Navarro Department of Computer Science, University of Chile {rkonow,gnavarro}@dcc.uchile.cl arxiv:1211.5353v1 [cs.ds] 22 Nov 2012 Abstract:

More information

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:

More information

Succinct Data Structures for Text and Information Retrieval

Succinct Data Structures for Text and Information Retrieval Succinct Data Structures for Text and Information Retrieval Simon Gog 1 Matthias Petri 2 1 Institute of Theoretical Informatics Karslruhe Insitute of Technology 2 Computing and Information Systems The

More information

New Lower and Upper Bounds for Representing Sequences

New Lower and Upper Bounds for Representing Sequences New Lower and Upper Bounds for Representing Sequences Djamal Belazzougui 1 and Gonzalo Navarro 2 1 LIAFA, Univ. Paris Diderot - Paris 7, France. dbelaz@liafa.jussieu.fr 2 Department of Computer Science,

More information

A Faster Grammar-Based Self-Index

A Faster Grammar-Based Self-Index A Faster Grammar-Based Self-Index Travis Gagie 1 Pawe l Gawrychowski 2 Juha Kärkkäinen 3 Yakov Nekrich 4 Simon Puglisi 5 Aalto University Max-Planck-Institute für Informatik University of Helsinki University

More information

Compact Data Strutures

Compact Data Strutures (To compress is to Conquer) Compact Data Strutures Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017 Agenda

More information

Compressed Representations of Sequences and Full-Text Indexes

Compressed Representations of Sequences and Full-Text Indexes Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Università di Pisa GIOVANNI MANZINI Università del Piemonte Orientale VELI MÄKINEN University of Helsinki AND GONZALO NAVARRO

More information

Dynamic Entropy-Compressed Sequences and Full-Text Indexes

Dynamic Entropy-Compressed Sequences and Full-Text Indexes Dynamic Entropy-Compressed Sequences and Full-Text Indexes VELI MÄKINEN University of Helsinki and GONZALO NAVARRO University of Chile First author funded by the Academy of Finland under grant 108219.

More information

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Gonzalo Navarro and Víctor Sepúlveda CeBiB Center for Biotechnology and Bioengineering, Chile Department of Computer Science, University

More information

E D I C T The internal extent formula for compacted tries

E D I C T The internal extent formula for compacted tries E D C T The internal extent formula for compacted tries Paolo Boldi Sebastiano Vigna Università degli Studi di Milano, taly Abstract t is well known [Knu97, pages 399 4] that in a binary tree the external

More information

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Wing-Kai Hon Kunihiko Sadakane Wing-Kin Sung Abstract Suffix trees and suffix arrays are the most prominent full-text indices, and their

More information

Alphabet-Independent Compressed Text Indexing

Alphabet-Independent Compressed Text Indexing Alphabet-Independent Compressed Text Indexing DJAMAL BELAZZOUGUI Université Paris Diderot GONZALO NAVARRO University of Chile Self-indexes are able to represent a text within asymptotically the information-theoretic

More information

Optimal Dynamic Sequence Representations

Optimal Dynamic Sequence Representations Optimal Dynamic Sequence Representations Gonzalo Navarro Yakov Nekrich Abstract We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on

More information

Efficient Accessing and Searching in a Sequence of Numbers

Efficient Accessing and Searching in a Sequence of Numbers Regular Paper Journal of Computing Science and Engineering, Vol. 9, No. 1, March 2015, pp. 1-8 Efficient Accessing and Searching in a Sequence of Numbers Jungjoo Seo and Myoungji Han Department of Computer

More information

Succinct Data Structures for NLP-at-Scale

Succinct Data Structures for NLP-at-Scale Succinct Data Structures for NLP-at-Scale Matthias Petri Trevor Cohn Computing and Information Systems The University of Melbourne, Australia first.last@unimelb.edu.au November 20, 2016 Who are we? Trevor

More information

Compressed Representations of Sequences and Full-Text Indexes

Compressed Representations of Sequences and Full-Text Indexes Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Dipartimento di Informatica, Università di Pisa, Italy GIOVANNI MANZINI Dipartimento di Informatica, Università del Piemonte

More information

Complementary Contextual Models with FM-index for DNA Compression

Complementary Contextual Models with FM-index for DNA Compression 2017 Data Compression Conference Complementary Contextual Models with FM-index for DNA Compression Wenjing Fan,WenruiDai,YongLi, and Hongkai Xiong Department of Electronic Engineering Department of Biomedical

More information

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES DO HUY HOANG (B.C.S. (Hons), NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE SCHOOL OF COMPUTING NATIONAL

More information

Suffix Sorting Algorithms

Suffix Sorting Algorithms Suffix Sorting Algorithms Timo Bingmann Text-Indexierung Vorlesung 2016-12-01 INSTITUTE OF THEORETICAL INFORMATICS ALGORITHMICS KIT University of the State of Baden-Wuerttemberg and National Research Center

More information

Advanced Data Structures

Advanced Data Structures Simon Gog gog@kit.edu - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Predecessor data structures We want to support

More information

Opportunistic Data Structures with Applications

Opportunistic Data Structures with Applications Opportunistic Data Structures with Applications Paolo Ferragina Giovanni Manzini Abstract There is an upsurging interest in designing succinct data structures for basic searching problems (see [23] and

More information

Data Compression Techniques

Data Compression Techniques Data Compression Techniques Part 2: Text Compression Lecture 5: Context-Based Compression Juha Kärkkäinen 14.11.2017 1 / 19 Text Compression We will now look at techniques for text compression. These techniques

More information

Succinct Suffix Arrays based on Run-Length Encoding

Succinct Suffix Arrays based on Run-Length Encoding Succinct Suffix Arrays based on Run-Length Encoding Veli Mäkinen Gonzalo Navarro Abstract A succinct full-text self-index is a data structure built on a text T = t 1 t 2...t n, which takes little space

More information

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Roberto Grossi Dipartimento di Informatica Università di Pisa 56125 Pisa, Italy grossi@di.unipi.it Jeffrey

More information

String Range Matching

String Range Matching String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings

More information

arxiv: v1 [cs.ds] 8 Sep 2018

arxiv: v1 [cs.ds] 8 Sep 2018 Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space Travis Gagie 1,2, Gonzalo Navarro 2,3, and Nicola Prezza 4 1 EIT, Diego Portales University, Chile 2 Center for Biotechnology

More information

arxiv:cs/ v1 [cs.it] 21 Nov 2006

arxiv:cs/ v1 [cs.it] 21 Nov 2006 On the space complexity of one-pass compression Travis Gagie Department of Computer Science University of Toronto travis@cs.toronto.edu arxiv:cs/0611099v1 [cs.it] 21 Nov 2006 STUDENT PAPER Abstract. We

More information

A Space-Efficient Frameworks for Top-k String Retrieval

A Space-Efficient Frameworks for Top-k String Retrieval A Space-Efficient Frameworks for Top-k String Retrieval Wing-Kai Hon, National Tsing Hua University Rahul Shah, Louisiana State University Sharma V. Thankachan, Louisiana State University Jeffrey Scott

More information

Read Mapping. Burrows Wheeler Transform and Reference Based Assembly. Genomics: Lecture #5 WS 2014/2015

Read Mapping. Burrows Wheeler Transform and Reference Based Assembly. Genomics: Lecture #5 WS 2014/2015 Mapping Burrows Wheeler and Reference Based Assembly Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin Genomics: Lecture #5 WS 2014/2015 Today Burrows Wheeler FM index

More information

Advanced Text Indexing Techniques. Johannes Fischer

Advanced Text Indexing Techniques. Johannes Fischer Advanced ext Indexing echniques Johannes Fischer SS 2009 1 Suffix rees, -Arrays and -rays 1.1 Recommended Reading Dan Gusfield: Algorithms on Strings, rees, and Sequences. 1997. ambridge University Press,

More information

Optimal-Time Text Indexing in BWT-runs Bounded Space

Optimal-Time Text Indexing in BWT-runs Bounded Space Optimal-Time Text Indexing in BWT-runs Bounded Space Travis Gagie Gonzalo Navarro Nicola Prezza Abstract Indexing highly repetitive texts such as genomic databases, software repositories and versioned

More information

Simple Compression Code Supporting Random Access and Fast String Matching

Simple Compression Code Supporting Random Access and Fast String Matching Simple Compression Code Supporting Random Access and Fast String Matching Kimmo Fredriksson and Fedor Nikitin Department of Computer Science and Statistics, University of Joensuu PO Box 111, FIN 80101

More information

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Compression Motivation Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Storage: Store large & complex 3D models (e.g. 3D scanner

More information

Data Compression Techniques

Data Compression Techniques Data Compression Techniques Part 2: Text Compression Lecture 7: Burrows Wheeler Compression Juha Kärkkäinen 21.11.2017 1 / 16 Burrows Wheeler Transform The Burrows Wheeler transform (BWT) is a transformation

More information

SUCCINCT DATA STRUCTURES

SUCCINCT DATA STRUCTURES SUCCINCT DATA STRUCTURES by Ankur Gupta Department of Computer Science Duke University Date: Approved: Jeffrey Scott Vitter, Supervisor Pankaj Agarwal Roberto Grossi Xiaobai Sun Dissertation submitted

More information

Stronger Lempel-Ziv Based Compressed Text Indexing

Stronger Lempel-Ziv Based Compressed Text Indexing Stronger Lempel-Ziv Based Compressed Text Indexing Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile.

More information

Approximate String Matching with Lempel-Ziv Compressed Indexes

Approximate String Matching with Lempel-Ziv Compressed Indexes Approximate String Matching with Lempel-Ziv Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2 and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,

More information

Optimal lower bounds for rank and select indexes

Optimal lower bounds for rank and select indexes Optimal lower bounds for rank and select indexes Alexander Golynski David R. Cheriton School of Computer Science, University of Waterloo agolynski@cs.uwaterloo.ca Technical report CS-2006-03, Version:

More information

arxiv: v1 [cs.ds] 21 Nov 2012

arxiv: v1 [cs.ds] 21 Nov 2012 The Rightmost Equal-Cost Position Problem arxiv:1211.5108v1 [cs.ds] 21 Nov 2012 Maxime Crochemore 1,3, Alessio Langiu 1 and Filippo Mignosi 2 1 King s College London, London, UK {Maxime.Crochemore,Alessio.Langiu}@kcl.ac.uk

More information

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT COMP9319 Web Data Compression and Search Lecture 2: daptive Huffman, BWT 1 Original readings Login to your cse account:! cd ~cs9319/papers! Original readings of each lecture will be placed there. 2 Course

More information

Small-Space Dictionary Matching (Dissertation Proposal)

Small-Space Dictionary Matching (Dissertation Proposal) Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length

More information

Approximate String Matching with Ziv-Lempel Compressed Indexes

Approximate String Matching with Ziv-Lempel Compressed Indexes Approximate String Matching with Ziv-Lempel Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2, and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,

More information

Advanced Data Structures

Advanced Data Structures Simon Gog gog@kit.edu - 0 Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Dynamic Perfect Hashing What we want: O(1) lookup

More information

Advanced Data Structures

Advanced Data Structures Simon Gog gog@kit.edu - Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Predecessor data structures We want to support the following operations on a set of integers from

More information

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm Matevž Jekovec University of Ljubljana Faculty of Computer and Information Science Oct 10, 2013 Text indexing problem

More information

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT COMP9319 Web Data Compression and Search Lecture 2: daptive Huffman, BWT 1 Original readings Login to your cse account: cd ~cs9319/papers Original readings of each lecture will be placed there. 2 Course

More information

LZ77-like Compression with Fast Random Access

LZ77-like Compression with Fast Random Access -like Compression with Fast Random Access Sebastian Kreft and Gonzalo Navarro Dept. of Computer Science, University of Chile, Santiago, Chile {skreft,gnavarro}@dcc.uchile.cl Abstract We introduce an alternative

More information

Forbidden Patterns. {vmakinen leena.salmela

Forbidden Patterns. {vmakinen leena.salmela Forbidden Patterns Johannes Fischer 1,, Travis Gagie 2,, Tsvi Kopelowitz 3, Moshe Lewenstein 4, Veli Mäkinen 5,, Leena Salmela 5,, and Niko Välimäki 5, 1 KIT, Karlsruhe, Germany, johannes.fischer@kit.edu

More information

The Burrows-Wheeler Transform: Theory and Practice

The Burrows-Wheeler Transform: Theory and Practice The Burrows-Wheeler Transform: Theory and Practice Giovanni Manzini 1,2 1 Dipartimento di Scienze e Tecnologie Avanzate, Università del Piemonte Orientale Amedeo Avogadro, I-15100 Alessandria, Italy. 2

More information

2. Exact String Matching

2. Exact String Matching 2. Exact String Matching Let T = T [0..n) be the text and P = P [0..m) the pattern. We say that P occurs in T at position j if T [j..j + m) = P. Example: P = aine occurs at position 6 in T = karjalainen.

More information

String Searching with Ranking Constraints and Uncertainty

String Searching with Ranking Constraints and Uncertainty Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 2015 String Searching with Ranking Constraints and Uncertainty Sudip Biswas Louisiana State University and Agricultural

More information

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE General e Image Coder Structure Motion Video x(s 1,s 2,t) or x(s 1,s 2 ) Natural Image Sampling A form of data compression; usually lossless, but can be lossy Redundancy Removal Lossless compression: predictive

More information

Fast Fully-Compressed Suffix Trees

Fast Fully-Compressed Suffix Trees Fast Fully-Compressed Suffix Trees Gonzalo Navarro Department of Computer Science University of Chile, Chile gnavarro@dcc.uchile.cl Luís M. S. Russo INESC-ID / Instituto Superior Técnico Technical University

More information

A Four-Stage Algorithm for Updating a Burrows-Wheeler Transform

A Four-Stage Algorithm for Updating a Burrows-Wheeler Transform A Four-Stage Algorithm for Updating a Burrows-Wheeler ransform M. Salson a,1,. Lecroq a, M. Léonard a, L. Mouchard a,b, a Université de Rouen, LIIS EA 4108, 76821 Mont Saint Aignan, France b Algorithm

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution

More information

Jumbled String Matching: Motivations, Variants, Algorithms

Jumbled String Matching: Motivations, Variants, Algorithms Jumbled String Matching: Motivations, Variants, Algorithms Zsuzsanna Lipták University of Verona (Italy) Workshop Combinatorial structures for sequence analysis in bioinformatics Milano-Bicocca, 27 Nov

More information

Efficient Fully-Compressed Sequence Representations

Efficient Fully-Compressed Sequence Representations Algorithmica (2014) 69:232 268 DOI 10.1007/s00453-012-9726-3 Efficient Fully-Compressed Sequence Representations Jérémy Barbay Francisco Claude Travis Gagie Gonzalo Navarro Yakov Nekrich Received: 4 February

More information

An Algorithmic Framework for Compression and Text Indexing

An Algorithmic Framework for Compression and Text Indexing An Algorithmic Framework for Compression and Text Indexing Roberto Grossi Ankur Gupta Jeffrey Scott Vitter Abstract We present a unified algorithmic framework to obtain nearly optimal space bounds for

More information

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1 Pattern Matching a b a c a a b 1 4 3 2 Pattern Matching 1 Outline and Reading Strings ( 9.1.1) Pattern matching algorithms Brute-force algorithm ( 9.1.2) Boyer-Moore algorithm ( 9.1.3) Knuth-Morris-Pratt

More information

A Simpler Analysis of Burrows-Wheeler Based Compression

A Simpler Analysis of Burrows-Wheeler Based Compression A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan School of Computer Science, Tel Aviv University, Tel Aviv, Israel; email: haimk@post.tau.ac.il Shir Landau School of Computer Science,

More information

Text Indexing, Suffix Sorting & Data Compression: Common Problems and Techniques

Text Indexing, Suffix Sorting & Data Compression: Common Problems and Techniques Text Indexing, uffix orting & Data Compression: Common Problems and Techniques Roberto Grossi Dipartimento di Informatica Università di Pisa Roadmap pace efficiency issues hort stories: suffix array permutations

More information

Reducing the Space Requirement of LZ-Index

Reducing the Space Requirement of LZ-Index Reducing the Space Requirement of LZ-Index Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile {darroyue, gnavarro}@dcc.uchile.cl 2 Dept. of

More information

A Faster Grammar-Based Self-index

A Faster Grammar-Based Self-index A Faster Grammar-Based Self-index Travis Gagie 1,Pawe l Gawrychowski 2,, Juha Kärkkäinen 3, Yakov Nekrich 4, and Simon J. Puglisi 5 1 Aalto University, Finland 2 University of Wroc law, Poland 3 University

More information

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca

More information

Data Compression Techniques

Data Compression Techniques Data Compression Techniques Part 1: Entropy Coding Lecture 4: Asymmetric Numeral Systems Juha Kärkkäinen 08.11.2017 1 / 19 Asymmetric Numeral Systems Asymmetric numeral systems (ANS) is a recent entropy

More information

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have EECS 229A Spring 2007 * * Solutions to Homework 3 1. Problem 4.11 on pg. 93 of the text. Stationary processes (a) By stationarity and the chain rule for entropy, we have H(X 0 ) + H(X n X 0 ) = H(X 0,

More information

On Compressing and Indexing Repetitive Sequences

On Compressing and Indexing Repetitive Sequences On Compressing and Indexing Repetitive Sequences Sebastian Kreft a,1,2, Gonzalo Navarro a,2 a Department of Computer Science, University of Chile Abstract We introduce LZ-End, a new member of the Lempel-Ziv

More information

Journal of Discrete Algorithms

Journal of Discrete Algorithms Journal of Discrete Algorithms 18 (2013) 100 112 Contents lists available at SciVerse ScienceDirect Journal of Discrete Algorithms www.elsevier.com/locate/jda ESP-index: A compressed index based on edit-sensitive

More information

arxiv: v2 [cs.ds] 6 Jul 2015

arxiv: v2 [cs.ds] 6 Jul 2015 Online Self-Indexed Grammar Compression Yoshimasa Takabatake 1, Yasuo Tabei 2, and Hiroshi Sakamoto 1 1 Kyushu Institute of Technology {takabatake,hiroshi}@donald.ai.kyutech.ac.jp 2 PRESTO, Japan Science

More information

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal

More information

Lecture 1 : Data Compression and Entropy

Lecture 1 : Data Compression and Entropy CPS290: Algorithmic Foundations of Data Science January 8, 207 Lecture : Data Compression and Entropy Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will study a simple model for

More information

Multiple Pattern Matching

Multiple Pattern Matching Multiple Pattern Matching Stephen Fulwider and Amar Mukherjee College of Engineering and Computer Science University of Central Florida Orlando, FL USA Email: {stephen,amar}@cs.ucf.edu Abstract In this

More information

CSEP 590 Data Compression Autumn Arithmetic Coding

CSEP 590 Data Compression Autumn Arithmetic Coding CSEP 590 Data Compression Autumn 2007 Arithmetic Coding Reals in Binary Any real number x in the interval [0,1) can be represented in binary as.b 1 b 2... where b i is a bit. x 0 0 1 0 1... binary representation

More information

Section Summary. Sequences. Recurrence Relations. Summations. Examples: Geometric Progression, Arithmetic Progression. Example: Fibonacci Sequence

Section Summary. Sequences. Recurrence Relations. Summations. Examples: Geometric Progression, Arithmetic Progression. Example: Fibonacci Sequence Section 2.4 Section Summary Sequences. Examples: Geometric Progression, Arithmetic Progression Recurrence Relations Example: Fibonacci Sequence Summations Introduction Sequences are ordered lists of elements.

More information

Burrows-Wheeler Transforms in Linear Time and Linear Bits

Burrows-Wheeler Transforms in Linear Time and Linear Bits Burrows-Wheeler Transforms in Linear Time and Linear Bits Russ Cox (following Hon, Sadakane, Sung, Farach, and others) 18.417 Final Project BWT in Linear Time and Linear Bits Three main parts to the result.

More information

A Repetitive Corpus Testbed

A Repetitive Corpus Testbed Chapter 3 A Repetitive Corpus Testbed In this chapter we present a corpus of repetitive texts. These texts are categorized according to the source they come from into the following: Artificial Texts, Pseudo-

More information

Università degli studi di Udine

Università degli studi di Udine Università degli studi di Udine Computing LZ77 in Run-Compressed Space This is a pre print version of the following article: Original Computing LZ77 in Run-Compressed Space / Policriti, Alberto; Prezza,

More information

Lecture 4 : Adaptive source coding algorithms

Lecture 4 : Adaptive source coding algorithms Lecture 4 : Adaptive source coding algorithms February 2, 28 Information Theory Outline 1. Motivation ; 2. adaptive Huffman encoding ; 3. Gallager and Knuth s method ; 4. Dictionary methods : Lempel-Ziv

More information

Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes

Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes Enno Ohlebusch, Simon Gog, and Adrian Kügel Institute of Theoretical Computer Science, University of Ulm, D-89069

More information

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca

More information

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code Chapter 2 Date Compression: Source Coding 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code 2.1 An Introduction to Source Coding Source coding can be seen as an efficient way

More information

More Haste, Less Waste: Lowering the Redundancy in Fully Indexable Dictionaries

More Haste, Less Waste: Lowering the Redundancy in Fully Indexable Dictionaries More Haste, Less Waste: Lowering the Redundancy in Fully Indexable Dictionaries Roberto Grossi, Alessio Orlandi, Rajeev Raman, S. Srinivasa Rao To cite this version: Roberto Grossi, Alessio Orlandi, Rajeev

More information

Converting SLP to LZ78 in almost Linear Time

Converting SLP to LZ78 in almost Linear Time CPM 2013 Converting SLP to LZ78 in almost Linear Time Hideo Bannai 1, Paweł Gawrychowski 2, Shunsuke Inenaga 1, Masayuki Takeda 1 1. Kyushu University 2. Max-Planck-Institut für Informatik Recompress SLP

More information

Succinct 2D Dictionary Matching with No Slowdown

Succinct 2D Dictionary Matching with No Slowdown Succinct 2D Dictionary Matching with No Slowdown Shoshana Neuburger and Dina Sokol City University of New York Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d

More information

Inverting the Burrows-Wheeler Transform

Inverting the Burrows-Wheeler Transform FUNCTIONAL PEARL Inverting the Burrows-Wheeler Transform Richard Bird and Shin-Cheng Mu 1 Programming Research Group, Oxford University Wolfson Building, Parks Road, Oxford, OX1 3QD, UK Abstract The objective

More information

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts Domenico Cantone Simone Faro Emanuele Giaquinta Department of Mathematics and Computer Science, University of Catania, Italy 1 /

More information

A simpler analysis of Burrows Wheeler-based compression

A simpler analysis of Burrows Wheeler-based compression Theoretical Computer Science 387 (2007) 220 235 www.elsevier.com/locate/tcs A simpler analysis of Burrows Wheeler-based compression Haim Kaplan, Shir Landau, Elad Verbin School of Computer Science, Tel

More information