Alphabet Friendly FM Index

Similar documents
Succinct Suffix Arrays based on Run-Length Encoding

Compressed Representations of Sequences and Full-Text Indexes

Compressed Representations of Sequences and Full-Text Indexes

Smaller and Faster Lempel-Ziv Indices

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

arxiv: v1 [cs.ds] 25 Nov 2009

Succincter text indexing with wildcards

arxiv: v1 [cs.ds] 19 Apr 2011

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

Preview: Text Indexing

Text Indexing: Lecture 6

Complementary Contextual Models with FM-index for DNA Compression

Dynamic Entropy-Compressed Sequences and Full-Text Indexes

Small-Space Dictionary Matching (Dissertation Proposal)

A Simple Alphabet-Independent FM-Index

Alphabet-Independent Compressed Text Indexing

arxiv: v1 [cs.ds] 15 Feb 2012

Compressed Index for Dynamic Text

Self-Indexed Grammar-Based Compression

Rank and Select Operations on Binary Strings (1974; Elias)

Lecture 18 April 26, 2012

On Compressing and Indexing Repetitive Sequences

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES

Stronger Lempel-Ziv Based Compressed Text Indexing

A Faster Grammar-Based Self-Index

Self-Indexed Grammar-Based Compression

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

Advanced Text Indexing Techniques. Johannes Fischer

Opportunistic Data Structures with Applications

String Range Matching

Optimal-Time Text Indexing in BWT-runs Bounded Space

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

Data Compression Techniques

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

Internal Pattern Matching Queries in a Text and Applications

Optimal Dynamic Sequence Representations

Fast Fully-Compressed Suffix Trees

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv

Space-Efficient Construction Algorithm for Circular Suffix Tree

arxiv: v1 [cs.ds] 8 Sep 2018

2018/5/3. YU Xiangyu

Lecture 4 : Adaptive source coding algorithms

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome

Succinct 2D Dictionary Matching with No Slowdown

An Algorithmic Framework for Compression and Text Indexing

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

Compact Data Strutures

Streaming and communication complexity of Hamming distance

arxiv: v1 [cs.ds] 22 Nov 2012

Linear-Space Alignment

Forbidden Patterns. {vmakinen leena.salmela

New Lower and Upper Bounds for Representing Sequences

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Compact Indexes for Flexible Top-k Retrieval

Read Mapping. Burrows Wheeler Transform and Reference Based Assembly. Genomics: Lecture #5 WS 2014/2015

UNIT I INFORMATION THEORY. I k log 2

An O(N) Semi-Predictive Universal Encoder via the BWT

Integer Sorting on the word-ram

Fast String Kernels. Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200

Efficient Fully-Compressed Sequence Representations

Burrows-Wheeler Transforms in Linear Time and Linear Bits

Define M to be a binary n by m matrix such that:

Data Compression Techniques

A Four-Stage Algorithm for Updating a Burrows-Wheeler Transform

CHAPTER 8 COMPRESSION ENTROPY ESTIMATION OF HEART RATE VARIABILITY AND COMPUTATION OF ITS RENORMALIZED ENTROPY

Space-Efficient Re-Pair Compression

Asymptotic Optimal Lossless Compression via the CSE Technique

At the Roots of Dictionary Compression: String Attractors

Data Compression Using a Sort-Based Context Similarity Measure

A Space-Efficient Frameworks for Top-k String Retrieval

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Università degli studi di Udine

arxiv: v3 [cs.ds] 6 Sep 2018

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004

Converting SLP to LZ78 in almost Linear Time

ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION

The Burrows-Wheeler Transform: Theory and Practice

String Searching with Ranking Constraints and Uncertainty

Inverting the Burrows-Wheeler Transform

arxiv: v2 [cs.ds] 8 Apr 2016

String Indexing for Patterns with Wildcards

Notes on Logarithmic Lower Bounds in the Cell Probe Model

Computation Theory Finite Automata

Succinct Data Structures for Text and Information Retrieval

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2

A Repetitive Corpus Testbed

Jumbled String Matching: Motivations, Variants, Algorithms

Bloom Filters, Minhashes, and Other Random Stuff

Fast Sorting and Selection. A Lower Bound for Worst Case

Data Structures in Java

A simpler analysis of Burrows Wheeler-based compression

Module 5 EMBEDDED WAVELET CODING. Version 2 ECE IIT, Kharagpur

Simple Compression Code Supporting Random Access and Fast String Matching

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

Data Compression Techniques

CS 455/555: Mathematical preliminaries

SUCCINCT DATA STRUCTURES

Transcription:

Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile

Outline Motivations Basics Burrows Wheeler Transform FM Index Compression boosting Wavelet tree Alphabet Friendly FM Index Experimental results Conclusions Departamento de Ciencias de la Computación Universidad de Chile

Motivations In pattern matching, the main problem consists in determining the number of occurrences of a pattern P over a text T, where text and pattern are sequences of characters from an alphabet Σ of size σ. Many different indexing data structures have been proposed for this aim, such as inverted lists, suffix trees, suffix arrays and succinct indexes. Over time, the size of textual information has been growing, hence the importance of having succinct indexing data structures for pattern matching. This has motivated a search for less space demanding indexes, in particular, using the same or less space than the original text. One of the most recent succinct indexes is the Alphabet Friendly FM Index, from now on AF FMI. http://pizzachili.dcc.uchile.cl/ Departamento de Ciencias de la Computación Universidad de Chile

Basics The suffix array of a text T is an array A that contains all the starting positions of the suffixes of the text T, such that T[A[1] n]<t[a[2] n]< <T[A[n] n], that is, array A gives the lexicographic order of all suffixes of the text T. The empirical entropy H resembles the entropy defined in the probabilistic setting. However, the empirical entropy is defined for any string and can be used to measure the performance of compression algorithms without any assumption on the input. So, when we create a succinct index structure it is natural to ask how close the index structure is to H k and which kind of search the index structure supports. Departamento de Ciencias de la Computación Universidad de Chile

Burrows Wheeler Transform (1) Burrows and Wheeler introduced a new compression algorithm based on a reversible transformation now called the Burrows Wheeler Transform (BWT from now on). The BWT consists of three basic steps: append at the end of T a special character # smaller than any other text character form a conceptual matrix M whose rows are the cyclic shifts of the string T# sorted in lexicographic order construct the transformed text BWT by taking the last column of matrix M. Departamento de Ciencias de la Computación Universidad de Chile

Burrows Wheeler Transform (2) Departamento de Ciencias de la Computación Universidad de Chile

Burrows Wheeler Transform (3) The inverse of the BWT consists in calculating: LF(i)=C[BWT[i]]+Occ(BWT[i],i) For c Σ, let C[c] be the total number of text characters which are alphabetically smaller than c. For c Σ, let Occ(c,q) be the number of occurrences of character c in the prefix BWT[1,q]. LF( ) comes from Last to First column mapping because character BWT[i] is located in the first column of M at position LF(i). In this way, we can navigate the text T backwards. That is, if T[k]= BWT[i], then T[k 1]= BWT [LF(i)] Departamento de Ciencias de la Computación Universidad de Chile

Burrows Wheeler Transform (4) For example if i=9 then LF(9)=C[BWT[9]]+Occ(BWT[9],9) = C[s]+Occ(s,9) = 8+3 = 11 Departamento de Ciencias de la Computación Universidad de Chile

FM Index (1) The FM index is a full text self index that allows to efficiently search for the occurrences of an arbitrary pattern P as a substring of the text T. The FM index is composed of a compressed representation of BWT with auxiliary structures that permit us to compute in O(1) time the value of Occ(c,q) with c Σ and for any q [1...n]. Departamento de Ciencias de la Computación Universidad de Chile

FM Index (2) The pseudo code of the counting operation that works in p phases is: To obtain the position of each row where P[1,p] is prefix, it is necessary to mark one text position every O(log 2 n/ log log n) and then to use LF(.) until finding a marked position. Departamento de Ciencias de la Computación Universidad de Chile

FM Index (3) For example if P= si c=p[i]=p[2]= i First=C[c]+Occ(c,First 1)+1=C[ i ]+0+1=2 Last=C[c]+Occ(c,Last)=C[ i ]+4=1+4=5 Departamento de Ciencias de la Computación Universidad de Chile

FM Index (4) c=p[1]= s First=C[c]+Occ(c,First 1)+1=C( s )+Occ( s,1 1)+1=8+0+1=9 Last=C[c]+Occ(c,Last)=C( s )+Occ( s,5)=8+2=10 Departamento de Ciencias de la Computación Universidad de Chile

Compression boosting (1) The key idea is that one can take an algorithm whose performance can be bounded in terms of the zero order entropy and obtain, via the booster, a new compressor whose performance can be bounded in terms of the k th order entropy, simultaneously for all k. The boosting technique computes an optimal partition S={s 1,s 2 s z } of BWT, such that, if we compress each s i with a zero th order entropy, we have a k th order entropy bound for the compressed BWT.

Rank First we show a binary rank: Rank 1 (Bits,i)=numbers of ones in Bits[1,i] Rank 1 (Bits,203)= =Superblock(203/32)+Block(203/8)+count(01100000,203%8) =Superblock(6)+Block(25)+popcount(01100000) =107+4+2=113

Wavelet tree (1) Theorem. Let W[1..w] denote a string over an arbitrary alphabet. The wavelet tree built on W uses wh 0 (W) + O(log (w log log w)/ log w) bits of storage and supports in O(log ) time the following operations: given q, 1 q w, the retrieval of the character W[q]; given c and q, 1 q w, the computation of the number of occurrences Occ W (c, q) of c in W[1..q].

Wavelet tree (2) C 1 =0000,C 3 =0010

Alphabet Friendly FM Index (1) To build the FM index we need to solve two problems: a) to compress BWT up to H k (T) b) to compute Occ(c, q) and BWT[q] in time independent of n. We use the boosting technique to transform problem a) into the problem of compressing the strings s 1,s 2 s z up to their zero th order entropy, and we use the wavelet tree to create a compressed (up to H 0 ) and indexable representation of each s i thus solving simultaneously problems a) and b).

Alphabet Friendly FM Index (2) Construction First we determine the optimal partition s 1,s 2 s z using the booster. Build a binary string B that keeps track of the starting positions in BWT of the s iʹs. For each string s i, i = 1 z build: the array C i [1, ] such that C i [c] stores the occurrences of character c within s 1,s 2 s i 1 the wavelet tree T i.

Alphabet Friendly FM Index (2) To compute BWT[q], we first determine the substring s y containing the q th character of BWT by computing y=rank 1 (B,q). Then we exploit the wavelet tree T y to determine BWT[q]. To compute Occ(c, q), we initially determine the substring s y where the row q occurs, y=rank 1 (B,q). Then we exploit the wavelet tree T y and the array C y [c] to compute Occ(c,q)=Occ_s y (c,q )+C y [c], where q = q sum(j=1..y 1; s j )

Alphabet Friendly FM Index (3) The size of the new index built on a string T[1,n] is bounded by nh k (T)+O(log (n loglog n)/ log n) bits, where H k (T) is the k th order empirical entropy of T. The above bound holds simultaneously for all k αlog n and any constant 0<α< 1. Moreover, the index design does not depend on the parameter k, which plays a role only in analysis of the space occupancy. Using the AF index, the counting of the occurrences of an arbitrary pattern P[1,p] as a substring of T takes O(p log ) time. Locating each pattern occurrence takes O(log (log 2 n/ loglog n)) time. Reporting a text substring of length l takes O((l + log 2 n/ loglog n) log ) time.

Alphabet Friendly FM Index (4) Using O(n) additional bits of storage, we can design a variant of the AF FMI that counts the pattern occurrences in O(p/ε) time (independent on ), and O((log 2 n/loglogn) ε /ε) time to locate each occurrence. To achive this we show first an approach with flat bit arrays Let W[1,w] be a substring of the BWT computed for the indexed text T. Instead of the wavelet tree, we process W by building bit arrays S c such that S c [i] = 1 if and only if W[i] = c. We then construct a FID data structure over each S c to support in constant time Rank 1 () and Select 1 () queries. Now a more general tradeoff with r ary trees Let us consider an r ary tree built over a string W[1,w]. Much like a wavelet tree, each node u of the r ary tree is responsible for a subset of the alphabet, say u, so that u represents the subsequence W u of W formed by the characters in u. The root node stands for the whole, and thus stores the whole string W. Each node u with children u 1 u r splits its alphabet u into r equally sized subsets u1 ur.

Alphabet Friendly FM Index (5) In each node u j, child of u, we store a bit array S uj of length W u such that S uj [q] = 1 whenever W u [q] uj. S uj is indeed the characteristic vector of the characters of W u copied into W uj. A FID data structure is then built over S uj to answer in constant time Rank 1 () and Select 1 () queries. Finally, if we also have =O(polylog n), the counting of the occurrences of an arbitrary pattern P[1,p] as a substring of T takes O(p) time. Locating each pattern occurrence takes O(log 1+ε n) time. Reporting a text substring of length l takes O(l + log 1+ε n) time.

Experimental Results (1) We have the following site that supports the test bed to all our experimental results: http://pizzachili.dcc.uchile.cl/ We compare several indexes over a collection of files. code/sources.100mb code/sources.50mb dna/dna.100mb dna/dna.50mb music/pitches.50mb nlang/english.100mb nlang/english.50mb protein/proteins.50mb xml/dblp.xml.100mb xml/dblp.xml.50mb

Experimental Results (2) Index size(count) 300 250 200 150 100 50 0 CCSA SSA RLFM AF-FMI Text xml/dblp.xml.100mb xml/dblp.xml.50mb protein/proteins.50mb code/sources.100mb code/sources.50mb dna/dna.100mb dna/dna.50mb music/pitches.50mb nlang/english.100mb nlang/english.50mb Size (MB)

Experimental Results (3) Index size(samplesize=64) 300 250 200 150 100 50 0 CCSA SSA RLFM AF-FMI Text xml/dblp.xml.100mb xml/dblp.xml.50mb code/sources.100mb code/sources.50mb dna/dna.100mb dna/dna.50mb music/pitches.50mb nlang/english.100mb nlang/english.50mb protein/proteins.50mb Size (MB)

Experimental Results (4)

Experimental Results (5)

Experimental Results (6)

Experimental Results (7)

Experimental Results (8)

Experimental Results (9)

Conclusions We have shown that the combination of the FM index, compression boosting techniques, wavelet trees, and fully indexable dictionaries, yields a simple data structure that uses asymptotically optimal space. The AF Index is very competitive with the existing full text selfindexes. Departamento de Ciencias de la Computación Universidad de Chile

Questions? Departamento de Ciencias de la Computación Universidad de Chile