Compact Data Strutures

Similar documents
Succincter text indexing with wildcards

Rank and Select Operations on Binary Strings (1974; Elias)

Smaller and Faster Lempel-Ziv Indices

Compressed Representations of Sequences and Full-Text Indexes

arxiv: v1 [cs.ds] 25 Nov 2009

arxiv: v1 [cs.ds] 19 Apr 2011

Preview: Text Indexing

Lecture 18 April 26, 2012

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

A Simple Alphabet-Independent FM-Index

New Lower and Upper Bounds for Representing Sequences

Alphabet Friendly FM Index

Compressed Representations of Sequences and Full-Text Indexes

Simple Compression Code Supporting Random Access and Fast String Matching

Optimal Dynamic Sequence Representations

arxiv: v1 [cs.ds] 15 Feb 2012

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv

A Simpler Analysis of Burrows-Wheeler Based Compression

arxiv: v1 [cs.ds] 22 Nov 2012

The Burrows-Wheeler Transform: Theory and Practice

Complementary Contextual Models with FM-index for DNA Compression

An Algorithmic Framework for Compression and Text Indexing

Opportunistic Data Structures with Applications

Alphabet-Independent Compressed Text Indexing

Dynamic Entropy-Compressed Sequences and Full-Text Indexes

LZ77-like Compression with Fast Random Access

Text Indexing: Lecture 6

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression

New Algorithms and Lower Bounds for Sequential-Access Data Compression

Read Mapping. Burrows Wheeler Transform and Reference Based Assembly. Genomics: Lecture #5 WS 2014/2015

Compressed Index for Dynamic Text

More Speed and More Compression: Accelerating Pattern Matching by Text Compression

Data Compression Techniques

A simpler analysis of Burrows Wheeler-based compression

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

Compact Indexes for Flexible Top-k Retrieval

A Faster Grammar-Based Self-Index

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

Lecture 4 : Adaptive source coding algorithms

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES

Succinct Suffix Arrays based on Run-Length Encoding

Multimedia. Multimedia Data Compression (Lossless Compression Algorithms)

Chapter 2: Source coding

arxiv:cs/ v1 [cs.it] 21 Nov 2006

Efficient Accessing and Searching in a Sequence of Numbers

Optimal lower bounds for rank and select indexes

Fast Fully-Compressed Suffix Trees

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Succinct Data Structures for Text and Information Retrieval

Stronger Lempel-Ziv Based Compressed Text Indexing

CMPT 365 Multimedia Systems. Lossless Compression

Data Compression Techniques

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm

Grammar Compressed Sequences with Rank/Select Support

Data Compression Techniques

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Forbidden Patterns. {vmakinen leena.salmela

Reducing the Space Requirement of LZ-Index

Data Compression Using a Sort-Based Context Similarity Measure

Image and Multidimensional Signal Processing

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2012-DBS-156 No /12/12 1,a) 1,b) 1,2,c) 1,d) 1999 Larsson Moffat Re-Pair Re-Pair Re-Pair Variable-to-Fi

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome

CSEP 590 Data Compression Autumn Dictionary Coding LZW, LZ77

Multimedia Information Systems

LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations

CS4800: Algorithms & Data Jonathan Ullman

On Compressing and Indexing Repetitive Sequences

Suffix Array of Alignment: A Practical Index for Similar Data

String Range Matching

A Unifying Framework for Compressed Pattern Matching

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

Efficient Fully-Compressed Sequence Representations

Approximate String Matching with Lempel-Ziv Compressed Indexes

Advanced Text Indexing Techniques. Johannes Fischer

arxiv: v1 [cs.ds] 21 Nov 2012

CRAM: Compressed Random Access Memory

Lecture 10 : Basic Compression Algorithms

Text matching of strings in terms of straight line program by compressed aleshin type automata

UNIT I INFORMATION THEORY. I k log 2

CHAPTER 8 COMPRESSION ENTROPY ESTIMATION OF HEART RATE VARIABILITY AND COMPUTATION OF ITS RENORMALIZED ENTROPY

Approximate String Matching with Ziv-Lempel Compressed Indexes

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

arxiv: v2 [cs.ds] 6 Jul 2015

arxiv: v3 [cs.ds] 6 Sep 2018

Fibonacci Coding for Lossless Data Compression A Review

CMPT 365 Multimedia Systems. Final Review - 1

On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004

A Four-Stage Algorithm for Updating a Burrows-Wheeler Transform

A Space-Efficient Frameworks for Top-k String Retrieval

arxiv: v1 [cs.ds] 8 Sep 2018

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

Motivation for Arithmetic Coding

Source Coding Techniques

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

Introduction to Information Theory. Part 3

Transcription:

(To compress is to Conquer) Compact Data Strutures Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017

Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 2 images: zurb.com

Introduction to Compact Data Structures Compact data structures lie at the intersection of Data Structures (indexing) and Information Theory (compression): One looks at data representations that not only permit space close to the minimum possible (as in compression) but also require that those representations allow one to efficiently carry out some operations on the data. PAGE 3 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER

Introduction Why compression? 4 Disks are cheap!! But they are also slow! Compression can help more data to fit in main memory. (access to memory is around 10 6 times faster than HDD) CPU speed is increasing faster We can trade processing time (needed to uncompress data) by space.

Introduction Why compression? 5 Compression does not only reduce space! I/O access on disks and networks Processing time * (less data has to be processed) If appropriate methods are used For example: Allowing handling data compressed all the time. Doc 1 Doc 2 Doc 3 Doc n Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (30%) Let s search for Keystone" Text collection (100%) Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (20%) P7zip, others

Introduction Why indexing? Indexing permits sublinear search time 6 term 1 Keystone term n Index (> 5-30%) Doc 1 Doc 2 Doc 3 Doc n Text collection (100%) Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (30%) Let s search for Keystone"

Introduction Why compact data structures? 7 Self-indexes: sublinear search time Text implicitly kept term 1 Keystone term n Index (> 5-30%) Doc 1 Doc 2 Doc 3 Doc n Self-index (WT, WCSA, ) Let s search for Keystone" Text collection 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 term 1 Keystone term n

Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 8 images: zurb.com

Compression Compressing aims at representing data within less space. How does it work? Which are the most traditional compression techniques? PAGE 9 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER

Basic Compression Modeling & Coding 10 A compressor could use as a source alphabet: A fixed number of symbols (statistical compressors) 1 char, 1 word A variable number of symbols (dictionary-based compressors) 1st occ of a encoded alone, 2nd occ encoded with next one ax Codes are built using symbols of a target alphabet: Fixed length codes (10 bits, 1 byte, 2 bytes, ) Variable length codes (1,2,3,4 bits/bytes ) Classification (fixed-to-variable, variable-to-fixed, ) Input alphabet fixed var Target alphabet fixed var -- statistical dictionary var2var

Basic Compression Main families of compressors 11 Taxonomy Dictionary based (gzip, compress, p7zip ) Grammar based (BPE, Repair) Statistical compressors (Huffman, arithmetic, Dense, PPM, ) Statistical compressors Gather the frequencies of the source symbols. Assign shorter codewords to the most frequent symbols. Obtain compression

Basic Compression Dictionary-based compressors 12 How do they achieve compression? Assign fixed-length codewords to variable-length symbols (text substrings) The longer the replaced substring the better compression Well-known representatives: Lempel-Ziv family LZ77 (1977): GZIP, PKZIP, ARJ, P7zip LZ78 (1978) LZW (1984): Compress, GIF images

EXAMPLE Basic Compression LZW 13 Starts with an initial dictionary D (contains symbols in S) For a given position of the text. while D contains w, reads prefix w=w 0 w 1 w 2 If w 0 w k w k+1 is not in D (w 0 w k does!) output (i = entrypos(w 0 w k )) (Note: codeword = log 2 ( D )) Add w 0 w k w k+1 to D Continue from w k+1 on (included) Dictionary has limited length? Policies: LRU, truncate& go,

EXAMPLE Basic Compression LZW 14 Starts with an initial dictionary D (contains symbols in S) For a given position of the text. while D contains w, reads prefix w=w 0 w 1 w 2 If w 0 w k w k+1 is not in D (w 0 w k does!) output (i = entrypos(w 0 w k )) (Note: codeword = log 2 ( D )) Add w 0 w k w k+1 to D Continue from w k+1 on (included) Dictionary has limited length? Policies: LRU, truncate& go,

Basic Compression Grammar-based BPE - Repair 15 Replaces pairs of symbols by a new one, until no pair repeats twice Adds a rule to a Dictionary. Source sequence A B C D E A B D E F D E D E F A B E C D DE G A B C G A B G F G G F A B E C D AB H H C G H G F G G F H E C D GF I Dictionary of Rules Final Repair Sequence H C G H I G I H E C D

Basic Compression Statistical compressors 16 Assign shorter codewords to the most frequent symbols Must gather symbol frequencies for each symbol c in S. Compression is lower bounded by the (zero-order) empirical entropy of the sequence (S). n= num of symbols n c = occs of symbol c H 0 (S) <= log ( S ) n H 0 (S) = lower bound of the size of S compressed with a zero-order compressor Most representative method: Huffman coding

Basic Compression Statistical compressors: Huffman coding Optimal prefix free coding No codeword is a prefix of one another. Decoding requires no look-ahead! Asymptotically optimal: Huffman(S) <= n(h0(s)+1) 17 Typically using bit-wise codewords Yet D-ary Huffman variants exist (D=256 byte-wise) Builds a Huffman tree to generate codewords

Basic Compression Statistical compressors: Huffman coding Sort symbols by frequency: S=ADBAAAABBBBCCCCDDEEE 18

Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 19

Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 20

Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 21

Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 22

Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 23

Basic Compression Statistical compressors: Huffman coding Branch labeling 24

Basic Compression Statistical compressors: Huffman coding Code assignment 25

Basic Compression Statistical compressors: Huffman coding Compression of sequence S= ADB 26 ADB 01 000 10

Basic Compression Burrows-Wheeler Transform (BWT) 27 Given S= mississipii$, BWT(S) is obtained by: (1) creating a Matrix M with all circular permutations of S$, (2) sorting the rows of M, and (3) taking the last column. mississippi$ $mississippi i$mississipp pi$mississip ppi$mississi ippi$mississ sippi$missis ssippi$missi issippi$miss sissippi$mis ssissippi$mi ississippi$m sort $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi F L = BWT(S)

Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 28 Given L=BWT(S), we can recover S=BWT -1 (L) 1 2 3 4 5 $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m 2 7 9 10 6 Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j 6 7 8 mississippi$ pi$mississip ppi$mississi 1 8 3 Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 9 10 11 sippi$missis sissippi$mis ssippi$missi 11 12 4 12 ssissippi$mi 5 F L LF

Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 29 Given L=BWT(S), we can recover S=BWT -1 (L) 1 2 3 4 5 $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m 2 7 9 10 6 - - - - - Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j 6 7 8 9 10 11 12 mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi 1 8 3 11 12 4 5 - - - - - - $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; In each step: S[n-i] = L[p]; p = LF[p]; i = i+1; F L LF S

Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 30 Given L=BWT(S), we can recover S=BWT -1 (L) 1 2 3 4 5 $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m 2 7 9 10 6 - - - - - Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j 6 7 8 9 10 11 12 mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi 1 8 3 11 12 4 5 - - - - - - $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=0: S[n-i] = L[p]; S[12]= $ p = LF[p]; p = 1 i = i+1; i=1 F L LF S

Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 31 Given L=BWT(S), we can recover S=BWT -1 (L) 1 2 3 4 5 $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m 2 7 9 10 6 - - - - - Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j 6 7 8 9 10 11 12 mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi 1 8 3 11 12 4 5 - - - - - i $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=1: S[n-i] = L[p]; S[11]= i p = LF[p]; p = 2 i = i+1; i=2 F L LF S

Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 32 Given L=BWT(S), we can recover S=BWT -1 (L) 1 2 3 4 5 $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m 2 7 9 10 6 m i s s i Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j 6 7 8 9 10 11 12 mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi 1 8 3 11 12 4 5 s s i p p i $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=1: S[n-i] = L[p]; S[11]= i p = LF[p]; p = 2 i = i+1; i=2 F L LF S

Basic Compression Bzip2: Burrows-Wheeler Transform (BWT) 33 BWT. Many similar symbols appear adjacent MTF. Output the position of the current symbol within S Keep the alphabet S = {a,b,c,d,e, } sorted so that the last used symbol is moved to the begining of S. RLE. If a value (0) appears several times (000000 6 times) replace it by a pair <value,times> <0,6> Huffman stage. Why does it work? In a text it is likely that he is preceeded by t, ssisii by i,

Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 34 images: zurb.com

Sequences We want to represent (compactly) a sequence of elements and to efficiently handle them. (Who is in the 2 nd position?? How many Barts up to position 5?? Where is the 3 rd Bart??) 1 2 3 4 5 6 7 8 9 PAGE 35 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER

Sequences Plain Representation of Data 36 Given a Sequence of n integers m = maximum value 4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 We can represent it with n log 2 (m+1) bits 100 010 100 100 100 100 001 100 010 100 001 001 010 011 100 100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 16 symbols x 3 bits per symbol = 48 bits array of two 32-bit ints Direct access (access to an integer + bit operations)

Sequences Compressed Representation of Data (H 0 ) 4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Is it compressible? Symbol 4 1 2 3 Occurrences (n c ) 9 4 2 1 37 H o (S) = 1.59 (bits per symbol) Huffman: 1.62 bits per symbol 1 01 000 001 1 1 1 1 01 1 000 1 01 01 1 1 1 5 10 15 20 25 26 bits: No direct access! (but we could add sampling) 1 0 3 0 1 2 7 16 0 1 1 4 2 3 1 4 9

Sequences Summary: Plain/Compressed access/rank/select 38 4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 100 010 100 100 100 100 001 100 010 100 001 001 010 011 100 100 1 4 5 10 13 16 19 22 25 28 31 34 37 40 43 46 1 01 000 001 1 1 1 1 01 1 000 1 01 01 1 1 1 5 10 15 20 25 Operations of interest: Access(i) : Value of the i th symbol Rank s (i) : Number of occs of symbol s up to position i (count) Select s (i) : Where the i th occ of symbol s? (locate)

Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 39 images: zurb.com

Bit Sequences access/rank/select on bitmaps 40 access (19) = 0 B = 0 1 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Rank 1 (6) = 3 Rank 0 (10) = 5 select 0 (10) =15 see [Navarro 2016]

Bit Sequences Applications 41 Bitmaps a basic part of most Compact Data Structures Example: (We will see it later in the CSA) HDT Bitmaps from Javi's talk!!! S: AAABBCCCCCCCCDDDEEEEEEEEEEFG n log s bits B: 1001010000000100100000000011 n bits D: ABCDEFG s log s bits Saves space: Fast access/rank/select is of interest!! Where is the 2 nd C? How many Cs up to position k?

Bit Sequences Reaching O(1) rank & o(n) bits of extra space Jacobson, Clark, Munro Variant by Fariña et al. Assuming 32 bit machine-word 42 Step 1: Split de Bitmap into superblocks of 256 bits, and store the number of 1s up to positions 1+256k (k= 0,1,2, ) D s = 0 35 62 1 2 3 97 3... 35 bits set to 1 27 bits set to 1 45 bits set to 1 0 1 0... 1 1... 1 0... 1 1 2 3 256 257 512 513 768... O(1) time to superblock. Space: n/256 superblocks and 1 int each

Bit Sequences Reaching O(1) rank & o(n) bits of extra space Step 2: For each superblock of 256 bits Divide it into 8 blocks of 32 bits each (machine word size) Store the number of ones from the beginning of the superblock 43 D s = 0 35 62 1 2 3 35 bits set to 1 27 bits set to 1 45 bits set to 1 97 3... 1 1 0... 1 1... 0 0... 1 1 2 3 256 257 300 512 513 768... 4 bits set to 1 6 bits set to 1 8 bits set to 1 1 1 0... 1 1 2 3 32 0... 1 33 44 64... 1... 0 224 256 D b = 0 4... 25 1 2 7 O(1) time to the blocks, 8 blocks per superblock, 1 byte each

Bit Sequences Reaching O(1) rank & o(n) bits of extra space 44 Step 3: Rank within a 32 bit block blk = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 Finally solving: rank 1 ( D, p ) = Ds[ p / 256 ] + Db[ p / 32 ] + rank 1 (blk, i) where i= p mod 32 Ex: rank1(d,300) = 35 + 4 + 4 = 43 Yet, how to compute rank 1 (blk, i) in constant time?

Bit Sequences Reaching O(1) rank & o(n) bits of extra space 45 How to compute rank 1 (blk, i) in constant time? Option 1: popcount within a machine word Option 2: Universal Table onesinbyte (solution for each byte) Only 256 entries storing values [0..8] Rank 1 (blk,12) blk = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 Shift 32 12 = 20 posicións Val binary OnesInByte 0 00000000 0 1 00000001 1 2 00000010 1 3 00000011 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 blk s = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 Finally, sum value onesinbyte for the 4 bytes in blk......... 252 11111100 6 253 11111101 7 254 11111110 7 255 11111111 8 Overall space: 1.375 n bits

Bit Sequences Select 1 in O(log n) with the same structures 46 select 1 (p) In practice, binary search using rank

Bit Sequences Compressed representations 48 Compressed Bit-Sequence representations exist!! Compressed [Raman et al, 2002] For very sparse bitmaps [Okanohara and Sadakane, 2007]... see [Navarro 2016]

Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 49 images: zurb.com

Integer Sequences access/rank/select on general sequences 50 access (13) = 3 S= 4 4 3 2 6 2 4 2 4 1 1 2 3 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Rank 2 (9) = 3 select 4 (3) =7 see [Navarro 2016]

Integer Sequences Wavelet tree (construction) [Grossi et al 2003] Given a sequence of symbols and an encoding The bits of the code of each symbol are distributed along the different levels of the tree 51 DATA A B A C D A C 0001001011 0010 A B A C D A C CODE 0 0 0 1 1 0 1 SYMBOL A 00 B 01 C 10 D 11 WAVELET TREE 0 1 A B A A C D C 0 1 0 0 0 1 0

Integer Sequences Wavelet tree (select) 52 OF 74 52 Searching for the 1st occurrence of D? DATA A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root WAVELET TREE A B A C D A C 0 0 0 1 1 0 1 0 1 A B A A C D C 0 1 0 0 0 1 0 B 0 B 1 it is the 2nd bit in B 1 Where is the 2nd 1? at pos 5. Where is the 1st 1? at pos 2.

Integer Sequences Wavelet tree (access) Recovering Data: extracting the next symbol Which symbol appears in the 6 th position? 53 DATA A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root WAVELET TREE A B A C D A C 0 0 0 1 1 0 1 0 1 A B A A C D C 0 1 0 0 0 1 0 B 0 B 1 How many 0 s are there up to pos 6? it is the 4th 0 Which bit occurs at position 4 in B 0? It is set to 0 The codeword read is 00 A

Integer Sequences Wavelet tree (access) Recovering Data: extracting the next symbol Which symbol appears in the 7 th position? 54 TEXT A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root B 0 WAVELET TREE A B A C D A C 0 0 0 1 1 0 1 0 1 A B A A C D C 0 1 0 0 0 1 0 B 1 How many 1 s are there up to pos 7? it is the 3rd 1 Which bit occurs at position 3 in B 1? It is set to 0 The codeword read is 10 C

Integer Sequences Wavelet tree (rank) How many C s are there up to position 7? 55 TEXT A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root B 0 WAVELET TREE A B A C D A C 0 0 0 1 1 0 1 0 1 A B A A C D C 0 1 0 0 0 1 0 B 1 How many 1 s are there up to pos 7? it is the 3rd 1 Select (locate symbol) Access and Rank: How many 0s up to position 3 in B 1? 2!!

Integer Sequences Wavelet tree (space and times) Each level contains n + o(n) bits 56 DATA A B A C D A C 0001001011 0010 SYMBOL CODE A 00 B 01 C 10 D 11 WAVELET TREE A B A C D A C 0 0 0 1 1 0 1 0 1 A B A A C D C 0 1 0 0 0 1 0 n + o(n) bits n + o(n) bits Rank/select/access expected O(log s) time n log s (1 + o(1)) bits

Integer Sequences Huffman-shaped (or others) Wavelet tree Using Huffman coding (or others) unbalanced 57 DATA A B A C D A C 1 000 1 01 001 1 01 SYMBOL CODE A 1 B 000 C 01 D 001 WAVELET TREE A B A C D A C 1 0 1 0 0 1 0 B C D C 0 1 0 0 0 1 0 1 B D C C 1 0 A A A nh 0 (S) + o(n) bits Rank/select/access O(H 0 (S)) time

Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 58 images: zurb.com

A brief review about indexing Inverted Indexes are the most well-known index for text [ ] Suffix Arrays are powerful but huge full-text indexes. Self-indexes trade a more compact space by performance PAGE 59 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER

A brief Review about Indexing Text indexing: well-known structures from the Web 60 Traditional indexes (with or without compression) auxiliar structure explicit text Inverted Indexes, Suffix Arrays,... Compressed Self-indexes implicit text Wavelet trees, Compressed Suffix Arrays, FM-index, LZ-index,

Doc 2 Doc 1 A brief Review about Indexing Inverted indexes Vocabulary DCC communications compression image data information Cliff Logde Posting Lists 0 142 368 506 104 165 341 219 445 99 207 336 128 395 19 25 Vocabulary DCC communications compression image data information Cliff Lodge Posting Lists 1 2 1 2 1 2 1 2 1 2 1 1 61 Full-positional information Space-time trade-off Doc-addressing inverted index Indexed text DCC is held at the Cliff Lodge convention center. It is an international forum for current work on data compression and related applications. DCC addresses not only compression methods for specific types of data (text, image, video, audio, space, graphics, web content, [...]... also the use of techniques from information theory and data compression in networking, communications, and storage applications involving large datasets (including image and information mining, retrieval, archiving, backup, communications, and HCI). Searches Word posting of that word Phrase intersection of postings Compression - Indexed text (Huffman,...) - Posting lists (Rice,...)

A brief Review about Indexing Inverted indexes 62 Lists contain increasing integers Gaps between integers are smaller in the longest lists Original posting list Diferenc. 4 10 15 25 29 40 46 54 57 70 79 82 1 2 3 4 5 6 7 8 9 10 11 12 4 6 5 10 4 11 6 8 3 13 9 3 Var-length coding c4 c6 c5 c10 c4 c11 c6 c8 c3 c13 c9 c3 Complete decompression Absolute sampling + var length coding 4 c6 c5 c10 29 c11 c6 c8 57 c13 c9 c3 Direct access Partial decompression

A brief Review about Indexing Suffix Arrays Sorting all the suffixes of T lexicographically 63 T = 1 2 3 4 5 6 7 8 9 10 11 12 a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 A = 12 11 8 1 4 6 9 2 5 7 10 3 racadabra$ ra$ dabra$ cadabra$ bracadabra$ bra$ adabra$ acadabra$ abracadabra$ abra$ a$ $

A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 64 P = a b T = 1 2 3 4 5 6 7 8 9 10 11 12 a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 A = 12 11 8 1 4 6 9 2 5 7 10 3

A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 65 P = a b T = 1 2 3 4 5 6 7 8 9 10 11 12 a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 A = 12 11 8 1 4 6 9 2 5 7 10 3

A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 66 P = a b T = 1 2 3 4 5 6 7 8 9 10 11 12 a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 A = 12 11 8 1 4 6 9 2 5 7 10 3

A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 67 P = a b T = 1 2 3 4 5 6 7 8 9 10 11 12 a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 A = 12 11 8 1 4 6 9 2 5 7 10 3

A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 68 P = a b T = 1 2 3 4 5 6 7 8 9 10 11 12 a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 A = 12 11 8 1 4 6 9 2 5 7 10 3

A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 69 P = a b T = 1 2 3 4 5 6 7 8 9 10 11 12 a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 A = 12 11 8 1 4 6 9 2 5 7 10 3

A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 70 P = a b T = 1 2 3 4 5 6 7 8 9 10 11 12 a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 A = 12 11 8 1 4 6 9 2 5 7 10 3 locations Noccs = (4-3)+1 Occs = A[3].. A[4] = {8, 1} Fast O(m lg n) O(m lg n + noccs) Space O(4n) + T

A brief Review about Indexing BWT FM-index 71 BWT(S) + other structures it is an index C[c] : for each char c in S, stores the number of occs in S of the chars that are lexicographically smaller than c. C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8 OCC(c, k): Number of occs of char c in the prefix of L: L [1..k] For k in [1..12] Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1 Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4 Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1 Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2 Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4 Char L[i] occurs in F at position LF(i): LF(i) = C[L[i]] + Occ(L[i],i)

A brief Review about Indexing BWT FM-index 74 Count (S[1,u], P[1,p]) C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8 Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1 Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4 Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1 Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2 Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4 s s i Count (S, issi )

A brief Review about Indexing BWT FM-index Representing L with a wavelet tree occ is compressed 75

Bibliography 76 1. M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Systems Research Center, 1994. http://gatekeeper.dec.com/pub/dec/src/researchreports/. 2. F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proc. 15th SPIRE, LNCS 5280, pages 176 187, 2008. 3. Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index. In Proc. 12th ACM-SIAM Symposium on Discrete Algorithms (SODA), Washington (USA), 2001. 4. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005. 5. Philip Gage. A new algorithm for data compression. C Users Journal, 12(2):23 38, February 1994 6. A. Golynski, I. Munro, and S. Rao. Rank/select operations on large alphabets: a tool for text indexing. In Proc. 17th SODA, pages 368 373, 2006. 7. R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proc. 14th SODA, pages 841 850, 2003.

Bibliography 77 8. David A. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the Institute of Radio Engineers, 40(9):1098-1101, 1952 9. N. J. Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 88(11):1722 1732, 2000 10. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. Comp., 22(5):935 948, 1993 11. Alistair Moffat, Andrew Turpin: Compression and Coding Algorithms.Kluwer 2002, ISBN 0-7923-7668-4 12. I. Munro. Tables. In Proc. 16th FSTTCS, LNCS 1180, pages 37 42, 1996. 13. Gonzalo Navarro, Veli Mäkinen, Compressed full-text indexes, ACM Computing Surveys (CSUR), v.39 n.1, p.2-es, 2007 14. Gonzalo Navarro. Compact Data Structures -A practical approach. Cambridge University Press, 570 pages, 2016 15. D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th ALENEX, 2007.

Bibliography 78 16. R. Raman, V. Raman, and S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. 13th SODA, pages 233 242, 2002. 17. Edleno Silva de Moura, Gonzalo Navarro, Nivio Ziviani, and Ricardo Baeza-Yates. Fast and flexible word searching on compressed text. ACM Transactions on Information Systems, 18(2):113 139, 2000. 18. Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 1999. 19. Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 3, 337 343. 20. Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24, 5, 530 536.

(To compress is to Conquer) Compact Data Strutures Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017 (Thanks: slides partially by: Susana Ladra, E. Rodríguez, & José R. Paramá)