(To compress is to Conquer) Compact Data Strutures Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017
Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 2 images: zurb.com
Introduction to Compact Data Structures Compact data structures lie at the intersection of Data Structures (indexing) and Information Theory (compression): One looks at data representations that not only permit space close to the minimum possible (as in compression) but also require that those representations allow one to efficiently carry out some operations on the data. PAGE 3 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER
Introduction Why compression? 4 Disks are cheap!! But they are also slow! Compression can help more data to fit in main memory. (access to memory is around 10 6 times faster than HDD) CPU speed is increasing faster We can trade processing time (needed to uncompress data) by space.
Introduction Why compression? 5 Compression does not only reduce space! I/O access on disks and networks Processing time * (less data has to be processed) If appropriate methods are used For example: Allowing handling data compressed all the time. Doc 1 Doc 2 Doc 3 Doc n Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (30%) Let s search for Keystone" Text collection (100%) Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (20%) P7zip, others
Introduction Why indexing? Indexing permits sublinear search time 6 term 1 Keystone term n Index (> 5-30%) Doc 1 Doc 2 Doc 3 Doc n Text collection (100%) Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (30%) Let s search for Keystone"
Introduction Why compact data structures? 7 Self-indexes: sublinear search time Text implicitly kept term 1 Keystone term n Index (> 5-30%) Doc 1 Doc 2 Doc 3 Doc n Self-index (WT, WCSA, ) Let s search for Keystone" Text collection 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 term 1 Keystone term n
Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 8 images: zurb.com
Compression Compressing aims at representing data within less space. How does it work? Which are the most traditional compression techniques? PAGE 9 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER
Basic Compression Modeling & Coding 10 A compressor could use as a source alphabet: A fixed number of symbols (statistical compressors) 1 char, 1 word A variable number of symbols (dictionary-based compressors) 1st occ of a encoded alone, 2nd occ encoded with next one ax Codes are built using symbols of a target alphabet: Fixed length codes (10 bits, 1 byte, 2 bytes, ) Variable length codes (1,2,3,4 bits/bytes ) Classification (fixed-to-variable, variable-to-fixed, ) Input alphabet fixed var Target alphabet fixed var -- statistical dictionary var2var
Basic Compression Main families of compressors 11 Taxonomy Dictionary based (gzip, compress, p7zip ) Grammar based (BPE, Repair) Statistical compressors (Huffman, arithmetic, Dense, PPM, ) Statistical compressors Gather the frequencies of the source symbols. Assign shorter codewords to the most frequent symbols. Obtain compression
Basic Compression Dictionary-based compressors 12 How do they achieve compression? Assign fixed-length codewords to variable-length symbols (text substrings) The longer the replaced substring the better compression Well-known representatives: Lempel-Ziv family LZ77 (1977): GZIP, PKZIP, ARJ, P7zip LZ78 (1978) LZW (1984): Compress, GIF images
EXAMPLE Basic Compression LZW 13 Starts with an initial dictionary D (contains symbols in S) For a given position of the text. while D contains w, reads prefix w=w 0 w 1 w 2 If w 0 w k w k+1 is not in D (w 0 w k does!) output (i = entrypos(w 0 w k )) (Note: codeword = log 2 ( D )) Add w 0 w k w k+1 to D Continue from w k+1 on (included) Dictionary has limited length? Policies: LRU, truncate& go,
EXAMPLE Basic Compression LZW 14 Starts with an initial dictionary D (contains symbols in S) For a given position of the text. while D contains w, reads prefix w=w 0 w 1 w 2 If w 0 w k w k+1 is not in D (w 0 w k does!) output (i = entrypos(w 0 w k )) (Note: codeword = log 2 ( D )) Add w 0 w k w k+1 to D Continue from w k+1 on (included) Dictionary has limited length? Policies: LRU, truncate& go,
Basic Compression Grammar-based BPE - Repair 15 Replaces pairs of symbols by a new one, until no pair repeats twice Adds a rule to a Dictionary. Source sequence A B C D E A B D E F D E D E F A B E C D DE G A B C G A B G F G G F A B E C D AB H H C G H G F G G F H E C D GF I Dictionary of Rules Final Repair Sequence H C G H I G I H E C D
Basic Compression Statistical compressors 16 Assign shorter codewords to the most frequent symbols Must gather symbol frequencies for each symbol c in S. Compression is lower bounded by the (zero-order) empirical entropy of the sequence (S). n= num of symbols n c = occs of symbol c H 0 (S) <= log ( S ) n H 0 (S) = lower bound of the size of S compressed with a zero-order compressor Most representative method: Huffman coding
Basic Compression Statistical compressors: Huffman coding Optimal prefix free coding No codeword is a prefix of one another. Decoding requires no look-ahead! Asymptotically optimal: Huffman(S) <= n(h0(s)+1) 17 Typically using bit-wise codewords Yet D-ary Huffman variants exist (D=256 byte-wise) Builds a Huffman tree to generate codewords
Basic Compression Statistical compressors: Huffman coding Sort symbols by frequency: S=ADBAAAABBBBCCCCDDEEE 18
Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 19
Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 20
Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 21
Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 22
Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 23
Basic Compression Statistical compressors: Huffman coding Branch labeling 24
Basic Compression Statistical compressors: Huffman coding Code assignment 25
Basic Compression Statistical compressors: Huffman coding Compression of sequence S= ADB 26 ADB 01 000 10
Basic Compression Burrows-Wheeler Transform (BWT) 27 Given S= mississipii$, BWT(S) is obtained by: (1) creating a Matrix M with all circular permutations of S$, (2) sorting the rows of M, and (3) taking the last column. mississippi$ $mississippi i$mississipp pi$mississip ppi$mississi ippi$mississ sippi$missis ssippi$missi issippi$miss sissippi$mis ssissippi$mi ississippi$m sort $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi F L = BWT(S)
Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 28 Given L=BWT(S), we can recover S=BWT -1 (L) 1 2 3 4 5 $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m 2 7 9 10 6 Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j 6 7 8 mississippi$ pi$mississip ppi$mississi 1 8 3 Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 9 10 11 sippi$missis sissippi$mis ssippi$missi 11 12 4 12 ssissippi$mi 5 F L LF
Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 29 Given L=BWT(S), we can recover S=BWT -1 (L) 1 2 3 4 5 $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m 2 7 9 10 6 - - - - - Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j 6 7 8 9 10 11 12 mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi 1 8 3 11 12 4 5 - - - - - - $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; In each step: S[n-i] = L[p]; p = LF[p]; i = i+1; F L LF S
Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 30 Given L=BWT(S), we can recover S=BWT -1 (L) 1 2 3 4 5 $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m 2 7 9 10 6 - - - - - Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j 6 7 8 9 10 11 12 mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi 1 8 3 11 12 4 5 - - - - - - $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=0: S[n-i] = L[p]; S[12]= $ p = LF[p]; p = 1 i = i+1; i=1 F L LF S
Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 31 Given L=BWT(S), we can recover S=BWT -1 (L) 1 2 3 4 5 $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m 2 7 9 10 6 - - - - - Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j 6 7 8 9 10 11 12 mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi 1 8 3 11 12 4 5 - - - - - i $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=1: S[n-i] = L[p]; S[11]= i p = LF[p]; p = 2 i = i+1; i=2 F L LF S
Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 32 Given L=BWT(S), we can recover S=BWT -1 (L) 1 2 3 4 5 $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m 2 7 9 10 6 m i s s i Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j 6 7 8 9 10 11 12 mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi 1 8 3 11 12 4 5 s s i p p i $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=1: S[n-i] = L[p]; S[11]= i p = LF[p]; p = 2 i = i+1; i=2 F L LF S
Basic Compression Bzip2: Burrows-Wheeler Transform (BWT) 33 BWT. Many similar symbols appear adjacent MTF. Output the position of the current symbol within S Keep the alphabet S = {a,b,c,d,e, } sorted so that the last used symbol is moved to the begining of S. RLE. If a value (0) appears several times (000000 6 times) replace it by a pair <value,times> <0,6> Huffman stage. Why does it work? In a text it is likely that he is preceeded by t, ssisii by i,
Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 34 images: zurb.com
Sequences We want to represent (compactly) a sequence of elements and to efficiently handle them. (Who is in the 2 nd position?? How many Barts up to position 5?? Where is the 3 rd Bart??) 1 2 3 4 5 6 7 8 9 PAGE 35 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER
Sequences Plain Representation of Data 36 Given a Sequence of n integers m = maximum value 4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 We can represent it with n log 2 (m+1) bits 100 010 100 100 100 100 001 100 010 100 001 001 010 011 100 100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 16 symbols x 3 bits per symbol = 48 bits array of two 32-bit ints Direct access (access to an integer + bit operations)
Sequences Compressed Representation of Data (H 0 ) 4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Is it compressible? Symbol 4 1 2 3 Occurrences (n c ) 9 4 2 1 37 H o (S) = 1.59 (bits per symbol) Huffman: 1.62 bits per symbol 1 01 000 001 1 1 1 1 01 1 000 1 01 01 1 1 1 5 10 15 20 25 26 bits: No direct access! (but we could add sampling) 1 0 3 0 1 2 7 16 0 1 1 4 2 3 1 4 9
Sequences Summary: Plain/Compressed access/rank/select 38 4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 100 010 100 100 100 100 001 100 010 100 001 001 010 011 100 100 1 4 5 10 13 16 19 22 25 28 31 34 37 40 43 46 1 01 000 001 1 1 1 1 01 1 000 1 01 01 1 1 1 5 10 15 20 25 Operations of interest: Access(i) : Value of the i th symbol Rank s (i) : Number of occs of symbol s up to position i (count) Select s (i) : Where the i th occ of symbol s? (locate)
Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 39 images: zurb.com
Bit Sequences access/rank/select on bitmaps 40 access (19) = 0 B = 0 1 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Rank 1 (6) = 3 Rank 0 (10) = 5 select 0 (10) =15 see [Navarro 2016]
Bit Sequences Applications 41 Bitmaps a basic part of most Compact Data Structures Example: (We will see it later in the CSA) HDT Bitmaps from Javi's talk!!! S: AAABBCCCCCCCCDDDEEEEEEEEEEFG n log s bits B: 1001010000000100100000000011 n bits D: ABCDEFG s log s bits Saves space: Fast access/rank/select is of interest!! Where is the 2 nd C? How many Cs up to position k?
Bit Sequences Reaching O(1) rank & o(n) bits of extra space Jacobson, Clark, Munro Variant by Fariña et al. Assuming 32 bit machine-word 42 Step 1: Split de Bitmap into superblocks of 256 bits, and store the number of 1s up to positions 1+256k (k= 0,1,2, ) D s = 0 35 62 1 2 3 97 3... 35 bits set to 1 27 bits set to 1 45 bits set to 1 0 1 0... 1 1... 1 0... 1 1 2 3 256 257 512 513 768... O(1) time to superblock. Space: n/256 superblocks and 1 int each
Bit Sequences Reaching O(1) rank & o(n) bits of extra space Step 2: For each superblock of 256 bits Divide it into 8 blocks of 32 bits each (machine word size) Store the number of ones from the beginning of the superblock 43 D s = 0 35 62 1 2 3 35 bits set to 1 27 bits set to 1 45 bits set to 1 97 3... 1 1 0... 1 1... 0 0... 1 1 2 3 256 257 300 512 513 768... 4 bits set to 1 6 bits set to 1 8 bits set to 1 1 1 0... 1 1 2 3 32 0... 1 33 44 64... 1... 0 224 256 D b = 0 4... 25 1 2 7 O(1) time to the blocks, 8 blocks per superblock, 1 byte each
Bit Sequences Reaching O(1) rank & o(n) bits of extra space 44 Step 3: Rank within a 32 bit block blk = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 Finally solving: rank 1 ( D, p ) = Ds[ p / 256 ] + Db[ p / 32 ] + rank 1 (blk, i) where i= p mod 32 Ex: rank1(d,300) = 35 + 4 + 4 = 43 Yet, how to compute rank 1 (blk, i) in constant time?
Bit Sequences Reaching O(1) rank & o(n) bits of extra space 45 How to compute rank 1 (blk, i) in constant time? Option 1: popcount within a machine word Option 2: Universal Table onesinbyte (solution for each byte) Only 256 entries storing values [0..8] Rank 1 (blk,12) blk = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 Shift 32 12 = 20 posicións Val binary OnesInByte 0 00000000 0 1 00000001 1 2 00000010 1 3 00000011 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 blk s = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 Finally, sum value onesinbyte for the 4 bytes in blk......... 252 11111100 6 253 11111101 7 254 11111110 7 255 11111111 8 Overall space: 1.375 n bits
Bit Sequences Select 1 in O(log n) with the same structures 46 select 1 (p) In practice, binary search using rank
Bit Sequences Compressed representations 48 Compressed Bit-Sequence representations exist!! Compressed [Raman et al, 2002] For very sparse bitmaps [Okanohara and Sadakane, 2007]... see [Navarro 2016]
Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 49 images: zurb.com
Integer Sequences access/rank/select on general sequences 50 access (13) = 3 S= 4 4 3 2 6 2 4 2 4 1 1 2 3 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Rank 2 (9) = 3 select 4 (3) =7 see [Navarro 2016]
Integer Sequences Wavelet tree (construction) [Grossi et al 2003] Given a sequence of symbols and an encoding The bits of the code of each symbol are distributed along the different levels of the tree 51 DATA A B A C D A C 0001001011 0010 A B A C D A C CODE 0 0 0 1 1 0 1 SYMBOL A 00 B 01 C 10 D 11 WAVELET TREE 0 1 A B A A C D C 0 1 0 0 0 1 0
Integer Sequences Wavelet tree (select) 52 OF 74 52 Searching for the 1st occurrence of D? DATA A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root WAVELET TREE A B A C D A C 0 0 0 1 1 0 1 0 1 A B A A C D C 0 1 0 0 0 1 0 B 0 B 1 it is the 2nd bit in B 1 Where is the 2nd 1? at pos 5. Where is the 1st 1? at pos 2.
Integer Sequences Wavelet tree (access) Recovering Data: extracting the next symbol Which symbol appears in the 6 th position? 53 DATA A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root WAVELET TREE A B A C D A C 0 0 0 1 1 0 1 0 1 A B A A C D C 0 1 0 0 0 1 0 B 0 B 1 How many 0 s are there up to pos 6? it is the 4th 0 Which bit occurs at position 4 in B 0? It is set to 0 The codeword read is 00 A
Integer Sequences Wavelet tree (access) Recovering Data: extracting the next symbol Which symbol appears in the 7 th position? 54 TEXT A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root B 0 WAVELET TREE A B A C D A C 0 0 0 1 1 0 1 0 1 A B A A C D C 0 1 0 0 0 1 0 B 1 How many 1 s are there up to pos 7? it is the 3rd 1 Which bit occurs at position 3 in B 1? It is set to 0 The codeword read is 10 C
Integer Sequences Wavelet tree (rank) How many C s are there up to position 7? 55 TEXT A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root B 0 WAVELET TREE A B A C D A C 0 0 0 1 1 0 1 0 1 A B A A C D C 0 1 0 0 0 1 0 B 1 How many 1 s are there up to pos 7? it is the 3rd 1 Select (locate symbol) Access and Rank: How many 0s up to position 3 in B 1? 2!!
Integer Sequences Wavelet tree (space and times) Each level contains n + o(n) bits 56 DATA A B A C D A C 0001001011 0010 SYMBOL CODE A 00 B 01 C 10 D 11 WAVELET TREE A B A C D A C 0 0 0 1 1 0 1 0 1 A B A A C D C 0 1 0 0 0 1 0 n + o(n) bits n + o(n) bits Rank/select/access expected O(log s) time n log s (1 + o(1)) bits
Integer Sequences Huffman-shaped (or others) Wavelet tree Using Huffman coding (or others) unbalanced 57 DATA A B A C D A C 1 000 1 01 001 1 01 SYMBOL CODE A 1 B 000 C 01 D 001 WAVELET TREE A B A C D A C 1 0 1 0 0 1 0 B C D C 0 1 0 0 0 1 0 1 B D C C 1 0 A A A nh 0 (S) + o(n) bits Rank/select/access O(H 0 (S)) time
Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 58 images: zurb.com
A brief review about indexing Inverted Indexes are the most well-known index for text [ ] Suffix Arrays are powerful but huge full-text indexes. Self-indexes trade a more compact space by performance PAGE 59 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER
A brief Review about Indexing Text indexing: well-known structures from the Web 60 Traditional indexes (with or without compression) auxiliar structure explicit text Inverted Indexes, Suffix Arrays,... Compressed Self-indexes implicit text Wavelet trees, Compressed Suffix Arrays, FM-index, LZ-index,
Doc 2 Doc 1 A brief Review about Indexing Inverted indexes Vocabulary DCC communications compression image data information Cliff Logde Posting Lists 0 142 368 506 104 165 341 219 445 99 207 336 128 395 19 25 Vocabulary DCC communications compression image data information Cliff Lodge Posting Lists 1 2 1 2 1 2 1 2 1 2 1 1 61 Full-positional information Space-time trade-off Doc-addressing inverted index Indexed text DCC is held at the Cliff Lodge convention center. It is an international forum for current work on data compression and related applications. DCC addresses not only compression methods for specific types of data (text, image, video, audio, space, graphics, web content, [...]... also the use of techniques from information theory and data compression in networking, communications, and storage applications involving large datasets (including image and information mining, retrieval, archiving, backup, communications, and HCI). Searches Word posting of that word Phrase intersection of postings Compression - Indexed text (Huffman,...) - Posting lists (Rice,...)
A brief Review about Indexing Inverted indexes 62 Lists contain increasing integers Gaps between integers are smaller in the longest lists Original posting list Diferenc. 4 10 15 25 29 40 46 54 57 70 79 82 1 2 3 4 5 6 7 8 9 10 11 12 4 6 5 10 4 11 6 8 3 13 9 3 Var-length coding c4 c6 c5 c10 c4 c11 c6 c8 c3 c13 c9 c3 Complete decompression Absolute sampling + var length coding 4 c6 c5 c10 29 c11 c6 c8 57 c13 c9 c3 Direct access Partial decompression
A brief Review about Indexing Suffix Arrays Sorting all the suffixes of T lexicographically 63 T = 1 2 3 4 5 6 7 8 9 10 11 12 a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 A = 12 11 8 1 4 6 9 2 5 7 10 3 racadabra$ ra$ dabra$ cadabra$ bracadabra$ bra$ adabra$ acadabra$ abracadabra$ abra$ a$ $
A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 64 P = a b T = 1 2 3 4 5 6 7 8 9 10 11 12 a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 A = 12 11 8 1 4 6 9 2 5 7 10 3
A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 65 P = a b T = 1 2 3 4 5 6 7 8 9 10 11 12 a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 A = 12 11 8 1 4 6 9 2 5 7 10 3
A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 66 P = a b T = 1 2 3 4 5 6 7 8 9 10 11 12 a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 A = 12 11 8 1 4 6 9 2 5 7 10 3
A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 67 P = a b T = 1 2 3 4 5 6 7 8 9 10 11 12 a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 A = 12 11 8 1 4 6 9 2 5 7 10 3
A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 68 P = a b T = 1 2 3 4 5 6 7 8 9 10 11 12 a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 A = 12 11 8 1 4 6 9 2 5 7 10 3
A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 69 P = a b T = 1 2 3 4 5 6 7 8 9 10 11 12 a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 A = 12 11 8 1 4 6 9 2 5 7 10 3
A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 70 P = a b T = 1 2 3 4 5 6 7 8 9 10 11 12 a b r a c a d a b r a $ 1 2 3 4 5 6 7 8 9 10 11 12 A = 12 11 8 1 4 6 9 2 5 7 10 3 locations Noccs = (4-3)+1 Occs = A[3].. A[4] = {8, 1} Fast O(m lg n) O(m lg n + noccs) Space O(4n) + T
A brief Review about Indexing BWT FM-index 71 BWT(S) + other structures it is an index C[c] : for each char c in S, stores the number of occs in S of the chars that are lexicographically smaller than c. C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8 OCC(c, k): Number of occs of char c in the prefix of L: L [1..k] For k in [1..12] Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1 Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4 Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1 Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2 Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4 Char L[i] occurs in F at position LF(i): LF(i) = C[L[i]] + Occ(L[i],i)
A brief Review about Indexing BWT FM-index 74 Count (S[1,u], P[1,p]) C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8 Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1 Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4 Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1 Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2 Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4 s s i Count (S, issi )
A brief Review about Indexing BWT FM-index Representing L with a wavelet tree occ is compressed 75
Bibliography 76 1. M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Systems Research Center, 1994. http://gatekeeper.dec.com/pub/dec/src/researchreports/. 2. F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proc. 15th SPIRE, LNCS 5280, pages 176 187, 2008. 3. Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index. In Proc. 12th ACM-SIAM Symposium on Discrete Algorithms (SODA), Washington (USA), 2001. 4. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005. 5. Philip Gage. A new algorithm for data compression. C Users Journal, 12(2):23 38, February 1994 6. A. Golynski, I. Munro, and S. Rao. Rank/select operations on large alphabets: a tool for text indexing. In Proc. 17th SODA, pages 368 373, 2006. 7. R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proc. 14th SODA, pages 841 850, 2003.
Bibliography 77 8. David A. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the Institute of Radio Engineers, 40(9):1098-1101, 1952 9. N. J. Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 88(11):1722 1732, 2000 10. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. Comp., 22(5):935 948, 1993 11. Alistair Moffat, Andrew Turpin: Compression and Coding Algorithms.Kluwer 2002, ISBN 0-7923-7668-4 12. I. Munro. Tables. In Proc. 16th FSTTCS, LNCS 1180, pages 37 42, 1996. 13. Gonzalo Navarro, Veli Mäkinen, Compressed full-text indexes, ACM Computing Surveys (CSUR), v.39 n.1, p.2-es, 2007 14. Gonzalo Navarro. Compact Data Structures -A practical approach. Cambridge University Press, 570 pages, 2016 15. D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th ALENEX, 2007.
Bibliography 78 16. R. Raman, V. Raman, and S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. 13th SODA, pages 233 242, 2002. 17. Edleno Silva de Moura, Gonzalo Navarro, Nivio Ziviani, and Ricardo Baeza-Yates. Fast and flexible word searching on compressed text. ACM Transactions on Information Systems, 18(2):113 139, 2000. 18. Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 1999. 19. Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 3, 337 343. 20. Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24, 5, 530 536.
(To compress is to Conquer) Compact Data Strutures Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017 (Thanks: slides partially by: Susana Ladra, E. Rodríguez, & José R. Paramá)