Compact Data Strutures

Size: px

Start display at page:

Download "Compact Data Strutures"

Lindsey Potter
5 years ago
Views:

1 (To compress is to Conquer) Compact Data Strutures Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017

2 Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 2 images: zurb.com

3 Introduction to Compact Data Structures Compact data structures lie at the intersection of Data Structures (indexing) and Information Theory (compression): One looks at data representations that not only permit space close to the minimum possible (as in compression) but also require that those representations allow one to efficiently carry out some operations on the data. PAGE 3 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER

4 Introduction Why compression? 4 Disks are cheap!! But they are also slow! Compression can help more data to fit in main memory. (access to memory is around 10 6 times faster than HDD) CPU speed is increasing faster We can trade processing time (needed to uncompress data) by space.

I/O access on disks and networks Processing time * (less

used For example: Allowing handling data compressed all

Doc 1 Doc 2 Doc 3 Doc n Doc 1 Doc 2 Doc 3 Doc n

5 Introduction Why compression? 5 Compression does not only reduce space! I/O access on disks and networks Processing time * (less data has to be processed) If appropriate methods are used For example: Allowing handling data compressed all the time. Doc 1 Doc 2 Doc 3 Doc n Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (30%) Let s search for Keystone" Text collection (100%) Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (20%) P7zip, others

6 Introduction Why indexing? Indexing permits sublinear search time 6 term 1 Keystone term n Index (> 5-30%) Doc 1 Doc 2 Doc 3 Doc n Text collection (100%) Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (30%) Let s search for Keystone"

7 Introduction Why compact data structures? 7 Self-indexes: sublinear search time Text implicitly kept term 1 Keystone term n Index (> 5-30%) Doc 1 Doc 2 Doc 3 Doc n Self-index (WT, WCSA, ) Let s search for Keystone" Text collection term 1 Keystone term n

8 Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 8 images: zurb.com

9 Compression Compressing aims at representing data within less space. How does it work? Which are the most traditional compression techniques? PAGE 9 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER

10 Basic Compression Modeling & Coding 10 A compressor could use as a source alphabet: A fixed number of symbols (statistical compressors) 1 char, 1 word A variable number of symbols (dictionary-based compressors) 1st occ of a encoded alone, 2nd occ encoded with next one ax Codes are built using symbols of a target alphabet: Fixed length codes (10 bits, 1 byte, 2 bytes, ) Variable length codes (1,2,3,4 bits/bytes ) Classification (fixed-to-variable, variable-to-fixed, ) Input alphabet fixed var Target alphabet fixed var -- statistical dictionary var2var

11 Basic Compression Main families of compressors 11 Taxonomy Dictionary based (gzip, compress, p7zip ) Grammar based (BPE, Repair) Statistical compressors (Huffman, arithmetic, Dense, PPM, ) Statistical compressors Gather the frequencies of the source symbols. Assign shorter codewords to the most frequent symbols. Obtain compression

12 Basic Compression Dictionary-based compressors 12 How do they achieve compression? Assign fixed-length codewords to variable-length symbols (text substrings) The longer the replaced substring the better compression Well-known representatives: Lempel-Ziv family LZ77 (1977): GZIP, PKZIP, ARJ, P7zip LZ78 (1978) LZW (1984): Compress, GIF images

13 EXAMPLE Basic Compression LZW 13 Starts with an initial dictionary D (contains symbols in S) For a given position of the text. while D contains w, reads prefix w=w 0 w 1 w 2 If w 0 w k w k+1 is not in D (w 0 w k does!) output (i = entrypos(w 0 w k )) (Note: codeword = log 2 ( D )) Add w 0 w k w k+1 to D Continue from w k+1 on (included) Dictionary has limited length? Policies: LRU, truncate& go,

14 EXAMPLE Basic Compression LZW 14 Starts with an initial dictionary D (contains symbols in S) For a given position of the text. while D contains w, reads prefix w=w 0 w 1 w 2 If w 0 w k w k+1 is not in D (w 0 w k does!) output (i = entrypos(w 0 w k )) (Note: codeword = log 2 ( D )) Add w 0 w k w k+1 to D Continue from w k+1 on (included) Dictionary has limited length? Policies: LRU, truncate& go,

15 Basic Compression Grammar-based BPE - Repair 15 Replaces pairs of symbols by a new one, until no pair repeats twice Adds a rule to a Dictionary. Source sequence A B C D E A B D E F D E D E F A B E C D DE G A B C G A B G F G G F A B E C D AB H H C G H G F G G F H E C D GF I Dictionary of Rules Final Repair Sequence H C G H I G I H E C D

16 Basic Compression Statistical compressors 16 Assign shorter codewords to the most frequent symbols Must gather symbol frequencies for each symbol c in S. Compression is lower bounded by the (zero-order) empirical entropy of the sequence (S). n= num of symbols n c = occs of symbol c H 0 (S) <= log ( S ) n H 0 (S) = lower bound of the size of S compressed with a zero-order compressor Most representative method: Huffman coding

17 Basic Compression Statistical compressors: Huffman coding Optimal prefix free coding No codeword is a prefix of one another. Decoding requires no look-ahead! Asymptotically optimal: Huffman(S) <= n(h0(s)+1) 17 Typically using bit-wise codewords Yet D-ary Huffman variants exist (D=256 byte-wise) Builds a Huffman tree to generate codewords

18 Basic Compression Statistical compressors: Huffman coding Sort symbols by frequency: S=ADBAAAABBBBCCCCDDEEE 18

19 Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 19

20 Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 20

21 Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 21

22 Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 22

23 Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 23

24 Basic Compression Statistical compressors: Huffman coding Branch labeling 24

25 Basic Compression Statistical compressors: Huffman coding Code assignment 25

26 Basic Compression Statistical compressors: Huffman coding Compression of sequence S= ADB 26 ADB

27 Basic Compression Burrows-Wheeler Transform (BWT) 27 Given S= mississipii$, BWT(S) is obtained by: (1) creating a Matrix M with all circular permutations of S$, (2) sorting the rows of M, and (3) taking the last column. mississippi$ $mississippi i$mississipp pi$mississip ppi$mississi ippi$mississ sippi$missis ssippi$missi issippi$miss sissippi$mis ssissippi$mi ississippi$m sort $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi F L = BWT(S)

28 Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 28 Given L=BWT(S), we can recover S=BWT -1 (L) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j mississippi$ pi$mississip ppi$mississi Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F sippi$missis sissippi$mis ssippi$missi ssissippi$mi 5 F L LF

29 Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 29 Given L=BWT(S), we can recover S=BWT -1 (L) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; In each step: S[n-i] = L[p]; p = LF[p]; i = i+1; F L LF S

30 Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 30 Given L=BWT(S), we can recover S=BWT -1 (L) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=0: S[n-i] = L[p]; S[12]= $ p = LF[p]; p = 1 i = i+1; i=1 F L LF S

31 Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 31 Given L=BWT(S), we can recover S=BWT -1 (L) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi i $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=1: S[n-i] = L[p]; S[11]= i p = LF[p]; p = 2 i = i+1; i=2 F L LF S

32 Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 32 Given L=BWT(S), we can recover S=BWT -1 (L) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m m i s s i Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi s s i p p i $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=1: S[n-i] = L[p]; S[11]= i p = LF[p]; p = 2 i = i+1; i=2 F L LF S

33 Basic Compression Bzip2: Burrows-Wheeler Transform (BWT) 33 BWT. Many similar symbols appear adjacent MTF. Output the position of the current symbol within S Keep the alphabet S = {a,b,c,d,e, } sorted so that the last used symbol is moved to the begining of S. RLE. If a value (0) appears several times ( times) replace it by a pair <value,times> <0,6> Huffman stage. Why does it work? In a text it is likely that he is preceeded by t, ssisii by i,

34 Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 34 images: zurb.com

35 Sequences We want to represent (compactly) a sequence of elements and to efficiently handle them. (Who is in the 2 nd position?? How many Barts up to position 5?? Where is the 3 rd Bart??) PAGE 35 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER

36 Sequences Plain Representation of Data 36 Given a Sequence of n integers m = maximum value We can represent it with n log 2 (m+1) bits symbols x 3 bits per symbol = 48 bits array of two 32-bit ints Direct access (access to an integer + bit operations)

37 Sequences Compressed Representation of Data (H 0 ) Is it compressible? Symbol Occurrences (n c ) H o (S) = 1.59 (bits per symbol) Huffman: 1.62 bits per symbol bits: No direct access! (but we could add sampling)

38 Sequences Summary: Plain/Compressed access/rank/select Operations of interest: Access(i) : Value of the i th symbol Rank s (i) : Number of occs of symbol s up to position i (count) Select s (i) : Where the i th occ of symbol s? (locate)

39 Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 39 images: zurb.com

40 Bit Sequences access/rank/select on bitmaps 40 access (19) = 0 B = Rank 1 (6) = 3 Rank 0 (10) = 5 select 0 (10) =15 see [Navarro 2016]

41 Bit Sequences Applications 41 Bitmaps a basic part of most Compact Data Structures Example: (We will see it later in the CSA) HDT Bitmaps from Javi's talk!!! S: AAABBCCCCCCCCDDDEEEEEEEEEEFG n log s bits B: n bits D: ABCDEFG s log s bits Saves space: Fast access/rank/select is of interest!! Where is the 2 nd C? How many Cs up to position k?

42 Bit Sequences Reaching O(1) rank & o(n) bits of extra space Jacobson, Clark, Munro Variant by Fariña et al. Assuming 32 bit machine-word 42 Step 1: Split de Bitmap into superblocks of 256 bits, and store the number of 1s up to positions 1+256k (k= 0,1,2, ) D s = bits set to 1 27 bits set to 1 45 bits set to O(1) time to superblock. Space: n/256 superblocks and 1 int each

43 Bit Sequences Reaching O(1) rank & o(n) bits of extra space Step 2: For each superblock of 256 bits Divide it into 8 blocks of 32 bits each (machine word size) Store the number of ones from the beginning of the superblock 43 D s = bits set to 1 27 bits set to 1 45 bits set to bits set to 1 6 bits set to 1 8 bits set to D b = O(1) time to the blocks, 8 blocks per superblock, 1 byte each

Bit Sequences Reaching O(1) rank & o(n) bits of extra space 44 Step 3: Rank within a 32 bit block blk = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 0 0 1

44 Bit Sequences Reaching O(1) rank & o(n) bits of extra space 44 Step 3: Rank within a 32 bit block blk = Finally solving: rank 1 ( D, p ) = Ds[ p / 256 ] + Db[ p / 32 ] + rank 1 (blk, i) where i= p mod 32 Ex: rank1(d,300) = = 43 Yet, how to compute rank 1 (blk, i) in constant time?

45 Bit Sequences Reaching O(1) rank & o(n) bits of extra space 45 How to compute rank 1 (blk, i) in constant time? Option 1: popcount within a machine word Option 2: Universal Table onesinbyte (solution for each byte) Only 256 entries storing values [0..8] Rank 1 (blk,12) blk = Shift = 20 posicións Val binary OnesInByte blk s = Finally, sum value onesinbyte for the 4 bytes in blk Overall space: n bits

46 Bit Sequences Select 1 in O(log n) with the same structures 46 select 1 (p) In practice, binary search using rank

47 Bit Sequences Compressed representations 48 Compressed Bit-Sequence representations exist!! Compressed [Raman et al, 2002] For very sparse bitmaps [Okanohara and Sadakane, 2007]... see [Navarro 2016]

48 Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 49 images: zurb.com

49 Integer Sequences access/rank/select on general sequences 50 access (13) = 3 S= Rank 2 (9) = 3 select 4 (3) =7 see [Navarro 2016]

50 Integer Sequences Wavelet tree (construction) [Grossi et al 2003] Given a sequence of symbols and an encoding The bits of the code of each symbol are distributed along the different levels of the tree 51 DATA A B A C D A C A B A C D A C CODE SYMBOL A 00 B 01 C 10 D 11 WAVELET TREE 0 1 A B A A C D C

51 Integer Sequences Wavelet tree (select) 52 OF Searching for the 1st occurrence of D? DATA A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root WAVELET TREE A B A C D A C A B A A C D C B 0 B 1 it is the 2nd bit in B 1 Where is the 2nd 1? at pos 5. Where is the 1st 1? at pos 2.

52 Integer Sequences Wavelet tree (access) Recovering Data: extracting the next symbol Which symbol appears in the 6 th position? 53 DATA A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root WAVELET TREE A B A C D A C A B A A C D C B 0 B 1 How many 0 s are there up to pos 6? it is the 4th 0 Which bit occurs at position 4 in B 0? It is set to 0 The codeword read is 00 A

53 Integer Sequences Wavelet tree (access) Recovering Data: extracting the next symbol Which symbol appears in the 7 th position? 54 TEXT A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root B 0 WAVELET TREE A B A C D A C A B A A C D C B 1 How many 1 s are there up to pos 7? it is the 3rd 1 Which bit occurs at position 3 in B 1? It is set to 0 The codeword read is 10 C

54 Integer Sequences Wavelet tree (rank) How many C s are there up to position 7? 55 TEXT A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root B 0 WAVELET TREE A B A C D A C A B A A C D C B 1 How many 1 s are there up to pos 7? it is the 3rd 1 Select (locate symbol) Access and Rank: How many 0s up to position 3 in B 1? 2!!

55 Integer Sequences Wavelet tree (space and times) Each level contains n + o(n) bits 56 DATA A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 WAVELET TREE A B A C D A C A B A A C D C n + o(n) bits n + o(n) bits Rank/select/access expected O(log s) time n log s (1 + o(1)) bits

56 Integer Sequences Huffman-shaped (or others) Wavelet tree Using Huffman coding (or others) unbalanced 57 DATA A B A C D A C SYMBOL CODE A 1 B 000 C 01 D 001 WAVELET TREE A B A C D A C B C D C B D C C 1 0 A A A nh 0 (S) + o(n) bits Rank/select/access O(H 0 (S)) time

57 Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 58 images: zurb.com

58 A brief review about indexing Inverted Indexes are the most well-known index for text [ ] Suffix Arrays are powerful but huge full-text indexes. Self-indexes trade a more compact space by performance PAGE 59 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER

59 A brief Review about Indexing Text indexing: well-known structures from the Web 60 Traditional indexes (with or without compression) auxiliar structure explicit text Inverted Indexes, Suffix Arrays,... Compressed Self-indexes implicit text Wavelet trees, Compressed Suffix Arrays, FM-index, LZ-index,

60 Doc 2 Doc 1 A brief Review about Indexing Inverted indexes Vocabulary DCC communications compression image data information Cliff Logde Posting Lists Vocabulary DCC communications compression image data information Cliff Lodge Posting Lists Full-positional information Space-time trade-off Doc-addressing inverted index Indexed text DCC is held at the Cliff Lodge convention center. It is an international forum for current work on data compression and related applications. DCC addresses not only compression methods for specific types of data (text, image, video, audio, space, graphics, web content, [...]... also the use of techniques from information theory and data compression in networking, communications, and storage applications involving large datasets (including image and information mining, retrieval, archiving, backup, communications, and HCI). Searches Word posting of that word Phrase intersection of postings Compression - Indexed text (Huffman,...) - Posting lists (Rice,...)

61 A brief Review about Indexing Inverted indexes 62 Lists contain increasing integers Gaps between integers are smaller in the longest lists Original posting list Diferenc Var-length coding c4 c6 c5 c10 c4 c11 c6 c8 c3 c13 c9 c3 Complete decompression Absolute sampling + var length coding 4 c6 c5 c10 29 c11 c6 c8 57 c13 c9 c3 Direct access Partial decompression

62 A brief Review about Indexing Suffix Arrays Sorting all the suffixes of T lexicographically 63 T = a b r a c a d a b r a $ A = racadabra$ ra$ dabra$ cadabra$ bracadabra$ bra$ adabra$ acadabra$ abracadabra$ abra$ a$ $

63 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 64 P = a b T = a b r a c a d a b r a $ A =

64 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 65 P = a b T = a b r a c a d a b r a $ A =

65 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 66 P = a b T = a b r a c a d a b r a $ A =

66 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 67 P = a b T = a b r a c a d a b r a $ A =

67 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 68 P = a b T = a b r a c a d a b r a $ A =

68 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 69 P = a b T = a b r a c a d a b r a $ A =

69 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 70 P = a b T = a b r a c a d a b r a $ A = locations Noccs = (4-3)+1 Occs = A[3].. A[4] = {8, 1} Fast O(m lg n) O(m lg n + noccs) Space O(4n) + T

70 A brief Review about Indexing BWT FM-index 71 BWT(S) + other structures it is an index C[c] : for each char c in S, stores the number of occs in S of the chars that are lexicographically smaller than c. C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8 OCC(c, k): Number of occs of char c in the prefix of L: L [1..k] For k in [1..12] Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1 Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4 Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1 Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2 Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4 Char L[i] occurs in F at position LF(i): LF(i) = C[L[i]] + Occ(L[i],i)

71 A brief Review about Indexing BWT FM-index 74 Count (S[1,u], P[1,p]) C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8 Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1 Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4 Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1 Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2 Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4 s s i Count (S, issi )

72 A brief Review about Indexing BWT FM-index Representing L with a wavelet tree occ is compressed 75

73 Bibliography M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Systems Research Center, F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proc. 15th SPIRE, LNCS 5280, pages , Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index. In Proc. 12th ACM-SIAM Symposium on Discrete Algorithms (SODA), Washington (USA), Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4): , Philip Gage. A new algorithm for data compression. C Users Journal, 12(2):23 38, February A. Golynski, I. Munro, and S. Rao. Rank/select operations on large alphabets: a tool for text indexing. In Proc. 17th SODA, pages , R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proc. 14th SODA, pages , 2003.

74 Bibliography David A. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the Institute of Radio Engineers, 40(9): , N. J. Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 88(11): , U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. Comp., 22(5): , Alistair Moffat, Andrew Turpin: Compression and Coding Algorithms.Kluwer 2002, ISBN I. Munro. Tables. In Proc. 16th FSTTCS, LNCS 1180, pages 37 42, Gonzalo Navarro, Veli Mäkinen, Compressed full-text indexes, ACM Computing Surveys (CSUR), v.39 n.1, p.2-es, Gonzalo Navarro. Compact Data Structures -A practical approach. Cambridge University Press, 570 pages, D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th ALENEX, 2007.

75 Bibliography R. Raman, V. Raman, and S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. 13th SODA, pages , Edleno Silva de Moura, Gonzalo Navarro, Nivio Ziviani, and Ricardo Baeza-Yates. Fast and flexible word searching on compressed text. ACM Transactions on Information Systems, 18(2): , Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, Ziv, J. and Lempel, A A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 3, Ziv, J. and Lempel, A Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24, 5,

76 (To compress is to Conquer) Compact Data Strutures Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017 (Thanks: slides partially by: Susana Ladra, E. Rodríguez, & José R. Paramá)

Succincter text indexing with wildcards

Succincter text indexing with wildcards University of British Columbia CPM 2011 June 27, 2011 Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview