Compact Data Strutures

Size: px
Start display at page:

Download "Compact Data Strutures"

Transcription

1 (To compress is to Conquer) Compact Data Strutures Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017

2 Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 2 images: zurb.com

3 Introduction to Compact Data Structures Compact data structures lie at the intersection of Data Structures (indexing) and Information Theory (compression): One looks at data representations that not only permit space close to the minimum possible (as in compression) but also require that those representations allow one to efficiently carry out some operations on the data. PAGE 3 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER

4 Introduction Why compression? 4 Disks are cheap!! But they are also slow! Compression can help more data to fit in main memory. (access to memory is around 10 6 times faster than HDD) CPU speed is increasing faster We can trade processing time (needed to uncompress data) by space.

5 Introduction Why compression? 5 Compression does not only reduce space! I/O access on disks and networks Processing time * (less data has to be processed) If appropriate methods are used For example: Allowing handling data compressed all the time. Doc 1 Doc 2 Doc 3 Doc n Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (30%) Let s search for Keystone" Text collection (100%) Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (20%) P7zip, others

6 Introduction Why indexing? Indexing permits sublinear search time 6 term 1 Keystone term n Index (> 5-30%) Doc 1 Doc 2 Doc 3 Doc n Text collection (100%) Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (30%) Let s search for Keystone"

7 Introduction Why compact data structures? 7 Self-indexes: sublinear search time Text implicitly kept term 1 Keystone term n Index (> 5-30%) Doc 1 Doc 2 Doc 3 Doc n Self-index (WT, WCSA, ) Let s search for Keystone" Text collection term 1 Keystone term n

8 Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 8 images: zurb.com

9 Compression Compressing aims at representing data within less space. How does it work? Which are the most traditional compression techniques? PAGE 9 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER

10 Basic Compression Modeling & Coding 10 A compressor could use as a source alphabet: A fixed number of symbols (statistical compressors) 1 char, 1 word A variable number of symbols (dictionary-based compressors) 1st occ of a encoded alone, 2nd occ encoded with next one ax Codes are built using symbols of a target alphabet: Fixed length codes (10 bits, 1 byte, 2 bytes, ) Variable length codes (1,2,3,4 bits/bytes ) Classification (fixed-to-variable, variable-to-fixed, ) Input alphabet fixed var Target alphabet fixed var -- statistical dictionary var2var

11 Basic Compression Main families of compressors 11 Taxonomy Dictionary based (gzip, compress, p7zip ) Grammar based (BPE, Repair) Statistical compressors (Huffman, arithmetic, Dense, PPM, ) Statistical compressors Gather the frequencies of the source symbols. Assign shorter codewords to the most frequent symbols. Obtain compression

12 Basic Compression Dictionary-based compressors 12 How do they achieve compression? Assign fixed-length codewords to variable-length symbols (text substrings) The longer the replaced substring the better compression Well-known representatives: Lempel-Ziv family LZ77 (1977): GZIP, PKZIP, ARJ, P7zip LZ78 (1978) LZW (1984): Compress, GIF images

13 EXAMPLE Basic Compression LZW 13 Starts with an initial dictionary D (contains symbols in S) For a given position of the text. while D contains w, reads prefix w=w 0 w 1 w 2 If w 0 w k w k+1 is not in D (w 0 w k does!) output (i = entrypos(w 0 w k )) (Note: codeword = log 2 ( D )) Add w 0 w k w k+1 to D Continue from w k+1 on (included) Dictionary has limited length? Policies: LRU, truncate& go,

14 EXAMPLE Basic Compression LZW 14 Starts with an initial dictionary D (contains symbols in S) For a given position of the text. while D contains w, reads prefix w=w 0 w 1 w 2 If w 0 w k w k+1 is not in D (w 0 w k does!) output (i = entrypos(w 0 w k )) (Note: codeword = log 2 ( D )) Add w 0 w k w k+1 to D Continue from w k+1 on (included) Dictionary has limited length? Policies: LRU, truncate& go,

15 Basic Compression Grammar-based BPE - Repair 15 Replaces pairs of symbols by a new one, until no pair repeats twice Adds a rule to a Dictionary. Source sequence A B C D E A B D E F D E D E F A B E C D DE G A B C G A B G F G G F A B E C D AB H H C G H G F G G F H E C D GF I Dictionary of Rules Final Repair Sequence H C G H I G I H E C D

16 Basic Compression Statistical compressors 16 Assign shorter codewords to the most frequent symbols Must gather symbol frequencies for each symbol c in S. Compression is lower bounded by the (zero-order) empirical entropy of the sequence (S). n= num of symbols n c = occs of symbol c H 0 (S) <= log ( S ) n H 0 (S) = lower bound of the size of S compressed with a zero-order compressor Most representative method: Huffman coding

17 Basic Compression Statistical compressors: Huffman coding Optimal prefix free coding No codeword is a prefix of one another. Decoding requires no look-ahead! Asymptotically optimal: Huffman(S) <= n(h0(s)+1) 17 Typically using bit-wise codewords Yet D-ary Huffman variants exist (D=256 byte-wise) Builds a Huffman tree to generate codewords

18 Basic Compression Statistical compressors: Huffman coding Sort symbols by frequency: S=ADBAAAABBBBCCCCDDEEE 18

19 Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 19

20 Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 20

21 Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 21

22 Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 22

23 Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 23

24 Basic Compression Statistical compressors: Huffman coding Branch labeling 24

25 Basic Compression Statistical compressors: Huffman coding Code assignment 25

26 Basic Compression Statistical compressors: Huffman coding Compression of sequence S= ADB 26 ADB

27 Basic Compression Burrows-Wheeler Transform (BWT) 27 Given S= mississipii$, BWT(S) is obtained by: (1) creating a Matrix M with all circular permutations of S$, (2) sorting the rows of M, and (3) taking the last column. mississippi$ $mississippi i$mississipp pi$mississip ppi$mississi ippi$mississ sippi$missis ssippi$missi issippi$miss sissippi$mis ssissippi$mi ississippi$m sort $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi F L = BWT(S)

28 Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 28 Given L=BWT(S), we can recover S=BWT -1 (L) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j mississippi$ pi$mississip ppi$mississi Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F sippi$missis sissippi$mis ssippi$missi ssissippi$mi 5 F L LF

29 Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 29 Given L=BWT(S), we can recover S=BWT -1 (L) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; In each step: S[n-i] = L[p]; p = LF[p]; i = i+1; F L LF S

30 Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 30 Given L=BWT(S), we can recover S=BWT -1 (L) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=0: S[n-i] = L[p]; S[12]= $ p = LF[p]; p = 1 i = i+1; i=1 F L LF S

31 Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 31 Given L=BWT(S), we can recover S=BWT -1 (L) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi i $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=1: S[n-i] = L[p]; S[11]= i p = LF[p]; p = 2 i = i+1; i=2 F L LF S

32 Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 32 Given L=BWT(S), we can recover S=BWT -1 (L) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m m i s s i Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi s s i p p i $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=1: S[n-i] = L[p]; S[11]= i p = LF[p]; p = 2 i = i+1; i=2 F L LF S

33 Basic Compression Bzip2: Burrows-Wheeler Transform (BWT) 33 BWT. Many similar symbols appear adjacent MTF. Output the position of the current symbol within S Keep the alphabet S = {a,b,c,d,e, } sorted so that the last used symbol is moved to the begining of S. RLE. If a value (0) appears several times ( times) replace it by a pair <value,times> <0,6> Huffman stage. Why does it work? In a text it is likely that he is preceeded by t, ssisii by i,

34 Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 34 images: zurb.com

35 Sequences We want to represent (compactly) a sequence of elements and to efficiently handle them. (Who is in the 2 nd position?? How many Barts up to position 5?? Where is the 3 rd Bart??) PAGE 35 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER

36 Sequences Plain Representation of Data 36 Given a Sequence of n integers m = maximum value We can represent it with n log 2 (m+1) bits symbols x 3 bits per symbol = 48 bits array of two 32-bit ints Direct access (access to an integer + bit operations)

37 Sequences Compressed Representation of Data (H 0 ) Is it compressible? Symbol Occurrences (n c ) H o (S) = 1.59 (bits per symbol) Huffman: 1.62 bits per symbol bits: No direct access! (but we could add sampling)

38 Sequences Summary: Plain/Compressed access/rank/select Operations of interest: Access(i) : Value of the i th symbol Rank s (i) : Number of occs of symbol s up to position i (count) Select s (i) : Where the i th occ of symbol s? (locate)

39 Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 39 images: zurb.com

40 Bit Sequences access/rank/select on bitmaps 40 access (19) = 0 B = Rank 1 (6) = 3 Rank 0 (10) = 5 select 0 (10) =15 see [Navarro 2016]

41 Bit Sequences Applications 41 Bitmaps a basic part of most Compact Data Structures Example: (We will see it later in the CSA) HDT Bitmaps from Javi's talk!!! S: AAABBCCCCCCCCDDDEEEEEEEEEEFG n log s bits B: n bits D: ABCDEFG s log s bits Saves space: Fast access/rank/select is of interest!! Where is the 2 nd C? How many Cs up to position k?

42 Bit Sequences Reaching O(1) rank & o(n) bits of extra space Jacobson, Clark, Munro Variant by Fariña et al. Assuming 32 bit machine-word 42 Step 1: Split de Bitmap into superblocks of 256 bits, and store the number of 1s up to positions 1+256k (k= 0,1,2, ) D s = bits set to 1 27 bits set to 1 45 bits set to O(1) time to superblock. Space: n/256 superblocks and 1 int each

43 Bit Sequences Reaching O(1) rank & o(n) bits of extra space Step 2: For each superblock of 256 bits Divide it into 8 blocks of 32 bits each (machine word size) Store the number of ones from the beginning of the superblock 43 D s = bits set to 1 27 bits set to 1 45 bits set to bits set to 1 6 bits set to 1 8 bits set to D b = O(1) time to the blocks, 8 blocks per superblock, 1 byte each

44 Bit Sequences Reaching O(1) rank & o(n) bits of extra space 44 Step 3: Rank within a 32 bit block blk = Finally solving: rank 1 ( D, p ) = Ds[ p / 256 ] + Db[ p / 32 ] + rank 1 (blk, i) where i= p mod 32 Ex: rank1(d,300) = = 43 Yet, how to compute rank 1 (blk, i) in constant time?

45 Bit Sequences Reaching O(1) rank & o(n) bits of extra space 45 How to compute rank 1 (blk, i) in constant time? Option 1: popcount within a machine word Option 2: Universal Table onesinbyte (solution for each byte) Only 256 entries storing values [0..8] Rank 1 (blk,12) blk = Shift = 20 posicións Val binary OnesInByte blk s = Finally, sum value onesinbyte for the 4 bytes in blk Overall space: n bits

46 Bit Sequences Select 1 in O(log n) with the same structures 46 select 1 (p) In practice, binary search using rank

47 Bit Sequences Compressed representations 48 Compressed Bit-Sequence representations exist!! Compressed [Raman et al, 2002] For very sparse bitmaps [Okanohara and Sadakane, 2007]... see [Navarro 2016]

48 Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 49 images: zurb.com

49 Integer Sequences access/rank/select on general sequences 50 access (13) = 3 S= Rank 2 (9) = 3 select 4 (3) =7 see [Navarro 2016]

50 Integer Sequences Wavelet tree (construction) [Grossi et al 2003] Given a sequence of symbols and an encoding The bits of the code of each symbol are distributed along the different levels of the tree 51 DATA A B A C D A C A B A C D A C CODE SYMBOL A 00 B 01 C 10 D 11 WAVELET TREE 0 1 A B A A C D C

51 Integer Sequences Wavelet tree (select) 52 OF Searching for the 1st occurrence of D? DATA A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root WAVELET TREE A B A C D A C A B A A C D C B 0 B 1 it is the 2nd bit in B 1 Where is the 2nd 1? at pos 5. Where is the 1st 1? at pos 2.

52 Integer Sequences Wavelet tree (access) Recovering Data: extracting the next symbol Which symbol appears in the 6 th position? 53 DATA A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root WAVELET TREE A B A C D A C A B A A C D C B 0 B 1 How many 0 s are there up to pos 6? it is the 4th 0 Which bit occurs at position 4 in B 0? It is set to 0 The codeword read is 00 A

53 Integer Sequences Wavelet tree (access) Recovering Data: extracting the next symbol Which symbol appears in the 7 th position? 54 TEXT A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root B 0 WAVELET TREE A B A C D A C A B A A C D C B 1 How many 1 s are there up to pos 7? it is the 3rd 1 Which bit occurs at position 3 in B 1? It is set to 0 The codeword read is 10 C

54 Integer Sequences Wavelet tree (rank) How many C s are there up to position 7? 55 TEXT A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root B 0 WAVELET TREE A B A C D A C A B A A C D C B 1 How many 1 s are there up to pos 7? it is the 3rd 1 Select (locate symbol) Access and Rank: How many 0s up to position 3 in B 1? 2!!

55 Integer Sequences Wavelet tree (space and times) Each level contains n + o(n) bits 56 DATA A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 WAVELET TREE A B A C D A C A B A A C D C n + o(n) bits n + o(n) bits Rank/select/access expected O(log s) time n log s (1 + o(1)) bits

56 Integer Sequences Huffman-shaped (or others) Wavelet tree Using Huffman coding (or others) unbalanced 57 DATA A B A C D A C SYMBOL CODE A 1 B 000 C 01 D 001 WAVELET TREE A B A C D A C B C D C B D C C 1 0 A A A nh 0 (S) + o(n) bits Rank/select/access O(H 0 (S)) time

57 Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 58 images: zurb.com

58 A brief review about indexing Inverted Indexes are the most well-known index for text [ ] Suffix Arrays are powerful but huge full-text indexes. Self-indexes trade a more compact space by performance PAGE 59 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER

59 A brief Review about Indexing Text indexing: well-known structures from the Web 60 Traditional indexes (with or without compression) auxiliar structure explicit text Inverted Indexes, Suffix Arrays,... Compressed Self-indexes implicit text Wavelet trees, Compressed Suffix Arrays, FM-index, LZ-index,

60 Doc 2 Doc 1 A brief Review about Indexing Inverted indexes Vocabulary DCC communications compression image data information Cliff Logde Posting Lists Vocabulary DCC communications compression image data information Cliff Lodge Posting Lists Full-positional information Space-time trade-off Doc-addressing inverted index Indexed text DCC is held at the Cliff Lodge convention center. It is an international forum for current work on data compression and related applications. DCC addresses not only compression methods for specific types of data (text, image, video, audio, space, graphics, web content, [...]... also the use of techniques from information theory and data compression in networking, communications, and storage applications involving large datasets (including image and information mining, retrieval, archiving, backup, communications, and HCI). Searches Word posting of that word Phrase intersection of postings Compression - Indexed text (Huffman,...) - Posting lists (Rice,...)

61 A brief Review about Indexing Inverted indexes 62 Lists contain increasing integers Gaps between integers are smaller in the longest lists Original posting list Diferenc Var-length coding c4 c6 c5 c10 c4 c11 c6 c8 c3 c13 c9 c3 Complete decompression Absolute sampling + var length coding 4 c6 c5 c10 29 c11 c6 c8 57 c13 c9 c3 Direct access Partial decompression

62 A brief Review about Indexing Suffix Arrays Sorting all the suffixes of T lexicographically 63 T = a b r a c a d a b r a $ A = racadabra$ ra$ dabra$ cadabra$ bracadabra$ bra$ adabra$ acadabra$ abracadabra$ abra$ a$ $

63 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 64 P = a b T = a b r a c a d a b r a $ A =

64 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 65 P = a b T = a b r a c a d a b r a $ A =

65 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 66 P = a b T = a b r a c a d a b r a $ A =

66 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 67 P = a b T = a b r a c a d a b r a $ A =

67 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 68 P = a b T = a b r a c a d a b r a $ A =

68 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 69 P = a b T = a b r a c a d a b r a $ A =

69 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 70 P = a b T = a b r a c a d a b r a $ A = locations Noccs = (4-3)+1 Occs = A[3].. A[4] = {8, 1} Fast O(m lg n) O(m lg n + noccs) Space O(4n) + T

70 A brief Review about Indexing BWT FM-index 71 BWT(S) + other structures it is an index C[c] : for each char c in S, stores the number of occs in S of the chars that are lexicographically smaller than c. C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8 OCC(c, k): Number of occs of char c in the prefix of L: L [1..k] For k in [1..12] Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1 Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4 Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1 Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2 Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4 Char L[i] occurs in F at position LF(i): LF(i) = C[L[i]] + Occ(L[i],i)

71 A brief Review about Indexing BWT FM-index 74 Count (S[1,u], P[1,p]) C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8 Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1 Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4 Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1 Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2 Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4 s s i Count (S, issi )

72 A brief Review about Indexing BWT FM-index Representing L with a wavelet tree occ is compressed 75

73 Bibliography M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Systems Research Center, F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proc. 15th SPIRE, LNCS 5280, pages , Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index. In Proc. 12th ACM-SIAM Symposium on Discrete Algorithms (SODA), Washington (USA), Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4): , Philip Gage. A new algorithm for data compression. C Users Journal, 12(2):23 38, February A. Golynski, I. Munro, and S. Rao. Rank/select operations on large alphabets: a tool for text indexing. In Proc. 17th SODA, pages , R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proc. 14th SODA, pages , 2003.

74 Bibliography David A. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the Institute of Radio Engineers, 40(9): , N. J. Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 88(11): , U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. Comp., 22(5): , Alistair Moffat, Andrew Turpin: Compression and Coding Algorithms.Kluwer 2002, ISBN I. Munro. Tables. In Proc. 16th FSTTCS, LNCS 1180, pages 37 42, Gonzalo Navarro, Veli Mäkinen, Compressed full-text indexes, ACM Computing Surveys (CSUR), v.39 n.1, p.2-es, Gonzalo Navarro. Compact Data Structures -A practical approach. Cambridge University Press, 570 pages, D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th ALENEX, 2007.

75 Bibliography R. Raman, V. Raman, and S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. 13th SODA, pages , Edleno Silva de Moura, Gonzalo Navarro, Nivio Ziviani, and Ricardo Baeza-Yates. Fast and flexible word searching on compressed text. ACM Transactions on Information Systems, 18(2): , Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, Ziv, J. and Lempel, A A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 3, Ziv, J. and Lempel, A Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24, 5,

76 (To compress is to Conquer) Compact Data Strutures Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017 (Thanks: slides partially by: Susana Ladra, E. Rodríguez, & José R. Paramá)

Succincter text indexing with wildcards

Succincter text indexing with wildcards University of British Columbia CPM 2011 June 27, 2011 Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview

More information

Rank and Select Operations on Binary Strings (1974; Elias)

Rank and Select Operations on Binary Strings (1974; Elias) Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo

More information

Smaller and Faster Lempel-Ziv Indices

Smaller and Faster Lempel-Ziv Indices Smaller and Faster Lempel-Ziv Indices Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, Universidad de Chile, Chile. {darroyue,gnavarro}@dcc.uchile.cl Abstract. Given a text T[1..u] over an

More information

Compressed Representations of Sequences and Full-Text Indexes

Compressed Representations of Sequences and Full-Text Indexes Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Dipartimento di Informatica, Università di Pisa, Italy GIOVANNI MANZINI Dipartimento di Informatica, Università del Piemonte

More information

arxiv: v1 [cs.ds] 25 Nov 2009

arxiv: v1 [cs.ds] 25 Nov 2009 Alphabet Partitioning for Compressed Rank/Select with Applications Jérémy Barbay 1, Travis Gagie 1, Gonzalo Navarro 1 and Yakov Nekrich 2 1 Department of Computer Science University of Chile {jbarbay,

More information

arxiv: v1 [cs.ds] 19 Apr 2011

arxiv: v1 [cs.ds] 19 Apr 2011 Fixed Block Compression Boosting in FM-Indexes Juha Kärkkäinen 1 and Simon J. Puglisi 2 1 Department of Computer Science, University of Helsinki, Finland juha.karkkainen@cs.helsinki.fi 2 Department of

More information

Preview: Text Indexing

Preview: Text Indexing Simon Gog gog@ira.uka.de - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Text Indexing Motivation Problems Given a text

More information

Lecture 18 April 26, 2012

Lecture 18 April 26, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and

More information

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with

More information

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:

More information

A Simple Alphabet-Independent FM-Index

A Simple Alphabet-Independent FM-Index A Simple Alphabet-Independent -Index Szymon Grabowski 1, Veli Mäkinen 2, Gonzalo Navarro 3, Alejandro Salinger 3 1 Computer Engineering Dept., Tech. Univ. of Lódź, Poland. e-mail: sgrabow@zly.kis.p.lodz.pl

More information

New Lower and Upper Bounds for Representing Sequences

New Lower and Upper Bounds for Representing Sequences New Lower and Upper Bounds for Representing Sequences Djamal Belazzougui 1 and Gonzalo Navarro 2 1 LIAFA, Univ. Paris Diderot - Paris 7, France. dbelaz@liafa.jussieu.fr 2 Department of Computer Science,

More information

Alphabet Friendly FM Index

Alphabet Friendly FM Index Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM

More information

Compressed Representations of Sequences and Full-Text Indexes

Compressed Representations of Sequences and Full-Text Indexes Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Università di Pisa GIOVANNI MANZINI Università del Piemonte Orientale VELI MÄKINEN University of Helsinki AND GONZALO NAVARRO

More information

Simple Compression Code Supporting Random Access and Fast String Matching

Simple Compression Code Supporting Random Access and Fast String Matching Simple Compression Code Supporting Random Access and Fast String Matching Kimmo Fredriksson and Fedor Nikitin Department of Computer Science and Statistics, University of Joensuu PO Box 111, FIN 80101

More information

Optimal Dynamic Sequence Representations

Optimal Dynamic Sequence Representations Optimal Dynamic Sequence Representations Gonzalo Navarro Yakov Nekrich Abstract We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on

More information

arxiv: v1 [cs.ds] 15 Feb 2012

arxiv: v1 [cs.ds] 15 Feb 2012 Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl

More information

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv

Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Gonzalo Navarro and Víctor Sepúlveda CeBiB Center for Biotechnology and Bioengineering, Chile Department of Computer Science, University

More information

A Simpler Analysis of Burrows-Wheeler Based Compression

A Simpler Analysis of Burrows-Wheeler Based Compression A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan School of Computer Science, Tel Aviv University, Tel Aviv, Israel; email: haimk@post.tau.ac.il Shir Landau School of Computer Science,

More information

arxiv: v1 [cs.ds] 22 Nov 2012

arxiv: v1 [cs.ds] 22 Nov 2012 Faster Compact Top-k Document Retrieval Roberto Konow and Gonzalo Navarro Department of Computer Science, University of Chile {rkonow,gnavarro}@dcc.uchile.cl arxiv:1211.5353v1 [cs.ds] 22 Nov 2012 Abstract:

More information

The Burrows-Wheeler Transform: Theory and Practice

The Burrows-Wheeler Transform: Theory and Practice The Burrows-Wheeler Transform: Theory and Practice Giovanni Manzini 1,2 1 Dipartimento di Scienze e Tecnologie Avanzate, Università del Piemonte Orientale Amedeo Avogadro, I-15100 Alessandria, Italy. 2

More information

Complementary Contextual Models with FM-index for DNA Compression

Complementary Contextual Models with FM-index for DNA Compression 2017 Data Compression Conference Complementary Contextual Models with FM-index for DNA Compression Wenjing Fan,WenruiDai,YongLi, and Hongkai Xiong Department of Electronic Engineering Department of Biomedical

More information

An Algorithmic Framework for Compression and Text Indexing

An Algorithmic Framework for Compression and Text Indexing An Algorithmic Framework for Compression and Text Indexing Roberto Grossi Ankur Gupta Jeffrey Scott Vitter Abstract We present a unified algorithmic framework to obtain nearly optimal space bounds for

More information

Opportunistic Data Structures with Applications

Opportunistic Data Structures with Applications Opportunistic Data Structures with Applications Paolo Ferragina Giovanni Manzini Abstract There is an upsurging interest in designing succinct data structures for basic searching problems (see [23] and

More information

Alphabet-Independent Compressed Text Indexing

Alphabet-Independent Compressed Text Indexing Alphabet-Independent Compressed Text Indexing DJAMAL BELAZZOUGUI Université Paris Diderot GONZALO NAVARRO University of Chile Self-indexes are able to represent a text within asymptotically the information-theoretic

More information

Dynamic Entropy-Compressed Sequences and Full-Text Indexes

Dynamic Entropy-Compressed Sequences and Full-Text Indexes Dynamic Entropy-Compressed Sequences and Full-Text Indexes VELI MÄKINEN University of Helsinki and GONZALO NAVARRO University of Chile First author funded by the Academy of Finland under grant 108219.

More information

LZ77-like Compression with Fast Random Access

LZ77-like Compression with Fast Random Access -like Compression with Fast Random Access Sebastian Kreft and Gonzalo Navarro Dept. of Computer Science, University of Chile, Santiago, Chile {skreft,gnavarro}@dcc.uchile.cl Abstract We introduce an alternative

More information

Text Indexing: Lecture 6

Text Indexing: Lecture 6 Simon Gog gog@kit.edu - 0 Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Reviewing the last two lectures We have seen two top-k document retrieval frameworks. Question

More information

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems

A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems Shunsuke Inenaga 1, Ayumi Shinohara 2,3 and Masayuki Takeda 2,3 1 Department of Computer Science, P.O. Box 26 (Teollisuuskatu 23)

More information

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca

More information

Self-Indexed Grammar-Based Compression

Self-Indexed Grammar-Based Compression Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca

More information

New Algorithms and Lower Bounds for Sequential-Access Data Compression

New Algorithms and Lower Bounds for Sequential-Access Data Compression New Algorithms and Lower Bounds for Sequential-Access Data Compression Travis Gagie PhD Candidate Faculty of Technology Bielefeld University Germany July 2009 Gedruckt auf alterungsbeständigem Papier ISO

More information

Read Mapping. Burrows Wheeler Transform and Reference Based Assembly. Genomics: Lecture #5 WS 2014/2015

Read Mapping. Burrows Wheeler Transform and Reference Based Assembly. Genomics: Lecture #5 WS 2014/2015 Mapping Burrows Wheeler and Reference Based Assembly Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin Genomics: Lecture #5 WS 2014/2015 Today Burrows Wheeler FM index

More information

Compressed Index for Dynamic Text

Compressed Index for Dynamic Text Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution

More information

More Speed and More Compression: Accelerating Pattern Matching by Text Compression

More Speed and More Compression: Accelerating Pattern Matching by Text Compression More Speed and More Compression: Accelerating Pattern Matching by Text Compression Tetsuya Matsumoto, Kazuhito Hagio, and Masayuki Takeda Department of Informatics, Kyushu University, Fukuoka 819-0395,

More information

Data Compression Techniques

Data Compression Techniques Data Compression Techniques Part 2: Text Compression Lecture 5: Context-Based Compression Juha Kärkkäinen 14.11.2017 1 / 19 Text Compression We will now look at techniques for text compression. These techniques

More information

A simpler analysis of Burrows Wheeler-based compression

A simpler analysis of Burrows Wheeler-based compression Theoretical Computer Science 387 (2007) 220 235 www.elsevier.com/locate/tcs A simpler analysis of Burrows Wheeler-based compression Haim Kaplan, Shir Landau, Elad Verbin School of Computer Science, Tel

More information

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT COMP9319 Web Data Compression and Search Lecture 2: daptive Huffman, BWT 1 Original readings Login to your cse account:! cd ~cs9319/papers! Original readings of each lecture will be placed there. 2 Course

More information

Compact Indexes for Flexible Top-k Retrieval

Compact Indexes for Flexible Top-k Retrieval Compact Indexes for Flexible Top-k Retrieval Simon Gog Matthias Petri Institute of Theoretical Informatics, Karlsruhe Institute of Technology Computing and Information Systems, The University of Melbourne

More information

A Faster Grammar-Based Self-Index

A Faster Grammar-Based Self-Index A Faster Grammar-Based Self-Index Travis Gagie 1 Pawe l Gawrychowski 2 Juha Kärkkäinen 3 Yakov Nekrich 4 Simon Puglisi 5 Aalto University Max-Planck-Institute für Informatik University of Helsinki University

More information

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2 Text Compression Jayadev Misra The University of Texas at Austin December 5, 2003 Contents 1 Introduction 1 2 A Very Incomplete Introduction to Information Theory 2 3 Huffman Coding 5 3.1 Uniquely Decodable

More information

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT COMP9319 Web Data Compression and Search Lecture 2: daptive Huffman, BWT 1 Original readings Login to your cse account: cd ~cs9319/papers Original readings of each lecture will be placed there. 2 Course

More information

Lecture 4 : Adaptive source coding algorithms

Lecture 4 : Adaptive source coding algorithms Lecture 4 : Adaptive source coding algorithms February 2, 28 Information Theory Outline 1. Motivation ; 2. adaptive Huffman encoding ; 3. Gallager and Knuth s method ; 4. Dictionary methods : Lempel-Ziv

More information

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES

COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES DO HUY HOANG (B.C.S. (Hons), NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE SCHOOL OF COMPUTING NATIONAL

More information

Succinct Suffix Arrays based on Run-Length Encoding

Succinct Suffix Arrays based on Run-Length Encoding Succinct Suffix Arrays based on Run-Length Encoding Veli Mäkinen Gonzalo Navarro Abstract A succinct full-text self-index is a data structure built on a text T = t 1 t 2...t n, which takes little space

More information

Multimedia. Multimedia Data Compression (Lossless Compression Algorithms)

Multimedia. Multimedia Data Compression (Lossless Compression Algorithms) Course Code 005636 (Fall 2017) Multimedia Multimedia Data Compression (Lossless Compression Algorithms) Prof. S. M. Riazul Islam, Dept. of Computer Engineering, Sejong University, Korea E-mail: riaz@sejong.ac.kr

More information

Chapter 2: Source coding

Chapter 2: Source coding Chapter 2: meghdadi@ensil.unilim.fr University of Limoges Chapter 2: Entropy of Markov Source Chapter 2: Entropy of Markov Source Markov model for information sources Given the present, the future is independent

More information

arxiv:cs/ v1 [cs.it] 21 Nov 2006

arxiv:cs/ v1 [cs.it] 21 Nov 2006 On the space complexity of one-pass compression Travis Gagie Department of Computer Science University of Toronto travis@cs.toronto.edu arxiv:cs/0611099v1 [cs.it] 21 Nov 2006 STUDENT PAPER Abstract. We

More information

Efficient Accessing and Searching in a Sequence of Numbers

Efficient Accessing and Searching in a Sequence of Numbers Regular Paper Journal of Computing Science and Engineering, Vol. 9, No. 1, March 2015, pp. 1-8 Efficient Accessing and Searching in a Sequence of Numbers Jungjoo Seo and Myoungji Han Department of Computer

More information

Optimal lower bounds for rank and select indexes

Optimal lower bounds for rank and select indexes Optimal lower bounds for rank and select indexes Alexander Golynski David R. Cheriton School of Computer Science, University of Waterloo agolynski@cs.uwaterloo.ca Technical report CS-2006-03, Version:

More information

Fast Fully-Compressed Suffix Trees

Fast Fully-Compressed Suffix Trees Fast Fully-Compressed Suffix Trees Gonzalo Navarro Department of Computer Science University of Chile, Chile gnavarro@dcc.uchile.cl Luís M. S. Russo INESC-ID / Instituto Superior Técnico Technical University

More information

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE General e Image Coder Structure Motion Video x(s 1,s 2,t) or x(s 1,s 2 ) Natural Image Sampling A form of data compression; usually lossless, but can be lossy Redundancy Removal Lossless compression: predictive

More information

Succinct Data Structures for Text and Information Retrieval

Succinct Data Structures for Text and Information Retrieval Succinct Data Structures for Text and Information Retrieval Simon Gog 1 Matthias Petri 2 1 Institute of Theoretical Informatics Karslruhe Insitute of Technology 2 Computing and Information Systems The

More information

Stronger Lempel-Ziv Based Compressed Text Indexing

Stronger Lempel-Ziv Based Compressed Text Indexing Stronger Lempel-Ziv Based Compressed Text Indexing Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile.

More information

CMPT 365 Multimedia Systems. Lossless Compression

CMPT 365 Multimedia Systems. Lossless Compression CMPT 365 Multimedia Systems Lossless Compression Spring 2017 Edited from slides by Dr. Jiangchuan Liu CMPT365 Multimedia Systems 1 Outline Why compression? Entropy Variable Length Coding Shannon-Fano Coding

More information

Data Compression Techniques

Data Compression Techniques Data Compression Techniques Part 2: Text Compression Lecture 7: Burrows Wheeler Compression Juha Kärkkäinen 21.11.2017 1 / 16 Burrows Wheeler Transform The Burrows Wheeler transform (BWT) is a transformation

More information

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm Matevž Jekovec University of Ljubljana Faculty of Computer and Information Science Oct 10, 2013 Text indexing problem

More information

Grammar Compressed Sequences with Rank/Select Support

Grammar Compressed Sequences with Rank/Select Support Grammar Compressed Sequences with Rank/Select Support Gonzalo Navarro and Alberto Ordóñez 2 Dept. of Computer Science, Univ. of Chile, Chile. gnavarro@dcc.uchile.cl 2 Lab. de Bases de Datos, Univ. da Coruña,

More information

Data Compression Techniques

Data Compression Techniques Data Compression Techniques Part 1: Entropy Coding Lecture 4: Asymmetric Numeral Systems Juha Kärkkäinen 08.11.2017 1 / 19 Asymmetric Numeral Systems Asymmetric numeral systems (ANS) is a recent entropy

More information

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal

More information

Forbidden Patterns. {vmakinen leena.salmela

Forbidden Patterns. {vmakinen leena.salmela Forbidden Patterns Johannes Fischer 1,, Travis Gagie 2,, Tsvi Kopelowitz 3, Moshe Lewenstein 4, Veli Mäkinen 5,, Leena Salmela 5,, and Niko Välimäki 5, 1 KIT, Karlsruhe, Germany, johannes.fischer@kit.edu

More information

Reducing the Space Requirement of LZ-Index

Reducing the Space Requirement of LZ-Index Reducing the Space Requirement of LZ-Index Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile {darroyue, gnavarro}@dcc.uchile.cl 2 Dept. of

More information

Data Compression Using a Sort-Based Context Similarity Measure

Data Compression Using a Sort-Based Context Similarity Measure Data Compression Using a Sort-Based Context Similarity easure HIDETOSHI YOKOO Department of Computer Science, Gunma University, Kiryu, Gunma 76, Japan Email: yokoo@cs.gunma-u.ac.jp Every symbol in the

More information

Image and Multidimensional Signal Processing

Image and Multidimensional Signal Processing Image and Multidimensional Signal Processing Professor William Hoff Dept of Electrical Engineering &Computer Science http://inside.mines.edu/~whoff/ Image Compression 2 Image Compression Goal: Reduce amount

More information

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2012-DBS-156 No /12/12 1,a) 1,b) 1,2,c) 1,d) 1999 Larsson Moffat Re-Pair Re-Pair Re-Pair Variable-to-Fi

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2012-DBS-156 No /12/12 1,a) 1,b) 1,2,c) 1,d) 1999 Larsson Moffat Re-Pair Re-Pair Re-Pair Variable-to-Fi 1,a) 1,b) 1,2,c) 1,d) 1999 Larsson Moffat Re-Pair Re-Pair Re-Pair Variable-to-Fixed-Length Encoding for Large Texts Using a Re-Pair Algorithm with Shared Dictionaries Kei Sekine 1,a) Hirohito Sasakawa

More information

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression Sergio De Agostino Sapienza University di Rome Parallel Systems A parallel random access machine (PRAM)

More information

CSEP 590 Data Compression Autumn Dictionary Coding LZW, LZ77

CSEP 590 Data Compression Autumn Dictionary Coding LZW, LZ77 CSEP 590 Data Compression Autumn 2007 Dictionary Coding LZW, LZ77 Dictionary Coding Does not use statistical knowledge of data. Encoder: As the input is processed develop a dictionary and transmit the

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 3 & 4: Color, Video, and Fundamentals of Data Compression 1 Color Science Light is an electromagnetic wave. Its color is characterized

More information

LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations

LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations Jérémy Barbay 1, Johannes Fischer 2, and Gonzalo Navarro 1 1 Department of Computer Science, University of Chile, {jbarbay gnavarro}@dcc.uchile.cl

More information

CS4800: Algorithms & Data Jonathan Ullman

CS4800: Algorithms & Data Jonathan Ullman CS4800: Algorithms & Data Jonathan Ullman Lecture 22: Greedy Algorithms: Huffman Codes Data Compression and Entropy Apr 5, 2018 Data Compression How do we store strings of text compactly? A (binary) code

More information

On Compressing and Indexing Repetitive Sequences

On Compressing and Indexing Repetitive Sequences On Compressing and Indexing Repetitive Sequences Sebastian Kreft a,1,2, Gonzalo Navarro a,2 a Department of Computer Science, University of Chile Abstract We introduce LZ-End, a new member of the Lempel-Ziv

More information

Suffix Array of Alignment: A Practical Index for Similar Data

Suffix Array of Alignment: A Practical Index for Similar Data Suffix Array of Alignment: A Practical Index for Similar Data Joong Chae Na 1, Heejin Park 2, Sunho Lee 3, Minsung Hong 3, Thierry Lecroq 4, Laurent Mouchard 4, and Kunsoo Park 3, 1 Department of Computer

More information

String Range Matching

String Range Matching String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings

More information

A Unifying Framework for Compressed Pattern Matching

A Unifying Framework for Compressed Pattern Matching A Unifying Framework for Compressed Pattern Matching Takuya Kida Yusuke Shibata Masayuki Takeda Ayumi Shinohara Setsuo Arikawa Department of Informatics, Kyushu University 33 Fukuoka 812-8581, Japan {

More information

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols. Universal Lossless coding Lempel-Ziv Coding Basic principles of lossless compression Historical review Variable-length-to-block coding Lempel-Ziv coding 1 Basic Principles of Lossless Coding 1. Exploit

More information

Efficient Fully-Compressed Sequence Representations

Efficient Fully-Compressed Sequence Representations Algorithmica (2014) 69:232 268 DOI 10.1007/s00453-012-9726-3 Efficient Fully-Compressed Sequence Representations Jérémy Barbay Francisco Claude Travis Gagie Gonzalo Navarro Yakov Nekrich Received: 4 February

More information

Approximate String Matching with Lempel-Ziv Compressed Indexes

Approximate String Matching with Lempel-Ziv Compressed Indexes Approximate String Matching with Lempel-Ziv Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2 and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,

More information

Advanced Text Indexing Techniques. Johannes Fischer

Advanced Text Indexing Techniques. Johannes Fischer Advanced ext Indexing echniques Johannes Fischer SS 2009 1 Suffix rees, -Arrays and -rays 1.1 Recommended Reading Dan Gusfield: Algorithms on Strings, rees, and Sequences. 1997. ambridge University Press,

More information

arxiv: v1 [cs.ds] 21 Nov 2012

arxiv: v1 [cs.ds] 21 Nov 2012 The Rightmost Equal-Cost Position Problem arxiv:1211.5108v1 [cs.ds] 21 Nov 2012 Maxime Crochemore 1,3, Alessio Langiu 1 and Filippo Mignosi 2 1 King s College London, London, UK {Maxime.Crochemore,Alessio.Langiu}@kcl.ac.uk

More information

CRAM: Compressed Random Access Memory

CRAM: Compressed Random Access Memory CRAM: Compressed Random Access Memory Jesper Jansson 1, Kunihiko Sadakane 2, and Wing-Kin Sung 3 1 Laboratory of Mathematical Bioinformatics, Institute for Chemical Research, Kyoto University, Gokasho,

More information

Lecture 10 : Basic Compression Algorithms

Lecture 10 : Basic Compression Algorithms Lecture 10 : Basic Compression Algorithms Modeling and Compression We are interested in modeling multimedia data. To model means to replace something complex with a simpler (= shorter) analog. Some models

More information

Text matching of strings in terms of straight line program by compressed aleshin type automata

Text matching of strings in terms of straight line program by compressed aleshin type automata Text matching of strings in terms of straight line program by compressed aleshin type automata 1 A.Jeyanthi, 2 B.Stalin 1 Faculty, 2 Assistant Professor 1 Department of Mathematics, 2 Department of Mechanical

More information

UNIT I INFORMATION THEORY. I k log 2

UNIT I INFORMATION THEORY. I k log 2 UNIT I INFORMATION THEORY Claude Shannon 1916-2001 Creator of Information Theory, lays the foundation for implementing logic in digital circuits as part of his Masters Thesis! (1939) and published a paper

More information

CHAPTER 8 COMPRESSION ENTROPY ESTIMATION OF HEART RATE VARIABILITY AND COMPUTATION OF ITS RENORMALIZED ENTROPY

CHAPTER 8 COMPRESSION ENTROPY ESTIMATION OF HEART RATE VARIABILITY AND COMPUTATION OF ITS RENORMALIZED ENTROPY 108 CHAPTER 8 COMPRESSION ENTROPY ESTIMATION OF HEART RATE VARIABILITY AND COMPUTATION OF ITS RENORMALIZED ENTROPY 8.1 INTRODUCTION Klimontovich s S-theorem offers an approach to compare two different

More information

Approximate String Matching with Ziv-Lempel Compressed Indexes

Approximate String Matching with Ziv-Lempel Compressed Indexes Approximate String Matching with Ziv-Lempel Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2, and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,

More information

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Compression Motivation Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Storage: Store large & complex 3D models (e.g. 3D scanner

More information

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding SIGNAL COMPRESSION Lecture 7 Variable to Fix Encoding 1. Tunstall codes 2. Petry codes 3. Generalized Tunstall codes for Markov sources (a presentation of the paper by I. Tabus, G. Korodi, J. Rissanen.

More information

arxiv: v2 [cs.ds] 6 Jul 2015

arxiv: v2 [cs.ds] 6 Jul 2015 Online Self-Indexed Grammar Compression Yoshimasa Takabatake 1, Yasuo Tabei 2, and Hiroshi Sakamoto 1 1 Kyushu Institute of Technology {takabatake,hiroshi}@donald.ai.kyutech.ac.jp 2 PRESTO, Japan Science

More information

arxiv: v3 [cs.ds] 6 Sep 2018

arxiv: v3 [cs.ds] 6 Sep 2018 Universal Compressed Text Indexing 1 Gonzalo Navarro 2 arxiv:1803.09520v3 [cs.ds] 6 Sep 2018 Abstract Center for Biotechnology and Bioengineering (CeBiB), Department of Computer Science, University of

More information

Fibonacci Coding for Lossless Data Compression A Review

Fibonacci Coding for Lossless Data Compression A Review RESEARCH ARTICLE OPEN ACCESS Fibonacci Coding for Lossless Data Compression A Review Ezhilarasu P Associate Professor Department of Computer Science and Engineering Hindusthan College of Engineering and

More information

CMPT 365 Multimedia Systems. Final Review - 1

CMPT 365 Multimedia Systems. Final Review - 1 CMPT 365 Multimedia Systems Final Review - 1 Spring 2017 CMPT365 Multimedia Systems 1 Outline Entropy Lossless Compression Shannon-Fano Coding Huffman Coding LZW Coding Arithmetic Coding Lossy Compression

More information

On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004

On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004 On Universal Types Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA University of Minnesota, September 14, 2004 Types for Parametric Probability Distributions A = finite alphabet,

More information

A Four-Stage Algorithm for Updating a Burrows-Wheeler Transform

A Four-Stage Algorithm for Updating a Burrows-Wheeler Transform A Four-Stage Algorithm for Updating a Burrows-Wheeler ransform M. Salson a,1,. Lecroq a, M. Léonard a, L. Mouchard a,b, a Université de Rouen, LIIS EA 4108, 76821 Mont Saint Aignan, France b Algorithm

More information

A Space-Efficient Frameworks for Top-k String Retrieval

A Space-Efficient Frameworks for Top-k String Retrieval A Space-Efficient Frameworks for Top-k String Retrieval Wing-Kai Hon, National Tsing Hua University Rahul Shah, Louisiana State University Sharma V. Thankachan, Louisiana State University Jeffrey Scott

More information

arxiv: v1 [cs.ds] 8 Sep 2018

arxiv: v1 [cs.ds] 8 Sep 2018 Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space Travis Gagie 1,2, Gonzalo Navarro 2,3, and Nicola Prezza 4 1 EIT, Diego Portales University, Chile 2 Center for Biotechnology

More information

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Roberto Grossi Dipartimento di Informatica Università di Pisa 56125 Pisa, Italy grossi@di.unipi.it Jeffrey

More information

Motivation for Arithmetic Coding

Motivation for Arithmetic Coding Motivation for Arithmetic Coding Motivations for arithmetic coding: 1) Huffman coding algorithm can generate prefix codes with a minimum average codeword length. But this length is usually strictly greater

More information

Source Coding Techniques

Source Coding Techniques Source Coding Techniques. Huffman Code. 2. Two-pass Huffman Code. 3. Lemple-Ziv Code. 4. Fano code. 5. Shannon Code. 6. Arithmetic Code. Source Coding Techniques. Huffman Code. 2. Two-path Huffman Code.

More information

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Wing-Kai Hon Kunihiko Sadakane Wing-Kin Sung Abstract Suffix trees and suffix arrays are the most prominent full-text indices, and their

More information

Introduction to Information Theory. Part 3

Introduction to Information Theory. Part 3 Introduction to Information Theory Part 3 Assignment#1 Results List text(s) used, total # letters, computed entropy of text. Compare results. What is the computed average word length of 3 letter codes

More information