Compact Data Strutures
|
|
- Lindsey Potter
- 5 years ago
- Views:
Transcription
1 (To compress is to Conquer) Compact Data Strutures Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017
2 Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 2 images: zurb.com
3 Introduction to Compact Data Structures Compact data structures lie at the intersection of Data Structures (indexing) and Information Theory (compression): One looks at data representations that not only permit space close to the minimum possible (as in compression) but also require that those representations allow one to efficiently carry out some operations on the data. PAGE 3 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER
4 Introduction Why compression? 4 Disks are cheap!! But they are also slow! Compression can help more data to fit in main memory. (access to memory is around 10 6 times faster than HDD) CPU speed is increasing faster We can trade processing time (needed to uncompress data) by space.
5 Introduction Why compression? 5 Compression does not only reduce space! I/O access on disks and networks Processing time * (less data has to be processed) If appropriate methods are used For example: Allowing handling data compressed all the time. Doc 1 Doc 2 Doc 3 Doc n Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (30%) Let s search for Keystone" Text collection (100%) Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (20%) P7zip, others
6 Introduction Why indexing? Indexing permits sublinear search time 6 term 1 Keystone term n Index (> 5-30%) Doc 1 Doc 2 Doc 3 Doc n Text collection (100%) Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (30%) Let s search for Keystone"
7 Introduction Why compact data structures? 7 Self-indexes: sublinear search time Text implicitly kept term 1 Keystone term n Index (> 5-30%) Doc 1 Doc 2 Doc 3 Doc n Self-index (WT, WCSA, ) Let s search for Keystone" Text collection term 1 Keystone term n
8 Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 8 images: zurb.com
9 Compression Compressing aims at representing data within less space. How does it work? Which are the most traditional compression techniques? PAGE 9 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER
10 Basic Compression Modeling & Coding 10 A compressor could use as a source alphabet: A fixed number of symbols (statistical compressors) 1 char, 1 word A variable number of symbols (dictionary-based compressors) 1st occ of a encoded alone, 2nd occ encoded with next one ax Codes are built using symbols of a target alphabet: Fixed length codes (10 bits, 1 byte, 2 bytes, ) Variable length codes (1,2,3,4 bits/bytes ) Classification (fixed-to-variable, variable-to-fixed, ) Input alphabet fixed var Target alphabet fixed var -- statistical dictionary var2var
11 Basic Compression Main families of compressors 11 Taxonomy Dictionary based (gzip, compress, p7zip ) Grammar based (BPE, Repair) Statistical compressors (Huffman, arithmetic, Dense, PPM, ) Statistical compressors Gather the frequencies of the source symbols. Assign shorter codewords to the most frequent symbols. Obtain compression
12 Basic Compression Dictionary-based compressors 12 How do they achieve compression? Assign fixed-length codewords to variable-length symbols (text substrings) The longer the replaced substring the better compression Well-known representatives: Lempel-Ziv family LZ77 (1977): GZIP, PKZIP, ARJ, P7zip LZ78 (1978) LZW (1984): Compress, GIF images
13 EXAMPLE Basic Compression LZW 13 Starts with an initial dictionary D (contains symbols in S) For a given position of the text. while D contains w, reads prefix w=w 0 w 1 w 2 If w 0 w k w k+1 is not in D (w 0 w k does!) output (i = entrypos(w 0 w k )) (Note: codeword = log 2 ( D )) Add w 0 w k w k+1 to D Continue from w k+1 on (included) Dictionary has limited length? Policies: LRU, truncate& go,
14 EXAMPLE Basic Compression LZW 14 Starts with an initial dictionary D (contains symbols in S) For a given position of the text. while D contains w, reads prefix w=w 0 w 1 w 2 If w 0 w k w k+1 is not in D (w 0 w k does!) output (i = entrypos(w 0 w k )) (Note: codeword = log 2 ( D )) Add w 0 w k w k+1 to D Continue from w k+1 on (included) Dictionary has limited length? Policies: LRU, truncate& go,
15 Basic Compression Grammar-based BPE - Repair 15 Replaces pairs of symbols by a new one, until no pair repeats twice Adds a rule to a Dictionary. Source sequence A B C D E A B D E F D E D E F A B E C D DE G A B C G A B G F G G F A B E C D AB H H C G H G F G G F H E C D GF I Dictionary of Rules Final Repair Sequence H C G H I G I H E C D
16 Basic Compression Statistical compressors 16 Assign shorter codewords to the most frequent symbols Must gather symbol frequencies for each symbol c in S. Compression is lower bounded by the (zero-order) empirical entropy of the sequence (S). n= num of symbols n c = occs of symbol c H 0 (S) <= log ( S ) n H 0 (S) = lower bound of the size of S compressed with a zero-order compressor Most representative method: Huffman coding
17 Basic Compression Statistical compressors: Huffman coding Optimal prefix free coding No codeword is a prefix of one another. Decoding requires no look-ahead! Asymptotically optimal: Huffman(S) <= n(h0(s)+1) 17 Typically using bit-wise codewords Yet D-ary Huffman variants exist (D=256 byte-wise) Builds a Huffman tree to generate codewords
18 Basic Compression Statistical compressors: Huffman coding Sort symbols by frequency: S=ADBAAAABBBBCCCCDDEEE 18
19 Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 19
20 Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 20
21 Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 21
22 Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 22
23 Basic Compression Statistical compressors: Huffman coding Bottom Up tree construction 23
24 Basic Compression Statistical compressors: Huffman coding Branch labeling 24
25 Basic Compression Statistical compressors: Huffman coding Code assignment 25
26 Basic Compression Statistical compressors: Huffman coding Compression of sequence S= ADB 26 ADB
27 Basic Compression Burrows-Wheeler Transform (BWT) 27 Given S= mississipii$, BWT(S) is obtained by: (1) creating a Matrix M with all circular permutations of S$, (2) sorting the rows of M, and (3) taking the last column. mississippi$ $mississippi i$mississipp pi$mississip ppi$mississi ippi$mississ sippi$missis ssippi$missi issippi$miss sissippi$mis ssissippi$mi ississippi$m sort $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi F L = BWT(S)
28 Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 28 Given L=BWT(S), we can recover S=BWT -1 (L) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j mississippi$ pi$mississip ppi$mississi Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F sippi$missis sissippi$mis ssippi$missi ssissippi$mi 5 F L LF
29 Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 29 Given L=BWT(S), we can recover S=BWT -1 (L) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; In each step: S[n-i] = L[p]; p = LF[p]; i = i+1; F L LF S
30 Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 30 Given L=BWT(S), we can recover S=BWT -1 (L) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=0: S[n-i] = L[p]; S[12]= $ p = LF[p]; p = 1 i = i+1; i=1 F L LF S
31 Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 31 Given L=BWT(S), we can recover S=BWT -1 (L) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi i $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=1: S[n-i] = L[p]; S[11]= i p = LF[p]; p = 2 i = i+1; i=2 F L LF S
32 Basic Compression Burrows-Wheeler Transform: reversible (BWT -1 ) 32 Given L=BWT(S), we can recover S=BWT -1 (L) $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m m i s s i Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]= c, and k= the number of times c occurs in L[1..i], and j=position in F of the kth occurrence of c Then set LF[i]=j mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi s s i p p i $ Example: L[7] = p, it is the 2nd p in L LF[7] = 8 which is the 2nd occ of p in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=1: S[n-i] = L[p]; S[11]= i p = LF[p]; p = 2 i = i+1; i=2 F L LF S
33 Basic Compression Bzip2: Burrows-Wheeler Transform (BWT) 33 BWT. Many similar symbols appear adjacent MTF. Output the position of the current symbol within S Keep the alphabet S = {a,b,c,d,e, } sorted so that the last used symbol is moved to the begining of S. RLE. If a value (0) appears several times ( times) replace it by a pair <value,times> <0,6> Huffman stage. Why does it work? In a text it is likely that he is preceeded by t, ssisii by i,
34 Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 34 images: zurb.com
35 Sequences We want to represent (compactly) a sequence of elements and to efficiently handle them. (Who is in the 2 nd position?? How many Barts up to position 5?? Where is the 3 rd Bart??) PAGE 35 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER
36 Sequences Plain Representation of Data 36 Given a Sequence of n integers m = maximum value We can represent it with n log 2 (m+1) bits symbols x 3 bits per symbol = 48 bits array of two 32-bit ints Direct access (access to an integer + bit operations)
37 Sequences Compressed Representation of Data (H 0 ) Is it compressible? Symbol Occurrences (n c ) H o (S) = 1.59 (bits per symbol) Huffman: 1.62 bits per symbol bits: No direct access! (but we could add sampling)
38 Sequences Summary: Plain/Compressed access/rank/select Operations of interest: Access(i) : Value of the i th symbol Rank s (i) : Number of occs of symbol s up to position i (count) Select s (i) : Where the i th occ of symbol s? (locate)
39 Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 39 images: zurb.com
40 Bit Sequences access/rank/select on bitmaps 40 access (19) = 0 B = Rank 1 (6) = 3 Rank 0 (10) = 5 select 0 (10) =15 see [Navarro 2016]
41 Bit Sequences Applications 41 Bitmaps a basic part of most Compact Data Structures Example: (We will see it later in the CSA) HDT Bitmaps from Javi's talk!!! S: AAABBCCCCCCCCDDDEEEEEEEEEEFG n log s bits B: n bits D: ABCDEFG s log s bits Saves space: Fast access/rank/select is of interest!! Where is the 2 nd C? How many Cs up to position k?
42 Bit Sequences Reaching O(1) rank & o(n) bits of extra space Jacobson, Clark, Munro Variant by Fariña et al. Assuming 32 bit machine-word 42 Step 1: Split de Bitmap into superblocks of 256 bits, and store the number of 1s up to positions 1+256k (k= 0,1,2, ) D s = bits set to 1 27 bits set to 1 45 bits set to O(1) time to superblock. Space: n/256 superblocks and 1 int each
43 Bit Sequences Reaching O(1) rank & o(n) bits of extra space Step 2: For each superblock of 256 bits Divide it into 8 blocks of 32 bits each (machine word size) Store the number of ones from the beginning of the superblock 43 D s = bits set to 1 27 bits set to 1 45 bits set to bits set to 1 6 bits set to 1 8 bits set to D b = O(1) time to the blocks, 8 blocks per superblock, 1 byte each
44 Bit Sequences Reaching O(1) rank & o(n) bits of extra space 44 Step 3: Rank within a 32 bit block blk = Finally solving: rank 1 ( D, p ) = Ds[ p / 256 ] + Db[ p / 32 ] + rank 1 (blk, i) where i= p mod 32 Ex: rank1(d,300) = = 43 Yet, how to compute rank 1 (blk, i) in constant time?
45 Bit Sequences Reaching O(1) rank & o(n) bits of extra space 45 How to compute rank 1 (blk, i) in constant time? Option 1: popcount within a machine word Option 2: Universal Table onesinbyte (solution for each byte) Only 256 entries storing values [0..8] Rank 1 (blk,12) blk = Shift = 20 posicións Val binary OnesInByte blk s = Finally, sum value onesinbyte for the 4 bytes in blk Overall space: n bits
46 Bit Sequences Select 1 in O(log n) with the same structures 46 select 1 (p) In practice, binary search using rank
47 Bit Sequences Compressed representations 48 Compressed Bit-Sequence representations exist!! Compressed [Raman et al, 2002] For very sparse bitmaps [Okanohara and Sadakane, 2007]... see [Navarro 2016]
48 Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 49 images: zurb.com
49 Integer Sequences access/rank/select on general sequences 50 access (13) = 3 S= Rank 2 (9) = 3 select 4 (3) =7 see [Navarro 2016]
50 Integer Sequences Wavelet tree (construction) [Grossi et al 2003] Given a sequence of symbols and an encoding The bits of the code of each symbol are distributed along the different levels of the tree 51 DATA A B A C D A C A B A C D A C CODE SYMBOL A 00 B 01 C 10 D 11 WAVELET TREE 0 1 A B A A C D C
51 Integer Sequences Wavelet tree (select) 52 OF Searching for the 1st occurrence of D? DATA A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root WAVELET TREE A B A C D A C A B A A C D C B 0 B 1 it is the 2nd bit in B 1 Where is the 2nd 1? at pos 5. Where is the 1st 1? at pos 2.
52 Integer Sequences Wavelet tree (access) Recovering Data: extracting the next symbol Which symbol appears in the 6 th position? 53 DATA A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root WAVELET TREE A B A C D A C A B A A C D C B 0 B 1 How many 0 s are there up to pos 6? it is the 4th 0 Which bit occurs at position 4 in B 0? It is set to 0 The codeword read is 00 A
53 Integer Sequences Wavelet tree (access) Recovering Data: extracting the next symbol Which symbol appears in the 7 th position? 54 TEXT A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root B 0 WAVELET TREE A B A C D A C A B A A C D C B 1 How many 1 s are there up to pos 7? it is the 3rd 1 Which bit occurs at position 3 in B 1? It is set to 0 The codeword read is 10 C
54 Integer Sequences Wavelet tree (rank) How many C s are there up to position 7? 55 TEXT A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 B root B 0 WAVELET TREE A B A C D A C A B A A C D C B 1 How many 1 s are there up to pos 7? it is the 3rd 1 Select (locate symbol) Access and Rank: How many 0s up to position 3 in B 1? 2!!
55 Integer Sequences Wavelet tree (space and times) Each level contains n + o(n) bits 56 DATA A B A C D A C SYMBOL CODE A 00 B 01 C 10 D 11 WAVELET TREE A B A C D A C A B A A C D C n + o(n) bits n + o(n) bits Rank/select/access expected O(log s) time n log s (1 + o(1)) bits
56 Integer Sequences Huffman-shaped (or others) Wavelet tree Using Huffman coding (or others) unbalanced 57 DATA A B A C D A C SYMBOL CODE A 1 B 000 C 01 D 001 WAVELET TREE A B A C D A C B C D C B D C C 1 0 A A A nh 0 (S) + o(n) bits Rank/select/access O(H 0 (S)) time
57 Agenda Introduction Basic compression Sequences Bit sequences Integer sequences A brief Review about Indexing PAGE 58 images: zurb.com
58 A brief review about indexing Inverted Indexes are the most well-known index for text [ ] Suffix Arrays are powerful but huge full-text indexes. Self-indexes trade a more compact space by performance PAGE 59 COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER
59 A brief Review about Indexing Text indexing: well-known structures from the Web 60 Traditional indexes (with or without compression) auxiliar structure explicit text Inverted Indexes, Suffix Arrays,... Compressed Self-indexes implicit text Wavelet trees, Compressed Suffix Arrays, FM-index, LZ-index,
60 Doc 2 Doc 1 A brief Review about Indexing Inverted indexes Vocabulary DCC communications compression image data information Cliff Logde Posting Lists Vocabulary DCC communications compression image data information Cliff Lodge Posting Lists Full-positional information Space-time trade-off Doc-addressing inverted index Indexed text DCC is held at the Cliff Lodge convention center. It is an international forum for current work on data compression and related applications. DCC addresses not only compression methods for specific types of data (text, image, video, audio, space, graphics, web content, [...]... also the use of techniques from information theory and data compression in networking, communications, and storage applications involving large datasets (including image and information mining, retrieval, archiving, backup, communications, and HCI). Searches Word posting of that word Phrase intersection of postings Compression - Indexed text (Huffman,...) - Posting lists (Rice,...)
61 A brief Review about Indexing Inverted indexes 62 Lists contain increasing integers Gaps between integers are smaller in the longest lists Original posting list Diferenc Var-length coding c4 c6 c5 c10 c4 c11 c6 c8 c3 c13 c9 c3 Complete decompression Absolute sampling + var length coding 4 c6 c5 c10 29 c11 c6 c8 57 c13 c9 c3 Direct access Partial decompression
62 A brief Review about Indexing Suffix Arrays Sorting all the suffixes of T lexicographically 63 T = a b r a c a d a b r a $ A = racadabra$ ra$ dabra$ cadabra$ bracadabra$ bra$ adabra$ acadabra$ abracadabra$ abra$ a$ $
63 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 64 P = a b T = a b r a c a d a b r a $ A =
64 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 65 P = a b T = a b r a c a d a b r a $ A =
65 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 66 P = a b T = a b r a c a d a b r a $ A =
66 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 67 P = a b T = a b r a c a d a b r a $ A =
67 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 68 P = a b T = a b r a c a d a b r a $ A =
68 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 69 P = a b T = a b r a c a d a b r a $ A =
69 A brief Review about Indexing Suffix Arrays Binary search for any pattern: ab 70 P = a b T = a b r a c a d a b r a $ A = locations Noccs = (4-3)+1 Occs = A[3].. A[4] = {8, 1} Fast O(m lg n) O(m lg n + noccs) Space O(4n) + T
70 A brief Review about Indexing BWT FM-index 71 BWT(S) + other structures it is an index C[c] : for each char c in S, stores the number of occs in S of the chars that are lexicographically smaller than c. C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8 OCC(c, k): Number of occs of char c in the prefix of L: L [1..k] For k in [1..12] Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1 Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4 Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1 Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2 Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4 Char L[i] occurs in F at position LF(i): LF(i) = C[L[i]] + Occ(L[i],i)
71 A brief Review about Indexing BWT FM-index 74 Count (S[1,u], P[1,p]) C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8 Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1 Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4 Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1 Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2 Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4 s s i Count (S, issi )
72 A brief Review about Indexing BWT FM-index Representing L with a wavelet tree occ is compressed 75
73 Bibliography M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Systems Research Center, F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proc. 15th SPIRE, LNCS 5280, pages , Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index. In Proc. 12th ACM-SIAM Symposium on Discrete Algorithms (SODA), Washington (USA), Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4): , Philip Gage. A new algorithm for data compression. C Users Journal, 12(2):23 38, February A. Golynski, I. Munro, and S. Rao. Rank/select operations on large alphabets: a tool for text indexing. In Proc. 17th SODA, pages , R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proc. 14th SODA, pages , 2003.
74 Bibliography David A. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the Institute of Radio Engineers, 40(9): , N. J. Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 88(11): , U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. Comp., 22(5): , Alistair Moffat, Andrew Turpin: Compression and Coding Algorithms.Kluwer 2002, ISBN I. Munro. Tables. In Proc. 16th FSTTCS, LNCS 1180, pages 37 42, Gonzalo Navarro, Veli Mäkinen, Compressed full-text indexes, ACM Computing Surveys (CSUR), v.39 n.1, p.2-es, Gonzalo Navarro. Compact Data Structures -A practical approach. Cambridge University Press, 570 pages, D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th ALENEX, 2007.
75 Bibliography R. Raman, V. Raman, and S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. 13th SODA, pages , Edleno Silva de Moura, Gonzalo Navarro, Nivio Ziviani, and Ricardo Baeza-Yates. Fast and flexible word searching on compressed text. ACM Transactions on Information Systems, 18(2): , Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, Ziv, J. and Lempel, A A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 3, Ziv, J. and Lempel, A Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24, 5,
76 (To compress is to Conquer) Compact Data Strutures Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017 (Thanks: slides partially by: Susana Ladra, E. Rodríguez, & José R. Paramá)
Succincter text indexing with wildcards
University of British Columbia CPM 2011 June 27, 2011 Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview
More informationRank and Select Operations on Binary Strings (1974; Elias)
Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo
More informationSmaller and Faster Lempel-Ziv Indices
Smaller and Faster Lempel-Ziv Indices Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, Universidad de Chile, Chile. {darroyue,gnavarro}@dcc.uchile.cl Abstract. Given a text T[1..u] over an
More informationCompressed Representations of Sequences and Full-Text Indexes
Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Dipartimento di Informatica, Università di Pisa, Italy GIOVANNI MANZINI Dipartimento di Informatica, Università del Piemonte
More informationarxiv: v1 [cs.ds] 25 Nov 2009
Alphabet Partitioning for Compressed Rank/Select with Applications Jérémy Barbay 1, Travis Gagie 1, Gonzalo Navarro 1 and Yakov Nekrich 2 1 Department of Computer Science University of Chile {jbarbay,
More informationarxiv: v1 [cs.ds] 19 Apr 2011
Fixed Block Compression Boosting in FM-Indexes Juha Kärkkäinen 1 and Simon J. Puglisi 2 1 Department of Computer Science, University of Helsinki, Finland juha.karkkainen@cs.helsinki.fi 2 Department of
More informationPreview: Text Indexing
Simon Gog gog@ira.uka.de - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Text Indexing Motivation Problems Given a text
More informationLecture 18 April 26, 2012
6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and
More informationTheoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts
Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with
More informationIndexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile
Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:
More informationA Simple Alphabet-Independent FM-Index
A Simple Alphabet-Independent -Index Szymon Grabowski 1, Veli Mäkinen 2, Gonzalo Navarro 3, Alejandro Salinger 3 1 Computer Engineering Dept., Tech. Univ. of Lódź, Poland. e-mail: sgrabow@zly.kis.p.lodz.pl
More informationNew Lower and Upper Bounds for Representing Sequences
New Lower and Upper Bounds for Representing Sequences Djamal Belazzougui 1 and Gonzalo Navarro 2 1 LIAFA, Univ. Paris Diderot - Paris 7, France. dbelaz@liafa.jussieu.fr 2 Department of Computer Science,
More informationAlphabet Friendly FM Index
Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM
More informationCompressed Representations of Sequences and Full-Text Indexes
Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Università di Pisa GIOVANNI MANZINI Università del Piemonte Orientale VELI MÄKINEN University of Helsinki AND GONZALO NAVARRO
More informationSimple Compression Code Supporting Random Access and Fast String Matching
Simple Compression Code Supporting Random Access and Fast String Matching Kimmo Fredriksson and Fedor Nikitin Department of Computer Science and Statistics, University of Joensuu PO Box 111, FIN 80101
More informationOptimal Dynamic Sequence Representations
Optimal Dynamic Sequence Representations Gonzalo Navarro Yakov Nekrich Abstract We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on
More informationarxiv: v1 [cs.ds] 15 Feb 2012
Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl
More informationPractical Indexing of Repetitive Collections using Relative Lempel-Ziv
Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Gonzalo Navarro and Víctor Sepúlveda CeBiB Center for Biotechnology and Bioengineering, Chile Department of Computer Science, University
More informationA Simpler Analysis of Burrows-Wheeler Based Compression
A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan School of Computer Science, Tel Aviv University, Tel Aviv, Israel; email: haimk@post.tau.ac.il Shir Landau School of Computer Science,
More informationarxiv: v1 [cs.ds] 22 Nov 2012
Faster Compact Top-k Document Retrieval Roberto Konow and Gonzalo Navarro Department of Computer Science, University of Chile {rkonow,gnavarro}@dcc.uchile.cl arxiv:1211.5353v1 [cs.ds] 22 Nov 2012 Abstract:
More informationThe Burrows-Wheeler Transform: Theory and Practice
The Burrows-Wheeler Transform: Theory and Practice Giovanni Manzini 1,2 1 Dipartimento di Scienze e Tecnologie Avanzate, Università del Piemonte Orientale Amedeo Avogadro, I-15100 Alessandria, Italy. 2
More informationComplementary Contextual Models with FM-index for DNA Compression
2017 Data Compression Conference Complementary Contextual Models with FM-index for DNA Compression Wenjing Fan,WenruiDai,YongLi, and Hongkai Xiong Department of Electronic Engineering Department of Biomedical
More informationAn Algorithmic Framework for Compression and Text Indexing
An Algorithmic Framework for Compression and Text Indexing Roberto Grossi Ankur Gupta Jeffrey Scott Vitter Abstract We present a unified algorithmic framework to obtain nearly optimal space bounds for
More informationOpportunistic Data Structures with Applications
Opportunistic Data Structures with Applications Paolo Ferragina Giovanni Manzini Abstract There is an upsurging interest in designing succinct data structures for basic searching problems (see [23] and
More informationAlphabet-Independent Compressed Text Indexing
Alphabet-Independent Compressed Text Indexing DJAMAL BELAZZOUGUI Université Paris Diderot GONZALO NAVARRO University of Chile Self-indexes are able to represent a text within asymptotically the information-theoretic
More informationDynamic Entropy-Compressed Sequences and Full-Text Indexes
Dynamic Entropy-Compressed Sequences and Full-Text Indexes VELI MÄKINEN University of Helsinki and GONZALO NAVARRO University of Chile First author funded by the Academy of Finland under grant 108219.
More informationLZ77-like Compression with Fast Random Access
-like Compression with Fast Random Access Sebastian Kreft and Gonzalo Navarro Dept. of Computer Science, University of Chile, Santiago, Chile {skreft,gnavarro}@dcc.uchile.cl Abstract We introduce an alternative
More informationText Indexing: Lecture 6
Simon Gog gog@kit.edu - 0 Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Reviewing the last two lectures We have seen two top-k document retrieval frameworks. Question
More informationA Fully Compressed Pattern Matching Algorithm for Simple Collage Systems
A Fully Compressed Pattern Matching Algorithm for Simple Collage Systems Shunsuke Inenaga 1, Ayumi Shinohara 2,3 and Masayuki Takeda 2,3 1 Department of Computer Science, P.O. Box 26 (Teollisuuskatu 23)
More informationSelf-Indexed Grammar-Based Compression
Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca
More informationSelf-Indexed Grammar-Based Compression
Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca
More informationNew Algorithms and Lower Bounds for Sequential-Access Data Compression
New Algorithms and Lower Bounds for Sequential-Access Data Compression Travis Gagie PhD Candidate Faculty of Technology Bielefeld University Germany July 2009 Gedruckt auf alterungsbeständigem Papier ISO
More informationRead Mapping. Burrows Wheeler Transform and Reference Based Assembly. Genomics: Lecture #5 WS 2014/2015
Mapping Burrows Wheeler and Reference Based Assembly Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin Genomics: Lecture #5 WS 2014/2015 Today Burrows Wheeler FM index
More informationCompressed Index for Dynamic Text
Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution
More informationMore Speed and More Compression: Accelerating Pattern Matching by Text Compression
More Speed and More Compression: Accelerating Pattern Matching by Text Compression Tetsuya Matsumoto, Kazuhito Hagio, and Masayuki Takeda Department of Informatics, Kyushu University, Fukuoka 819-0395,
More informationData Compression Techniques
Data Compression Techniques Part 2: Text Compression Lecture 5: Context-Based Compression Juha Kärkkäinen 14.11.2017 1 / 19 Text Compression We will now look at techniques for text compression. These techniques
More informationA simpler analysis of Burrows Wheeler-based compression
Theoretical Computer Science 387 (2007) 220 235 www.elsevier.com/locate/tcs A simpler analysis of Burrows Wheeler-based compression Haim Kaplan, Shir Landau, Elad Verbin School of Computer Science, Tel
More informationCOMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT
COMP9319 Web Data Compression and Search Lecture 2: daptive Huffman, BWT 1 Original readings Login to your cse account:! cd ~cs9319/papers! Original readings of each lecture will be placed there. 2 Course
More informationCompact Indexes for Flexible Top-k Retrieval
Compact Indexes for Flexible Top-k Retrieval Simon Gog Matthias Petri Institute of Theoretical Informatics, Karlsruhe Institute of Technology Computing and Information Systems, The University of Melbourne
More informationA Faster Grammar-Based Self-Index
A Faster Grammar-Based Self-Index Travis Gagie 1 Pawe l Gawrychowski 2 Juha Kärkkäinen 3 Yakov Nekrich 4 Simon Puglisi 5 Aalto University Max-Planck-Institute für Informatik University of Helsinki University
More informationText Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2
Text Compression Jayadev Misra The University of Texas at Austin December 5, 2003 Contents 1 Introduction 1 2 A Very Incomplete Introduction to Information Theory 2 3 Huffman Coding 5 3.1 Uniquely Decodable
More informationCOMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT
COMP9319 Web Data Compression and Search Lecture 2: daptive Huffman, BWT 1 Original readings Login to your cse account: cd ~cs9319/papers Original readings of each lecture will be placed there. 2 Course
More informationLecture 4 : Adaptive source coding algorithms
Lecture 4 : Adaptive source coding algorithms February 2, 28 Information Theory Outline 1. Motivation ; 2. adaptive Huffman encoding ; 3. Gallager and Knuth s method ; 4. Dictionary methods : Lempel-Ziv
More informationCOMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES
COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES DO HUY HOANG (B.C.S. (Hons), NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE SCHOOL OF COMPUTING NATIONAL
More informationSuccinct Suffix Arrays based on Run-Length Encoding
Succinct Suffix Arrays based on Run-Length Encoding Veli Mäkinen Gonzalo Navarro Abstract A succinct full-text self-index is a data structure built on a text T = t 1 t 2...t n, which takes little space
More informationMultimedia. Multimedia Data Compression (Lossless Compression Algorithms)
Course Code 005636 (Fall 2017) Multimedia Multimedia Data Compression (Lossless Compression Algorithms) Prof. S. M. Riazul Islam, Dept. of Computer Engineering, Sejong University, Korea E-mail: riaz@sejong.ac.kr
More informationChapter 2: Source coding
Chapter 2: meghdadi@ensil.unilim.fr University of Limoges Chapter 2: Entropy of Markov Source Chapter 2: Entropy of Markov Source Markov model for information sources Given the present, the future is independent
More informationarxiv:cs/ v1 [cs.it] 21 Nov 2006
On the space complexity of one-pass compression Travis Gagie Department of Computer Science University of Toronto travis@cs.toronto.edu arxiv:cs/0611099v1 [cs.it] 21 Nov 2006 STUDENT PAPER Abstract. We
More informationEfficient Accessing and Searching in a Sequence of Numbers
Regular Paper Journal of Computing Science and Engineering, Vol. 9, No. 1, March 2015, pp. 1-8 Efficient Accessing and Searching in a Sequence of Numbers Jungjoo Seo and Myoungji Han Department of Computer
More informationOptimal lower bounds for rank and select indexes
Optimal lower bounds for rank and select indexes Alexander Golynski David R. Cheriton School of Computer Science, University of Waterloo agolynski@cs.uwaterloo.ca Technical report CS-2006-03, Version:
More informationFast Fully-Compressed Suffix Trees
Fast Fully-Compressed Suffix Trees Gonzalo Navarro Department of Computer Science University of Chile, Chile gnavarro@dcc.uchile.cl Luís M. S. Russo INESC-ID / Instituto Superior Técnico Technical University
More informationRun-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE
General e Image Coder Structure Motion Video x(s 1,s 2,t) or x(s 1,s 2 ) Natural Image Sampling A form of data compression; usually lossless, but can be lossy Redundancy Removal Lossless compression: predictive
More informationSuccinct Data Structures for Text and Information Retrieval
Succinct Data Structures for Text and Information Retrieval Simon Gog 1 Matthias Petri 2 1 Institute of Theoretical Informatics Karslruhe Insitute of Technology 2 Computing and Information Systems The
More informationStronger Lempel-Ziv Based Compressed Text Indexing
Stronger Lempel-Ziv Based Compressed Text Indexing Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile.
More informationCMPT 365 Multimedia Systems. Lossless Compression
CMPT 365 Multimedia Systems Lossless Compression Spring 2017 Edited from slides by Dr. Jiangchuan Liu CMPT365 Multimedia Systems 1 Outline Why compression? Entropy Variable Length Coding Shannon-Fano Coding
More informationData Compression Techniques
Data Compression Techniques Part 2: Text Compression Lecture 7: Burrows Wheeler Compression Juha Kärkkäinen 21.11.2017 1 / 16 Burrows Wheeler Transform The Burrows Wheeler transform (BWT) is a transformation
More informationTheoretical aspects of ERa, the fastest practical suffix tree construction algorithm
Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm Matevž Jekovec University of Ljubljana Faculty of Computer and Information Science Oct 10, 2013 Text indexing problem
More informationGrammar Compressed Sequences with Rank/Select Support
Grammar Compressed Sequences with Rank/Select Support Gonzalo Navarro and Alberto Ordóñez 2 Dept. of Computer Science, Univ. of Chile, Chile. gnavarro@dcc.uchile.cl 2 Lab. de Bases de Datos, Univ. da Coruña,
More informationData Compression Techniques
Data Compression Techniques Part 1: Entropy Coding Lecture 4: Asymmetric Numeral Systems Juha Kärkkäinen 08.11.2017 1 / 19 Asymmetric Numeral Systems Asymmetric numeral systems (ANS) is a recent entropy
More informationSource Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria
Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal
More informationForbidden Patterns. {vmakinen leena.salmela
Forbidden Patterns Johannes Fischer 1,, Travis Gagie 2,, Tsvi Kopelowitz 3, Moshe Lewenstein 4, Veli Mäkinen 5,, Leena Salmela 5,, and Niko Välimäki 5, 1 KIT, Karlsruhe, Germany, johannes.fischer@kit.edu
More informationReducing the Space Requirement of LZ-Index
Reducing the Space Requirement of LZ-Index Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile {darroyue, gnavarro}@dcc.uchile.cl 2 Dept. of
More informationData Compression Using a Sort-Based Context Similarity Measure
Data Compression Using a Sort-Based Context Similarity easure HIDETOSHI YOKOO Department of Computer Science, Gunma University, Kiryu, Gunma 76, Japan Email: yokoo@cs.gunma-u.ac.jp Every symbol in the
More informationImage and Multidimensional Signal Processing
Image and Multidimensional Signal Processing Professor William Hoff Dept of Electrical Engineering &Computer Science http://inside.mines.edu/~whoff/ Image Compression 2 Image Compression Goal: Reduce amount
More information情報処理学会研究報告 IPSJ SIG Technical Report Vol.2012-DBS-156 No /12/12 1,a) 1,b) 1,2,c) 1,d) 1999 Larsson Moffat Re-Pair Re-Pair Re-Pair Variable-to-Fi
1,a) 1,b) 1,2,c) 1,d) 1999 Larsson Moffat Re-Pair Re-Pair Re-Pair Variable-to-Fixed-Length Encoding for Large Texts Using a Re-Pair Algorithm with Shared Dictionaries Kei Sekine 1,a) Hirohito Sasakawa
More informationComputing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome
Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression Sergio De Agostino Sapienza University di Rome Parallel Systems A parallel random access machine (PRAM)
More informationCSEP 590 Data Compression Autumn Dictionary Coding LZW, LZ77
CSEP 590 Data Compression Autumn 2007 Dictionary Coding LZW, LZ77 Dictionary Coding Does not use statistical knowledge of data. Encoder: As the input is processed develop a dictionary and transmit the
More informationMultimedia Information Systems
Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 3 & 4: Color, Video, and Fundamentals of Data Compression 1 Color Science Light is an electromagnetic wave. Its color is characterized
More informationLRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations
LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations Jérémy Barbay 1, Johannes Fischer 2, and Gonzalo Navarro 1 1 Department of Computer Science, University of Chile, {jbarbay gnavarro}@dcc.uchile.cl
More informationCS4800: Algorithms & Data Jonathan Ullman
CS4800: Algorithms & Data Jonathan Ullman Lecture 22: Greedy Algorithms: Huffman Codes Data Compression and Entropy Apr 5, 2018 Data Compression How do we store strings of text compactly? A (binary) code
More informationOn Compressing and Indexing Repetitive Sequences
On Compressing and Indexing Repetitive Sequences Sebastian Kreft a,1,2, Gonzalo Navarro a,2 a Department of Computer Science, University of Chile Abstract We introduce LZ-End, a new member of the Lempel-Ziv
More informationSuffix Array of Alignment: A Practical Index for Similar Data
Suffix Array of Alignment: A Practical Index for Similar Data Joong Chae Na 1, Heejin Park 2, Sunho Lee 3, Minsung Hong 3, Thierry Lecroq 4, Laurent Mouchard 4, and Kunsoo Park 3, 1 Department of Computer
More informationString Range Matching
String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings
More informationA Unifying Framework for Compressed Pattern Matching
A Unifying Framework for Compressed Pattern Matching Takuya Kida Yusuke Shibata Masayuki Takeda Ayumi Shinohara Setsuo Arikawa Department of Informatics, Kyushu University 33 Fukuoka 812-8581, Japan {
More informationBasic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.
Universal Lossless coding Lempel-Ziv Coding Basic principles of lossless compression Historical review Variable-length-to-block coding Lempel-Ziv coding 1 Basic Principles of Lossless Coding 1. Exploit
More informationEfficient Fully-Compressed Sequence Representations
Algorithmica (2014) 69:232 268 DOI 10.1007/s00453-012-9726-3 Efficient Fully-Compressed Sequence Representations Jérémy Barbay Francisco Claude Travis Gagie Gonzalo Navarro Yakov Nekrich Received: 4 February
More informationApproximate String Matching with Lempel-Ziv Compressed Indexes
Approximate String Matching with Lempel-Ziv Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2 and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,
More informationAdvanced Text Indexing Techniques. Johannes Fischer
Advanced ext Indexing echniques Johannes Fischer SS 2009 1 Suffix rees, -Arrays and -rays 1.1 Recommended Reading Dan Gusfield: Algorithms on Strings, rees, and Sequences. 1997. ambridge University Press,
More informationarxiv: v1 [cs.ds] 21 Nov 2012
The Rightmost Equal-Cost Position Problem arxiv:1211.5108v1 [cs.ds] 21 Nov 2012 Maxime Crochemore 1,3, Alessio Langiu 1 and Filippo Mignosi 2 1 King s College London, London, UK {Maxime.Crochemore,Alessio.Langiu}@kcl.ac.uk
More informationCRAM: Compressed Random Access Memory
CRAM: Compressed Random Access Memory Jesper Jansson 1, Kunihiko Sadakane 2, and Wing-Kin Sung 3 1 Laboratory of Mathematical Bioinformatics, Institute for Chemical Research, Kyoto University, Gokasho,
More informationLecture 10 : Basic Compression Algorithms
Lecture 10 : Basic Compression Algorithms Modeling and Compression We are interested in modeling multimedia data. To model means to replace something complex with a simpler (= shorter) analog. Some models
More informationText matching of strings in terms of straight line program by compressed aleshin type automata
Text matching of strings in terms of straight line program by compressed aleshin type automata 1 A.Jeyanthi, 2 B.Stalin 1 Faculty, 2 Assistant Professor 1 Department of Mathematics, 2 Department of Mechanical
More informationUNIT I INFORMATION THEORY. I k log 2
UNIT I INFORMATION THEORY Claude Shannon 1916-2001 Creator of Information Theory, lays the foundation for implementing logic in digital circuits as part of his Masters Thesis! (1939) and published a paper
More informationCHAPTER 8 COMPRESSION ENTROPY ESTIMATION OF HEART RATE VARIABILITY AND COMPUTATION OF ITS RENORMALIZED ENTROPY
108 CHAPTER 8 COMPRESSION ENTROPY ESTIMATION OF HEART RATE VARIABILITY AND COMPUTATION OF ITS RENORMALIZED ENTROPY 8.1 INTRODUCTION Klimontovich s S-theorem offers an approach to compare two different
More informationApproximate String Matching with Ziv-Lempel Compressed Indexes
Approximate String Matching with Ziv-Lempel Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2, and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,
More informationBandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)
Compression Motivation Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Storage: Store large & complex 3D models (e.g. 3D scanner
More informationSIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding
SIGNAL COMPRESSION Lecture 7 Variable to Fix Encoding 1. Tunstall codes 2. Petry codes 3. Generalized Tunstall codes for Markov sources (a presentation of the paper by I. Tabus, G. Korodi, J. Rissanen.
More informationarxiv: v2 [cs.ds] 6 Jul 2015
Online Self-Indexed Grammar Compression Yoshimasa Takabatake 1, Yasuo Tabei 2, and Hiroshi Sakamoto 1 1 Kyushu Institute of Technology {takabatake,hiroshi}@donald.ai.kyutech.ac.jp 2 PRESTO, Japan Science
More informationarxiv: v3 [cs.ds] 6 Sep 2018
Universal Compressed Text Indexing 1 Gonzalo Navarro 2 arxiv:1803.09520v3 [cs.ds] 6 Sep 2018 Abstract Center for Biotechnology and Bioengineering (CeBiB), Department of Computer Science, University of
More informationFibonacci Coding for Lossless Data Compression A Review
RESEARCH ARTICLE OPEN ACCESS Fibonacci Coding for Lossless Data Compression A Review Ezhilarasu P Associate Professor Department of Computer Science and Engineering Hindusthan College of Engineering and
More informationCMPT 365 Multimedia Systems. Final Review - 1
CMPT 365 Multimedia Systems Final Review - 1 Spring 2017 CMPT365 Multimedia Systems 1 Outline Entropy Lossless Compression Shannon-Fano Coding Huffman Coding LZW Coding Arithmetic Coding Lossy Compression
More informationOn Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004
On Universal Types Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA University of Minnesota, September 14, 2004 Types for Parametric Probability Distributions A = finite alphabet,
More informationA Four-Stage Algorithm for Updating a Burrows-Wheeler Transform
A Four-Stage Algorithm for Updating a Burrows-Wheeler ransform M. Salson a,1,. Lecroq a, M. Léonard a, L. Mouchard a,b, a Université de Rouen, LIIS EA 4108, 76821 Mont Saint Aignan, France b Algorithm
More informationA Space-Efficient Frameworks for Top-k String Retrieval
A Space-Efficient Frameworks for Top-k String Retrieval Wing-Kai Hon, National Tsing Hua University Rahul Shah, Louisiana State University Sharma V. Thankachan, Louisiana State University Jeffrey Scott
More informationarxiv: v1 [cs.ds] 8 Sep 2018
Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space Travis Gagie 1,2, Gonzalo Navarro 2,3, and Nicola Prezza 4 1 EIT, Diego Portales University, Chile 2 Center for Biotechnology
More informationCompressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching Roberto Grossi Dipartimento di Informatica Università di Pisa 56125 Pisa, Italy grossi@di.unipi.it Jeffrey
More informationMotivation for Arithmetic Coding
Motivation for Arithmetic Coding Motivations for arithmetic coding: 1) Huffman coding algorithm can generate prefix codes with a minimum average codeword length. But this length is usually strictly greater
More informationSource Coding Techniques
Source Coding Techniques. Huffman Code. 2. Two-pass Huffman Code. 3. Lemple-Ziv Code. 4. Fano code. 5. Shannon Code. 6. Arithmetic Code. Source Coding Techniques. Huffman Code. 2. Two-path Huffman Code.
More informationBreaking a Time-and-Space Barrier in Constructing Full-Text Indices
Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Wing-Kai Hon Kunihiko Sadakane Wing-Kin Sung Abstract Suffix trees and suffix arrays are the most prominent full-text indices, and their
More informationIntroduction to Information Theory. Part 3
Introduction to Information Theory Part 3 Assignment#1 Results List text(s) used, total # letters, computed entropy of text. Compare results. What is the computed average word length of 3 letter codes
More information