COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

Similar documents
COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

Sources: The Data Compression Book, 2 nd Ed., Mark Nelson and Jean-Loup Gailly.

CSE 421 Greedy: Huffman Codes

Data Compression Techniques

Lecture 4 : Adaptive source coding algorithms

Entropy Coding. Connectivity coding. Entropy coding. Definitions. Lossles coder. Input: a set of symbols Output: bitstream. Idea

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Lecture 1 : Data Compression and Entropy

Data Compression Techniques

Huffman Coding. C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University

Alphabet Friendly FM Index

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

CSEP 590 Data Compression Autumn Arithmetic Coding

CSEP 521 Applied Algorithms Spring Statistical Lossless Data Compression

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

State of the art Image Compression Techniques

Integer Sorting on the word-ram

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

CS4800: Algorithms & Data Jonathan Ullman

Source Coding Techniques

Slides for CIS 675. Huffman Encoding, 1. Huffman Encoding, 2. Huffman Encoding, 3. Encoding 1. DPV Chapter 5, Part 2. Encoding 2

Splay trees (Sleator, Tarjan 1983)

Multimedia. Multimedia Data Compression (Lossless Compression Algorithms)

Data Compression Techniques

CMPT 365 Multimedia Systems. Lossless Compression

Summary of Last Lectures

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Lecture 3 : Algorithms for source coding. September 30, 2016

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

Autumn Coping with NP-completeness (Conclusion) Introduction to Data Compression

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

BASIC COMPRESSION TECHNIQUES

21. Dynamic Programming III. FPTAS [Ottman/Widmayer, Kap. 7.2, 7.3, Cormen et al, Kap. 15,35.5]

Chapter 2: Source coding

UNIT I INFORMATION THEORY. I k log 2

A Four-Stage Algorithm for Updating a Burrows-Wheeler Transform

CMPT 365 Multimedia Systems. Final Review - 1

Static Huffman. Wrong probabilities. Adaptive Huffman. Canonical Huffman Trees. Algorithm for canonical trees. Example of a canonical tree

Image and Multidimensional Signal Processing

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

Introduction to information theory and coding

CHAPTER 8 COMPRESSION ENTROPY ESTIMATION OF HEART RATE VARIABILITY AND COMPUTATION OF ITS RENORMALIZED ENTROPY

Lecture 10 : Basic Compression Algorithms

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

3 Greedy Algorithms. 3.1 An activity-selection problem

Greedy. Outline CS141. Stefano Lonardi, UCR 1. Activity selection Fractional knapsack Huffman encoding Later:

Lecture 18 April 26, 2012

Optimal codes - I. A code is optimal if it has the shortest codeword length L. i i. This can be seen as an optimization problem. min.

Multimedia Information Systems

Text Indexing: Lecture 6

Preview: Text Indexing

Information Theory. Week 4 Compressing streams. Iain Murray,

SGN-2306 Signal Compression. 1. Simple Codes

Optimal Dynamic Sequence Representations

Binary Search Trees. Lecture 29 Section Robb T. Koether. Hampden-Sydney College. Fri, Apr 8, 2016

Module 5 EMBEDDED WAVELET CODING. Version 2 ECE IIT, Kharagpur

CS 229r Information Theory in Computer Science Feb 12, Lecture 5

6.02 Fall 2012 Lecture #1

Data Compression Using a Sort-Based Context Similarity Measure

Compressed Representations of Sequences and Full-Text Indexes

1. Basics of Information

Midterm 2 for CS 170

Lecture 1: Shannon s Theorem

CSEP 590 Data Compression Autumn Dictionary Coding LZW, LZ77

Succinct Suffix Arrays based on Run-Length Encoding

F U N C T I O N A L P E A R L S Inverting the Burrows-Wheeler Transform

Compact Data Strutures

arxiv: v1 [cs.ds] 19 Apr 2011

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts

Smaller and Faster Lempel-Ziv Indices

Information Theory and Statistics Lecture 2: Source coding

Algorithm Design and Analysis

Opportunistic Data Structures with Applications

17.1 Binary Codes Normal numbers we use are in base 10, which are called decimal numbers. Each digit can be 10 possible numbers: 0, 1, 2, 9.

Information Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes

Dynamic Entropy-Compressed Sequences and Full-Text Indexes

Introduction to Theory of Computing

Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Coding of memoryless sources 1/35

Sequence comparison by compression

2018/5/3. YU Xiangyu

Entropy as a measure of surprise

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Data Compression for use in the Short Messaging System

CSCI 2570 Introduction to Nanocomputing

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Image Data Compression

Succincter text indexing with wildcards

Lec 03 Entropy and Coding II Hoffman and Golomb Coding

arxiv: v1 [cs.ds] 15 Feb 2012

Lecture 2: Introduction to Audio, Video & Image Coding Techniques (I) -- Fundaments

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 2

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak

Lec 04 Variable Length Coding (VLC) in JPEG

Noisy channel communication

1 Introduction to information theory

Transcription:

COMP9319 Web Data Compression and Search Lecture 2: daptive Huffman, BWT 1

Original readings Login to your cse account: cd ~cs9319/papers Original readings of each lecture will be placed there. 2

Course schedule Data compression Search Data compression + Search Web data compression + Search Optional topics 3

Huffman coding S Freq Huffman 100010 1 a 100000 0 b 6 10 c 2 110 d 1 1110 0 0 10 1 4 1 e 1 1111 0 2 0 1 4 100k 6 2 1 1 a b c d e

Huffman not optimal H = 0.9999 log 1.0001 + 0.00006 log 16668.333 + + 1/100010 log 100010 0.00 L = (100000*1 + )/100010 1 5

Problems of Huffman coding Huffman codes have an integral # of bits. E.g., log (3) = 1.585 while Huffman may need 2 bits Noticeable non-optimality when prob of a symbol is high. => rithmetic coding 6

Problems of Static coding Need statistics & static: e.g., single pass over the data just to collect stat & stat unchanged during encoding To decode, the stat table need to be transmitted. Table size can be significant for small msg. => daptive compression e.g., adaptive huffman 7

daptive compression Encoder Initialize the model Repeat for each input char ( Encode char Update the model ) Decoder Initialize the model Repeat for each input char ( Decode char Update the model ) Make sure both sides have the same Initialize & update model algorithms. 8

daptive Huffman Coding (dummy) Encoder Reset the stat Repeat for each input char ( Encode char Update the stat Rebuild huffman tree ) Decoder Reset the stat Repeat for each input char ( Decode char Update the stat Rebuild huffman tree ) 9

daptive Huffman Coding (dummy) Encoder Reset the stat Repeat for each input char ( Encode char Update the stat Rebuild huffman tree ) Decoder Reset the stat Repeat for each input char ( Decode char Update the stat Rebuild huffman tree ) This works but too slow! 10

daptive Huffman (lgorithm outline) 1. If current symbol is NYT, add two child nodes to NYT node. One will be a new NYT node the other is a leaf node for our symbol. Increase weight for the new leaf node and the old NYT and go to step 4. If not, go to symbol's leaf node. 2. If this node does not have the highest number in a block, swap it with the node having the highest number 3. Increase weight for current node 4. If this is not the root node go to parent node then go to step 2. If this is the root, end. 11

The update procedure from Introduction to Data Compression by by Sayood Khalid lso, Wikipedia provides a good summary, example and explanation (i.e., http:// en.wikipedia.org/wiki/ daptive_huffman_coding) 12

daptive Huffman abbbbba: 01100001011000100110001001100010011000100110001001100001 abbbbba: 011000010011000100111101 13 a: 01100001 b: 01100010 Modified from Wikipedia

More example 256: W=17 252: W=3 254: W=7 253: W=4 e 255: W=10 More aaaa. coming a 248: W=1 b 249: W=2 c 250: W=2 d 251: W=2 14

More example 256: W=18 252: W=4 254: W=8 253: W=4 e 255: W=10 a 248: W=2 b 249: W=2 c 250: W=2 d 251: W=2 15

More example 256: W=19 252: W=4 254: W=9 253: W=5 e 255: W=10 d 248: W=2 b 249: W=2 c 250: W=2 a 251: W=3 16

More example 256: W=20 252: W=4 254: W=10 253: W=6 e 255: W=10 d 248: W=2 b 249: W=2 c 250: W=2 a 251: W=4 17

More example 256: W=20 252: W=4 254: W=10 253: W=6 e 255: W=10 d 248: W=2 b 249: W=2 c 250: W=2 a 251: W=4 18

More example +1 256: W=20 a 252: W=5 254: W=10 +1 253: W=6 e 255: W=10 251: W=4 c 250: W=2 19 d 248: W=2 b 249: W=2

More example 256: W=21 e 254: W=10 a 252: W=5 255: W=11 253: W=6 251: W=4 c 250: W=2 20 d 248: W=2 b 249: W=2

3 daptive Huffman (FGK)

daptive Huffman (FGK): when f is inserted 4

daptive Huffman (FGK vs Vitter) 1. FGK: (Explicit) node numbering Vitter: Implicit numbering 2. Vitter s Invariant: 5

aa bbb c (Huffman) a 2 b 3 c 1 sp 2 3 5 1 c 2 sp 2 a 3 b 6

aa bbb c (Huffman) a 2 b 3 c 1 sp 2 3 5 1 c 2 sp 2 a 3 b 7 a 01 b 1 c 000 sp 001 Total 16bits a 10 b 11 c 00 sp 01

8 daptive Huffman (Vitter s Invariant)

daptive Huffman (Vitter 1987) abbbbba: 01100001011000100110001001100010011000100110001001100001 abbbbba: 011000010011000100111101 9 a: 01100001 b: 01100010 Modified from Wikipedia

daptive Huffman (Vitter 1987) abbbbba: 01100001011000100110001001100010011000100110001001100001 abbbbba: 011000010011000101111101 1 255 a 254 10 a: 01100001 b: 01100010 You can correct the Wikipedia article Modified from Wikipedia

11 daptive Huffman (Vitter 87)

daptive Huffman Question: daptive Huffman vs Static Huffman 21

Compared with Static Huffman Dynamic and can offer better compression (cf. Vitter s experiments next) i.e., the tree can be smaller (hence shorter the code) before the whole bitstream is received. Works when prior stat is unavailable Saves symbol table overhead (cf. Vitter s expt next) 22

Vitter s experiments Include overheads such as symbol tables / leaf node code etc. 95 SCII chars + <end-of-line> Exclude overheads such as symbol tables / leaf node code etc. 23 From Vitter s paper. You know where it is.

24 More experiments

Next BWT BWT: Burrows Wheeler Transform It is a transform, not a compression; but it usually helps compression (esp. text compression). 25 Excerpted from Wikipedia

Recall from Lecture 1 s RLE and BWT example rabcabcababaabacabcabcabcababaa$ aabbbbccacccrcbaaaaaaaaaabbbbba$ aab4ccac3rcba10b5a$ 26

simple example Input: #BNNS 27 Excerpted from Wikipedia

ll rotations #BNNS S#BNN S#BNN NS#BN NS#BN NNS#B NNS#B BNNS# 28

Sort the rows #BNNS NNS#B NS#BN S#BNN BNNS# NNS#B NS#BN S#BNN 29

Output #BNNS NNS#B NS#BN S#BNN BNNS# NNS#B NS#BN S#BNN 30

Exercise: you can try the example rabcabcababaabacabcabcabcababaa$ aabbbbccacccrcbaaaaaaaaaabbbbba$ 31

Now the inverse Input: S B N N # 32

First add S B N N # 33

Then sort # B N N S 34

dd again S# B N N #B N N S 35

Then sort #B N N S B N N S# 36

Then add S#B BN NN NS #B N N S# 37

Then sort #B N N S# BN NN NS S#B 38

Then add S#B BN NN NS# #BN NN NS S#B 39

Then sort #BN NN NS S#B BN NN NS# S#B 40

Then add S#BN BNN NNS NS#B #BN NN NS# S#B 41

Then sort #BN NN NS# S#B BNN NNS NS#B S#BN 42

Then add S#BN BNN NNS# NS#B #BNN NNS NS#B S#BN 43

Then sort #BNN NNS NS#B S#BN BNN NNS# NS#B S#BN 44

Then add S#BNN BNNS NNS#B NS#BN #BNN NNS# NS#B S#BN 45

Then sort #BNN NNS# NS#B S#BN BNNS NNS#B NS#BN S#BNN 46

Then add S#BNN BNNS# NNS#B NS#BN #BNNS NNS#B NS#BN S#BNN 47

Then sort (?) #BNNS NNS#B NS#BN S#BNN BNNS# NNS#B NS#BN S#BNN 48

Implementation Do we need to represent the table in the encoder? No, a single pointer for each row is needed. 49

BWT(S) function BWT (string s) create a table, rows are all possible rotations of s sort rows alphabetically return (last column of the table) 50

InverseBWT(S) function inversebwt (string s) create empty table repeat length(s) times insert s as a column of table before first column of the table // first insert creates first column sort rows of the table alphabetically return (row that ends with the 'EOF' character) 51

Move to Front (MTF) Reduce entropy based on local frequency correlation Usually used for BWT before an entropyencoding step uthor and detail: Original paper at cs9319/papers http://www.arturocampos.com/ac_mtf.html 52

Example: abaabacad Symbol Code List a 0 abcde.. b 1 bacde.. a 1 abcde.. a 0 abcde.. b 1 bacde.. a 1 abcde.. c 2 cabde.. a 1 acbde.. d 3 dacbe.. 53 To transform a general file, the list has 256 SCII symbols.

BWT compressor vs ZIP ZIP (i.e., LZW based) BWT+RLE+MTF+C 54 From http://marknelson.us/1996/09/01/bwt/

Other ways to reverse BWT Consider L=BWT(S) is composed of the symbols V 0 V N-1, the transformed string may be parsed to obtain: The number of symbols in the substring V 0 V i-1 that are identical to V i. For each unique symbol, V i, in L, the number of symbols that are lexicographically less than that symbol. 55

Example Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 56

???????] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 57

??????] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 58

?????N] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 59

????N] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 60

???NN] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 61

??NN] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 62

?BNN] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 63

[BNN] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 64

[BNN] Position Symbol # Matching 0 B 0 1 N 0 2 N 1 3 [ 0 4 0 5 1 6 ] 0 7 2 Symbol # LessThan 0 B 3 N 4 [ 6 ] 7 65 Occ / Rank C [ ]

n illustration B N N [ ] B N N [ ] First Last 66

] B N N [ ] B N N [ ] 67

N] B N N [ ] B N N [ ] 68

N] B N N [ ] B N N [ ] 69

NN] B N N [ ] B N N [ ] 70

NN] B N N [ ] B N N [ ] 71

BNN] B N N [ ] B N N [ ] 72

[BNN] B N N [ ] B N N [ ] 73

Dynamic BWT? Instead of reconstructing BWT, local reordering from the original BWT. Details: Salson M, Lecroq T, Léonard M and Mouchard L (2009). " Four-Stage lgorithm for Updating a Burrows Wheeler Transform". Theoretical Computer Science 410 (43): 4350. 74