CS4800: Algorithms & Data Jonathan Ullman

Similar documents
Lecture 4 : Adaptive source coding algorithms

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

Entropy as a measure of surprise

Source Coding Techniques

CSE 421 Greedy: Huffman Codes

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Data Compression Techniques

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Chapter 2: Source coding

Introduction to Information Theory. Part 3

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Lecture 1 : Data Compression and Entropy

CSEP 521 Applied Algorithms Spring Statistical Lossless Data Compression

Multimedia. Multimedia Data Compression (Lossless Compression Algorithms)

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling

Algorithm Design and Analysis

3F1 Information Theory, Lecture 3

Image and Multidimensional Signal Processing

3F1 Information Theory, Lecture 3

Information Theory and Statistics Lecture 2: Source coding

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

Greedy. Outline CS141. Stefano Lonardi, UCR 1. Activity selection Fractional knapsack Huffman encoding Later:

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

Autumn Coping with NP-completeness (Conclusion) Introduction to Data Compression

Entropy Coding. Connectivity coding. Entropy coding. Definitions. Lossles coder. Input: a set of symbols Output: bitstream. Idea

Huffman Coding. C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University

CMPT 365 Multimedia Systems. Lossless Compression

Lecture 3 : Algorithms for source coding. September 30, 2016

Lecture 6: Kraft-McMillan Inequality and Huffman Coding

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Kolmogorov complexity ; induction, prediction and compression

In-Class Soln 1. CS 361, Lecture 4. Today s Outline. In-Class Soln 2

Introduction to information theory and coding

ECE 587 / STA 563: Lecture 5 Lossless Compression

Data Structures in Java

Intro to Information Theory

CHAPTER 8 COMPRESSION ENTROPY ESTIMATION OF HEART RATE VARIABILITY AND COMPUTATION OF ITS RENORMALIZED ENTROPY

CSEP 590 Data Compression Autumn Dictionary Coding LZW, LZ77

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Information Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes

Lecture 10 : Basic Compression Algorithms

lossless, optimal compressor

ECE 587 / STA 563: Lecture 5 Lossless Compression

Lecture 16 Oct 21, 2014

Lecture 9. Greedy Algorithm

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak

21. Dynamic Programming III. FPTAS [Ottman/Widmayer, Kap. 7.2, 7.3, Cormen et al, Kap. 15,35.5]

Digital Communications III (ECE 154C) Introduction to Coding and Information Theory

Lecture 1: September 25, A quick reminder about random variables and convexity

repetition, part ii Ole-Johan Skrede INF Digital Image Processing

Midterm 2 for CS 170

Multimedia Information Systems

CMPT 365 Multimedia Systems. Final Review - 1

UNIT I INFORMATION THEORY. I k log 2

Information & Correlation

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 2

Coding for Discrete Source

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

17.1 Binary Codes Normal numbers we use are in base 10, which are called decimal numbers. Each digit can be 10 possible numbers: 0, 1, 2, 9.

Information and Entropy. Professor Kevin Gold

COS597D: Information Theory in Computer Science October 19, Lecture 10

Using an innovative coding algorithm for data encryption

Data Compression. Limit of Information Compression. October, Examples of codes 1

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

SGN-2306 Signal Compression. 1. Simple Codes

Information Theory. Week 4 Compressing streams. Iain Murray,

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Optimisation and Operations Research

Chapter 9 Fundamental Limits in Information Theory

Data Compression Techniques

EE376A - Information Theory Midterm, Tuesday February 10th. Please start answering each question on a new page of the answer booklet.

3 Greedy Algorithms. 3.1 An activity-selection problem

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

EE5585 Data Compression January 29, Lecture 3. x X x X. 2 l(x) 1 (1)

Motivation for Arithmetic Coding

A Mathematical Theory of Communication

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

COMP9319 Web Data Compression and Search. Lecture 2: Adaptive Huffman, BWT

Lecture 1: Shannon s Theorem

Data Compression Using a Sort-Based Context Similarity Measure

Quantum-inspired Huffman Coding

Greedy Alg: Huffman abhi shelat

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

Dynamics and time series: theory and applications. Stefano Marmi Scuola Normale Superiore Lecture 6, Nov 23, 2011

Outline. Computer Science 418. Number of Keys in the Sum. More on Perfect Secrecy, One-Time Pad, Entropy. Mike Jacobson. Week 3

U Logo Use Guidelines

ECE472/572 - Lecture 11. Roadmap. Roadmap. Image Compression Fundamentals and Lossless Compression Techniques 11/03/11.

CS 229r Information Theory in Computer Science Feb 12, Lecture 5

SUPPLEMENTARY INFORMATION

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Greedy Algorithms and Data Compression. Curs 2017

Lecture 2: Introduction to Audio, Video & Image Coding Techniques (I) -- Fundaments

Transcription:

CS4800: Algorithms & Data Jonathan Ullman Lecture 22: Greedy Algorithms: Huffman Codes Data Compression and Entropy Apr 5, 2018

Data Compression How do we store strings of text compactly? A (binary) code is a mapping from Σ 0,1 Simplest code: assign numbers 1,2,, Σ to each symbol, map to binary numbers of log - Σ bits Morse Code:

Data Compression Letters have uneven frequencies! Want to use short encodings for frequent letters, long encodings for infrequent leters a b c d Frequency 1/2 1/4 1/8 1/8 Encoding 0 10 110 111

Data Compression What properties would a good code have? Easy to encode a string Encode(KTS) = The encoding is short on average 4 bits per letter (30 symbols max!) Easy to decode a string Decode( ) =

Prefix Free Codes Cannot decode if there are ambiguities E.g. enc(e) is a prefix of enc(s) Prefix-Free Code: A binary enc: Σ 0,1 s.t. for every 6 8 Σ, enc 6 is not a prefix of enc 8 Any fixed-length code is prefix-free Can make any code prefix-free by adding some string meaning STOP

Prefix Free Codes Can represent a prefix-free code as a tree Encode by going up the tree (or using a table) d a b 0 0 1 1 0 0 1 1 Decode by going down the tree 0 1 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1

Prefix Free Codes Can represent a prefix-free code as a tree Encode by going up the tree (or using a table) d a b 0 0 1 1 0 0 1 1 Decode by going down the tree 0 1 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1

Huffman Codes (An algorithm to find) an optimal prefix-free code optimal = minimizes len : = = > len B C > @ Note, optimality depends on what you re compressing H is the 8 th most frequent letter in English (6.094%) but the 20 th most frquent in Italian (0.636%) a b c d Frequency 1/2 1/4 1/8 1/8 Encoding 0 10 110 111

Huffman Codes First Try: split letters into two sets of roughly equal frequency and recurse Balanced binary trees should have low depth a b c d e.32.25.20.18.05

Huffman Codes First Try: split letters into two sets of roughly equal frequency and recurse a b c d e.32.25.20.18.05 first try optimal

Huffman Codes Huffman s Algorithm: pair up the two letters with the lowest frequency and recurse a b c d e.32.25.20.18.05

Huffman Codes Huffman s Algorithm: pair up the two letters with the lowest frequency and recurse Theorem: Huffman s Algorithm produces a prefixfree code of optimal length We ll prove the theorem using an exchange argument

Huffman Codes Theorem: Huffman s Alg produces an optimal prefix-free code (1) In an optimal prefix-free code (an optimal tree), every internal node has exactly 2 children

Huffman Codes Theorem: Huffman s Alg produces an optimal prefix-free code (2) If 6, 8 have the lowest frequency, then there is an optimal code where 6, 8 are siblings and are at the bottom of the tree

Huffman Codes Theorem: Huffman s Alg produces an optimal prefix-free code (2) If 6, 8 have the lowest frequency, then there is an optimal code where 6, 8 are siblings and are at the bottom of the tree

Huffman Codes Theorem: Huffman s Alg produces an optimal prefix-free code Proof by Induction on the Number of Letters in Σ: Base case ( Σ = 2): rather obvious

Huffman Codes Theorem: Huffman s Alg produces an optimal prefix-free code Proof by Induction on the Number of Letters in Σ: Inductive Hypothesis:

Huffman Codes Theorem: Huffman s Alg produces an optimal prefix-free code Proof by Induction on the Number of Letters in Σ: Inductive Hypothesis: Without loss of generality, frequencies are = D,, = E, the two lowest are = D, = - Merge 1,2 into a new letter F + 1 with = EHD = = D + = -

Huffman Codes Theorem: Huffman s Alg produces an optimal prefix-free code Proof by Induction on the Number of Letters in Σ: Inductive Hypothesis: Without loss of generality, frequencies are = D,, = E, the two lowest are = D, = - Merge 1,2 into a new letter F + 1 with = EHD = = D + = - By induction, if : I is the Huffman code for = J,, = EHD, then : I is optimal Need to prove that : is optimal for = D,, = E

Huffman Codes Theorem: Huffman s Alg produces an optimal prefix-free code If : is optimal for = J,, = EHD then : is optimal for = D,, = E

An Experiment Take the Dickens novel A Tale of Two Cities File size is 799,940 bytes Build a Huffman code and compress File size is now 439,688 bytes Raw Huffman Size 799,940 439,688

Huffman Codes Huffman s Algorithm: pair up the two letters with the lowest frequency and recurse Theorem: Huffman s Algorithm produces a prefixfree code of optimal length In what sense is this code really optimal?

Length of Huffman Codes What can we say about Huffman code length? Suppose = > = 2 Ll N for every C Σ Then, len B C = l > for the optimal Huffman code Proof:

Length of Huffman Codes What can we say about Huffman code length? Suppose = > = 2 Ll N for every C Σ Then, len B C = l > for the optimal Huffman code len : = > @ = > log D - P ON

Entropy Given a set of frequencies (aka a probability distribution) the entropy is Q = = R = > log - 1 = > > Entropy is a measure of randomness

Entropy Given a set of frequencies (aka a probability distribution) the entropy is Q = = R = > log - 1 = > > Entropy is a measure of randomness Entropy was introduced by Shannon in 1948 and is the foundational concept in: Data compression Communicating over a noisy connection Cryptography and security

Entropy of Passwords Your password is a specific string, so = STU = 1.0 To talk about security of passwords, we have to model them as random Random 16 letter string: Q = 16 log - 26 75.2 Random IMDb movie: Q = log - 1764727 20.7 Your favorite IMDb movie: Q 20.7 Entropy measures how difficult passwords are to guess on average

Entropy of Passwords

Entropy and Compression Given a set of frequencies (probability distribution) the entropy is Q = = R = > log - 1 = > > Suppose that we generate string \ by choosing ] random letters independently with frequencies = Any compression scheme requires at least ] Q = bits to store \ (as ] ) Huffman codes are truly optimal!

But Wait! Take the Dickens novel A Tale of Two Cities File size is 799,940 bytes Build a Huffman code and compress File size is now 439,688 bytes But we can do better! Raw Huffman gzip bzip2 Size 799,940 439,688 301,295 220,156

What do the frequencies represent? Real data (e.g. natural language, music, images) have patterns between letters The frequency of U is only 2.76% in English, but what if the previous letter is Q? Possible approach: model pairs of letters Record the frequency of pairs of letters and build a huffman code for the pairs Pros: can improve compression ratio Cons: now the code tree is now much bigger cannot identify patterns across more than pairs

Lempel-Ziv-Welch Compression Try to learn patterns as you find them! Compress: _``_`_``_a_``_

Lempel-Ziv-Welch Compression Try to learn patterns as you find them! 1. Start with an initial empty dictionary D 2. Until input is empty: 1. Find the longest prefix pre that matches D 2. Output D(p) and remove p from the input 3. Add (p + nextletter) to D zip uses (some version of) LZW compression

Entropy and Compression Given a set of frequencies (probability distribution) the entropy is Q = = R = > log - 1 = > > Suppose that we generate string \ by choosing ] random letters independently with frequencies = Any compression scheme requires at least ] Q = bits to store \ (as ] ) Huffman codes are truly optimal if (and only if) there is no relationship between different letters!