Entropy as a measure of surprise

Similar documents
Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Chapter 5: Data Compression

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

1 Introduction to information theory

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

U Logo Use Guidelines

Lecture 3 : Algorithms for source coding. September 30, 2016

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

UNIT I INFORMATION THEORY. I k log 2

Lecture 1 : Data Compression and Entropy

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Information & Correlation

Data Compression. Limit of Information Compression. October, Examples of codes 1

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 2

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

ECE Advanced Communication Theory, Spring 2009 Homework #1 (INCOMPLETE)

Chapter 2: Source coding

3F1 Information Theory, Lecture 3

An introduction to basic information theory. Hampus Wessman

Lecture 1: September 25, A quick reminder about random variables and convexity

DCSP-3: Minimal Length Coding. Jianfeng Feng

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

EE5585 Data Compression January 29, Lecture 3. x X x X. 2 l(x) 1 (1)

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

Uncertainity, Information, and Entropy

CS4800: Algorithms & Data Jonathan Ullman

Information and Entropy

Information Theory. Week 4 Compressing streams. Iain Murray,

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

3F1 Information Theory, Lecture 3

Information Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling

Communications Theory and Engineering

Lecture 4 : Adaptive source coding algorithms

Lecture 11: Quantum Information III - Source Coding

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

Information Theory and Statistics Lecture 2: Source coding

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Information Theory and Coding Techniques

Data Compression Techniques

Lecture 1: Shannon s Theorem

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Lec 03 Entropy and Coding II Hoffman and Golomb Coding

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

Tight Upper Bounds on the Redundancy of Optimal Binary AIFV Codes

Lecture 22: Final Review

Solutions to Set #2 Data Compression, Huffman code and AEP

A Mathematical Theory of Communication

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

ECE 587 / STA 563: Lecture 5 Lossless Compression

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

DATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering

ECE 587 / STA 563: Lecture 5 Lossless Compression

Information Theory, Statistics, and Decision Trees

Motivation for Arithmetic Coding

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

Homework Set #2 Data Compression, Huffman code and AEP

Introduction to information theory and coding

CSE 421 Greedy: Huffman Codes

Shannon-Fano-Elias coding

(Classical) Information Theory III: Noisy channel coding

Chapter 9 Fundamental Limits in Information Theory

CSCI 2570 Introduction to Nanocomputing

Intro to Information Theory

Lecture 2: Introduction to Audio, Video & Image Coding Techniques (I) -- Fundaments

Optimal codes - I. A code is optimal if it has the shortest codeword length L. i i. This can be seen as an optimization problem. min.

1. Basics of Information

MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)

Lecture 2: Introduction to Audio, Video & Image Coding Techniques (I) -- Fundaments. Tutorial 1. Acknowledgement and References for lectures 1 to 5

Lecture 11: Polar codes construction

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak

6.02 Fall 2012 Lecture #1

Coding of memoryless sources 1/35

lossless, optimal compressor

EC2252 COMMUNICATION THEORY UNIT 5 INFORMATION THEORY

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Exercises with solutions (Set B)

COMM901 Source Coding and Compression. Quiz 1

Compression and Coding

2018/5/3. YU Xiangyu

Huffman Coding. C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University

Introduction to Information Theory. By Prof. S.J. Soni Asst. Professor, CE Department, SPCE, Visnagar

MAHALAKSHMI ENGINEERING COLLEGE QUESTION BANK. SUBJECT CODE / Name: EC2252 COMMUNICATION THEORY UNIT-V INFORMATION THEORY PART-A

Digital communication system. Shannon s separation principle

Shannon s Noisy-Channel Coding Theorem

Lecture 6: Kraft-McMillan Inequality and Huffman Coding

An Approximation Algorithm for Constructing Error Detecting Prefix Codes

Lecture Notes on Digital Transmission Source and Channel Coding. José Manuel Bioucas Dias

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

Shannon s Noisy-Channel Coding Theorem

Learning Decision Trees

Transcription:

Entropy as a measure of surprise Lecture 5: Sam Roweis September 26, 25 What does information do? It removes uncertainty. Information Conveyed = Uncertainty Removed = Surprise Yielded. How should we quantify information/uncertainty/surprise? Here are some properties any function h(p(event)) should possess:. h() = (no surprise for certain events) 2. h() = (infinite surprise for impossible events) 3. p i > p j h(p i ) < h(p j ) (higher prob. means less surprise) 4. x, y independent h(p(x i &y j )) = h(p(x i )) + h(p(x j )). (surprise is additive for independent events) 5. h( ) is continuous (small change in prob, small change in surprise) What functions h( ) satisfy all the requirements above? The only consistent solution is h(p) = a log b (p). By convention, we chose a =, b = 2 which sets the units to bits. Reminder: Entropy and Optimal Codes Last class we introduced the entropy H = i p i log(/p i ) of a source which provides a hard lower bound on the average per-symbol encoding length for any decodable code. We also learned about Huffman s Algorithm which constructs an optimal symbol code for any source. But Huffman Codes may be as much as one bit worse than the entropy on average. To get closer to lower limit of the entropy, we are going to consider block coding in which we encode several adjacent source symbols together (called the extension of a source). This introduces some decoding delay, since we can t decode any any member of the block until the code for the entire block is received, but it gets us closer to the entropy in performance. First, let s talk a bit more about entropy, then let s see why Huffman codes are optimal, then we can go back to block coding. Proving that Binary Huffman Codes are Optimal We can prove that the binary Huffman code procedure produces optimal codes by induction on the number of source symbols, I. For I = 2, the code produced is obviously optimal you can t do better than using one bit to code each symbol. For I > 2, we assume that the procedure produces optimal codes for any alphabet of size I (with any symbol probabilities), and then prove that it does so for alphabets of size I as well. So, to start with, suppose the Huffman procedure produces optimal codes for alphabets of size I. Let L be the expected codeword length of the code produced by the procedure when it is used to encode the symbols a,..., a I, having probabilities p,..., p I. Without loss of generality, let s assume that p i p I p I for all i {,..., I 2}.

The Induction Step The recursive call in the procedure will have produced a code for symbols a,..., a I 2, a, having probabilities p,..., p I 2, p, with p = p I + p I. By the induction hypothesis, this code is optimal. Let its average length be L. Suppose some other instantaneous code for a,..., a I had expected length less than L. We can modify this code so that the codewords for a I and a I are siblings (ie, they have the forms z and z) while keeping its average length the same, or less. Let the average length of this modified code be L, which must also be less than L. From this modified code, we can produce another code for a,..., a I 2, a. We keep the codewords for a,..., a I 2 the same, and encode a as z. Let the average length of this code be L. Won t top-down splitting work just as well? Shannon & Fano had another technique for constructing a prefix code, other than rounding up log /p i to get l i. They arranged the source symbols in order from most probable to least probable, and then divided into two sets whose total probabilities are as close as possible to being equal. All symbols then have the first digits of their codes assigned; symbols in the first set receive and symbols in the second set receive. As long as any sets with more than one member remain, the same process is repeated on those sets, to determine successive digits of their codes. When a set has been reduced to one symbol, of course, this means the symbol s code is complete and furthermore it is guaranteed to not form the prefix of any other symbol s code. The Induction Step (Conclusion) We now have two codes for a,..., a I and two for a,..., a I 2, a. The average lengths of these codes satisfy the following equations: L = L + p I + p I L = L + p I + p I Why? The codes for a,..., a I are like the codes for a,..., a I 2, a, except that one symbol is replaced by two, whose codewords are one bit longer. This one additional bit is added with probability p = p I + p I. Since L is the optimal average length, L L. From these equations, we then see that L L, which contradicts the supposition that L < L. The Huffman procedure therefore produces optimal codes for alphabets of size I. By induction, this is true for all I. Shannon-Fano top-down splitting Example: Symbol probabilities (p =.35, p 2 =.7, p 3 =.7, p 4 =.6, p 5 =.5). First split (.35,.7) ; (.7,.6,.5). Second split:.35 ;.7. Third split:.7 ; (.6,.5). Final split:.6 ;.5.

Top-down splitting is sub-optimal The above algorithm works, and it produces fairly efficient variable-length encodings. When the two smaller sets produced by a partitioning happen to be exactly equal in probabilitity, then the one bit of information used to distinguish them is used most efficiently. Unfortunately, Shannon-Fano top-down splitting does not always produce optimal prefix codes: the code we obtained above was {,,,, } which has average L = 2.3 bits/symbol. Huffman s algorithm produces a code of {,,,, } which has average L = 2.3 bits/symbol. Why, intuitively, does Huffman s algorithm work? Another Way to Compress Down to the Entropy We get a similar result by supposing that we will always encode N symbols into a block of exactly NR bits. Can we do this in a way that is very likely to be decodable? Yes, for large values of N. The Law of Large Numbers (LLN) tells us that the sequence of symbols to encode, a i,..., a in, is very likely to be a typical one, for which N log 2(/(p i p in )) = N log N 2 (/p ij ) j= is very close to the expectation of log 2 (/p i ), which is the entropy, H(X) = p i log 2 (/p i ). (See Section 4.3 of MacKay s book.) i So if we encode all the sequences in this typical set in a way that can be decoded, the code will almost always be uniquely decodable. Shannon s Noiseless Coding Theorem By using extensions of the source, we can compress arbitrarily close to the entropy! Formally: For any desired average length per symbol, R, that is greater than the binary entropy, H(X), there is a value of N for which a uniquely decodable binary code for X N exists that has expected length less than NR. Consider coding the N-th extension of a source whose symbols have probabilities p,..., p I, using an binary Shannon-Fano code. The Shannon-Fano code for blocks of N symbols will have expected codeword length, L N, no greater than + H(X N ) = + NH(X). The expected codeword length per original source source symbol will therefore be no greater than L N N = +NH(X) N = H(X) + N. By choosing N to be large enough, we can make this as close to the entropy, H(X), as we wish. How Big is the Typical Set? Let s define typical sequences as ones where (/N) log 2 (/(p i p in )) H(X) + η/ N The probability of any such typical sequence will satisfy p i p in 2 NH(X) η N We scale the margin allowed above H(X) as / N since that s how the standard deviation of an average scales. LLN (Chebychev s inequality) then tells us that most sequences will satisfy this condition, for some large enough value of η. The total probability for all such sequences can t be greater than one, so the number of typical sequences can t be greater than 2 NH(X)+η N We will be able to encode these sequences in NR bits if NR NH(X) + η N. If R > H(X), this will be true if N is sufficiently large.

Some Observations about Instantaneous Codes Suppose we have an instantaneous code for symbols a,..., a I, with probabilities p,..., p I and codeword lengths l,..., l I. Under each of the following conditions, we can find a better instantaneous code, i.e. one with smaller expected codeword length:. If p < p 2 and l < l 2 : Swap the codewords for a and a 2. 2. If there is a codeword of the form xby, where x and y are strings of zero or more bits, and b is a single bit, but there are no codewords of the form xb z, where z is a string of zero or more bits, and b b: Change all the codewords of the form xby to xy. (This improves things if none of the p i are zero, and never makes things worse.) Continuing to Improve the Example The result is the code shown below: a, p =. a 2, p 2 =.2 a 3, p 3 =.4 a 4, p 4 =.2 a 5, p 5 =.3 a 6, p 6 =.3 Now we note that a 6, with probability.3, has a longer codeword than a, which has probability.. We can improve the code by swapping the codewords for these symbols. The Improvements in Terms of Trees We can view these improvements in terms of the trees for the codes. Here s an example: a, p =. a 2, p 2 =.2 a 3, p 3 =.4 a 4, p 4 =.2 a 5, p 5 =.3 a 6, p 6 =.3 Two codewords have the form... but none have the form... (ie, there s only one branch out of the node). We can therefore improve the code by deleting the surplus node. The State After These Improvements Here s the code after this improvement: a 6, p 6 =.3 a 2, p 2 =.2 a 3, p 3 =.4 a 4, p 4 =.2 a 5, p 5 =.3 a, p =. In general, after such improvements: The most improbable symbol will have the longest codeword and there will be at least one other codeword of this length its sibling in the tree. The second-most improbable symbol will also have a codeword of the longest length.

A Final Rearrangement The codewords for the most improbable and second-most improbable symbols must have the same length. The most improbable symbol s codeword also has a sibling of the same length. We can swap codewords to make this sibling be the codeword for the second-most improbable symbol. For the example, the result is: a 6, p 6 =.3 a 2, p 2 =.2 a 3, p 3 =.4 a 5, p 5 =.3 a 4, p 4 =.2 a, p =. An End and a Beginning Shannon s Noiseless Coding Theorem is mathematically satisfying. From a practical point of view, though, we still have two problems: How can we compress data to nearly the entropy in practice? The number of possible blocks of size N is I N huge when N is large. And N sometimes must be large to get close to the entropy by encoding blocks of size N. One solution: A technique known as arithmetic coding. Where do the symbol probabilities p,..., p I come from? And are symbols really independent, with known, constant probabilities? This is the problem of source modeling.