Sample solutions to Homework 4, Information-Theoretic Modeling (Fall 2014)

Similar documents
SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

Motivation for Arithmetic Coding

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Lecture 1: Shannon s Theorem

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Data Compression Techniques

CSEP 590 Data Compression Autumn Arithmetic Coding

Homework Set #2 Data Compression, Huffman code and AEP

1 Introduction to information theory

Entropy as a measure of surprise

Entropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39

Computational Cognitive Science

Lecture 6: Kraft-McMillan Inequality and Huffman Coding

Source Coding Techniques

Error Correcting Codes Prof. Dr. P. Vijay Kumar Department of Electrical Communication Engineering Indian Institute of Science, Bangalore

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 2

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Shannon-Fano-Elias coding

3F1 Information Theory, Lecture 3

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

3F1 Information Theory, Lecture 3

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

CS 361: Probability & Statistics

Lecture 2: Introduction to Audio, Video & Image Coding Techniques (I) -- Fundaments

Lecture 3 : Algorithms for source coding. September 30, 2016

Homework 6: Image Completion using Mixture of Bernoullis

Universal Loseless Compression: Context Tree Weighting(CTW)

Image and Multidimensional Signal Processing

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Lecture 2: Introduction to Audio, Video & Image Coding Techniques (I) -- Fundaments. Tutorial 1. Acknowledgement and References for lectures 1 to 5

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Coding of memoryless sources 1/35

Lecture 1: September 25, A quick reminder about random variables and convexity

Compression and Coding

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

Solutions to Set #2 Data Compression, Huffman code and AEP

Intro to Information Theory

EE5585 Data Compression January 29, Lecture 3. x X x X. 2 l(x) 1 (1)

CSCI 2570 Introduction to Nanocomputing

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Information Theory. Week 4 Compressing streams. Iain Murray,

1 A simple example. A short introduction to Bayesian statistics, part I Math 217 Probability and Statistics Prof. D.

Summary of Last Lectures

Lec 05 Arithmetic Coding

EE376A - Information Theory Midterm, Tuesday February 10th. Please start answering each question on a new page of the answer booklet.

Computational Cognitive Science

Communications Theory and Engineering

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

Lecture 4 : Adaptive source coding algorithms

Stream Codes. 6.1 The guessing game

Information Theory and Statistics Lecture 2: Source coding

CSC321 Lecture 18: Learning Probabilistic Models

CIS 2033 Lecture 5, Fall

Lecture 4: Proof of Shannon s theorem and an explicit code

Lecture 18: Shanon s Channel Coding Theorem. Lecture 18: Shanon s Channel Coding Theorem

Lecture 15: Conditional and Joint Typicaility

U Logo Use Guidelines

6.02 Fall 2012 Lecture #1

(Refer Slide Time: 0:21)

Information Theory, Statistics, and Decision Trees

Programming Assignment 4: Image Completion using Mixture of Bernoullis

CSEP 521 Applied Algorithms Spring Statistical Lossless Data Compression

Computer Arithmetic. MATH 375 Numerical Analysis. J. Robert Buchanan. Fall Department of Mathematics. J. Robert Buchanan Computer Arithmetic

Universal probability distributions, two-part codes, and their optimal precision

Kolmogorov complexity ; induction, prediction and compression

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory

Using Probability to do Statistics.

COMM901 Source Coding and Compression. Quiz 1

Lecture 1 : Data Compression and Entropy

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 5

CS 361: Probability & Statistics

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018

Chapter 5: Data Compression

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Examples MAT-INF1100. Øyvind Ryan

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Operation and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Information Theory and Coding Techniques

Intro to Probability. Andrei Barbu

MA 137 Calculus 1 with Life Science Applications. (Section 6.1)

Digital communication system. Shannon s separation principle

Lecture 4: Constructing the Integers, Rationals and Reals

Basics of DCT, Quantization and Entropy Coding

Chaos, Complexity, and Inference (36-462)

Huffman Coding. C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University

Digital Communications III (ECE 154C) Introduction to Coding and Information Theory

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

Context tree models for source coding

Bayesian Network Structure Learning using Factorized NML Universal Models

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

EM (cont.) November 26 th, Carlos Guestrin 1

Transcription:

Sample solutions to Homework 4, Information-Theoretic Modeling (Fall 204) Jussi Määttä October 2, 204 Question [First, note that we use the symbol! as an end-of-message symbol. When we see it, we know that the message has ended. This is not mentioned in the lecture slides.] (a) The cumulative function F (x) y x p(y) takes the values F (a) 0.05, F (b) 0.55, F (c) 0.9 and F (!). For simplicity, denote F l ( ) (0, 0.05, 0.55, 0.9) and F r ( ) (0.05, 0.55, 0.9, ). For instance, F l (c) 0.55 and F r (c) 0.9. In general, if we have the interval [a, b) [0, ] and we see a symbol x, the new interval will be [ ) a + (b a)f l (x), a + (b a)f r (x) which is a subinterval of [a, b). This is nothing different from what is shown e.g. in the lecture slides; it s just the idea of recursive partitioning of intervals put into a formula. Initially, for the empty message, we have the interval [0, ). The interval after the first symbol c is [F l (c), F r (c)) [0.55, 0.9).

After the second symbol a, the interval becomes I(ca) [0.55 + (0.9 0.55)F l (a), 0.55 + (0.9 0.55)F r (a)) [0.55, 0.5675). After the third symbol b, the interval becomes I(cab) [0.55 + (0.5675 0.55)F l (b), 0.55 + (0.5675 0.55)F r (b)) [0.550875, 0.559625). Finally, we get I(cab!) [0.55875, 0.559625). (b) (i) The shortest codeword within the interval is 0.559. It is the only codeword within I(cab!) that has less than four decimals. (ii) We cannot use C 0.559, because 0.5597 / I(cab!). But C 0.5595 suffices. For any number D that is a continuation of C, we have 0.5595 C D 0.5595999... 0.5596. In fact, any C {0.5590, 0.559,..., 0.5595} will do. (c) The probabilities have changed, so let us first find the new interval. We have F l ( ) (0, 2/32, 8/32, 29/32) and F r ( ) (2/32, 8/32, 29/32, ). Hence, In binary, we have I(c) [8/32, 29/32), I(ca) [8/32, 299/52), I(cab) [469/892, 4707/892), I(cab!) [8795/32768, 4707/892). I(cab!) [0.0000000, 0.0000000). 2

Let s take a closer look at these numbers: (0.) 0 0 0 0 0 0 0 (0.) 0 0 0 0 0 0 0 The shortest codeword within I(cab!) is 0.0000. The (unique) shortest codeword whose all continuations are in I(cab!) is 0.00000. We got a prefix codeword of bits. This agrees with the result mentioned in the lecture slides that log 2 + p(c) p(a) p(b) p(!) bits is sufficient. What would happen if we instead considered a Shannon symbol code? The codelength for cab! would be log 2 p(c) + log 2 }{{} p(a) + log 2 }{{} p(b) + log 2 }{{} p(!). }{{}.54 4 3.42 The Huffman code for single-letter symbols would give a code-length of 9 bits. So arithmetic coding does not always beat other codes (you should not be surprised by this). Note. Consider the message ccc!. It seems reasonable to expect that arithmetic coding will fare better with this message: log 2 (/p(c)) and log 2 (/p(!)) are not integers and hence arithmetic coding should benefit. And indeed, the Shannon code gives code-length 0, Huffman gives 9 and arithmetic coding can encode the message into (0.)0000, i.e. 9 bits. To better see the advantage of arithmetic coding, we would have to consider longer messages. Note 2. If you got different answers and used a computer, you may have had problems with the accuracy of floating-point numbers. It s safest to do this with pen and paper, or use e.g. wxmaxima. Of course, proper algorithmic implementations of arithmetic coding work around these issues; see for instance the paper by Witten, Neal & Cleary (link at the course webpage) for a C implementation. https://andrejv.github.io/wxmaxima/ 3

Question 2 For θ {0.25, 0.75}, the probability of the data is 0.25 2 0.75 2 9/256 0.035. For θ 0.5, the probability is 0.5 4 /6 0.063. Hence, with this quantization, using the maximum likelihood parameter gives the total code-length l(0.5) + log 2 /6 + 4 5. (Had we used θ 0.25 or θ 0.75, the code-length would be 2+log 2 (256/9) 6.83.) Question 3 Consider first the more general setting where the parameter θ is distributed according to the beta distribution, θ Beta(α, β), α > 0, β > 0. The so-called beta-binomial distribution corresponds to a generalization of the binomial distribution where before each trial, the success probability p is drawn from the beta distribution. The probability mass function of the beta-binomial distribution can be written in the form 2 ( ) n B(k + α, n k + β) f(k) k B(α, β) where n is the number of trials, k is the number of successes and B(x, y) is the beta function 3. In our case, the parameter θ has the uniform distribution, or equivalently 2 http://mathworld.wolfram.com/betabinomialdistribution.html 3 If x and y are positive integers, then we may simplify B(x, y) (x )! (y )!. (x + y )! 4

θ Beta(, ), so we have f(k) ( ) n B(k +, n k + ) k B(, ) ( ) n k! (n k)! k (n + )! n! k! (n k)! k! (n k)! (n + )! n! (n + )! n + which we have to divide by ( n k) because we are using an n-fold Bernoulli distribution instead of a binomial distribution. So for the sequence 00, we have n 4 and k 2 and hence its probability is p w (00) /(n + ) ( n ) /5 6 30. k Hence, the mixture code-length is log 2 (/p w (00)) log 2 30 4.9. Note. This particular case can also be solved directly by integration, ˆ 0 θ 2 ( θ) 2 dθ and then taking the logarithm. ˆ 0 [ θ 5 ( θ 4 2θ 3 + θ 2) dθ 5 θ4 2 + θ3 3 ] 5 2 + 3 30, Note 2. (Optional, if you re interested:) It was pointed out during the exercise session that there is yet another way to solve this, by doing updates on the beta distribution. The distribution Beta(, ) can be interpreted so that before seeing any data, we assume that we ve already seen one 0 and one. Then we proceed as follows: 0 5

Bits seen # of zeros # of ones Pr[0] Pr[] /2 /2 0 2 2/3 /3 00 3 3/4 /4 00 3 2 3/5 2/5 So the probability of the sequence 00 is 2 2 3 4 2 5 30. However, this approach may not be very illustrative of the point of mixture codes, so I will not go into details here. Question 4 (a) The number of s in a binary sequence of four bits is one of 0 k 4. What is the maximum likelihood parameter ˆθ? For k 0, the probability is θ 0 ( θ) 4 ( θ) 4, which is maximized when θ 0. Similarly, for k 4 the maximum likelihood parameter is θ. For k, 2, 3, the maximum likelihood parameter can be found taking the logarithm and differentiating: log p θ (x,..., x 4 ) log θ k ( θ) 4 k k log θ + (4 k) log( θ), d dθ log p θ(x,..., x 4 ) k θ 4 k θ. The derivative equals zero when θ k/4. (I leave it to the readers to convince themselves that this is the global maximum.) Combining the above, we see that the maximum likelihood parameter for sequences with k s is ˆθ (k) k/4. Now, for a given 0 k 4, there are ( n k) four-bit sequences that have k s. 6

Hence, we may compute the NML normalizing constant as follows: C (b) D {0,} 4 pˆθ(d) 4 ( ) 4 ˆθ (k) k ( k ˆθ (k) ) 4 k 4 ( ) ( ) k ( 4 k 4 k k 4 4 k0 k0 3.288. By the above, the NML code-length for D 00 is ) 4 k log 2 p NML (00) log 2 pˆθ(00) C log 2 0.5 4 C 4 + log 2 C 5.6865. Note. As mentioned in the problem statement, another option for calculating C would be to simply go through all possible sequences (0000, 000, 000,..., ), compute their maximum likelihood probabilities and sum them up. Then you have a sum of 2 n terms instead of a sum of n + terms. Note 2. As was pointed out by someone at the exercise session, the sum from k 0 to 4 can be made computationally easier by realizing that the terms for k 0 and k 4 agree, as do the terms for k and k 3. Note 3. For the sequence 00, we got the following code-lengths: 5.00 (twopart code), 4.9 (mixture) and 5.69 (NML). What about the sequence 0000? The code-lengths will be approximately 3.66 (two-part code), 2.32 (mixture) and.69 (NML). Note 4. This question was asked at the exercise session. Consider the code-lengths for 00 for the two-part code and the mixture code. They both round up to 5 bits. So is the mixture code really any better than the two-part 7

code? The answer is that indeed it is: Consider, for instance, the problem of model selection: given some data, we want to pick a model (class) that seems to describe the data well. Our problem is then not compression but simply comparing the models classes, and hence there is no need to round up and we will end up preferring the mixture model over the two-part code. Note 5. (Optional, if you re interested:) Computing the value of C has been an interesting research problem. There s a linear-time algorithm for computing its value for multinomials (of which the Bernoulli is a special case) that was developed in our department. Take a look at the paper by Kontkanen and Myllymäki at http://dx.doi.org/0.06/j.ipl.2007.04.003 (should be accessible from the university network). 8