Shannon s noiseless coding theorem

Similar documents
UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 17 Lecturer: David Wagner April 3, Notes 17 for CS 170

Entropies & Information Theory

Information Theory and Statistics Lecture 4: Lempel-Ziv code

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Infinite Sequences and Series

Lecture 14: Graph Entropy

Lecture 6: Source coding, Typicality, and Noisy channels and capacity

Math 155 (Lecture 3)

HOMEWORK 2 SOLUTIONS

L = n i, i=1. dp p n 1

6.3 Testing Series With Positive Terms

Problem Set 2 Solutions

CS284A: Representations and Algorithms in Molecular Biology

Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP

4.3 Growth Rates of Solutions to Recurrences

Discrete Mathematics for CS Spring 2005 Clancy/Wagner Notes 21. Some Important Distributions

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 15

Recurrence Relations

Addition: Property Name Property Description Examples. a+b = b+a. a+(b+c) = (a+b)+c

Lecture 2: April 3, 2013

Bertrand s Postulate

CS 330 Discussion - Probability

The multiplicative structure of finite field and a construction of LRC

Seunghee Ye Ma 8: Week 5 Oct 28

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

PH 425 Quantum Measurement and Spin Winter SPINS Lab 1

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

The Maximum-Likelihood Decoding Performance of Error-Correcting Codes

Lecture 7: October 18, 2017

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

On a Smarandache problem concerning the prime gaps

Series: Infinite Sums

Recursive Algorithms. Recurrences. Recursive Algorithms Analysis

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame

An Introduction to Randomized Algorithms

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

1 Generating functions for balls in boxes

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

Lecture 1: Basic problems of coding theory

Zeros of Polynomials

ECE 564/645 - Digital Communication Systems (Spring 2014) Final Exam Friday, May 2nd, 8:00-10:00am, Marston 220

Information Theory and Coding

1 Hash tables. 1.1 Implementation

6.895 Essential Coding Theory October 20, Lecture 11. This lecture is focused in comparisons of the following properties/parameters of a code:

Section 5.1 The Basics of Counting

Math 216A Notes, Week 5

Intro to Learning Theory

Sums, products and sequences

SOME THEORY AND PRACTICE OF STATISTICS by Howard G. Tucker

Series III. Chapter Alternating Series

b i u x i U a i j u x i u x j

Lecture 2 February 8, 2016

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Discrete Mathematics and Probability Theory Spring 2012 Alistair Sinclair Note 15

Basics of Probability Theory (for Theory of Computation courses)

CS322: Network Analysis. Problem Set 2 - Fall 2009

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Lecture 10: Universal coding and prediction

Entropy Rates and Asymptotic Equipartition

The Growth of Functions. Theoretical Supplement

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

3.2 Properties of Division 3.3 Zeros of Polynomials 3.4 Complex and Rational Zeros of Polynomials

Lecture 11: Channel Coding Theorem: Converse Part

Posted-Price, Sealed-Bid Auctions

Lecture 6: Integration and the Mean Value Theorem. slope =

Putnam Training Exercise Counting, Probability, Pigeonhole Principle (Answers)

Riemann Sums y = f (x)

Hashing and Amortization

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 19

Please do NOT write in this box. Multiple Choice. Total

Approximations and more PMFs and PDFs

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Recitation 4: Lagrange Multipliers and Integration

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

Math 113 Exam 3 Practice

The Random Walk For Dummies

INFINITE SEQUENCES AND SERIES

TEACHER CERTIFICATION STUDY GUIDE

Advanced Stochastic Processes.

Ma 530 Infinite Series I

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Convergence of random variables. (telegram style notes) P.J.C. Spreij

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

10-701/ Machine Learning Mid-term Exam Solution

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Massachusetts Institute of Technology

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Sequences and Series of Functions

1 Review and Overview

Output Analysis and Run-Length Control

Machine Learning for Data Science (CS 4786)

It is often useful to approximate complicated functions using simpler ones. We consider the task of approximating a function by a polynomial.

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002

Lecture 19: Convergence

( 1) n (4x + 1) n. n=0

Transcription:

18.310 lecture otes May 4, 2015 Shao s oiseless codig theorem Lecturer: Michel Goemas I these otes we discuss Shao s oiseless codig theorem, which is oe of the foudig results of the field of iformatio theory. Roughly speakig, we wat to aswer such questios as how much iformatio is cotaied i some piece of data? Oe way to approach this questio is to say that the data cotais bits of iformatio i average if it ca be coded by a biary sequece of legth i average. So iformatio theory is closely related to data compressio. 1 Some History History of data compressio. Oe of the earliest istaces of widespread use of data compressio came with telegraph code books, which were i widespread use at the begiig of the 20th Cetury. At this time, telegrams were quite expesive; the cost of a trasatlatic telegram was aroud $1 per word, which would be equivalet to somethig like $30 today. This led to the developmet of telegraph code books, some of which ca be foud i Google Books. These books gave a log list of words which ecoded phrases. Some of the codewords i these books covey quite a bit of iformatio; i the Fourth editio of the ABC Code, for example, Mirmido meas Lord High Chacellor has resiged ad saturatio meas Are recoverig salvage, but should bad weather set i the hull will ot hold out log. 1 Iformatio theory. I 1948, Claude Shao published a semial paper which fouded the field of iformatio theory 2. I this paper, amog other thigs, he set data compressio o a firm mathematical groud. How did he do this? Well, he set up a model of radom data, ad maaged to determie how much it could be compressed. This is what we will discuss ow. 2 Radom data ad compressio First of all we eed a model of data. We take our data to be a sequece of letters from a give alphabet A. The we eed some probabilistic settig. We will have a radom source of letters. So we could say that we have radom letters X 1, X 2, X 3,... from A. I these otes we will assume that we have a first-order source, that is, we assume that the radom variables X 1, X 2, X 3,... are idepedet ad idetically distributed. So for ay letter a i the alphabet A the probability PX = a is some costat p a which does ot deped o or o the letters X 1, X 2,..., X 1. Shao s theory actually carries out to more complicated models of sources Markov chais of ay order. These more complicated sources would be more realistic models of reality. However for simplicity, but we shall oly cosider first-order sources i these otes. 3 1 The ABC Uiversal Commercial Telegraph Code, Specially Adapted for the Use of Fiaciers Merchats, Shipowers, Brokers, Agets, Etc. by W. Clauso-Thue, America Code Publishig Co., Fourth Editio 1899 2 Available o-lie at http://cm.bell-labs.com/cm/ms/what/shaoday/shao1948.pdf 3 Suppose for istace you are tryig to compress Eglish text. We might cosider that we have some sample Etropy-1

Now we eed to say what data compressio meas. We shall ecode data by biary sequeces sequeces of 0 ad 1. A codig fuctio φ for a set S of messages is simply a fuctio which associates to each elemet s S a distict biary sequece φs. Now if the messages s i S have a certai probability distributio the the legth L of the biary sequece φs is a radom variable. We are lookig for codes such that the average legth EL is as small as possible. I our cotext, the radom messages will be the sequeces s = X 1, X 2,..., X cosistig of the first letters comig out of the source. Oe way to ecode these messages is to attribute distict biary sequeces of legth log 2 A to the letters i the alphabet A. The the biary sequece φs would be the cocateatio of the codes of the letter, so that the legth L of φs would be log 2 A. That s a perfectly valid codig fuctio, leadig to average legth EL = log 2 A. Now the mai questio is: ca we do better? How much better? This is what we discuss ext. 3 Shao s etropy Theorem Cosider a alphabet A = {a 1,..., a k } ad a first-order source X as above: the th radom letter is deoted X. For all i {1,..., k} we deote by p i the probability of the letter a i, that is, PX = a i = p i. We defie the etropy of the source X as Hp = p i log 2 p i. We ofte deote the etropy just by H, without emphasizig the depedece o p. The etropy Hp is a oegative umber. It ca also be show that Hp log 2 k by cocavity of the logarithm. This upper boud is achieved whe p 1 = p 2 =... = p k = 1/k. We will ow discuss ad prove leavig a few details out the followig result of Shao. Theorem 1 Shao s etropy Theorem. Let X be a first order source with etropy H. Let φ be ay codig fuctio i biary words for the sequeces s = X 1, X 2,..., X cosistig of the first letters comig out of the source. The the legth L of the code φs is at least H o average, that is, EL H + o, where the little o otatio meas that the expressio divided by goes to zero as goes to ifiity. Moreover, there exists a codig fuctio φ such that EL H + o. corpus of Eglish text o had say, everythig i the Library of Cogress. Shao cosidered a series of sources, each of which is a better approximatio to Eglish. The first-order source which emits a letter a with probability p a which is proportioal to its frequecy i the text. The probability distributio of a sequece of letters from this source is just idepedet radom variables where the letter a j appears with some probability p j. The secodorder source is that where a letter is emitted with probability that depeds oly o the previous letter, ad these probabilities are just the coditioal probabilities that appear i the corpus that is, the coditioal probability of gettig a u, give that the previous letter was a q, is derived by lookig at the frequecy of all letters that follow a q i the corpus. I the third-order source, the probability of a letter depeds oly o the two previous letters, ad so o. High-order sources would seem to give a pretty good approximatio of Eglish, ad so it would seem that a compressio method that works well o this class of sources would also work well o Eglish text. Etropy-2

So the etropy of the source tells you how much the messages comig out of it ca be compressed. Aother way of iterpretig this theorem is to say that the amout of iformatio comig out of the source is H bits of iformatios per letters. We will ow give a sketch of the proof of Shao s etropy Theorem. First, let s try to show that oe caot compress the source too much. We look at a sequece of letters from the first-order source X, with the probability of letter a i beig p i for all i i [k]. First observe that the umber of sequeces of legth with exactly i letter a i is = 1, 2,..., k! 1! 2! k!, ad all these words have the same probability p 1 1 p 2 2 p k k. Now, if we have some umber M of equally likely messages that must be set, the i order to sed them we eed to use log M bits o average see the coutig otes. So we eed to sed at least log 2 1, 2,..., k bits. To approximate this, we ca use Stirlig s formula! 2π. e It gives Usig this formula oe obtais log 2 log 2! = log 2 log 2 e + o. 1, 2,..., k = log 2! = log 2 = log 2 i! i log 2 i + o i log 2 i / + o. I terms of the etropy fuctio, this ca be rewritte as: 1 log 2 = H 1, 2,..., k,, k + o. So we have a coditioal lower boud o the legth of the codig sequeces EL 1,..., k i log 2 i / + o. 1 I particular if i = p i for all i oe gets EL 1,..., k k p i log 2 p i +o = H+o. Etropy-3

Now we wat to fid what i is i geeral. The expectatio of i is p i. Moreover it ca be show that i is very cocetrated aroud this expectatio. Actually, usig Chebyshev s iequality oe ca prove do it! that for ay costat ɛ > 0 P i p i ɛ p i1 p i ɛ 2. Now we fix a costat ɛ > 0 ad we cosider two cases. We defie a sequece of letters to be ɛ-typical if i p i ɛ for all i ad ɛ-atypical otherwise. By the uio boud, the above gives a boud o the probability to be ɛ-atypical: Pɛ-atypical p i 1 p i ɛ 2 1 ɛ 2. Moreover Equatio 1 gives EL ɛ-typical p i ɛ log 2 p i + ɛ + o. We ow use the liearity of expectatio to boud the legth of the coded message: EL = EL ɛ typicalpɛ typical + EL ɛ atypicalpɛ atypical p i ɛ log 2 p i + ɛ + o 1 1 ɛ 2. Sice oe ca take ɛ as small as oe wats, this shows that EL H + o. So we have proved the first part of the Shao etropy theorem. We ow will show that oe ca do compressio ad get coded messages of legth o more tha H i average. We use agai the liearity of expectatio to boud the legth of the coded messages: EL = EL ɛ typicalpɛ typical + EL ɛ atypicalpɛ atypical. Now, we eed to aalyze this expressio. Patypical c, so as log as we do t make the output i the atypical case more tha legth C, we ca igore the secod term, as this oe will be costat while the first term will be liear. What we could do is use oe bit to tell the receiver whether the output was typical ad atypical. If it is atypical, we ca sed it without compressio thus sedig log 2 k = log 2 k, ad if it typical, we ca the compress it. The domiat cotributio to the expected legth the occurs from typical outputs, because of the rarity of atypical oes. How do we compress the source output if it s typical? Oe of the simplest ways theoretically but this is ot practical is to calculate the umber of typical outputs, ad the assig a umber i biary represetatio to each output. This compresses the source to log 2 of the umber of typical outputs. We will do this ad get a upper boud of H + o bits, where the little o otatio meas that the expressio divided by goes to zero as goes to ifiity. Etropy-4

How do we calculate the umber of typical outputs? For each typical vector of umbers i, we have that the umber of outputs is 1, 2,..., k. So a upper boud o the total umber of typical outputs is i :p i ɛ i p i +ɛ 1, 2,..., k this is a upper boud as we have t take ito accout that i =. But we ca eve use a eve cruder upper boud by upper boudig the umber of terms i the summatio by k it is ot ecessary to use the improved 2ɛ k. Thus, we get that the umber of typical outputs is k, p 1 ± ɛ, p 2 ± ɛ,..., p k ± ɛ where we ca choose the p i ± ɛ to maximize the expressio. Takig logs, we get that the umber of bits required to sed a typical output is at most k log 2 + H + cɛ, for some costat c. The first term is egligible for large, ad we ca let ɛ go to zero as goes to so as to get compressio to H + o bits. I summary, Shao s oiseless theorem says that we eed to trasmit H bits ad this is ca be essetially achieved. We ll see ext a much more practical way Huffma codes to do the compressio ad although it ofte does ot quite achieve the Shao boud, it gets fairly close to it., Etropy-5