CS225/ECE205A, Spring 2005: Information Theory Wim van Dam Engineering 1, Room 5109

Similar documents
Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

1 Introduction to information theory

Lecture 2: August 31

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Chapter 2: Source coding

Lecture 22: Final Review

Entropy as a measure of surprise

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Lecture 1: September 25, A quick reminder about random variables and convexity

Lecture 1: Introduction, Entropy and ML estimation

Information & Correlation

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Chapter 9 Fundamental Limits in Information Theory

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Chapter 5: Data Compression

Information Theory and Statistics Lecture 2: Source coding

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Intro to Information Theory

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

Solutions to Set #2 Data Compression, Huffman code and AEP

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Coding of memoryless sources 1/35

Lecture 1. Introduction

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

3F1 Information Theory, Lecture 3

Entropies & Information Theory

Lecture 5 - Information theory

Noisy channel communication

Information in Biology

Example: Letter Frequencies

EC2252 COMMUNICATION THEORY UNIT 5 INFORMATION THEORY

Example: Letter Frequencies

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Data Compression. Limit of Information Compression. October, Examples of codes 1

Coding for Discrete Source

Homework Set #2 Data Compression, Huffman code and AEP

Information Theory and Communication

ELEMENTS O F INFORMATION THEORY

Computing and Communications 2. Information Theory -Entropy

Information in Biology

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

Example: Letter Frequencies

UNIT I INFORMATION THEORY. I k log 2

Lecture 1: Shannon s Theorem

(Classical) Information Theory III: Noisy channel coding

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Dynamics and time series: theory and applications. Stefano Marmi Scuola Normale Superiore Lecture 6, Nov 23, 2011

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Digital Communications III (ECE 154C) Introduction to Coding and Information Theory

CS 229r Information Theory in Computer Science Feb 12, Lecture 5

Information Theory: Entropy, Markov Chains, and Huffman Coding

MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)

Quiz 2 Date: Monday, November 21, 2016

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

ELEC 515 Information Theory. Distortionless Source Coding

Information Theory. Week 4 Compressing streams. Iain Murray,

Lecture 3 : Algorithms for source coding. September 30, 2016

Introduction to Information Theory. B. Škorić, Physical Aspects of Digital Security, Chapter 2

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Information Theory, Statistics, and Decision Trees

AQI: Advanced Quantum Information Lecture 6 (Module 2): Distinguishing Quantum States January 28, 2013

LECTURE 3. Last time:

The binary entropy function

Solutions to Homework Set #1 Sanov s Theorem, Rate distortion

Classification & Information Theory Lecture #8

Lecture Notes on Digital Transmission Source and Channel Coding. José Manuel Bioucas Dias

Module 1. Introduction to Digital Communications and Information Theory. Version 2 ECE IIT, Kharagpur

Noisy-Channel Coding


PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

CS4800: Algorithms & Data Jonathan Ullman

Chapter 2 Review of Classical Information Theory

MAHALAKSHMI ENGINEERING COLLEGE QUESTION BANK. SUBJECT CODE / Name: EC2252 COMMUNICATION THEORY UNIT-V INFORMATION THEORY PART-A

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling

3F1 Information Theory, Lecture 3

Autumn Coping with NP-completeness (Conclusion) Introduction to Data Compression

Introduction to Information Theory

Non-binary Distributed Arithmetic Coding

(Classical) Information Theory II: Source coding

CSCI 2570 Introduction to Nanocomputing

A Mathematical Theory of Communication

Information and Entropy

Shannon's Theory of Communication

lossless, optimal compressor

Chapter I: Fundamental Information Theory

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

Information Theory - Entropy. Figure 3

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

3F1 Information Theory Course

Shannon s Noisy-Channel Coding Theorem

Transcription:

CS225/ECE205A, Spring 2005: Information Theory Wim van Dam Engineering 1, Room 5109 vandam@cs http://www.cs.ucsb.edu/~vandam/teaching/s06_cs225/

Formalities There was a mistake in earlier announcement. The midterm is scheduled for Week 5 (May 2/4). Before the weekend I will post a number of exercises that we will discuss the coming week. Try to answer them before next Tuesday. Also, start thinking about a project that you would like to work on. See the book and other places for ideas. Questions?

G. Perec s La Disparition A Void s plot follows a group of individuals hunting a missing companion, Anton Vowl. It is in part a parody of noir and horror fiction, with many stylistic tricks and gags, plot convolutions, and a grim conclusion. In many parts it implicitly talks about its own lipogrammatic limitation: illustrations of this including missing subdivision 5 of 26, and calling its main protagonist "Vowl". Individuals within A Void do work out what is missing, but find difficulty discussing or naming it as it has no word or form that conforms to its author's constraint. Philip Howard, writing a lipogrammatic appraisal of A Void in a British daily journal, said, This is a story chock-full of plots and sub-plots, of loops within loops, of trails in pursuit of trails, all of which allow its author an opportunity to display his customary virtuosity as an avant-gardist magician, acrobat and clown.

Letter Frequencies in English Figure from Information Theory, Inference, and Learning Algorithms, by David J.C. MacKay From Information Theory, Inference and Learning Algorithms, D. MacKay. The 27 letter alphabet gives an upper bound on the entropy of log 27 = 4.75. A random sentence: XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD The different single letter frequencies gives a lower entropy of 4.1 bits/letter. A typical sentence: OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA.

Combined Letter Frequencies Figure from Information Theory, Inference, and Learning Algorithms, by David J.C. MacKay For letter pairs, the entropy is 4.03 bits per letter. Typical sentence using triplets: IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID For letter quadruplets, the entropy is down to 2.8 bits/letter. The entropy of English is estimated to be 1.5 bits per letter. See http://math.ucsd.edu/~crypto/java/entropy/

Redundancy, Redundancy The fact that a natural language is very redundant is very useful: it allows us to understand English texts that are written or spoken unclearly. Redundancy of images is also helpful when we are reproducing them imperfectly (copying, faxing, et c.) But when we are using reliable (digital) technology, then the natural redundancy incurs an unnecessary cost when transmitting information. Shannon s idea while working for Bell Labs: Data Compression

Chapter 5: Data Compression

(Binary) Source Codes We want to efficiently encode (in bits) sequences produced by a random source variable X ~ p(x). The expected length of a code C is given by: L(C) = Σ x p(x) l(x), with l(x) the length of C(x). Example: For X={a,b,c} we can use the binary source code C(a)=00, C(b)=01, C(c)=10, with L(C)=2 bits/letter. A better idea is to use C(a)=0, C(b)=10, C(c)=11, such that if p(a) = p(b) = p(c) = 1/3, the code has expected length 1.66 bits per letter (bpc).

Relying on Bias Continued example for X={a,b,c}: The C(a)=0, C(b)=10, C(c)=11, works even better if p(a) = ½ and p(b) = p(c) = ¼, in which case the expected length is 1.5 bits per letter. It is worse if we have p(a) = p(b) = ¼ and p(c) = ½ such that L(C)= ¼ + ½ + 1 = 1.75 bits per letter. Central idea of variable length coding: Assign short code words to likely letters (and longer code words to the less likely letters).

Morse Code Figure from Information Theory, Inference, and Learning Algorithms, by David J.C. MacKay

Distributing Lengths The question has become: Given a probability p distribution over the alphabet X, how can we assign the lengths l(x) of the code words C(x)? Is it possible to have lengths l 1 =1, l 2 =2, l 3 =3, l 4 =3? Answer: Yes; use words 0,10,110 and 111. Is it possible to have lengths l 1 =1, l 2 =2, l 3 =2, l 4 =3? Answer: No. Try 0,01,10 and 111; what does 010 mean? We want our codes to be uniquely decodable.

A Uniquely Decodable Code Take the following code for X={a,b,c,d}: C(a)=10, C(b)=00, C(c)=11, C(d)=110. b ut a n n o y i n g Try decoding the string 100011101100110. It gives C 1 (100011101100110) = abcacbd, but there was some suspense with the second c, as the 11 could also have been the start of C(d)=110. This code is uniquely decodable, but the decoding is not instantaneous.

Instantaneous Codes Definition [CT, Section 5.1]: A code is called a prefix code or an instantaneous code if no code word is the prefix of any other code word. We want to do our coding with such prefix codes. The non-instantaneous code had code lengths 2,2,2,3. There is a instantaneous code with such lengths. Take the words 00,01,10,111. But there is no instantaneous code with lengths 1,2,2,3.

Kraft s Inequality Crucial question: For which code word lengths l 1,,l m does there exist a prefix code? Kraft s Inequality: For any binary prefix code with code word lengths l 1,,l m, it must hold that: Conversely, if the lengths l j obey Kraft s inequality, then there is a prefix code with such lengths. m j= 1 2 l j 1 If instead of bits we use the more general Dits {1,,D}, then we have m j= 1 D l j 1

Proving Kraft s Inequality L.G. Kraft, A device for quantizing, grouping, and coding amplitude modulated pulses, M.Sc. thesis, MIT, 1949. [Cover & Thomas]: The Kraft inequality was first proved by McMillan (1956) How to prove it? It might help to think about prefix codes as trees.

A Binary Code Tree 0 1 00 10 Code words are the leafs, with lengths: 2,2,3,3,4,4,4,4 010 011 1100 1101 1110 1111

CS225/ECE205A, Spring 2005: Information Theory Wim van Dam Engineering 1, Room 5109 vandam@cs http://www.cs.ucsb.edu/~vandam/teaching/s06_cs225/

Last Week The amount of randomness of a discrete random variable X (with distribution p) is measured by the entropy function H (with bits as units): H(X) = x p(x)log 2 p(x) If X and Y are independent, then H(X,Y) = H(X)+H(Y). In general we have H(X,Y) = H(X) + Σ x p(x) H(Y X=x).

Conditional Entropy The expected entropy of Y after we have observed a value x X, is called the conditional entropy H(Y X): H(Y X) = x = = = E p(x) H(Y X = x) x x,y p(x) p(x,y)log p(y x) p(x,y) y log p(y X) p(y x)log p(y x) Chain rule: H(X,Y) = H(X)+H(Y X) = H(Y)+H(X Y).

Noisy Channel Example The symmetric binary channel with noise rate p again: Assume that the X bit is completely random with H(X) = H((½,½)) = 1. X Y 0 1 p p 0 p 1 1 p 1 After receiving/observing Y, the entropy has changed from H(X)=1 to H(X Y) = H((1 p,p)) = p log p (1 p)log(1 p). We expect an entropy reduction of 1 H((p,1 p)). Think of this reduction as gaining information about X.

Another Example of H(X Y) Take p(x) over {0,,500} with p = (½,1/1000,,1/1000) with entropy H(X) = ½ + ½ log 1000 = 4.983 bits. If we learn that x is not 0, then we increase the entropy: p(x x 0 )=(0,1/500,,1/500) with H(X x 0 )=8.966 bits. We learned information, yet the entropy/uncertainty increased? Think: Not finding your wallet in the likely place. The expected uncertainty (=conditional entropy) goes down: H(X x=0? ) = ½ H(X x=0 ) + ½ H(X x 0 ) = 4.483 bits.

Asymmetry of H(X Y) If the relation between X and Y is asymmetric, then we will have in general: H(X Y) H(Y X). X Y Example of an asymmetric channel. 0 1 0 Assume again H(X) = 1 and ½ note that H(Y) = H((3/4,1/4)) = 0.811 1 1 ½ We have H(X Y) = 0.689 bits. On the other hand H(Y X) = 0.5 bits. Check with H(X,Y) = 3/2 and the chain rule: H(X,Y) = H(X) + H(Y X) = H(Y) + H(X Y)

Chain Rule for Entropy For random variables X 1,,X n we have the Chain rule: H(X,...,X 1 n ) = H(X = n i= 1 1 ) + H(X H(X i X 2 i 1 X 1,...,X ) + 1 ) + H(X n X n 1,...,X Think: the amount of information that you obtain by observing X 1,,X n equals the X 1 information H(X 1 ), plus the additional X 2 information H(X 2 X 1 ), et cetera. L 1 ) Notice the similarity with the multiplicative rules for joint probabilities: p(x,y) = p(x) p(y x).

Mutual Information For two variables X,Y the mutual information I(X;Y) is the amount of certainty regarding X that we learned after observing Y. Hence I(X;Y) = H(X) H(X Y). Note that now X and Y can be interchanged using the chain rule: I(X; Y) = H(X) H(X Y) = H(X,Y) H(Y X) H(X Y) = H(Y) H(Y X) = I(Y;X) Think of I(X;Y) as the overlap between X and Y.

All Together Now H(X,Y) H(X Y) I(X;Y) H(Y X) H(X) H(Y)

Asymmetric Channel H(X Y)=0.689 H(X)=1 H(X,Y)=3/2 I(X;Y) = 0.311 H(Y X) =0.5 H(Y)=0.811 X Y 0 1 0 ½ 1 1 ½ For the previous asymmetric channel we had H(X)=1, H(Y)=0.811 and H(X,Y)=3/2, etc. Hence I(X;Y)=0.311.

About Mutual Information Mutual information is the central notion in information theory. It quantifies how much we learn about X by observing Y. When X and Y are the same we get: I(X;X) = H(X), hence entropy is called self information. It can be generalized to more than two variables: X Y Z

Expectation of What? Mutual information can be viewed as an expectation: I(X; Y) = = = H(X) H(X Y) x,y E p p(x,y)log log p(x, y) p(x) p(x, y) p(x) p(y) p(y) This expectation is also called the relative entropy between the probabilities p(x,y) and p(x) p(y).

Relative Entropy The relative entropy or Kullback Leibler distance between two distributions p and q is defined by D(p q) = p(x)log = x E p p(x) log q(x) p(x) q(x) This distance expresses our expected disbelief in q when we are observing probability distribution p. Not a true distance as D(p q) D(q p). Also, D(p q) can be infinite.

Example KL Distance Compare a fair coin and biased coin. Probabilities: p = (½,½) versus q = (q,1 q) D(p q) D(p q) expresses how much disbelief we have in the q-model, when we observe a fair coin (per coin flip). D(q p) expresses the disbelief in the fair-coin theory when observing q. D(q p)

Another D(q p) Example Consider a 10% difference: q=[q,1 q] while p=[q+/,1 q /] D(q p) The difference between [90%,10%] and [80%,20%] is much bigger than between [60%,40%] and [50%,50%]

CS225/ECE205A, Spring 2005: Information Theory Wim van Dam Engineering 1, Room 5109 vandam@cs http://www.cs.ucsb.edu/~vandam/teaching/s06_cs225/

Help Wanted Note Taker You must be sensitive to the needs of students with disabilities $100 Talk to me after class

Formalities We are going to have projects. The emphasis will be on a small paper in which you show that you understand and can apply the tools of information theory. You will also give an ultra-short presentation (5 min). Grade determination: Midterm+Final+Project: 1+2+3=6.

Topics for Projects Information theory and data compression (lossless, lossy, mpeg) error correcting codes (802.11, DVDs, satellite) gambling and the stock market [CT, 6, 15] Kolmogorov complexity [CT, 7] Quantum information theory [Nielsen & Chuang] analog signals, rate distortion theory [CT, 13] information theory and natural languages statistical distances and phylogenetics anything that uses some nontrivial information theory

Entropic Inequalities Entropic definitions and equalities: H(X) = Σ x p(x) log p(x), H(X Y) = Σ y p(y) H(X Y=y), I(X;Y) = H(X) H(X Y), D(p q) = Σ x p(x) log p(x)/q(x), Entropic inequalities that follow from 0 p(x) 1: H(X) 0, H(X Y) 0, H(X,Y) H(X), Less obvious inequalities: I(X;Y) 0, H(X) log X, D(p q) 0, How do those inequalities relate? How to prove the not-so-obvious inequalities?

Information Inequality Theorem 2.6.3: For two probabilities distribution p and q we have D(p q) 0 and D(p q)=0 if and only if p=q. p(x) D(p q) = p(x) log x q(x) = = = log log(1) 0 x p(x) log x p(x) q(x) p(x) q(x) p(x) Q: What happened here? A: Application of Jensen s inequality to the concave function log:r + R.

Convex and Concave A function f is convex if for all points x 1 and x 2 it holds that f(λx 1 +(1 λ)x 2 ) λf(x 1 )+(1 λ)f(x 2 ). Equivalently: f 0 Strictly convex: f >0 A function f is concave if for all points x 1 and x 2 it holds that f(λx 1 +(1 λ)x 2 ) λf(x 1 )+(1 λ)f(x 2 ) Equivalently: f 0 Strictly concave: f <0

Jensen s Inequality Theorem 2.6.2: If f is a convex function on a random variable Z, then E[f(Z)] f(e[z]). If f is a concave function on a random variable Z, then E[f(Z)] f(e[z]). If f is strictly convex or strictly concave, then E[f(Z)] = f(e[z]) implies that Z is a deterministic variable. Proof: See Cover and Thomas. Note: f is convex if and only if f is concave.

Jensen for the Log Function The function log(z) is concave for z R +, hence E[log Z] log(e[z]) and also E[log 1/Z] log(1/e[z]). For our purposes we typically have variables X,Y and some additional function g:x,y R + such that (with Z=g(X,Y)) Jensen s inequality tells us: E[log g(x,y)] log(e[g(x,y)]). Important example: Σ x p(x) log q(x)/p(x) = E[log q(x)/p(x)] log(e[q(x)/p(x)]) = log(σ x p(x) q(x)/p(x)) = log(σ x q(x)) = 0.

Information Inequality, Again Theorem 2.6.3: For two probabilities distribution p and q we have D(p q) 0 and D(p q)=0 if and only if p=q. p(x) D(p q) = p(x) log x q(x) = = = log log(1) 0 x p(x) x q(x) log p(x) p(x) q(x) p(x) Q: What happened here? A: Application of Jensen s inequality to the concave function log:r + R.

For Next Tuesday Complete the proof that for d(x,y) = H(X Y) + H(Y X) it holds that d(x,y) + d(y,z) d(x,z). Go through as many entropic inequalities as you want and see which ones are the easy ones and which one require Jensen s inequality. Think about a project you want to work on.

CS225/ECE205A, Spring 2005: Information Theory Wim van Dam Engineering 1, Room 5109 vandam@cs http://www.cs.ucsb.edu/~vandam/teaching/s06_cs225/

Formalities Answers to Homework Exercise 1 will be posted. Next week s Midterm will be on Thursday, May 4. Topics: Entropies and source coding. The Midterm and Final are both open book, etc. Also next week, I want to receive the tentative titles and abstracts for your projects.

Using Jensen s Inequality Using Jensen s inequality on the log function, we get log(e[1/g(x)]) E[log g(x)] log(e[g(x)]) for all functions g:x R + and distributions p(x). Using this we can prove H(X) log X I(X;Y) 0 D(p q) 0 for all random variables X, Y and distributions p,q.

Proving H(X) log X, Twice Directly using Jensen s Inequality on E[log(1/p(X))]: H(X) = E = E = = p p log(e log( log [ logp(x)] [log1/ p(x)] p x X [1/ p(x)]) X p(x) / p(x)) Using nonnegativity of Relative entropy D(p q) between p(x) and q=1/ X : 0 = E = E = E D(p q) p p p [log(p(x) / q(x))] [log(p(x) [log(p(x)) + log = H(X) + log X X )]. X ] Upper bound on the entropy H(X) with the probability distribution p over the finite alphabet X of size X.

What We Know Kraft s Inequality: For any binary prefix code with code word lengths l 1,,l m, it must hold that: Conversely, if the lengths l j obey Kraft s inequality, then there is a prefix code with such lengths. m j= 1 2 l j Entropic Inequalities like: D(p q) 0, such that E p [log 1/p(X)] E p [log 1/q(X)] for all p,q. 1 In other words: E p [log 1/q(X)] is minimized for q=p.

Source Coding Question Given a probability distribution p 1,,p m, what lengths l 1,,l m do we take such that we minimize the expected length L(C) = Σ j p j l j, while obeying Kraft s inequality? Rough answer: Take l j log p j. Note that if exact this gives: L(C) = Σ j p j l j = Σ j p j log p j = H(p). Using inequalities like D(p q) 0 we will show that: - For all prefix codes we must have: L(C) H(p) - There exists a code such that: L(C) < H(p)+1.

Lossless Coding Theorem Shannon s 1948 Lossless Source Coding Theorem: Let X be a random variable with entropy H(X). - For all uniquely decodable codes C it holds that L(C) H(X) - There exists a prefix code C such that L(C) < H(X)+1. Idea: Let X={1,,n}. Any code C with lengths l 1,, l n that obeys Kraft s inequality defines a probability distribution q over {0,1,,n} with q(j) = 2 lj (and q(0) such that Σ j q(j)=1). The expected length of this code: L(C) = E p [ log q(x)].

Lower Bound Proof Let p be the probability distribution of variable X. Let code C have lengths l 1,, l n, defining the distribution q over {0,1,,n} with q(j) = 2 lj, such that : L(C) = E p [ log q(x)]. This gives: L(C) H(X) = = = E E p p [ log q(x)] E [log p(x) / q(x)] D(p q) 0 p [ log p(x)] The equality L(C) H(X) = D(p q) also shows that we should try to have q=p to get the smallest possible L(C).

The Upper Bound Let p be the probability distribution of variable X. Try defining a prefix code C with l j = log p j. Algebra tells us: j 2 l j = = 1. j j 2 2 logp logp The expected length of this code is: j j Kraft s inequality tells us that there does exist a prefix code C with these lengths l j. E p [l j ] = E p [ log p(x) ] < E p [ log p(x)]+1 = H(X)+1. How to get rid of the annoying +1 term?

Lowering the Upper Bound To reduce the H(X)+1 upper bound to H(X)+ε, we apply the same approach to strings in X m instead of X. The new probability distribution p m is defined by: p m (x 1,,x m ) = p(x 1 ) p(x m ) such that H(X m ) = m H(X). We know that there exists a prefix code C m for X m with L(C m ) < H(X m ) +1, hence the average per letter is now bounded by: L(C) m < m H(X ) m + 1 = H(X) + 1 m. By letting m grow, we get L(C m )/m < H(X)+ε, for arbitrary small ε>0.

Source Coding Algorithms How to implement Shannon s idea for lossless source coding for a known distribution p(x)? Two important cases: - Huffman codes - Shannon, Shannon-Fano, Shannon-Fano-Elias, and/or Arithmetic codes An important but different setting occurs when we do not know a priori the distribution p. - Lempel-Ziv compression

Optimal Coding? Let X={0,1} with p(x=0) = 0.99 and p(x=1) = 0.01. Shannon s original idea suggests a code with l 0 = log p 0 = 1 bit and l 1 = log p 1 = 7 bits, which is clearly not optimal. Looking at the code tree it is clear that for an optimal code the two longest words must have equal length and differ only in the last bits. Another optimal property: if p j > p k then l j l k. These observations lead to the Huffman code (which is indeed optimal).

Huffman Code Assume probabilities p 1 p 2 p m for X={1,,m}. Because p 1 and p 2 are the two smallest probabilities, the lengths l 1 and l 2 are the longest and equal. The node at depth l 1 1 connecting the code words for 1 and 2 can be viewed as a code word for 1 or 2 (with probability p 1 +p 2 ). If 1 or 2 is less likely than 3 (p 1 +p 2 < p 2 ), then l 1 1 l 3. Otherwise (p 1 +p 2 p 2 ), we must have l 1 1 l 3. Applying this line of thought recursively gives a code tree for the Huffman code.