An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Similar documents
Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

COMM901 Source Coding and Compression. Quiz 1

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Coding of memoryless sources 1/35

1 Introduction to information theory

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

Chapter 2: Source coding

Exercises with solutions (Set B)

Entropy as a measure of surprise

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

Shannon-Fano-Elias coding

Chapter 5: Data Compression

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

CSCI 2570 Introduction to Nanocomputing

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

Chapter 9 Fundamental Limits in Information Theory

EE5585 Data Compression January 29, Lecture 3. x X x X. 2 l(x) 1 (1)

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Ch. 2 Math Preliminaries for Lossless Compression. Section 2.4 Coding

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST

Lecture 6: Kraft-McMillan Inequality and Huffman Coding

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

lossless, optimal compressor

3F1 Information Theory, Lecture 3

Information and Entropy

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

DCSP-3: Minimal Length Coding. Jianfeng Feng

Coding for Discrete Source

Lecture Notes on Digital Transmission Source and Channel Coding. José Manuel Bioucas Dias

Uncertainity, Information, and Entropy

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

ELEMENT OF INFORMATION THEORY

3F1: Signals and Systems INFORMATION THEORY Examples Paper Solutions

Homework Set #2 Data Compression, Huffman code and AEP

Lecture 3 : Algorithms for source coding. September 30, 2016

Information Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Exercises with solutions (Set D)

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

Motivation for Arithmetic Coding

Lecture 1: September 25, A quick reminder about random variables and convexity

Capacity of a channel Shannon s second theorem. Information Theory 1/33

Information Theory and Statistics Lecture 2: Source coding

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Optimal codes - I. A code is optimal if it has the shortest codeword length L. i i. This can be seen as an optimization problem. min.

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

Compression and Coding

Lecture 1: Shannon s Theorem

3F1 Information Theory, Lecture 3

TSBK08 Data compression Exercises

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak

Information Theory and Coding Techniques

Introduction to information theory and coding

Lecture 5: Asymptotic Equipartition Property

(Classical) Information Theory II: Source coding

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

10-704: Information Processing and Learning Spring Lecture 8: Feb 5

Principles of Communications

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Data Compression. Limit of Information Compression. October, Examples of codes 1

Dept. of Linguistics, Indiana University Fall 2015

2018/5/3. YU Xiangyu

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

ECE 4400:693 - Information Theory

Summary of Last Lectures

Solutions to Set #2 Data Compression, Huffman code and AEP

Computing and Communications 2. Information Theory -Entropy

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Lower Bounds on the Graphical Complexity of Finite-Length LDPC Codes

Lecture 1 : Data Compression and Entropy

ECE Advanced Communication Theory, Spring 2009 Homework #1 (INCOMPLETE)

Lec 03 Entropy and Coding II Hoffman and Golomb Coding

Tight Upper Bounds on the Redundancy of Optimal Binary AIFV Codes

Lecture 22: Final Review

Revision of Lecture 5

Noisy channel communication

Lecture 11: Quantum Information III - Source Coding

LECTURE 3. Last time:

UNIT I INFORMATION THEORY. I k log 2

Intro to Information Theory

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

CHAPTER 3. P (B j A i ) P (B j ) =log 2. j=1

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

COS597D: Information Theory in Computer Science October 19, Lecture 10

(Classical) Information Theory III: Noisy channel coding

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Information Theory: Entropy, Markov Chains, and Huffman Coding

Lecture 4 Channel Coding

Lecture 4 : Adaptive source coding algorithms

Digital communication system. Shannon s separation principle

Source Coding: Part I of Fundamentals of Source and Video Coding

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

Chapter 2 Source Models and Entropy. Any information-generating process can be viewed as. computer program in executed form: binary 0

EE 376A: Information Theory Lecture Notes. Prof. Tsachy Weissman TA: Idoia Ochoa, Kedar Tatwawadi

Communication Theory II

Transcription:

Kraft s inequality An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if N 2 l i 1 Proof: Suppose that we have a tree code. Let l max = max{l 1,..., l N }. Expand the tree so that all branches have the depth l max. A codeword at depth l i has 2 lmax l i leaves underneath itself at depth l max. The sets of leaves under codewords are disjoint. The total number of leaves under codewords are less than or equal to 2 lmax. Thus we have N 2 lmax l i 2 lmax N 2 l i 1

Kraft s inequality, cont. Conversely, given codeword lengths l 1,..., l N that fulfill Kraft s inequality, we can always construct a tree code. Start with a complete binary tree where all the leaves are at depth l max. Assume, without loss of generality, that the codeword lengths are sorted in increasing order. Choose a free node at depth l 1 for the first codeword and remove all its descendants. Do the same for l 2 and codeword 2, etc. until we have placed all codewords.

Kraft s inequality, cont. Obviously we can place a codeword at depth l 1. In order for the algorithm to work, there must in every step i be free leaves at the maximum depth l max. The number of remaining leaves is i 1 i 1 2 lmax 2 lmax l j = 2 lmax (1 2 l j ) > 2 lmax (1 j=1 j=1 where we used the fact that Kraft s inequality is fulfilled. N 2 l j ) 0 This shows that there are free leaves in every step. Thus, we can construct a tree code code with the given codeword lengths. j=1

Kraft-McMillan s inequality Kraft s inequality can be shown to be fulfilled for all uniquely decodable codes, not just prefix codes. It is then called Kraft-McMillan s inequality: A uniquely decodable code with the codeword lengths l 1,..., l N exists if and only if N 2 l i 1 Consider ( N 2 l i ) n, where n is an arbitrary positive integer. N ( 2 l i ) n = N N i 1=1 i n=1 2 (l i 1 +...+l in )

Kraft-McMillan s inequality, cont. l i1+... + l in is the length of n codewords from the code. The smallest value this exponent can take is n, which would happen if all code words had the length 1. The largest value the exponent can take is nl where l is the maximal codeword length. The summation can then be written as N ( 2 l i ) n = nl k=n A k 2 k where A k is the number of combinations of n codewords that have the combined length k. The number of possible binary sequences of length k is 2 k. Since the code is uniquely decodable, we must have in order to be able to decode. A k 2 k

Kraft-McMillan s inequality, cont. We have which gives us N nl ( 2 l i ) n 2 k 2 k = nl n + 1 k=n N 2 l i (n(l 1) + 1) 1/n This is true for all n, including when we let n tend to infinity, which finally gives us N 2 l i 1 The converse of the inequality has already been proven, since we know that we can construct a prefix code with the given codeword lengths if they fulfill Kraft s inequality, and all prefix codes are uniquely decodable.

Instantaneous codes One consequence of Kraft-McMillan s inequality is that it tells us that there is nothing to gain by using codes that are uniquely decodable but not instantaneous. Given a uniquely decodable but not instantaneous code with codeword lengths l 1,..., l N, Kraft s inequality gives that we can always construct an instantaneous code with exactly the same codeword lengths. This new code will have the same rate as the old code but it will be easier to decode.

Performance measure We measure how good a code is with the mean data rate R (more often just data rate or rate). R = average number of bits per codeword average number of symbols per code word [bits/symbol] Since we re doing data compression, we want the rate to be as low as possible. If we initially assume that we have a memoryless source X j and code one symbol at a time using a tree code, then R is given by R = l = p i l i [bits/symbol] where L is the alphabet size and p i the probability of symbol i. l is the mean codeword length [bits/codeword].

Theoretical lower bound Given that we have a memoryless source X j and that we code one symbol at a time with a prefix code. Then the mean codeword length l (which is equal to the rate) is bounded by l p i log 2 p i = H(X j ) H(X j ) is called the entropy of the source.

Theoretical lower bound, cont. Consider the difference between the entropy and the mean codeword length H(X j ) l = = p i log p i p i l i = p i (log 1 log 2 l i ) = p i 1 ln 2 p i (log 1 l i ) p i p i log 2 l i p i p i ( 2 l i 1) = 1 p i ln 2 ( 2 l i 1 (1 1) = 0 ln 2 where we used the fact that ln x x 1 and Kraft s inequality. p i )

Shannon s information measure We want a measure I of information that is connected to the probabilities of events. Some desired properties: The information I (A) of an event A should only depend of the probability P(A) of the event. The lower the probability of the event, the larger the information should be. If the probability of an event is 1, the information should be 0. Information should be a continuous function of the probability. If the independent events A and B happen, the information should be the sum of the informations I (A) + I (B) This gives that information should be a logarithmic measure.

Information Theory The information I (A; B) that is given about an event A, when event B happens is defined as I (A; B) = log b P(A B) P(A) where we assume that P(A) 0 and P(B) 0. In the future we assume, unless otherwise specified, that b = 2. The unit of information is then called bit. (If b = e the unit is called nat.) I (A; B) is symmetric in A and B: I (A; B) = log P(A B) P(A) = log P(B A) P(B) = log P(AB) P(A)P(B) = = I (B; A) Therefore the information is also called mutual information.

Information Theory, cont. We further have that I (A; B) log P(A) with equality to the left if P(A B) = 0 and equality to the right if P(A B) = 1. I (A; B) = 0 means that the events A and B are independent. log P(A) is the amount of information that needs to be given in order for us to determine that event A has happened. I (A; A) = log P(A A) P(A) = log P(A) We define the self information of the event A as I (A) = log P(A)

Information Theory, cont. If we apply the definitions on the events {X = x} and {Y = y} we get and I (X = x) = log p X (x) I (X = x; Y = y) = log p X Y (x y) p X (x) These are real functions of the random variables X and the random variable (X, Y ), so the mean values are well defined H(X ) = E{I (X = x)} = p X (x i ) log p X (x i ) is called the entropy (or uncertainty) of the random variable X.

Information Theory, cont. I (X ; Y ) = E{I (X = x; Y = y)} = M = p XY (x i, y j ) log p X Y (x i y j ) p X (x i ) j=1 is called the mutual information between the random variables X and Y.

Information Theory, cont. If (X, Y ) is viewed as one random variable we get H(X, Y ) = j=1 M p XY (x i, y j ) log p XY (x i, y j ) This is called the joint entropy of X and Y. It then follows that the mutual information can be written as I (X ; Y ) = E{log p X Y } = E{log p XY } p X p X p Y = E{log p XY log p X log p Y } = E{log p XY } E{log p X } E{log p Y } = H(X ) + H(Y ) H(X, Y )

Information Theory, cont. Useful inequality log r (r 1) log e with equality if and only if r = 1. This can also be written ln r r 1 If X takes values in {x 1, x 2,..., x L } we have that 0 H(X ) log L with equality to the left if and only if there is an i such that p X (x i ) = 1 and with equality to the right if and only if p X (x i ) = 1/L for all i = 1,..., L.

Information Theory, cont. Proof left inequality: p X (x i ) log p X (x i ) = 0, p X (x i ) = 0 > 0, 0 < p X (x i ) < 1 = 0, p X (x i ) = 1 Thus we have that H(X ) 0 with equality if and only if p X (x i ) is either 0 or 1 for each i, but this means that p X (x i ) = 1 for exactly one i.

Information Theory, cont. Proof right inequality: H(X ) log L = = = ( p X (x i ) log p X (x i ) log L 1 p X (x i ) log L p X (x i ) 1 p X (x i )( 1) log e L p X (x i ) 1 L p X (x i )) log e = (1 1) log e = 0 with equality if and only if p X (x i ) = 1 L for all i = 1,..., L

Information Theory, cont. The conditional entropy of X given the event Y = y j is H(X Y = y j ) = p X Y (x i y j ) log p X Y (x i y j ) We have that 0 H(X Y = y j ) log L The conditional entropy of X given Y is defined as We have that H(X Y ) = E{ log p X Y } = M = p XY (x i, y j ) log p X Y (x i y j ) j=1 0 H(X Y ) log L

Information Theory, cont. We also have that M H(X Y ) = p XY (x i, y j ) log p X Y (x i y j ) j=1 j=1 M = p Y (y j )p X Y (x i y j ) log p X Y (x i y j ) M = p Y (y j ) p X Y (x i y j ) log p X Y (x i y j ) j=1 M = p Y (y j )H(X Y = y j ) j=1

Information Theory, cont. We have p X1X 2...X N = p X1 p X2 X 1 p XN X 1...X N 1 which leads to the chain rule H(X 1 X 2... X N ) = H(X 1 ) + H(X 2 X 1 ) + + H(X N X 1... X N 1 ) We also have that I (X ; Y ) = H(X ) H(X Y ) = = H(Y ) H(Y X )

Information Theory, cont. Other interesting inequalities H(X Y ) H(X ) with equality if and only if X and Y are independent. I (X ; Y ) 0 with equality if and only if X and Y are independent. If f (X ) is a function of X, we have H(f (X )) H(X ) H(f (X ) X ) = 0 H(X, f (X )) = H(X )

Entropy for sources The entropy, or entropy rate of a stationary random source X n is defined as lim n lim n 1 n H(X 1... X n ) = 1 n (H(X 1) + H(X 2 X 1 ) +... + H(X n X 1... X n 1 )) = lim H(X n X 1... X n 1 ) n For a memoryless source, the entropy rate is equal to H(X n ).

Entropy for Markov sources The entropy rate of a stationary Markov source X n of order k is given by H(X n X n 1... X n k ) The entropy rate of the state sequence S n is the same as the entropy rate of the source H(S n S n 1 S n 2...) = H(S n S n 1 ) = H(X n... X n k+1 X n 1... X n k ) = H(X n X n 1... X n k ) and thus the entropy rate can also be calculated by H(S n S n 1 ) = w j H(S n S n 1 = s j ) L k j=1 ie a weighted average of the entropies of the outgoing probabilities of each state.