Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

Similar documents
Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Chapter 5: Data Compression

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Chapter 2: Source coding

Lecture 3 : Algorithms for source coding. September 30, 2016

Data Compression. Limit of Information Compression. October, Examples of codes 1

Information Theory and Statistics Lecture 2: Source coding

Solutions to Set #2 Data Compression, Huffman code and AEP

Homework Set #2 Data Compression, Huffman code and AEP

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Coding of memoryless sources 1/35

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Lecture 22: Final Review

Entropy as a measure of surprise

lossless, optimal compressor

Motivation for Arithmetic Coding

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

Chapter 5. Data Compression

Lecture 6: Kraft-McMillan Inequality and Huffman Coding

EE5585 Data Compression January 29, Lecture 3. x X x X. 2 l(x) 1 (1)

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

COMM901 Source Coding and Compression. Quiz 1

1 Introduction to information theory

ECE 587 / STA 563: Lecture 5 Lossless Compression

ECE 587 / STA 563: Lecture 5 Lossless Compression

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

Lecture 1 : Data Compression and Entropy

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

Lec 03 Entropy and Coding II Hoffman and Golomb Coding

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Information Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes

EE376A - Information Theory Midterm, Tuesday February 10th. Please start answering each question on a new page of the answer booklet.

APC486/ELE486: Transmission and Compression of Information. Bounds on the Expected Length of Code Words

Chapter 9 Fundamental Limits in Information Theory

Communications Theory and Engineering

Lecture 1: September 25, A quick reminder about random variables and convexity

Coding for Discrete Source

3F1 Information Theory, Lecture 3

LECTURE 3. Last time:

3F1 Information Theory, Lecture 3

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Optimal codes - I. A code is optimal if it has the shortest codeword length L. i i. This can be seen as an optimization problem. min.

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

Lecture Notes on Digital Transmission Source and Channel Coding. José Manuel Bioucas Dias

COS597D: Information Theory in Computer Science October 19, Lecture 10

Lecture 2: August 31

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Exercises with solutions (Set B)

Universal Loseless Compression: Context Tree Weighting(CTW)

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling

(Classical) Information Theory II: Source coding

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

Lecture 4 Noisy Channel Coding

Lecture 11: Polar codes construction

Intro to Information Theory

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

10-704: Information Processing and Learning Spring Lecture 8: Feb 5

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Information Theory in Intelligent Decision Making

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 2

Lecture 1: Shannon s Theorem

repetition, part ii Ole-Johan Skrede INF Digital Image Processing

Lecture 4 : Adaptive source coding algorithms

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST

EE5585 Data Compression May 2, Lecture 27

EE/Stat 376B Handout #5 Network Information Theory October, 14, Homework Set #2 Solutions

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Lecture 5 Channel Coding over Continuous Channels

Summary of Last Lectures

EE 376A: Information Theory Lecture Notes. Prof. Tsachy Weissman TA: Idoia Ochoa, Kedar Tatwawadi

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Introduction to information theory and coding

ECE 4400:693 - Information Theory

Tight Upper Bounds on the Redundancy of Optimal Binary AIFV Codes


Solutions to Homework Set #1 Sanov s Theorem, Rate distortion

ELEC 515 Information Theory. Distortionless Source Coding

The memory centre IMUJ PREPRINT 2012/03. P. Spurek

CMPT 365 Multimedia Systems. Final Review - 1

Shannon-Fano-Elias coding

National University of Singapore Department of Electrical & Computer Engineering. Examination for

Quiz 2 Date: Monday, November 21, 2016

Variable-to-Variable Codes with Small Redundancy Rates

Information Theory, Statistics, and Decision Trees

UNIT I INFORMATION THEORY. I k log 2

18.310A Final exam practice questions

DATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering

ELEC546 Review of Information Theory

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

x log x, which is strictly convex, and use Jensen s Inequality:

U Logo Use Guidelines

Transcription:

3- Mathematical methods in communication Lecture 3 Lecturer: Haim Permuter Scribe: Yuval Carmel, Dima Khaykin, Ziv Goldfeld I. REMINDER A. Convex Set A set R is a convex set iff, x,x 2 R, θ, θ, θx + θx 2 R, () where θ = θ. B. Convex Function A function f : R R is a convex function iff, x,x 2 R, θ, θ, f(θx + θx 2 ) θf(x )+ θf(x 2 ). (2) Epigraph f(x) x Fig.. Epigraph Epigraph is the set of points lying on or above its graph. epi-, ep- (Greek): above, over, on, upon

3-2 Mathematical definition of f : R n R: epif = {(x,µ) : x R n, µ R, µ f(x)} R n+ A function is convex if and only if its epigraph is a convex set. If the second derivative of a function f : R R exists in the interval (a,b) and it satisfies d2 f(x) dx 2 for all x (a,b), then the function is convex in this domain A function may be convex even if the second derivative does not exists. The function x is convex but the second derivative at x = does not exists. C. Convexity and Concavity of Information Theory Related Functions D(p q) is a convex function in the pair (p,q) D(p q) = p(x) xp(x)log q(x) D(p q) = p(x) = q(x), x X H(X) is concave in p(x) I(X;Y) is concave in p(x), when p(y x) is fixed I(X;Y) is convex in p(y x), when p(x) is fixed II. LOSSLESS SOURCE CODING/ DATA COMPRESSION In this section we look at a problem called lossless source coding. In this problem there exists source sequence X n = (X,X 2,...,X n ), where the elements of source sequence are distributed i.i.d. according to P X, and each symbol x X is encoded into a codeword C(x) with length l(x). The receiver obtains the sequence of codewords {C(X i )} n i=, and needs to reconstruct the source sequence { ˆX i } n i= losslessly, namely, with probability. Source sequence X,X 2,...,X n Encoder Codeword f(x i ) {,} Decoder ˆX, ˆX 2,..., ˆX n Fig. 2. Lossless source coding problem. The encoder in the figure converts a value x X into sequence of bits {,} of variable length denoted as {,} l(x), where X i i.i.d p(x). The term l(x) is the length of the codeword associated with value x X.

3-3 Definition (Operational definition of instantaneous source coding) An instantaneous source code C consists an encoding function f : X {,} l(x), (3) where l(x) is the length of the codeword associated with symbol X, and a decoding function that maps a sequence of bits into a sequence of estimated symbols ˆX. Definition 2 (Lossless source code) We say that a code is lossless if Pr{X n = ˆX n } =. Definition 3 (Expected length of a code) The expected length of a code is E[l(X)]. Here, our goal as engineers is to design a code that minimize the expected length of a code. By the law of large number, since the source is i.i.d. then the average length of the code will be with high probability E[l(X)]. The following example illustrates a code and its expected length. Example (Mean Length) Let X be the source with an alphabet X = {A,B,C,D} A p = 2 B p = 4 X = (4) C p = 8 D p = 8 Consider the following encoding function f, decoding function g, and the associated code-length. A A [bits] B B 2 [bits] f(x) = g(x) = l(x) = (5) C C 3 [bits] D D 3 [bits] The expected code length obtained in this example is E[l(X)] = 6 8 [bits] (6) Definition 4 (Codebook) A codebook is a list of all codewords in the code.

3-4 For instance the codebook in Example is {,,,}. We would denote a codeword by c i (e.g., c =, c 2 = ) and the length of codeword c i is l i. Definition 5 (Non-singular code) A code is non-singular if for any x x 2 f(x ) f(x 2 ) Definition 6 (Uniquely Decodable Code) We say that a code is uniquely decodable, if any extension of codewords is non-singular. An extension of codewords is a concatenation f(x )f(x 2 )f(x 3 )f(x 4 )... without any spaces or commas. Definition 7 (Prefix code) A code is a prefix code a.k.a instantaneous code, if no codeword is a prefix of any other codeword. Non Singular Uniquely Decodable Prefix Fig. 3. The relations between non-singular, uniquely-decodable and prefix code Figure 3 illustrates the relations between the three different classes of codes defined above.

3-5 III. KRAFT INQUALITY Theorem (Kraft Inequelity) For any prefix code c,c 2,...,c m, with length l,l 2,...,l m, we have: (7) i If there is l,l 2,...,l m that satisfies (7), then there exists a prefix code with length l,l 2,...,l m. Proof: l max 2 lmax leaves Fig. 4. Binary Tree We define: l max = max{l,l 2,...,l m }. We are creating a Binary tree of depth l max as depicted in Fig. 4. c c 2 c 4 c 3 Fig. 5. Codewords and Amount of Leaves

3-6 The left branch is associated with in the code and the right branch with in the code. Now, we draw the codewords c,c 2,...,c m on the tree as depicted in Fig. 5. The number of leaves (nodes at the bottom of the tree) associated with codeword c i is 2 lmax l i. Since the code is a prefix code, the leaves associated with the codewords are disjoint. Therefore, m 2 lmax l i 2 lmax, (8) i= and this implies that Kraft inequality given in (7) holds. Now we prove the second direction. Namely, that if there are a list of lengthsl,l 2,...,l m that satisfies the Kraft inequality, then there exists a codebook c,c 2,...,c m with length l,l 2,...,l m that is a prefix codebook. The idea is to place the codewords on the tree in such a way that there will be no overlap of the leaves associated with each codeword. Let l l 2 l m. For length l i associate 2 lmax l i codewords and place the codebook such that it covers only those leaves. For instance, l is with length of bits and reduces 4 leaves, in general, each codeword i reduces 2 lmax l i leaves out of 2 lmax or fraction of the leaves. The next codeword has, therefore, another branch to start from, meaning, different prefix. Since i 2 l i there would be enough leaves to associate with each codeword and because there is no overlap of the leaves the codebook is a prefix codebook. IV. LOWER BOUND ON EXPECTED LENGTH OF PREFIX CODE Theorem 2 (Lower bound on expected length of prefix code) The expected length L of any instantaneous (prefix) code for a random variable X is greater than or equal to the entropy H(X); that is, L H(X). (9) Proof: We can write the difference between the expected length and the entropy as L H(X) = i p i l i + i p i logp i () = i p i log + i p i logp i. ()

3-7 Let us normalize the term in order to to get a pmf. L H(X) = i p i log p i log p i logp i. (2) j 2 l j + i j 2 l j + i Now, by letting r i = 2 l i, we obtain j 2 l j L H(X) = i p i logr i + i p i logp i +log j 2 l j. (3) Note that Using the definition of relative entropy we can write r i = (4) L H(X) = D(p r)+log i j 2 l j. (5) Since D(p r) and log j 2 l j obtained that L H(X). Note that equality is achieved if and only if due to Kraft inequality, given in Theorem, we p i =, (6) or equivalently, l i = log p i (7) This leads to the following definition and corollary. Definition 8 (D-adic distribution) A distribution,p X is called D-adic distribution if for any x X, there exists an integer n x, s.t. P(x) = D nx. Corollary (Optimality of 2-adic code) There exists a binary prefix code for a source X with an average length H(X) if and only if X has a 2-adic.

3-8 V. SHANNON-FANO CODE Lengths l i of Shannon-Fano Code defined as l i = log pi. (8) These lengths satisfy the Kraft inequality = i i i = i 2 log p i (9) 2 log p i (2) 2 logp i (2) = p i (22) = ; (23) hence according to Theorem, there exists a prefix code with lengths given in (8). The expected length is bounded by L = p i l i = p i log p i X i p i (log ) + p i (24) (25) = H(X)+ (26) Example 2 (Wrong coding) Let X p(x), and wrongly we assume X q(x), what would be the expected length L while l i = log qi? Solution: L q = p i l i = p i log q i i p i (log ) + q i (27) (28) = i p i log p i q i i p i logp i + (29)

3-9 = D(p q)+h(x)+ (3) = D(p q)+l p (3) The expected length is increased by D(p q). VI. HUFFMAN CODE An optimal (shortest expected length) prefix code for a given distribution can be constructed by a simple algorithm discovered by Huffman. Let us introduce Huffman codes with some examples. Consider a random variable X taking values in the set X = {,2,3,4,5} with probabilities {.3.25.25..}, respectively. We expect the optimal binary code for X to have the longest codewords assigned to the symbols 4 and 5. These two lengths must be equal, since otherwise we can delete a bit from the longer codeword and still have a prefix code, but with a shorter expected length. In general, we can construct a code in which the two longest codewords differ only in the last bit. For this code, we can combine the symbols 4 and 5 into a single source symbol, with a probability assignment.2. Proceeding this way, combining the two least likely symbols into one symbol until we are finally left with only one symbol, and then assigning codewords to the symbols, we obtain the following table This code has average length 2.2 bits. If the probability vector is dyadic, meaning p i = 2 n i, the Huffman code achieves H(X).

3- Example: In this case the expected length L is equal to entropy H (X): L = E[l i ] = 2 + 4 2+ 8 3+ 8 3 = 6 8 = H(X) (32) Optimality of Huffman code Lemma For any distribution, there exists an optimal instantaneous code (with minimum expected length) that satisfies the following properties: ) The lengths are ordered inversely with the probabilities (i.e. if p i p j, then l i l i ). 2) The two longest codewords have the same length. 3) Two of the longest codewords differ only in the last bit and correspond to the two least likely symbols. APPENDIX I ALTERNATIVE PROOF OF KRAFT S INEQUALITY We now present the infinite version of Kraft inequality. Theorem 3 (Kraft s Inequality) (i) For any prefix code {c i } i, with lengths {l i } i we have: (33) i (ii) Conversely if {l i } satisfies (34), then there exists a prefix code with these lengths.

3- Remark The following proof of Kraft s inequality is preferable compared to the previous proof that was presented because it doesn t demand a finite set of codewords or lengths. However, part (i) of Theorem 3 is equivalent to saying that for any subset of m-lengths l,l 2,l 3,...,l m from the code we have, (34) i and this follows directly from the finite version of Kraft inequality given in Theorem since any subset of the code such as {c,c 2,...,c m } is also a prefix code. However, part (ii) of the theorem requires a reconstruction of a prefix-code of all infinite symbols and the finite version can not be used. The main idea of this proof is exactly the same idea as the proof with the tree just that we use intervals on [,] rather then leaves. And instead of having disjoint leaves we have here disjoint intervals. Proof: We start by proving part (i): Let {c i } be a prefix code, where c i is a codeword of length l i = c i. We define a function f : c i [,] that calculates the decimal value of c i, by: f(c i ) = l i j= c i,j 2 j For future reference, we inspect the interval [ f(c i ),f(c i )+). Note that:. f(c i ). 2. f(c i...) = f(c i ). i.e. adding zeroes at the end of the codeword does not change the value of f(c i ). 3. f(c i...) = f(c i )+ j= 2( l i+j) = f(c i )+. This follows from the fact that 2 i =. (35) i= We assume that the {c i } are arranged in an increasing lexicographic order, which it is always possible to arrange a code in a lexicographic order however there might be infinite codewords that have a lower lexicographic order then a specific codeword, hence assuming a finite index i is not always possible. This can be corrected by avoiding the indexing i.

3-2 means that f(c i ) f(c k ) for all i k. Since c i is a prefix code we have: f(c i+ ) f(c i )+ (36) Thus, we get that the intervals [ f(c i ),f(c i )+2 i) l are pairwise disjoint. By recurrent use of Inequality (36) we obtain: f(c m ) Since, by definition f(c m ), this proves the first part of the theorem, i.e. m i= m i= We have seen that a necessary condition for a code {c i } to be prefix is that the intervals [ f(c i ),f(c i )+) are pairwaise disjoint. The proof of the second part of the theorem is based upon the claim that this condition is also sufficient: Lemma 2 Given a code {c i } such that the intervals [ f(c i ),f(c i )+) are disjoint, the code is prefix. Remark 2 In the following proof we use the fact that in order to prove A B one can show that B c A c (i.e. not B not A). Proof: We conversly assume that the code {c i } is not prefix. If it is so, we can find two codewords c m and c n (without loss of generality we assume m > n thus l m > l n ), for which the first c n bits of c m are identical to the bits of c n. In this case: f(c m ) = l m j= c m,j 2 j = l n j= c n,j 2 j + l m j=l n+ c m,j 2 j < f(c n )+2 ln So we get that f(c m ) < f(c n ) + 2 ln, contradictly to the fact that the intervals [ f(cn ),f(c n )+2 ln ) and [ f(cm ),f(c m )+2 lm ) are pairwise disjoint. Thus the code is prefix. Proof: of (ii):

3-3 Assume that the lengths {l i } are given and satisfy Kraft s inequality (34). We prove that we can find a prefix code with the given lengths. Without loss of generality, assume that l l 2... We define the word c i to be the inverse image under the mapping f of the number i j= 2 l j, i.e. c i is the only word (up to addition of zeroes from the right) such that the equality holds. i f(c i ) = j= To calculate c i we use the function f : [,] c i. In order to justify that use we first show that < f(c i ). From the structure of f(c i ) it is easy to see that f(c i ) > for every i. Moreover, using 2 l j the assumption of the theorem (i.e. inequality (34)) we get that i f(c i ) = 2 l j j= for every i. Thus we get that < f(c i ). Next show that the length of every codeword c i that is built this way is indeed no longer than l i. Again from the structure of f(c i ), it is simple to see that the maximal number of bits needed for the codeword c i is l i bits. Because we assume that the lengths are arranged by rising order (i.e. l i l i for every i), the length of each codeword c i cannot be longer than c i = l i. If it is shorter, we add zeroes from the right up to the wanted length. To complete the proof, it is enough to show that the intervals I i = [ [ ) i i f(c i ),f(c i )+ = 2 l j, are pairwise disjoint and finally use Lemma. Since by definition, f (c i ) increases as i increases and the right border of the interval I i is the left border of the interval I i+, the intervals {I i } are pairwise disjoint, which concludes the proof. j= j= 2 l j )