Lecture Notes on Digital Transmission Source and Channel Coding. José Manuel Bioucas Dias

Size: px

Start display at page:

Download "Lecture Notes on Digital Transmission Source and Channel Coding. José Manuel Bioucas Dias"

Scarlett Parker
5 years ago
Views:

1 Lecture Notes on Digital Transmission Source and Channel Coding José Manuel Bioucas Dias February 2015

2 CHAPTER 1 Source and Channel Coding

3 Contents 1 Source and Channel Coding Introduction Source Coding Discrete stationary sources Discrete memoryless sources (DMS) Entropy Source extension Source encoder Classes of codes Kraft inequality Bounds on optimal codelengths Huffman code Joint and conditional entropy. Mutual information Discrete sources with memory Markov chains Entropy rate and conditional entropy rate Coding stationary sources Optimal coding of stationary Markovian sources Channel coding Model of a discrete memoryless channel (DMC) Channel coding theorem Additive Gaussian channel Capacity of an amplitude-continuous time-continuous channel Sphere packing for the Gaussian Channel

4 Contents Capacity of the bandlimited Gaussian channel

5 Contents Introduction This chapter addresses the following two fundamental questions and answers linked with a digital communications system: Fundamental questions: 1. What is the minimum number of bits per symbol with which a discrete source may be represented? 2. What is maximum reliable transmission rate over an unreliable channel? Answers to the fundamental questions: 1. The answer to question 1 is provided by the Shannon s source coding Theorem [1], which states that the minimum average number of bits per symbol to represent a discrete source is its Entropy, which is a measure of the source complexity. 2. The answer to question 2 is provided by the Shannon s channel coding Theorems [1] and linked with the Channel Capacity, which is an upper bound of the rate of information that may readably flow through the channel probability of error arbitrarily small. This chapter is focused on source coding, channel coding, entropy, and channel capacity. These subjects and concepts are studied in Information Theory (IT), which, in general terms, is concerned with the transmission, processing, extraction, and utilization of information [2]. The math herein used consists mainly of discrete probability with a few incursions on continuous random vectors. In spite of the lightness of the math tools used, the key ideas in source and channel coding, put forth by Shannon in 1948, are profound and have underlaid a scientific revolution in digital communication systems (and in other areas) still alive in today. The material herein presented in mostly based in the books [3, 2, 4]. For proofs details, the students are referred those books. Although the emphasis will be on insights and intuition, a number of proofs and showcases of core concepts and algorithms will be provided. The Section is organized as follows. Source coding is addressed in Section 1.2 and Channel

Contents 5 coding is addressed in Sections 1.5 (discrete channel) and 1.6 (Gaussian Channel). 1.2 Source Coding Figure 1.1: Block diagram of a digital communication system.

6 Contents 5 coding is addressed in Sections 1.5 (discrete channel) and 1.6 (Gaussian Channel). 1.2 Source Coding Figure 1.1: Block diagram of a digital communication system. The source encoder and decoder are highlighted. Figure 1.1 shows the block diagram of a digital communication system. The source produces a sequence of symbols x t X for at a rate 1/T symbols per second from an alphabet with cardinality X = M symbols. The source encoder reads source symbols and produces variable length codewords c t D where D is the set of finite sequences of symbols from the alphabet D = {0, 1,..., D}. For example if D = {0, 1} or D = {0, 1, 3} we say that the respective codewords are binary or ternary. In order to generate efficient codes, in terms of the average number of symbols, the source encoder generates short codewords for symbols x t X with high probability and vice versa. In the receiver, and assuming that the transmission link between the channel encoder and the

7 Contents 6 channel decoder is reliable, i.e., ĉ t = c t, the encoding process is reverted back by the source decoder. This section addresses fundamental aspects of source encoding and decoding, which are intimately linked with with the 1 st fundamental question and answer stated in the Introduction, where the notion of entropy, as a measure of the complexity of the source, takes central stage. From a practical and algorithmic point of view, the research efforts to design efficient source encoders led to, among many others, Huffman codes, Shannon-Fano code, Elias code, and the Lempel-Ziv codes Discrete stationary sources Figure 1.2: Discrete source transmitting symbols from the alphabet X at the rate r = 1/T symbols/second. Figure 1.2 shows a discrete source, which transmits a symbol from the alphabet X, with M symbols (i.e., X = M) every T seconds. Very often we assume, without lack of generality, that X = {1, 2,..., M}. The sequence of symbols at the output of the source, denoted by (x t ) t=, x t X, is assumed to be a sample of the discrete stochastic process (X t ) t= with alphabet X. The sources considered in this chapter are stationary; that is, the joint statistics of any finite subset of RVs form (X t ) t= depends only on the relative time differences between the RVs: Definition 1.1. (Stationary stochastic process) The stochastic process (X t ) t= is stationary if given any set K of time instants {t 1,..., t K } and any integer k, we have P (X t1 = x 1,..., X tk = x K ) = P (X t1 +k = x 1,..., X tk +k = x K ) for any sequence (x 1,..., x K ) X K. We conclude, therefore, that in a stationary source the joint probability does not depend on a time shift. For example P (X t = x t ) = P (X 1 = x t ) for any t Z and x t X and P (X t1 = x t1, X t2 = x t2 ) = P (X t1 t 2 = x t1, X 0 = x t2 ) for any t 1, t 2 Z and (x t1, x t2 ) X 2

8 Contents 7 The assumption of stationarity, or quasi-stationarity, which is defined as stationarity but for τ τ max, holds true with good approximation in many sources of the real world Discrete memoryless sources (DMS) The discrete memoryless sources (DMS) are a subset of the stationary sources in which the sequence (X t ) t=1 is statistically independent. Therefore, we have P (X t1 = x 1, X t2 = x 2,..., X tk = x K ) = P (X t1 = x 1 )P (X t2 = x 2 )... P (X tk = x K ), for any finite set of naturals t 1, t 2,..., t K and (x 1,..., x K ) X K. Given that (X t ) t=1 is stationary, then the RVs X t for t N are identically distributed and, hence, we often use X to denote X t for a given t N. The notations p(x) := P (X = x), x X (1.1) p i := P (X = i), i X (1.2) p := (p 1,..., p M ) (1.3) are often used. Information content of a source The information content of a source is a central concept in Information Theory. Intuitively, the notion of information should be close to that of uncertainty or of surprise: 1. Information uncertainty surprise 2. An event with probability 1 does not contain surprise 3. An event with very low probability 1 does contain surprise The notion of the information associated to a symbol x X is formalized via the function I : X R defined as ( ) 1 I(x) := log. p(x) We highlight the following properties of the function I, compatible with the intuitive notion of information associated the source symbols x, y X at symbol intervals t 1, t 2 Z with t 1 t 2 :

9 Contents 8 Properties of the information I: 1. I(x) = 0 if and only if (iff) p(x) = 1 2. I(x) 0, 0 p(x) 1 3. I(x) > I(y) if p(x) < p(y) ( ) 1 4. I(x) + I(y) = log p(x)p(y) ( = log 1 P (X t1 = x, X t2 = y) ) Remarks: 1. Property 4 results from the fact that the r.v.s X t1 and X t2 are statistically independent. 2. The unit of information is arbitrary. If log 2 is used, the unit of information is called bit (binary digit). 1 bit is the information associated with a binary source with equally likely symbols: M = 2, p 1 = p 2 = 1 2, I(1) = I(2) = log 2 2 = 1 bit 3. By default log denotes logarithmic of base When referring to the entropy of a source, we use the entropy unit is bits/symbol Entropy Based on the information I, we now introduce the most important entity of source coding: the entropy. Definition 1.2. Entropy of the r.v. X: H(X) := E[I(X)] = p(x) I(x) x X = x X p(x) log p(x) bits The entropy is therefore the mean value of the information. As we will seen in Section 1.2.8, the entropy of the r.v. X is also a measure of the complexity of X interpretable as the minimum number of bits to represent the samples of X.

10 Contents 9 Remarks on the definition of the entropy: 1. H(X) represents the mean information gain per symbol log 0 := 0 (by continuous extension of p log p). 3. In addition to H(X), the notations H(X ), H(p 1, p 2,..., p M ), and H(p) are also used to denote entropy. Example 1.1. Consider a DMS with alphabet X = {1, 2, 3, 4} and p 1 = 1/2, p 2 = 1/4, p 1 = 1/8, p 4 = 1/8. The entropy of this source is H(X) = 1 2 log log log log 8 = 1.75 bit/symbol 8 Example 1.2. Consider a DMS with alphabet X = {1,..., M} with symbols equally likely, i.e., p i = 1/M for i = 1,..., M. The entropy of this source is H(X) = M 1 log M = log M bit/symbol M Example 1.3. Consider a binary source with alphabet X = {1, 2} and probabilities p X (1) = p and p X (2) = 1 p. Then H(X) = p log p (1 p) log 2 (1 p). Because the entropy of a binary source is widely used in IT, there is a special function to H : [0, 1] [0, 1] represent it: H(p) := p log p (1 p) log 2 (1 p), (1.4) where p X (1) = p. Fig. 1.3 plots the graph of H. Exercise 1.1. Prove that function H defined in 1.4 is symmetric about 1/2, concave, and takes the maximum value H(1/2) = 1 bit. Interpret the fact H is maximum for equally likely symbols.

11 Contents 10 H (bits) p Figure 1.3: The entropy function H of at a binary source with p X (1) = p. Properties of entropy: 1. H(X) = 0 iff p(x) = 1 for some x X 2. 0 H(X) log M 3. H(X) = log M iff p(x) = 1/M for x X (equally likely symbols maximum uncertainty) 4. Change of base: Let then H b (X) := x X p(x) log b p(x), H b (X) = log b (a))h a (X) 5. H(p 1,..., p M ) is concave 6. Let X 3 and X = X A X B with X A = {x 1,..., x a } and X B = {x a+1,..., x M }. Define P A = p p a and P B = p a p M. Then ( p1 H(p 1,..., p M ) = H(p A, p B ) + p A H,..., p ) a + p B H p A p B ( pa+1 p B,..., p ) M p B Property 1: The sufficient condition is immediate and the necessary condition results from the fact the (p log p = 0) p {0, 1} (1.5) Properties 2 and 3: H(X) 0 is implied by p log p 0 for p [0, 1]. The proof of the inequality H(X) log M and H(X) = log M iff p X (x) = 1/M for x X uses the information

12 Contents 11 inequality: Theorem 1.1. (Information inequality)given the probability distributions p 1,..., p M q 1,..., q M,, with equality iff p i = q i for i = 1,..., M. M p i log p i q i 0 and Remark: we are using the conventions, justified by continuity extension, that 0 log 0 = 0 and p log p 0 =. Proof: Taking into account the above remark, we only need to consider the case p i, q i > 0. Noting that log is a strictly convex function, the Jensen s inequality 1 implies the M p i log p M ( i = p i log q ) ( M ) i q i log p i = log(1) = 0. q i p i p i Since log is strictly convex, then M p i log p i q i = 0 iff p i q i = c, with implies that p i = q i as they are distributions. have Getting back to the proof of Properties 2 and 3, by setting q i = 1/M, for i = 1,..., M, we M p i log p i 1/M 0 (1.6) M p i log p i M p i log M (1.7) H(X) log M. (1.8) The equality in (1.6) hold iff p i = 1/M for i = 1,..., M. Property 4: Results from log b (x) = log b (a) log a (x). 1 Given a convex function ϕ and a RV X, the Jensens s inequality [5] states that E[ϕ(X)] ϕ(e[x]). If ϕ is strictly convex the equality implies that X = E[X].

13 Contents 12 Property 5: We start by remarking that the term p log p, for p 0, is concave (why). Since H(p 1,..., p M ) is the sum of concave terms defined in the the probability simplex (i.e.. {p 1,..., p M : p i 0, M p i = 1}), which is convex, then H is concave. Property 6: We first provide an interpretation for this property. The usual way to think about the symbol transmitted by the source in a given symbol interval is to consider that it is a sample form a random variable with probability mass function (pmf) p X. We may imagine different, but equivalent, sampling strategies to generate the symbols: Direct: sample X from X. Hierarchic: Let Z denote a RV with values in {1, 2}, with P (Z = 1) = p A and P (Z = 2) = p B. i.e., the mass probability of symbols in X A and X B, respectively. To generate a symbol, the source first samples from Z to select the alphabet X A if Z = 1 or X B if Z = 2. Then, in a second step, the source samples a symbol from the selected alphabet with conditional probability P (X = x Z = 1) if Z = 1 and P (X = x Z = 2) if Z = 2. The expression in the right hand side of (1.5) is exactly the entropy for the hierarchic sampling strategy: The term H(p A, p B ) is the entropy of the selector. The remaining two terms are the mean entropy associated with the conditional pmfs P (X = x Z = 1) and P (X = x Z = 2), termed conditional entropy of X given Z. This and other extensions of the entropy are addressed in Section 1.3 To prove Property 6, we arrange the terms of H(X) in two groups corresponding to the alphabets X Z and X N : a H(X) = p i log p i = = a M i=a+1 p i log p i p A p A a p i log p A M i=a+1 p i log p i M i=a+1 p i log p i p B p B p i log p B p A = p A log p A p B log p B + p A H ( p1 = H(p A, p B ) + p A H,..., p ) a + p B H p A p B a p i log p i p B p A p A,..., p ) a + p B H p A ( p1 p ( B pa+1 p B,..., p M p B M i=a+1 ( pa+1 p B ) p i p B log p i,..., p M p B p B )

Contents 13 1.2.4 Source extension Figure 1.

4 illustrates the concept of source extension of order n: it consists in grouping n consecutive symbols of the original source to form extended symbols defined in the alphabet X n.

14 Contents Source extension Figure 1.4: Source extension of order n. Sequences of n symbols form the alphabet X are grouped in an extended symbol of the alphabet X n. Figure 1.4 illustrates the concept of source extension of order n: it consists in grouping n consecutive symbols of the original source to form extended symbols defined in the alphabet X n. The notation X n := (X 1, X 2,..., X n ) and x n X n is used to refer to, respectively, a sequence of n independent random variables distributed as X and a symbol of the extended source. Example 1.4. Let X = {0, 1}; then X 2 = {00, 01, 10, 11} X 3 = {000, 001, 010, 011, 100, 101, 110, 111}. Probability of the extended symbols: Let x n := (x 1, x 2,..., x n ) X n be a symbol of the extended source. Since this source is DMS, it follows that P (X n = x n ) = P (X 1 = x 1 )P (X 2 = x 2 ) P (X n = x n ) = p(x 1 )p(x 2 ) p(x M ). Entropy of an extended DMS source: The entropy of the extended source is Let H(X) be the entropy of the original source. H(X n ) = E[I(X 1,..., X n )] n = E[I(X i )] = nh(x).

15 Contents 14 This result has a straightforward interpretation: the information of the extended source is n times the information of the original source. We remark that this result is valid only for memoryless sources. If the source has memory it holds that H(X n ) nh(x) Source encoder Figure 1.5: Block diagram of the the source and the source encoder. Figure 1.5 shows the block diagram of the source and of the source encoder. The source produces a sequence symbols from the alphabet X at a rate r = 1/T symbols per second. The source encoder produces another sequence converting each source symbol x t X into the code word C(x t ), where C : X D is a mapping from the alphabet X into D, the set of finite length sequences of symbols from D = {0, 1,..., D 1}. The length, in symbol from D, of the code word C(x) is denoted by l(x) of l i. The set of M code words, denoted as C, is termed code book. Definition 1.3. The average code word length (codelength for short) is defined as L(C) := E[l(X)] = p(x)l(x) symbols from the alphabet D. x X When D = 2, the code words are sequences of binary symbols and the units of L are bits. We remarks however that although the unit of H and of L are same, they represent different objects: the former denotes information and the latter binary digits. Some authors use binits to represent binary digits. As seen later in this section, the entropy is the lower bound of the codelength. Therefore, from this point of view there is some ground to use the same units for both entities. Central idea of source coding: minimize the codelength L by assigning the code words with smaller length to the more likely symbols. This rationale is illustrated in Example 1.5. Example 1.5. The table below shows the alphabet X = {1, 2, 3, 4}, the respective probabilities and two sets of binary code words (i.e., D = {0, 1}) jointly with the respective lengths in bits.

16 Contents 15 x p X (x) C A (x) l A (x) C B (x) l B (x) 1 1/ / / / H(X) = 7/4bit L A = 2 bit L B = 7/4bit The code words of the code A have length L(C A ) = 2 bits. By assigning shorter code words symbols with higher probability, the code B has length L(C B ) = 7/4 bits, thus more efficient than code A We remark that H(X) = L(C B ). As shown latter in this section, the entropy is in fact the lower bound of the codelength of any useful code. Example 1.6. X = {1, 2, 3, 4, 5}, D = {0, 1, 2} (ternary codes), x p XC (x) C C (x) l C (x) p XD (x) C D (x) l D (x) 1 1/ / / / / / / / / / H 2 (X C ) = bits H 2 (X D ) = bits H 3 (X C ) = trits H 3 (X D ) = trits L(C C ) = 1.6 trits L(C D ) = trits As we will see in section 1.2.9, both codes C C and C D are Huffman codes and thus optimal. Notice that L(C C ) > H 3 (X C ) and L(C D ) = H 3 (X D ). Do you see any special pattern in the pmf of X D? Example 1.7. l i = n L = n. Example 1.8. p i = D l i com l i {1, 2,... }. Thus M M L = p i l i = p i log D p i = H D (X). We conclude, therefore, that if the probabilities p i, for i X, are D-adic, i.e., powers of D 1, then taking l i = log D p i yields L = H D.

17 Contents Classes of codes Any useful code is decodable in the sense the sequence produced by the source, (x t ) t=1, is recoverable from the sequence of code words (C(x t )) t=1. Below, we present four classes of codes with increasing requirements regarding decodability. Definition 1.4. (Non-singular code) Given two symbols x i, x j X x i x j C(x i ) C(x j ) (1.9) Definition 1.5. (Extension of a code) Let X be the set of finite sequences of symbols from X. The extension C of a code C is a mapping C : X D defined by C(x i1 x i2,..., x in ) = C(x i1 )C(x i2 ),..., C(x in ), for any n and x i1 x i2,..., x in X. Definition 1.6. (Uniquely decodable code) A code is uniquely decodable if its extension is nonsingular: that is, given two sequences of symbols, x i1 x i2,..., x in and y i1 y i2,..., y im, with x i1 x i2,..., x in y i1 y i2,..., y im, we have C(x i1 )C(x i2 ),..., C(x in ) C(y i1 )C(y i2 ),..., C(y im ). (1.10) Definition 1.7. (Prefix code or instantaneous code) A code is called a prefix code or an instantaneous code if no code word is a prefix of any other code word. Therefore, prefix codes may be represented by trees: each code word corresponds to a unique path form the root of the tree to a leaf. Figure 1.6 shows, in the left hand side, the binary tree corresponding to code C A shown Example 1.5 and, in the right hand side, the ternary tree corresponding to code C D shown Example 1.6. Table 1.1 shows examples of four codes: the singular code does not satisfy (1.9); the nonsingular and non-uniquely decodable satisfies (1.9) but does not satisfy (1.10). For example the sequences 114 and 12 both yield the sequence of code words 0010; in the uniquely decodable and non-instantaneous code, in order to decode a symbol, a source symbol, the decoder have to read the first symbol of the next code word; finally, the last code is instantaneous since no code word is prefix of another code word. Figure 1.2 shows the Venn diagram of the classes of codes: the non-singular code are a subset of all codes; the uniquely decodable codes are a subset of the non-singular codes; and the instantaneous codes are a subset of the uniquely decodable codes.

18 Contents 17 Table 1.1: Classes of codes. Table 1.2: Venn diagram of the classes of codes.

Contents 18 Figure 1.6: Left: binary tree corresponding to code C A shown Example 1.5. Right: ternary tree corresponding to code C D shown Example 1.6. 1.2.

The fact that, in instantaneous codes, any code word may not be a prefix of any other code word imposes a constraint in the length of the code words expressed by the Kraft inequality. Theorem 1.2.

19 Contents 18 Figure 1.6: Left: binary tree corresponding to code C A shown Example 1.5. Right: ternary tree corresponding to code C D shown Example Kraft inequality To minimize the codelength, the code words should be as short as possible. The fact that, in instantaneous codes, any code word may not be a prefix of any other code word imposes a constraint in the length of the code words expressed by the Kraft inequality. Theorem 1.2. (Kraft inequality)for any instantaneous code (i.e., a prefix code) over an alphabet of size D, the respective codeword lengths l 1..., l n (measured in D-adic symbols) must satisfy the inequality n D l i 1. Conversely, given a set of codeword lengths that satisfy this inequality, there exists an instantaneous code with these word lengths. Proof: (Kraft inequality) Assume without loss of generality that l 1 l 2,..., l n = l max. The number of leaves of a D-tree with depth l max is D lmax. The number of leaves at depth l max that are children of a node with depth l i are D lmax l i, Therefore,we must have n D lmax l i D lmax n D l i 1.

Contents 19 Figure 1.7: Illustration of the Kraft inequality for a binary tree. Figure 1.7 illustrates the Kraft inequality for a binary tree and words lengths l 1 = 1, l 2 = 2, l 3 = 3, l 4 = 3.

.., n lmax denote the number of code words with lengths, respectively, 1, 2, 3,..., l max. Consider a D-tree with n i leaves at level i.

20 Contents 19 Figure 1.7: Illustration of the Kraft inequality for a binary tree. Figure 1.7 illustrates the Kraft inequality for a binary tree and words lengths l 1 = 1, l 2 = 2, l 3 = 3, l 4 = 3. Notice 8 = 2 lmax 2 lmax l lmax l lmax l lmax l 1 = Proof: (Converse of the Kraft inequality) We prove by construction. Let n 1, n 2, n 3,..., n lmax denote the number of code words with lengths, respectively, 1, 2, 3,..., l max. Consider a D-tree with n i leaves at level i. In order to build this tree, we must have: level 1 n 1 D n 1 D 1 1 level 2 n 2 (D n 1 )D n 2 D 2 + n 1 D 1 1 level l max n lmax (D n lmax 1)D l max j=1 n jd j 1 (1.11) Since the inequalities in the right hand side of (1.11) are all implied by the Kraft inequality, the proof is complete. Exercise 1.2. Prove that if the the Kraft inequality is satisfied with equality, then all tree nodes, apart from the leaves, have D children. Remark 1.1. The Kraft inequality stated in Theorem 1.2 considers instantaneous finite codebooks. The inequality and the converse result are however valid for a larger classes of codes: R1) Extended Kraft inequality: applies to infinite prefix codes (see [2, Th ]). R2) McMillan inequality applies to uniquely decodable codes (see [2, Th ]). The Kraft and McMillan inequalities are often collectively termed Kraft-McMillan inequality.

21 Contents Bounds on optimal codelengths The Kraft-McMillan inequality imposes constraints on the code word lengths that imply that the codelength of code for a DMS source cannot be smaller than the entropy. Theorem 1.3. Any instantaneous code C satisfies L(C) H D (X). The equality holds iff the probabilities p i, for i = 1,..., M, are D-adic, i.e., powers of D 1. Proof: Let q i information inequality as := D l i /ξ, for i = 1,..., M, and ξ := M D l i. We may then write the M p i log D p i q i = M p i p i log D l i /ξ = H D (X) + L(C) + log D ξ 0 H D (X) L(C) + log D ξ H D (X) L(C), (1.12) where (1.12) results from ξ 1. The equality is satisfied iff ξ = 1 and p i = D l i. Theorem 1.3 provides a lower bound for the codelength of any instantaneous code, which is achieved iff the the source pmf is D-adic. When the probabilities are not D-adic, we face the integer optimization min l 1,...,l M M p i l i subject to: n D l i 1. (1.13) The Huffman algorithm, presented in Section 1.2.9, yields a solution to (1.13). The Shannon code, present below, although yielding a suboptimal solution, is instrumental in the main results of this section: The Source Coding Theorem. Definition 1.8. (Shannon code) Consider an instantaneous code with codeword lengths l i := log D (1/p i ), that is log D 1 p i l i < log D 1 p i + 1, i = 1,..., M. The inequality in left hand side implies that p i D l i Kraft inequality) such instantaneous code exists. and thus M D l i 1. Therefore (see

22 Contents 21 Codelength of the Shannon code: over i = 1,..., M, we obtain Multiplying the above inequality by p i and summing H D (X) L < H D (X) + 1. The above pair of inequalities are valid for any memoryless discrete source, namely for its extension of order n. Therefore, we may write and H D (X 1,..., X n ) L n < H D (X 1,..., X n ) + 1, H D (X) L n n < H D(X) + 1 n, where L n represents codelength of the extended source and we use the results H D (X 1,..., X n ) = nh D (X), valid for DMS sources. The above inequality shows that it is possible to code the source with codelength arbitrarily close to the source entropy. This result jointly with the fact that H(X) is the lower bound for the codelength of any instantaneous code are essentially the 1 Shannon Theorem: Theorem 1.4. (Shannon Source Coding Theorem:) A memoryless discrete source with entropy H bits/symbol may be encoded and decoded using instantaneous codes with average codeword length L = H + ε binits per symbol, where ε > 0 is arbitrarily small. Is is impossible to encode this source with instantaneous codes for L < H. Definition 1.9. The code efficiency is defined as η H L Example 1.9. (Shannon code for successive source extensions) Consider a binary aource X with alphabet X = {1, 2} with probabilities p = (0.2, 0.8). H(X) = bits/symbol. The corresponding entropy is Figure 1.8, left, shows a table with the probabilities and the Shannon code word lengths of X 3 (3rd order extension of the original source). The efficiency for this code is η = 3H/L 3 = Figure 1.8, right, shows evolution of L n /n and of L n nh(x). As expected L n /n approaches the entropy by excess. monotonic. We remark however that the evolution of L n /n in not

Contents 22 Figure 1.8: Left: probabilities and Shannon code word lengths of X 3 (3rd order extension of the original source). Right: evolution of L n /n and of L n nh(x). Remark 1.2. (Instantaneous versus uniquely decodable non-instantaneous codes) Source coding uses almost exclusively instantaneous codes.

The answer in negative because any uniquely decodable code satisfies the Kraft inequality (see comment R2 in Remark 1.

23 Contents 22 Figure 1.8: Left: probabilities and Shannon code word lengths of X 3 (3rd order extension of the original source). Right: evolution of L n /n and of L n nh(x). Remark 1.2. (Instantaneous versus uniquely decodable non-instantaneous codes) Source coding uses almost exclusively instantaneous codes. A natural question is whether there is any advantage in using uniquely decodable non-instantaneous codes instead of instantaneous ones, for example in terms of codelength. The answer in negative because any uniquely decodable code satisfies the Kraft inequality (see comment R2 in Remark 1.1) and, therefore, we may construct instantaneous codes with the same code word lengths of the, respective, non-instantaneous ones (Theorem 1.2). Therefore, the uniquely decodable non-instantaneous codes may be replaced with instantaneous ones yielding the same codelengths. Given that instantaneous codes are much simpler from the computational, algorithmic, and hardware implementation points of view, they are by large the preferred choice Huffman code For a given DMS, the Huffman code is optimal in the sense that there is no other instantaneous code with smaller codelength. The Huffman iteratively grows a D-tree from the leaves to the root as follows:

Contents 23 Huffman algorithm: 1. Order the alphabet symbols by nonincreasing probability. 2. Merge the D symbols with smaller probability thus obtaining a new tree node and a new alphabet with D 1 symbols.

24 Contents 23 Huffman algorithm: 1. Order the alphabet symbols by nonincreasing probability. 2. Merge the D symbols with smaller probability thus obtaining a new tree node and a new alphabet with D 1 symbols. The probability of the new symbol is the sum of the probabilities of the merged symbols. 3. If the alphabet contains more than one symbol goto step For all nodes, apart from the root, assign the symbols 0, 1,..., D to the edges departing form the nodes. The order of the assignment is irrelevant. 5. The codeword of a given symbol is obtained by reading the code symbols along the path that goes from the root to the leaf corresponding to that symbol. Example (Huffman algorithm) Figure 1.3 shows the tree generated by the Huffman algorithm for a DMS with M = 8 with H = bits/symbol. Figure 1.9: Tree generated by the Huffman algorithm. The algorithm run is 7 iterations. Table 1.3 shows the retrived codewords and the respective code word lengths, both for the Huffman and Shannon codes. The codelength of the Huffman and Shannon codes are respectively bits/symbol and 2.2 bits/symbol. We have then (why?) H < L(Huffman) < L(Shannon) < H + 1

Prove that the codelength of an Huffman code is given by L = n iters P (merged symbols(i)), where n iters is the number of iterations of the Huffman algorithm and P (merged symbols(i)) is the

25 Contents 24 Table 1.3: Retrieved Huffman code and code word lengths of the Huffman and Shannon codes. The efficiency of the code is η = = Exercise 1.3. Prove that the codelength of an Huffman code is given by L = n iters P (merged symbols(i)), where n iters is the number of iterations of the Huffman algorithm and P (merged symbols(i)) is the probability of the merged symbols at the i-th iteration. For example, with reference to Fig. 1.9, we have L = = bits/symbol. Remark 1.3. (Huffman codes with D > 2) The Huffman algorithm is optimal provided that D symbols are merged in the last iteration. This condition translates to M = D + (D 1)(k 1), where k {0, 1,..., } denotes the number of iterations. It is always satisfied for D = 2, but not necessarily for D > 2. When the condition is not satisfied, and in order to have optimality, the alphabet shall be augmented with with zero probability symbols until the condition is satisfied.

Contents 25 Example 1.11. (Huffman algorithm for D = 3) Figure 1.

10, right, consisted in including the symbol x = 5, with probability zero. Nor we have 5 = 3 + (3 1)1 = 5. Figure 1.10: Incorrect and correct application of the Huffman algorithm.

26 Contents 25 Example (Huffman algorithm for D = 3) Figure 1.10, left, shows an incorrect applications of the Huffman algorithm because the number of iterations is k = 2 and, thus, (3 1)1 = 5. The correction shown in 1.10, right, consisted in including the symbol x = 5, with probability zero. Nor we have 5 = 3 + (3 1)1 = 5. Figure 1.10: Incorrect and correct application of the Huffman algorithm. We finish this section with a formal statement on the optimality of the Huffman codes. Theorem 1.5. The Huffman coding is optimal; that is, if C an Huffman code and C is any other uniquely decodable code then L(C ) L(C ). See [2, Th ] for a proof. Assuming that p 1 p 2,..., p M, it starts by showing that the optimality codes may be found in a set with the following properties: 1. p i > p j l i l j. 2. The two last code words have the same length. 3. The two longest code words differ only in the last two bits corresponding to the two least likely symbols. The remaining part of the proof uses induction to show that finding an optimal code to the probabilities p 1, p 2,..., p M 2, p M 1, p M is equivalent to finding an optimal code to the probabilities p 1, p 2,..., p M 2, p M 1 + p M.

27 Contents Joint and conditional entropy. Mutual information The entropy of random variable took central stage in source coding of DMSs. The coding of sources with memory, addressed in Sections 1.4, and channel coding, addressed in Section 1.5, calls for new IT definitions, concepts, and entities able to capture the statistical links among random variables of the underlying stochastic processes. In this section, we introduce three of such entities: the joint entropy, the conditional entropy, and the mutual information. We adopt a concise style in the presentation, focusing on the fundamental properties and insights. For more details, see [2, Ch. 2]. In the following, the RVs X and X 1,..., X n take values in the X and Y take values Y. Definition Joint entropy 2 : H(X, Y ) := p(x, y) log p(x, y) x X x Y = E[log p(x, Y )] Remark 1.4. The joint entropy of (X, Y ) can be looked at the entropy of a RV taking values in the Cartesian product X Y. Definition Conditional entropy: H(Y X) := p(x)h(y X = x) x X = p(x) p(y x) log p(y x) x X y Y = p(x, y) log p ( y x) x X y Y = E[log p(y X)] 2 We use same function p to denote different pmfs. The meaning of each pmf is inferred from the dummy variables used in respective arguments. Although this notion is ambiguous, it leads to a much simpler and neat formulas, whose benefits for the reader worth the ambiguity.

28 Contents 27 Remark 1.5. The conditional entropy of H(Y X) can interpreted as the mean uncertainty of Y after observing X. Properties of the conditional entropy: P1) If Y = f(x), f : X Y, then H(Y X) = 0 (why?). However, H(Y X) = 0 does not imply H(X Y ) = 0. Give an example. P2) If X and Y are independent, then H(X Y ) = H(X) (prove). Chain rule: H(X, Y ) = E[log p(x, Y )] = E[log p(x)p(y X)] = E[log p(x)] E[log p(y X)] = H(X) + H(Y X) H(X, Y ) = E[log p(x, Y )] = E[log p(y )p(x Y )] = E[log p(y )] E[log p(x Y )] = H(Y ) + H(X Y ) Independent RVs: If X e Y are independent, then H(X, Y ) = H(X) + H(Y ) (prove and interpret). Exercise 1.4. The RVs X and Y take values in the alphabet {1, 2, 3, 4} and have the following joint distribution: Determine: 1. H(X), H(Y ), H(X Y ). 2. Check that H(X Y ) H(Y X).

Contents 28 Figure 1.11: Joint and marginal distributions of RVs X and Y. 3. Check that H(X) H(X Y ) = H(Y ) H(Y X). Justify qualitatively this equality. Definition 1.12. Joint entropy of n RVs.

29 Contents 28 Figure 1.11: Joint and marginal distributions of RVs X and Y. 3. Check that H(X) H(X Y ) = H(Y ) H(Y X). Justify qualitatively this equality. Definition Joint entropy of n RVs. The chain rule:: Consider n RVs X 1, X 2,..., X n X n. Their joint entropy is defined as H(X 1, X 2,..., X n ) := x 1,...,x n X n p(x 1,..., x n ) log p(x 1,..., x n ) = E[log p(x 1, X 2,..., X n )] Using the chain rule successively yields p(x 1, x 2..., x n ) = p(x 1 x 2..., x n )p(x 2 x 3..., x n )... p(x n 1 x n )p(x n ) and thus H(X 1, X 2,..., X n ) = H(X 1 X 2,..., X n ) + H(X 2 X 3,..., X n ) +... H(X n 1 X n ) + H(X n ) Independent RVs: If X 1, X 2,..., X n X n are independente, then (prove) H(X 1, X 2,..., X n ) = n H(X i ). Definition (Mutual information between X and Y :) I(X; Y ) = H(X) H(X Y )

30 Contents 29 Remark 1.6. The mutual information I(X, Y ) quantifies the amount of information of Y contained in X. Equivalent interpretation: I(X; Y ) quantifies the reduction in the uncertainty about X that is obtained by observing Y. Properties of the mutual information: P1) I(X; Y ) as a function of p(x, y), p(x), p(y) for x X and y Y: I(X; Y ) := H(X) H(X Y ) = E[log p(x)] + E[log p(x Y )] [ ( )] p(x Y ) = E log p(x) [ ( )] p(x, Y ) = E log p(x)p(y ) p(x, y) = p(x, y) log p(x)p(y) x X, y Y (1.14) P2) I(X; Y ) = I(Y ; X). Results from the symmetry of the expression (1.14). P3) I(X; Y ) 0. Results from the fact that p(x, y) e p(x)p(y) are probability distributions and from applying the information inequality to I(X; Y ) = x X, y Y p(x, y) log p(x, y) p(x)p(y) P4) I(X; Y ) 0 implies H(X) H(X Y ). That is, conditioning can never increase the entropy. Interpret this result. P5) I(X; Y ) = 0 iff p(x, y) = p(x)p(y), that is, iff X and Y are independent (prove). P6) I(X; Y ) = H(X) + H(Y ) H(X, Y ) (prove). P7) Based on the equality I(X; Y ) = H(X) + H(Y ) H(X, Y ) interpret I(X; Y ) as the gain, in terms of codelength, obtained by jointly coding X e Y relatively to the independent coding of X and Y. Figure 1.12 shows highlights graphically the relations H(X) = H(X Y ) + I(X; Y ) and H(Y ) = H(Y X) + I(X; Y ).

Contents 30 Figure 1.12: Graphical representation of the mutual information. 1.4 Discrete sources with memory Until now, we have considered only DMSs modeled by independent stochastic processes.

In fact, it is these statistic dependencies that encode most of the information in most types or source, such as a text written in a given language, a computer program, or a Morse code.

31 Contents 30 Figure 1.12: Graphical representation of the mutual information. 1.4 Discrete sources with memory Until now, we have considered only DMSs modeled by independent stochastic processes. However, most sources of interest, if not all, have memory in the sense that the variables underlying the respective stochastic processes have probabilistic dependences. In fact, it is these statistic dependencies that encode most of the information in most types or source, such as a text written in a given language, a computer program, or a Morse code. In this section, we address very briefly the coding or discrete sources with memory. The first ingredient to attack this problem is the statistic model for the source. The complete characterization of the stochastic process implies the knowledge of the joint probability of P (X t1 = x 1, X t2 = x 2,..., X tk = x K ) for any set of integers t 1, t 2,..., t K and any set of symbols (x 1,..., x K ) X K. This is a formidable amount of information not accessible in most applications, unless the source probabilities are well approximated by treatable models, in the complexity sense. This the case of stationary processes, defined in Section 1.1 and Markovian chains Markov chains Markov chains are Markov processes with a countable state space. The Markovian property means that the probability of X t given X j for j < t (i.e., the past) depends only on X j for t n j < t (i.e., a local past) for some positive integer n. Formally: Definition (Markov chain of order n) The stochastic process (X t ) t= is termed a Markov chain of of order n if it satisfies the following property: P (X t = x t X t 1 = x t 1, X t 2 = x t 2,... ) = P (X t = x t X t 1 = x t 1, X t 2 = x t 2,..., X t n = x t n )

32 Contents 31 for any t and sequence x t, x t 1 X. A stochastic process is invariant if its conditional probabilities do not depend on a time shift. It happens that many sources or the real word are well approximated by invariant Markov chains. Next, we present a formal definition of invariant Markov chain of order n. Definition (Invariant Markov chain of order n) The process (X t ) t=1 is an invariant Markov chain of order n if it has a memory of size n and its conditional probabilities do not depend explicitly of time, that is P (X t+1 = x n+1 X t = x n,..., X t n+1 = x 1, X t n = x 0,... ) (1.15) =P (X t+1 = x n+1 X t = x n,..., X t n+1 = x 1 ) (1.16) =P (X n+1 = x n+1 X n = x n,..., X 1 = x 1 ) (1.17) for any t Z and any..., x 0, x 1,..., x n+1 X. The joint probability in (1.16) is a consequence of the Markovianity of order n and probability in (1.17) is a consequence of the invariance. Invariant Markov chain of order 1: From Definition 1.15, we conclude that an invariant Markov chain of order 1 satisfies P (X n = j X n 1 = i) = P (X 2 = j X 1 = i), i, j X, and therefore it is characterized by a set of M M probabilities termed the transition matrix: P = [P i,j ], P i,j = P (X 2 = j X 1 = i), i, j X. The transition matrix is a stochastic matrix: 0 P i,j 1 and M P i,j = j=1 M P (X 2 = j X 1 = i) = 1. j=1 Figure 1.13, left, shows is a transition matrix of an invariant Markov chain of order 1 with M = 3. The rows of P are indexed by X 1 (the current state) and the columns are indexd by X 2 (the future). In the right hand, the same transition matrix is represented by a graph. The

Joint Probability: Consider an invariant Markov chain of order 1 with transition matrix P, the set of consecutive time instants {1, 2,.

33 Contents 32 Figure 1.13: Left: transition matrix of an invariant Markov chain or order 1. Right: Graph representation of the transition matrix. states 1,2,3 represent the the current state and the arrows with the respective weights on top represent the transition probabilities. Joint Probability: Consider an invariant Markov chain of order 1 with transition matrix P, the set of consecutive time instants {1, 2,..., t}, and define p 1 (n) P (X n = 1) p 2 (n) P (X n = 2) p(n) := :=... p M (n) P (X n = M) t P (X t = x t,..., X 1 = x 1 ) = p x1 (1) P xu 1,xu. u=2 Exercise 1.5. Given the transition matrix P shown in Fig and the vector of probabilities p T (1) = [0.2, 0.3, 0.6], compute P (X 4 = 2, X 2 = 3, X 1 = 1). Propagation of the probability: Consider an invariant Markov chain of order 1 with transition matrix P. We have that (prove) We have that (prove) M p j (t + 1) = P i,j p i (t) = [P T ] (j,:) p(t).,

34 Contents 33 where [P T ] (j,:) denoted the j-th row of P T. Therefore, it holds p(t + 1) = P T p(t) = (P T ) t p(1). Definition (Stationary distribution) Consider an invariant Markov chain of order 1 with transition matrix P. Let p(t) be a distribution such such that. p(t) = P T p(t) (1.18) If such p(t) exists, the process is said to be in a stationary state and p(t) is termed a stationary state and is represented by p( ). We remark that the equation (1.18) is an eigenvalueeigenvector problem corresponding to the unit eigenvalue. Exercise 1.6. Prove that any transition matrix has a unit eigenvalue. Example Consider an invariant Markov chain of order 1 with two states X = {1, 2} and transition matrix P = 1 α β α 1 β with 0 < α, β < 1. The eigenvalues of P are {1, 1 α β} (show). The eigenvector p := [p 1 p 2 ] T corresponding to the unit eigenvalue satisfy the equation P T p = p, that is p 1 (1 α) + βp 2 = p 1 p 2 = α β p 1, (1.19) Using the fact that p is a probability distribution, that is p 1 + p 2 = 1, we obtain p( ) = β α+β α α+β Question: In (1.19), only the first row of the equation P T p = p was used. Why? Can we use the second?.

35 Contents 34 Definition (Irreducible Markov chain ) A Markov chain is irreducible if it possible to reach, with positive probability, any state form any other state in a finite number of steps. Formally, the state j is accessible from state i if there exists an integer t ij 0 such that P (X tij = j X 0 = i) > 0 (i.e., [P t ij ] i,j > 0). Definition (Aperiodic Markov chain ) A Markov chain is aperiodic if it does not contain periodic states: a state i is periodic with period k if any return to state i must occur in multiples of k time steps; that is k = gcd{n : [P n ] i,i > 0}, where gcd denotes the greatest common divisor. Theorem 1.6. Let (X t ) t= be an irreducible and aperiodic invariant Markov chain of order 1. Then, 1. (X t ) t= has a unique stationary distribution. 2. lim t p(t) = p( ) independently of the initial distribution. 3. The stochastic process is stationary Entropy rate and conditional entropy rate The entropy rate and the conditional entropy rate are extensions, for sources with memory, of the entropy for RVs presented in Section The entropy rate is the average entropy per source symbol. The conditional entropy rate is the entropy of a given RV given all RVs in the past. The formal definition of the two forms o entropy is provided below. Definition (Entropy rate) Given a stochastic process (X t ) t=1, its entropy rate is defined as when the limit exits. 1 H(X) := lim t t H(X 1, X 2,..., X t ), (1.20) Exercise 1.7. Show that H(X) = H(X) for a DMS source.

36 Contents 35 Definition (Conditional entropy rate) Given a stochastic process (X t ) t=1, its conditional entropy rate is defined as when the limit exits. H (X) = lim t H(X t X t 1,..., X 1 ). (1.21) Exercise 1.8. Show that H (X) = H(X) for a DMS source. The next theorem states that, for stationary processes, the two forms of entropy are equal. Theorem 1.7. For stationary stochastic processes, the limits (1.20) and (1.21) exist and are equal, that is, H(X) = H (X). Moreover, 1. H(X t X t 1,..., X 1 ) is non-increasing in t. 2. H(X t X t 1,..., X 1 ) 1H(X t t, X t 1,..., X 1 ) for all t t H(X t, X t 1,..., X 1 ) is non-increasing in t. Proof of Theorem 1.7.1: H(X t+1 X t,..., X 2, X 1 ) H(X t+1 X t,..., X 2 ) (1.22) = H(X t X t 1,..., X 2, X 1 ), (1.23) where (1.22) follows from the fact that conditioning cannot increase the entropy and (1.23) follows from stationarity. Proof of Theorem 1.7.2: 1 t H(X t, X t 1,..., X 1 ) = 1 t 1 t t H(X k X k 1,..., X 1 ) (1.24) k=1 t H(X t X t 1,..., X 1 ) (1.25) k=1 = H(X t X t 1,..., X 1 ), (1.26)

37 Contents 36 where (1.24) follows from the chain rule for entropy and (1.25) follows from the fact that conditioning canot increase the entropy and from stationarity. Proof of Theorem 1.7.3: H(X t+1, X t,..., X 1 ) = H(X t+1 X t,..., X 1 ) + H(X t,..., X 1 ) (1.27) H(X t X t 1,..., X 1 ) + H(X t,..., X 1 ) (1.28) 1 t H(X t,..., X 1 ) + H(X t, X t,..., X 1 ) (1.29) = 1 + t H(X t,..., X 1 ), (1.30) t where (1.27) follows from the chain rule for entropy, (1.28) follows from the fact that conditioning cannot increase the entropy, and (1.29) follows form from Theorem Finally, from (1.30) if follows that t H(X t+1, X t,..., X 1 ) 1 t H(X t,..., X 1 ). Proof of Theorem 1.7: The sequences H(X t,..., X 1 )/t, defined in (1.20), and H(X t X t1..., X 1 ), defined in (1.21), are non-increasing and non-negative, so they converge. We now show that the limit is the same: 1 H(X) = lim t t H(X 1, X 2,..., X t ) (1.31) 1 = lim t t 1 = lim t t t H(X k X k 1,..., X 1 ) (1.32) }{{} a k t a k (1.33) k=1 k=1 = lim t a t (1.34) = H(X), (1.35) where (1.32) follows from the chain rule for entropya and (1.32) follows from the Cesáro mean 3. 3 Cesáro Mean: if lim t a t = a and b t = (1/t) t k=1 a k, then lim t b t = a.

38 Contents 37 Entropy rate of a stationary Markov chain of order 1: Since the Markov chain is stationary, it is invariant and thus characterized by its transaction matrix P. Assuming that the stationary distribution is unique, then its entropy rate is H(X) = H (X) = lim t H(X t X t 1,..., X 1 ) = lim t H(X t X t 1 ) = H(X 2 X 1 ) M = H(X 2 X 1 = i)p (X 1 = i) M M = p i ( ) P ij log P ij j=1 Example Consider an invariant Markov chain of order 1 with alphabet X = {1, 2} and transition matrix 1 α β α. 1 β Using the stationary distribution computed in (1.12), the conditional entropy rate is given by H (X) = β α + β H(α) + α α + β H(β) Coding stationary sources Consider a stationary source with memory and the associated stochastic process (X t ) t=. Let X n := (X 1,..., X n ) a vector of RVs corresponding to an extented symbol of order n taking values in X n, as illustrated in Fig The symbols of the extended may be encoded using a Shannon code with codelength L n (C s ), thus satisfying H(X n ) L n (C s ) < H(X n ) + 1 (1.36) H(Xn ) n L n(c s ) n < H(Xn ) n + 1 n. (1.37)

39 Contents 38 Since the source is stationary, then both limits in the left hand side and right hand side of (1.37) converge to the entropy rate H(X). Furthermore, from Theorem 1.7.1, we may write H(X n ) n = H(X) + δ(n), where δ(n) 0 and lim n δ(n) = 0. Therefore, we may write H(X) L n(c s ) n < H(X) + δ(n) + 1 n, (1.38) which enable to state the Shannon Source Coding Theorem for soiurces with memory: Theorem 1.8. (Shannon Source Coding Theorem for stationary sources:) A stationary discrete source with entropy H bits/symbol may be encoded and decoded using instantaneous codes with average codeword length L = H + ε binits per symbol, where ε > 0 is arbitrarily small. Is is impossible to encode this source with instantaneous codes for L < H Optimal coding of stationary Markovian sources The bounds (1.38) are valid for any stationary process. However, in the case of Markovian processes, and since the memory of the source is finate, we build an encoder encoder which has the past into account and thus yields a better codelength. We illustrate this concept with invariant Markov chains of order 1. The encoder is a finite state machine which, for a given t Z, encodes the symbols generated with the conditional pmf p(x t x t 1 ), with x t, x t 1 X, associated with the random variable X t X t 1 = x t 1. Let us assume that t = 2, for a lighter presentation. Let L(X 2 X 1 = x 1 ) be the codelength of an optimal instantaneous code (for example the Huffman code) for X 2 X 1 = x 1. We may then write H(X 2 X 1 ) = x 1 ) L(X t X 1 = x 1 ) < H(X t X 1 = x 1 ) + 1. (1.39) Given that the source is stationary, with a stationary distribution, by averaging (1.39) with respect to X 1, we obtain H(X 2 X 1 ) L(X 2 X 1 ) < H(X 2 X 1 ) + 1.

40 Contents 39 Now instead of coding just X 2 given X 1, let us code X n+1, X n,..., X 2 given X 1, denoted as X n X 1, We may then write H(X n X 1 ) L(X n X 1 ) < H(X n X 1 ) + 1. But since H(X n X 1 ) = nh(x 2 X 1 ) (prove), we obtain H(X 2 X 1 ) L(X n X 1 ) < H(X 2 X 1 ) + 1 n. (1.40) Having in mind that for stationary sources H(X) = H(X ) and for invariant Markov chains of 1 order, H(X ) = H(X 2 X 1 ), then by comparing (1.38) with (1.40), we conclude that the coding with memory removes the slack δ(n) in the former bounds. Example The table below shows the transition matrix and optimal Huffman codes for each state of the Markov chain. Figure 1.14: Transition matrix and optimal Huffman codes for each state of thr Markov chain. The stationary distribution and the conditional entropy for eaxh state are H(X 2 X 1 = 1) = H(0.2, 0.4, 0.4) = bits/sym p( ) = H(X 2 X 1 = 2) = H(0.3, 0.5, 0.2) = bits/sym H(X 2 X 1 = 3) = H(0.6, 0.1, 0.3) = bits/sym The codelengths for the sate-dependent codes are L(X 2 X 1 = 1) = 1.6 bits/sym L(X 2 X 1 = 2) = 1.5 bits/sym L(X 2 X 1 = 3) = 1.4 bits/sym. The conditional entropy rate, the codelength, and the efficiency are H (X) = bits/sym L = bits/sym η = =

Contents 40 1.5 Channel coding Figure 1.15: Block diagram of a digital communication system. Fundamental question: What is maximum reliable transmission rate over an unreliable channel?

41 Contents Channel coding Figure 1.15: Block diagram of a digital communication system. Fundamental question: What is maximum reliable transmission rate over an unreliable channel? Answer to the fundamental question: The channel capacity (Shannon s second Theorem). The possibility of transmitting reliably over a noisy channel without reducing dramatically the transmission rate of information is counterintuitive. Common sense says that if the channel introduces an error in a symbol with a probability p > 0, then in n independent transmissions, a number of pn symbols should be corrupted on average. Shannon proved that this belief

cyclic codes, convolucional codes, turbo codes, and low parity check codes) ensuring a reliable transmission over an unreliable channel at rates close to the channel capacity The channel encoding and

42 Contents 41 was wrong as long as the communication rate was below the channel capacity. This is one of the most beautiful and deep results in communication theory, which have underlaid huge research efforts to build channel encoding and decoding procedures (for example, block codes, cyclic codes, convolucional codes, turbo codes, and low parity check codes) ensuring a reliable transmission over an unreliable channel at rates close to the channel capacity The channel encoding and decoding procedures are implemented by the channel encoder and the channel decoder, respectively, both shown inside the dashed box in Fig The channel encoder receives sequences of symbols d t D from the source encoder and generates new sequences of symbols x t X, which are sent to the channel. The new sequences encodes the input sequences jointly with redundant information to be used by the decoder to detect sequences d t D. The objective of the encoding and decoding procedure is that d t = d t with high probability. Organization: This section is organized as follows. Section introduces the model of a discrete memoryless channel and the respective the channel transition matrix, defines the channel capacity, which resumes the ability of the channel to transmit reliable information. Section states the channel-coding Theorem [1], which gives necessary and sufficient conditions for the existence of reliable transmission over an unreliable channel. The material herein presented is partially inspired by the books [3, 2, 4] Model of a discrete memoryless channel (DMC) Figure 1.16: Model of a discrete memoryless channel (DMC). With reference to Fig. 1.15, the channel is a mathematical model for the communication between the input of the modulator and the output of the demodulator. This model is schematized in Fig The symbols at the input and output of the channel are, respectively, modeled by random variables X, taking values in X := {1,..., M} X, and Y, taking values in Y := {1,..., N}. The channel is characterized by the probabilities at output of the channel

43 Contents 42 conditioned to the input, termed transition probabilities: p(y j x i ) := P (Y = x j X = x i ), i = 1,..., M, j = 1,..., N. Definition (Discrete memoryless channel (DMC)) The channel is said to be a discrete because the alphabets X and Y are finite and memoryless because the output sequence is conditionally independent of the input sequence: P (Y t = y t,..., Y 1 = y 1 X t = x t,..., X 1 = x 1 ) = t P (Y i = y i X i = x i ), for any x 1, x 2,..., x t X and y 1, y 2,..., y t Y. Definition The transition matrix, of size M N, holds the conditional probabilities as follows: P = p(y 1 x 1 ) p(y 2 x 1 )... p(y N x 1 ).. p(y 1 x M ) p(y 2 x M )... p(y N x M )... The rows and columns of P are associated with, respectively, with the RV X at the input of the channel and the output RV Y at the output of the channel. Matrix P = [P i,j ] is stochastic; that is, P i,j 0 and N j=1 P i,j = 1. The channel output probabilities p(y) := P (Y = y), for y Y, are computed from the intput probabilities p(x) := P (X = x), for x X, as p(y) = x X p(x, y) = x X p(y x)p(x) or, in terms of transition matrix P, [p(y 1 ) p(y 2 )... p(y N )] = [p(x 1 ) p(x 2 )... p(x N )]P.

44 Contents 43 Definition The channel capacity is de maximum mutual information I(X; Y ) between X and Y over all input distributions {p(x), x X }: C = max I(X; Y ) {p(x), x X } bits/transmission The channel capacity is the highest rate in bits per channel use at which the information can be transmitted with arbitrarily low probability od error. This fundamental results is formalized by the Shannon s second Theorem stated in Section Properties of the channel capacity: The following properties of the channel capacity result form the properties of the mutual information P1) C 0, since I(X; Y ) 0. P2) C log X, since I(X; Y ) H(X). P3) C log Y, since I(X; Y ) H(Y ). Example The Noiseless binary channel has transition matrix P = In this channel the output reproduces exactly the input. Therefore, we anticipate that its capacity is 1 bit/transmission. In fact, we have H(X Y ) = 0 and thus C = max p(x 1 ) I(X; Y ) max H(X) = 1 p(x 1 ) bits/transmission. Example The binary symmetric channel (BSC) is characterized by the following transition matrix (the respective graph is also shown):

Contents 44 Figure 1.17: Capacity of the BSC channel.

+ H(Y X = x 2 )p(x 2 ) = H(p)p(x 1 ) + H(p)p(x 2 ) = H(p) 3) H(Y ) = H(q) where q = (1 p)p(x 1 ) + p(1 p(x 1 )).

= 1/2 has been used. Figure 1.17 plots the capacity of the BSC as a function of the p = p(y 1 x 2 ). The capacity is maximum for p {0, 1} (why?

45 Contents 44 Figure 1.17: Capacity of the BSC channel. P = (1 p) p p (1 p) To compute the capacity of this channel, we note that 1) I(X; Y ) = I(Y ; X) = H(Y ) H(Y X) 2) H(Y X) = H(Y X = x 1 )p(x 1 ) + H(Y X = x 2 )p(x 2 ) = H(p)p(x 1 ) + H(p)p(x 2 ) = H(p) 3) H(Y ) = H(q) where q = (1 p)p(x 1 ) + p(1 p(x 1 )). The capacity of the BSC is then given by C = max H(q) H(p) {p(x)} = 1 H(p), where fact that max {p(x)} H(q) = 1 is attained for p(x 1 ) = p(x 2 ) = 1/2 has been used. Figure 1.17 plots the capacity of the BSC as a function of the p = p(y 1 x 2 ). The capacity is maximum for p {0, 1} (why?) and minimum for p = 0 corresponding to independent RVs X and Y (why?); that is, the input and output of the channel are independent. Example The binary erasure channel (BEC) is characterized by the following transition matrix (the respective graph is also shown):

Contents 45 P = (1 p) p 0 0 p (1 p) To compute the capacity of this channel, we note that 1) I(X; Y ) = I(Y ; X) = H(Y ) H(Y X) 2) H(Y X) = H(Y X = x 1 )p(x 1 ) + H(Y X = x 2 )p(x 2 ) = H(p)p(x 1 ) +

The capacity of the BEC is then given by C = max(1 p)h(q) {p(x)} = 1 p, where fact that max {p(x)} H(q) = 1 is attained for p(x 1 ) = p(x 2 ) = 1/2 has been used.

46 Contents 45 P = (1 p) p 0 0 p (1 p) To compute the capacity of this channel, we note that 1) I(X; Y ) = I(Y ; X) = H(Y ) H(Y X) 2) H(Y X) = H(Y X = x 1 )p(x 1 ) + H(Y X = x 2 )p(x 2 ) = H(p)p(x 1 ) + H(p)p(x 2 ) = H(p) 3) H(Y ) = H(q) + (1 p)h(q) where q = p(x 1 ). The capacity of the BEC is then given by C = max(1 p)h(q) {p(x)} = 1 p, where fact that max {p(x)} H(q) = 1 is attained for p(x 1 ) = p(x 2 ) = 1/2 has been used. Interpret obtained capacity for p = 0 and for p = 1. Example Noisy channel with nonoverlapping outputs. This channel has two possible outputs for each of the two inputs, as shown in Fig At the first glance, one might think that the channel is noisy, but really it is not. Figure 1.18: Noisy channel with nonoverlapping outputs. To understand why it is not noisy, suppose that we group the the outputs in two groups {y 1, y 2 } and {y 3, y 4 } as illustrated in Fig Then when the output symbol belongs to the first set, we know without any uncertainty that the symbol at the input was x 1 and when the output symbol belongs to the second set, we know without any uncertainty that the symbol at the input was x 2. Therefore, the channel is equivalent to a BSC channel with crossover probability p = 0 and capacity C = 1 bit/symbol.

47 Contents 46 We now compute the channel capacity to confirm the above rationale. We note that 1) I(X; Y ) = H(X) H(X Y ) 2) H(X Y ) = 0 (why?) The capacity of the noisy channel is then given by C = max H(X) {p(x)} = 1 bit/symbol. Exercise 1.9. Noisy typewriter. Consider a channel with transition matrix P i,i = 1/2, i = 1,..., N P = [P i,j ] with P i,i+1 = 1/2, i = 1,..., N 1 P N,1 = 1/2, where N 4 is an even. The noisy typewriter channel is shown in Fig. 1.19, left, for N = 8. Show that the channel capacity is C = log N 1 bits/symbol and interpret the result. For N = 8, design an encoder and a decoder ensuring zero transmission errors. Encode and decode the sequence x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8. Hint: notice that, as shown in Fig. 1.19, right, the subset of inputs {x 1, x 3, x 5, x 7 } produces disjoint sequences at the output. Definition A channel is said weakly symmetric if every row of the transition matrix is a permutation of every other and all the columns sums are equal. Exercise Prove that the capacity of a weakly symmetric channel is C = log Y H(P i1, P i2,..., P in ), for any i {1,..., M}. Note that H(P i1, P i2,..., P in ) is the entropy of the ith row of the transition matrix.

Contents 47 Figure 1.19: Left: Noisy typewriter. Right: noiseless subchannel. Figure 1.20: Block diagram of the channel encoding and decoding process. 1.5.

48 Contents 47 Figure 1.19: Left: Noisy typewriter. Right: noiseless subchannel. Figure 1.20: Block diagram of the channel encoding and decoding process Channel coding theorem The channel encoding and decoding process schematized in Fig. components: 1.20 entails the following 1. stationary source (M t ) t= with alphabet M and entropy H(M) bits/symbol. 2. extended messages m k M k, resulting from sequences of k consecutive source symbols, are input to the encoder. 3. a channel with input and output alphabets X and Y, respectively, and transition probabilities p(y x) for x X and y Y. The channel is compactly characterized by the probabilities {p(y x), x X, y Y}. 4. an encoding function x n : M k X n, yielding the codewords x n (m k ) for m k M k. Each

49 Contents 48 codeword is a string of length n from X. The set of codewords is called the codebook. 5. a decoding function g : Y n M k, which assigns a message m k M k to each string y n Y n of length n. The random variables M k and M k models the input of the encoder and the output of the decoder, respectively; the random vectors X n = (X 1,..., X n ) and Y n = (Y 1,..., Y n ) model the input and the output of the channel, respectively. Definition An (n, k) code for the channel {p(y x), (x, y) X Y} consists of the triplet (M k, x n, g) defined above. Definition The rate of an (n, k) code is R := kh(m) n bits/transmission Definition The maximum probability of error of a (n, k) code for the for the channel {p(y x), (x, y) X Y} is defined as λ (n) := max m k M k P ( M k m k M k = m k ) Theorem 1.9. (Channel coding theorem) Assume that the source is stationary and has entropy H(M). Then for every ε > 0 and rate R < C, there exists a sequence of (k, n) codes with maximum probability of error λ (n) 0. Conversely, it not possibly to have λ (n) 0 with R > C. The fundamental requisite to transmit with a probability of error arbitrarily low is then R < C. If H(M) C, we must have k/n < 1 such that R < C. For example, consider a binary source with entropy H(M) = 1 bit/symbol and a BSC with p = 0.01 and thus C = 1 H(p) = bits/transmission. In order to have a reliable transmission, we must have k n < C =

50 Contents 49 Therefore the code words must have an overhead of n k n = = (8.08%) redundant bits introduced by the encoder. In order not to accumulate information at the input of the encoder the channel transmits the information at a higher rate than the transmission rate of the source. If T and T c are the source and channel time symbols, respectively, then we must have kt = nt c, and thus H(M) T < C T c bits/second. The rate C T c is termed critical rate. Informal justification of the Channel coding theorem. The proof the Channel coding theorem is beyond the scope of this course. See [2, Ch. 8] for a formal proof. Notwithstanding, we provide an informal justification hinging on the notion of typical sequence, with is stated below. Asymptotic Equipartition Property (AEP). For stationary sources and n large the set of n-sequences can be divided in two subsets: the typical set containing 2 nh(x) equally likely sequences with probability close to 2 nh(x) with high probability, and the nontypical set whose probability is close to zero. We provide now intuition on the channel coding theorem, supported on the following three ideas/results: 1. For large values of n, every channel behaves like the noisy typewriter in that there is a subset of inputs x n X n that produces disjoint sequences at the output. 2. For each typical input n-sequences x n there is 2 nh(y X) possible y n sequences at the output of the channel. 3. The number of typical n-sequences y n is 2 nh(y ). Therefore, if we divide the set of 2 nh(y ) output sequences into sets of size 2 nh(y X), corresponding to the different sequences, the total number of disjoint sets is less or equal than 2 n(h(y ) H(Y X) = 2 ni(y ;X). Hence, we may send at most 2 ni(y ;X) distinguishable sequences of length n.

Contents 50 1.6 Additive Gaussian channel Fundamental question: What is maximum rate at which information can be reliably transmitted over a bandlimited channel perturbed by additive Gaussian noise?

51 Contents Additive Gaussian channel Fundamental question: What is maximum rate at which information can be reliably transmitted over a bandlimited channel perturbed by additive Gaussian noise? Answer to the fundamental question: Shannon Information Capacity Theorem Organization: This section i sates the Shannon-Hartley information capacity Theorem [1, 6], which provides the channel capacity of a continuous channel of fixed bandwidth perturbed by bandlimited Gaussian noise. The material herein presented is partially inspired by the books [3, 2, 4] Capacity of an amplitude-continuous time-continuous channel Figure 1.21: Left: model of an amplitude-continuous channel with additive Gaussian noise. Right: discretized version with discretization interval. Consider the channel shown in Fig left, where X is a RV with pdf f X, such that E[X 2 ] P X, and N N (0, σn 2 ) is a Gaussian RV independent of X, with zero-mean and variance σ 2 N. Let I i = [i, (i + 1) [, for i Z, and X be a quantized version of X such as X = i if X I i. Likewise, define N = i if N I i and Y = X + N. The discretized channel is shown in Fig right. Therefore, we have p i := P (X I i ) = (i+1) i f X (x)dx f X (x i ),

52 Contents 51 where x i I i. The entropy of X is given by H(X ) = p i log p i = i= i= f X (x i ) log( f X (x i )) f X (x i ) log(f X (x i )) i= i= = h(x) + ε X ( ) log, f X (x i ) log where h(x) := f X (x) log(f X (x)) dx is the so called differential entropy of X [2] and ε X accounts for the difference between the Rieman integral h(x) and the discretized sum approximation with discretization interval. Under suitable conditions on f X, we have lim 0 ε X ( ) = 0. The mutual information between X and Y, and proceding as before, is I(X ; Y ) = H(Y ) H(Y X ) = h(y ) h(y X) + ε Y ( ) ε Y X ( ) log + log, where h(y X) = f XY (x, y) log(f Y X (y x)) dxdy is the differential conditional entropy [2] between Y and X, and ε Y X accounts for the difference between the Rieman integral h(y X) and the discretized sum approximation with discretization interval. Under suitable conditions on f XY, we have lim 0 ε Y X ( ) = 0. Taking into consideration that f Y X N (x, σn 2 ), we have h(y X) = f X (x) f Y X (y x) log(f Y X(y x)) dydx }{{} N ((y x),σn 2 ) = (y x)2 ( e 2σ N 2 ln(2πσ 2 f X (x) n ) (y x) 2 /(2σN 2 ) ) dydx 2πσ 2 n ln 2 = 1 2 log(2π e σ2 N)

53 Contents 52 Le the capacity of the discretized channel, C be given by C = max I(X ; Y ). f X : E[X 2 ] P X Defining the capacity of the continuous channel as C = lim 0 C, and assuming that the lim and max operators may be swapped, then we have Since h(y X) does not depend on f X, we have C = C = max h(y ) h(y X) f X : E[X 2 ] P X max h(y ) 1 f X : E[X 2 ] P X 2 log(2π e σ2 N). The solution for the above optimization is achieved for a Gaussian pdf [2, Th ] and given by h(y ) = (1/2) log(2π e (PX 2 + σ2 N ). We have then C = 1 2 log(2π e (P X 2 + σn) log(2π e σ2 n) = 1 ( 2 log 1 + P ) X 2 σn 2 Theorem The capacity of a Gaussian Channel with Power constraint P X noise variance σ 2 N is C = 1 ( 2 log 1 + P ) X σn 2 and bits per transmission (1.41) Definition A (n, k) code for a Gaussian channel channel Y X = x N (x, σ 2 N ) consists of the triplet (R k, x n, g) defined in Section 1.25 with the constraint n x 2 i (w) np X w W. i= 1 We remind that the respective code rate defined in (1.26) is R = kh(x) n Theorem 1.10 has the following implications: bits per transmission 1. (Achievability) For every ε > 0 and rate R < C, there exists a sequence of codes (n, k) with maximum probability of error λ (n) (Unachievability) For any sequence of codes (n, k) such that R > C with maximum probability of error λ (n) does not converge to zero.

Contents 53 1.6.2 Sphere packing for the Gaussian Channel We now provide an plausible argument as to why it is possible to construct codes (n, k) with low probability of error.

The random vector Y n and the output of the Gaussian channel is given by Y n = X n + N n where N n is an i.i.d. random vector of size n with Gaussian components of zero-mean and variance σn 2.

54 Contents Sphere packing for the Gaussian Channel We now provide an plausible argument as to why it is possible to construct codes (n, k) with low probability of error. Consider a message w R k and the codeword vector x n (w) such that x n (w) 2 2 = n x 2 i (w) np X. The random vector Y n and the output of the Gaussian channel is given by Y n = X n + N n where N n is an i.i.d. random vector of size n with Gaussian components of zero-mean and variance σn 2. For a given Xn, (Y n X n ) 2 has mean nσn 2 and variance 2nσ4 N. This means that as n increases all vectors (Y n X n ) 2 become increasingly closer to a n-dimensional sphere of radius nσ 2 N. The decoder is able to correctly infer the vectors Xn (w), for w W, if the minimum distance bewtteen any two vectors is at least 2 nσn 2, i.e., the spheres are non-intercepting. Therefore, the maximum number of correctly decoded codewords is equal to the maximum number oh n-dimensional non-intercepting spheres that can be packed into a larger sphere of size np X + nσn 2. This number is (n(p X + σ 2 N ) n 2 ) (nσ 2 N ) n 2 n = 2 2 log(1+ P X σ N 2 ), corresponding to the rate ( 1 2 log 1 + P ) X. σn 2 Figure 1.22: Sphere packing interpretation for the Gaussian channel.

Contents 54 1.6.3 Capacity of the bandlimited Gaussian channel Figure 1.

55 Contents Capacity of the bandlimited Gaussian channel Figure 1.23: Gaussian channel with bandwidth B; X(t) denotes a stochastic process with zero mean and variance σx 2 and N(t) denotes a stationary Gaussian process with PSD G N(f) = N 0 /2. We return now to the time-continuous channel in the sense that their input and outputs are time-continuous signals and that the input signals are bandlimited to the frequency interval [ B, B] Hz. Therefore, we may discard the frequency components of the noise outside the frequency interval [ B, B]. This scenario is modeled by the block diagram device shown in Fig. 1.23, where a lowpass filter with bandwidth B is included at the output. The noise power at the output is σn 2 = (N 0/2)2B = N 0 B. By the sampling theorem, the signal component at the output may be recovered using 2B sample per second. Accordingly, we may express the information capacity as Theorem (Information capacity theorem) The information capacity of a continuous channel of bandwidth B Hz, perturbed by additive Gaussian noise of power spectral density N/2 W/Hz and bandlimited to B is given by ( C = B log 1 + P ) X bits per second, N 0 B where P X is the average transmitted power. Theorem is one of the most important and beautiful results of information theory; it links in a simply formula the three most important parameters of a communication systems: the channel bandwidth, the averaged transmitted power, and noise power spectral density. The

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code Chapter 3 Source Coding 3. An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code 3. An Introduction to Source Coding Entropy (in bits per symbol) implies in average