Lecture Notes on Digital Transmission Source and Channel Coding. José Manuel Bioucas Dias

Size: px
Start display at page:

Download "Lecture Notes on Digital Transmission Source and Channel Coding. José Manuel Bioucas Dias"

Transcription

1 Lecture Notes on Digital Transmission Source and Channel Coding José Manuel Bioucas Dias February 2015

2 CHAPTER 1 Source and Channel Coding

3 Contents 1 Source and Channel Coding Introduction Source Coding Discrete stationary sources Discrete memoryless sources (DMS) Entropy Source extension Source encoder Classes of codes Kraft inequality Bounds on optimal codelengths Huffman code Joint and conditional entropy. Mutual information Discrete sources with memory Markov chains Entropy rate and conditional entropy rate Coding stationary sources Optimal coding of stationary Markovian sources Channel coding Model of a discrete memoryless channel (DMC) Channel coding theorem Additive Gaussian channel Capacity of an amplitude-continuous time-continuous channel Sphere packing for the Gaussian Channel

4 Contents Capacity of the bandlimited Gaussian channel

5 Contents Introduction This chapter addresses the following two fundamental questions and answers linked with a digital communications system: Fundamental questions: 1. What is the minimum number of bits per symbol with which a discrete source may be represented? 2. What is maximum reliable transmission rate over an unreliable channel? Answers to the fundamental questions: 1. The answer to question 1 is provided by the Shannon s source coding Theorem [1], which states that the minimum average number of bits per symbol to represent a discrete source is its Entropy, which is a measure of the source complexity. 2. The answer to question 2 is provided by the Shannon s channel coding Theorems [1] and linked with the Channel Capacity, which is an upper bound of the rate of information that may readably flow through the channel probability of error arbitrarily small. This chapter is focused on source coding, channel coding, entropy, and channel capacity. These subjects and concepts are studied in Information Theory (IT), which, in general terms, is concerned with the transmission, processing, extraction, and utilization of information [2]. The math herein used consists mainly of discrete probability with a few incursions on continuous random vectors. In spite of the lightness of the math tools used, the key ideas in source and channel coding, put forth by Shannon in 1948, are profound and have underlaid a scientific revolution in digital communication systems (and in other areas) still alive in today. The material herein presented in mostly based in the books [3, 2, 4]. For proofs details, the students are referred those books. Although the emphasis will be on insights and intuition, a number of proofs and showcases of core concepts and algorithms will be provided. The Section is organized as follows. Source coding is addressed in Section 1.2 and Channel

6 Contents 5 coding is addressed in Sections 1.5 (discrete channel) and 1.6 (Gaussian Channel). 1.2 Source Coding Figure 1.1: Block diagram of a digital communication system. The source encoder and decoder are highlighted. Figure 1.1 shows the block diagram of a digital communication system. The source produces a sequence of symbols x t X for at a rate 1/T symbols per second from an alphabet with cardinality X = M symbols. The source encoder reads source symbols and produces variable length codewords c t D where D is the set of finite sequences of symbols from the alphabet D = {0, 1,..., D}. For example if D = {0, 1} or D = {0, 1, 3} we say that the respective codewords are binary or ternary. In order to generate efficient codes, in terms of the average number of symbols, the source encoder generates short codewords for symbols x t X with high probability and vice versa. In the receiver, and assuming that the transmission link between the channel encoder and the

7 Contents 6 channel decoder is reliable, i.e., ĉ t = c t, the encoding process is reverted back by the source decoder. This section addresses fundamental aspects of source encoding and decoding, which are intimately linked with with the 1 st fundamental question and answer stated in the Introduction, where the notion of entropy, as a measure of the complexity of the source, takes central stage. From a practical and algorithmic point of view, the research efforts to design efficient source encoders led to, among many others, Huffman codes, Shannon-Fano code, Elias code, and the Lempel-Ziv codes Discrete stationary sources Figure 1.2: Discrete source transmitting symbols from the alphabet X at the rate r = 1/T symbols/second. Figure 1.2 shows a discrete source, which transmits a symbol from the alphabet X, with M symbols (i.e., X = M) every T seconds. Very often we assume, without lack of generality, that X = {1, 2,..., M}. The sequence of symbols at the output of the source, denoted by (x t ) t=, x t X, is assumed to be a sample of the discrete stochastic process (X t ) t= with alphabet X. The sources considered in this chapter are stationary; that is, the joint statistics of any finite subset of RVs form (X t ) t= depends only on the relative time differences between the RVs: Definition 1.1. (Stationary stochastic process) The stochastic process (X t ) t= is stationary if given any set K of time instants {t 1,..., t K } and any integer k, we have P (X t1 = x 1,..., X tk = x K ) = P (X t1 +k = x 1,..., X tk +k = x K ) for any sequence (x 1,..., x K ) X K. We conclude, therefore, that in a stationary source the joint probability does not depend on a time shift. For example P (X t = x t ) = P (X 1 = x t ) for any t Z and x t X and P (X t1 = x t1, X t2 = x t2 ) = P (X t1 t 2 = x t1, X 0 = x t2 ) for any t 1, t 2 Z and (x t1, x t2 ) X 2

8 Contents 7 The assumption of stationarity, or quasi-stationarity, which is defined as stationarity but for τ τ max, holds true with good approximation in many sources of the real world Discrete memoryless sources (DMS) The discrete memoryless sources (DMS) are a subset of the stationary sources in which the sequence (X t ) t=1 is statistically independent. Therefore, we have P (X t1 = x 1, X t2 = x 2,..., X tk = x K ) = P (X t1 = x 1 )P (X t2 = x 2 )... P (X tk = x K ), for any finite set of naturals t 1, t 2,..., t K and (x 1,..., x K ) X K. Given that (X t ) t=1 is stationary, then the RVs X t for t N are identically distributed and, hence, we often use X to denote X t for a given t N. The notations p(x) := P (X = x), x X (1.1) p i := P (X = i), i X (1.2) p := (p 1,..., p M ) (1.3) are often used. Information content of a source The information content of a source is a central concept in Information Theory. Intuitively, the notion of information should be close to that of uncertainty or of surprise: 1. Information uncertainty surprise 2. An event with probability 1 does not contain surprise 3. An event with very low probability 1 does contain surprise The notion of the information associated to a symbol x X is formalized via the function I : X R defined as ( ) 1 I(x) := log. p(x) We highlight the following properties of the function I, compatible with the intuitive notion of information associated the source symbols x, y X at symbol intervals t 1, t 2 Z with t 1 t 2 :

9 Contents 8 Properties of the information I: 1. I(x) = 0 if and only if (iff) p(x) = 1 2. I(x) 0, 0 p(x) 1 3. I(x) > I(y) if p(x) < p(y) ( ) 1 4. I(x) + I(y) = log p(x)p(y) ( = log 1 P (X t1 = x, X t2 = y) ) Remarks: 1. Property 4 results from the fact that the r.v.s X t1 and X t2 are statistically independent. 2. The unit of information is arbitrary. If log 2 is used, the unit of information is called bit (binary digit). 1 bit is the information associated with a binary source with equally likely symbols: M = 2, p 1 = p 2 = 1 2, I(1) = I(2) = log 2 2 = 1 bit 3. By default log denotes logarithmic of base When referring to the entropy of a source, we use the entropy unit is bits/symbol Entropy Based on the information I, we now introduce the most important entity of source coding: the entropy. Definition 1.2. Entropy of the r.v. X: H(X) := E[I(X)] = p(x) I(x) x X = x X p(x) log p(x) bits The entropy is therefore the mean value of the information. As we will seen in Section 1.2.8, the entropy of the r.v. X is also a measure of the complexity of X interpretable as the minimum number of bits to represent the samples of X.

10 Contents 9 Remarks on the definition of the entropy: 1. H(X) represents the mean information gain per symbol log 0 := 0 (by continuous extension of p log p). 3. In addition to H(X), the notations H(X ), H(p 1, p 2,..., p M ), and H(p) are also used to denote entropy. Example 1.1. Consider a DMS with alphabet X = {1, 2, 3, 4} and p 1 = 1/2, p 2 = 1/4, p 1 = 1/8, p 4 = 1/8. The entropy of this source is H(X) = 1 2 log log log log 8 = 1.75 bit/symbol 8 Example 1.2. Consider a DMS with alphabet X = {1,..., M} with symbols equally likely, i.e., p i = 1/M for i = 1,..., M. The entropy of this source is H(X) = M 1 log M = log M bit/symbol M Example 1.3. Consider a binary source with alphabet X = {1, 2} and probabilities p X (1) = p and p X (2) = 1 p. Then H(X) = p log p (1 p) log 2 (1 p). Because the entropy of a binary source is widely used in IT, there is a special function to H : [0, 1] [0, 1] represent it: H(p) := p log p (1 p) log 2 (1 p), (1.4) where p X (1) = p. Fig. 1.3 plots the graph of H. Exercise 1.1. Prove that function H defined in 1.4 is symmetric about 1/2, concave, and takes the maximum value H(1/2) = 1 bit. Interpret the fact H is maximum for equally likely symbols.

11 Contents 10 H (bits) p Figure 1.3: The entropy function H of at a binary source with p X (1) = p. Properties of entropy: 1. H(X) = 0 iff p(x) = 1 for some x X 2. 0 H(X) log M 3. H(X) = log M iff p(x) = 1/M for x X (equally likely symbols maximum uncertainty) 4. Change of base: Let then H b (X) := x X p(x) log b p(x), H b (X) = log b (a))h a (X) 5. H(p 1,..., p M ) is concave 6. Let X 3 and X = X A X B with X A = {x 1,..., x a } and X B = {x a+1,..., x M }. Define P A = p p a and P B = p a p M. Then ( p1 H(p 1,..., p M ) = H(p A, p B ) + p A H,..., p ) a + p B H p A p B ( pa+1 p B,..., p ) M p B Property 1: The sufficient condition is immediate and the necessary condition results from the fact the (p log p = 0) p {0, 1} (1.5) Properties 2 and 3: H(X) 0 is implied by p log p 0 for p [0, 1]. The proof of the inequality H(X) log M and H(X) = log M iff p X (x) = 1/M for x X uses the information

12 Contents 11 inequality: Theorem 1.1. (Information inequality)given the probability distributions p 1,..., p M q 1,..., q M,, with equality iff p i = q i for i = 1,..., M. M p i log p i q i 0 and Remark: we are using the conventions, justified by continuity extension, that 0 log 0 = 0 and p log p 0 =. Proof: Taking into account the above remark, we only need to consider the case p i, q i > 0. Noting that log is a strictly convex function, the Jensen s inequality 1 implies the M p i log p M ( i = p i log q ) ( M ) i q i log p i = log(1) = 0. q i p i p i Since log is strictly convex, then M p i log p i q i = 0 iff p i q i = c, with implies that p i = q i as they are distributions. have Getting back to the proof of Properties 2 and 3, by setting q i = 1/M, for i = 1,..., M, we M p i log p i 1/M 0 (1.6) M p i log p i M p i log M (1.7) H(X) log M. (1.8) The equality in (1.6) hold iff p i = 1/M for i = 1,..., M. Property 4: Results from log b (x) = log b (a) log a (x). 1 Given a convex function ϕ and a RV X, the Jensens s inequality [5] states that E[ϕ(X)] ϕ(e[x]). If ϕ is strictly convex the equality implies that X = E[X].

13 Contents 12 Property 5: We start by remarking that the term p log p, for p 0, is concave (why). Since H(p 1,..., p M ) is the sum of concave terms defined in the the probability simplex (i.e.. {p 1,..., p M : p i 0, M p i = 1}), which is convex, then H is concave. Property 6: We first provide an interpretation for this property. The usual way to think about the symbol transmitted by the source in a given symbol interval is to consider that it is a sample form a random variable with probability mass function (pmf) p X. We may imagine different, but equivalent, sampling strategies to generate the symbols: Direct: sample X from X. Hierarchic: Let Z denote a RV with values in {1, 2}, with P (Z = 1) = p A and P (Z = 2) = p B. i.e., the mass probability of symbols in X A and X B, respectively. To generate a symbol, the source first samples from Z to select the alphabet X A if Z = 1 or X B if Z = 2. Then, in a second step, the source samples a symbol from the selected alphabet with conditional probability P (X = x Z = 1) if Z = 1 and P (X = x Z = 2) if Z = 2. The expression in the right hand side of (1.5) is exactly the entropy for the hierarchic sampling strategy: The term H(p A, p B ) is the entropy of the selector. The remaining two terms are the mean entropy associated with the conditional pmfs P (X = x Z = 1) and P (X = x Z = 2), termed conditional entropy of X given Z. This and other extensions of the entropy are addressed in Section 1.3 To prove Property 6, we arrange the terms of H(X) in two groups corresponding to the alphabets X Z and X N : a H(X) = p i log p i = = a M i=a+1 p i log p i p A p A a p i log p A M i=a+1 p i log p i M i=a+1 p i log p i p B p B p i log p B p A = p A log p A p B log p B + p A H ( p1 = H(p A, p B ) + p A H,..., p ) a + p B H p A p B a p i log p i p B p A p A,..., p ) a + p B H p A ( p1 p ( B pa+1 p B,..., p M p B M i=a+1 ( pa+1 p B ) p i p B log p i,..., p M p B p B )

14 Contents Source extension Figure 1.4: Source extension of order n. Sequences of n symbols form the alphabet X are grouped in an extended symbol of the alphabet X n. Figure 1.4 illustrates the concept of source extension of order n: it consists in grouping n consecutive symbols of the original source to form extended symbols defined in the alphabet X n. The notation X n := (X 1, X 2,..., X n ) and x n X n is used to refer to, respectively, a sequence of n independent random variables distributed as X and a symbol of the extended source. Example 1.4. Let X = {0, 1}; then X 2 = {00, 01, 10, 11} X 3 = {000, 001, 010, 011, 100, 101, 110, 111}. Probability of the extended symbols: Let x n := (x 1, x 2,..., x n ) X n be a symbol of the extended source. Since this source is DMS, it follows that P (X n = x n ) = P (X 1 = x 1 )P (X 2 = x 2 ) P (X n = x n ) = p(x 1 )p(x 2 ) p(x M ). Entropy of an extended DMS source: The entropy of the extended source is Let H(X) be the entropy of the original source. H(X n ) = E[I(X 1,..., X n )] n = E[I(X i )] = nh(x).

15 Contents 14 This result has a straightforward interpretation: the information of the extended source is n times the information of the original source. We remark that this result is valid only for memoryless sources. If the source has memory it holds that H(X n ) nh(x) Source encoder Figure 1.5: Block diagram of the the source and the source encoder. Figure 1.5 shows the block diagram of the source and of the source encoder. The source produces a sequence symbols from the alphabet X at a rate r = 1/T symbols per second. The source encoder produces another sequence converting each source symbol x t X into the code word C(x t ), where C : X D is a mapping from the alphabet X into D, the set of finite length sequences of symbols from D = {0, 1,..., D 1}. The length, in symbol from D, of the code word C(x) is denoted by l(x) of l i. The set of M code words, denoted as C, is termed code book. Definition 1.3. The average code word length (codelength for short) is defined as L(C) := E[l(X)] = p(x)l(x) symbols from the alphabet D. x X When D = 2, the code words are sequences of binary symbols and the units of L are bits. We remarks however that although the unit of H and of L are same, they represent different objects: the former denotes information and the latter binary digits. Some authors use binits to represent binary digits. As seen later in this section, the entropy is the lower bound of the codelength. Therefore, from this point of view there is some ground to use the same units for both entities. Central idea of source coding: minimize the codelength L by assigning the code words with smaller length to the more likely symbols. This rationale is illustrated in Example 1.5. Example 1.5. The table below shows the alphabet X = {1, 2, 3, 4}, the respective probabilities and two sets of binary code words (i.e., D = {0, 1}) jointly with the respective lengths in bits.

16 Contents 15 x p X (x) C A (x) l A (x) C B (x) l B (x) 1 1/ / / / H(X) = 7/4bit L A = 2 bit L B = 7/4bit The code words of the code A have length L(C A ) = 2 bits. By assigning shorter code words symbols with higher probability, the code B has length L(C B ) = 7/4 bits, thus more efficient than code A We remark that H(X) = L(C B ). As shown latter in this section, the entropy is in fact the lower bound of the codelength of any useful code. Example 1.6. X = {1, 2, 3, 4, 5}, D = {0, 1, 2} (ternary codes), x p XC (x) C C (x) l C (x) p XD (x) C D (x) l D (x) 1 1/ / / / / / / / / / H 2 (X C ) = bits H 2 (X D ) = bits H 3 (X C ) = trits H 3 (X D ) = trits L(C C ) = 1.6 trits L(C D ) = trits As we will see in section 1.2.9, both codes C C and C D are Huffman codes and thus optimal. Notice that L(C C ) > H 3 (X C ) and L(C D ) = H 3 (X D ). Do you see any special pattern in the pmf of X D? Example 1.7. l i = n L = n. Example 1.8. p i = D l i com l i {1, 2,... }. Thus M M L = p i l i = p i log D p i = H D (X). We conclude, therefore, that if the probabilities p i, for i X, are D-adic, i.e., powers of D 1, then taking l i = log D p i yields L = H D.

17 Contents Classes of codes Any useful code is decodable in the sense the sequence produced by the source, (x t ) t=1, is recoverable from the sequence of code words (C(x t )) t=1. Below, we present four classes of codes with increasing requirements regarding decodability. Definition 1.4. (Non-singular code) Given two symbols x i, x j X x i x j C(x i ) C(x j ) (1.9) Definition 1.5. (Extension of a code) Let X be the set of finite sequences of symbols from X. The extension C of a code C is a mapping C : X D defined by C(x i1 x i2,..., x in ) = C(x i1 )C(x i2 ),..., C(x in ), for any n and x i1 x i2,..., x in X. Definition 1.6. (Uniquely decodable code) A code is uniquely decodable if its extension is nonsingular: that is, given two sequences of symbols, x i1 x i2,..., x in and y i1 y i2,..., y im, with x i1 x i2,..., x in y i1 y i2,..., y im, we have C(x i1 )C(x i2 ),..., C(x in ) C(y i1 )C(y i2 ),..., C(y im ). (1.10) Definition 1.7. (Prefix code or instantaneous code) A code is called a prefix code or an instantaneous code if no code word is a prefix of any other code word. Therefore, prefix codes may be represented by trees: each code word corresponds to a unique path form the root of the tree to a leaf. Figure 1.6 shows, in the left hand side, the binary tree corresponding to code C A shown Example 1.5 and, in the right hand side, the ternary tree corresponding to code C D shown Example 1.6. Table 1.1 shows examples of four codes: the singular code does not satisfy (1.9); the nonsingular and non-uniquely decodable satisfies (1.9) but does not satisfy (1.10). For example the sequences 114 and 12 both yield the sequence of code words 0010; in the uniquely decodable and non-instantaneous code, in order to decode a symbol, a source symbol, the decoder have to read the first symbol of the next code word; finally, the last code is instantaneous since no code word is prefix of another code word. Figure 1.2 shows the Venn diagram of the classes of codes: the non-singular code are a subset of all codes; the uniquely decodable codes are a subset of the non-singular codes; and the instantaneous codes are a subset of the uniquely decodable codes.

18 Contents 17 Table 1.1: Classes of codes. Table 1.2: Venn diagram of the classes of codes.

19 Contents 18 Figure 1.6: Left: binary tree corresponding to code C A shown Example 1.5. Right: ternary tree corresponding to code C D shown Example Kraft inequality To minimize the codelength, the code words should be as short as possible. The fact that, in instantaneous codes, any code word may not be a prefix of any other code word imposes a constraint in the length of the code words expressed by the Kraft inequality. Theorem 1.2. (Kraft inequality)for any instantaneous code (i.e., a prefix code) over an alphabet of size D, the respective codeword lengths l 1..., l n (measured in D-adic symbols) must satisfy the inequality n D l i 1. Conversely, given a set of codeword lengths that satisfy this inequality, there exists an instantaneous code with these word lengths. Proof: (Kraft inequality) Assume without loss of generality that l 1 l 2,..., l n = l max. The number of leaves of a D-tree with depth l max is D lmax. The number of leaves at depth l max that are children of a node with depth l i are D lmax l i, Therefore,we must have n D lmax l i D lmax n D l i 1.

20 Contents 19 Figure 1.7: Illustration of the Kraft inequality for a binary tree. Figure 1.7 illustrates the Kraft inequality for a binary tree and words lengths l 1 = 1, l 2 = 2, l 3 = 3, l 4 = 3. Notice 8 = 2 lmax 2 lmax l lmax l lmax l lmax l 1 = Proof: (Converse of the Kraft inequality) We prove by construction. Let n 1, n 2, n 3,..., n lmax denote the number of code words with lengths, respectively, 1, 2, 3,..., l max. Consider a D-tree with n i leaves at level i. In order to build this tree, we must have: level 1 n 1 D n 1 D 1 1 level 2 n 2 (D n 1 )D n 2 D 2 + n 1 D 1 1 level l max n lmax (D n lmax 1)D l max j=1 n jd j 1 (1.11) Since the inequalities in the right hand side of (1.11) are all implied by the Kraft inequality, the proof is complete. Exercise 1.2. Prove that if the the Kraft inequality is satisfied with equality, then all tree nodes, apart from the leaves, have D children. Remark 1.1. The Kraft inequality stated in Theorem 1.2 considers instantaneous finite codebooks. The inequality and the converse result are however valid for a larger classes of codes: R1) Extended Kraft inequality: applies to infinite prefix codes (see [2, Th ]). R2) McMillan inequality applies to uniquely decodable codes (see [2, Th ]). The Kraft and McMillan inequalities are often collectively termed Kraft-McMillan inequality.

21 Contents Bounds on optimal codelengths The Kraft-McMillan inequality imposes constraints on the code word lengths that imply that the codelength of code for a DMS source cannot be smaller than the entropy. Theorem 1.3. Any instantaneous code C satisfies L(C) H D (X). The equality holds iff the probabilities p i, for i = 1,..., M, are D-adic, i.e., powers of D 1. Proof: Let q i information inequality as := D l i /ξ, for i = 1,..., M, and ξ := M D l i. We may then write the M p i log D p i q i = M p i p i log D l i /ξ = H D (X) + L(C) + log D ξ 0 H D (X) L(C) + log D ξ H D (X) L(C), (1.12) where (1.12) results from ξ 1. The equality is satisfied iff ξ = 1 and p i = D l i. Theorem 1.3 provides a lower bound for the codelength of any instantaneous code, which is achieved iff the the source pmf is D-adic. When the probabilities are not D-adic, we face the integer optimization min l 1,...,l M M p i l i subject to: n D l i 1. (1.13) The Huffman algorithm, presented in Section 1.2.9, yields a solution to (1.13). The Shannon code, present below, although yielding a suboptimal solution, is instrumental in the main results of this section: The Source Coding Theorem. Definition 1.8. (Shannon code) Consider an instantaneous code with codeword lengths l i := log D (1/p i ), that is log D 1 p i l i < log D 1 p i + 1, i = 1,..., M. The inequality in left hand side implies that p i D l i Kraft inequality) such instantaneous code exists. and thus M D l i 1. Therefore (see

22 Contents 21 Codelength of the Shannon code: over i = 1,..., M, we obtain Multiplying the above inequality by p i and summing H D (X) L < H D (X) + 1. The above pair of inequalities are valid for any memoryless discrete source, namely for its extension of order n. Therefore, we may write and H D (X 1,..., X n ) L n < H D (X 1,..., X n ) + 1, H D (X) L n n < H D(X) + 1 n, where L n represents codelength of the extended source and we use the results H D (X 1,..., X n ) = nh D (X), valid for DMS sources. The above inequality shows that it is possible to code the source with codelength arbitrarily close to the source entropy. This result jointly with the fact that H(X) is the lower bound for the codelength of any instantaneous code are essentially the 1 Shannon Theorem: Theorem 1.4. (Shannon Source Coding Theorem:) A memoryless discrete source with entropy H bits/symbol may be encoded and decoded using instantaneous codes with average codeword length L = H + ε binits per symbol, where ε > 0 is arbitrarily small. Is is impossible to encode this source with instantaneous codes for L < H. Definition 1.9. The code efficiency is defined as η H L Example 1.9. (Shannon code for successive source extensions) Consider a binary aource X with alphabet X = {1, 2} with probabilities p = (0.2, 0.8). H(X) = bits/symbol. The corresponding entropy is Figure 1.8, left, shows a table with the probabilities and the Shannon code word lengths of X 3 (3rd order extension of the original source). The efficiency for this code is η = 3H/L 3 = Figure 1.8, right, shows evolution of L n /n and of L n nh(x). As expected L n /n approaches the entropy by excess. monotonic. We remark however that the evolution of L n /n in not

23 Contents 22 Figure 1.8: Left: probabilities and Shannon code word lengths of X 3 (3rd order extension of the original source). Right: evolution of L n /n and of L n nh(x). Remark 1.2. (Instantaneous versus uniquely decodable non-instantaneous codes) Source coding uses almost exclusively instantaneous codes. A natural question is whether there is any advantage in using uniquely decodable non-instantaneous codes instead of instantaneous ones, for example in terms of codelength. The answer in negative because any uniquely decodable code satisfies the Kraft inequality (see comment R2 in Remark 1.1) and, therefore, we may construct instantaneous codes with the same code word lengths of the, respective, non-instantaneous ones (Theorem 1.2). Therefore, the uniquely decodable non-instantaneous codes may be replaced with instantaneous ones yielding the same codelengths. Given that instantaneous codes are much simpler from the computational, algorithmic, and hardware implementation points of view, they are by large the preferred choice Huffman code For a given DMS, the Huffman code is optimal in the sense that there is no other instantaneous code with smaller codelength. The Huffman iteratively grows a D-tree from the leaves to the root as follows:

24 Contents 23 Huffman algorithm: 1. Order the alphabet symbols by nonincreasing probability. 2. Merge the D symbols with smaller probability thus obtaining a new tree node and a new alphabet with D 1 symbols. The probability of the new symbol is the sum of the probabilities of the merged symbols. 3. If the alphabet contains more than one symbol goto step For all nodes, apart from the root, assign the symbols 0, 1,..., D to the edges departing form the nodes. The order of the assignment is irrelevant. 5. The codeword of a given symbol is obtained by reading the code symbols along the path that goes from the root to the leaf corresponding to that symbol. Example (Huffman algorithm) Figure 1.3 shows the tree generated by the Huffman algorithm for a DMS with M = 8 with H = bits/symbol. Figure 1.9: Tree generated by the Huffman algorithm. The algorithm run is 7 iterations. Table 1.3 shows the retrived codewords and the respective code word lengths, both for the Huffman and Shannon codes. The codelength of the Huffman and Shannon codes are respectively bits/symbol and 2.2 bits/symbol. We have then (why?) H < L(Huffman) < L(Shannon) < H + 1

25 Contents 24 Table 1.3: Retrieved Huffman code and code word lengths of the Huffman and Shannon codes. The efficiency of the code is η = = Exercise 1.3. Prove that the codelength of an Huffman code is given by L = n iters P (merged symbols(i)), where n iters is the number of iterations of the Huffman algorithm and P (merged symbols(i)) is the probability of the merged symbols at the i-th iteration. For example, with reference to Fig. 1.9, we have L = = bits/symbol. Remark 1.3. (Huffman codes with D > 2) The Huffman algorithm is optimal provided that D symbols are merged in the last iteration. This condition translates to M = D + (D 1)(k 1), where k {0, 1,..., } denotes the number of iterations. It is always satisfied for D = 2, but not necessarily for D > 2. When the condition is not satisfied, and in order to have optimality, the alphabet shall be augmented with with zero probability symbols until the condition is satisfied.

26 Contents 25 Example (Huffman algorithm for D = 3) Figure 1.10, left, shows an incorrect applications of the Huffman algorithm because the number of iterations is k = 2 and, thus, (3 1)1 = 5. The correction shown in 1.10, right, consisted in including the symbol x = 5, with probability zero. Nor we have 5 = 3 + (3 1)1 = 5. Figure 1.10: Incorrect and correct application of the Huffman algorithm. We finish this section with a formal statement on the optimality of the Huffman codes. Theorem 1.5. The Huffman coding is optimal; that is, if C an Huffman code and C is any other uniquely decodable code then L(C ) L(C ). See [2, Th ] for a proof. Assuming that p 1 p 2,..., p M, it starts by showing that the optimality codes may be found in a set with the following properties: 1. p i > p j l i l j. 2. The two last code words have the same length. 3. The two longest code words differ only in the last two bits corresponding to the two least likely symbols. The remaining part of the proof uses induction to show that finding an optimal code to the probabilities p 1, p 2,..., p M 2, p M 1, p M is equivalent to finding an optimal code to the probabilities p 1, p 2,..., p M 2, p M 1 + p M.

27 Contents Joint and conditional entropy. Mutual information The entropy of random variable took central stage in source coding of DMSs. The coding of sources with memory, addressed in Sections 1.4, and channel coding, addressed in Section 1.5, calls for new IT definitions, concepts, and entities able to capture the statistical links among random variables of the underlying stochastic processes. In this section, we introduce three of such entities: the joint entropy, the conditional entropy, and the mutual information. We adopt a concise style in the presentation, focusing on the fundamental properties and insights. For more details, see [2, Ch. 2]. In the following, the RVs X and X 1,..., X n take values in the X and Y take values Y. Definition Joint entropy 2 : H(X, Y ) := p(x, y) log p(x, y) x X x Y = E[log p(x, Y )] Remark 1.4. The joint entropy of (X, Y ) can be looked at the entropy of a RV taking values in the Cartesian product X Y. Definition Conditional entropy: H(Y X) := p(x)h(y X = x) x X = p(x) p(y x) log p(y x) x X y Y = p(x, y) log p ( y x) x X y Y = E[log p(y X)] 2 We use same function p to denote different pmfs. The meaning of each pmf is inferred from the dummy variables used in respective arguments. Although this notion is ambiguous, it leads to a much simpler and neat formulas, whose benefits for the reader worth the ambiguity.

28 Contents 27 Remark 1.5. The conditional entropy of H(Y X) can interpreted as the mean uncertainty of Y after observing X. Properties of the conditional entropy: P1) If Y = f(x), f : X Y, then H(Y X) = 0 (why?). However, H(Y X) = 0 does not imply H(X Y ) = 0. Give an example. P2) If X and Y are independent, then H(X Y ) = H(X) (prove). Chain rule: H(X, Y ) = E[log p(x, Y )] = E[log p(x)p(y X)] = E[log p(x)] E[log p(y X)] = H(X) + H(Y X) H(X, Y ) = E[log p(x, Y )] = E[log p(y )p(x Y )] = E[log p(y )] E[log p(x Y )] = H(Y ) + H(X Y ) Independent RVs: If X e Y are independent, then H(X, Y ) = H(X) + H(Y ) (prove and interpret). Exercise 1.4. The RVs X and Y take values in the alphabet {1, 2, 3, 4} and have the following joint distribution: Determine: 1. H(X), H(Y ), H(X Y ). 2. Check that H(X Y ) H(Y X).

29 Contents 28 Figure 1.11: Joint and marginal distributions of RVs X and Y. 3. Check that H(X) H(X Y ) = H(Y ) H(Y X). Justify qualitatively this equality. Definition Joint entropy of n RVs. The chain rule:: Consider n RVs X 1, X 2,..., X n X n. Their joint entropy is defined as H(X 1, X 2,..., X n ) := x 1,...,x n X n p(x 1,..., x n ) log p(x 1,..., x n ) = E[log p(x 1, X 2,..., X n )] Using the chain rule successively yields p(x 1, x 2..., x n ) = p(x 1 x 2..., x n )p(x 2 x 3..., x n )... p(x n 1 x n )p(x n ) and thus H(X 1, X 2,..., X n ) = H(X 1 X 2,..., X n ) + H(X 2 X 3,..., X n ) +... H(X n 1 X n ) + H(X n ) Independent RVs: If X 1, X 2,..., X n X n are independente, then (prove) H(X 1, X 2,..., X n ) = n H(X i ). Definition (Mutual information between X and Y :) I(X; Y ) = H(X) H(X Y )

30 Contents 29 Remark 1.6. The mutual information I(X, Y ) quantifies the amount of information of Y contained in X. Equivalent interpretation: I(X; Y ) quantifies the reduction in the uncertainty about X that is obtained by observing Y. Properties of the mutual information: P1) I(X; Y ) as a function of p(x, y), p(x), p(y) for x X and y Y: I(X; Y ) := H(X) H(X Y ) = E[log p(x)] + E[log p(x Y )] [ ( )] p(x Y ) = E log p(x) [ ( )] p(x, Y ) = E log p(x)p(y ) p(x, y) = p(x, y) log p(x)p(y) x X, y Y (1.14) P2) I(X; Y ) = I(Y ; X). Results from the symmetry of the expression (1.14). P3) I(X; Y ) 0. Results from the fact that p(x, y) e p(x)p(y) are probability distributions and from applying the information inequality to I(X; Y ) = x X, y Y p(x, y) log p(x, y) p(x)p(y) P4) I(X; Y ) 0 implies H(X) H(X Y ). That is, conditioning can never increase the entropy. Interpret this result. P5) I(X; Y ) = 0 iff p(x, y) = p(x)p(y), that is, iff X and Y are independent (prove). P6) I(X; Y ) = H(X) + H(Y ) H(X, Y ) (prove). P7) Based on the equality I(X; Y ) = H(X) + H(Y ) H(X, Y ) interpret I(X; Y ) as the gain, in terms of codelength, obtained by jointly coding X e Y relatively to the independent coding of X and Y. Figure 1.12 shows highlights graphically the relations H(X) = H(X Y ) + I(X; Y ) and H(Y ) = H(Y X) + I(X; Y ).

31 Contents 30 Figure 1.12: Graphical representation of the mutual information. 1.4 Discrete sources with memory Until now, we have considered only DMSs modeled by independent stochastic processes. However, most sources of interest, if not all, have memory in the sense that the variables underlying the respective stochastic processes have probabilistic dependences. In fact, it is these statistic dependencies that encode most of the information in most types or source, such as a text written in a given language, a computer program, or a Morse code. In this section, we address very briefly the coding or discrete sources with memory. The first ingredient to attack this problem is the statistic model for the source. The complete characterization of the stochastic process implies the knowledge of the joint probability of P (X t1 = x 1, X t2 = x 2,..., X tk = x K ) for any set of integers t 1, t 2,..., t K and any set of symbols (x 1,..., x K ) X K. This is a formidable amount of information not accessible in most applications, unless the source probabilities are well approximated by treatable models, in the complexity sense. This the case of stationary processes, defined in Section 1.1 and Markovian chains Markov chains Markov chains are Markov processes with a countable state space. The Markovian property means that the probability of X t given X j for j < t (i.e., the past) depends only on X j for t n j < t (i.e., a local past) for some positive integer n. Formally: Definition (Markov chain of order n) The stochastic process (X t ) t= is termed a Markov chain of of order n if it satisfies the following property: P (X t = x t X t 1 = x t 1, X t 2 = x t 2,... ) = P (X t = x t X t 1 = x t 1, X t 2 = x t 2,..., X t n = x t n )

32 Contents 31 for any t and sequence x t, x t 1 X. A stochastic process is invariant if its conditional probabilities do not depend on a time shift. It happens that many sources or the real word are well approximated by invariant Markov chains. Next, we present a formal definition of invariant Markov chain of order n. Definition (Invariant Markov chain of order n) The process (X t ) t=1 is an invariant Markov chain of order n if it has a memory of size n and its conditional probabilities do not depend explicitly of time, that is P (X t+1 = x n+1 X t = x n,..., X t n+1 = x 1, X t n = x 0,... ) (1.15) =P (X t+1 = x n+1 X t = x n,..., X t n+1 = x 1 ) (1.16) =P (X n+1 = x n+1 X n = x n,..., X 1 = x 1 ) (1.17) for any t Z and any..., x 0, x 1,..., x n+1 X. The joint probability in (1.16) is a consequence of the Markovianity of order n and probability in (1.17) is a consequence of the invariance. Invariant Markov chain of order 1: From Definition 1.15, we conclude that an invariant Markov chain of order 1 satisfies P (X n = j X n 1 = i) = P (X 2 = j X 1 = i), i, j X, and therefore it is characterized by a set of M M probabilities termed the transition matrix: P = [P i,j ], P i,j = P (X 2 = j X 1 = i), i, j X. The transition matrix is a stochastic matrix: 0 P i,j 1 and M P i,j = j=1 M P (X 2 = j X 1 = i) = 1. j=1 Figure 1.13, left, shows is a transition matrix of an invariant Markov chain of order 1 with M = 3. The rows of P are indexed by X 1 (the current state) and the columns are indexd by X 2 (the future). In the right hand, the same transition matrix is represented by a graph. The

33 Contents 32 Figure 1.13: Left: transition matrix of an invariant Markov chain or order 1. Right: Graph representation of the transition matrix. states 1,2,3 represent the the current state and the arrows with the respective weights on top represent the transition probabilities. Joint Probability: Consider an invariant Markov chain of order 1 with transition matrix P, the set of consecutive time instants {1, 2,..., t}, and define p 1 (n) P (X n = 1) p 2 (n) P (X n = 2) p(n) := :=... p M (n) P (X n = M) t P (X t = x t,..., X 1 = x 1 ) = p x1 (1) P xu 1,xu. u=2 Exercise 1.5. Given the transition matrix P shown in Fig and the vector of probabilities p T (1) = [0.2, 0.3, 0.6], compute P (X 4 = 2, X 2 = 3, X 1 = 1). Propagation of the probability: Consider an invariant Markov chain of order 1 with transition matrix P. We have that (prove) We have that (prove) M p j (t + 1) = P i,j p i (t) = [P T ] (j,:) p(t).,

34 Contents 33 where [P T ] (j,:) denoted the j-th row of P T. Therefore, it holds p(t + 1) = P T p(t) = (P T ) t p(1). Definition (Stationary distribution) Consider an invariant Markov chain of order 1 with transition matrix P. Let p(t) be a distribution such such that. p(t) = P T p(t) (1.18) If such p(t) exists, the process is said to be in a stationary state and p(t) is termed a stationary state and is represented by p( ). We remark that the equation (1.18) is an eigenvalueeigenvector problem corresponding to the unit eigenvalue. Exercise 1.6. Prove that any transition matrix has a unit eigenvalue. Example Consider an invariant Markov chain of order 1 with two states X = {1, 2} and transition matrix P = 1 α β α 1 β with 0 < α, β < 1. The eigenvalues of P are {1, 1 α β} (show). The eigenvector p := [p 1 p 2 ] T corresponding to the unit eigenvalue satisfy the equation P T p = p, that is p 1 (1 α) + βp 2 = p 1 p 2 = α β p 1, (1.19) Using the fact that p is a probability distribution, that is p 1 + p 2 = 1, we obtain p( ) = β α+β α α+β Question: In (1.19), only the first row of the equation P T p = p was used. Why? Can we use the second?.

35 Contents 34 Definition (Irreducible Markov chain ) A Markov chain is irreducible if it possible to reach, with positive probability, any state form any other state in a finite number of steps. Formally, the state j is accessible from state i if there exists an integer t ij 0 such that P (X tij = j X 0 = i) > 0 (i.e., [P t ij ] i,j > 0). Definition (Aperiodic Markov chain ) A Markov chain is aperiodic if it does not contain periodic states: a state i is periodic with period k if any return to state i must occur in multiples of k time steps; that is k = gcd{n : [P n ] i,i > 0}, where gcd denotes the greatest common divisor. Theorem 1.6. Let (X t ) t= be an irreducible and aperiodic invariant Markov chain of order 1. Then, 1. (X t ) t= has a unique stationary distribution. 2. lim t p(t) = p( ) independently of the initial distribution. 3. The stochastic process is stationary Entropy rate and conditional entropy rate The entropy rate and the conditional entropy rate are extensions, for sources with memory, of the entropy for RVs presented in Section The entropy rate is the average entropy per source symbol. The conditional entropy rate is the entropy of a given RV given all RVs in the past. The formal definition of the two forms o entropy is provided below. Definition (Entropy rate) Given a stochastic process (X t ) t=1, its entropy rate is defined as when the limit exits. 1 H(X) := lim t t H(X 1, X 2,..., X t ), (1.20) Exercise 1.7. Show that H(X) = H(X) for a DMS source.

36 Contents 35 Definition (Conditional entropy rate) Given a stochastic process (X t ) t=1, its conditional entropy rate is defined as when the limit exits. H (X) = lim t H(X t X t 1,..., X 1 ). (1.21) Exercise 1.8. Show that H (X) = H(X) for a DMS source. The next theorem states that, for stationary processes, the two forms of entropy are equal. Theorem 1.7. For stationary stochastic processes, the limits (1.20) and (1.21) exist and are equal, that is, H(X) = H (X). Moreover, 1. H(X t X t 1,..., X 1 ) is non-increasing in t. 2. H(X t X t 1,..., X 1 ) 1H(X t t, X t 1,..., X 1 ) for all t t H(X t, X t 1,..., X 1 ) is non-increasing in t. Proof of Theorem 1.7.1: H(X t+1 X t,..., X 2, X 1 ) H(X t+1 X t,..., X 2 ) (1.22) = H(X t X t 1,..., X 2, X 1 ), (1.23) where (1.22) follows from the fact that conditioning cannot increase the entropy and (1.23) follows from stationarity. Proof of Theorem 1.7.2: 1 t H(X t, X t 1,..., X 1 ) = 1 t 1 t t H(X k X k 1,..., X 1 ) (1.24) k=1 t H(X t X t 1,..., X 1 ) (1.25) k=1 = H(X t X t 1,..., X 1 ), (1.26)

37 Contents 36 where (1.24) follows from the chain rule for entropy and (1.25) follows from the fact that conditioning canot increase the entropy and from stationarity. Proof of Theorem 1.7.3: H(X t+1, X t,..., X 1 ) = H(X t+1 X t,..., X 1 ) + H(X t,..., X 1 ) (1.27) H(X t X t 1,..., X 1 ) + H(X t,..., X 1 ) (1.28) 1 t H(X t,..., X 1 ) + H(X t, X t,..., X 1 ) (1.29) = 1 + t H(X t,..., X 1 ), (1.30) t where (1.27) follows from the chain rule for entropy, (1.28) follows from the fact that conditioning cannot increase the entropy, and (1.29) follows form from Theorem Finally, from (1.30) if follows that t H(X t+1, X t,..., X 1 ) 1 t H(X t,..., X 1 ). Proof of Theorem 1.7: The sequences H(X t,..., X 1 )/t, defined in (1.20), and H(X t X t1..., X 1 ), defined in (1.21), are non-increasing and non-negative, so they converge. We now show that the limit is the same: 1 H(X) = lim t t H(X 1, X 2,..., X t ) (1.31) 1 = lim t t 1 = lim t t t H(X k X k 1,..., X 1 ) (1.32) }{{} a k t a k (1.33) k=1 k=1 = lim t a t (1.34) = H(X), (1.35) where (1.32) follows from the chain rule for entropya and (1.32) follows from the Cesáro mean 3. 3 Cesáro Mean: if lim t a t = a and b t = (1/t) t k=1 a k, then lim t b t = a.

38 Contents 37 Entropy rate of a stationary Markov chain of order 1: Since the Markov chain is stationary, it is invariant and thus characterized by its transaction matrix P. Assuming that the stationary distribution is unique, then its entropy rate is H(X) = H (X) = lim t H(X t X t 1,..., X 1 ) = lim t H(X t X t 1 ) = H(X 2 X 1 ) M = H(X 2 X 1 = i)p (X 1 = i) M M = p i ( ) P ij log P ij j=1 Example Consider an invariant Markov chain of order 1 with alphabet X = {1, 2} and transition matrix 1 α β α. 1 β Using the stationary distribution computed in (1.12), the conditional entropy rate is given by H (X) = β α + β H(α) + α α + β H(β) Coding stationary sources Consider a stationary source with memory and the associated stochastic process (X t ) t=. Let X n := (X 1,..., X n ) a vector of RVs corresponding to an extented symbol of order n taking values in X n, as illustrated in Fig The symbols of the extended may be encoded using a Shannon code with codelength L n (C s ), thus satisfying H(X n ) L n (C s ) < H(X n ) + 1 (1.36) H(Xn ) n L n(c s ) n < H(Xn ) n + 1 n. (1.37)

39 Contents 38 Since the source is stationary, then both limits in the left hand side and right hand side of (1.37) converge to the entropy rate H(X). Furthermore, from Theorem 1.7.1, we may write H(X n ) n = H(X) + δ(n), where δ(n) 0 and lim n δ(n) = 0. Therefore, we may write H(X) L n(c s ) n < H(X) + δ(n) + 1 n, (1.38) which enable to state the Shannon Source Coding Theorem for soiurces with memory: Theorem 1.8. (Shannon Source Coding Theorem for stationary sources:) A stationary discrete source with entropy H bits/symbol may be encoded and decoded using instantaneous codes with average codeword length L = H + ε binits per symbol, where ε > 0 is arbitrarily small. Is is impossible to encode this source with instantaneous codes for L < H Optimal coding of stationary Markovian sources The bounds (1.38) are valid for any stationary process. However, in the case of Markovian processes, and since the memory of the source is finate, we build an encoder encoder which has the past into account and thus yields a better codelength. We illustrate this concept with invariant Markov chains of order 1. The encoder is a finite state machine which, for a given t Z, encodes the symbols generated with the conditional pmf p(x t x t 1 ), with x t, x t 1 X, associated with the random variable X t X t 1 = x t 1. Let us assume that t = 2, for a lighter presentation. Let L(X 2 X 1 = x 1 ) be the codelength of an optimal instantaneous code (for example the Huffman code) for X 2 X 1 = x 1. We may then write H(X 2 X 1 ) = x 1 ) L(X t X 1 = x 1 ) < H(X t X 1 = x 1 ) + 1. (1.39) Given that the source is stationary, with a stationary distribution, by averaging (1.39) with respect to X 1, we obtain H(X 2 X 1 ) L(X 2 X 1 ) < H(X 2 X 1 ) + 1.

40 Contents 39 Now instead of coding just X 2 given X 1, let us code X n+1, X n,..., X 2 given X 1, denoted as X n X 1, We may then write H(X n X 1 ) L(X n X 1 ) < H(X n X 1 ) + 1. But since H(X n X 1 ) = nh(x 2 X 1 ) (prove), we obtain H(X 2 X 1 ) L(X n X 1 ) < H(X 2 X 1 ) + 1 n. (1.40) Having in mind that for stationary sources H(X) = H(X ) and for invariant Markov chains of 1 order, H(X ) = H(X 2 X 1 ), then by comparing (1.38) with (1.40), we conclude that the coding with memory removes the slack δ(n) in the former bounds. Example The table below shows the transition matrix and optimal Huffman codes for each state of the Markov chain. Figure 1.14: Transition matrix and optimal Huffman codes for each state of thr Markov chain. The stationary distribution and the conditional entropy for eaxh state are H(X 2 X 1 = 1) = H(0.2, 0.4, 0.4) = bits/sym p( ) = H(X 2 X 1 = 2) = H(0.3, 0.5, 0.2) = bits/sym H(X 2 X 1 = 3) = H(0.6, 0.1, 0.3) = bits/sym The codelengths for the sate-dependent codes are L(X 2 X 1 = 1) = 1.6 bits/sym L(X 2 X 1 = 2) = 1.5 bits/sym L(X 2 X 1 = 3) = 1.4 bits/sym. The conditional entropy rate, the codelength, and the efficiency are H (X) = bits/sym L = bits/sym η = =

41 Contents Channel coding Figure 1.15: Block diagram of a digital communication system. Fundamental question: What is maximum reliable transmission rate over an unreliable channel? Answer to the fundamental question: The channel capacity (Shannon s second Theorem). The possibility of transmitting reliably over a noisy channel without reducing dramatically the transmission rate of information is counterintuitive. Common sense says that if the channel introduces an error in a symbol with a probability p > 0, then in n independent transmissions, a number of pn symbols should be corrupted on average. Shannon proved that this belief

42 Contents 41 was wrong as long as the communication rate was below the channel capacity. This is one of the most beautiful and deep results in communication theory, which have underlaid huge research efforts to build channel encoding and decoding procedures (for example, block codes, cyclic codes, convolucional codes, turbo codes, and low parity check codes) ensuring a reliable transmission over an unreliable channel at rates close to the channel capacity The channel encoding and decoding procedures are implemented by the channel encoder and the channel decoder, respectively, both shown inside the dashed box in Fig The channel encoder receives sequences of symbols d t D from the source encoder and generates new sequences of symbols x t X, which are sent to the channel. The new sequences encodes the input sequences jointly with redundant information to be used by the decoder to detect sequences d t D. The objective of the encoding and decoding procedure is that d t = d t with high probability. Organization: This section is organized as follows. Section introduces the model of a discrete memoryless channel and the respective the channel transition matrix, defines the channel capacity, which resumes the ability of the channel to transmit reliable information. Section states the channel-coding Theorem [1], which gives necessary and sufficient conditions for the existence of reliable transmission over an unreliable channel. The material herein presented is partially inspired by the books [3, 2, 4] Model of a discrete memoryless channel (DMC) Figure 1.16: Model of a discrete memoryless channel (DMC). With reference to Fig. 1.15, the channel is a mathematical model for the communication between the input of the modulator and the output of the demodulator. This model is schematized in Fig The symbols at the input and output of the channel are, respectively, modeled by random variables X, taking values in X := {1,..., M} X, and Y, taking values in Y := {1,..., N}. The channel is characterized by the probabilities at output of the channel

43 Contents 42 conditioned to the input, termed transition probabilities: p(y j x i ) := P (Y = x j X = x i ), i = 1,..., M, j = 1,..., N. Definition (Discrete memoryless channel (DMC)) The channel is said to be a discrete because the alphabets X and Y are finite and memoryless because the output sequence is conditionally independent of the input sequence: P (Y t = y t,..., Y 1 = y 1 X t = x t,..., X 1 = x 1 ) = t P (Y i = y i X i = x i ), for any x 1, x 2,..., x t X and y 1, y 2,..., y t Y. Definition The transition matrix, of size M N, holds the conditional probabilities as follows: P = p(y 1 x 1 ) p(y 2 x 1 )... p(y N x 1 ).. p(y 1 x M ) p(y 2 x M )... p(y N x M )... The rows and columns of P are associated with, respectively, with the RV X at the input of the channel and the output RV Y at the output of the channel. Matrix P = [P i,j ] is stochastic; that is, P i,j 0 and N j=1 P i,j = 1. The channel output probabilities p(y) := P (Y = y), for y Y, are computed from the intput probabilities p(x) := P (X = x), for x X, as p(y) = x X p(x, y) = x X p(y x)p(x) or, in terms of transition matrix P, [p(y 1 ) p(y 2 )... p(y N )] = [p(x 1 ) p(x 2 )... p(x N )]P.

44 Contents 43 Definition The channel capacity is de maximum mutual information I(X; Y ) between X and Y over all input distributions {p(x), x X }: C = max I(X; Y ) {p(x), x X } bits/transmission The channel capacity is the highest rate in bits per channel use at which the information can be transmitted with arbitrarily low probability od error. This fundamental results is formalized by the Shannon s second Theorem stated in Section Properties of the channel capacity: The following properties of the channel capacity result form the properties of the mutual information P1) C 0, since I(X; Y ) 0. P2) C log X, since I(X; Y ) H(X). P3) C log Y, since I(X; Y ) H(Y ). Example The Noiseless binary channel has transition matrix P = In this channel the output reproduces exactly the input. Therefore, we anticipate that its capacity is 1 bit/transmission. In fact, we have H(X Y ) = 0 and thus C = max p(x 1 ) I(X; Y ) max H(X) = 1 p(x 1 ) bits/transmission. Example The binary symmetric channel (BSC) is characterized by the following transition matrix (the respective graph is also shown):

45 Contents 44 Figure 1.17: Capacity of the BSC channel. P = (1 p) p p (1 p) To compute the capacity of this channel, we note that 1) I(X; Y ) = I(Y ; X) = H(Y ) H(Y X) 2) H(Y X) = H(Y X = x 1 )p(x 1 ) + H(Y X = x 2 )p(x 2 ) = H(p)p(x 1 ) + H(p)p(x 2 ) = H(p) 3) H(Y ) = H(q) where q = (1 p)p(x 1 ) + p(1 p(x 1 )). The capacity of the BSC is then given by C = max H(q) H(p) {p(x)} = 1 H(p), where fact that max {p(x)} H(q) = 1 is attained for p(x 1 ) = p(x 2 ) = 1/2 has been used. Figure 1.17 plots the capacity of the BSC as a function of the p = p(y 1 x 2 ). The capacity is maximum for p {0, 1} (why?) and minimum for p = 0 corresponding to independent RVs X and Y (why?); that is, the input and output of the channel are independent. Example The binary erasure channel (BEC) is characterized by the following transition matrix (the respective graph is also shown):

46 Contents 45 P = (1 p) p 0 0 p (1 p) To compute the capacity of this channel, we note that 1) I(X; Y ) = I(Y ; X) = H(Y ) H(Y X) 2) H(Y X) = H(Y X = x 1 )p(x 1 ) + H(Y X = x 2 )p(x 2 ) = H(p)p(x 1 ) + H(p)p(x 2 ) = H(p) 3) H(Y ) = H(q) + (1 p)h(q) where q = p(x 1 ). The capacity of the BEC is then given by C = max(1 p)h(q) {p(x)} = 1 p, where fact that max {p(x)} H(q) = 1 is attained for p(x 1 ) = p(x 2 ) = 1/2 has been used. Interpret obtained capacity for p = 0 and for p = 1. Example Noisy channel with nonoverlapping outputs. This channel has two possible outputs for each of the two inputs, as shown in Fig At the first glance, one might think that the channel is noisy, but really it is not. Figure 1.18: Noisy channel with nonoverlapping outputs. To understand why it is not noisy, suppose that we group the the outputs in two groups {y 1, y 2 } and {y 3, y 4 } as illustrated in Fig Then when the output symbol belongs to the first set, we know without any uncertainty that the symbol at the input was x 1 and when the output symbol belongs to the second set, we know without any uncertainty that the symbol at the input was x 2. Therefore, the channel is equivalent to a BSC channel with crossover probability p = 0 and capacity C = 1 bit/symbol.

47 Contents 46 We now compute the channel capacity to confirm the above rationale. We note that 1) I(X; Y ) = H(X) H(X Y ) 2) H(X Y ) = 0 (why?) The capacity of the noisy channel is then given by C = max H(X) {p(x)} = 1 bit/symbol. Exercise 1.9. Noisy typewriter. Consider a channel with transition matrix P i,i = 1/2, i = 1,..., N P = [P i,j ] with P i,i+1 = 1/2, i = 1,..., N 1 P N,1 = 1/2, where N 4 is an even. The noisy typewriter channel is shown in Fig. 1.19, left, for N = 8. Show that the channel capacity is C = log N 1 bits/symbol and interpret the result. For N = 8, design an encoder and a decoder ensuring zero transmission errors. Encode and decode the sequence x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8. Hint: notice that, as shown in Fig. 1.19, right, the subset of inputs {x 1, x 3, x 5, x 7 } produces disjoint sequences at the output. Definition A channel is said weakly symmetric if every row of the transition matrix is a permutation of every other and all the columns sums are equal. Exercise Prove that the capacity of a weakly symmetric channel is C = log Y H(P i1, P i2,..., P in ), for any i {1,..., M}. Note that H(P i1, P i2,..., P in ) is the entropy of the ith row of the transition matrix.

48 Contents 47 Figure 1.19: Left: Noisy typewriter. Right: noiseless subchannel. Figure 1.20: Block diagram of the channel encoding and decoding process Channel coding theorem The channel encoding and decoding process schematized in Fig. components: 1.20 entails the following 1. stationary source (M t ) t= with alphabet M and entropy H(M) bits/symbol. 2. extended messages m k M k, resulting from sequences of k consecutive source symbols, are input to the encoder. 3. a channel with input and output alphabets X and Y, respectively, and transition probabilities p(y x) for x X and y Y. The channel is compactly characterized by the probabilities {p(y x), x X, y Y}. 4. an encoding function x n : M k X n, yielding the codewords x n (m k ) for m k M k. Each

49 Contents 48 codeword is a string of length n from X. The set of codewords is called the codebook. 5. a decoding function g : Y n M k, which assigns a message m k M k to each string y n Y n of length n. The random variables M k and M k models the input of the encoder and the output of the decoder, respectively; the random vectors X n = (X 1,..., X n ) and Y n = (Y 1,..., Y n ) model the input and the output of the channel, respectively. Definition An (n, k) code for the channel {p(y x), (x, y) X Y} consists of the triplet (M k, x n, g) defined above. Definition The rate of an (n, k) code is R := kh(m) n bits/transmission Definition The maximum probability of error of a (n, k) code for the for the channel {p(y x), (x, y) X Y} is defined as λ (n) := max m k M k P ( M k m k M k = m k ) Theorem 1.9. (Channel coding theorem) Assume that the source is stationary and has entropy H(M). Then for every ε > 0 and rate R < C, there exists a sequence of (k, n) codes with maximum probability of error λ (n) 0. Conversely, it not possibly to have λ (n) 0 with R > C. The fundamental requisite to transmit with a probability of error arbitrarily low is then R < C. If H(M) C, we must have k/n < 1 such that R < C. For example, consider a binary source with entropy H(M) = 1 bit/symbol and a BSC with p = 0.01 and thus C = 1 H(p) = bits/transmission. In order to have a reliable transmission, we must have k n < C =

50 Contents 49 Therefore the code words must have an overhead of n k n = = (8.08%) redundant bits introduced by the encoder. In order not to accumulate information at the input of the encoder the channel transmits the information at a higher rate than the transmission rate of the source. If T and T c are the source and channel time symbols, respectively, then we must have kt = nt c, and thus H(M) T < C T c bits/second. The rate C T c is termed critical rate. Informal justification of the Channel coding theorem. The proof the Channel coding theorem is beyond the scope of this course. See [2, Ch. 8] for a formal proof. Notwithstanding, we provide an informal justification hinging on the notion of typical sequence, with is stated below. Asymptotic Equipartition Property (AEP). For stationary sources and n large the set of n-sequences can be divided in two subsets: the typical set containing 2 nh(x) equally likely sequences with probability close to 2 nh(x) with high probability, and the nontypical set whose probability is close to zero. We provide now intuition on the channel coding theorem, supported on the following three ideas/results: 1. For large values of n, every channel behaves like the noisy typewriter in that there is a subset of inputs x n X n that produces disjoint sequences at the output. 2. For each typical input n-sequences x n there is 2 nh(y X) possible y n sequences at the output of the channel. 3. The number of typical n-sequences y n is 2 nh(y ). Therefore, if we divide the set of 2 nh(y ) output sequences into sets of size 2 nh(y X), corresponding to the different sequences, the total number of disjoint sets is less or equal than 2 n(h(y ) H(Y X) = 2 ni(y ;X). Hence, we may send at most 2 ni(y ;X) distinguishable sequences of length n.

51 Contents Additive Gaussian channel Fundamental question: What is maximum rate at which information can be reliably transmitted over a bandlimited channel perturbed by additive Gaussian noise? Answer to the fundamental question: Shannon Information Capacity Theorem Organization: This section i sates the Shannon-Hartley information capacity Theorem [1, 6], which provides the channel capacity of a continuous channel of fixed bandwidth perturbed by bandlimited Gaussian noise. The material herein presented is partially inspired by the books [3, 2, 4] Capacity of an amplitude-continuous time-continuous channel Figure 1.21: Left: model of an amplitude-continuous channel with additive Gaussian noise. Right: discretized version with discretization interval. Consider the channel shown in Fig left, where X is a RV with pdf f X, such that E[X 2 ] P X, and N N (0, σn 2 ) is a Gaussian RV independent of X, with zero-mean and variance σ 2 N. Let I i = [i, (i + 1) [, for i Z, and X be a quantized version of X such as X = i if X I i. Likewise, define N = i if N I i and Y = X + N. The discretized channel is shown in Fig right. Therefore, we have p i := P (X I i ) = (i+1) i f X (x)dx f X (x i ),

52 Contents 51 where x i I i. The entropy of X is given by H(X ) = p i log p i = i= i= f X (x i ) log( f X (x i )) f X (x i ) log(f X (x i )) i= i= = h(x) + ε X ( ) log, f X (x i ) log where h(x) := f X (x) log(f X (x)) dx is the so called differential entropy of X [2] and ε X accounts for the difference between the Rieman integral h(x) and the discretized sum approximation with discretization interval. Under suitable conditions on f X, we have lim 0 ε X ( ) = 0. The mutual information between X and Y, and proceding as before, is I(X ; Y ) = H(Y ) H(Y X ) = h(y ) h(y X) + ε Y ( ) ε Y X ( ) log + log, where h(y X) = f XY (x, y) log(f Y X (y x)) dxdy is the differential conditional entropy [2] between Y and X, and ε Y X accounts for the difference between the Rieman integral h(y X) and the discretized sum approximation with discretization interval. Under suitable conditions on f XY, we have lim 0 ε Y X ( ) = 0. Taking into consideration that f Y X N (x, σn 2 ), we have h(y X) = f X (x) f Y X (y x) log(f Y X(y x)) dydx }{{} N ((y x),σn 2 ) = (y x)2 ( e 2σ N 2 ln(2πσ 2 f X (x) n ) (y x) 2 /(2σN 2 ) ) dydx 2πσ 2 n ln 2 = 1 2 log(2π e σ2 N)

53 Contents 52 Le the capacity of the discretized channel, C be given by C = max I(X ; Y ). f X : E[X 2 ] P X Defining the capacity of the continuous channel as C = lim 0 C, and assuming that the lim and max operators may be swapped, then we have Since h(y X) does not depend on f X, we have C = C = max h(y ) h(y X) f X : E[X 2 ] P X max h(y ) 1 f X : E[X 2 ] P X 2 log(2π e σ2 N). The solution for the above optimization is achieved for a Gaussian pdf [2, Th ] and given by h(y ) = (1/2) log(2π e (PX 2 + σ2 N ). We have then C = 1 2 log(2π e (P X 2 + σn) log(2π e σ2 n) = 1 ( 2 log 1 + P ) X 2 σn 2 Theorem The capacity of a Gaussian Channel with Power constraint P X noise variance σ 2 N is C = 1 ( 2 log 1 + P ) X σn 2 and bits per transmission (1.41) Definition A (n, k) code for a Gaussian channel channel Y X = x N (x, σ 2 N ) consists of the triplet (R k, x n, g) defined in Section 1.25 with the constraint n x 2 i (w) np X w W. i= 1 We remind that the respective code rate defined in (1.26) is R = kh(x) n Theorem 1.10 has the following implications: bits per transmission 1. (Achievability) For every ε > 0 and rate R < C, there exists a sequence of codes (n, k) with maximum probability of error λ (n) (Unachievability) For any sequence of codes (n, k) such that R > C with maximum probability of error λ (n) does not converge to zero.

54 Contents Sphere packing for the Gaussian Channel We now provide an plausible argument as to why it is possible to construct codes (n, k) with low probability of error. Consider a message w R k and the codeword vector x n (w) such that x n (w) 2 2 = n x 2 i (w) np X. The random vector Y n and the output of the Gaussian channel is given by Y n = X n + N n where N n is an i.i.d. random vector of size n with Gaussian components of zero-mean and variance σn 2. For a given Xn, (Y n X n ) 2 has mean nσn 2 and variance 2nσ4 N. This means that as n increases all vectors (Y n X n ) 2 become increasingly closer to a n-dimensional sphere of radius nσ 2 N. The decoder is able to correctly infer the vectors Xn (w), for w W, if the minimum distance bewtteen any two vectors is at least 2 nσn 2, i.e., the spheres are non-intercepting. Therefore, the maximum number of correctly decoded codewords is equal to the maximum number oh n-dimensional non-intercepting spheres that can be packed into a larger sphere of size np X + nσn 2. This number is (n(p X + σ 2 N ) n 2 ) (nσ 2 N ) n 2 n = 2 2 log(1+ P X σ N 2 ), corresponding to the rate ( 1 2 log 1 + P ) X. σn 2 Figure 1.22: Sphere packing interpretation for the Gaussian channel.

55 Contents Capacity of the bandlimited Gaussian channel Figure 1.23: Gaussian channel with bandwidth B; X(t) denotes a stochastic process with zero mean and variance σx 2 and N(t) denotes a stationary Gaussian process with PSD G N(f) = N 0 /2. We return now to the time-continuous channel in the sense that their input and outputs are time-continuous signals and that the input signals are bandlimited to the frequency interval [ B, B] Hz. Therefore, we may discard the frequency components of the noise outside the frequency interval [ B, B]. This scenario is modeled by the block diagram device shown in Fig. 1.23, where a lowpass filter with bandwidth B is included at the output. The noise power at the output is σn 2 = (N 0/2)2B = N 0 B. By the sampling theorem, the signal component at the output may be recovered using 2B sample per second. Accordingly, we may express the information capacity as Theorem (Information capacity theorem) The information capacity of a continuous channel of bandwidth B Hz, perturbed by additive Gaussian noise of power spectral density N/2 W/Hz and bandlimited to B is given by ( C = B log 1 + P ) X bits per second, N 0 B where P X is the average transmitted power. Theorem is one of the most important and beautiful results of information theory; it links in a simply formula the three most important parameters of a communication systems: the channel bandwidth, the averaged transmitted power, and noise power spectral density. The

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code Chapter 3 Source Coding 3. An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code 3. An Introduction to Source Coding Entropy (in bits per symbol) implies in average

More information

Chapter 2: Source coding

Chapter 2: Source coding Chapter 2: meghdadi@ensil.unilim.fr University of Limoges Chapter 2: Entropy of Markov Source Chapter 2: Entropy of Markov Source Markov model for information sources Given the present, the future is independent

More information

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code Chapter 2 Date Compression: Source Coding 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code 2.1 An Introduction to Source Coding Source coding can be seen as an efficient way

More information

1 Introduction to information theory

1 Introduction to information theory 1 Introduction to information theory 1.1 Introduction In this chapter we present some of the basic concepts of information theory. The situations we have in mind involve the exchange of information through

More information

Chapter 5: Data Compression

Chapter 5: Data Compression Chapter 5: Data Compression Definition. A source code C for a random variable X is a mapping from the range of X to the set of finite length strings of symbols from a D-ary alphabet. ˆX: source alphabet,

More information

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1 Kraft s inequality An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if N 2 l i 1 Proof: Suppose that we have a tree code. Let l max = max{l 1,...,

More information

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

10-704: Information Processing and Learning Fall Lecture 10: Oct 3 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 0: Oct 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy of

More information

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 Please submit the solutions on Gradescope. EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 1. Optimal codeword lengths. Although the codeword lengths of an optimal variable length code

More information

Chapter 9 Fundamental Limits in Information Theory

Chapter 9 Fundamental Limits in Information Theory Chapter 9 Fundamental Limits in Information Theory Information Theory is the fundamental theory behind information manipulation, including data compression and data transmission. 9.1 Introduction o For

More information

Lecture 4 Noisy Channel Coding

Lecture 4 Noisy Channel Coding Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 9, 2015 1 / 56 I-Hsiang Wang IT Lecture 4 The Channel Coding Problem

More information

Lecture 22: Final Review

Lecture 22: Final Review Lecture 22: Final Review Nuts and bolts Fundamental questions and limits Tools Practical algorithms Future topics Dr Yao Xie, ECE587, Information Theory, Duke University Basics Dr Yao Xie, ECE587, Information

More information

Entropy as a measure of surprise

Entropy as a measure of surprise Entropy as a measure of surprise Lecture 5: Sam Roweis September 26, 25 What does information do? It removes uncertainty. Information Conveyed = Uncertainty Removed = Surprise Yielded. How should we quantify

More information

Coding of memoryless sources 1/35

Coding of memoryless sources 1/35 Coding of memoryless sources 1/35 Outline 1. Morse coding ; 2. Definitions : encoding, encoding efficiency ; 3. fixed length codes, encoding integers ; 4. prefix condition ; 5. Kraft and Mac Millan theorems

More information

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University Chapter 4 Data Transmission and Channel Capacity Po-Ning Chen, Professor Department of Communications Engineering National Chiao Tung University Hsin Chu, Taiwan 30050, R.O.C. Principle of Data Transmission

More information

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

10-704: Information Processing and Learning Fall Lecture 9: Sept 28 10-704: Information Processing and Learning Fall 2016 Lecturer: Siheng Chen Lecture 9: Sept 28 Note: These notes are based on scribed notes from Spring15 offering of this course. LaTeX template courtesy

More information

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal

More information

Data Compression. Limit of Information Compression. October, Examples of codes 1

Data Compression. Limit of Information Compression. October, Examples of codes 1 Data Compression Limit of Information Compression Radu Trîmbiţaş October, 202 Outline Contents Eamples of codes 2 Kraft Inequality 4 2. Kraft Inequality............................ 4 2.2 Kraft inequality

More information

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1) 3- Mathematical methods in communication Lecture 3 Lecturer: Haim Permuter Scribe: Yuval Carmel, Dima Khaykin, Ziv Goldfeld I. REMINDER A. Convex Set A set R is a convex set iff, x,x 2 R, θ, θ, θx + θx

More information

Intro to Information Theory

Intro to Information Theory Intro to Information Theory Math Circle February 11, 2018 1. Random variables Let us review discrete random variables and some notation. A random variable X takes value a A with probability P (a) 0. Here

More information

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy Haykin_ch05_pp3.fm Page 207 Monday, November 26, 202 2:44 PM CHAPTER 5 Information Theory 5. Introduction As mentioned in Chapter and reiterated along the way, the purpose of a communication system is

More information

Revision of Lecture 5

Revision of Lecture 5 Revision of Lecture 5 Information transferring across channels Channel characteristics and binary symmetric channel Average mutual information Average mutual information tells us what happens to information

More information

Lecture 1: September 25, A quick reminder about random variables and convexity

Lecture 1: September 25, A quick reminder about random variables and convexity Information and Coding Theory Autumn 207 Lecturer: Madhur Tulsiani Lecture : September 25, 207 Administrivia This course will cover some basic concepts in information and coding theory, and their applications

More information

ELEC546 Review of Information Theory

ELEC546 Review of Information Theory ELEC546 Review of Information Theory Vincent Lau 1/1/004 1 Review of Information Theory Entropy: Measure of uncertainty of a random variable X. The entropy of X, H(X), is given by: If X is a discrete random

More information

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy Coding and Information Theory Chris Williams, School of Informatics, University of Edinburgh Overview What is information theory? Entropy Coding Information Theory Shannon (1948): Information theory is

More information

lossless, optimal compressor

lossless, optimal compressor 6. Variable-length Lossless Compression The principal engineering goal of compression is to represent a given sequence a, a 2,..., a n produced by a source as a sequence of bits of minimal possible length.

More information

Information Theory and Statistics Lecture 2: Source coding

Information Theory and Statistics Lecture 2: Source coding Information Theory and Statistics Lecture 2: Source coding Łukasz Dębowski ldebowsk@ipipan.waw.pl Ph. D. Programme 2013/2014 Injections and codes Definition (injection) Function f is called an injection

More information

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H. Problem sheet Ex. Verify that the function H(p,..., p n ) = k p k log p k satisfies all 8 axioms on H. Ex. (Not to be handed in). looking at the notes). List as many of the 8 axioms as you can, (without

More information

CSCI 2570 Introduction to Nanocomputing

CSCI 2570 Introduction to Nanocomputing CSCI 2570 Introduction to Nanocomputing Information Theory John E Savage What is Information Theory Introduced by Claude Shannon. See Wikipedia Two foci: a) data compression and b) reliable communication

More information

Entropy Rate of Stochastic Processes

Entropy Rate of Stochastic Processes Entropy Rate of Stochastic Processes Timo Mulder tmamulder@gmail.com Jorn Peters jornpeters@gmail.com February 8, 205 The entropy rate of independent and identically distributed events can on average be

More information

(Classical) Information Theory II: Source coding

(Classical) Information Theory II: Source coding (Classical) Information Theory II: Source coding Sibasish Ghosh The Institute of Mathematical Sciences CIT Campus, Taramani, Chennai 600 113, India. p. 1 Abstract The information content of a random variable

More information

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information. L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate

More information

Dept. of Linguistics, Indiana University Fall 2015

Dept. of Linguistics, Indiana University Fall 2015 L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission

More information

3F1 Information Theory, Lecture 3

3F1 Information Theory, Lecture 3 3F1 Information Theory, Lecture 3 Jossy Sayir Department of Engineering Michaelmas 2013, 29 November 2013 Memoryless Sources Arithmetic Coding Sources with Memory Markov Example 2 / 21 Encoding the output

More information

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information 4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information Ramji Venkataramanan Signal Processing and Communications Lab Department of Engineering ramji.v@eng.cam.ac.uk

More information

Lecture 14 February 28

Lecture 14 February 28 EE/Stats 376A: Information Theory Winter 07 Lecture 4 February 8 Lecturer: David Tse Scribe: Sagnik M, Vivek B 4 Outline Gaussian channel and capacity Information measures for continuous random variables

More information

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality

More information

Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information

Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information 204 IEEE International Symposium on Information Theory Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information Omur Ozel, Kaya Tutuncuoglu 2, Sennur Ulukus, and Aylin Yener

More information

Classical Information Theory Notes from the lectures by prof Suhov Trieste - june 2006

Classical Information Theory Notes from the lectures by prof Suhov Trieste - june 2006 Classical Information Theory Notes from the lectures by prof Suhov Trieste - june 2006 Fabio Grazioso... July 3, 2006 1 2 Contents 1 Lecture 1, Entropy 4 1.1 Random variable...............................

More information

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE General e Image Coder Structure Motion Video x(s 1,s 2,t) or x(s 1,s 2 ) Natural Image Sampling A form of data compression; usually lossless, but can be lossy Redundancy Removal Lossless compression: predictive

More information

An introduction to basic information theory. Hampus Wessman

An introduction to basic information theory. Hampus Wessman An introduction to basic information theory Hampus Wessman Abstract We give a short and simple introduction to basic information theory, by stripping away all the non-essentials. Theoretical bounds on

More information

Noisy channel communication

Noisy channel communication Information Theory http://www.inf.ed.ac.uk/teaching/courses/it/ Week 6 Communication channels and Information Some notes on the noisy channel setup: Iain Murray, 2012 School of Informatics, University

More information

Lecture 8: Shannon s Noise Models

Lecture 8: Shannon s Noise Models Error Correcting Codes: Combinatorics, Algorithms and Applications (Fall 2007) Lecture 8: Shannon s Noise Models September 14, 2007 Lecturer: Atri Rudra Scribe: Sandipan Kundu& Atri Rudra Till now we have

More information

UNIT I INFORMATION THEORY. I k log 2

UNIT I INFORMATION THEORY. I k log 2 UNIT I INFORMATION THEORY Claude Shannon 1916-2001 Creator of Information Theory, lays the foundation for implementing logic in digital circuits as part of his Masters Thesis! (1939) and published a paper

More information

Information Theory: Entropy, Markov Chains, and Huffman Coding

Information Theory: Entropy, Markov Chains, and Huffman Coding The University of Notre Dame A senior thesis submitted to the Department of Mathematics and the Glynn Family Honors Program Information Theory: Entropy, Markov Chains, and Huffman Coding Patrick LeBlanc

More information

Entropies & Information Theory

Entropies & Information Theory Entropies & Information Theory LECTURE I Nilanjana Datta University of Cambridge,U.K. See lecture notes on: http://www.qi.damtp.cam.ac.uk/node/223 Quantum Information Theory Born out of Classical Information

More information

Capacity of a channel Shannon s second theorem. Information Theory 1/33

Capacity of a channel Shannon s second theorem. Information Theory 1/33 Capacity of a channel Shannon s second theorem Information Theory 1/33 Outline 1. Memoryless channels, examples ; 2. Capacity ; 3. Symmetric channels ; 4. Channel Coding ; 5. Shannon s second theorem,

More information

Communications Theory and Engineering

Communications Theory and Engineering Communications Theory and Engineering Master's Degree in Electronic Engineering Sapienza University of Rome A.A. 2018-2019 AEP Asymptotic Equipartition Property AEP In information theory, the analog of

More information

LECTURE 13. Last time: Lecture outline

LECTURE 13. Last time: Lecture outline LECTURE 13 Last time: Strong coding theorem Revisiting channel and codes Bound on probability of error Error exponent Lecture outline Fano s Lemma revisited Fano s inequality for codewords Converse to

More information

Introduction to information theory and coding

Introduction to information theory and coding Introduction to information theory and coding Louis WEHENKEL Set of slides No 4 Source modeling and source coding Stochastic processes and models for information sources First Shannon theorem : data compression

More information

Lecture 3: Channel Capacity

Lecture 3: Channel Capacity Lecture 3: Channel Capacity 1 Definitions Channel capacity is a measure of maximum information per channel usage one can get through a channel. This one of the fundamental concepts in information theory.

More information

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for MARKOV CHAINS A finite state Markov chain is a sequence S 0,S 1,... of discrete cv s from a finite alphabet S where q 0 (s) is a pmf on S 0 and for n 1, Q(s s ) = Pr(S n =s S n 1 =s ) = Pr(S n =s S n 1

More information

Homework Set #2 Data Compression, Huffman code and AEP

Homework Set #2 Data Compression, Huffman code and AEP Homework Set #2 Data Compression, Huffman code and AEP 1. Huffman coding. Consider the random variable ( x1 x X = 2 x 3 x 4 x 5 x 6 x 7 0.50 0.26 0.11 0.04 0.04 0.03 0.02 (a Find a binary Huffman code

More information

Computing and Communications 2. Information Theory -Entropy

Computing and Communications 2. Information Theory -Entropy 1896 1920 1987 2006 Computing and Communications 2. Information Theory -Entropy Ying Cui Department of Electronic Engineering Shanghai Jiao Tong University, China 2017, Autumn 1 Outline Entropy Joint entropy

More information

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Multimedia Communications. Mathematical Preliminaries for Lossless Compression Multimedia Communications Mathematical Preliminaries for Lossless Compression What we will see in this chapter Definition of information and entropy Modeling a data source Definition of coding and when

More information

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18 Information Theory David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 18 A Measure of Information? Consider a discrete random variable

More information

Notes 3: Stochastic channels and noisy coding theorem bound. 1 Model of information communication and noisy channel

Notes 3: Stochastic channels and noisy coding theorem bound. 1 Model of information communication and noisy channel Introduction to Coding Theory CMU: Spring 2010 Notes 3: Stochastic channels and noisy coding theorem bound January 2010 Lecturer: Venkatesan Guruswami Scribe: Venkatesan Guruswami We now turn to the basic

More information

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST Information Theory Lecture 5 Entropy rate and Markov sources STEFAN HÖST Universal Source Coding Huffman coding is optimal, what is the problem? In the previous coding schemes (Huffman and Shannon-Fano)it

More information

EE229B - Final Project. Capacity-Approaching Low-Density Parity-Check Codes

EE229B - Final Project. Capacity-Approaching Low-Density Parity-Check Codes EE229B - Final Project Capacity-Approaching Low-Density Parity-Check Codes Pierre Garrigues EECS department, UC Berkeley garrigue@eecs.berkeley.edu May 13, 2005 Abstract The class of low-density parity-check

More information

Principles of Communications

Principles of Communications Principles of Communications Weiyao Lin Shanghai Jiao Tong University Chapter 10: Information Theory Textbook: Chapter 12 Communication Systems Engineering: Ch 6.1, Ch 9.1~ 9. 92 2009/2010 Meixia Tao @

More information

Information Theory - Entropy. Figure 3

Information Theory - Entropy. Figure 3 Concept of Information Information Theory - Entropy Figure 3 A typical binary coded digital communication system is shown in Figure 3. What is involved in the transmission of information? - The system

More information

Information and Entropy

Information and Entropy Information and Entropy Shannon s Separation Principle Source Coding Principles Entropy Variable Length Codes Huffman Codes Joint Sources Arithmetic Codes Adaptive Codes Thomas Wiegand: Digital Image Communication

More information

Chapter I: Fundamental Information Theory

Chapter I: Fundamental Information Theory ECE-S622/T62 Notes Chapter I: Fundamental Information Theory Ruifeng Zhang Dept. of Electrical & Computer Eng. Drexel University. Information Source Information is the outcome of some physical processes.

More information

3F1 Information Theory, Lecture 3

3F1 Information Theory, Lecture 3 3F1 Information Theory, Lecture 3 Jossy Sayir Department of Engineering Michaelmas 2011, 28 November 2011 Memoryless Sources Arithmetic Coding Sources with Memory 2 / 19 Summary of last lecture Prefix-free

More information

LECTURE 3. Last time:

LECTURE 3. Last time: LECTURE 3 Last time: Mutual Information. Convexity and concavity Jensen s inequality Information Inequality Data processing theorem Fano s Inequality Lecture outline Stochastic processes, Entropy rate

More information

Lecture 2: August 31

Lecture 2: August 31 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy

More information

EE/Stat 376B Handout #5 Network Information Theory October, 14, Homework Set #2 Solutions

EE/Stat 376B Handout #5 Network Information Theory October, 14, Homework Set #2 Solutions EE/Stat 376B Handout #5 Network Information Theory October, 14, 014 1. Problem.4 parts (b) and (c). Homework Set # Solutions (b) Consider h(x + Y ) h(x + Y Y ) = h(x Y ) = h(x). (c) Let ay = Y 1 + Y, where

More information

Lecture 4 : Adaptive source coding algorithms

Lecture 4 : Adaptive source coding algorithms Lecture 4 : Adaptive source coding algorithms February 2, 28 Information Theory Outline 1. Motivation ; 2. adaptive Huffman encoding ; 3. Gallager and Knuth s method ; 4. Dictionary methods : Lempel-Ziv

More information

Lecture 5 Channel Coding over Continuous Channels

Lecture 5 Channel Coding over Continuous Channels Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From

More information

ECE 4400:693 - Information Theory

ECE 4400:693 - Information Theory ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential

More information

Appendix B Information theory from first principles

Appendix B Information theory from first principles Appendix B Information theory from first principles This appendix discusses the information theory behind the capacity expressions used in the book. Section 8.3.4 is the only part of the book that supposes

More information

Solutions to Set #2 Data Compression, Huffman code and AEP

Solutions to Set #2 Data Compression, Huffman code and AEP Solutions to Set #2 Data Compression, Huffman code and AEP. Huffman coding. Consider the random variable ( ) x x X = 2 x 3 x 4 x 5 x 6 x 7 0.50 0.26 0. 0.04 0.04 0.03 0.02 (a) Find a binary Huffman code

More information

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16 EE539R: Problem Set 4 Assigned: 3/08/6, Due: 07/09/6. Cover and Thomas: Problem 3.5 Sets defined by probabilities: Define the set C n (t = {x n : P X n(x n 2 nt } (a We have = P X n(x n P X n(x n 2 nt

More information

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

Lecture 6 I. CHANNEL CODING. X n (m) P Y X 6- Introduction to Information Theory Lecture 6 Lecturer: Haim Permuter Scribe: Yoav Eisenberg and Yakov Miron I. CHANNEL CODING We consider the following channel coding problem: m = {,2,..,2 nr} Encoder

More information

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have EECS 229A Spring 2007 * * Solutions to Homework 3 1. Problem 4.11 on pg. 93 of the text. Stationary processes (a) By stationarity and the chain rule for entropy, we have H(X 0 ) + H(X n X 0 ) = H(X 0,

More information

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols. Universal Lossless coding Lempel-Ziv Coding Basic principles of lossless compression Historical review Variable-length-to-block coding Lempel-Ziv coding 1 Basic Principles of Lossless Coding 1. Exploit

More information

5 Mutual Information and Channel Capacity

5 Mutual Information and Channel Capacity 5 Mutual Information and Channel Capacity In Section 2, we have seen the use of a quantity called entropy to measure the amount of randomness in a random variable. In this section, we introduce several

More information

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding SIGNAL COMPRESSION Lecture 3 4.9.2007 Shannon-Fano-Elias Codes and Arithmetic Coding 1 Shannon-Fano-Elias Coding We discuss how to encode the symbols {a 1, a 2,..., a m }, knowing their probabilities,

More information

1 Background on Information Theory

1 Background on Information Theory Review of the book Information Theory: Coding Theorems for Discrete Memoryless Systems by Imre Csiszár and János Körner Second Edition Cambridge University Press, 2011 ISBN:978-0-521-19681-9 Review by

More information

Lecture 5: Channel Capacity. Copyright G. Caire (Sample Lectures) 122

Lecture 5: Channel Capacity. Copyright G. Caire (Sample Lectures) 122 Lecture 5: Channel Capacity Copyright G. Caire (Sample Lectures) 122 M Definitions and Problem Setup 2 X n Y n Encoder p(y x) Decoder ˆM Message Channel Estimate Definition 11. Discrete Memoryless Channel

More information

CS6304 / Analog and Digital Communication UNIT IV - SOURCE AND ERROR CONTROL CODING PART A 1. What is the use of error control coding? The main use of error control coding is to reduce the overall probability

More information

Exercises with solutions (Set B)

Exercises with solutions (Set B) Exercises with solutions (Set B) 3. A fair coin is tossed an infinite number of times. Let Y n be a random variable, with n Z, that describes the outcome of the n-th coin toss. If the outcome of the n-th

More information

ECE 587 / STA 563: Lecture 5 Lossless Compression

ECE 587 / STA 563: Lecture 5 Lossless Compression ECE 587 / STA 563: Lecture 5 Lossless Compression Information Theory Duke University, Fall 2017 Author: Galen Reeves Last Modified: October 18, 2017 Outline of lecture: 5.1 Introduction to Lossless Source

More information

Channel capacity. Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5.

Channel capacity. Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5. Channel capacity Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5. Exercices Exercise session 11 : Channel capacity 1 1. Source entropy Given X a memoryless

More information

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157 Lecture 6: Gaussian Channels Copyright G. Caire (Sample Lectures) 157 Differential entropy (1) Definition 18. The (joint) differential entropy of a continuous random vector X n p X n(x) over R is: Z h(x

More information

ECE 587 / STA 563: Lecture 5 Lossless Compression

ECE 587 / STA 563: Lecture 5 Lossless Compression ECE 587 / STA 563: Lecture 5 Lossless Compression Information Theory Duke University, Fall 28 Author: Galen Reeves Last Modified: September 27, 28 Outline of lecture: 5. Introduction to Lossless Source

More information

Optimal codes - I. A code is optimal if it has the shortest codeword length L. i i. This can be seen as an optimization problem. min.

Optimal codes - I. A code is optimal if it has the shortest codeword length L. i i. This can be seen as an optimization problem. min. Huffman coding Optimal codes - I A code is optimal if it has the shortest codeword length L L m = i= pl i i This can be seen as an optimization problem min i= li subject to D m m i= lp Gabriele Monfardini

More information

Lecture 2. Capacity of the Gaussian channel

Lecture 2. Capacity of the Gaussian channel Spring, 207 5237S, Wireless Communications II 2. Lecture 2 Capacity of the Gaussian channel Review on basic concepts in inf. theory ( Cover&Thomas: Elements of Inf. Theory, Tse&Viswanath: Appendix B) AWGN

More information

Noisy-Channel Coding

Noisy-Channel Coding Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/05264298 Part II Noisy-Channel Coding Copyright Cambridge University Press 2003.

More information

Coding for Discrete Source

Coding for Discrete Source EGR 544 Communication Theory 3. Coding for Discrete Sources Z. Aliyazicioglu Electrical and Computer Engineering Department Cal Poly Pomona Coding for Discrete Source Coding Represent source data effectively

More information

Lecture 11: Quantum Information III - Source Coding

Lecture 11: Quantum Information III - Source Coding CSCI5370 Quantum Computing November 25, 203 Lecture : Quantum Information III - Source Coding Lecturer: Shengyu Zhang Scribe: Hing Yin Tsang. Holevo s bound Suppose Alice has an information source X that

More information

A Mathematical Theory of Communication

A Mathematical Theory of Communication A Mathematical Theory of Communication Ben Eggers Abstract This paper defines information-theoretic entropy and proves some elementary results about it. Notably, we prove that given a few basic assumptions

More information

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015 Outline Codes and Cryptography 1 Information Sources and Optimal Codes 2 Building Optimal Codes: Huffman Codes MAMME, Fall 2015 3 Shannon Entropy and Mutual Information PART III Sources Information source:

More information

EE5585 Data Compression January 29, Lecture 3. x X x X. 2 l(x) 1 (1)

EE5585 Data Compression January 29, Lecture 3. x X x X. 2 l(x) 1 (1) EE5585 Data Compression January 29, 2013 Lecture 3 Instructor: Arya Mazumdar Scribe: Katie Moenkhaus Uniquely Decodable Codes Recall that for a uniquely decodable code with source set X, if l(x) is the

More information

ELEMENT OF INFORMATION THEORY

ELEMENT OF INFORMATION THEORY History Table of Content ELEMENT OF INFORMATION THEORY O. Le Meur olemeur@irisa.fr Univ. of Rennes 1 http://www.irisa.fr/temics/staff/lemeur/ October 2010 1 History Table of Content VERSION: 2009-2010:

More information

Motivation for Arithmetic Coding

Motivation for Arithmetic Coding Motivation for Arithmetic Coding Motivations for arithmetic coding: 1) Huffman coding algorithm can generate prefix codes with a minimum average codeword length. But this length is usually strictly greater

More information

Lecture 3 : Algorithms for source coding. September 30, 2016

Lecture 3 : Algorithms for source coding. September 30, 2016 Lecture 3 : Algorithms for source coding September 30, 2016 Outline 1. Huffman code ; proof of optimality ; 2. Coding with intervals : Shannon-Fano-Elias code and Shannon code ; 3. Arithmetic coding. 1/39

More information

arxiv: v1 [cs.it] 5 Sep 2008

arxiv: v1 [cs.it] 5 Sep 2008 1 arxiv:0809.1043v1 [cs.it] 5 Sep 2008 On Unique Decodability Marco Dalai, Riccardo Leonardi Abstract In this paper we propose a revisitation of the topic of unique decodability and of some fundamental

More information

Digital Communications III (ECE 154C) Introduction to Coding and Information Theory

Digital Communications III (ECE 154C) Introduction to Coding and Information Theory Digital Communications III (ECE 154C) Introduction to Coding and Information Theory Tara Javidi These lecture notes were originally developed by late Prof. J. K. Wolf. UC San Diego Spring 2014 1 / 8 I

More information

Lecture 1: Introduction, Entropy and ML estimation

Lecture 1: Introduction, Entropy and ML estimation 0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual

More information

Shannon s Noisy-Channel Coding Theorem

Shannon s Noisy-Channel Coding Theorem Shannon s Noisy-Channel Coding Theorem Lucas Slot Sebastian Zur February 2015 Abstract In information theory, Shannon s Noisy-Channel Coding Theorem states that it is possible to communicate over a noisy

More information