Lecture 4 Noisy Channel Coding

Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 9, 2015 1 / 56 I-Hsiang Wang IT Lecture 4

The Channel Coding Problem w Channel Encoder x N Noisy Channel y N Channel Decoder bw Meta Description 1 Message: Random message W Unif [1 : 2 K ]. 2 Channel: Consist of an input alphabet X, an output alphabet Y, and a family of conditional distributions { p ( y k x k, y k 1) k N } determining the stochastic relationship between the output symbol y k and the input symbol x k along with all past signals ( x k 1, y k 1). 3 Encoder: Encode the message w by a length N codeword x N X N. 4 Decoder: Reconstruct message ŵ from the channel output y N. 5 Efficiency: Maximize the code rate R K N bits/channel use, given certain decoding criterion. 2 / 56 I-Hsiang Wang IT Lecture 4

Decoding Criterion: Vanishing Error Probability w Channel Encoder x N Noisy Channel A key performance measure: Error Probability P (N) e Question: Is it possible to get zero error probability? y N Channel Decoder bw { } P W Ŵ. Ans: Probably not, unless the channel noise has some special structure. Following the development of lossless source coding, Shannon turned the attention to answering the following question: Is it possible to have a sequence of encoder/decoder pairs such that P e (N) 0 as N? If so, what is the largest possible code rate R where vanishing error probability is possible? Recall: In lossless source coding, we see that the infimum of compression rates where vanishing error probability is possible is H ({S i } ). 3 / 56 I-Hsiang Wang IT Lecture 4

Rate R Block Length N Probability of Error P (N) e Capacity: Take N, Require P (N) e 0 = sup R = C. Error Exponent: Take N, Fix rate R = min P (N) e 2 NE(R). Finite Block Length: Fix N, Require ( P ) (N) e ε V = sup R = C N Q 1 (ε) + O log N N. Remark: For source coding, one can establish a similar framework. 4 / 56 I-Hsiang Wang IT Lecture 4

In this lecture we only focus on capacity. In other words, we ignore the issue of finite block length (FBL). FBL performance can be obtained via techniques extending from CLT. We do not pursue finer analysis on the error probability via large deviation techniques either. 5 / 56 I-Hsiang Wang IT Lecture 4

Discrete Memoryless Channel (DMC) In order to demonstrate the key ideas in channel coding, in this lecture we shall focus on discrete memoryless channels (DMC) defined below. Definition 1 (Discrete Memoryless Channel) A discrete channel ( X, { p ( y k x k, y k 1) k N }, Y ) is memoryless if k N, p ( y k x k, y k 1) = p Y X (y k x k ). In other words, Y k X k ( X k 1, Y k 1). Here the conditional p.m.f. p Y X is called the channel law or channel transition function. Question: is our definition of a channel sufficient to specify p ( y N x N ), the stochastic relationship between the channel input (codeword) x N and the channel output y N? 6 / 56 I-Hsiang Wang IT Lecture 4

p ( y N x N) = p ( x N, y N) p (x N ) p ( x N, y N) N = p ( x k, y k x k 1, y k 1) = k=1 N p ( y k x k, y k 1) p ( x k x k 1, y k 1) k=1 Hence, we need to further specify { p ( x k x k 1, y k 1) k N }, which cannot be obtained from p ( x N). Interpretation: { p ( x k x k 1, y k 1) k N } is induced by the encoding function, which implies that the encoder can potentially make use of the past channel output, i.e., feedback. 7 / 56 I-Hsiang Wang IT Lecture 4

DMC without Feedback w Channel Encoder x k Noisy Channel y k w Channel Encoder x k y k 1 D Noisy Channel y k No Feedback With Feedback Suppose in the system, the encoder has no knowledge about the realization of the channel output, then, p ( x k x k 1, y k 1) = p ( x k x k 1) for all k N, and it is said the the channel has no feedback. In this case, specifying { p ( y k x k, y k 1) k N } suffices to specify p ( y N x N). Proposition 1 (DMC without Feedback) For a DMC ( X, p Y X, Y ) without feedback, p ( y N x N) = N p Y X (y i x i ). k=1 8 / 56 I-Hsiang Wang IT Lecture 4

Overview In this lecture, we would like to establish the following (informally described) noisy channel coding theorem due to Shannon: For a DMC ( X, p Y X, Y ), the maximum code rate with vanishing error probability is the channel capacity C max I (X ; Y ). p X ( ) The above holds regardless of the availability of feedback. To demonstrate this result, we organize the lecture as follows: 1 Give the problem formulation, state the main theorem, and visit a couple of examples to show how to compute channel capacity. 2 Prove the converse part: an achievable rate cannot exceed C. 3 Prove the achievability part with a random coding argument. 9 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity 1 Channel Capacity and the Weak Converse Channel Capacity Proof of the Weak Converse Feedback Capacity 2 10 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity 1 Channel Capacity and the Weak Converse Channel Capacity Proof of the Weak Converse Feedback Capacity 2 11 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity Channel Coding without Feedback: Problem Setup w Channel Encoder x N Noisy Channel y N Channel Decoder bw 1 A ( 2 NR, N ) channel code consists of an encoding function (encoder) enc N : [1 : 2 K ] X N that maps each message w to a length N codeword x N, where K NR. a decoding function (decoder) dec N : Y N [1 : 2 K ] { } that maps a channel output sequence y N to a reconstructed message ŵ or an error message. 2 The error probability is defined as P (N) e { } P W Ŵ. 3 A rate R is said to be achievable if there exist a sequence of ( 2 NR, N ) codes such that P (N) e 0 as N. The channel capacity is defined as C sup {R R : achievable}. 12 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity Channel Coding Theorem for Discrete Memoryless Channel Theorem 1 (Channel Coding Theorem for DMC without Feedback) The capacity C of the DMC p (y x) without feedback is given by C = max I (X ; Y ). (1) p(x) The capacity formula (1) is intuitive, since I (X ; Y ) represents the amount of information about the channel input X that one can infer from the channel output Y. The maximization over p (x) stands for choosing the best possible input distribution so that the amount of information transfer is maximized. 13 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity Rest of the lecture: 1 First we give some examples of noisy channels to show how to compute capacity. 2 Then, we prove that for any rate R > C, it is impossible to have vanishing error probability (converse). 3 Finally, we prove that for any R < C, there exist a sequence of encoding/decoding schemes such that the error probability vanishes as blocklength tends to (achievability), based on a probabilistic argument called random coding. 14 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity Binary Symmetric Channel A binary symmetric channel (BSC) consists of Binary input/output X = Y = {0, 1}. [ ] 1 p p Channel law p (y x) =. p 1 p The capacity of BSC is C BSC = 1 H b (p). 1 X 0 1 p p p 1 p Y 0 1 To compute BSC capacity, observe I (X ; Y ) = H (Y ) H (Y X ), and H (Y X = 0 ) = H (Y X = 1 ) = H b (p) = H (Y X ) = H b (p). H (Y ) log 2 = 1, with equality iff Y is uniform. Question: Is it possible to choose a p (x) such that Y is uniform? Ans: Yes, choose X to be uniform = C = max I (X ; Y ) = 1 H b (p). p(x) 15 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity Binary Erasure Channel A binary erasure channel (BEC) consists of Binary input X = {0, 1} and output with erasure Y = {0, 1, }. [ ] 1 p p 0 Channel law p (y x) =. 0 p 1 p The capacity of BEC is C BEC = 1 p. X 0 1 1 p p p 1 p Y 0 1 Suppose we begin with I (X ; Y ) = H (Y ) H (Y X ). Then, H (Y X = 0 ) = H (Y X = 1 ) = H b (p) = H (Y X ) = H b (p). H (Y ) log 3, with equality iff Y is uniform. Question: Is it possible to choose a p (x) such that Y is uniform? Ans: No. So, we cannot say that max p(x) H (Y ) = log 3. 16 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity X 0 1 p Y 0 Y 0 1 X 0 p 1 p 1 p 1 1 1 1 1 Instead, we can start with I (X ; Y ) = H (X ) H (X Y ). Then, we have the reverse channel law p (x y) = 1 0 α 1 α, where α P {X = 0}. 0 1 H (X Y = 0 ) = H (X Y = 1 ) = 0, H (X Y = ) = H b (α) = H (X ) = H (X Y ) = P {Y = } = ph (X ). H (X ) 1, with equality iff X is uniform. Hence, C BEC = max p(x) (1 p) H (X ) = 1 p. 17 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity Erasure Channel We can generalize BEC to the following erasure channel: Input X, output Y = X { }. 1 p, y = x Channel law p (y x) = p, y = 0, otherwise A motivation for this model is from networking, where the erasure models the packet drop. Exercise 1 Show that the capacity of the erasure channel is C EC = (1 p) log X. 18 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity Symmetric Channel In computing the capacity of BSC, we observe that 1 H (Y X ) = H b (p) regardless of p (x). Why? Because all rows of p (y x) are permutations of a same probability vector [ p 1 p ]. 2 H (Y ) = log Y can be attained, that is, Y can be made uniform by choosing X to be uniform. Why? Because all columns of p (y x), have the same sum x p (y x). Definition 2 (Symmetric Channel) A symmetric channel is a channel with channel law p (y x) satisfying (1) all rows of p (y x) are permutations of a same probability vector p, and (2) all columns of p (y x), have the same sum x p (y x). Exercise 2 Show that the capacity of a symmetric channel is log Y H (p). 19 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity Computing Capacity of DMC via Convex Optimization For a DMC, we are able to find its capacity efficiently by revoking efficient algorithms in solving convex programs, since I (X ; Y ) is a concave function of p (x) for fixed p (y x). Proposition 2 I (X ; Y ) is a concave function of p (x) for fixed p (y x). pf: By definition, I (X ; Y ) = H (Y ) H (Y X ). H (Y X ) = x p (x) H (Y X = x) is a linear function of p (x), because H (Y X = x) = p (y x) log p (y x) is constant for fixed p (y x). H (Y ) is a concave function of p (y). p (y) is a linear function of p (x) for fixed p (y x). Hence, H (Y ) is a concave function of p (x) for fixed p (y x). Putting the above together, we complete the proof. 20 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity 1 Channel Capacity and the Weak Converse Channel Capacity Proof of the Weak Converse Feedback Capacity 2 21 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity Proof of the (Weak) Converse (1) We would like to show that for every sequence of ( 2 NR, N ) codes such that P (N) e 0 as N, the rate R max I (X ; Y ). p(x) pf: Note that W Unif [1 : 2 K ] and hence K = H (W ). ( ) ( ) NR H (W ) = I W ; Ŵ + H W Ŵ (2) I ( W ; Y N ) ( + 1 + P (N) e log ( 2 K + 1 )) (3) N k=1 I ( W ; Y k Y k 1 ) ( ) + 1 + P (N) e (NR + 2) (2) is due to K = NR NR and chain rule. (3) is due to W Y N Ŵ and Fano s inequality. (4) is due to chain rule and 2 K + 1 2 NR+1 + 1 2 2 NR+1. (4) 22 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity Proof of the (Weak) Converse (2) ( ) Set ε N 1 N 1 + P (N) e (NR + 2), we see that ε N 0 as N because lim N P (N) e = 0. The next step is to relate N k=1 I ( W ; Y k Y k 1 ) to I (X ; Y ), by the following manipulation: I ( W ; Y k Y k 1 ) I ( W, Y k 1 ; Y k ) I ( W, Y k 1, X k ; Y k ) (5) = I (X k ; Y k ) max I (X ; Y ) (6) p(x) (5) is due to the fact that conditioning reduces entropy. (6) is due to DMC: p ( y k x k, y k 1, w ) = p ( y k x k, y k 1) = p (y k x k ) = Y k X k ( W, X k 1, Y k 1) = Y k X k ( W, Y k 1). 23 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity Proof of the (Weak) Converse (3) Hence, we have NR N k=1 I ( W ; Y k Y k 1 ) + Nε N N max I (X ; Y ) + Nε N p(x) = R max I (X ; Y ) + ε N, N. p(x) Taking N, we have: R max I (X ; Y ) if it is achievable. p(x) Remark: Similar to the source coding problem, a stronger version of the converse holds in the channel coding problem as well: if R > C, then 1 as N for any encoding/decoding functions. P (N) e 24 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity 1 Channel Capacity and the Weak Converse Channel Capacity Proof of the Weak Converse Feedback Capacity 2 25 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity Channel Coding with Feedback: Problem Setup w Channel Encoder x N Noisy Channel y N Channel Decoder bw D 1 A ( 2 NR, N ) channel code consists of an encoding function (encoder) enc N : [1 : 2 K ] Y N 1 X N that maps each message w to a length N codeword x N, where K NR, and the k-th symbol x k is a function of ( w, y k 1) for all k [1 : N]. a decoding function (decoder) dec N : Y N [1 : 2 K ] { } that maps a channel output sequence y N to a reconstructed message ŵ or an error message. 2 The error probability is defined as P (N) e { } P W Ŵ. 3 A rate R is said to be achievable if there exist a sequence of ( 2 NR, N ) codes such that P (N) e 0 as N. The channel capacity is defined as C sup {R R : achievable}. 26 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity Dependency Graph: Without vs. With Feedback X N X 1 p Y X Y N Y 1 enc N X 2 Y 2 dec N W cw X k Y k X N No Feedback Y N 27 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity Dependency Graph: Without vs. With Feedback X N X 1 p Y X Y N Y 1 enc N X 2 Y 2 dec N W cw X k Y k X N With Feedback Y N 28 / 56 I-Hsiang Wang IT Lecture 4

Channel Capacity Proof of the Weak Converse Feedback Capacity Feedback Capacity Theorem 2 (Channel Coding Theorem for DMC with Feedback) The capacity of the DMC p (y x) with feedback is given by (1), the same as that without feedback. In other words, feedback does not increase the channel capacity for DMC. The proof is immediate because in the converse proof of channel coding theorem without feedback, we do not make use of the assumption that there is no feedback. In other words, the proof is identical even with feedback. Remark: Although feedback does not increase capacity, it does improve the reliability (error exponent) and finite-blocklength performance greatly. Furthermore, the design and the complexity of the coding scheme may also be greatly simplified and reduced due to feedback. The details are out of scope of this lecture. 29 / 56 I-Hsiang Wang IT Lecture 4

1 Channel Capacity and the Weak Converse Channel Capacity Proof of the Weak Converse Feedback Capacity 2 30 / 56 I-Hsiang Wang IT Lecture 4

1 Channel Capacity and the Weak Converse Channel Capacity Proof of the Weak Converse Feedback Capacity 2 31 / 56 I-Hsiang Wang IT Lecture 4

Overview In order to prove the achievability part of Theorem 1, we need to show the following mathematical statement: R < C, R 0, a sequence of ( 2 NR, N ) codes such that lim N P(N) e = 0. In general, to prove the existence of certain objects satisfying some desirable properties, there are two possible ways: 1 Explicitly construct an object and prove that the properties hold. 2 Assume that no objects can satisfy the properties, and show contradiction. The achievability proof presented in this lecture is more of the second flavor, and in fact belongs to the so-called probabilistic method. 32 / 56 I-Hsiang Wang IT Lecture 4

The Probabilistic Method What is the probabilistic method? Roughly speaking, in order to show the existence of certain objects satisfying some desirable properties, One first imposes particular probability distribution over the possible object space. Then, by showing that on average the properties hold or the properties hold with non-zero probability, one concludes the existence of such objects. Example 1 Given a set of n-dimensional unit vectors {v 1, v 2,..., v k }, show that x i {±1}, i = 1,..., k such that k i=1 x iv i k. 33 / 56 I-Hsiang Wang IT Lecture 4

pf: Let {X i } k i=1 be i.i.d. r.v. s with P {X i = 1} = P {X i = 1} = 1 2. Define V [ k i=1 X iv i. Compute E V 2] as follows: [ E V 2] = E [ V T V ] [( k = E X i v T i k = E i=1 j=1 i=1 k X i X j v T i v j = ) ( k )] X i v i k i=1 i=1 j=1 k E [X i X j ] v T i v j {X i } are mutually independent, E [X i X j ] = E [X i ] E [X j ] = 0 for i j. [ E V 2] = k i=1 E [ ] X 2 i vi 2 = k. Hence, x i {±1}, i = 1,..., k such that k i=1 x iv i k. Otherwise, [ E V 2] should be less than k, leading to contradiction. 34 / 56 I-Hsiang Wang IT Lecture 4

Paul Erdős (1913 1996) 35 / 56 I-Hsiang Wang IT Lecture 4

Coding over Noisy Channel Before we prove the main theorem, let us set up a few notations related to coding over noisy channel. 1 Codebook c = { x N (1), x N (2)... x N ( 2 K)} consists of the 2 K codewords and is the range of the encoding function. 2 ML Decoder (maximum likelihood) is the optimal decoder that minimizes the probability of error P e (N) when the messages are uniformly chosen (uniform prior): ŵ ML = arg max w [1:2K ] p ( y N x N (w) ). } W 3 Probability of Error of Message m: λ m P {Ŵ W = m In principle, one can derive the ML decoding rule and compute P (N) e given codebook. But, there are some challenges toward proving the channel coding theorem. for a 36 / 56 I-Hsiang Wang IT Lecture 4

Challenges and Work-Arounds First, the expression of error probability of ML is usually intractable, and it is hard to obtain any insight regarding the asymptotic behaviors. Second, it is unclear how to construct the codebook and the corresponding decoding scheme. In summary, to prove the achievability part of the channel coding theorem, there are two main challenges we shall overcome: 1 How to show the existence of good codebooks? We circumvent the issue of explicit construction by using a random coding argument (a kind of the probabilistic method) 2 How to analyze the error probability? We circumvent the issue of ML decoding error analysis by using a suboptimal decoder and derive upper bounds on the probability of error of the chosen decoder. 37 / 56 I-Hsiang Wang IT Lecture 4

Proof Program 1 Random Codebook Generation: Generate an ensemble of codebooks according to certain probability distribution. Hence, codebook C becomes a random object. 2 Error Probability Analysis: Goal: Show that as N, E C [ P (N) e,ml (C) ] 0, and conclude that there must exist a codebook c such that the decoding error probability P (N) e,ml 0. To simplify analysis, we shall introduce suboptimal decoders and give a tractable upper bound of error probability using union of events bound. 38 / 56 I-Hsiang Wang IT Lecture 4

Random Codebook Generation A simple way is to i.i.d. generate 2 K codewords, and each codeword p ( x N) i=1 p X (x i ). In other words, if we stack all 2 K codewords together into a 2 K N matrix C, the elements of the matrix C will be i.i.d. distributed according to p X : (each row is a codeword) X 1 (1) X 2 (1) X N (1) X 1 (2) X 2 (2) X N (2) c =...... ( X ) ( 1 2 K X ) ( 2 2 K X ) N 2 K and p (c) P {C = c} = 2 K w=1 N i=1 p X (x i (w)). It turns out the symmetry in such codebook ensemble distribution helps simplify the analysis. 39 / 56 I-Hsiang Wang IT Lecture 4

Encoding and Decoding For a realization c of the codebook random ensemble C, we describe the encoding and decoding methods below. Encoding: for a message m [1 : 2 K ], choose the m-th row of the codebook c and send it out. Decoding: ideally one would like to use the following ML decoding rule: ŵ ML = arg max w [1:2K ] p ( y N x N (w) ). However, the performance of ML decoder is usually not tractable, as mentioned before. Instead, we introduce a suboptimal decoder based on typical sequences as follows: ŵ T = a unique w such that ( x N (w), y N) T ε (N) (X, Y). Note: there are some other suboptimal decoders can be used, such as threshold decoders. 40 / 56 I-Hsiang Wang IT Lecture 4

Error Probability Analysis (1) Since the ML decoder is optimal, we can analyze the performance of the typicality decoder and use it as[ an upper] bound. Hence, our goal is turned to proving lim N E C P (N) e,t (C) = 0. 1 The first step [ is to use] the symmetry of codebook ensemble to simplify E C P (N) e,t (C) and argue that we can focus on analyzing the error probability of the first codeword X N (1) averaged over C: [ ] [ E C P (N) e,t (C) = E C 2 ] K 2 K λ m (C) = 2 K E C [λ m (C)] m=1 m = 2 K m E C [λ 1 (C)] = E C [λ 1 (C)] = P {Error, averaged over C W = 1} 41 / 56 I-Hsiang Wang IT Lecture 4

Error Probability Analysis (2) 2 For notational simplicity, use E denote the text Error event and drop the averaged over C. Our next focus is to upper bound P {E W = 1} P 1 (E). The trick here is to distinguish into two different kinds of errors: { (X E a N (1), Y N) } / T ε (N) E = E a E t, { (X E t N (w), Y N) } T ε (N) for some w 1 The core is whether or not the joint sequence ( { X N (w), Y N) are (X ε-typical. Let us define A w N (w), Y N) } T ε (N). We can then rewrite E a = A c 1, E t = w 1 A w, and hence E = E a E t = A c 1 ( w 1A w ). 42 / 56 I-Hsiang Wang IT Lecture 4

Error Probability Analysis (3) 3 We are now ready to apply the union of events bound: P 1 {E} = P 1 {A c 1 ( w 1 A w )} P 1 {A c 1} + 2 K w=2 P 1 {A w }. Next, we shall develop upper bounds on the probability that the actual transmitted codeword X N (1) and the actual received signal Y N are not (jointly) typical. the probability that some other (random) codeword X N ( 1) and the actual received signal Y N are (jointly) typical. Lemma 1 (A Key Lemma) N(I(X ;Y ) δ(ε)) P 1 {A 1 } 1 ε for N large enough, and P 1 {A w } 2 for all w 1, where δ (ε) 0 as ε 0. 43 / 56 I-Hsiang Wang IT Lecture 4

Error Probability Analysis (4) 4 Finally, let us put all the above together and apply Lemma 1: [ ] E C P (N) e,t (C) P {E} = P {E W = 1} P 1 {E} P 1 {A c 1} + ε + 2 K w=2 2 K w=2 P 1 {A w } N(I(X ;Y ) δ(ε)) 2 N(I(X ;Y ) δ(ε) R) ε + 2 As long as R I (X ; Y ) δ(ε), we are able to make [ P {E} ] 2ε for N large enough, which is equivalent to lim N E C P (N) e,t (C) 0. 44 / 56 I-Hsiang Wang IT Lecture 4

Completion of the We have shown [ that as] long as R I (X ; Y ) δ(ε), lim N E C P (N) e,t (C) 0, and hence there must exist a realization of codebook c such that P (N) e,t (c) 0 as N. Finally, taking the codebook generating distribution p X = arg max p(x) I (X ; Y ), we conclude that R < C = max p(x) I (X ; Y ), R is achievable. 45 / 56 I-Hsiang Wang IT Lecture 4

Proof of Lemma 1 (1): Recap of Typicality Recall: by definition, an ε-typical (vector) sequence (x n, y n ) shall satisfy π (a, b x n, y n ) p X,Y (a, b) εp X,Y (a, b), (a, b) X Y. (Note: we can think of (X, Y) as a r.v. and apply the same definition of typicality!) Hence, if (X n, Y n ) n i=1 p X,Y (x i, y i ), then we have 0 (x n, y n ) T (n) ε (X, Y) = x n T (n) ε (X), y n T ε (n) (Y). 1 (x n, y n ) T ε (n) (X, Y), 1 n log p (xn, y n ) H (X, Y ) δ(ε), where δ(ε) = εh (X, Y ). ( ) 2 p T ε (n) (X, Y) 1 ε for n large enough. 3 T (n) ε (X, Y) 2 n(h(x,y )+δ(ε)). 4 T ε (n) (X, Y) (1 ε)2 n(h(x,y ) δ(ε)) for n large enough. 46 / 56 I-Hsiang Wang IT Lecture 4

Proof of Lemma 1 (2): Typical with Actual Codeword { (X Let us first consider P 1 {A 1 } = P N (1), Y N) } T (N) W = 1. We are averaging over a random codebook ensemble C, and the random codebook is generated element-by-element i.i.d. based on p X. DMC without feedback implies p ( y N x N) = N i=1 p Y X (y i x i ). Hence, given W = 1, ( X N (1), Y N) has the following joint distribution: p ( x N, y N) = p ( x N) p ( y N x N) = N = N i=1 p X,Y (x i, y i ) ε i=1 p X (x i ) N i=1 p Y X (y i x i ) By Property 2 (LLN), we see that for N large enough, { (X P 1 {A 1 } = P N (1), Y N) } T (N) W = 1 1 ε. ε 47 / 56 I-Hsiang Wang IT Lecture 4

Proof of Lemma 1 (3): Typical with a Wrong Codeword { (X Consider P 1 {A w } = P N (w), Y N) } T (N) W = 1 for w 1. Note that we are averaging over a random codebook ensemble C, and the random codebook is generated element-by-element i.i.d. based on p X. Hence, although X N (1) and X N (w) have the same marginal distribution p X, they are actually independent. Due to DMC, ( X N (1), Y N) X N (w). Hence, Y N X N (w), and P 1 {A w } = (x N,y N ) T (N) ε 2 N(1+ε)H(X,Y ) }{{} cardinality upper bound on typical set = 2 N(I(X ;Y ) δ(ε)), ε p ( x N) p ( y N) 2 N(1 ε)h(x) }{{} upper bound on prob. of a typical sequence where δ (ε) = ε (H (X, Y ) + H (X) + H (Y)) 0 as ε 0. 2 N(1 ε)h(y) }{{} upper bound on prob. of a typical sequence 48 / 56 I-Hsiang Wang IT Lecture 4

Some Reflections Reflection 1: Mutual independence of codewords. In the random coding argument of the proof, 2 K N elements of the codebook matrix C are generated i.i.d., and hence the 2 K rows { X N (1),..., X N ( 2 K)} are mutually independent. However, in the proof we only require pairwise independence: X N (1) X N (w), w 1. Reflection 2: Typicality decoder. We use typicality decoder other than the optimal ML decoder to find tractable upper bounds on the error probability. There are other suboptimal decoders can be used. For example, the following threshold decoder can also work: ( ŵ th a unique w such that i x N (w) ; y N) > β, where i ( x N ; y N) log p(xn,y N ) p(x N )p(y N ) = N k=1 log p Y X (y k x k ) p X (x k ), and β I (X ; Y ) ε. 49 / 56 I-Hsiang Wang IT Lecture 4

1 Channel Capacity and the Weak Converse Channel Capacity Proof of the Weak Converse Feedback Capacity 2 50 / 56 I-Hsiang Wang IT Lecture 4

Joint Source-Channel Coding: Problem Setup Source {S i } s Ns Nc x ync Channel Encoder p Y X Decoder bs Ns Destination Source model: discrete stationary ergodic with entropy rate H ({S i } ). Channel model: DMC p Y X with channel capacity C ( ) p Y X. ) 1 A ( S NcR, N c joint source-channel code consists of an encoding function (encoder) enc Nc : S Ns X Nc that maps each source sequence s Ns to a length N c codeword x Nc, N s N c R. a decoding function (decoder) dec Nc : Y N c S N s that maps a channel output sequence y N c to a reconstructed sequence ŝ N s. { } 2 The error probability is defined as P (N c) e P S Ns ŜN s. 3 A rate R is said ) to be achievable if there exist a sequence of ( S NcR, N c codes such that P (N c) e 0 as N c. 51 / 56 I-Hsiang Wang IT Lecture 4

Theorem Theorem 3 () C 1 If R < H({S i} ), then R is achievable, i.e., lossless reconstruction of source {S i } is possible via the noisy channel p Y X. C 2 Conversely, if R > H({S i } ), then R is not achievable, i.e., lossless reconstruction is impossible. Source s Ns Source Encoder b K Channel Encoder x Nc Binary Interface Noisy Channel Destination bs Ns Source Decoder b b K Channel Decoder y Nc 52 / 56 I-Hsiang Wang IT Lecture 4

Proof of Achievability pf: (Achievability Part): Choose a ( 2 N sr s, N s ) lossless source code with Rs = H ({S i } ) + ε s. Choose a ( 2 N cr c, N c ) channel code with Rc = C ε c. Due the the channel coding theorem, the binary sequence b K lives in the digital interface between the source and the channel coders can be decoded with vanishing error probability. Due to the lossless source coding theorem, the source sequences can be reconstructed with vanishing error probability as long as the bit sequence b K can be successfully decoded by the channel decoder. Concatenate the above two codes together, we see that as long as N s R s < N c R c N s N c < R c C ε R s = c H({S i } )+ε s, the separation scheme is able to reconstruct the source sequence with vanishing error probability. C Since ε s, ε c can be made arbitrarily small, as long as R < H({S i} ), it is achievable. 53 / 56 I-Hsiang Wang IT Lecture 4

Proof of Converse pf: (Converse Part): We shall prove that achievable R, R N s H ({S i } ) H ( S ) ) Ns = I (S Ns ; ŜNs ( I ( S N s ; Y N c ) + N c k=1 1 + P (N c) e I ( S N s ; Y k Y k 1 ) ( + ) + H (S Ns Ŝ Ns ) N s log S 1 + P (N c) e C H({S i } ). ) N s log S (7) (8) N c (C + ε Nc ), where ε Nc 0 as N c. (9) (7) is due to the property of entropy rate and chain rule. (8) is due to S Ns Y Nc ŜNs and Fano s inequality. (9) is due to similar steps as in the channel coding converse proof. Hence, R Ns N c C H({S i} ) if R is achievable. 54 / 56 I-Hsiang Wang IT Lecture 4

Summary 55 / 56 I-Hsiang Wang IT Lecture 4

Channel coding theorem: C = max p(x) I (X ; Y ), for DMC p Y X with or without feedback Weak converse: Fano s inequality, data processing inequality, and DMC assumption Achievability: random coding argument, typicality decoder Feedback does not increase the capacity of DMC. Symmetric channel capacity = log Y H (p ), where p permutes all rows of p Y X. Erasure channel capacity = (1 p) log X. Joint source-channel coding theorem: C R < = R is achievable; R > H({S i} ) C H({S i} ) = R is not achievable. Source-channel separation is optimal. 56 / 56 I-Hsiang Wang IT Lecture 4