Lecture 5 Channel Coding over Continuous Channels

Size: px

Start display at page:

Download "Lecture 5 Channel Coding over Continuous Channels"

Amie Watkins
5 years ago
Views:

1 Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, / 34 I-Hsiang Wang NIT Lecture 5

2 From Discrete-Valued to Continuous-Valued (1) So far we have focused on the discrete-valued (& finite-alphabet) r.v s: Entropy and mutual information for discrete-valued r.v s. Lossless source coding for discrete stationary sources. Channel coding over discrete memoryless channels. In this lecture we extend the basic principles and fundamental theorems to continuous-valued sources and channels. In particular: Mutual information for continuous-valued r.v. s. Lossy source coding for continuous stationary sources. Channel coding with input cost over continuous memoryless channels. (Main example: Gaussian channel capacity). We skip lossy source coding (rate distortion theory) in this course. 2 / 34 I-Hsiang Wang NIT Lecture 5

3 From Discrete-Valued to Continuous-Valued (2) Main technique for extending coding theorems from the discrete-valued, finite-alphabet world to the continuous-valued world: Advantages: Discretization. No need for new tools (eg., typicality) for continuous-valued r.v. s. Extends naturally to multi-terminal settings can focus on discrete memoryless networks. Outline: 1 Differential entropy 2 Channel coding with input cost over DMC 3 Gaussian channel capacity 3 / 34 I-Hsiang Wang NIT Lecture 5

4 Disclaimer: due to time constraint, we will not be 100% rigorous in deriving the results in this lecture. Instead, you can find rigorous treatment in the references. Reading: 1 Mutual information and differential entropy: Chapter 8, Cover&Thomas [2] Chapter 15, Moser [5] Chapter 2.2, El Gamal&Kim [1] 2 Gaussian channel capacity: Chapter 9, Cover&Thomas [2] Chapter 16, Moser [5] Chapter 3.3,3.4, El Gamal&Kim [1] Remark: Using discretization to derive the achievability of Gaussian channel capacity follows [1]. [2] uses weak typicality for continuous random variables; [5] uses threshold decoder, similar to weak typicality in spirit. 4 / 34 I-Hsiang Wang NIT Lecture 5

5 1 Mutual Information and Differential Entropy / 34 I-Hsiang Wang NIT Lecture 5

6 Entropy of a Continuous Random Variable Question: What is the entropy of a continuous real-valued random variable X? Suppose X has the probability density function (p.d.f.) f (x). Let us discretize X to answer this question, as follows: Partition R into length- intervals: R = k= [k, (k + 1) ). Suppose that f (x) is continuous, then by the mean-value theorem, k Z, x k [k, (k + 1) ) such that f (x k ) = 1 (k+1) f (x) dx. k Set [X] := x k if X [k, (k + 1) ), with p.m.f. p (x k ) = f (x k ). Observation: lim 0 H ([X] ) = H (X) (intuitively), while H ( ) [X] = (f (x k ) ) log (f (x k ) ) = f (x k ) log f (x k ) log k= k= f (x) log f (x) dx + = as 0 We conclude that H (X) = if [ ] f (x) log f (x) = E log 1 f(x) exists. 6 / 34 I-Hsiang Wang NIT Lecture 5

7 Differential Entropy It is quite intuitive that the entropy of a continuous random variable can be arbitrarily large, because it can take infinitely many possible values. Hence, in general it is impossible to losslessly compress a continuous source with finite rate. Instead, lossy source coding is done. Yet, for continuous r.v. s, it turns out to be useful to define the counterparts of entropy and conditional entropy, as follows: Definition 1 (Differential entropy and conditional differential entropy) The differential [ entropy] of a continuous r.v. X with p.d.f. f (x) is defined as h (X) := E log 1 f(x) if the (improper) integral exists. The conditional differential entropy of a continuous r.v. X given Y, where (X, Y) has joint p.d.f. [ f (x, y) ] and conditional p.d.f. f (x y), is 1 defined as h (X Y) := E log f(x Y) if the (improper) integral exists. 7 / 34 I-Hsiang Wang NIT Lecture 5

8 Mutual Information between Continuous Random Variables How about mutual information between two continuous real-valued random variables X and Y, with joint p.d.f. f X,Y (x, y) and marginal p.d.f. s f X (x) and f Y (y)? Again, we use discretization: Partition R 2 plane into squares: R 2 = k,j= I k I j, where Ik := [k, (k + 1) ). Suppose that f X,Y (x, y) is continuous, then by the mean-value theorem (MVT), k, j Z, (x k, y j ) Ik I j such that f X,Y (x k, y j ) = 1 f 2 Ik X,Y (x, y) dx dy. I j Set ([X], [Y] ) := k,j (x k, y j ) I { (X, Y) Ik } I j, with p.m.f. p (x k, y j ) = f X,Y (x k, y j ) 2. By MVT, k, j Z, x k Ik and ỹ j Ij such that p (x k ) := f Ik X (x) dx = f X ( x k ), p (y j ) := f Ij Y (y) dy = f Y (ỹ j ). 8 / 34 I-Hsiang Wang NIT Lecture 5

9 Mutual Information between Continuous Random Variables Observation: lim 0 I ([X] ; [Y] ) = I (X; Y) (intuitively), while I ( [X] ; [Y] ) = = k,j= k,j= = 2 Hence, I (X; Y) = E k,j= p (x k, y j ) log p (x k, y j ) p (x k ) p (y j ) ( fx,y (x k, y j ) 2) log fx,y (x k, y j) 2 [ ] log f(x,y) f(x)f(y) f X ( x k ) f Y (ỹ j ) 2 f X,Y (x k, y j ) log f X,Y (x k, y j ) f X ( x k ) f Y (ỹ j ) f X,Y (x, y) log fx,y (x, y) dx dy as 0 f X (x) f Y (y) if the improper integral exists. 9 / 34 I-Hsiang Wang NIT Lecture 5

10 Mutual Information Unlike entropy that is only well-defined for discrete random variables, in general we can define the mutual information between two real-valued random variables (no necessarily continuous or discrete) as follows. Definition 2 (Mutual information) The mutual information between two random variables X and Y is defined as I (X; Y) = sup I ( ) [X] P ; [Y] Q, P,Q where the supremum is taken over all pairs of partitions P and Q of R. We have the following theorem immediately from the previous discussion: Theorem 1 (Mutual information between two continuous r.v. s) [ ] I (X; Y) := E log f(x,y) f(x)f(y) = h (X) h (X Y). 10 / 34 I-Hsiang Wang NIT Lecture 5

11 Properties that Extend to Continuous R.V. s Proposition 1 (Chain rule) n h (X, Y) = h (X) + h (Y X), h (X n ) = h ( X i X i 1). i=1 Proposition 2 (Conditioning reduces differential entropy) h (X Y) h (X), h (X Y, Z) h (X Z). Proposition 3 (Non-negativity of mutual information) I (X; Y) 0, I (X; Y Z) / 34 I-Hsiang Wang NIT Lecture 5

12 New Properties of Differential Entropy Example 1 (Differential entropy of a uniform r.v.) For a r.v. X Unif [a, b], that is, its p.d.f. f X (x) = 1 b ai {a x b}, its differential entropy h (X) = log (b a). Differential entropy can be negative. Since b a can be made arbitrarily small, h (X) = log (b a) can be negative. Hence, the non-negative property of entropy cannot be extended to differential entropy. Scaling will change the differential entropy. Consider X Unif [0, 1]. 2X Unif [0, 2]. Hence, h (X) = log 1 = 0, h (2X) = log 2 = 1 = h (X) h (2X). This is in sharp contrast to entropy: H (X) = H (g (X)) as long as g ( ) is an invertible function. 12 / 34 I-Hsiang Wang NIT Lecture 5

13 Scaling and Translation Proposition 4 Let X be a continuous random variable with differential entropy h (X). Translation does not change the differential entropy: For a constant c, h (X + c) = h (X). Scaling shift the differential entropy: For a constant a 0, h (ax) = h (X) + log a. Proposition 5 Let X be a continuous random vector with differential entropy h (X). For a constant vector c, h (X + c) = h (X). For an invertible matrix A R n n, h (AX) = h (X) + log det A. The proof of these propositions are left as exercises. 13 / 34 I-Hsiang Wang NIT Lecture 5

14 Differential Entropy of Gaussian Random Vectors Example 2 (Differential entropy of N (0, 1)) For a r.v. X N (0, 1), that is, its p.d.f. f X (x) = 1 2π e x2 2, its differential entropy h (X) = 1 2 log (2πe). For a n-dim random vector X N (m, K), we can rewrite X as X = AW + m, where AA T = K and W consists of i.i.d. W i N (0, 1), i = 1,..., n. Hence, by the translation and scaling properties of differential entropy: n h (X) = h (W) + log det A = h (W i ) + 1 log det K 2 i=1 = n 2 log (2πe) log det K = 1 2 log (2πe)n det K 14 / 34 I-Hsiang Wang NIT Lecture 5

15 Kullback-Leibler Divergence Definition 3 (KL divergence between densities) The Kullback-Leibler divergence between two probability [ density ] functions f (x) and g (x) is defined as D (f g) := E if the log f(x) g(x) (improper) integral exists. The expectation is taken over r.v. X f (x). By Jensen s inequality, it is straightforward to see that the non-negativity of KL divergence remains. Proposition 6 (Non-negativity of KL divergence) D (f g) 0, with equality iff f = g almost everywhere (i.e., except for some points with zero probability). Note: D (f g) is finite only if the support of f (x) is contained in the support of g (x). 15 / 34 I-Hsiang Wang NIT Lecture 5

16 Maximum Differential Entropy Theorem 2 (Maximum Differential Entropy under Covariance Constraint) Let [ X be a random vector with mean m and covariance matrix E (X m) (X m) T] = K, and X G be Gaussian with the same covariance K. Then, h (X) h ( X G) = 1 2 log (2πe)n det K. pf: First, we can assume WLOG that both X and X G are zero-mean. Let the p.d.f. of X be f (x) and the p.d.f. of X G be f G (x). Hence, 0 D (f f G ) = E [log f (X)] E [log f G (X)] = h (X) E f [log f G (X)] Note: log f G (x) is a quadratic function of x, and X, X G have the same second moment. Hence, E f [log f G (X)] = E fg [log f G (X)] = h ( X G), and 0 D (f f G ) = h (X) + h ( X G) = h (X) h ( X G). 16 / 34 I-Hsiang Wang NIT Lecture 5

17 1 Mutual Information and Differential Entropy / 34 I-Hsiang Wang NIT Lecture 5

18 Input Cost Our goal is to study channel coding over continuous memoryless channel. However, without any constraint on the channel input, it is quite easy to see that infinite amount of information can be delivered over the channel. To make the problem valid, we shall impose a constraint on the input cost of the communication. Let us begin by defining the cost function. Definition 4 (Input cost function) A non-negative input cost function b : X [0, ) is defined over the input alphabet X of a DMC ( X, p Y X, Y ). We can shift b such that there exists a symbol x o X with b (x o ) = 0. WLOG assume the existence of such zero-cost symbol x o X. Average input cost constraint for channel coding: for N channel uses, the average input cost 1 N N i=1 b (x i) B. 18 / 34 I-Hsiang Wang NIT Lecture 5

19 over DMC: Problem Setup w Channel Encoder x N Noisy Channel y N Channel Decoder bw 1 A ( 2 NR, N, B ) channel code consists of an encoding function (encoder) enc N : [1 : 2 K ] X N that maps each message w to a length N codeword x N, where K := NR. The codeword follows the input cost constraint 1 N N i=1 b (x i) B. a decoding function (decoder) dec N : Y N [1 : 2 K ] { } that maps a channel output y N to a reconstructed message ŵ or an error. { } 2 The error probability is defined as P (N) e := Pr W Ŵ. 3 A rate R is said to be achievable with input cost B if there exist a sequence of ( 2 NR, N, B ) codes such that P (N) e 0 as N. The channel capacity is defined as C(B) := sup {R R : achievable}. 19 / 34 I-Hsiang Wang NIT Lecture 5

20 Channel Coding Theorem with Average Input Cost Theorem 3 (Channel Coding Theorem for DMC with Average Input Cost) The capacity of the DMC p (y x) with input cost B is given by C (B) = max I (X; Y). (1) p(x): E[b(X)] B Compared to that without input cost constraint, the additional constraint in the extremal problem (1) is on the expected cost E [b (X)]. The capacity-cost function C (B) is non-decreasing, concave, and continuous in B (exercise). 20 / 34 I-Hsiang Wang NIT Lecture 5

21 Converse Proof pf: Following the converse proof of DMC without input cost, we arrive at R ϵ N 1 N I (X k ; Y k ), where ϵ N 0 as N. N k=1 By the definition of C (B), k [1 : N], I (X k ; Y k ) C (E [b (X k )]). Let B k := E [b (X k )]. Hence, ( ) R ϵ N 1 N C (B k ) (a) 1 N (b) C B k C (B). N N k=1 (a) is due to concavity of C (B) in B. k=1 (b) is due to B k B for all k [1 : N], and the fact that C (B) is non-decreasing in B. Hence, for all R is achievable, R C (B). 21 / 34 I-Hsiang Wang NIT Lecture 5

22 Achievability Proof (1) Achievability proof mostly follows that of DMC without input cost. Twist: in random codebook generation, how to ensure all codewords satisfy the input cost constraint? Random Codebook Generation and Encoding: Generate the random codebook C i.i.d. randomly according to p X (x) = arg max I (X; Y). p(x): E[b(X)] B 1+ϵ In other words, we step back and generate random codewords with a slightly smaller average cost B 1+ϵ. If the generate x N T (N) ϵ (X), then by Problem 6(a) of Quiz #1, we have 1 N N i=1 b (x i) (1 + ϵ)e [b (X)] = (1 + ϵ) B 1+ϵ = B. If the generate x N / T ϵ (N) (X), it may violates the input constraint, and hence we send the zero-cost codeword [ ] x o x o instead. 22 / 34 I-Hsiang Wang NIT Lecture 5

23 Achievability Proof (2) Error Probability Analysis: Following the same line in that of DMC without input cost constraint, we arrive at upper bounding Pr {E W = 1} := P 1 (E), where { (X E = A c 1 ( w 1 A w ), A w := N (w), Y N) } T ϵ (N) (X, Y). Upper bounding P 1 (A w ) for w 1 remains the same: since the actual message W = 1, whether or not the codeword X N (1) violates the cost constraint is not relevant. Upper bounding P 1 (A c 1 ) requires slight modification, regarding whether or not X N (1) T ϵ (N) (X): { (X P 1 (A c 1) P N 1 (1), Y N) } / T ϵ (N) (X, Y), X N (1) T ϵ (N) (X) { } + P 1 X N (1) / T (N) (X) 2ϵ, for N sufficiently large. ϵ 23 / 34 I-Hsiang Wang NIT Lecture 5

24 Achievability Proof (3) ( ) B Hence, we conclude that for any R < C 1+ϵ, R is achievable. ( ) Finally, since C (B) is continuous in B, we can make C arbitrarily B 1+ϵ close to C (B), and hence for any R < C (B), R is achievable. Exercise 1 Show that C (B) = max I (X; Y) p(x): E[b(X)] B is non-decreasing, concave, and continuous in B. 24 / 34 I-Hsiang Wang NIT Lecture 5

25 1 Mutual Information and Differential Entropy / 34 I-Hsiang Wang NIT Lecture 5

26 Additive White Gaussian Noise (AWGN) Channel z N w Channel Encoder x N y N Channel Decoder bw 1 Input/output alphabet X = Y = R. 2 AWGN Channel: Conditional p.d.f. f Y X is given by Y = X + Z, Z N ( 0, σ 2) X. {Z k } form an i.i.d. (white) Gaussian r.p. with Z k N ( 0, σ 2), k. Memoryless: Z k ( W, X k 1, Z k 1). Without feedback: Z N X N. 3 Average input power constraint P: 1 N N k=1 x k 2 P. 26 / 34 I-Hsiang Wang NIT Lecture 5

27 Channel Coding Theorem for Gaussian Channel Theorem 4 () The capacity of the AWGN channel with input power constraint P and noise variance σ 2 is given by C = sup I (X; Y) = 1 ( f(x): E[ X 2 ] P 2 log 1 + P ) σ 2. (2) Compared (1) in DMC with input cost constraint, in (2) the max operation is replaced by the sup operation. For the AWGN channel, the supremum is actually attainable with Gaussian input distribution f (x) = 1 2πP e x2 2P, i.e., X N (0, P). The capacity-power function C (P) is non-decreasing, concave, and left-continuous in P (exercise). 27 / 34 I-Hsiang Wang NIT Lecture 5

28 Converse Proof and Evaluation of Capacity Converse proof: Following the same line of proof, we arrive at R is achievable = R sup I (X; Y). f(x): E[ X 2 ] P Capacity evaluation: It is straightforward to see that the optimizing input X should have zero mean. Note that since Z X, I (X; Y) = h (Y) h (Y X) = h (Y) h (X + Z X) = h (Y) h (Z X) = h (Y) h (Z) = h (Y) 1 log (2πe) σ2 2 (a) 1 2 log (2πe) ( P + σ 2) 1 2 log (2πe) σ2 = 1 ( 2 log 1 + P ) σ 2 (a): h (Y) 1 2 log (2πe) E [ Y 2], E [ Y 2] = E [ X 2] + E [ Z 2] P + σ / 34 I-Hsiang Wang NIT Lecture 5

29 Achievability Proof (1) The proof of achievability for this continuous-valued channel makes use of discretization, so that we can apply the result in DMC with input cost. Discretization: here we use a different way to discretize the channel. { } m N, let Q m := l m : l = 0, ±1,..., ±m be the set of quantized points. For any r R, quantize r to the closest point [r] m Q m such that [r] m r. Let us use the above quantizer to discretize the channel: Quantize the channel input X to [X] m. Let Y (m) := [X] m + Z denote the channel output corresponding to the quantized channel input [X] m. Note: Y (m) is continuous-valued. Quantize the channel output Y (m) to [ Y (m)] n. 29 / 34 I-Hsiang Wang NIT Lecture 5

30 Achievability Proof (2) Now we have a DMC with input [X] m and output [ Y (m)] n. Note that for any p.d.f. f X (x) with E [ X 2] P, the quantized [X] m also satisfies the power constraint: E [ [X] m 2] E [ X 2] P. Hence, by the achievability result of DMC with input cost constraint, any R < I ( [X] m ; [ Y (m)] n) (evaluated under fx (x) = 1 2πP e x2 2P ) is indeed achievable for the discretized channel under power constraint P. The only thing left to be shown is that, I ( [X] m ; [ Y (m)] n) can be made arbitrarily close to I (X; Y) = 1 2 log ( 1 + P σ 2 ) as m, n. 30 / 34 I-Hsiang Wang NIT Lecture 5

31 Achievability Proof (3) First, due to data processing inequality, since [X] m Y (m) [ Y (m)] n, we have I ( [X] m ; [ Y (m)] n) I ( [X]m ; Y (m)) = h ( Y (m)) h (Z). Since E [ Y (m) 2] P + σ 2, we have h ( Y (m)) 1 2 log ( 2πe(P + σ 2 ) ), and hence ( [ I [X] m ; Y (m)] ) 1 (1 n 2 log + P ) σ 2. Second, for the lower bound, Lemma 3.2. in El Gamal&Kim [1] gives the following: lim inf m ( lim I n [ [X] m ; Y (m)] n Combining the above, proof is complete. ) 1 2 log (1 + P σ 2 ). 31 / 34 I-Hsiang Wang NIT Lecture 5

32 Lesson Learned Discretization may not be the simplest way to obtain achievability results. Instead, such discretization shows how to extend the coding theorem for a DMC to a Gaussian or any other well-behaved continuous-valued channel. Similar procedures can be used to extend coding theorems for finite-alphabet multiuser channels to their Gaussian counterparts. Hence, later we may skip formal proofs of such extensions. 32 / 34 I-Hsiang Wang NIT Lecture 5

33 1 Mutual Information and Differential Entropy / 34 I-Hsiang Wang NIT Lecture 5

34 Mutual information between two[ continuous r.v. s ] X and Y with joint density f X,Y : I (X; Y) = E. log f X,Y(X,Y) f X (X)f Y (Y) Differential[ entropy and ] conditional differential [ entropy: ] 1 1 h (X) := E log f X (X), h (X Y) := E log f X Y (X Y). I (X; Y) = h (X) h (X Y) = h (Y) h (Y X). KL divergence between densities f and g: D (f g) := E f [ log f(x) g(x) Chain rule, conditioning reduces differential entropy, non-negativity of mutual information and KL divergence: remain to hold. Differential entropy can be negative; h (X) h (X, Y). Channel capacity with input cost constraint: DMC: C (B) = max p(x): E[b(X)] B I (X; Y). Continuous-valued: C (B) = sup f(x): E[b(X)] B I (X; Y). Gaussian channel capacity: C (P) = 1 2 log ( 1 + P σ 2 ). ]. 34 / 34 I-Hsiang Wang NIT Lecture 5

Lecture 4 Channel Coding

Lecture 4 Channel Coding Capacity and the Weak Converse Lecture 4 Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 15, 2014 1 / 16 I-Hsiang Wang NIT Lecture 4 Capacity