Chapter 6. Continuous Sources and Channels. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Size: px

Start display at page:

Download "Chapter 6. Continuous Sources and Channels. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University"

Violet Carroll
5 years ago
Views:

1 Chapter 6 Continuous Sources and Channels Po-Ning Chen, Professor Department of Communications Engineering National Chiao Tung University Hsin Chu, Taiwan 300, R.O.C.

2 Continuous Sources and Channels I: 6-1 Model {X t X, t I} Discrete sources Both X and I are discrete. Continuous sources Discrete-time continuous sources X is continuous; I is discrete. Waveform sources Both X and I are continuous.

3 Information Content of Continuous Sources I: 6-2 A straightforward extension of entropy from discrete sources to continuous sources. Entropy: H(X) = x X P X (x) log P X (x) nats. Example 6.1 (extension of entropy to continuous sources) Give a source X with source alphabet [0, 1) and uniform generic distribution. We can make it discrete by quantizing it into m levels as q m (X) = i m, if i 1 m X < i m, for 1 i m. Then the resultant entropy of the quantized source is m ( ) 1 1 H(q m (X)) = m log = log(m) nats. m i=1 Since the entropy H(q m (X)) of the quantized source is a lower bound to the entropy H(X) of the uniformly distributed continuous source, H(X) lim m H(q m(x)) =.

4 Information Content of Continuous Sources I: 6-3 A quantization extension of entropy seems ineffective for continuous sources, because their entropies are all infinity. Proof: For any continuous source X, there must exist a non-empty open interval in which the cumulative distribution function F X ( ) is strictly increasing. Now quantize the source into m + 1 level as follows: Then Assign one level to the complement of this open interval, and assign m levels to this open interval such that the probability mass on this interval, denoted by a, is equally distributed to these m levels. (This is similar to do equal partitions on the concerned domain of F 1 X ( ).) H(X) H(X ) = (1 a) log(1 a) a log a m, where X represents the quantized version of X. The lower bound goes to infinity as m tends to infinity.

5 Differential Entropy I: 6-4 Alternative extension of entropy to continuous sources (from its formula) Definition 6.2 (differential entropy) The differential entropy (in nats) of a continuous source with generic probability density function (pdf) p X is defined as h(x) = p X (x) log p X (x)dx. X The next example demonstrates the difference (in its quantity) between the entropy and differential entropy. Example 6.3 A continuous source X with source alphabet [0, 1) and pdf f(x) = 2x has differential entropy equal to 1 2x log(2x)dx = x2 1 (1 2 log(2x)) 2 0 = 1 log(2) nats. 2 0

6 Examples of Differential Entropy I: 6-5 Note that the differential entropy, unlike the entropy, can be negative in its value. Example 6.4 (differential entropy of continuous sources with uniform generic distribution) A continuous source X with uniform generic distribution over (a, b) has differential entropy h(x) = log b a nats. Example 6.5 (differential entropy of Gaussian sources) A continuous source X with Gaussian generic distribution of mean µ and variance σ 2 has differential entropy [ ] 1 (x h(x) = φ(x) R 2 log(2πσ2 µ)2 ) + dx 2σ 2 = 1 2 log(2πσ2 ) + 1 2σ 2E[(X µ)2 ] = 1 2 log(2πσ2 ) = 1 2 log(2πσ2 e) nats, where φ(x) is the pdf of the Gaussian distribution with mean µ and variance σ 2.

7 Properties of Differential Entropy I: 6-6 The extension of AEP theorem from discrete cases to continuous cases is not based on number counting (which is always infinity for continuous sources), but on volume measuring. Theorem 6.6 (AEP for continuous sources) Let X 1,..., X n be a sequence of sources drawn i.i.d. according to the density p X ( ). Then 1 n log p X(X 1,..., X n ) E[ log p X (X)] = h(x) in probability. Proof: The proof is an immediate result of law of large numbers. Definition 6.7 (typical set) For δ > 0 and any n given, define the typical set as { } F n (δ) = x n X n : 1 n log p X(X 1,..., X n ) h(x) < δ. Definition 6.8 (volume) The volume of a set A is defined as Vol(A) = dx 1 dx n. A

8 Properties of Differential Entropy I: 6-7 Theorem 6.9 (Shannon-McMillan theorem for continuous sources) 1. For n sufficiently large, P X n {F c n(δ)} < δ. 2. Vol(F n (δ)) e n(h(x)+δ) for all n. 3. Vol(F n (δ)) (1 δ)e n(h(x) δ) for n sufficiently large. Proof: The proof is an extension of Shannon-McMillan theorem for discrete sources, and hence we omit it. E.g. We can also derive a source coding theorem for continuous sources and obtain that when a continuous source is compressed by quantization, it is beneficial to put most of the quantization effort on its typical set, instead of on the entire source space. Assigning (m 1)-level to elements in F n (δ); Assigning 1 level to those elements outside F n (δ).

9 Properties of Differential Entropy I: 6-8 Remarks: If the differential entropy of a continuous source is larger, it is expected that a larger number of quantization levels is required in order to minimize the distortion introduced via quantization. So we may conclude that continuous sources with higher differential entropy contain more information in volume. Question: Which continuous source is the richest in information (i.e., is highest in differential entropy and cost more in quantization)? Answer: Gaussian.

10 Properties of Differential Entropy I: 6-9 Theorem 6.10 (maximal differential entropy of Gaussian source) The Gaussian source has the largest differential entropy among all continuous sources with identical mean and variance. Proof: Let p( ) be the pdf of a continuous source X, and let φ( ) be the pdf of a Gaussian source Y. Assume that these two sources have the same mean µ and variance σ 2. Observe that R φ(y) log φ(y)dy = R φ(y) [ 1 2 log(2πσ2 ) + ] (y µ)2 dy 2σ 2 = 1 2 log(2πσ2 ) + 1 2σ 2E[(Y µ)2 ] = 1 2 log(2πσ2 ) + 1 = = R R 2σ 2E[(X µ)2 ] p(x) [ 12 log(2πσ2 ) p(x) log φ(x)dx. (x µ)2 2σ 2 ] dx

11 Properties of Differential Entropy I: 6-10 Hence, h(y ) h(x) = φ(y) log φ(y)dy + p(x) log p(x)dx R R = p(x) log φ(x)dx + p(x) log p(x)dx R R = p(x) log p(x) R φ(x) dx ( p(x) 1 φ(x) ) dx (fundamental inequality) R p(x) = (p(x) φ(x)) dx = 0, R with equality holds if, and only if, p(x) = φ(x) for all x R.

12 Properties of Differential Entropy I: 6-11 Properties regarding differential entropy for continuous sources, which are the same as in discrete cases. Lemma h(x Y ) h(x) with equality holds if, and only if, X and Y are independent. 2. (chain rule for differential entropy) n h(x 1, X 2,..., X n ) = h(x i X 1, X 2,..., X i 1 ). are inde- 3. h(x n ) n i=1 h(x i) with equality holds if, and only if, {X i } n i=1 pendent. i=1

13 Properties of Differential Entropy I: 6-12 There are some properties that are conceptually different in continuous cases from the original ones in discrete cases. Lemma 6.12 (In discrete cases) For any one-to-one correspondence mapping f, H(f(X)) = H(X). (In continuous cases) For a mapping f(x) = ax with some non-zero constant a, h(f(x)) = h(x) + log a. Proof: (For continuous cases only) Let p X ( ) and p f(x) ( ) be respectively the pdfs of the original source and the mapped source. Then p f(x) (u) = 1 ( u ) a p X. a By taking the new pdf into the formula of differential entropy, we have the desired result.

14 Important Notes on Differential Entropy I: 6-13 The above lemma says that the differential entropy can be increased by a oneto-one correspondence mapping. Therefore, when viewing the quantity as a measure of information content of continuous sources, we would yield that information content can be increased by a function mapping, which is somewhat contrary to the intuition. Based on this reason, it may not be appropriate to interpret the differential entropy as an index of information content of continuous sources. Indeed, it may be better to view this quantity as a measure of quantization efficiency.

15 Important Notes on Differential Entropy I: 6-14 Some researchers interpret the differential entropy as a convenient intermediate formula of calculating the mutual information and divergence for systems with continuous alphabets, which is in general true. Before introducing the formula of relative entropy and mutual information for continuous settings, we point out beforehand that the interpretation, as well as the operational characteristics, of the mutual information and divergence for systems with continuous alphabets is exactly the same as for systems with discrete alphabets.

16 More Properties on Differential Entropy I: 6-15 Corollary 6.13 For a sequence of continuous sources X n = (X 1,..., X n ), and a non-singular n n matrix A, h(ax n ) = h(x n ) + log A, where A represents the determinant of matrix A. Corollary 6.14 h(x + c) = h(x) and h(x Y ) = h(x + Y Y ).

17 Operational Meaning of Differential Entropy I: 6-16 Lemma 6.15 Give the pdf f(x) of a continuous source X, and suppose that f(x) log 2 f(x) is Riemann-integrable. Then to uniformly quantize the random source within n-bit accuracy, i.e., the quantization width is no greater than 2 n, need approximately h(x) + n bits (as n large enough). Proof: Step 1: Mean-value theorem. Let = 2 n be the width of two adjacent quantization levels. Let t i = i for integer i (, ). From mean-value theorem, we can choose x i [t i 1, t i ] such that ti t i 1 f(x)dx = f(x i )(t i t i 1 ) = f(x i ).

18 Operational Meaning of Differential Entropy I: 6-17 Step 2: Definition of h (X). Let h (X) = [f(x i ) log 2 f(x i )]. i= Since h(x) is Riemann-integrable, h (X) h(x) as = 2 n 0. Therefore, given any ε > 0, there exists N such that for all n > N, h(x) h (X) < ε. Step 3: Computation of H(X ). The entropy of the quantized source X is H(X ) = p i log 2 p i = (f(x i ) ) log 2 (f(x i ) ) bits. i= i=

19 Operational Meaning of Differential Entropy I: 6-18 Step 4: H(X ) h (X). From Steps 2 and 3, Hence, for n > N. H(X ) h (X) = i= = ( log 2 ) = ( log 2 ) [f(x i ) ] log 2 i= ti t i 1 f(x)dx f(x)dx = log 2 = n. [h(x) + n] ε < H(X ) = h (X) + n < [h(x) + n] + ε, Remark 1: Since H(X ) is the minimum average number of codeword length for lossless data compression, to uniformly quantize a continuous source upto n-bit accuracy requires approximately h(x) + n bits. Remark 2: We may conclude that the larger the differential entropy, the average number of bits required to uniformly quantize the source subject to a fixed accuracy is larger.

20 Operational Meaning of Differential Entropy I: 6-19 This operational meaning of differential entropy can be used to interpret its properties introduced in the previous subsection. For example, h(x + c) = h(x) can be interpreted as A shift in value does not change the quantization efficiency of the original source.

21 Riemann Integral Versus Lebesgue Integral I: 6-20 Riemann integral: Let s(x) represent a step function on [a, b), which is defined as that there exists a partition a = x 0 < x 1 < < x n = b such that s(x) is constant during (x i, x i+1 ) for 0 i < n. If a function f(x) is Riemann integrable, b a f(x) = sup { s(x) : s(x) f(x) } b a s(x)dx = { inf s(x) : s(x) f(x) } b a s(x)dx. Example of a non-riemann-integrable function: f(x) = 0 if x is irrational; f(x) = 1 if x is rational. Then but { sup { s(x) : s(x) f(x) inf s(x) : s(x) f(x) } } b a b a s(x)dx = 0, s(x)dx = (b a).

22 Riemann Integral Versus Lebesgue Integral I: 6-21 Lebesgue integral: Let t(x) represent a simple function, which is defined as the linear combination of indicator functions for mutually-disjoint partitions. For example, let U 1,..., U m be the mutually-disjoint partitions of the domain X and m i=1 U i = X. The indicator function of U i is 1(x; U i ) = 1 if x U i, and 0, otherwise. Then t(x) = m i=1 a i1(x; U i ) is a simple function. If a function f(x) is Lebesgue integrable, then b a f(x) = sup { t(x) : t(x) f(x) } b a t(x)dx = { inf t(x) : t(x) f(x) } b a t(x)dx. The previous example is actually Lebesgue integrable, and its Lebesgue integral is equal to zero.

23 Example of Quantization Efficiency I: 6-22 Example 6.16 Find the minimum average number of bits required to uniformly quantize the decay time (in years) of a radium atom upto 3-digit accuracy, if the half-life of the radium is 80 years. Note that the half-life of a radium atom is the median of its decay time distribution f(x) = λe λx (x > 0). Since the median is 80, we obtain: 80 0 λe λx dx = 0.5, which implies λ = Also, 3-digit accuracy is approximately equivalent to log = bit accuracy. Therefore, the number of bits required to quantize the source is approximately e h(x) + 10 = log = bits. λ

24 Relative Entropy and Mutual Information I: 6-23 Definition 6.17 (relative entropy) Define the relative entropy between two densities p X and p ˆX by D(X ˆX) = p X (x) log p X(x) dx. p ˆX(x) X Definition 6.18 (mutual information) The mutual information with inputoutput joint density p X,Y (x, y) is defined as I(X; Y ) = p X (x, y) log p X,Y (x, y) p X (x)p Y (y) dxdy. X Y

25 Relative Entropy and Mutual Information I: 6-24 Different from the cases for entropy, the properties of relative entropy and mutual information in continuous cases are the same as those in the discrete cases. In particular, the mutual information of the quantized version of a continuous channel will converge to the mutual information of the same continuous channel (as the quantization step size goes to zero). Hence, some researchers prefer to define the mutual information of a continuous channel directly as the limit of the quantized channel. Here, we quote some of the properties on relative entropy and mutual information from discrete settings. Lemma D(X ˆX) 0 with equality holds if, and only if, p X = p ˆX. 2. I(X; Y ) 0 with equality holds if, and only if, X and Y are independent. 3. I(X; Y ) = h(y ) h(y X).

26 Rate Distortion Theorem Revisited I: 6-25 Since the entropy of continuous sources is infinity, to compress a continuous source without distortion is impossible according to Shannon s source coding theorem. Thus, one way to characterize the data compression for continuous sources is to encode the original source subject to a constraint on the distortion, which yields the rate-distortion function for data compression. In concept, the rate-distortion function is the minimum data compression rate (nats required per source letter or nats required per source sample) for which the distortion constraint is satisfied.

27 Specific Formula of Rate Distortion Theorem I: 6-26 Theorem 6.20 Under the squared error distortion measure, namely ρ(z,ẑ) = (z ẑ) 2, the rate-distortion function for continuous source Z with zero mean and variance σ 2 satisfies 1 σ2 R(D) log 2 D, for 0 D σ2 ; 0, for D > σ 2. Equality holds when Z is Gaussian. Proof: By Theorem 5.16 (extended to squared-error distortion measure), R(D) = { min pẑ Z : E[(Z Ẑ)2 ] D So for any pẑ Z satisfying the distortion constraint, R(D) I(p Z, pẑ Z ). } I(Z; Ẑ). For 0 D σ 2, choose a dummy Gaussian random variable W with zero mean and variance ad, where a = 1 D/σ 2, and is independent of Z. Let Ẑ = az +W. Then E[(Z Ẑ)2 ] = E[(1 a) 2 Z 2 ] + E[W 2 ] = (1 a) 2 σ 2 + ad = D

28 Specific Formula of Rate Distortion Theorem I: 6-27 which satisfies the distortion constraint. Note that the variance of Ẑ is equal to E[a 2 Z 2 ] + E[W 2 ] = σ 2 D. Consequently, R(D) I(Z; Ẑ) = = = = h(ẑ) h(ẑ Z) h(ẑ) h(w + az Z) h(ẑ) h(w Z) (By Corollary 6.14) h(ẑ) h(w) (By Lemma 6.11) = h(ẑ) 1 2 log(2πe(ad)) 1 2 log(2πe(σ2 D)) 1 2 log(2πe(ad)) = 1 2 log σ2 D. For D > σ 2, let Ẑ satisfy Pr{Ẑ = 0} = 1, and be independent of Z. Then E[(Z Ẑ)2 ] = E[Z 2 ] E[Ẑ2 ] 2E[Z]E[Ẑ] = σ2 < D, and I(Z; Ẑ) = 0. Hence, R(D) = 0 for D > σ 2. The achievability of the upper bound by Gaussian source will be proved in Theorem 6.22.

29 Rate Distortion Function for Binary Source I: 6-28 Theorem 6.21 Fix a memoryless binary source Z = {Z n = (Z 1, Z 2,..., Z n )} n=1 with marginal distribution P Z (0) = 1 P Z (1) = p. Assume that the Hamming additive distortion measure is employed. Then the rate-distortion function { Hb (p) H R(D) = b (D), if 0 D min{p, 1 p}; 0, if D > min{p, 1 p}, where H b (p) = p log(p) (1 p) log(1 p) is the binary entropy function. Proof: Assume without loss of generality that p 1/2. We first prove the theorem under 0 D < min{p, 1 p} = p. Observe that for any binary random variable Ẑ, Also observe that H(Z Ẑ) = H(Z Ẑ Ẑ). E[ρ(Z, Ẑ)] D implies Pr{Z Ẑ = 1} D.

30 Rate Distortion Function for Binary Source I: 6-29 Then I(Z; Ẑ) = H(Z) H(Z Ẑ) = H b (p) H(Z Ẑ Ẑ) H b (p) H(Z Ẑ) H b (p) H b (D), (conditioning never increase entropy) where the last inequality follows since H b (x) is increasing for x 1/2, and Pr{Z Ẑ = 1} D. Since the above derivation is true for any PẐ Z, we have R(D) H b (p) H b (D). It remains to show that the lower bound is achievable by some PẐ Z, or equivalently, H(Z Ẑ) = H b(d) for some PẐ Z. By defining P Z Ẑ (0 0) = PZ Ẑ(1 1) = 1 D, we immediately obtain H(Z Ẑ) = H b(d). The desired PẐ Z can be obtained by solving 1 = PẐ(0) + PẐ(1) = P Z(0) = P Z(0) P Z Ẑ (0 1)P Ẑ Z (1 0) P Z Ẑ (0 0)P Ẑ Z (0 0) + p 1 D P Ẑ Z (0 0) + p D (1 P Ẑ Z (0 0))

31 Rate Distortion Function for Binary Source I: 6-30 and and yield PẐ Z (0 0) = 1 D 1 2D 1 = PẐ(0) + PẐ(1) = P Z(1) P Z Ẑ (1 0)P Ẑ Z (0 1) + P Z(1) P Z Ẑ (1 1)P Ẑ Z (1 1) = 1 p D (1 P Ẑ Z (1 1)) + 1 p 1 D P Ẑ Z (1 1), ( 1 D ) p and PẐ Z (1 1) = 1 D 1 2D ( 1 D ). 1 p Now in the case of p D < 1 p, we can let PẐ Z (1 0) = PẐ Z (1 1) = 1 to obtain I(Z; Ẑ) = 0 and E[ρ(Z, Ẑ)] = 1 z=0 1 P Z (z)pẑ Z (ẑ z)ρ(z,ẑ) = p D. ẑ=0 Similarly, in the case of D 1 p, we let PẐ Z (0 0) = PẐ Z (0 1) = 1 to obtain I(Z; Ẑ) = 0 and E[ρ(Z, Ẑ)] = 1 z=0 1 P Z (z)pẑ Z (ẑ z)ρ(z,ẑ) = 1 p D. ẑ=0

32 Rate Distortion Function for Binary Source I: 6-31 Remark: The Hamming additive distortion measure is defined as: n ρ n (z n, ẑ n ) = z i ẑ i, where denotes modulo two addition. In such case, ρ(z n, ẑ n ) is exactly the number of bit changes or bit errors after compression. i=1

33 Rate Distortion Function for Gaussian Sources I: 6-32 Theorem 6.22 Fix a memoryless source Z = {Z n = (Z 1, Z 2,..., Z n )} n=1 with zero-mean Gaussian marginal distribution of variance σ 2. Assume that the squared error distortion measure is employed. Then the rate-distortion function is given by: R(D) = 1 σ2 log 2 D, if 0 D σ2 ; 0, if D > σ 2. Proof: From Theorem 6.20, it suffices to show that under the Gaussian source, (1/2) log(σ 2 /D) is a lower bound to R(D) for 0 D σ 2.

34 Rate Distortion Function for Gaussian Sources I: 6-33 This can be proved as follows. For Gaussian source Z with E[(Z Ẑ)2 ] D, I(Z; Ẑ) = h(z) h(z Ẑ) = 1 2 log(2πeσ2 ) h(z Ẑ Ẑ) (Corollary 6.14) 1 2 log(2πeσ2 ) h(z Ẑ) (Lemma 6.11) 1 2 log(2πeσ2 ) 1 ( ) 2 log 2πe Var[(Z Ẑ)] 1 2 log(2πeσ2 ) 1 ) (2πe 2 log E[(Z Ẑ)2 ] (Theorem 6.10) 1 2 log(2πeσ2 ) 1 2 log (2πeD) = 1 2 log σ2 D.

35 Channel Coding Theorem for Continuous Sources I: 6-34 Power constraint on channel input To derive the channel capacity for a memoryless continuous channel without any constraint on the inputs is somewhat impractical, especially when the input can be any number on the infinite real line. Such constraint is usually of the form or 1 n E[t(X)] S n E[t(X i )] S for a sequence of random inputs, i=1 where t( ) is a non-negative cost function. Example 6.23 (average power constraint) t(x) = x 2, i.e., the constraint is that the average input power is bounded above by S.

36 Channel Coding Theorem for Continuous Sources I: 6-35 As extended from discrete cases, the channel capacity of a discrete-time continuous channel with the input cost constraints is of the form C(S) = max I(X; Y ). (6.3.1) {p X : E[t(X)] S} Claim: C(S) is a concave function of S.

37 Channel Coding Theorem for Continuous Sources I: 6-36 Lemma 6.24 (concavity of capacity-cost function) C(S) is concave, continuous, and strictly increasing in S. Proof: Let P X1 and P X2 be two distributions that respectively achieve C(P 1 ) and C(P 2 ). Denote P Xλ = λpx1 + (1 λ)p X2. Then C(λP 1 + (1 λ)p 2 ) = max I(P X, P Y X ) {P X : E[t(X)] λp 1 +(1 λ)p 2 } I(P Xλ, P Y X ) λi(p X1, P Y X ) + (1 λ)i(p X2, P Y X ) = λc(p 1 ) + (1 λ)c(p 2 ), where the first inequality holds since E Xλ [t(x)] = t(x)dp λ (x) R = λ t(x)dp X1 (x) + (1 λ) R = λe X1 [t(x)] + (1 λ)e X2 [t(x)] λp 1 + (1 λ)p 2, R t(x)dp X2 (x) and the second inequality follows from the concavity of mutual information with respect to the first argument. Accordingly, C(S) is concave in S.

38 Channel Coding Theorem for Continuous Sources I: 6-37 Furthermore, it can be easily seen by definition that C(S) is non-decreasing, which, together with its concavity, implies its continuity and strict increasing.

39 Channel Coding Theorem for Continuous Sources I: 6-38 Although the capacity-cost function formula in (6.3.1) is valid for general cost function t( ), we only substantiate it under the average power constraint in the next forward channel coding theorem. Its validity for a more general case can be similarly proved based on the same concept.

40 Channel Coding Theorem for Continuous Sources I: 6-39 Theorem 6.25 (forward channel coding theorem for continuous channels under average power constraint) For any ε (0, 1), there exist 0 < γ < 2ε and a data transmission code sequence { C n = (n, M n )} n=1 satisfying 1 n log M n > C(S) γ and for each codeword c = (c 1, c 2,..., c n ), 1 n n c 2 i S (6.3.2) i=1 such that the probability of decoding error P e ( C n ) is less than ε for sufficiently large n, where C(S) = max I(X; Y ). {p X : E[X 2 ] S}

41 Channel Coding Theorem for Continuous Sources I: 6-40 Proof: The theorem holds trivially when C(S) = 0 because we can choose M n = 1 for every n, and yields P e ( C n ) = 0. Hence, assume without loss of generality C(S) > 0. Step 0: Take a positive γ satisfying Pick ξ > 0 small enough such that γ < min{2ε, C(S)}. 2[C(S) C(S ξ)] < γ, where the existence of such ξ is assured by the strict increasing of C(S). Hence, we have C(S ξ) γ/2 > C(S) γ > 0. Choose M n to satisfy C(S ξ) γ 2 > 1 n log M n > C(S) γ, for which the choice should exist for all sufficiently large n. Take δ = γ/8. Let P X be the distribution that achieves C(S ξ); hence, E[X 2 ] S ξ and I(X; Y ) = C(S ξ).

42 Channel Coding Theorem for Continuous Sources I: 6-41 Step 1: Random coding with average power constraint. Randomly draw M n 1 codewords according to distribution P X n with n P X n(x n ) = P X (x i ). By law of large numbers, each randomly selected codeword c m = (c m1,..., c mn ) i=1 satisfies lim n 1 n for m = 1, 2,..., M n 1. n c 2 mi S ε almost surely i=1

43 Channel Coding Theorem for Continuous Sources I: 6-42 Step 2: Coder. For M n selected codewords {c 1,..., c Mn }, replace the codewords that violate the power constraint (i.e., (6.3.2)) by all-zero codeword 0. Define the encoder as f n (m) = c m for 1 m M n. When receiving an output sequence y n, the decoder g n ( ) is given by m, if (c m, y n ) F n (δ) g n (y n ) = and ( m m) (c m, y n ) F n (δ), arbitrary, otherwise, where { F n (δ) = (x n, y n ) X n Y n : 1 n log p X n Y n(xn, y n ) h(x, Y ) < δ, 1 n log p X n(xn ) h(x) < δ, } and 1 n log p Y n(yn ) h(y ) < δ.

44 Channel Coding Theorem for Continuous Sources I: 6-43 Step 3: Probability of error. Let λ m denote the error probability given that codeword m is transmitted. Define { } E 0 = x n X n : 1 n x 2 i > S. n Then by following similar argument as (4.3.2) (discrete cases), we get: where i=1 E[λ m ] P X n(e 0 ) + P X n,y n (Fc n(δ)) M n + P X n,y n(c m, y n ), m =1 m m c m X n y n F n (δ c m ) F n (δ x n ) = {y n Y n : (x n, y n ) F n (δ)}. Note that the additional term P X n(e 0 ) (to discrete cases) is to cope with the errors due to all-zero codeword replacement, which will be less than δ for all sufficiently large n by the law of large numbers.

45 Channel Coding Theorem for Continuous Sources I: 6-44 Finally, by carrying out similar procedure as in the proof of the capacity for discrete channels, we obtain: E[P e (C n )] P X n(e 0 ) + P X n,y n (Fc n(δ)) +M n e n(h(x,y )+δ) e n(h(x) δ) n(h(y ) δ) e P X n(e 0 ) + P X n,y n (Fc n(δ)) + e n(c(s ξ) 4δ) n(i(x;y ) 3δ) e = P X n(e 0 ) + P X n,y n (Fc n(δ)) + e nδ. Accordingly, we can make the average probability of error, namely E[P e (C n )], less than 3δ = 3γ/8 < 3ε/4 < ε for all sufficiently large n.

46 Channel Coding Theorem for Continuous Sources I: 6-45 Theorem 6.26 (converse channel coding theorem for continuous channels) For any data transmission code sequence { C n = (n, M n )} n=1 satisfying the power constraint, if the ultimate data transmission rate satisfies lim inf n 1 n log M n > C(S), then its probability of decoding error is bounded away from zero for all n sufficiently large. Proof: For an (n, M n ) block data transmission code, an encoding function is chosen as: f n : {1, 2,..., M n } X n, and each index i is equally likely for the average probability of block decoding error criterion. Hence, we can assume that the information message {1, 2,..., M n } is generated from a uniformly distributed random variable, and denote it by W. As a result, H(W) = log M n. Since W X n Y n forms a Markov chain because Y n only depends on X n, we obtain by the data processing lemma that I(W; Y n ) I(X n ; Y n ).

47 Channel Coding Theorem for Continuous Sources I: 6-46 We can also bound I(X n ; Y n ) by C(S) as: I(X n ; Y n ) max {P X n : (1/n) I(X n ; Y n ) n i=1 E[Xi 2] S} max {P X n : (1/n) n i=1 E[X 2 i ] S} n I(X j ; Y j ) j=1 = max {(P 1,P 2,...,P n ) : (1/n) max n i=1 P i =S} {P X n : ( i) E[Xi 2] P i} max {(P 1,P 2,...,P n ) : (1/n) n i=1 P i =S} max {(P 1,P 2,...,P n ) : (1/n) n i=1 P i =S} = max {(P 1,P 2,...,P n ):(1/n) n i=1 P i =S} = max {(P 1,P 2,...,P n ):(1/n) n n i=1 P i =S} n j=1 n j=1 n I(X j ; Y j ) j=1 max I(X j ; Y j ) {P X n : ( i) E[Xi 2] P i} max I(X j ; Y j ) {P Xj : E[Xj 2] P j} n C(P j ) j=1 n j=1 1 n C(P j)

48 Channel Coding Theorem for Continuous Sources I: 6-47 max {(P 1,P 2,...,P n ):(1/n) nc n i=1 P i =S} (by concavity of C(S)) = nc(s). Consequently, by defining P e ( C n ) as the error of guessing W by observing Y n via a decoding function g n : Y n {1, 2,..., M n }, 1 n which is exactly the average block decoding failure, we get log M n = H(W) = H(W Y n ) + I(W; Y n ) H(W Y n ) + I(X n ; Y n ) n j=1 H b (P e ( C n )) + P e ( C n ) log( W 1) + nc(s), (by Fano s inequality) P j log(2) + P e ( C n ) log(m n 1) + nc(s), (by the fact that ( t [0, 1]) H b (t) log(2)) log(2) + P e ( C n ) log M n + nc(s),

49 Channel Coding Theorem for Continuous Sources I: 6-48 which implies that P e ( C n ) 1 C(S) log(2). (1/n) log M n log M n So if lim inf n (1/n) log M n > C(S), then there exists δ with 0 < δ < 4ε and an integer N such that for n N, 1 n log M n > C(S) + δ. Hence, for n N 0 = max{n, 2 log(2)/δ}, P e ( C n ) 1 C(S) C(S) + δ log(2) n(c(s) + δ) δ 2(C(S) + δ). Remarks: This is a weak converse statement. Since the capacity is now a function of the cost constraint, it is named the capacity-cost function. Next, we will derive the capacity-cost function of some frequently used channel models.

50 Memoryless Additive Gaussian Channels I: 6-49 Definition 6.27 (memoryless additive channel) Let X 1,..., X n and Y 1,..., Y n be the input and output sequences of the channel. Also let N 1,..., N n be the noise. Then a memoryless additive channel is defined by Y i = X i + N i for each i, where {X i, Y i, N i } n i=1 are i.i.d. and X i is independent of N i. Definition 6.28 (memoryless additive Gaussian channel) A memoryless additive channel is called memoryless additive Gaussian channel, if the noise is a Gaussian random variable. Theorem 6.29 (capacity of memoryless additive Gaussian channel under average power constraint) The capacity of a memoryless additive Gaussian channel with noise model N(0, σ 2 ) and average power constraint is equal to: C(S) = 1 ( 2 log 1 + S ) nats/channel symbol. σ 2

51 Memoryless Additive Gaussian Channels I: 6-50 Proof: By definition, C(S) = max I(X; Y ) {p X : E[X 2 ] S} = max (h(y ) h(y X)) {p X : E[X 2 ] S} = max (h(y ) h(n + X X)) {p X : E[X 2 ] S} = max (h(y ) h(n X)) {p X : E[X 2 ] S} = max {p X : E[X 2 ] S} ( (h(y ) h(n)) ) = max {p X : E[X 2 ] S} h(y ) h(n), where N represents the additive Gaussian noise. We thus need to find an input distribution satisfying E[X 2 ] S that maximizes the differential entropy of Y. Recall that the differential entropy subject to mean and variance constraint is maximized by Gaussian random variable; also, the differential entropy of a Gaussian random variable with variance σ 2 is (1/2) log(2πσ 2 e) nats, and is nothing to do with its mean.

52 Memoryless Additive Gaussian Channels I: 6-51 Therefore, by taking X to be a Gaussian random variable having distribution N(0, S), Y achieves its largest variance S + σ 2 under the constraint E[X 2 ] S. Consequently, C(S) = 1 2 log(2πe(s + σ2 )) 1 2 log(2πeσ2 ) = 1 ( 2 log 1 + S ) nats/channel symbol. σ 2 Theorem 6.30 (worseness in capacity of Gaussian noise) For all memoryless additive discrete-time continuous channels whose noise has zero-mean and variance σ 2, the capacity subject to average power constraint is lower bounded by 1 2 log (1 + Sσ ), 2 which is the capacity for memoryless additive discrete-time Gaussian channel. (This means that the Gaussian noise is the worst kind of noise in the sense of channel capacity.)

53 Memoryless Additive Gaussian Channels I: 6-52 Proof: Let p Yg X g and p Y X denote the transition probabilities of the Gaussian channel and some other channel satisfying the cost constraint, respectively. Let N g and N respectively denote their noises. Then for any Gaussian input p Xg, I(p Xg, p Y X ) I(p Xg, p Yg X g ) = p Xg (x)p N (y x) log p N(y x) dydx R R p Y (y) p Xg (x)p Ng (y x) log p N g (y x) dydx R R p Yg (y) = p Xg (x)p N (y x) log p N(y x) dydx R R p Y (y) p Xg (x)p N (y x) log p N g (y x) dydx R R p Yg (y) = p Xg (x)p N (y x) log p N(y x)p Yg (y) R R p Ng (y x)p Y (y) dydx ( p Xg (x)p N (y x) 1 p ) N g (y x)p Y (y) dydx R R p N (y x)p Yg (y) ( ) p Y (y) = 1 p Xg (x)p Ng (y x)dx dy R p Yg (y) R = 0,

54 Memoryless Additive Gaussian Channels I: 6-53 with equality holds if, and only if, for all x. Therefore, p Y (y) p Yg (y) = p N(y x) p Ng (y x) max I(p X, p Yg {p X : E[X 2 X g ) = I(p X g, p Yg X g ) ] S} I(p X g, P Y X ) max I(p X, p Y X ). {p X : E[X 2 ] S}

55 Uncorrelated Parallel Gaussian Channels I: 6-54 Water-pouring scheme In concept, more channel input power should be placed on those channels with smaller noise power. The water-pouring scheme not only substantiates the above intuition, but also gives a quantitatively meaning for it. S = S 1 + S 2 + S 4 σ 2 2 σ 2 3 S 1 S 2 S 4 σ 2 1 σ 2 4

56 Uncorrelated Parallel Gaussian Channels I: 6-55 Theorem 6.31 (capacity for parallel additive Gaussian channels) The capacity of k parallel additive Gaussian channels under an overall input power constraint S, is k ( 1 C(S) = 2 log 1 + S ) i σi 2, i=1 where σi 2 is the noise variance of channel i, S i = max{0, θ σi 2 }, and θ is chosen to satisfy k i=1 S i = S. This capacity is achieved by a set of independent Gaussian input with zero mean and variance S i.

57 Uncorrelated Parallel Gaussian Channels I: 6-56 Proof: By definition, C(S) = { p X k : max } I(X k ; Y k ). ki=1 E[Xi 2] S Since noise N 1,..., N k are independent, ( ) I(X k ; Y k ) = h(y k ) h(y k X k ) = h(y k ) h(n k + X k X k ) = h(y k ) h(n k X k ) = h(y k ) h(n k ) k = h(y k ) h(n i ) = i=1 k h(y i ) i=1 k I(X i ; Y i ) i=1 k i=1 k h(n i ) i=1 ( 1 2 log 1 + S ) i σi 2 with equality holds if each input is a Gaussian random variable with zero mean

58 Uncorrelated Parallel Gaussian Channels I: 6-57 and variance S i, and the inputs are independent, where S i is the individual power constraint applied on channel i with k i=1 S i = S. So the problem is reduced to finding the power allotment that maximizes the capacity subject to the constraint k i=1 S i = S. By using the Lagrange multiplier technique, the maximizer of { k ( 1 max 2 log 1 + S ) ( k )} i σ 2 + λ S i S i=1 i i=1 can be found by taking the derivative (w.r.t. S i ) of the above equation and let it be zero, which yields 1 1 2S i + σ 2 + λ = 0, if S i > 0; i 1 1 2S i + σi 2 + λ 0, if S i = 0. Hence, { Si = θ σ 2 i, if S i > 0; S i θ σ 2 i, if S i = 0, where θ = 1/(2λ).

59 Rate Distortion for Parallel Gaussian Sources I: 6-58 A theorem on rate-distortion function parallel to that on capacity-cost function can also be established. Theorem 6.32 (rate distortion for parallel Gaussian sources) Given k mutually independent Gaussian sources with variance σ1, 2..., σk 2. The overall rate-distortion function for additive squared error distortion, namely is given by k E[(Z i Ẑi) 2 ] D i=1 R(D) = k i=1 1 2 log σ2 i D i, where D i = min{θ, σ 2 i } and θ is chosen to satisfy k i=1 D i = D.

60 Rate Distortion for Parallel Gaussian Sources I: 6-59 height σ 2 3 Pour in water of amount D. θ height σ 2 2 height σ 2 1 The overall water is D = D 1 + D 2 + D 3 + D 4 = σ σ θ + σ 2 4. height σ 2 4 The water-pouring for lossy data compression of parallel Gaussian sources.

61 Correlated Parallel Additive Gaussian Channels I: 6-60 Theorem 6.33 (capacity for correlated parallel additive Gaussian channels) The capacity of k parallel additive Gaussian channels under an overall input power constraint S, is k ( 1 C(S) = 2 log 1 + S ) i, λ i i=1 where λ i is the i-th eigenvalue of the positive-definite noise covariance matrix K N, S i = max{0, θ λ i }, and θ is chosen to satisfy k i=1 S i = S. This capacity is achieved by a set of independent Gaussian input with zero mean and variance S i. A matrix K k k is positive definite if for every x 1,..., x k, x 1 [x 1,..., x k ]K. 0, x k with equality holds only when x i = 0 for 1 i k. The matrix Λ in the decomposition of a positive-definite matrix K, i.e., K = QΛQ t, is a diagonal matrix with non-zero diagonal components.

62 Correlated Parallel Additive Gaussian Channels I: 6-61 Proof: Let K X be the covariance matrix of the input, X 1,..., X k. The input power constraint then becomes k E[Xi 2 ] = tr(k X ) S, i=1 where tr( ) represent the traverse of the k k matrix K X. Assume that the input is independent of the noise. Then I(X k ; Y k ) = h(y k ) h(y k X k ) = h(y k ) h(n k + X k X k ) = h(y k ) h(n k X k ) = h(y k ) h(n k ). Since h(n k ) is not determined by the input, the capacity-finding problem is reduced to maximize h(y k ) over all possible inputs satisfying the power constraint. Now observe that the covariance matrix of Y k is K Y = K X +K N, which implies that the differential entropy of Y k is upper bounded by h(y k ) 1 2 log((2πe)k K X + K N ). It remains to find the K X (if it is possible) under which the above upper bound is achieved, and also this achievable upper bound is maximized.

63 Correlated Parallel Additive Gaussian Channels I: 6-62 Decompose K N into its diagonal form as K N = QΛQ t, where superscript t represents the transpose operation on matrices, QQ t = I k k, and I k k represents the identity matrix of order k. Note that since K N is positive definite, Λ is a diagonal matrix with positive diagonal components equal to the eigenvalues of K N. Then K X + K N = K X + QΛQ t = Q Q t K X Q + Λ Q t = Q t K X Q + Λ = A + Λ, where A = Q t K X Q. Since tr(a) = tr(k X ), the problem is further transformed to maximize A + Λ subject to tr(a) S.

64 Correlated Parallel Additive Gaussian Channels I: 6-63 Lemma [Hadamard s inequality] Any positive definite k k matrix K satisfies k K K ii, where K ii is the component of matrix K locating at i th row and i th column. Equality holds if, and only if, the matrix is diagonal. i=1 By observing that A + Λ is positive definite (because Λ is positive definite), together with the above lemma, we have A + Λ k (A ii + λ i ), i=1 where λ i is the component of matrix Λ locating at i th row and i th column, which is exactly the i-th eigenvalue of K N. Thus, the maximum value of A + Λ under tr(a) S is achieved by a diagonal A with k A ii = S. i=1

65 Correlated Parallel Additive Gaussian Channels I: 6-64 Finally, we can adopt the Lagrange multiplier technique as used previously to obtain: A ii = max{0, θ λ i }, where θ is chosen to satisfy k i=1 A ii = S. Waveform channels will be discussed next!

66 Bandlimited Waveform Channels with AWGN I: 6-65 A common model for communication over a radio network or a telephone line is a band-limited channel with white noise, which is a continuous-time channel, modelled as Y t = (X t + N t ) h(t), where represents the convolution operation, X t is the waveform source, Y t is the waveform output, N t is the white noise, and h(t) is the band-limited filter. N t X t + H(f) Y t Channel

67 Bandlimited Waveform Channels with AWGN I: 6-66 Coding over waveform channels For a fixed interval [0, T), select M different functions (waveform codebook) for each informational message. c 1 (t), c 2 (t),..., c M (t), Based on the received function y(t) for t [0, T) at the channel output, a decision on the input message is made. Sampling theorem Sampling a band-limited signal at a sampling rate 1/(2W) is sufficient to reconstruct the signal from the samples, if W is the bandwidth of the signal. Conventional Remark: Based on the sampling theorem, one can sample the filtered waveform signal c m (t) and the filtered waveform noise N t, and reconstruct them distortionlessly by the sampling frequency 2W (from the receiver Y t point of view).

68 Bandlimited Waveform Channels with AWGN I: 6-67 My Remarks: Assume for the moment that the channel is noiseless. The input waveform codeword c j (t) is time-limited in [0, T); so it cannot be band-limited. However, the receiver can only observe a band-limited version c j (t) of c j (t) due to the ideal band-limited filter h(t). Notably, c j (t) is no longer time-limited. By sampling theorem, the band-limited but time-unlimited c j (t) can be distortionlessly reconstructed by its (possibly infinitely many) samples at sampling rate 1/(2W). Implicit System Constraint: Yet, the receiver can only use those samples within time [0, T) to guess what the transmitter originally sent out. Notably, these 2WT samples may not reconstruct c j (t) without distortion. As a result, the waveform codewords {c j (t)} M j=1 are chosen such that their residual signals { c j (t)} M j=1, after experiencing the ideal lowpass filter h(t) and the implicit 2WT-sample constraint, are more resistent to noise.

69 Bandlimited Waveform Channels with AWGN I: 6-68 What we learn from these remarks. The time-limited waveform c j (t) would pass through an ideal lowpass filter and be sampled, and only 2WT samples survives at the receiver end without noise. Xt (not X t ) is the true residual signal that survives at the receiver end without noise. The power constraint in the capacity-cost function is actually applied on Xt (the true signal that can be reconstructed by the 2WT samples seen at the receiver end), rather than the transmitted signal X t. Indeed, the signal-to-noise ratio concerned in most communication problems is the ratio of the signal power survived at the receiver end against the noise power experienced by this received signal. Do not misunderstand that the signal power in this ratio is the transmitted power at the transmitter end.

70 Bandlimited Waveform Channels with AWGN I: 6-69 The noise process How about the band-limited AWGN Ñt = N t h(t)? Ñt is no longer white? However, with the right sampling rate, the samples are still uncorrelated. Consequently, the noises experienced by the 2W T signal samples are still independent Gaussian distributed (cf. The next slide). Y t = (X t + N t ) h(t) = X t + Ñt. Y k/(2w) = X k/(2w) + Ñk/(2W) for 0 k < 2WT.

71 Bandlimited Waveform Channels with AWGN I: 6-70 [( ) ( )] E[Ñi/(2W)Ñk/(2W)] = E h(τ)n i/(2w) τ dτ h(τ )N k/(2w) τ dτ R R = h(τ)h(τ )E [ N i/(2w) τ N k/(2w) τ ] dτ dτ R R = h(τ)h(τ ) N ( 0 i R R 2 δ 2W k ) 2W τ + τ dτ dτ = N 0 h(τ)h(τ (i k)/(2w))dτ 2 R ( ) H(f) 2 df = 1 = N ( W ) ( 0 1 W ) e j2πfτ 1 df e j2πf (τ (i k)/(2w)) df dτ 2 R W 2W W 2W = N W W ( ) 0 e j2π(f+f )τ dτ e j2πf (i k)/(2w) df df 4W W W W W R = N 0 δ(f + f )e j2πf (i k)/(2w) df df 4W W W = N W ( 0 e j2πf(i k)/(2w) df = N ) 0 sin(2πw(i k)/(2w)) 4W W 2 π(i k)/(2w) = N 0 sin(π(i k)) (With the right sample rate, W is cancelled out.) 2 π(i k) { N0 /2, if i = k; = 0, if i k.

72 Bandlimited Waveform Channels with AWGN I: 6-71 Hence, the capacity-cost function of this channel subject to input waveform width T (and implicit system constraint) is equal to: C T (S) = = = = { p X 2WT : 2WT 1 i=0 2WT 1 i=0 2WT 1 i=0 1 2 log 1 2 log 1 2 log = WT log max } I( X 2WT ; Y 2WT ) i=0 E[ X i/(2w) 2 ] S ( 1 + S ) i 2WT 1 σ 2 i ( 1 + S/(2WT) (N 0 /2) ( 1 + S ) WTN 0 ), ( 1 + S WTN 0 where the input samples X 2WT that achieves the capacity is also i.i.d. Gaussian distributed. It can be proved similarly to white N t i.i.d. Ñ2WT that a white Gaussian process X t can render an i.i.d. Gaussian distributed filtered samples X 2WT, if a right sampling rate is employed. )

73 Bandlimited Waveform Channels with AWGN I: 6-72 Remark on notations C T (S) denotes the capacity-cost function subject to input waveform width T. This notation is specifically used in waveform channels. C T (S) should not be confused with the notation of C(S), which is used to represent the capacity-cost function of a discrete-time channels, where the channel input is transmitted only at each sampled time instance, and hence no duration T is involved. One, however, can measure C(S) by the unit of bits per sample period to relate the quantity to the usual unit of data transmission speed, such as bits per second. To obtain the maximum reliable data transmission speed over waveform channel in units of bits per second, we require to calculate: C T (S) T.

74 Bandlimited Waveform Channels with AWGN I: 6-73 Example 6.34 (telephone line channel) Suppose telephone signals are bandlimited to 4 khz. Given signal-to-noise (SNR) ratio of 20dB (namely, S/(WN 0 ) = 20dB) and T = 1 millisecond, the capacity of bandlimited Gaussian waveform channel is equal to: C T (S) = WT log ( 1 + S ) WTN 0 = 4000 ( ) log Therefore, the maximum reliable transmission speed is: C T (S) T = nats per second ( ) = bits per second.

75 Bandlimited Waveform Channels with AWGN I: 6-74 Remarks: It needs to pay special attention that the capacity-cost formula used above is calculated based on a prohibitively simplified channel model, i.e., the noise is additive white Gaussian. The capacity formula from such a simplified model only provides a lower bound to the true channel capacity, since AWGN is the worst noise in the sense of capacity. So the true channel capacity is possibly higher than the quantity obtained in the above example!

76 Band-Unlimited Waveform Channels with AWGN I: 6-75 Take W in the above formula, we obtain the channel capacity for Gaussian waveform channel of infinite bandwidth is C T (S) S N 0 (nats per T unit time). The capacity grows linearly with the input power.

77 Filtered Waveform Stationary Gaussian Channels I: 6-76 N t X t Y H(f) + t Channel The filtered waveform stationary Gaussian channel can be transformed to: N t X t + H(f) Y t where the power spectral density PSD N (f) of N t satisfies PSD N (f) = PSD N(f) H(f) 2, and PSD N (f) is the power spectral density of N t.

78 Filtered Waveform Stationary Gaussian Channels I: 6-77 Notes We cannot use sampling theorem in this case (as we did previously) because N t is no longer assumed white; so the noise samples are not i.i.d. h(t) is not necessarily band-limited. Lemma 6.35 Any real-valued function v(t) defined over [0, T) can be decomposed into v(t) = v i Ψ i (t), i=1 where the real-valued functions {Ψ i (t)} are any orthonormal set (which can span the space with respect to v(t)) of functions on [0, T), namely and T 0 Ψ i (t)ψ j (t)dt = v i = T 0 { 1, if i = j; 0, if i j, Ψ i (t)v(t)dt.

79 Filtered Waveform Stationary Gaussian Channels I: 6-78 Examples Orthonormal set is not unique. Sampling theorem employs sinc function as the orthonormal set to span a bandlimited signal. Fourier expansion is another example of orthonormal expansion. The above two orthonormal sets do not suit here, because their resultant coefficients are not necessarily i.i.d. We therefore have to introduce the Karhunen-Loeve expansion.

80 Filtered Waveform Stationary Gaussian Channels I: 6-79 Lemma 6.36 (Karhunen-Loeve expansion) Given a stationary random process V t, and its autocorrelation function φ V (t) = E[V τ V τ+t ]. Let {Ψ i (t)} i=1 and {λ i} i=1 be the eigenfunctions and eigenvalues of φ V (t), namely T 0 φ V (t s)ψ i (s)ds = λ i Ψ i (t), 0 t T. Then the expansion coefficients {Λ i } i=1 of V t with respect to orthonormal functions {Ψ i (t)} i=1 are uncorrelated. In addition, if V t is Gaussian, then {Λ i } i=1 are independent Gaussian random variables.

81 Filtered Waveform Stationary Gaussian Channels I: 6-80 Proof: [ T E[Λ i Λ j ] = E Ψ i (t)v t dt = = = = = 0 T T 0 T 0 T 0 T 0 T T 0 ] Ψ j (s)v s ds Ψ i (t)ψ j (s)e[v t V s ]dtds Ψ i (t)ψ j (s)φ V (t s)dtds 0 ( T ) Ψ i (t) Ψ j (s)φ V (t s)ds dt 0 Ψ i (t)(λ j Ψ j (t))dt { 0 λi, if i = j; 0, if i j.

82 Filtered Waveform Stationary Gaussian Channels I: 6-81 Theorem 6.37 Give a filtered Gaussian waveform channel with noise spectral density PSD N (f) and filter H(f). C T (S) C(S) = lim = 1 [ ] θ max 0, log df, T T 2 PSD N (f)/ H(f) 2 where θ is the solution of S = max [ 0, θ PSD ] N(f) df. H(f) 2 Remark: This also follows the water-pouring scheme.

83 Filtered Waveform Stationary Gaussian Channels I: 6-82 (a) The water pouring scheme. The curve is PSD N (f)/h 2 (f) (b) The input spectral density that achieves capacity.

84 Filtered Waveform Stationary Gaussian Channels I: 6-83 Proof: Let the Karhunen-Loeve expansion of the autocorrelation function of N t be denoted by {Ψ i (t)} i=1 To abuse the notations without ambiguity, we let N i and X i be the Karhunen- Loeve coefficients of N t and X t with respect to Ψ i (t). Since {N i } i=1 are independent Gaussian distributed, we obtain that the channel capacity subject to input waveform width T is 1 C T (S) = 2 log where θ is the solution of = = i=1 i=1 i=1 S = 1 2 log 1 2 max ( 1 + max(0, θ λ i) [ max λ i (1, θλi )] ] [0, log θλi, max [0, θ λ i ] i=1 and E[(N i )2 ] = λ i is the i-th eigenvalue of the autocorrelation function of N t (corresponding to eigenfunction Ψ i (t)). )

85 Filtered Waveform Stationary Gaussian Channels I: 6-84 The proof is then completed by applying the Toeplitz distribution theorem. [Toeplitz distribution theorem] Consider a zero-mean stationary random process V t with power spectral density PSD V (f) and PSD V (f)df <. Denote by λ 1 (T), λ 2 (T), λ 3 (T),... the eigenvalues of the Karhunen-Loeve expansion corresponding to the autocorrelation function of V t over the time interval of width T. Then for any real-valued continuous function a( ) satisfying for any finite constant K, a(t) K t lim T 1 T a(λ i (T)) = i=1 for 0 t max f R PSD V (f) a ( PSD V (f) ) df.

86 Information Transmission Theorem I: 6-85 Theorem 6.38 (joint source-channel coding theorem) Fix a distortion measure. A DMS can be reproduced at the output of a channel with distortion less than D (by taking sufficiently large blocklength), if R(D) nats/source letter T s seconds/source letter < C(S) nats/channel usage T c seconds/channel usage, where T s and T c represent the durations per source letter and per channel input, respectively. Note that the units of R(D) and C(S) should be the same, i.e., they should be measured both in nats (by taking natural logarithm), or both in bits (by taking base-2 logarithm). Theorem 6.39 (joint source-channel coding converse) All data transmission codes will have average distortion larger than D for sufficiently large blocklength, if R(D) > C(S). T s T c

87 Information Transmission Theorem I: 6-86 Example 6.40 (additive white Gaussian noise (AWGN) channel with binary channel input) The source is discrete-time binary memoryless with uniform marginal. The discrete-time continuous channel has binary input alphabet and real-line output alphabet with Gaussian transition probability. Denote by P b the probability of bit error (i.e, Hamming distortion measure is adopted.)

88 Information Transmission Theorem I: 6-87 The rate-distortion function for binary input and Hamming additive distortion measure is log(2) H b (D), if 0 D 1 R(D) = 2 ; 0, if D > 1 2. Due to Butman and McEliece, the channel capacity-cost function for binaryinput AWGN channel is C(S) = S σ 1 [ ( )] e y2 /2 S S log cosh 2 2π σ + y dy 2 σ 2 = E bt c /T s N 0 /2 1 [ ( )] e y2 /2 E b T c /T s log cosh 2π N 0 /2 + y E b T c /T s dy N 0 /2 = 2Rγ b 1 2π e y2 /2 log[cosh(2rγ b + y 2Rγ b )]dy, where R = T c /T s is the code rate for data transmission and is measured in the unit of source letter/channel usage (or information bit/channel bit), and γ b (often denoted by E b /N 0 ) is the signal-to-noise ratio per information bit.

89 Information Transmission Theorem I: 6-88 Then from the joint source-channel coding theorem, good codes exist when or equivalently log(2) H b (P b ) < 2γ b 1 R 2π R(D) < T s T c C(S), e y2 /2 log[cosh(2rγ b + y 2Rγ b )]dy.

90 Information Transmission Theorem I: 6-89 By re-formulating the above inequality as H b (P b ) > log(2) 2γ b + 1 R 2π e y2 /2 log[cosh(2rγ b + y 2Rγ b )]dy, a lower bound on the bit error probability as a function of γ b is established. This is the Shannon limit for any code to achieve binary-input Gaussian channel. P b 1 1e-1 1e-2 1e-3 1e-4 1e-5 Shannon limit R = 1/2 R = 1/3 1e γ b (db) The Shannon limits for (2, 1) and (3, 1) codes under binary-input AWGN channel.

91 Information Transmission Theorem I: 6-90 The result in the above example becomes important due to the invention of the Turbo coding, for which a near-shannon-limit performance is first obtained. That implies that a near-optimal channel code has been constructed, since in principle, no codes can perform better than the Shannon-limit.

92 Information Transmission Theorem I: 6-91 Example 6.41 (AWGN channel with real number input) The source is discrete-time binary memoryless with uniform marginal. The discrete-time continuous channel has real-line input alphabet and real-line output alphabet with Gaussian transition probability. Denote by P b the probability of bit error (i.e, Hamming distortion measure is adopted.)

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University Chapter 4 Data Transmission and Channel Capacity Po-Ning Chen, Professor Department of Communications Engineering National Chiao Tung University Hsin Chu, Taiwan 30050, R.O.C. Principle of Data Transmission