Lecture 6: Gaussian Channels Copyright G. Caire (Sample Lectures) 157
Differential entropy (1) Definition 18. The (joint) differential entropy of a continuous random vector X n p X n(x) over R is: Z h(x n )= p X n(x) log p X n(x)dx = E [log p X n(x n )] h(x n ) is a concave function of the pdf p X n(x), but it is not necessarily nonnegative. Sometimes we write h(p X ) when we wish to stress that h( ) is a function of the pdf p X. Translation: h(x n + a) =h(x n ). Scaling: h(ax n )=h(x n ) + log A. Copyright G. Caire (Sample Lectures) 158
Differential entropy (2) Example: Uniform distribution: p X (x) = 1 1{x 2 [0,a]} a Then, we have Z a 1 h(x) = log adx = log a 0 a... more in general, for a uniform distribution in R n over a compact domain D with volume Vol(D) we have h(x n ) = log Vol(D) Copyright G. Caire (Sample Lectures) 159
Gaussian distribution The standard normal distribution X N (0, 1) has density: p X (x) = 1 x 2 p exp 2 2 2 2 The real multi-variate Gaussian distribution X n N (µ, ) has density: p X n(x) = 1 1 exp (2 ) n/2 1/2 2 (x µ)t 1 (x µ) Copyright G. Caire (Sample Lectures) 160
Real Gaussian random vectors We notice that, for Gaussian densities, log p X n(x) is a quadratic function of x. E.g., for the real case we have log p X n(x) = n 2 log(2 ) 1 2 log 1 2 (x µ)t 1 (x µ) log e Since h(x n ) is invariant by translations, it must be independent of µ. Hence, there is no loss of generality in assuming µ = 0. Taking expectation and changing sign, noticing that x T 1 x = tr(xx T 1 ) we have h(x n ) = n 2 log(2 )+1 2 log + 1 2 tr(e[xxt ] 1 ) log e = 1 2 log((2 e)n ) Copyright G. Caire (Sample Lectures) 161
Complex Gaussian random vectors The complex circularly-symmetric Gaussian distribution X n CN(µ, ), where X n = XR n + jxn I, where Xn R N (Re{µ}, 1 2 ), Xn I N (Im{µ}, 1 2 ) and XR n,xn I are statistically independent, has density: p X n(x) = 1 n exp (x µ)h 1 (x µ) Repeating the arguments of before, for the complex circularly symmetric case we have: h(x n ) = log(( e) n ) Copyright G. Caire (Sample Lectures) 162
Conditional differential entropy Definition 19. For two jointly distributed continuous random vectors X n,y m over R, respectively, with joint pdf p X n,y m (x, y), the conditional differential entropy of X n given Y m is: h(x n Y m ) = Z Z p X n,y m (x, y) log p X n Y m(x y)dxdy = E log p X n Y m(xn Y m ) Copyright G. Caire (Sample Lectures) 163
Divergence and mutual information Definition 20. Let p X and q X be pdfs defined on R. The divergence (Kullback- Leibler distance) between p X and q X is defined as D(p X kq X )= Z p X (x) log p X(x) q X (x) dx Notice that D(p X kq X ) < 1 if supp(p X ) supp(q X ). Definition 21. The mutual information of X, Y p X,Y is defined by I(X; Y )= Z p X,Y (x, y) log p X,Y (x, y) p X (x)p Y (y) dxdy Copyright G. Caire (Sample Lectures) 164
Relations (similar to the discrete case) We have the following expressions: apple I(X; Y ) = E log p X,Y (X, Y ) p X (X)p Y (Y ) apple = E log p X Y (X Y ) = h(x) h(x Y ) p X (X) apple = E log p Y X(Y X) = h(y ) h(y X) p Y (Y ) = D p Y X kp Y p X = D p Y X kq Y p X D(p Y kq Y ) where the last line holds for any pdf q Y for which the divergences are finite. Copyright G. Caire (Sample Lectures) 165
Properties (1) Non-negativity and convexity of D(p X kq X ) hold verbatim as in the discrete case. I(X; Y ) entropy). 0, therefore h(y X) apple h(y ) (Conditioning reduces differential The chain rule for entropy holds for differential entropy as well: h(x n )= nx h(x i X i 1 ) i=1 Independence upper bound h(x n ) apple nx h(x i ) i=1 Copyright G. Caire (Sample Lectures) 166
Properties (2) As an application of the independence bound for Gaussian differential entropies, we get Hadamard inequality: 1 2 log ((2 e)n ) apple nx i=1 1 2 log (2 e i,i) ) apple ny i=1 i,i The chain rule for mutual information holds as well: I(X; Y n )= nx I(X; Y i Y i 1 ) i=1 where the conditional mutual information is I(X; Y Z) =h(y Z) h(y X, Z). Copyright G. Caire (Sample Lectures) 167
Extremal of differential entropy Theorem 15. Differential entropy maximizer: Let X n be a random vector with mean zero and covariance matrix E[X n (X n ) T ]=Cov(X n )=. Then, h(x n ) apple 1 2 log ((2 e)n ) Let X n be a complex random vector with mean zero and covariance matrix E[X n (X n ) H ]=Cov(X n )=. Then, h(x n ) apple log (( e) n ) Copyright G. Caire (Sample Lectures) 168
Proof: Let g(x) denote the Gaussian pdf with mean zero and covariance matrix. Then 0 apple D(p X nkg) Z = p X n(x) log p X n (x) dx g(x) Z = h(x n ) p X n(x) log g(x) dx Z = h(x n ) g(x) log g(x) dx = h(x n )+h(g) = h(x n )+ 1 2 log ((2 e)n ) Copyright G. Caire (Sample Lectures) 169
Some consequences Theorem 16. Differential entropy and MMSE estimation: Let X be a continuous random variable and X b an estimator of X. Then h E X X b 2i 1 2 e e2h(x) with equality if and only if X is Gaussian and b X = E[X]. Proof: We have: E h X X b 2i min E h X X b 2i X b = E X E[X] 2 = Var(X) 1 2 e e2h(x) Copyright G. Caire (Sample Lectures) 170
where we have used the fact that the Minimum Mean-Square Error (MMSE) estimator for X is its mean. Corollary 8. Let X and Y be jointly distributed continuous random variables. For any estimator function bx(y) of X from Y, we have (Proof: HOMEWORK) E X bx(y ) 2 1 2 e e2h(x Y ) Copyright G. Caire (Sample Lectures) 171
Corollary 9. Let X n and Y m be jointly distributed continuous random vectors, and let bx(y m )=E[X n Y m ], E n = X n bx(y m ) denote the MMSE estimator of X n given Y m and the corresponding estimation error sequence. Let e =Cov(X n bx(y m )) denote the covariance matrix of the MMSE error. Then, h(x n Y m ) apple 1 2 log ((2 e)n e ) with equality iff X n and Y m are jointly Gaussian. (Proof: HOMEWORK) Copyright G. Caire (Sample Lectures) 172
MMSE estimation for jointly Gaussian RVs When X n,y m are jointly zero-mean Gaussian, then (sequences are considered as column vectors here), the MMSE estimator is given by: bx(y m ) = xy y 1 Y m e = x xy y 1 yx where x =Cov(X n ), y =Cov(Y m ), xy = E[X n (Y m ) T ] and yx = T xy. For two one-variate jointly Gaussian zero-mean random variables X and Y, we have! h(x Y )= 1 2 2 2 2 log (2 e) x y xy 2 y where x, 2 2 y are the variances and xy = E[XY ]. Copyright G. Caire (Sample Lectures) 173
From discrete to continuous Let X R be the range of X (you can generalized this to random vectors in R n, of course). A partition P of X is a finite collection of disjoint sets P i such that S i P = X. We denote by [X] P the quantization of X by the partition P, such that Z P([X] P = i) =P(X 2 P i )= p X (x)dx P i For two random variables X, Y with joint pdf p X,Y, let [X] P and [Y ] Q denote their discretized versions and define the DMC transition probability P [Y ]Q [X] P (s r) =P(Y 2 Q s X 2 P r )= RP r R Q s p X,Y (x, y)dxdy R P r p X (x)dx Copyright G. Caire (Sample Lectures) 174
Continuous memoryless channels As an application of the data processing inequality we have: I(X; Y )=sup P,Q I ([X] P ;[Y ] Q ) The proof of the channel coding theorem for continuous-input, continuousoutput, discrete-time channels, with transition density p Y X (y x), follows by defining a suitable sequence of discretized inputs (input partitions) and discretized output (output partitions), and taking the sup over all possible finite partitions. This step is standard, and applies to any well-behaved channel with wellbehaved transition probability density. Hence, from now on, we will just **use** the channel coding theorem without further questions. Copyright G. Caire (Sample Lectures) 175
Gaussian channels Z X g Y Discrete-time, continuous-alphabet channel X = Y = R, Z N (0,N 0 /2). Input-output relation: Y = x + Z. Average transmit power constraint: 1 n nx x i (m) 2 apple E s, 8 m =1,...,2 nr i=1 Codewords are points in R n contained in a sphere of radius p ne s. Copyright G. Caire (Sample Lectures) 176
Capacity of the Gaussian channel Theorem 17. The capacity of the AWGN channel with received Signal to Noise Ratio SNR =2E s /N 0 is given by C(SNR) = 1 2 log(1 + SNR) Proof: From the channel capacity formula with input cost, we have: C(E s )= sup I(X; Y ) p X :E[ X 2 ]applee s Copyright G. Caire (Sample Lectures) 177
Then: C(E s ) = sup p X :E[ X 2 ]applee s h(y ) h(y X) = sup p X :E[ X 2 ]applee s h(y ) h(x + Z X) = sup p X :E[ X 2 ]applee s h(y ) h(z) = sup p X :E[ X 2 ]applee s h(x + Z) = 1 2 log(2 e(n 0/2+E s )) = 1 2 log(1 + 2E s/n 0 ) 1 2 log( en 0) 1 2 log( en 0) where the maximum is achieved by letting X N (0,E s ). Copyright G. Caire (Sample Lectures) 178
For the converse, we consider a sequence of codes {C n } of rate R, with P e (n)! 0 as n! 1 and the relaxed (average) input cost constraint 2 nr P 2 nr m=1 1 P n n i=1 x i(m) 2 apple E s. Using Fano and data processing inequalities, we have nr = H(M) apple I(X n ; Y n )+n n = h(y n ) h(z n )+n n = h(y n n ) 2 log( en 0)+n n nx 1 apple 2 log (2 e(n n 0/2 + Var(X i ))) 2 log( en 0)+n n i=1 apple n 2 log 2 e(n 0/2+ 1 n apple n 2 log (2 e(n 0/2+E s ))! nx Var(X i )) i=1 n 2 log( en 0)+n n n 2 log( en 0)+n n Copyright G. Caire (Sample Lectures) 179
Geometric interpretation: sphere packing The volume of a sphere in R n of radius r is given by V n r n, where V n is the volume of the unit sphere (known formula, irrelevant in this context). The received random vector Y n = X n +Z n, with high probability (law of large numbers) is inside a sphere of radius p n(n 0 /2+E s ). The noise vector Z n, with high probability, is inside a sphere of radius p nn0 /2. For large n, it is possible to pack noise spheres into the typical output sphere, with negligible empty space in between. number of spheres = V nn n 2 (N 0 /2+E s ) n 2 V n n n 2 (N 0 /2) n 2 =(1+2E s /N 0 ) n 2 Copyright G. Caire (Sample Lectures) 180
The resulting rate is: R = 1 n log (number of spheres) =1 2 log (1 + 2E s/n 0 ) Copyright G. Caire (Sample Lectures) 181
Continuous-time baseband channels N 0 2 W W f An informal argument: consider a base-band waveform channel with singlesided bandwidth W Hz. Shannon s sampling theorem: any (infinite duration) base-band signal signal can be represented uniquely from its samples, taken at the so-called Nyquist rate 2W. Copyright G. Caire (Sample Lectures) 182
Noise power Signal power N = N 0 2 2W = N 0W P x = E s 2W Signal to Noise Ratio (SNR): SNR = P x N = 2E s N 0 Capacity in bit/s: multiply the capacity in bit/symbol times the number of symbols/s: C(SNR) =2W 1 2 log 1+ 2E s = W log(1 + SNR) N 0 Copyright G. Caire (Sample Lectures) 183
Continuous-time passband channels Consider a passband waveform channel with carrier frequency modulation f 0 W and single-sided bandwidth W Hz centered around f 0. The complex baseband equivalent system has single sided bandwidth W/2, complex signal X and complex circularly symmetric noise Z CN(0,N 0 ). Noise power N = N 0 W Signal power (optimal input X CN(0,E s )). P x = E s W Copyright G. Caire (Sample Lectures) 184
Signal to Noise Ratio (SNR): SNR = P x N = E s N 0 Repeating the capacity calculation for complex circularly symmetric input and output, we find C(SNR) = log(1 + SNR) Capacity in bit/s: multiply the capacity in bit/symbol times the number of symbols/s: C(SNR) =W log 1+ E s = W log(1 + SNR) N 0 Copyright G. Caire (Sample Lectures) 185
Parallel Gaussian channels: power allocation Consider the parallel complex circularly symmetric Gaussian channel defined by Y = Gx + Z where x, Y, Z 2 C K, G = diag(g 1,...,G K ) is a diagonal matrix of fixed coefficients, and Z CN(0,N 0 I K ). Notice that the input and output alphabets are the K-dimensional vector space C K, and codewords are sequences of vectors x(m) = (x 1 (m),...,x n (m)). Input constraint: codewords x(m), must satisfy 1 n nx KX x k,i (m) 2 apple E s i=1 k=1 Copyright G. Caire (Sample Lectures) 186
The capacity is given by C(E s ) = max p X :E[kXk 2 ]applee s I(X; Y) = max p X :E[kXk 2 ]applee s h(y) h(z) = max p X :E[kXk 2 ]applee s h(y) K log(( e)n 0 ) = max p X :E[kXk 2 ]applee s = max P Kk=1 E s,k applee s KX log ( e)(n 0 + G k 2 E s,k ) K log(( e)n 0 ) k=1 KX log k=1 1+ G k 2 E s,k N 0 Copyright G. Caire (Sample Lectures) 187
Power allocation: convex optimization We must maximize the mutual information with respect to the power allocation over the parallel channels: maximize subject to KX log k=1 1+ G k 2 E s,k N 0 KX E s,k apple E s k=1 E s,k 0, 8 k Copyright G. Caire (Sample Lectures) 188
The Lagrangian function is L({E k,s }, )= KX log k=1 1+ G k 2 E s,k N 0! KX E s,k E s k=1 We impose the KKT conditions @L @E s,k = G k 2 N 0 1+ G k 2 E s,k N 0 apple 0 with equality for all k such that E? s,k > 0. Solving for E s,k we find E? s,k = 1 N 0 G k 2 Copyright G. Caire (Sample Lectures) 189
We let 1/ =, and argue that the KKT conditions are satisfied if we let apple E s,k? N 0 = G k 2 + In fact, if E? s,k =0and N 0 then E? G k 2 s,k @L @E < 0. s,k > 0 and @L @E s,k =0. Otherwise, if < N 0 G k 2 then This optimal power allocation is called Waterfilling. Intuitively, we spend more power over the better channels (the ones with higher gain). Replacing the water filling solution into the mutual information expression, we obtain KX apple Gk 2 C(E s )= log k=1 N 0 + Copyright G. Caire (Sample Lectures) 190
End of Lecture 6 Copyright G. Caire (Sample Lectures) 191