Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) PDF Free Download

Lecture 6: Gaussian Channels Copyright G. Caire (Sample Lectures) 157

Differential entropy (1) Definition 18. The (joint) differential entropy of a continuous random vector X n p X n(x) over R is: Z h(x n )= p X n(x) log p X n(x)dx = E [log p X n(x n )] h(x n ) is a concave function of the pdf p X n(x), but it is not necessarily nonnegative. Sometimes we write h(p X ) when we wish to stress that h( ) is a function of the pdf p X. Translation: h(x n + a) =h(x n ). Scaling: h(ax n )=h(x n ) + log A. Copyright G. Caire (Sample Lectures) 158

Differential entropy (2) Example: Uniform distribution: p X (x) = 1 1{x 2 [0,a]} a Then, we have Z a 1 h(x) = log adx = log a 0 a... more in general, for a uniform distribution in R n over a compact domain D with volume Vol(D) we have h(x n ) = log Vol(D) Copyright G. Caire (Sample Lectures) 159

Gaussian distribution The standard normal distribution X N (0, 1) has density: p X (x) = 1 x 2 p exp 2 2 2 2 The real multi-variate Gaussian distribution X n N (µ, ) has density: p X n(x) = 1 1 exp (2 ) n/2 1/2 2 (x µ)t 1 (x µ) Copyright G. Caire (Sample Lectures) 160

Real Gaussian random vectors We notice that, for Gaussian densities, log p X n(x) is a quadratic function of x. E.g., for the real case we have log p X n(x) = n 2 log(2 ) 1 2 log 1 2 (x µ)t 1 (x µ) log e Since h(x n ) is invariant by translations, it must be independent of µ. Hence, there is no loss of generality in assuming µ = 0. Taking expectation and changing sign, noticing that x T 1 x = tr(xx T 1 ) we have h(x n ) = n 2 log(2 )+1 2 log + 1 2 tr(e[xxt ] 1 ) log e = 1 2 log((2 e)n ) Copyright G. Caire (Sample Lectures) 161

Complex Gaussian random vectors The complex circularly-symmetric Gaussian distribution X n CN(µ, ), where X n = XR n + jxn I, where Xn R N (Re{µ}, 1 2 ), Xn I N (Im{µ}, 1 2 ) and XR n,xn I are statistically independent, has density: p X n(x) = 1 n exp (x µ)h 1 (x µ) Repeating the arguments of before, for the complex circularly symmetric case we have: h(x n ) = log(( e) n ) Copyright G. Caire (Sample Lectures) 162

Conditional differential entropy Definition 19. For two jointly distributed continuous random vectors X n,y m over R, respectively, with joint pdf p X n,y m (x, y), the conditional differential entropy of X n given Y m is: h(x n Y m ) = Z Z p X n,y m (x, y) log p X n Y m(x y)dxdy = E log p X n Y m(xn Y m ) Copyright G. Caire (Sample Lectures) 163

Divergence and mutual information Definition 20. Let p X and q X be pdfs defined on R. The divergence (Kullback- Leibler distance) between p X and q X is defined as D(p X kq X )= Z p X (x) log p X(x) q X (x) dx Notice that D(p X kq X ) < 1 if supp(p X ) supp(q X ). Definition 21. The mutual information of X, Y p X,Y is defined by I(X; Y )= Z p X,Y (x, y) log p X,Y (x, y) p X (x)p Y (y) dxdy Copyright G. Caire (Sample Lectures) 164

Relations (similar to the discrete case) We have the following expressions: apple I(X; Y ) = E log p X,Y (X, Y ) p X (X)p Y (Y ) apple = E log p X Y (X Y ) = h(x) h(x Y ) p X (X) apple = E log p Y X(Y X) = h(y ) h(y X) p Y (Y ) = D p Y X kp Y p X = D p Y X kq Y p X D(p Y kq Y ) where the last line holds for any pdf q Y for which the divergences are finite. Copyright G. Caire (Sample Lectures) 165

Properties (1) Non-negativity and convexity of D(p X kq X ) hold verbatim as in the discrete case. I(X; Y ) entropy). 0, therefore h(y X) apple h(y ) (Conditioning reduces differential The chain rule for entropy holds for differential entropy as well: h(x n )= nx h(x i X i 1 ) i=1 Independence upper bound h(x n ) apple nx h(x i ) i=1 Copyright G. Caire (Sample Lectures) 166

Properties (2) As an application of the independence bound for Gaussian differential entropies, we get Hadamard inequality: 1 2 log ((2 e)n ) apple nx i=1 1 2 log (2 e i,i) ) apple ny i=1 i,i The chain rule for mutual information holds as well: I(X; Y n )= nx I(X; Y i Y i 1 ) i=1 where the conditional mutual information is I(X; Y Z) =h(y Z) h(y X, Z). Copyright G. Caire (Sample Lectures) 167

Extremal of differential entropy Theorem 15. Differential entropy maximizer: Let X n be a random vector with mean zero and covariance matrix E[X n (X n ) T ]=Cov(X n )=. Then, h(x n ) apple 1 2 log ((2 e)n ) Let X n be a complex random vector with mean zero and covariance matrix E[X n (X n ) H ]=Cov(X n )=. Then, h(x n ) apple log (( e) n ) Copyright G. Caire (Sample Lectures) 168

Proof: Let g(x) denote the Gaussian pdf with mean zero and covariance matrix. Then 0 apple D(p X nkg) Z = p X n(x) log p X n (x) dx g(x) Z = h(x n ) p X n(x) log g(x) dx Z = h(x n ) g(x) log g(x) dx = h(x n )+h(g) = h(x n )+ 1 2 log ((2 e)n ) Copyright G. Caire (Sample Lectures) 169

Some consequences Theorem 16. Differential entropy and MMSE estimation: Let X be a continuous random variable and X b an estimator of X. Then h E X X b 2i 1 2 e e2h(x) with equality if and only if X is Gaussian and b X = E[X]. Proof: We have: E h X X b 2i min E h X X b 2i X b = E X E[X] 2 = Var(X) 1 2 e e2h(x) Copyright G. Caire (Sample Lectures) 170

where we have used the fact that the Minimum Mean-Square Error (MMSE) estimator for X is its mean. Corollary 8. Let X and Y be jointly distributed continuous random variables. For any estimator function bx(y) of X from Y, we have (Proof: HOMEWORK) E X bx(y ) 2 1 2 e e2h(x Y ) Copyright G. Caire (Sample Lectures) 171

Corollary 9. Let X n and Y m be jointly distributed continuous random vectors, and let bx(y m )=E[X n Y m ], E n = X n bx(y m ) denote the MMSE estimator of X n given Y m and the corresponding estimation error sequence. Let e =Cov(X n bx(y m )) denote the covariance matrix of the MMSE error. Then, h(x n Y m ) apple 1 2 log ((2 e)n e ) with equality iff X n and Y m are jointly Gaussian. (Proof: HOMEWORK) Copyright G. Caire (Sample Lectures) 172

MMSE estimation for jointly Gaussian RVs When X n,y m are jointly zero-mean Gaussian, then (sequences are considered as column vectors here), the MMSE estimator is given by: bx(y m ) = xy y 1 Y m e = x xy y 1 yx where x =Cov(X n ), y =Cov(Y m ), xy = E[X n (Y m ) T ] and yx = T xy. For two one-variate jointly Gaussian zero-mean random variables X and Y, we have! h(x Y )= 1 2 2 2 2 log (2 e) x y xy 2 y where x, 2 2 y are the variances and xy = E[XY ]. Copyright G. Caire (Sample Lectures) 173

From discrete to continuous Let X R be the range of X (you can generalized this to random vectors in R n, of course). A partition P of X is a finite collection of disjoint sets P i such that S i P = X. We denote by [X] P the quantization of X by the partition P, such that Z P([X] P = i) =P(X 2 P i )= p X (x)dx P i For two random variables X, Y with joint pdf p X,Y, let [X] P and [Y ] Q denote their discretized versions and define the DMC transition probability P [Y ]Q [X] P (s r) =P(Y 2 Q s X 2 P r )= RP r R Q s p X,Y (x, y)dxdy R P r p X (x)dx Copyright G. Caire (Sample Lectures) 174

Continuous memoryless channels As an application of the data processing inequality we have: I(X; Y )=sup P,Q I ([X] P ;[Y ] Q ) The proof of the channel coding theorem for continuous-input, continuousoutput, discrete-time channels, with transition density p Y X (y x), follows by defining a suitable sequence of discretized inputs (input partitions) and discretized output (output partitions), and taking the sup over all possible finite partitions. This step is standard, and applies to any well-behaved channel with wellbehaved transition probability density. Hence, from now on, we will just **use** the channel coding theorem without further questions. Copyright G. Caire (Sample Lectures) 175

Gaussian channels Z X g Y Discrete-time, continuous-alphabet channel X = Y = R, Z N (0,N 0 /2). Input-output relation: Y = x + Z. Average transmit power constraint: 1 n nx x i (m) 2 apple E s, 8 m =1,...,2 nr i=1 Codewords are points in R n contained in a sphere of radius p ne s. Copyright G. Caire (Sample Lectures) 176

Capacity of the Gaussian channel Theorem 17. The capacity of the AWGN channel with received Signal to Noise Ratio SNR =2E s /N 0 is given by C(SNR) = 1 2 log(1 + SNR) Proof: From the channel capacity formula with input cost, we have: C(E s )= sup I(X; Y ) p X :E[ X 2 ]applee s Copyright G. Caire (Sample Lectures) 177

Then: C(E s ) = sup p X :E[ X 2 ]applee s h(y ) h(y X) = sup p X :E[ X 2 ]applee s h(y ) h(x + Z X) = sup p X :E[ X 2 ]applee s h(y ) h(z) = sup p X :E[ X 2 ]applee s h(x + Z) = 1 2 log(2 e(n 0/2+E s )) = 1 2 log(1 + 2E s/n 0 ) 1 2 log( en 0) 1 2 log( en 0) where the maximum is achieved by letting X N (0,E s ). Copyright G. Caire (Sample Lectures) 178

For the converse, we consider a sequence of codes {C n } of rate R, with P e (n)! 0 as n! 1 and the relaxed (average) input cost constraint 2 nr P 2 nr m=1 1 P n n i=1 x i(m) 2 apple E s. Using Fano and data processing inequalities, we have nr = H(M) apple I(X n ; Y n )+n n = h(y n ) h(z n )+n n = h(y n n ) 2 log( en 0)+n n nx 1 apple 2 log (2 e(n n 0/2 + Var(X i ))) 2 log( en 0)+n n i=1 apple n 2 log 2 e(n 0/2+ 1 n apple n 2 log (2 e(n 0/2+E s ))! nx Var(X i )) i=1 n 2 log( en 0)+n n n 2 log( en 0)+n n Copyright G. Caire (Sample Lectures) 179

Geometric interpretation: sphere packing The volume of a sphere in R n of radius r is given by V n r n, where V n is the volume of the unit sphere (known formula, irrelevant in this context). The received random vector Y n = X n +Z n, with high probability (law of large numbers) is inside a sphere of radius p n(n 0 /2+E s ). The noise vector Z n, with high probability, is inside a sphere of radius p nn0 /2. For large n, it is possible to pack noise spheres into the typical output sphere, with negligible empty space in between. number of spheres = V nn n 2 (N 0 /2+E s ) n 2 V n n n 2 (N 0 /2) n 2 =(1+2E s /N 0 ) n 2 Copyright G. Caire (Sample Lectures) 180

The resulting rate is: R = 1 n log (number of spheres) =1 2 log (1 + 2E s/n 0 ) Copyright G. Caire (Sample Lectures) 181

Continuous-time baseband channels N 0 2 W W f An informal argument: consider a base-band waveform channel with singlesided bandwidth W Hz. Shannon s sampling theorem: any (infinite duration) base-band signal signal can be represented uniquely from its samples, taken at the so-called Nyquist rate 2W. Copyright G. Caire (Sample Lectures) 182

Noise power Signal power N = N 0 2 2W = N 0W P x = E s 2W Signal to Noise Ratio (SNR): SNR = P x N = 2E s N 0 Capacity in bit/s: multiply the capacity in bit/symbol times the number of symbols/s: C(SNR) =2W 1 2 log 1+ 2E s = W log(1 + SNR) N 0 Copyright G. Caire (Sample Lectures) 183

Continuous-time passband channels Consider a passband waveform channel with carrier frequency modulation f 0 W and single-sided bandwidth W Hz centered around f 0. The complex baseband equivalent system has single sided bandwidth W/2, complex signal X and complex circularly symmetric noise Z CN(0,N 0 ). Noise power N = N 0 W Signal power (optimal input X CN(0,E s )). P x = E s W Copyright G. Caire (Sample Lectures) 184

Signal to Noise Ratio (SNR): SNR = P x N = E s N 0 Repeating the capacity calculation for complex circularly symmetric input and output, we find C(SNR) = log(1 + SNR) Capacity in bit/s: multiply the capacity in bit/symbol times the number of symbols/s: C(SNR) =W log 1+ E s = W log(1 + SNR) N 0 Copyright G. Caire (Sample Lectures) 185

Parallel Gaussian channels: power allocation Consider the parallel complex circularly symmetric Gaussian channel defined by Y = Gx + Z where x, Y, Z 2 C K, G = diag(g 1,...,G K ) is a diagonal matrix of fixed coefficients, and Z CN(0,N 0 I K ). Notice that the input and output alphabets are the K-dimensional vector space C K, and codewords are sequences of vectors x(m) = (x 1 (m),...,x n (m)). Input constraint: codewords x(m), must satisfy 1 n nx KX x k,i (m) 2 apple E s i=1 k=1 Copyright G. Caire (Sample Lectures) 186

The capacity is given by C(E s ) = max p X :E[kXk 2 ]applee s I(X; Y) = max p X :E[kXk 2 ]applee s h(y) h(z) = max p X :E[kXk 2 ]applee s h(y) K log(( e)n 0 ) = max p X :E[kXk 2 ]applee s = max P Kk=1 E s,k applee s KX log ( e)(n 0 + G k 2 E s,k ) K log(( e)n 0 ) k=1 KX log k=1 1+ G k 2 E s,k N 0 Copyright G. Caire (Sample Lectures) 187

Power allocation: convex optimization We must maximize the mutual information with respect to the power allocation over the parallel channels: maximize subject to KX log k=1 1+ G k 2 E s,k N 0 KX E s,k apple E s k=1 E s,k 0, 8 k Copyright G. Caire (Sample Lectures) 188

The Lagrangian function is L({E k,s }, )= KX log k=1 1+ G k 2 E s,k N 0! KX E s,k E s k=1 We impose the KKT conditions @L @E s,k = G k 2 N 0 1+ G k 2 E s,k N 0 apple 0 with equality for all k such that E? s,k > 0. Solving for E s,k we find E? s,k = 1 N 0 G k 2 Copyright G. Caire (Sample Lectures) 189

We let 1/ =, and argue that the KKT conditions are satisfied if we let apple E s,k? N 0 = G k 2 + In fact, if E? s,k =0and N 0 then E? G k 2 s,k @L @E < 0. s,k > 0 and @L @E s,k =0. Otherwise, if < N 0 G k 2 then This optimal power allocation is called Waterfilling. Intuitively, we spend more power over the better channels (the ones with higher gain). Replacing the water filling solution into the mutual information expression, we obtain KX apple Gk 2 C(E s )= log k=1 N 0 + Copyright G. Caire (Sample Lectures) 190

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157