Hessian and concavity of mutual information, differential entropy, and entropy power in linear vector Gaussian channels

Size: px

Start display at page:

Download "Hessian and concavity of mutual information, differential entropy, and entropy power in linear vector Gaussian channels"

Brenda Daniel
5 years ago
Views:

1 1 Hessian and concavity of mutual information, differential entropy, and entropy power in linear vector Gaussian channels Miquel ayaró and Daniel. alomar Abstract arxiv: v1 [cs.i] 11 Mar 2009 Within the framework of linear vector Gaussian channels with arbitrary signaling, closed-form expressions for the Jacobian of the minimum mean square error and Fisher information matrices with respect to arbitrary parameters of the system are calculated in this paper. Capitalizing on prior research where the minimum mean square error and Fisher information matrices were linked to information-theoretic quantities through differentiation, closed-form expressions for the Hessian of the mutual information and the differential entropy are derived. hese expressions are then used to assess the concavity properties of mutual information and differential entropy under different channel conditions and also to derive a multivariate version of the entropy power inequality due to Costa. I. INRODUCION AND MOIVAION Closed-form expressions for the Hessian matrix of the mutual information with respect to arbitrary parameters of the system are useful from a theoretical perspective but also from a practical standpoint. In system design, if the mutual information is to be optimized through a gradient algorithm as in [1], the Hessian matrix may be used alongside the gradient in the Newton s method to speed up the convergence of the algorithm. Additionally, from a system analysis perspective, the Hessian matrix can also complement the gradient in studying the sensitivity of the mutual information to variations of the system parameters and, more importantly, in the cases where the mutual information is concave with respect to the system design parameters, it can also be used to guarantee the global optimality of a given design. In this sense and within the framework of linear vector Gaussian channels with arbitrary signaling, the purpose of this work is twofold. First, we find closed-form expressions for the Hessian matrix of the mutual information, differential entropy and entropy power with respect to arbitrary parameters of the system and, second, we study the concavity properties of these quantities. Both goals are intimately related since concavity can be assessed through the negative definiteness of the Hessian matrix. As intermediate results of our study, we derive closed-form expressions for the Jacobian of the minimum mean-square error MMSE and Fisher information matrices, which are interesting results in their own right and contribute to the exploration of the fundamental links between information theory and estimation theory. Initial connections between information- and estimation-theoretic quantities for linear channels with additive Gaussian noise date back from the late fifties: in the proof of Shannon s entropy power inequality [2], Stam used the fact that the derivative of the output differential entropy with respect to the added noise power is equal to the Fisher information of the channel output and attributed this identity to De Bruijn. More than a decade later, the links between both worlds strengthened when Duncan [3] and Kadota, akai, and iv [4] independently represented mutual information as a function of the error in causal filtering. Much more recently, in [5], Guo, Shamai, and Verdú fruitfully explored further these connections and, as their main result, proved that the derivative of the mutual information and differential entropy with respect to the signal-to-noise ratio SNR is equal to half the MMSE regardless of the input statistics. he main result in [5] was A shorter version of this paper is to appear in IEEE ransactions on Information heory. his work was supported by the RGC research grant. M. ayaró conducted his part of this research while he was with the Department of Electronic and Computer Engineering, Hong Kong University of Science and echnology, Clear Water Bay, Kowloon, Hong Kong. He is now with the Centre ecnològic de elecomunicacions de Catalunya CC, Barcelona, Spain miquel.payaro@cttc.es. D.. alomar is with the Department of Electronic and Computer Engineering, Hong Kong University of Science and echnology, Clear Water Bay, Kowloon, Hong Kong palomar@ust.hk.

2 2 H G Signal arameters Y= GS+ CN Noise arameters h Y I SY ; D G D C E S J Y = E = E { Φ } S Y { Γ } Y Y H C D G D C E E { Φ Y Φ Y } S { Γ Y Γ Y } Y S Y Fig. 1. Simplified representation of the relations between the quantities dealt with in this work. he Jacobian, D, and Hessian, H, operators represent first and second order differentiation, respectively. generalized to the abstract Wiener space by akai in [6] and by alomar and Verdú in two different directions: in [1] they calculated the partial derivatives of the mutual information and differential entropy with respect to arbitrary parameters of the system, rather than with respect to the SNR alone, and in [7] they represented the derivative of mutual information as a function of the conditional marginal input given the output for channels where the noise is not constrained to be Gaussian. In this paper we build upon the setting of [1], where loosely speaking, it was proved that, for the linear vector Gaussian channel Y = GS +CN, 1 i the gradients of the differential entropy hy and the mutual information IS;Y with respect to functions of the linear transformation undergone by the input, G, are linear functions of the MMSE matrix E S and ii the gradient of the differential entropy hy with respect to the linear transformation undergone by the noise, C, are linear functions of the Fisher information matrix, J Y. In this work, we show that the previous two key quantities E S and J Y, which completely characterize the first-order derivatives, are not enough to describe the second-order derivatives. For that purpose, we introduce the more refined conditional MMSE matrix Φ S y and conditional Fisher information matrix Γ Y y note that when these quantities are averaged with respect to the distribution of the output y, we recover E S = E{Φ S Y } and J Y = E{Γ Y Y }. In particular, the second-order derivatives depend on Φ S y and Γ Y y through the following terms: E{Φ S Y Φ S Y } and E{Γ Y Y Γ Y Y }. See Fig. 1 for a schematic representation of these relations. Analogous results to some of the expressions presented in this paper particularized to the scalar Gaussian channel were simultaneously derived in [8], [9], where the second and third derivatives of the mutual information with respect to the SNR were calculated. As an application of the obtained expressions, we show concavity properties of the mutual information and the differential entropy, derive a multivariate generalization of the entropy power inequality EI due to Costa in [10]. Our multivariate EI has already found an application in [11] to derive outer bounds on the capacity region in multiuser channels with feedback. his paper is organized as follows. In Section II, the model for the linear vector Gaussian channel is given and the differential entropy, mutual information, minimum mean-square error, and Fisher information quantities as well as the relationships among them are introduced. he main results of the paper are given in Section III where we present the expressions for the Jacobian matrix of the MMSE and Fisher information and also for the Hessian matrix of the mutual information and differential entropy. In Section IV the concavity properties of the mutual information are studied and in Section V a multivariate generalization of Costa s EI in [10] is given. Finally, an extension to the complex-valued case of some of the obtained results is considered in Section VI. Notation: Straight boldface denote multivariate quantities such as vectors lowercase and matrices uppercase. Uppercase italics denote random variables, and their realizations are represented by lowercase italics. he sets of

3 3 q-dimensional symmetric, positive semidefinite, and positive definite matrices are denoted by S q, S q +, and Sq ++, respectively. he elements of a matrix A are represented by A ij or [A] ij interchangeably, whereas the elements of a vector a are represented by a i. he operator diaga represents a column vector with the diagonal entries of matrix A, DiagA and Diaga represent a diagonal matrix whose non-zero elements are given by the diagonal elements of matrix A and by the elements of vector a, respectively, and veca represents the vector obtained by stacking the columns of A. For symmetric matrices, vecha is obtained from veca by eliminating the repeated elements located above the main diagonal of A. he Kronecker matrix product is represented by A B and the Schur or Hadamard element-wise matrix product is denoted by A B. he superscripts,, and +, denote transpose, Hermitian, and Moore-enrose pseudo-inverse operations, respectively. With a slight abuse of notation, we consider that when square root or multiplicative inverse are applied to a vector, they act upon the entries of the vector, we thus have [ a ] i = a i and [1/a] i = 1/a i. II. SIGNAL MODEL We consider a general discrete-time linear vector Gaussian channel, whose output Y R n is represented by the following signal model Y = GS +, 2 where S R m is the zero-mean channel input vector with covariance matrix R S, the matrix G R n m specifies the linear transformation undergone by the input vector, and R n represents a zero-mean Gaussian noise with non-singular covariance matrix R. he channel transition probability density function corresponding to the channel model in 2 is 1 Y S y s = y Gs = 2π n detr exp 1 2 y Gs R 1 y Gs 3 and the marginal probability density function of the output is given by 1 Y y = E { Y S y S }, 4 which is an infinitely differentiable continuous function of y regardless of the distribution of the input vector S thanks to the smoothing properties of the added noise [10, Section II]. At some points, it may be convenient to define the random vector X = GS with covariance matrix given by R X = GR S G and also express the noise vector as = CN, where C R n n, such that n n, and the noise covariance matrix R = CR N C has an inverse so that 3 is meaningful. With this notation, Y X y x can be obtained by replacing Gs by x in 3 and the channel model 2 can be alternatively rewritten as Y = GS +CN = X +CN = GS + = X +. 5 In the following subsections we describe the information- and estimation-theoretic quantities whose relations we are interested in. A. Differential entropy and mutual information he differential entropy 2 of the continuous random vector Y is defined as [12, Chapter 9] hy = E{log Y Y }. 6 For the case where the distribution of Y assigns positive mass to one or more singletons in R n, the above definition is usually extended with hy =. For the linear vector Gaussian channel in 5, the input-output mutual information is [12, Chapter 10] IS;Y = hy h = hy 1 2 logdet2πer = hy 1 2 logdet 2πeCR N C. 7 1 We highlight that in every expression involving integrals, expectation operators, or even a density we should include the statement if it exists. 2 hroughout this paper we work with natural logarithms and thus nats are used as information units.

4 4 B. MMSE matrix We consider the estimation of the input signal S based on the observation of a realization of the output Y = y. he mean square error MSE matrix of an estimate Ŝy of the input S given the realization of the output Y = y is defined as E { S ŜY S ŜY } and it gives us a description of the performance of the estimator. he estimator that simultaneously achieves the minimum MSE for all the components of the estimation error vector is given by the conditional mean estimator Ŝy = E{S y} and the corresponding MSE matrix, referred to as the MMSE matrix, is E S = E { S E{S Y }S E{S Y } }. 8 An alternative and useful expression for the MMSE matrix can be obtained by considering first the MMSE matrix conditioned on a specific realization of the output Y = y, which is denoted by Φ S y and defined as: Φ S y = E { S E{S y}s E{S y} y }. 9 Observe from 9 that Φ S y is a positive semidefinite matrix. Finally, the MMSE matrix in 8 can be obtained by taking the expectation in 9 with respect to the distribution of the output: C. Fisher information matrix E S = E{Φ S Y }. 10 Besides the MMSE matrix, another quantity that is closely related to the differential entropy is the Fisher information matrix with respect to a translation parameter, which is a special case of the Fisher information matrix [13]. he Fisher information is a measure of the minimum error in estimating a parameter of a distribution and is closely related to the Cramér-Rao lower bound [14]. For an arbitrary random vector Y, the Fisher information matrix with respect to a translation parameter is defined as J Y = E { D y log YY D y log Y Y }, 11 where D is the Jacobian operator. his operator together with the Hessian operator, H, and other definitions and conventions used for differentiation with respect to multidimensional parameters are described in Appendices A and B. he expression of the Fisher information in 11 in terms of the Jacobian of log Y y can be transformed into an expression in terms of its Hessian matrix, thanks to the logarithmic identity H y log Y y = H y Y y Y y D y log YyD y log Y y 12 together with the fact that E { H y Y Y / Y Y } = H y Y ydy = 0, which follows directly from the expression for H y Y y in 154 in Appendix C. he alternative expression for the Fisher information matrix in terms of the Hessian is then J Y = E{H y log Y Y }. 13 Similarly to the previous section with the MMSE matrix, it will be useful to define a conditional form of the Fisher information matrix Γ Y y, in such a way that J Y = E{Γ Y Y }. At this point, it may not be clear which of the two forms 11 or 13 will be more useful for the rest of the paper; we advance that defining Γ Y y based on 13 will prove more convenient: Γ Y y = H y log Y y = R 1 R 1 Φ XyR 1, 14 where the second equality is proved in Lemma C.4 in Appendix C 3 and where we have Φ X y = GΦ S yg. 3 Note that the lemmas placed in the appendices have a prefix indicating the appendix where they belong to ease its localization. From this point we will omit the explicit reference to the appendix.

5 5 D. rior known relations among information- and estimation-theoretic quantities he first known relation between the above described quantities is the De Bruijn identity [2] see also the alternative derivation in [5], which couples the Fisher information with the differential entropy according to d dt h X + t = 1 2 rj Y, 15 where, in this case Y = X + t. A multivariate extension of the De Bruijn identity was found in [1] as C hx +CN = J Y CR N. 16 In [5], the more canonical operational measures of mutual information and MMSE were coupled through the identity d dsnr I S; snrs + = 1 2 re S. 17 his result was generalized in [1] to the multivariate case, yielding G IS;GS + = R 1 GE S. 18 Note that the simple dependence of mutual information on differential entropy established in 7, implies that G IS;GS + = G hgs +. From these previous existing results, we realize that the output differential entropy function hgs + CN is related to the MMSE matrix E S through differentiation with respect to the transformation G undergone by the signal S see 18 and is related to the Fisher information matrix J Y through differentiation with respect to the transformation C undergone by the Gaussian noise N see 16. his is illustrated in Fig. 1. A comprehensive account of other relations can be found in [5]. Since we are interested in calculating the Hessian matrix of differential entropy and mutual information quantities, in the light of the results in 16 and 18, it is instrumental to first calculate the Jacobian matrix of the MMSE and Fisher information matrices, as considered in the next section. III. JACOBIAN AND HESSIAN RESULS In order to derive the Hessian of the differential entropy and the mutual information, we start by obtaining the Jacobians of the Fisher information matrix and the MMSE matrix. A. Jacobian of the Fisher information matrix As a warm-up, consider first the signal model in 5 with Gaussian signaling, Y G = X G +CN. In this case, the conditional Fisher information matrix defined in 14 does not depend on the realization of the received vector y and is e.g., [14, Appendix 3C] Γ Y G = R XG +R 1 = R XG +CR N C Consequently, we have that J YG = E{Γ Y G } = Γ Y G. he Jacobian matrix of the Fisher information matrix with respect to the noise transformation C can be readily obtained as D C J YG = D R J YG D C R = D R R XG +R 1 D C CR N C 20 = D + nj YG J YG D n 2D + ncr N I n 21 = 2D + n J Y G J YG CR N I n 22 = 2D + n E{Γ Y G Γ Y G }CR N I n, 23 where 20 follows from the Jacobian chain rule in Lemma B.5; in 21 we have applied Lemmas B.7.6 and B.7.7 with D + n being the Moore-enrose inverse of the duplication matrix D n defined in Appendix A 4 ; and finally 22 4 he matrix D n appears in 23 and in many successive expressions because we are explicitly taking into account the fact that J Y is a symmetric matrix. he reader is referred to Appendices A and B for more details on the conventions used in this paper.

6 6 follows from the facts that D n D + n = N n, D + nn n = D + n, and A AN n = N n A A, which are given in 132 and 129 in Appendix A, respectively. In the following theorem we generalize 23 for the case of arbitrary signaling. heorem 1 Jacobian of the Fisher information matrix: Consider the signal model Y = X + CN, where C is an arbitrary deterministic matrix, the signaling X is arbitrarily distributed, and the noise vector N is Gaussian and independent of the input X. hen, the Jacobian of the Fisher information matrix of the n-dimensional output vector Y is D C J Y = 2D + ne{γ Y Y Γ Y Y }CR N I n, 24 where Γ Y y is defined in 14. roof: Since J Y is a symmetric matrix, its Jacobian can be written as D C J Y = D C vechj Y 25 = D C D + n vecj Y 26 = D + n D CvecJ Y 27 = D + n 2N ne{γ Y Y CR N Γ Y Y } 28 = 2D + ne{γ Y Y Γ Y Y }CR N I n, 29 where 26 follows from 131 in Appendix A and 27 follows from Lemma B.7.2. he expression for D C vecj Y is derived in Appendix D, which yields 28 and 29 follows from Lemma A.3 and D + nn n = D + n as detailed in Appendix A. Remark 1: Due to the fact that, in general, the conditional Fisher information matrix Γ Y y does depend on the particular value of the observation y, it is not possible to express the expectation of the Kronecker product as the Kronecker product of the expectations, as in 22 for the Gaussian signaling case, where Γ Y G does not depend on the particular value of the observation y. B. Jacobian of the MMSE matrix Again, as a warm-up, before dealing with the arbitrary signaling case we consider first the signal model in 5 with Gaussian signaling, Y G = GS G +, and study the properties of the conditional MMSE matrix, Φ S y, which does not depend on the particular realization of the observed vector y. recisely, we have [14, Chapter 11] Φ SG = R 1 S +G R 1 G 1 and thus E SG = E{Φ SG } = Φ SG. Following similar steps as in for the Fisher information matrix, the Jacobian matrix of the MMSE matrix with respect to the signal transformation G can be readily obtained as D G E SG = 2D + m E{Φ S G Φ SG } I m G R 1 30, 31 Note that the expression in 31 for the Jacobian of the MMSE matrix has a very similar structure as the Jacobian for the Fisher information matrix in 23. he following theorem formalizes the fact that the Gaussian assumption is unnecessary for 31 to hold. heorem 2 Jacobian of the MMSE matrix: Consider the signal model Y = GS +, where G is an arbitrary deterministic matrix, the m-dimensional signaling S is arbitrarily distributed, and the noise vector is Gaussian and independent of the input S. hen, the Jacobian of the MMSE matrix of the input vector S is D G E S = 2D + m E{Φ SY Φ S Y } I m G R 1, 32 where Φ S y is defined in 9. roof: he proof is analogous to that of heorem 1 with the appropriate notation adaptation. he calculation of D G vece S can be found in Appendix E. Remark 2: In light of the two results in heorems 1 and 2, it is now apparent that Γ Y y plays an analogous role in the differentiation of the Fisher information matrix as the one played by the conditional MMSE matrix Φ S y when differentiating the MMSE matrix, which justifies the choice made in Section II-C of identifying Γ Y y with the expression in 13 and not with the expression in 11.

7 7 C. Jacobians with respect to arbitrary parameters With the basic results for the Jacobian of the MMSE and Fisher information matrices in heorems 1 and 2, one can easily find the Jacobian with respect to arbitrary parameters of the system through the chain rule for differentiation see Lemma B.5. recisely, we are interested in considering the case where the linear transformation undergone by the signal is decomposed as the product of two linear transformations, G = H, where H represents the channel, which is externally determined by the propagation environment conditions, and represents the linear precoder, which is specified by the system designer. heorem 3 Jacobians with respect to arbitrary parameters: Consider the signal model Y = HS + CN, where H R n p, R p m, and C R n n, with n n, are arbitrary deterministic matrices, the signaling S R m is arbitrarily distributed, the noise N R n is Gaussian, independent of the input S, and has covariance matrix R N, and the total noise, defined as = CN R n, has a positive definite covariance matrix given by R = CR N C. hen, the MMSE and Fisher information matrices satisfy D E S = 2D + me{φ S Y Φ S Y } I m H R 1 H 33 D H E S = 2D + me{φ S Y Φ S Y } H R 1 34 D R J Y = D + n E{Γ YY Γ Y Y }D n 35 D RN J Y = D + n E{Γ YY Γ Y Y }C CD n. 36 roof: he Jacobians D E S and D H E S follow from the Jacobian D G E S calculated in heorem 2 applying the following chain rules from Lemma B.5: D E S = D G E S D G 37 D H E S = D G E S D H G, 38 where G = H and where D G = I m H and D H G = I n can be found in Lemma B.7.1. Similarly, the Jacobian D R J Y can be calculated by applying D C J Y = D R J Y D C R, 39 where D C R = 2D + ncr N I n as in Lemma B.7.7. Recalling that, in this case, the matrix C is a dummy variable that is used only to obtain D R J Y through the chain rule, the factor CR N I n can be eliminated from both sides of the equation. Using D + nd n = I n, the result follows. Finally, the Jacobian D RN J Y follows from the chain rule D RN J Y = D R J Y D RN R = D R J Y D + nc CD n, 40 where the expression ford RN R is obtained from Lemma B.7.3 and where we have used thatd + na AD n D + n = D + na AN n = D + nn n A A = D + na A. D. Hessian of differential entropy and mutual information Now that we have obtained the Jacobians of the MMSE and Fisher matrices, we will capitalize on the results in [1] to obtain the Hessians of the mutual information IS;Y and the differential entropy hy. We start by recalling the results that will be used. Lemma 1 Differential entropy Jacobians [1]: Consider the setting of heorem 3. hen, the differential entropy of the output vector Y, hy, satisfies D hy = vec H R 1 HE S 41 D H hy = vec R 1 HE S 42 D C hy = vec J Y CR N 43 D R hy = 1 2 vec J Y Dn 44 D RN hy = 1 2 vec C J Y C D n. 45

8 8 Remark 3: Note that in [1] the authors gave the expressions 41 and 42 for the mutual information. Recalling the simple relation 7 between mutual information and differential entropy for the linear vector Gaussian channel, it becomes easy to see that 41 and 42 are also valid by replacing the differential entropy by the mutual information because the differential entropy of the noise vector is independent of and H. Remark 4: Alternatively, the expressions 43, 44, and 45 do not hold verbatim for the mutual information because, in that case, the differential entropy of the noise vector does depend on C, R, and R N and it has to be taken into account. hen, from 7 and applying basic Jacobian results from [15, Chapter 9], we have D C IS;Y = D C hy vec CRN C 1 CRN 46 D R IS;Y = D R hy 1 2 vec R 1 Dn 47 D RN IS;Y = D RN hy 1 2 vec C CR N C 1 C D n. 48 With Lemma 1 at hand, and the expressions obtained in the previous section for the Jacobian matrices of the Fisher information and the MMSE matrices, we are ready to calculate the Hessian matrix with respect to all the parameters of interest. heorem 4 Differential entropy Hessians: Consider the setting of heorem 3. hen, the differential entropy of the output vector Y, hy, satisfies H hy = E S H R 1 H 2 I m H R 1 H N m E{Φ S Y Φ S Y } I m H R 1 H H H hy = E S R 1 2 R 1 H N m E{Φ S Y Φ S Y } H R 1 = E S R 1 2 Ip R 1 H N p E{Φ S Y Φ S Y } I p H R 1 49 H C hy = R N J Y 2R N C I n N n E{Γ Y Y Γ Y Y }CR N I n 50 H R hy = 1 2 D ne{γ Y Y Γ Y Y }D n 51 H RN hy = 1 2 D n C C E{Γ Y Y Γ Y Y }C CD n. 52 roof: See Appendix F. Remark 5: he Hessian results in heorem 4 are given for the differential entropy. he Hessian matrices for the mutual information can be found straightforwardly from 7 and Remarks 3 and 4 as H IS;Y = H hy, H H IS;Y = H H hy, and H C IS;Y = H C hy +2R N C I n N n CRN C 1 CR N C 1 CR N I n R N CR N C 1 53 H R IS;Y = H R hy D nr 1 R 1 D n 54 H RN IS;Y = H RN hy D n C CR N C 1 C C CR N C 1 C D n. 55 E. Hessian of mutual information with respect to the transmitted signal covariance While in the previous sections we have obtained expressions for the Jacobian of the MMSE and the Hessian of the mutual information and differential entropy with respect to the noise covariances R and R N among others, we have purposely avoided calculating these Jacobian and Hessian matrices with respect to covariance matrices of the signal such as the squared precoder Q =, the transmitted signal covariance Q = R S, or the input signal covariance R S. he reason is that, in general, the mutual information, the differential entropy, and the MMSE are not functions of Q, Q, or R S alone. It can be seen, for example, by noting that, given Q, the corresponding precoder matrix is specified up to an arbitrary orthonormal transformation, as both and V, with V being orthonormal, yield the same squared precoder Q. Now, it is easy to see that the two precoders and V need not yield the same mutual information, and, thus, the mutual information is not well defined as a function of Q alone because the

9 9 mutual information can not be uniquely determined from Q. he same reasoning applies to the differential entropy and the MMSE matrix. here are, however, some particular cases where the quantities of mutual information and differential entropy are indeed functions of Q, Q, or R S. We have, for example, the particular case where the signaling is Gaussian, S = S G. In this case, the mutual information is given by IS G ;Y G = 1 2 logdeti n +R 1 HR S H, 56 which is, of course, a function of the transmitted signal covariance Q = R S, a function of the input signal covariance R S, and also a function of the squared precoder Q = when R S = I m. Upon direct differentiation with respect to, e.g., Q we obtain [15, Chapter 9] D Q IS G ;Y G = 1 2 vec H R 1 HI p +QH R 1 H 1 D p, 57 which, after some algebra, agrees with the result in [1, heorem 2, Eq. 23] adapted to our notation, D Q IS G ;Y G = 1 2 vec H R 1 HE S G R 1 S 1 D p, 58 where, for the sake of simplicity, we have assumed that the inverses of and R S exist and where the MMSE is given by E SG = R 1 S + H R 1 H 1. Note now that the MMSE matrix is not a function of Q and, consequently, it cannot be used to derive the Hessian of the mutual information with respect to Q as we have done in Section III-D for other variables such as or C. herefore, the Hessian of the mutual information for the Gaussian signaling case has to be obtained by direct differentiation of the expression in 57 with respect to Q, yielding [15, Chapter 10] H Q IS G ;Y G = 1 2 D p Ip +H R 1 HQ 1 H R 1 H H R 1 HI p +H R 1 HQ 1 D p. 59 Another particular case where the mutual information is a function of the transmit covariance matrices is in the low-snr regime [16]. Assuming that R = N 0 I, relov and Verdú showed that [16, heorem 3] IS;Y = 1 r HR S H 1 2N 0 4N0 2 r HR S H 2 +o N 2 0, 60 where the dependence up to terms on0 2 of the mutual information with respect to Q = R S is explicitly shown. he Jacobian and Hessian of the mutual information, for this particular case become [15, Chapters 9 and 10]: D Q IS;Y = 1 2N 0 vec H H D p 1 H Q IS;Y = 1 2N0 2 D p 2N 2 0 H H H H D p +o N 2 0 vec H HQH H D p +o N Even though we have shown two particular cases where the mutual information is a function of the transmitted signal covariance matrix Q = R S, it is important to highlight that care must be taken when calculating the Jacobian matrix of the MMSE and the Hessian matrix of the mutual information or differential entropy as, in general, these quantities are not functions of Q, Q, nor R S. In this sense, the results in [1, heorem 2, Eqs. 23, 24, 25; Corollary 2, Eq. 49; heorem 4, Eq. 56] only make sense when the mutual information is well defined as a function of the signal covariance matrix such as the cases seen above where the signaling is Gaussian or the SNR is low. IV. MUUAL INFORMAION CONCAVIY RESULS As we have mentioned in the introduction, studying the concavity of the mutual information with respect to design parameters of the system is important from both analysis and design perspectives. he first candidate as a system parameter of interest that naturally arises is the precoder matrix in the signal model Y = HS+. However, one realizes from the expression H IS;Y in Remark 5 of heorem 4, that for a sufficiently small the Hessian is approximately H IS;Y E S H R 1 H, which, from Lemma G.3 is

10 10 positive definite and, consequently, the mutual information is not concave in actually, it is convex. Numerical computations show that the non-concavity of the mutual information with respect to also holds for non-small. he next candidate is the transmitted signal covariance matrix Q, which, at first sight, is better suited than the precoder as it is well known that, for the Gaussian signaling case, the mutual information as in 56 is a concave function of the transmitted signal covariance Q. Similarly, in the low SNR regime we have that, from 62, the mutual information is also a concave function with respect to Q. Since in this work we are interested in the properties of the mutual information for all the SNR range and for arbitrary signaling, we wish to study if the above results can be generalized. Unfortunately, as discussed in the previous section, the first difference of the general case with respect to the particular cases of Gaussian signaling and low SNR is that the mutual information is not well defined as a function of the transmitted signal covariance Q only. Having discarded the concavity of the mutual information with respect to and Q, in the following subsections we study the concavity of the mutual information with respect to other parameters of the system. For the sake of notation we define the channel covariance matrix as R H = H R 1 H, which will be used in the remainder of the paper. A. he scalar case: concavity in the SNR he concavity of the mutual information with respect to the SNR for arbitrary input distributions can be derived as a corollary from Costa s results in [10], where he proved the concavity of the entropy power of a random variable consisting of the sum of a signal and Gaussian noise with respect to the power of the signal. As a direct consequence, the concavity of the entropy power implies the concavity of the mutual information in the signal power, or, equivalently, in the SNR. In this section, we give an explicit expression of the Hessian of the mutual information with respect to the SNR, which was previously unavailable for vector Gaussian channels. Corollary 1 Mutual information Hessian with respect to the SNR: Consider the signal modely = snrhs+, with snr > 0 and where all the terms are defined as in heorem 3. hen, H snr IS;Y = d2 IS;Y dsnr 2 = 1 2 re{ R H Φ S Y 2}. 63 Moreover, H snr IS;Y 0 for all snr, which implies that the mutual information is a concave function with respect to snr. roof: First, we consider the result in [1, Corollary 1], D snr I S;Y = 1 2 rr HE S. 64 Now, we only need to choose = snri p, which implies m = p, and apply the results in heorem 4 and the chain rule in Lemma B.5 to obtain H snr I S;Y = 1 2 D snrrr H E S 65 = 1 2 D E S rr H E S D E S D snr 66 = 1 2 vec R H D p 2D + p E{Φ SY Φ S Y }I p 1 snrr H 2 snr veci p 67 = 1 2 vec R H E{Φ S Y Φ S Y }vecr H, 68 where in last equality we have used Lemma A.4, the equality D p D + p = N p, and the fact that, for symmetric matrices, vec R H N p = vec R H as in 128 in Appendix A. From the expression in 68, it readily follows that the mutual information is a concave function of the snr parameter because, from Lemma G.3 we have that Φ S y Φ S y 0, y, and, consequently, H snr IS;Y 0. Finally, applying again Lemma A.4 and vec AvecB = ra B, the expression for the Hessian in the corollary follows. Remark 6: Observe that 63 agrees with [9, rp. 5] for scalar Gaussian channels.

11 11 We now wonder if the concavity result in Corollary 1 can be extended to more general quantities than the scalar SNR. In the following section we study the concavity of the mutual information with respect to the squared singular values of the precoder for the simple case where the left singular vectors of the precoder coincide with the eigenvectors of the channel covariance matrix R H, which is commonly referred to as the case where the precoder diagonalizes the channel. B. Concavity in the squared singular values of the precoder when the precoder diagonalizes the channel Consider the eigendecomposition of the p p channel covariance matrix R H = U H DiagσU H, where U H R p p is an orthonormal matrix and the vector σ R p contains non-negative entries in decreasing order. Note that in the case where rankr H = p < p, the last p rankr H elements of the vector σ are zero. Let us now consider the singular value decomposition SVD of thep m precoder matrix = U Diag λ V. For the case where m p, we have that U R p p is an orthonormal matrix, the vector λ is p-dimensional, and the matrix V R m p contains orthonormal columns such that V V = I p. For the case m < p, the matrix U R p m contains orthonormal columns such that U U = I m, the vector λ is m-dimensional, and V R m m is an orthonormal matrix. In the following theorem we assumem p for the sake of simplicity, and we characterize the concavity properties of the mutual information with respect to the entries of the squared singular values vector λ for the particular case where the left singular vectors of the precoder coincide with the eigenvectors of the channel covariance matrix, U = U H. he result for the case m < p is stated after the following theorem, and is left without proof because it follows similar steps. heorem 5 Mutual information Hessian with respect to the squared singular values of the precoder: Consider Y = HS +, where all the terms are defined as in heorem 3, for the particular case where the eigenvectors of the channel covariance matrix R H and the left singular vectors of the precoder R p m coincide, i.e., U = U H, and where we have m p. hen, the Hessian of the mutual information with respect to the squared singular values of the precoder λ is: H λ IS;Y = 1 2 DiagσE{ Φ V SY Φ V SY } Diagσ, 69 where we recall that A B denotes the Schur or Hadamard product. Moreover, the Hessian matrix H λ IS;Y is negative semidefinite, which implies that the mutual information is a concave function of the squared singular values of the precoder. roof: he Hessian of the mutual information H λ I S;Y can be obtained from the Hessian chain rule in Lemma B.5 as H λ IS;Y = D λ H IS;Y D λ +D I S;Y I p H λ. 70 Now we need to calculate D λ and H λ. he expression for D λ follows as D λ = D λ vec U H Diag λ V 71 = V U H D λ vecdiag λ 72 = 1 2 V U H S p Diag λ 1, 73 where, in 72, we have used Lemmas A.4 and B.7.2 and where the last step follows from [D λ vecdiag λ ] i+j 1p,k = λ k λi δ ij = 1 2 λ k δ ij δ ik, {i,j,k} [1,p], 74 and recalling the definition of the reduction matrix S p in 133, [S p ] i+j 1p,k = δ ij δ ik. Following steps similar to the derivation of D λ, the Hessian matrix H λ is obtained according to H λ = D λ D λ = 1 2 D λ Diag λ 1S p V U H 75 = 1 2 V U H S p I p D λ Diag λ 1 76 = 1 4 V U H S p I p S p Diag λ 3. 77

12 12 lugging 73 and 77 in 70 and operating together with the expressions for the Jacobian matrix D IS;Y and the Hessian matrix H I S;Y given in Remark 3 of Lemma 1 and in Remark 5 of heorem 4, respectively, we obtain H λ IS;Y = 1 1S Diag λ 4 p E V S Diagσ 2 I p Diag σ λ N p E { VΦ S Y V VΦ } S Y V Ip Diag σ λ 1 S p Diag λ vec Diag σ λ E V S Sp I p Sp Diag λ 3, where it can be noted that the dependence of H λ IS;Y on U H has disappeared. Now, applying Lemma A.2 the first term in last equation becomes Diag λ 1S p EV S Diagσ S p Diag λ 1 whereas the third term in 78 can be expressed as vec Diag σ λ E V S Sp I m Sp Diag λ 3 = Diag λ 1 EV S Diagσ Diag λ 1 79 = E V S Diagσ 1/λ, 80 = Diag Diag σ λ E V S Diag λ 3 81 = E V S Diagσ 1/λ, 82 where in 81 we have used that, for any square matrix A R p p, p [vec AS p ] k = A ij δ ij δ ik = A kk 83 [diaga I p S p ] kl = i,j=1 p A jj δ ki δ ij δ il = A ll δ kl. 84 Now, from 80 and 82 we see that the first and third terms in 78 cancel out and, recalling thatv Φ SY V = Φ V SY, the expression for the Hessian matrix H λ IS;Y simplifies to i,j H λ IS;Y = 1 4 Diag λ 1E { ΦV S Y Diag σ λ Φ V SY Diag σ λ +Diag σ λ Φ V SY Φ V SY Diag σ λ } Diag λ 1, 85 where we have applied Lemma A.2 and have taken into account that 2 I p Diag σ λ N p = I p Diag σ λ +K p Diag σ λ Ip, 86 together with S p K p = S p. Now, from simple inspection of the expression in 85 and recalling the properties of the Schur product, the desired result follows. Remark 7: Observe from the expression for the Hessian in 69 that for the case where the channel covariance matrix R H is rank deficient, rankr H = p < p, the last p p entries of the vector σ are zero, which implies that the last p p rows and columns of the Hessian matrix are also zero. Remark 8: For the case where m < p, note that the matrix U R p m with the left singular vectors of the precoder is not square. We thus consider that it contains the m eigenvectors in U H associated with the m largest eigenvalues of R H. In this case, the Hessian matrix of the mutual information with respect to the squared singular values λ R m is also negative semidefinite and its expression becomes H λ IS;Y = 1 2 Diag σe{ Φ V SY Φ V SY } Diag σ, 87 where we have defined σ = σ 1 σ 2...σ m and where we recall that, in this case, V R m m.

13 13 We now recover a result obtained in [17] were it was proved that the mutual information is concave in the power allocation for the case of parallel channels. Note, however, that [17] considered independence of the elements in the signaling vector S, whereas the following result shows that it is not necessary. Corollary 2 Mutual information concavity with respect to the power allocation in parallel channels: articularizing heorem 5 for the case where the channel H, the precoder, and the noise covariance R are diagonal matrices, which implies that U = U H = I p, it follows that the mutual information is a concave function with respect to the power allocation for parallel non-interacting channels for an arbitrary distribution of the signaling vector S. C. General negative results In the previous section we have proved that the mutual information is a concave function of the squared singular values of the precoder matrix for the case where the left singular vectors of the precoder coincide with the eigenvectors of the channel correlation matrix, R H. For the general case where these vectors do not coincide, the mutual information is not a concave function of the squared singular values of the precoder. his fact is formally established in the following theorem through a counterexample. heorem 6 General non-concavity of the mutual information: Consider Y = HS +, where all the terms are defined as in heorem 3. It then follows that, in general, the mutual information is not a concave function with respect to the squared singular values of the precoder λ. roof: We present a simple two-dimensional counterexample. Assume that the noise is white R = I 2 and consider the following channel matrix and precoder structure H ce = 1 β β 1, ce = Diag λ = λ1 0 0 λ2, 88 where β 0,1] and assume that the distribution for the signal vector S has two equally likely mass points at the following positions s 1 2 =, s 2 0 = Accordingly, we define the noiseless received vector as r k = H ce ce s k, for k = {1,2}, which yields r 1 λ1 = 2 β, r 2 β λ2 = 2 λ2. 90 λ 1 We now define the mutual information for this counterexample as I ce λ 1,λ 2,β = IS;H ce ce S +W. 91 Since there are only two possible signals to be transmitted, s 1 ands 2, it is clear that0 I ce log2. Moreover we will use the fact that, asr = I 2, the mutual information is an increasing function of the squared distance of the two only possible received vectorsd 2 λ 1,λ 2,β = r 1 r 2 2, which is denoted byi ce λ 1,λ 2,β = f d 2 λ 1,λ 2,β, where f is an increasing function. For a fixed β, we want to study the concavity of I ce λ 1,λ 2,β with respect to λ 1,λ 2. In order to do so, we restrict ourselves to the study of concavity along straight lines of the type λ 1 + λ 2 = ρ, with ρ > 0, which is sufficient to disprove the concavity. Given three aligned points, such that the point in between is at the same distance of the other two, if a function is concave in λ 1,λ 2 it means that the average of the function evaluated at the two extreme points is smaller than or equal to the function evaluated at the midpoint. Consequently, concavity can be disproved by finding three aligned points, such that the aforementioned concavity property is violated. Our three aligned points will be ρ,0, 0,ρ, and ρ/2,ρ/2 and instead of working with the mutual information we will work with the squared distance among the received points because closed form expressions are available. Operating with the received vectors and recalling that β 0,1], we can easily obtain d 2 ρ,0,β = d 2 0,ρ,β = 4ρ1+β 2 > 4ρ 92 d 2 ρ/2,ρ/2,β = 4ρ1 β 2 < 4ρ. 93

14 Mutual information evaluated through the line λ 1 + λ 2 = 10 for different values of β 0.6 Mutual information nats β = 0 β = β = 0.6 β = 0.8 β = θ, λ 1 = 10 θ, λ 2 = 10 1 θ Fig. 2. Graphical representation of the mutual information I ceλ 1,λ 2,β in the counterexample for different values of the channel parameter β along the line λ 1 +λ 2 = ρ = 10. It can be readily seen that, except for the case β = 0, the function is not concave. he first equality means that the mutual information evaluated at the extreme points has the same quantitative value and is always above a certain threshold, f4ρ, independently of the value of β. Consequently the mean of the mutual information evaluated at the two extreme points is equal to the value on any of the extreme points. he second equality means that the function evaluated at the point in between is always below this same threshold. Now it is clear that, given any ρ > 0 we can always find β such that 0 < β 1 and that I ce ρ,0,β = I ce 0,ρ,β > I ce ρ/2,ρ/2,β, 94 which contradicts the concavity hypothesis. For illustrative purposes, in Fig. 2 we have depicted the mutual information for different values of the channel parameter β for the counterexample in the proof of heorem 6. Note that the function is only concave and, in fact, linear for the case where the channel is diagonal, β = 0, which agrees with the results in heorem 5. D. Concavity results summary At the beginning of Section IV we have argued that the mutual information is concave with respect to the full transmitted signal covariance matrix Q for the case where the signaling is Gaussian and also for the low SNR regime. Next we have discussed that this result cannot be generalized for arbitrary signaling distributions because, in the general case, the mutual information is not well defined as a function of Q alone. In Sections IV-A and IV-B, we have encountered two particular cases where the mutual information is a concave function. In the first case, we have seen that the mutual information is concave with respect to the SNR and, in the second, that the mutual information is a concave function of the squared singular values of the precoder, provided that the eigenvectors of the channel covariance R H and the left singular vectors of the precoder coincide. For the general case where these vectors do not coincide in general, we have shown in Section IV-C that the mutual information is not concave in the squared singular values. A summary of the different concavity results for the mutual information as a function of the configuration of the linear vector Gaussian channel can be found in able I.

15 15 ABLE I SUMMARY OF HE CONCAVIY YE OF HE MUUAL INFORMAION. INDICAES CONCAVIY, INDICAES NON-CONCAVIY, AND INDICAES HA I DOES NO ALY Scalar Vector Matrix ower, snr, Squared singular values, λ, ransmit covariance, Q, Cases = Diag` snri = U λv Q = R S General case: Y = HS + [5] [10] Section IV-A Section IV-C Section III-E Channel covariance R H and precoder share singular/eigenvectors: U = U H Section IV-B Section III-E Independent parallel communication: [17] [17] R H = U = V = I, S = Q i S i Note that Q is diagonal Low SNR regime: R = N 0I n, N 0 1 [16] Gaussian signaling: S = S G V. MULIVARIAE EXENSION OF COSA S ENROY OWER INEQUALIY Having proved that the mutual information and, hence, the differential entropy are concave functions of the squared singular values λ of the precoder for the case where the left singular vectors of the precoder coincide with the eigenvectors of the channel covariance R H, H λ IS;Y = H λ hy 0, we want to study if this last result can be strengthened by proving the concavity in λ of the entropy power. he entropy power of the random vector Y R n was first introduced by Shannon in his seminal work [18] and is, since then, defined as NY = 1 2 2πe exp n hy, 95 where hy represents the differential entropy as defined in 6. he entropy power of a random vectory represents the variance or power of a standard Gaussian random vector Y G N 0,σ 2 I n such that both Y and Y G have identical differential entropy, hy G = hy. Costa proved in [10] that, provided that the random vector W is white Gaussian distributed, then NX + tw 1 tnx+tnx +W, 96 where t [0,1]. As Costa noted, the above entropy power inequality EI is equivalent to the concavity of the entropy power function NX + tw with respect to the parameter t, or, formally 5 d 2 dt 2N X + tw = H t N X + tw Due to its inherent interest and to the fact that the proof by Costa was rather involved, simplified proofs of his result have been subsequently given in [19] [22]. Additionally, in his paper Costa presented two extensions of his main result in 97. recisely, he showed that the EI is also valid when the Gaussian vector W is not white, and also for the case where the t parameter is multiplying the arbitrarily distributed random vector X instead: H t N tx +W Observe that t in 98 plays the role of a scalar precoder. We next consider an extension of 98 to the case where the scalar precoder t is replaced by a multivariate precoder R p m and a channel H R n p for the particular case where the precoder left singular vectors coincide with the channel covariance eigenvectors. Similarly as in Section IV-B we assume that m p. he case m < p is presented after the proof of the following theorem. heorem 7 Costa s multivariate EI: Consider Y = HS+, where all the terms are defined as in heorem 3, for the particular case where the eigenvectors of the channel covariance matrix R H coincide with the left singular 5 he equivalence between equations 97 and 96 is due to the fact that the function NX + tw is twice differentiable almost everywhere thanks to the smoothing properties of the added Gaussian noise.

16 16 vectors of the precoder R p m and where we assume that m p. It then follows that the entropy power NY is a concave function of λ, i.e., H λ NY 0. Moreover, the Hessian matrix of the entropy power function NY with respect to λ is given by H λ NY = NY diagev n Diagσ S diage V S { E Φ n V SY Φ V SY } Diagσ, 99 where we recall that diage V S is a column vector with the diagonal entries of the matrix E V S. roof: First, let us prove 99. From the definition of the entropy power in 95 and applying the chain rule for Hessians in 143 we obtain H λ NY = D λ hy H hyny D λ hy +D hy NY H λ hy = 2NY 2D λ hy D λ hy +H λ hy. n n Now, recalling from [5, Eq. 61] that D λ hy = 1/2DiagσdiagE V S and incorporating the expression for H λ hy calculated in heorem 5, the result in 99 follows. Now that a explicit expression for the Hessian matrix has been obtained, we wish to prove that it is negative semidefinite. Note from 100 that, except for the positive factor 2NY /n, the Hessian matrix H λ NY is the sum of a rank one positive semidefinite matrix and the Hessian matrix of the differential entropy, which is negative semidefinite according to heorem 5. Consequently, the definiteness of H λ NY is, a priori, undetermined, and some further developments are needed to determine it, which is what we do next. First consider the positive semidefinite matrix Ay S p +, which is obtained by selecting the first p = rankr H columns and rows of the positive semidefinite matrix Diag σ Φ V SyDiag σ, 100 [Ay] ij = [ Diag σ Φ V SY Diag σ ] ij, {i,j} = 1,...,p. 101 With this definition, it is now easy to see that the expression E{diagAy}E { diagay } E { Ay Ay } 102 n coincides up to the factor 2NY /n with the first p rows and columns of the Hessian matrix H λ NY in 99. Recalling that the remaining elements of the Hessian matrix H λ NY are zero due to the presence of the matrix Diagσ, it is sufficient to show that the expression in 102 is negative semidefinite to prove the negative semidefiniteness of H λ NY. Now, we apply roposition G.9 to Ay, yielding aking the expectation in both sides of 103, we have From Lemma G.10 we know that from which it follows that Ay Ay diagaydiagay p. 103 E{AY AY } E{ diagay diagay } p, 104 E { diagay diagay } E { diagay } E { diagay }, E{AY AY } E{diagAY }E{ diagay } p. Since the operators diaga and expectation commute we finally obtain the desired result as E{AY AY } diage{ay }diage{ay } p diage{ay }diage{ay }, n

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 8, AUGUST /$ IEEE

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 8, AUGUST 2009 3613 Hessian and Concavity of Mutual Information, Differential Entropy, and Entropy Power in Linear Vector Gaussian Channels Miquel