Estimation-Theoretic Representation of Mutual Information

Estimation-Theoretic Representation of Mutual Information Daniel P. Palomar and Sergio Verdú Department of Electrical Engineering Princeton University Engineering Quadrangle, Princeton, NJ 08544, USA {danielp,verdu}@princeton.edu Abstract A fundamental relationship between information theory and estimation theory was recently unveiled for the Gaussian channel, relating the derivative of mutual information with the minimum mean-square error. This paper generalizes this fundamental link between information theory and estimation theory to arbitrary channels and in particular encompasses the discrete memoryless channel DMC). In addition to the intrinsic theoretical interest of such a result, it naturally leads to an efficient numerical computation of mutual information for cases in which it was previously infeasible such as with LDPC codes. 1 Introduction and Motivation A fundamental relationship between estimation theory and information theory was recently unveiled in [1] for Gaussian channels; in particular, it was shown that, for the scalar Gaussian channel Y = snr X + N 1) and regardless of the input distribution, the mutual information and the minimum meansquare error MMSE) are related assuming complex-valued inputs/outputs) by d dsnr I X; snr X + N ) [ X [ ] ] = E E X snr X + N 2 2) the right-hand side is the MMSE corresponding to the best estimation of X upon the observation Y for a given signal-to-noise ratio SNR) snr. The same relation was shown to hold for the linear vector Gaussian channel Y = snr HX + N. 3) This work was supported in part by the Fulbright Program and the Ministry of Education and Science of Spain; the U.S. National Science Foundation under Grant NCR-0074277; and through collaborative participation in the Communications and Networks Consortium sponsored by the U.S. Army Research Laboratory under the Collaborative Technology Alliance Program, Cooperative Agreement DAAD19-01-2-0011. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation thereon.

Similar results hold in a continuous-time setting, i.e., the derivative of the mutual information is equal to the noncausal MMSE. Other generalizations were also obtained in [1] such as when the input undergoes an arbitrary random transformation before contamination by additive Gaussian noise. The previous results on the derivative of the mutual information with respect to the SNR for Gaussian channels were later generalized in [2] to embrace derivatives with respect to arbitrary parameters; in particular, the relation was compactly expressed for the linear vector Gaussian channel in terms of the gradient of the mutual information with respect to the channel matrix H as H I X ; HX + N )=HE 4) [ E E X E [X Y ]) X E [X Y ]) ] 5) is the covariance matrix of the estimation error vector, also known as the MMSE matrix. The derivative with respect to an arbitrary parameter can be readily obtained from this gradient via a chain rule for differentiation. In addition to their intrinsic theoretical interest, these fundamental relations between mutual information and MMSE have already found several applications. Counterparts of the fundamental relation that express the derivative of mutual information have been explored for other types of channels; namely, for Poisson channels [3], for additive non-gaussian channels [4], and for the discrete memoryless channel DMC) [5]. As should be expected the MMSE does not play a role in the representation of mutual information for these channels. The goal of this paper is to generalize the link between information theory and estimation theory to arbitrary channels. Thus, completing the connection between information theory and estimation theory found in [1]. Generalizing the abovementioned approaches, our main result gives the derivative of mutual information with respect to a channel parameter θ in terms of the input estimate given by the posterior distribution PX Y θ as1 [ ] loge PY θ X Y X) I X; Y )=E log PX Y θ θ θ X Y ). 6) For the particular case of a memoryless channel, the derivative is expressed in terms of the individual input estimates given by the posterior marginals PX θ i Y as n θ I Xn ; Y n )= [ ] n loge PY θ E i X i Y i X i ) log PX θ θ i Y X n i Y n ). 7) Observe that in this more general setup that embraces any arbitrary channel, the role of the conditional estimator E [X i Y n ] which arises in the Gaussian channel) has been generalized to the corresponding conditional distribution P θ X i Y n. In addition to the theoretical interest of this characterization, one practical application is the computation of the mutual information I X n ; Y n ) achieved by a given code as 1 The base of the second logarithm in 6) agrees with the units of the mutual information.

opposed to an ensemble of codes) when input to a memoryless channel. In such a case, P X n is a distribution that puts equal mass on the codewords and zero mass else. Indeed, the mutual information achieved by a given code over a channel is a useful information that finds several applications, for example, in studying the concatenation of two or more coding schemes, in bounding the size of a code to achieve a desired block error rate or bit error rate, and in the analysis and design of systems via EXIT charts. In [6, 7, 8], the computation of the information rate for finite-state Markov sources over channels with finite memory was efficiently obtained with a Monte Carlo algorithm, based on the fact that P Y n can be computed very efficiently in practice with the forward recursion of the BCJR algorithm. However, in a more general setting the source is not a Markov process, the existing approaches cannot be used. Indeed, for an arbitrary source, a direct computation of the mutual information is a notoriously difficult task and infeasible in most realistic cases since it requires an enumeration of the whole codebook for the computation of the output probability P Y n or of the posterior probability of the input conditioned on the output P X n Y n [9]. Based on 7), it is possible to obtain a numerical method to compute the mutual information via its derivative which requires the posterior marginals P Xi Y n instead of the joint posterior P X n Y n or joint output P Y n) or, equivalently, the symbolwise a posteriori probabilities APP) obtained by an optimum soft decoder. As is well known, in some notable cases of interest, the APPs can be computed or approximated very efficiently in practice by message-passing algorithms. For example, for Markov sources e.g., convolutional codes or trellis codes) the forward-backward algorithm also known as BCJR algorithm) computes the exact posterior marginals. In other cases, the posterior marginals can only be approximated such as in the turbo decoding for concatenated codes, the soft decoding of Reed-Solomon codes, the sum-product algorithm for factor graphs e.g., sparse codes such as low-density parity-check LDPC) codes) [10, 11], and Pearl s belief propagation algorithm for Bayesian networks [9]. In addition to its computational applications, the representation of the derivative the mutual information has some interesting analytical applications. Indeed, it has been recently shown in [5] that the derivative of the conditional entropy of the input given the output or, equivalently, the mutual information) with respect to a channel parameter can be used as a generalization of the EXIT chart called GEXIT chart) which has very appealing properties as a tool for analyzing the behavior of ensembles of codes using iterative decoding. 2 Derivative of Mutual Information We first give a general representation for arbitrary random transformations and memoryless channels. Then, we particularize the results for specific types of channels such as the BSC, BEC, DMC, and Gaussian channel. 2.1 General Representation The following result characterizes the derivative of the mutual information for an arbi-

trary random transformation with arbitrary input and output alphabets. 2 Theorem 1 Consider a random transformation PY θ X, which is differentiable as a function of the parameter θ, and a random input with distribution P X independent of θ). Then, the derivative of the mutual information I X; Y ) with respect to the parameter θ can be written in terms of the posterior distribution PX Y θ as3 [ ] loge PY θ X Y X) I X; Y )=E log PX Y θ X Y ), 8) θ θ the expectation is with respect to the joint distribution P X P θ Y X. Observe that when PY θ X and P X Y θ are not pdf s or pmf s, Theorem 1 similarly holds using instead Radon-Nikodym derivatives. The result in Theorem 1 particularizes to the formulas found for the cases previously solved such as the Gaussian channel [1], the additive-noise not necessarily Gaussian) channel [4], and the Poisson channel [3]. Theorem 1 can be readily particularized to the case of an arbitrary channel with transition probability PY θ n X,n denotes the number of uses of the channel and the n input and output alphabets are n-dimensional Cartesian products, and input distribution P X n. For the case of a memoryless channel with possibly dependent inputs), Theorem 1 simplifies as follows. Theorem 2 Consider a memoryless channel with transition probability PY θ n X = n n P θ i P θ i Y i X i is differentiable as a function of the parameter θ i and independent of θ j for j i), and a random input with distribution P X n independent of θ i for all i). Then, the derivative of the mutual information I X n ; Y n ) with respect to the parameter θ i can be written in terms of the posterior marginal distribution PX θ i Y as n [ ] loge P θ i I X n ; Y n Y )=E i X i Y i X i ) log PX θ θ i θ i Y X n i Y n ) 9) i the expectation is with respect to the joint distribution P Xi P θ Y n X i. Observe that if the channel is time-invariant i.e., if P θ i Y i X i = PY θ X for all i), then, by simply applying the chain rule for differentiation with θ i = θ for all i, we get 7). The result in Theorem 2 for a memoryless channel can be easily extended to a finitestate Markov channel. An interesting application of Theorem 2 is the computation of the derivative of the mutual information of a given and fixed 2 nr,n ) code used over a memoryless channel, n and R are the blocklength and the rate of the code, respectively. This is easily done by defining the input distribution P X n as the one induced by the code typically 2 The results in this paper require some mild regularity conditions about the interchange of the order of differentiation and integration expectation) which are satisfied in most cases of interest and implicitly assumed hereinafter. 3 Unless the logarithm basis is indicated, it can be chosen arbitrarily as long as both sides of the equation have the same units. Y i X i,

under an equiprobable choice of codewords). Indeed, the practical relevance of Theorem 2 for numerical computation is remarkable since, as already mentioned, the symbolwise APP P Xi Y n obtained by an optimum soft decoder can be efficiently computed in practice with a message-passing algorithm such as the BCJR, sum-product, or belief-propagation algorithms [9, 10, 11]. The expectation over X i and Y n can be numerically approximated with a Monte Carlo approach by averaging over many realizations of X i and Y n. Alternatively, one can consider the numerical approximation of the expectation only over Y n and then obtain the an inner expectation over X i conditioned on Y n through P Xi Y n; then, for a finite input alphabet, 7) becomes [ n θ I Xn ; Y n )= E PX θ i Y x n i Y n ) log θ e PY X Y ] i x i ) log PX θ θ i Y x n i Y n ). x i 10) 2.2 Binary Symmetric Channel BSC) Theorem 2 can be readily particularized for the BSC with d log e P δ Y X y i x i ) dδ = x i y i δ 1 x i y i 1 δ denotes the xor operation or sum in modulo 2. The following result carries out the expectation over X i and Y i analytically and provides an alternative formula in terms of extrinsic information summarized by the distribution P δ X i Y,y denotes the sequence y n except the ith element y i )whichis frequently more useful. Theorem 3 Consider a BSC with crossover probability δ 0, 1) and input distribution P X n.then, d dδ I Xn ; Y n ) [ n ) ) )) )] ) λi Y exp λi Y +exp γ) = E tanh log 2γP 2loge exp λ i Y Xi 1) )) + exp γ) 11) 12) and λ i y ) log P δ X i Y 0 y ) P δ X i Y 1 y ) 13) γ =log 1 δ. 14) δ 2.3 Binary Erasure Channel BEC) The following result refines Theorem 2 for the BEC and provides an alternative formula in terms of extrinsic information.

Theorem 4 Consider a BEC with erasure probability ɛ 0, 1) and input distribution P X n.then, [ d n ))) log 1+exp dɛ I Xn ; Y n λi Y )= E + log 1+exp λ ))) ] i Y 1+expλ i Y )) 1+exp λ i Y )) λ i y ) log P ɛ X i Y 0 y ) P ɛ X i Y 1 y ). 15) 2.4 Discrete Memoryless Channel DMC) Consider a DMC with arbitrary finite input alphabet X = { } a 1 a X, arbitrary finite output alphabet Y = { } b 1 b Y, and arbitrary time-invariant memoryless channel transition probability P Y n X n yn x n )= n P Y X y i x i ). The channel transition probability can be compactly described with the channel transition matrix Π with k, l)th element defined as [Π] kl = π kl = P Y X b k a l ). Theorem 2 can be readily particularized for the DMC in terms of extrinsic information as θ I Xn ; Y n )= n y x i ) P θ Y X y i x i ) P Xi x i ) P θ Y X i θ x i,y i,y P θ xi X log i y ) P θ Y Y X y ) i x i ) x i P θ X i x Y i y ) PY θ X y. 16) i x i ) An equivalent form of 16) was independently obtained in [5, Thm. 1] the conditioning is with respect to an extrinsic information random variable Z i instead of Y. The convergence analysis of the decoding of LDPC code ensembles is carried out in [5] by the GEXIT of the code ensemble a generalization of the EXIT which is defined as the negative of the derivative of mutual information averaged over the code ensemble). The following result refines Theorem 2 for the DMC by carrying out the expectation over X i and Y i analytically and provides an alternative formula in terms of extrinsic information. Theorem 5 Consider a DMC with channel transition matrix Π and input distribution P X n. Then, provided that π kl > 0, 4 n [ Π I X n ; Y n )] kl = E log 1+ π km m l π kl exp λ m,l) ))) i Y 1+ ) 17) m l exp λ m,l) i Y ) λ m,l) ) i y log P X i Y am y ) P Xi Y a l y ). 18) 4 The gradient with respect to a matrix X f is defined as [ X f] ij f/ [X] ij.

The usefulness of the gradient in Theorem 5 is as an intermediate step in the computation of the derivative with respect to an arbitrary parameter θ via the chain rule for differentiation: θ I Xn ; Y n )=Tr T Π I Xn ; Y n ) Π ) 19) θ only the elements of the gradient Π I X n ; Y n ) that are multiplied by nonzero elements of Π/θ need to be computed. 2.5 Gaussian Channel For the Gaussian channel, Theorem 2 can be particularized to obtain 2) and 4) for the case of iid inputs) in agreement with [1] and [2], respectively. The following result further refines Theorem 2 for the binary input AWGN BIAWGN) channel by carrying out analytically the expectation over X i and Y i. Theorem 6 Consider a real-valued Gaussian channel with the channel transition probability P Y X y x) = 1 2π e y snr x) 2 /2 and a binary {±1} input distribution P X n.then, and d dsnr I Xn ; Y n )= 1 2 snr [ n ) Ψ λi Y ; snr ) E 1+expλ i Y )) + Ψ ) λi Y ; snr ) ] 1+exp λ i Y )) λ i y ) log P X i Y 1 y ) P Xi Y +1 y ) Ψx; snr) E [ N log 1+exp x 2 snr snr + N ) log e ))] 20) 21) = 1 + ν log 1+exp x 2 snr snr + ν ) log e )) e ν2 /2 dν. 22) 2π 3 Applications 3.1 Computation of Mutual Information of LDPC Codes Consider the computation of the mutual information of an LDPC code over a channel via the derivative. Figure 1a) shows the mutual information of different codes of rate 1/2 over a BSC versus the channel crossover probability δ. The curves were computed via Theorem 3 and the posterior marginals for the LDPC codes were estimated using the sum-product algorithm 5 [10, 11]. In particular, two LDPC codes with blocklengths n =96andn = 4000 are considered as well as a simple repetition code. Naturally, for 5 In general, the belief propagation algorithm will only give an estimate of the posterior marginals; hence, the computation of the mutual information based on these estimations will inevitably be subject to the accuracy of these estimates. A number of algorithms that provide more accurate estimates than the basic belief-propagation algorithm have been recently proposed.

Estimation of mutual information vs. δ Estimation of mutual information vs. snr 0.5 Code rate = 0.5 0.6 Mutual information bits/symbol) 0.4 0.3 0.2 Capacity LDPC code n=4000) LDPC code n=96) Repetition code Mutual information bits/symbol) 0.5 0.4 0.3 0.2 Code rate = 0.5 Capacity LDPC code n=4000) LDPC code n=96) Repetition code 0.1 0.1 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 δ a) 0 0 0.5 1 1.5 2 2.5 3 3.5 4 snr b) Figure 1: Estimation of the mutual information of different codes of rate 1/2 over: a) a BSC and b) an antipodal Gaussian channel. δ = 0 all codes achieve a mutual information equal to the code rate. The difference among the LDPC codes is not significant this observation is in terms of mutual information and does not imply that the decoding of these different LDPC codes is expected to give the same error performance whatsoever); the implication is that using a long LDPC code is essentially equivalent to using a short one combined with an outer code. Figure 1b) shows the mutual information of the same codes but over a Gaussian channel instead of a BSC. The curves were computed via Theorem 6 and the posterior marginals for the LDPC codes were estimated using the sum-product algorithm. happened in the BSC, the LDPC codes achieve a mutual information essentially equal to the code rate whenever the rate is below a certain fraction of the channel capacity; the repetition code, however, shows a much worse behavior. 3.2 Universal Estimation of the Derivative of Mutual Information Another application of our results is the estimation of the derivative of the mutual information achieved by inputs which are neither accessible nor statistically known hence the term universal), as is frequently the case when dealing with text, images, etc. Assuming that the channel is discrete memoryless and known with full rank transition probability matrix), it is possible to estimate the derivative of the mutual information by simply observing the output. To that end, we use one of the universal algorithms recently developed to estimate the posterior marginals P Xi Y [12, 13] and then we apply Theorem 5fortheDMC. To compute the mutual information by integrating the derivative, we must have access to the outputs corresponding to a grid of channels with a range of qualities starting from a perfect channel. The input assumed to be stationary ergodic) is neither accessible nor statistically known and is only observed after passing through the channel. Theorem 5 combined with 19)) can be more conveniently rewritten due to the stationarity and As

2.25 Mutual information vs. ser 2.2 2.15 Mutual information bits/symbol) 2.1 2.05 2 1.95 1.9 1.85 1.8 1.75 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ser Figure 2: Input-output mutual information of Don Quixote de La Mancha over the typewritter channel as a function of the symbol error probability. ergodicity) as 1 θ n I Xn ; Y n ) 1 n n Tr R T i [ )] log 1+ π km Ri y = m l π kl exp kl ) ) y Π θ λ m,l) i 1+ m l exp λ m,l) i y ) y ))) 23) ) 24) and the sequence y n is obtained by passing an unknown) sequence x n through the channel. As an illustration of the previous approach, we compute the amount of information about the source received by a reader of the novel Don Quixote de La Mancha in English translation) given that the printed novel contains errors introduced by the typist. We model this channel by assuming that each letter is independently flipped, with some symbol error rate SER) of ser, equiprobably into one of its nearest neighbors in the QWERTY keyboard. Figure 2 shows the mutual information obtained by integrating the derivative from the point of reference ser =0. Forser = 0, the mutual information equals the entropy of Don Quixote de La Mancha which is 2.17 bits/symbol. Interestingly, the mutual information as well as the capacity for this channel) is not monotonic; in particular, the mutual information decreases for symbol error rates up to ser 0.82 and then increases. The reason for this behavior is that, for a sufficiently high ser, each letter and its neighbors happen with roughly the same probability, as as ser 1, the probability that the intended letter is indeed observed at the output of the channel becomes zero and this reduces the uncertainty about the transmitter letter given the observed one.

References [1] D. Guo, S. Shamai, and S. Verdú, Mutual information and minimum mean-square error in Gaussian channels, IEEE Trans. Inform. Theory, vol. 51, no. 4, pp. 1261 1282, April 2005. [2] D. P. Palomar and S. Verdú, Gradient of mutual information in linear vector Gaussian channels, in Proc. 2005 IEEE International Symposium on Information Theory ISIT 2005), Adelaide, Australia, Sept. 4-9, 2005. [3] D. Guo, S. Shamai, and S. Verdú, Mutual information and conditional mean estimation in Poisson channels, in Proc. 2004 IEEE Information Theory Workshop, pp. 265 270, San Antonio, TX, USA, Oct. 24-29, 2004. [4], Additive non-gaussian noise channels: Mutual information and conditional mean estimation, in Proc. 2005 IEEE International Symposium on Information Theory ISIT 2005), Adelaide, Australia, Sept. 4-9, 2005. [5] C. Méasson, R. Urbanke, A. Montanari, and T. Richardson, Life above threshold: From list decoding to area theorem and MSE, in Proc. IEEE Inform. Theory Workshop, San Antonio, TX, USA, 2004. [6] D. Arnold and H.-A. Loeliger, On the information rate of binary-input channels with memory, in Proc. 2001 IEEE International Conference on Communications ICC 2001), pp. 2692 2695, Helsinki, Finland, June 11-14, 2001. [7] V. Sharma and S. K. Singh, Entropy and channel capacity in the regenerative setup with applications to Markov channels, in Proc. 2001 IEEE International Symposium on Information Theory ISIT 2001), p. 283, Washington, DC, USA, June 24-29, 2001. [8] H. D. Pfister, J. B. Soriaga, and P. H. Siegel, On the achievable information rates of finite-state ISI channels, in Proc. IEEE 2001 Global Communications Conference Globecom-2001), San Antonio, TX, USA, Nov. 25-29, 2001. [9] J. Pearl, Probabilistic Reasoning in Intelligent Systems, 2nd ed. San Francisco, CA: Kaufmann, 1988. [10] N. Wiberg, Codes and decoding on general graphs, Ph.D. dissertation, Linköping University, Linköping, Sweden, 1996. [11] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, Factor graphs and the sumproduct algorithm, IEEE Trans. Inform. Theory, vol. 47, no. 2, pp. 498 519, Feb. 2001. [12] T. Weissman, E. Ordentlich, G. Seroussi, S. Verdú, and M. Weinberger, Universal discrete denoising: Known channel, IEEE Trans. Inform. Theory, vol. 51, no. 1, pp. 5 28, Jan. 2005. [13] J. Yu and S. Verdú, Schemes for bi-directional modeling of discrete stationary sources, in Proc. 39th IEEE Conference on Information Sciences and Systems CISS-2005), The John Hopkins University, Baltimore, MD, March 16-18, 2005.