Relative Entropy and Score Function: New Information Estimation Relationships through Arbitrary Additive Perturbation

Size: px

Start display at page:

Download "Relative Entropy and Score Function: New Information Estimation Relationships through Arbitrary Additive Perturbation"

Susanna Ferguson
5 years ago
Views:

1 Relative Entropy an Score Function: New Information Estimation Relationships through Arbitrary Aitive Perturbation Dongning Guo Department of Electrical Engineering & Computer Science Northwestern University Evanston, IL 60208, USA Abstract This paper establishes new information estimation relationships pertaining to moels with aitive noise of arbitrary istribution. In particular, we stuy the change in the relative entropy between two probability measures when both of them are perturbe by a small amount of the same aitive noise. It is shown that the rate of the change with respect to the energy of the perturbation can be expresse in terms of the mean suare ifference of the score functions of the two istributions, an, rather surprisingly, is unrelate to the istribution of the perturbation otherwise. The result hols true for the classical relative entropy or Kullback Leibler istance, as well as two of its generalizations: Rényi s relative entropy an the f-ivergence. The result generalizes a recent relationship between the relative entropy an mean suare errors pertaining to Gaussian noise moels, which in turn supersees many previous information estimation relationships. A generalization of the e Bruijn ientity to non-gaussian moels can also be regare as conseuence of this new result. I. INTRODUCTION To ate, a number of connections between basic information measures an estimation measures have been iscovere. By information measures we mean notions which escribe the amount of information, such as entropy an mutual information, as well as several closely relate uantities, such as ifferential entropy an relative entropy also known as information ivergence or Kullback Leibler istance. By estimation measures we mean key notions in estimation theory, which inclue in particular the mean suare error MSE an Fisher information, among others. An early such connection is attribute to e Bruijn ] which relates the ifferential entropy of an arbitrary ranom variable corrupte by Gaussian noise an its Fisher information: δ h X + δ N 2 J X + δ N for every δ 0, where X enotes an arbitrary ranom variable an N N 0, enotes a stanar Gaussian ranom variable inepenent of X throughout this paper. Here JY enotes the Fisher information of its istribution with respect to w.r.t. the location family. The e Bruijn ientity is euivalent to a recent connection between the input output mutual information an the minimum mean-suare This work is supporte by the NSF uner grant CCF an DARPA uner grant W9NF error MMSE of a Gaussian moel 2]: γ I X; γ X + N 2 mmse P X, γ 2 where X P X an mmse P X, γ enotes the MMSE of estimating X given γ X + N. The parameter γ 0 is unerstoo as the signal-to-noise ratio SNR of the Gaussian moel. By-proucts of formula 2 inclue the representation of the entropy, ifferential entropy an the non-gaussianness measure in relative entropy in terms of the MMSE 2] 4]. Several generalizations an extensions of the previous results are foun in 5] 7]. Moreover, the erivative of the mutual information an entropy w.r.t. channel gains have also been stuie for non-aitive-noise channels 7], 8]. Among the aforementione information measures, relative entropy is the most general an versatile in the sense that it is efine for istributions which are iscrete, continuous, or neither, an all the other information measures can be easily expresse in terms of relative entropy. The following relationship between the relative entropy an Fisher information is known 9]: Let p θ } be a family of probability ensity functions pfs parameterize by θ R. Then Dp θ+δ p θ δ 2 /2 Jp θ + o δ 2 3 where Jp θ is the Fisher information of p θ w.r.t. θ. In a recent work 0], Verú establishe an interesting relationship between the relative entropy an mismatche estimation. Let mse Q P, γ represent the MSE for estimating the input X of istribution P to a Gaussian channel of SNR eual to γ base on the channel output, with the estimator assuming the prior istribution of X to be Q. Then 2 γ D P N 0, γ Q N 0, γ mse Q P, γ mmsep, γ where the convolution P N 0, γ represents the istribution of X + N/ γ with X P. Obviously mse Q P, γ mmsep, γ if Q is ientical to P. The formula is particularly satisfying because the left-han sie l.h.s. is an informationtheoretic measure of the mismatch between two istributions, whereas the right-han sie r.h.s. measures the mismatch using an estimation-theoretic metric, i.e., the increase in the 4

2 estimation error ue to the mismatch. In another recent work, Narayanan an Srinivasa ] consier an aitive non-gaussian noise channel moel an provie the following generalization of the e Bruijn ientity: δ h X + δ V δ0 JX where the pf of V is symmetric about 0, twice ifferentiable, an of unit variance but otherwise arbitrary. The significance of 5 is that the erivative oes not epen on the etaile statistics of the noise. Thus, if we view the ifferential entropy as a manifol of the istribution of the perturbation δ V, the geometry of the manifol appears to be locally a bowl which is uniform in every irection of the perturbation. In this work, we consier the change in the relative entropy between two istributions when both of them are perturbe by an infinitesimal amount of the same arbitrary aitive noise. We show that the rate of this change can be expresse as the mean-suare ifference of the score functions of the istributions. Note that the score function is an important notion in estimation theory, whose mean suare is the Fisher information. Like formula 5, the new general relationship turns out to be inepenent of the noise istribution. The general relationship is foun to hol for both the classical relative entropy or Kullback Leibler istance an the more general Rényi s relative entropy, as well as the general f-ivergence ue to Csiszár an inepenently Ali an Silvey 2]. In the special case of Gaussian perturbations, it is shown that, 2, 4, 5 can all be obtaine as conseuence of the new result. II. MAIN RESULTS Theorem : Let Ψ enote an arbitrary istribution with zero mean an variance δ. Let P an Q be two istributions whose respective pfs p an are twice ifferentiable. If P Q an δ z DP Ψ Q Ψ 2 2 E P pz log pz z δ0 + ] 0 6 pz log pz 2 z 7 z log pz log Z 2} 8 where the expectation in 8 is taken with respect to Z P. The classical relative entropy Kullback-Leibler istance is efine for two probability measures P Q as P DP Q log P. 9 Q When the corresponing ensities exist, it is also customary to enote the relative entropy by Dp. The notation / δ in Theorem can be unerstoo as taking erivative w.r.t. the variance of the istribution Ψ with its shape fixe, i.e., Ψ is the istribution of δ V with the ranom variable V fixe. We note that the r.h.s. of 8 oes not epen on the istribution of V, i.e., the change in the relative entropy ue to small perturbation is proportional to the variance of the perturbation but inepenent of its shape. Thus the notation / δ is not ambiguous. For every function f, let f enote its erivative f for notational convenience. For every ifferentiable pf p, the function log px p x/px is known as its score function, hence the r.h.s. of 8 is the mean suare ifference of two score functions. As the previous result 4, this is satisfying because both sies of 8 represent some error ue to the mismatch between the prior istribution supplie to the estimator an the actual istribution p. Obviously, if p an are ientical, both sies of the formula are eual to zero; otherwise, the erivative is negative i.e., perturbation reuces relative entropy. Consier now the Rényi relative entropy, which is efine for two probability measures P Q an every α > 0 as D α P Q α P α log P 0 Q where D P Q is efine as the classical relative entropy DP Q because α D α P Q DP Q. Theorem 2: Let the istributions P, Q an Ψ be efine the same way as in Theorem. Let δ enote the variance of Ψ. If P Q an δ D αp Ψ Q Ψ α 2 p α z α z ] 0 z δ0 + log pz 2 p α z α 2 z z pα u α u u z. Note that as α, the r.h.s. of 2 becomes the r.h.s. of 7. We also point out that, similar to that in 7, the outer integral in 2 can be viewe as the mean suare ifference of two scores log pz log Z with the pf of Z being proportional to p α z α z. Theorems an 2 are uite general because conitions 6 an are satisfie by most istributions of interest. For example, if both p an belong to the exponential family, the erivatives in 6 an also vanish exponentially fast. Not all istributions satisfy those conitions. This is because that although the functions pz an pz logpz/z integrate to an DP Q respectively, they nee not vanish. For example, pz may consist of narrower an narrower Gaussian pulses of the same height as z, so that not only pz oes not vanish, but p z is unboune. Another family of generalize relative entropy, calle the f- ivergence, was introuce by Csiszár an inepenently by

3 Ali an Silvey see e.g., 2]. It is efine for P Q as P I f P Q f Q. 3 Q Theorem 3: Let the istributions P, Q an Ψ be efine the same way as in Theorem. Let δ enote the variance of Ψ. Suppose the secon erivative of f exists an is enote by f. If P Q an δ I f z zf P Ψ Q Ψ δ0+ 2 ] pz 0 4 z py yf py 2 y. y y 5 The integral in 5 can still be expresse in terms of the ifference of the score functions because py/y py/y log py log y]. 6 Note that the special case of ft t log t correspons to the Kullback Leibler istance, whereas the case of ft t t / correspons to the Tsallis relative entropy 3]. Inee, Theorem is a special case of Theorem 3. III. PROOF The key property that unerlies Theorems 3 is the following observation of the local geometry of an aitive-noiseperturbe istribution mae in ]: Lemma : Let the pf p of a ranom variable Z be twice ifferentiable. Let p δ enote the pf of Z + δ V where V is of zero mean an unit variance, an is inepenent of Z. Then for every y R, as δ 0 +, δ p δy δ py. 7 y2 Formula 7 allows the erivative w.r.t. the energy of the perturbation δ to be transforme to the secon erivative of the original pf. In Appenix we provie a brief proof for Lemma which is slightly ifferent than that in ]. Note that Lemma oes not reuire the istribution of the perturbation to be symmetric as is reuire in ]. In the following we first prove Theorem 3, which implies Theorem as a special case, an prove Theorem 2. A. Proof for Theorem 3 Let V be a ranom variable with fixe istribution P V. For convenience, we use a shorthan p δ to enote the pf of the ranom variable Y Z + δ V with Z, V P P V. Similarly, let δ enote the pf of Y Z + δ V with Z, V Q P V. Clearly, δ I f P Ψ Q Ψ δ I f p δ δ. 8 For any single-variable function g, let g an g enote its first an secon erivative respectively. Consier now δ I f p δ δ pδ y δ yf y 9 δ δ y δ y pδ y f + δ y δ δ y δ f pδ y y 20 δ y δ y pδ y f p ] δy δ δ y δ y f pδ y δ y + p δy f pδ y y. 2 δ δ y Invoking Lemma on 2 yiels δ I f p δ δ py p yf δ0 + 2 y py + y f py ] py y y f y. y 22 To procee, we reorganize the integran in 22 to the esire form. The key techniue is integration by parts, which we carry out implicitly with the help of a moest amount of foresight. For convenience, we use p an as shorthan for py an y respectively. We use the fact g h gh gh to rewrite the integran in 22 as p f p + f p + f p p f p p f p p f p ] f ] p ]] p f p p ] p f. 23 Combining the first two terms an simplifying the last term on the r.h.s. of 23 yiel ] ] ] p p f p f + p p f. 24 The first term in 24 integrates to zero by assumption 4. The last two terms in 24 can be combine to obtain ] p p f p 2 p f. 25 Collecting the results from 22 to 25, we have I f P δ Q δ δ 2 δ0 + y py y 2 py f y y which is euivalent to 5. Hence the proof of Theorem We note that the preceing calculation is tantamount to two uses of integration by parts. The treatment here, however, reuires the minimum regularity conitions on the ensities p an.

4 B. Proof for Theorem 2 Consier now δ D αp δ δ α α δ log p α δ y α δ y y 27 p α δ yα δ y y δ / p α δ y α δ y y. 28 By Lemma, the integral in the numerator in 28 can be written as α α pδ y p δ y pδ y δ y α + α y δ y δ δ y δ α α py p y + α α py y y 2 y 2 y 29 at δ 0 +. Note that implies that α α py py α p y + α y 30 y y vanishes as y. Using integration by parts, we further euate the integral on the r.h.s. of 29 to α α p 2 p + α α p y. 3 y 2 y The integran in 3 can be written as α 2 α p α α p p p In the following we omit the coefficient α α/2 to write the remaining terms in 32 as α 2 α p p p p p 2 p 2 p α 2 α 2p p α α + p α α 2 33 log p log 2 p α α. 34 Collecting the preceing results from 28 to 34, we have establishe 2 in Theorem 2. IV. RECOVERING EXISTING INFORMATION ESTIMATION RELATIONSHIPS USING THEOREM A. Mutual information an MMSE We first use Theorem to recover formula 2 establishe in 2]. For convenience, consier the following alternative Gaussian moel: Z X + σw 35 where X an W N 0, are inepenent so that σ 2 represents the noise variance. It suffices to show the following result, which is euivalent to 2, σ 2 IX; X + σw 2σ 4 mmse P X, σ For any x R, let P Z Xx enote the istribution of Z as the output of the moel 35 conitione on X x. The mutual information can be expresse as IX; X + σw IX; Z DP Z X P Z P X 37 which is the average of DP Z Xx P Z over x accoring to the istribution P X, which oes not epen on σ 2. Let N N 0, be inepenent of Z. Consier the erivative of DP Z Xx P Z w.r.t. σ 2, or euivalently, by introucing a small perturbation, δ D P Z+ δn Xx PZ+ δn δ0 + 2 E log pz Xx Z log p Z Z 2 } 38 ue to Theorem, where the expectation is over P Z Xx, which is a Gaussian istribution centere at x with variance σ 2. The first score is easy to evaluate: log p Z Xx Z x Z/σ 2. The secon score is etermine by the following simple variation of a result ue to Esposito 4] see also Lemma 2 in 2]: Lemma 2: log p Z z E X Z z} z/σ 2. Clearly, the r.h.s. of 38 becomes E PZ Xx x E X Z} 2 } /2σ 4, 39 the average of which over x is eual to the r.h.s. of 36. Thus 36 is establishe, an so is 2. B. Differential Entropy an MMSE Consier again the moel 35. It is not ifficult to see DP Z+ δ N N 0, σ 2 + δ 2 log 2πe σ 2 + δ + EX2 /2 σ 2 + δ h Z + δ N 40. By Theorem an Lemma 2, we have δ D P Z+ δ N N 0, σ 2 + δ δ0 + E X Z} z 2 E σ 2 + z } 2 σ 2 4 E E X Z} 2} / 2σ Plugging into 40, we have σ 2 hz 2 log2πe + σ2 2 + σ 2 σ 2 EX2 E X Z} 2 2σ Note that EX 2 E X Z} 2 mmsep X, /σ 2. Moreover, hx hz σ0, an hz 2 log2πe + σ2 vanishes as σ 2. Therefore, by integrating w.r.t. σ 2 from

5 0 to, we obtain hx 2 log 2πe s 2 mmse P X, s ss + s 44 which is euivalent to the integral expression in 2], in which we use the SNR as the integral variable. C. Relative Entropy an MMSE The connection between relative entropy an MMSE 4 can also be regare as a special case of Theorem. Consier again the moel 35 an apply Theorem. We have 2 δ D P X+ δ N Q X+ δ N E P log p Z X + δ N log Z X + } 2 δ N. By Lemma 2, the r.h.s. of 45 can be rewritten as E P EP X Z} E Q X Z} 2} 45 E P X EQ X Z} X E P X Z}] 2} 46 E P X EQ X Z} 2} + E P X EP X Z} 2} 2E P X E Q X Z}X E P X Z}}. 47 Using the orthogonality of X E P X Z} an every function of Z uner probability measure P, we can replace E Q X Z} in the last term by E P X Z} which are both functions of Z, an continue the euality as E P X EQ X Z} 2} + E P X EP X Z} 2} 2E P X E P X Z}X E P X Z}} E P X EQ X Z} 2} E P X EP X Z} 2} 48 mse Q X P X, γ mmse P X, γ 49 where γ /δ. Hence yiels the esire formula 4. D. Differential Entropy an Fisher Information The generalize e Bruijn ientity 5 can be recovere basically by inspection of 8. Consier a istribution Q Z which is uniform on m, m] with m being a large number an vanishes smoothly outsie the interval e.g., a raise-cosine function with roll-off. Then Q Z+σN remains essentially uniform, so that log Z z 0 over almost all the probability mass of P Z. As m, 8 reuces to 5. V. CONCLUDING REMARKS The relationships connecting the score function an various forms of relative entropy shown in this paper are the most general for aitive-noise moels to this ate. It is by now clear that such erivative relationships between basic informationan estimation-theoretic measures rely on neither the normality of the aitive perturbation, nor the logarithm functional in classical information measures. The results, however, o not irectly translate into integral relationships unless the noise is Gaussian, which has the infinite ivisibility property. VI. ACKNOWLEDGEMENTS The author woul like to thank Sergio Verú for sharing an earlier conjecture of his on the relationship between the Kullback Leibler istance an MSE, which inspire this work. APPENDIX PROOF OF LEMMA Proof: Recall that p δ enotes the istribution of Y X + δ V. Denote the characteristic function of p δ as ϕu, δ E e iuy }. 50 Due to inepenence of X an V, ϕs, δ E e iux} } E e iu δ V 5 k } iu δ V ϕs, 0 E 52 k! k0 k ] ϕs, 0 + δiu2 iu δ + EV k 53 2 k! k3 where we have use the assumptions that V has zero mean an unit variance. Note also that the series sum in 53 vanishes as oδ. Taking the inverse Fourier transform on both sies of 53 yiels p δ y p 0 y + 2 δ 2 y 2 p 0y + oδ. 54 Hence Lemma is prove as p 0 y py. REFERENCES ] A. J. Stam, Some ineualities satisfie by the uantities of information of Fisher an Shannon, Information an Control, vol. 2, pp. 0 2, ] D. Guo, S. Shamai, an S. Verú, Mutual information an minimum mean-suare error in Gaussian channels, IEEE Trans. Inform. Theory, vol. 5, pp , Apr ] D. Guo, S. Shamai, an S. Verú, Proof of entropy power ineualities via MMSE, in Proc. IEEE Int. Symp. Inform. Theory, pp. 0 05, Seattle, WA, USA, July ] S. Verú an D. Guo, A simple proof of the entropy power ineuality, IEEE Trans. Inform. Theory, pp , May ] M. Zakai, On mutual information, likelihoo ratios, an estimation error for the aitive Gaussian channel, IEEE Trans. Inform. Theory, vol. 5, pp , Sept ] D. Guo, S. Shamai, an S. Verú, Aitive non-gaussian noise channels: Mutual information an conitional mean estimation, in Proc. IEEE Int. Symp. Inform. Theory, Aelaie, Australia, Sept ] D. Palomar an S. Verú, Representation of mutual information via input estimates, IEEE Trans. Inform. Theory, vol. 53, pp , Feb ] C. Measson, A. Montanari, an R. Urbanke, Maxwell Construction: The Hien Brige Between Iterativew an Maximum a Posteriori Decoing, IEEE Trans. Inform. Theory, vol. 54, pp , ] S. Kullback, Information Theory an Statistics. New York: Dover, ] S. Verú, Mismatche estimation an relative entropy, in Proc. IEEE Int. Symp. Inform. Theory, Seoul, Korea, ] K. R. Narayanan an A. R. Srinivasa, On the thermoynamic temperature of a general istribution ] I. Csiszr, Axiomatic characterizations of information measures, Entropy, vol. 0, pp , ] C. Tsallis, Possible generalization of Boltzmann-Gibbs statistics, Journal of Statistical Physics, vol. 52, pp , ] R. Esposito, On a relation between etection an estimation inecision theory, Information an Control, vol. 2, pp. 6 20, 968.

Linear First-Order Equations

Linear First-Order Equations 5 Linear First-Orer Equations Linear first-orer ifferential equations make up another important class of ifferential equations that commonly arise in applications an are relatively easy to solve (in theory)