Upper Bounds on the Relative Entropy and Rényi Divergence as a Function of Total Variation Distance for Finite Alphabets Igal Sason Department of Electrical Engineering Technion Israel Institute of Technology Haifa 3, Israel E-mail: sason@ee.technion.ac.il Sergio Verdú Department of Electrical Engineering Princeton University Princeton, New Jersey 8544, USA E-mail: verdu@princeton.edu arxiv:53.347v4 [cs.it] 7 Oct 5 Abstract A new upper bound on the relative entropy is derived as a function of the total variation distance for probability measures defined on a common finite alphabet. The bound improves a previously reported bound by Csiszár and Talata. It is further extended to an upper bound on the Rényi divergence of an arbitrary non-negative order including as a function of the total variation distance. Keywords: Pinsker s inequality, relative entropy, relative information, Rényi divergence, total variation distance.. INTRODUCTION Consider two probability distributions P and Q defined on a common measurable space A, F. The Csiszár- Kemperman-Kullback-Pinsker inequality a.k.a. Pinsker s inequality states that where [ DP Q = E P log dp ] dq P Q log e DP Q = A d log d dq designates the relative entropy a.k.a. the Kullback-Leibler divergence from P to Q, and P Q = sup P F QF 3 F F is the total variation distance between P and Q. A reverse Pinsker inequality providing an upper bound on the relative entropy in terms of the total variation distance does not exist in general since we can find distributions that are arbitrarily close in total variation but with arbitrarily high relative entropy. Nevertheless, it is possible to introduce constraints under which such reverse Pinsker inequalities can be obtained. In the case where the probability measures P and Q are defined on a common discrete i.e., finite or countable set A, DP Q = log, 4 P Q =. 5 One of the implications of is that convergence in relative entropy implies convergence in total variation distance. The total variation distance is bounded P Q, whereas the relative entropy is an unbounded information measure. Improved versions of Pinsker s inequality were studied, e.g., in [9], [], [4], [7], []. A reverse Pinsker inequality providing an upper bound on the relative entropy in terms of the total variation distance does not exist in general since we can find distributions that are arbitrarily close in total variation but with arbitrarily high relative entropy. Nevertheless, it is possible to introduce constraints under which such reverse Pinsker inequalities can be obtained. In the case of a finite alphabet A, Csiszár and Talata [6, p. ] show that log e DP Q P Q, 6 where. 7 Recent applications of 6 can be found in [, Appendix D] and [, Lemma 7] for the analysis of the thirdorder asymptotics of the discrete memoryless channel with or without cost constraints. In addition to in 7, the bounds in this paper involve β =, 8 β = 9 so, β, β [, ]. In this paper, Section derives a reverse Pinsker inequality for probability measures defined on a common finite set, improving the bound in 6. The utility of this inequality is studied in Section 3, and it is extended in Section 4 to Rényi divergences of an arbitrary non-negative order.. A NEW REVERSE PINSKER INEQUALITY FOR DISTRIBUTIONS ON A FINITE SET The present section introduces a strengthened version of 6, followed by some remarks and an example.
A. Main Result and Proof Theorem. Let P and Q be probability measures defined on a common finite set A, and assume that Q is strictly positive on A. Then, the following inequality holds: DP Q log + log + P Q P Q β log e P Q where and β are given in 7 and 9, respectively. Proof: Theorem is proved by obtaining upper and lower bounds on the χ -divergence from P to Q χ P Q. A lower bound follows by invoking Jensen s inequality χ P Q = 3 = exp log 4 exp log 5 = exp DP Q. 6 Alternatively, 6 can be obtained by combining the equality χ P Q = exp D P Q 7 with the monotonicity of the Rényi divergence D α P Q in α, which implies that D P Q DP Q. A refined version of 6 is derived in the following. The starting point is a refined version of Jensen s inequality in [, Lemma ], generalizing a result from [7, Theorem ], which leads to see [, Theorem 7] DQ P log + χ P Q DP Q 8 max DQ P. 9 From 9 and the definition of β in 9, we have χ P Q exp DP Q + β DQ P exp DP Q + β log e P Q where follows from 8 and the definition of β in 9, and follows from Pinsker s inequality. Note that the lower bound in refines the lower bound in 6 since β [, ]. An upper bound on χ P Q is derived as follows: and, from 3, χ P Q = = P Q max 3 P Q max. 4 Combining 3 and 4 yields χ P Q P Q. 5 Finally, follows by combining the upper and lower bounds on the χ -divergence in and 5. Remark. It is easy to check that Theorem strengthens the bound by Csiszár and Talata in 6 by at least a factor of since upper bounding the logarithm in gives DP Q β log e P Q. 6 In the finite-alphabet case, we can obtain another upper bound on DP Q as a function of the l norm P Q : DP Q log + P Q β log e P Q 7 which follows by combining,, and P Q P Q. Using the inequality log + x x log e for x in the right side of 7, and also loosening this bound by β log e ignoring the term P Q, we recover the bound DP Q P Q log e 8 which appears in the proof of Property 4 of [, Lemma 7], and also used in [, 74]. Remark. The lower bounds on the χ -divergence in 6 and improve the one in [6, Lemma 6.3] which states that DP Q χ P Q log e. Remark 3. Reverse Pinsker inequalities have been also derived in quantum information theory [], [], providing upper bounds on the relative entropy of two quantum states as a function of the trace norm distance when the imal eigenvalues of the states are positive c.f. [, Theorem 6] and [, Theorem ]. These type of bounds are akin to the weakend form in. When the variational distance is much smaller than the imal eigenvalue see [, Eq. 57], the latter bounds have a quadratic scaling in this distance, similarly to ; they are also inversely proportional to the imal eigenvalue, similarly to the dependence of in.
3. APPLICATIONS OF THEOREM A. The Exponential Decay of the Probability for a Non- Typical Sequence To exemplify the utility of Theorem, we bound the function L δ Q = DP Q 9 P T δ Q where we have denoted the subset of probability measures on A, F which are δ-close to Q as } T δ Q = P : a A, δ 3 Note that a,..., a n is strongly δ-typical according to Q if its empirical distribution belongs to T δ Q. According to Sanov s theorem e.g. [5, Theorem.4.], if the random variables are independent distributed according to Q, then the probability that Y,..., Y n, is not δ-typical vanishes exponentially with exponent L δ Q. To state the next result, we invoke the following notions from [4]. Given a probability measure Q, its balance coefficient is given by β Q = inf QA. 3 A F : QA The function φ:, ] [ log e, is given by 4 p φp = log p p, p,, 3 log e, p =. Theorem. If >, then φ β Q Q δ L δ Q 33 where 34 holds if δ Q. log + δ 34 Proof: Ordentlich and Weinberger [4, Section 4] show the refinement of Pinsker s inequality: φ β Q P Q DP Q. 35 Note that if > then β Q <, and therefore φ β Q is well defined and finite. If P T δ Q the simple bound P Q > δ 36 together with 35 yields 33. The upper bound 34 follows from and the fact that if δ Q, then P Q = δ. 37 P T δ Q If δ Q, the ratio between the upper and lower bounds in 34, satisfies log e φ β Q log + Q δ log e Q 4 38 δ where 38 follows from the fact that its second and third factors are less than or equal to and 4, respectively. Note that the bounds in 33 and 34 scale like δ for δ. B. Distance from Equiprobable If P is a distribution on a finite set A, HP gauges the distance from U, the equiprobable distribution, since HP = log A DP U. 39 Thus, it is of interest to explore the relationship between HP and P U. Particularizing, [4,.] see also [4, pp. 3 3], and we obtain P U log e log A HP, 4 P U A exp HP, 4 respectively..8.6.4..8. P U b c exp HP, 4 A A =4 a.5.5.8.6.4..8. b c HP bits A = 6 a.5.5.5 3 3.5 4 HP bits Fig.. Bounds on P U as a function of HP for A = 4, and A = 6. The point HP, P U =, A is depicted on the y-axis. In the curves of the two plots, the bounds a, b and c refer, respectively, to 4, 4 and 4. The bounds in 4 4 are illustrated for A = 4, 6 in Figure. For HP =, P U = A is shown for reference in Figure ; as the cardinality of the alphabet
increases, the gap between P U and its upper bound is reduced and this gap decays asymptotically to zero. Results on the more general problem of finding bounds on HP HQ based on P Q can be found in [5, Theorem 7.3.3], [], [6], [8], [6, Section.7] and [7]. 4. EXTENSION OF THEOREM TO RÉNYI DIVERGENCES Definition. The Rényi divergence of order α [, ] from P to Q is defined for α,, as D α P Q α log P α a Q α a. 43 Recall that D P Q DP Q is defined to be the analytic extension of D α P Q at α = if DP Q <, L Hôpital s rule gives that DP Q = lim α D α P Q. The extreme cases of α =, are defined as follows: If α = then D P Q = log QSupportP, If α = + then D P Q = log sup. Pinsker s inequality was extended by Gilardoni [] for a Rényi divergence of order α, ] see also [8, Theorem 3], and it gets the form α P Q log e D α P Q. A tight lower bound on the Rényi divergence of order α > as a function of the total variation distance is given in [9], which is consistent with Vajda s tight lower bound for f- divergences in [3, Theorem 3]. Motivated by these findings, we extend the upper bound on the relative entropy in Theorem to Rényi divergences of an arbitrary order. Theorem 3. Assume that P, Q are strictly positive with imum masses denoted by P and, respectively. Let β and β be given in 8 and 9, respectively, and abbreviate δ P Q [, ]. Then, the Rényi divergence of order α [, ] satisfies D α P Q f, α, ] f, α [, ] f, f 3, f 4 }, α, } log δ, f, f 3, f 4, α [ ], 44 where, for α [, ], f α, β, δ α log + δβ α β α [,, δ β log β, α =, log β, for α [, ] α = f α, β,, δ } f α, β, δ, log + δ and, for α [,, f 3 and f 4 are given by f 3 α, P, β, δ [ α log + δ α f 4 β,, δ log + δ P 45 46 ] β δ log e, 47 β δ log e, log + δ, } δ }. 48 Proof: See [, Section 7.C]. Remark 4. A simple bound, albeit looser than the one in Theorem 3 is P Q D α P Q log + 49 which is asymptotically tight as α in the case of a binary alphabet with equiprobable Q. Example. Figure illustrates the bound in 45, which is valid for all α [, ] see [, Theorem 3], and the upper bounds of Theorem 3 in the case of binary alphabets. 5. SUMMARY We derive in this paper some reverse Pinsker inequalities for probability measures P Q defined on a common finite set, which provide lower bounds on the total variation distance P Q as a function of the relative entropy DP Q under the assumption of a bounded relative information or >. More general results for an arbitrary alphabet are available in [, Section 5]. In [], we study bounds among various f-divergences, dealing with arbitrary alphabets and deriving bounds on the ratios of various distance measures. New expressions of the Rényi divergence in terms of the relative information spectrum are derived, leading to upper and lower bounds on the Rényi divergence in terms of the variational distance.
.7 nats.5.3.. b a D P kq Fig.. The Rényi divergence D αp Q for P and Q which are defined on a binary alphabet with P = Q = 5, compared to a its upper bound in 44, and b its upper bound in 45 see [, Theorem 3]. The two bounds coincide here when α,.9,. ACKNOWLEDGMENT The work of I. Sason has been supported by the Israeli Science Foundation ISF under Grant /, and the work of S. Verdú has been supported by the US National Science Foundation under Grant CCF-665, and in part by the Center for Science of Information, an NSF Science and Technology Center under Grant CCF-93937. REFERENCES [] K. M. R. Audenaert and J. Eisert, Continuity bounds on the quantum relative entropy, Journal of Mathematical Physics, vol. 46, paper 4, October 5. [] K. M. R. Audenaert and J. Eisert, Continuity bounds on the quantum relative entropy - II, Journal of Mathematical Physics, vol. 5, paper, November. [3] G. Böcherer and B. C. Geiger, Optimal quantization for distribution synthesis, March 5. Available at http://arxiv.org/abs/37.6843. [4] J. Bretagnolle and C. Huber, Estimation des densités: risque imax, Probability Theory and Related Fields, vol. 47, no., pp. 9 37, 979. [5] T. M. Cover and J. A. Thomas, Elements of Information Theory, second edition, John Wiley & Sons, 6. [6] I. Csiszár and Z. Talata, Context tree estimation for not necessarily finite memory processes, via BIC and MDL, IEEE Trans. on Information Theory, vol. 5, no. 3, pp. 7 6, March 6. [7] S. S. Dragomir, Bounds for the normalized Jensen functional, Bulletin of the Australian Mathematical Society, vol. 74, no. 3, pp. 47 478, 6. [8] T. van Erven and P. Harremoës, Rényi divergence and Kullback- Leibler divergence, IEEE Trans. on Information Theory, vol. 6, no. 7, pp. 3797 38, July 4. [9] A. A. Fedotov, P. Harremoës and F. Topsøe, Refinements of Pinsker s inequality, IEEE Trans. on Information Theory, vol. 49, no. 6, pp. 49 498, June 3. [] G. L. Gilardoni, On Pinsker s and Vajda s type inequalities for Csiszár s f-divergences, IEEE Trans. on Information Theory, vol. 56, no., pp. 5377 5386, November. [] S. W. Ho and R. W. Yeung, The interplay between entropy and variational distance, IEEE Trans. on Information Theory, vol. 56, no., pp. 596 599, December. [] V. Kostina and S. Verdú, Channels with cost constraints: strong converse and dispersion, to appear in the IEEE Trans. on Information Theory, vol. 6, no. 5, May 5. [3] M. Kraj ci, C. F. Liu, L. Mike s and S. M. Moser, Performance analysis of Fano coding, Proceedings of the IEEE 5 International Symposium on Information Theory, Hong Kong, June 4 9, 5. [4] E. Ordentlich and M. J. Weinberger, A distribution dependent refinement of Pinsker s inequality, IEEE Trans. on Information Theory, vol. 5, no. 5, pp. 836 84, May 5. [5] M. S. Pinsker, Information and Information Stability of Random Variables and Random Processes, San-Fransisco: Holden-Day, 964, originally published in Russian in 96. [6] V. V. Prelov and E. C. van der Meulen, Mutual information, variation, and Fano s inequality, Problems of Information Transmission, vol. 44, no. 3, pp. 85 97, September 8. [7] M. D. Reid and R. C. Williamson, Information, divergence and risk for binary experiments, Journal of Machine Learning Research, vol., no. 3, pp. 73 87, March. [8] I. Sason, Entropy bounds for discrete random variables via maximal coupling, IEEE Trans. on Information Theory, vol. 59, no., pp. 78 73, November 3. [9] I. Sason, On the Rényi divergence and the joint range of relative entropies, Proceedings of the 5 IEEE International Symposium on Information Theory, pp. 6 64, Hong Kong, June 4 9, 5. [] I. Sason and S. Verdú, Bounds among f-divergences, submitted to the IEEE Trans. on Information Theory, July 5. [Online]. Available at http://arxiv.org/abs/58.335. [] M. Tomamichel and V. Y. F. Tan, A tight upper bound for the third-order asymptotics for most discrete memoryless channels, IEEE Trans. on Information Theory, vol. 59, no., pp. 74 75, November 3. [] I. Vajda, Note on discriation information and variation, IEEE Trans. on Information Theory, vol. 6, no. 6, pp. 77 773, November 97. [3] I. Vajda, On f-divergence and singularity of probability measures, Periodica Mathematica Hungarica, vol., no. 4, pp. 3 34, 97. [4] V. N. Vapnik, Statistical Learning Theory, John Wiley & Sons, 998. [5] S. Verdú, Total variation distance and the distribution of the relative information, Proceedings of the Information Theory and Applications Workshop, pp. 499 5, San-Diego, California, USA, February 4. [6] S. Verdú, Information Theory, in preparation. [7] Z. Zhang, Estimating mutual information via Kolmogorov distance, IEEE Trans. on Information Theory, vol. 53, no. 9, pp. 38 38, September 7.