f-divergence Inequalities

Size: px

Start display at page:

Download "f-divergence Inequalities"

Maud Ami Stafford
5 years ago
Views:

1 f-divergence Inequalities Igal Sason Sergio Verdú Abstract This paper develops systematic approaches to obtain f-divergence inequalities, dealing with pairs of probability measures defined on arbitrary alphabets. Functional domination is one such approach, where special emphasis is placed on finding the best possible constant upper bounding a ratio of f-divergences. Another approach used for the derivation of bounds among f-divergences relies on moment inequalities and the logarithmic-convexity property, which results in tight bounds on the relative entropy and Bhattacharyya distance in terms of χ divergences. A rich variety of bounds are shown to hold under boundedness assumptions on the relative information. Special attention is devoted to the total variation distance and its relation to the relative information and relative entropy, including reverse Pinsker inequalities, as well as on the E γ divergence, which generalizes the total variation distance. Pinsker s inequality is extended for this type of f-divergence, a result which leads to an inequality linking the relative entropy and relative information spectrum. Integral expressions of the Rényi divergence in terms of the relative information spectrum are derived, leading to bounds on the Rényi divergence in terms of either the variational distance or relative entropy. Keywords: relative entropy, total variation distance, f-divergence, Rényi divergence, Pinsker s inequality, relative information. I. Sason is with the Department of Electrical Engineering, Technion Israel Institute of Technology, Haifa 3000, Israel sason@ee.technion.ac.il). S. Verdú is with the Department of Electrical Engineering, Princeton University, Princeton, New Jersey 08544, USA verdu@princeton.edu). This manuscript is identical to the published journal paper [94], apart of some additional material which includes Sections III-C and IV-F, and three technical proofs. Parts of this work have been presented at the 04 Workshop on Information Theory and Applications, San-Diego, Feb. 04, at the 05 IEEE Information Theory Workshop, Jeju Island, Korea, Oct. 05, and the 06 IEEE International Conference on the Science of Electrical Engineering, Eilat, Israel, Nov. 06. This work has been supported by the Israeli Science Foundation ISF) under Grant /, by NSF Grant CCF-0665, by the Center for Science of Information, an NSF Science and Technology Center under Grant CCF , and by ARO under MURI Grant W9NF March 4, 08

2 I. INTRODUCTION Throughout their development, information theory, and more generally, probability theory, have benefitted from non-negative measures of dissimilarity, or loosely speaking, distances, between pairs of probability measures defined on the same measurable space see, e.g., [4], [65], [06]). Notable among those measures are see Section II for definitions): total variation distance P Q ; relative entropy DP Q); χ -divergence χ P Q); Hellinger divergence H α P Q); Rényi divergence D α P Q). It is useful, particularly in proving convergence results, to give bounds of one measure of dissimilarity in terms of another. The most celebrated among those bounds is Pinsker s inequality: P Q log e DP Q) ) proved by Csiszár [3] and Kullback [60], with Kemperman [56] independently a bit later. Improved and generalized versions of Pinsker s inequality have been studied, among others, in [39], [44], [46], [76], [86], [9], [04]. Relationships among measures of distances between probability measures have long been a focus of interest in probability theory and statistics e.g., for studying the rate of convergence of measures). The reader is referred to surveys in [4, Section 3], [65, Chapter ], [86] and [87, Appendix 3], which provide several relationships among useful f-divergences and other measures of dissimilarity between probability measures. Some notable existing bounds among f-divergences include, in addition to ): [63, Lemma ], [64, p. 5] [5,.)] H P Q) P Q ) H P Q) 4 H P Q) ) ; 3) 4 P Q exp DP Q) ) ; 4) The folklore in information theory is that ) is due to Pinsker [78], albeit with a suboptimal constant. As explained in [0], although no such inequality appears in [78], it is possible to put together two of Pinsker s bounds to conclude that Csiszár derived ) in [3, Theorem 4.] after publishing a weaker version in [, Corollary ] a year earlier. 408 P Q log e DP Q). March 4, 08

3 [4, Theorem 5], [35, Theorem 4], [00] 3 [48, Corollary 5.6] For all α χ P Q) DP Q) log + χ P Q) ) ; 5) ) α + α ) H α P Q) ; 6) the inequality in 6) is reversed if α 0, ), ], and it holds with equality if α =. [44], [45], [86, 58)] [97] P Q, P Q [ 0, ] χ P Q) P Q P Q, P Q, ). D P Q) DQ P ) χ P Q) log e; 8) 7) 4 H P Q) log e DP Q) DQ P ) 9) 4 χ P Q) χ Q P ) log e; 0) 4 H P Q) log e DP Q) + DQ P ) ) χ P Q) + χ Q P ) ) log e; ) [3,.8)] [46], [86, Corollary 3], [9] DP Q) DP Q) + DQ P ) P Q log χ P Q) + χ Q P ) [53, p. 7] cf. a generalized form in [87, Lemma A.3.5]) generalized in [65, Proposition.5]: P Q + χ P Q) ) log e ; 3) ) + P Q, 4) P Q 8 P Q 4 P Q ; 5) H P Q) log e DP Q), 6) H α P Q) log e D α P Q) DP Q), 7) March 4, 08

4 for α 0, ), and 4 for α, ). [65, Proposition.35] If α 0, ), β max{α, α}, then [37, Theorems 3 and 6] H α P Q) log e D α P Q) DP Q) 8) + P Q ) β P Q ) β α) H α P Q) 9) D α P Q) is monotonically increasing in α > 0; α ) D α P Q) is monotonically decreasing in α 0, ]; P Q. 0) [65, Proposition.7] the same monotonicity properties hold for H α P Q). [46] If α 0, ], then α P Q log e D α P Q); ) An inequality for H α P Q) [6, Lemma ] becomes at α = ), the parallelogram identity [6,.)] DP 0 Q) + DP Q) = DP 0 P ) + DP P ) + DP Q). ) with P = P 0 + P ), and extends a result for the relative entropy in [6, Theorem.]. A reverse Pinsker inequality, providing an upper bound on the relative entropy in terms of the total variation distance, does not exist in general since we can find distributions which are arbitrarily close in total variation but with arbitrarily high relative entropy. Nevertheless, it is possible to introduce constraints under which such reverse Pinsker inequalities hold. In the special case of a finite alphabet A, Csiszár and Talata [9, p. 0] showed that when Q min min a A Qa) is positive. 3 log e DP Q) Q min ) P Q 3) [5, Theorem 3.] if f : 0, ) R is a strictly convex function, then there exists a real-valued function ψ f such that lim x 0 ψ f x) = 0 and 4 P Q ψ f Df P Q) ). 4) 3 Recent applications of 3) can be found in [57, Appendix D] and [0, Lemma 7] for the analysis of the third-order asymptotics of the discrete memoryless channel with or without cost constraints. 4 Eq. 4) follows as a special case of [5, Theorem 3.] with m = and w m =. March 4, 08

5 which implies 5 lim D f P n Q n ) = 0 n lim P n Q n = 0. 5) n The numerical optimization of an f-divergence subject to simultaneous constraints on f i -divergences i =,..., L) was recently studied in [48], which showed that for that purpose it is enough to restrict attention to alphabets of cardinality L +. Earlier, [50] showed that if L =, then either the solution is obtained by a pair P, Q) on a binary alphabet, or it is a deflated version of such a point. Therefore, from a purely numerical standpoint, the minimization of D f P Q) such that D g P Q) d can be accomplished by a grid search on [0, ]. Occasionally, as in the case where D f P Q) = DP Q) and D g P Q) = P Q, it is actually possible to determine analytically the locus of D f P Q), D g P Q)) see [39]). In fact, as shown in [05, )], a binary alphabet suffices if the single constraint is on the total variation distance. The same conclusion holds when minimizing the Rényi divergence [9]. In this work, we find relationships among the various divergence measures outlined above as well as a number of other measures of dissimilarity between probability measures. The framework of f-divergences, which encompasses the foregoing measures Rényi divergence is a one-to-one transformation of the Hellinger divergence) serves as a convenient playground. The rest of the paper is structured as follows: Section II introduces the basic definitions needed and in particular the various measures of dissimilarity between probability measures used throughout. Based on functional domination, Section III provides a basic tool for the derivation of bounds among f- divergences. Under mild regularity conditions, this approach further enables to prove the optimality of constants in those bounds. In addition, we show instances where such optimality can be shown in the absence of regularity conditions. The basic tool used in Section III is exemplified in obtaining relationships among important f- divergences such as relative entropy, Hellinger divergence and total variation distance. This approach is also useful in strengthening and providing an alternative proof of Samson s inequality [89] a counterpart to Pinsker s inequality using Marton s divergence, useful in proving certain concentration of measure results [3]), whose constant we show cannot be improved. In addition, we show several new results in Section III-E on the maximal ratios of various f-divergences to total variation distance. Section IV provides an approach for bounding ratios of f-divergences, assuming that the relative information see Definition ) is lower and/or upper bounded with probability one. The approach is exemplified in bounding ratios of relative entropy to various f-divergences, and analyzing the local behavior of f-divergence ratios when the reference measure is fixed. We also show that bounded relative information leads to a strengthened version of Jensen s inequality, which, in turn, results in upper and lower bounds on the ratio of the non-negative difference March 4, 08

6 6 log + χ P Q) ) DP Q) to DQ P ). A new reverse version of Samson s inequality is another byproduct of the main tool in this section. The rich structure of the total variation distance as well as its importance in both fundamentals and applications merits placing special attention on bounding the rest of the distance measures in terms of P Q. Section V gives several useful identities linking the total variation distance with the relative information spectrum, which result in a number of upper and lower bounds on P Q, some of which are tighter than Pinsker s inequality in various ranges of the parameters. It also provides refined bounds on DP Q) as a function of χ -divergences and the total variation distance. Section VI is devoted to proving reverse Pinsker inequalities, namely, lower bounds on P Q as a function of DP Q) involving either a) bounds on the relative information, b) Lipschitz constants, or c) the minimum mass of the reference measure in the finite alphabet case). In the latter case, we also examine the relationship between entropy and the total variation distance from the equiprobable distribution, as well as the exponential decay of the probability that an independent identically distributed sequence is not strongly typical. Section VII focuses on the E γ divergence. This f-divergence generalizes the total variation distance, and its utility in information theory has been exemplified in [8], [68], [69], [70], [79], [80], [8]. Based on the operational interpretation of the DeGroot statistical information [3] and the integral representation of f- divergences as a function of DeGroot s measure, Section VII provides an integral representation of f-divergences as a function of the E γ divergence; this representation shows that {E γ P Q), E γ Q P )), γ } uniquely determines DP Q) and H α P Q), as well as any other f-divergence with twice differentiable f. Accordingly, bounds on the E γ divergence directly translate into bounds on other important f-divergences. In addition, we show an extension of Pinsker s inequality ) to E γ divergence, which leads to a relationship between the relative information spectrum and relative entropy. The Rényi divergence, which has found a plethora of information-theoretic applications, is the focus of Section VIII. Expressions of the Rényi divergence are derived in Section VIII-A as a function of the relative information spectrum. These expressions lead in Section VIII-B to the derivation of bounds on the Rényi divergence as a function of the variational distance under the boundedness assumption of the relative information. Bounds on the Rényi divergence of an arbitrary order are derived in Section VIII-C as a function of the relative entropy when the relative information is bounded. II. BASIC DEFINITIONS A. Relative Information and Relative Entropy We assume throughout that the probability measures P and Q are defined on a common measurable space A, F ), and P Q denotes that P is absolutely continuous with respect to Q, namely there is no event F F March 4, 08

7 such that P F) > 0 = QF). 7 Definition : If P Q, the relative information provided by a A according to P, Q) is given by 5 ı P Q a) log dp a). 6) dq When the argument of the relative information is distributed according to P, the resulting real-valued random variable is of particular interest. Its cumulative distribution function and expected value are known as follows. Definition : If P Q, the relative information spectrum is the cumulative distribution function with 6 X P. The relative entropy of P with respect to Q is where Y Q. F P Q x) = P [ ı P Q X) x ], 7) DP Q) = E [ ı P Q X) ] 8) = E [ ı P Q Y ) exp ı P Q Y ) )], 9) B. f-divergences Introduced by Ali-Silvey [] and Csiszár [], [3], a useful generalization of the relative entropy, which retains some of its major properties and, in particular, the data processing inequality [3]), is the class of f-divergences. A general definition of an f-divergence is given in [66, p. 4398], specialized next to the case where P Q. Definition 3: Let f : 0, ) R be a convex function, and suppose that P Q. The f-divergence from P to Q is given by D f P Q) = f ) dp dq 30) dq = E [ fz) ] 3) where and in 30), we took the continuous extension 7 Z = exp ı P Q Y ) ), Y Q. 3) f0) = lim t 0 ft), + ]. 33) 5 dp dq denotes the Radon-Nikodym derivative or density) of P with respect to Q. Logarithms have an arbitrary common base, and exp ) indicates the inverse function of the logarithm with that base. 6 X P means that P[X F] = P F) for any event F F. 7 The convexity of f : 0, ) R implies its continuity on 0, ). March 4, 08

8 8 We can also define D f P Q) without requiring P Q. Let f : 0, ) R be a convex function with f) = 0, and let f : 0, ) R be given by f t) = t f ) t 34) for all t > 0. Note that f is also convex see, e.g., [4, Section 3..6]), f ) = 0, and D f P Q) = D f Q P ) if P Q. By definition, we take f 0) = lim f fu) t) = lim t 0 u u. 35) If p and q denote, respectively, the densities of P and Q with respect to a σ-finite measure µ i.e., p = dp dµ, q = dq dµ ), then we can write 30) as D f P Q) = = q f {pq>0} ) p dµ 36) q ) p q f dµ + f0) Qp = 0) + f 0)P q = 0). 37) q Remark : Different functions may lead to the same f-divergence for all P, Q): if for an arbitrary b R, we have then f b t) = f 0 t) + b t ), t 0 38) D f0 P Q) = D fb P Q). 39) The following key property of f-divergences follows from Jensen s inequality. Proposition : If f : 0, ) R is convex and f) = 0, P Q, then D f P Q) 0. 40) If, furthermore, f is strictly convex at t =, then equality in 40) holds if and only if P = Q. Surveys on general properties of f-divergences can be found in [65], [06], [07]. The assumptions of Proposition are satisfied by many interesting measures of dissimilarity between probability measures. In particular, the following examples receive particular attention in this paper. As per Definition 3, in each case the function f is defined on 0, ). ) Relative entropy [59]: ft) = t log t, DP Q) = D f P Q) 4) = D r P Q) 4) March 4, 08

9 with r : 0, ) [0, ) defined as 9 ) Relative entropy: P Q) ft) = log t, 3) Jeffrey s divergence [54]: P Q) ft) = t ) log t, 4) χ -divergence [77]: ft) = t ) or ft) = t, rt) t log t + t) log e. 43) DQ P ) = D f P Q); 44) DP Q) + DQ P ) = D f P Q); 45) χ P Q) = D f P Q) 46) ) dp χ P Q) = dq dq 47) ) dp = dq 48) dq = E [ exp ı P Q Y ) )] 49) = E [ exp ı P Q X) )] 50) with X P and Y Q. Note that if P Q, then from the right side of 48), we obtain with gt) = t t = f t), and ft) = t. χ Q P ) = D g P Q) 5) 5) Hellinger divergence of order α 0, ), ) [54], [65, Definition.0]: with H α P Q) = D fα P Q) 5) f α t) = tα α. 53) The χ -divergence is the Hellinger divergence of order, while H P Q) is usually referred to as the squared Hellinger distance. The analytic extension of H α P Q) at α = yields 6) Total variation distance: Setting H P Q) log e = DP Q). 54) ft) = t 55) March 4, 08

10 results in 0 7) Triangular Discrimination [6], [09] a.k.a. Vincze-Le Cam distance): with P Q = D f P Q) 56) dp = dq dq 57) ) = sup P F) QF). 58) F F P Q) = D f P Q) 59) Note that ft) = t ) t +. 60) 8) Jensen-Shannon divergence [67] a.k.a. capacitory discrimination): with 9) E γ divergence see, e.g., [79, p. 34]): For γ, with P Q) = χ P P + Q) 6) = χ Q P + Q). 6) JSP Q) = D P P + Q) + D Q P + Q) 63) = D f P Q) 64) ) + t ft) = t log t + t) log. 65) E γ P Q) = D fγ P Q) 66) f γ t) = t γ) + 67) where x) + max{x, 0}. E γ is sometimes called hockey-stick divergence because of the shape of f γ. If γ =, then 0) DeGroot statistical information [3]: For p 0, ), E P Q) = P Q. 68) I p P Q) = D φp P Q) 69) March 4, 08

11 with Invoking 66) 70), we get cf. [66, 77)]) φ p t) = min{p, p} min{p, pt}. 70) I P Q) = E P Q) = 4 P Q. 7) This measure was first proposed by DeGroot [3] due to its operational meaning in Bayesian statistical hypothesis testing see Section VII-B), and it was later identified as an f-divergence see [66, Theorem 0]). ) Marton s divergence [73, pp ]: d P, Q) = min E [ P [X Y Y ] ] 7) = D s P Q) 73) where the minimum is over all probability measures P XY with respective marginals P X = P and P Y = Q, and st) = t ) {t < }. 74) Note that Marton s divergence satisfies the triangle inequality [73, Lemma 3.], and d P, Q) = 0 implies P = Q; however, due to its asymmetry, it is not a distance measure. C. Rényi Divergence Another generalization of relative entropy was introduced by Rényi [88] in the special case of finite alphabets. The general definition assuming 8 P Q) is the following. Definition 4: Let P Q. The Rényi divergence of order α 0 from P to Q is given as follows: If α 0, ), ), then with X P and Y Q. If α = 0, then 9 D α P Q) = α log E [ exp α ı P Q Y ) )]) 75) = α log E [ exp α ) ı P Q X) )]) 76) D 0 P Q) = ) max log. 77) F F : P F)= QF) 8 Rényi divergence can also be defined without requiring absolute continuity, e.g., [37, Definition ]. 9 The function in 75) is, in general, right-discontinuous at α = 0. Rényi [88] defined D 0P Q) = 0, while we have followed [37] defining it instead as lim α 0 D αp Q). March 4, 08

12 If α =, then D P Q) = DP Q) 78) which is the analytic extension of D α P Q) at α =. If DP Q) <, it can be verified by L Hôpital s rule that DP Q) = lim α D α P Q). If α = + then with Y Q. If P Q, we take D P Q) =. D P Q) = log ess sup dp ) dq Y ) 79) Rényi divergence is a one-to-one transformation of Hellinger divergence of the same order α 0, ), ): which, when particularized to order becomes D α P Q) = α log + α ) H αp Q)) 80) D P Q) = log + χ P Q) ). 8) Note that 6), 7), 8) follow from 80) and the monotonicity of the Rényi divergence in its order, which in turn yields 6). Introduced in [8], the Bhattacharyya distance was popularized in the engineering literature in [55]. Definition 5: The Bhattacharyya distance between P and Q, denoted by BP Q), is given by BP Q) = D P Q) 8) ) = log H. 83) P Q) Note that, if P Q, then BP Q) = BQ P ) and BP Q) = 0 if and only if P = Q, though BP Q) does not satisfy the triangle inequality. III. FUNCTIONAL DOMINATION Let f and g be convex functions on 0, ) with f) = g) = 0, and let P and Q be probability measures defined on a measurable space A, F ). If there exists α > 0 such that ft) αgt) for all t 0, ), then it follows from Definition 3 that D f P Q) α D g P Q). 84) This simple observation leads to a proof of, for example, 6) and the left inequality in ) with the aid of Remark. March 4, 08

13 A. Basic Tool 3 Theorem : Let P Q, and assume f is convex on 0, ) with f) = 0; g is convex on 0, ) with g) = 0; gt) > 0 for all t 0, ), ). Denote the function κ: 0, ), ) R and κt) = ft), t 0, ), ) 85) gt) Then, κ = sup κt). 86) t 0,), ) a) D f P Q) κ D g P Q). 87) b) If, in addition, f ) = g ) = 0, then D f P Q) sup = κ. 88) P Q D g P Q) Proof: a) The bound in 87) follows from 84) and ft) κ gt) for all t > 0. b) Since g is positive except at t =, D g P Q) > 0 if P Q. The convexity of f, g on 0, ) implies their continuity; and since gt) > 0 for all t 0, ), ), κ ) is continuous on both 0, ) and, ). To show 88), we fix an arbitrary ν 0, ), ) and construct a sequence of pairs of probability measures whose ratio of f-divergence to g-divergence converges to κν). To that end, for sufficiently small ε > 0, let P ε and Q ε be parametric probability measures defined on the set A = {0, } with P ε 0) = ν ε and Q ε 0) = ε. Then, lim ε 0 D f P ε Q ε ) D g P ε Q ε ) = lim ε 0 ) ε fν) + ε) f νε ε ) 89) ε gν) + ε) g νε ε fν) + ν α f α) = lim α 0 gν) + ν α g α) 90) = κν) 9) where 90) holds by change of variable ε = α/ν + α), and 9) holds by the assumption on the derivatives of f and g at, the assumption that f) = g) = 0, and the continuity of κ ) at ν. If κ = κν) March 4, 08

14 4 we are done. If the supremum in 86) is not attained on 0, ), ), then the right side of 9) can be made arbitrarily close to κ by an appropriate choice of ν. Remark : Beyond the restrictions in Theorem a), the only operative restriction imposed by Theorem b) is the differentiability of the functions f and g at t =. Indeed, we can invoke Remark and add f ) t) to ft), without changing D f and likewise with g) and thereby satisfying the condition in Theorem b); the stationary point at must be a minimum of both f and g because of the assumed convexity, which implies their non-negativity on 0, ). Remark 3: It is useful to generalize Theorem b) by dropping the assumption on the existence of the derivatives at. To that end, note that the inverse transformation used for the transition to 90) is given by ν = + α ε ) where ε > 0 is sufficiently small, so if ν > resp. ν < ), then α > 0 resp. α < 0). Consequently, it is easy to see from 90) that if κ = sup t> κt), the construction in the proof can restrict to ν >, in which case it is enough to require that the left derivatives of f and g at be equal to 0. Analogously, if κ = sup 0<t< κt), it is enough to require that the right derivatives of f and g at be equal to 0. When neither left nor right derivatives at are 0, then 88) need not hold as the following example shows. Example : Let ft) = t and Then, κ =, while in view of 39) and 56) for all P, Q), gt) = ft) + t. 9) D g P Q) = D f P Q) = P Q. 93) B. Relationships Among DP Q), χ P Q) and P Q Since the Rényi divergence of order α > 0 is monotonically increasing in α, 8) yields DP Q) log + χ P Q) ) 94) χ P Q) log e. 95) Inequality 94), which can be found in [00] and [4, Theorem 5], is sharpened in Theorem 4 under the assumption of bounded relative information. In view of 80), an alternative way to sharpen 94) is for α, ), which is tight as α. DP Q) α log + α )H αp Q)) 96) Relationships between the relative entropy, total variation distance and χ divergence are derived next. March 4, 08

15 Theorem : a) If P Q and c, c 0, then 5 DP Q) c P Q + c χ P Q) ) log e 97) b) holds if c, c ) = 0, ) and c, c ) = 4, ). Furthermore, if c = 0 then c = is optimal, and if c = then c = 4 is optimal. sup where the supremum is over P Q and P Q. Proof: DP Q) + DQ P ) χ P Q) + χ Q P ) = log e 98) a) The satisfiability of 0) with c, c ) = 0, ) is equivalent to 95). Let P Q, and Y Q. Then, DP Q) = E [ )] dp r dq Y ) E [ dp dq Y ) ) + dp + dq Y ) ) ] 99) log e 00) = 4 P Q + χ P Q) ) log e 0) where 99) follows from the definition of relative entropy with the function r : 0, ) R defined in 43); 00) holds since for t 0, ) rt) and 0) follows from 47), 57), and the identity This proves 97) with c, c ) = 4, ). t) + = [ t) + + t ) ] log e 0) [ t + t) ]. Next, we show that if c = 0 then c = is the best possible constant in 97). To that end, let f t) = t ), and let κt) be the continuous extension of decreasing on 0, ), so rt) f t). It can be verified that the function κ is monotonically κ = lim t 0 κt) = log e. 03) Since D r P Q) = DP Q) and D f P Q) = χ P Q), and r ) = f ) = 0, the desired result follows from Theorem b). To show that c = 4 is the best possible constant in 97) if c =, we let g t) = [ t) + + t ) ]. Theorem b) does not apply here since g is not differentiable at t =. However, we can still construct March 4, 08

16 probability measures for proving the optimality of the point c, c ) = 4, ). To that end, let ε 0, ), and define probability measures P ε and Q ε on the set A = {0, } with P ε ) = ε and Q ε ) = ε. Since D r P Q) = DP Q) and D g P Q) = 4 P Q + χ P Q), DP ε Q ε ) lim ε 0 4 P ε Q ε + χ P ε Q ε ) = lim ε 0 ε) r + ε) + ε rε) ε) g + ε) + ε g ε) 6 04) = log e 05) where 05) holds since we can write numerator and denominator in the right side as b) We have with ε) r + ε) + ε rε) = ε ) log + ε) + ε log ε 06) = ε log e + oε), 07) ε) g + ε) + ε g ε) = ε ε. 08) D f P Q) = DP Q) + DQ P ), 09) D g P Q) = χ P Q) + χ Q P ) 0) ft) t ) log t ) gt) t t + t ) κt) = ft) gt) = t log t t, t 0, ), ) 3) lim κt) = κ = t log e 4) where 4) is easy to verify since κ is monotonically increasing on 0, ), and monotonically decreasing on, ). The desired result follows since the conditions of Theorem b) apply. Remark 4: Inequality 97) strengthens the bound claimed in [3,.8)], DP Q) P Q + χ P Q) ) log e, 5) although the short outline of the suggested proof in [3, p. 70] leads to the weaker upper bound P Q + χ P Q) nats. March 4, 08

17 7 Remark 5: Note that 95) implies a looser result where the constant in the right side of 98) is doubled. Furthermore, ) and 98) result in the bound χ P Q) + χ Q P ) P Q which, although weaker than 5), has the same behavior for small values of P Q. C. Relationships Among DP Q), P Q) and P Q The next result shows three upper bounds on the Vincze-Le Cam distance in terms of the relative entropy and total variation distance). Theorem 3: Let r : 0, ) R be the function defined in 43), and let f denote the function f : 0, ) R defined in 60). a) If P Q, then where c =.59 = κ, computed with Furthermore, the constant c in 6) is the best possible. b) If P Q, then where c = is defined by P Q) log e c DP Q) 6) κt) = f t) log e. 7) rt) P Q) log e DP Q) + c P Q log e 8) c = min { c 0: κ c t), t 0, ) } 9) κ c t) = Furthermore, the constant c in 8) is the best possible. c) If P Q, then and the bound in ) is tight. Proof: f t) log e rt) + c t) + log e. 0) P Q) log e DP Q) + DQ P ) ) a) Let f = f log e and g = r. These functions satisfy the conditions in Theorem b), and result in the function κ defined in 7). The function κ is maximal at t = with c κt ) =.59. Since D f P Q) = P Q) log e and D g P Q) = DP Q), Theorem b) results in 6) with the optimality of its constant c. March 4, 08

18 b) By the definition of c in 9), it follows that for all t > 0 8 f t) log e rt) + c t) + log e; ) note that ) holds for all t 0, ) i.e., not only in 0, ) as in 9)) since, for t, ) is equivalent to f t) log e rt) which indeed holds for all t [, ). Hence, the bound in 8) follows from 4), 59), 66) and 68). It can be verified from 9) that c = , the maximal value of κ c on 0, ) is attained at t = and κ c t ) =. The optimality of the constant c in 8) is shown next note that Theorem b) cannot be used here, since the right side of ) is not differentiable at t = ). Let P ε and Q ε be probability measures defined on A = {0, } with P ε 0) = t ε and Q ε 0) = ε for ε 0, ). Then, it follows that lim ε 0 P ε Q ε ) log e DP ε Q ε ) + c Pε Q ε log e f t ) ε log e + oε) = lim [ ε 0 rt ) + c t ) log e ] ε + oε) 3) = κ c t ) = 4) where 3) follows by the construction of P ε and Q ε, and the use of Taylor series expansions; 4) follows from 0) note that t < ), which yields the required optimality result in 8). c) Let f = f log e and g : 0, ) R be defined by gt) = t ) log t for all t > 0. These functions, which satisfy the conditions in Theorem b), yield D f P Q) = P Q) log e and D g P Q) = DP Q)+ DQ P ). Theorem b) yields the desired result with κt) = t, t 0, ), ) 5) t + ) log e t κ) = = κ. 6) Remark 6: We can generalize 6) and ) to P Q) log e c DP Q) + c DQ P ) 7) where c, c 0 and P Q. In view of Theorem, the locus of the allowable constant pairs c, c ) in 7) can be evaluated. To that end, Theorem b) can be used with ft) = f t) log e, and gt) = c rt) + c t r ) t for all t > 0. This results in a convex region, which is symmetric with respect to the straight line c = c since P Q) = Q P )). Note that 6), ) and the symmetry property of this region identify, respectively, the points.59, 0),, ) and 0,.59) on its boundary; since the sum of their coordinates is nearly constant, the boundary of this subset of the positive quadrant is nearly a straight line of slope. March 4, 08

19 D. An Alternative Proof of Samson s Inequality 9 An analog of Pinsker s inequality, which comes in handy for the proof of Marton s conditional transportation inequality [3, Lemma 8.4], is the following bound due to Samson [89, Lemma ]: Theorem 4: If P Q, then where d P, Q) is the distance measure defined in 7). d P, Q) + d Q, P ) log e DP Q) 8) We provide next an alternative proof of Theorem 4, in view of Theorem b), with the following advantages: a) This proof yields the optimality of the constant in 8), i.e., we prove that sup d P, Q) + d Q, P ) DP Q) = log e where the supremum is over all probability measures P, Q such that P Q and P Q. b) A simple adaptation of this proof enables to derive a reverse inequality to 8), which holds under the boundedness assumption of the relative information see Section IV-D). Proof: 9) d P, Q) + d Q, P ) = D s P Q) + D s P Q) 30) = D f P Q) 3) where, from 74), s : 0, ) [0, ) is the convex function s t) = t s ) t ) {t > } t = t and, from 74) and 3), the non-negative and convex function f : 0, ) R in 3) is given by 3) ft) = st) + s t) = t ) max{, t} for all t > 0. Let r be the non-negative and convex function r : 0, ) R defined in 43), which yields D r P Q) = DP Q). Note that f) = r) = 0, and the functions f and r are both differentiable at t = with f ) = r ) = 0. The desired result follows from Theorem b) since in this case κt) = 33) t ), t 0, ), ) 34) rt) max{, t} lim κt) = t log e = κ, 35) as can be verified from the monotonicity of κ on 0, ) increasing) and, ) decreasing). Remark 7: As mentioned in [89, p. 438], Samson s inequality 8) strengthens the Pinsker-type inequality in [73, Lemma 3.]: d P, Q) log e min{ DP Q), DQ P ) } 36) March 4, 08

20 0 Nevertheless, similarly to our alternative proof of Theorem 4, one can verify that Theorem b) yields the optimality of the constant in 36). E. Ratio of f-divergence to Total Variation Distance Vajda [05, Theorem ] showed that the range of an f-divergence is given by see 35)) 0 D f P Q) f0) + f 0) 37) where every value in this range is attainable by a suitable pair of probability measures P Q. Recalling Remark, note that f b 0) + fb 0) = f0) + f 0) with f b ) defined in 38). Basu et al. [9, Lemma.] strengthened 37), showing that D f P Q) f0) + f 0)) P Q. 38) Note that, provided f0) and f 0) are finite, 38) yields a counterpart to 4). Next, we show that the constant in 38) cannot be improved. Theorem 5: If f : 0, ) R is convex with f) = 0, then sup D f P Q) P Q = f0) + f 0)) 39) where the supremum is over all probability measures P, Q such that P Q and P Q. Proof: As the first step, we give a simplified proof of 38) cf. [9, pp ]). In view of Remark, it is sufficient to show that for all t > 0, ft) + f0) f 0)) t ) f0) + f 0)) t. 40) If t 0, ), 40) reduces to ft) t)f0), which holds in view of the convexity of f and f) = 0. If t, we can readily check, with the aid of 34), that 40) reduces to f t ) t )f 0), which, in turn, holds because f is convex and f ) = 0. For the second part of the proof of 39), we construct a pair of probability measures P ε and Q ε such that, for a sufficiently small ε > 0, Df Pε Qε) P ε Q ε can be made arbitrarily close to the right side of 39). To that end, let ε 0, 5 )], and let P ε and Q ε be defined on the set A = {0,, } with P ε 0) = Q ε ) = ε and P ε ) = Q ε 0) = ε. Then, D f P ε Q ε ) ε fε) + ε f lim = lim ε) ε 0 P ε Q ε ε 0 ε ε) where 4) holds since P ε ) = Q ε ), and 4) follows from 33) and 35). 4) = f0) + f 0) 4) March 4, 08

21 Remark 8: Csiszár [4, Theorem ] showed that if f0) and f 0) are finite and P Q, then there exists a constant C f > 0 which depends only on f such that D f P Q) C f P Q. 43) If P Q <, then 43) is superseded by 38) where the constant is not only explicit but is the best possible according to Theorem 5. A direct application of Theorem 5 yields Corollary : H α P Q) sup = P Q P Q, α 0, ) 44) α) P Q) sup =, 45) P Q P Q JSP Q) sup = log, 46) P Q P Q sup P Q d P, Q) P Q =, 47) d sup P, Q) + d Q, P ) = 48) P Q P Q where the suprema in 44) 47) are over all P Q with P Q, and the supremum in 48) is over all P Q with P Q. Remark 9: The results in 44), 45) and 46) strengthen, respectively, the inequalities in [65, Proposition.35], [0, )] and [0, Theorem ]. The results in 47) and 48) form counterparts of 9). IV. BOUNDED RELATIVE INFORMATION In this section we show that it is possible to find bounds among f-divergences without requiring a strong condition of functional domination see Section III) as long as the relative information is upper and/or lower bounded almost surely. A. Definition of β and β. The following notation is used throughout the rest of the paper. Given a pair of probability measures P, Q) on the same measurable space, denote β, β [0, ] by β = exp D P Q) ), 49) β = exp D Q P ) ) 50) March 4, 08

22 with the convention that if D P Q) =, then β = 0, and if D Q P ) =, then β = 0. Note that if β > 0, then P Q, while β > 0 implies Q P. Furthermore, if P Q, then with Y Q, β = ess inf dq dp Y ) = ess sup dp ) dq Y ), 5) β = ess inf dp dq Y ) = ess sup dq ) dp Y ). 5) The following examples illustrate important cases in which β and β are positive. Example : Gaussian distributions.) Let P and Q be Gaussian probability measures with equal means, and variances σ 0 and σ respectively. Then, β = σ 0 σ {σ 0 σ }, 53) β = σ σ 0 {σ σ 0 }. 54) Example 3: Shifted Laplace distributions.) Let P and Q be the probability measures whose probability density functions are, respectively, given by f λ a 0 ) and f λ a ) with f λ x) = λ exp λ x ), x R 55) where λ > 0. In this case, 55) gives which yields dp dq x) = exp λ x a x a 0 ) ), x R 56) β = β = exp λ a a 0 ) 0, ]. 57) Example 4: Cramér distributions.) Suppose that P and Q have Cramér probability density functions f θ,m and f θ0,m 0, respectively, with f θ,m x) = θ + θ x m ), x R 58) where θ > 0 and m R. In this case, we have β, β 0, ) since the ratio of the probability density functions, f θ,m f θ0,m 0, tends to θ0 θ probability density functions is θ θ 0 < in the limit x ±. In the special case where m 0 = m = m R, the ratio of these around m, it can be verified that in this special case at x = m; due also to the symmetry of the probability density functions β = β = min { θ0, θ }. 59) θ θ 0 Example 5: Cauchy distributions.) Suppose that P and Q have Cauchy probability density functions g γ,m and g γ0,m 0, respectively, γ 0 γ and g γ,m x) = πγ [ ) ] x m +, x R 60) γ March 4, 08

23 3 where γ > 0. In this case, we also have β, β 0, ) since the ratio of the probability density functions tends to γ0 γ < in the limit x ±. In the special case where m 0 = m, { γ β = β = min, γ } 0. 6) γ 0 γ B. Basic Tool Since β = β = P = Q, it is advisable to avoid trivialities by excluding that case. Theorem 6: Let f and g satisfy the assumptions in Theorem, and assume that β, β ) [0, ). Then, where and κ ) is defined in 85). D f P Q) κ D g P Q) 6) κ = Proof: Defining g0) and g 0) as in 33)-35), respectively, note that sup κβ) 63) β β,),β ) f0) Qp = 0) g0) κ Qp = 0) 64) because if β = 0 then f0) κ g0), and if β > 0 then Qp = 0) = 0. Similarly, f 0) P q = 0) g 0) κ P q = 0) 65) because if β = 0 then f 0) κ g 0) and if β > 0 then P q = 0) = 0. Moreover, since f) = g) = 0, we can substitute pq > 0 by {pq > 0} {p q} in the right side of 37) for D f P Q) and likewise for D g P Q). In view of the definition of κ, 64) and 65), ) p D f P Q) κ q g dµ q pq>0,p q + κ g0) Qp = 0) + κ g 0) P q = 0) 66) = κ D g P Q). 67) Note that if β = β = 0, then Theorem 6 does not improve upon Theorem a). Remark 0: In the application of Theorem 6, it is often convenient to make use of the freedom afforded by Remark and choose the corresponding offsets such that: the positivity property of g required by Theorem 6 is satisfied; the lowest κ is obtained. March 4, 08

24 4 Remark : Similarly to the proof of Theorem b), under the conditions therein, one can verify that the constants in Theorem 6 are the best possible among all probability measures P, Q with given β, β ) [0, ). Remark : Note that if we swap the assumptions on f and g in Theorem 6, the same result translates into inf κβ) D g P Q) D f P Q). 68) β β,),β ) Furthermore, provided both f and g are positive except at t = ) and κ is monotonically increasing, Theorem 6 and 68) result in κβ ) D g P Q) D f P Q) 69) κβ ) D gp Q). 70) In this case, if β > 0, sometimes it is convenient to replace β > 0 with β 0, β ) at the expense of loosening the bound. A similar observation applies to β. Example 6: If ft) = t ) and gt) = t, we get χ P Q) max{β, β } P Q. 7) C. Bounds on DP Q) DQ P ) The remaining part of this section is devoted to various applications of Theorem 6. From this point, we make use of the definition of r : 0, ) [0, ) in 43). An illustrative application of Theorem 6 gives upper and lower bounds on the ratio of relative entropies. Theorem 7: Let P Q, P Q, and β, β ) 0, ). Let κ: 0, ), ) 0, ) be defined as Then, Proof: For t > 0, let κt) = t log t + t) log e t ) log e log t. 7) DP Q) κβ ) DQ P ) κβ ). 73) gt) = log t + t ) log e 74) then D g P Q) = DQ P ), D r P Q) = DP Q) and the conditions of Theorem 6 are satisfied. The desired result follows from Theorem 6 and the monotonicity of κ ) shown in Appendix A. March 4, 08

25 D. Reverse Samson s Inequality 5 The next result gives a counterpart to Samson s inequality 8). Theorem 8: Let β, β ) 0, ). Then, inf d P, Q) + d Q, P ) DP Q) = min { κβ ), κβ ) } 75) where the infimum is over all P Q with given β, β ), and where κ: 0, ), ) 0, log e) is given in 34). Proof: Applying Remark to the convex and positive except at t = ) function ft) given in 33), and gt) = rt), the lower bound on d P,Q)+d Q,P ) DP Q) in the right side of 75) follows from the fact that 34) is monotonically increasing on 0, ), and monotonically decreasing on, ). To verify that this is the best possible lower bound, we recall Remark since in this case f ) = g ) = 0. E. Bounds on DP Q) H αp Q) The following result bounds the ratio of relative entropy to Hellinger divergence of an arbitrary positive order α. Theorem 9 extends and strengthens a result for α 0, ) by Haussler and Opper [5, Lemma 4 and 6)] see also [6]), which in turn generalizes the special case for α = by Birgé and Massart [, 7.6)]. obtained simultaneously and independently Theorem 9: Let P Q, P Q, α 0, ), ) and β, β ) [0, ). Define the continuous function on [0, ]: log e t = 0; α) rt) t α t 0, ), ); + αt α κ α t) = α log e t = ; 76) t = and α 0, ); 0 t = and α, ). Then, for α 0, ), and, for α, ), Proof: κ α β ) DP Q) H α P Q) κ αβ ) 77) κ α β DP Q) ) H α P Q) κ αβ ). 78) March 4, 08

26 D r P Q) = DP Q), and D gα P Q) = H α P Q) with in view of Remark, 4) and 5). Since g αt) = α tα ) α g α t) = tα + αt α, t 0, ) 79) α g α is monotonically decreasing on 0, ] and monotonically increasing on [, ); hence, g α ) = 0 implies that g α is positive except at t =. The convexity conditions required by Theorem 6 are also easy to check for both r ) and g α ). The function in 76) is the continuous extension of r g α, which as shown in Appendix B, is monotonically increasing on [0, ] if α 0, ), and it is monotonically decreasing on [0, ] if α, ). Therefore, Theorem 6 results in 77) and 78) for α 0, ) and α, ), respectively. Remark 3: Theorem 9 is of particular interest for α =. In this case since, from 76), κ : [0, ] [0, ] is monotonically increasing with κ 0) = log e, the left inequality in 78) yields 6). For large arguments, κ ) grows logarithmically. For example, if β > , it follows from 77) that DP Q) 4 + log ) e e ) H P Q) nats 80) which is strictly smaller than the upper bound on the relative entropy in [, Theorem 5], given not in terms of β but in terms of another more cumbersome quantity that controls the mass that dp dq 6 may have at large values. As mentioned in Section II-B, χ P Q) is equal to the Hellinger divergence of order. Specializing Theorem 9 to the case α = results in which improves the upper and lower bounds in [34, Proposition ]: κ β DP Q) ) χ P Q) κ β ), 8) β log e DP Q) χ P Q) β log e. 8) For example, if β = β = 00, 8) gives a possible range [0.037, 0.963] nats for the ratio of relative entropy to χ -divergence, while 8) gives a range of [0.005, 50] nats. Note that if β = 0, then the upper bound in 8) is κ 0) = log e whereas it is according to [34, Proposition ]. In view of Remark, the bounds in 8) are the best possible among all probability measures P, Q with given β, β ) [0, ). F. Bounds on JSP Q) P Q) HαP Q) P Q), JSP Q), P Q Let P Q and P Q. [0, Theorem ] and [0, )] show, respectively, JSP Q) log e log, 83) P Q) P Q P Q) P Q. 84) The following result suggests a refinement in the upper bounds of 83) and 84). March 4, 08

27 Theorem 0: Let P Q, P Q, and let β, β [0, ). Then, JSP Q) log e P Q) max{ κ β ), κ β ) }, 85) P Q) P Q max{ κ β ), κ β ) } 86) with κ : [0, ] [0, log ] and κ : [0, ] [0, ] defined by log t = 0; [ )] t + t + κ t) = t ) t log t t + ) log t 0, ), ); and log e t = ; log t = ; t t [0, ); t + κ t) = t =. Proof: Let f TV, f, f JS denote the functions f : 0, ) R in 55), 60) and 65), respectively; these functions yield P Q, P Q) and JSP Q) as f-divergences. The functions κ and κ, as introduced in 87) and 88), respectively, are the continuous extensions to [0, ] of 7 87) 88) κ t) = f JSt) + t ) log, 89) f t) κ t) = f t) f TV t). 90) It can be verified by 87) and 88) that κ i t) = κ i t ) for t [0, ] and i {, }; furthermore, κ and κ are both monotonically decreasing on [0, ], and monotonically increasing on [, ]. Consequently, if β [β, β ] and i {, }, κ i ) κ i β) max { κ i β ), κ iβ ) }. 9) In view of Theorem 6 and Remark, 85) follows from 89) and 9); 86) follows from 90) and 9). If β = 0, referring to an unbounded relative information ı P Q, the right inequalities in 83), 84) and 85), 86) respectively coincide due to 87), 88), and since κ 0) = ); otherwise, the right inequalities in 85), 86) provide, respectively, sharper upper bounds than those in 83), 84). March 4, 08

28 Example 7: Suppose that P Q, P Q and Q = 4, 3 4) ); substituting β = β = improving the upper bounds on JSP Q) P Q) to the right inequalities in 83), 84). For finite alphabets, [7, Theorem 7] shows in 85) 88) gives dp dq a) for all a A e.g., P =, ) and JSP Q) log e 0.50 log e 9) P Q) P Q) P Q 3 93) and P Q) P Q which are log log e and, respectively, according log JSP Q) log e. 94) P Q) H The following theorem extends the validity of 94) for Hellinger divergences of an arbitrary order α 0, ) and for a general alphabet, while also providing a refinement of both upper and lower bounds in 94) when the relative information ı P Q is bounded. Theorem : Let P Q, P Q, α 0, ), ), and let β, β [0, ] be given in 49)-50). Let κ α : [0, ] [0, ) be given by log t = 0; [ α ) t log t t + ) log ) ] t+ κ α t) = t α t 0, ), ); + α αt log e t = ; α ) + α log t =. Then, if α 0, ), and, if α, ), min { κ α β ), κ αβ ) } JSP Q) H α P Q) 8 95) max κ α β) 96) β [β,β ] κ α β JSP Q) ) H α P Q) κ αβ ). 97) Proof: Let f α and f JS denote, respectively, the functions f : 0, ) R in 53) and 65) which yield the Hellinger divergence, H α, and the Jensen-Shannon divergence, JSP Q), as f-divergences. From 95), κ α is the continuous extension to [0, ] of κ α t) = f JSt) + t ) log f α t) + ) α. 98) α t ) March 4, 08

29 As shown in the proof of Theorem 9, for every α 0, ), ) and t 0, ), ), we have g α t) ) f α t) + α α t ) > 0. It can be verified that κ α : [0, ] R has the following monotonicity properties: if α 0, ), there exists t α > 0 such that κ α is monotonically increasing on [0, t α ], and it is monotonically decreasing on [t α, ]; if α, ), κ α is monotonically decreasing on [0, ]. Based on these properties of κ α, for β [β, β ]: if α 0, ) 9 if α, ), min { κ α β ), κ αβ ) } κ α β) max t [β,β ] κ α t); 99) κ α β ) κ αβ) κ α β ). 00) In view of Theorem 6, Remark and 98), the bounds in 96) and 97) follow respectively from 99) and 00). Example 8: Specializing Theorem to α =, since κ t) = κ ) t for t 0, ) and κ global maximum at t = with κ ) = log e, 96) implies that log κ ) min{β, β } Under the assumptions in Example 7, it follows from 0) that log e which improves the lower bound log log e in the left side of 94). achieves its JSP Q) log e. 0) P Q) H JSP Q) log e 0) P Q) H Specializing Theorem to α = implies that cf. 95) and 97)) Under the assumptions in Example 7, it follows from 03) that κ β JSP Q) ) χ P Q) κ β ). 03) 0.70 log e JSP Q) χ log e 04) P Q) while without any boundedness assumption, 97) yields the weaker upper and lower bounds 0 JSP Q) χ log log e. 05) P Q) March 4, 08

30 G. Local Behavior of f-divergences 30 Another application of Theorem 6 shows that the local behavior of f-divergences differs by only a constant, provided that the first distribution approaches the reference measure in a certain strong sense. Theorem : Suppose that {P n }, a sequence of probability measures defined on a measurable space A, F ), converges to Q another probability measure on the same space) in the sense that, for Y Q, lim ess sup dp n Y ) = 06) n dq where it is assumed that P n Q for all sufficiently large n. If f and g are convex on 0, ) and they are positive except at t = where they are 0), then and lim D f P n Q) = lim D gp n Q) = 0, 07) n n min{κ ), κ + D f P n Q) )} lim n D g P n Q) max{κ ), κ + )} 08) where we have indicated the left and right limits of the function κ ), defined in 85), at by κ ) and κ + ), respectively. Proof: Since f) = 0, 0 D f P n Q) = f ) dpn dq 09) dq where we have abbreviated The condition in 06) yields sup fβ) 0) β [β,n, β,n] β,n ess sup dp n Y ), ) dq β,n ess inf dp n dq Y ). ) lim β,n =, 3) n lim β,n = 4) n where 3) is a restatement of 06) see the notation in )), and Appendix C justifies 4). Hence, 07) follows from 0), 3), 4), the continuity of f at due to its convexity). Abbreviating I n = [β,n, ), β,n ], 6) and 68) result in inf κβ)d g P n Q) D f P n Q) sup κβ)d g P n Q). 5) β I n β I n March 4, 08

31 The right and left continuity of κ ) at together with 3) and 4) imply that 3 by letting n. inf κβ) min{κ ), κ + )}, 6) β I n sup κβ) max{κ ), κ + )} 7) β I n Corollary : Let {P n Q} converge to Q in the sense of 06). Then, DP n Q) and DQ P n ) vanish as n with DP n Q) lim =. 8) n DQ P n ) Corollary 3: Let {P n Q} converge to Q in the sense of 06). Then, χ P n Q) and DP n Q) vanish as n with Note that 9) is known in the finite alphabet case [8, Theorem 4.]). DP n Q) lim n χ P n Q) = log e. 9) In Example, the ratio in 08) is equal to, while the lower and upper bounds are 3 and, respectively. Continuing with Examples 3, 4 and 5, it is easy to check that 06) is satisfied in the following cases. Example 9: A sequence of Laplacian probability density functions with common variance and converging means: p n x) = λ exp λ x a n ) 0) lim a n = a. ) n Example 0: A sequence of converging Cramér probability density functions: p n x) = θ n + θ n x m n ), x R ) lim m n = m R 3) n lim θ n = θ > 0. 4) n Example : A sequence of converging Cauchy probability density functions: [ p n x) = ) ] x mn +, x R 5) πγ n γ n lim m n = m R 6) n lim γ n = γ > 0. 7) n March 4, 08

32 H. Strengthened Jensen s inequality 3 Bounding away from zero a certain density between two probability measures enables the following strengthened version of Jensen s inequality, which generalizes a result in [33, Theorem ]. Lemma : Let f : R R be a convex function, P P 0 be probability measures defined on a measurable space A, F ), and fix an arbitrary random transformation P Z X : A R. Denote 0 P 0 P Z X P Z0, and P P Z X P Z. Then, where X 0 P 0, X P, and β E [fe[z 0 X 0 ])] fe[z 0 ]) ) E[fE[Z X ])] fe[z ]) 8) β ess inf dp dp 0 X 0 ). 9) Proof: If β = 0, the claimed result is Jensen s inequality, while if β =, P 0 = P and the result is trivial. Hence, we assume β 0, ). Note that P = βp 0 + β)p where P is the probability measure whose density with respect to P 0 is given by Letting P P Z X P Z, Jensen s inequality implies Furthermore, we can apply Jensen s inequality again to obtain dp = ) dp β 0. 30) dp 0 β dp 0 fe[z ]) β fe[z 0 ]) + β) fe[z ]). 3) fe[z ]) = fe [E[Z X ]]) 3) Substituting this bound on fe[z ]) in 3) we obtain the desired result. E [fe[z X ])] 33) = E [fe[z X ])] β E [fe[z 0 X 0 ])]. 34) β Remark 4: Letting Z = X, and choosing P 0 so that β = 0 e.g., P is a restriction of P 0 to an event of P 0 -probability less than ), 8) becomes Jensen s inequality fe[x ]) E[fX )]. Lemma finds the following application to the derivation of f-divergence inequalities. Theorem 3: Let f : 0, ) R be a convex function with f) = 0. Fix P Q on the same space with β, β ) [0, ) and let X P. Then, β D f P Q) E [ f expı P Q X)) )] f + χ P Q) ) 35) β D f P Q). 36) 0 We follow the notation in [] where P 0 P Z X P Z0 means that the marginal probability measures of the joint distribution P 0P Z X are P 0 and P Z0. March 4, 08

33 33 Proof: We invoke Lemma with P Z X that is given by the deterministic transformation exp ı P Q ) ) : A R. Then, E[Z 0 X 0 ] = exp ı P Q X 0 ) ). If, moreover, we let X 0 Q = P 0, we obtain and if we let X P = P, we have see 50)) E[Z 0 ] =, 37) E [fe[z 0 X 0 ])] = D f P Q) 38) E[Z ] = + χ P Q), 39) E [fe[z X ])] = E [ f expı P Q X)) )]. 40) Therefore, 35) follows from Lemma. Recalling 5), inequality 36) follows from Lemma as well switching the roles P 0 and P, namely, now we take P = P 0 and Q = P. Specializing Theorem 3 to the convex function on 0, ) where ft) = log t sharpens inequality 94) under the assumption of bounded relative information. Theorem 4: Fix P Q such that β, β ) 0, ). Then, β DQ P ) log + χ P Q) ) DP Q) 4) β DQ P ). 4) V. TOTAL VARIATION DISTANCE, RELATIVE INFORMATION SPECTRUM AND RELATIVE ENTROPY A. Exact Expressions The following result provides several useful expressions of the total variation distance in terms of the relative information. Theorem 5: Let P Q, and let X P and Y Q be defined on a measurable space A, F ). Then, P Q = E [ expıp Q Y )) ] 43) = E [ expı P Q Y )) ) +] 44) = E [ expı P Q Y )) ) ] 45) = E [ exp ı P Q X)) ) +] 46) = P [ ı P Q X) > 0 ] P [ ı P Q Y ) > 0 ]) 47) = P [ ı P Q Y ) 0 ] P [ ı P Q X) 0 ]) 48) z) + z {z > 0} = max{z, 0}, and z) z {z < 0} = max{ z, 0}. March 4, 08

34 Furthermore, if P Q, then = = = 0 0 β P [ ı P Q Y ) < log β ] dβ 49) [ P ı P Q X) > log ] dβ 50) β β [ F P Q log β) ] dβ. 5) P Q = E [ exp ı P Q X)) ) ] 34 5) = E [ exp ıp Q X)) ]. 53) Proof: See Appendix D. Remark 5: In view of 47), if P Q, the supremum in 58) is a maximum which is achieved by the event F = { a A: ı P Q a) > 0 } F. 54) Similarly to Theorem 5, the following theorem provides several expressions of the relative entropy in terms of the relative information spectrum. Theorem 6: If DP Q) <, then DP Q) = = = 0 0 FP Q α) ) dα 0 F P Q log β) dβ β P [ ı P Q Y ) > α ] αe α dα F P Q α) dα 55) F P Q log β) dβ 56) β 0 0 P [ ı P Q Y ) < α ] αe α dα 57) where Y Q, and for convenience 56) and 57) assume that the relative information and the resulting relative entropy are in nats. Proof: The expectation of a real-valued random variable V is equal to E[V ] = 0 P[V > t] dt 0 P[V < t] dt 58) where we are free to substitute > by, and < by. If we let V = ı P Q X) with X P, then 58) yields 55) provided that DP Q) = E[V ] <. Eq. 56) follows from 55) by the substitution α = log β when the relative entropy is expressed in nats. To prove 57), let Z = ı P Q Y ) with Y Q, and let V = rz) where r : 0, ) [0, ) is given in 43) with natural logarithm. The function r is strictly monotonically increasing on [, ), on which interval we define its inverse by s : [0, ) [, ); it is also strictly monotonically decreasing on 0, ], on which March 4, 08

35 35 interval we define its inverse by s : [0, ] 0, ]. Then, only the first integral on the right side of 58) can be non-zero, and we decompose it as DP Q) = = = 0 0 P [ Z, rz) > t ] dt + P [ Z > s t) ] dt + P[Z > v] log e v dv 0 0 P [ Z <, rz) > t ] dt P [ Z < s t) ] dt 59) 0 P[Z < v] log e v dv 60) where 60) follows from the change of variable of integration t = rv). Upon taking log e on both sides of the inequalities inside the probabilities in 60), and a further change of the variable of integration v = e α, 60) is seen to be equal to 57). B. Upper Bounds on P Q In this section, we provide three upper bounds on P Q which complement ). Theorem 7: If P Q and X P, then P Q log e DP Q) + E [ ı P Q X) ]. 6) Proof: For every z [, ], ) ) z exp z) {z > 0} {z > 0}. 6) log e Substituting z = ı P Q X), taking expectation of both sides of 6), and using 46) give P Q log e E [ ı P Q X) {ı P Q X) > 0} ] 63) = E [ ı P Q X) + ı P Q X) ] 64) = DP Q) + E [ ı P Q X) ]. 65) Remark 6: Theorem 7 is tighter than Pinsker s bound in [78,.3.4)]: P Q log e E [ ı P Q X) ]. 66) The second upper bound on P Q is a consequence of Theorem 5. Theorem 8: Let P Q with β, β ) [0, ). Then, for every β 0 [β, ], P Q β 0) P[ı P Q X) > 0] ] + β 0 β ) P [ı P Q X) > log β0 67) March 4, 08

36 where X P, and, for every β 0 [β, ], 36 P Q β 0) P[ı P Q Y ) < 0] + β 0 β ) P[ı P Q Y ) < log β 0 ] 68) where Y Q. Furthermore, both upper bounds on P Q in 67) and 68) are tight in the sense that they are achievable by suitable pairs of probability measures defined on a binary alphabet. Proof: Since the integrand in the right side of 50) is monotonically increasing in β, we may upper bound ] it by P [ı P Q X) > log when β [β β0, β 0 ], and by P[ı P Q X) > 0] when β β 0, ]. The same reasoning applied to 49) yields 68). To check the tightness of 67) for any β 0, ], choose an arbitrary η 0, ) and a pair of probability measures P and Q defined on the binary alphabet A = {0, } with P 0) = η ηβ, 69) Q0) = β P 0). 70) Then, we have ı P Q 0) = log β, ı P Q ) = log η < 0, and both sides of 67) are readily seen to be equal when β 0 = β. The tightness of 68) can be shown in a similar way. The third upper bound on P Q is a classical inequality [55, 99)], usually given in the context of bounding the error probability of Bayesian binary hypothesis testing in terms of the Bhattacharyya distance. Theorem 9: 4 P Q exp D P Q) ). 7) Remark 7: The bound in 7) is tight if P, Q are defined on {0, } with P 0) = Q0) or P 0) = Q). In view of the monotonicity of D α P Q) in α, Theorem 9 yields 4), which is equivalent to the Bretagnole- Huber inequality [5,.)] see also [08, pp. 30 3]). Note that 4) is tighter than ) only when P Q > C. Lower Bounds on P Q In this section, we give several lower bounds on P Q in terms of the relative information spectrum. Furthermore, in Section VI, we give lower bounds on P Q in terms of the relative entropy as well as other features of P and Q). If, for at least one value of β 0, ), either P [ ] ı P Q X) > log β or P [ ı P Q Y ) < log β ] are known then we get the following lower bounds on P Q as a consequence of Theorem 5: March 4, 08

37 Theorem 0: If P Q then, for every β 0 0, ), 37 P Q β 0 ) P [ ] ı P Q Y ) < log β 0, 7) ] P Q β 0 ) P [ı P Q X) > log β0 73) with X P and Y Q. Proof: The lower bounds in 7) and 73) follow from 49) and 50) respectively. For example, from 49), it follows that for an arbitrary β 0 0, ) P Q = P [ ı P Q Y ) < log β ] dβ 74) 0 β 0 P [ ı P Q Y ) < log β ] dβ 75) β 0 ) P [ ı P Q Y ) < log β 0 ] where 76) holds since the integrand in 75) is monotonically increasing in β 0, ]. Next we exemplify the utility of Theorem 0 by giving an alternative proof to the tight lower bound on the relative information spectrum, given in [68, Proposition ] as a function of the total variation distance. Proposition : Let P Q, then for every β > 0 0, β F P Q log β) β P Q β ), β 0, ], P Q ) P Q, Furthermore, for every β > 0 and δ [0, ), the lower bound in 77) is attainable by a pair P, Q) with P Q = δ. Proof: Since P Q, they cannot be mutually singular and therefore P Q <. From 7) and 73) see Theorem 0), it follows that for every β 0 0, ) P Q β 0) [ F P Q log )]. 78) β 0 Consequently, the substitution β = β 0 > yields F P Q log β) β P Q β ) which provides a non-negative lower bound on the relative information spectrum provided that β P Q. Having shown 77), we proceed to argue that it is tight. Fix δ [0, ) and let P Q = δ, which yields P Q = δ in the right side of 77).. 76) 77) 79) March 4, 08

38 If β < δ, let the pair P, Q) be defined on the binary alphabet {0, } with P ) = and Q) = δ thereby ensuring δ = P Q ). Then, from 7), F P Q log β) = P 0) = 0. 80) If β δ, let τ > β and consider the probability measures P = P τ and Q = Q τ defined on the binary alphabet {0, } with P τ ) = τδ τ and Q τ ) = then which tends to βδ β δ τ note that indeed δ = P τ Q τ ). Since < β < τ F P Q log β) = P τ 0) = τδ τ 8) in the right side of 77) by letting τ β. 38 Attained under certain conditions, the following counterpart to Theorem 8 gives a lower bound on the total variation distance based on the distribution of the relative information. It strengthens the bound in [0, Theorem 8], which in turn tightens the lower bounds in [78,.3.8)] and [99, Lemma 7]. Theorem : If P Q then, for any η, η > 0, P Q exp η ) ) P [ ı P Q X) η ] + expη ) ) P [ ı P Q X) η ] 8) with X P. Equality holds in 8) if P and Q are probability measures defined on {0, } and, for an arbitrary η, η > 0, P 0) = exp η ) exp η η ), 83) Q0) = exp η ) P 0). 84) Proof: From 53), it follows that for arbitrary η, η > 0, P Q E [ exp ıp Q X) ) { ıp Q X) η }] + E [ exp ıp Q X) ) { ıp Q X) η }] 85) which is readily loosened to obtain 8). Equality holds in 8) for P and Q in the theorem statement since ı P Q X) only takes the values log P 0) Q0) = η and log P ) Q) = η. The following lower bound on the total variation distance is the counterpart to Theorem 7. Theorem : If P Q, and X P then P Q log e E [ ı P Q X) ] DP Q). 86) Proof: We reason in parallel to the proof of Theorem 7. For all z [, ], [ exp z) ] z) log e. 87) March 4, 08

39 Substituting z = ı P Q X), taking expectation of both sides of 87), and using 5) we obtain P Q log e E [ ı P Q X) ) ] 39 88) = E [ ı P Q X) ı P Q X) ] 89) = E [ ı P Q X) ] DP Q). 90) Remark 8: The combination of Pinsker s inequality ) and 86) yields the following inequality due to Barron see [6, p. 339]) which is useful in establishing convergence results for relative entropy e.g. [7]) E [ ı P Q X) ] DP Q) + DP Q) log e 9) with X P. D. Relative Entropy and Bhattacharyya Distance The following result refines 5) by using an approach which relies on moment inequalities [96] [98]. The coverage in this section is self-contained. Theorem 3: If P Q, then DP Q) log + χ P Q) ) 3 χ P Q) ) log e + χ Q P ) ) + χ P Q) ). 9) Furthermore, if {P n } converges to Q in the sense of 06), then the ratio of DP n Q) and its upper bound in 9) tends to as n. Proof: The derivation of 9) relies on [96, Theorem.] which states that if W is a non-negative random variable, then E[W α ] E α [W ] ) log e, α 0, αα ) λ α log E[W ] ) E[log W ], α = 0 93) is log-convex in α R. E[W log W ] E[W ] log E[W ] ), α = To prove 9), let W = dp dq X) with X P, then 93) yields λ 0 = log + χ P Q) ) DP Q), 94) λ α = [ + α )H α Q P ) + χ P Q) ) ] α log e 95) αα + ) March 4, 08

40 for all α > 0, and specializing 95) yields In view of the log-convexity of λ α in α R, then which, by assembling 94) 98), yields 9). that λ = χ P Q) log e ), + χ P Q) [ ] 96) λ = + χ Q P ) 6 + χ P Q) ) log e. 97) λ 0 λ λ 98) Suppose that {P n } converges to Q in the sense of 06). Then, it follows from Theorem and Corollary 3 lim DP n Q) = 0, 99) n lim n χ P n Q) = 0, 300) DP n Q) lim n χ P n Q) = log e, 30) χ Q P n ) lim n χ =. 30) P n Q) Let U n denote the upper bound on DP n Q) in 9). Assembling 99) 30), it can be verified that lim n U n =. 303) DP n Q) 40 Remark 9: In view of 99) 30), while the ratio of the right side of 9) with P = P n and DP n Q) tends to, the ratio of the looser bound in 5) and DP n Q) tends to. Remark 0: If {P n } and Q are defined on a finite set A, then the condition in 06) is equivalent to P n Q 0 with Qa) > 0 for all a A. Remark : An alternative refinement of 5) has been recently obtained in [98] as a function of χ P Q) and the Bhattacharyya distance BP Q) see Definition 5): [ DP Q) log + χ P Q) ) 3 exp BP Q) ) + χ P Q) ] log e 9 χ. 304) P Q) Eq. 304) can be generalized by relying on the log-convexity of λ α in α R, which yields λ α 0 λ α λ α 305) March 4, 08

41 for all α 0, ); consequently, assembling 94), 95), 96) and 305) yields DP Q) log + χ P Q) ) α αα + ) ) α χ P Q) ) α α [ α)hα Q P ) ) + χ P Q) ) α ] α for all α 0, ). Note that in the special case α =, 306) becomes 304), as can be readily verified in view of 83) and the symmetry property H P Q) = H Q P ). Remark : The following lower bound on the relative entropy has been derived in [98], based on the approach of moment inequalities: Note that from 8) DP Q) BP Q) + BP Q) log 4 306) log e 6 [ exp BP Q) )] log e exp 4BP Q) ) + χ Q P ). 307) 4 P Q and since the right side of 307) is monotonically increasing in BP Q), the replacement of BP Q) in the right side of 307) with its lower bound in 308) yields DP Q) log 4 P Q ) + ) 308) 3 4 P Q log e 8 P Q + χ Q P ) P Q. 309) Although 309) improves the bound in 4), it is weaker than 307), and it satisfies the tightness property in Theorem 3 only in special cases such as when P and Q are defined on A = {0, } with P 0) = Q) = ε and we let ε 0. Define the binary relative entropy function as the continuous extension to [0, ] of ) ) x x dx y) = x log + x) log. 30) y y The following result improves the upper bound in 8). a) Theorem 4: Let P Q with β, β ) 0, ). Then, which is attainable for binary alphabets. For the derivation of 307) for a general alphabet, similarly to [98], set W = λ 0 λ 4 λ which follows from the log-convexity of λ α in α. χ P Q) β ) β ), 3) dq X) in 93) with X P, and use the inequality dp March 4, 08

42 b) { 3 DP Q) min log + c) + + β ), + c) 3) c + 8β c log e + c) c ) log e, 4β 33) β )} d β β β ) β β 34) where we have abbreviated c = χ P Q) for typographical convenience. Proof: To prove 3), we first consider the case where P, Q are defined on A = {0, } and P 0) Q0) = β, P ) Q) = β and. Straightforward calculation yields P 0) = β β β, Q0) = β P 0) 35) χ P Q) = β ) β ). 36) In the case of a general alphabet, consider the elementary bound with a < 0 < b: E[Z ] ab which holds for any Z [a, b], E[Z] = 0, and follows simply by taking expectations of Z = ab + Za + b) Z a)b Z) 37) ab + Za + b). 38) Since χ P Q) = E[Z ], 3) follows by letting a = β, b = β and Z = dp Y ), Y Q. 39) dq To prove 3), note that it follows by combining 9) with the left side of the inequality β χ Q P ) where 30) follows from Theorem 3 with ft) = t for t > 0. To prove 33), note that assembling 8) and 4) yields χ P Q) + χ P Q) β χ Q P ) 30) DP Q) log + χ P Q) ) β D P Q) χ P Q) log e and solving this quadratic inequality in DP Q), for fixed χ P Q), yields the bound in 33). Bound 34) holds since the maximal DP Q), for fixed χ P Q), is monotonically increasing in χ P Q). In view of 3), DP Q) cannot be larger than its maximal value when χ P Q) = β ) β ). In the latter case, the condition of equality in 38) recall that E[Z] = 0) is P[Z = a] = 4 3) b = P[Z = b] 3) b a March 4, 08

43 43 which implies that the maximal relative entropy DP Q) over all P Q with given β, β ) 0, ) is equal to E [ + Z) log + Z) ] 33) b + a) log + a) a + b) log + b) = b a β ) = d β β β ) β β β where 33) 35) follow from 30) and 3) with a = β and b = β. Remark 3: The proof of 3) relies on the left side of 30); this strengthens the bound which follows from Theorem 6, given by χ Q P ) β χ P Q). The bound 33) is typically of similar tightness as the bound in 3), although none of them outperforms the other for all β, β ) 0, ) and χ P Q) [ 0, β ) β ) ] see 3)). Remark 4: The left inequality in 8) and Theorem 4 provide an analytical outer bound on the locus of the points χ P Q), DP Q)) where P Q and β dp dq β for given β, β ) 0, ). Example : In continuation to Remark 4, for given β, β ) 0, ), Figure compares the locus of the points χ P Q), DP Q)) when P, Q are restricted to binary alphabets, and P Q is bounded between β and β, with an outer bound constructed with the left inequality in 8) and Theorem 4 recall that the outer bound is valid for an arbitrary alphabet). 34) 35) DP kq) nats b) DP kq) nats b) a) a) P kq) P kq) Fig.. Comparison of the locus of the points χ P Q), DP Q)) when P, Q are defined on A = {0, } with the bounds in the left side of 8) a) and Theorem 4 b); β, β ) =, ) and β, β ) =, ) 0 0 in the upper and lower plots, respectively. The dashed curve in each plot corresponds to the looser bound in 5). March 4, 08

44 44 The following result relies on the earlier analysis to provide bounds on the Bhattacharyya distance, expressed in terms of χ divergences and relative entropy. Theorem 5: If P Q, then the following bounds on the Bhattacharyya distance hold: log + χ P Q) ) [ log + 3 χ P Q) 4 log e log + χ P Q) ) ] ) DP Q) BP Q) 36) log + χ P Q) ) 3 4 log + P Q) ) 3 + χ P Q) ) + χ Q P ) ). 37) Furthermore, if {P n } converges to Q in the sense of 06), then the ratio of the bounds on BP n Q) in 36) and 37) tends to as n. Proof: In view of the log-convexity of λ α in α R, λ 0 λ λ, λ λ λ 3 38) for any choice of the random variable W in 93). Consequently, assembling 83), 94), 95) and 38) yield the bounds on BP Q) in 36) and 37). Suppose that {P n } converges to Q in the sense of 06). Let L n and U n denote, respectively, the lower and upper bounds on BP n Q) in 36) and 37). Assembling 99) 30), it can be easily verified that lim n which yields that lim n U n L n =. L n χ P n Q) = lim n Remark 5: Note that 37) refines the bound U n χ P n Q) = 8 log e 39) BP Q) log + χ P Q) ) 330) which is equivalent to λ 0 in view of Jensen s inequality, 83) and 95)). Remark 6: Let {P n } converge to Q in the sense of 06). In view of 39), it follows that BP n Q) lim n χ P n Q) = 8 log e, 33) from which we can surmise that both upper bounds in 9) and 304) are tight under the condition in 06) see Theorem 3), although 9) only depends on χ -divergences. In view of 33), the lower bound in 307) is also tight under the condition in 06), in the sense that the ratio of DP n Q) and its lower bound in 307) tends to as n ; this sufficient condition for the tightness of 307) strengthens the result in [98, Section 4]. March 4, 08

45 VI. REVERSE PINSKER INEQUALITIES 45 It is not possible to lower bound P Q solely in terms of DP Q) since for any arbitrarily small ɛ > 0 and arbitrarily large λ > 0, we can construct examples with P Q < ɛ and λ < DP Q) <. Therefore, each of the bounds in this section involves not only DP Q) but another feature of the pair P, Q). A. Bounded Relative Information As in Section IV, the following result involves the bounds on the relative information. Theorem 6: If β 0, ) and β [0, ), then, DP Q) ϕβ ) ϕβ ) ) P Q 33) where ϕ: [0, ) [0, ) is given by 0 t = 0 t log t ϕt) = t 0, ), ) t log e t =. 333) Proof: Let X P, Y Q, and Z be defined in 3). The function ϕ: [0, ) [0, ) is continuous, monotonically increasing and non-negative; the monotonicity property holds since t ) ϕ t) = t ) log e log t 0 for all t > 0, and its non-negativity follows from the fact that ϕ is monotonically increasing on [0, ) and ϕ0) = 0. Accordingly, ϕβ ) ϕz) ϕβ ) 334) since 3) and 5) 5) imply that Z [β, β ] with probability one. The relative entropy satisfies DP Q) = E [ Z log Z ] 335) = E [ ϕz) Z ) ] 336) = E [ ϕz) Z ) {Z > } ] + E [ ϕz) Z ) {Z < } ]. 337) We bound each of the summands in the right side of 337) separately. Invoking 334), we have E [ ϕz) Z ) {Z > } ] ϕβ ) E[ Z ) {Z > } ] 338) = ϕβ ) E[ Z) ] 339) = ϕβ ) P Q 340) March 4, 08

46 46 where 339) holds since x) x {x < 0}, and 340) follows from 45) with Z in 3). Similarly, 334) yields E [ ϕz) Z ) {Z < } ] ϕβ ) E [ Z ) {Z < } ] 34) = ϕβ ) E [ Z) +] 34) = ϕβ ) P Q 343) where 343) follows from 44). Assembling 337), 340) and 343), we obtain 33). Remark 7: By dropping the negative term in 33), we can get the weaker version in [0, Theorem 7]: ) log β DP Q) P Q. 344) β ) The coefficient of P Q in the right side of 344) is monotonically decreasing in β and it tends to log e by letting β. The improvement over 344) afforded by 33) is exemplified in Appendix E. The bound in 344) has been recently used in the context of the optimal quantization of probability measures [, Proposition 4]. Remark 8: The proof of Theorem 6 hinges on the fact that the function ϕ is monotonically increasing. It can be verified that ϕ is also concave and differentiable. Taking into account these additional properties of ϕ, the bound in Theorem 6 can be tightened as see Appendix F): DP Q) ϕβ ) ϕβ ) ϕ β ) ) β P Q + ϕ β ) E[ ZZ ) {Z > } ] 345) which is expressed in terms of the distribution of the relative information. The second summand in the right side of 345) satisfies χ P Q) + β P Q E[ ZZ ) {Z > } ] 346) χ P Q) + P Q. 347) From 3), 45) and Z [β, β ], the gap between the upper and lower bounds in 347) satisfies β ) P Q = β ) E [ Z) +] 348) β ) 349) which is upper bounded by, and it is close to zero if β. The combination of 345) and 347) leads to DP Q) ϕβ ) ϕβ ) + ϕ β ) β ) ) P Q + ϕ β ) χ P Q). 350) Remark 9: A special case of 344), where Q is the uniform distribution over a set of a finite size, was recently rediscovered in [58, Corollary 3] based on results from [5]. March 4, 08

47 Remark 30: For ε [0, ] and a fixed probability measure Q, define 47 D ε, Q) = inf DP Q). 35) P : P Q ε From Sanov s theorem see [0, Theorem.4.]), D ε, Q) is equal to the asymptotic exponential decay of the probability that the total variation distance between the empirical distribution of a sequence of i.i.d. random variables and the true distribution Q is more than a specified value ε. Bounds on D ε, Q) have been shown in [0, Theorem ], which, locally, behave quadratically in ε. Although this result was classified in [0] as a reverse Pinsker inequality, note that it differs from the scope of this section which provides, under suitable conditions, lower bounds on the total variation distance as a function of the relative entropy. B. Lipschitz Constraints Definition 6: A function f : B R, where B R, is L-Lipschitz if for all x, y B fx) fy) L x y. 35) The following bound generalizes [35, Theorem 6] to the non-discrete setting. Theorem 7: Let P Q with β 0, ) and β [0, ), and f : [0, ) R be continuous and convex with f) = 0, and L-Lipschitz on [β, β ]. Then, Proof: If Y Q, and Z is given by 3) then f) = 0 yields where 357) holds due to 43). D f P Q) L P Q. 353) D f P Q) = E [ fz) ] 354) E [ fz) f) ] L E [ Z ] Note that if f has a bounded derivative on [β, β ], we can choose L = Remark 3: In the case ft) = t log t, f0) = 0, 358) particularizes to 355) 356) = L P Q 357) sup f t) <. 358) t [β,β ] L = max { logeβ ), logeβ )} 359) resulting in a reverse Pinsker inequality which is weaker than that in 33) by at least a factor of. March 4, 08

48 C. Finite Alphabet 48 Throughout this subsection, we assume that P and Q are probability measures defined on a common finite set A, and Q is strictly positive on A, which has more than one element. The bound in 344) strengthens the finite-alphabet bound in [38, Lemma 3.0]: ) DP Q) log P Q 360) with Q min Q min = min a A Qa). 36) To verify this, notice that β Q min. Let v : 0, ) 0, ) be defined by vt) = t log t ; since v is a monotonically decreasing and non-negative function, we can weaken 344) to write ) log Q DP Q) min P Q 36) Q min ) ) log P Q 363) where 363) follows from 36). Q min The main result in this subsection is the following bound. Theorem 8: DP Q) log + ) P Q. 364) Q min Furthermore, if Q P and β is defined as in 50), then the following tightened bound holds: ) P Q DP Q) log + β log e P Q. 365) Q min Proof: Combining 5) and the following finite-alphabet upper bound on χ P Q) yields 364): Q min χ P Q) = a A P a) Qa)) Qa)/Q min 366) a A P a) Qa) ) 367) max P x) Qx) P a) Qa) x A a A = P Q max P a) Qa) 368) a A P Q. 369) March 4, 08

49 If P Q, then 365) follows by combining 369) and ) χ P Q) exp DP Q) + β DQ P ) 370) 49 where 370) follows by rearranging 4), and 37) follows from ). exp DP Q) + P Q β log e ) 37) Remark 3: It is easy to check that Theorem 8 strengthens the bound by Csiszár and Talata 3) by at least a factor of since upper bounding the logarithm in 364) gives DP Q) log e Q min P Q. 37) Remark 33: In the finite-alphabet case, similarly to 364), one can obtain another upper bound on DP Q) as a function of the l norm P Q : DP Q) Q min P Q log e 373) which appears in the proof of Property 4 of [0, Lemma 7], and also used in [57, 74)]. Furthermore, similarly to 365), the following tightened bound holds if P Q: DP Q) log + P ) Q β log e P Q 374) Q min which follows by combining 367), 37), and the inequality P Q P Q. Remark 34: Combining ) and 369) yields that if P Q are defined on a common finite set, then DP Q) χ P Q) Q min log e 375) which at least doubles the lower bound in [7, Lemma 6]. This, in turn, improves the tightened upper bound on the strong data processing inequality constant in [7, Theorem 0] by a factor of. Remark 35: Reverse Pinsker inequalities have been also derived in quantum information theory [3], [4]), providing upper bounds on the relative entropy of two quantum states as a function of the trace norm distance when the minimal eigenvalues of the states are positive c.f. [3, Theorem 6] and [4, Theorem ]). When the variational distance is much smaller than the minimal eigenvalue see [3, Eq. 57)]), the latter bounds have a quadratic scaling in this distance, similarly to 364); they are also inversely proportional to the minimal eigenvalue, similarly to the dependence of 364) in Q min. Remark 36: Let P and Q be probability distributions defined on an arbitrary alphabet A. Combining Theorems 7 and 6 leads to a derivation of an upper bound on the difference DP Q) DQ P ) as a function of P Q as long as P Q and the relative information ı P Q is bounded away from and +. Furthermore, another upper bound on the difference of the relative entropies can be readily obtained by combining Theorems 7 and 8 when P and Q are probability measures defined on a finite alphabet. In the latter case, combining March 4, 08

50 50 Theorem 7 and 374) also yields another upper bound on DP Q) DQ P ) which scales quadratically with P Q. All these bounds form a counterpart to [5, Theorem ] and Theorem 7, providing measures of the asymmetry of the relative entropy when the relative information is bounded. D. Distance From the Equiprobable Distribution If P is a distribution on a finite set A, HP ) gauges the distance from U, the equiprobable distribution defined on A, since HP ) = log A DP U). Thus, it is of interest to explore the relationship between HP ) and P U. Next, we determine the exact locus of the points HP ), P U ) among all probability measures P defined on A, and this region is compared to upper and lower bounds on P U as a function of HP ). As usual, hx) denotes the continuous extension of x log x x) log x) to x [0, ] and dx y) denotes the binary relative entropy in 30). Theorem 9: Let U be the equiprobable distribution on a {,..., A }, with < A <. a) For 0, A )], 3 max HP ) = log A min d m P : P U = m A + ) m A where the minimum in the right side of 376) is over 376) m {,..., A A }. 377) Denoting such an integer by m, the maximum in the left side of 376) is attained by A + l {,..., m }, m P l) = A A m ), l {m +,..., A }. b) Let 0, k = 0 h k = h A k ) + A k log k, k {,..., A } 378) 379) If H [h k, h k ) for k {,..., A }, then log A, k = A. min P U = k + θ) A ) 380) P : HP )=H 3 There is no P with P U > A ). March 4, 08

51 which is achieved by k + θ) A, l = 5 P k) θ l) = A, l {,..., k}, θ A, l = k + 0, l {k +,..., A } 38) where θ [0, ) is chosen so that HP k) θ ) = H. Proof: See Appendix G. Remark 37: For probability measures defined on a -element set A, the maximal and minimal values of P U in Theorem 9 coincide. This can be verified since, if P ) = p for p [0, ], then P U = p and HP ) = hp). Hence, if A = and HP ) = H [0, log ], then P U = h H) 38) where h : [0, log ] [0, ] denotes the inverse of the binary entropy function. Results on the more general problem of finding bounds on HP ) HQ) based on P Q can be found in [0, Theorem 7.3.3], [5], [90] and [4]. Most well-known among them is ) A HP ) HQ) P Q log P Q which holds if P, Q are probability measures defined on a finite set A with P a) Qa) 383) for all a A see [], and [7, Lemma.7] with a stronger sufficient condition). Particularizing 383) to the case where Q = U, and P a) A for all a A yields ) A HP ) log A P U log, 384) P U a bound which finds use in information-theoretic security [30]. Particularizing ), 4), and 364) we obtain HP ) log A P U log e, 385) HP ) log A + log 4 P U ), 386) HP ) log A log + A P U ). 387) If either A = or 8 A 0, it can be checked that the lower bound on HP ) in 384) is worse than 387), irrespectively of P U note that 0 P U A )). The exact locus of HP ), P U ) among all the probability measures P defined on a finite set A see Theorem 9), and the bounds in 385) 387) are illustrated in Figure for A = 4 and A = 56. For A = 4, March 4, 08

52 5 the lower bound in 387) is tighter than 384). For A = 56, we only show 387) in Figure as in this case 384) offers a very minor improvement in a small range. As the cardinality of the set A increases, the gap between the exact locus shaded region) and the upper bound obtained from 385) and 386) Curves a) and b), respectively) decreases, whereas the gap between the exact locus and the lower bound in 387) Curve c)) increases. b) b) P U c) a) P U a) c) HP )bits) HP )bits) Fig.. The exact locus of HP ), P U ) among all the probability measures P defined on a finite set A, and bounds on P U as a function of HP ) for A = 4 left plot), and A = 56 right plot). The point HP ), P U ) = 0, A )) is depicted on the y-axis. In the two plots, Curves a), b) and c) refer, respectively, to 385), 386) and 387); the exact locus shaded region) refers to Theorem 9. E. The Exponential Decay of the Probability of Non-Strongly Typical Sequences The objective is to bound the function L δ Q) = min DP Q) 388) P T δq) where the subset of probability measures on A, F ) which are δ-close to Q is given by { } T δ Q) = P : a A, P a) Qa) δ Qa). 389) Note that a,..., a n ) is strongly δ-typical according to Q if its empirical distribution belongs to T δ Q). According to Sanov s theorem e.g. [0, Theorem.4.]), if the random variables are independent and distributed according to Q, then the probability that Y,..., Y n ), is not δ-typical vanishes exponentially with exponent L δ Q). March 4, 08

f-divergence Inequalities

SASON AND VERDÚ: f-divergence INEQUALITIES f-divergence Inequalities Igal Sason, and Sergio Verdú Abstract arxiv:508.00335v7 [cs.it] 4 Dec 06 This paper develops systematic approaches to obtain f-divergence