f-divergence Inequalities

Size: px
Start display at page:

Download "f-divergence Inequalities"

Transcription

1 f-divergence Inequalities Igal Sason Sergio Verdú Abstract This paper develops systematic approaches to obtain f-divergence inequalities, dealing with pairs of probability measures defined on arbitrary alphabets. Functional domination is one such approach, where special emphasis is placed on finding the best possible constant upper bounding a ratio of f-divergences. Another approach used for the derivation of bounds among f-divergences relies on moment inequalities and the logarithmic-convexity property, which results in tight bounds on the relative entropy and Bhattacharyya distance in terms of χ divergences. A rich variety of bounds are shown to hold under boundedness assumptions on the relative information. Special attention is devoted to the total variation distance and its relation to the relative information and relative entropy, including reverse Pinsker inequalities, as well as on the E γ divergence, which generalizes the total variation distance. Pinsker s inequality is extended for this type of f-divergence, a result which leads to an inequality linking the relative entropy and relative information spectrum. Integral expressions of the Rényi divergence in terms of the relative information spectrum are derived, leading to bounds on the Rényi divergence in terms of either the variational distance or relative entropy. Keywords: relative entropy, total variation distance, f-divergence, Rényi divergence, Pinsker s inequality, relative information. I. Sason is with the Department of Electrical Engineering, Technion Israel Institute of Technology, Haifa 3000, Israel sason@ee.technion.ac.il). S. Verdú is with the Department of Electrical Engineering, Princeton University, Princeton, New Jersey 08544, USA verdu@princeton.edu). This manuscript is identical to the published journal paper [94], apart of some additional material which includes Sections III-C and IV-F, and three technical proofs. Parts of this work have been presented at the 04 Workshop on Information Theory and Applications, San-Diego, Feb. 04, at the 05 IEEE Information Theory Workshop, Jeju Island, Korea, Oct. 05, and the 06 IEEE International Conference on the Science of Electrical Engineering, Eilat, Israel, Nov. 06. This work has been supported by the Israeli Science Foundation ISF) under Grant /, by NSF Grant CCF-0665, by the Center for Science of Information, an NSF Science and Technology Center under Grant CCF , and by ARO under MURI Grant W9NF March 4, 08

2 I. INTRODUCTION Throughout their development, information theory, and more generally, probability theory, have benefitted from non-negative measures of dissimilarity, or loosely speaking, distances, between pairs of probability measures defined on the same measurable space see, e.g., [4], [65], [06]). Notable among those measures are see Section II for definitions): total variation distance P Q ; relative entropy DP Q); χ -divergence χ P Q); Hellinger divergence H α P Q); Rényi divergence D α P Q). It is useful, particularly in proving convergence results, to give bounds of one measure of dissimilarity in terms of another. The most celebrated among those bounds is Pinsker s inequality: P Q log e DP Q) ) proved by Csiszár [3] and Kullback [60], with Kemperman [56] independently a bit later. Improved and generalized versions of Pinsker s inequality have been studied, among others, in [39], [44], [46], [76], [86], [9], [04]. Relationships among measures of distances between probability measures have long been a focus of interest in probability theory and statistics e.g., for studying the rate of convergence of measures). The reader is referred to surveys in [4, Section 3], [65, Chapter ], [86] and [87, Appendix 3], which provide several relationships among useful f-divergences and other measures of dissimilarity between probability measures. Some notable existing bounds among f-divergences include, in addition to ): [63, Lemma ], [64, p. 5] [5,.)] H P Q) P Q ) H P Q) 4 H P Q) ) ; 3) 4 P Q exp DP Q) ) ; 4) The folklore in information theory is that ) is due to Pinsker [78], albeit with a suboptimal constant. As explained in [0], although no such inequality appears in [78], it is possible to put together two of Pinsker s bounds to conclude that Csiszár derived ) in [3, Theorem 4.] after publishing a weaker version in [, Corollary ] a year earlier. 408 P Q log e DP Q). March 4, 08

3 [4, Theorem 5], [35, Theorem 4], [00] 3 [48, Corollary 5.6] For all α χ P Q) DP Q) log + χ P Q) ) ; 5) ) α + α ) H α P Q) ; 6) the inequality in 6) is reversed if α 0, ), ], and it holds with equality if α =. [44], [45], [86, 58)] [97] P Q, P Q [ 0, ] χ P Q) P Q P Q, P Q, ). D P Q) DQ P ) χ P Q) log e; 8) 7) 4 H P Q) log e DP Q) DQ P ) 9) 4 χ P Q) χ Q P ) log e; 0) 4 H P Q) log e DP Q) + DQ P ) ) χ P Q) + χ Q P ) ) log e; ) [3,.8)] [46], [86, Corollary 3], [9] DP Q) DP Q) + DQ P ) P Q log χ P Q) + χ Q P ) [53, p. 7] cf. a generalized form in [87, Lemma A.3.5]) generalized in [65, Proposition.5]: P Q + χ P Q) ) log e ; 3) ) + P Q, 4) P Q 8 P Q 4 P Q ; 5) H P Q) log e DP Q), 6) H α P Q) log e D α P Q) DP Q), 7) March 4, 08

4 for α 0, ), and 4 for α, ). [65, Proposition.35] If α 0, ), β max{α, α}, then [37, Theorems 3 and 6] H α P Q) log e D α P Q) DP Q) 8) + P Q ) β P Q ) β α) H α P Q) 9) D α P Q) is monotonically increasing in α > 0; α ) D α P Q) is monotonically decreasing in α 0, ]; P Q. 0) [65, Proposition.7] the same monotonicity properties hold for H α P Q). [46] If α 0, ], then α P Q log e D α P Q); ) An inequality for H α P Q) [6, Lemma ] becomes at α = ), the parallelogram identity [6,.)] DP 0 Q) + DP Q) = DP 0 P ) + DP P ) + DP Q). ) with P = P 0 + P ), and extends a result for the relative entropy in [6, Theorem.]. A reverse Pinsker inequality, providing an upper bound on the relative entropy in terms of the total variation distance, does not exist in general since we can find distributions which are arbitrarily close in total variation but with arbitrarily high relative entropy. Nevertheless, it is possible to introduce constraints under which such reverse Pinsker inequalities hold. In the special case of a finite alphabet A, Csiszár and Talata [9, p. 0] showed that when Q min min a A Qa) is positive. 3 log e DP Q) Q min ) P Q 3) [5, Theorem 3.] if f : 0, ) R is a strictly convex function, then there exists a real-valued function ψ f such that lim x 0 ψ f x) = 0 and 4 P Q ψ f Df P Q) ). 4) 3 Recent applications of 3) can be found in [57, Appendix D] and [0, Lemma 7] for the analysis of the third-order asymptotics of the discrete memoryless channel with or without cost constraints. 4 Eq. 4) follows as a special case of [5, Theorem 3.] with m = and w m =. March 4, 08

5 which implies 5 lim D f P n Q n ) = 0 n lim P n Q n = 0. 5) n The numerical optimization of an f-divergence subject to simultaneous constraints on f i -divergences i =,..., L) was recently studied in [48], which showed that for that purpose it is enough to restrict attention to alphabets of cardinality L +. Earlier, [50] showed that if L =, then either the solution is obtained by a pair P, Q) on a binary alphabet, or it is a deflated version of such a point. Therefore, from a purely numerical standpoint, the minimization of D f P Q) such that D g P Q) d can be accomplished by a grid search on [0, ]. Occasionally, as in the case where D f P Q) = DP Q) and D g P Q) = P Q, it is actually possible to determine analytically the locus of D f P Q), D g P Q)) see [39]). In fact, as shown in [05, )], a binary alphabet suffices if the single constraint is on the total variation distance. The same conclusion holds when minimizing the Rényi divergence [9]. In this work, we find relationships among the various divergence measures outlined above as well as a number of other measures of dissimilarity between probability measures. The framework of f-divergences, which encompasses the foregoing measures Rényi divergence is a one-to-one transformation of the Hellinger divergence) serves as a convenient playground. The rest of the paper is structured as follows: Section II introduces the basic definitions needed and in particular the various measures of dissimilarity between probability measures used throughout. Based on functional domination, Section III provides a basic tool for the derivation of bounds among f- divergences. Under mild regularity conditions, this approach further enables to prove the optimality of constants in those bounds. In addition, we show instances where such optimality can be shown in the absence of regularity conditions. The basic tool used in Section III is exemplified in obtaining relationships among important f- divergences such as relative entropy, Hellinger divergence and total variation distance. This approach is also useful in strengthening and providing an alternative proof of Samson s inequality [89] a counterpart to Pinsker s inequality using Marton s divergence, useful in proving certain concentration of measure results [3]), whose constant we show cannot be improved. In addition, we show several new results in Section III-E on the maximal ratios of various f-divergences to total variation distance. Section IV provides an approach for bounding ratios of f-divergences, assuming that the relative information see Definition ) is lower and/or upper bounded with probability one. The approach is exemplified in bounding ratios of relative entropy to various f-divergences, and analyzing the local behavior of f-divergence ratios when the reference measure is fixed. We also show that bounded relative information leads to a strengthened version of Jensen s inequality, which, in turn, results in upper and lower bounds on the ratio of the non-negative difference March 4, 08

6 6 log + χ P Q) ) DP Q) to DQ P ). A new reverse version of Samson s inequality is another byproduct of the main tool in this section. The rich structure of the total variation distance as well as its importance in both fundamentals and applications merits placing special attention on bounding the rest of the distance measures in terms of P Q. Section V gives several useful identities linking the total variation distance with the relative information spectrum, which result in a number of upper and lower bounds on P Q, some of which are tighter than Pinsker s inequality in various ranges of the parameters. It also provides refined bounds on DP Q) as a function of χ -divergences and the total variation distance. Section VI is devoted to proving reverse Pinsker inequalities, namely, lower bounds on P Q as a function of DP Q) involving either a) bounds on the relative information, b) Lipschitz constants, or c) the minimum mass of the reference measure in the finite alphabet case). In the latter case, we also examine the relationship between entropy and the total variation distance from the equiprobable distribution, as well as the exponential decay of the probability that an independent identically distributed sequence is not strongly typical. Section VII focuses on the E γ divergence. This f-divergence generalizes the total variation distance, and its utility in information theory has been exemplified in [8], [68], [69], [70], [79], [80], [8]. Based on the operational interpretation of the DeGroot statistical information [3] and the integral representation of f- divergences as a function of DeGroot s measure, Section VII provides an integral representation of f-divergences as a function of the E γ divergence; this representation shows that {E γ P Q), E γ Q P )), γ } uniquely determines DP Q) and H α P Q), as well as any other f-divergence with twice differentiable f. Accordingly, bounds on the E γ divergence directly translate into bounds on other important f-divergences. In addition, we show an extension of Pinsker s inequality ) to E γ divergence, which leads to a relationship between the relative information spectrum and relative entropy. The Rényi divergence, which has found a plethora of information-theoretic applications, is the focus of Section VIII. Expressions of the Rényi divergence are derived in Section VIII-A as a function of the relative information spectrum. These expressions lead in Section VIII-B to the derivation of bounds on the Rényi divergence as a function of the variational distance under the boundedness assumption of the relative information. Bounds on the Rényi divergence of an arbitrary order are derived in Section VIII-C as a function of the relative entropy when the relative information is bounded. II. BASIC DEFINITIONS A. Relative Information and Relative Entropy We assume throughout that the probability measures P and Q are defined on a common measurable space A, F ), and P Q denotes that P is absolutely continuous with respect to Q, namely there is no event F F March 4, 08

7 such that P F) > 0 = QF). 7 Definition : If P Q, the relative information provided by a A according to P, Q) is given by 5 ı P Q a) log dp a). 6) dq When the argument of the relative information is distributed according to P, the resulting real-valued random variable is of particular interest. Its cumulative distribution function and expected value are known as follows. Definition : If P Q, the relative information spectrum is the cumulative distribution function with 6 X P. The relative entropy of P with respect to Q is where Y Q. F P Q x) = P [ ı P Q X) x ], 7) DP Q) = E [ ı P Q X) ] 8) = E [ ı P Q Y ) exp ı P Q Y ) )], 9) B. f-divergences Introduced by Ali-Silvey [] and Csiszár [], [3], a useful generalization of the relative entropy, which retains some of its major properties and, in particular, the data processing inequality [3]), is the class of f-divergences. A general definition of an f-divergence is given in [66, p. 4398], specialized next to the case where P Q. Definition 3: Let f : 0, ) R be a convex function, and suppose that P Q. The f-divergence from P to Q is given by D f P Q) = f ) dp dq 30) dq = E [ fz) ] 3) where and in 30), we took the continuous extension 7 Z = exp ı P Q Y ) ), Y Q. 3) f0) = lim t 0 ft), + ]. 33) 5 dp dq denotes the Radon-Nikodym derivative or density) of P with respect to Q. Logarithms have an arbitrary common base, and exp ) indicates the inverse function of the logarithm with that base. 6 X P means that P[X F] = P F) for any event F F. 7 The convexity of f : 0, ) R implies its continuity on 0, ). March 4, 08

8 8 We can also define D f P Q) without requiring P Q. Let f : 0, ) R be a convex function with f) = 0, and let f : 0, ) R be given by f t) = t f ) t 34) for all t > 0. Note that f is also convex see, e.g., [4, Section 3..6]), f ) = 0, and D f P Q) = D f Q P ) if P Q. By definition, we take f 0) = lim f fu) t) = lim t 0 u u. 35) If p and q denote, respectively, the densities of P and Q with respect to a σ-finite measure µ i.e., p = dp dµ, q = dq dµ ), then we can write 30) as D f P Q) = = q f {pq>0} ) p dµ 36) q ) p q f dµ + f0) Qp = 0) + f 0)P q = 0). 37) q Remark : Different functions may lead to the same f-divergence for all P, Q): if for an arbitrary b R, we have then f b t) = f 0 t) + b t ), t 0 38) D f0 P Q) = D fb P Q). 39) The following key property of f-divergences follows from Jensen s inequality. Proposition : If f : 0, ) R is convex and f) = 0, P Q, then D f P Q) 0. 40) If, furthermore, f is strictly convex at t =, then equality in 40) holds if and only if P = Q. Surveys on general properties of f-divergences can be found in [65], [06], [07]. The assumptions of Proposition are satisfied by many interesting measures of dissimilarity between probability measures. In particular, the following examples receive particular attention in this paper. As per Definition 3, in each case the function f is defined on 0, ). ) Relative entropy [59]: ft) = t log t, DP Q) = D f P Q) 4) = D r P Q) 4) March 4, 08

9 with r : 0, ) [0, ) defined as 9 ) Relative entropy: P Q) ft) = log t, 3) Jeffrey s divergence [54]: P Q) ft) = t ) log t, 4) χ -divergence [77]: ft) = t ) or ft) = t, rt) t log t + t) log e. 43) DQ P ) = D f P Q); 44) DP Q) + DQ P ) = D f P Q); 45) χ P Q) = D f P Q) 46) ) dp χ P Q) = dq dq 47) ) dp = dq 48) dq = E [ exp ı P Q Y ) )] 49) = E [ exp ı P Q X) )] 50) with X P and Y Q. Note that if P Q, then from the right side of 48), we obtain with gt) = t t = f t), and ft) = t. χ Q P ) = D g P Q) 5) 5) Hellinger divergence of order α 0, ), ) [54], [65, Definition.0]: with H α P Q) = D fα P Q) 5) f α t) = tα α. 53) The χ -divergence is the Hellinger divergence of order, while H P Q) is usually referred to as the squared Hellinger distance. The analytic extension of H α P Q) at α = yields 6) Total variation distance: Setting H P Q) log e = DP Q). 54) ft) = t 55) March 4, 08

10 results in 0 7) Triangular Discrimination [6], [09] a.k.a. Vincze-Le Cam distance): with P Q = D f P Q) 56) dp = dq dq 57) ) = sup P F) QF). 58) F F P Q) = D f P Q) 59) Note that ft) = t ) t +. 60) 8) Jensen-Shannon divergence [67] a.k.a. capacitory discrimination): with 9) E γ divergence see, e.g., [79, p. 34]): For γ, with P Q) = χ P P + Q) 6) = χ Q P + Q). 6) JSP Q) = D P P + Q) + D Q P + Q) 63) = D f P Q) 64) ) + t ft) = t log t + t) log. 65) E γ P Q) = D fγ P Q) 66) f γ t) = t γ) + 67) where x) + max{x, 0}. E γ is sometimes called hockey-stick divergence because of the shape of f γ. If γ =, then 0) DeGroot statistical information [3]: For p 0, ), E P Q) = P Q. 68) I p P Q) = D φp P Q) 69) March 4, 08

11 with Invoking 66) 70), we get cf. [66, 77)]) φ p t) = min{p, p} min{p, pt}. 70) I P Q) = E P Q) = 4 P Q. 7) This measure was first proposed by DeGroot [3] due to its operational meaning in Bayesian statistical hypothesis testing see Section VII-B), and it was later identified as an f-divergence see [66, Theorem 0]). ) Marton s divergence [73, pp ]: d P, Q) = min E [ P [X Y Y ] ] 7) = D s P Q) 73) where the minimum is over all probability measures P XY with respective marginals P X = P and P Y = Q, and st) = t ) {t < }. 74) Note that Marton s divergence satisfies the triangle inequality [73, Lemma 3.], and d P, Q) = 0 implies P = Q; however, due to its asymmetry, it is not a distance measure. C. Rényi Divergence Another generalization of relative entropy was introduced by Rényi [88] in the special case of finite alphabets. The general definition assuming 8 P Q) is the following. Definition 4: Let P Q. The Rényi divergence of order α 0 from P to Q is given as follows: If α 0, ), ), then with X P and Y Q. If α = 0, then 9 D α P Q) = α log E [ exp α ı P Q Y ) )]) 75) = α log E [ exp α ) ı P Q X) )]) 76) D 0 P Q) = ) max log. 77) F F : P F)= QF) 8 Rényi divergence can also be defined without requiring absolute continuity, e.g., [37, Definition ]. 9 The function in 75) is, in general, right-discontinuous at α = 0. Rényi [88] defined D 0P Q) = 0, while we have followed [37] defining it instead as lim α 0 D αp Q). March 4, 08

12 If α =, then D P Q) = DP Q) 78) which is the analytic extension of D α P Q) at α =. If DP Q) <, it can be verified by L Hôpital s rule that DP Q) = lim α D α P Q). If α = + then with Y Q. If P Q, we take D P Q) =. D P Q) = log ess sup dp ) dq Y ) 79) Rényi divergence is a one-to-one transformation of Hellinger divergence of the same order α 0, ), ): which, when particularized to order becomes D α P Q) = α log + α ) H αp Q)) 80) D P Q) = log + χ P Q) ). 8) Note that 6), 7), 8) follow from 80) and the monotonicity of the Rényi divergence in its order, which in turn yields 6). Introduced in [8], the Bhattacharyya distance was popularized in the engineering literature in [55]. Definition 5: The Bhattacharyya distance between P and Q, denoted by BP Q), is given by BP Q) = D P Q) 8) ) = log H. 83) P Q) Note that, if P Q, then BP Q) = BQ P ) and BP Q) = 0 if and only if P = Q, though BP Q) does not satisfy the triangle inequality. III. FUNCTIONAL DOMINATION Let f and g be convex functions on 0, ) with f) = g) = 0, and let P and Q be probability measures defined on a measurable space A, F ). If there exists α > 0 such that ft) αgt) for all t 0, ), then it follows from Definition 3 that D f P Q) α D g P Q). 84) This simple observation leads to a proof of, for example, 6) and the left inequality in ) with the aid of Remark. March 4, 08

13 A. Basic Tool 3 Theorem : Let P Q, and assume f is convex on 0, ) with f) = 0; g is convex on 0, ) with g) = 0; gt) > 0 for all t 0, ), ). Denote the function κ: 0, ), ) R and κt) = ft), t 0, ), ) 85) gt) Then, κ = sup κt). 86) t 0,), ) a) D f P Q) κ D g P Q). 87) b) If, in addition, f ) = g ) = 0, then D f P Q) sup = κ. 88) P Q D g P Q) Proof: a) The bound in 87) follows from 84) and ft) κ gt) for all t > 0. b) Since g is positive except at t =, D g P Q) > 0 if P Q. The convexity of f, g on 0, ) implies their continuity; and since gt) > 0 for all t 0, ), ), κ ) is continuous on both 0, ) and, ). To show 88), we fix an arbitrary ν 0, ), ) and construct a sequence of pairs of probability measures whose ratio of f-divergence to g-divergence converges to κν). To that end, for sufficiently small ε > 0, let P ε and Q ε be parametric probability measures defined on the set A = {0, } with P ε 0) = ν ε and Q ε 0) = ε. Then, lim ε 0 D f P ε Q ε ) D g P ε Q ε ) = lim ε 0 ) ε fν) + ε) f νε ε ) 89) ε gν) + ε) g νε ε fν) + ν α f α) = lim α 0 gν) + ν α g α) 90) = κν) 9) where 90) holds by change of variable ε = α/ν + α), and 9) holds by the assumption on the derivatives of f and g at, the assumption that f) = g) = 0, and the continuity of κ ) at ν. If κ = κν) March 4, 08

14 4 we are done. If the supremum in 86) is not attained on 0, ), ), then the right side of 9) can be made arbitrarily close to κ by an appropriate choice of ν. Remark : Beyond the restrictions in Theorem a), the only operative restriction imposed by Theorem b) is the differentiability of the functions f and g at t =. Indeed, we can invoke Remark and add f ) t) to ft), without changing D f and likewise with g) and thereby satisfying the condition in Theorem b); the stationary point at must be a minimum of both f and g because of the assumed convexity, which implies their non-negativity on 0, ). Remark 3: It is useful to generalize Theorem b) by dropping the assumption on the existence of the derivatives at. To that end, note that the inverse transformation used for the transition to 90) is given by ν = + α ε ) where ε > 0 is sufficiently small, so if ν > resp. ν < ), then α > 0 resp. α < 0). Consequently, it is easy to see from 90) that if κ = sup t> κt), the construction in the proof can restrict to ν >, in which case it is enough to require that the left derivatives of f and g at be equal to 0. Analogously, if κ = sup 0<t< κt), it is enough to require that the right derivatives of f and g at be equal to 0. When neither left nor right derivatives at are 0, then 88) need not hold as the following example shows. Example : Let ft) = t and Then, κ =, while in view of 39) and 56) for all P, Q), gt) = ft) + t. 9) D g P Q) = D f P Q) = P Q. 93) B. Relationships Among DP Q), χ P Q) and P Q Since the Rényi divergence of order α > 0 is monotonically increasing in α, 8) yields DP Q) log + χ P Q) ) 94) χ P Q) log e. 95) Inequality 94), which can be found in [00] and [4, Theorem 5], is sharpened in Theorem 4 under the assumption of bounded relative information. In view of 80), an alternative way to sharpen 94) is for α, ), which is tight as α. DP Q) α log + α )H αp Q)) 96) Relationships between the relative entropy, total variation distance and χ divergence are derived next. March 4, 08

15 Theorem : a) If P Q and c, c 0, then 5 DP Q) c P Q + c χ P Q) ) log e 97) b) holds if c, c ) = 0, ) and c, c ) = 4, ). Furthermore, if c = 0 then c = is optimal, and if c = then c = 4 is optimal. sup where the supremum is over P Q and P Q. Proof: DP Q) + DQ P ) χ P Q) + χ Q P ) = log e 98) a) The satisfiability of 0) with c, c ) = 0, ) is equivalent to 95). Let P Q, and Y Q. Then, DP Q) = E [ )] dp r dq Y ) E [ dp dq Y ) ) + dp + dq Y ) ) ] 99) log e 00) = 4 P Q + χ P Q) ) log e 0) where 99) follows from the definition of relative entropy with the function r : 0, ) R defined in 43); 00) holds since for t 0, ) rt) and 0) follows from 47), 57), and the identity This proves 97) with c, c ) = 4, ). t) + = [ t) + + t ) ] log e 0) [ t + t) ]. Next, we show that if c = 0 then c = is the best possible constant in 97). To that end, let f t) = t ), and let κt) be the continuous extension of decreasing on 0, ), so rt) f t). It can be verified that the function κ is monotonically κ = lim t 0 κt) = log e. 03) Since D r P Q) = DP Q) and D f P Q) = χ P Q), and r ) = f ) = 0, the desired result follows from Theorem b). To show that c = 4 is the best possible constant in 97) if c =, we let g t) = [ t) + + t ) ]. Theorem b) does not apply here since g is not differentiable at t =. However, we can still construct March 4, 08

16 probability measures for proving the optimality of the point c, c ) = 4, ). To that end, let ε 0, ), and define probability measures P ε and Q ε on the set A = {0, } with P ε ) = ε and Q ε ) = ε. Since D r P Q) = DP Q) and D g P Q) = 4 P Q + χ P Q), DP ε Q ε ) lim ε 0 4 P ε Q ε + χ P ε Q ε ) = lim ε 0 ε) r + ε) + ε rε) ε) g + ε) + ε g ε) 6 04) = log e 05) where 05) holds since we can write numerator and denominator in the right side as b) We have with ε) r + ε) + ε rε) = ε ) log + ε) + ε log ε 06) = ε log e + oε), 07) ε) g + ε) + ε g ε) = ε ε. 08) D f P Q) = DP Q) + DQ P ), 09) D g P Q) = χ P Q) + χ Q P ) 0) ft) t ) log t ) gt) t t + t ) κt) = ft) gt) = t log t t, t 0, ), ) 3) lim κt) = κ = t log e 4) where 4) is easy to verify since κ is monotonically increasing on 0, ), and monotonically decreasing on, ). The desired result follows since the conditions of Theorem b) apply. Remark 4: Inequality 97) strengthens the bound claimed in [3,.8)], DP Q) P Q + χ P Q) ) log e, 5) although the short outline of the suggested proof in [3, p. 70] leads to the weaker upper bound P Q + χ P Q) nats. March 4, 08

17 7 Remark 5: Note that 95) implies a looser result where the constant in the right side of 98) is doubled. Furthermore, ) and 98) result in the bound χ P Q) + χ Q P ) P Q which, although weaker than 5), has the same behavior for small values of P Q. C. Relationships Among DP Q), P Q) and P Q The next result shows three upper bounds on the Vincze-Le Cam distance in terms of the relative entropy and total variation distance). Theorem 3: Let r : 0, ) R be the function defined in 43), and let f denote the function f : 0, ) R defined in 60). a) If P Q, then where c =.59 = κ, computed with Furthermore, the constant c in 6) is the best possible. b) If P Q, then where c = is defined by P Q) log e c DP Q) 6) κt) = f t) log e. 7) rt) P Q) log e DP Q) + c P Q log e 8) c = min { c 0: κ c t), t 0, ) } 9) κ c t) = Furthermore, the constant c in 8) is the best possible. c) If P Q, then and the bound in ) is tight. Proof: f t) log e rt) + c t) + log e. 0) P Q) log e DP Q) + DQ P ) ) a) Let f = f log e and g = r. These functions satisfy the conditions in Theorem b), and result in the function κ defined in 7). The function κ is maximal at t = with c κt ) =.59. Since D f P Q) = P Q) log e and D g P Q) = DP Q), Theorem b) results in 6) with the optimality of its constant c. March 4, 08

18 b) By the definition of c in 9), it follows that for all t > 0 8 f t) log e rt) + c t) + log e; ) note that ) holds for all t 0, ) i.e., not only in 0, ) as in 9)) since, for t, ) is equivalent to f t) log e rt) which indeed holds for all t [, ). Hence, the bound in 8) follows from 4), 59), 66) and 68). It can be verified from 9) that c = , the maximal value of κ c on 0, ) is attained at t = and κ c t ) =. The optimality of the constant c in 8) is shown next note that Theorem b) cannot be used here, since the right side of ) is not differentiable at t = ). Let P ε and Q ε be probability measures defined on A = {0, } with P ε 0) = t ε and Q ε 0) = ε for ε 0, ). Then, it follows that lim ε 0 P ε Q ε ) log e DP ε Q ε ) + c Pε Q ε log e f t ) ε log e + oε) = lim [ ε 0 rt ) + c t ) log e ] ε + oε) 3) = κ c t ) = 4) where 3) follows by the construction of P ε and Q ε, and the use of Taylor series expansions; 4) follows from 0) note that t < ), which yields the required optimality result in 8). c) Let f = f log e and g : 0, ) R be defined by gt) = t ) log t for all t > 0. These functions, which satisfy the conditions in Theorem b), yield D f P Q) = P Q) log e and D g P Q) = DP Q)+ DQ P ). Theorem b) yields the desired result with κt) = t, t 0, ), ) 5) t + ) log e t κ) = = κ. 6) Remark 6: We can generalize 6) and ) to P Q) log e c DP Q) + c DQ P ) 7) where c, c 0 and P Q. In view of Theorem, the locus of the allowable constant pairs c, c ) in 7) can be evaluated. To that end, Theorem b) can be used with ft) = f t) log e, and gt) = c rt) + c t r ) t for all t > 0. This results in a convex region, which is symmetric with respect to the straight line c = c since P Q) = Q P )). Note that 6), ) and the symmetry property of this region identify, respectively, the points.59, 0),, ) and 0,.59) on its boundary; since the sum of their coordinates is nearly constant, the boundary of this subset of the positive quadrant is nearly a straight line of slope. March 4, 08

19 D. An Alternative Proof of Samson s Inequality 9 An analog of Pinsker s inequality, which comes in handy for the proof of Marton s conditional transportation inequality [3, Lemma 8.4], is the following bound due to Samson [89, Lemma ]: Theorem 4: If P Q, then where d P, Q) is the distance measure defined in 7). d P, Q) + d Q, P ) log e DP Q) 8) We provide next an alternative proof of Theorem 4, in view of Theorem b), with the following advantages: a) This proof yields the optimality of the constant in 8), i.e., we prove that sup d P, Q) + d Q, P ) DP Q) = log e where the supremum is over all probability measures P, Q such that P Q and P Q. b) A simple adaptation of this proof enables to derive a reverse inequality to 8), which holds under the boundedness assumption of the relative information see Section IV-D). Proof: 9) d P, Q) + d Q, P ) = D s P Q) + D s P Q) 30) = D f P Q) 3) where, from 74), s : 0, ) [0, ) is the convex function s t) = t s ) t ) {t > } t = t and, from 74) and 3), the non-negative and convex function f : 0, ) R in 3) is given by 3) ft) = st) + s t) = t ) max{, t} for all t > 0. Let r be the non-negative and convex function r : 0, ) R defined in 43), which yields D r P Q) = DP Q). Note that f) = r) = 0, and the functions f and r are both differentiable at t = with f ) = r ) = 0. The desired result follows from Theorem b) since in this case κt) = 33) t ), t 0, ), ) 34) rt) max{, t} lim κt) = t log e = κ, 35) as can be verified from the monotonicity of κ on 0, ) increasing) and, ) decreasing). Remark 7: As mentioned in [89, p. 438], Samson s inequality 8) strengthens the Pinsker-type inequality in [73, Lemma 3.]: d P, Q) log e min{ DP Q), DQ P ) } 36) March 4, 08

20 0 Nevertheless, similarly to our alternative proof of Theorem 4, one can verify that Theorem b) yields the optimality of the constant in 36). E. Ratio of f-divergence to Total Variation Distance Vajda [05, Theorem ] showed that the range of an f-divergence is given by see 35)) 0 D f P Q) f0) + f 0) 37) where every value in this range is attainable by a suitable pair of probability measures P Q. Recalling Remark, note that f b 0) + fb 0) = f0) + f 0) with f b ) defined in 38). Basu et al. [9, Lemma.] strengthened 37), showing that D f P Q) f0) + f 0)) P Q. 38) Note that, provided f0) and f 0) are finite, 38) yields a counterpart to 4). Next, we show that the constant in 38) cannot be improved. Theorem 5: If f : 0, ) R is convex with f) = 0, then sup D f P Q) P Q = f0) + f 0)) 39) where the supremum is over all probability measures P, Q such that P Q and P Q. Proof: As the first step, we give a simplified proof of 38) cf. [9, pp ]). In view of Remark, it is sufficient to show that for all t > 0, ft) + f0) f 0)) t ) f0) + f 0)) t. 40) If t 0, ), 40) reduces to ft) t)f0), which holds in view of the convexity of f and f) = 0. If t, we can readily check, with the aid of 34), that 40) reduces to f t ) t )f 0), which, in turn, holds because f is convex and f ) = 0. For the second part of the proof of 39), we construct a pair of probability measures P ε and Q ε such that, for a sufficiently small ε > 0, Df Pε Qε) P ε Q ε can be made arbitrarily close to the right side of 39). To that end, let ε 0, 5 )], and let P ε and Q ε be defined on the set A = {0,, } with P ε 0) = Q ε ) = ε and P ε ) = Q ε 0) = ε. Then, D f P ε Q ε ) ε fε) + ε f lim = lim ε) ε 0 P ε Q ε ε 0 ε ε) where 4) holds since P ε ) = Q ε ), and 4) follows from 33) and 35). 4) = f0) + f 0) 4) March 4, 08

21 Remark 8: Csiszár [4, Theorem ] showed that if f0) and f 0) are finite and P Q, then there exists a constant C f > 0 which depends only on f such that D f P Q) C f P Q. 43) If P Q <, then 43) is superseded by 38) where the constant is not only explicit but is the best possible according to Theorem 5. A direct application of Theorem 5 yields Corollary : H α P Q) sup = P Q P Q, α 0, ) 44) α) P Q) sup =, 45) P Q P Q JSP Q) sup = log, 46) P Q P Q sup P Q d P, Q) P Q =, 47) d sup P, Q) + d Q, P ) = 48) P Q P Q where the suprema in 44) 47) are over all P Q with P Q, and the supremum in 48) is over all P Q with P Q. Remark 9: The results in 44), 45) and 46) strengthen, respectively, the inequalities in [65, Proposition.35], [0, )] and [0, Theorem ]. The results in 47) and 48) form counterparts of 9). IV. BOUNDED RELATIVE INFORMATION In this section we show that it is possible to find bounds among f-divergences without requiring a strong condition of functional domination see Section III) as long as the relative information is upper and/or lower bounded almost surely. A. Definition of β and β. The following notation is used throughout the rest of the paper. Given a pair of probability measures P, Q) on the same measurable space, denote β, β [0, ] by β = exp D P Q) ), 49) β = exp D Q P ) ) 50) March 4, 08

22 with the convention that if D P Q) =, then β = 0, and if D Q P ) =, then β = 0. Note that if β > 0, then P Q, while β > 0 implies Q P. Furthermore, if P Q, then with Y Q, β = ess inf dq dp Y ) = ess sup dp ) dq Y ), 5) β = ess inf dp dq Y ) = ess sup dq ) dp Y ). 5) The following examples illustrate important cases in which β and β are positive. Example : Gaussian distributions.) Let P and Q be Gaussian probability measures with equal means, and variances σ 0 and σ respectively. Then, β = σ 0 σ {σ 0 σ }, 53) β = σ σ 0 {σ σ 0 }. 54) Example 3: Shifted Laplace distributions.) Let P and Q be the probability measures whose probability density functions are, respectively, given by f λ a 0 ) and f λ a ) with f λ x) = λ exp λ x ), x R 55) where λ > 0. In this case, 55) gives which yields dp dq x) = exp λ x a x a 0 ) ), x R 56) β = β = exp λ a a 0 ) 0, ]. 57) Example 4: Cramér distributions.) Suppose that P and Q have Cramér probability density functions f θ,m and f θ0,m 0, respectively, with f θ,m x) = θ + θ x m ), x R 58) where θ > 0 and m R. In this case, we have β, β 0, ) since the ratio of the probability density functions, f θ,m f θ0,m 0, tends to θ0 θ probability density functions is θ θ 0 < in the limit x ±. In the special case where m 0 = m = m R, the ratio of these around m, it can be verified that in this special case at x = m; due also to the symmetry of the probability density functions β = β = min { θ0, θ }. 59) θ θ 0 Example 5: Cauchy distributions.) Suppose that P and Q have Cauchy probability density functions g γ,m and g γ0,m 0, respectively, γ 0 γ and g γ,m x) = πγ [ ) ] x m +, x R 60) γ March 4, 08

23 3 where γ > 0. In this case, we also have β, β 0, ) since the ratio of the probability density functions tends to γ0 γ < in the limit x ±. In the special case where m 0 = m, { γ β = β = min, γ } 0. 6) γ 0 γ B. Basic Tool Since β = β = P = Q, it is advisable to avoid trivialities by excluding that case. Theorem 6: Let f and g satisfy the assumptions in Theorem, and assume that β, β ) [0, ). Then, where and κ ) is defined in 85). D f P Q) κ D g P Q) 6) κ = Proof: Defining g0) and g 0) as in 33)-35), respectively, note that sup κβ) 63) β β,),β ) f0) Qp = 0) g0) κ Qp = 0) 64) because if β = 0 then f0) κ g0), and if β > 0 then Qp = 0) = 0. Similarly, f 0) P q = 0) g 0) κ P q = 0) 65) because if β = 0 then f 0) κ g 0) and if β > 0 then P q = 0) = 0. Moreover, since f) = g) = 0, we can substitute pq > 0 by {pq > 0} {p q} in the right side of 37) for D f P Q) and likewise for D g P Q). In view of the definition of κ, 64) and 65), ) p D f P Q) κ q g dµ q pq>0,p q + κ g0) Qp = 0) + κ g 0) P q = 0) 66) = κ D g P Q). 67) Note that if β = β = 0, then Theorem 6 does not improve upon Theorem a). Remark 0: In the application of Theorem 6, it is often convenient to make use of the freedom afforded by Remark and choose the corresponding offsets such that: the positivity property of g required by Theorem 6 is satisfied; the lowest κ is obtained. March 4, 08

24 4 Remark : Similarly to the proof of Theorem b), under the conditions therein, one can verify that the constants in Theorem 6 are the best possible among all probability measures P, Q with given β, β ) [0, ). Remark : Note that if we swap the assumptions on f and g in Theorem 6, the same result translates into inf κβ) D g P Q) D f P Q). 68) β β,),β ) Furthermore, provided both f and g are positive except at t = ) and κ is monotonically increasing, Theorem 6 and 68) result in κβ ) D g P Q) D f P Q) 69) κβ ) D gp Q). 70) In this case, if β > 0, sometimes it is convenient to replace β > 0 with β 0, β ) at the expense of loosening the bound. A similar observation applies to β. Example 6: If ft) = t ) and gt) = t, we get χ P Q) max{β, β } P Q. 7) C. Bounds on DP Q) DQ P ) The remaining part of this section is devoted to various applications of Theorem 6. From this point, we make use of the definition of r : 0, ) [0, ) in 43). An illustrative application of Theorem 6 gives upper and lower bounds on the ratio of relative entropies. Theorem 7: Let P Q, P Q, and β, β ) 0, ). Let κ: 0, ), ) 0, ) be defined as Then, Proof: For t > 0, let κt) = t log t + t) log e t ) log e log t. 7) DP Q) κβ ) DQ P ) κβ ). 73) gt) = log t + t ) log e 74) then D g P Q) = DQ P ), D r P Q) = DP Q) and the conditions of Theorem 6 are satisfied. The desired result follows from Theorem 6 and the monotonicity of κ ) shown in Appendix A. March 4, 08

25 D. Reverse Samson s Inequality 5 The next result gives a counterpart to Samson s inequality 8). Theorem 8: Let β, β ) 0, ). Then, inf d P, Q) + d Q, P ) DP Q) = min { κβ ), κβ ) } 75) where the infimum is over all P Q with given β, β ), and where κ: 0, ), ) 0, log e) is given in 34). Proof: Applying Remark to the convex and positive except at t = ) function ft) given in 33), and gt) = rt), the lower bound on d P,Q)+d Q,P ) DP Q) in the right side of 75) follows from the fact that 34) is monotonically increasing on 0, ), and monotonically decreasing on, ). To verify that this is the best possible lower bound, we recall Remark since in this case f ) = g ) = 0. E. Bounds on DP Q) H αp Q) The following result bounds the ratio of relative entropy to Hellinger divergence of an arbitrary positive order α. Theorem 9 extends and strengthens a result for α 0, ) by Haussler and Opper [5, Lemma 4 and 6)] see also [6]), which in turn generalizes the special case for α = by Birgé and Massart [, 7.6)]. obtained simultaneously and independently Theorem 9: Let P Q, P Q, α 0, ), ) and β, β ) [0, ). Define the continuous function on [0, ]: log e t = 0; α) rt) t α t 0, ), ); + αt α κ α t) = α log e t = ; 76) t = and α 0, ); 0 t = and α, ). Then, for α 0, ), and, for α, ), Proof: κ α β ) DP Q) H α P Q) κ αβ ) 77) κ α β DP Q) ) H α P Q) κ αβ ). 78) March 4, 08

26 D r P Q) = DP Q), and D gα P Q) = H α P Q) with in view of Remark, 4) and 5). Since g αt) = α tα ) α g α t) = tα + αt α, t 0, ) 79) α g α is monotonically decreasing on 0, ] and monotonically increasing on [, ); hence, g α ) = 0 implies that g α is positive except at t =. The convexity conditions required by Theorem 6 are also easy to check for both r ) and g α ). The function in 76) is the continuous extension of r g α, which as shown in Appendix B, is monotonically increasing on [0, ] if α 0, ), and it is monotonically decreasing on [0, ] if α, ). Therefore, Theorem 6 results in 77) and 78) for α 0, ) and α, ), respectively. Remark 3: Theorem 9 is of particular interest for α =. In this case since, from 76), κ : [0, ] [0, ] is monotonically increasing with κ 0) = log e, the left inequality in 78) yields 6). For large arguments, κ ) grows logarithmically. For example, if β > , it follows from 77) that DP Q) 4 + log ) e e ) H P Q) nats 80) which is strictly smaller than the upper bound on the relative entropy in [, Theorem 5], given not in terms of β but in terms of another more cumbersome quantity that controls the mass that dp dq 6 may have at large values. As mentioned in Section II-B, χ P Q) is equal to the Hellinger divergence of order. Specializing Theorem 9 to the case α = results in which improves the upper and lower bounds in [34, Proposition ]: κ β DP Q) ) χ P Q) κ β ), 8) β log e DP Q) χ P Q) β log e. 8) For example, if β = β = 00, 8) gives a possible range [0.037, 0.963] nats for the ratio of relative entropy to χ -divergence, while 8) gives a range of [0.005, 50] nats. Note that if β = 0, then the upper bound in 8) is κ 0) = log e whereas it is according to [34, Proposition ]. In view of Remark, the bounds in 8) are the best possible among all probability measures P, Q with given β, β ) [0, ). F. Bounds on JSP Q) P Q) HαP Q) P Q), JSP Q), P Q Let P Q and P Q. [0, Theorem ] and [0, )] show, respectively, JSP Q) log e log, 83) P Q) P Q P Q) P Q. 84) The following result suggests a refinement in the upper bounds of 83) and 84). March 4, 08

27 Theorem 0: Let P Q, P Q, and let β, β [0, ). Then, JSP Q) log e P Q) max{ κ β ), κ β ) }, 85) P Q) P Q max{ κ β ), κ β ) } 86) with κ : [0, ] [0, log ] and κ : [0, ] [0, ] defined by log t = 0; [ )] t + t + κ t) = t ) t log t t + ) log t 0, ), ); and log e t = ; log t = ; t t [0, ); t + κ t) = t =. Proof: Let f TV, f, f JS denote the functions f : 0, ) R in 55), 60) and 65), respectively; these functions yield P Q, P Q) and JSP Q) as f-divergences. The functions κ and κ, as introduced in 87) and 88), respectively, are the continuous extensions to [0, ] of 7 87) 88) κ t) = f JSt) + t ) log, 89) f t) κ t) = f t) f TV t). 90) It can be verified by 87) and 88) that κ i t) = κ i t ) for t [0, ] and i {, }; furthermore, κ and κ are both monotonically decreasing on [0, ], and monotonically increasing on [, ]. Consequently, if β [β, β ] and i {, }, κ i ) κ i β) max { κ i β ), κ iβ ) }. 9) In view of Theorem 6 and Remark, 85) follows from 89) and 9); 86) follows from 90) and 9). If β = 0, referring to an unbounded relative information ı P Q, the right inequalities in 83), 84) and 85), 86) respectively coincide due to 87), 88), and since κ 0) = ); otherwise, the right inequalities in 85), 86) provide, respectively, sharper upper bounds than those in 83), 84). March 4, 08

28 Example 7: Suppose that P Q, P Q and Q = 4, 3 4) ); substituting β = β = improving the upper bounds on JSP Q) P Q) to the right inequalities in 83), 84). For finite alphabets, [7, Theorem 7] shows in 85) 88) gives dp dq a) for all a A e.g., P =, ) and JSP Q) log e 0.50 log e 9) P Q) P Q) P Q 3 93) and P Q) P Q which are log log e and, respectively, according log JSP Q) log e. 94) P Q) H The following theorem extends the validity of 94) for Hellinger divergences of an arbitrary order α 0, ) and for a general alphabet, while also providing a refinement of both upper and lower bounds in 94) when the relative information ı P Q is bounded. Theorem : Let P Q, P Q, α 0, ), ), and let β, β [0, ] be given in 49)-50). Let κ α : [0, ] [0, ) be given by log t = 0; [ α ) t log t t + ) log ) ] t+ κ α t) = t α t 0, ), ); + α αt log e t = ; α ) + α log t =. Then, if α 0, ), and, if α, ), min { κ α β ), κ αβ ) } JSP Q) H α P Q) 8 95) max κ α β) 96) β [β,β ] κ α β JSP Q) ) H α P Q) κ αβ ). 97) Proof: Let f α and f JS denote, respectively, the functions f : 0, ) R in 53) and 65) which yield the Hellinger divergence, H α, and the Jensen-Shannon divergence, JSP Q), as f-divergences. From 95), κ α is the continuous extension to [0, ] of κ α t) = f JSt) + t ) log f α t) + ) α. 98) α t ) March 4, 08

29 As shown in the proof of Theorem 9, for every α 0, ), ) and t 0, ), ), we have g α t) ) f α t) + α α t ) > 0. It can be verified that κ α : [0, ] R has the following monotonicity properties: if α 0, ), there exists t α > 0 such that κ α is monotonically increasing on [0, t α ], and it is monotonically decreasing on [t α, ]; if α, ), κ α is monotonically decreasing on [0, ]. Based on these properties of κ α, for β [β, β ]: if α 0, ) 9 if α, ), min { κ α β ), κ αβ ) } κ α β) max t [β,β ] κ α t); 99) κ α β ) κ αβ) κ α β ). 00) In view of Theorem 6, Remark and 98), the bounds in 96) and 97) follow respectively from 99) and 00). Example 8: Specializing Theorem to α =, since κ t) = κ ) t for t 0, ) and κ global maximum at t = with κ ) = log e, 96) implies that log κ ) min{β, β } Under the assumptions in Example 7, it follows from 0) that log e which improves the lower bound log log e in the left side of 94). achieves its JSP Q) log e. 0) P Q) H JSP Q) log e 0) P Q) H Specializing Theorem to α = implies that cf. 95) and 97)) Under the assumptions in Example 7, it follows from 03) that κ β JSP Q) ) χ P Q) κ β ). 03) 0.70 log e JSP Q) χ log e 04) P Q) while without any boundedness assumption, 97) yields the weaker upper and lower bounds 0 JSP Q) χ log log e. 05) P Q) March 4, 08

30 G. Local Behavior of f-divergences 30 Another application of Theorem 6 shows that the local behavior of f-divergences differs by only a constant, provided that the first distribution approaches the reference measure in a certain strong sense. Theorem : Suppose that {P n }, a sequence of probability measures defined on a measurable space A, F ), converges to Q another probability measure on the same space) in the sense that, for Y Q, lim ess sup dp n Y ) = 06) n dq where it is assumed that P n Q for all sufficiently large n. If f and g are convex on 0, ) and they are positive except at t = where they are 0), then and lim D f P n Q) = lim D gp n Q) = 0, 07) n n min{κ ), κ + D f P n Q) )} lim n D g P n Q) max{κ ), κ + )} 08) where we have indicated the left and right limits of the function κ ), defined in 85), at by κ ) and κ + ), respectively. Proof: Since f) = 0, 0 D f P n Q) = f ) dpn dq 09) dq where we have abbreviated The condition in 06) yields sup fβ) 0) β [β,n, β,n] β,n ess sup dp n Y ), ) dq β,n ess inf dp n dq Y ). ) lim β,n =, 3) n lim β,n = 4) n where 3) is a restatement of 06) see the notation in )), and Appendix C justifies 4). Hence, 07) follows from 0), 3), 4), the continuity of f at due to its convexity). Abbreviating I n = [β,n, ), β,n ], 6) and 68) result in inf κβ)d g P n Q) D f P n Q) sup κβ)d g P n Q). 5) β I n β I n March 4, 08

31 The right and left continuity of κ ) at together with 3) and 4) imply that 3 by letting n. inf κβ) min{κ ), κ + )}, 6) β I n sup κβ) max{κ ), κ + )} 7) β I n Corollary : Let {P n Q} converge to Q in the sense of 06). Then, DP n Q) and DQ P n ) vanish as n with DP n Q) lim =. 8) n DQ P n ) Corollary 3: Let {P n Q} converge to Q in the sense of 06). Then, χ P n Q) and DP n Q) vanish as n with Note that 9) is known in the finite alphabet case [8, Theorem 4.]). DP n Q) lim n χ P n Q) = log e. 9) In Example, the ratio in 08) is equal to, while the lower and upper bounds are 3 and, respectively. Continuing with Examples 3, 4 and 5, it is easy to check that 06) is satisfied in the following cases. Example 9: A sequence of Laplacian probability density functions with common variance and converging means: p n x) = λ exp λ x a n ) 0) lim a n = a. ) n Example 0: A sequence of converging Cramér probability density functions: p n x) = θ n + θ n x m n ), x R ) lim m n = m R 3) n lim θ n = θ > 0. 4) n Example : A sequence of converging Cauchy probability density functions: [ p n x) = ) ] x mn +, x R 5) πγ n γ n lim m n = m R 6) n lim γ n = γ > 0. 7) n March 4, 08

32 H. Strengthened Jensen s inequality 3 Bounding away from zero a certain density between two probability measures enables the following strengthened version of Jensen s inequality, which generalizes a result in [33, Theorem ]. Lemma : Let f : R R be a convex function, P P 0 be probability measures defined on a measurable space A, F ), and fix an arbitrary random transformation P Z X : A R. Denote 0 P 0 P Z X P Z0, and P P Z X P Z. Then, where X 0 P 0, X P, and β E [fe[z 0 X 0 ])] fe[z 0 ]) ) E[fE[Z X ])] fe[z ]) 8) β ess inf dp dp 0 X 0 ). 9) Proof: If β = 0, the claimed result is Jensen s inequality, while if β =, P 0 = P and the result is trivial. Hence, we assume β 0, ). Note that P = βp 0 + β)p where P is the probability measure whose density with respect to P 0 is given by Letting P P Z X P Z, Jensen s inequality implies Furthermore, we can apply Jensen s inequality again to obtain dp = ) dp β 0. 30) dp 0 β dp 0 fe[z ]) β fe[z 0 ]) + β) fe[z ]). 3) fe[z ]) = fe [E[Z X ]]) 3) Substituting this bound on fe[z ]) in 3) we obtain the desired result. E [fe[z X ])] 33) = E [fe[z X ])] β E [fe[z 0 X 0 ])]. 34) β Remark 4: Letting Z = X, and choosing P 0 so that β = 0 e.g., P is a restriction of P 0 to an event of P 0 -probability less than ), 8) becomes Jensen s inequality fe[x ]) E[fX )]. Lemma finds the following application to the derivation of f-divergence inequalities. Theorem 3: Let f : 0, ) R be a convex function with f) = 0. Fix P Q on the same space with β, β ) [0, ) and let X P. Then, β D f P Q) E [ f expı P Q X)) )] f + χ P Q) ) 35) β D f P Q). 36) 0 We follow the notation in [] where P 0 P Z X P Z0 means that the marginal probability measures of the joint distribution P 0P Z X are P 0 and P Z0. March 4, 08

33 33 Proof: We invoke Lemma with P Z X that is given by the deterministic transformation exp ı P Q ) ) : A R. Then, E[Z 0 X 0 ] = exp ı P Q X 0 ) ). If, moreover, we let X 0 Q = P 0, we obtain and if we let X P = P, we have see 50)) E[Z 0 ] =, 37) E [fe[z 0 X 0 ])] = D f P Q) 38) E[Z ] = + χ P Q), 39) E [fe[z X ])] = E [ f expı P Q X)) )]. 40) Therefore, 35) follows from Lemma. Recalling 5), inequality 36) follows from Lemma as well switching the roles P 0 and P, namely, now we take P = P 0 and Q = P. Specializing Theorem 3 to the convex function on 0, ) where ft) = log t sharpens inequality 94) under the assumption of bounded relative information. Theorem 4: Fix P Q such that β, β ) 0, ). Then, β DQ P ) log + χ P Q) ) DP Q) 4) β DQ P ). 4) V. TOTAL VARIATION DISTANCE, RELATIVE INFORMATION SPECTRUM AND RELATIVE ENTROPY A. Exact Expressions The following result provides several useful expressions of the total variation distance in terms of the relative information. Theorem 5: Let P Q, and let X P and Y Q be defined on a measurable space A, F ). Then, P Q = E [ expıp Q Y )) ] 43) = E [ expı P Q Y )) ) +] 44) = E [ expı P Q Y )) ) ] 45) = E [ exp ı P Q X)) ) +] 46) = P [ ı P Q X) > 0 ] P [ ı P Q Y ) > 0 ]) 47) = P [ ı P Q Y ) 0 ] P [ ı P Q X) 0 ]) 48) z) + z {z > 0} = max{z, 0}, and z) z {z < 0} = max{ z, 0}. March 4, 08

34 Furthermore, if P Q, then = = = 0 0 β P [ ı P Q Y ) < log β ] dβ 49) [ P ı P Q X) > log ] dβ 50) β β [ F P Q log β) ] dβ. 5) P Q = E [ exp ı P Q X)) ) ] 34 5) = E [ exp ıp Q X)) ]. 53) Proof: See Appendix D. Remark 5: In view of 47), if P Q, the supremum in 58) is a maximum which is achieved by the event F = { a A: ı P Q a) > 0 } F. 54) Similarly to Theorem 5, the following theorem provides several expressions of the relative entropy in terms of the relative information spectrum. Theorem 6: If DP Q) <, then DP Q) = = = 0 0 FP Q α) ) dα 0 F P Q log β) dβ β P [ ı P Q Y ) > α ] αe α dα F P Q α) dα 55) F P Q log β) dβ 56) β 0 0 P [ ı P Q Y ) < α ] αe α dα 57) where Y Q, and for convenience 56) and 57) assume that the relative information and the resulting relative entropy are in nats. Proof: The expectation of a real-valued random variable V is equal to E[V ] = 0 P[V > t] dt 0 P[V < t] dt 58) where we are free to substitute > by, and < by. If we let V = ı P Q X) with X P, then 58) yields 55) provided that DP Q) = E[V ] <. Eq. 56) follows from 55) by the substitution α = log β when the relative entropy is expressed in nats. To prove 57), let Z = ı P Q Y ) with Y Q, and let V = rz) where r : 0, ) [0, ) is given in 43) with natural logarithm. The function r is strictly monotonically increasing on [, ), on which interval we define its inverse by s : [0, ) [, ); it is also strictly monotonically decreasing on 0, ], on which March 4, 08

35 35 interval we define its inverse by s : [0, ] 0, ]. Then, only the first integral on the right side of 58) can be non-zero, and we decompose it as DP Q) = = = 0 0 P [ Z, rz) > t ] dt + P [ Z > s t) ] dt + P[Z > v] log e v dv 0 0 P [ Z <, rz) > t ] dt P [ Z < s t) ] dt 59) 0 P[Z < v] log e v dv 60) where 60) follows from the change of variable of integration t = rv). Upon taking log e on both sides of the inequalities inside the probabilities in 60), and a further change of the variable of integration v = e α, 60) is seen to be equal to 57). B. Upper Bounds on P Q In this section, we provide three upper bounds on P Q which complement ). Theorem 7: If P Q and X P, then P Q log e DP Q) + E [ ı P Q X) ]. 6) Proof: For every z [, ], ) ) z exp z) {z > 0} {z > 0}. 6) log e Substituting z = ı P Q X), taking expectation of both sides of 6), and using 46) give P Q log e E [ ı P Q X) {ı P Q X) > 0} ] 63) = E [ ı P Q X) + ı P Q X) ] 64) = DP Q) + E [ ı P Q X) ]. 65) Remark 6: Theorem 7 is tighter than Pinsker s bound in [78,.3.4)]: P Q log e E [ ı P Q X) ]. 66) The second upper bound on P Q is a consequence of Theorem 5. Theorem 8: Let P Q with β, β ) [0, ). Then, for every β 0 [β, ], P Q β 0) P[ı P Q X) > 0] ] + β 0 β ) P [ı P Q X) > log β0 67) March 4, 08

36 where X P, and, for every β 0 [β, ], 36 P Q β 0) P[ı P Q Y ) < 0] + β 0 β ) P[ı P Q Y ) < log β 0 ] 68) where Y Q. Furthermore, both upper bounds on P Q in 67) and 68) are tight in the sense that they are achievable by suitable pairs of probability measures defined on a binary alphabet. Proof: Since the integrand in the right side of 50) is monotonically increasing in β, we may upper bound ] it by P [ı P Q X) > log when β [β β0, β 0 ], and by P[ı P Q X) > 0] when β β 0, ]. The same reasoning applied to 49) yields 68). To check the tightness of 67) for any β 0, ], choose an arbitrary η 0, ) and a pair of probability measures P and Q defined on the binary alphabet A = {0, } with P 0) = η ηβ, 69) Q0) = β P 0). 70) Then, we have ı P Q 0) = log β, ı P Q ) = log η < 0, and both sides of 67) are readily seen to be equal when β 0 = β. The tightness of 68) can be shown in a similar way. The third upper bound on P Q is a classical inequality [55, 99)], usually given in the context of bounding the error probability of Bayesian binary hypothesis testing in terms of the Bhattacharyya distance. Theorem 9: 4 P Q exp D P Q) ). 7) Remark 7: The bound in 7) is tight if P, Q are defined on {0, } with P 0) = Q0) or P 0) = Q). In view of the monotonicity of D α P Q) in α, Theorem 9 yields 4), which is equivalent to the Bretagnole- Huber inequality [5,.)] see also [08, pp. 30 3]). Note that 4) is tighter than ) only when P Q > C. Lower Bounds on P Q In this section, we give several lower bounds on P Q in terms of the relative information spectrum. Furthermore, in Section VI, we give lower bounds on P Q in terms of the relative entropy as well as other features of P and Q). If, for at least one value of β 0, ), either P [ ] ı P Q X) > log β or P [ ı P Q Y ) < log β ] are known then we get the following lower bounds on P Q as a consequence of Theorem 5: March 4, 08

37 Theorem 0: If P Q then, for every β 0 0, ), 37 P Q β 0 ) P [ ] ı P Q Y ) < log β 0, 7) ] P Q β 0 ) P [ı P Q X) > log β0 73) with X P and Y Q. Proof: The lower bounds in 7) and 73) follow from 49) and 50) respectively. For example, from 49), it follows that for an arbitrary β 0 0, ) P Q = P [ ı P Q Y ) < log β ] dβ 74) 0 β 0 P [ ı P Q Y ) < log β ] dβ 75) β 0 ) P [ ı P Q Y ) < log β 0 ] where 76) holds since the integrand in 75) is monotonically increasing in β 0, ]. Next we exemplify the utility of Theorem 0 by giving an alternative proof to the tight lower bound on the relative information spectrum, given in [68, Proposition ] as a function of the total variation distance. Proposition : Let P Q, then for every β > 0 0, β F P Q log β) β P Q β ), β 0, ], P Q ) P Q, Furthermore, for every β > 0 and δ [0, ), the lower bound in 77) is attainable by a pair P, Q) with P Q = δ. Proof: Since P Q, they cannot be mutually singular and therefore P Q <. From 7) and 73) see Theorem 0), it follows that for every β 0 0, ) P Q β 0) [ F P Q log )]. 78) β 0 Consequently, the substitution β = β 0 > yields F P Q log β) β P Q β ) which provides a non-negative lower bound on the relative information spectrum provided that β P Q. Having shown 77), we proceed to argue that it is tight. Fix δ [0, ) and let P Q = δ, which yields P Q = δ in the right side of 77).. 76) 77) 79) March 4, 08

38 If β < δ, let the pair P, Q) be defined on the binary alphabet {0, } with P ) = and Q) = δ thereby ensuring δ = P Q ). Then, from 7), F P Q log β) = P 0) = 0. 80) If β δ, let τ > β and consider the probability measures P = P τ and Q = Q τ defined on the binary alphabet {0, } with P τ ) = τδ τ and Q τ ) = then which tends to βδ β δ τ note that indeed δ = P τ Q τ ). Since < β < τ F P Q log β) = P τ 0) = τδ τ 8) in the right side of 77) by letting τ β. 38 Attained under certain conditions, the following counterpart to Theorem 8 gives a lower bound on the total variation distance based on the distribution of the relative information. It strengthens the bound in [0, Theorem 8], which in turn tightens the lower bounds in [78,.3.8)] and [99, Lemma 7]. Theorem : If P Q then, for any η, η > 0, P Q exp η ) ) P [ ı P Q X) η ] + expη ) ) P [ ı P Q X) η ] 8) with X P. Equality holds in 8) if P and Q are probability measures defined on {0, } and, for an arbitrary η, η > 0, P 0) = exp η ) exp η η ), 83) Q0) = exp η ) P 0). 84) Proof: From 53), it follows that for arbitrary η, η > 0, P Q E [ exp ıp Q X) ) { ıp Q X) η }] + E [ exp ıp Q X) ) { ıp Q X) η }] 85) which is readily loosened to obtain 8). Equality holds in 8) for P and Q in the theorem statement since ı P Q X) only takes the values log P 0) Q0) = η and log P ) Q) = η. The following lower bound on the total variation distance is the counterpart to Theorem 7. Theorem : If P Q, and X P then P Q log e E [ ı P Q X) ] DP Q). 86) Proof: We reason in parallel to the proof of Theorem 7. For all z [, ], [ exp z) ] z) log e. 87) March 4, 08

39 Substituting z = ı P Q X), taking expectation of both sides of 87), and using 5) we obtain P Q log e E [ ı P Q X) ) ] 39 88) = E [ ı P Q X) ı P Q X) ] 89) = E [ ı P Q X) ] DP Q). 90) Remark 8: The combination of Pinsker s inequality ) and 86) yields the following inequality due to Barron see [6, p. 339]) which is useful in establishing convergence results for relative entropy e.g. [7]) E [ ı P Q X) ] DP Q) + DP Q) log e 9) with X P. D. Relative Entropy and Bhattacharyya Distance The following result refines 5) by using an approach which relies on moment inequalities [96] [98]. The coverage in this section is self-contained. Theorem 3: If P Q, then DP Q) log + χ P Q) ) 3 χ P Q) ) log e + χ Q P ) ) + χ P Q) ). 9) Furthermore, if {P n } converges to Q in the sense of 06), then the ratio of DP n Q) and its upper bound in 9) tends to as n. Proof: The derivation of 9) relies on [96, Theorem.] which states that if W is a non-negative random variable, then E[W α ] E α [W ] ) log e, α 0, αα ) λ α log E[W ] ) E[log W ], α = 0 93) is log-convex in α R. E[W log W ] E[W ] log E[W ] ), α = To prove 9), let W = dp dq X) with X P, then 93) yields λ 0 = log + χ P Q) ) DP Q), 94) λ α = [ + α )H α Q P ) + χ P Q) ) ] α log e 95) αα + ) March 4, 08

40 for all α > 0, and specializing 95) yields In view of the log-convexity of λ α in α R, then which, by assembling 94) 98), yields 9). that λ = χ P Q) log e ), + χ P Q) [ ] 96) λ = + χ Q P ) 6 + χ P Q) ) log e. 97) λ 0 λ λ 98) Suppose that {P n } converges to Q in the sense of 06). Then, it follows from Theorem and Corollary 3 lim DP n Q) = 0, 99) n lim n χ P n Q) = 0, 300) DP n Q) lim n χ P n Q) = log e, 30) χ Q P n ) lim n χ =. 30) P n Q) Let U n denote the upper bound on DP n Q) in 9). Assembling 99) 30), it can be verified that lim n U n =. 303) DP n Q) 40 Remark 9: In view of 99) 30), while the ratio of the right side of 9) with P = P n and DP n Q) tends to, the ratio of the looser bound in 5) and DP n Q) tends to. Remark 0: If {P n } and Q are defined on a finite set A, then the condition in 06) is equivalent to P n Q 0 with Qa) > 0 for all a A. Remark : An alternative refinement of 5) has been recently obtained in [98] as a function of χ P Q) and the Bhattacharyya distance BP Q) see Definition 5): [ DP Q) log + χ P Q) ) 3 exp BP Q) ) + χ P Q) ] log e 9 χ. 304) P Q) Eq. 304) can be generalized by relying on the log-convexity of λ α in α R, which yields λ α 0 λ α λ α 305) March 4, 08

41 for all α 0, ); consequently, assembling 94), 95), 96) and 305) yields DP Q) log + χ P Q) ) α αα + ) ) α χ P Q) ) α α [ α)hα Q P ) ) + χ P Q) ) α ] α for all α 0, ). Note that in the special case α =, 306) becomes 304), as can be readily verified in view of 83) and the symmetry property H P Q) = H Q P ). Remark : The following lower bound on the relative entropy has been derived in [98], based on the approach of moment inequalities: Note that from 8) DP Q) BP Q) + BP Q) log 4 306) log e 6 [ exp BP Q) )] log e exp 4BP Q) ) + χ Q P ). 307) 4 P Q and since the right side of 307) is monotonically increasing in BP Q), the replacement of BP Q) in the right side of 307) with its lower bound in 308) yields DP Q) log 4 P Q ) + ) 308) 3 4 P Q log e 8 P Q + χ Q P ) P Q. 309) Although 309) improves the bound in 4), it is weaker than 307), and it satisfies the tightness property in Theorem 3 only in special cases such as when P and Q are defined on A = {0, } with P 0) = Q) = ε and we let ε 0. Define the binary relative entropy function as the continuous extension to [0, ] of ) ) x x dx y) = x log + x) log. 30) y y The following result improves the upper bound in 8). a) Theorem 4: Let P Q with β, β ) 0, ). Then, which is attainable for binary alphabets. For the derivation of 307) for a general alphabet, similarly to [98], set W = λ 0 λ 4 λ which follows from the log-convexity of λ α in α. χ P Q) β ) β ), 3) dq X) in 93) with X P, and use the inequality dp March 4, 08

42 b) { 3 DP Q) min log + c) + + β ), + c) 3) c + 8β c log e + c) c ) log e, 4β 33) β )} d β β β ) β β 34) where we have abbreviated c = χ P Q) for typographical convenience. Proof: To prove 3), we first consider the case where P, Q are defined on A = {0, } and P 0) Q0) = β, P ) Q) = β and. Straightforward calculation yields P 0) = β β β, Q0) = β P 0) 35) χ P Q) = β ) β ). 36) In the case of a general alphabet, consider the elementary bound with a < 0 < b: E[Z ] ab which holds for any Z [a, b], E[Z] = 0, and follows simply by taking expectations of Z = ab + Za + b) Z a)b Z) 37) ab + Za + b). 38) Since χ P Q) = E[Z ], 3) follows by letting a = β, b = β and Z = dp Y ), Y Q. 39) dq To prove 3), note that it follows by combining 9) with the left side of the inequality β χ Q P ) where 30) follows from Theorem 3 with ft) = t for t > 0. To prove 33), note that assembling 8) and 4) yields χ P Q) + χ P Q) β χ Q P ) 30) DP Q) log + χ P Q) ) β D P Q) χ P Q) log e and solving this quadratic inequality in DP Q), for fixed χ P Q), yields the bound in 33). Bound 34) holds since the maximal DP Q), for fixed χ P Q), is monotonically increasing in χ P Q). In view of 3), DP Q) cannot be larger than its maximal value when χ P Q) = β ) β ). In the latter case, the condition of equality in 38) recall that E[Z] = 0) is P[Z = a] = 4 3) b = P[Z = b] 3) b a March 4, 08

43 43 which implies that the maximal relative entropy DP Q) over all P Q with given β, β ) 0, ) is equal to E [ + Z) log + Z) ] 33) b + a) log + a) a + b) log + b) = b a β ) = d β β β ) β β β where 33) 35) follow from 30) and 3) with a = β and b = β. Remark 3: The proof of 3) relies on the left side of 30); this strengthens the bound which follows from Theorem 6, given by χ Q P ) β χ P Q). The bound 33) is typically of similar tightness as the bound in 3), although none of them outperforms the other for all β, β ) 0, ) and χ P Q) [ 0, β ) β ) ] see 3)). Remark 4: The left inequality in 8) and Theorem 4 provide an analytical outer bound on the locus of the points χ P Q), DP Q)) where P Q and β dp dq β for given β, β ) 0, ). Example : In continuation to Remark 4, for given β, β ) 0, ), Figure compares the locus of the points χ P Q), DP Q)) when P, Q are restricted to binary alphabets, and P Q is bounded between β and β, with an outer bound constructed with the left inequality in 8) and Theorem 4 recall that the outer bound is valid for an arbitrary alphabet). 34) 35) DP kq) nats b) DP kq) nats b) a) a) P kq) P kq) Fig.. Comparison of the locus of the points χ P Q), DP Q)) when P, Q are defined on A = {0, } with the bounds in the left side of 8) a) and Theorem 4 b); β, β ) =, ) and β, β ) =, ) 0 0 in the upper and lower plots, respectively. The dashed curve in each plot corresponds to the looser bound in 5). March 4, 08

44 44 The following result relies on the earlier analysis to provide bounds on the Bhattacharyya distance, expressed in terms of χ divergences and relative entropy. Theorem 5: If P Q, then the following bounds on the Bhattacharyya distance hold: log + χ P Q) ) [ log + 3 χ P Q) 4 log e log + χ P Q) ) ] ) DP Q) BP Q) 36) log + χ P Q) ) 3 4 log + P Q) ) 3 + χ P Q) ) + χ Q P ) ). 37) Furthermore, if {P n } converges to Q in the sense of 06), then the ratio of the bounds on BP n Q) in 36) and 37) tends to as n. Proof: In view of the log-convexity of λ α in α R, λ 0 λ λ, λ λ λ 3 38) for any choice of the random variable W in 93). Consequently, assembling 83), 94), 95) and 38) yield the bounds on BP Q) in 36) and 37). Suppose that {P n } converges to Q in the sense of 06). Let L n and U n denote, respectively, the lower and upper bounds on BP n Q) in 36) and 37). Assembling 99) 30), it can be easily verified that lim n which yields that lim n U n L n =. L n χ P n Q) = lim n Remark 5: Note that 37) refines the bound U n χ P n Q) = 8 log e 39) BP Q) log + χ P Q) ) 330) which is equivalent to λ 0 in view of Jensen s inequality, 83) and 95)). Remark 6: Let {P n } converge to Q in the sense of 06). In view of 39), it follows that BP n Q) lim n χ P n Q) = 8 log e, 33) from which we can surmise that both upper bounds in 9) and 304) are tight under the condition in 06) see Theorem 3), although 9) only depends on χ -divergences. In view of 33), the lower bound in 307) is also tight under the condition in 06), in the sense that the ratio of DP n Q) and its lower bound in 307) tends to as n ; this sufficient condition for the tightness of 307) strengthens the result in [98, Section 4]. March 4, 08

45 VI. REVERSE PINSKER INEQUALITIES 45 It is not possible to lower bound P Q solely in terms of DP Q) since for any arbitrarily small ɛ > 0 and arbitrarily large λ > 0, we can construct examples with P Q < ɛ and λ < DP Q) <. Therefore, each of the bounds in this section involves not only DP Q) but another feature of the pair P, Q). A. Bounded Relative Information As in Section IV, the following result involves the bounds on the relative information. Theorem 6: If β 0, ) and β [0, ), then, DP Q) ϕβ ) ϕβ ) ) P Q 33) where ϕ: [0, ) [0, ) is given by 0 t = 0 t log t ϕt) = t 0, ), ) t log e t =. 333) Proof: Let X P, Y Q, and Z be defined in 3). The function ϕ: [0, ) [0, ) is continuous, monotonically increasing and non-negative; the monotonicity property holds since t ) ϕ t) = t ) log e log t 0 for all t > 0, and its non-negativity follows from the fact that ϕ is monotonically increasing on [0, ) and ϕ0) = 0. Accordingly, ϕβ ) ϕz) ϕβ ) 334) since 3) and 5) 5) imply that Z [β, β ] with probability one. The relative entropy satisfies DP Q) = E [ Z log Z ] 335) = E [ ϕz) Z ) ] 336) = E [ ϕz) Z ) {Z > } ] + E [ ϕz) Z ) {Z < } ]. 337) We bound each of the summands in the right side of 337) separately. Invoking 334), we have E [ ϕz) Z ) {Z > } ] ϕβ ) E[ Z ) {Z > } ] 338) = ϕβ ) E[ Z) ] 339) = ϕβ ) P Q 340) March 4, 08

46 46 where 339) holds since x) x {x < 0}, and 340) follows from 45) with Z in 3). Similarly, 334) yields E [ ϕz) Z ) {Z < } ] ϕβ ) E [ Z ) {Z < } ] 34) = ϕβ ) E [ Z) +] 34) = ϕβ ) P Q 343) where 343) follows from 44). Assembling 337), 340) and 343), we obtain 33). Remark 7: By dropping the negative term in 33), we can get the weaker version in [0, Theorem 7]: ) log β DP Q) P Q. 344) β ) The coefficient of P Q in the right side of 344) is monotonically decreasing in β and it tends to log e by letting β. The improvement over 344) afforded by 33) is exemplified in Appendix E. The bound in 344) has been recently used in the context of the optimal quantization of probability measures [, Proposition 4]. Remark 8: The proof of Theorem 6 hinges on the fact that the function ϕ is monotonically increasing. It can be verified that ϕ is also concave and differentiable. Taking into account these additional properties of ϕ, the bound in Theorem 6 can be tightened as see Appendix F): DP Q) ϕβ ) ϕβ ) ϕ β ) ) β P Q + ϕ β ) E[ ZZ ) {Z > } ] 345) which is expressed in terms of the distribution of the relative information. The second summand in the right side of 345) satisfies χ P Q) + β P Q E[ ZZ ) {Z > } ] 346) χ P Q) + P Q. 347) From 3), 45) and Z [β, β ], the gap between the upper and lower bounds in 347) satisfies β ) P Q = β ) E [ Z) +] 348) β ) 349) which is upper bounded by, and it is close to zero if β. The combination of 345) and 347) leads to DP Q) ϕβ ) ϕβ ) + ϕ β ) β ) ) P Q + ϕ β ) χ P Q). 350) Remark 9: A special case of 344), where Q is the uniform distribution over a set of a finite size, was recently rediscovered in [58, Corollary 3] based on results from [5]. March 4, 08

47 Remark 30: For ε [0, ] and a fixed probability measure Q, define 47 D ε, Q) = inf DP Q). 35) P : P Q ε From Sanov s theorem see [0, Theorem.4.]), D ε, Q) is equal to the asymptotic exponential decay of the probability that the total variation distance between the empirical distribution of a sequence of i.i.d. random variables and the true distribution Q is more than a specified value ε. Bounds on D ε, Q) have been shown in [0, Theorem ], which, locally, behave quadratically in ε. Although this result was classified in [0] as a reverse Pinsker inequality, note that it differs from the scope of this section which provides, under suitable conditions, lower bounds on the total variation distance as a function of the relative entropy. B. Lipschitz Constraints Definition 6: A function f : B R, where B R, is L-Lipschitz if for all x, y B fx) fy) L x y. 35) The following bound generalizes [35, Theorem 6] to the non-discrete setting. Theorem 7: Let P Q with β 0, ) and β [0, ), and f : [0, ) R be continuous and convex with f) = 0, and L-Lipschitz on [β, β ]. Then, Proof: If Y Q, and Z is given by 3) then f) = 0 yields where 357) holds due to 43). D f P Q) L P Q. 353) D f P Q) = E [ fz) ] 354) E [ fz) f) ] L E [ Z ] Note that if f has a bounded derivative on [β, β ], we can choose L = Remark 3: In the case ft) = t log t, f0) = 0, 358) particularizes to 355) 356) = L P Q 357) sup f t) <. 358) t [β,β ] L = max { logeβ ), logeβ )} 359) resulting in a reverse Pinsker inequality which is weaker than that in 33) by at least a factor of. March 4, 08

48 C. Finite Alphabet 48 Throughout this subsection, we assume that P and Q are probability measures defined on a common finite set A, and Q is strictly positive on A, which has more than one element. The bound in 344) strengthens the finite-alphabet bound in [38, Lemma 3.0]: ) DP Q) log P Q 360) with Q min Q min = min a A Qa). 36) To verify this, notice that β Q min. Let v : 0, ) 0, ) be defined by vt) = t log t ; since v is a monotonically decreasing and non-negative function, we can weaken 344) to write ) log Q DP Q) min P Q 36) Q min ) ) log P Q 363) where 363) follows from 36). Q min The main result in this subsection is the following bound. Theorem 8: DP Q) log + ) P Q. 364) Q min Furthermore, if Q P and β is defined as in 50), then the following tightened bound holds: ) P Q DP Q) log + β log e P Q. 365) Q min Proof: Combining 5) and the following finite-alphabet upper bound on χ P Q) yields 364): Q min χ P Q) = a A P a) Qa)) Qa)/Q min 366) a A P a) Qa) ) 367) max P x) Qx) P a) Qa) x A a A = P Q max P a) Qa) 368) a A P Q. 369) March 4, 08

49 If P Q, then 365) follows by combining 369) and ) χ P Q) exp DP Q) + β DQ P ) 370) 49 where 370) follows by rearranging 4), and 37) follows from ). exp DP Q) + P Q β log e ) 37) Remark 3: It is easy to check that Theorem 8 strengthens the bound by Csiszár and Talata 3) by at least a factor of since upper bounding the logarithm in 364) gives DP Q) log e Q min P Q. 37) Remark 33: In the finite-alphabet case, similarly to 364), one can obtain another upper bound on DP Q) as a function of the l norm P Q : DP Q) Q min P Q log e 373) which appears in the proof of Property 4 of [0, Lemma 7], and also used in [57, 74)]. Furthermore, similarly to 365), the following tightened bound holds if P Q: DP Q) log + P ) Q β log e P Q 374) Q min which follows by combining 367), 37), and the inequality P Q P Q. Remark 34: Combining ) and 369) yields that if P Q are defined on a common finite set, then DP Q) χ P Q) Q min log e 375) which at least doubles the lower bound in [7, Lemma 6]. This, in turn, improves the tightened upper bound on the strong data processing inequality constant in [7, Theorem 0] by a factor of. Remark 35: Reverse Pinsker inequalities have been also derived in quantum information theory [3], [4]), providing upper bounds on the relative entropy of two quantum states as a function of the trace norm distance when the minimal eigenvalues of the states are positive c.f. [3, Theorem 6] and [4, Theorem ]). When the variational distance is much smaller than the minimal eigenvalue see [3, Eq. 57)]), the latter bounds have a quadratic scaling in this distance, similarly to 364); they are also inversely proportional to the minimal eigenvalue, similarly to the dependence of 364) in Q min. Remark 36: Let P and Q be probability distributions defined on an arbitrary alphabet A. Combining Theorems 7 and 6 leads to a derivation of an upper bound on the difference DP Q) DQ P ) as a function of P Q as long as P Q and the relative information ı P Q is bounded away from and +. Furthermore, another upper bound on the difference of the relative entropies can be readily obtained by combining Theorems 7 and 8 when P and Q are probability measures defined on a finite alphabet. In the latter case, combining March 4, 08

50 50 Theorem 7 and 374) also yields another upper bound on DP Q) DQ P ) which scales quadratically with P Q. All these bounds form a counterpart to [5, Theorem ] and Theorem 7, providing measures of the asymmetry of the relative entropy when the relative information is bounded. D. Distance From the Equiprobable Distribution If P is a distribution on a finite set A, HP ) gauges the distance from U, the equiprobable distribution defined on A, since HP ) = log A DP U). Thus, it is of interest to explore the relationship between HP ) and P U. Next, we determine the exact locus of the points HP ), P U ) among all probability measures P defined on A, and this region is compared to upper and lower bounds on P U as a function of HP ). As usual, hx) denotes the continuous extension of x log x x) log x) to x [0, ] and dx y) denotes the binary relative entropy in 30). Theorem 9: Let U be the equiprobable distribution on a {,..., A }, with < A <. a) For 0, A )], 3 max HP ) = log A min d m P : P U = m A + ) m A where the minimum in the right side of 376) is over 376) m {,..., A A }. 377) Denoting such an integer by m, the maximum in the left side of 376) is attained by A + l {,..., m }, m P l) = A A m ), l {m +,..., A }. b) Let 0, k = 0 h k = h A k ) + A k log k, k {,..., A } 378) 379) If H [h k, h k ) for k {,..., A }, then log A, k = A. min P U = k + θ) A ) 380) P : HP )=H 3 There is no P with P U > A ). March 4, 08

51 which is achieved by k + θ) A, l = 5 P k) θ l) = A, l {,..., k}, θ A, l = k + 0, l {k +,..., A } 38) where θ [0, ) is chosen so that HP k) θ ) = H. Proof: See Appendix G. Remark 37: For probability measures defined on a -element set A, the maximal and minimal values of P U in Theorem 9 coincide. This can be verified since, if P ) = p for p [0, ], then P U = p and HP ) = hp). Hence, if A = and HP ) = H [0, log ], then P U = h H) 38) where h : [0, log ] [0, ] denotes the inverse of the binary entropy function. Results on the more general problem of finding bounds on HP ) HQ) based on P Q can be found in [0, Theorem 7.3.3], [5], [90] and [4]. Most well-known among them is ) A HP ) HQ) P Q log P Q which holds if P, Q are probability measures defined on a finite set A with P a) Qa) 383) for all a A see [], and [7, Lemma.7] with a stronger sufficient condition). Particularizing 383) to the case where Q = U, and P a) A for all a A yields ) A HP ) log A P U log, 384) P U a bound which finds use in information-theoretic security [30]. Particularizing ), 4), and 364) we obtain HP ) log A P U log e, 385) HP ) log A + log 4 P U ), 386) HP ) log A log + A P U ). 387) If either A = or 8 A 0, it can be checked that the lower bound on HP ) in 384) is worse than 387), irrespectively of P U note that 0 P U A )). The exact locus of HP ), P U ) among all the probability measures P defined on a finite set A see Theorem 9), and the bounds in 385) 387) are illustrated in Figure for A = 4 and A = 56. For A = 4, March 4, 08

52 5 the lower bound in 387) is tighter than 384). For A = 56, we only show 387) in Figure as in this case 384) offers a very minor improvement in a small range. As the cardinality of the set A increases, the gap between the exact locus shaded region) and the upper bound obtained from 385) and 386) Curves a) and b), respectively) decreases, whereas the gap between the exact locus and the lower bound in 387) Curve c)) increases. b) b) P U c) a) P U a) c) HP )bits) HP )bits) Fig.. The exact locus of HP ), P U ) among all the probability measures P defined on a finite set A, and bounds on P U as a function of HP ) for A = 4 left plot), and A = 56 right plot). The point HP ), P U ) = 0, A )) is depicted on the y-axis. In the two plots, Curves a), b) and c) refer, respectively, to 385), 386) and 387); the exact locus shaded region) refers to Theorem 9. E. The Exponential Decay of the Probability of Non-Strongly Typical Sequences The objective is to bound the function L δ Q) = min DP Q) 388) P T δq) where the subset of probability measures on A, F ) which are δ-close to Q is given by { } T δ Q) = P : a A, P a) Qa) δ Qa). 389) Note that a,..., a n ) is strongly δ-typical according to Q if its empirical distribution belongs to T δ Q). According to Sanov s theorem e.g. [0, Theorem.4.]), if the random variables are independent and distributed according to Q, then the probability that Y,..., Y n ), is not δ-typical vanishes exponentially with exponent L δ Q). March 4, 08

f-divergence Inequalities

f-divergence Inequalities SASON AND VERDÚ: f-divergence INEQUALITIES f-divergence Inequalities Igal Sason, and Sergio Verdú Abstract arxiv:508.00335v7 [cs.it] 4 Dec 06 This paper develops systematic approaches to obtain f-divergence

More information

arxiv: v4 [cs.it] 17 Oct 2015

arxiv: v4 [cs.it] 17 Oct 2015 Upper Bounds on the Relative Entropy and Rényi Divergence as a Function of Total Variation Distance for Finite Alphabets Igal Sason Department of Electrical Engineering Technion Israel Institute of Technology

More information

On Improved Bounds for Probability Metrics and f- Divergences

On Improved Bounds for Probability Metrics and f- Divergences IRWIN AND JOAN JACOBS CENTER FOR COMMUNICATION AND INFORMATION TECHNOLOGIES On Improved Bounds for Probability Metrics and f- Divergences Igal Sason CCIT Report #855 March 014 Electronics Computers Communications

More information

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding APPEARS IN THE IEEE TRANSACTIONS ON INFORMATION THEORY, FEBRUARY 015 1 Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding Igal Sason Abstract Tight bounds for

More information

arxiv: v4 [cs.it] 8 Apr 2014

arxiv: v4 [cs.it] 8 Apr 2014 1 On Improved Bounds for Probability Metrics and f-divergences Igal Sason Abstract Derivation of tight bounds for probability metrics and f-divergences is of interest in information theory and statistics.

More information

Tight Bounds for Symmetric Divergence Measures and a New Inequality Relating f-divergences

Tight Bounds for Symmetric Divergence Measures and a New Inequality Relating f-divergences Tight Bounds for Symmetric Divergence Measures and a New Inequality Relating f-divergences Igal Sason Department of Electrical Engineering Technion, Haifa 3000, Israel E-mail: sason@ee.technion.ac.il Abstract

More information

Arimoto-Rényi Conditional Entropy. and Bayesian M-ary Hypothesis Testing. Abstract

Arimoto-Rényi Conditional Entropy. and Bayesian M-ary Hypothesis Testing. Abstract Arimoto-Rényi Conditional Entropy and Bayesian M-ary Hypothesis Testing Igal Sason Sergio Verdú Abstract This paper gives upper and lower bounds on the minimum error probability of Bayesian M-ary hypothesis

More information

Arimoto Channel Coding Converse and Rényi Divergence

Arimoto Channel Coding Converse and Rényi Divergence Arimoto Channel Coding Converse and Rényi Divergence Yury Polyanskiy and Sergio Verdú Abstract Arimoto proved a non-asymptotic upper bound on the probability of successful decoding achievable by any code

More information

Soft Covering with High Probability

Soft Covering with High Probability Soft Covering with High Probability Paul Cuff Princeton University arxiv:605.06396v [cs.it] 20 May 206 Abstract Wyner s soft-covering lemma is the central analysis step for achievability proofs of information

More information

Convergence of generalized entropy minimizers in sequences of convex problems

Convergence of generalized entropy minimizers in sequences of convex problems Proceedings IEEE ISIT 206, Barcelona, Spain, 2609 263 Convergence of generalized entropy minimizers in sequences of convex problems Imre Csiszár A Rényi Institute of Mathematics Hungarian Academy of Sciences

More information

Convexity/Concavity of Renyi Entropy and α-mutual Information

Convexity/Concavity of Renyi Entropy and α-mutual Information Convexity/Concavity of Renyi Entropy and -Mutual Information Siu-Wai Ho Institute for Telecommunications Research University of South Australia Adelaide, SA 5095, Australia Email: siuwai.ho@unisa.edu.au

More information

A new converse in rate-distortion theory

A new converse in rate-distortion theory A new converse in rate-distortion theory Victoria Kostina, Sergio Verdú Dept. of Electrical Engineering, Princeton University, NJ, 08544, USA Abstract This paper shows new finite-blocklength converse bounds

More information

Bounds for entropy and divergence for distributions over a two-element set

Bounds for entropy and divergence for distributions over a two-element set Bounds for entropy and divergence for distributions over a two-element set Flemming Topsøe Department of Mathematics University of Copenhagen, Denmark 18.1.00 Abstract Three results dealing with probability

More information

On the Entropy of Sums of Bernoulli Random Variables via the Chen-Stein Method

On the Entropy of Sums of Bernoulli Random Variables via the Chen-Stein Method On the Entropy of Sums of Bernoulli Random Variables via the Chen-Stein Method Igal Sason Department of Electrical Engineering Technion - Israel Institute of Technology Haifa 32000, Israel ETH, Zurich,

More information

MMSE Dimension. snr. 1 We use the following asymptotic notation: f(x) = O (g(x)) if and only

MMSE Dimension. snr. 1 We use the following asymptotic notation: f(x) = O (g(x)) if and only MMSE Dimension Yihong Wu Department of Electrical Engineering Princeton University Princeton, NJ 08544, USA Email: yihongwu@princeton.edu Sergio Verdú Department of Electrical Engineering Princeton University

More information

5.1 Inequalities via joint range

5.1 Inequalities via joint range ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 5: Inequalities between f-divergences via their joint range Lecturer: Yihong Wu Scribe: Pengkun Yang, Feb 9, 2016

More information

Distance-Divergence Inequalities

Distance-Divergence Inequalities Distance-Divergence Inequalities Katalin Marton Alfréd Rényi Institute of Mathematics of the Hungarian Academy of Sciences Motivation To find a simple proof of the Blowing-up Lemma, proved by Ahlswede,

More information

Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities

Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities Maxim Raginsky and Igal Sason ISIT 2013, Istanbul, Turkey Capacity-Achieving Channel Codes The set-up DMC

More information

Some Expectations of a Non-Central Chi-Square Distribution With an Even Number of Degrees of Freedom

Some Expectations of a Non-Central Chi-Square Distribution With an Even Number of Degrees of Freedom Some Expectations of a Non-Central Chi-Square Distribution With an Even Number of Degrees of Freedom Stefan M. Moser April 7, 007 Abstract The non-central chi-square distribution plays an important role

More information

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012 Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 202 BOUNDS AND ASYMPTOTICS FOR FISHER INFORMATION IN THE CENTRAL LIMIT THEOREM

More information

Dispersion of the Gilbert-Elliott Channel

Dispersion of the Gilbert-Elliott Channel Dispersion of the Gilbert-Elliott Channel Yury Polyanskiy Email: ypolyans@princeton.edu H. Vincent Poor Email: poor@princeton.edu Sergio Verdú Email: verdu@princeton.edu Abstract Channel dispersion plays

More information

On the Chi square and higher-order Chi distances for approximating f-divergences

On the Chi square and higher-order Chi distances for approximating f-divergences On the Chi square and higher-order Chi distances for approximating f-divergences Frank Nielsen, Senior Member, IEEE and Richard Nock, Nonmember Abstract We report closed-form formula for calculating the

More information

Lower Bounds on the Graphical Complexity of Finite-Length LDPC Codes

Lower Bounds on the Graphical Complexity of Finite-Length LDPC Codes Lower Bounds on the Graphical Complexity of Finite-Length LDPC Codes Igal Sason Department of Electrical Engineering Technion - Israel Institute of Technology Haifa 32000, Israel 2009 IEEE International

More information

topics about f-divergence

topics about f-divergence topics about f-divergence Presented by Liqun Chen Mar 16th, 2018 1 Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments

More information

Journal of Inequalities in Pure and Applied Mathematics

Journal of Inequalities in Pure and Applied Mathematics Journal of Inequalities in Pure and Applied Mathematics http://jipam.vu.edu.au/ Volume, Issue, Article 5, 00 BOUNDS FOR ENTROPY AND DIVERGENCE FOR DISTRIBUTIONS OVER A TWO-ELEMENT SET FLEMMING TOPSØE DEPARTMENT

More information

Channels with cost constraints: strong converse and dispersion

Channels with cost constraints: strong converse and dispersion Channels with cost constraints: strong converse and dispersion Victoria Kostina, Sergio Verdú Dept. of Electrical Engineering, Princeton University, NJ 08544, USA Abstract This paper shows the strong converse

More information

On the Concentration of the Crest Factor for OFDM Signals

On the Concentration of the Crest Factor for OFDM Signals On the Concentration of the Crest Factor for OFDM Signals Igal Sason Department of Electrical Engineering Technion - Israel Institute of Technology, Haifa 3, Israel E-mail: sason@eetechnionacil Abstract

More information

Solution of the 8 th Homework

Solution of the 8 th Homework Solution of the 8 th Homework Sangchul Lee December 8, 2014 1 Preinary 1.1 A simple remark on continuity The following is a very simple and trivial observation. But still this saves a lot of words in actual

More information

Entropy measures of physics via complexity

Entropy measures of physics via complexity Entropy measures of physics via complexity Giorgio Kaniadakis and Flemming Topsøe Politecnico of Torino, Department of Physics and University of Copenhagen, Department of Mathematics 1 Introduction, Background

More information

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi Real Analysis Math 3AH Rudin, Chapter # Dominique Abdi.. If r is rational (r 0) and x is irrational, prove that r + x and rx are irrational. Solution. Assume the contrary, that r+x and rx are rational.

More information

CLASSICAL PROBABILITY MODES OF CONVERGENCE AND INEQUALITIES

CLASSICAL PROBABILITY MODES OF CONVERGENCE AND INEQUALITIES CLASSICAL PROBABILITY 2008 2. MODES OF CONVERGENCE AND INEQUALITIES JOHN MORIARTY In many interesting and important situations, the object of interest is influenced by many random factors. If we can construct

More information

Hartogs Theorem: separate analyticity implies joint Paul Garrett garrett/

Hartogs Theorem: separate analyticity implies joint Paul Garrett  garrett/ (February 9, 25) Hartogs Theorem: separate analyticity implies joint Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ (The present proof of this old result roughly follows the proof

More information

An Achievable Error Exponent for the Mismatched Multiple-Access Channel

An Achievable Error Exponent for the Mismatched Multiple-Access Channel An Achievable Error Exponent for the Mismatched Multiple-Access Channel Jonathan Scarlett University of Cambridge jms265@camacuk Albert Guillén i Fàbregas ICREA & Universitat Pompeu Fabra University of

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Examples of Dual Spaces from Measure Theory

Examples of Dual Spaces from Measure Theory Chapter 9 Examples of Dual Spaces from Measure Theory We have seen that L (, A, µ) is a Banach space for any measure space (, A, µ). We will extend that concept in the following section to identify an

More information

2 Sequences, Continuity, and Limits

2 Sequences, Continuity, and Limits 2 Sequences, Continuity, and Limits In this chapter, we introduce the fundamental notions of continuity and limit of a real-valued function of two variables. As in ACICARA, the definitions as well as proofs

More information

Module 3. Function of a Random Variable and its distribution

Module 3. Function of a Random Variable and its distribution Module 3 Function of a Random Variable and its distribution 1. Function of a Random Variable Let Ω, F, be a probability space and let be random variable defined on Ω, F,. Further let h: R R be a given

More information

Analysis II - few selective results

Analysis II - few selective results Analysis II - few selective results Michael Ruzhansky December 15, 2008 1 Analysis on the real line 1.1 Chapter: Functions continuous on a closed interval 1.1.1 Intermediate Value Theorem (IVT) Theorem

More information

Lecture 5 - Information theory

Lecture 5 - Information theory Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information

More information

The Dirichlet s P rinciple. In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation:

The Dirichlet s P rinciple. In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation: Oct. 1 The Dirichlet s P rinciple In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation: 1. Dirichlet s Principle. u = in, u = g on. ( 1 ) If we multiply

More information

Functional Properties of MMSE

Functional Properties of MMSE Functional Properties of MMSE Yihong Wu epartment of Electrical Engineering Princeton University Princeton, NJ 08544, USA Email: yihongwu@princeton.edu Sergio Verdú epartment of Electrical Engineering

More information

The Fading Number of a Multiple-Access Rician Fading Channel

The Fading Number of a Multiple-Access Rician Fading Channel The Fading Number of a Multiple-Access Rician Fading Channel Intermediate Report of NSC Project Capacity Analysis of Various Multiple-Antenna Multiple-Users Communication Channels with Joint Estimation

More information

Solution. 1 Solution of Homework 7. Sangchul Lee. March 22, Problem 1.1

Solution. 1 Solution of Homework 7. Sangchul Lee. March 22, Problem 1.1 Solution Sangchul Lee March, 018 1 Solution of Homework 7 Problem 1.1 For a given k N, Consider two sequences (a n ) and (b n,k ) in R. Suppose that a n b n,k for all n,k N Show that limsup a n B k :=

More information

Copyright 2010 Pearson Education, Inc. Publishing as Prentice Hall.

Copyright 2010 Pearson Education, Inc. Publishing as Prentice Hall. .1 Limits of Sequences. CHAPTER.1.0. a) True. If converges, then there is an M > 0 such that M. Choose by Archimedes an N N such that N > M/ε. Then n N implies /n M/n M/N < ε. b) False. = n does not converge,

More information

An elementary proof of the weak convergence of empirical processes

An elementary proof of the weak convergence of empirical processes An elementary proof of the weak convergence of empirical processes Dragan Radulović Department of Mathematics, Florida Atlantic University Marten Wegkamp Department of Mathematics & Department of Statistical

More information

Scalar multiplication and addition of sequences 9

Scalar multiplication and addition of sequences 9 8 Sequences 1.2.7. Proposition. Every subsequence of a convergent sequence (a n ) n N converges to lim n a n. Proof. If (a nk ) k N is a subsequence of (a n ) n N, then n k k for every k. Hence if ε >

More information

An Improved Sphere-Packing Bound for Finite-Length Codes over Symmetric Memoryless Channels

An Improved Sphere-Packing Bound for Finite-Length Codes over Symmetric Memoryless Channels POST-PRIT OF THE IEEE TRAS. O IFORMATIO THEORY, VOL. 54, O. 5, PP. 96 99, MAY 8 An Improved Sphere-Packing Bound for Finite-Length Codes over Symmetric Memoryless Channels Gil Wiechman Igal Sason Department

More information

Proof. We indicate by α, β (finite or not) the end-points of I and call

Proof. We indicate by α, β (finite or not) the end-points of I and call C.6 Continuous functions Pag. 111 Proof of Corollary 4.25 Corollary 4.25 Let f be continuous on the interval I and suppose it admits non-zero its (finite or infinite) that are different in sign for x tending

More information

Lebesgue-Radon-Nikodym Theorem

Lebesgue-Radon-Nikodym Theorem Lebesgue-Radon-Nikodym Theorem Matt Rosenzweig 1 Lebesgue-Radon-Nikodym Theorem In what follows, (, A) will denote a measurable space. We begin with a review of signed measures. 1.1 Signed Measures Definition

More information

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence

More information

Inequalities for the L 1 Deviation of the Empirical Distribution

Inequalities for the L 1 Deviation of the Empirical Distribution Inequalities for the L 1 Deviation of the Emirical Distribution Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, Marcelo J. Weinberger June 13, 2003 Abstract We derive bounds on the robability

More information

4. Conditional risk measures and their robust representation

4. Conditional risk measures and their robust representation 4. Conditional risk measures and their robust representation We consider a discrete-time information structure given by a filtration (F t ) t=0,...,t on our probability space (Ω, F, P ). The time horizon

More information

The Hilbert Transform and Fine Continuity

The Hilbert Transform and Fine Continuity Irish Math. Soc. Bulletin 58 (2006), 8 9 8 The Hilbert Transform and Fine Continuity J. B. TWOMEY Abstract. It is shown that the Hilbert transform of a function having bounded variation in a finite interval

More information

Entropy power inequality for a family of discrete random variables

Entropy power inequality for a family of discrete random variables 20 IEEE International Symposium on Information Theory Proceedings Entropy power inequality for a family of discrete random variables Naresh Sharma, Smarajit Das and Siddharth Muthurishnan School of Technology

More information

WE start with a general discussion. Suppose we have

WE start with a general discussion. Suppose we have 646 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, NO. 2, MARCH 1997 Minimax Redundancy for the Class of Memoryless Sources Qun Xie and Andrew R. Barron, Member, IEEE Abstract Let X n = (X 1 ; 111;Xn)be

More information

A SYNOPSIS OF HILBERT SPACE THEORY

A SYNOPSIS OF HILBERT SPACE THEORY A SYNOPSIS OF HILBERT SPACE THEORY Below is a summary of Hilbert space theory that you find in more detail in any book on functional analysis, like the one Akhiezer and Glazman, the one by Kreiszig or

More information

On Bayes Risk Lower Bounds

On Bayes Risk Lower Bounds Journal of Machine Learning Research 17 (2016) 1-58 Submitted 4/16; Revised 10/16; Published 12/16 On Bayes Risk Lower Bounds Xi Chen Stern School of Business New York University New York, NY 10012, USA

More information

RÉNYI DIVERGENCE AND THE CENTRAL LIMIT THEOREM. 1. Introduction

RÉNYI DIVERGENCE AND THE CENTRAL LIMIT THEOREM. 1. Introduction RÉNYI DIVERGENCE AND THE CENTRAL LIMIT THEOREM S. G. BOBKOV,4, G. P. CHISTYAKOV 2,4, AND F. GÖTZE3,4 Abstract. We explore properties of the χ 2 and more general Rényi Tsallis) distances to the normal law

More information

3 Integration and Expectation

3 Integration and Expectation 3 Integration and Expectation 3.1 Construction of the Lebesgue Integral Let (, F, µ) be a measure space (not necessarily a probability space). Our objective will be to define the Lebesgue integral R fdµ

More information

The Poisson Channel with Side Information

The Poisson Channel with Side Information The Poisson Channel with Side Information Shraga Bross School of Enginerring Bar-Ilan University, Israel brosss@macs.biu.ac.il Amos Lapidoth Ligong Wang Signal and Information Processing Laboratory ETH

More information

Correlation Detection and an Operational Interpretation of the Rényi Mutual Information

Correlation Detection and an Operational Interpretation of the Rényi Mutual Information Correlation Detection and an Operational Interpretation of the Rényi Mutual Information Masahito Hayashi 1, Marco Tomamichel 2 1 Graduate School of Mathematics, Nagoya University, and Centre for Quantum

More information

Lecture 4 Lebesgue spaces and inequalities

Lecture 4 Lebesgue spaces and inequalities Lecture 4: Lebesgue spaces and inequalities 1 of 10 Course: Theory of Probability I Term: Fall 2013 Instructor: Gordan Zitkovic Lecture 4 Lebesgue spaces and inequalities Lebesgue spaces We have seen how

More information

x log x, which is strictly convex, and use Jensen s Inequality:

x log x, which is strictly convex, and use Jensen s Inequality: 2. Information measures: mutual information 2.1 Divergence: main inequality Theorem 2.1 (Information Inequality). D(P Q) 0 ; D(P Q) = 0 iff P = Q Proof. Let ϕ(x) x log x, which is strictly convex, and

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 3, MARCH

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 3, MARCH IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 57, NO 3, MARCH 2011 1587 Universal and Composite Hypothesis Testing via Mismatched Divergence Jayakrishnan Unnikrishnan, Member, IEEE, Dayu Huang, Student

More information

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011 LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS S. G. Bobkov and F. L. Nazarov September 25, 20 Abstract We study large deviations of linear functionals on an isotropic

More information

5 Measure theory II. (or. lim. Prove the proposition. 5. For fixed F A and φ M define the restriction of φ on F by writing.

5 Measure theory II. (or. lim. Prove the proposition. 5. For fixed F A and φ M define the restriction of φ on F by writing. 5 Measure theory II 1. Charges (signed measures). Let (Ω, A) be a σ -algebra. A map φ: A R is called a charge, (or signed measure or σ -additive set function) if φ = φ(a j ) (5.1) A j for any disjoint

More information

Guesswork Subject to a Total Entropy Budget

Guesswork Subject to a Total Entropy Budget Guesswork Subject to a Total Entropy Budget Arman Rezaee, Ahmad Beirami, Ali Makhdoumi, Muriel Médard, and Ken Duffy arxiv:72.09082v [cs.it] 25 Dec 207 Abstract We consider an abstraction of computational

More information

The properties of L p -GMM estimators

The properties of L p -GMM estimators The properties of L p -GMM estimators Robert de Jong and Chirok Han Michigan State University February 2000 Abstract This paper considers Generalized Method of Moment-type estimators for which a criterion

More information

Continuity. Matt Rosenzweig

Continuity. Matt Rosenzweig Continuity Matt Rosenzweig Contents 1 Continuity 1 1.1 Rudin Chapter 4 Exercises........................................ 1 1.1.1 Exercise 1............................................. 1 1.1.2 Exercise

More information

Continued fractions for complex numbers and values of binary quadratic forms

Continued fractions for complex numbers and values of binary quadratic forms arxiv:110.3754v1 [math.nt] 18 Feb 011 Continued fractions for complex numbers and values of binary quadratic forms S.G. Dani and Arnaldo Nogueira February 1, 011 Abstract We describe various properties

More information

Channel Dispersion and Moderate Deviations Limits for Memoryless Channels

Channel Dispersion and Moderate Deviations Limits for Memoryless Channels Channel Dispersion and Moderate Deviations Limits for Memoryless Channels Yury Polyanskiy and Sergio Verdú Abstract Recently, Altug and Wagner ] posed a question regarding the optimal behavior of the probability

More information

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma

More information

Partial cubes: structures, characterizations, and constructions

Partial cubes: structures, characterizations, and constructions Partial cubes: structures, characterizations, and constructions Sergei Ovchinnikov San Francisco State University, Mathematics Department, 1600 Holloway Ave., San Francisco, CA 94132 Abstract Partial cubes

More information

Divergences, surrogate loss functions and experimental design

Divergences, surrogate loss functions and experimental design Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs.berkeley.edu Martin J. Wainwright University of California Berkeley,

More information

Universal Estimation of Divergence for Continuous Distributions via Data-Dependent Partitions

Universal Estimation of Divergence for Continuous Distributions via Data-Dependent Partitions Universal Estimation of for Continuous Distributions via Data-Dependent Partitions Qing Wang, Sanjeev R. Kulkarni, Sergio Verdú Department of Electrical Engineering Princeton University Princeton, NJ 8544

More information

Context tree models for source coding

Context tree models for source coding Context tree models for source coding Toward Non-parametric Information Theory Licence de droits d usage Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Concentration Inequalities for Density Functionals. Shashank Singh

Concentration Inequalities for Density Functionals. Shashank Singh Concentration Inequalities for Density Functionals by Shashank Singh Submitted to the Department of Mathematical Sciences in partial fulfillment of the requirements for the degree of Master of Science

More information

2019 Spring MATH2060A Mathematical Analysis II 1

2019 Spring MATH2060A Mathematical Analysis II 1 2019 Spring MATH2060A Mathematical Analysis II 1 Notes 1. CONVEX FUNCTIONS First we define what a convex function is. Let f be a function on an interval I. For x < y in I, the straight line connecting

More information

6.1 Variational representation of f-divergences

6.1 Variational representation of f-divergences ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 6: Variational representation, HCR and CR lower bounds Lecturer: Yihong Wu Scribe: Georgios Rovatsos, Feb 11, 2016

More information

Chapter 1 Preliminaries

Chapter 1 Preliminaries Chapter 1 Preliminaries 1.1 Conventions and Notations Throughout the book we use the following notations for standard sets of numbers: N the set {1, 2,...} of natural numbers Z the set of integers Q the

More information

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9 MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended

More information

Set, functions and Euclidean space. Seungjin Han

Set, functions and Euclidean space. Seungjin Han Set, functions and Euclidean space Seungjin Han September, 2018 1 Some Basics LOGIC A is necessary for B : If B holds, then A holds. B A A B is the contraposition of B A. A is sufficient for B: If A holds,

More information

Chapter 8. General Countably Additive Set Functions. 8.1 Hahn Decomposition Theorem

Chapter 8. General Countably Additive Set Functions. 8.1 Hahn Decomposition Theorem Chapter 8 General Countably dditive Set Functions In Theorem 5.2.2 the reader saw that if f : X R is integrable on the measure space (X,, µ) then we can define a countably additive set function ν on by

More information

EVALUATIONS OF EXPECTED GENERALIZED ORDER STATISTICS IN VARIOUS SCALE UNITS

EVALUATIONS OF EXPECTED GENERALIZED ORDER STATISTICS IN VARIOUS SCALE UNITS APPLICATIONES MATHEMATICAE 9,3 (), pp. 85 95 Erhard Cramer (Oldenurg) Udo Kamps (Oldenurg) Tomasz Rychlik (Toruń) EVALUATIONS OF EXPECTED GENERALIZED ORDER STATISTICS IN VARIOUS SCALE UNITS Astract. We

More information

δ xj β n = 1 n Theorem 1.1. The sequence {P n } satisfies a large deviation principle on M(X) with the rate function I(β) given by

δ xj β n = 1 n Theorem 1.1. The sequence {P n } satisfies a large deviation principle on M(X) with the rate function I(β) given by . Sanov s Theorem Here we consider a sequence of i.i.d. random variables with values in some complete separable metric space X with a common distribution α. Then the sample distribution β n = n maps X

More information

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector On Minimax Filtering over Ellipsoids Eduard N. Belitser and Boris Y. Levit Mathematical Institute, University of Utrecht Budapestlaan 6, 3584 CD Utrecht, The Netherlands The problem of estimating the mean

More information

Math 118B Solutions. Charles Martin. March 6, d i (x i, y i ) + d i (y i, z i ) = d(x, y) + d(y, z). i=1

Math 118B Solutions. Charles Martin. March 6, d i (x i, y i ) + d i (y i, z i ) = d(x, y) + d(y, z). i=1 Math 8B Solutions Charles Martin March 6, Homework Problems. Let (X i, d i ), i n, be finitely many metric spaces. Construct a metric on the product space X = X X n. Proof. Denote points in X as x = (x,

More information

Asymptotics for posterior hazards

Asymptotics for posterior hazards Asymptotics for posterior hazards Pierpaolo De Blasi University of Turin 10th August 2007, BNR Workshop, Isaac Newton Intitute, Cambridge, UK Joint work with Giovanni Peccati (Université Paris VI) and

More information

Research Article On Some Improvements of the Jensen Inequality with Some Applications

Research Article On Some Improvements of the Jensen Inequality with Some Applications Hindawi Publishing Corporation Journal of Inequalities and Applications Volume 2009, Article ID 323615, 15 pages doi:10.1155/2009/323615 Research Article On Some Improvements of the Jensen Inequality with

More information

Rényi Information Dimension: Fundamental Limits of Almost Lossless Analog Compression

Rényi Information Dimension: Fundamental Limits of Almost Lossless Analog Compression IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010 3721 Rényi Information Dimension: Fundamental Limits of Almost Lossless Analog Compression Yihong Wu, Student Member, IEEE, Sergio Verdú,

More information

Simple Channel Coding Bounds

Simple Channel Coding Bounds ISIT 2009, Seoul, Korea, June 28 - July 3,2009 Simple Channel Coding Bounds Ligong Wang Signal and Information Processing Laboratory wang@isi.ee.ethz.ch Roger Colbeck Institute for Theoretical Physics,

More information

Journal of Inequalities in Pure and Applied Mathematics

Journal of Inequalities in Pure and Applied Mathematics Journal of Inequalities in Pure and Applied Mathematics http://jipamvueduau/ Volume 7, Issue 2, Article 59, 2006 ENTROPY LOWER BOUNDS RELATED TO A PROBLEM OF UNIVERSAL CODING AND PREDICTION FLEMMING TOPSØE

More information

A GENERAL CLASS OF LOWER BOUNDS ON THE PROBABILITY OF ERROR IN MULTIPLE HYPOTHESIS TESTING. Tirza Routtenberg and Joseph Tabrikian

A GENERAL CLASS OF LOWER BOUNDS ON THE PROBABILITY OF ERROR IN MULTIPLE HYPOTHESIS TESTING. Tirza Routtenberg and Joseph Tabrikian A GENERAL CLASS OF LOWER BOUNDS ON THE PROBABILITY OF ERROR IN MULTIPLE HYPOTHESIS TESTING Tirza Routtenberg and Joseph Tabrikian Department of Electrical and Computer Engineering Ben-Gurion University

More information

Part 2 Continuous functions and their properties

Part 2 Continuous functions and their properties Part 2 Continuous functions and their properties 2.1 Definition Definition A function f is continuous at a R if, and only if, that is lim f (x) = f (a), x a ε > 0, δ > 0, x, x a < δ f (x) f (a) < ε. Notice

More information

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Nishant Gurnani. GAN Reading Group. April 14th, / 107 Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107 Why are these Papers Important? 2 / 107 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN,

More information

ECE 4400:693 - Information Theory

ECE 4400:693 - Information Theory ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential

More information

The main results about probability measures are the following two facts:

The main results about probability measures are the following two facts: Chapter 2 Probability measures The main results about probability measures are the following two facts: Theorem 2.1 (extension). If P is a (continuous) probability measure on a field F 0 then it has a

More information

44 CHAPTER 2. BAYESIAN DECISION THEORY

44 CHAPTER 2. BAYESIAN DECISION THEORY 44 CHAPTER 2. BAYESIAN DECISION THEORY Problems Section 2.1 1. In the two-category case, under the Bayes decision rule the conditional error is given by Eq. 7. Even if the posterior densities are continuous,

More information

Estimating divergence functionals and the likelihood ratio by convex risk minimization

Estimating divergence functionals and the likelihood ratio by convex risk minimization Estimating divergence functionals and the likelihood ratio by convex risk minimization XuanLong Nguyen Dept. of Statistical Science Duke University xuanlong.nguyen@stat.duke.edu Martin J. Wainwright Dept.

More information