Generalised Pinsker Inequalities

Size: px

Start display at page:

Download "Generalised Pinsker Inequalities"

Alvin Franklin
5 years ago
Views:

1 Generalised Pinsker Inequalities Mark D Reid Australian National University Canberra ACT, Australia MarkReid@anueduau Robert C Williamson Australian National University and NICTA Canberra ACT, Australia BobWilliamson@anueduau Abstract We generalise the classical Pinsker inequality which relates variational divergence to Kullback- Liebler divergence in two ways: we consider arbitrary f-divergences in place of KL divergence, and we assume knowledge of a sequence of values of generalised variational divergences We then develop a best possible inequality for this doubly generalised situation Specialising our result to the classical case provides a new and tight explicit bound relating KL to variational divergence solving a problem posed by Vajda some years ago The solution relies on exploiting a connection between divergences and the Bayes risk of a learning problem via an integral representation Introduction Divergences such as the Kullback-Liebler and variational divergence arise pervasively They are a means of defining a notion of distance between two probability distributions The question often arises: given knowledge of one, what can be said of the other? For all distributions P and Q on an arbitrary set, the classical Pinsker inequality relates the Kullback-Liebler divergence KLP, Q and variational divergence V P, Q by KLP, Q [V P, Q] This simple classical bound is known not to be tight Over the past several decades a number of refinements have been given see Appendix A for a summary of past work Vajda [3] posed the question of determining a tight lower bound on KL-divergence in terms of variational divergence This best possible Pinsker inequality takes the form LV := inf KLP, Q, V [, V P,Q=V Recently Fedotov et al [7] presented an implicit parametric solution of the form of the graph of the bound as V t,lt t R + where V t = t cotht t, Lt = log t sinht + t cotht t sinh t One can generalise the notion of a Pinsker inequality in at least two ways: replace KL divergence by a general f-divergence; and bound the f-divergence in terms of the known values of a sequence of generalised variational divergences defined later in this paper V i n i=, i, In this paper we study this doubly generalised problem and provide a complete solution in terms of explicit, best possible bounds The main result is given below as Theorem 6 Applying it to specific f-divergences gives the following corollary Corollary Let V = V P, Q denote the variational divergence between the distributions P and Q and similarly for the other divergences in Table below Then the following bounds for the divergences hold and are tight: h V ; I V T ln V J V ln ln V + ln χ V < V V + V KL min ln β [V, V ] β+ V ln V + β β+ V β++v +V V ; Ψ 8V V + V ln+v ln V 3 β V β +V + The proof of the main result depends in an essential way on a learning theory perspective We make use of an integral representation of f-divergences in terms of DeGroot s statistical information the difference between a prior and posterior Bayes risk[] By using the relationships between the generalised variational divergence and the - misclassification loss we are able to use an elementary but somewhat intricate geometrical argument to obtain the result The rest of the paper is organised as follows Section collects background results upon which we rely The main result of the paper is stated in Section 3 and its proof presented in in Section Appendix A summarises previous work The terms V < and V are indicator functions and are defined below

2 Background Results and Notation In this section we collect notation and background concepts and results we need for the main result Notational Conventions The substantive objects are defined within the body of the paper Here we collect elementary notation and the conventions we adopt throughout We write x y := minx, y, x y := maxx, y and p = if p is true and p = otherwise The generalised function δ is defined by b δxfxdx = f when f is continuous a at and a< <b For convenience, we will define δ c x := δx c The real numbers are denoted R, the non-negative reals R + ; Sets are in calligraphic font: X Vectors are written in bold font: a, α, x R m We will often have cause to take expectations E over random variables We write such quantities in blackboard bold: I, L, etc The lower bound on quantities with an intrinsic lower bound eg the Bayes optimal loss are written with an underbar: L, L Quantities related by double integration recur in this paper and we notate the starting point in lower case, the first integral with upper case, and the second integral in upper case with an overbar: γ, Γ, Γ Csiszár f-divergences The class of f-divergences [, 3] provide a rich set of relations that can be used to measure the separation of the distributions An f-divergence is a function that measures the distance between a pair of distributions P and Q defined over a space X of observations Traditionally, the f-divergence of P from Q is defined for any convex f :, R such that f = In this case, the f-divergence is I f P, Q =E Q [f dp dq ] = X f dp dq 5 dq when P is absolutely continuous with respect to Q and equal otherwise All f-divergences are non-negative and zero when P = Q, that is, I f P, Q and I f P, P = for all distributions P, Q In general, however, they are not metrics, since they are not necessarily symmetric ie, for all distributions P and Q, I f P, Q =I f Q, P and do not necessarily satisfy the triangle inequality Several well-known divergences correspond to specific choices of the function f [, 5] One divergence central to this paper is the variational divergence V P, Q which is obtained by setting ft = t in Equation 5 It is the only f-divergence that is a true metric on the space of distributions over X [3] and gets its name from its equivalent definition in the variational form V P, Q = P Q := sup P A QA 6 A X Some authors define V without the above Furthermore, the variational divergence is one of a family of Liese and Miescke [8, pg 3] give a definition that does not require absolute continuity primitive or simple f-divergences discussed in Section 3 These are primitive in the sense that all other f-divergences can be expressed as a weighted sum of members from this family Another well known f-divergence is the Kullback- Leibler KL divergence KLP, Q, obtained by setting ft =t lnt in Equation 5 Others are given in Table As already mentioned in the introduction, the KL and variational divergences satisfy the classical Pinsker s inequality which states that for all distributions P and Q over some common space X KLP, Q [V P, Q] 7 3 Integral Representations of f-divergences The main tool in our proof of Theorem 6 is the integral representation of f-divergences, first articulated by Österreicher and Vajda [] and Gutenbrunner [] They show that an f-divergence can be represented as a weighted integral of the simple divergence measures V P, Q =I f P, Q, 8 where f t := min{, } min{, t} for [, ] Theorem For any convex f such that f =, the f-divergence I f can be expressed, for all distributions P and Q, as I f P, Q = where the generalised function γ f := 3 f V P, Q γ f d 9 Recently, this theorem has been shown to be a direct consequence of a generalised Taylor s expansion for convex functions [7, ] Even when f is not twice differentiable, the convexity of f implies its continuity and so its right-hand derivative f + exists In this case, γ is interpreted distributionally in terms of df + For example, when ft = t then f t =δt and so γ f = δ = 6δ 3 The divergences V for [, ] can be seen as a family of generalised variational divergences since, df +t for any member of this family is δt and so γ f = Thus, for = we have γ f =δ δ that is, four times the γ function for variational divergence and so by 9 we see that V P, Q =VP, Q Theorem shows that knowledge of the values of V P, Q for all [, ] is sufficient to compute the value of I f P, Q for any f-divergence, since the weight function γ is dependent only on f, not P and Q All of the generalised Pinsker bounds we derive are found by asking how knowledge of a the value of a finite number of V P, Q constrains the overall value of I f P, Q,

3 Symbol Divergence Name ft γ V P, Q Variational t 6δ KLP, Q Kullback-Liebler t ln t P, Q Triangular Discrimination t /t + 8 t t+ IP, Q Jensen-Shannon ln t lnt + + ln TP, Q Arithmetic-Geometric Mean ln t+ t+ t + JP, Q Jeffreys t lnt h P, Q Hellinger t [ ] 3/ χ P, Q Pearson χ-squared t 3 ΨP, Q Symmetric χ-squared t t+ t Table : Divergences and their corresponding functions f and weights γ; confer [5, 7] Topsøe [7] calls CP, Q = IP, Q and CP, Q = TP, Q the Capacitory and Dual Capacitory discrimination respectively Several of the above divergences are symmetrised versions of others For example, T P, Q = P +Q [KL,P + KL P +Q,Q], IP, Q = P +Q [KLP, + KLQ, P +Q ], JP, Q = KLP, Q + KLQ, P, and ΨP, Q =χ P, Q+χ Q, P Table summarises the weight functions γ for a number of f-divergences that appear in the literature These are used in the proof of specific bounds in Corollary Before we can prove the main result, we need to establish some properties of the general variational divergences In particular, we will make use of their relationship to Bayes risks for - loss Divergence and Risk Let L, P, Q denote the - Bayes risk for a classification problem in which observations are drawn from X using the mixture distribution M = P + Q, and each observation x X is assigned a positive label with probability ηx := dp dm x If r = rx {, } is a label prediction for a particular x X, the - expected loss for that observation is Lr,, p, q = q r = + p r = where q = dq dp dm x and p = dm x are densities Thus, the full expected - loss of a predictor r : X {, } is given by Lr,, P, Q := E M [Lrx,, px,qx] and it is well known eg, [5] that its Bayes risk is obtained by the Bayes optimal predictor r x := ηx That is, L, P, Q := inf r Lr,, P, Q =Lr,, P, Q, where the infimum is taken over all M-measurable predictors r : X {, } So, by the definition of ηx and noting that η iff p p + q which holds iff p q we see that the - Bayes risk can be expressed as L, P, Q 3 = E M [ q η + p η < ] = E Q [ p q ]+E P [ p < q ] We now observe that { p q, q p qf = q q p, q > p [ ] ] dp and so by noting that E Q f dq = E M [qf p q we have established the following lemma Lemma 3 For all [, ] and all distributions P and Q, the generalised variational divergence satisfies V P, Q = L, P, Q Thus, the value of V P, Q can be understood via the - Bayes risk for a classification problem with labelconditional distributions P and Q and prior probability for the positive class This relationship between f- divergence and Bayes risk is not new It was established in a more general setting by Österreicher and Vajda [] who note that the term in is the statistical information for - loss and later by Nguyen et al [9] 5 Concavity of - Bayes Risk Curves For a given pair of distributions P and Q the set of values for L, P, Q as varies over [, ] can be visualised as a curve as in Figure Lemma For all distributions P and Q, the function L, P, Q is concave Proof: By we have that Observe that L, P, Q =E M [Lr,, p, q] Lr,, p, q = q η + p η < { q, q p = p, q > p = min{ q, p}

4 and so for any p, q is the minimum of two linear functions and thus concave in The full Bayes risk is the expectation of these functions and thus simply a linear combination of concave functions and thus concave The tightness of the bounds in the main result of the next section depend on the following corollary of a result due to Torgersen [8] It asserts that any appropriate concave function can be viewed as the - risk curve for some pair of distributions P and Q A proof can be found in [, 63] Corollary 5 Suppose X has a connected component Let ψ : [, ] [, ] be an arbitrary concave function such that for all [, ], ψ Then there exists P and Q such that L, P, Q =ψ for all [, ] 3 Main Result We will now show how viewing f-divergences in terms of their weighted integral representation simplifies the problem of understanding the relationship between different divergences and leads, amongst other things, to an explicit formula for Fix a positive integer n Consider a sequence < < < < n < Suppose we sampled the value of V P, Q at these discrete values of Since V P, Q is concave, the piece-wise linear concave function passing through points { i,v i P, Q} n i= is guaranteed to be an upper bound on the variational curve, V P, Q, This therefore gives a lower bound on the f-divergence given by a weight function γ This observation forms the basis of the theorem stated below Theorem 6 For a positive integer n consider a sequence < < < < n < Let := and n+ := and for i =,, n + let ψ i := i i V i P, Q observe that consequently ψ = ψ n+ = Let { A n := a =a,, a n R n : 5 ψ i+ ψ i a i ψ } i ψ i, i =,, n i+ i i i The set A n defines the allowable slopes of a piecewise linear function majorizing V P, Q at each of,, n For a =a,, a n A n, let i := ψ i ψ i+ +a i+ i+ a i i a i+ a i, i =,, n, 6 j := {k {,, n} : k < k+} 7 i := i <j i + i = j + j < i i, 8 α a,i := i j a i + i >j a i, 9 β a,i := i j ψ i a i i + i>j ψ i a i i for i =,, n+ and let γ f be the weight corresponding to f given by For arbitrary I f and for all distributions P and Q the following bound holds If in addition X contains a connected component, it is tight I f P, Q n i+ min α a,i + β a,i γ f d a A n i= i n [ = min αa,i i+ + β a,i Γ f i+ α a,i Γf i+ a A n i= α a,i i + β a,i Γ f i +α a,i Γf i ], 3 where Γ f := γf tdt and Γ f := Γf tdt Equation 3 follows from by integration by parts The remainder of the proof is in Section Although 3 looks daunting, we observe: the constraints on a are convex in fact they are a box constraint; and the objective is a relatively benign function of a When n = the result simplifies considerably If in addition = then by we have V P, Q = V P, Q It is then a straightforward exercise to explicitly evaluate, especially when γ f is symmetric The following theorem expresses the result in terms of V P, Q for comparability with previous results The result for KLP, Q is a best-possible improvement on the classical Pinsker inequality Theorem 7 For any distributions P, Q on X, let V := V P, Q Then the following bounds hold and, if in addition X has a connected component, are tight When γ is symmetric about + V Γ f I f P, Q [ Γf V and Γ f and Γ f are as in Theorem 6 and convex, Γf ] This theorem gives the first explicit representation of the optimal Pinsker bound 3 By plotting both and one can confirm that the two bounds implicit and explicit coincide; see Figure Proof of Main Result Proof: Theorem 6 This proof is driven by the duality between the family of variational divergences V P, Q and the - Bayes risk L, P, Q given in Lemma 3 Given distributions P and Q let φ =V P, Q = ψ, where ψ = L, P, Q We know that ψ is nonnegative and concave and satisfies ψ and thus ψ = ψ = Since I f P, Q = φγ f d, 5 3 A summary of existing results and their relationship to those presented here is given in appendix A

5 5 3 5 Figure : Lower bound on KLP, Q as a function of the variational divergence V P, Q Both the explicit bound and Fedotorev et al s implicit bound are plotted I f P, Q is minimised by minimising φ over all P, Q such that 5 φ i =φ i = i i ψ i Since ψ i := i i V i P, Q =ψ i the minimisation problem for φ can be expressed in terms of ψ as: Given i,ψ i n i= find the maximal ψ: [, ] [, ]6 such that ψ i =ψ i, i =,, n +, 7 ψ, [, ], 8 ψ is concave 9 This will tell us the optimal φ to use since optimising over ψ is equivalent to optimising over L, P, Q Under the additional assumption on X, Corollary 5 implies that for any ψ satisfying 7, 8 and 9 there exists P, Q such that L, P, Q = ψ This establishes the tightness of our bounds Let Ψ be the set of piece-wise linear concave functions on [, ] having n+ segments such that ψ Ψ ψ satisfies 7 and 8 We now show that in order to solve 6 it suffices to consider ψ Ψ If g is a concave function on R, then let ðgx := {s R: gy gx+ s, y x, y R} denote the sup-differential of g at x This is the obvious analogue of the sub-differential for convex functions [3] Suppose ψ is a general concave function satisfying 7 and 8 For i =,, n, let { G ψ i := [, ] g ψ i : i ψ i R is linear and } g ψ i =i ð ψ i Observe that by concavity, for all concave ψ satisfying 7 and 8, for all g n i= G ψ i, g ψ, for [, ] Thus given any such ψ, one can always construct ψ = ming ψ,, g ψ n 3 such that ψ is concave, satisfies 7 and ψ ψ, for all [, ] It remains to take account of 8 That is trivially done by setting ψ = minψ, 3 which remains concave and piecewise linear although with potentially one additional linear segment Finally, the pointwise smallest concave ψ satisfying 7 and 8 is the piecewise linear function connecting the points,,,ψ,,ψ,, m,ψ m,, Let g : [, ] [, ] be this function which can be written explicitly as g = ψ i + ψ i+ ψ i [ i, i+ ], i+ i where we have defined :=, ψ :=, n+ := and ψ n+ := We now explicitly parametrize this family of functions Let p i : [, ] R denote the affine segment the graph of which passes through i,ψ i, i =,, n + Write p i =a i + b i We know that p i i =ψ i and thus b i = ψ i a i i, i =,, n + 3 In order to determine the constraints on a i, since g is concave and minorizes ψ, it suffices to only consider i,g i and i+,g i+ for i =,, n We have for i =,, n p i i g i a i i + b i ψ i a i i + ψ i a i i ψ i a i i i }{{} ψ i ψ i < a i ψ i ψ i i i 33 Similarly we have for i =,, n p i i+ g i+ a i i+ + b i ψ i+ a i i+ + ψ i a i i ψ i+ a i i+ i }{{} ψ i+ ψ i > a i ψ i+ ψ i i+ i 3

6 5 "! -"!" # # # 3 Admissible region for lines with slopes in A $ $ $ 3 " " " " 3 " " ~ " ~ ~ " " ~ ~ " 3 $ a " for a particular a Figure : Illustration of construction of optimal ψ = L, P, Q in the proof of Theorem 6 The optimal ψ is piece-wise linear such that ψ i =ψ i, i =,, n + We now determine the points at which ψ defined by 3 and 3 change slope That occurs at the points when p i = p i+ a i + ψ i a i i = a i+ + ψ i+ a i+ i+ a i+ a i = ψ i ψ i+ + a i+ i+ a i i = ψ i ψ i+ + a i+ i+ a i+ a i =: i for i =,, n Thus ψ =p i, [ i, i ],i=,, n Let a =a,, a n We explicitly denote the dependence of ψ on a by writing ψ a Let φ a := ψ a = α a,i + β a,i, [ i, i ],i=,, n +, where a A n see 5, i, α a,i and β a,i are defined by 8, 9 and respectively The extra segment induced at index j see 7 is needed since has a slope change at = Thus in general, φ a is piece-wise linear with n + segments recall i ranges from to n + ; if k+ = for some k {,, n}, then there will be only n + non-trivial segments Thus { } n φ a [ i, i+ ] : a A n i= is the set of φ consistent with the constraints and A n is defined in 5 Thus substituting into 5, interchanging the order of summation and integration and optimizing we have shown The tightness has already been argued: under the additional assumption on X, since there is no slop in the argument above since every ψ satisfying the constraints in 6 is the Bayes risk function for some P, Q Proof: Theorem 7 In this case n = and the optimal ψ function will be piecewise linear, concave, and its graph will pass through,ψ Thus the optimal φ will be of the form φ =, [,L] [U, ] a + b, a + b, [L, ] [,U] where a +b = ψ b = ψ a and a [ ψ, ψ ] see Figure 3 For variational divergence, = and thus by ψ = V = V 35 and so φ = V/ We can thus determine L and U: al + b = L al + ψ a = L L = a ψ a

7 5 "! -"!" Observe that the above bound behaves like V for small V, and V ln V for V [, ] Using +V V the traditional Pinkser inequality KLP, Q V / we have % # " $ JP, Q = KLP, Q + KLQ, P V + V = V Jensen-Shannon Divergence Here ft = t ln t t+ lnt + + ln and thus γ f = f 3 = Thus L " # U ψ IP, Q = ψ d = ln ψ ψ ln ψ +ψ ln ψ + ln Figure 3: The optimisation problem when n = Given ψ, there are many risk curves consistent with it The optimisation problem involves finding the piece-wise linear concave risk curve ψ Ψ and the corresponding φ = ψ that maximises I f L and U are defined in the text Similarly au + b = U U = ψ+a a+ and thus I f P, Q + min a [ ψ,ψ ] a ψ a ψ +a a+ [ a ψ + a ]γ f d [ a ψ + a + ]γ f d 36 If γ f is symmetric about = and convex and =, then the optimal a = Thus in that case, I f P, Q ψ γ f d 37 ψ = [ ψ Γ f + Γ f ψ Γ f ] = [ V Γ f + Γ f V Γf ] 38 Combining the above with 35 leads to a range of Pinsker style bounds for symmetric I f : Jeffrey s Divergence Since JP, Q = KLP, Q + KLQ, P we have γ = As a check, ft = t lnt, f t = t+ t and so γ f = f 3 = Thus / ψ JP, Q ψ d = ψ lnψ ln ψ Substituting ψ = V gives JP, Q V ln +V V Substituting ψ = V leads to IP, Q V ln V + + V ln+v ln Hellinger Divergence Here ft = t Consequently γ f = f 3 = = 3 / 3/ and thus [ ] 3/ h ψ P, Q d ψ [ ] 3/ = ψ ψ + ψ ψ = V = + V V +V = V For small V, V V / hence γ f = f 3 thus V + + V + V Arithmetic-Geometric Mean Divergence In this case, ft = t+ ln t+ Thus f t = t + t t t+ and = γf = + and T P, Q ψ + ψ d = ln ψ lnψ ln Substituting ψ = V gives T P, Q ln + V ln V = ln ln V ln

8 Symmetric χ -Divergence In this case ΨP, Q = χ P, Q+χ Q, P and thus see below γ f = + 3 As a check, from ft = t t+ 3 t we have f t = t3 + t and thus γ 3 f = f 3 gives the same result ΨP, Q ψ ψ = + ψ ψ ψ ψ d Substituting ψ = V 8V gives ΨP, Q V When γ f is not symmetric, one needs to use 36 instead of the simpler 38 We consider two cases χ -Divergence Here ft = t and so f t = and hence γ =f / 3 = which is not 3 symmetric Upon substituting / 3 for γ in 36 and evaluating the integrals we obtain χ P, Q +ψ min ψ a [ ψ ψ,ψ ] a +ψ ψ ψ a } {{ } =:Ja,ψ One can then solve a Ja, ψ = for a and one obtains a =ψ Now a > ψ only if ψ > One can check that when ψ, then a Ja, ψ is monotonically increasing for a [ ψ, ψ ] and hence the minimum occurs at a = ψ Thus the value of a minimising Ja, ψ is a = ψ > / ψ + ψ / ψ Substituting the optimal value of a into Ja, ψ we obtain Ja,ψ = ψ > / + 8ψ 8ψ +ψ + ψ / ψ +ψ ψ ψ ψ Substituting ψ = V and observing that V < ψ > / we obtain χ P, Q V < V V + V V Observe that the bound diverges to as V Kullback-Leibler Divergence In this case we have ft = t ln t and thus f t = /t and consequently γ f = f 3 = which is clearly not symmetric From 36 we obtain KLP, Q a ψ ln a+ψ a ψ min [ ψ,ψ ] + a + ψ ln a+ψ a ψ + Substituting ψ = V gives KLP, Q min a [ V, V ] δ av, where δ a V = V + a ln a V a +V + a+ V ln a+ V a++v Set β := a and we have 5 Conclusion We have generalised the classical Pinsker inequality and developed best possible bounds for the general situation A special case of the result gives an explicit bound relating Kullback-Liebler divergence and variational divergence The proof relied on an integral representation of f-divergences in terms of statistical information Such representations are a powerful device as they identify the primitives underpinning general learning problems These representations are further studied in [] A History of Pinsker Inequalities Pinsker [] presented the first bound relating KLP, Q to V P, Q: KL V / and it is now known by his name or sometimes as the Pinsker-Csiszár-Kullback inequality since Csiszar [3] presented another version and Kullback [] showed KL V /+V /36 Much later Topsøe [6] showed KL V /+V /36 + V 6 /7 Non-polynomial bounds are due to Vajda [3]: KL L Vajda V := log +V V V +V and Toussaint [9] who showed KL L Vajda V V /+V /36 + V 8 /88 Care needs to be taken when comparing results from the literature as different definitions for the divergences exist For example Gibbs and Su [8] used a definition of V that differs by a factor of from ours There are some isolated bounds relating V to some other divergences, analogous to the classical Pinkser bound; Kumar [5] has presented a summary as well as new bounds for a wide range of symmetric f-divergences by making assumptions on the likelihood ratio: r px/qx R< for all x X This line of reasoning has also been developed by Dragomir et al [6] and Taneja [5, ] Topsøe [7] has presented some infinite series representations for capacitory discrimination in terms of triangular discrimination which lead to inequalities between those two divergences Liese and Miescke [8, p8] give the inequality V h h which seems to be originally due to LeCam [6] which when rearranged corresponds exactly to the bound for h in theorem 7 Withers [3] has also presented some inequalities between other particular pairs of divergences; his reasoning is also in terms of infinite series expansions Arnold et al [3] considered the case of n = but arbitrary I f that is they bound an arbitrary f-divergence in terms of the variational divergence Their argument is similar to the geometric proof of Theorem 6 They do not compute any of the explicit bounds in theorem 7 except they state page 3 χ P, Q V which is looser than 3 Gilardoni [9] showed via an intricate argument that if f exists, then I f f V He also showed some fourth order inequalities of the form I f c,f V + c,f V where the constants depend on the behaviour of f at in a complex way Gilardoni [, ] presented a completely different approach which obtains many of the results of theorem 7 Gilardoni [] improved Va- We were unaware of these two papers until completing

9 jda s bound slightly to KLP, Q ln V V ln +V Gilardoni [, ] presented a general tight lower bound for I f = I f P, Q in terms of V = V P, Q which is difficult to evaluate explicitly in general: f[g R k/v ] I f V f[gl k/v ] g + R k/v g, L k/v where k t = + and of course g L t g R t ku = k u; and gu = u f u fu, g R [gu] = u for u and g L [gu] = u for u He presented a new parametric form for I f = KL in terms of Lambert s W function In general, the result is analogous to that of Fedotov et al [7] in that it is in a parametric form which, if one wishes to evaluate for a particular V, one needs to do a one dimensional numerical search as complex as However, when f is such that I f is symmetric, this simplifies to the elegant form I f V f +V V f V He presented explicit special cases for h, J, and I identical to the results in Theorem 7 It is not apparent how the approach of Gilardoni [, ] could be extended to more general situations such as that in Theorem 6 ie n> Bolley and Villani [] considered weighted versions of the Pinsker inequalities for a weighted generalisation of Variational divergence in terms of KL-divergence that are related to transportation inequalities Acknowledgements This work was supported by the Australian Research Council and NICTA; an initiative of the Commonwealth Government under Backing Australia s Ability References [] SM Ali and SD Silvey A general class of coefficients of divergence of one distribution from another Journal of the Royal Statistical Society Series B Methodological, 8:3, 966 [] F Bolley and C Villani Weighted Csiszár- Kullback-Pinsker inequalities and applications to transportation inequalities Annales de la Faculte des Sciences de Toulouse, 3:33 35, 5 [3] I Csiszár Information-type measures of difference of probability distributions and indirect observations Studia Scientiarum Mathematicarum Hungarica, :99 38, 967 [] MH DeGroot Uncertainty, Information, and Sequential Experiments The Annals of Mathematical Statistics, 33: 9, 96 [5] Luc Devroye, László Györfi, and Gábor Lugosi A Probabilistic Approach to Pattern Recognition Springer, New York, 996 [6] SS Dragomir, V Gluščević, and CEM Pearce Csiszár f-divergence, Ostrowski s inequality and mutual information Nonlinear Analysis, 7: , the results presented in the main paper [7] AA Fedotov, P Harremoës, and F Topsøe Refinements of Pinsker s inequality IEEE Transactions on Information Theory, 96:9 98, June 3 [8] Alison L Gibbs and Francis Edward Su On choosing and bounding probability metrics International Statistical Review, 7:9 35, [9] G L Gilardoni On Pinsker s Type Inequalities and Csiszár s f-divergences Part I: Second and Fourth-Order Inequalities arxiv:cs/6397v, April 6 [] Gustavo L Gilardoni On the minimum f- divergence for a given total variation Comptes Rendus Académie des sciences, Paris, Series, 33, 6 [] Gustavo L Gilardoni On the relationship between symmetric f-divergence and total variation and an improved Vajda s inequality Preprint, Departamento de Estatística, Universidade de Brasília, April 6 [] Cornelius Gutenbrunner On applications of the representation of f-divergences as averaged minimal Bayesian risk In Transactions of the th Prague Conference on Information Theory, Statistical Decision Functions and Random Processes, pages 9 56, Dordrecht; Boston, 99 Kluwer Academic Publishers [3] Mohammadali Khosravifard, Dariush Fooladivanda, and T Aaron Gulliver Confliction of the Convexity and Metric Properties in f-divergences IEICE Transactions on Fundamentals of Electronics, Communication and Computer Sciences, E9- A9:88 853, 7 [] S Kullback Lower bound for discrimination information in terms of variation IEEE Transactions on Information Theory, 3:6 7, 967 Correction, volume 6, p 65, September 97 [5] P Kumar and S Chhina A symmetric information divergence measure of the Csiszár s f-divergence class and its bounds Computers and Mathematics with Applications, 9: , 5 [6] Lucien LeCam Asymptotic Methods in Statistical Decision Theory Springer, 986 [7] F Liese and I Vajda On divergences and informations in statistics and information theory IEEE Transactions on Information Theory, 5:39, 6 [8] Friedrich Liese and Klaus-J Miescke Statistical Decision Theory Springer, New York, 8 [9] X Nguyen, MJ Wainwright, and MI Jordan On distance measures, surrogate loss functions, and distributed detection Technical Report 695, Department of Statistics, University of California, Berkeley, October 5 [] F Österreicher and I Vajda Statistical information and discrimination IEEE Transactions on In-

10 formation Theory, 393:36 39, 993 [] MS Pinsker Information and Information Stability of Random Variables and Processes Holden- Day, 96 [] Mark D Reid and Robert C Williamson Information, divergence and risk for binary experiments arxiv preprint arxiv:9356v, 89 pages, January 9 [3] R T Rockafellar Convex Analysis Princeton Landmarks in Mathematics and Physics Princeton University Press, 97 [] IJ Taneja Bounds on non-symmetric divergence measures in terms of symmetric divergence measures arxiv:mathpr/5656v, 5 [5] IJ Taneja Refinement inequalities among symmetric divergence measures arxiv:math/533v, April 5 [6] F Topsøe Bounds for entropy and divergence for distributions over a two-element set J Ineq Pure & Appl Math,, [7] Flemming Topsøe Some inequalities for information divergence and related measures of discrimination IEEE Transactions on Information Theory, 6:6 69, [8] EN Torgersen Comparison of Statistical Experiments Cambridge University Press, 99 [9] GT Toussaint Probability of error, expected divergence and the affinity of several distributions IEEE Transactions on Systems, Man and Cybernetics, 8:8 85, 978 [3] Andreas Unterreiter, Anton Arnold, Peter Markowich, and Giuseppe Toscani On generalized Csiszár-Kullback inequalities Monatshefte für Mathematik, 3:35 53, [3] I Vajda Note on discrimination and variation IEEE Transactions on Information Theory, 6:77 773, 97 [3] Lang Withers Some inequalities relating different measures of divergence between two probability distributions IEEE Transactions on Information Theory, 55:78 735, 999

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding

APPEARS IN THE IEEE TRANSACTIONS ON INFORMATION THEORY, FEBRUARY 015 1 Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding Igal Sason Abstract Tight bounds for