Generalised Pinsker Inequalities

Size: px
Start display at page:

Download "Generalised Pinsker Inequalities"

Transcription

1 Generalised Pinsker Inequalities Mark D Reid Australian National University Canberra ACT, Australia MarkReid@anueduau Robert C Williamson Australian National University and NICTA Canberra ACT, Australia BobWilliamson@anueduau Abstract We generalise the classical Pinsker inequality which relates variational divergence to Kullback- Liebler divergence in two ways: we consider arbitrary f-divergences in place of KL divergence, and we assume knowledge of a sequence of values of generalised variational divergences We then develop a best possible inequality for this doubly generalised situation Specialising our result to the classical case provides a new and tight explicit bound relating KL to variational divergence solving a problem posed by Vajda some years ago The solution relies on exploiting a connection between divergences and the Bayes risk of a learning problem via an integral representation Introduction Divergences such as the Kullback-Liebler and variational divergence arise pervasively They are a means of defining a notion of distance between two probability distributions The question often arises: given knowledge of one, what can be said of the other? For all distributions P and Q on an arbitrary set, the classical Pinsker inequality relates the Kullback-Liebler divergence KLP, Q and variational divergence V P, Q by KLP, Q [V P, Q] This simple classical bound is known not to be tight Over the past several decades a number of refinements have been given see Appendix A for a summary of past work Vajda [3] posed the question of determining a tight lower bound on KL-divergence in terms of variational divergence This best possible Pinsker inequality takes the form LV := inf KLP, Q, V [, V P,Q=V Recently Fedotov et al [7] presented an implicit parametric solution of the form of the graph of the bound as V t,lt t R + where V t = t cotht t, Lt = log t sinht + t cotht t sinh t One can generalise the notion of a Pinsker inequality in at least two ways: replace KL divergence by a general f-divergence; and bound the f-divergence in terms of the known values of a sequence of generalised variational divergences defined later in this paper V i n i=, i, In this paper we study this doubly generalised problem and provide a complete solution in terms of explicit, best possible bounds The main result is given below as Theorem 6 Applying it to specific f-divergences gives the following corollary Corollary Let V = V P, Q denote the variational divergence between the distributions P and Q and similarly for the other divergences in Table below Then the following bounds for the divergences hold and are tight: h V ; I V T ln V J V ln ln V + ln χ V < V V + V KL min ln β [V, V ] β+ V ln V + β β+ V β++v +V V ; Ψ 8V V + V ln+v ln V 3 β V β +V + The proof of the main result depends in an essential way on a learning theory perspective We make use of an integral representation of f-divergences in terms of DeGroot s statistical information the difference between a prior and posterior Bayes risk[] By using the relationships between the generalised variational divergence and the - misclassification loss we are able to use an elementary but somewhat intricate geometrical argument to obtain the result The rest of the paper is organised as follows Section collects background results upon which we rely The main result of the paper is stated in Section 3 and its proof presented in in Section Appendix A summarises previous work The terms V < and V are indicator functions and are defined below

2 Background Results and Notation In this section we collect notation and background concepts and results we need for the main result Notational Conventions The substantive objects are defined within the body of the paper Here we collect elementary notation and the conventions we adopt throughout We write x y := minx, y, x y := maxx, y and p = if p is true and p = otherwise The generalised function δ is defined by b δxfxdx = f when f is continuous a at and a< <b For convenience, we will define δ c x := δx c The real numbers are denoted R, the non-negative reals R + ; Sets are in calligraphic font: X Vectors are written in bold font: a, α, x R m We will often have cause to take expectations E over random variables We write such quantities in blackboard bold: I, L, etc The lower bound on quantities with an intrinsic lower bound eg the Bayes optimal loss are written with an underbar: L, L Quantities related by double integration recur in this paper and we notate the starting point in lower case, the first integral with upper case, and the second integral in upper case with an overbar: γ, Γ, Γ Csiszár f-divergences The class of f-divergences [, 3] provide a rich set of relations that can be used to measure the separation of the distributions An f-divergence is a function that measures the distance between a pair of distributions P and Q defined over a space X of observations Traditionally, the f-divergence of P from Q is defined for any convex f :, R such that f = In this case, the f-divergence is I f P, Q =E Q [f dp dq ] = X f dp dq 5 dq when P is absolutely continuous with respect to Q and equal otherwise All f-divergences are non-negative and zero when P = Q, that is, I f P, Q and I f P, P = for all distributions P, Q In general, however, they are not metrics, since they are not necessarily symmetric ie, for all distributions P and Q, I f P, Q =I f Q, P and do not necessarily satisfy the triangle inequality Several well-known divergences correspond to specific choices of the function f [, 5] One divergence central to this paper is the variational divergence V P, Q which is obtained by setting ft = t in Equation 5 It is the only f-divergence that is a true metric on the space of distributions over X [3] and gets its name from its equivalent definition in the variational form V P, Q = P Q := sup P A QA 6 A X Some authors define V without the above Furthermore, the variational divergence is one of a family of Liese and Miescke [8, pg 3] give a definition that does not require absolute continuity primitive or simple f-divergences discussed in Section 3 These are primitive in the sense that all other f-divergences can be expressed as a weighted sum of members from this family Another well known f-divergence is the Kullback- Leibler KL divergence KLP, Q, obtained by setting ft =t lnt in Equation 5 Others are given in Table As already mentioned in the introduction, the KL and variational divergences satisfy the classical Pinsker s inequality which states that for all distributions P and Q over some common space X KLP, Q [V P, Q] 7 3 Integral Representations of f-divergences The main tool in our proof of Theorem 6 is the integral representation of f-divergences, first articulated by Österreicher and Vajda [] and Gutenbrunner [] They show that an f-divergence can be represented as a weighted integral of the simple divergence measures V P, Q =I f P, Q, 8 where f t := min{, } min{, t} for [, ] Theorem For any convex f such that f =, the f-divergence I f can be expressed, for all distributions P and Q, as I f P, Q = where the generalised function γ f := 3 f V P, Q γ f d 9 Recently, this theorem has been shown to be a direct consequence of a generalised Taylor s expansion for convex functions [7, ] Even when f is not twice differentiable, the convexity of f implies its continuity and so its right-hand derivative f + exists In this case, γ is interpreted distributionally in terms of df + For example, when ft = t then f t =δt and so γ f = δ = 6δ 3 The divergences V for [, ] can be seen as a family of generalised variational divergences since, df +t for any member of this family is δt and so γ f = Thus, for = we have γ f =δ δ that is, four times the γ function for variational divergence and so by 9 we see that V P, Q =VP, Q Theorem shows that knowledge of the values of V P, Q for all [, ] is sufficient to compute the value of I f P, Q for any f-divergence, since the weight function γ is dependent only on f, not P and Q All of the generalised Pinsker bounds we derive are found by asking how knowledge of a the value of a finite number of V P, Q constrains the overall value of I f P, Q,

3 Symbol Divergence Name ft γ V P, Q Variational t 6δ KLP, Q Kullback-Liebler t ln t P, Q Triangular Discrimination t /t + 8 t t+ IP, Q Jensen-Shannon ln t lnt + + ln TP, Q Arithmetic-Geometric Mean ln t+ t+ t + JP, Q Jeffreys t lnt h P, Q Hellinger t [ ] 3/ χ P, Q Pearson χ-squared t 3 ΨP, Q Symmetric χ-squared t t+ t Table : Divergences and their corresponding functions f and weights γ; confer [5, 7] Topsøe [7] calls CP, Q = IP, Q and CP, Q = TP, Q the Capacitory and Dual Capacitory discrimination respectively Several of the above divergences are symmetrised versions of others For example, T P, Q = P +Q [KL,P + KL P +Q,Q], IP, Q = P +Q [KLP, + KLQ, P +Q ], JP, Q = KLP, Q + KLQ, P, and ΨP, Q =χ P, Q+χ Q, P Table summarises the weight functions γ for a number of f-divergences that appear in the literature These are used in the proof of specific bounds in Corollary Before we can prove the main result, we need to establish some properties of the general variational divergences In particular, we will make use of their relationship to Bayes risks for - loss Divergence and Risk Let L, P, Q denote the - Bayes risk for a classification problem in which observations are drawn from X using the mixture distribution M = P + Q, and each observation x X is assigned a positive label with probability ηx := dp dm x If r = rx {, } is a label prediction for a particular x X, the - expected loss for that observation is Lr,, p, q = q r = + p r = where q = dq dp dm x and p = dm x are densities Thus, the full expected - loss of a predictor r : X {, } is given by Lr,, P, Q := E M [Lrx,, px,qx] and it is well known eg, [5] that its Bayes risk is obtained by the Bayes optimal predictor r x := ηx That is, L, P, Q := inf r Lr,, P, Q =Lr,, P, Q, where the infimum is taken over all M-measurable predictors r : X {, } So, by the definition of ηx and noting that η iff p p + q which holds iff p q we see that the - Bayes risk can be expressed as L, P, Q 3 = E M [ q η + p η < ] = E Q [ p q ]+E P [ p < q ] We now observe that { p q, q p qf = q q p, q > p [ ] ] dp and so by noting that E Q f dq = E M [qf p q we have established the following lemma Lemma 3 For all [, ] and all distributions P and Q, the generalised variational divergence satisfies V P, Q = L, P, Q Thus, the value of V P, Q can be understood via the - Bayes risk for a classification problem with labelconditional distributions P and Q and prior probability for the positive class This relationship between f- divergence and Bayes risk is not new It was established in a more general setting by Österreicher and Vajda [] who note that the term in is the statistical information for - loss and later by Nguyen et al [9] 5 Concavity of - Bayes Risk Curves For a given pair of distributions P and Q the set of values for L, P, Q as varies over [, ] can be visualised as a curve as in Figure Lemma For all distributions P and Q, the function L, P, Q is concave Proof: By we have that Observe that L, P, Q =E M [Lr,, p, q] Lr,, p, q = q η + p η < { q, q p = p, q > p = min{ q, p}

4 and so for any p, q is the minimum of two linear functions and thus concave in The full Bayes risk is the expectation of these functions and thus simply a linear combination of concave functions and thus concave The tightness of the bounds in the main result of the next section depend on the following corollary of a result due to Torgersen [8] It asserts that any appropriate concave function can be viewed as the - risk curve for some pair of distributions P and Q A proof can be found in [, 63] Corollary 5 Suppose X has a connected component Let ψ : [, ] [, ] be an arbitrary concave function such that for all [, ], ψ Then there exists P and Q such that L, P, Q =ψ for all [, ] 3 Main Result We will now show how viewing f-divergences in terms of their weighted integral representation simplifies the problem of understanding the relationship between different divergences and leads, amongst other things, to an explicit formula for Fix a positive integer n Consider a sequence < < < < n < Suppose we sampled the value of V P, Q at these discrete values of Since V P, Q is concave, the piece-wise linear concave function passing through points { i,v i P, Q} n i= is guaranteed to be an upper bound on the variational curve, V P, Q, This therefore gives a lower bound on the f-divergence given by a weight function γ This observation forms the basis of the theorem stated below Theorem 6 For a positive integer n consider a sequence < < < < n < Let := and n+ := and for i =,, n + let ψ i := i i V i P, Q observe that consequently ψ = ψ n+ = Let { A n := a =a,, a n R n : 5 ψ i+ ψ i a i ψ } i ψ i, i =,, n i+ i i i The set A n defines the allowable slopes of a piecewise linear function majorizing V P, Q at each of,, n For a =a,, a n A n, let i := ψ i ψ i+ +a i+ i+ a i i a i+ a i, i =,, n, 6 j := {k {,, n} : k < k+} 7 i := i <j i + i = j + j < i i, 8 α a,i := i j a i + i >j a i, 9 β a,i := i j ψ i a i i + i>j ψ i a i i for i =,, n+ and let γ f be the weight corresponding to f given by For arbitrary I f and for all distributions P and Q the following bound holds If in addition X contains a connected component, it is tight I f P, Q n i+ min α a,i + β a,i γ f d a A n i= i n [ = min αa,i i+ + β a,i Γ f i+ α a,i Γf i+ a A n i= α a,i i + β a,i Γ f i +α a,i Γf i ], 3 where Γ f := γf tdt and Γ f := Γf tdt Equation 3 follows from by integration by parts The remainder of the proof is in Section Although 3 looks daunting, we observe: the constraints on a are convex in fact they are a box constraint; and the objective is a relatively benign function of a When n = the result simplifies considerably If in addition = then by we have V P, Q = V P, Q It is then a straightforward exercise to explicitly evaluate, especially when γ f is symmetric The following theorem expresses the result in terms of V P, Q for comparability with previous results The result for KLP, Q is a best-possible improvement on the classical Pinsker inequality Theorem 7 For any distributions P, Q on X, let V := V P, Q Then the following bounds hold and, if in addition X has a connected component, are tight When γ is symmetric about + V Γ f I f P, Q [ Γf V and Γ f and Γ f are as in Theorem 6 and convex, Γf ] This theorem gives the first explicit representation of the optimal Pinsker bound 3 By plotting both and one can confirm that the two bounds implicit and explicit coincide; see Figure Proof of Main Result Proof: Theorem 6 This proof is driven by the duality between the family of variational divergences V P, Q and the - Bayes risk L, P, Q given in Lemma 3 Given distributions P and Q let φ =V P, Q = ψ, where ψ = L, P, Q We know that ψ is nonnegative and concave and satisfies ψ and thus ψ = ψ = Since I f P, Q = φγ f d, 5 3 A summary of existing results and their relationship to those presented here is given in appendix A

5 5 3 5 Figure : Lower bound on KLP, Q as a function of the variational divergence V P, Q Both the explicit bound and Fedotorev et al s implicit bound are plotted I f P, Q is minimised by minimising φ over all P, Q such that 5 φ i =φ i = i i ψ i Since ψ i := i i V i P, Q =ψ i the minimisation problem for φ can be expressed in terms of ψ as: Given i,ψ i n i= find the maximal ψ: [, ] [, ]6 such that ψ i =ψ i, i =,, n +, 7 ψ, [, ], 8 ψ is concave 9 This will tell us the optimal φ to use since optimising over ψ is equivalent to optimising over L, P, Q Under the additional assumption on X, Corollary 5 implies that for any ψ satisfying 7, 8 and 9 there exists P, Q such that L, P, Q = ψ This establishes the tightness of our bounds Let Ψ be the set of piece-wise linear concave functions on [, ] having n+ segments such that ψ Ψ ψ satisfies 7 and 8 We now show that in order to solve 6 it suffices to consider ψ Ψ If g is a concave function on R, then let ðgx := {s R: gy gx+ s, y x, y R} denote the sup-differential of g at x This is the obvious analogue of the sub-differential for convex functions [3] Suppose ψ is a general concave function satisfying 7 and 8 For i =,, n, let { G ψ i := [, ] g ψ i : i ψ i R is linear and } g ψ i =i ð ψ i Observe that by concavity, for all concave ψ satisfying 7 and 8, for all g n i= G ψ i, g ψ, for [, ] Thus given any such ψ, one can always construct ψ = ming ψ,, g ψ n 3 such that ψ is concave, satisfies 7 and ψ ψ, for all [, ] It remains to take account of 8 That is trivially done by setting ψ = minψ, 3 which remains concave and piecewise linear although with potentially one additional linear segment Finally, the pointwise smallest concave ψ satisfying 7 and 8 is the piecewise linear function connecting the points,,,ψ,,ψ,, m,ψ m,, Let g : [, ] [, ] be this function which can be written explicitly as g = ψ i + ψ i+ ψ i [ i, i+ ], i+ i where we have defined :=, ψ :=, n+ := and ψ n+ := We now explicitly parametrize this family of functions Let p i : [, ] R denote the affine segment the graph of which passes through i,ψ i, i =,, n + Write p i =a i + b i We know that p i i =ψ i and thus b i = ψ i a i i, i =,, n + 3 In order to determine the constraints on a i, since g is concave and minorizes ψ, it suffices to only consider i,g i and i+,g i+ for i =,, n We have for i =,, n p i i g i a i i + b i ψ i a i i + ψ i a i i ψ i a i i i }{{} ψ i ψ i < a i ψ i ψ i i i 33 Similarly we have for i =,, n p i i+ g i+ a i i+ + b i ψ i+ a i i+ + ψ i a i i ψ i+ a i i+ i }{{} ψ i+ ψ i > a i ψ i+ ψ i i+ i 3

6 5 "! -"!" # # # 3 Admissible region for lines with slopes in A $ $ $ 3 " " " " 3 " " ~ " ~ ~ " " ~ ~ " 3 $ a " for a particular a Figure : Illustration of construction of optimal ψ = L, P, Q in the proof of Theorem 6 The optimal ψ is piece-wise linear such that ψ i =ψ i, i =,, n + We now determine the points at which ψ defined by 3 and 3 change slope That occurs at the points when p i = p i+ a i + ψ i a i i = a i+ + ψ i+ a i+ i+ a i+ a i = ψ i ψ i+ + a i+ i+ a i i = ψ i ψ i+ + a i+ i+ a i+ a i =: i for i =,, n Thus ψ =p i, [ i, i ],i=,, n Let a =a,, a n We explicitly denote the dependence of ψ on a by writing ψ a Let φ a := ψ a = α a,i + β a,i, [ i, i ],i=,, n +, where a A n see 5, i, α a,i and β a,i are defined by 8, 9 and respectively The extra segment induced at index j see 7 is needed since has a slope change at = Thus in general, φ a is piece-wise linear with n + segments recall i ranges from to n + ; if k+ = for some k {,, n}, then there will be only n + non-trivial segments Thus { } n φ a [ i, i+ ] : a A n i= is the set of φ consistent with the constraints and A n is defined in 5 Thus substituting into 5, interchanging the order of summation and integration and optimizing we have shown The tightness has already been argued: under the additional assumption on X, since there is no slop in the argument above since every ψ satisfying the constraints in 6 is the Bayes risk function for some P, Q Proof: Theorem 7 In this case n = and the optimal ψ function will be piecewise linear, concave, and its graph will pass through,ψ Thus the optimal φ will be of the form φ =, [,L] [U, ] a + b, a + b, [L, ] [,U] where a +b = ψ b = ψ a and a [ ψ, ψ ] see Figure 3 For variational divergence, = and thus by ψ = V = V 35 and so φ = V/ We can thus determine L and U: al + b = L al + ψ a = L L = a ψ a

7 5 "! -"!" Observe that the above bound behaves like V for small V, and V ln V for V [, ] Using +V V the traditional Pinkser inequality KLP, Q V / we have % # " $ JP, Q = KLP, Q + KLQ, P V + V = V Jensen-Shannon Divergence Here ft = t ln t t+ lnt + + ln and thus γ f = f 3 = Thus L " # U ψ IP, Q = ψ d = ln ψ ψ ln ψ +ψ ln ψ + ln Figure 3: The optimisation problem when n = Given ψ, there are many risk curves consistent with it The optimisation problem involves finding the piece-wise linear concave risk curve ψ Ψ and the corresponding φ = ψ that maximises I f L and U are defined in the text Similarly au + b = U U = ψ+a a+ and thus I f P, Q + min a [ ψ,ψ ] a ψ a ψ +a a+ [ a ψ + a ]γ f d [ a ψ + a + ]γ f d 36 If γ f is symmetric about = and convex and =, then the optimal a = Thus in that case, I f P, Q ψ γ f d 37 ψ = [ ψ Γ f + Γ f ψ Γ f ] = [ V Γ f + Γ f V Γf ] 38 Combining the above with 35 leads to a range of Pinsker style bounds for symmetric I f : Jeffrey s Divergence Since JP, Q = KLP, Q + KLQ, P we have γ = As a check, ft = t lnt, f t = t+ t and so γ f = f 3 = Thus / ψ JP, Q ψ d = ψ lnψ ln ψ Substituting ψ = V gives JP, Q V ln +V V Substituting ψ = V leads to IP, Q V ln V + + V ln+v ln Hellinger Divergence Here ft = t Consequently γ f = f 3 = = 3 / 3/ and thus [ ] 3/ h ψ P, Q d ψ [ ] 3/ = ψ ψ + ψ ψ = V = + V V +V = V For small V, V V / hence γ f = f 3 thus V + + V + V Arithmetic-Geometric Mean Divergence In this case, ft = t+ ln t+ Thus f t = t + t t t+ and = γf = + and T P, Q ψ + ψ d = ln ψ lnψ ln Substituting ψ = V gives T P, Q ln + V ln V = ln ln V ln

8 Symmetric χ -Divergence In this case ΨP, Q = χ P, Q+χ Q, P and thus see below γ f = + 3 As a check, from ft = t t+ 3 t we have f t = t3 + t and thus γ 3 f = f 3 gives the same result ΨP, Q ψ ψ = + ψ ψ ψ ψ d Substituting ψ = V 8V gives ΨP, Q V When γ f is not symmetric, one needs to use 36 instead of the simpler 38 We consider two cases χ -Divergence Here ft = t and so f t = and hence γ =f / 3 = which is not 3 symmetric Upon substituting / 3 for γ in 36 and evaluating the integrals we obtain χ P, Q +ψ min ψ a [ ψ ψ,ψ ] a +ψ ψ ψ a } {{ } =:Ja,ψ One can then solve a Ja, ψ = for a and one obtains a =ψ Now a > ψ only if ψ > One can check that when ψ, then a Ja, ψ is monotonically increasing for a [ ψ, ψ ] and hence the minimum occurs at a = ψ Thus the value of a minimising Ja, ψ is a = ψ > / ψ + ψ / ψ Substituting the optimal value of a into Ja, ψ we obtain Ja,ψ = ψ > / + 8ψ 8ψ +ψ + ψ / ψ +ψ ψ ψ ψ Substituting ψ = V and observing that V < ψ > / we obtain χ P, Q V < V V + V V Observe that the bound diverges to as V Kullback-Leibler Divergence In this case we have ft = t ln t and thus f t = /t and consequently γ f = f 3 = which is clearly not symmetric From 36 we obtain KLP, Q a ψ ln a+ψ a ψ min [ ψ,ψ ] + a + ψ ln a+ψ a ψ + Substituting ψ = V gives KLP, Q min a [ V, V ] δ av, where δ a V = V + a ln a V a +V + a+ V ln a+ V a++v Set β := a and we have 5 Conclusion We have generalised the classical Pinsker inequality and developed best possible bounds for the general situation A special case of the result gives an explicit bound relating Kullback-Liebler divergence and variational divergence The proof relied on an integral representation of f-divergences in terms of statistical information Such representations are a powerful device as they identify the primitives underpinning general learning problems These representations are further studied in [] A History of Pinsker Inequalities Pinsker [] presented the first bound relating KLP, Q to V P, Q: KL V / and it is now known by his name or sometimes as the Pinsker-Csiszár-Kullback inequality since Csiszar [3] presented another version and Kullback [] showed KL V /+V /36 Much later Topsøe [6] showed KL V /+V /36 + V 6 /7 Non-polynomial bounds are due to Vajda [3]: KL L Vajda V := log +V V V +V and Toussaint [9] who showed KL L Vajda V V /+V /36 + V 8 /88 Care needs to be taken when comparing results from the literature as different definitions for the divergences exist For example Gibbs and Su [8] used a definition of V that differs by a factor of from ours There are some isolated bounds relating V to some other divergences, analogous to the classical Pinkser bound; Kumar [5] has presented a summary as well as new bounds for a wide range of symmetric f-divergences by making assumptions on the likelihood ratio: r px/qx R< for all x X This line of reasoning has also been developed by Dragomir et al [6] and Taneja [5, ] Topsøe [7] has presented some infinite series representations for capacitory discrimination in terms of triangular discrimination which lead to inequalities between those two divergences Liese and Miescke [8, p8] give the inequality V h h which seems to be originally due to LeCam [6] which when rearranged corresponds exactly to the bound for h in theorem 7 Withers [3] has also presented some inequalities between other particular pairs of divergences; his reasoning is also in terms of infinite series expansions Arnold et al [3] considered the case of n = but arbitrary I f that is they bound an arbitrary f-divergence in terms of the variational divergence Their argument is similar to the geometric proof of Theorem 6 They do not compute any of the explicit bounds in theorem 7 except they state page 3 χ P, Q V which is looser than 3 Gilardoni [9] showed via an intricate argument that if f exists, then I f f V He also showed some fourth order inequalities of the form I f c,f V + c,f V where the constants depend on the behaviour of f at in a complex way Gilardoni [, ] presented a completely different approach which obtains many of the results of theorem 7 Gilardoni [] improved Va- We were unaware of these two papers until completing

9 jda s bound slightly to KLP, Q ln V V ln +V Gilardoni [, ] presented a general tight lower bound for I f = I f P, Q in terms of V = V P, Q which is difficult to evaluate explicitly in general: f[g R k/v ] I f V f[gl k/v ] g + R k/v g, L k/v where k t = + and of course g L t g R t ku = k u; and gu = u f u fu, g R [gu] = u for u and g L [gu] = u for u He presented a new parametric form for I f = KL in terms of Lambert s W function In general, the result is analogous to that of Fedotov et al [7] in that it is in a parametric form which, if one wishes to evaluate for a particular V, one needs to do a one dimensional numerical search as complex as However, when f is such that I f is symmetric, this simplifies to the elegant form I f V f +V V f V He presented explicit special cases for h, J, and I identical to the results in Theorem 7 It is not apparent how the approach of Gilardoni [, ] could be extended to more general situations such as that in Theorem 6 ie n> Bolley and Villani [] considered weighted versions of the Pinsker inequalities for a weighted generalisation of Variational divergence in terms of KL-divergence that are related to transportation inequalities Acknowledgements This work was supported by the Australian Research Council and NICTA; an initiative of the Commonwealth Government under Backing Australia s Ability References [] SM Ali and SD Silvey A general class of coefficients of divergence of one distribution from another Journal of the Royal Statistical Society Series B Methodological, 8:3, 966 [] F Bolley and C Villani Weighted Csiszár- Kullback-Pinsker inequalities and applications to transportation inequalities Annales de la Faculte des Sciences de Toulouse, 3:33 35, 5 [3] I Csiszár Information-type measures of difference of probability distributions and indirect observations Studia Scientiarum Mathematicarum Hungarica, :99 38, 967 [] MH DeGroot Uncertainty, Information, and Sequential Experiments The Annals of Mathematical Statistics, 33: 9, 96 [5] Luc Devroye, László Györfi, and Gábor Lugosi A Probabilistic Approach to Pattern Recognition Springer, New York, 996 [6] SS Dragomir, V Gluščević, and CEM Pearce Csiszár f-divergence, Ostrowski s inequality and mutual information Nonlinear Analysis, 7: , the results presented in the main paper [7] AA Fedotov, P Harremoës, and F Topsøe Refinements of Pinsker s inequality IEEE Transactions on Information Theory, 96:9 98, June 3 [8] Alison L Gibbs and Francis Edward Su On choosing and bounding probability metrics International Statistical Review, 7:9 35, [9] G L Gilardoni On Pinsker s Type Inequalities and Csiszár s f-divergences Part I: Second and Fourth-Order Inequalities arxiv:cs/6397v, April 6 [] Gustavo L Gilardoni On the minimum f- divergence for a given total variation Comptes Rendus Académie des sciences, Paris, Series, 33, 6 [] Gustavo L Gilardoni On the relationship between symmetric f-divergence and total variation and an improved Vajda s inequality Preprint, Departamento de Estatística, Universidade de Brasília, April 6 [] Cornelius Gutenbrunner On applications of the representation of f-divergences as averaged minimal Bayesian risk In Transactions of the th Prague Conference on Information Theory, Statistical Decision Functions and Random Processes, pages 9 56, Dordrecht; Boston, 99 Kluwer Academic Publishers [3] Mohammadali Khosravifard, Dariush Fooladivanda, and T Aaron Gulliver Confliction of the Convexity and Metric Properties in f-divergences IEICE Transactions on Fundamentals of Electronics, Communication and Computer Sciences, E9- A9:88 853, 7 [] S Kullback Lower bound for discrimination information in terms of variation IEEE Transactions on Information Theory, 3:6 7, 967 Correction, volume 6, p 65, September 97 [5] P Kumar and S Chhina A symmetric information divergence measure of the Csiszár s f-divergence class and its bounds Computers and Mathematics with Applications, 9: , 5 [6] Lucien LeCam Asymptotic Methods in Statistical Decision Theory Springer, 986 [7] F Liese and I Vajda On divergences and informations in statistics and information theory IEEE Transactions on Information Theory, 5:39, 6 [8] Friedrich Liese and Klaus-J Miescke Statistical Decision Theory Springer, New York, 8 [9] X Nguyen, MJ Wainwright, and MI Jordan On distance measures, surrogate loss functions, and distributed detection Technical Report 695, Department of Statistics, University of California, Berkeley, October 5 [] F Österreicher and I Vajda Statistical information and discrimination IEEE Transactions on In-

10 formation Theory, 393:36 39, 993 [] MS Pinsker Information and Information Stability of Random Variables and Processes Holden- Day, 96 [] Mark D Reid and Robert C Williamson Information, divergence and risk for binary experiments arxiv preprint arxiv:9356v, 89 pages, January 9 [3] R T Rockafellar Convex Analysis Princeton Landmarks in Mathematics and Physics Princeton University Press, 97 [] IJ Taneja Bounds on non-symmetric divergence measures in terms of symmetric divergence measures arxiv:mathpr/5656v, 5 [5] IJ Taneja Refinement inequalities among symmetric divergence measures arxiv:math/533v, April 5 [6] F Topsøe Bounds for entropy and divergence for distributions over a two-element set J Ineq Pure & Appl Math,, [7] Flemming Topsøe Some inequalities for information divergence and related measures of discrimination IEEE Transactions on Information Theory, 6:6 69, [8] EN Torgersen Comparison of Statistical Experiments Cambridge University Press, 99 [9] GT Toussaint Probability of error, expected divergence and the affinity of several distributions IEEE Transactions on Systems, Man and Cybernetics, 8:8 85, 978 [3] Andreas Unterreiter, Anton Arnold, Peter Markowich, and Giuseppe Toscani On generalized Csiszár-Kullback inequalities Monatshefte für Mathematik, 3:35 53, [3] I Vajda Note on discrimination and variation IEEE Transactions on Information Theory, 6:77 773, 97 [3] Lang Withers Some inequalities relating different measures of divergence between two probability distributions IEEE Transactions on Information Theory, 55:78 735, 999

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding APPEARS IN THE IEEE TRANSACTIONS ON INFORMATION THEORY, FEBRUARY 015 1 Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding Igal Sason Abstract Tight bounds for

More information

Tight Bounds for Symmetric Divergence Measures and a New Inequality Relating f-divergences

Tight Bounds for Symmetric Divergence Measures and a New Inequality Relating f-divergences Tight Bounds for Symmetric Divergence Measures and a New Inequality Relating f-divergences Igal Sason Department of Electrical Engineering Technion, Haifa 3000, Israel E-mail: sason@ee.technion.ac.il Abstract

More information

On Improved Bounds for Probability Metrics and f- Divergences

On Improved Bounds for Probability Metrics and f- Divergences IRWIN AND JOAN JACOBS CENTER FOR COMMUNICATION AND INFORMATION TECHNOLOGIES On Improved Bounds for Probability Metrics and f- Divergences Igal Sason CCIT Report #855 March 014 Electronics Computers Communications

More information

arxiv: v4 [cs.it] 17 Oct 2015

arxiv: v4 [cs.it] 17 Oct 2015 Upper Bounds on the Relative Entropy and Rényi Divergence as a Function of Total Variation Distance for Finite Alphabets Igal Sason Department of Electrical Engineering Technion Israel Institute of Technology

More information

A SYMMETRIC INFORMATION DIVERGENCE MEASURE OF CSISZAR'S F DIVERGENCE CLASS

A SYMMETRIC INFORMATION DIVERGENCE MEASURE OF CSISZAR'S F DIVERGENCE CLASS Journal of the Applied Mathematics, Statistics Informatics (JAMSI), 7 (2011), No. 1 A SYMMETRIC INFORMATION DIVERGENCE MEASURE OF CSISZAR'S F DIVERGENCE CLASS K.C. JAIN AND R. MATHUR Abstract Information

More information

arxiv:math/ v1 [math.st] 19 Jan 2005

arxiv:math/ v1 [math.st] 19 Jan 2005 ON A DIFFERENCE OF JENSEN INEQUALITY AND ITS APPLICATIONS TO MEAN DIVERGENCE MEASURES INDER JEET TANEJA arxiv:math/05030v [math.st] 9 Jan 005 Let Abstract. In this paper we have considered a difference

More information

Divergences, surrogate loss functions and experimental design

Divergences, surrogate loss functions and experimental design Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs.berkeley.edu Martin J. Wainwright University of California Berkeley,

More information

arxiv: v4 [cs.it] 8 Apr 2014

arxiv: v4 [cs.it] 8 Apr 2014 1 On Improved Bounds for Probability Metrics and f-divergences Igal Sason Abstract Derivation of tight bounds for probability metrics and f-divergences is of interest in information theory and statistics.

More information

Nested Inequalities Among Divergence Measures

Nested Inequalities Among Divergence Measures Appl Math Inf Sci 7, No, 49-7 0 49 Applied Mathematics & Information Sciences An International Journal c 0 NSP Natural Sciences Publishing Cor Nested Inequalities Among Divergence Measures Inder J Taneja

More information

Bounds for entropy and divergence for distributions over a two-element set

Bounds for entropy and divergence for distributions over a two-element set Bounds for entropy and divergence for distributions over a two-element set Flemming Topsøe Department of Mathematics University of Copenhagen, Denmark 18.1.00 Abstract Three results dealing with probability

More information

Research Article On Some Improvements of the Jensen Inequality with Some Applications

Research Article On Some Improvements of the Jensen Inequality with Some Applications Hindawi Publishing Corporation Journal of Inequalities and Applications Volume 2009, Article ID 323615, 15 pages doi:10.1155/2009/323615 Research Article On Some Improvements of the Jensen Inequality with

More information

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing XuanLong Nguyen Martin J. Wainwright Michael I. Jordan Electrical Engineering & Computer Science Department

More information

Journal of Inequalities in Pure and Applied Mathematics

Journal of Inequalities in Pure and Applied Mathematics Journal of Inequalities in Pure and Applied Mathematics http://jipam.vu.edu.au/ Volume, Issue, Article 5, 00 BOUNDS FOR ENTROPY AND DIVERGENCE FOR DISTRIBUTIONS OVER A TWO-ELEMENT SET FLEMMING TOPSØE DEPARTMENT

More information

arxiv: v2 [cs.it] 26 Sep 2011

arxiv: v2 [cs.it] 26 Sep 2011 Sequences o Inequalities Among New Divergence Measures arxiv:1010.041v [cs.it] 6 Sep 011 Inder Jeet Taneja Departamento de Matemática Universidade Federal de Santa Catarina 88.040-900 Florianópolis SC

More information

Stanford Statistics 311/Electrical Engineering 377

Stanford Statistics 311/Electrical Engineering 377 I. Bayes risk in classification problems a. Recall definition (1.2.3) of f-divergence between two distributions P and Q as ( ) p(x) D f (P Q) : q(x)f dx, q(x) where f : R + R is a convex function satisfying

More information

A View on Extension of Utility-Based on Links with Information Measures

A View on Extension of Utility-Based on Links with Information Measures Communications of the Korean Statistical Society 2009, Vol. 16, No. 5, 813 820 A View on Extension of Utility-Based on Links with Information Measures A.R. Hoseinzadeh a, G.R. Mohtashami Borzadaran 1,b,

More information

arxiv: v1 [stat.ml] 5 Jan 2009

arxiv: v1 [stat.ml] 5 Jan 2009 Information, Divergence and Risk for Binary Experiments Mark D. Reid Australian National University Canberra ACT 0200, Australia Robert C. Williamson Australian National University and NICTA Canberra ACT

More information

THEOREM AND METRIZABILITY

THEOREM AND METRIZABILITY f-divergences - REPRESENTATION THEOREM AND METRIZABILITY Ferdinand Österreicher Institute of Mathematics, University of Salzburg, Austria Abstract In this talk we are first going to state the so-called

More information

Jensen-Shannon Divergence and Hilbert space embedding

Jensen-Shannon Divergence and Hilbert space embedding Jensen-Shannon Divergence and Hilbert space embedding Bent Fuglede and Flemming Topsøe University of Copenhagen, Department of Mathematics Consider the set M+ 1 (A) of probability distributions where A

More information

Convergence of generalized entropy minimizers in sequences of convex problems

Convergence of generalized entropy minimizers in sequences of convex problems Proceedings IEEE ISIT 206, Barcelona, Spain, 2609 263 Convergence of generalized entropy minimizers in sequences of convex problems Imre Csiszár A Rényi Institute of Mathematics Hungarian Academy of Sciences

More information

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H. Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

On the Chi square and higher-order Chi distances for approximating f-divergences

On the Chi square and higher-order Chi distances for approximating f-divergences On the Chi square and higher-order Chi distances for approximating f-divergences Frank Nielsen, Senior Member, IEEE and Richard Nock, Nonmember Abstract We report closed-form formula for calculating the

More information

Loss Functions. 1 Introduction. Robert C. Williamson

Loss Functions. 1 Introduction. Robert C. Williamson Loss Functions Robert C. Williamson Abstract Vapnik described the three main learning problems of pattern recognition, regression estimation and density estimation. These are defined in terms of the loss

More information

Robustness and duality of maximum entropy and exponential family distributions

Robustness and duality of maximum entropy and exponential family distributions Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat

More information

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure The Information Bottleneck Revisited or How to Choose a Good Distortion Measure Peter Harremoës Centrum voor Wiskunde en Informatica PO 94079, 1090 GB Amsterdam The Nederlands PHarremoes@cwinl Naftali

More information

Series 7, May 22, 2018 (EM Convergence)

Series 7, May 22, 2018 (EM Convergence) Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Convexity/Concavity of Renyi Entropy and α-mutual Information

Convexity/Concavity of Renyi Entropy and α-mutual Information Convexity/Concavity of Renyi Entropy and -Mutual Information Siu-Wai Ho Institute for Telecommunications Research University of South Australia Adelaide, SA 5095, Australia Email: siuwai.ho@unisa.edu.au

More information

CONSIDER an estimation problem in which we want to

CONSIDER an estimation problem in which we want to 2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 4, APRIL 2011 Lower Bounds for the Minimax Risk Using f-divergences, and Applications Adityanand Guntuboyina, Student Member, IEEE Abstract Lower

More information

PROPERTIES. Ferdinand Österreicher Institute of Mathematics, University of Salzburg, Austria

PROPERTIES. Ferdinand Österreicher Institute of Mathematics, University of Salzburg, Austria CSISZÁR S f-divergences - BASIC PROPERTIES Ferdinand Österreicher Institute of Mathematics, University of Salzburg, Austria Abstract In this talk basic general properties of f-divergences, including their

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Entropy measures of physics via complexity

Entropy measures of physics via complexity Entropy measures of physics via complexity Giorgio Kaniadakis and Flemming Topsøe Politecnico of Torino, Department of Physics and University of Copenhagen, Department of Mathematics 1 Introduction, Background

More information

Are You a Bayesian or a Frequentist?

Are You a Bayesian or a Frequentist? Are You a Bayesian or a Frequentist? Michael I. Jordan Department of EECS Department of Statistics University of California, Berkeley http://www.cs.berkeley.edu/ jordan 1 Statistical Inference Bayesian

More information

Some New Information Inequalities Involving f-divergences

Some New Information Inequalities Involving f-divergences BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 12, No 2 Sofia 2012 Some New Information Inequalities Involving f-divergences Amit Srivastava Department of Mathematics, Jaypee

More information

Information Divergences and the Curious Case of the Binary Alphabet

Information Divergences and the Curious Case of the Binary Alphabet Information Divergences and the Curious Case of the Binary Alphabet Jiantao Jiao, Thomas Courtade, Albert No, Kartik Venkat, and Tsachy Weissman Department of Electrical Engineering, Stanford University;

More information

On the Chi square and higher-order Chi distances for approximating f-divergences

On the Chi square and higher-order Chi distances for approximating f-divergences c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/17 On the Chi square and higher-order Chi distances for approximating f-divergences Frank Nielsen 1 Richard Nock 2 www.informationgeometry.org

More information

Some Background Math Notes on Limsups, Sets, and Convexity

Some Background Math Notes on Limsups, Sets, and Convexity EE599 STOCHASTIC NETWORK OPTIMIZATION, MICHAEL J. NEELY, FALL 2008 1 Some Background Math Notes on Limsups, Sets, and Convexity I. LIMITS Let f(t) be a real valued function of time. Suppose f(t) converges

More information

On divergences, surrogate loss functions, and decentralized detection

On divergences, surrogate loss functions, and decentralized detection On divergences, surrogate loss functions, and decentralized detection XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@eecs.berkeley.edu Martin J. Wainwright Statistics

More information

f-divergence Inequalities

f-divergence Inequalities SASON AND VERDÚ: f-divergence INEQUALITIES f-divergence Inequalities Igal Sason, and Sergio Verdú Abstract arxiv:508.00335v7 [cs.it] 4 Dec 06 This paper develops systematic approaches to obtain f-divergence

More information

CONVEX FUNCTIONS AND MATRICES. Silvestru Sever Dragomir

CONVEX FUNCTIONS AND MATRICES. Silvestru Sever Dragomir Korean J. Math. 6 (018), No. 3, pp. 349 371 https://doi.org/10.11568/kjm.018.6.3.349 INEQUALITIES FOR QUANTUM f-divergence OF CONVEX FUNCTIONS AND MATRICES Silvestru Sever Dragomir Abstract. Some inequalities

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

Estimates for probabilities of independent events and infinite series

Estimates for probabilities of independent events and infinite series Estimates for probabilities of independent events and infinite series Jürgen Grahl and Shahar evo September 9, 06 arxiv:609.0894v [math.pr] 8 Sep 06 Abstract This paper deals with finite or infinite sequences

More information

f-divergence Inequalities

f-divergence Inequalities f-divergence Inequalities Igal Sason Sergio Verdú Abstract This paper develops systematic approaches to obtain f-divergence inequalities, dealing with pairs of probability measures defined on arbitrary

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

Legendre-Fenchel transforms in a nutshell

Legendre-Fenchel transforms in a nutshell 1 2 3 Legendre-Fenchel transforms in a nutshell Hugo Touchette School of Mathematical Sciences, Queen Mary, University of London, London E1 4NS, UK Started: July 11, 2005; last compiled: August 14, 2007

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Convex Analysis and Economic Theory AY Elementary properties of convex functions

Convex Analysis and Economic Theory AY Elementary properties of convex functions Division of the Humanities and Social Sciences Ec 181 KC Border Convex Analysis and Economic Theory AY 2018 2019 Topic 6: Convex functions I 6.1 Elementary properties of convex functions We may occasionally

More information

5.1 Inequalities via joint range

5.1 Inequalities via joint range ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 5: Inequalities between f-divergences via their joint range Lecturer: Yihong Wu Scribe: Pengkun Yang, Feb 9, 2016

More information

Surrogate loss functions, divergences and decentralized detection

Surrogate loss functions, divergences and decentralized detection Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1

More information

Class Prior Estimation from Positive and Unlabeled Data

Class Prior Estimation from Positive and Unlabeled Data IEICE Transactions on Information and Systems, vol.e97-d, no.5, pp.1358 1362, 2014. 1 Class Prior Estimation from Positive and Unlabeled Data Marthinus Christoffel du Plessis Tokyo Institute of Technology,

More information

topics about f-divergence

topics about f-divergence topics about f-divergence Presented by Liqun Chen Mar 16th, 2018 1 Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments

More information

Analysis II - few selective results

Analysis II - few selective results Analysis II - few selective results Michael Ruzhansky December 15, 2008 1 Analysis on the real line 1.1 Chapter: Functions continuous on a closed interval 1.1.1 Intermediate Value Theorem (IVT) Theorem

More information

The Pacic Institute for the Mathematical Sciences http://www.pims.math.ca pims@pims.math.ca Surprise Maximization D. Borwein Department of Mathematics University of Western Ontario London, Ontario, Canada

More information

On surrogate loss functions and f-divergences

On surrogate loss functions and f-divergences On surrogate loss functions and f-divergences XuanLong Nguyen, Martin J. Wainwright, xuanlong.nguyen@stat.duke.edu wainwrig@stat.berkeley.edu Michael I. Jordan, jordan@stat.berkeley.edu Department of Statistical

More information

Convex Functions and Optimization

Convex Functions and Optimization Chapter 5 Convex Functions and Optimization 5.1 Convex Functions Our next topic is that of convex functions. Again, we will concentrate on the context of a map f : R n R although the situation can be generalized

More information

Probabilistic and Bayesian Machine Learning

Probabilistic and Bayesian Machine Learning Probabilistic and Bayesian Machine Learning Day 4: Expectation and Belief Propagation Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/

More information

Some Expectations of a Non-Central Chi-Square Distribution With an Even Number of Degrees of Freedom

Some Expectations of a Non-Central Chi-Square Distribution With an Even Number of Degrees of Freedom Some Expectations of a Non-Central Chi-Square Distribution With an Even Number of Degrees of Freedom Stefan M. Moser April 7, 007 Abstract The non-central chi-square distribution plays an important role

More information

Convex Optimization Notes

Convex Optimization Notes Convex Optimization Notes Jonathan Siegel January 2017 1 Convex Analysis This section is devoted to the study of convex functions f : B R {+ } and convex sets U B, for B a Banach space. The case of B =

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Conjugate Predictive Distributions and Generalized Entropies

Conjugate Predictive Distributions and Generalized Entropies Conjugate Predictive Distributions and Generalized Entropies Eduardo Gutiérrez-Peña Department of Probability and Statistics IIMAS-UNAM, Mexico Padova, Italy. 21-23 March, 2013 Menu 1 Antipasto/Appetizer

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

Course 212: Academic Year Section 1: Metric Spaces

Course 212: Academic Year Section 1: Metric Spaces Course 212: Academic Year 1991-2 Section 1: Metric Spaces D. R. Wilkins Contents 1 Metric Spaces 3 1.1 Distance Functions and Metric Spaces............. 3 1.2 Convergence and Continuity in Metric Spaces.........

More information

Legendre-Fenchel transforms in a nutshell

Legendre-Fenchel transforms in a nutshell 1 2 3 Legendre-Fenchel transforms in a nutshell Hugo Touchette School of Mathematical Sciences, Queen Mary, University of London, London E1 4NS, UK Started: July 11, 2005; last compiled: October 16, 2014

More information

Computational and Statistical Learning theory

Computational and Statistical Learning theory Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,

More information

arxiv: v8 [cs.it] 20 Feb 2014

arxiv: v8 [cs.it] 20 Feb 2014 arxiv:1206.6544v8 [cs.it] 20 Feb 2014 Minimum KL-divergence on complements of L 1 balls Daniel Berend, Peter Harremoës, and Aryeh Kontorovich May 5, 2014 Abstract Pinsker s widely used inequality upper-bounds

More information

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi Real Analysis Math 3AH Rudin, Chapter # Dominique Abdi.. If r is rational (r 0) and x is irrational, prove that r + x and rx are irrational. Solution. Assume the contrary, that r+x and rx are rational.

More information

Introduction to Real Analysis Alternative Chapter 1

Introduction to Real Analysis Alternative Chapter 1 Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces

More information

Estimating divergence functionals and the likelihood ratio by convex risk minimization

Estimating divergence functionals and the likelihood ratio by convex risk minimization Estimating divergence functionals and the likelihood ratio by convex risk minimization XuanLong Nguyen Dept. of Statistical Science Duke University xuanlong.nguyen@stat.duke.edu Martin J. Wainwright Dept.

More information

The small ball property in Banach spaces (quantitative results)

The small ball property in Banach spaces (quantitative results) The small ball property in Banach spaces (quantitative results) Ehrhard Behrends Abstract A metric space (M, d) is said to have the small ball property (sbp) if for every ε 0 > 0 there exists a sequence

More information

Convergent Iterative Algorithms in the 2-inner Product Space R n

Convergent Iterative Algorithms in the 2-inner Product Space R n Int. J. Open Problems Compt. Math., Vol. 6, No. 4, December 2013 ISSN 1998-6262; Copyright c ICSRS Publication, 2013 www.i-csrs.org Convergent Iterative Algorithms in the 2-inner Product Space R n Iqbal

More information

Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p

More information

Channel Polarization and Blackwell Measures

Channel Polarization and Blackwell Measures Channel Polarization Blackwell Measures Maxim Raginsky Abstract The Blackwell measure of a binary-input channel (BIC is the distribution of the posterior probability of 0 under the uniform input distribution

More information

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012 Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 202 BOUNDS AND ASYMPTOTICS FOR FISHER INFORMATION IN THE CENTRAL LIMIT THEOREM

More information

ZOBECNĚNÉ PHI-DIVERGENCE A EM-ALGORITMUS V AKUSTICKÉ EMISI

ZOBECNĚNÉ PHI-DIVERGENCE A EM-ALGORITMUS V AKUSTICKÉ EMISI ZOBECNĚNÉ PHI-DIVERGENCE A EM-ALGORITMUS V AKUSTICKÉ EMISI Jan Tláskal a Václav Kůs Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University in Prague 11.11.2010 Concepts What the

More information

LECTURE 15: COMPLETENESS AND CONVEXITY

LECTURE 15: COMPLETENESS AND CONVEXITY LECTURE 15: COMPLETENESS AND CONVEXITY 1. The Hopf-Rinow Theorem Recall that a Riemannian manifold (M, g) is called geodesically complete if the maximal defining interval of any geodesic is R. On the other

More information

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1) Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1) Detection problems can usually be casted as binary or M-ary hypothesis testing problems. Applications: This chapter: Simple hypothesis

More information

A Probabilistic Upper Bound on Differential Entropy

A Probabilistic Upper Bound on Differential Entropy A Probabilistic Upper Bound on Differential Entropy Joseph DeStefano, Qifeng Lu and Erik Learned-Miller Abstract The differential entrops a quantity employed ubiquitousln communications, statistical learning,

More information

A Note on Nonconvex Minimax Theorem with Separable Homogeneous Polynomials

A Note on Nonconvex Minimax Theorem with Separable Homogeneous Polynomials A Note on Nonconvex Minimax Theorem with Separable Homogeneous Polynomials G. Y. Li Communicated by Harold P. Benson Abstract The minimax theorem for a convex-concave bifunction is a fundamental theorem

More information

Linear Codes, Target Function Classes, and Network Computing Capacity

Linear Codes, Target Function Classes, and Network Computing Capacity Linear Codes, Target Function Classes, and Network Computing Capacity Rathinakumar Appuswamy, Massimo Franceschetti, Nikhil Karamchandani, and Kenneth Zeger IEEE Transactions on Information Theory Submitted:

More information

Szemerédi s Lemma for the Analyst

Szemerédi s Lemma for the Analyst Szemerédi s Lemma for the Analyst László Lovász and Balázs Szegedy Microsoft Research April 25 Microsoft Research Technical Report # MSR-TR-25-9 Abstract Szemerédi s Regularity Lemma is a fundamental tool

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS. W. Erwin Diewert January 31, 2008.

CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS. W. Erwin Diewert January 31, 2008. 1 ECONOMICS 594: LECTURE NOTES CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS W. Erwin Diewert January 31, 2008. 1. Introduction Many economic problems have the following structure: (i) a linear function

More information

ON A CLASS OF PERIMETER-TYPE DISTANCES OF PROBABILITY DISTRIBUTIONS

ON A CLASS OF PERIMETER-TYPE DISTANCES OF PROBABILITY DISTRIBUTIONS KYBERNETIKA VOLUME 32 (1996), NUMBER 4, PAGES 389-393 ON A CLASS OF PERIMETER-TYPE DISTANCES OF PROBABILITY DISTRIBUTIONS FERDINAND OSTERREICHER The class If, p G (l,oo], of /-divergences investigated

More information

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Inference Course, WTCN, UCL, March 2013 Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

More information

Law of Cosines and Shannon-Pythagorean Theorem for Quantum Information

Law of Cosines and Shannon-Pythagorean Theorem for Quantum Information In Geometric Science of Information, 2013, Paris. Law of Cosines and Shannon-Pythagorean Theorem for Quantum Information Roman V. Belavkin 1 School of Engineering and Information Sciences Middlesex University,

More information

Quantifying information and uncertainty

Quantifying information and uncertainty Quantifying information and uncertainty Alexander Frankel and Emir Kamenica University of Chicago April 2018 Abstract We examine ways to measure the amount of information generated by a piece of news and

More information

INTRODUCTION TO REAL ANALYSIS II MATH 4332 BLECHER NOTES

INTRODUCTION TO REAL ANALYSIS II MATH 4332 BLECHER NOTES INTRODUCTION TO REAL ANALYSIS II MATH 4332 BLECHER NOTES You will be expected to reread and digest these typed notes after class, line by line, trying to follow why the line is true, for example how it

More information

The Distribution of Mixing Times in Markov Chains

The Distribution of Mixing Times in Markov Chains The Distribution of Mixing Times in Markov Chains Jeffrey J. Hunter School of Computing & Mathematical Sciences, Auckland University of Technology, Auckland, New Zealand December 2010 Abstract The distribution

More information

- Well-characterized problems, min-max relations, approximate certificates. - LP problems in the standard form, primal and dual linear programs

- Well-characterized problems, min-max relations, approximate certificates. - LP problems in the standard form, primal and dual linear programs LP-Duality ( Approximation Algorithms by V. Vazirani, Chapter 12) - Well-characterized problems, min-max relations, approximate certificates - LP problems in the standard form, primal and dual linear programs

More information

WE start with a general discussion. Suppose we have

WE start with a general discussion. Suppose we have 646 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, NO. 2, MARCH 1997 Minimax Redundancy for the Class of Memoryless Sources Qun Xie and Andrew R. Barron, Member, IEEE Abstract Let X n = (X 1 ; 111;Xn)be

More information

Non-parametric Estimation of Integral Probability Metrics

Non-parametric Estimation of Integral Probability Metrics Non-parametric Estimation of Integral Probability Metrics Bharath K. Sriperumbudur, Kenji Fukumizu 2, Arthur Gretton 3,4, Bernhard Schölkopf 4 and Gert R. G. Lanckriet Department of ECE, UC San Diego,

More information

AN AFFINE EMBEDDING OF THE GAMMA MANIFOLD

AN AFFINE EMBEDDING OF THE GAMMA MANIFOLD AN AFFINE EMBEDDING OF THE GAMMA MANIFOLD C.T.J. DODSON AND HIROSHI MATSUZOE Abstract. For the space of gamma distributions with Fisher metric and exponential connections, natural coordinate systems, potential

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated

More information

Generative MaxEnt Learning for Multiclass Classification

Generative MaxEnt Learning for Multiclass Classification Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,

More information

Sharpening Jensen s Inequality

Sharpening Jensen s Inequality Sharpening Jensen s Inequality arxiv:1707.08644v [math.st] 4 Oct 017 J. G. Liao and Arthur Berg Division of Biostatistics and Bioinformatics Penn State University College of Medicine October 6, 017 Abstract

More information

1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer 11(2) (1989),

1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer 11(2) (1989), Real Analysis 2, Math 651, Spring 2005 April 26, 2005 1 Real Analysis 2, Math 651, Spring 2005 Krzysztof Chris Ciesielski 1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer

More information

Some Properties of the Augmented Lagrangian in Cone Constrained Optimization

Some Properties of the Augmented Lagrangian in Cone Constrained Optimization MATHEMATICS OF OPERATIONS RESEARCH Vol. 29, No. 3, August 2004, pp. 479 491 issn 0364-765X eissn 1526-5471 04 2903 0479 informs doi 10.1287/moor.1040.0103 2004 INFORMS Some Properties of the Augmented

More information

SEMI-INNER PRODUCTS AND THE NUMERICAL RADIUS OF BOUNDED LINEAR OPERATORS IN HILBERT SPACES

SEMI-INNER PRODUCTS AND THE NUMERICAL RADIUS OF BOUNDED LINEAR OPERATORS IN HILBERT SPACES SEMI-INNER PRODUCTS AND THE NUMERICAL RADIUS OF BOUNDED LINEAR OPERATORS IN HILBERT SPACES S.S. DRAGOMIR Abstract. The main aim of this paper is to establish some connections that exist between the numerical

More information