arxiv: v2 [math.st] 22 Aug 2018

Size: px

Start display at page:

Download "arxiv: v2 [math.st] 22 Aug 2018"

Suzan Simpson
5 years ago
Views:

1 GENERALIZED BREGMAN AND JENSEN DIVERGENCES WHICH INCLUDE SOME F-DIVERGENCES TOMOHIRO NISHIYAMA arxiv: v2 [math.st] 22 Aug 208 Abstract. In this paper, we introduce new classes of divergences by extending the definitions of the Bregman divergence and the skew Jensen divergence. These new divergence classes (g-bregman divergence and skew g-jensen divergence satisfy some properties similar to the Bregman or skew Jensen divergence. We show these g-divergences include divergences which belong to a class of f-divergence (the Hellinger distance, the chi-square divergence and the alpha-divergence in addition to the Kullback-Leibler divergence. Moreover, we derive an inequality between the skew g-jensen divergence and the g-bregman divergence and show this inequality is a generalization of Lin s inequality. Keywords: Bregman divergence, f-divergence, Jensen divergence, Kullback- Leibler divergence, Jeffreys divergence, Jensen-Shannon divergence, Hellinger distance, chi-square divergence, alpha divergence.. Introduction Divergences are functions measure the discrepancy between two points and play a key role in the field of machine learning, signal processing and so on. Given a set S and p,q S, a divergence is defined as a function D : S S R which satisfies the following properties.. D(p,q 0 for all p,q S 2. D(p,q = 0 p = q The Bregman divergence [6] B F (p,q F(p F(q p q, F(q and f- divergence [,9] D f (p q i f( pi are representative divergences defined by using a strictly convex function, where F,f : S R are strictly convex functions and f( = 0. Besides that Nielsen have introduced the skew Jensen divergence J F,α (p,q ( αf(p + αf(q F(( αp + αq for a paramter α (0, and have shown the relation between the Bregman divergence and the skew Jensen divergence [7, 4]. The Bregaman divergence and the f-divergece include well-known divergences. For example, the Kullback-Leibler divergence(kl-divergence [] is a type of the Bregman divergence or the f-divergence. On the other hand, the Hellinger distance, the χ-square divergence, the α-divergence [4,8] are a type of the f-divergence and the Jensen-Shannon divergence(js-divergence [2] is a type of the skew Jensen divergence. For probability distributions p and q, each divergence is defined as follows. KL-divergence KL(p q i p i ln( p i (

2 2 TOMOHIRO NISHIYAMA Hellinger distance H 2 (p,q i ( p i 2 (2 Pearson χ-square divergence Neyman χ-square divergence D χ 2,P(p q i (p i 2 (3 α-divergence D χ2,n(p q i D α (p q α(α i (p i 2 p i (4 (p α i q α i (5 The main goal of this paper is to introduce new classes of divergences ( gdivergence and clarify the relations among g-divergences and f-divergence. First, we introduce the g-bregman divergence and the skew g-jensen divergence by extending the definitions of the Bregman divergence and the skew Jensen divergence respectively. For these new classes of divergences, we show they satisfy some properties similar to the Bregman or the skew Jensen divergence and derive an inequality between them. Then, we show that the g-bregman divergence includes the Hellinger distance, the Pearson and Neyman χ-square divergence and the α-divergence. Furthermore, we show that the skew g-jensen divergence include the Hellinger distance and the α-divergence. These divergences have not only the properties of f-divergence but also some properties similar to the Bregman or the skew Jensen divergence. Finally we derive many inequalities by using an inequality between the g-bregman divergence and skew g-jensen divergence. 2. g-bregman divergence and skew g-jensen divergence By using a function g which has a inverse function, we define the g-bregman divergence and the skew g-jensen divergence. Definition. Let Ω R d be a convex set. For p Ω, let F(x be a strictly convex function F : Ω R. The Bregman divergence and the symmetric Bregman divergence are defined as B F (p,q F(p F(q p q, F(q (6 B g F,sym (p,q Bg F (p,q+bg F (q,p = p q, F(p F(q. (7 The symbol, indicates an inner product. For the parameter α R \{0,}, the scaled skew Jensen divergence [3,4] is defined as sj F,α (p,q ( ( αf(p+αf(q F(( αp+αq. (8 α( α Nielsen has shown the relation between the Bregman divergence and the Jensen divergence as follows [4]. B F (q,p = lim sj F,α (p,q α 0 (9 B F (p,q = lim sj F,α (p,q α (0

3 GENERALIZED BREGMAN AND JENSEN DIVERGENCES WHICH INCLUDE SOME F-DIVERGENCES3 J F,α (p,q = ( ( αb F (p,( αp+αq+αb F (q,( αp+αq α( α Definition 2. (Definition of the g-convex setlet S be a vector space over the real numbers and g be a function g : S S. If S satisfies the following condition, we call the set S g-convex set. For all p and n S and all α in the interval (0,, the point ( αg(p+αg(q also belongs to S. Definition 3. (Definition of the g-divergences Let Ω be a g-convex set and let g : Ω Ω be a function which satisfies g(p = g(q p = q. Let α be a parameter in (0,. We define the (scaled skew g-jensen divergence, the g-bregman divergence and the g-symmetric Bregman divergence as follows. ( sj g F,α (p,q sj F,α(g(p,g(q (2 B g F (p,q B F(g(p,g(q (3 B g F,sym (p,q B F,sym(g(p,g(q (4 From the definition of function g, these functions satisfy divergence properties D g (p,q = 0 g(p = g(q p = q and D g (p,q 0, where D g indicates the g-divergences. When g(p is equal to p for all p Ω, the g-divergences are consistent with original divergences. In the following, we assume that g is a function having an inverse function. The g-bregman divergence and the symmetric g-bregman divergence satisfy the linearity, the generalized law of cosines and the generalized parallelogram law as well as the Bregman and the symmetric Bregman divergence. Linearity For positive constant c and c 2, holds. B g c F +c 2F 2 (p,q = c B g F (p,q+c 2 B g F 2 (p,q (5 Proposition. (Generalized law of cosines Let Ω be a g-convex set. For points p,q,r Ω, the following equations hold. B g F (p,q = Bg F (p,r+bg F (r,q g(p g(r, F(g(q F(g(r (6 B g F,sym (p,q = Bg F,sym (p,r+bg F,sym (r,q (7 g(p g(r, F(g(q F(g(r + g(q g(r, F(g(p F(g(r It is easily proved in the same way as the Bregman divergence by putting p = g(p, q = g(q and r = g(r. Theorem. (Generalized parallelogram law Let Ω be a g-convexset. When points p,q,r,s Ω satisfy g(p+g(r = g(q+g(s, the following equations holds. B g F,sym (p,q+bg F,sym (q,r+bg F,sym (r,s+bg F,sym (s,q = Bg F,sym (p,r+bg F,sym (q,s (8 The left hand side is the sum of four sides of rectangle pqrs and the right hand side is the sum of diagonal lines of rectangle pqrs. Proof. It has been shown that the Bregman divergence satisfies the four point identity(see equation (2.3 in [6]. F(r F(s,p q = B F (q,r+b F (p,s B F (p,r B F (q,s (9

4 4 TOMOHIRO NISHIYAMA By putting p = g(p, q = g(q, r = g(r and s = g(s, we can prove the g- Bregman divergence satisfies the similar four point identity. F(g(r F(g(s,g(p g(q = B g F (q,r+bg F (p,s Bg F (p,r Bg F (q,s (20 By combining the assumption and the definition of the symmetric g-bregman divergence ((7 and (4, we have B g F,sym (r,s = Bg F (q,r+bg F (p,s Bg F (p,r Bg F (q,s. (2 Exchanging p r and q s in (2 and taking the sum with (2, the result follows. This theorem can be also proved in the same way as mentioned in the paper [5]. The same relation between the Bregman and the skew-jensen divergence also hold for the g-bregman and the skew g-jensen divergence. B g F (q,p = lim α 0 sjg F,α (p,q (22 B g F (p,q = lim α sjg F,α (p,q (23 Lemma. Let Ω be a g-convexset and let p,q be points p,q Ω. For a parameter α (0, and a point r Ω which satisfies g(r = ( αg(p + αg(q, the g-bregman divergence can be expressed as follows. sj g F,α (p,q α( α (( αb gf (p,r+αbgf (q,r It is easily proved in the same way as equation ( by putting p = g(p and q = g(q. Next, we derive an inequality between the skew g-jensen and the symmetric g-bregman divergence. Lemma 2. Let Ω be a g-convexset and let p,q be points p,q Ω. For a parameter α (0, and a point r Ω which satisfies g(r = ( αg(p+αg(q, the following equations hold. B g F (p,q = Bg F (p,r+bg F (r,q+ α α Bg F,sym (r,q (25 (24 B g F (q,p = α Bg F (r,p+bg F (q,r+ α Bg F,sym (r,p (26 Proof. We first prove (25. By the generalized law of cosines (6, we have B g F (p,q = Bg F (p,r+bg F (r,q+ g(r g(p, F(g(q F(g(r. (27 From the assumption, the equation g(r g(p = α (g(q g(r (28 α holds. By substituting (28 to (27 and using (4, the result follows. By exchanging p and n (27, we can prove (26 in the same way. Lemma 3. Let Ω be a g-convexset and let p,q be points p,q Ω. For a parameter α (0, and a point r Ω which satisfies g(r = ( αg(p+αg(q, the following equation holds. B g F,sym (p,q = α Bg F,sym (p,r+ α Bg F,sym (r,q (29 Proof. Taking the sum of (25 and (26, the result follows.

5 GENERALIZED BREGMAN AND JENSEN DIVERGENCES WHICH INCLUDE SOME F-DIVERGENCES5 Theorem 2. (Jensen-Bregaman inequality Let Ω be a g-convex set and let p,q be points in Ω. For a parameter α (0,, the skew g-jensen divergence and the symmetric g-bregman divergence, the following inequality holds. B g F,sym (p,q sjg F,α (p,q (30 Proof. Let r Ω be a point which satisfies g(r = ( αg(p + αg(q. From (24 and using B g F (p,q Bg F,sym (p,q, we have sj g F,α (p,q = α Bg F (p,r+ α Bg F (q,r α Bg F,sym (p,r+ α Bg F,sym (q,r. Using Lemma 3., we have (3 sj g F,α (p,q α Bg F,sym (p,r+ α Bg F,sym (r,q = Bg F,sym (p,q (32 3. Examples We show some examples of the g-divergences. We denote the g-bregman divergence as GBD, the symmetric g-bregman divergence as SGBD, the skew g-jensen divergence as SGJD and the Jensen-Bregman inequality (30 as JBinequality. Aparameterαisin(0,exceptforexample3andweuseaparameterβ (0, instead of α in example 3. From example to 6, we show the case that GBD is belong to a class of f-divergence and example 7 is the case of information geometry [2 4] and statistical mechanics. KL-divergence, Jeffreys divergence, JS-divergence Generating functions: F(p = i p ilnp i, g(p = p GBD: generalized KL-divergence KL(p q i ( p i + +p i ln( pi SGBD: Jeffreys divergence(j-divergence [0] J(p,q i (p i ln( pi SGJD: scaled skew JS-divergence ( α( α JS α(p q α( α ( αkl(p ( αp+αq+αkl(q ( αp+αq JB-ineuqlity: generalized Lin s inequality α( αj(p,q JS α (p,q. When α is equall to 2, this inequality is consistent with Lin s inequality [2]. 2 KL-divergence, Jeffreys divergence, α-divergence Generating functions: F(p = i exp(p i, g(p = lnp GBD: generalized reverse KL-divergence KL(q p i (p i + ln( qi p i SGBD: J-divergence J(p,q i (p i ln( pi SGJD: generalized α-divergenced α (p q, where D α (p q i αp i ( α JB-ineuqlity: J(p,q D α (p q 3 α-divergence Generating functions: F(p = α i p α i, g(p = p α for α R\{0,} GBD: generalized α-divergence D α (p q SGBD: symmetric α-divergence ( D α,sym (p,q D α (p q+d α (q p ( βp i +β ( ( βp α i +βqα α i SGJD: β( β( α i α(α (pα i q α i

6 6 TOMOHIRO NISHIYAMA ( JB-ineuqlity: D α,sym (p,q β( β( α i ( βp i +β ( ( βp α i +βqα i α By using the equations 2 D ( 2 (p q = H2 (p,q, 2D (2 (p q = D χ 2,P(p q and 2D ( (p q = D χ2,n(p q [8], we obtain example 4 to 6. 4 Hellinger distance Generating functions: F(p = i p2 i, g(p = p GBD: Hellinger distance H 2 (p,q i ( p i 2 SGBD: Hellinger distance 2H 2 (p,q SGJD: Hellinger distance H 2 (p,q JB-ineuqlity: trivial 5 Pearson χ-square divergence Generating functions: F(p = 2 i pi, g(p = p 2 (p i 2 GBD: Pearson χ-square divergence D χ2,p(q p i SGBD: symmetric χ-square divergence D χ 2,sym(p,q D χ 2(p q+d χ 2(q p 2 SGJD: α( α i ( ( αp i α + ( αp 2 i +αq2 i JB-ineuqlity: D χ2,sym(p,q 2 α( α i ( ( αp i α + ( αp 2 i +αq2 i 6 Neyman χ-square divergence Generating functions: F(p = i p i, g(p = p (p i 2 p i GBD: Neyman χ-square divergence D χ 2,N(q p i SGBD: symmetric χ-square divergence D χ 2,sym(p,q SGJD: α( α i (( αp p i +α i ( α+αp i JB-ineuqlity: D χ 2,sym(p,q α( α i (( αp i +α p i ( α+αp i 7 Information geometry and statistical mechanics In the dually flat space M, there exist two affine coordinates θ i, η i and two convex functions called potential ψ(θ, φ(η. For the points P,Q M, we put p i = θ i (P, = θ i (Q or p i = η i (P, = η i (Q. Generating functions: F(p = ψ(θ or F(p = φ(η, g(p = p GBD: canonical divergence D(P Q φ(η(q+ψ(θ(p i η i(qθ i (P SGBD: D A (P,Q i (η i(q η i (P(θ i (Q θ i (P SGJD: sj ψ,α (θ(p,θ(q or sj φ,α (η(p,η(q JB-ineuqlity: D A (P,Q sj ψ,α (θ(p,θ(q or D A (P,Q sj φ,α (η(p,η(q For the exponential or mixture families, GBD is equal to the KL-divergence and SGBD is equal to the J-divergence. For the exponential families, SGJD sj ψ,α (θ(p,θ(q is equal to the skew Bhattacharyya distance [5,4], and for the mixture families, SGJD sj φ,α (η(p,η(q is equal to the skew-js divergence(see [5] in detail. In the case of canonical ensemble in statistical mechanics, the probability distribution function belongs to exponential families. Because the manifold of the exponential family is a dually flat space, the same equations and inequality hold

7 GENERALIZED BREGMAN AND JENSEN DIVERGENCES WHICH INCLUDE SOME F-DIVERGENCES7 for θ = β k B T (33 η = U (34 ψ = βf (35 φ = S, (36 k B where k B is the Boltzmann constant, T is temperature, U is the internal energy, S is the entropy and F is the Helmholtz free energy. Defining aparameterαas aindex ofthe statewhich satisfy T α = ( α T 0 +α T or U α = ( αu 0 +αu, JB-inequality can be written as follows. α( α (T T 0 (U U 0 T 0 T F α T α ( α F 0 T 0 α F T (37 α( α (T T 0 (U U 0 T 0 T S α ( αs 0 αs (38 4. Conclusion We have introduced the g-bregman divergence and the skew g-jensen divergence by extending the definitions of the Bregman and the skew Jensen divergence respectively. We have shown they satisfy some properties similar to the Bregman and the skew-jensen divergence. Furthermore, we have derived an inequality between the symmetric g-bregman divergence and the skew g-jensen divergence and we have shown they include divergences which belong to a class of f-divergence. If the relationship between these three classes of divergence are studied in more detail, it is expected to proceed applications to various fields including machine learning. References [] Syed Mumtaz Ali and Samuel D Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological, pages 3 42, 966. [2] Shun-ichi Amari. Information geometry and its applications. Springer, 206. [3] Shun-ichi Amari and Andrzej Cichocki. Information geometry of divergence functions. Bulletin of the Polish Academy of Sciences: Technical Sciences, 58(:83 95, 200. [4] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 9. American Mathematical Soc., [5] Anil Bhattacharyya. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc., 35:99 09, 943. [6] Lev M Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3:200 27, 967. [7] Jacob Burbea and C Radhakrishna Rao. On the convexity of some divergence measures based on entropy functions. Technical report, PITTSBURGH UNIV PA INST FOR STATISTICS AND APPLICATIONS, 980. [8] Andrzej Cichocki and Shun-ichi Amari. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy, 2(6: , 200. [9] Imre Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229 38, 967. [0] Harold Jeffreys. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. A, 86(007:453 46, 946. [] Solomon Kullback. Information theory and statistics. Courier Corporation, 997. [2] Jianhua Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory, 37(:45 5, 99. [3] Frank Nielsen. A family of statistical symmetric divergences based on jensen s inequality. arxiv preprint arxiv: , 200.

8 8 TOMOHIRO NISHIYAMA [4] Frank Nielsen and Sylvain Boltz. The burbea-rao and bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8: , 20. [5] Tomohiro Nishiyama. Divergence functions in dually flat spaces and their properties. arxiv preprint arxiv: , 208. [6] Simeon Reich and Shoham Sabach. Two strong convergence theorems for bregman strongly nonexpansive operators in reflexive banach spaces. Nonlinear Analysis: Theory, Methods & Applications, 73(:22 35, 200.

On the Chi square and higher-order Chi distances for approximating f-divergences

On the Chi square and higher-order Chi distances for approximating f-divergences Frank Nielsen, Senior Member, IEEE and Richard Nock, Nonmember Abstract We report closed-form formula for calculating the