On the Chi square and higher-order Chi distances for approximating f-divergences

Size: px

Start display at page:

Download "On the Chi square and higher-order Chi distances for approximating f-divergences"

Jade McGee
6 years ago
Views:

1 On the Chi square and higher-order Chi distances for approximating f-divergences Frank Nielsen, Senior Member, IEEE and Richard Nock, Nonmember Abstract We report closed-form formula for calculating the Chi square and higher-order Chi distances between statistical distributions belonging to the same exponential family with affine natural space, and instantiate those formula for the Poisson and isotropic Gaussian families. We then describe an analytic formula for the f-divergences based on Taylor expansions and relying on an extended class of Chi-type distances. Index Terms families. statistical divergences, chi square distance, Kullback-Leibler divergence, Taylor series, exponential A. Statistical divergences: f-divergences I. INTRODUCTION Measuring the similarity or dissimilarity between two probability measures is met ubiquitously in signal processing. Some usual distances are the Pearson χ P and Neyman χ N chi square distances, and the Kullback-Leibler divergence [] defined respectively by χ (x (x) x (x)) P (X : X ) = dν(x), () x (x) χ (x (x) x (x)) N(X : X ) = dν(x), () x (x) KL(X : X ) = x (x) log x (x) dν(x), (3) x (x) Frank Nielsen is with Sony Computer Science Laboratories, Inc., Higashi Gotanda, 4-00 Shinagawa-ku, Tokyo, Japan, nielsen@csl.sony.co.jp Richard Nock is with UAG CEREGMIA, Martinique, France, rnock@martinique.univ-ag.fr.

2 where X and X are probability measures absolutely continuous with respect to a reference measure ν, and x and x denote their Radon-Nikodym densities, respectively. Those dissimilarity measures M are termed divergences to contrast with metric distances since they are oriented distances (i.e., M(X : X ) M(X : X )) that do not satisfy the triangular inequality. In the 960 s, many of those divergences were unified using the generic framework of f-divergences [], I f, defined for an arbitrary functional f: I f (X : X ) = x (x)f ( ) x (x) dν(x) 0, (4) x (x) where f is a convex function f : (0, ) dom(f) [0, ] (often normalized so that f() = 0). Those f-divergences can always be symmetrized by taking: S f (X : X ) = I f (X : X ) + I f (X : X ), (5) with f (u) = uf(/u), and I f (X : X ) = I f (X : X ). See Table I for a list of common f-divergences with their corresponding generators f. In information theory, f-divergences are characterized as the unique family of convex divergences that satisfies the information monotonicity property [3]. Note that f-divergences may evaluate to infinity (that is, unbounded I f ) when the integral diverge, even if x, x > 0 on the support X. For example, let X = (0, ) be the unit interval, and two densities (with respect to Lebesgue measure ν L ) x (x) = and x (x) = ce /x with c = 0 e /x dx 0.48 the normalizing constant. Consider the Kullback-Leibler divergence (f-divergence with f(u) = u log u): KL(X : X ) = 0 = log c + x log x (x) x (x) dν L(x), (6) 0 dν(x) =. (7) x Beware that sometimes the χ N and χ P definitions are inverted in the literature. This may stem from an alternative definition of f-divergences defined as I f (X : X ) = x (x)f( x (x) x (x) )dν(x) = I f (X : X ). That is, given any transition probability τ that transforms measures X and X to X τ and X τ (by a Markov morphism τ), we have I f (X : X ) I f (X τ : X τ ), with equality if and only if the transition τ is induced from a sufficient statistic [3].

3 3 f-divergence I f (P : Q) Formula Generator f(u) Total variation (metric) Squared Hellinger Pearson χ P Neyman χ N Pearson-Vajda χ k λ,p Pearson-Vajda χ k λ,p Kullback-Leibler reverse Kullback-Leibler α-divergence Jensen-Shannon p(x) q(x) dν(x) u ( p(x) q(x)) dν(x) ( u ) (q(x) p(x)) p(x) dν(x) (u ) (p(x) q(x)) q(x) (q(x) λp(x)) k dν(x) ( u) u p k (x) dν(x) (u λ) k q(x) λp(x) k dν(x) u λ k p k (x) p(x) log p(x) dν(x) log u q(x) q(x) log q(x) dν(x) u log u p(x) 4 ( p α α (x)q +α 4 (x)dν(x)) ( u +α α ) (p(x) log p(x) q(x) +u + q(x) log )dν(x) (u + ) log + u log u p(x)+q(x) p(x)+q(x) TABLE I SOME COMMON f -DIVERGENCES I f WITH CORRESPONDING GENERATORS: EXCEPT THE TOTAL VARIATION, f -DIVERGENCES ARE NOT METRIC [4]. B. Stochastic approximations of f-divergences To bypass the integral evaluation of I f of Eq. 4 (often mathematically intractable), we carry out a stochastic integration: Î f (X : X ) n n i= ( f ( ) x (s i ) + x (t i ) x (s i ) x (t i ) f ( )) x (t i ), (8) x (t i ) with s,..., s n and t,..., t n IID. sampled from X and X, respectively. Those approximations, although converging to the true values when n, are time consuming and yield poor results in practice, specially when the dimension of the observation space, X, is large. We therefore concentrate on obtaining exact or arbitrarily fine approximation formula for f-divergences by considering a restricted class of exponential families. C. Exponential families Let x, y denote the inner product for x, y X : The inner product for vector spaces X is the scalar product x, y = x y. An exponential family [6] is a set of probability measures E F = {P θ } θ dominated by a measure ν having their Radon-Nikodym densities p θ expressed canonically as:

4 4 p θ (x) = exp( t(x), θ F (θ) + k(x)), (9) for θ belonging to the natural parameter space: Θ = { θ R D p θ (x)dν(x) = }. Since log p x X θ(x)dν(x) = log = 0, it follows that F (θ) = log exp( t(x), θ + k(x))dν(x). For full regular families [6], it can be proved that function F is strictly convex and differentiable over the open convex set Θ. Function F characterizes the family, and bears different names in the literature (partition function, log-normalizer or cumulant function) and parameter θ (natural parameter) defines the member P θ of the family E F. Let D = dim(θ) denote the dimension of Θ, the order of the family. The map k(x) : X R is an auxiliary function defining a carrier measure ξ with dξ(x) = e k(x) dν(x). In practice, we often consider the Lebesgue measure ν L defined over the Borel σ-algebra E = B(R d ) of R d for continuous distributions (e.g., Gaussian), or the counting measure ν c defined on the power set σ-algebra E = X for discrete distributions (e.g., Poisson or multinomial families). The term t(x) is a measure mapping called the sufficient statistic [6]. Table II shows the canonical decomposition for the Poisson and isotropic Gaussian families. Notice that the Kullback-Leibler divergence between members X E F (θ ) and X E F (θ ) of the same exponential family amount to compute a Bregman divergence on swapped natural parameters [8]: KL(X : X ) = B F (θ : θ ), where B F (θ : θ ) = F (θ) F (θ ) (θ θ ) F (θ ), where F denotes the gradient. II. χ AND HIGHER-ORDER χ k DISTANCES A. A closed-form formula When X and X belong to the same restricted exponential family E F, we obtain the following result: Lemma : The Pearson/Neyman Chi square distance between X E F (θ ) and X E F (θ ) is given by: χ P (X : X ) = e F (θ θ ) (F (θ ) F (θ )), (0) χ N(X : X ) = e F (θ θ ) (F (θ ) F (θ )), () provided that θ θ and θ θ belongs to the natural parameter space Θ. This implies that the chi square distances are always bounded. The proof relies on the following lemma:

5 5 Lemma : The integral I p,q = x (x) p x (x) q dν(x) with p + q = for X E F (θ ) and X E F (θ ), p R, p + q = converge and equals to: I p,q = e F (pθ +qθ ) (pf (θ )+qf (θ )) () provided the natural parameter space Θ is affine. Proof: Let us calculate the integral I p,q : = exp(p( t(x), θ F (θ ) + k(x))) exp(q( t(x), θ F (θ ) + k(x)))dν(x), = e t(x),pθ +qθ (pf (θ )+qf (θ ))+k(x) dν(x), = e F (pθ +qθ ) (pf (θ )+qf (θ )) p F (x pθ + qθ )dν(x). When pθ + qθ Θ, we have p F (x pθ + qθ)dν(x) =, hence the result. To prove Lemma, we rewrite: χ P (X : X ) = = ( x (x) x (x) x (x) + x (x))dν(x), (3) ( ) x (x) x (x) dν(x), (4) and apply Lemma for p = and q = (checking that p + q = ). The closed-form formula for the Neyman chi square follows from the fact that χ N (X : X ) = χ P (X : X ). Thus when the natural parameter space Θ is affine, the Pearson/Neyman Chi square distances and its symmetrization χ P + χ N between members of the same exponential family are available in closed-form. Examples of such families are the Poisson, binomial, multinomial, or isotropic Gaussian families to name a few. Let us call those families: affine exponential families for short. The canonical decomposition of usual affine exponential families are reported in Table II. Note that a formula for the α-divergences between members of the same exponential family were reported in [8] for α [0, ]: In that case, αθ + ( α)θ always belong to the open convex natural space Θ (here, p belongs to R). B. The Poisson and isotropic Gaussian cases As reported in Table II, those Poisson and isotropic Gaussian exponential families have affine natural parameter spaces Θ.

6 6 Poi(λ) : p(x λ) = λx e λ, λ > 0, x {0,,...} x! Nor I(µ) : p(x µ) = (π) d e (x µ) (x µ), µ R d, x R d Family θ Θ F (θ) k(x) t(x) ν Poisson log λ R e θ log x! x ν c Iso.Gaussian µ R d θ θ d log π x x x ν L TABLE II EXAMPLES OF EXPONENTIAL FAMILIES WITH AFFINE NATURAL SPACE Θ. ν c DENOTES THE COUNTING MEASURE AND ν L THE LEBESGUE MEASURE. The Poisson family. For P Poi(λ ) and P Poi(λ ), we have: ( ) λ χ P (λ : λ ) = exp λ + λ. (5) λ To illustrate this formula with a numerical example, consider X Poi() and X Poi(). Then, it comes that χ P (P : P ) = e.78. The isotropic Normal family. For N Nor I (µ ) and N Nor I (µ ), we have according to Table II: χ P (µ : µ ) = e (µ µ ) (µ µ ) (µ µ µ µ ). In that case the χ distance is symmetric: χ P (µ : µ ) = e (µ µ ) (µ µ ) = χ N(µ : µ ) (6) C. Extensions to higher-order Vajda χ k divergences The higher-order Pearson-Vajda χ k P and χk P distances [5] are defined by: χ k (x (x) x (x)) k P (X : X ) = dν(x), (7) x (x) k χ k x (x) x (x) k P (X : X ) = dν(x), (8) x (x) k are f-divergences for the generators (u ) k and u k (with χ k P (X : X ) χ k P (X : X )). When k =, we have χ P (X : X ) = (x (x) x (x))dν(x) = 0 (i.e., divergence is never discriminative), and χ P (X, X ) is twice the total variation distance (the only metric

7 7 f-divergence [4]). χ 0 P is the unit constant. Observe that the χk P odd k (signed distance), but not χ k P. We can compute the χk P binomial expansion: distance may be negative for term explicitly by performing the Lemma 3: The (signed) χ k P distance between members X E F (θ ) and X E F (θ ) of the same affine exponential family is (k N) always bounded and equal to: k ( ) k e χ k P (X : X ) = ( ) k j F (( j)θ +jθ ) j e. (9) ( j)f (θ )+jf (θ ) Proof: = = j=0 χ k (x (x) x (x)) k P (X : X ) = dν(x), x (x) k (0) k ( ) k ( ) k j x (x) k j x (x) j dν(x), j x (x) k () j=0 k ( ) k ( ) k j j j=0 x (x) j x (x) j dν(x). () Then the proof follows from Lemma that shows that I j,j (X : X ) = x (x) j x (x) j dν(x) = e F (( j)θ +jθ ) e ( j)f (θ )+jf (θ ). For Poisson/Normal distributions, we get: χ k P (λ : λ ) = χ k P (µ : µ ) = k ( ) k ( ) k j e λ j λ j (( j)λ +jλ ), (3) j j=0 k ( ) k ( ) k j e j(j )(µ µ ) (µ µ ). (4) j j=0 Observe that for λ = λ = λ, we have χ k P (λ : λ ) = k j=0 ( )k j( k j) e λ λ = ( ) k = 0 when k N, as expected. The χ k P value is always bounded. For sanity check, consider the binomial expansion for k =, we have: χ P (λ : λ ) = ( ) e λ λ ( ) e λ λ + ( ) λ λ e λ = e λ λ λ, in accordance with Eq. 5. Consider a numerical example: Let λ = 0.6 and λ = 0.3, then χ P 0.6, χ3 P 0.03, χ4 P 0.04, χ5 P 0.0, χ6 P 0.08, χ7 P 0.03, χ 8 P 0.0, χ9 P , χ0 P sign of those χ k -type signed distances , etc. This numerical example illustrates the alternating

8 8 III. f -DIVERGENCES FROM TAYLOR SERIES Recall that the f-divergence defined for a generator f is I f (X : X ) = x (x)f ( ) x (x) x dν(x). (x) Assuming f analytic, we use the Taylor expansion about a point λ: f(x) = f(λ) + f (λ)(x λ) + f (λ)(x λ) +... = i=0 f (i) (λ)(x λ) i, the power series expansion of f, for i! λ int(dom(f (i) )) i. Lemma 4 (extends Theorem of [5]): When bounded, the f-divergence I f can be expressed as the power series of higher order Chi distances: I f (X : X ) = x (x) = i=0 i=0 i! f (i) (λ) ( ) i x (x) x (x) λ dν(x), i! f (i) (λ) χ i λ,p (X : X ), (5) where in the equality we have swapped the integral and sum according to Fubini theorem since we assumed that I f <, and χ i λ,p (X : X ) is a generalization of the χ i P Table I): and χ 0 λ,p (X : X ) = by convention. defined by (see χ i (x (x) λx (x)) i λ,p (X : X ) = dν(x). (6) x (x) i Choosing λ = int(dom(f (i) )), we approximate the f-divergence as follows (Theorem of [5]): I f (X : X ) s k=0 where f (s+) = sup t [m,m] f (s+) (t) and m p q fatness of p q, we ensure that I f <. f (k) () χ k P (X : X ) k! (s + )! f (s+) (M m) s, (7) M. Notice that by assuming the Choosing λ = 0 (whenever 0 int(dom(f (i) ))) and affine exponential families, we get the f-divergence in a much simpler analytic expression: I f (X : X ) = i=0 f (i) (0) I i,i (θ : θ ), (8) i! I i,i (θ : θ ) = ef (iθ +( i)θ ) e if (θ )+( i)f (θ ). (9)

9 9 Lemma 5: The bounded f-divergences between members of the same affine exponential family can be computed as an equivalent power series whenever f is analytic. Corollary : A second-order Taylor expansion yields I f (X : X ) f() + f ()χ N (X : X )+ f ()χ N (X : X ). Since f() = 0 (f can always be renormalized) and χ N (X : X ) = 0, it follows that I f (X : X ) f () χ N(X : X ), (30) and reciprocally χ N (X : X ) f () I f(x : X ) (f () > 0 follows from the strict convexity of the generator). When f(u) = u log u, this yields the well-known approximation []: χ P (X : X ) KL(X : X ). (3) For affine exponential families, we then plug the closed-form formula of Lemma to get a simple approximation formula of I f. For example, consider the Jensen-Shannon divergence (Table I) with f (u) = u u+ and f () =. It follows that I JS(X : X ) 4 χ N (X : X ). (For Poisson distributions λ = 5 and λ = 5., we get.5% relative error. A. Example : χ revisited Let us start with a sanity check for the χ distance between Poisson distributions. The Pearson chi square distance is a f-divergence for f(t) = t with f (t) = t and f (t) = and f (i) (t) = 0 for i >. Thus, with f (0) (0) =, f () (0) = 0, f () (0) =, and f (i) (0) = 0 for i >. Recall that I i,i (θ : θ ) = e F (iθ +( i)θ ) (if (θ )+( i)f (θ ) = exp(λ i λ i iλ ( i)λ ). Note that I i,i (λ, λ) = e 0 = for all i. Thus we get: I f (X : X ) = I,0 + I, with I,0 = e λ λ = and I, = e λ λ λ +λ. Thus, we obtain I f (X : X ) = + e λ λ λ +λ, in accordance with Eq. 5. B. Example : Kullback-Leibler divergence By choosing f(u) = log u, we obtain the Kullback-Leibler divergence (see Table I). We have f (i) (u) = ( ) i (i )!u i, and hence f (i) () χ,p j= = ( )i i! i, for i (with f() = 0). Since = 0, it follows that: ( ) i KL(X : X ) = χ j,p i (X : X ). (3)

10 0 Note that for the case of KL divergence between members of the same exponential families, the divergence can be expressed in a simpler closed-form using a Bregman divergence [8] on the swapped natural parameters. For example, consider Poisson distributions with λ = 0.6 and λ = 0.3, the Kullback-Leibler divergence computed from the equivalent Bregman divergence yields KL 0.58, the stochastic evaluation of Eq. 8 with n = 0 6 yields KL 0.56 and the KL divergence obtained from the truncation of Eq. 3 to the first s terms yields the following sequence: (s = ), 0.090(s = 3), 0.07(s = 4), 0.35(s = 0), 0.50(s = 5), etc. IV. CONCLUDING REMARKS We investigated the calculation of statistical f-divergences between members of the same exponential family with affine natural space. We first reported a generic closed-form formula for the Pearson/Neyman χ and Vajda χ k -type distance, and instantiated that formula for the Poisson and the isotropic Gaussian affine exponential families. We then considered the Taylor expansion of the generator f at any given point λ to deduce an analytic expression of f-divergences using Pearson-Vajda χ k λ,p distances. A second-order Taylor approximation yielded a fast estimation of f-divergences. This framework shall find potential applications in signal processing and when designing inequality bounds between divergences. A Java TM package that illustrates numerically the lemma is available online at: REFERENCES [] T. M. Cover and J. A. Thomas, Elements of information theory. New York, USA: Wiley-Interscience, 99. [] F. Österreicher, Distances based on the perimeter of the risk set of a testing problem, Austrian Journal of Statistics, vol. 4, no., p. 3.9, 03. [3] S. Amari and H. Nagaoka, Methods of Information Geometry, AMS, Ed. Oxford University Press, 000. [4] M. Khosravifard, D. Fooladivanda, and T. A. Gulliver, Confliction of the Convexity and Metric Properties in f-divergences, IEICE Trans. Fundam. Electron. Commun. Comput. Sci. E90-A, 9, , 007. [5] N. Barnett, P. Cerone, S. Dragomir, and A. Sofo, Approximating Csiszár f-divergence by the use of Taylor s formula with integral remainder, Mathematical Inequalities & Applications, vol. 5, no. 3, pp , 00. [6] L. D. Brown, Fundamentals of statistical exponential families: with applications in statistical decision theory. Hayworth, CA, USA: Institute of Mathematical Statistics, 986. [7] A. Cichocki and S.-i. Amari, Families of alpha- beta- and gamma- divergences: Flexible and robust measures of similarities, Entropy, vol., no. 6, pp , 00. [8] F. Nielsen and S. Boltz, The Burbea-Rao and Bhattacharyya centroids, IEEE Transactions on Information Theory, vol. 57, no. 8, pp , 0.

On the Chi square and higher-order Chi distances for approximating f-divergences

c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/17 On the Chi square and higher-order Chi distances for approximating f-divergences Frank Nielsen 1 Richard Nock 2 www.informationgeometry.org