On the Chi square and higher-order Chi distances for approximating f-divergences

Size: px

Start display at page:

Download "On the Chi square and higher-order Chi distances for approximating f-divergences"

Tamsin Reeves
5 years ago
Views:

1 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/17 On the Chi square and higher-order Chi distances for approximating f-divergences Frank Nielsen 1 Richard Nock Sony Computer Science Laboratories, Inc. 2 UAG-CEREGMIA September 2013

2 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/17 Statistical divergences Measures the separability between two distributions. Examples: Pearson/Neymann χ 2, Kullback-Leibler divergence: χ 2 P (X 1 : X 2 ) = χ 2 N (X 1 : X 2 ) = KL(X 1 : X 2 ) = (x2 (x) x 1 (x)) 2 dν(x), x 1 (x) (x1 (x) x 2 (x)) 2 dν(x), x 2 (x) x 1 (x)log x 1(x) x 2 (x) dν(x),

3 f-divergences: A generic definition I f (X 1 : X 2 ) = where f is a convex function such that f(1) = 0. x 1 (x)f ( ) x2 (x) dν(x) 0, x 1 (x) f : (0, ) dom(f) [0, ] Jensen inequality: I f (X 1 : X 2 ) f( x 2 (x)dν(x)) = f(1) = 0. May consider f (1) = 0 and fix the scale of divergence by setting f (1) = 1. Can always be symmetrized: S f (X 1 : X 2 ) = I f (X 1 : X 2 )+I f (X 1 : X 2 ) with f (u) = uf(1/u), and I f (X 1 : X 2 ) = I f (X 2 : X 1 ). c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/17

4 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/17 f-divergences: Some examples Name of the f-divergence Formula I f (P : Q) Generator f(u) with f(1) = 0 Total variation (metric) Squared Hellinger 1 2 p(x) q(x) dν(x) 1 2 u 1 ( p(x) q(x)) 2 dν(x) ( u 1) 2 Pearson χ 2 (q(x) p(x)) 2 P dν(x) (u 1) 2 p(x) Neyman χ 2 (p(x) q(x)) 2 (1 u) N dν(x) 2 q(x) u Pearson-Vajda χ k (q(x) λp(x)) k P p k 1 dν(x) (u 1) k (x) Pearson-Vajda χ k q(x) λp(x) k P p k 1 dν(x) u 1 k (x) p(x) Kullback-Leibler p(x)log q(x) dν(x) logu q(x) reverse Kullback-Leibler q(x)log p(x) dν(x) u logu α-divergence 4 1 α 2 (1 p 1 α 2 (x)q 1+α (x)dν(x)) 4 1 α Jensen-Shannon 1 2p(x) 2 (p(x)log p(x)+q(x) + q(x)log 2q(x) )dν(x) (u + 1)log 1+u p(x)+q(x) 2 1+α 2 (1 u 2 ) + u logu

5 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/17 Stochastic approximations of f-divergences I (n) f (X 1 : X 2 ) 1 2n n i=1 ( f ( ) x2 (s i ) + x 1(t i ) x 1 (s i ) x 2 (t i ) f ( )) x2 (t i ), x 1 (t i ) with s 1,...,s n and t 1,...,t n IID. sampled from X 1 and X 2, respectively. lim I (n) n f (X 1 : X 2 ) I f (X 1 : X 2 ) work for any generator f but... In practice, limited to small dimension support.

6 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/17 Exponential families Canonical decomposition of the probability measure: p θ (x) = exp( t(x),θ F(θ)+k(x)), Here, consider natural parameter space Θ affine. Poi(λ) : p(x λ) = λx e λ,λ > 0,x {0,1,...} x! Nor I (µ) : p(x µ) = (2π) d 2 e 1 2 (x µ) (x µ),µ R d,x R d Family θ Θ F(θ) k(x) t(x) ν Poisson logλ R e θ logx! x ν c Iso.Gaussian µ R d 1 2 θ θ d 2 log2π 1 2 x x x ν L

7 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/17 χ 2 for affine exponential families Bypass integral computation, Closed-form formula χ 2 P (X 1 : X 2 ) = e F(2θ 2 θ 1 ) (2F(θ 2 ) F(θ 1 )) 1, χ 2 N (X 1 : X 2 ) = e F(2θ 1 θ 2 ) (2F(θ 1 ) F(θ 2 )) 1, Kullback-Leibler divergence amounts to a Bregman divergence [3]: KL(X 1 : X 2 ) = B F (θ 2 : θ 1 ) B F (θ : θ ) = F(θ) F(θ ) (θ θ ) F(θ )

8 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/17 Higher-order Vajda χ k divergences χ k P (X 1 : X 2 ) = χ k P (X 1 : X 2 ) = (x2 (x) x 1 (x)) k x 1 (x) k 1 dν(x), x2 (x) x 1 (x) k x 1 (x) k 1 dν(x), are f-divergences for the generators (u 1) k and u 1 k. When k = 1, χ 1 P (X 1 : X 2 ) = (x 1 (x) x 2 (x))dν(x) = 0 (never discriminative), and χ 1 P (X 1,X 2 ) is twice the total variation distance. χ 0 P χ k P is the unit constant. is a signed distance

9 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/17 Higher-order Vajda χ k divergences Lemma The (signed) χ k P distance between members X 1 E F (θ 1 ) and X 2 E F (θ 2 ) of the same affine exponential family is (k N) always bounded and equal to: χ k P (X 1 : X 2 ) = k ( k ( 1) k j j j=0 ) e F((1 j)θ 1 +jθ 2 ) e (1 j)f(θ 1)+jF(θ 2 ).

10 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/17 Higher-order Vajda χ k divergences: For Poisson/Normal distributions, we get closed-form formula: χ k P (λ 1 : λ 2 ) = χ k P (µ 1 : µ 2 ) = k ( k ( 1) k j j j=0 k ( k ( 1) k j j j=0 ) e λ1 j 1 λ j 2 ((1 j)λ 1+jλ 2 ), ) e 1 2 j(j 1)(µ 1 µ 2 ) (µ 1 µ 2 ). signed distances.

11 f-divergences from Taylor series Lemma (extends Theorem 1 of [1]) When bounded, the f-divergence I f can be expressed as the power series of higher order Chi-type distances: I f (X 1 : X 2 ) = = i=0 x 1 (x) i=0 ( ) 1 i! f (i) x2 (x) i (λ) x 1 (x) λ dν(x), 1 i! f (i) (λ) χ i λ,p (X 1 : X 2 ), I f <, and χ i λ,p (X 1 : X 2 ) is a generalization of the χ i P defined by: χ i λ,p (X (x2 (x) λx 1 (x)) i 1 : X 2 ) = x 1 (x) i 1 dν(x). and χ 0 λ,p (X 1 : X 2 ) = 1 by convention. Note that χ i λ,p f(1) = (1 λ)k is a f-divergence for f(u) = (u λ) k (1 λ) k c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/17

12 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/17 f-divergences: Analytic formula λ = 1 int(dom(f (i) )), f-divergence (Theorem 1 of [1]): I f (X 1 : X 2 ) s k=0 f (k) (1) χ k P k! (X 1 : X 2 ) 1 (s +1)! f (s+1) (M m) s, where f (s+1) = sup t [m,m] f (s+1) (t) and m p q M. λ = 0 (whenever 0 int(dom(f (i) ))) and affine exponential families, simpler expression: I f (X 1 : X 2 ) = I 1 i,i (θ 1 : θ 2 ) = i=0 f (i) (0) I 1 i,i (θ 1 : θ 2 ), i! e F(iθ 2+(1 i)θ 1 ) e if(θ 2)+(1 i)f(θ 1 ).

13 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/17 Corollary: Approximating f-divergences by χ 2 divergences Corollary A second-order Taylor expansion yields I f (X 1 : X 2 ) f(1)+f (1)χ 1 N (X 1 : X 2 )+ 1 2 f (1)χ 2 N (X 1 : X 2 ) Since f(1) = 0 and χ 1 N (X 1 : X 2 ) = 0, it follows that I f (X 1 : X 2 ) f (1) 2 χ2 N (X 1 : X 2 ), (f (1) > 0 follows from the strict convexity of the generator). When f(u) = ulogu, this yields the well-known approximation [2]: χ 2 P (X 1 : X 2 ) 2 KL(X 1 : X 2 ).

14 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/17 Kullback-Leibler divergence: Analytic expression Kullback-Leibler divergence: f(u) = log u. f (i) (u) = ( 1) i (i 1)!u i and hence f (i) (1) i! = ( 1)i i, for i 1 (with f(1) = 0). Since χ 1 1,P = 0, it follows that: KL(X 1 : X 2 ) = ( 1) i j=2 i χ j P (X 1 : X 2 ). alternating sign sequence Poisson distributions: λ 1 = 0.6 and λ 2 = 0.3, KL (exact using Bregman divergence), stochastic evaluation with n = 10 6 yields KL KL divergence from Taylor truncation: (s = 2), (s = 3), (s = 4), (s = 10), (s = 15), etc.

15 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/17 Contributions Statistical f-divergences between members of the same exponential family with affine natural space. Generic closed-form formula for the Pearson/Neyman χ 2 and Vajda χ k -type distance Analytic expression of f-divergences using Pearson-Vajda-type distances. Second-order Taylor approximation for fast estimation of f-divergences. Java TM package:

16 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/17 Thank author="frank Nielsen and Richard Nock", title="on the {C}hi square and higher-order {C}hi distances for approximating $f$-divergences", year="2013", eprint="arxiv/ " }

17 c 2013 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/17 Bibliographic references I N.S. Barnett, P. Cerone, S.S. Dragomir, and A. Sofo. Approximating Csiszár f-divergence by the use of Taylor s formula with integral remainder. Mathematical Inequalities & Applications, 5(3): , Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience, New York, NY, USA, Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8): , August 2011.

On the Chi square and higher-order Chi distances for approximating f-divergences

On the Chi square and higher-order Chi distances for approximating f-divergences Frank Nielsen, Senior Member, IEEE and Richard Nock, Nonmember Abstract We report closed-form formula for calculating the