EFFICIENT MULTIVARIATE ENTROPY ESTIMATION WITH

Size: px

Start display at page:

Download "EFFICIENT MULTIVARIATE ENTROPY ESTIMATION WITH"

Judith Bennett
6 years ago
Views:

1 EFFICIENT MULTIVARIATE ENTROPY ESTIMATION WITH HINTS OF APPLICATIONS TO TESTING SHAPE CONSTRAINTS Richard Samworth, University of Cambridge Joint work with Thomas B. Berrett and Ming Yuan

2 Collaborators October 1,

3 Differential entropy The (differential) entropy of a random vector X with density function f is defined as H = H(X) = H(f) := E{logf(X)} = X f logf where X := {x : f(x) > 0}. Estimation of entropy plays a key role in goodness-of-fit tests (Vasicek, 1976; Cressie, 1976), tests of independence (Goria et al., 2005), independent component analysis (Miller and Fisher, 2003), feature selection in classification (Kwak and Choi, 2002)... October 1,

4 Kozachenko Leonenko estimators Let X 1,...,X n iid f on R d. Let X (k),i denote the kth nearest neighbour of X i, and let ρ (k),i := X (k),i X i. The Kozachenko Leonenko estimator of the entropy H is Ĥ n = Ĥn(X 1,...,X n ) := 1 n n i=1 ( ρ d (k),i V d (n 1) ) log e Ψ(k), where V d := π d/2 /Γ(1+d/2) denotes the volume of the unit d-dimensional Euclidean ball and where Ψ denotes the digamma function. October 1,

5 Intuition KL estimators attempt to mimic oracle estimator H n := n 1 n i=1 logf(x i) based on a k-nearest neighbour density estimate approximation k n 1 V dρ d (k),1 f(x 1). Previous work focuses on k = 1 or (recently) k fixed, and often assumes f is compactly supported (Kozachenko and Leonenko, 1987; Tsybakov and Van der Meulen, 1996; Singh et al., 2003; Mnatsakanov et al., 2008; Biau and Devroye, 2015; Delattre and Fournier, 2016; Singh and Póczos, 2016; Gao et al., 2016). October 1,

6 The trouble with full support A Taylor expansion of H(f) around a density estimator ˆf yields H(f) f(x)log ˆf(x)dx 1 ( f 2 ) (x) R d 2 R ˆf(x) dx 1. d When f is bounded away from zero on its support, one can estimate the (smaller order) second term to obtain efficient estimators in higher dimensions (Laurent, 1996). However, when f is not bounded away from zero on its support such procedures are no longer effective. October 1,

7 Intuition regarding bias Let ξ i := ρd (k),i V d(n 1), and for u [0, ), define e Ψ(k) F n,x (u) := P(ξ i u X i = x) = n 1 j=k ( n 1 j ) p j n,x,u(1 p n,x,u ) n 1 j, where p n,x,u := B x (r n,u ) f(y)dy and r n,u := { e Ψ(k) u V d (n 1)} 1/d. Also define a limiting distribution function where λ x,u := uf(x)e Ψ(k). F x (u) := e λ x,u j=k λ j x,u j!, October 1,

8 More intuition regarding bias We expect that E(Ĥ n ) = f(x) X f(x) = X X f(x) logudf n,x (u)dx logudf x (u)dx ( te Ψ(k) log )e t t k 1 f(x) (k 1)! dtdx = H, where we have substituted t = λ x,u. October 1,

9 Formal conditions (A1)(α) R x α f(x)dx < d (A2)(β) f is bounded and, with m := β, η := β m, f is C m (R d ). r > 0 and a measurable g : R d [0, ) such that for t = 1,...,m and y x r, f (t) (x) g (x)f(x) f (m) (y) f (m) (x) g (x)f(x) y x η and sup x:f(x) δ g (x) = o(δ ǫ ) as δ ց 0 for every ǫ > 0. October 1,

10 Main result on the bias Suppose (A1)(α) and (A2)(β) hold for some α,β > 0. Let k = k n be a deterministic sequence of positive integers with k = O(n 1 ǫ ) as n for some ǫ > 0. (i) If d 3, α > 2d d 2, β > 2, then uniformly for k {1,...,k }, E(Ĥn) H = Γ(k +2/d) 2V 2/d d (d+2)γ(k)n 2/d X ( ) f(x) k 2/d dx+o f(x) 2/d n 2/d. (ii) If d 2 or d 3 with α (0,2d/(d 2)] or β 2, then ( { (k ) α/(α+d) ǫ, k β/d }) E(Ĥn) H = O max n uniformly for k {1,...,k }, for every ǫ > 0. n β/d October 1,

11 Corollary Assume the conditions in (i), and let k0 = k 0,n be a deterministic sequence of positive integers with k0,n k n and k0 as n. Then E(Ĥ n ) H = 1 2V 2/d d (d+2) { X } ( ) f(x) k 2/d k 2/d dx f(x) 2/d n 2/d+o n 2/d uniformly for k {k 0,...,k }. October 1,

12 Prewhitening For any A R d d with A > 0, H(AX 1 ) = H(X 1 )+log A. We can therefore set Y i := AX i, and let Ĥ A n = ĤA n(x 1,...,X n ) := Ĥn(Y 1,...,Y n ) log A. Under the conditions in (i), the dominant bias contribution is Γ(k +2/d) 2V 2/d d (d+2)γ(k)n 2/d X tr(a T f(x)a) A 2/d dx. f(x) 2/d In general, it is hard to compare this with the original bias, but... October 1,

13 More on prewhitening Suppose f(x) = Σ 1/2 f 0 (Σ 1/2 x), where either f 0 (z) = d j=1 g(z j) for some density g, or f 0 (z) = g( z p ) for some p (0, ), g : [0, ) [0, ). Then X X f(x) f(x) 2/d dx = Σ 1/d tr(σ 1 ) d f 0 (x) R d f 0 (x) 2/d dx, tr(a T f(x)a) A 2/d f(x) 2/d dx = AΣAT 1/d tr(a T Σ 1 A 1 ) d f 0 (x) dx. R d f 0 (x) 2/d By the AM GM inequality, A AΣA T 1/d tr(a T Σ 1 A 1 ) is minimised when the eigenvalues of AΣA T are equal. This motivates the choice A = ˆΣ 1/2. October 1,

14 Examples MSE with n = from N 3 (0,Σ) with Σ := Σ := MSE Standard Prewhitened MSE Standard Prewhitened k k October 1,

15 Asymptotic variance Assume (A1)(α) for some α > d and that (A2)(2) holds. Let k0 and k 1 denote deterministic sequences with k0 k 1, k 0 /log5 n and k1 = O(nτ ), where { 2α τ < min 5α+3d, α d } 2α, d Then VarĤn = n 1 Varlogf(X 1 )+o(n 1 ) as n, uniformly for k {k 0,...,k 1 }. October 1,

16 Efficiency in low dimensions Assume d 3 and the previous conditions hold. If d = 3, assume also that k 1 = o(n1/4 ). Then n 1/2 (Ĥn H) d N ( 0,Varlogf(X 1 ) ), and ne { (Ĥ n H) 2} Varlogf(X 1 ) as n. October 1,

17 Weighted KL estimators For weights w 1,...,w k with k j=1 w j = 1, define Ĥ w n := 1 n n i=1 k w j logξ (j),i, j=1 where ξ (j),i := e Ψ(j) V d (n 1)ρ d (j),i. If k j=1 w j Γ(j +2/d) Γ(j) = 0, then under (A1)(α) with α > d and (A2)(β) with β > 2, E(Ĥw n) H = o(n 1/2 ) when d = 4 and if β > 5/2, then the same conclusion holds when d = 5. October 1,

18 Choosing weights in the general case Let W (k) := { w R k : k j=1 w j Γ(j +2l/d) Γ(j) = 0,l = 1,..., d/4, k } w j = 1,w j = 0 for j / { k/d, 2k/d,...,k}. j=1 Then there exists k d such that for k k d, we can find w (k) W (k) with sup k kd w (k) <. October 1,

19 Efficiency in arbitrary dimensions Assume (A1)(α) and (A2)(β), k = k n is as before, and w = w (k) W (k) for large n with limsup n w (k) <. (i) If α > d, β > d/2, k1 = o( min { d/4 1 n 1+ d/4, n 1 2β}) d, then n 1/2 (Ĥ w n H) d N ( 0,Varlogf(X 1 ) ) ne{(ĥw n H) 2 } Varlogf(X 1 ), uniformly for k {k 0,...,k 1 }. (ii) If d 4, α > d and β [2,d/2], then Ĥ w n H = O p (k β/d /n β/d ). uniformly for k {k 0,...,k 1 }. October 1,

20 Log-concave projections Let P d be the set of probability measures P on R d with R d x dp(x) < and P(H) < 1 for all hyperplanes H. Let F d be the set of all upper semi-continuous log-concave densities on R d. Then there is a well-defined projection ψ : P d F d given by ψ (P) := argmax logf dp f F d R d (Dümbgen, S. and Schuhmacher, 2011). October 1,

21 Risk bounds Let X 1,...,X n iid f0 F d with empirical distribution P n, and let ˆf n := ψ (P n ). Let d 2 KL (f,g) := R d f log(f/g). Then d 2 KL(ˆf n,f 0 ) 1 n n i=1 log ˆf n (X i ) f 0 (X i ) =: d2 X(ˆf n,f 0 ) (Dümbgen, S. and Schuhmacher, 2011, Kim, Guntuboyina and S., 2016). Moreover, sup E f0 d 2 X(ˆf n,f 0 ) = f 0 F d O(n 4/5 ) if d = 1 O(n 2/3 logn) if d = 2 O(n 1/2 logn) if d = 3 (Kim and S., 2016, Kim, Guntuboyina and S., 2016). October 1,

22 Estimating entropy under log-concavity Let Then, for d = 1,2, Ĥ LC n := 1 n n log ˆf n (X i ). i=1 1/2 n (ĤLC n H) = n R 1/2 logf 0 d(p n P 0 ) n 1/2 d 2 X(ˆf n,f 0 ) d d N ( 0,Varlogf 0 (X 1 ) ). October 1,

23 Testing log-concavity Let X P P d and let X ψ (P). Then H(X ) H(X), with equality iff X has a log-concave density. To test H 0 : f 0 F d, set T n := Ĥn ĤLC n. Under H 0, for d = 1,2, T n = o p (n 1/2 ), provided f 0 satisfies conditions for Ĥn to be efficient. Under H 1, T n p H(f0 ) H(f 0) < 0. October 1,

24 References Berrett, T. B., Samworth, R. J. and Yuan, M. (2016) Efficient multivariate entropy estimation via k-nearest neighbour distances. Biau, G. and Devroye, L. (2015) Lectures on the Nearest Neighbor Method. Springer, New York. Cressie, N. (1976) On the logarithms of high-order spacings. Biometrika, 63, Delattre, S. and Fournier, N. (2016) On the KozachenkoLeonenko entropy estimator. Dümbgen, L., Samworth, R. and Schuhmacher, D. (2011) Approximation by log-concave distributions with applications to regression. Ann. Statist., 39, Gao, W., Oh, S. and Viswanath, P. (2016) Demystifying fixed k-nearest neighbor information estimators. October 1,

25 Goria, M. N., Leonenko, N. N., Mergel, V. V. and Novi Inverardi, P. L. (2005). A new class of random vector entropy estimators and its applications in testing statistical hypotheses. J. Nonparam. Statist., 17, Kim, A. K. H. and Samworth, R. J. (2016) Global rates of convergence in log-concave density estimation. Ann. Statist., to appear. Kozachenko, L. F. and Leonenko, N. N. (1987). Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii, 23, Kwak, N. and Choi, C. (2002). Input feature selection by mutual information based on Parzen window. IEEE Trans. Patt. Anal. and Mach. Intell., 24, Laurent, B. (1996) Efficient estimation of integral functionals of a density. Ann. Statist., 24, Miller, E. G. and Fisher, J. W. (2003). ICA using spacings estimates of entropy. J. Mach. Learn. Res., 4, October 1,

26 Mnatsakanov, R. M., Misra, N., Li, S. and Harner, E. J. (2008). Kn-nearest neighbor estimators of entropy. Math. Meth. Statist., 17, Singh, H., Misra, N., Hnizdo, V., Fedorowicz, A. and Demchuk, E. (2003). Nearest neighbor estimates of entropy. Amer. J. Math. Man. Sci., 23, Singh, S. and Póczos, B. (2016) Analysis of k nearest neighbor distances with application to entropy estimation. Tsybakov, A. B. and Van der Meulen, E. C. (1996). Root-n consistent estimators of entropy for densities with unbounded support. Scand. J. Statist., 23, Vasicek, O. (1976). A test for normality based on sample entropy. J. Roy. Statist. Soc. Ser B, 38, October 1,

ESTIMATION OF ENTROPIES AND DIVERGENCES VIA NEAREST NEIGHBORS. 1. Introduction

Tatra Mt. Math. Publ. 39 (008), 65 73 t m Mathematical Publications ESTIMATION OF ENTROPIES AND DIVERGENCES VIA NEAREST NEIGHBORS Nikolai Leonenko Luc Pronzato Vippal Savani ABSTRACT. We extend the results