Asymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias

Asymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias Zhiyi Zhang Department of Mathematics and Statistics Uniersity of North Carolina at Charlotte Charlotte, NC 28223 Abstract This paper establishes the asymptotic normality of an entropy estimator with an exponentially decaying bias on any finite alphabet. Furthermore it is shown that the nonparametric estimator is asymptotically efficient. Introduction. Let {p k } be a probability distribution on a finite alphabet, X {l k ; k K}, where K 2 is a finite integer. Let p X be a random ariable such that P p X p k ) p k. Entropy in the form of H E[ lnp X )] K p k lnp k ), ) was introduced by Shannon 948), and is often referred to as Shannon s entropy. Nonparametric estimation of H has been a subect of much research for many decades. Miller 955) and Basharin 959) were perhaps among the first who studied the intuitie general nonparametric estimator, Ĥ K k ˆp k lnˆp k ) where ˆp k is the sample relatie frequency of the k th letter l k, also known as the plug-in estimator. Others hae inestigated the topic in arious forms and directions oer the AMS 2000 Subect Classifications. Primary 62f0, 62F2, 62G05, 62G20; secondary 62F5. Keywords and phrases. Turing s formula, nonparametric entropy estimation, asymptotic normality. Research partially supported by NSF Grants DMS 004769 k

Normality of an Entropy Estimator 2 years. Many important references can be found in Antos and Kontoyiannis 200) and Paninski 2003). Among many difficult issues of nonparametric entropy estimation, much research effort in the literature seems to be placed on reducing the bias of the estimators. The main reference point of such discussion is the On ) decaying bias of the plug-in Ĥ whose form may be found in Harris 975). Many bias-adusted nonparametric estimators hae been proposed. All of them hae been shown to reduce bias in certain numerical studies. Howeer the rates of bias decay for most of the bias-adusted estimators are largely unknown, and there is no clear theoretical eidence why any of these proposed estimators should improe the bias decay to a rate faster than On ). Zhang 202) proposed an estimator Ĥz, as gien in 2) below, and showed that the associated bias decays at a rate no slower than On p 0 ) n ) where p 0 min{p k > 0; k,, K}. In addition, Zhang 202) established that a uniform ariance upper bound for the entire class of distributions with finite entropy that decays at a rate of Olnn)/n) compared to O[lnn)] 2 /n) for the plug-in, that in a wide range of subclasses, the ariance of the proposed estimator conerges at a rate of O/n), and that the aforementioned rate of conergence carries oer to the conergence rates in mean squared errors in many subclasses. The computational performances of Ĥ z, and of its ariants, were compared faorably with seeral other commonly known estimators, such as the ackknife estimator by Zahl 977) and Strong, Koberle, de Ruyter an Steeninck and Bialek 998), and the NSB estimator by Nemenman, Shafee and Bialek 2002). Let {y k } be the sequence of obsered counts of letters in the alphabet in an independently and identically distributed iid) sample of size n and {ˆp k y k /n}. The general nonparametric estimator of entropy proposed by Zhang 202) is n n + [n + )]! K Ĥ z n! k ˆp k n) ˆp k. 2) This paper establishes two normal laws of Ĥz as stated in Theorem and Corollary below, and the asymptotic efficiency of Ĥz is gien in Theorem 2. Let H 2) E[ lnp X )] 2 K k p k ln 2 p k ).

Normality of an Entropy Estimator 3 Theorem. Let {p k ; k K} be a non-uniform probability distribution on a finite alphabet X and Ĥz be as in 2). Then n Ĥz H) L N0, σ 2 ) where σ 2 V ar [ lnp X )] H 2) H 2. Let Ĥ 2) z n { i ) { n + [n + )]! i i) n! [ K k ˆp k m0 ˆp k m n ) ]}}. 3) Corollary. Let {p k ; k K} be a non-uniform probability distribution on a finite alphabet, Ĥ z be as in 2), and Ĥ2) z be as in 3). Then n Ĥz H Ĥ z 2) Ĥ2 z L N0, ). Theorem 2. Let {p k ; k K} be a non-uniform probability distribution on a finite alphabet X. Then Ĥz is asymptotically efficient. 2 Proofs Ĥ z in 2) may be re-expressed as K n Ĥ z ˆp { n + } [n + )]! n) k ˆp k n! k def K ˆp k ĝ k,n. 4) Of first interest is an asymptotic normal law of ˆp k ĝ k,n. For simplicity, consider first a binomial distribution with parameters n and p 0, ), and functions n { n + } [n + )]! n) g n p) p n! [ n p)+], and h n p) pg n p). Let hp) p lnp). Lemma below is easily proed by induction. k

Normality of an Entropy Estimator 4 Lemma. Let a,,, n, be complex numbers satisfying a for eery. n a n a. Then Lemma 2. Let ˆp X/n where X is a binomial random ariable with parameters n and p.. n[h n p) hp)] 0 uniformly in p c, ) for any c, 0 < c <. 2. n h n p) hp) < An) On 3/2 ) uniformly in p [/n, c] for any c, 0 < c < p. 3. P ˆp c) < Bn) On /2 exp{ nc}) where C p c)2 for any c 0, p). Proof of Part. As the notation in g n p) suggests, the range for is from to min{n, n p) + }. For any in that range, let W n,+ n+ [n +)]! n!. Noting 0 /[n p)] /n subect to n p), by Lemma, W n,+ ) p n p) p) ) ) p) n n p) n n ) ) p) + n p) n n ) ) p) n + p) n p) n ) p) n + p) ) p) n + p) p p ) p) n + p) p) n n p) n n n p) 2 n. ) ) n p) + n

Normality of an Entropy Estimator 5 For a sufficiently large n, let V n n /8. n hn p) hp) np n p)+ + np n p)+2 p) np V n + np n p)+ V n+ W n,+ W n,+ ) p n p) W n,+ + np n p)+2 p) def + 2 + 3. np V n n p) n5/8 0. n n /8 2 p n n p)+ V n+ ) p n p) ) p n p) n p) p n[n p)+] np n p)+ V n+ p) n[n p)+] np p) n/8 n[n c)+] nc c) n/8 0. 3 n n p) p) n p)+2 n p) n p)+ n 0. Hence sup p c,) n hn p) hp) 0. Proof of Part 2. The proof is identical to that of Part aboe until the expression + 2 + 3 where each term is to be ealuated on the interal [/n, c]. It is clear that On 3/8 ). For 2, since n p) + at p /n is n > n, we hae 2 p n min{n, n p)+ } V n+ n p) p n min{n, n /n)+ } V n+ n p) Therefore + 2 + 3 On 3/2 ). p n n V n + n p) < p nn ) n V n+ p) < nn ) p) V n < nn ) On 3/2 ). 3 p n min{n, n p)+ }+ p) < p n p) < n On /2 ).

Normality of an Entropy Estimator 6 Proof of Part 3. Let Z and ϕz) be a standard normal random ariable and its density function respectiely, and let denote asymptotic equality. Since nˆp p) L N0, p p)), P ˆp c) nc p)/ ϕz)dz np c)/ ϕz)dz < np c)/ e z [ np c)/ ]dz np c) [np c)/ ] 2 e x dx { [ np ] } 2 np c) exp c)/ p p) n /2 p c) exp } { np c)2. Proof of Theorem. Without loss of generality, consider the sample proportions of the first two letters of the alphabet ˆp and ˆp 2 in an iid sample of size n. n ˆp p, ˆp 2 p 2 ) L N0, Σ), where Σ σ i ), i,, 2, σ ii p i p i ) and σ i p i p when i. Write n {[hn ˆp ) + h n ˆp 2 )] [ p lnp ) p 2 lnp 2 )]} n {[h n ˆp ) + h n ˆp 2 )] [hˆp ) + hˆp 2 )]} + n {[hˆp ) + hˆp 2 )] [ p lnp ) p 2 lnp 2 )]} n [h n ˆp ) hˆp )] + n [h n ˆp 2 ) hˆp 2 )] + n {[hˆp ) + hˆp 2 )] [ p lnp ) p 2 lnp 2 )]} n [h n ˆp ) hˆp )] [ˆp p /2] + n [h n ˆp 2 ) hˆp 2 )] [ˆp2 p 2 /2] + n [h n ˆp ) hˆp )] [ˆp >p /2] + n [h n ˆp 2 ) hˆp 2 )] [ˆp2 >p 2 /2] + n {[hˆp ) + hˆp 2 )] [ p lnp ) p 2 lnp 2 )]}. The third and fourth terms aboe conerge to zero almost surely by Part of Lemma 2. The last

Normality of an Entropy Estimator 7 term, by the delta method, conerges in law to N0, τ 2 ) where after a few algebraic steps τ 2 [lnp ) + ] 2 p p ) + [lnp 2 ) + ] 2 p 2 p 2 ) 2[lnp ) + ][lnp 2 ) + ]p p 2 [lnp ) + ] 2 p + [lnp 2 ) + ] 2 p 2 {[lnp ) + ]p + [lnp ) + ]p } 2. It remains to show that the first term the second term will admit the same argument) conerges to zero in probability. Howeer this fact can be established by the following argument. By Part 2 and then Part 3 of Lemma 2, E{ n h n ˆp ) hˆp ) [ˆp p /2]} An)P ˆp p /2) An)Bn) On 3/2 )On /2 exp{ nc}) 0 for some positie constant C. This fact, noting that n h n ˆp ) hˆp ) 0, gies immediately the desired conergence in probability, that is, n h n ˆp ) hˆp ) [ˆp p /2] the desired weak conergence for n {[h n ˆp ) + h n ˆp 2 )] [ p lnp ) p 2 lnp 2 )]}. By generalization for K terms, nĥz H) ariable that assumes the alue p k when X assumes l k, P 0. In turn, it gies L N0, σ 2 ) where, letting p X denote the random σ 2 K k { [lnp k) + ]} 2 p k { K k { [lnp k) + ]}p k } 2 V ar [ lnp X ) ] V ar [ lnp X )]. Remark. It may be interesting to note that the asymptotic ariance of nĥz H)is identical to that of nĥ H) where Ĥ is the plug-in. Remark 2. When {p k } is a uniform distribution, lnp X ) is constant, V ar [ lnp X )] 0 and therefore nĥz H) asymptotically degenerates. Let ζ, k p k p k ), C i and therefore Ĥ2) z Z, n+ [n +)]! n! n C Z,. i i) for 2 and define C 0), K )] k [ˆp k ˆp k n,

Normality of an Entropy Estimator 8 For clarity in proing Corollary, a few notations and two well-known lemmas in U-statistics are first gien. For each i, i n, let X i be a random ariable such that X i l k indicates the eent that the k th letter of the alphabet is obsered and P X i l k ) p k. Let X,, X n be an iid sample, and denote x,, x n as the corresponding sample realization. A U-statistic is an n-ariable function obtained by aeraging the alues of an m-ariable function kernel of degree m, often denoted by ψ) oer all n!/[m!n m)!] possible subsets of m ariables from the set of n ariables. Interested readers may refer to Lee 990) for an introduction. Turing s formula, also known as the Good-Turing estimator, is a nonparametric estimator introduced by Good 953), but largely credited to Alan Turing, as a means of estimate the total probability associated with letters in the alphabet that are not represented in a random sample. In Zhang & Zhou 200), it is shown that Z, is a U-statistic with kernel ψ being Turing s formula with degree m +. Let ψ c x,, x c ) E[ψx,, x c, X c+,, X m )] and σc 2 V ar[ψ c X,, X c )]. Lemmas 3 and 4 below are due to Hoeffding 948). Lemma 3. Let U n be a U-statistic with kernel ψ of degree m. V aru n ) n m ) m c m c ) n m m c ) σ 2 c. Lemma 4. Let U n be a U-statistic with kernel ψ of degree m. For 0 c d m, σ 2 c /c σ 2 d /d. Lemma 5. V arz, ) n ζ, + + n ζ2,. Proof. Let m +. By Lemmas 3, 4, and identity n m V arz, ) ) n m c m c m c ) n m m c ) m c c m c ) n m m c ) m 2 n, ) σ 2 m/m m n σ2 m. 5) Consider σ 2 m V ar[ψx,, X m )] E[ψX,, X m )] 2 [ K k p k p k ) m )] 2. Let y m) k

Normality of an Entropy Estimator 9 denote the frequency of the k th letter in the sample of size m. [ σm 2 E[ψX,, X m )] 2 K E m 2 k K )] [y k ]) k [y k ] K E m 2 k [y k ] + 2 ) k<k K [y k ] [yk ] m K k p k p k ) m + 2m ) m k<k K p kp k p k p k ) m 2 m K k p k p k ) m + 2 k<k K p kp k p k p k + p k p k ) m 2 K m k p k p k ) m + 2 [ k<k K pk p k ) m 2 p k p k ) m 2] m K k p k p k ) m + [ K k p k p k ) m 2 ] 2 m ζ,m + ζ 2,m 2. By 5), V arz, ) n ζ, + + n ζ2,. Proof of Corollary. By Zhang & Zhou 200), EZ, ) K k p k p k ) ζ,, and therefore EĤ2) z ) n C K k p k p k ) K k p n k C p k ) K k p k[ lnp k )] 2 K k p k ln 2 p k ). It only remains to show V arĥ2) z ) 0. V arĥ2) z Note C i ) n n [ n C V arz, )] 2. i i) i w C C w coz,, Z,w ) n, ζ, ζ,, ζ 2, ζ,, n w C C w V arz, )V arz,w ) ζ, K k p k p k ) K k p k p 0 ) p 0 ) where p 0 min{p k > 0; k,, K}, and therefore, from Lemma 5 for 2, V arz, ) n + 2)ζ, 2 /2 n p 0 ) )/2.

Normality of an Entropy Estimator 0 As n, n C V arz, ) 2 n n /2 p 0 ) 2 n n /4 /2 p 0 ) + 2 n n n /4 + /2 p 0 ) 2 n n /4 n /4) /2 + 2 n n /2 p 0 ) n /4 p 0 2n /8 + 2 p 0 ) n /4 p 0 0, and V arĥ2) z ) 0 follows. Hence Ĥ2) z p H 2). The fact of Ĥz p H is implied by Theorem. Finally the corollary follows Slutsky s Theorem. Proof of Theorem 2. First consider the plug-in estimator Ĥ. It can be erified that nĥ H) N0, σ 2 ) where σ 2 σ 2 {p k }) is as in Theorem. We want to show first that Ĥ is asymptotically efficient in two separate cases: ) when K is known and 2) when K is unknown. If K is known, then the underlying model {p k ; k K} is a K )-parameter multinomial distribution and therefore Ĥ is the maximum likelihood estimator of H which implies that it is asymptotically efficient. Since the estimator Ĥ takes the same alue, gien a sample, regardless whether K is known or not, its asymptotic ariance is the same whether K is known or not. Therefore Ĥ must be asymptotically efficient when K is finite but unknown, or else, it would contradict the fact that Ĥ is asymptotically efficient when K is known. The asymptotic efficiency of Ĥz follows from the fact that nĥz H) and nĥ H) hae identical limiting distribution. References [] Antos, A. and Kontoyiannis, I. 200). Conergence properties of functional estimates for discrete distributions, Random Structures & Algorithms, Vol. 9, pp. 63-93. [2] Basharin, G. 959). On a statistical estimate for the entropy of a sequence of independent random ariables, Theory of Probability and Its Applications, 4, pp. 333-336.

Normality of an Entropy Estimator [3] Good, I.J. 953). The population frequencies of species and the estimation of population parameters. Biometrika, 40, pp. 237-264. [4] Harris, B. 975). The statistical estimation of entropy in the non-parametric case, Topics in Information Theory, edited by I. Csiszar, Amsterdam: North-Holland, pp. 323-355. [5] Hoeffding, W. 948). A class of statistics with asymptotically normal distribution, Annals of Mathematical Statistics, Vol. 9, No. 3, pp. 293-325. [6] Lee, A.J. 990). U-Statistics: Theory and Practice, Marcel Dekker, Inc. New York. [7] Miller, G. 955). Note on the bias of information estimates, Information theory in psychology II-B, ed. H. Quastler, Glencoe, IL: Free Press, pp. 95-00. [8] Nemenman, I., Shafee, F. & Bialek, W. 2002). Entropy and inference, reisited. Adances in Neural Information Processing Systems 4, Cambridge, MA, 2002. MIT Press. [9] Paninski, L. 2003). Estimation of entropy and mutual information, Neural Comp. 5, pp. 9-253. [0] Shannon, C.E. 948). A Mathematical Theory of Communication, Bell Syst. Tech. J., 27, pp. 379-423, and pp. 623-656. [] Strong, S.P., Koberle, R., de Ruyter an Steeninck, R.R., & Bialek, W. 998). Entropy and information in neural spike trains. Physical Reiew Letters, 80 ), pp. 97-200. [2] Zahl, S. 977). Jackknifing an index of diersity, Ecology, 58: pp. 90793. [3] Zhang, Z. 202). Entropy estimation in Turing s perspectie, Neural Computation, Vol. 24, No. 5, pp. 368-389. [4] Zhang, Z. and Zhou, J. 200). Re-parameterization of multinomial distribution and diersity indices, J. of Statistical Planning and Inference, Vol. 40, No. 7, pp. 73-738.