Asymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias

Similar documents
Entropy Estimation in Turing s Perspective

Entropy and information estimation: An overview

Estimation of information-theoretic quantities

A nonparametric confidence interval for At-Risk-of-Poverty-Rate: an example of application

SUPPLEMENTARY MATERIAL. Authors: Alan A. Stocker (1) and Eero P. Simoncelli (2)

On diamond-free subposets of the Boolean lattice

CHARACTERIZATIONS OF THE PARETO DISTRIBUTION BY THE INDEPENDENCE OF RECORD VALUES. Se-Kyung Chang* 1. Introduction

arxiv: v1 [stat.ml] 15 Feb 2018

An Explicit Lower Bound of 5n o(n) for Boolean Circuits

NOTES ON THE REGULAR E-OPTIMAL SPRING BALANCE WEIGHING DESIGNS WITH CORRE- LATED ERRORS

THE MINIMUM MATCHING ENERGY OF BICYCLIC GRAPHS WITH GIVEN GIRTH

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

Blow up of Solutions for a System of Nonlinear Higher-order Kirchhoff-type Equations

THE FIFTH DIMENSION EQUATIONS

Holomorphy of the 9th Symmetric Power L-Functions for Re(s) >1. Henry H. Kim and Freydoon Shahidi

Astrometric Errors Correlated Strongly Across Multiple SIRTF Images

Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

INTEGERS WITH A DIVISOR IN (y, 2y]

Lecture J. 10 Counting subgraphs Kirchhoff s Matrix-Tree Theorem.

Fu Yuhua 1. Beijing, China

NEW FRONTIERS IN APPLIED PROBABILITY

An Alternative Characterization of Hidden Regular Variation in Joint Tail Modeling

Hölder-type inequalities and their applications to concentration and correlation bounds

Sparse Functional Regression

OPTIMAL RESOLVABLE DESIGNS WITH MINIMUM PV ABERRATION

A Regularization Framework for Learning from Graph Data

OBSERVATIONS ON BAGGING

Insights into Cross-validation

Estimation of Efficiency with the Stochastic Frontier Cost. Function and Heteroscedasticity: A Monte Carlo Study

Towards Universal Cover Decoding

Patterns of Non-Simple Continued Fractions

Collective Risk Models with Dependence Uncertainty

Optimized Concatenated LDPC Codes for Joint Source-Channel Coding

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks

arxiv: v1 [math.co] 25 Apr 2016

VISUAL PHYSICS ONLINE RECTLINEAR MOTION: UNIFORM ACCELERATION

Symmetric Functions and Difference Equations with Asymptotically Period-two Solutions

Optimization Problems in Multiple Subtree Graphs

Surface Matching Degree

Algorithms and Data Structures 2014 Exercises and Solutions Week 14

(x,y) 4. Calculus I: Differentiation

G 2(X) X SOLVABILITY OF GRAPH INEQUALITIES. 1. A simple graph inequality. Consider the diagram in Figure 1.

Small area estimation under a two-part random effects model with application to estimation of literacy in developing countries

arxiv: v1 [physics.comp-ph] 17 Jan 2014

THE CAUCHY PROBLEM FOR ONE-DIMENSIONAL FLOW OF A COMPRESSIBLE VISCOUS FLUID: STABILIZATION OF THE SOLUTION

LESSON 4: INTEGRATION BY PARTS (I) MATH FALL 2018

An Optimal Split-Plot Design for Performing a Mixture-Process Experiment

Effect of stray capacitances on single electron tunneling in a turnstile

Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35

Modeling Highway Traffic Volumes

Regularity of the density for the stochastic heat equation

On the Linear Threshold Model for Diffusion of Innovations in Multiplex Social Networks

Complex Systems Methods 3. Statistical complexity of temporal sequences

Dynamics of a Bertrand Game with Heterogeneous Players and Different Delay Structures. 1 Introduction

A spectral Turán theorem

Assignment 4 (Solutions) NPTEL MOOC (Bayesian/ MMSE Estimation for MIMO/OFDM Wireless Communications)

On computing Gaussian curvature of some well known distribution

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

PREPERIODIC POINTS OF POLYNOMIALS OVER GLOBAL FIELDS

Balanced Partitions of Vector Sequences

Gaussians. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Computing Laboratory A GAME-BASED ABSTRACTION-REFINEMENT FRAMEWORK FOR MARKOV DECISION PROCESSES

OPTIMIZATION OF FLOWS AND ANALYSIS OF EQUILIBRIA IN TELECOMMUNICATION NETWORKS

Generalization to Unseen Cases

On general error distributions

Optimal Joint Detection and Estimation in Linear Models

Noise constrained least mean absolute third algorithm

Reversal in time order of interactive events: Collision of inclined rods

Dynamic Vehicle Routing with Moving Demands Part II: High speed demands or low arrival rates

A noncommutative extension of Mahler s theorem on interpolation series 1

SPACE-TIME HOLOMORPHIC TIME-PERIODIC SOLUTIONS OF NAVIER-STOKES EQUATIONS. 1. Introduction We study Navier-Stokes equations in Lagrangean coordinates

Density Estimation: ML, MAP, Bayesian estimation

Relativistic Energy Derivation

A New Extended Uniform Distribution

Lecture 1. 1 Overview. 2 Maximum Flow. COMPSCI 532: Design and Analysis of Algorithms August 26, 2015

Physics 2A Chapter 3 - Motion in Two Dimensions Fall 2017

arxiv: v2 [math.co] 12 Jul 2009

Prime and irreducible elements of the ring of integers modulo n

STAT 512 sp 2018 Summary Sheet

Dynamic Vehicle Routing with Heterogeneous Demands

Two-sided bounds for L p -norms of combinations of products of independent random variables

Dynamic Vehicle Routing with Moving Demands Part II: High speed demands or low arrival rates

Trajectory Estimation for Tactical Ballistic Missiles in Terminal Phase Using On-line Input Estimator

DEVIL PHYSICS THE BADDEST CLASS ON CAMPUS AP PHYSICS

Classical Mechanics NEWTONIAN SYSTEM OF PARTICLES MISN NEWTONIAN SYSTEM OF PARTICLES by C. P. Frahm

Journal of Computational and Applied Mathematics. New matrix iterative methods for constraint solutions of the matrix

On the Entropy of Sums of Bernoulli Random Variables via the Chen-Stein Method

On the lower limits of entropy estimation

Online Companion to Pricing Services Subject to Congestion: Charge Per-Use Fees or Sell Subscriptions?

NON-PARAMETRIC METHODS IN ANALYSIS OF EXPERIMENTAL DATA

ORDER STATISTICS, QUANTILES, AND SAMPLE QUANTILES

(a) During the first part of the motion, the displacement is x 1 = 40 km and the time interval is t 1 (30 km / h) (80 km) 40 km/h. t. (2.

On the Front-Tracking Algorithm

MANAGEMENT SCIENCE doi /mnsc ec pp. ec1 ec31

Asymptotic efficiency of simple decisions for the compound decision problem

Maximum union-free subfamilies

4. A Physical Model for an Electron with Angular Momentum. An Electron in a Bohr Orbit. The Quantum Magnet Resulting from Orbital Motion.

The Robustness of Stochastic Switching Networks

Probabilistic Engineering Design

Math Mathematical Notation

Transcription:

Asymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias Zhiyi Zhang Department of Mathematics and Statistics Uniersity of North Carolina at Charlotte Charlotte, NC 28223 Abstract This paper establishes the asymptotic normality of an entropy estimator with an exponentially decaying bias on any finite alphabet. Furthermore it is shown that the nonparametric estimator is asymptotically efficient. Introduction. Let {p k } be a probability distribution on a finite alphabet, X {l k ; k K}, where K 2 is a finite integer. Let p X be a random ariable such that P p X p k ) p k. Entropy in the form of H E[ lnp X )] K p k lnp k ), ) was introduced by Shannon 948), and is often referred to as Shannon s entropy. Nonparametric estimation of H has been a subect of much research for many decades. Miller 955) and Basharin 959) were perhaps among the first who studied the intuitie general nonparametric estimator, Ĥ K k ˆp k lnˆp k ) where ˆp k is the sample relatie frequency of the k th letter l k, also known as the plug-in estimator. Others hae inestigated the topic in arious forms and directions oer the AMS 2000 Subect Classifications. Primary 62f0, 62F2, 62G05, 62G20; secondary 62F5. Keywords and phrases. Turing s formula, nonparametric entropy estimation, asymptotic normality. Research partially supported by NSF Grants DMS 004769 k

Normality of an Entropy Estimator 2 years. Many important references can be found in Antos and Kontoyiannis 200) and Paninski 2003). Among many difficult issues of nonparametric entropy estimation, much research effort in the literature seems to be placed on reducing the bias of the estimators. The main reference point of such discussion is the On ) decaying bias of the plug-in Ĥ whose form may be found in Harris 975). Many bias-adusted nonparametric estimators hae been proposed. All of them hae been shown to reduce bias in certain numerical studies. Howeer the rates of bias decay for most of the bias-adusted estimators are largely unknown, and there is no clear theoretical eidence why any of these proposed estimators should improe the bias decay to a rate faster than On ). Zhang 202) proposed an estimator Ĥz, as gien in 2) below, and showed that the associated bias decays at a rate no slower than On p 0 ) n ) where p 0 min{p k > 0; k,, K}. In addition, Zhang 202) established that a uniform ariance upper bound for the entire class of distributions with finite entropy that decays at a rate of Olnn)/n) compared to O[lnn)] 2 /n) for the plug-in, that in a wide range of subclasses, the ariance of the proposed estimator conerges at a rate of O/n), and that the aforementioned rate of conergence carries oer to the conergence rates in mean squared errors in many subclasses. The computational performances of Ĥ z, and of its ariants, were compared faorably with seeral other commonly known estimators, such as the ackknife estimator by Zahl 977) and Strong, Koberle, de Ruyter an Steeninck and Bialek 998), and the NSB estimator by Nemenman, Shafee and Bialek 2002). Let {y k } be the sequence of obsered counts of letters in the alphabet in an independently and identically distributed iid) sample of size n and {ˆp k y k /n}. The general nonparametric estimator of entropy proposed by Zhang 202) is n n + [n + )]! K Ĥ z n! k ˆp k n) ˆp k. 2) This paper establishes two normal laws of Ĥz as stated in Theorem and Corollary below, and the asymptotic efficiency of Ĥz is gien in Theorem 2. Let H 2) E[ lnp X )] 2 K k p k ln 2 p k ).

Normality of an Entropy Estimator 3 Theorem. Let {p k ; k K} be a non-uniform probability distribution on a finite alphabet X and Ĥz be as in 2). Then n Ĥz H) L N0, σ 2 ) where σ 2 V ar [ lnp X )] H 2) H 2. Let Ĥ 2) z n { i ) { n + [n + )]! i i) n! [ K k ˆp k m0 ˆp k m n ) ]}}. 3) Corollary. Let {p k ; k K} be a non-uniform probability distribution on a finite alphabet, Ĥ z be as in 2), and Ĥ2) z be as in 3). Then n Ĥz H Ĥ z 2) Ĥ2 z L N0, ). Theorem 2. Let {p k ; k K} be a non-uniform probability distribution on a finite alphabet X. Then Ĥz is asymptotically efficient. 2 Proofs Ĥ z in 2) may be re-expressed as K n Ĥ z ˆp { n + } [n + )]! n) k ˆp k n! k def K ˆp k ĝ k,n. 4) Of first interest is an asymptotic normal law of ˆp k ĝ k,n. For simplicity, consider first a binomial distribution with parameters n and p 0, ), and functions n { n + } [n + )]! n) g n p) p n! [ n p)+], and h n p) pg n p). Let hp) p lnp). Lemma below is easily proed by induction. k

Normality of an Entropy Estimator 4 Lemma. Let a,,, n, be complex numbers satisfying a for eery. n a n a. Then Lemma 2. Let ˆp X/n where X is a binomial random ariable with parameters n and p.. n[h n p) hp)] 0 uniformly in p c, ) for any c, 0 < c <. 2. n h n p) hp) < An) On 3/2 ) uniformly in p [/n, c] for any c, 0 < c < p. 3. P ˆp c) < Bn) On /2 exp{ nc}) where C p c)2 for any c 0, p). Proof of Part. As the notation in g n p) suggests, the range for is from to min{n, n p) + }. For any in that range, let W n,+ n+ [n +)]! n!. Noting 0 /[n p)] /n subect to n p), by Lemma, W n,+ ) p n p) p) ) ) p) n n p) n n ) ) p) + n p) n n ) ) p) n + p) n p) n ) p) n + p) ) p) n + p) p p ) p) n + p) p) n n p) n n n p) 2 n. ) ) n p) + n

Normality of an Entropy Estimator 5 For a sufficiently large n, let V n n /8. n hn p) hp) np n p)+ + np n p)+2 p) np V n + np n p)+ V n+ W n,+ W n,+ ) p n p) W n,+ + np n p)+2 p) def + 2 + 3. np V n n p) n5/8 0. n n /8 2 p n n p)+ V n+ ) p n p) ) p n p) n p) p n[n p)+] np n p)+ V n+ p) n[n p)+] np p) n/8 n[n c)+] nc c) n/8 0. 3 n n p) p) n p)+2 n p) n p)+ n 0. Hence sup p c,) n hn p) hp) 0. Proof of Part 2. The proof is identical to that of Part aboe until the expression + 2 + 3 where each term is to be ealuated on the interal [/n, c]. It is clear that On 3/8 ). For 2, since n p) + at p /n is n > n, we hae 2 p n min{n, n p)+ } V n+ n p) p n min{n, n /n)+ } V n+ n p) Therefore + 2 + 3 On 3/2 ). p n n V n + n p) < p nn ) n V n+ p) < nn ) p) V n < nn ) On 3/2 ). 3 p n min{n, n p)+ }+ p) < p n p) < n On /2 ).

Normality of an Entropy Estimator 6 Proof of Part 3. Let Z and ϕz) be a standard normal random ariable and its density function respectiely, and let denote asymptotic equality. Since nˆp p) L N0, p p)), P ˆp c) nc p)/ ϕz)dz np c)/ ϕz)dz < np c)/ e z [ np c)/ ]dz np c) [np c)/ ] 2 e x dx { [ np ] } 2 np c) exp c)/ p p) n /2 p c) exp } { np c)2. Proof of Theorem. Without loss of generality, consider the sample proportions of the first two letters of the alphabet ˆp and ˆp 2 in an iid sample of size n. n ˆp p, ˆp 2 p 2 ) L N0, Σ), where Σ σ i ), i,, 2, σ ii p i p i ) and σ i p i p when i. Write n {[hn ˆp ) + h n ˆp 2 )] [ p lnp ) p 2 lnp 2 )]} n {[h n ˆp ) + h n ˆp 2 )] [hˆp ) + hˆp 2 )]} + n {[hˆp ) + hˆp 2 )] [ p lnp ) p 2 lnp 2 )]} n [h n ˆp ) hˆp )] + n [h n ˆp 2 ) hˆp 2 )] + n {[hˆp ) + hˆp 2 )] [ p lnp ) p 2 lnp 2 )]} n [h n ˆp ) hˆp )] [ˆp p /2] + n [h n ˆp 2 ) hˆp 2 )] [ˆp2 p 2 /2] + n [h n ˆp ) hˆp )] [ˆp >p /2] + n [h n ˆp 2 ) hˆp 2 )] [ˆp2 >p 2 /2] + n {[hˆp ) + hˆp 2 )] [ p lnp ) p 2 lnp 2 )]}. The third and fourth terms aboe conerge to zero almost surely by Part of Lemma 2. The last

Normality of an Entropy Estimator 7 term, by the delta method, conerges in law to N0, τ 2 ) where after a few algebraic steps τ 2 [lnp ) + ] 2 p p ) + [lnp 2 ) + ] 2 p 2 p 2 ) 2[lnp ) + ][lnp 2 ) + ]p p 2 [lnp ) + ] 2 p + [lnp 2 ) + ] 2 p 2 {[lnp ) + ]p + [lnp ) + ]p } 2. It remains to show that the first term the second term will admit the same argument) conerges to zero in probability. Howeer this fact can be established by the following argument. By Part 2 and then Part 3 of Lemma 2, E{ n h n ˆp ) hˆp ) [ˆp p /2]} An)P ˆp p /2) An)Bn) On 3/2 )On /2 exp{ nc}) 0 for some positie constant C. This fact, noting that n h n ˆp ) hˆp ) 0, gies immediately the desired conergence in probability, that is, n h n ˆp ) hˆp ) [ˆp p /2] the desired weak conergence for n {[h n ˆp ) + h n ˆp 2 )] [ p lnp ) p 2 lnp 2 )]}. By generalization for K terms, nĥz H) ariable that assumes the alue p k when X assumes l k, P 0. In turn, it gies L N0, σ 2 ) where, letting p X denote the random σ 2 K k { [lnp k) + ]} 2 p k { K k { [lnp k) + ]}p k } 2 V ar [ lnp X ) ] V ar [ lnp X )]. Remark. It may be interesting to note that the asymptotic ariance of nĥz H)is identical to that of nĥ H) where Ĥ is the plug-in. Remark 2. When {p k } is a uniform distribution, lnp X ) is constant, V ar [ lnp X )] 0 and therefore nĥz H) asymptotically degenerates. Let ζ, k p k p k ), C i and therefore Ĥ2) z Z, n+ [n +)]! n! n C Z,. i i) for 2 and define C 0), K )] k [ˆp k ˆp k n,

Normality of an Entropy Estimator 8 For clarity in proing Corollary, a few notations and two well-known lemmas in U-statistics are first gien. For each i, i n, let X i be a random ariable such that X i l k indicates the eent that the k th letter of the alphabet is obsered and P X i l k ) p k. Let X,, X n be an iid sample, and denote x,, x n as the corresponding sample realization. A U-statistic is an n-ariable function obtained by aeraging the alues of an m-ariable function kernel of degree m, often denoted by ψ) oer all n!/[m!n m)!] possible subsets of m ariables from the set of n ariables. Interested readers may refer to Lee 990) for an introduction. Turing s formula, also known as the Good-Turing estimator, is a nonparametric estimator introduced by Good 953), but largely credited to Alan Turing, as a means of estimate the total probability associated with letters in the alphabet that are not represented in a random sample. In Zhang & Zhou 200), it is shown that Z, is a U-statistic with kernel ψ being Turing s formula with degree m +. Let ψ c x,, x c ) E[ψx,, x c, X c+,, X m )] and σc 2 V ar[ψ c X,, X c )]. Lemmas 3 and 4 below are due to Hoeffding 948). Lemma 3. Let U n be a U-statistic with kernel ψ of degree m. V aru n ) n m ) m c m c ) n m m c ) σ 2 c. Lemma 4. Let U n be a U-statistic with kernel ψ of degree m. For 0 c d m, σ 2 c /c σ 2 d /d. Lemma 5. V arz, ) n ζ, + + n ζ2,. Proof. Let m +. By Lemmas 3, 4, and identity n m V arz, ) ) n m c m c m c ) n m m c ) m c c m c ) n m m c ) m 2 n, ) σ 2 m/m m n σ2 m. 5) Consider σ 2 m V ar[ψx,, X m )] E[ψX,, X m )] 2 [ K k p k p k ) m )] 2. Let y m) k

Normality of an Entropy Estimator 9 denote the frequency of the k th letter in the sample of size m. [ σm 2 E[ψX,, X m )] 2 K E m 2 k K )] [y k ]) k [y k ] K E m 2 k [y k ] + 2 ) k<k K [y k ] [yk ] m K k p k p k ) m + 2m ) m k<k K p kp k p k p k ) m 2 m K k p k p k ) m + 2 k<k K p kp k p k p k + p k p k ) m 2 K m k p k p k ) m + 2 [ k<k K pk p k ) m 2 p k p k ) m 2] m K k p k p k ) m + [ K k p k p k ) m 2 ] 2 m ζ,m + ζ 2,m 2. By 5), V arz, ) n ζ, + + n ζ2,. Proof of Corollary. By Zhang & Zhou 200), EZ, ) K k p k p k ) ζ,, and therefore EĤ2) z ) n C K k p k p k ) K k p n k C p k ) K k p k[ lnp k )] 2 K k p k ln 2 p k ). It only remains to show V arĥ2) z ) 0. V arĥ2) z Note C i ) n n [ n C V arz, )] 2. i i) i w C C w coz,, Z,w ) n, ζ, ζ,, ζ 2, ζ,, n w C C w V arz, )V arz,w ) ζ, K k p k p k ) K k p k p 0 ) p 0 ) where p 0 min{p k > 0; k,, K}, and therefore, from Lemma 5 for 2, V arz, ) n + 2)ζ, 2 /2 n p 0 ) )/2.

Normality of an Entropy Estimator 0 As n, n C V arz, ) 2 n n /2 p 0 ) 2 n n /4 /2 p 0 ) + 2 n n n /4 + /2 p 0 ) 2 n n /4 n /4) /2 + 2 n n /2 p 0 ) n /4 p 0 2n /8 + 2 p 0 ) n /4 p 0 0, and V arĥ2) z ) 0 follows. Hence Ĥ2) z p H 2). The fact of Ĥz p H is implied by Theorem. Finally the corollary follows Slutsky s Theorem. Proof of Theorem 2. First consider the plug-in estimator Ĥ. It can be erified that nĥ H) N0, σ 2 ) where σ 2 σ 2 {p k }) is as in Theorem. We want to show first that Ĥ is asymptotically efficient in two separate cases: ) when K is known and 2) when K is unknown. If K is known, then the underlying model {p k ; k K} is a K )-parameter multinomial distribution and therefore Ĥ is the maximum likelihood estimator of H which implies that it is asymptotically efficient. Since the estimator Ĥ takes the same alue, gien a sample, regardless whether K is known or not, its asymptotic ariance is the same whether K is known or not. Therefore Ĥ must be asymptotically efficient when K is finite but unknown, or else, it would contradict the fact that Ĥ is asymptotically efficient when K is known. The asymptotic efficiency of Ĥz follows from the fact that nĥz H) and nĥ H) hae identical limiting distribution. References [] Antos, A. and Kontoyiannis, I. 200). Conergence properties of functional estimates for discrete distributions, Random Structures & Algorithms, Vol. 9, pp. 63-93. [2] Basharin, G. 959). On a statistical estimate for the entropy of a sequence of independent random ariables, Theory of Probability and Its Applications, 4, pp. 333-336.

Normality of an Entropy Estimator [3] Good, I.J. 953). The population frequencies of species and the estimation of population parameters. Biometrika, 40, pp. 237-264. [4] Harris, B. 975). The statistical estimation of entropy in the non-parametric case, Topics in Information Theory, edited by I. Csiszar, Amsterdam: North-Holland, pp. 323-355. [5] Hoeffding, W. 948). A class of statistics with asymptotically normal distribution, Annals of Mathematical Statistics, Vol. 9, No. 3, pp. 293-325. [6] Lee, A.J. 990). U-Statistics: Theory and Practice, Marcel Dekker, Inc. New York. [7] Miller, G. 955). Note on the bias of information estimates, Information theory in psychology II-B, ed. H. Quastler, Glencoe, IL: Free Press, pp. 95-00. [8] Nemenman, I., Shafee, F. & Bialek, W. 2002). Entropy and inference, reisited. Adances in Neural Information Processing Systems 4, Cambridge, MA, 2002. MIT Press. [9] Paninski, L. 2003). Estimation of entropy and mutual information, Neural Comp. 5, pp. 9-253. [0] Shannon, C.E. 948). A Mathematical Theory of Communication, Bell Syst. Tech. J., 27, pp. 379-423, and pp. 623-656. [] Strong, S.P., Koberle, R., de Ruyter an Steeninck, R.R., & Bialek, W. 998). Entropy and information in neural spike trains. Physical Reiew Letters, 80 ), pp. 97-200. [2] Zahl, S. 977). Jackknifing an index of diersity, Ecology, 58: pp. 90793. [3] Zhang, Z. 202). Entropy estimation in Turing s perspectie, Neural Computation, Vol. 24, No. 5, pp. 368-389. [4] Zhang, Z. and Zhou, J. 200). Re-parameterization of multinomial distribution and diersity indices, J. of Statistical Planning and Inference, Vol. 40, No. 7, pp. 73-738.