New Entropy Estimator with an Application to Test of Normality

Size: px

Start display at page:

Download "New Entropy Estimator with an Application to Test of Normality"

Darlene Evans
5 years ago
Views:

1 New Entropy Estimator with an Application to Test of Normality arxiv: v1 [math.st] 15 Oct 211 Salim BOUZEBDA 1, Issam ELHATTAB 1, Amor KEZIOU 1 and Tewfik LOUNIS 2 L.S.T.A., Université Pierre et Marie Curie 4 place Jussieu Paris Cedex 5 Laboratoire de Mathématiques Nicolas Oresme CNRS UMR 6139 Université de Caen 2 Abstract In the present paper we propose a new estimator of entropy based on smooth estimators of quantile density. The consistency and asymptotic distribution of the proposed estimates are obtained. As a consequence, a new test of normality is proposed. A small power comparison is provided. A simulation study for the comparison, in terms of mean squared error, of all estimators under study is performed. AMS Subject Classification: : 62F12 ; 62F3 ; 62G3 ; 6F17 ; 62E2. Keywords: Entropy; Quantile density; Kernel estimation; Vasicek s estimator; Spacingbased estimators; Test of nomality; Entropy test. 1 Introduction and estimation Let X be a random variable [r.v.] with cumulative distribution function [cdf] F(x := P(X x for x R and a density function f( with respect to Lebesgue measure on R. Then its differential (or Shannon entropy is defined by H(X := f(xlogf(xdx, (1.1 R salim.bouzebda@upmc.fr (corresponding author issam.elhattab@upmc.fr amor.keziou@upmc.fr tewfik.lounis@gmail.com 1

2 where dx denotes Lebesgue measure on R. We assume that H(X is properly defined by the integral (1.1, in the sense that H(X <. (1.2 The concept of differential entropy was introduced in Shannon s original paper [Shannon (1948]. Since this early epoch, the notion of entropy has been the subject of great theoretical and applied interest. Entropy concepts and principles play a fundamental role in many applications, such as statistical communication theory [Gallager (1968], quantization theory [Rényi (1959], statistical decision theory [Kullback (1959], and contingency table analysis [Gokhale and Kullback (1978]. Csiszár (1962 introduced the concept of convergence in entropy and showed that the latter convergence concept implies convergence in L 1. This property indicates that entropy is a useful concept to measure closeness in distribution, and also justifies heuristically the usage of sample entropy as test statistics when designing entropy-based tests of goodness-of-fit. This line of research has been pursued by Vasicek (1976, Prescott (1976, Dudewicz and van der Meulen (1981, Gokhale (1983, Ebrahimi et al. (1992 and Esteban et al. (21 [including the references therein]. The idea here is that many families of distributions are characterized by maximization of entropy subject to constraints (see, e.g., Jaynes (1957 and Lazo and Rathie (1978. There is a huge literature on the Shannon s entropy and its applications. It is not the purpose of this paper to survey this extensive literature. We can refer to Cover and Thomas (26 (see their Chapter 8, for a comprehensive overview of differential entropy and their mathematical properties. In the literature, Several proposals have been made to estimate entropy. Dmitriev and Tarasenko (1973 and Ahmad and Lin (1976 proposed estimators of the entropy using kernel-type estimators of the density f(. Vasicek (1976 proposed an entropy estimator based on spacings. Inspired by the work of Vasicek (1976, some authors [van Es (1992, Correa (1995 and Wieczorkowski and Grzegorzewski (1999] proposed modified entropy estimators, improving in some respects the properties of Vasicek s estimator (see Section 4 below. The reader finds in Beirlant et al. (1997 detailed accounts of the theory as well as surveys for entropy estimators. This paper aims to introduce a new entropy estimator and obtains its asymptotic properties. Comparison Simulations indicate that our estimator produces smaller mean squared error than the other well-known competitors considered in this work. As a consequence, we propose a new test of normality. First, we introduce some definitions and notations. For each distribution function F(, we define the quantile function by Let Q(t := inf{x : F(x t}, < t < 1. x F := sup{x : F(x = } and x F := inf{x : F(x = 1}, x F < x F. 2

3 We assume that the distribution function F( has a density f( (with respect to Lebesgue measure on R, and that f(x > for all x (x F,x F. Let q(x := dq(x/dx = 1/f(Q(x, < x < 1, be the quantile density function [qdf]. The entropy H(X, defined by (1.1, can be expressed in the form of quantile-density functional as ( d H(X = log dx Q(x dx = log ( q(x dx. (1.3 [,1] [,1] The Vasicek s estimator was constructed by replacing the quantile function Q( in (1.3 by the empirical quantile function and using a difference operator instead of the differential one. The derivative (d/dxq(x is then estimated by a function of spacings. In this work, we construct our estimator of entropy by replacing q(, in (1.3, by an appropriate estimator q n ( ofq(. We shall consider the kernel-type estimator ofq( introduced by Falk (1986 and studied by Cheng and Parzen (1997. Our choice is motivated by the well asymptotic behavior properties of this estimator. Cheng and Parzen (1997 were established the asymptotic properties of q n ( on all compact U ],1[, which avoids the boundary problems. Since the entropy is definite as an integral on ],1[ of a functional of q(, it is not suitable to substitute directly q n ( in (1.3 to estimate H(X. To circumvent the boundary effects, we will proceed as follows. We set for small ε ],1/2[, H ε (X := εlog ( q(ε +εlog ( q(1 ε + In view (1.3, we can see that 1 ε ε log ( q(x dx. H(X H ε (X = o ( η(ε, (1.4 where η(ε, as ε. The choice of ε close to zero guaranteed the closeness of H ε (X to H(X, then the problem of the estimation of H(X is reduced to estimate H ε (X. Given an independent and identically distributed random [i.i.d.] sample X 1,...,X n, and let ε ],1/2[, an estimator of H ε (X can be defined as Ĥ ε;n (X = εlog( q n (ε+εlog( q n (1 ε+ 1 ε ε log ( q n (x dx, (1.5 where the estimator q n ( of the qdf q( is defined as follows. Let X 1;n X n;n denote the order statistics of X 1,...,X n. The empirical quantile function Q n ( based upon these random variables is given by Q n (t := X k;n, (k 1/n < t k/n, k = 1,...,n. (1.6 3

4 Let {K n (t,x,(t,x [,1] ],1[} n 1 be a sequence of kernels and {µ n ( } n 1 a sequence of σ-finite measures on [,1]. A smoothed version of Q n ( (see, e.g., Cheng and Parzen (1997 can be defined as Q n (t := 1 Finally, we estimate q( by q n (t := d dt Q n (t = d dt Q n (xk n (t,xdµ n (x, t ],1[. 1 Q n (xk n (t,xdµ n (x, t ],1[. (1.7 Clearly, in order to obtain a meaningful qdf estimator in this way, the sequence of kernels K n (, must satisfy certain differentiability conditions, and together with the sequence of measures µ n (, must satisfy certain variational conditions. These conditions will be detailed in the next section. A familiar example is the convolution-kernel estimator q n (t := d dt 1 h 1 n Q n(xk ( t x h n dx, t ],1[, where K( denotes a kernel function, namely a measurable function integrating to 1 on R, and has bounded derivative, and {h n } n 1 is a sequence of positive reals fulfills h n and nh n as n. In this case, Csörgő and Révész (1984 and Csörgő et al. (1991 define n/(n+1 ( t x q n (t := h 1 n K dq n (x = h 1 n 1/(n+1 n 1 ( t i/n K i=1 h n h n (X i+1;n X i;n, t [,1]. Calculations using summation by parts show that, for all t ],1[, [ ( ( ] t 1 t q n (t = q n (t+h 1 n K X n;n K X 1;n. h n In the sequel, we shall consider the general family of qdf estimators defined in (1.7. h n The remainder of the present article is organized as follows. The consistency and normality of our estimator are discussed in the next section. Our arguments, used to establish the asymptotic normality of our estimator, make use of an original application of the invariance principle for the empirical quantile process. In Section 3.1, we discuss briefly the smoothed version of Parzen s entropy estimator. In Section 4, we investigate the finitesample performance of the newly proposed estimator and compare the latter with the performances of existing estimators. In section 5, a new test of normality is proposed and compared with other competitor. Some concluding remarks and possible future developments are mentioned in Section 6. To avoid interrupting the flow of the presentation, all mathematical developments are relegated to Section 7. 4

5 2 Main results Throughout U(ε := [ε,1 ε], for an ε ],1/2[ is arbitrarily fixed (free of the sample size n. The following conditions are used to establish the main results. We recall the notations of Section 1. (Q.1 The quantile density function q( is twice differentiable in ],1[. Moreover, < min{q(,q(1} ; (Q.2 There exists a constant ς > such that { } t(1 t J(t ς, sup t ],1[ where J(t := dlog { q(t } /dt is the score function; (Q.3 Either q( < or q( is nonincreasing in some interval (,t, and either q(1 < or q( is nondecreasing in some interval (t,1, where < t < t < 1. We will make the following assumptions on the sequence of kernels K n (,. (K.1 For each n 1 and each (t,x U(ε ],1[, K n (t,x, and for each t U(ε, 1 K n(t,xdµ n (x = 1; (K.2 There is a sequence δ n such that [ t+δn ] R n := sup 1 K n (t,xdµ n (x, as n ; t U(ε t δ n (K.3 For any function g(, that is at least three times differentiable in ],1[, ĝ n (t := 1 g(xk n (t,xdµ n (x, is differentiable in t on U(ε, and 1 sup g(t g(xk n (t,xdµ n (x = O( n α, α > ; t U(ε sup t U(ε g (t d dt 1 g(xk n (t,xdµ n (x = O( n β, β >. Note that the conditions (Q and (K.1-2-3, which we will use to establish the convergence in probability of Ĥ ε;n (X, match those used by Cheng and Parzen (1997 to establish the asymptotic properties of q n (. Our result concerning consistency of Ĥε;n(X defined in (1.5, where the qdf estimator is given in (1.7, is as follows. 5

6 Theorem 2.1 Let X 1,...,X n be i.i.d. r.v. s with a quantile density function q( fulfilling (Q Let K n (, fulfill (K Then, we have, Ĥε;n(X H(X = O P (n 1/2 M(q;K n +n β +η(ε, (2.1 where M(q;K n := M q Mn (1 δ n logδn 1 +M q M + n (q 2 R n (1+n 1/2 A γ (nm q Mn (1, M n (g := sup u U(ε 1 R n (g := sup u U(ε M g := sup g(u, u U(ε g(xk n (u,x dµ n (x, [,1]\U(ε+δ n and δ n is given in condition (K.2. g(xk n (u,x dµ n (x, The proof of Theorem 2.1 is postponed to the Section 7. To establish the asymptotic normality of Ĥ ε;n (X, we will use the following additional condition. ( (Q.4 E {log 2 q ( F(X } <. Let, for all ε ],1/2[, ( E {log 2 q ( F(X } { εlog 2( q(ε +εlog 2( q(1 ε + 1 ε ε log 2( q(x } ( dx = o ϑ(ε, where ϑ(ε, as ε. The main result, concerning the normality of Ĥ ε;n (X, to be proved here may now be stated precisely as follows. Theorem 2.2 Assume that the conditions (Q and (K hold with α > 1/2 and β > 1/2 in (K.3. Then, we have, n( Ĥ ε;n (X H ε (X ψ n (ε = O P ( { 2εlogε 1 } 1/2 +o P (1, (2.2 where ψ n (ε := U(ε (q (x/q(xb n (xdx, is a centered Gaussian random variable with variance equal to Var ( ψ n (ε { ( = Var log q ( F(X } ( +o ϑ(ε+η(ε. 6

7 The proof of Theorem 2.2 is postponed to the Section 7. Remark 2.3 Condition (Q.4 is extremely weak and is satisfied by all commonly encountered distributions including many important heavy tailed distributions for which the moments do not exists, the interested reader may refer to Song (2 and the references therein. Remark 2.4 We mention that the condition (K.3 is satisfied, when α,β > 1/2, for the difference kernels K n (t,xdµ n (x = h 1 n k((t x/h ndx with h n = O(n ν where 1/4 < ν < 1. A typical example for differences kernels satisfying theses conditions is the gaussian kernel. 3 The smoothed Parzen estimator of entropy We mention that the notations of this section are similar to that used in Parzen (1979 and changes have been made in order to adopt it to our setting. Given a random sample X 1,...,X n of a continuous random variable X with distribution function F(. In this section we will work under the following hypothesis, for all x R, H : F(x = F ( x µ σ, where µ and σ are parameters to be estimated, it is convenient to transform X i to ( Xi µ n Û i = F, σ n where µ n and σ n are efficient estimators of µ and σ. It is more convenient to base tests of the hypothesis on the deviations from identity function t of the empirical distribution function of Û1,...,Ûn. Let denote by D n, (t, t 1 the empirical quantile function of Û1,...,Ûn; it can be expressed in terms of the sample quantile function Q n ( of the original data X 1,...,X n by where D n, (t := F ( Qn (t µ n σ n, ( ( i Q n (t = n n t X i 1;n +n t i 1 X i;n, for i 1 n n t i,i = 1,...,n. n In our case, we may consider the smoothed version of the quantile density given in Parzen (1979 and defined by ( d n, (t := d dt D Qn (t µ n d n, (t = f σ n dt Q n (t 1 σ. n 7

8 Consequently, define where d n (t := f (Q (t d dt Q n (t 1 σ,n, σ,n := 1 f (Q (t d dt Q n (tdt. This is an alternative approach to estimating q( when one desires a goodness of fit test of a location scale parametric model, we refer to Parzen (1991 for more details. We mention that in (Parzen, 1979, Section 7., the smoothed version of d n (t is defined, for the convolution-kernel, by 1 d n (t := d n (u 1 ( t u K du. h n This suggests the following estimator of entropy H n (X := 1 h n log( d n (tdt. (3.1 Under similar conditions to that of Theorem 2.1, we could derive, under H, that, as n H n (X H(X = o P (1. (3.2 For more details on goodness of fit test in connection with Shannon s entropy, the interested reader may refer to Parzen (1991. Note that the normality is an example of location-scale parametric model. Remark 3.1 Recall that there exists several ways to obtain smoothed versions of d n (. Indeed, we can choose the following smoothing method. Keep in mind the following definition q n (t = d dt Q n (t = d dt 1 Then, we can use the following estimator Finally, we estimate the entropy under H H n(x := 1 Q n (xk n (t,xdµ n (x, t ],1[. d n (t := f (Q (t q n (t 1 σ,n. log( d n(tdt. (3.3 Under conditions of Theorem 2.1, we could derive, under H, that, as n H n (X H(X = o P(1. (3.4 8

9 It will be interesting to provide a complete investigation of H n (X and Hn (X and study the test of normality of statistics based on them. This would go well beyond the scope of the present paper. Remark 3.2 The nonparametric approach, that we use to estimate the Shannon s entropy, treats the density parameter-free and thus offers the greatest generality. This is in contrast with the parametric approach where one assumes that the underlying distribution follows a certain parametric model, and the problem reduces to estimating a finite number of parameters describing the model. 4 A comparison study by simulations Assuming that X 1,...,X n, is the sample, the estimator proposed by Vasicek (1976 is given by H m,n (V := 1 n ( n log n 2m {X i+m;n X i m;n }, i=1 where m is a positive integer fulfilling m n/2. Vasicek proved that his estimator is consistent, i.e., H (V m,n P H(X, as n, m and m/n. Vasicek s estimator is also asymptotically normal under appropriate conditions on m and n. van Es (1992 proposed a new estimator of entropy given by H (Van m,n := 1 n m n + k=m n m i=1 log ( n+1 m {X i+m;n X i;n } 1 k +log(m log(n+1. van Es (1992 established, under suitable conditions, the consistency and asymptotic normality of this estimator. Correa (1995 suggested a modification of Vasicek s estimator. In order to estimate the density f( of F( in the interval (X i m;n,x i+m;n he used a local linear model based on 2m+1 points: F(X j;n = α+βx j;n,j = m i,...,m+i. This produces a smaller mean squared error estimator. The Correa s estimator is defined by H m,n (Cor := 1 n m ( i+m } j=i m{ Xj;n X (i (j i log n n i+m { }, j=i m Xj;n X (i i=1 9

10 where X (i := 1 2m+1 i+m j=i m X j;n. Here, m n/2 is a positive integer, X i;n = X 1;n for i < 1 and X i;n = X n;n for i > n. By simulations, Correa showed that, for some selected distributions, his estimator produces smaller mean squared error than Vasicek s estimator and van Es estimator. Wieczorkowski and Grzegorzewski (1999 modified the Vasicek s estimator by adding a bias correction. Their estimator is given by H (WG m,n := H (V m,n logn+log(2m +Ψ(n+1 2 n ( 1 2m n m Ψ(i+m 1, i=1 Ψ(2m where Ψ(x is the digamma function defined by (see e.g., Gradshteyn and Ryzhik (27 Ψ(x := dlogγ(x dx = Γ (x Γ ( x. For integer arguments, we have Ψ(k = k 1 i=1 1 i γ, where γ = is the Euler s constant. A series of experiments were conducted in order to compare the performance of our estimator, in terms of efficiency and robustness, with the following entropy estimators : Vasicek s estimator, van Es estimator, Correa s estimator and Wieczorkowski-Grzegorewski s estimator. We provide numerical illustrations regarding the mean squared error of each of the above estimators of the entropy H(X. The computing program codes were implemented in R. We have considered the following distributions. For each distribution, we give the value of H(X. Standard normal distribution N(,1: H(X = log(σ 2πe = log( 2πe. Uniform distribution: H(X =. Weibull distribution with the shape parameter equal to 2 and the scale parameter equal to.5: H(X =

11 Recall that the entropy for Weibull distribution, with the shape parameter k > and the scale parameter λ >, is given by ( H(X = γ 1 1 ( λ +log +1. k k Exponential distribution with parameter λ = 1: H(X = 1 log(λ = 1. Student s t-distribution with the number of degrees of freedom equal to 1, 3 and 5: H(X = , (degrees of freedom equal to 1 (Cauchy distribution, H(X = , (degrees of freedom equal to 3, H(X = , (degrees of freedom equal to 5. Keeping in mind that the entropy for Student s t-distribution, with the number of degrees of freedom ν, is given by H(X = ν +1 ( ( 1+ν ( ν ( ( νbeta ν Ψ Ψ +log , 1, 2 where Ψ( is digamma function. For sample sizes n = 1, n = 2 and n = 5; 5 samples were generated. All the considered spacing-based estimators depend on the parameter m n/2; the optimal choice of m given to the sample size n is still an open problem. Here, we have chosen the parameter m according to Correa (1995, where m = 3 for n = 1 and n = 2, and m = 4 for n = 5. For our estimator Ĥǫ,n(X, we have used the standard gaussian kernel, and the choice of the bandwidth is done by numerical optimization of the MSE of Ĥ ε;n (X with respect to h n. In Tables 1, 2 and 3 we have considered normal distribution N(,1 with sample sizes n = 1, n = 2 and n = 5. In Tables 4-9, we have considered the uniform distribution, Weibull distribution, exponential distribution with parameter 1, Student t-distribution with parameter 3 and 5 and Cauchy distribution, respectively, and sample size n = 5. In all cases, and for each considered estimator, we compute the bias, variance and MSE by Monte-Carlo through the 5 replications. From Tables 1-9, we can see that our estimator works better than all the others, in the sense that the MSE is smaller. We think that the simulation results may be substantially ameliorated if we choose for example the Beta or Gamma kernel, which have the advantage to take into account the boundary effects. It appears that our estimator, based on the smoothed quantile density estimate, behaves better than the traditional ones in term of 11

12 Estimate bias Variance MSE H m,n (V H m,n (Van H m,n (Cor H m,n (WG Ĥ ε;n (X Table 1: Results for n = 1, m = 3, h n =.157, Normal distribution N(,1 Estimate bias Variance MSE H m,n (V H m,n (Van H m,n (Cor H m,n (WG Ĥ ε;n (X Table 2: Results for n = 2, m = 3, h n =.81, Normal distribution N(,1 Estimate bias Variance MSE H m,n (V H m,n (Van H m,n (Cor H m,n (WG Ĥ ε;n (X Table 3: Results for n = 5, m = 4, h n =.333, Normal distribution N(,1 Estimate bias Variance MSE H m,n (V H m,n (Van H m,n (Cor H m,n (WG Ĥ ε;n (X Table 4: Results for n = 5, m = 4, h n =.522, Uniform distribution efficiency. We turn now to compare robustness property of the above estimators of the entropy H(X. The robustness here is to be stand versus contamination and not in terms of influence function or break-down points which make sense only under parametric or semiparametric 12

13 Estimate bias Variance MSE H m,n (V H m,n (Van H m,n (Cor H m,n (WG Ĥ ε;n (X Table 5: Results for n = 5, m = 4, h n =.614, Weibull distribution with the shape parameter equal to 2 and the scale parameter equal to.5 Estimate bias Variance MSE H m,n (V H m,n (Van H m,n (Cor H m,n (WG Ĥ ε;n (X Table 6: Results for n = 5, m = 4, h n =.712, Exponential distribution with parameter 1 Estimate bias Variance MSE H m,n (V H m,n (Van H m,n (Cor H m,n (WG Ĥ ε;n (X Table 7: Results forn = 5, m = 4, h n =,336, Student s t-distribution with the number of degrees of freedom equal to 3 Estimate bias Variance MSE H m,n (V H m,n (Van H m,n (Cor H m,n (WG Ĥ ε;n (X Table 8: Results for n = 5, m = 4, h n =.344, Student s t-distribution with the number of degrees of freedom equal to 5 13

14 Estimate bias Variance MSE H m,n (V H m,n (Van H m,n (Cor H m,n (WG Ĥ ε;n (X Table 9: Results for n = 5, m = 4, h n =.235, Cauchy distribution settings. According to Huber and Ronchetti (29: Most approaches to robustness are based on the following intuitive requirement: A discordant small minority should never be able to override the evidence of the majority of the observations. In the same reference, it is mentioned that resistant statistical procedure, i.e., if the value of the estimate is insensitive to small changes in the underlying sample, is equivalent to robustness for practical purposes in view of Hampel s theorem, refer to (Huber and Ronchetti, 29, Section and Section 2.6. Typical examples for the notion of robustness in the nonparametric setting are the sample mean and the sample median which are the nonparametric estimates of the population mean and median, respectively. Although nonparametric, the sample mean is highly sensitive to outliers and therefore for symmetric distribution and contaminated data the sample median is more appropriate to estimate the population mean or median, refer to Huber and Ronchetti (29 for more details. In our simulation, we will consider data generated from N(, 1 distribution where a small proportion ǫ of observations were replaced by atypical ones generated from a contaminating distribution F (. We consider two cases with ǫ = 4% and ǫ = 1%, and we choose the contaminating distribution F ( to be the uniform distribution on [,1], the results are presented in Table 1 with ǫ = 4% and Table 11 with ǫ = 1%. Let f( denote the density function of N(,1 and f ( the density of the contaminating distribution F (. The contaminated sample, X 1,...,X n, can be seen as if it has been generated from the density f ǫ ( := (1 ǫf( +ǫf (. Since the sample is contaminated, all the above estimators tend to H(f ǫ and not to H(f. The objective here is to obtain the best (in the sense that the corresponding MSE is the smallest estimate of the entropy H(f from the contaminated data X 1,...,X n. We will consider the five estimates as above, and we compute their bias, variance and MSE by Monte-Carlo using 5 replications. From Tables 1-11, we can see that our estimator is the best one. Remark 4.1 To understand well the behavior of the proposed estimator, it will be interesting to consider several family of distributions: 14

15 Estimate bias Variance MSE H m,n (V H m,n (Van H m,n (Cor H m,n (WG Ĥ ε;n (X Table 1: Results for n = 5, m = 4, h n =.333, ǫ = 4%, Normal distribution N(,1 Estimate bias Variance MSE H m,n (V H m,n (Van H m,n (Cor H m,n (WG Ĥ ε;n (X Table 11: Results for n = 5, m = 4, h n =.333, ǫ = 1%, Normal distribution N(,1 1. Distribution with support (, and symmetric. 2. Distribution with support (, and asymmetric. 3. Distribution with support (,. 4. Distribution with support ],1[. Extensive simulations for these families will be undertaken elsewhere. In the following remark, we give a way how the choose the smoothing parameter in practice. Remark 4.2 The choice of the smoothing parameter plays an instrumental role in implementation of practical estimation. We recall that the smoothing parameter h n, in our simulations, has been chosen to minimize the MSE of the estimator, assuming that the underlying distribution is known. In more general case, without assuming any knowledge on the underlying distribution function, one can use, among others, the selection procedure proposed in Jones (1992. Jones (1992 derived that the asymptotic MSE of q n (, in the case of convolution-kernel estimator, which is given by { } 2 AMSE( q n (t = h4 n 4 q 2 (t x 2 K(xdx + 1 nh n q 2 (t R R K 2 (xdx. 15

16 Minimizing the last equation with respect to h n, we find the asymptotically optimal bandwidth for q n ( as { h opt q(t 2 } 1/5 R n = K2 (xdx n(q 2 (t { R x2 K(xdx } 2. Note that h opt n depends on the unknown functions and q(t = 1 f(q(t, q (t = f (Q(tf(Q(t 3[f (Q(t] 2. f 5 (Q(t These functions may be estimated in the classical way, refer to Jones (1992, Silverman (1986 and Cheng and Sun (26 for more details on the subject and the references therein. Another way to estimate the optimal value of h n is to use a cross-validation type method. Remark 4.3 The main problem in using entropy estimates such as in (1.5 is to choose properly the smoothing parameter h n. With a lot more effort, we could derive analog results here for Ĥε;n(X using the methods in Bouzebda and Elhattab (29, 21, 211, as well as the modern empirical process tools developed in Einmahl and Mason (25 in their work on uniform in bandwidth consistency of kernel-type estimators. 5 Test for normality A well-known theorem of information theory [see, e.g., p. 55, Shannon (1948] states that among all distributions that possess a density function f( and have a given variance σ 2, the entropy H(X is maximized by the normal distribution. The entropy of the normal distribution with variance σ 2 is log ( σ 2πe. As pointed out by Vasicek (1976, this property can be used for tests of normality. Towards this aim, one can use the estimate Ĥ ε;n (X of H(X, as follows. Let X 1,...,X n be independent random replicæ of a random variable X with quantile density function q(. Let Ĥε;n(X the estimator of H(X as in (1.5. Let T n denote the statistic T n := log( 2πσ 2 n +.5 Ĥε;n(X, (5.1 where σ 2 n is the sample standard deviation based on X 1,...,X n, defined by σ 2 n := 1 n n ( 2, Xi X n i=1 16

17 where n X n := n 1 X i. The normality hypothesis will be rejected whenever the observed value of T n will be significantly less than, in the sense we precise below. The exact critical values T α of T n at the significance levels α ],1[ are defined by the equation P(T n T α = α. The distribution of T n under the null hypothesis cannot readily be obtained analytically. To evaluate the critical values T α, we have used a Monte Carlo method, for sample sizes 1 n 5 and the significance value given by α =.5. Namely, for each n 5, we simulate 2 samples of size n from the standard normal distribution. Since α =.5 = 1/2, we determine the 1-th order statistic t 1,2 and obtain the critical value T n,.5 through the equation T n,.5 = t 1,2. Our results are presented in Table 12. i=1 Sample size Percentage level n Table 12: Critical Values of T n. Park and Park (23 establish the entropy-based goodness of fit test statistics based on the nonparametric distribution functions of the sample entropy and modified sample entropy Ebrahimi et al. (1994, and compare their performances for the exponential and normal distributions. The authors consider where H Ebr m,n := 1 n n log i=1 ( n c i m {X i+m;n X i m;n } 1+ i 1 if 1 i m, m c i := 2 if m+1 i n m, 1+ n i if n m i n. n Yousefzadeh and Arghami (28 use a new cdf estimator to obtain a nonparametric entropy estimate and use it for testing exponentially and normality, they introduce the, 17

18 following estimator ( n Hm,n You X i+m;n X i m;n := log F n (X i+m;n F n (X i m;n where i=1 F n (X i+m;n F n (X i m;n n i=1 ( F n (X i+m;n F n (X i m;n, F n (x := n 1 n(n+1 ( n n 1 + x X ;n X 2;n X ;n + x X 1;n X 3;n X 1;n ( n 1 i+ 1 + x X i 1;n n(n+1 n 1 X i+1;n X i 1;n + x X i;n X i+2;n X i;n ( n 1 n x X n 2;n n(n+1 n 1 X n;n X n 2;n + x X n 1;n X n+1;n X n 1;n for X 1;n x X 2;n, for X i;n x X i+1;n, i = 2,...,n 2, for X n 1;n x X n,n, and X ;n := X 1;n n n 1 (X 2;n X 1;n, X n+1;n := X n;n n n 1 (X n;n X n 1;n. To compare the power of the proposed test a simulation was conducted. We used tests based on the following entropy estimators: H m,n (WG, Hm,n Ebr, HYou m,n and our test (5.1. In the power comparison, we consider the following alternatives. The Weibull distribution with density function ( f(x;λ,k = k ( ( x k x exp 1{x > }, λ λ λ where k > is the shape parameter and λ > is the scale parameter of the distribution and where 1{ } stands for the indicator function of the set { }. The uniform distribution with density function f(x = 1, < x < 1. The Student t-distribution with density function f(x;ν = Γ((ν +1/2 Γ(ν/2 1 1 νπ (1+x 2 /ν(ν+1/2, ν > 2, < x <. Remark 5.1 We mention the value taken in Table 13 for the statistics based on Hm,n Ebr and Hm,n You are the same to that calculated in Table 6, p of Yousefzadeh and Arghami (28, for the same alternatives that we consider in our comparison. 18

19 Statistics based on Alternatives Ĥ ε;n (X H (WG 4,n H (WG 8,n Hm,n Ebr Hm,n You Uniform Weibull( Student t Student t Table 13: Power estimate of.5 tests against alternatives of the normal distribution based on 2 replications for sample size n=5. Alternatives h n Uniform h n =.5297 Weibull(2 h n =.6555 Student t 5 h n =.31 Student t 3 h n =.189 Table 14: Choice of smoothing parameter for Ĥε;n(X. In Table 13 we have reported the results of power comparison for the sample size is 5. We made 2 Monte-Carlo simulations to compare the powers of the proposed test against 4 alternatives. From Table 13, we can see that the proposed test T n shows better power than all other statistics for the alternative that we consider. It is natural that our test has good performances for unbounded support since the kernel-type estimators behave well in this situation. 6 Concluding remarks and future works We have proposed a new estimator of entropy based on the kernel-quantile density estimator. Simulations show that this estimator behaves better than the other competitors both under contamination or not. More precisely, the MSE is consistently smaller than the MSE of the spacing-based estimators considered in our study. It will be interesting to compare theoretically the power and the robustness of the test of normality, based on the proposed estimator of the present paper, with those considered in Esteban et al. (21. The study of entropy in presence of censored data is possible using a similar methodology as presented here. It would be interesting to provide a complete investigation of the choice of the parameter h n for kernel difference-type estimator which requires nontrivial mathematics, this would go well beyond the scope of the present paper. The problems and the methods described here all are inherently univariate. A natural and useful multivariate extension is the use of copula function. We propose to extend the 19

20 results of this paper to the multivariate case in the following way. Consider a R d -valued random vector X = (X 1,...,X d with joint cdf and marginal cdf s F(x := F(x 1,...,x d := P(X 1 x 1,...,X d x d F j (x j := P(X j x j for j = 1,...,d. If the marginal distribution functions F 1 (,...,F d ( are continuous, then, according to Sklar s theorem [Sklar (1959], the copula function C(, pertaining to F(, is unique and C(u := C(u 1,...,u d := F(Q 1 (u 1,...,Q d (u d, for u [,1] d, where, forj = 1,...,d,Q j (u := inf{x : F j (x u} withu (,1],Q j ( := lim t Q j (t := Q j ( +, is the quantile function of F j (. The differential entropy may represented via density copula c( and quantile densities q i ( as follows H(X = d H(X i +H(c, (6.1 i=1 where H(X i = [,1] log ( q i (u du, (6.2 q i (u = dq i (u/du, for i = 1,...,d and H(c is the copula entropy defined by H(c = c(f 1 (x 1,...,F d (x d log ( c(f 1 (x 1,...,F d (x d dx R d = c(u 1,...,u d log ( c(u 1,...,u d du [,1] d = c(ulog ( c(u du, (6.3 [,1] d where is the copula density. c(u 1,...,u d = d u 1... u d C(u 1,...,u d 7 Proof. This section is devoted to the proofs of our results. The previously defined notation continues to be used below. Proof of Theorem 2.1. Recall that we set for all ε ],1/2[, U(ε = [ε,1 ε], and H ε (X = εlog ( q(ε + log ( q(x dx+εlog ( q(1 ε. U(ε 2

21 Therefore, by (1.4, we have the following Ĥε;n(X H(X Ĥε;n(X H ε (X + H ε (X H(X = Ĥε;n(X H ε (X +o ( η(ε. (7.1 We evaluate the first term of the right hand of (7.1. By the triangular inequality, we have Ĥε;n(X H ε (X = ε( log ( q n (ε log ( q(ε + (log ( q n (x log ( q(x dx We note that for z >, Therefore, we have U(ε +ε ( log ( q n (1 ε log ( q(1 ε ( ε log ( qn (ε log ( q(ε + (log ( q n (x log ( q(x dx U(ε + ε ( log ( qn (1 ε log ( q(1 ε. logz z 1 + 1/z 1. log( qn (x log ( q(x ( qn (x = log q(x q n(x q(x q(x + q n(x q(x. q n (x Under conditions (Q and (K.1-2-3, we may apply Theorem 2.2 in Cheng and Parzen (1997, for all fixed ε ],1/2[ and recall M(q;K n defined in Theorem 2.1, we have, as n, ( sup q n (x q(x = O P n 1/2 M(q;K n +n, β (7.2 x U(ε we infer that we have, uniformly over x U(ε, q n (x (1/2q(x, for all n enough large. This fact implies Ĥε;n(X H ε (X 4 sup q n (x q(x, x U(ε in probability, which implies that Ĥε;n(X H ε (X = O P (n 1/2 M(q;K n +n β +η(ε. Thus the proof is complete. Proof of Theorem 2.2. Throughout, we will make use of the fact (see e.g. Csörgő and Révész 21

22 (1978, 1981 that we can definex 1,...,X n on a probability space which carries a sequence of Brownian bridges {B n (t : t [,1],n 1}, such that, n(qn (t Q(t q(tb n (t = q(te n (t (7.3 and [ ] n 1/2 lim sup q(te n (t C, (7.4 n A γ (n with probability one, where C is a universal constant and { logn, max{q(,q(1} < or γ 2, A γ (n := (loglogn γ (logn (1+ν(1 γ γ > 2, with γ in (Q.2 and an arbitrary positive ν. Using Taylor expansion we get, for λ ],1[, n( Ĥ ε;n (X H ε (X = n(log ( qn (x log ( q(x dx = U(ε + nε(log ( q n (ε log ( q(ε + nε(log ( q n (1 ε log ( q(1 ε ( q n (x q(x n dx+l n (ε λ q n (x+(1 λq(x U(ε =: T n (ε+l n (ε. (7.5 We have, under our assumptions, q n (x q(x in probability, which implies that λ q n (x+ (1 λq(x q(x in probability. Thus, we have, by Slutsky s theorem, as n tends to the infinity, T n (ε has the same limiting law of T n (ε := (1/q(x n( q n (x q(xdx. (7.6 U(ε We have the following decomposition T n (ε = (1/q(x n d ( 1 Q n (vk n (x,vdµ n (v Q(x dx U(ε dx = (1/q(x d ( 1 n(qn (v Q(vK n (x,vdµ n (v dx U(ε dx (1/q(x n d ( 1 Q(vK n (x,vdµ n (v Q(x dx dx U(ε 22

23 Using condition (K.3 and we have the following (1/q(x n d ( 1 dx U(ε U(ε (1/q(xdx 1, Q(vK n (x,vdµ n (v Q(x dx = O(n1/2 β. This, in turn, implies, in connection with (7.3, that T n (ε = (1/q(x d ( 1 (q(vb n (v+q(ve n (vk n (x,vdµ n (v dx dx U(ε +O(n 1/2 β. Recalling the arguments of the proof of Lemma 3.2. of Cheng (1995, for I n (u = [u δ n,u+δ n ] and I c n (u = [,1]\I n(u, we can show that (1/q(x d ( 1 q(ve n (vk n (x,vdµ n (v dx U(ε dx ( (1/q(xdx sup d 1 q(ve n (vk n (x,vdµ n (v. dx U(ε x U(ε Then, it follows (1/q(x d ( 1 q(ve n (vk n (x,vdµ n (v dx U(ε dx ( 1 sup q(v e n(v d x U(ε dx K n(x,v dµ n(v ( 1 sup e n (v q(v d x U(ε dx K n(x,v dµ n(v ( 1 Cn 1/2 A γ (n q(v d dx K n(x,v dµ n(v ( Cn 1/2 A γ (n sup q(x sup d x U(ε δ n x U(ε I n(x dx K n(x,v dµ n(v +Cn 1/2 A γ (n sup d x U(ε dx q(vk n(x,v dµ n(v =: R (1 n (ε+r (2 n (ε. I c n (x 23

24 Since q( is continuous on ],1[, then it is bounded on compact interval U(ε δ n ],1[, which gives, by condition (K.2 R (1 n (ε Cn 1/2 A γ (n sup q(x x U(ε δ n = Cn 1/2 A γ (no(1 = o(1 = o P (1. Let s = 1 + δ with δ > arbitrary, and let r = 1 1/s. Then Hölder s inequality implies { [ [q(v] r d ] 1/r [,1] dx K n(x,vdµ n (v R (2 n (ε Cn 1/2 A γ (n sup x U(ε [ 1 s In(x (v d c [,1] dx K n(x,vdµ n (v Under condition (K.3, we can see that [ sup x U(ε sup [q(v] r d [,1] x U(ε = O(n β. q(x ] 1/r dx K n(x,vdµ n (v [ [q(v] r d [,1] dx K n(x,vdµ n (v This in connection with condition (K.2, implies ( R (2 n (ε = Cn 1/2 A γ (no(1o = o P (1. sup Rn 1/(1/δ x U(ε ] 1/s }. ] 1/r = o(1 Hence U(ε (1/q(x d ( 1 q(ve n (vk n (x,vdµ n (v dx Then, we conclude that T n (ε = (1/q(x d ( 1 dx U(ε +o P (1. dx = o P(1. q(vb n (vk n (x,vdµ n (v dx We have by Cheng and Parzen (1997, p. 297, (we can refer also to Stadtmüller (1988, 1986 and Xiang (1994 for related results on smoothed Wiener process and Brownian 24

25 bridge sup x U(ε 1 sup x U(ε + sup q(vb n (vk n (x,vdµ n (v q(xb n (x 1 x U(ε q(v B n (v B n (x K n (x,vdµ n (v B n(x 1 q(v q(x K n (x,vdµ n (v +OP (n β ( = O P ((2δ n logδn 1 1/2 +O P R 1/(1+δ n = o P (1. This gives, under condition (Q.1, (1/q(x d ( 1 q(vb n (vk n (x,vdµ n (v dx U(ε dx ( 1 1 ε = (1/q(x q(vb n (vk n (x,vdµ n (v ε ( ( d 1 U(ε dx (1/q(x q(vb n (vk n (x,vdµ n (v dx ( = (1/q(xq(xB n (x 1 ε d ε U(ε dx (1/q(x q(xb n (xdx+o P (1 = (q(x /q(xb n (xdx+b n (1 ε B n (ε+o P (1. U(ε Making use of (Csörgő and Révész, 1981, Theorem 1.4.1, for ε sufficiently small, and B n (1 ε B(1 = o P (1 B n (ε B( = o P (1. Which implies that T n (ε = U(ε (q (x/q(xb n (xdx+o P (1. By (Cheng and Parzen, 1997, Theorem 2.2., we have, for β > 1/2 (β in condition (K.3, sup n(log ( qn (u log(q(u = o P (1, (7.7 u U(ε which implies, by using Taylor expansion, that L n (ε = o P (1, (7.8 25

26 where L n (ε is given in (7.5. This gives T n (ε+l n (ε = (q(x /q(xb n (xdx+o P (1. U(ε It follows, n( Ĥ ε;n (X H ε (X = d U(ε (q (x/q(xb n (xdx+o P (1. Here and elsewhere we denote by = d the equality in distribution. Note that (q (x/q(xb n (xdx U(ε is a gaussian random variable with mean ( E (q (u/q(ub n (udu = U(ε =, U(ε (q (u/q(ue(b n (udu and variance ( E (q (u/q(ub n (udu (q (v/q(vb n (vdv U(ε ( U(ε = E (q (u/q(u(q (v/q(vb n (ub n (vdudv U(ε U(ε = (q (u/q(u(q (v/q(ve(b n (ub n (vdudv U(ε U(ε = (q (u/q(u(q (v/q(v(min(u,v uvdudv U(ε U(ε ( v 1 = (q (v/q(v (1 v (q (u/q(uudu+v (q (u/q(u(1 udu dv U(ε v = εlog(q(εlog(q(1 ε+εlog 2 (q(ε log(q(1 ε log(q(xdx = = ( +H ε (X log(q(1 ε H ε (X log 2 (q(xdx+εlog 2 (q(ε+εlog 2 (q(1 ε A 2 (ε U(ε ( = Var log 2 (q(xdx+o [,1] { log ( ϑ(ε ( q ( F(X } ( +o ϑ(ε+η 2 (ε. ( ( 2 H(X+o η(ε U(ε Thus the proof is complete. 26

27 References Ahmad, I. A. and Lin, P. E. (1976. A nonparametric estimation of the entropy for absolutely continuous distributions. IEEE Trans. Information Theory, IT-22(3, Beirlant, J., Dudewicz, E. J., Györfi, L., and van der Meulen, E. C. (1997. Nonparametric entropy estimation: an overview. Int. J. Math. Stat. Sci., 6(1, Bouzebda, S. and Elhattab, I. (29. A strong consistency of a nonparametric estimate of entropy under random censorship. C. R. Math. Acad. Sci. Paris, 347(13-14, Bouzebda, S. and Elhattab, I. (21. Uniform in bandwidth consistency of the kernel-type estimator of the shannon s entropy. C. R. Math. Acad. Sci. Paris, 348(5-6, Bouzebda, S. and Elhattab, I. (211. Uniform-in-bandwidth consistency for kernel-type estimators of Shannon s entropy. Electron. J. Stat., 5, Cheng, C. (1995. Uniform consistency of generalized kernel estimators of quantile density. Ann. Statist., 23(6, Cheng, C. and Parzen, E. (1997. Unified estimators of smooth quantile and quantile density functions. J. Statist. Plann. Inference, 59(2, Cheng, M.-Y. and Sun, S. (26. Bandwidth selection for kernel quantile estimation. J. Chin. Statist. Assoc., 44, Correa, J. C. (1995. A new estimator of entropy. Comm. Statist. Theory Methods, 24(1, Cover, T. M. and Thomas, J. A. (26. Elements of information theory. Wiley- Interscience [John Wiley & Sons], Hoboken, NJ, second edition. Csörgő, M. and Révész, P. (1978. Strong approximations of the quantile process. Ann. Statist., 6(4, Csörgő, M. and Révész, P. (1981. Strong approximations in probability and statistics. Probability and Mathematical Statistics. Academic Press Inc. [Harcourt Brace Jovanovich Publishers], New York. Csörgő, M. and Révész, P. (1984. Two approaches to constructing simultaneous confidence bounds for quantiles. Probab. Math. Statist., 4(2,

28 Csörgő, M., Horváth, L., and Deheuvels, P. (1991. Estimating the quantile-density function. In Nonparametric functional estimation and related topics (Spetses, 199, volume 335 of NATO Adv. Sci. Inst. Ser. C Math. Phys. Sci., pages Kluwer Acad. Publ., Dordrecht. Csiszár, I. (1962. Informationstheoretische Konvergenzbegriffe im Raum der Wahrscheinlichkeitsverteilungen. Magyar Tud. Akad. Mat. Kutató Int. Közl., 7, Dmitriev, J. G. and Tarasenko, F. P. (1973. The estimation of functionals of a probability density and its derivatives. Teor. Verojatnost. i Primenen., 18, Dudewicz, E. J. and van der Meulen, E. C. (1981. Entropy-based tests of uniformity. J. Amer. Statist. Assoc., 76(376, Ebrahimi, N., Habibullah, M., and Soofi, E. (1992. Testing exponentiality based on Kullback-Leibler information. J. Roy. Statist. Soc. Ser. B, 54(3, Ebrahimi, N., Pflughoeft, K., and Soofi, E. S. (1994. Two measures of sample entropy. Statist. Probab. Lett., 2(3, Einmahl, U. and Mason, D. M. (25. Uniform in bandwidth consistency of kernel-type function estimators. Ann. Statist., 33(3, Esteban, M. D., Castellanos, M. E., Morales, D., and Vajda, I. (21. Monte Carlo comparison of four normality tests using different entropy estimates. Comm. Statist. Simulation Comput., 3(4, Gallager, R. (1968. Information theory and reliable communication. New York-London- Sydney-Toronto: John Wiley & Sons, Inc. XVI, 588 p. Gokhale, D. (1983. On entropy-based goodness-of-fit tests. Comput. Stat. Data Anal., 1, Gokhale, D. V. and Kullback, S. (1978. The information in contingency tables, volume 23 of Statistics: Textbooks and Monographs. Marcel Dekker Inc., New York. Gradshteyn, I. S. and Ryzhik, I. M. (27. Table of integrals, series, and products. Elsevier/Academic Press, Amsterdam, seventh edition. Falk, M. (1986. On the estimation of the quantile density function. Statist. Probab. Lett., 4(2, Huber, P. J. and Ronchetti, E. M. (29. Robust statistics. Wiley Series in Probability and Statistics. John Wiley & Sons Inc., Hoboken, NJ, second edition. 28

29 Jaynes, E. T. (1957. Information theory and statistical mechanics. Phys. Rev. (2, 16, Jones, M. (1992. Estimating densities, quantiles, quantile densities and density quantiles. Ann. Inst. Stat. Math., 44(4, Kullback, S. (1959. Information theory and statistics. John Wiley and Sons, Inc., New York. Lazo, A. C. and Rathie, P. N. (1978. On the entropy of continuous probability distributions. IEEE Trans. Inf. Theory, 24, Park, S. and Park, D. (23. Correcting moments for goodness of fit tests based on two entropy estimates. J. Stat. Comput. Simul., 73(9, Parzen, E. (1979. Nonparametric statistical data modeling. J. Amer. Statist. Assoc., 74(365, With comments by John W. Tukey, Roy E. Welsch, William F. Eddy, D. V. Lindley, Michael E. Tarter and Edwin L. Crow, and a rejoinder by the author. Parzen, E. (1991. Goodness of fit tests and entropy. J. Combin. Inform. System Sci., 16(2-3, Prescott, P. (1976. On a test for normality based on sample entropy. J. R. Stat. Soc., Ser. B, 38, Rényi, A. (1959. On the dimension and entropy of probability distributions. Acta Math. Acad. Sci. Hungar., 1, (unbound insert. Sklar, A. (1959. Fonctions de répartition à n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8, Shannon, C. E. (1948. A mathematical theory of communication. Bell System Tech. J., 27, , Silverman, B. W. (1986. Density estimation for statistics and data analysis. Monographs on Statistics and Applied Probability. Chapman & Hall, London. Song, K.-S. (2. Limit theorems for nonparametric sample entropy estimators. Statist. Probab. Lett., 49(1, Stadtmüller, U. (1986. Asymptotic properties of nonparametric curve estimates. Period. Math. Hungar., 17(2, Stadtmüller, U. (1988. Kernel approximations of a Wiener process. Period. Math. Hungar., 19(1,

30 van Es, B. (1992. Estimating functionals related to a density by a class of statistics based on spacings. Scand. J. Statist., 19(1, Vasicek, O. (1976. A test for normality based on sample entropy. J. Roy. Statist. Soc. Ser. B, 38(1, Wieczorkowski, R. and Grzegorzewski, P. (1999. Entropy estimators improvements and comparisons. Comm. Statist. Simulation Comput., 28(2, Xiang, X. (1994. A law of the logarithm for kernel quantile density estimators. Ann. Probab., 22(2, Yousefzadeh, F. and Arghami, N. R. (28. Testing exponentiality based on type II censored data and a new cdf estimator. Comm. Statist. Simulation Comput., 37(8-1,

Department of Statistics, School of Mathematical Sciences, Ferdowsi University of Mashhad, Iran.

JIRSS (2012) Vol. 11, No. 2, pp 191-202 A Goodness of Fit Test For Exponentiality Based on Lin-Wong Information M. Abbasnejad, N. R. Arghami, M. Tavakoli Department of Statistics, School of Mathematical