Nonparametric estimation of the entropy using a ranked set sample

Size: px

Start display at page:

Download "Nonparametric estimation of the entropy using a ranked set sample"

Winifred Gilbert
6 years ago
Views:

1 arxiv: v3 [stat.co] 14 Jun 2015 Nonparametric estimation of the entropy using a ranked set sample Morteza Amini and Mahdi Mahdizadeh Department of tatistics, chool of Mathematics, tatistics and computer cience, College of cience, University of Tehran, P.O. Box , Tehran, Iran chool of Biological ciences, Institute for Research in Fundamental ciences (IPM), P. O. Box , Tehran, Iran Department of tatistics, Hakim abzevari University, P.O. Box 397, abzevar, Iran June 16, 2015 Abstract This paper is concerned with nonparametric estimation of the entropy in ranked set sampling. Theoretical properties of the proposed estimator are studied. The proposed estimator is compared with the rival estimator in simple random sampling. The applications of the proposed estimator to the mutual information estimation as well as estimation of the Kullback-Leibler divergence are provided. everal Monté-Carlo simulation studies are conducted to examine the performance of the estimator. The results are applied to the longleaf pine (Pinus palustris) trees and the body fat percentage data sets to illustrate applicability of theoretical results. Keywords: Kernel density estimation, Multi-stage ranked set sampling, Nonlinear dependency, Optimal bandwidth selection. AM ubject Classifications: primary: 94A17, 62D05; secondary: 62G07, 62P10. Corresponding Author addresses: morteza.amini@ut.ac.ir (Morteza Amini), mahdizadeh.m@live.com (Mahdi Mahdizadeh) 1

2 1 Introduction In many fields of sciences, including environmental, biological and ecological studies, measurement of the variable of interest is usually time-consuming and/or costly. Under simple random sampling (R), this leads to samples of small sizes with which statistical inference is less reliable. Thus, if the situation permits, a more efficient sampling technique may be desirable. A proper choice in such situations is ranked set sampling(r) which was originally introduced by McIntyre(1952) in the context of an agricultural experiment. The application of R in many other areas of science is studied by many researchers, including Kvam (2003), who applied R for environmental monitoring, Halls and Dell (1966), who mentioned the application of R in forestry and Chen et al. (2005) who applied R in medicine. R is highly applicable when a few units in a set are easily ranked without full measurement. One of the judgment ranking methods is to use an auxiliary variable for ranking. The auxiliary variable, is often highly correlated with the variable of interest and its measurement is usually inexpensive and fast. For example, in a clinical trial, the measurement of the body fat percentage variable is time consuming and expensive. However, the measurement of the body circumference (waist), which is highly correlated with the body fat percentage, is inexpensive and fast. This auxiliary variable may be used to extract a ranked set sample of the values of the variable of interest. The judgment ranking is based on any means not involving formal measurement, e.g. visual ranking or concomitant variable. Many authors have introduced extensions of R to construct improved estimators of different population attributes. Ozturk (2011) proposed some variations on R in which rankers are permitted to declare ties. Multistage ranked set sampling (MR), proposed by Al-aleh and Al-Omari (2002), is another example of such a design. The sampling methodology of MR is described in the following algorithm. Balanced R is simply the case r = 1 of the following algorithm. 1. Randomly identify k r+1, (r 1) units from the target population and allocate them randomly in k r 1 sets, each of size k For each set in step 1, apply judgement ordering, by any cheap method not involving formal measurement, onthe elements of the ith(i = 1,...,k) sample and identify the ith (judgement) smallest unit, to get a (judgement) ranked set of size k. This step gives k r 1 (judgement) ranked sets, each of size k. 3. Repeat step 2, r 1 times and apply each ranking stage on the ranked sets of its previous stage. A ranked set of size k is acquired in the (r 1)th stage. 4. Actually measure the k identified units in step Repeat steps 1-4, m 1 times, to obtain a sample of size n = mk. 2

3 WedenotethesamplecollectedusingMRby{X [i]j : i = 1...,k;j = 1,...,m}, where X [i]j is the ith judgement order statistic from the jth cycle. The special case r = 2 of MR (Al-aleh and Al-Kadiri, 2000) is known as double ranked set sampling (DR). The ranked set sample contains not only measurements on the quantified units but also additional information on their ranks. ince R provides more structured samples than R, improved inference may result from the use of R design. Investigation of R in nonparametric settings has attracted attention of many researchers. Dell and Clutter (1972) formally showed that the sample mean using R is an unbiased estimator of the population mean regardless of ranking errors and it has a smaller variance than the sample mean using R when the number of measured units are the same. tokes (1980) considered estimation of variance and showed that improved estimates of variance can be produced in R. tokes and ager (1988) characterized a ranked set sample as a sample from a conditional distribution, conditioning on a multinomial random vector, and applied R to the estimation of the cumulative distribution function. Chen (1999) studied the kernel method of density estimation in R. Deshpande et al. (2006) have obtained nonparametric confidence intervals for quantiles of the population based on a ranked set sample. ince the concept of differential entropy was introduced in hannon (1968) s paper, it has been of great interest because of wide applications specially in signal analysis and analysis of neural data and also because of its interesting theoretical properties. The concept of mutual information, defined based on hannon entropy, is also well-known specially as a well-defined dependence measure (see Cover and Thomas, 2012 for more details). The mutual information has several advantages compared to the classical dependence measures, such as Pearson, pearman and Kendall s correlation. The classical dependence criteria usually measure the correlation between two variables. The Pearson correlation criterion only measures the linear dependence and the pearman and Kendall s correlation criteria use only the ranks of the observations. The mutual information enables us to measure whether linear or nonlinear dependency between two or more variables. Joe(1989) studied the large sample properties of the kernel based estimator of the entropy based on R. He also obtained the optimal bandwidth and a proper kernel function. In this paper, we are specially interested in estimating the entropy based on R and MR by adapting the work of Joe (1989). The proposed estimator is applied to estimate the mutual information of variables as well as the Kullback- Leibler divergence measure. We compare the proposed kernel based estimator of the entropy based on R and MR samples with that of Joe (1989) by approximating relative efficiency of the proposed estimator with respect to its competitor in R. We observe that the R and MR methods, perfectly improve the kernel based estimation of the entropy. 3

4 The outline of this paper is as follows. ection 2 introduces a nonparametric estimator of the entropy in R. The approximated bias and mean square error (ME) of the proposed estimator are derived. The choice of kernel and the optimal bandwidth is discussed and estimation of the ME of the entropy estimator and approximating the efficiency of the estimator relative to the corresponding estimator in R are studied. The applications of the proposed estimator for estimation of the mutual information and the Kullback-Leibler divergence criteria is discussed in ection 3. A simulation study and two real data examples are presented in ection 4 to examine performance of the proposed estimators and to illustrate applicability of theoretical results. 2 The proposed estimator Let f X be a p-variate (p 1) probability density function (pdf) with respect to Lebesgue measure µ with distribution function F X. We consider estimation of H(X) = f X logf X dµ, using a ranked set sample, where is a bounded set, such that f X is bounded below on it by a positive constant and f X logf X dµ R p f X logf X dµ. We assume that f X has continuous first and second derivatives, which are dominated by integrable functions. For each i, X [i]1,...,x [i]m are independent and identically distributed to X [i]. For p = 1 and the ordinary R (r = 1), when the ranking is perfect, X [i] is identically distributed to the ith order statistic in a sample of size k, from the parent distribution F, while X [1],...,X [k] are independent. For p = 2, if the values of X = (X (1),X (2) ) are ranked by X (1), then the distribution of X [i] is identical with the joint distribution of the ith order statistic and its concomitant in an iid sample of size k from f X. For the perfect ranking and DR case (r = 2), the distribution of X [i] is identical with that of the ith order statistic and its concomitants of the independent but not identically distributed sample X [1],...,X [k]. We assume throughout that the ranked set sample X [i]j, i = 1,...,k, j = 1,...,m is consistent, that is (C1) For each x, 1 k k F X [i] (x) = F X(x). It is well known that if the ranking of X is performed using one of its components, then both R and MR samples are consistent (see e.g., Al-aleh and Al-Omari (2002) for a proof of the consistency for the DR case). 4

5 For any t in support of X, the kernel density estimate of f X (t) based on the ranked set sample X [i]j, i = 1,...,k, j = 1,...,m, is (Chen, 1999) f R n (t) = 1 nγ p m ( ) t X[i]j K p, γ where K p is a kernel function and γ is a positive bandwidth. We assume throughout that K p (u 1,...,u p ) = p k 0(u j ) and k 0 is a univariate kernel function satisfying the following conditions: (C2) k 0 (u) = k 0 ( u); (C3) k 0 (u) du = 1 and (C4) u 2 k 0 (u) du = 1. Let Fn R (t) = 1 n be the estimator of F in R, then f R n (t) = 1 γ pk p m I(X [i]j t) ( ) t u γ dfn R (u). We propose the estimator of the entropy of X, H(X), in R based on X [i]j, i = 1,...,k, j = 1,...,m, as H R n (X) = 1 n m logfn R (X [i]j )I (X [i]j ) = logf R n (x) dfn R (x). The asymptotic properties of Hn R (X) are studied by Joe (1989). In the next section, we study the asymptotic properties of Hn R (X). 2.1 Characteristics of the estimator In this section, we approximate ME and bias of Hn R (X). With a suitable choice of bandwidth γ (see ection 4), we deduce that ME(Hn R (X)) is O(n 1 ), that is Hn R (X) is a consistent estimator of H(X). To prove the results, we need the following lemma, which is an analog of Lemma 2.1 of Joe (1989). Lemma 1 Let U n (x) = n(f R n (t) F(t)), then, under the condition C1, (i) for any integrable function a(x) with E(a(X)) <, we have E a(x) du n (x) = 0; 5

6 (ii) for any integrable function a(x, y), we have E a(x,y) du n (x) du n (y) = a(x,x) df X (x) 1 k Proof ee the Appendix. By using Lemma 1, we get the following result. Theorem 1 (i) The bias of H R n (X) is E(H R a(x,y)f X[i] (x)f X[i] (y) dx dy. n (X)) H(X) = (H γ H(X))+n 1 α 1 +O(n 3/2 ), ( ( ) ) 1 where H γ = E(I (X)log( γ (X))), γ (x) = E I (X), γ p K x X γ α 1 = df X (x) B(x,x) 1 k B(x,y)f X[i] (x)f X[i] (y) dx dy, and B(x,y) = 1 2 (ii) The ME of H R n (X) is ( ) ( ) 1 z x K γ 2p p K z y γ p γ df ( γ (z)) 2 X (z)+ ( ) 1 K y x γ p p γ I (x). (1) γ (x) E(Hn R (X) H(X)) 2 = (H γ H(X)) 2 +n 1 (α 2 2α 1 (H γ H(X)))+O(n 3/2 ), (2) where α 2 = A 2 (x) df X (x) 1 ( 2 A(x) df X[i] (x)), k and A(x) = Proof ee the Appendix. ( ) 1 y x K γ p p γ γ (y) df X (y)+log( γ (x))i (x) (3) 6

7 2.2 Choosing the optimal bandwidth and kernel To choose an optimal value for the bandwidth and a suitable kernel function, note that α 1 = θ+γ p K(0) γ (x) 1 df X (x) γ pκ 2 2 γ(z) γ (z) 2 df X (z) where ( = θ γ p κ2 2 ) K(0) γ (x) 1 df X (x) γ pκ 2 ( γ 2 (z) γ(z)) γ (z) 2 df X (z), θ = 1 k B(x,y)f X[i] (x)f X[i] (y) dx dy, γ(z) = E(I (X)γ p K 2 ((z X)/γ))/κ 2 and κ 2 = K 2 (u) du. The term θ is O(1). In addition, [ v γ (z) γ(z) = (2γ 2 ) 1 trf X 2 (z) k0 2(v) ] dv 1 +O(γ 2 ), κ 02 where κ 02 = k0 2 (u) du. Thus, if and then, we have k 0 (0) = κ /p (4) v 2 k 2 0(v) κ 02 dv = 1, (5) α 1 = O(1)+O(γ 2 p ). Joe (1989) showed that the kernel function of the form k 0 (u) = { η1 +η 1 u, u < ξ 1 η 3 η 4 u, ξ 1 < u < ξ 2. (6) satisfies conditions (4) and (5) and computed η i,ξ j, i = 1,...,4, j = 1,2 for p = 1,...,4. For p 2, a competitor of the kernel function in (6) would be which satisfies condition (5). k 0 (u) = (4π) 0.5 exp{ u 2 /4}, (7) 7

8 Using (7), we have, for p 2 α 1 = O(1)+O(γ p ). On the other hand, since H γ H(X) = O(γ 2 ), for the kernel function in (6), we get E(Hn R (X) H(X)) 2 = O(n 1 )+O(n 1 γ 2 )+O(n 1 γ 4 p )+O(γ 4 ) and E(H R n (X) H(X)) = O(n 1 )+O(n 1 γ 2 p )+O(γ 2 ), and for the kernel function in (7), we get and E(H R n (X) H(X)) 2 = O(n 1 )+O(n 1 γ 2 )+O(n 1 γ 2 p )+O(γ 4 ) E(H R n (X) H(X)) = O(n 1 )+O(n 1 γ p )+O(γ 2 ). Using the kernel function in (6), for p 2, by setting γ = O(n 1/(2+0.5p) ), we deduce that E(Hn R (X) H(X)) 2 = O(n 1 )+O(n 4/(2+0.5p) ) and and E(H R n (X) H(X)) = O(n 2/(2+0.5p) ). Using the kernel function (7), for p 2, by setting γ = O(n 1/(2+0.5p) ), we have E(H R n (X) H(X)) 2 = O(n 1 )+O(n (4 0.5p)/(2+0.5p) ) E(H R n (X) H(X)) = O(n (2 0.5p)/(2+0.5p) ). Thus, using both kernels in (6) and (7), for p 2, E(H R n (X) H(X)) 2 = O(n 1 ). Although the rate of convergence of the bias is somehow higher using the kernel in (6), simulations show that, for p 2, using the kernel function in (7), results in better performance of the entropy estimator than using the kernel function in (6), specially for smaller sample sizes. From the above discussion, for both kernel functions in (7) and (6) and for p 2, the optimal bandwidth is of the form γ = c n p, where c depends on the parameters of f X. To discuss about the suitable value of c, first, we study the behavior of the optimal value of γ through simulation, obtained by minimizing the cross-validation criterion CV γ = 1 m ( H R n (X) Hn R (X ( j) ) ) 2, (8) m 8

9 where X ( j) = {X [i]l, i = 1,...,k, l( j) = 1,...,m}. imulations show that the bandwidth should decrease with more dependence for a bivariate density. Therefore, if Σ is the covariance matrix of the population, then c increases by increasing θ = Σ 2 1. For the multivariate normal distribution with covariance matrix Σ (with (tr(σ 1 )) Σ not too close to zero), a rough approximation of θ 2+0.5p is R φ(x,σ)dx, where 2 p 2+0.5p ( p ) R = [ 0.674,0.674] p (see Jones, 1989). Thus, we take the the optimal bandwidth to be γ = d 1 n ˆα 2+0.5p IQR p, where d 1 is a suitable normalizing constant, which is decreasing in p, ˆα is the proportion of data in the rectangle p [q j1,q j3 ], q j1 and q j3 are the lower and upper quartiles and IQR is the average of interquartile ranges. imulations suggest that for p 2, the rule γ = d 1 n ˆα 2+0.5p IQR p, (9) works quite well, with the kernel in (7), for ρ = 0.5(0.1)0.9, for the values of d 1, given in Table 1, for different values of k = 3,5, p = 1,2 and r = 1,2. A value of d 1 which was corresponding to the minimum value of the simulated ME of the estimator Hn R (X) on a grid of values of d 1 = 0.5(0.05)2.00 is chosen as the optimal value of d 1. Remark 1 imulations suggest that for p = 3,4, the bandwidth in (9) works quite well with the kernel in (6) under the multivariate normal distribution. We have not provided the suitable values of d 1 for this case here for the sake of briefness. However, an illustrative example is provided for the case p = 3 in ection Estimation of the ME of the estimator Computations show that CV γ in (8) has a large negative bias for estimating ME of Hn R (X). To propose a less biased estimator for ME of Hn R (X), we use the approximated ME in Theorem 1, where CV γ is employed as an estimator of the term (H γ H(X)) 2. ince H γ H(X) is negative for suitable values of γ, we use D γ as an estimator of H γ H(X), where D γ = 1 m m ( H R n (X) H R n (X ( j) ) ). Thus, our proposed estimator of the ME of H R n (X) is ˆM γ = CV γ +n 1 (ˆα 2 +2ˆα 1 D γ )δ(cv γ +n 1 (ˆα 2 +2ˆα 1 D γ )), (10) 9

10 Table 1: The suitable choices of d 1 in (9) for the estimator of H n (X) under the bivariate normal distribution, for n = 30, k = 3,5, p = 1,2, r = 1,2 and ρ = 0.5(0.1)0.9. ρ r k p where ˆα 1 = 1 mk m δ(t) = { 1 t 0 0 t < 0, 1 ˆB(X [i]j,x [i]j )I (X [i]j ) m(m 1)k m j j (=1) ˆB(X [i]j,x [i]j )I (X [i]j )I (X [i]j ), and ˆα 2 = 1 mk ˆB(x,y) = 1 2mk Â(x) = 1 mk m Â 2 (X [i]j )I (X [i]j ) 1 k m ( ) 1 X[i]j x K γ 2p p γ m [ m 2 Â(X [i]j )I (X [i]j )], K p ( X[i]j y γ ) (f R n (X [i]j )) 2 I (X [i]j )+ ( ) 1 X[i]j x K γ p p γ f R n (X [i]j ) ( ) 1 K y x γ p p γ fn R (x) I (X [i]j )+log(fn R (x))i (x). I (x) The performance of the estimators in (8) and (10) are examined through simulation studies in ection The relative efficiency To approximate RE of Hn R (X) with respect to Hn R (X), we use the approximated ME of Hn R (X) obtained by Joe (1989) as E(H R n (X) H(X)) 2 = (H γ H(X)) 2 +n 1 (β 2 2β 1 (H γ H(X)))+O(n 3/2 ), (11) 10

11 where β 1 = = B(x,x) df X (x) B(x,x) df X (x) 1 2, B(x,y) df(x) df X (y) and β 2 = = ( A 2 (x) df X (x) A 2 (x) df X (x) (1 H γ ) 2, ) 2 A(x) df X (x) in which A(x) and B(x,y) are given in (3) and (1), respectively. Joe (1989) also showed that the bias of H R n (X) is E(H R n (X)) H(X) = (H γ H(X))+n 1 β 1 +O(n 3/2 ). As a corollary of Theorem 1 and using (11), the RE of Hn R (X) relative to n (X) is approximated as H R RE(H R n (X),H R n (X)) 1+ n 1 (β 2 α 2 2(β 1 α 1 )(H γ H(X))) (H γ H(X)) 2 +n 1 (α 2 2α 1 (H γ H(X))). (12) Using Cauchy-chwartz inequality for series and consistency of the R sample we have 1 k ( 2 ( A n (x) df X[i] (x)) A n (x) df X (x) A n and consequently β 2 α 2. Numerical computations under the bivariate normal distribution, for different values of k, m and ρ, show that that for a suitable choice of γ, the approximated relative efficiency in(12) is greater than 1. The computed values of the approximated relative efficiencies of H(X (1) ) are given in Table 2, for different values of ρ, n and k, for R and DR schemes, under bivariate normal distribution. ince the values given in Table 2 are population parameters, different optimal values of γ areneeded for this computation. Weused γ = cn 1/(n+0.5p), with suitable values of the normalizing constant c. It is found that, for the R scheme, suitable values of c are c = 1.35 for ρ = 0.9 and c = 1.30 for ρ = 0.8. For R, we used c = 1.40 for k = 3 and ρ = 0.9, c = 1.35 for k = 3 and ρ = 0.8, c = 1.65 for k = 5 and ρ = 0.9 and c = 1.60 for k = 5 and ρ = 0.8. For DR, our suggested values ) 2 11

12 Table 2: Approximated relative efficiencies, under bivariate normal distribution, for different values of ρ, n and k. ρ n k R DR are c = 1.45 for k = 3 and ρ = 0.9, c = 1.40 for k = 3 and ρ = 0.8, c = 1.70 for k = 5 and ρ = 0.9 and c = 1.65 for k = 5 and ρ = 0.8. AsonecanseefromTable2,allapproximatedvaluesofRE(Hn R (X (1) ),Hn R (X (1) )) are greater than Also, the relative efficiencies decrease as ρ and/or n increase and also as k decreases. For DR, the relative efficiencies are greater than for R, as expected. Finally, using the results of Al-aleh and Al-Omari (2002) we have, for i = 1,...,k and x in the support of X, { ( lim f kfx (x), F 1 i 1 ) ( X r [i] (x) = X k x < F 1 i X k) (13) 0, otherwise, and consequently for each x,y in the support of X, 1 lim r k f X[i] (x)f X[i] (y) = kf X (x)f X (y). Thus lim α 1 = β 1 k 1, and lim r 2 α 2 = β 2 (k 1)(1 H γ ) 2, r and thereby lim r RE(HR n (X),Hn R (X)) 1+n 1 (k 1)((1 H γ ) 2 H γ +H(X)) 3 Applications [ (H γ H(X)) 2 +n 1 (β 2 (k 1)(1 H γ ) 2 2(β 1 k 1 1 )(H γ H(X)))]. 2 In addition to estimation of the entropy, the developed results might be applied for estimation of the Kullback-Leibler distance measure as well as the mutual information criterion. The estimator of the Kullback-Leibler distance measure can be used to construct homogeneity test statistics. The mutual information is well-known as a measure of dependence. These applications are described in the following sections. 12

13 3.1 Application to estimation of the mutual information For p 2 and X = (X (1),X (2) ), the mutual information of X (1) and X (2), defined as I(X (1),X (2) ) = H(X (1) )+H(X (2) ) H(X) ( ) f X = f X log dµ, f X (1)f X (2) which is the Kullback-Leibler divergence between the joint distribution of X = (X (1),X (2) ) and the product of its marginal pdfs, is used as a popular and welldefined nonlinear correlation criteria. If for example, X is distributed as the bivariate normal distribution with correlation parameter ρ, the mutual information between X (1) and X (2) is I(X (1),X (2) ) = 0.5log(1 ρ 2 ). Motivated by the above relationship, and using the fact that I(X (1),X (2) ) 0, a standardized nonlinear correlation measure based on the mutual information might be defined as follows I(X (1),X (2) ) = 1 e 2I(X(1),X (2)) (0,1). (14) Usingtheproposedestimatoroftheentropy, themutualinformationi(x (1),X (2) ) is estimated by Î(X (1),X (2) ) = Hn R (X (1) )+Hn R (X (2) ) Hn R (X) ( ) fn X;R (x 1,x 2 ) = log df f X(1) ;R n (x 1 )f X(2) ;R n X;R (x 1,x 2 ), (15) n (x 2 ) where Hn R (X (l) ) is the the estimator of H(X (l) ) in R based on X (l) [i]j, i = 1,...,k, j = 1,...,m, l = 1,2 and Hn R (X) is the estimator of H(X) based on (X (1) ), i = 1,...,k, j = 1,...,m, that is (i)j,x(2) [i]j H R n (X) = 1 mk m logf R n (X (1) (i)j,x(2) [i]j )I (X (1) (i)j,x(2) [i]j ). Using the kernel function K p (u 1,...,u p ) = p k 0(u j ) and a similar bandwidthparameterforestimationoffn X;R (x 1,x 2 ), f X(1) ;R n (x 1 )andf X(2) ;R n (x 2 ), itis straightforwardthatî(x(1),x (2) ) 0. ThecorrespondingestimatorofI(X (1),X (2) ) is obtained by plugging in the estimator of Î(X(1),X (2) ) in (14). 13

14 Table3: Thesuitable choices ofd 1 in(9) fortheestimator in(15)under thebivariate normal distribution, for n = 30, p = 2 and different values of k, r and ρ. ρ r k imulations show that the bandwidth parameter in (9) works well for the estimator in (15), with p equal to dimension of X = (X (1),X (2) ), by choosing some suitable value of the constant d 1. Table 3 presents the suitable choices of d 1 in (9), for the estimator in (15), under the bivariate normal distribution, for n = 30, p = 2 and different values of k, r and ρ = 0.5(0.1)0.9. A value of d 1 which was corresponding to the minimum value of the simulated ME of the estimator in (15), on a grid of values of d 1 = 0.5(0.05)2.00, is chosen as the optimal value of d Application to estimation of the Kullback-Leibler divergence Using the proposed estimator of the entropy, the Kullback-Leibler divergence between f X (1) and f X (2) defined as ( ) fx (1) KL(f X (1),f X (2)) = f X (1) log dµ, (16) can be estimated as KL(f X (1),f X (2)) = = 1 mk log ( f X(1) ;R n f X(2) ;R n m log f X (2) ) (x) (x) df X(1) ;R n (x) fx(1) ;R n (X (1) [i]j ) f X(2) ;R n (X (1) [i]j ). (17) tatistical properties of the estimator in (17) as well as its application for the homogeneity test remains as an open problem. 4 Numerical studies In this section, several numerical studies are conducted to examine the performance of the proposed estimators and to illustrate theoretical results. A simulation study 14

15 under the bivariate normal distribution is conducted to examine the performance of the estimators in (8) and (10). A data analysis is performed to examine the performance of the estimators for a real data example. Finally, the proposed theoretical results are applied to the problem of variable selection based on the estimator of the mutual information as a correlation measure. 4.1 imulation study In order to compare the performance of CV γ in (8) and M γ in (10), two separate simulation studies were conducted, each with 10,000 iterations. ince, computation of the multiple integrals in α i and β i, i = 1,2 takes a lot of time, we have considered the simple case p = 2,p 1 = p 2 = 1 and estimation of H(X (1) ), under the bivariate normal distribution with correlation parameter ρ, using the kernel in (7) and the bandwidth in (9). The variable X (2) is considered as the ranker variable and the computations are performed for R and DR schemes. In the first simulation study, the values of ME(Hn R (X (1) )) were computed which are given in Table 4, for different values of n, k and ρ, for R and DR schemes. In the second simulation study, thevariance andexpectation of CV γ andm γ were computed which are given corresponding to each of the simulated ME(Hn R (X (1) )) in Table 4. As one can see from Table 4, the estimator M γ is less biased than CV γ, except for the case k = 5 and m = 3. As expected, the variance of CV γ is less than that of M γ. In the simulation study, it is found that the suitable values of d 1 for Hn R (X (1) ) are almost similar to those of CV γ and M γ. 4.2 Data analysis We apply the results to a data set analyzed first by Platt et al. (1988). The original data set consists of measurements made on 399 long-leaf pine (Pinus palustris) trees. We use a truncated version of this data set, given in Chen et al. (2004), which contains the diameter (in centimeters) at breast height (X (1) ) and the height (in feet) (X (2) ) of 396 trees. To examine the distributional properties of the proposed estimators for this real data set, we extract samples of size n = mk = 30, with k = 3 and m = 10, using R, as well as samples of the same size using DR and R from this data set, by the diameter of the trees as the ranking variable. ince the population size is finite, the sampling is with replacement. Also, since the population does not have a density and the differential entropy is not defined, the true entropy of the population is estimated by H N (X) = 1 N logf N (X i )I (X i ), N 15

16 Table 4: imulated MEs as well as simulted variance and expectation of the estimated MEs, under bivariate normal distribution, for different values of ρ, n and k. cheme R DR ρ n k imul. ME Var( ˆM γ ) E( ˆM γ ) Var(CV γ ) E(CV γ ) imul. ME Var( ˆM γ ) E( ˆM γ ) Var(CV γ ) E(CV γ )

17 Table 5: The simulated biases and MEs of the entropy and the mutual information estimators for the tree data. Estimator Hn R (X (1) ) Hn DR (X (1) ) Hn R (X (1) ) Bias ME Estimator Î R (X (1),X (2) ) Î DR (X (1),X (2) ) Î R (X (1),X (2) ) Bias ME Estimator Î R (X (1),X (2) ) Î DR (X (1),X (2) ) Î R (X (1),X (2) ) Bias ME ( where f N (x) = 1 N Nγ k 0 ) x X i γ and simillarly for I(X (1),X (2) ). Using optimal bandwidth and kernel function for a simple random sample of size N = 396, the full population estimates of the true parameters are H N (X (1) ) = 4.921, I N (X (1),X (2) ) = and I N (X (1),X (2) ) = The simulated bias and ME of the estimators were computed which are given in Table 5. As it can be observed from Table 5, the estimators of I N (X (1),X (2) ) and I N (X (1),X (2) ) based on DR have the smallest MEs and biases. Also, the estimators of H N (X (1) ) has smallest ME but are more biased under R and DR schemes. 4.3 Application to a real data set In this section, we apply the proposed theoretical results to the problem of variable selection in regression estimation based on ranked set sampling (Yu and Lam, 1997). We consider a real data set, consisting of measurements of different body fat percentage evaluations and various body circumference measurements for N =252 men. This data set is taken from Penrose et al. (1985). Consider the following variables Y: Percent of body fat using Brozek s equation; X (1) : Abdomen circumference; X (2) : Weight; X (3) : Chest circumference; uppose, the regression estimator of the mean percent of body fat using Brozek s equation is of interest. Three auxiliary variables X (1), X (2) and X (3) are considered 17

18 and suppose that one is willing to choose the two most correlated auxiliary variables with the variable of interest, Y. The classical correlation criteria fail to measure the correlation between three variables. Thus, the mutual information criterion is used as the variable selection tool. A double ranked set sample with k = 3 and m = 10, is extracted from this data set, with X (1) as the ranking variable. This sample is given in Table 6. Based on this sample, the determinant of the correlation matrix of the sample is obtained as R = The simulation results under the multivariate normal distribution with variancecovariance matrix Σ, satisfying Σ = 0.01, suggest using the bandwidth in (9) with d 1 = The resulting estimators are and ince H n ((Y)) = 3.599, H n ((X (1),X (2),Y)) = , H n ((X (1),X (2) )) = 7.807, H n ((X (1),X (3),Y)) = , H n ((X (1),X (3) )) = 6.894, H n ((X (2),X (3),Y)) = , H n ((X (2),X (3) )) = 7.679, Î((X (1),X (2),Y)) = 0.180, Î((X (1),X (2),Y)) = 0.302, Î((X (1),X (3),Y)) = 0.029, Î((X (1),X (3),Y)) = 0.057, Î((X (2),X (3),Y)) = 0.160, Î((X (2),X (3),Y)) = Î((X (1),X (2),Y)) > Î((X(2),X (3),Y)) > Î((X(1),X (3),Y)), the variables X (1) and X (2) might be chosen out of the three auxiliary variables, based on the sample given in Table 6. Acknowledgements The first author s research is supported in part by a grant B from the Institute for Research in Fundamental ciences (IPM), Tehran, Iran. 18

19 19 Table 6: Double ranked set sample extracted from the body fat data set. Variable i j Y [i]j i j X (1) [i]j i j X (2) [i]j i j X (3) [i]j

20 Y X1 X X Figure 1: catterplots for pairs of variables for the body fat data set 20

21 Appendix Proof of Lemma 1 The proof of part (i) is trivial. For the consistent ranked set sample, we have E a(x,y) du n (x) du n (y) = ne a(x,y) dfn R (x) dfn R (y) n a(x,y) df X (x) df X (y) { m [ ] = n 1 E a(x [i]j,x [i]j )+ + m j 1 j 2 i 1 =1i 1 =1 } a(x [i1 ]j 1,X [i2 ]j 2 ) n a(x,y) df X (x) df X (y) = a(x,x) df X (x) nm 1 +nm 1 = 1 k a(x,x) df X (x) i 1 i 2 a(x [i1 ]j,x [i2 ]j) a(x,y) df X (x) df X (y) a(x,y) 1 f k (x)f 2 X[i1 ] X [i2 (y) dx dy ] i 1 i 2 a(x,y)f X[i] (x)f X[i] (y) dx dy. Proof of Theorem 1 By the Taylor expansion of the integrands and using the fact that ( ) 1 y x fn R (x) γ (x) = n 1/2 γ pk p du n (y), γ 21

22 we can write H R n (X) = Thus +n 1/2 + = log(fn R (x)) dfn R (x) = (log(fn R (x)) log( γ (x))) du n (x) log( γ (x)) df X (x)+n 1/2 +n 1/2 + H R n (X) = H γ n 1/2 n 1 + (fn R (x) γ (x)) γ (x) (log(fn R (x)) log( γ (x))) df X (x) log( γ (x)) du n (x) (fn R (x) γ (x)) 2 df X (x) 2 γ (x) 2 df X (x) (fn R (x) γ (x)) du n (x) γ (x) log( γ (x)) df X (x)+n 1/2 log( γ (x)) du n (x)+o(n 3/2 ). ( 1 K γ p p ) y x γ γ (y) du n (y) df X (x) ] log( γ (x)) du n (x) ( ) 1 K y x γ p γ du n (y) du n (x) γ (y) ( ( ) 1 y x z x K γ 2p p )K γ p γ 2 γ (x) 2 du n (y) du n (z) df X (x) +O(n 3/2 ). By applying Lemma 1, the required result follows. References 1. Al-aleh, M.F., and Al-Kadiri, M. (2000). Double ranked set sampling. tatistics & Probability Letters, 48, Al-aleh, M.F., and Al-Omari, A.I. (2002). Multistage ranked set sampling. Journal of tatistical Planning and Inference, 102, Chen, Z. (1999). Density estimation using ranked set sampling data. Environmental and Ecological tatistics, 6,

23 4. Chen Z., Bai Z. and inha B. (2004). Ranked set sampling: theory and applications. Lecture Notes in tatistics. pringer, New York. 5. Chen H, tasny E.A. and Wolfe D.A. (2005). Ranked set sampling for efficient estimation of a population proportion. tatistics in Medicine, 24, Cover, T. M. and Thomas, J. A. (2012). Elements of Information Theory. 2nd ed., Wiley, New York. 7. Dell, T.R. and Clutter, J.L. (1972). Ranked set sampling theory with order statistics background. Biometrics, 28, Kvam, P.H. (2003). Ranked set sampling based on binary water quality data with covariates. Journal of Agricultural, Biological, and Environmental tatistics, 8, Halls L.K. and Dell T.R. (1966). Trial of ranked-set sampling for forage yields. Forest cience, 12, Deshpande J. V., Frey J. and Ozturk O. (2006). Nonparametric ranked-set sampling confidence intervals for quantiles of a finite population. Environmental and Ecological tatistics, 13, Joe, H. (1989). Estimation of entropy and other functionals of a multivariate density. Annals of the Institute of tatistical Mathematics, 41, Ozturk, O. (2011). ampling from partially rank-ordered sets. Environmental and Ecological tatistics, 18, Penrose, K.W., Nelson, A.G. and Fisher, A.G. (1985). Generalized body composition prediction equation for men using simple measurement techniques. Medicine and cience in ports and Exercise, 17, Platt, W.J., Evans G.W. and Rathbun.L (1988). The population-dynamics of a long lived conifer (Pinus palustris). American Naturalist, 131, McIntyre, G.A. (1952). A method of unbiased selective sampling using ranked sets. Australian Journal of Agricultural Research, 3, hannon, C. E.(1968). A mathematical theory of communication. Bell ystem Tech. J., 27, tokes,.l. (1980). Estimation of variance using judgement ordered ranked set samples, Biometrics, 36,

24 18. tokes,.l. and ager, T.W. (1988). Characterization of a ranked-set sample with application to estimating distribution functions. Journal of the American tatistical Association, 83, Yu, P. L. H. and Lam, K. (1997). Regression Estimator in Ranked et ampling. Biometrics, 53 (3),

New ranked set sampling for estimating the population mean and variance

Hacettepe Journal of Mathematics and Statistics Volume 45 6 06, 89 905 New ranked set sampling for estimating the population mean and variance Ehsan Zamanzade and Amer Ibrahim Al-Omari Abstract The purpose