1 Chapter 15 Confidence Intervals for Mean Difference Between Two Delta-Distributions Karen V. Rosales and Joshua D. Naranjo Abstract Traditional two-sample estimation procedures like pooled-t, Welch s t, and the Wilcoxon-Hodges-Lehmann are often used for skewed data and data inflated with zero values. We investigate how well these work compared to dedicated procedures that consider the specialized nature of the data. Keywords Two-sample estimation Confidence intervals Skewed distribution Zero-inflated data Delta distribution 15.1 Introduction Some data are inherently nonnegative and contain a large number of zeros. Aitchison (1955) first described a distribution that contains both zero and positive values in an application to household expenditures. Some households spend nothing on, say, children s clothing while others allocate high amounts that make the distribution skewed and approximately follow the lognormal curve. On marine surveys, data are frequently inflated with zeros. Pennington (1983) examined a series of ichthyoplankton surveys aimed at estimating the total egg production of Atlantic mackerel in the study region. When zeros are mixed with lognormal positive values, this type of distribution is referred to as delta distribution (Aitchison 1955). One-sample confidence intervals for the mean of a delta distribution were investigated by Owen and DeRouen (1980), Pennington (1983), Zhou and Tu (2000a), Fletcher (2008), and Rosales (2009). Zhou and Tu (2000a) explored different methods of constructing confidence intervals for the mean of a delta distribution, including a bootstrap and two likelihood-based intervals. Fletcher (2008) investigated a profile-likelihood K.V. Rosales MMS Holdings, Inc., Canton, MI, USA J.D. Naranjo ( ) Department of Statistics, Western Michigan University, Kalamazoo, MI 49008, USA Springer International Publishing Switzerland 2016 R.Y. Liu, J.W. McKean (eds.), Robust Rank-Based and Nonparametric Methods, Springer Proceedings in Mathematics & Statistics 168, DOI / _15 261
2 262 K.V. Rosales and J.D. Naranjo approach. Zhou and Tu (2000b) proposed a maximum likelihood-based method and a bootstrap method for constructing confidence intervals for the ratio in means of medical costs data that contained both lognormal and zero observations. It remains unclear how well various two-sample confidence intervals work. For example, can we simply ignore the delta distribution structure of data and use traditional LS methods for estimating difference between means? Will more robust versions work better? In this paper, we focus on commonly used two-sample confidence intervals, and compare them to confidence intervals specifically derived under delta-distribution theory. We investigate how relative performance depends on sample size, proportion of zeros, the population means, and the population variances. In Sect. 15.2, we set up notation and terminology. In Sect. 15.3, we describe the confidence intervals included in the simulation study. In Sect. 15.4, we discuss results of a simulation study Notation and Terminology Consider a population in which a proportion ı of the observations are zeros, and the non-zero values follow a lognormal distribution with parameters and 2.The population is said to have a Delta distribution, denoted as (ı; ; 2 ). We will index the populations of interest by j D 1; 2. Thus the jth population is said to have distribution (ı j ; j ;j 2), with mean j and variance j 2. The population mean and variance of the jth population are j D EŒY j D.1 ı j /e jc 2 j =2 (15.1) j D VarŒY j D.1 ı j /e 2 jc 2 j.e 2 j.1 ı j // (15.2) Let y 1j ;:::;y nj j be a random sample from the jth population. Assume, without loss of generality, that the n j1 nonzero observations are listed first and the n j0 D n j n j1 zero observations are listed last. For the nonzero observations let x ij D log y ij and O j D Oı j D n j0 =n j (15.3) P nj1 id1 log y P nj1 ij id1 D x ij DNx j (15.4) n j1 n j1 s 2 j D P nj1 id1.log y ij O j / 2 P nj1 id1 D.x ij Nx j / 2 n j1 1 n j1 1 (15.5) Note that O j and s 2 j are simply the sample mean and variance of the log-transformed nonzero observations from the jth sample. The proportion of nonzero observations in the jth sample is 1 O ı j. Finney (1941) derived minimum-variance unbiased
3 15 Confidence Intervals for Mean Difference Between Two Delta-Distributions 263 estimators for the lognormal mean and variance. Extending his results, Aitchison (1955) showed that the following is a minimum variance unbiased estimator of the mean of the -distribution. 8 n j1 ˆ< n j e O s j G 2 j nj1 if n 2 j1 >1 O j D x j1 n ˆ: j if n j1 D 1 (15.6) 0 if n j1 D 0 where G nj1.t/ is a Bessel function defined as, G nj1.t/ D 1 C n j1 1 t C n j1 1X id2.n j1 1/ 2i 1 t i n i j1.n j1 C 1/.n j1 C 3/.n j1 C 2i 3/iŠ An estimate of asymptotic variance is given by Aitchison and Brown (1969) O 1.O j / D e2 O j C S 2 j n j " Oı j.1 ı O j / C.1 ı O j /.2Sj 2 C Sj 4/ # 2 (15.7) Owen and DeRouen (1980) suggested confidence interval estimates based on these estimates of mean and variance. Pennington (1983) proposed an interval estimate using an alternative estimate of the variance, as follows: 8 n j1 ˆ< n j e 2 O n j j1 s n j G 2 j nj1 n j1 1 2 n j 1 G nj1 2 n j1 n j1 1 s2 j if n j1 >1 O pen.o j / D. x j1 n ˆ: j / 2 if n j1 D 1 0 if n j1 D 0 (15.8) 15.3 Two-Sample Confidence Intervals We are interested in confidence interval estimates for the difference between means 1 2 of two delta distributions. We first consider traditional least-squares confidence intervals based on Student s t-distribution, using either the pooled-sd version or the unpooled-sd Welch Satterthwaite version. The pooled-t 100(1- )% confidence interval is given by s s # 1 ".Ny 1 Ny 2 / t =2;df S p C 1 1 ;.Ny 1 Ny 2 / C t =2;df S p C 1 (15.9) n 1 n 2 n 1 n 2
4 264 K.V. Rosales and J.D. Naranjo n X j where Ny j D 1 n j y ij is the sample mean for the jth sample, t =2;df is the upper id1 percentile of the t-distribution, n j is the sample size, df Dn 1 C n 2 2, and S p is the pooled standard deviation. We refer to this method as Pooled-t in the simulation study. A 100(1- )% confidence interval based on Welch s statistic is 2 4.Ny 1 Ny 2 / t =2; s s 2 1 n 1 C s2 2 n 2 ;.Ny 1 Ny 2 / C t =2; s 3 s 2 1 C s2 2 5 (15.10) n 1 n 2 The degrees of freedom associated with this variance estimate is approximated using the Welch-Satterthwaite equation D. s2 1 n 1 C s2 2 n 2 / 2 s 4 1 n 2 1.n 1 1/ C s4 2 n 2 2.n 2 1/ This method will be denoted as Welch-t in the simulation study. Since the lognormal is right skewed, more robust alternatives might work better than the t-based methods. A rank-based alternative is the confidence interval based on the Wilcoxon rank sum test. See, for example, Hollander et al. (2014). The Wilcoxon interval may be computed as follows. Form all possible.n 1 /.n 2 / pairwise differences y h1 y i2 between the first group and the second group. Let O.1/ ; O.2/ ;:::;O.n 1n 2 / denote these ordered differences. The Hodges-Lehmann point estimator of 1 2 is the median of these differences. A 100(1- )% confidence interval is given by O.C / ; O.n 1n 2 C1 C (15.11) where C D n 1.2n 2 Cn 1 C1/ 2 C 1 w =2, and w =2 is an appropriate percentile of the rank sum distribution. For large samples, a normal approximation of C is given by C D n 1n 2 2 Z =2 n1 n 2.n 1 C n 2 C 1/ This method is denoted as Wilcoxon in the simulation study. Both versions of the t-interval and the Wilcoxon interval ignore the zero-inflated nature of the data. One may construct a confidence interval based on Aitchison s minimum variance unbiased estimator O and Pennington s estimator of the variance of O. A 100(1- )% confidence interval for. 1 2 / is 12 1=2.O 1 O 2 / z =2 qo pen.o 1 / CO pen.o 2 / (15.12) where O and O pen are given in Eqs. (15.6) and (15.8), respectively. This method will be referred to as MVUE1 in the simulation study.
5 15 Confidence Intervals for Mean Difference Between Two Delta-Distributions 265 An alternative confidence interval can be constructed based on the variance estimate from Aitchison and Brown (1969). This 100(1- )% confidence interval for. 1 2 / is.o 1 O 2 / z =2 p O1.O 1 / CO 1.O 2 / (15.13) where O and O 1 are given in Eqs. (15.6) and (15.7), respectively. We refer to this method as MVUE2 for the rest of this dissertation. In addition to the above confidence intervals, we propose two additional robust confidence intervals. Since the sample mean and the sample variance lack robustness, Al-Khouli (1999) proposed to directly replace O and s 2 in (15.4) and (15.5) with robust M-estimators to obtain robust estimators of and. In his simulation, using (T H, Sb 2) in place of ( O, s2 ) seemed to work best, where T H is the one-step Huber M-estimator of location and Sb 2 is a bi-weight A-estimator of scale. Directly substituting T H and Sb 2 in place of O and s2 in (15.6) and (15.8), we get a robust version of the MVUE1 interval (15.12). The confidence interval is.o M1 O M2 / z =2 p OM.O M1 / CO M.O M2 / (15.14) where 8 n j1 ˆ< n j e T Sb Hj j G nj1 if n 2 j1 >1 O Mj D x 1 ˆ: nj if n j1 D 1 0 if n j1 D 0 and 8 n n j1 ˆ< n j e 2T nj1 Sb o Hj j n j G nj1 n j1 1 2 n j 1 G nj1 2 n j1 n j1 1 S b j if n j1 >1 O M.O Mj / D. ˆ: x 1 nj / 2 if n j1 D 1 0 if n j1 D 0 This method is referred as RMVUE1 in the simulation study. Similarly, a robust version of the MVUE2 confidence interval (15.13) replaces O and s in Eqs. (15.6) and (15.7) with their robust versions. The confidence interval is.o M1 O M2 / z =2 p O1.O M1 / CO 1.O M2 / (15.15) where O Mj D n j1 n j e T Hj G nj1 Sbj 2
6 266 K.V. Rosales and J.D. Naranjo and O 1.O Mj / D e2t Hj CS bj n j " Oı j.1 ı O j / C.1 ı O # j /.2S bj C Sb 2 j / 2 We denote this method as RMVUE2 in the simulation study Simulation To assess the general performance and robustness of the interval estimators (15.9) (15.15), we conducted a simulation study under various parameter combinations of the -distribution. Performance of the different estimates will be assessed using the following criteria: Coverage Probability (CP): proportion of times that the 95 % confidence interval contains the true value of 1 2. Coverage Error (CE): absolute difference between the coverage probability and 95 %. Lower Error Rate (LER): proportion of times that the true value 1 2 falls below the interval Upper Error Rate (UER): proportion of times that the true value 1 2 falls above the interval Average Width (Width): average width of 95 % confidence interval Note that all confidence intervals have confidence level set at 95 %. Ideally an estimation procedure will have CP=0.95, CE=0.0, LER=0.025, and UER= We also report the average width of each method. We evaluate performance at balanced sample sizes of 15 and 50. Ten thousand simulations are done for each combination of parameters and sample size. Table 15.1 shows simulation results when the two delta distributions are the same. MVUE1 and RMVUE1 seem to do best, achieving narrower intervals without sacrificing coverage probability. Coverage probabilities all exceed 0.95, maybe due to overinflated standard error estimates because of skewness. The naive t-based intervals seem competitive, with reasonable width and coverage probability. The Wilcoxon interval has the shortest width. Table 15.2 shows simulation results when ı 1 ı 2. Again, MVUE1 and RMVUE1 seem to do best, with narrower intervals without sacrificing coverage probability. The naive t-based intervals remain competitive, with reasonable width and coverage probability. The Wilcoxon interval still has significantly shortest width but achieves this at the price of unacceptably low coverage probability, especially for larger differences in ı. Table 15.3 shows simulation results when 1 2. MVUE1 and RMVUE1 still seem to do best, with RMVUE1 edging out MVUE1 in coverage probability
7 15 Confidence Intervals for Mean Difference Between Two Delta-Distributions 267 Table % CI under equal distributions 1.0:2; 0:5; 1/ and 2.0:2; 0:5; 1/ W 1 2 D 0 Method Sample size CP CE LER UER Width Pooled-t Welch-t Wilcoxon MVUE MVUE RMVUE RMVUE Pooled-t Welch-t Wilcoxon MVUE MVUE RMVUE RMVUE and width. MVUE2 and RMVUE2 attain better coverage probabilities at the cost of significantly wider intervals. The naive procedures pooled-t and Welch-t are surprisingly competitive, with reasonable width and coverage probability. The Wilcoxon interval has unacceptably low coverage probability, especially for larger differences in. Table 15.4 shows simulation results when All intervals have problems maintaining close to 95 % coverage probability, especially for larger differences in 2. The simulations show two notable features of Wilcoxon confidence intervals: they tend to be shorter and have low coverage probability. Wilcoxon intervals are a function of the ordered pairwise differences between the two samples [see e.g. Hollander et al. (2014)]. If.ı 1 ;ı 2 / are both large, then enough pairwise differences are 0 regardless of the values of the positive observations. This seems to reduce length of the Wilcoxon interval more than the others. Low coverage probability may be a result of the Wilcoxon interval estimating the wrong parameter. The Wilcoxon point estimator is the median of pairwise differences, which is naturally a better estimate of the true median of differences (i.e. the median of F Y1 Y 2 ) rather than the difference in means 1 2. For example, given two distributions.0:1; 0:5; 1/ and.0:5; 0:5; 1/, the difference in means is 1 2 D 1:0873 while the median of the difference is m D 0:7988. In Table 15.5, we reassess the performance of Wilcoxon by looking at the percentage of time it contains the median of differences m instead of 1 2. The Wilcoxon 95 % interval coverage probability for 1 2 D 1:0873 are quite low at and , respectively, but the coverage probability for m D 0:7988 are and , respectively, as
8 268 K.V. Rosales and J.D. Naranjo Table % CI under varying proportion of zeros ı Method Sample size CP CE LER UER Width 1.0:2; 0:5; 1/ and 2.0:4; 0:5; 1/ W 1 2 D 0:5437 Pooled-t Welch-t Wilcoxon MVUE MVUE RMVUE RMVUE Pooled-t Welch-t Wilcoxon MVUE MVUE RMVUE RMVUE :1; 0:5; 1/ and 2.0:5; 0:5; 1/ W 1 2 D 1:0873 Pooled-t Welch-t Wilcoxon MVUE MVUE RMVUE RMVUE Pooled-t Welch-t Wilcoxon MVUE MVUE RMVUE RMVUE found in the entry labeled W(for m). In fact, in all cases (see the rest of Table 15.5), as long as we measure the percentage of times that Wilcoxon interval contains the appropriate parameter m instead of 1 2, then the Wilcoxon has best coverage probability and narrowest width. Since the performance of MVUE2 and RMVUE2 trail MVUE1 and RMVUE1 in Tables 15.2, 15.3, and 15.4, they have been removed from Table 15.5 for space considerations.
9 15 Confidence Intervals for Mean Difference Between Two Delta-Distributions 269 Table % CI under varying lognormal parameter Method Sample size CP CE LER UER Width 1 (0.2, 0, 1) and 2 (0.2, 0.5, 1): 1 2 D 0:8556 Pooled-t Welch-t Wilcoxon MVUE MVUE RMVUE RMVUE Pooled-t Welch-t Wilcoxon MVUE MVUE RMVUE RMVUE (0.2, 0, 1) and 2 (0.2, 0.9, 1): 1 2 D 1:9252 Pooled-t Welch-t Wilcoxon MVUE MVUE RMVUE RMVUE Pooled-t Welch-t Wilcoxon MVUE MVUE RMVUE RMVUE Conclusion Traditional two-sample estimation procedures like pooled-t and Welch t that require normal distribution are often used for skewed data and data inflated with zero values. Our simulations show that these naive nonrobust approaches do not do too badly compared to dedicated delta distribution procedures, in terms of coverage probabilities and interval width. Among the dedicated approaches, we would recommend the MVUE1 and its robust version RMVUE1. The MVUE1 procedure is based on the mean estimator
10 270 K.V. Rosales and J.D. Naranjo Table % CI under varying lognormal parameter 2 Method Sample Size CP CE LER UER Width 1 (0.2, 0.5, 0.15) and 2 (0.2, 0.5, 1.0): 1 2 D 0:7529 Pooled-t Welch-t Wilcoxon MVUE MVUE RMVUE RMVUE Pooled-t Welch-t Wilcoxon MVUE MVUE RMVUE RMVUE (0.2, 0.5, 0.15) and 2 (0.2, 0.5, 2.0): 1 2 D 2:1636 Pooled-t Welch-t Wilcoxon MVUE MVUE RMVUE RMVUE Pooled-t Welch-t Wilcoxon MVUE MVUE RMVUE RMVUE O by Aitchison (1955) and the variance estimator by Pennington (1983). The RMVUE1 is similar to MVUE1 but uses M-estimates for the lognormal parameters and 2. The Wilcoxon two-sample interval performed consistently badly, but only when it was asked to estimate the difference in means 1 2. When used to estimate the median of differences m, it performed very well in terms of coverage probability, and generally had the shortest interval width. Of course, usefulness of the Wilcoxon interval will depend more on whether the user wants to estimate the median of differences instead of the difference in means.
11 15 Confidence Intervals for Mean Difference Between Two Delta-Distributions 271 Table % CI under varying parameters and sample size Method Sample Size CP CE LER UER Width Varying ı: 1 (0.1, 0.5, 1.0) and 2 (0.5, 0.5, 1.0) 1 2 = , m= Pooled-t Welch-t Wilcoxon (for 1 2 ) Wilcoxon (for m) MVUE MVUE RMVUE Pooled-t Welch-t Wilcoxon (for 1 2 ) Wilcoxon (for m) MVUE RMVUE Varying : 1 (0.2, 0, 1) and 2 (0.2, 0.9, 1) 1 2 D 1:9252, m= Pooled-t Welch-t Wilcoxon (for 1 2 ) Wilcoxon (for m) MVUE RMVUE Pooled-t Welch-t Wilcoxon (for 1 2 ) Wilcoxon (for m) MVUE RMVUE Varying 2 : 1 (0.2, 0.5, 0.15) and 2 (0.2, 0.5, 2.0) 1 2 = , m=0.0 Pooled-t Welch-t Wilcoxon (for 1 2 ) Wilcoxon (for m) MVUE RMVUE Pooled-t Welch-t Wilcoxon (for 1 2 ) Wilcoxon (for m) MVUE RMVUE The Wilcoxon interval is assessed for containing both 1 2 and the median of difference m
12 272 K.V. Rosales and J.D. Naranjo References Aitchison, J. (1955). On the distribution of a positive random variable having a discrete probability mass at the origin. Journal of the American Statistical Association, 50(271), Aitchison, J., & Brown, J. (1969). The lognormal distribution. Cambridge: Cambridge University Press. Al-Khouli, A. (1999). Robust estimation and bootstrap testing for the delta distribution with applications in marine sciences. Ph.D. dissertation, Texas A&M University. Finney, D. J. (1941). On the distribution of a variate whose logarithm is normally distributed. Journal of the Royal Statistical Society, Series B, 7, Fletcher, D. (2008). Confidence intervals for the mean of the delta-lognormal distribution. Environmental and Ecological Statistics, 15(2), Hollander, M., Wolfe, D., & Chicken, E. (2014). Nonparametric statistical methods. Hoboken: Wiley. Owen, W., & DeRouen, T. (1980). Estimation of the mean for lognormal data containing zeroes and left-censored values, with applications to the measurement of worker exposure to air contaminants. Biometrics, 36(4), Pennington, M. (1983). Efficient estimators of abundance, for fish and plankton surveys, Biometrics, 39(1), Rosales, M. (2009). The robustness of confidence intervals for the mean of delta distribution. Ph.D. dissertation, Western Michigan University. Zhou, X. H., & Tu, W. (2000a). Confidence intervals for the mean of diagnostic test charge data containing zeros. Biometrics, 56(4), Zhou, X. H., & Tu, W. (2000b). Interval estimation for the ratio in means of log-normally distributed medical costs with zero values. Computational Statistics and Data Analysis, 35(2),