Inferences for the Ratio: Fieller s Interval, Log Ratio, and Large Sample Based Confidence Intervals

Size: px

Start display at page:

Download "Inferences for the Ratio: Fieller s Interval, Log Ratio, and Large Sample Based Confidence Intervals"

Roy Stafford
6 years ago
Views:

1 Inferences for the Ratio: Fieller s Interval, Log Ratio, and Large Sample Based Confidence Intervals Michael Sherman Department of Statistics, 3143 TAMU, Texas A&M University, College Station, Texas 77843, USA sherman@stat.tamu.edu Arnab Maity Department of Statistics, North Carolina State University, 2311 Stinson Drive, Raleigh, North Carolina 27695, U.S.A. amaity@ncsu.edu Suojin Wang Department of Statistics, 3143 TAMU, Texas A&M University, College Station, Texas 77843, USA sjwang@stat.tamu.edu Abstract In sample surveys and many other areas of application, the ratio of variables is often of great importance. This often occurs when one variable is available at the population level while another variable of interest is available for sample data only. In this case, using the sample ratio, we can often gather valuable information on the variable of interest for the unsampled observations. In many other studies, the ratio itself is of interest, for example when estimating proportions from a random number of observations. In this note we compare three confidence intervals for the population ratio: A large sample interval, a log based version of the large sample interval, and Fieller s interval. This is done through data analysis and through a small simulation experiment. The Fieller method has often been proposed as a superior interval for small sample sizes. We show through a data example and simulation experiments that Fieller s method often gives nonsensical and uninformative intervals when the observations are noisy relative to the mean of the data. The large sample interval does not similarly suffer and thus can be a more reliable method for small and large samples. Some key words: Fieller s interval, ratio estimation, variance estimation, sample surveys, small sample inference.

2 1 Introduction In sample surveys and many other areas of application, the ratio of variables is often of great importance. This often occurs when one variable is available at the population level while another variable of interest is available for sample data only. In this case, if the ratio of the two variables is estimable we gather valuable information on the variable of interest for the unsampled observations. In other studies, the ratio itself is of interest. One such example is the estimation of proportions where the number of observations is random, as is often the case in cluster sampling, for example. Other examples include applications to cost effectiveness in economics (Jiang, Wu, and Williams, 2000), studying the ratio of regression coefficients (Hirschberg and Lye, 2007) and comparing health outcomes across spatial domains (Beyene and Moineddin, 2005). For these reasons estimation of the true ratio between variables is of great interest. 2 Methodological Development There are two basic frameworks in which we draw inferences: the infinite population and the finite population settings. For the former, we consider our observations as coming from a bivariate distribution function, F (x, y), with correlation coefficient ρ. In the latter, our data observations come from a finite population (X, Y ), X = (x 1,..., x N ), Y = (y 1,..., y N ) of size N. In the former case the population ratio is defined to be R = E(Y )/E(X) = µ y /µ x, where the expectations, µ y and µ x, are the population means of the variables. In the latter case the population ratio is defined to be R = N i=1 y i / N i=1 x i = Ȳ / X. Our results are not specific to the finite or infinite population case, and in either case we denote a random sample by the sample elements chosen as S = (i 1,..., i n ). Then the sample data is (x, y), where x = x(s), y = y(s). For either the infinite population or finite population setting, we consider the sample ratio to be the ratio of sample means: r = i S y i/ i S x i = ȳ/ x where x, ȳ are the sample means of the observations x = x(s) and y = y(s). 1

3 We desire inferences on the population ratio, R, based on the sample ratio r. Define the sample standard deviations of x = x(s) and y = y(s) as s x and s y. Further, define c x x = n 1 s 2 x/ x 2, cȳȳ = n 1 s 2 y/ȳ 2 and cȳ x = n 1 s xy /( xȳ). Large sample approximations, e.g., Cochran (1977) or Lohr (2009), show that σ C 2 = i S (y i rx i ) 2 /{n(n 1) x 2 } = r 2 [cȳȳ + c x x 2cȳ x ] is a consistent variance estimator of r so that a large sample confidence interval is given by I C = (r ± t α/2,n 1 r cȳȳ + c x x 2cȳ x ), where t α,n 1 denotes the (1 α)100% quantile of a t n 1 distribution. Note that we have assumed that in the finite population situation that the sampling fraction n/n is small enough so that we can consider the finite population correction, fpc = 1 n/n, to be equal to one. We assume this throughout the present study. If this is not the case then all variance estimates can be adjusted accordingly. As an alternative to the previous large sample interval, one can use the log ratio as a pivotal quantity, construct a confidence interval around the estimated log ratio and exponentiate both end points to obtain an interval for the actual ratio. Specifically, using a Taylor s expansion we derive var{log(r)} = var{log(ȳ) log( x)} var{log(ȳ ) log( X) + (ȳ Ȳ )/Ȳ ( x X)/ X} = var(ȳ Ȳ )/Ȳ 2 + var( x X)/ X 2 2cov{(ȳ Ȳ ), ( x X)}/(Ȳ X). Hence we have the variance estimator var{log(r)} cȳȳ + c x x 2cȳ x. Thus the log ratio based interval is I LR = (r exp{ t α/2,n 1 cȳȳ + c x x 2cȳ x }, r exp{t α/2,n 1 cȳȳ + c x x 2cȳ x }). One motivation for this interval is that positive variables are often skewed to the right and the distribution of the sample ratio, r, is also often skewed to the right. The interval I C is by definition symmetric around r while the interval I LR allows for asymmetric behavior. 2

4 For small samples, another better known suggested improvement on I C is the so called Fieller interval, suggested initially by Fieller (1932). This interval has often been used in applications, for example in Heitjan (2000), Hirschberg and Lye (2007), Jiang, Wu, and Williams (2000), and Beyene and Moineddin (2005). To explore the behavior of this interval, it is natural to assume that the framework for sampling is simple random sampling from an infinite population. This is mainly for simplicity and clarity in our analytical discussions. This Fieller approach, however, is often used in the finite population setting where data are typically not assumed to come from a specific distribution. The Fieller interval assumes that the joint distribution of the infinite population, F (x, y), is bivariate normal. In this case, the pivotal statistic T = n 1/2 (ȳ R x)/(s 2 y 2Rs xy + R 2 s 2 x) 1/2 has an exact t-distribution. This is easily seen by defining the variables z i = y i Rx i, i S. Then the mean of the z variables is 0, and T = n 1/2 ( z)/s z has a t-distribution with n 1 degrees of freedom, where s z is the sample standard deviation of z i. A confidence interval for the ratio is given by the set of all R such that t α/2,n 1 < T < t 1 α/2,n 1. (1) Using this we can form the equation: t 2 α/2,n 1 > T 2. This expression is easily written as a quadratic inequality in R: ar 2 + br + c < 0 (2) with a = 1 t 2 α/2,n 1 c x x, b = 2r(1 t 2 α/2,n 1 c xȳ), c = r 2 (1 t 2 α/2,n 1 c ȳȳ). Suppose there are two real valued solutions, d 1 d 2, to the equation in (2) obtained by changing the inequality into an equality. If the coefficient a is such that a > 0 then the solution to the inequality (1) for R is (d 1, d 2 ). On the other hand, if a < 0, the solution in R is (, d 1 ) (d 2, ). Using the quadratic formula, and solving for R we find the two roots, d 1 and d 2 as functions of our sample data. This gives the endpoints of the Fieller interval. The endpoints of the 3

5 interval are given by I F = r [ (1 t 2 α/2,n 1 c xȳ) ± t α/2,n 1 {(c x x + cȳȳ 2cȳ x ) t 2 α/2,n 1 (c x xcȳȳ c 2 ȳ x)} 1/2 ]. 1 t 2 α/2,n 1 c x x In a small sample (n = 8) of observations, Efron and Tibshirani (1993) seek to estimate the population ratio in a study on bioequivalence. The observations are roughly compatible with normality and the Fieller interval is called exact and the gold standard in this case. This seems appropriate and several nonparametric intervals based on resampling are shown to be close to the Fieller interval with the better intervals closer to the Fieller interval. In the following section we show that qualitatively different behavior of the Fieller interval, I F, can occur. 3 Data Example In contrast to the Efron and Tibshirani example from Section 2, we show there can be quite different behavior of the Fieller interval. Consider the n = 8 observations from Lehtonen and Pahkinen (2004), p.103 of two variables (ue91, hou85), the number of unemployed people and the number of households in several provinces of central Finland. The data are given in Table 1. The goal is to estimate the true ratio of the ue91 to the hou85 variables. Although the sampling fraction is n/n = 8/32 in this case for simplicity we assume in our illustration that fpc = 1. Using the formula for the Fieller interval we find the two solutions to the quadratic equation to be and The implied interval seems to be reasonably precise. Note, however, that the sample ratio, r = , and that this value is not included in the interval. Closer inspection shows that the right endpoint from the formula is given by and the left endpoint from the formula is given by This is a curious situation. If we seek to find the interval from (1), we find that the interval is given by (, ) (0.1510, ), the complement of the interval (0.1379, ). This suggests that we cannot simply switch the endpoints to obtain an appropriate interval. Further, we see that the correct Fieller interval 4

6 ue91 hou Table 1: A simple random sample without replacement of size n = 8 from the Province 91 population data as presented in Lehtonen and Pahkinen (2004), p.103. Presented are the two study variables: the number of unemployed persons (ue91) and the number of households (hou85). gives a very uninformative and completely nonsensical interval (which is actually the union of two intervals). Inspection of the formula shows that we have completely nonsensical intervals when either (c x x + cȳȳ 2cȳ x ) t 2 α/2,n 1 (c x xcȳȳ c 2 ȳ x) < 0 and/or 1 t 2 α/2,n 1 c x x < 0. The second situation occurs when the sample variance of the x s is large relative to the square of the sample mean. Note, also, that this is the case where the usual confidence interval for µ x contains zero. In this data set we find that c x x = 0.37 and this in turn leads to a value of 1 t 2 α/2,n 1 c x x = Note that in stark contrast to the strange behavior of the Fieller interval, the usual large sample interval is given by I C = (0.1478, ) and the log ratio interval is given by I LR = (0.1483, ). Both of these intervals give informative inferences on the population ratio. 5

7 4 Simulation We now investigate how the three intervals, Fieller, log ratio and large sample perform in a simulation experiment. We are particularly interested in the small sample behavior of Fieller s interval, where it has been assumed to be particularly suited. The model is: X F X, Y = h(x)β + ɛ, where ɛ F ɛ. We consider two choices for h( ), namely, h(x) = x and h(x) = x 2. We set F X and F ɛ to be one of the three distributions (1) Normal(2,1), (2) Gamma(shape=3) and (3) Gamma(shape=1). Also we set β = 0, 0.5 and 1. For each of these cases, we consider sample sizes n = 5, n = 10 and n = 30. We calculate all three intervals and evaluate the coverage and mean interval length based on 10,000 data sets for each setting. To give some idea of what might be expected, we give the details of one particular simulation experiment. In this case, we set n = 5, and set F X and F ɛ to be the Normal(2,1) distribution with β = 0, so that in this case the t-distribution for T is exact. In one simulation we find that the right endpoint is less than the left in 921 cases. Further in 17 cases we find that (c x x + cȳȳ 2cȳ x ) t 2 α/2,n 1 (c x xcȳȳ c 2 ȳ x) < 0 leading to a negative square root. In these cases the interval is undefined. Using the interval (d 1, d 2 ) in all cases we find the coverage to be.857. However, including the 921 intervals in which this interval does not include the true ratio but (, d 1 ) (d 2, ) does contain R we find a coverage of.949 which is within simulation error of the nominal.950 (as it must be as the t-distribution is exact). It is not clear what to do in the 17 cases with a negative square root. In these cases, the left hand side of (2) is either > 0 or < 0 for all real values of R, which means that either there is no solution for R in (2) or the solution for R is the whole real line. By convention, we say there is no solution and in this case the interval fails to capture the true ratio. However, the practical impact of this situation is small in this case. The difficulty is in a significant minority the correct intervals necessary to get the nominal coverage are the absurd disjoint 6

8 intervals of the form (, d 1 ) (d 2, ). These intervals are nonsensical and would likely be interpreted inadvertently to actually be (d 1, d 2 ) which is not the correct interval. Note, in particular, that in the data example in Section 3 the interval obtained by switching the endpoints is not comparable to either the large sample interval or the log interval. A natural question is how likely it is to obtain the nonsensical interval. Recall that this occurs when the coefficient a = 1 t 2 α/2,n c x x is negative. In the case of bivariate normality where both X and Y follow Normal(2,1) distributions, we can explicitly give the probability of a < 0. Note that P [a < 0] = P [t 2 α/2,n 1c x x > 1] = P [n x 2 /s 2 x t 2 α/2,n 1]. Further, n x 2 /σx 2 has a noncentral χ 2 distribution with 1 degree of freedom and noncentrality parameter nµ 2 x/σx. 2 Now, (n 1)s 2 x/σx 2 has a (central) χ 2 distribution with n 1 degrees of freedom. Thus, n x 2 /s 2 x has a noncentral F -distribution with numerator degrees of freedom 1, denominator degrees of freedom n 1, and noncentrality parameter nµ 2 x/σx. 2 To evaluate this probability in our simulation experiment, with n = 5, µ x = 2 and σx 2 = 1. Thus we have nµ 2 x/σx 2 = 20 and we find for α = 0.05 P [F (1, 4, 20) < 7.71] = , where F (a, b, c) denotes a random variable with the F -distribution with a and b numerator and denominator degrees of freedom and noncentrality parameter c. This value is close to our empirical result in the simulation experiment, where we observed 921/10000 =.0921 negative denominators in the Fieller interval. For other sample sizes, Figure 1 displays the chance of a negative denominator in the Fieller interval when sampling from a bivariate normal population where both X and Y follow Normal(2,1) distributions. We see from the left plot in Figure 1 that once the sample size is larger than n = 10 the chance is quite small. The sample size needed to be safe from the sign switch is in fact dependent on the inverse of the coefficient of variation (µ x /s x ), as nµ 2 x/σx 2 is the noncentrality of the F distribution. In the left plot in Figure 1 cv = 0.5 while in the right plot both X and Y follow Normal(1,1) 7

9 % of times a < % of times a < Sample size (n) Sample size (n) Figure 1: Results from simulation study. Displayed are the chance of a negative denominator in the Fieller interval when sampling from a bivariate normal population with coefficient of variation 0.5 (left) and 1.0 (right), respectively. distributions so cv = 1.0. The results are much more severe in the right plot. For larger values of cv the Fieller interval performs badly even for larger sample sizes. We note that a potential problem with bigger cv is that the log ratio method can also become non-applicable for large cv s as x itself may become negative. Table 2 summarizes our simulation results. In all cases when two distinct roots d 1 and d 2 were found in the construction of the Fieller interval the interval was taken to be (d 1, d 2 ). We see that the Fieller interval, often motivated for good small sample behavior, behaves very poorly for n = 5 and moderately badly for n = 10. The reason for the undercoverage of Fieller s interval under bivariate normality is as discussed above. Fieller intervals tend to be the widest but in many cases still do not have coverage closest to nominal of the three intervals. For the larger sample size of n = 30 all three intervals perform well across all situations considered. We see that under gamma distributions the comparisons of the three intervals are qualitatively similar. 8

10 5 Conclusion We have studied the Fieller interval for a population ratio pointing out that care must be used when using the Fieller interval. We can obtain nonsensical answers from the formula. We see that this is not a rare occurrence. For bivariate normal observations Fieller s formula gives nonsensical results in approximately 10 percent of data sets when n = 5 and cv = 0.5. This sample size is small, but such sample sizes are common in biological applications. The larger the coefficient of variation the larger the proportion of nonsensical and uninformative intervals we obtain using Fieller s interval. The large sample approximations leading to the Cochran and log ratio intervals perform more stably in general for small samples. Although the log ratio method requires positive means to be usable, it appears to perform generally better when applicable than Cochran s large sample method, especially when the sample size is small. The bootstrap is a common alternative to large sample methods for small to moderate sample sizes. The analysis of a ratio, however, is a particularly difficult problem for the bootstrap and great care is necessary to draw reasonable inferences. See, e.g., Chapter 25 of Efron and Tibshirani (1993) where bias correction and calibration are necessary to make the intervals perform adequately. References Beyene, J. and Moineddin, R. (2005), Methods for Confidence Interval Estimation of a Ratio Parameter with Application to Location Quotients, BMC Medical Research Methodology. 5, 32. Cochran, W. G. (1977), Sampling Techniques, New York: Wiley. Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, New York: Chapman Hall. Fieller, E.C. (1932), The Distribution of the Index in a Bivariate Normal Distribution, 9

11 Biometrika, 24, Heitjan, D.F. (2000), Fieller s Method and Net Health Benefits, Health Economics, 9, Hirschberg, J.G. and Lye, J.N. (2007), Providing Intuition to the Fieller Method with Two Geometric Representations Using STATA and EVIEWS, preprint. Jiang, G., Wu, J., and Williams, G.R. (2000), Fieller s Interval and the Bootstrap-Fieller Interval for the Incremental Cost-Effectiveness Ratio, Health Services and Outcomes Research Methodology, 1, Lehtonen, R. and Pahkinen, E. (2004), Practical Methods for Design and Analysis of Complex Surveys, 2nd Edition, New York: Wiley. Lohr, S.L. (2009), Sampling: Design and Analysis, 2nd Edition, Pacific Grove: Brooks/Cole. 10

12 n = 5 n = 10 n = 30 β = 0 β = 0.5 β = 1 β = 0 β = 0.5 β = 1 β = 0 β = 0.5 β = 1 case: X Normal(2, 1), Y = Xβ + Normal(2, 1) Cochran Fieller log ratio case: X Normal(2, 1), Y = X 2 β + Normal(2, 1) Cochran Fieller log ratio case: X Normal(2, 1), Y = Xβ + Gamma(shape = 3) Cochran Fieller log ratio case: X Normal(2, 1), Y = X 2 β + Gamma(shape = 3) Cochran Fieller log ratio case: X Normal(2, 1), Y = Xβ + Gamma(shape = 1) Cochran Fieller log ratio case: X Normal(2, 1), Y = X 2 β + Gamma(shape = 1) Cochran Fieller log ratio Table 2: Results from the simulation study. For each setting (β = 0, 0.5 and 1) the coverage probability (first column) and interval length (second column) is reported. The nominal coverage is 95%. 11

13 n = 5 n = 10 n = 30 β = 0 β = 0.5 β = 1 β = 0 β = 0.5 β = 1 β = 0 β = 0.5 β = 1 case: X Gamma(shape = 3), Y = Xβ + Normal(2, 1) Cochran Fieller log ratio case: X Gamma(shape = 3), Y = X 2 β + Normal(2, 1) Cochran Fieller log ratio case: X Gamma(shape = 3), Y = Xβ + Gamma(shape = 3) Cochran Fieller log ratio case: X Gamma(shape = 3), Y = X 2 β + Gamma(shape = 3) Cochran Fieller log ratio case: X Gamma(shape = 3), Y = Xβ + Gamma(shape = 1) Cochran Fieller log ratio case: X Gamma(shape = 3), Y = X 2 β + Gamma(shape = 1) Cochran Fieller log ratio Table 2: continued. 12

14 n = 5 n = 10 n = 30 β = 0 β = 0.5 β = 1 β = 0 β = 0.5 β = 1 β = 0 β = 0.5 β = 1 case: X Gamma(shape = 1), Y = Xβ + Normal(2, 1) Cochran Fieller log ratio case: X Gamma(shape = 1), Y = X 2 β + Normal(2, 1) Cochran Fieller log ratio case: X Gamma(shape = 1), Y = Xβ + Gamma(shape = 3) Cochran Fieller log ratio case: X Gamma(shape = 1), Y = X 2 β + Gamma(shape = 3) Cochran Fieller log ratio case: X Gamma(shape = 1), Y = Xβ + Gamma(shape = 1) Cochran Fieller log ratio case: X Gamma(shape = 1), Y = X 2 β + Gamma(shape = 1) Cochran Fieller log ratio Table 2: continued. 13

Linear models and their mathematical foundations: Simple linear regression

Linear models and their mathematical foundations: Simple linear regression Steffen Unkel Department of Medical Statistics University Medical Center Göttingen, Germany Winter term 2018/19 1/21 Introduction