A LARGER SAMLE SIZE IS NOT ALWAYS BETTER!!! Nagaraj K. Neerchal Departmet of Mathematics ad Statistics Uiversity of Marylad Baltimore Couty, Baltimore, MD 2250 Herbert Lacayo ad Barry D. Nussbaum Uited States Evirometal rotectio Agecy Washigto, DC 20460 ABSTRACT I a previous paper Neerchal, Lacayo ad Nussbaum (2007) explored the behavior of the well-kow problem of fidig the optimal sample size for obtaiig a cofidece iterval of a pre-assiged precisio (or legth) for the proportio parameter of a fiite or ifiite biary populatio. We illustrated some special problems that arise due to the discreteess of the populatio distributio ad precisio that is measured by the legth of the iterval rather tha by the variace. Specifically, the cofidece level of a iterval of fixed legth does ot ecessarily icrease as the sample size icreases. However, whe such cofidece levels are computed usig ormal approximatios, we see a mootoic behavior. I this paper, we cosider the correspodig problem uder the oisso approximatio ad show that for this distributio mootoicity does ot hold ad oe should be beware of this seemig peculiarity i recommedig sample sizes for studies ivolvig estimatio of meas or proportios. Keywords ad hrases: Cofidece itervals, oisso distributio, Biomial distributio, Hypergeometric distributio, Optimal sample size. INTRODUCTION I may sample situatios, the paret populatio size is relatively small, (say N < 00). For example, the Uited States Evirometal rotectio Agecy (USEA) routiely audits certai small databases. Further, if samplig is expesive, a customer may request the smallest sample size that attais or exceeds a specified cofidece iterval (CI) where that CI has a specified precisio deoted by τ, or d. [A more formal statemet of this problem a la Cochra (975) will follow later.] This questio as stated by oe customer is as follows: How large a sample do I eed to estimate the error rate of a specific data base. This is a fairly straight forward simple radom samplig without replacemet (SRSWOR, i.e. the Hypergeometric Distributio) problem. It is cosidered by us i a previous paper amely Neerchal, Lacayo ad Nussbaum (2007). I that paper, we foud, cotrary to our expectatios, that icreasig the sample size did ot always icrease the magitude of the cofidece level. For example as show i Table., of the Appedix, for a populatio of N=90 ad a desired precisio of.02 (i.e. CI=.04), we see that the cofidece level is NOT mootoe, but rather goes up ad dow as the sample size icreases. This uexpected o-mootoic up-dow-up behavior is observed for the Hypergeometric ad Biomial Distributios, both discreet. O the other had, the cofidece level for the ormal distributio, which is ofte used to approximate biomial probabilities, is mootoically icreasig. I this paper we will ivestigate the mootoicity (or the lack thereof) of oisso distributio.
2. A FORMAL STATEMENT OF THE ROBLEM We cosider the statistical iferece problem of obtaiig a iterval estimate of a pre-assiged legth (also referred to as precisio) for the mea of a populatio. The objective is to provide the optimal sample size that will achieve a desired cofidece level. A prelimiary estimate of µ, the populatio mea, is available a priori. Suppose X deotes the sample mea from a sample of size ; the, we are lookig for the smallest such that ( µ < τ ) α. () X Thus, the cofidece iterval is of fixed legth 2 τ aroud the sample mea ad has at least 00( α)% cofidece level. If the populatio is fiite (size N) ad the sample is obtaied by simple radom samplig without replacemet, (SRSWOR), the cofidece level () above is give by summig up the appropriate terms from a hypergeometric distributio. That is. ( X µ < τ ) = p+ τ j= p τ + [ Np] N [ Np]) j j N (2) where deotes the sample size, ad where the ceilig ad floor fuctios [, ] idicate the smallest iteger less tha or equal to the quatity iside the brackets for the ceilig fuctio ad similarly the floor fuctio idicates the largest iteger ot less tha the quatity i the floor brackets. Suppose we assume that either the populatio is ifiite or the samplig is doe with replacemet; the we ca use the biomial distributio to compute the cofidece level give i (). That is, ( X µ < τ ) = λ + τ j= λ τ + j p ( j p) j (3) Where (, ) i the upper ad lower limits of the summatio symbol above deote ceilig ad floor fuctios respectively. Of course, for large, it is also commo to use Normal approximatio. Eve elemetary textbooks meat for the first course i Statistics cotai elaborate descriptios of Normal approximatio to biomial distributio with or without correctio. That is, ( X µ < τ ) τ p( p) < Z < τ p( p) (4) - 2 -
where Z deotes a stadard ormal radom variable. The advatage of ormal approximatio is that, we ca obtai a explicit formula for the optimal sample size to achieve the desired cofidece level. As show i Cochra (977), opt = z 2 α / 2 p( p) 2 τ I Neerchal, Lacayo ad Nussbaum(2007), we show that the ormal approximatio formula (4) for the cofidece level is mootoically icreasig as the sample size icreases, while the exact formulas (2) ad (3) correspod to a up-ad-dow (a saw tooth shape) growth patter. Cosequetly, oe eeds to use cautio whe roudig up sample size formulas. I this paper, we cosider the same iferece problem uder the oisso distributio, aother popular distributio used widely i practical applicatios. The oisso distributio is also used to approximate biomial whe the sample size is large ad the probability of success is small. We let X, X 2 L, X deote a simple radom sample from a oisso distributio with parameter λ, ad cosider ( X τ, X +τ ) as the fixed legth cofidece iterval for λ. The cofidece level of this iterval is give by ( X τ < λ < X + τ ) = λ τ < λ+ τ = e i= λ τ + λ i= X ( λ) i! i i = X < λ + τ τ < λ < i i= i= X i + τ (5) where, oce agai,, deote ceilig ad floor fuctios as i (3). We have also observed a similar behavior for the oisso distributio as we did for the biomial distributio. I other words, the cofidece level give i equatio (5) is o-mootoic ad its graph will have a saw tooth patter. That is, if we let ( X τ < λ < X + τ ) ( X τ < λ < X + τ ) = + +, the ca actually be positive or egative as the sample size icreases. This ca be see i Figures through 3 of the Appedix. 3. RESULTS AND DISCUSSION It is straightforward to compute the expressio give i (5) usig ay software package which computes oisso probabilities that provide plots of the relatioships betwee cofidece levels ad sample size, for differet combiatios of λ ad τ. [See Figures to 3] The saw tooth patter is obvious. This has - 3 -
major cosequeces i determiig recommeded sample size. The usual practice of roudig up the optimal sample size formula to a higher iteger may lead to a lower cofidece level tha desired. The mai thrust of the author s work is from the vatage poit of applicatios, which focuses o the determiatio of optimal sample size. Whe the samples are expesive to obtai, as it was i the motivatig example of US-EA s auditig case study metioed i the itroductio, it would be quite costly if the additioal samples take actually drive dow the cofidece. This would be like payig more ad gettig less!! This prelimiary ivestigatio of oisso distributio ad our previous work leads to iterestig research questios. We list some of them below.. Is this peculiar behavior of the cofidece level of the fixed legth cofidece itervals true for all the commo discrete distributios ad false for all cotiuous distributios? 2. I our work so far, we focused o the fixed legth cofidece itervals ad correspodig optimal "sample size determiatio problem" a la Cochra (977). Aother commoly used approach is based o lookig at the coverage probabilities by specifyig the Type I ad Type II errors. I fact, for some of the commoly used discrete distributios, so-called exact cofidece itervals are also available. See for example, page 247 (for Biomial) ad page 25 (for oisso) of Millard ad Neerchal (200). A iterestig research questio would be to ask Would we see the saw tooth patter i the cofidece levels as a fuctio the sample size for such cofidece itervals as well? REFERENCES Abramowitz ad Stegu (965). Hadbook of Mathematical Fuctios. Dover ublicatios, Ic. New York. Johso, N.L. ad Kotz, S. (969). Distributios i Statistics: Discrete Distributios. Houghto Miffli Compay, New York. Cochra, W. G. (977). Samplig Techiques, 3rd ed., Wiley, New York. Brow, L. D, Cai T., ad DasGupta, A. (200). Iterval Estimatio for a Biomial roportio. Statistical Sciece, Vol. 6, No. 2, 0-33. Millard. S. M. ad Neerchal, N. K. (200). Evirometal Statistics with S-LUS. CRC/Chapma Hall, Boca Rato, FL. Neerchal, N. K., Lacayo, H. ad Nussbaum, B. D. (2007). Is a Large Sample Size Always Better? America Joural of Mathematics ad Maagemet Scieces. (I process). - 4 -
AENDIX: TABLES AND GRAHS TABLE. Exact Cofidece Levels (robability that the mea will be i the cofidece iterval specified by the precisio tau) for shortest itervals aroud the sample mea, ad the sample sizes proposed by various commercial programs for precisio 0% Exact Cofidece Level for idicated sample size ad precisio [i.e. tau] Cofidece Iterval [i.e. 2*tau=2precisio] = 47 = 60 = 63 =67 0.02 0.7233 0.8799 0.834 0.7457 0.04 0.9047 0.8799 0.9698 0.9454 0.06 0.9047 0.98 0.9698.0000 Figure lot of cofidece level [ ( X µ < τ ) ] vs Sample Size. Legth of CI: 2*tau, tau=., ad Lambda= 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.5 0. 0.05 0 0 5 0 5 20 25 30 35 Sample Size - 5 -
Table 2 Some raw data that may helps explai jagadess i Fig. tau lambda up lo lambda prob prob2 cof 0. 0 0.735759 0.367879 0.367879 0. 2 2 2 0.676676 0.406006 2 0.27067 0. 3 3 2 3 0.647232 0.4239 3 0.224042 0. 4 4 3 4 0.628837 0.43347 4 0.95367 0. 5 5 4 5 0.6596 0.440493 5 0.75467 0. 6 6 5 6 0.606303 0.44568 6 0.60623 0. 7 7 6 7 0.59874 0.4497 7 0.49003 0. 8 8 7 8 0.592547 0.45296 8 0.39587 0. 9 9 8 9 0.587408 0.455653 9 0.3756 0. 0 0 9 0 0.58304 0.45793 0 0.25 0. 2 9 0.688697 0.3405 0.34886 0. 2 3 0 2 0.68536 0.347229 2 0.334306 0. 3 4 3 0.67532 0.35365 3 0.32967 0. 4 5 2 4 0.66936 0.358458 4 0.30902 0. 5 6 3 5 0.66423 0.36328 5 0.300905 0. 6 7 4 6 0.659344 0.367527 6 0.2986 0. 7 8 5 7 0.654958 0.37454 7 0.283505 0. 8 9 6 8 0.65096 0.37505 8 0.275866 0. 9 20 7 9 0.64774 0.37836 9 0.26883 0. 20 2 8 20 0.643698 0.38422 20 0.262276 0. 2 23 8 2 0.76029 0.3068 2 0.44349-6 -
Figure 2 lot of cofidece level [ ( X µ < τ ) ] vs Sample Size. Legth of CI: 2*tau, tau=., ad Lambda=. 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 0 5 0 5 20 25 30 35 Sample Size Table 3 Some raw data that may helps explai jagadess i Fig. 2 tau lambda up lo lambda prob prob2 cof 0. 0. 0 0 0. 0.9048374 0.9048374 0 0. 0. 0 0 0.2 0.887308 0.887308 2 0 0. 0. 0 0 0.3 0.740882 0.740882 3 0 0. 0. 0 0 0.4 0.67032 0.67032 4 0 0. 0. 0 0 0.5 0.6065307 0.6065307 5 0 0. 0. 0 0.6 0.8780986 0.54886 6 0.329287 0. 0. 0 0.7 0.84495 0.4965853 7 0.3476097 0. 0. 0 0.8 0.808792 0.449329 8 0.3594632 0. 0. 0 0.9 0.7724824 0.4065697 9 0.365927 0. 0. 0 0.7357589 0.3678794 0 0.3678794 0. 0. 2 0. 0.900463 0.33287 0.5675452 0. 0. 2 0.2 0.879487 0.30942 2 0.5782929 0. 0. 2 0.3 0.85725 0.272538 3 0.5845807 0. 0. 2 0.4 0.8334977 0.246597 4 0.5869008 0. 0. 2 0.5 0.8088468 0.223302 5 0.585767 0. 0. 3 0.6 0.92865 0.208965 6 0.7929 0. 0. 3 0.7 0.906806 0.826835 7 0.72427-7 -
Figure 3. lot of cofidece level [ ( X µ < τ ) ] vs Sample Size. Legth of CI: 2*tau, tau=., ad Lambda=.2 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 0 0 20 30 40 50 Sample Size Discussio of Figures ad 2 I Figure, we ote a peculiar patter as the sample size icreases. For istace, startig with a sample size of, we see [from Table 2] that the cofidece level is.348 As the sample size icreases to 2, 3, 20, the cofidece level actually decreases. However, whe the sample size goes from 20 to 2, there is a marked icrease i the cofidece level [from.262 to.44 ]. This patter appears to repeat i cycles of 0. I additio, i Figure 2, we ote a similar peculiar patter as the sample size icreases. For istace, startig with a sample size of 6, we see from Table 3 that the cofidece level is.329 As the sample size icreases to 7, 8, 9, 0, the cofidece level icreases. However, whe the sample size goes from 0 to, there is a marked icrease i the cofidece level [from.367 to.567 ]. This patter appears to repeat i cycles of 5. - 8 -