Poisson approximation - PDF Free Download

p^ 0.17 0.16 0.15 0.14 0.13 0.12 0.11 0.10 0.09 0.08 0.07 0.06 Poisso approximatio Normal approximatio 90 200 400 800 2000 5000 10,000 Figure 3: Poisso vs. ormal approximatios for large sample sizes. 14

0.5 p^ R2 R1 R4 R3 0.4 Normal approximatio 0.3 0.2 0.1 Do ot approximate Poisso approximatio 0.0 R5 0 20 40 60 80 100 Figure 2: Approximatio methods for codece limits (maximum error: 0.04). 13

0.5 0.4 p^ Normal approximatio 0.3 0.2 Do ot approximate 0.1 0.0 Poisso approximatio 0 20 40 60 80 100 Figure 1: Approximatio methods for codece limits (maximum error: 0.01). 12

[17] R.E. Walpole ad R.H. Myers, Probability ad Statistics for Egieers ad Scietists, Fifth Editio, Macmilla, 1993. [18] S. Wolfram, Mathematica: A System for Doig Mathematics by Computer, Secod Editio, Addiso-Wesley, 1991. 11

[5] J.E. Freud, Mathematical Statistics, Fifth Editio, Pretice-Hall, 1992. [6] B.K. Ghosh, \A Compariso of Some Approximate Codece Itervals for the Biomial Parameter", Joural of the America Statistical Associatio, Volume 74, Number 368, 1979, 894-900. [7] R.N. Goldma ad J.S. Weiberg, Statistics: A Itroductio, Pretice-Hall, 1985. [8] R.V. Hogg ad E.A. Tais, Probability ad Statistical Iferece, Fourth Editio, 1993. [9] L.L. Lapi Probability ad Statistics for Moder Egieerig, Brooks/Cole, 1983. [10] R.J. Larse ad M.L. Marx, A Itroductio to Mathematical Statistics ad Its Applicatios, Secod Editio, Pretice-Hall, 1986. [11] R.F. Lig, \Just Say No to Biomial (ad other Discrete Distributios) Tables", America Statisticia, Volume 46, Number 1, 1992, pp. 53-54. [12] W. Medehall ad T. Sicich, Statistics for Egieerig ad the Scieces, Third Editio, MacMilla, 1992. [13] S.Ross, A First Course i Probability, Third Editio, Macmilla, 1988. [14] M. Schader ad F. Schmid, \Two Rules of Thumb for the Approximatio of the Biomial Distributio by the Normal Distributio", America Statisticia, Volume 43, Number 1, 1989, pp. 23-24. [15] R.L. Scheaer ad J.T. McClave, Probability ad Statistics for Egieers, Third Editio, PWS-Ket, 1990. [16] K.S. Trivedi, Probability ad Statistics with Reliability, Queuig ad Computer Sciece Applicatios, Pretice-Hall, 1982. 10

For sample sizes larger tha 150, the absolute error of either upper ad lower codece limit is less tha 0.01 if the appropriate approximatio techique is used. Figure 3 should be cosulted for specic guidace as to whether the biomial or Poisso approximatio is appropriate. Itroductory probability ad statistics textbooks targetig statistics ad mathematics majors would beet from icludig the use of the F distributio to d p L ad p U. Also, more of these texts should iclude the use of the Poisso approximatio to the biomial distributio for determiig iterval estimates for p. These codece limits oly require a table look-up associated with the chi-square distributio ad are very accurate for large ad small p. Ackowledgmet This research was supported by the Istitute for Computer Applicatios i Sciece ad Egieerig (ICASE). Their support is gratefully ackowledged. Helpful commets from Pam Burch, Herb Multhaup ad Bruce Schmeiser are gratefully ackowledged. Refereces [1] A.D. Aczel, Complete Busiess Statistics, Secod Editio, Irwi, 1993. [2] C. Blyth, \Approximate Biomial Codece Limits", Joural of the America Statistical Associatio, Volume 81, Number 395, 1986, pp. 843-855. [3] G. Casella ad R. Berger, Statistical Iferece, Wadsworth ad Brooks/Cole, 1990. [4] H. Che, \The Accuracy of Approximate Itervals for a Biomial Parameter", Joural of the America Statistical Associatio, Volume 85, Number 410, 1990, pp. 514-518. 9

correspodig to the ormal approximatio rule ^p(1 ^p) 10 the rule labeled \R4" is a plot of ^p = 1 2 p ( 36) correspodig to the ormal approximatio rule ^p(1 ^p) > 9 2 o the rage [36; 100] the rule labeled \R5" is a plot of 20 ad ^p 0:05 or 100 ad ^p 10 correspodig to the guidelie for usig the Poisso approximatio. The, ^pcombiatios fallig above the dotted curves for rules R1, R2, R3, ad R4 correspod to those that would be used if the rules of thumb were followed. Clearly, rules R3 ad R4 are sigicatly more coservative tha R1 ad R2. Figure 3 is a cotiuatio of Figure 2 for sample sizes larger tha = 100. Note that the vertical axis has bee modied ad the horizotal axis is logarithmic. The curve i the gure represets the largest value of ^p where the Poisso approximatio to the biomial is superior to the ormal approximatio to the biomial. Sice this relatioship is liear, a rather uwieldy rule of thumb for betwee 100 ad 10,000 is: use the ormal approximatio over the Poisso approximatio if ^p > 5:2 4 Coclusios log 10. 18:8 Although there are a umber of dieret variatios of the calculatios that have bee coducted here (e.g., oe-sided codece itervals, dieret sigicace levels, dieret deitios of error), there are three geeral coclusios: The traditioal advice from most textbooks of usig the ormal ad Poisso approximatios to the biomial for the purpose of computig codece itervals for p should be tempered with a statemet such as: \the Poisso approximatio should be used whe 20 ad p 0:05 if the aalyst ca tolerate a error that may be as large as 0.04" (see Figure 2). 8

o each Beroulli trial is arbitrary, we oly cosider the rage 0 < ^p 1. Figures 1, 2 2 ad 3 have mirror images for the rage 1 ^p <1. 2 Figure 1 cotais a plot of versus ^p for =2;4;...;100 ad cosiders the rage 0 < ^p 1 for a maximum error of 0.01. Thus if the actual error for a particular (; ^p) 2 pair is greater that 0.01, the poit lads i the \Do ot approximate" regio. If oe of the two approximatios yields a error of less tha 0.01, the the pair belogs to either the \Normal approximatio" or \Poisso approximatio" regios, depedig o which yields a smaller error. Not surprisigly, the ormal approximatio performs better whe the poit estimate is closer to 1 ad the Poisso approximatio performs 2 better whe the poit estimate is closer to 0. Both approximatios perform better as icreases. I order to avoid ay spurious discotiuities i the regios, the calculatios were made for eve values of. The edges of the regio are ot smooth because of the discrete atures of ad ^p. The boudary of the approximatio regios are those (; ^p) pairs where the error is less tha 0.01. If the horizotal axis were exteded, the ormal ad Poisso regios would meet at approximately = 150. Mathematica [18] was used for the comparisos because of its ability to hold variables to arbitrary precisio. If the maximum error is relaxed to 0.04, the there are more cases where the approximatios perform adequately. Figure 2 is aalogous to Figure 1 but cosiders a error of 0.04. This gure also cotais the rules of thumb associated with the ormal ad Poisso approximatios to the biomial distributio. I particular, the rule labeled \R1" is a plot of ^p =5= o the rage [10; 100] correspodig to the ormal approximatio rule ^p 5 ad (1 ^p) 5 the rule labeled \R2" is a plot of ^p = 4 to the ormal approximatio rule ^p 2 q 4+ the rule labeled \R3" is a plot of ^p = 1 2 7 o the rage [4; 100] correspodig ^p(1 ^p) fallig i the iterval (0; 1) p ( 40) 2 o the rage [40; 100]

or 1 y 1 X k=0 (p PL ) k k! e p PL = =2: The left-had side of this equatio is the cumulative distributio fuctio for a Erlag radom variable with parameters p PL at oe. Cosequetly, P [E ppl ;y 1] = =2 ad y (deoted by E ppl ;y) evaluated Sice 2p PL E ppl ;y this reduces to is equivalet toa 2 radom variable with 2y degrees of freedom, P [ 2 2p 2y PL]==2 or p PL = 1 2 2 2y;1 =2 : By a similar lie of reasoig, the upper limit based o the Poisso approximatio to the biomial distributio is p PU = 1 2 2 2(y+1);=2 : This approximatio works best whe p is small (e.g., reliability applicatios where the probability of failure p is small). 3 Compariso of the Approximate Methods There are a multitude of dieret ways to compare the approximate codece itervals with the exact values. We have decided to compute the error of a approximate two-sided codece iterval as the maximum error maxfjp L ~p L j; jp U ~p U jg where ~p L ad ~p U are the approximate lower ad upper bouds, respectively. This error is computed for all combiatios of ad ^p. Sice the deitio of \success" 6

fcrit = Quatile[FRatioDistributio[2 * y, 2 * ( - y + 1)], alpha/2] pl = 1 / ( 1 + ( - y + 1) / ( y * fcrit) ) fcrit = Quatile[FRatioDistributio[2 * (y + 1), 2 * ( - y)], 1 - alpha/2] pu = 1 / ( 1 + ( - y) / ( (y + 1) * fcrit) ) This method is sigicatly faster tha the approach usig the biomial distributio, but ecouters diculty with determiig the F ratio quatiles for some combiatios of ad y. The rst approximate codece iterval is based o the ormal approximatio to the biomial. The radom variable p Y p p(1 p) Thus a approximate codece iterval for p is s Y Y z (1 Y ) =2 where z =2 is the 1 =2 fractile of the stadard ormal distributio. This approximatio works best whe p = 1 2 <p< Y +z =2 is asymptotically stadard ormal. s Y (1 Y ) (e.g., political polls). It allows codece limits that fall outside of the iterval [0, 1]. Oe should also be careful whe Y =0orY = sice the codece iterval will have a width of 0. The secod approximate codece iterval is based o the Poisso approximatio to the biomial (see, for example, Trivedi [16], page 498). This codece iterval does ot appear as ofte i textbooks as the rst approximate codece iterval. The radom variable Y is asymptotically Poisso with parameter p. Therefore, the exact lower boud p L satisfyig X k=y k! p k L(1 p L ) k = =2 ca be approximated with a Poisso lower limit p PL which satises 1X k=y (p PL ) k k! 5 e p PL = =2

Sice this probability is equal to =2 for a two-sided codece iterval, or I a similar fashio, F 2y;2( y+1);1 =2 = ( y +1)p L y(1 p L ) p L = 1+ p U = 1+ 1 : y+1 yf 2y;2( y+1);1 =2 1 : y (y+1)f 2(y+1);2( y);=2 The ext paragraph discusses umerical issues associated with determiig these bouds. is The Mathematica (see [18]) code for solvig the biomial equatios umerically pl = FidRoot[ Sum[Biomial[, k] * p ^ k * (1 - p) ^ ( - k), {k, y, }] == alpha/2, {p, y / } ] pu = FidRoot[ Sum[Biomial[, k] * p ^ k * (1 - p) ^ ( - k), {k, 0, y}] == alpha/2, {p, y / } ] for a give, y ad. This code works well for small ad moderate sized values of. Some umerical istability occurred for larger values of, so the well kow relatioship (Larse ad Marx [10], page 101) betwee the successive values of the probability mass fuctio f(x) of the biomial distributio f(x) = ( x+1)p x(1 p) f(x 1) x =1;2;...; was used to calculate the biomial cumulative distributio fuctio. The Mathematica code for determiig p L ad p U usig the F distributio is 4

the lower limit p L satises X k=y k! p k L(1 p L ) k = =2 where y is the observed value of the radom variable Y ad is the omial coverage of the codece iterval (see, for example, [10], page 279). For y = 1;2;...; 1, the upper limit p U satises yx k=0 k! p k U (1 p U) k = =2: This codece iterval requires umerical methods to determie p L ad p U ad takes loger to calculate as icreases. This iterval will be used as a basis to check the approximate bouds reviewed later i this sectio. A gure showig the coverage probabilities for bouds of this type is show i Blyth [2]. Followig a derivatio similar to his, a faster way to determie the lower ad upper limits ca be determied. Let W 1 ;W 2 ;...;W be iid U(0, 1) radom variables. Let Y be the umber of the W i 's that are less tha p. Hece Y is biomial with parameters ad p. Usig a result from page 233 of Casella ad Berger [3], the order statistic W W (y) has the beta distributio with parameters y ad y + 1. Sice the evets Y y ad W<pare equivalet, P [Y y] (which is ecessary for determiig p L ) ca be calculated by P (Y y) = P (W <p) = Usig the substitutio t = ( ( +1) (y) ( y +1) Z p y+1)w ad simplifyig yields y(1 w) 0 w y 1 (1 w) y dw: P (Y y) = ( +1) (y) ( y +1) = P [F 2y;2( y+1) < ( y+1 y ( y +1)p ]: y(1 p) Z ( y+1)p ) y+1 y(1 p) 0 ( y+1 y t y 1 dt +1 + t) 3

Poisso approximatios to the biomial distributio. Determiig a codece iterval for p whe the sample size is large usig approximate methods is ofte eeded i simulatios with a large umber of replicatios ad i pollig. Computig probabilities usig the ormal ad Poisso approximatios is ot cosidered here sice work has bee doe o this problem. Lig [11] suggests usig a relatioship betwee the cumulative distributio fuctios of the biomial ad F distributios to compute biomial probabilities. Ghosh [6] compares two codece itervals for the Beroulli parameter based o the ormal approximatio to the biomial distributio. Schader ad Schmid [14] compare the maximum absolute error i computig the cumulative distributio fuctio for the biomial distributio usig the ormal approximatio with a cotiuity correctio. They cosider the two rules for determiig whether the approximatio should be used: p ad (1 p) are both greater tha 5, ad p(1 p) > 9. Their coclusio is that the relatioship betwee the maximum absolute error ad p is approximately liear whe cosiderig the smallest possible sample sizes to satisfy the rules. Cocerig work doe o codece itervals for p, Blyth [2] has compared ve approximate oe-sided codece itervals for p based o the ormal distributio. I additio, he uses the F distributio to reduce the amout of time ecessary to compute a exact codece iterval. Usig a arcsi trasformatio to improve the codece limits is cosidered by Che [4]. 2 Codece Iterval Estimators for p Two-sided codece iterval estimators for p ca be determied with the aid of umerical methods. Oe-sided codece iterval estimators are aalogous. Let p L <p<p U be a \exact" (see [2]) codece iterval for p. For y =1;2;...; 1, 2

1 Itroductio There is coictig advice cocerig the sample size ecessary to use the ormal approximatio to the biomial distributio. For example, a samplig of textbooks recommed that the ormal distributio be used to approximate the biomial distributio whe: p ad (1 p) are both greater tha 5 (see [1], page 211, [5], page 245, [7], page 304, [9], page 148, [16], page 497, [17], page 161) p 2 q p(1 p) lies i the iterval (0; 1) (see [15], page 242, [12], page 299) p(1 p) 10 (see [13], page 171) p(1 p) > 9 (see [1], page 158). May other textbook authors give o specic advice cocerig whe the ormal approximatio should be used. To complicate matters further, most of this advice cocers usig these approximatios to compute probabilities. Whether these same rules of thumb apply to codece itervals is seldom addressed. The Poisso approximatio, while less popular tha the ormal approximatio to the biomial, is useful for large values of ad small values of p. The same samplig of textbooks recommed that the Poisso distributio be used to approximate the biomial distributio whe 20 ad p 0:05 or 100 ad p 10 (see [8], page 177, [5], page 204). Let X 1 ;X 2 ;...;X p ad let Y be iid Beroulli radom variables with ukow parameter = P i=1 X i be a biomial radom variable with parameters ad p. The maximum likelihood estimator for p is ^p = Y, which isubiased ad cosistet. The iterest here is i codece iterval estimators for p. I particular, we wat to compare the approximate codece iterval estimators based o the ormal ad 1

A Compariso of Approximate Iterval Estimators for the Beroulli Parameter Lawrece Leemis Departmet of Mathematics College of William ad Mary Williamsburg, VA 23187-8795 Kishor S. Trivedi Departmet of Electrical Egieerig Duke Uiversity, Box 90291 Durham, NC 27708-0291 The goal of this paper is to compare the accuracy of two approximate codece iterval estimators for the Beroulli parameter p. The approximate codece itervals are based o the ormal ad Poisso approximatios to the biomial distributio. Charts are give to idicate which approximatio is appropriate for certai sample sizes ad poit estimators. KEY WORDS: Codece iterval, Biomial distributio, Beroulli distributio, Poisso distributio. 0