University of Groningen. Statistical Auditing and the AOQL-method Talens, Erik

Uiversity of Groige Statistical Auditig ad the AOQL-method Tales, Erik IMPORTANT NOTE: You are advised to cosult the publisher's versio (publisher's PDF if you wish to cite from it. Please check the documet versio below. Documet Versio Publisher's PDF, also kow as Versio of record Publicatio date: 2005 Lik to publicatio i Uiversity of Groige/UMCG research database Citatio for published versio (APA: Tales, E. (2005. Statistical Auditig ad the AOQL-method s.. Copyright Other tha for strictly persoal use, it is ot permitted to dowload or to forward/distribute the text or part of it out the coset of the author(s ad/or copyright holder(s, uless the work is uder a ope cotet licese (like Creative Commos. Take-dow policy If you believe that this documet breaches copyright please cotact us providig details, ad we will remove access to the work immediately ad ivestigate your claim. Dowloaded from the Uiversity of Groige/UMCG research database (Pure: http://www.rug.l/research/portal. For techical reasos the umber of authors show o this cover page is limited to 10 maximum. Dowload date: 24-03-2018

Chapter 4 Hypergeometric Distributio The hypergeometric distributio plays a key role i statistical auditig. This chapter describes some importat properties of the hypergeometric distributio we use i subsequet chapters. Sectio 4.1 will give some elemetary properties of the hypergeometric probability. This sectio also gives some properties of the hypergeometric distributio fuctio ad quotiets of hypergeometric distributio fuctios. These properties will be very helpful i Chapter 5. Sectio 4.2 gives exact ad approximate cofidece itervals for the probability that a certai characteristic is preset i a populatio. Fially Sectio 4.3 shows how we ca calculate hypergeometric probabilities i a efficiet ad accurate way. This sectio is essetial to Chapter 6. 4.1 Properties of the hypergeometric distributio Cosider a populatio of N elemets. A umber of these N elemets may have a certai characteristic that we are iterested i, e.g. the umber of travel declaratios i a yearly populatio which were processed icorrectly. We will deote this umber by M. I auditig applicatios this characteristic is ofte uwated, ad therefore the value of M is relatively small. This umber is ot kow to us i advace. To get more iformatio about M, a radom sample of size is take. The sample cotais K elemets that have the characteristic of iterest. The umber K i the sample follows a hypergeometric distributio parameters, M, ad N. We write K H(, M, N. We use a well-kow exteded defiitio of the biomial coefficiets, that

56 Chapter 4. Hypergeometric Distributio will be very coveiet i our algebraic maipulatios the hypergeometric distributio. Recall that ( p q = p! for q = 0, 1,..., p; p = 0, 1, 2,..., q!(p q! where 0! = 1 by defiitio. For other values of p, q Z it is defied ( p q = 0. Usig these otatios we do ot have to icorporate the usual domai for K, amely K = 0,..., the restrictio that K (N M ad K M. Thus for o-egative itegers k we have P{K = k, M, N} = M k k. (4.1.1 We kow that E(K = M N ad M (N M (N Var(K =. N 2 (N 1 The followig properties for the hypergeometric distributio hold. We refer to Lieberma ad Owe (1961. Property 4.1.1. The hypergeometric distributio has the followig elemetary properties: P{K = k + 1, M, N} = (M k ( k (k + 1 (N M + k + 1 P{K = k, M, N} P{K = k + 1, M, N} = ( + 1 (N M + k ( + 1 k (N P{K = k, M, N} P{K = k, M + 1, N} = (M + 1 (N M + k (M + 1 k (N M P{K = k, M, N} P{K = k, M, N + 1} = (N + 1 (N M + 1 (N M + k + 1 (N + 1 P{K = k, M, N}

4.1. Properties of the hypergeometric distributio 57 P{K = k, M, N} = P{K = k, N M, N} = P{K = M k N, M, N} = P{K = N M + k N, N M, N} P{K k, M, N} = 1 P{K k 1, N M, N} = 1 P{K M k 1 N, M, N} = P{K N M + k N, N M, N} The followig property is a very helpful tool that shows that we are allowed to iterchage M ad out affectig the hypergeometric probabilities. This property will frequetly be used i this thesis. Property 4.1.2. If the roles of M ad are iterchaged, this does ot affect the hypergeometric probabilities; i.e. P{K = k, M, N} = P{K = k M,, N}. The proof of this property is simple. A probabilistic explaatio for Property 4.1.2 is give i Davidso ad Johso (1993. Notice that P{K = k, M, N} is a uimodal fuctio of k, see Johso, Kotz ad Kemp (1992. It takes o its maximum for the largest iteger that does ot exceed (M+1(+1. If (M+1(+1 is a iteger, say c the it takes o its N+2 N+2 maximum for this iteger c, but also for c 1. 4.1.1 Properties of Λ(, M, N We itroduce the followig otatio, Λ(, M, N = P{K k 0, M, N} = k 0 k=0 M k k. (4.1.2 This otatio suppresses the depedece o k 0, because i most of our applicatios we will ot allow the value of k 0 to vary. Uless stated otherwise k 0 will be cosidered fixed i the sequel. I fact we will be more iterested i the behaviour of Λ as a fuctio of, M, ad N. We will discus some properties of Λ(, M, N that are especially useful i Chapter 6.

58 Chapter 4. Hypergeometric Distributio Theorem 4.1.1. The followig properties hold for Λ(, M, N: 1. Λ(, M, N = Λ(M,, N. 2. Λ(, M, N = 1 if ad oly if M k 0 or k 0. 3. Λ(, M, N = 0 if ad oly if M > N + k 0. 4. Let M {0,..., N 1}, the M 1 k Λ(, M + 1, N = Λ(, M, N 0 k 0 1, or, equivaletly, Λ(, M + 1, N = Λ(, M, N N P{K = k 0 1, M, N 1} for {1,..., N}. 5. Let M {0,..., N 1}, the Λ(, M + 1, N Λ(, M, N. The iequality is strict if ad oly if k 0 M N + k 0 ad > k 0. 6. Let {0,..., N 1}, the Λ( + 1, M, N Λ(, M, N. The iequality is strict if ad oly if k 0 N M + k 0 ad M > k 0. Proof. Parts 1, 2, ad 3 immediately follow from Property 4.1.2, (4.1.2, ad the defiitio of the hypergeometric distributio, respectively. Usig Pascal s triagle we obtai ( N k 0 ( ( M + 1 N M 1 Λ(, M + 1, N = k k k=0 k 0 (( ( ( M M N M 1 = + k k 1 k k=0 k 0 = k k=0 M 1 k + k 0 1 k=0 k ( N M 1 k 1

4.1. Properties of the hypergeometric distributio 59 ad hece ( N Λ(, M, N = k 0 k=0 k ( N M k k 0 ( (( ( M N M 1 N M 1 = + k k k 1 k=0 k 0 ( ( M N M 1 k 0 ( ( M N M 1 = + k k k k 1 k=0 k=0 ( ( ( N M N M 1 = Λ(, M + 1, N +. k 0 1 Summatios empty idex sets are equal to zero by defiitio. This proves the first result of part 4. Its secod result is obvious. Part 5 follows immediately from part 4. Part 6 follows from part 5 by applyig the result from part 1. Theorem 4.1.1, part 5 shows that the probability of acceptig the populatio is decreasig i M. Part 6 shows that this probability also decreases if a larger sample is take. These facts are i accordace ituitio. 4.1.2 Properties of λ(, M, N This subsectio will focus o the quotiet of Λ(, M + 1, N ad Λ(, M, N, which plays a key role i provig some of the properties i Chapter 5. This quotiet is defied by k 0 λ(, M, N = { Λ(,M+1,N Λ(,M,N > 0 if M < N + k 0, 0 if N + k 0 M N 1. (4.1.3 Accordig to Theorem 4.1.1, part 3 this ratio is well-defied for M N +k 0, ad i the special case M = N +k 0 it is equal to zero. Obviously 0 λ 1, accordig to Theorem 4.1.1, part 4. A umber of properties of λ are collected i the followig theorem. Theorem 4.1.2. The followig properties hold for λ(, M, N [0, 1]. 1. λ(, M, N = 1 if ad oly if M < k 0 or k 0. 2. λ(, M, N = 0 if ad oly if M N + k 0.

60 Chapter 4. Hypergeometric Distributio 3. Let > k 0, if k 0 M N + k 0, the it ca be writte λ(, M, N = 1 1 g(, M, N, where g(, M, N = Λ(, M, N M 1 > 0. k 0 k 0 1 4. Let M {0,..., N 1}, the λ(, M, N λ(, M + 1, N. The iequality is strict if ad oly if max(0, k 0 1 M N + k 0 1 ad > k 0. 5. Let, M {0,..., N 1} the λ(, M, N λ( + 1, M, N. The iequality is strict if ad oly if k 0 N M + k 0 1 ad M > k 0. 6. If M k 0, > k 0 ad N + M k 0, the λ(, M, N < λ(, M, N + 1. Proof. Part 1 follows from (4.1.2 ad Theorem 4.1.1, parts 2 ad 5. Part 2 is obvious. Part 3 follows from Theorem 4.1.1, parts 3, 4, ad 5. Now we prove part 4. For k 0, part 4 follows trivially from part 1. Therefore, we assume > k 0. Usig part 3 we derive for k 0 M N + k 0 that g(, M, N = = k 0 k=max(0,+m N Λ(, M, N M 1 = k 0 k 0 1 k 0! k! (N M k 0 k=0 k 0 h=k+1 M k k k 0 M 1 k 0 1 = N M + h M h + 1 k 0 j=k 1 j. (4.1.4 Notice that from Theorem 4.1.1, part 5 ad the parts 1 ad 2 just established it follows that 0 < λ(, M, N < 1 for k 0 M N + k 0 1, ad ad 1 = λ(, 0, N =... = λ(, k 0 1, N > λ(, k 0, N for k 0 1, λ(, N + k 0 1, N > λ(, N + k 0, N =... = λ(, N 1, N = 0.

4.2. Cofidece sets 61 For k 0 = 0 there is o M such that λ(, M, N = 1. Now it remais to prove that λ(, M, N > λ(, M + 1, N or, equivaletly g(, M, N > g(, M + 1, N, for k 0 M N + k 0 2. This follows from (4.1.4, because g(, M, N is a decreasig fuctio of M o this iterval. This cocludes the proof of part 4. Notice that for M < k 0, part 5 follows trivially from part 1. Therefore, we assume M k 0. From Theorem 4.1.1, part 6 ad the parts 1 ad 2 just proved, it follows that 0 < λ(, M, N < 1 for k 0 + 1 N M + k 0 1, ad ad 1 = λ(0, M, N =... = λ(k 0, M, N > λ(k 0 + 1, M, N, λ(n M+k 0 1, M, N > λ(n M+k 0, M, N =... = λ(n 1, M, N = 0. To complete the proof of part 5 we have to prove that g(, M, N > g( + 1, M, N for k 0 + 1 N M + k 0 2. This follows from (4.1.4, because g(, M, N is a decreasig fuctio of o this iterval. This cocludes the proof of part 5. From (4.1.4 we ca see that g(, M, N < g(, M, N + 1 ad hece λ(, M, N < λ(, M, N + 1 for M k 0, > k 0 ad N + M k 0. This cocludes the proof of part 6. Remark 4.1.1. Theorem 4.1.2, parts 4 ad 5 imply logcocavity of the cumulative hypergeometric distributio fuctio i the argumets ad M i all possible poits, ad eve strict logcocavity o a relevat subset. Here, logcocavity of a fuctio f o the o-egative itegers is defied as f (x + 2 f (x [ f (x + 1] 2 x = 0, 1,... Strictess occurs if the iequality is strict. 4.2 Cofidece sets The value of M is ot kow to us, but after takig a radom sample of size we ca give a poit estimate ad costruct a cofidece iterval for M. Suppose we observe k items the characteristic of iterest i the sample. The maximum likelihood estimator for M is the give by the largest iteger ot exceedig K N+1, i.e. K N+1. If K N+1 is a iteger, the K N+1 1 ad K N+1 both maximize the likelihood. This is ot a ubiased estimator. The ubiased

62 Chapter 4. Hypergeometric Distributio estimator is give by K N ad a ubiased estimator for its variace is N (N 1 K ( 1 K. Oly providig poit estimators for M will ot suffice. To quatify the ucertaity, we would also like to give cofidece iterval estimators. We prefer to give exact cofidece itervals. Here, exact we mea that we use the uderlyig hypergeometric distributio ad ot some approximatio of this distributio. Due to the discrete character of the hypergeometric distributio it is possible to costruct cofidece sets istead of cofidece itervals. Although from a practical view we prefer cofidece itervals, we caot exclude the possibility of cofidece sets that are ot cofidece itervals. If we observe K = k, where k {0,...,}, we would like to fid a way to associate to this value of k for α (0, 1, a subset of possible values of M {0,..., N}, we call this subset M C (k, to state that M C (K cotais the true value of M probability of at least 1 α, or, i symbols, P{M C (K M M} 1 α, for every M {0,..., N}. (4.2.1 The quatity 1 α is called the cofidece level. The probability i equatio (4.2.1 is called the coverage probability for M. Due to the discrete character of M it is ot possible to exactly attai the omial cofidece level 1 α out usig radomized methods (Wright, 1997. These methods will always attai the exact omial cofidece level. We will ot cosider these methods. The methods discussed here are coservative, meaig that the cofidece level will be at least 1 α. We have to costruct M C (K, i.e. M C (0,..., M C ( i such a way that (4.2.1 is satisfied for every M {0,..., N}. We first otice that give the true value of M the probability that M C (K cotais M is the same as the total probability of observig those values of k for which M C (k cotais M. Let R(M be the set cotaiig all these values of k, i.e. R(M = {k M C (k M}, the we ca rewrite the left-had side of (4.2.1 i the followig way P{M C (K M M} = P{K = k, M, N}. (4.2.2 k R(M

4.2. Cofidece sets 63 Remember that K H(, M, N. Now, suppose we costruct for every M [0,..., N] sets R (M values of k such that k R (M P{K = k, M, N} 1 α ad let M C (k be the set of all values of M for which k R (M. It is obvious from (4.2.2 that by usig M C (K = M C (K, ad also R(M = R (M, we have foud a way to defie M C (K such that equatio (4.2.1 is satisfied for every M {0,..., N}. There are various methods to defie R (M to costruct cofidece sets. We will discuss two of these methods here. 4.2.1 Test-method We call M C (k a cofidece set. I those cases where the cofidece sets M C (K actually tur out to be cofidece itervals [M L (K, M U (K], we speak of a 100(1 α% two-sided cofidece iterval lower cofidece boud M L (K ad upper cofidece boud M U (K. Sice we kow that the hypergeometric distributio fuctio is a uimodal fuctio of k, we ca costruct R(M i the followig way. For every M [0,..., N] the set R(M cotais all values of k for which P{K k M} > β ad P{K k M} > γ, β + γ = α. First, we will cosider the case β = γ = α. Note that 2 mi(r(m ad max(r(m are o-decreasig fuctios of M. This esures that the cofidece set M C (K = {M K R(M} will always be a cofidece iterval. If we observe K = k, the the lower ad upper cofidece iterval limits are give by M L (k = smallest iteger M s.t. P{K k} > α 2 ad M U (k = largest iteger M s.t. P{K k} > α 2. This method coicides geeratig a cofidece iterval by ivertig a family of hypothesis tests for M. That is why this method is called the test-method. It also appears to be the same method as described by Katz (1953, Koij (1973 ad Wright (1991.

64 Chapter 4. Hypergeometric Distributio Buoaccorsi (1987 showed that this method is always superior to the oe described by Cochra (1977 i the sese that this method always delivers cofidece itervals that are shorter tha the cofidece itervals that were suggested by Cochra. Cochra s itervals were the fiite populatio aalog of the method by Clopper ad Pearso (1934 for the costructio of cofidece itervals for a biomial fractio. Also other values of β ad γ could be cosidered. A very iterestig case is the case of β = 0 ad γ = α. This is the case of oly givig a upper cofidece boud. Bickel ad Doksum (1977 showed that this boud will be uiformly most accurate, because if the iverse test method is used, the the correspodig tests are uiformly most powerful. 4.2.2 Likelihood-method We could also costruct R(M i the followig way. For every M [0,..., N] we sort the values of k accordig to the size of the accompayig probabilities. Therefore, k (1 has the largest probability, k (2 has the ext largest ad so forth. If ties occur betwee k (i ad k (i+1, the the orderig is ot strict. We deal this issue later. This meas that P { K = k (1 M } P { K = k (2 M }... P { K = k ( M }. Now, for every M [0,..., N] we costruct R(M i such a way that it cosists of the smallest possible umber of elemets, say k (M, such that k (M i=1 P { K = k (i M } 1 α. Because the elemets are selected based o their likelihood, we call the cofidece set M C (K = {M K R(M} obtaied i this way a likelihood cofidece set. This method was first described by Wedell ad Schmee (2001. We will call mi C (K the lower cofidece boud M L (K ad max C (K the upper cofidece boud M U (K. Usig this method it is possible that the cofidece sets produced are ot cofidece itervals, gaps ca occur. A practical solutio is to take the iterval [M L (K, M U (K]. Some theoretical solutios are suggested by Wedell ad Schmee. They also show that the occurrece of

4.2. Cofidece sets 65 these gaps is seldom. Usig this method ties ca occur. Ties occur whe P { K = k (k (M M } = P { K = k (k (M+1 M }. These ties ofte occur whe the hypergeometric distributio is symmetric for lower ad upper tail probabilities. Suppose k (k (M < k (k (M+1, the if we choose k (k (M to add to R(M this meas that M U (k (k (M is less tight ad that M L (k (k (M+1 is tighter compared to the choice of k (k (M+1. Of course this choice has to be made before we start samplig. 6 5 Test method Likelihood method 4 3 k 2 1 0 1 0 2 4 6 8 10 12 14 16 18 20 M Figure 4.1. Compariso of the 90%-cofidece itervals of the test-method ad the likelihood-method for = 5 ad N = 20. Wedell ad Schmee also showed by simulatio studies that this method performs well i compariso to test-method. Figure 4.1 gives a compariso of the two methods for a 90%-cofidece iterval = 5 ad N = 20. Notice

66 Chapter 4. Hypergeometric Distributio that i this case for k = 1 ad k = 4 the cofidece itervals are equally log. I all other cases the likelihood-method produces shorter itervals. It is also possible that the test-method produces shorter itervals, but study of Wedell ad Schmee shows that this will ot occur very ofte. 4.2.3 Approximate cofidece sets Istead of usig the exact hypergeometric distributio to obtai cofidece sets for M, also i certai cases approximatios of this distributio ca be used. We use these approximatios to fid cofidece itervals for p = M. Of course N cofidece itervals for M ca be obtaied by multiplyig the populatio size N. We will describe three approximatios, that is the approximatio by the biomial distributio, the approximatio by the Poisso distributio, ad the approximatio by the ormal distributio. The questio arises whe we are allowed to use a certai approximatio. Text books give so-called rules of thumb. However, these rules differ amog text books, ad are almost always give out ay quatitative assessmet of the quality of such approximatios. Therefore, we should ot pay too much attetio to rules of thumb. Schader ad Schmid (1992 showed that two rules of thumb for approximatig the biomial distributio by the ormal distributio are of dubious quality i umerical accuracy. Leemis ad Kishor (1996 ivestigated rules of thumb for ormal ad Poisso approximatios of the biomial distributio. From their article we ca see, especially whe we look at it from a auditig poit of view (i which the proportios are usually very small, that usig rules of thumb out ay quatitative assessmet of the quality of the approximatios should be avoided. Therefore, if possible we should use a exact approach. We will apply these approximatios to the test-method β = γ = α 2. Therefore, i terms of p our problem focusses o solvig the followig equatios to fid the smallest iteger value of N p L such that P{K k p = p L } = i=k pl i (1 pl i > α 2,

4.2. Cofidece sets 67 ad the largest iteger value of N p U such that P{K k p = p U } = k i=0 pu i (1 pu i > α 2. Note that, p L ad p U are elemets of {0, 1/N, 2/N,...,1}. Our (1 α- cofidece iterval for p becomes [p L, p U ]. I certai cases we ca approximate the hypergeometric distributio by aother discrete or eve cotiuous distributio Biomial approximatio For relatively small values of p ad large values of N we ca approximate the hypergeometric distributio by the biomial distributio. As a rule of thumb p < 0.1 ad N 60 is sometimes used. Now, p L ad p U are elemets of [0, 1], ad we have to solve the followig problem. Fid p L ad p U such that ad P{K k p = p L } = P{K k p = p U } = i=k k i=0 ( p i L i (1 p L i = α 2, ( pu i i (1 p U i = α 2. This cofidece iterval is kow as the Clopper-Pearso cofidece iterval for p (Clopper ad Pearso, 1934. The followig relatioship relates the tail of a biomial distributio the tail of a F-distributio k i=0 ( { p i (1 p i = P Y i } (1 p(k + 1 p( k Y F(2( k, 2(k + 1. A proof ca be foud i Leemis ad Kishor (1996. Now, it follows immediately that ad p L = 1 1 + k+1 k F 1 α 2 (2( k + 1, 2k 1 p U = 1 + k k+1 F α 2 (2( k, 2(k + 1,

68 Chapter 4. Hypergeometric Distributio where F 1 α 2 (, ad Fα 2 (, deote the 100 (1 α/2th ad the 100 (α/2th percetile of the F-distributio. May statistical software packages provide the percetiles of the F-distributio. For large degrees of freedom umerical problems ca occur, the approximate methods could be used. Vollset (1993 compared thirtee methods that produce two-sided cofidece itervals for the biomial proportio. Newcombe (1998 further examied seve of these methods. The Clopper-Pearso method is kow to be rather coservative, meaig that the coverage probabilities usually exceed 1 α. Very ofte approximate methods as adjusted Wald itervals or cotiuity corrected score itervals are suggested to tackle this problem (e.g. Vollset, 1993; Leemis ad Kishor, 1996. Blyth ad Still (1983 remark that the Clopper-Pearso method is oly a approximatio of the exact iterval ad cosider procedures correct cofidece coefficiet. These methods give umerical results that are very similar to the approach the acceptability fuctio of Blaker ad Spjøtvoll (2000. Poisso approximatio For small values of p ad extremely large values of the Poisso approximatio ca be used. As a rule of thumb (p < 0.01 ad ( 1000 is sometimes used. Now, p L ad p U are elemets of [0, 1] agai, ad we have to solve the followig problem. Fid p L ad p U such that P{K k p = p L } = e p L (p L i i=k i! = α 2, ad k e p U (p U i P{K k p = p U } = = α i! 2. i=0 The followig relatioship relates the tail of a Poisso distributio the tail of a χ 2 -distributio. k 1 i=0 e p (p i i! = P{Y > 2p} Y χ 2 (2k. A proof ca be foud i Johso et al. (1992. Now, it follows immediately that p L = 1 2 χ 2 α 2 (2k

4.2. Cofidece sets 69 ad p U = 1 2 χ 2 1 α (2(k + 1, 2 where χ 2 α 2 ( ad χ 2 1 α 2 ( deote the 100 (α/2th ad the 100 (1 α/2th percetile of the χ 2 -distributio. Also this cofidece iterval is coservative. It is possible to icrease some of the lower edpoits ad decrease some of the higher edpoits ad still satisfy the coverage requiremet. Examples ca be foud i Crow ad Garder (1959, Casella ad Robert (1989, ad Kabaila ad Byre (2001. Normal approximatio We ca also use the ormal distributio to approximate the hypergeometric distributio. To do so the rule of thumb p 4 is sometimes used. We ca approximate the hypergeometric distributio by a ormal distributio mea ad variace equal to mea ad variace of K. Therefore, p L ad p U are elemets of [0, 1] agai, ad usig cotiuity correctios we have to solve the followig problem. Fid p L ad p U such that ad P{K k p = p L } = 1 k 0.5 p L = α p L (1 p L N 2, N 1 P{K k p = p U } = k + 0.5 p U = α p U (1 p U N 2. N 1 Solvig these equatios gives the followig cofidece iterval [p L, p U ] = 1 [ u + (2k ± 1 2u ± u 2 2u ] ( (k ± 0.52 + ( k 0.5 2 + (2k ± 1 2 where u = + N N 1 Z 2 1 α 2, Z 2 1 α the 100 (1 α/2th percetile of the stadard ormal distributio. 2 More simplified versios of this approximatio are also used.

70 Chapter 4. Hypergeometric Distributio Lig ad Pratt (1984 compared several ormal approximatios for the hypergeometric distributio. They show that especially the so-called Peizer approximatios tur out to be very accurate. These complicated approximatios origiate from a upublished paper by Peizer. However, these approximatios are ot ivertible i closed form. Moleaar (1973 gave two relatively simpler ormal approximatios that are ivertible i closed form, but still give very complicated solutios. These approximatios will probably give more accurate bouds tha the method described above. A crude approximatio ca be obtaied by usig the approximate ormality of p mea equal to the ubiased estimator for p, i.e. K, ad variace equal to the ubiased estimator for the variace of this estimator, i.e. ( ( N K 1 K. N( 1 If we also correct for cotiuity, the we fid the followig cofidece iterval [p L, p U ] = [ ( k ± Z 2 1 α 2 N N( 1 ( ( k 1 k ] + 1. 2 4.3 Computig the hypergeometric distributio Theorem 4.1.1, part 4 ca be used to fid some recursive properties that we will use i calculatig the hypergeometric distributio. It shows that we ca compute Λ(, M, N from Λ(, M + 1, N, by usig the hypergeometric probability P{K = k 0 1, M, N 1}. But suppose that we already calculated Λ(, M + 1, N from Λ(, M + 2, N, the we ca use this step to facilitate the computatio of P{K = k 0 1, M, N 1}. Property 4.3.1 gives a few examples of this. Property 4.3.1. The followig recursive properties facilitate the computatio of the hypergeometric distributio. 1. If k 0 M N + k 0 1 ad k 0 + 1 N 1, the Λ(, M, N = Λ(, M + 1, N + C 1 (, M, N

4.3. Computig the hypergeometric distributio 71 C 1 (, M, N = M k 0 + 1 M + 1 If k 0 + 1 N 1, the N M 1 N M + k 0 C 1 (, M + 1, N, C 1 (, N + k 0, N =! (N + k 0! k 0! N! 2. If k 0 + 1 M N + k 0 + 1 ad k 0 + 2 N, the Λ(, M, N = Λ( 1, M, N C 2 (, M, N. C 2 (, M, N = N M N + 1 M k 0 k 0 1 C 1( 1, M, N. 3. If k 0 M N + k 0 ad k 0 + 1 N, the Λ(, M, N = Λ(, M + 1, N + C 1 (, M, N C 1 (, M, N = M + 1 C 2(, M + 1, N. 4. If k 0 + 1 M N + k 0 ad k 0 N 1, the Λ(, M, N = Λ( + 1, M 1, N + C 3 (, M, N C 3 (, M, N = M 1 + 1 C 1 ( + 1, M 1, N. Proof. First we prove part 1. Usig Theorem 4.1.1, part 4 we fid Λ(, M, N = Λ(, M + 1, N + C 1 (, M, N C 1 (, M, N = M 1 k 0 k 0 1, ad Λ(, M + 1, N = Λ(, M + 2, N + C 1 (, M + 1, N

72 Chapter 4. Hypergeometric Distributio C 1 (, M + 1, N = +1 k 0 M 2 k 0 1. From Theorem 4.1.1, part 4 we otice that C 1 (, M, N > 0 ad C 1 (, M + 1, N > 0 if k 0 M N + k 0 1 ad k 0 + 1 N 1. Combiig the expressios for C 1 (, M, N ad C 1 (, M + 1, N gives C 1 (, M, N = M k 0 + 1 M + 1 For M = N + k 0 ad k 0 + 1 N 1 we fid N M 1 N M + k 0 C 1 (, M + 1, N. C 1 (, N + k 0, N = Λ(, N + k 0, N Λ(, N + k 0 + 1, N = Λ(, N + k 0, N =! (N + k 0!. k 0! N! I provig part 2 we agai use Theorem 4.1.1, part 4 i combiatio part 1. Usig this theorem we fid Λ(, M, N = Λ( 1, M, N C 2 (, M, N C 2 (, M, N = ( 1 k 0 M k 0 1 M, ad Λ( 1, M, N = Λ( 1, M + 1, N + C 1 ( 1, M, N M 1 k C 1 ( 1, M, N = 0 k 0 2. 1 Observe that C 2 (, M, N > 0 ad C 1 ( 1, M, N > 0 if k 0 + 1 M N +k 0 +1 ad k 0 +2 N. Combiig the expressios for C 2 (, M, N ad C 1 ( 1, M, N gives C 2 (, M, N = To prove part 3 we use the previous results N M N + 1 M k 0 k 0 1 C 1( 1, M, N. Λ(, M, N = Λ(, M + 1, N + C 1 (, M, N

4.3. Computig the hypergeometric distributio 73 C 1 (, M, N = M 1 k 0 k 0 1, ad Λ(, M + 1, N = Λ( 1, M + 1, N C 2 (, M + 1, N C 2 (, M + 1, N = ( 1 k 0 M k ( 0 N M+1. Note that C 1 (, M, N > 0 ad C 2 (, M + 1, N > 0 if k 0 M N + k 0 ad k 0 +1 N. Combiig the expressios of C 1 (, M, N ad C 2 (, M+ 1, N gives C 1 (, M, N = M + 1 C 2(, M + 1, N. I provig part 4 we use Theorem 4.1.1, part 4 i combiatio part 1 ad fid Λ(, M, N = Λ( + 1, M, N + C 2 ( + 1, M, N C 2 ( + 1, M, N = ( 1 k 0 M k 0 1. M We agai use Theorem 4.1.1, part 4 to fid Λ( + 1, M, N = Λ( + 1, M 1, N C 1 ( + 1, M 1, N C 1 ( + 1, M 1, N = 1 M k 0 k ( 0 N +1. Note that C 1 (+1, M 1, N > 0 ad C 2 (+1, M, N > 0 if k 0 +1 M N + k 0 ad k 0 N 1. Combiig the expressios of C 1 ( + 1, M 1, N ad C 2 ( + 1, M, N gives C 2 ( + 1, M, N = M + 1 C 1( + 1, M 1, N.

74 Chapter 4. Hypergeometric Distributio Usig the previous results we fid Λ(, M, N = Λ( + 1, M, N + C 2 ( + 1, M, N = Λ( + 1, M, N + M + 1 C 1( + 1, M 1, N = Λ( + 1, M 1, N C 1 ( + 1, M 1, N+ + M + 1 C 1( + 1, M 1, N = Λ( + 1, M 1, N + M 1 + 1 C 1 ( + 1, M 1, N Table 4.1. The values of Λ(, M, 8 for k 0 = 2. M \ 0 1 2 3 4 5 6 7 8 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 3 1 1 1 0.9821 0.9286 0.8214 0.6429 0.375 0 4 1 1 1 0.9286 0.7571 0.5 0.2143 0 0 5 1 1 1 0.8214 0.5 0.1786 0 0 0 6 1 1 1 0.6429 0.2143 0 0 0 0 7 1 1 1 0.375 0 0 0 0 0 8 1 1 1 0 0 0 0 0 0 Table 4.1 gives the values of Λ(, M, 8 for all possible combiatios of ad M k 0 = 2. This table gives a illustratio of Property 4.3.1. First we ca use Theorem 4.1.1, parts 2 ad 3. This gives us Λ(, M, 8 = 1 if 2 or M 2 ad Λ(, M, 8 = 0 for all combiatios of ad M for which M > 10. We start computig this table Λ(3, 7, 8. Usig Property 4.3.1, part 1 it immediately follows that Λ(3, 7, 8 = C 1 (3, 7, 8 = 3! (8 3 + 2! 2! 8! = 3/8 = 0.375, ote that Λ(3, 8, 8 = 0. Agai by usig Property 4.3.1, part 1 we ca calculate

4.3. Computig the hypergeometric distributio 75 Λ(3, 6, 8: Λ(3, 6, 8 = Λ(3, 7, 8 + C 1 (3, 6, 8 = 3/8 + 6 2 + 1 8 6 1 6 + 1 8 6 3 + 2 3/8 = 3/8 + 5/7 3/8 = 3/8 + 15/56 = 9/14 0, 6429. We ca repeat this procedure util we have foud Λ(3, 3, 8 ad by the we have foud Λ(3, M, 8 for all possible values of M. We ca calculate Λ(4, 6, 8, Λ(4, M, 8 = 0 for M > 6, by usig Property 4.3.1, part 2: Λ(4, 6, 8 = Λ(3, 6, 8 8 6 8 4 + 1 6 2 4 2 1 C 1(3, 6, 8 = 9/14 2/5 4 15/56 = 9/14 3/7 = 3/14 0.2143. Sice Λ(4, 7, 8 = 0, it follows that C 1 (3, 6, 8 = 3/14. Now we ca apply Property 4.3.1, part 1 to fid the remaiig values of Λ(4, M, 8. By repeatig the procedure above the table ca be completed. Sometimes we have to use the terms of Λ to fid a recursive expressio. For istace if we would like calculate Λ(, M, N from Λ(, M, N 1 or from Λ(, M 1, N 1. We itroduce the followig otatio. We write P(, M, N as a (k 0 + 1-vector, elemets P j (, M, N = P{K = j} = M j j, j = 0,...,k 0 ad ι = (1,...,1 a (k 0 + 1-vector. Now, it follows that Λ(, M, N = ι P(, M, N. (4.3.1 How we compute the probabilities P j (, M, N from P j (, M, N 1 will be show i the followig property. Property 4.3.2. If M k 0, k 0 ad N >, the for j = 0,...,k 0 0 if j < + M N j+1 N P j (, M, N = P j+1(, M, N 1 if j = + M N < k 0 ( j / N if j = + M N = k 0 (N (N M N (N M + j P j(, M, N 1 if j > + M N.

76 Chapter 4. Hypergeometric Distributio Proof. The cases j < + M N, j > + M N ad j = k 0 = + M N follow immediately from the defiitio of the hypergeometric probability. Note that if M k 0, k 0 ad N >, the P j (, M, N 1 > 0 implies that P j (, M, N > 0. For j = + M N < k 0 the probability P j (, M, N 1 equals zero, but the probability P j+1 (, M, N 1 does have a positive value. It is ot difficult to see that for P j (, M, N = j = j + 1 N j+1 1 = j + 1 N P j+1(, M, N 1. Notice that oce + M N 0, all elemets of P(, M, N are positive. We ca fid a similar property if we would like to compute the probability P j (, M, N from the probability P j (, M 1, N 1. Property 4.3.3. If M > k 0, k 0 ad N >, the for j = 0,...,k 0 { 0 if j < + M N P j (, M, N = M (N N (M j P j(, M 1, N 1 if j + M N. Proof. This follows immediately from the defiitio of the hypergeometric probability. If M > k 0, k 0 ad N >, the P j (, M 1, N 1 > 0 implies that P j (, M, M > 0. The properties we derived here will be essetial i the developig of the algorithms that we will describe i Chapter 5 ad 6. These properties eable the algorithms to be efficiet ad accurate.