University of Groningen. Statistical Auditing and the AOQL-method Talens, Erik

Similar documents
4. Partial Sums and the Central Limit Theorem

Statistics 511 Additional Materials

Convergence of random variables. (telegram style notes) P.J.C. Spreij

6.3 Testing Series With Positive Terms

Topic 9: Sampling Distributions of Estimators

MA131 - Analysis 1. Workbook 3 Sequences II

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Some Properties of the Exact and Score Methods for Binomial Proportion and Sample Size Calculation

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Topic 9: Sampling Distributions of Estimators

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

Lecture 6 Simple alternatives and the Neyman-Pearson lemma

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Topic 9: Sampling Distributions of Estimators

Infinite Sequences and Series

Stat 421-SP2012 Interval Estimation Section

An Introduction to Randomized Algorithms

Math 155 (Lecture 3)

Access to the published version may require journal subscription. Published with permission from: Elsevier.

1.010 Uncertainty in Engineering Fall 2008

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Simulation. Two Rule For Inverting A Distribution Function

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Frequentist Inference

Ma 530 Introduction to Power Series

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Estimation for Complete Data

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

MATH/STAT 352: Lecture 15

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

CHAPTER 10 INFINITE SEQUENCES AND SERIES


GUIDELINES ON REPRESENTATIVE SAMPLING

ENGI Series Page 6-01

ON POINTWISE BINOMIAL APPROXIMATION

Efficient GMM LECTURE 12 GMM II

Expectation and Variance of a random variable

1 Introduction to reducing variance in Monte Carlo simulations

The standard deviation of the mean

Binomial Distribution

Measure and Measurable Functions

Output Analysis and Run-Length Control

Confidence intervals summary Conservative and approximate confidence intervals for a binomial p Examples. MATH1005 Statistics. Lecture 24. M.

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Seunghee Ye Ma 8: Week 5 Oct 28

3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense,

The Random Walk For Dummies

7 Sequences of real numbers

Problem Set 4 Due Oct, 12

Week 5-6: The Binomial Coefficients

Bertrand s Postulate

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

A LARGER SAMPLE SIZE IS NOT ALWAYS BETTER!!!

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes.

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

7.1 Convergence of sequences of random variables

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

Tests of Hypotheses Based on a Single Sample (Devore Chapter Eight)

Monte Carlo Integration

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

AAEC/ECON 5126 FINAL EXAM: SOLUTIONS

Lecture 2: Monte Carlo Simulation

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS

INEQUALITIES BJORN POONEN

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

MAT1026 Calculus II Basic Convergence Tests for Series

Application to Random Graphs

x = Pr ( X (n) βx ) =

Mathematical Induction

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

1 Inferential Methods for Correlation and Regression Analysis

Optimally Sparse SVMs

7.1 Convergence of sequences of random variables

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

MATHEMATICS. The assessment objectives of the Compulsory Part are to test the candidates :

Math 113 Exam 3 Practice

Machine Learning for Data Science (CS 4786)

The Growth of Functions. Theoretical Supplement

Math 257: Finite difference methods

Random Variables, Sampling and Estimation

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Math 61CM - Solutions to homework 3

Solutions to Tutorial 3 (Week 4)

Math 113 Exam 3 Practice

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Lesson 10: Limits and Continuity

Lecture 2. The Lovász Local Lemma

Chi-Squared Tests Math 6070, Spring 2006

Lecture 7: Properties of Random Samples

Introduction to Econometrics (3 rd Updated Edition) Solutions to Odd- Numbered End- of- Chapter Exercises: Chapter 3

Exercise 4.3 Use the Continuity Theorem to prove the Cramér-Wold Theorem, Theorem. (1) φ a X(1).

Transcription:

Uiversity of Groige Statistical Auditig ad the AOQL-method Tales, Erik IMPORTANT NOTE: You are advised to cosult the publisher's versio (publisher's PDF if you wish to cite from it. Please check the documet versio below. Documet Versio Publisher's PDF, also kow as Versio of record Publicatio date: 2005 Lik to publicatio i Uiversity of Groige/UMCG research database Citatio for published versio (APA: Tales, E. (2005. Statistical Auditig ad the AOQL-method s.. Copyright Other tha for strictly persoal use, it is ot permitted to dowload or to forward/distribute the text or part of it out the coset of the author(s ad/or copyright holder(s, uless the work is uder a ope cotet licese (like Creative Commos. Take-dow policy If you believe that this documet breaches copyright please cotact us providig details, ad we will remove access to the work immediately ad ivestigate your claim. Dowloaded from the Uiversity of Groige/UMCG research database (Pure: http://www.rug.l/research/portal. For techical reasos the umber of authors show o this cover page is limited to 10 maximum. Dowload date: 24-03-2018

Chapter 4 Hypergeometric Distributio The hypergeometric distributio plays a key role i statistical auditig. This chapter describes some importat properties of the hypergeometric distributio we use i subsequet chapters. Sectio 4.1 will give some elemetary properties of the hypergeometric probability. This sectio also gives some properties of the hypergeometric distributio fuctio ad quotiets of hypergeometric distributio fuctios. These properties will be very helpful i Chapter 5. Sectio 4.2 gives exact ad approximate cofidece itervals for the probability that a certai characteristic is preset i a populatio. Fially Sectio 4.3 shows how we ca calculate hypergeometric probabilities i a efficiet ad accurate way. This sectio is essetial to Chapter 6. 4.1 Properties of the hypergeometric distributio Cosider a populatio of N elemets. A umber of these N elemets may have a certai characteristic that we are iterested i, e.g. the umber of travel declaratios i a yearly populatio which were processed icorrectly. We will deote this umber by M. I auditig applicatios this characteristic is ofte uwated, ad therefore the value of M is relatively small. This umber is ot kow to us i advace. To get more iformatio about M, a radom sample of size is take. The sample cotais K elemets that have the characteristic of iterest. The umber K i the sample follows a hypergeometric distributio parameters, M, ad N. We write K H(, M, N. We use a well-kow exteded defiitio of the biomial coefficiets, that

56 Chapter 4. Hypergeometric Distributio will be very coveiet i our algebraic maipulatios the hypergeometric distributio. Recall that ( p q = p! for q = 0, 1,..., p; p = 0, 1, 2,..., q!(p q! where 0! = 1 by defiitio. For other values of p, q Z it is defied ( p q = 0. Usig these otatios we do ot have to icorporate the usual domai for K, amely K = 0,..., the restrictio that K (N M ad K M. Thus for o-egative itegers k we have P{K = k, M, N} = M k k. (4.1.1 We kow that E(K = M N ad M (N M (N Var(K =. N 2 (N 1 The followig properties for the hypergeometric distributio hold. We refer to Lieberma ad Owe (1961. Property 4.1.1. The hypergeometric distributio has the followig elemetary properties: P{K = k + 1, M, N} = (M k ( k (k + 1 (N M + k + 1 P{K = k, M, N} P{K = k + 1, M, N} = ( + 1 (N M + k ( + 1 k (N P{K = k, M, N} P{K = k, M + 1, N} = (M + 1 (N M + k (M + 1 k (N M P{K = k, M, N} P{K = k, M, N + 1} = (N + 1 (N M + 1 (N M + k + 1 (N + 1 P{K = k, M, N}

4.1. Properties of the hypergeometric distributio 57 P{K = k, M, N} = P{K = k, N M, N} = P{K = M k N, M, N} = P{K = N M + k N, N M, N} P{K k, M, N} = 1 P{K k 1, N M, N} = 1 P{K M k 1 N, M, N} = P{K N M + k N, N M, N} The followig property is a very helpful tool that shows that we are allowed to iterchage M ad out affectig the hypergeometric probabilities. This property will frequetly be used i this thesis. Property 4.1.2. If the roles of M ad are iterchaged, this does ot affect the hypergeometric probabilities; i.e. P{K = k, M, N} = P{K = k M,, N}. The proof of this property is simple. A probabilistic explaatio for Property 4.1.2 is give i Davidso ad Johso (1993. Notice that P{K = k, M, N} is a uimodal fuctio of k, see Johso, Kotz ad Kemp (1992. It takes o its maximum for the largest iteger that does ot exceed (M+1(+1. If (M+1(+1 is a iteger, say c the it takes o its N+2 N+2 maximum for this iteger c, but also for c 1. 4.1.1 Properties of Λ(, M, N We itroduce the followig otatio, Λ(, M, N = P{K k 0, M, N} = k 0 k=0 M k k. (4.1.2 This otatio suppresses the depedece o k 0, because i most of our applicatios we will ot allow the value of k 0 to vary. Uless stated otherwise k 0 will be cosidered fixed i the sequel. I fact we will be more iterested i the behaviour of Λ as a fuctio of, M, ad N. We will discus some properties of Λ(, M, N that are especially useful i Chapter 6.

58 Chapter 4. Hypergeometric Distributio Theorem 4.1.1. The followig properties hold for Λ(, M, N: 1. Λ(, M, N = Λ(M,, N. 2. Λ(, M, N = 1 if ad oly if M k 0 or k 0. 3. Λ(, M, N = 0 if ad oly if M > N + k 0. 4. Let M {0,..., N 1}, the M 1 k Λ(, M + 1, N = Λ(, M, N 0 k 0 1, or, equivaletly, Λ(, M + 1, N = Λ(, M, N N P{K = k 0 1, M, N 1} for {1,..., N}. 5. Let M {0,..., N 1}, the Λ(, M + 1, N Λ(, M, N. The iequality is strict if ad oly if k 0 M N + k 0 ad > k 0. 6. Let {0,..., N 1}, the Λ( + 1, M, N Λ(, M, N. The iequality is strict if ad oly if k 0 N M + k 0 ad M > k 0. Proof. Parts 1, 2, ad 3 immediately follow from Property 4.1.2, (4.1.2, ad the defiitio of the hypergeometric distributio, respectively. Usig Pascal s triagle we obtai ( N k 0 ( ( M + 1 N M 1 Λ(, M + 1, N = k k k=0 k 0 (( ( ( M M N M 1 = + k k 1 k k=0 k 0 = k k=0 M 1 k + k 0 1 k=0 k ( N M 1 k 1

4.1. Properties of the hypergeometric distributio 59 ad hece ( N Λ(, M, N = k 0 k=0 k ( N M k k 0 ( (( ( M N M 1 N M 1 = + k k k 1 k=0 k 0 ( ( M N M 1 k 0 ( ( M N M 1 = + k k k k 1 k=0 k=0 ( ( ( N M N M 1 = Λ(, M + 1, N +. k 0 1 Summatios empty idex sets are equal to zero by defiitio. This proves the first result of part 4. Its secod result is obvious. Part 5 follows immediately from part 4. Part 6 follows from part 5 by applyig the result from part 1. Theorem 4.1.1, part 5 shows that the probability of acceptig the populatio is decreasig i M. Part 6 shows that this probability also decreases if a larger sample is take. These facts are i accordace ituitio. 4.1.2 Properties of λ(, M, N This subsectio will focus o the quotiet of Λ(, M + 1, N ad Λ(, M, N, which plays a key role i provig some of the properties i Chapter 5. This quotiet is defied by k 0 λ(, M, N = { Λ(,M+1,N Λ(,M,N > 0 if M < N + k 0, 0 if N + k 0 M N 1. (4.1.3 Accordig to Theorem 4.1.1, part 3 this ratio is well-defied for M N +k 0, ad i the special case M = N +k 0 it is equal to zero. Obviously 0 λ 1, accordig to Theorem 4.1.1, part 4. A umber of properties of λ are collected i the followig theorem. Theorem 4.1.2. The followig properties hold for λ(, M, N [0, 1]. 1. λ(, M, N = 1 if ad oly if M < k 0 or k 0. 2. λ(, M, N = 0 if ad oly if M N + k 0.

60 Chapter 4. Hypergeometric Distributio 3. Let > k 0, if k 0 M N + k 0, the it ca be writte λ(, M, N = 1 1 g(, M, N, where g(, M, N = Λ(, M, N M 1 > 0. k 0 k 0 1 4. Let M {0,..., N 1}, the λ(, M, N λ(, M + 1, N. The iequality is strict if ad oly if max(0, k 0 1 M N + k 0 1 ad > k 0. 5. Let, M {0,..., N 1} the λ(, M, N λ( + 1, M, N. The iequality is strict if ad oly if k 0 N M + k 0 1 ad M > k 0. 6. If M k 0, > k 0 ad N + M k 0, the λ(, M, N < λ(, M, N + 1. Proof. Part 1 follows from (4.1.2 ad Theorem 4.1.1, parts 2 ad 5. Part 2 is obvious. Part 3 follows from Theorem 4.1.1, parts 3, 4, ad 5. Now we prove part 4. For k 0, part 4 follows trivially from part 1. Therefore, we assume > k 0. Usig part 3 we derive for k 0 M N + k 0 that g(, M, N = = k 0 k=max(0,+m N Λ(, M, N M 1 = k 0 k 0 1 k 0! k! (N M k 0 k=0 k 0 h=k+1 M k k k 0 M 1 k 0 1 = N M + h M h + 1 k 0 j=k 1 j. (4.1.4 Notice that from Theorem 4.1.1, part 5 ad the parts 1 ad 2 just established it follows that 0 < λ(, M, N < 1 for k 0 M N + k 0 1, ad ad 1 = λ(, 0, N =... = λ(, k 0 1, N > λ(, k 0, N for k 0 1, λ(, N + k 0 1, N > λ(, N + k 0, N =... = λ(, N 1, N = 0.

4.2. Cofidece sets 61 For k 0 = 0 there is o M such that λ(, M, N = 1. Now it remais to prove that λ(, M, N > λ(, M + 1, N or, equivaletly g(, M, N > g(, M + 1, N, for k 0 M N + k 0 2. This follows from (4.1.4, because g(, M, N is a decreasig fuctio of M o this iterval. This cocludes the proof of part 4. Notice that for M < k 0, part 5 follows trivially from part 1. Therefore, we assume M k 0. From Theorem 4.1.1, part 6 ad the parts 1 ad 2 just proved, it follows that 0 < λ(, M, N < 1 for k 0 + 1 N M + k 0 1, ad ad 1 = λ(0, M, N =... = λ(k 0, M, N > λ(k 0 + 1, M, N, λ(n M+k 0 1, M, N > λ(n M+k 0, M, N =... = λ(n 1, M, N = 0. To complete the proof of part 5 we have to prove that g(, M, N > g( + 1, M, N for k 0 + 1 N M + k 0 2. This follows from (4.1.4, because g(, M, N is a decreasig fuctio of o this iterval. This cocludes the proof of part 5. From (4.1.4 we ca see that g(, M, N < g(, M, N + 1 ad hece λ(, M, N < λ(, M, N + 1 for M k 0, > k 0 ad N + M k 0. This cocludes the proof of part 6. Remark 4.1.1. Theorem 4.1.2, parts 4 ad 5 imply logcocavity of the cumulative hypergeometric distributio fuctio i the argumets ad M i all possible poits, ad eve strict logcocavity o a relevat subset. Here, logcocavity of a fuctio f o the o-egative itegers is defied as f (x + 2 f (x [ f (x + 1] 2 x = 0, 1,... Strictess occurs if the iequality is strict. 4.2 Cofidece sets The value of M is ot kow to us, but after takig a radom sample of size we ca give a poit estimate ad costruct a cofidece iterval for M. Suppose we observe k items the characteristic of iterest i the sample. The maximum likelihood estimator for M is the give by the largest iteger ot exceedig K N+1, i.e. K N+1. If K N+1 is a iteger, the K N+1 1 ad K N+1 both maximize the likelihood. This is ot a ubiased estimator. The ubiased

62 Chapter 4. Hypergeometric Distributio estimator is give by K N ad a ubiased estimator for its variace is N (N 1 K ( 1 K. Oly providig poit estimators for M will ot suffice. To quatify the ucertaity, we would also like to give cofidece iterval estimators. We prefer to give exact cofidece itervals. Here, exact we mea that we use the uderlyig hypergeometric distributio ad ot some approximatio of this distributio. Due to the discrete character of the hypergeometric distributio it is possible to costruct cofidece sets istead of cofidece itervals. Although from a practical view we prefer cofidece itervals, we caot exclude the possibility of cofidece sets that are ot cofidece itervals. If we observe K = k, where k {0,...,}, we would like to fid a way to associate to this value of k for α (0, 1, a subset of possible values of M {0,..., N}, we call this subset M C (k, to state that M C (K cotais the true value of M probability of at least 1 α, or, i symbols, P{M C (K M M} 1 α, for every M {0,..., N}. (4.2.1 The quatity 1 α is called the cofidece level. The probability i equatio (4.2.1 is called the coverage probability for M. Due to the discrete character of M it is ot possible to exactly attai the omial cofidece level 1 α out usig radomized methods (Wright, 1997. These methods will always attai the exact omial cofidece level. We will ot cosider these methods. The methods discussed here are coservative, meaig that the cofidece level will be at least 1 α. We have to costruct M C (K, i.e. M C (0,..., M C ( i such a way that (4.2.1 is satisfied for every M {0,..., N}. We first otice that give the true value of M the probability that M C (K cotais M is the same as the total probability of observig those values of k for which M C (k cotais M. Let R(M be the set cotaiig all these values of k, i.e. R(M = {k M C (k M}, the we ca rewrite the left-had side of (4.2.1 i the followig way P{M C (K M M} = P{K = k, M, N}. (4.2.2 k R(M

4.2. Cofidece sets 63 Remember that K H(, M, N. Now, suppose we costruct for every M [0,..., N] sets R (M values of k such that k R (M P{K = k, M, N} 1 α ad let M C (k be the set of all values of M for which k R (M. It is obvious from (4.2.2 that by usig M C (K = M C (K, ad also R(M = R (M, we have foud a way to defie M C (K such that equatio (4.2.1 is satisfied for every M {0,..., N}. There are various methods to defie R (M to costruct cofidece sets. We will discuss two of these methods here. 4.2.1 Test-method We call M C (k a cofidece set. I those cases where the cofidece sets M C (K actually tur out to be cofidece itervals [M L (K, M U (K], we speak of a 100(1 α% two-sided cofidece iterval lower cofidece boud M L (K ad upper cofidece boud M U (K. Sice we kow that the hypergeometric distributio fuctio is a uimodal fuctio of k, we ca costruct R(M i the followig way. For every M [0,..., N] the set R(M cotais all values of k for which P{K k M} > β ad P{K k M} > γ, β + γ = α. First, we will cosider the case β = γ = α. Note that 2 mi(r(m ad max(r(m are o-decreasig fuctios of M. This esures that the cofidece set M C (K = {M K R(M} will always be a cofidece iterval. If we observe K = k, the the lower ad upper cofidece iterval limits are give by M L (k = smallest iteger M s.t. P{K k} > α 2 ad M U (k = largest iteger M s.t. P{K k} > α 2. This method coicides geeratig a cofidece iterval by ivertig a family of hypothesis tests for M. That is why this method is called the test-method. It also appears to be the same method as described by Katz (1953, Koij (1973 ad Wright (1991.

64 Chapter 4. Hypergeometric Distributio Buoaccorsi (1987 showed that this method is always superior to the oe described by Cochra (1977 i the sese that this method always delivers cofidece itervals that are shorter tha the cofidece itervals that were suggested by Cochra. Cochra s itervals were the fiite populatio aalog of the method by Clopper ad Pearso (1934 for the costructio of cofidece itervals for a biomial fractio. Also other values of β ad γ could be cosidered. A very iterestig case is the case of β = 0 ad γ = α. This is the case of oly givig a upper cofidece boud. Bickel ad Doksum (1977 showed that this boud will be uiformly most accurate, because if the iverse test method is used, the the correspodig tests are uiformly most powerful. 4.2.2 Likelihood-method We could also costruct R(M i the followig way. For every M [0,..., N] we sort the values of k accordig to the size of the accompayig probabilities. Therefore, k (1 has the largest probability, k (2 has the ext largest ad so forth. If ties occur betwee k (i ad k (i+1, the the orderig is ot strict. We deal this issue later. This meas that P { K = k (1 M } P { K = k (2 M }... P { K = k ( M }. Now, for every M [0,..., N] we costruct R(M i such a way that it cosists of the smallest possible umber of elemets, say k (M, such that k (M i=1 P { K = k (i M } 1 α. Because the elemets are selected based o their likelihood, we call the cofidece set M C (K = {M K R(M} obtaied i this way a likelihood cofidece set. This method was first described by Wedell ad Schmee (2001. We will call mi C (K the lower cofidece boud M L (K ad max C (K the upper cofidece boud M U (K. Usig this method it is possible that the cofidece sets produced are ot cofidece itervals, gaps ca occur. A practical solutio is to take the iterval [M L (K, M U (K]. Some theoretical solutios are suggested by Wedell ad Schmee. They also show that the occurrece of

4.2. Cofidece sets 65 these gaps is seldom. Usig this method ties ca occur. Ties occur whe P { K = k (k (M M } = P { K = k (k (M+1 M }. These ties ofte occur whe the hypergeometric distributio is symmetric for lower ad upper tail probabilities. Suppose k (k (M < k (k (M+1, the if we choose k (k (M to add to R(M this meas that M U (k (k (M is less tight ad that M L (k (k (M+1 is tighter compared to the choice of k (k (M+1. Of course this choice has to be made before we start samplig. 6 5 Test method Likelihood method 4 3 k 2 1 0 1 0 2 4 6 8 10 12 14 16 18 20 M Figure 4.1. Compariso of the 90%-cofidece itervals of the test-method ad the likelihood-method for = 5 ad N = 20. Wedell ad Schmee also showed by simulatio studies that this method performs well i compariso to test-method. Figure 4.1 gives a compariso of the two methods for a 90%-cofidece iterval = 5 ad N = 20. Notice

66 Chapter 4. Hypergeometric Distributio that i this case for k = 1 ad k = 4 the cofidece itervals are equally log. I all other cases the likelihood-method produces shorter itervals. It is also possible that the test-method produces shorter itervals, but study of Wedell ad Schmee shows that this will ot occur very ofte. 4.2.3 Approximate cofidece sets Istead of usig the exact hypergeometric distributio to obtai cofidece sets for M, also i certai cases approximatios of this distributio ca be used. We use these approximatios to fid cofidece itervals for p = M. Of course N cofidece itervals for M ca be obtaied by multiplyig the populatio size N. We will describe three approximatios, that is the approximatio by the biomial distributio, the approximatio by the Poisso distributio, ad the approximatio by the ormal distributio. The questio arises whe we are allowed to use a certai approximatio. Text books give so-called rules of thumb. However, these rules differ amog text books, ad are almost always give out ay quatitative assessmet of the quality of such approximatios. Therefore, we should ot pay too much attetio to rules of thumb. Schader ad Schmid (1992 showed that two rules of thumb for approximatig the biomial distributio by the ormal distributio are of dubious quality i umerical accuracy. Leemis ad Kishor (1996 ivestigated rules of thumb for ormal ad Poisso approximatios of the biomial distributio. From their article we ca see, especially whe we look at it from a auditig poit of view (i which the proportios are usually very small, that usig rules of thumb out ay quatitative assessmet of the quality of the approximatios should be avoided. Therefore, if possible we should use a exact approach. We will apply these approximatios to the test-method β = γ = α 2. Therefore, i terms of p our problem focusses o solvig the followig equatios to fid the smallest iteger value of N p L such that P{K k p = p L } = i=k pl i (1 pl i > α 2,

4.2. Cofidece sets 67 ad the largest iteger value of N p U such that P{K k p = p U } = k i=0 pu i (1 pu i > α 2. Note that, p L ad p U are elemets of {0, 1/N, 2/N,...,1}. Our (1 α- cofidece iterval for p becomes [p L, p U ]. I certai cases we ca approximate the hypergeometric distributio by aother discrete or eve cotiuous distributio Biomial approximatio For relatively small values of p ad large values of N we ca approximate the hypergeometric distributio by the biomial distributio. As a rule of thumb p < 0.1 ad N 60 is sometimes used. Now, p L ad p U are elemets of [0, 1], ad we have to solve the followig problem. Fid p L ad p U such that ad P{K k p = p L } = P{K k p = p U } = i=k k i=0 ( p i L i (1 p L i = α 2, ( pu i i (1 p U i = α 2. This cofidece iterval is kow as the Clopper-Pearso cofidece iterval for p (Clopper ad Pearso, 1934. The followig relatioship relates the tail of a biomial distributio the tail of a F-distributio k i=0 ( { p i (1 p i = P Y i } (1 p(k + 1 p( k Y F(2( k, 2(k + 1. A proof ca be foud i Leemis ad Kishor (1996. Now, it follows immediately that ad p L = 1 1 + k+1 k F 1 α 2 (2( k + 1, 2k 1 p U = 1 + k k+1 F α 2 (2( k, 2(k + 1,

68 Chapter 4. Hypergeometric Distributio where F 1 α 2 (, ad Fα 2 (, deote the 100 (1 α/2th ad the 100 (α/2th percetile of the F-distributio. May statistical software packages provide the percetiles of the F-distributio. For large degrees of freedom umerical problems ca occur, the approximate methods could be used. Vollset (1993 compared thirtee methods that produce two-sided cofidece itervals for the biomial proportio. Newcombe (1998 further examied seve of these methods. The Clopper-Pearso method is kow to be rather coservative, meaig that the coverage probabilities usually exceed 1 α. Very ofte approximate methods as adjusted Wald itervals or cotiuity corrected score itervals are suggested to tackle this problem (e.g. Vollset, 1993; Leemis ad Kishor, 1996. Blyth ad Still (1983 remark that the Clopper-Pearso method is oly a approximatio of the exact iterval ad cosider procedures correct cofidece coefficiet. These methods give umerical results that are very similar to the approach the acceptability fuctio of Blaker ad Spjøtvoll (2000. Poisso approximatio For small values of p ad extremely large values of the Poisso approximatio ca be used. As a rule of thumb (p < 0.01 ad ( 1000 is sometimes used. Now, p L ad p U are elemets of [0, 1] agai, ad we have to solve the followig problem. Fid p L ad p U such that P{K k p = p L } = e p L (p L i i=k i! = α 2, ad k e p U (p U i P{K k p = p U } = = α i! 2. i=0 The followig relatioship relates the tail of a Poisso distributio the tail of a χ 2 -distributio. k 1 i=0 e p (p i i! = P{Y > 2p} Y χ 2 (2k. A proof ca be foud i Johso et al. (1992. Now, it follows immediately that p L = 1 2 χ 2 α 2 (2k

4.2. Cofidece sets 69 ad p U = 1 2 χ 2 1 α (2(k + 1, 2 where χ 2 α 2 ( ad χ 2 1 α 2 ( deote the 100 (α/2th ad the 100 (1 α/2th percetile of the χ 2 -distributio. Also this cofidece iterval is coservative. It is possible to icrease some of the lower edpoits ad decrease some of the higher edpoits ad still satisfy the coverage requiremet. Examples ca be foud i Crow ad Garder (1959, Casella ad Robert (1989, ad Kabaila ad Byre (2001. Normal approximatio We ca also use the ormal distributio to approximate the hypergeometric distributio. To do so the rule of thumb p 4 is sometimes used. We ca approximate the hypergeometric distributio by a ormal distributio mea ad variace equal to mea ad variace of K. Therefore, p L ad p U are elemets of [0, 1] agai, ad usig cotiuity correctios we have to solve the followig problem. Fid p L ad p U such that ad P{K k p = p L } = 1 k 0.5 p L = α p L (1 p L N 2, N 1 P{K k p = p U } = k + 0.5 p U = α p U (1 p U N 2. N 1 Solvig these equatios gives the followig cofidece iterval [p L, p U ] = 1 [ u + (2k ± 1 2u ± u 2 2u ] ( (k ± 0.52 + ( k 0.5 2 + (2k ± 1 2 where u = + N N 1 Z 2 1 α 2, Z 2 1 α the 100 (1 α/2th percetile of the stadard ormal distributio. 2 More simplified versios of this approximatio are also used.

70 Chapter 4. Hypergeometric Distributio Lig ad Pratt (1984 compared several ormal approximatios for the hypergeometric distributio. They show that especially the so-called Peizer approximatios tur out to be very accurate. These complicated approximatios origiate from a upublished paper by Peizer. However, these approximatios are ot ivertible i closed form. Moleaar (1973 gave two relatively simpler ormal approximatios that are ivertible i closed form, but still give very complicated solutios. These approximatios will probably give more accurate bouds tha the method described above. A crude approximatio ca be obtaied by usig the approximate ormality of p mea equal to the ubiased estimator for p, i.e. K, ad variace equal to the ubiased estimator for the variace of this estimator, i.e. ( ( N K 1 K. N( 1 If we also correct for cotiuity, the we fid the followig cofidece iterval [p L, p U ] = [ ( k ± Z 2 1 α 2 N N( 1 ( ( k 1 k ] + 1. 2 4.3 Computig the hypergeometric distributio Theorem 4.1.1, part 4 ca be used to fid some recursive properties that we will use i calculatig the hypergeometric distributio. It shows that we ca compute Λ(, M, N from Λ(, M + 1, N, by usig the hypergeometric probability P{K = k 0 1, M, N 1}. But suppose that we already calculated Λ(, M + 1, N from Λ(, M + 2, N, the we ca use this step to facilitate the computatio of P{K = k 0 1, M, N 1}. Property 4.3.1 gives a few examples of this. Property 4.3.1. The followig recursive properties facilitate the computatio of the hypergeometric distributio. 1. If k 0 M N + k 0 1 ad k 0 + 1 N 1, the Λ(, M, N = Λ(, M + 1, N + C 1 (, M, N

4.3. Computig the hypergeometric distributio 71 C 1 (, M, N = M k 0 + 1 M + 1 If k 0 + 1 N 1, the N M 1 N M + k 0 C 1 (, M + 1, N, C 1 (, N + k 0, N =! (N + k 0! k 0! N! 2. If k 0 + 1 M N + k 0 + 1 ad k 0 + 2 N, the Λ(, M, N = Λ( 1, M, N C 2 (, M, N. C 2 (, M, N = N M N + 1 M k 0 k 0 1 C 1( 1, M, N. 3. If k 0 M N + k 0 ad k 0 + 1 N, the Λ(, M, N = Λ(, M + 1, N + C 1 (, M, N C 1 (, M, N = M + 1 C 2(, M + 1, N. 4. If k 0 + 1 M N + k 0 ad k 0 N 1, the Λ(, M, N = Λ( + 1, M 1, N + C 3 (, M, N C 3 (, M, N = M 1 + 1 C 1 ( + 1, M 1, N. Proof. First we prove part 1. Usig Theorem 4.1.1, part 4 we fid Λ(, M, N = Λ(, M + 1, N + C 1 (, M, N C 1 (, M, N = M 1 k 0 k 0 1, ad Λ(, M + 1, N = Λ(, M + 2, N + C 1 (, M + 1, N

72 Chapter 4. Hypergeometric Distributio C 1 (, M + 1, N = +1 k 0 M 2 k 0 1. From Theorem 4.1.1, part 4 we otice that C 1 (, M, N > 0 ad C 1 (, M + 1, N > 0 if k 0 M N + k 0 1 ad k 0 + 1 N 1. Combiig the expressios for C 1 (, M, N ad C 1 (, M + 1, N gives C 1 (, M, N = M k 0 + 1 M + 1 For M = N + k 0 ad k 0 + 1 N 1 we fid N M 1 N M + k 0 C 1 (, M + 1, N. C 1 (, N + k 0, N = Λ(, N + k 0, N Λ(, N + k 0 + 1, N = Λ(, N + k 0, N =! (N + k 0!. k 0! N! I provig part 2 we agai use Theorem 4.1.1, part 4 i combiatio part 1. Usig this theorem we fid Λ(, M, N = Λ( 1, M, N C 2 (, M, N C 2 (, M, N = ( 1 k 0 M k 0 1 M, ad Λ( 1, M, N = Λ( 1, M + 1, N + C 1 ( 1, M, N M 1 k C 1 ( 1, M, N = 0 k 0 2. 1 Observe that C 2 (, M, N > 0 ad C 1 ( 1, M, N > 0 if k 0 + 1 M N +k 0 +1 ad k 0 +2 N. Combiig the expressios for C 2 (, M, N ad C 1 ( 1, M, N gives C 2 (, M, N = To prove part 3 we use the previous results N M N + 1 M k 0 k 0 1 C 1( 1, M, N. Λ(, M, N = Λ(, M + 1, N + C 1 (, M, N

4.3. Computig the hypergeometric distributio 73 C 1 (, M, N = M 1 k 0 k 0 1, ad Λ(, M + 1, N = Λ( 1, M + 1, N C 2 (, M + 1, N C 2 (, M + 1, N = ( 1 k 0 M k ( 0 N M+1. Note that C 1 (, M, N > 0 ad C 2 (, M + 1, N > 0 if k 0 M N + k 0 ad k 0 +1 N. Combiig the expressios of C 1 (, M, N ad C 2 (, M+ 1, N gives C 1 (, M, N = M + 1 C 2(, M + 1, N. I provig part 4 we use Theorem 4.1.1, part 4 i combiatio part 1 ad fid Λ(, M, N = Λ( + 1, M, N + C 2 ( + 1, M, N C 2 ( + 1, M, N = ( 1 k 0 M k 0 1. M We agai use Theorem 4.1.1, part 4 to fid Λ( + 1, M, N = Λ( + 1, M 1, N C 1 ( + 1, M 1, N C 1 ( + 1, M 1, N = 1 M k 0 k ( 0 N +1. Note that C 1 (+1, M 1, N > 0 ad C 2 (+1, M, N > 0 if k 0 +1 M N + k 0 ad k 0 N 1. Combiig the expressios of C 1 ( + 1, M 1, N ad C 2 ( + 1, M, N gives C 2 ( + 1, M, N = M + 1 C 1( + 1, M 1, N.

74 Chapter 4. Hypergeometric Distributio Usig the previous results we fid Λ(, M, N = Λ( + 1, M, N + C 2 ( + 1, M, N = Λ( + 1, M, N + M + 1 C 1( + 1, M 1, N = Λ( + 1, M 1, N C 1 ( + 1, M 1, N+ + M + 1 C 1( + 1, M 1, N = Λ( + 1, M 1, N + M 1 + 1 C 1 ( + 1, M 1, N Table 4.1. The values of Λ(, M, 8 for k 0 = 2. M \ 0 1 2 3 4 5 6 7 8 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 3 1 1 1 0.9821 0.9286 0.8214 0.6429 0.375 0 4 1 1 1 0.9286 0.7571 0.5 0.2143 0 0 5 1 1 1 0.8214 0.5 0.1786 0 0 0 6 1 1 1 0.6429 0.2143 0 0 0 0 7 1 1 1 0.375 0 0 0 0 0 8 1 1 1 0 0 0 0 0 0 Table 4.1 gives the values of Λ(, M, 8 for all possible combiatios of ad M k 0 = 2. This table gives a illustratio of Property 4.3.1. First we ca use Theorem 4.1.1, parts 2 ad 3. This gives us Λ(, M, 8 = 1 if 2 or M 2 ad Λ(, M, 8 = 0 for all combiatios of ad M for which M > 10. We start computig this table Λ(3, 7, 8. Usig Property 4.3.1, part 1 it immediately follows that Λ(3, 7, 8 = C 1 (3, 7, 8 = 3! (8 3 + 2! 2! 8! = 3/8 = 0.375, ote that Λ(3, 8, 8 = 0. Agai by usig Property 4.3.1, part 1 we ca calculate

4.3. Computig the hypergeometric distributio 75 Λ(3, 6, 8: Λ(3, 6, 8 = Λ(3, 7, 8 + C 1 (3, 6, 8 = 3/8 + 6 2 + 1 8 6 1 6 + 1 8 6 3 + 2 3/8 = 3/8 + 5/7 3/8 = 3/8 + 15/56 = 9/14 0, 6429. We ca repeat this procedure util we have foud Λ(3, 3, 8 ad by the we have foud Λ(3, M, 8 for all possible values of M. We ca calculate Λ(4, 6, 8, Λ(4, M, 8 = 0 for M > 6, by usig Property 4.3.1, part 2: Λ(4, 6, 8 = Λ(3, 6, 8 8 6 8 4 + 1 6 2 4 2 1 C 1(3, 6, 8 = 9/14 2/5 4 15/56 = 9/14 3/7 = 3/14 0.2143. Sice Λ(4, 7, 8 = 0, it follows that C 1 (3, 6, 8 = 3/14. Now we ca apply Property 4.3.1, part 1 to fid the remaiig values of Λ(4, M, 8. By repeatig the procedure above the table ca be completed. Sometimes we have to use the terms of Λ to fid a recursive expressio. For istace if we would like calculate Λ(, M, N from Λ(, M, N 1 or from Λ(, M 1, N 1. We itroduce the followig otatio. We write P(, M, N as a (k 0 + 1-vector, elemets P j (, M, N = P{K = j} = M j j, j = 0,...,k 0 ad ι = (1,...,1 a (k 0 + 1-vector. Now, it follows that Λ(, M, N = ι P(, M, N. (4.3.1 How we compute the probabilities P j (, M, N from P j (, M, N 1 will be show i the followig property. Property 4.3.2. If M k 0, k 0 ad N >, the for j = 0,...,k 0 0 if j < + M N j+1 N P j (, M, N = P j+1(, M, N 1 if j = + M N < k 0 ( j / N if j = + M N = k 0 (N (N M N (N M + j P j(, M, N 1 if j > + M N.

76 Chapter 4. Hypergeometric Distributio Proof. The cases j < + M N, j > + M N ad j = k 0 = + M N follow immediately from the defiitio of the hypergeometric probability. Note that if M k 0, k 0 ad N >, the P j (, M, N 1 > 0 implies that P j (, M, N > 0. For j = + M N < k 0 the probability P j (, M, N 1 equals zero, but the probability P j+1 (, M, N 1 does have a positive value. It is ot difficult to see that for P j (, M, N = j = j + 1 N j+1 1 = j + 1 N P j+1(, M, N 1. Notice that oce + M N 0, all elemets of P(, M, N are positive. We ca fid a similar property if we would like to compute the probability P j (, M, N from the probability P j (, M 1, N 1. Property 4.3.3. If M > k 0, k 0 ad N >, the for j = 0,...,k 0 { 0 if j < + M N P j (, M, N = M (N N (M j P j(, M 1, N 1 if j + M N. Proof. This follows immediately from the defiitio of the hypergeometric probability. If M > k 0, k 0 ad N >, the P j (, M 1, N 1 > 0 implies that P j (, M, M > 0. The properties we derived here will be essetial i the developig of the algorithms that we will describe i Chapter 5 ad 6. These properties eable the algorithms to be efficiet ad accurate.