Inverse-Type Sampling Procedure for Estimating the Number of Multinomial Classes

Size: px

Start display at page:

Download "Inverse-Type Sampling Procedure for Estimating the Number of Multinomial Classes"

Geraldine Lambert
6 years ago
Views:

1 MARCEL DEKKER, INC. 27 MADISON AVENUE NEW YORK, NY Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc. SEQUENTIAL ANALYSIS Vol. 22, No. 4, pp , 23 Inverse-Type Sampling Procedure for Estimating the Number of Multinomial Classes Hokwon A. Cho* Department of Mathematical Sciences, University of Nevada, Las Vegas, Nevada, USA ABSTRACT A procedure with inverse-type sampling is proposed for the estimation of the number of classes (or cells) in a given multinomial distribution with a devised stopping rule that satisfies a preassigned P*-condition and controls the probability of a correct decision P{CD}. Dirichlet Integrals of Type II are adapted for developing the procedure, which is based on a decision-theoretic approach. We assume that we know neither the number of classes (or cells) nor the cell probabilities (parameters) of the given multinomial but we do assume the minimum cell probability "(>). Finally, the Monte Carlo experimentation is carried out in order to illustrate the theoretical results and the behavior of proposed procedure. Key Words: Inverse-type sampling; Number of classes in multinomial distribution; Probability of correct decision; Dirichlet Integrals Type II; Minimum cell probability. *Correspondence: Hokwon A. Cho, Department of Mathematical Sciences, University of Nevada, 455 Maryland Parkway, Box 4542, Las Vegas, NV 89154, USA; Fax: (77) ; cho@unlv.nevada.edu. 37 DOI: 1.181/SQA Copyright & 23 by Marcel Dekker, Inc (Print); (Online)

2 MARCEL DEKKER, INC. 27 MADISON AVENUE NEW YORK, NY Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc. 38 Cho 1. INTRODUCTION Suppose we have a multinomial population which is partitioned into an unknown number of classes, say k. Then a natural question will be to ask how many classes (or categories, cells) there are. The following examples are rather common in practical situations and have given in the literature. 1. Biologists, ecologists and botanists may be interested in estimating the number of different species in a population of fish, animals or plants. Such estimates are essential to obtain indices of diversity or the rates of extinction which play an important role in the study of ecosystems or in studying environmental conditions of the ecological diversity of species. [1] 2. Numismatists may be concerned with estimating the total number of dies used to produce coins to study ancient economies. Suppose the researcher assumes that each die has been used to manufacture approximately the same number of coins. [2,3] 3. Linguists may be interested in estimating the size of an author s vocabulary. [4,5] In addition to these applications, estimating the number of distinct records in a filing system such as the social security card file can be an important issue, where many records are suspected to be duplicated. An early article in the above area was written by Goodman. [6] Since then, many approaches have been proposed and developed. Bunge and Fitzpatrick [7] provided a comprehensive review of the existing literature according to methods and approaches with an extended bibliography. Even though there has been more work done on these problems, both frequentist and Bayesian methods [8] use rather restrictive assumptions; namely, equal cell probabilities or uniform prior. Here, we consider a more flexible inverse-type sampling procedure, which has not been considered so far, using multiple decision-theoretic methods. 2. FORMULATION Let X 1, X 2, be a sequence of independent observations drawn from a multinomial population with an unknown integer number of classes (or cells) k, with corresponding cell probabilities p i, where p i >, i ¼ 1, 2,, k and k i¼1p i ¼ 1. Then, there is a simple relation between the unknown k

3 MARCEL DEKKER, INC. 27 MADISON AVENUE NEW YORK, NY Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc. Inverse-Type Sampling Procedure 39 and the smallest and largest values, namely, 1=p ½kŠ k 1=p ½1Š, ð2:1þ where p [i] represents the i-th ordered cell probability. For moderate and large numbers of observations, the number of cells with positive frequency in the sequential sample at stopping time will be a important part of our statistic for estimating the true number of cells k. However we need an appropriate stopping rule that will provide a confidence level on our estimate of k. From a practical point of view, we fix a sufficiently small positive value " and say that cells with probability less than this either do not exist or are not of any practical interest. We refer this condition as to the "-condition as ð"þ (the extended parameter space). From (2.1) it follows that k 1/"; thus if " ¼.1, then k 1 and if " ¼.5, then k Stopping Rule of the Procedure We propose the following stopping and decision rule for our procedure R; If the smallest positive frequency among the observed cells reaches a tabled positive integer c cð"þ, then we stop sampling. That is, the stopping rule for the total (random) number N c of observations required is N c ¼ inf n 3 Min f i ¼ c, ð2:2þ 1iK n where n ¼ K n i¼1 f i, n 1 and K n is the number of distinct cells seen in n observations, and the f i, i ¼ 1, 2,, K n are corresponding frequencies of each observed cell. The value of c, which is called the stopping value, will be determined by a P*-condition and this in turn will determine the random variable N c. We then use the statistic K n as an estimate of the true number of cells k. This value K n is clearly a lower bound on k and K n! k as c!1. Since we do not know at stopping time whether we have seen all the cells or not, we define a correct decision (CD) as observing all of the cells with p i " by stopping time. We show below that for the procedure R the P{CD R} P* where P* is a prespecified level of confidence such that 1/k P*<1; the tabled value of the stopping integer c depends on the prespecified P*, as well as on ". Equivalently the probability of a wrong decision P{WD R}, i.e., of stopping without seeing all the cells with p i ", will be less than 1 P*.

4 MARCEL DEKKER, INC. 27 MADISON AVENUE NEW YORK, NY Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc. 31 Cho Under our "-condition the proposed sequential procedure R terminates with probability one with a finite number of observations since the value of c will always be finite regardless of how small a positive value is prespecified for " The Configurations of the Cell Probabilities and Least Favorable Configuration The procedure R we use is constructed so that it satisfies the usual P*-requirement, namely P{CD R} P* for any configurations of cell probabilities, vector p, in the parameter space (") in which every component is at least ". This leads us to find the so-called least favorable configuration (LFC). [9] That is, we must look for the configuration which infimizes (or minimizes) the PfCDjRg over all vectors p 2 ð"þ. Thus if such a limit configuration p LFC exists and is found, then we will be able to confine our attention solely to the LFC rather than the entire space of parameters. However, we do focus on finding the LFC in this article. We want to satisfy the basic requirement that PfCDjRg P for all p 2 ð"þ. A probability configuration of parameters is said to be a "-least favorable configuration ("-LFC) if any given N, k and " the probability of a correct decision P{CD} becomes a minimum (or infimum) over all vectors p 2 ð"þ under "-condition. We note that the slippage ratio of cell probabilities, defined as p [ j] /p [ j 1], j ¼ 2, 3,, k, is related to find the LFC Basic Scheme of the Procedure In the first stage of our solution we take " ¼.1 and P* to be the traditional value.95 to give a specific example of the use of the procedure R. Then the goal is to find the smallest integers c under the stopping rule R which satisfies the P*-requirement. Firstly let us consider the case where only one cell has been observed, i.e., t ¼ 1. Using our assumption where only one cell has been observed, there may be a second cell; in the worst scenario suppose its cell probability values are p [1] ¼.1 and p [2] ¼.9. Also suppose P* ¼.95. Since PfCDjRg þ PfWDjRg ¼ 1, we want to find integer c such that P{WD} 1 P*. In order to find c(.1) for this situation, we set up the equation in n: :9 n þ :1 n :5 ð2:3þ

5 MARCEL DEKKER, INC. 27 MADISON AVENUE NEW YORK, NY Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc. Inverse-Type Sampling Procedure 311 By solving this equation we find n to be 28.4 and the smallest integer above this is 29. In this case c is the value of n that satisfies Eq. (2.3) under the condition " ¼.1, namely 29. Therefore the number of observation required is 29 observations in the one cell observed in order to have PfCDjRg :95; the decision in this case would be k ¼ 1. That is, if we observe one cell at least 29 times in a row, then we conclude with probability.95 that there is only one class (or kind) in the population. In the next section we apply the same analogy to extend to the case where we observe more than one cell, i.e., t PROBABILITY OF CORRECT DECISION AND DIRICHLET INTEGRALS 3.1. Multinomial Events and C and D Functions Let two events E 1 and E 2 represent the minimum frequency and maximum frequency respectively among the cells at stopping time (a.s.t.) and the stopping rule is of type used in inverse sampling. Then applying the results shown in Olkin and Sobel, [1] we have PfE 1 g¼pf f i r, i ¼ 1, 2,, b a:s:t :when f bþ1 ¼ m for the first timeg ¼ Z ab Z ab 1 Z a1 f ðbþ ðx; r, mþ Yb Ca ðbþ ðr, mþ ð3:1þ and PfE 2 g¼pf f i <r, i ¼1, 2,, b a:s:t: when f bþ1 ¼ m for the first timeg where Z 1 Z 1 ¼ a b a b 1 D ðbþ a ðr, mþ f ðbþ ðx; r, mþ ¼ Z 1 a 1 ðm þ RÞ ðmþ Q b i¼1 ðr iþ f ðbþ ðx; r, mþ Yb i¼1 i¼1 dx i dx i Q b i¼1 xr i 1 i ð3:2þ dx i 1 þ P mþr ð3:3þ b i¼1 x i which is the (b-variate) Dirichlet-type II distribution, f i denotes the frequency in the i-th cell, and a i ¼ p i /p bþ1, i ¼ 1, 2,, b. Consider a multinomial model with 1 þ b þ j cells and the corresponding cell probabilities are p and p ¼ ( p 1, p 2,, p b : p bþ1,, p bþj ),

6 MARCEL DEKKER, INC. 27 MADISON AVENUE NEW YORK, NY Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc. 312 Cho where p þ bþj i¼1 p i ¼ 1. Let c be an integer such that 1 c b. Suppose that we stop sampling when the frequency f reaches the value of m for the first time. Let E denote the compound multinomial even that the frequencies for i ¼ 1, 2,, c are all at least r, the frequencies f i for i ¼ c þ 1, c þ 2,, b are all less than r and the frequencies for i ¼ b þ 1, b þ 2,, b þ j are all exactly equal to r at stopping time. Then the probability of the compound multinomial event E is given by the CDintegrals in Sobel et al. [11] :! ðc, b c;jþ ðm þ RÞ Y bþj Z a r a 1 Z a c i CDa ðr; mþ¼ ðmþ bþj ðrþ r i¼bþ1 Z 1 Z 1 Q b i¼1 x i r 1 dx i a cþ1 1 þ P bþj i¼bþ1 a i þ P mþr, ð3:4þ b i¼1 x i a b where a ¼ p/p ¼ ( p 1, p 2,, p b : p bþ1,, p bþj ) and R ¼ (b þ j )r. We note that c denotes the number of integrals of type C in the expression (3.4). Furthermore if j ¼, we are able to eliminate the j and simply (3.4) as follows: CD ðc, b cþ a ðr; mþ ¼ ðm þ RÞ ðmþ b ðrþ Z a 1 Z a c a cþ1 Z 1 Z 1 a b Q b i¼1 xr 1 i 1 þ P b i¼1 x i dx i mþr ð3:5þ By letting c ¼ b, this CD-integral reduces to the usual C-integral in Eq. (3.1) and letting c ¼, it reduces to the usual D-integral in Eq. (3.2). In addition, if we have a common value for the a, then we can use a without specifying the vector notations on the left-hand side of both Eqs. (3.4) and (3.5). It should be noted that since there is a common value of r in each partition, it is convenient to use one C and one D in the left hand side of the expressions (3.2) and (3.5) Expressions of P{WD} based on C and D Functions Let k be the true number of cells in the state of nature. Let T denote the observed number of cells at the stopping time. Also we assume in the present stage that every cell probability p i 1/1, i ¼ 1, 2,, k. Consider a configuration such as {9/1t, 9/1t,, 9/1t, 1/1}, where 9/1t 1/1, t ¼ 1, 2,, 9. In this configuration the slippage ratio of cell probabilities is a constant and 9/t. Note that if t ¼ 1, then we immediately stop

7 MARCEL DEKKER, INC. 27 MADISON AVENUE NEW YORK, NY Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc. Inverse-Type Sampling Procedure 313 Figure 3.1. Cell structure and probabilities for omitting one cell. sampling since under our "-condition all p i s are equally probable spontaneously, and we call this equal parameter configuration (EPC). We now are going to focus on the case of omitting exactly one cell P{WD} for the case of Omitting One Cell: T ¼ k 1 In order to take the case of omitting one cell into consideration, first we need to investigate the structure of the cells in multinomial as shown in Fig Since the total number of cells is the sum of the observed cells and the missing cell, the total number of cells from the above diagram must be k ¼ t þ 1 where t ¼ 1, 2,,9. According to the cell structure, there can be three possibilities in omitting one cell. Hence the P{WD} can be completely considered on the basis of the three cases. Denoting by L the cell of size 9/1t and by S the cell of size 1/1, the probability of a wrong decision P{WD} for omitting one cell is composed of three parts by the cell structure, that is, PfWDg ¼tL; h Siþ tðt 1ÞhL; Liþ ts; h Li, ð3:6þ where tl; h Si, for instance, represents the fact that there are t possible cases in which one of the L is the missing cell and S is the stopping cell. Then by Eq. (3.4) we can express P{WD} in terms of multiple of C and D-integrals as follows: PfWDg¼tDC ð1,t 1Þ ð1,c;cþþtðt 1ÞDCCð1,1,t 2Þ ¼ t þ t ðt=9,1þ þ tdcð1,t 1Þ ð9=t,9=tþ Z 1 Z 1 Z 1 ð1 þ tcþ t ðcþ ð1 þ tcþ t ðcþ þ tðt 1Þ t=9 Z 1 Z 9=t 9=t ð1 þ tcþ t ðcþ Z 9=t dx Q t 1 1 þ x þ P t 1 i¼1 y i Z 1 Z t=9 Z 1 1 dy i i¼1 yc 1 i 1þtc dx Q t 1 1 þ x þ P t 1 i¼1 y i Z 1 dy i i¼1 yc 1 i 1þtc ð1,t=9,1þ ð1,c,c;cþ dxðy c 1 dyþ Q t 2 i¼1 zc 1 i dz i 1 þ x þ y þ P 1þtc : t 2 i¼1 z i ð3:7þ

8 MARCEL DEKKER, INC. 27 MADISON AVENUE NEW YORK, NY Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc. 314 Cho After some algebra and integrations of the D-type integrals we write Eq. (3.7) in the form 9 c ðtcþ PfWDg ¼t 9 þ t t ðcþ t c ðtcþ þ t 9 þ t t ðcþ Z 9=ð9þtÞ Z 9=ð9þtÞ Z 9=ð9þtÞ Q t 1 i¼1 wc 1 i Z 9=ð9þtÞ Z 1 ðtc cþ 1=2 Z 1=2 þ t ðt 1Þ 2 c t 1 ðcþ t c 1 9 2c 1 ðtc þ 1Þ 9 18 þ t 2 ðcþ t 2 ðcþ Z 9=ð18þtÞ Z 9=ð18þtÞ 1 þ P t 2 i¼1 w i dw i 1 þ P t 1 i¼1 w i Q t 1 i¼1 wc 1 i Q t 2 i¼1 wc 1 i dw i tc 1 t c 2 9 2c 2 ðtc þ 2Þ 9 18 þ t ðcþ ðc 1Þ t 2 ðcþ Z 9=ð18þtÞ Z 9=ð18þtÞ Q t 2 i¼1 wc 1 i dw i tc 2 1 þ P t 2 i¼1 w i tc dw i 1 þ P t 1 i¼1 w i 1 þ P t 2 i¼1 w i tc Q t 2 i¼1 wc 1 i dw i tc c 9 c Z ðtc þ cþ 9=ð18þtÞ Z 9=ð18þtÞ 18 þ t ðcþ ð1þ t 2 ðcþ 9 Q t 2 i¼1 wc 1 i dw >= i 1 þ P tc c t 2 i¼1 w >; : ð3:8þ i After combining similar terms and reshuffing the remaining terms in terms of the C-function, this can be written as 9 c PfWDg ¼t þ t c C ðt 1Þ 1Þ n 9=ð9þtÞðc; cþþtðt 9 þ t 9 þ t 2 c C ðt 2Þ 1=2 ðc; cþ Xc t c 1 18 c ) 2c 1 i C ðt 2Þ 9=ð18þtÞðc; 2c iþ : 18 þ t 18 þ t c 1 i¼1 ð3:9þ

9 MARCEL DEKKER, INC. 27 MADISON AVENUE NEW YORK, NY Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc. Inverse-Type Sampling Procedure 315 Table 3.1. Total P{WD} for " ¼.1, P ¼.95. k t c(") P{WD, t ¼ k 1} (1) P{WD, t ¼ k 2} (2) Total P{WD} (1) þ (2) From Eq. (3.9), we are able to obtain values of c corresponding to different values of t, (t ¼ 1, 2,, 9). The values are given in Table 3.1 when P is specified as.95 under " ¼.1. It is most important to note that P{WD}.5 uniformly. It should also be noted that P{WD, T ¼ k 1} in Table 3.1 is not monotonic in either t or c. A reasonable explanation for this is that both the least favorable configuration and the c values are changing. After c settles (to 4 in the table) and the least favorable configuration becomes the one with exactly one cell probability slipped to the left and thereafter the results in Table 3.1 are monotonic. As we see in table, the maximum value t can take is 9 since the number of missing cells is 1 and total number of cells k ¼ t þ 1, and the probabilities of omitting one cell are uniformly smaller than 1 P* which is.5. Therefore the suggested procedure is well justified by guaranteeing P*-condition under the assumption P{WD} for the Case of Omitting Two Cells: T ¼ k 2 In order to apply the same analogy of the case of omitting one cell, let us consider the cell structure for the case of missing two cells. Since the total number of cells is the sum of the observed cells and the missing cells, the noticeable difference from the case of omitting one cell is that the total number of cells as shown in Fig. 3.2 must be k ¼ (t þ 1) þ 1 ¼ t þ 2 where t ¼ 1, 2,, 8. By the cell structure, there can be three possibilities in omitting two cells. Hence the P{WD} can be completely considered on the basis of the following three cases. Denoting L by the cell of size 9/1t and S by the cell of size 1/1,

10 MARCEL DEKKER, INC. 27 MADISON AVENUE NEW YORK, NY Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc. 316 Cho Figure 3.2. Cell structure and probabilities for omitting two cells. the probability of wrong decision P{WD} when we have two unobserved cells is given by PfWDg ¼ t þ 1 hl, L; Siþ t ðt þ 1ÞhL, S; Li 2 t þðtþ1þ hl, L; Li, ð3:1þ 2 where tðt þ 1ÞhL, S; Li, for instance, represents that there are t (t þ 1) possible cases of which one of L and S are missing cell and L is the stopping cell. Then we can express P{WD} in terms of multiple of C and D-integrals by Eq. (3.6). PfWDg ¼ t þ 1 2 D ð1þ 9=ðtþ1Þ Dð1Þ 9=ðtþ1Þ C ð1þ 9=ðtþ1Þð1, 1, c; cþ þðtþ1þtd ð1þ 1 Dð1Þ ðtþ1þ=9 C ðt 1Þ 1 ð1, 1, c; cþ t þðtþ1þ D ð1þ 1 2 Dð1Þ 1 C ðt 2Þ 1 C ð1þ ðtþ1þ=9ð1, 1, c, c; cþ: ð3:11þ Then we can express P{WD} in terms of multiple of C and D-integrals by Eq. (3.6) as follows. t ðt þ 1Þ ðtc þ 2Þ PfWDg ¼ 2 t ðcþ þ Q t 1 i¼1 xc 1 Z 1 9=ðtþ1Þ Z 1 9=ðtþ1Þ Z 9=ðtþ1Þ i dx i dy 1 dy 2 tcþ2 þ t ðt þ 1Þ 1þy 1 þy 2 þ P t 1 i¼1 x i Z 1 Z 1 1 ðtþ19þ=9 Z 1 ðt 1Þt ðt þ 1Þ 2 Z 1 ðtc þ 2Þ t ðcþ Q t 1 i¼1 xc 1 Z 9=ðtþ1Þ ð2 þ tcþ t ðcþ i dx i dy 1 dy 2 1 þ y 1 þ y 2 þ P t 1 i¼1 x i Z 1 Z 1 Z tcþ2 Z 1 Z ðtþ1þ=9 Q yc 1 3 dy t 2 3 i¼1 xc 1 i dx i dy 1 dy 2 1 þ y 1 þ y 2 þ P tcþ2 : ð3:12þ t 2 i¼1 x i þ y 3

11 MARCEL DEKKER, INC. 27 MADISON AVENUE NEW YORK, NY Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc. Inverse-Type Sampling Procedure 317 After considerable algebra and integrations, we obtain PfWDg¼ tðt þ 1Þ 2 t þ 1 tc ðtcþ t þ 19 ðcþ t 1 ðcþ 9 c ðtcþ þ tðt þ 1Þ t þ 19 ðcþ t 1 ðcþ ( tðt 1Þðt þ 1Þ 1 ðtc cþ þ 2 3 c ðcþ t 2 ðcþ Xc t þ 1 c i 9 9 i¼1 Z 9=ð28þtÞ Z 9=ð28þtÞ Z 9=ðtþ19Þ Z t=ðtþ19þ Z 1=3 Z 9=ðtþ19Þ Z t=ðtþ19þ Q t 2 Q t 1 i¼1 yc 1 i dy i ð1 þ P t 1 i¼1 y iþ tc Q t 1 i¼1 yc 1 i dy i ð1 þ P t 1 i¼1 y iþ tc Z 1=3 i¼1 : yc 1 i dy i ð1 þ P t 2 i¼1 y iþ tc c 2c i ð2c iþ ðtc iþ 28 þ t ðcþ ðc i þ 1Þ t 2 ðcþ ð2c iþ Q ) t 2 i¼1 yc 1 i dy i ð1 þ P t 2 i¼1 y : ð3:13þ iþ tc i Using Eq. (3.1) this can be written in terms of C-functions as PfWDg¼ t ðtþ1þ tþ1 c C ðt 1Þ 9=ðtþ19Þ 2 tþ19 ðc; cþþtðtþ1þ 9 t þ 19 c C ðt 1Þ 9=ðtþ19Þðc; cþ ( tðt 1Þðt þ 1Þ þ 2 3 c C ðt 2Þ t þ 1 c i 27 c 1=3 ðc; cþ Xc 28 þ t 28 þ t i¼1! ) 2c 1 i C ðt 2Þ 9=ð28þtÞðc; 2c iþ : ð3:14þ c 1 Since the number of missing cells is 2 and total number of cells k ¼ t þ 2, the maximum value t can take is up to 8. The values of the stopping time and P{WD} are calculated and given in Table 3.1. We need to note importantly that three components in above expression have the identical C-functions when t ¼ 8 except coefficients of t. We know that if t ¼ 8, the configuration of cell structure becomes EPC Total Probability of Wrong Decision Since the events that exactly 1 cell is missing, exactly 2 cells are missing,, are mutually exclusive and P{WD} for more than two

12 MARCEL DEKKER, INC. 27 MADISON AVENUE NEW YORK, NY Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc. 318 Cho cells missing are extremely small by using the same analysis as in Subsections and 3.3.2, the total P{WD} is PfWDg ¼ X k 1 PfWD, t ¼ k jg j¼1 ð3:15þ PfWD, k ¼ t 1gþPfWD, k ¼ t 2g: Therefore we can approximate the total P{WD} as being close to the sum of column (1) and column (2) in the Table 3.1 and indeed we note that the results are uniformly below.5, which satisfies P*-condition in the procedure, for each value of k. Therefore the proposed procedure is well justified by guaranteeing the P*-condition under the assumption. As we see in Table 3.1, and the probabilities of omitting two cells are uniformly smaller than 1 P* which is.5. We note that for two cells missing the probabilities are relatively very smaller than the probability of one missing cell and have a negligible effect on raising the probability of wrong decision. Also additionally, we find the stopping values and calculated the lower bound (LB) for the expected number of observations E(N) given t, i.e., minimum total number of observations for the case of P* ¼.99. Then we define the c-value of Table 3.1 to be the stopping values of the procedure R and obtain the following table: We note that we do not use any c-value for t ¼ 1 since we simply assert there must be exactly 1 cells under the "-condition. Illustration 3.1. Suppose we observe 4 distinct cells (t ¼ 4). Then from Table 3.2 we obtain c ¼ 6 for P* ¼.95. If we have observed at least Table 3.2. Values of the stopping constant c " as a function of t under procedure R, " ¼ :1. P* ¼.95 P* ¼.99 t c LB of E(N t) c LB of E(N t)

13 MARCEL DEKKER, INC. 27 MADISON AVENUE NEW YORK, NY Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc. Inverse-Type Sampling Procedure 319 c ¼ 6 from each of the t ¼ 4 cells, we stop sampling and assert that there are exactly 4 cells. Under our "-condition, all p i s are.1, the probability that our decision is correct is at least.95. However, if we wish to have P{CD}.99, we get c ¼ 9 according to the table. This implies that there should be at least c ¼ 9 observations from each of the t ¼ 4 cells under the assumption " ¼ 1/1 to keep the level of confidence at least.99, which is guaranteed by the proposed procedure R. 4. EXPECTATIONS OF THE NUMBER OF OBSERVATIONS IN THE PROCEDURE 4.1. Expected Total Number of Observations Let N c denote the total number of observations at stopping time needed by the procedure R, which is an inverse sampling procedure. Now we denote the total number of true cells by b (in fact, b stands for blue cell). The stopping rule says that sampling is terminated when the minimum positive frequency reaches a tabled value c which depends upon the observed number of different cells t and prespecified P*. Clearly the total number of observations is random but its moments are functions of the true number of cells and P*; they do not depend on c or t. Before we derive an explicit expression for the expectation of N c,we introduce the general representation for g-th ascending factorial moment of the waiting time until every one of the b cells reaches its quota in a simpler setting than our problem; the quota for cell i is r i (i ¼ 1, 2,, b). Sobel et al. [11] (p. 62, Eq. (5.8)) have shown that the -th ascending factorial moment of the waiting time until the every one of the b cells reaches its quota is given by ½Š ¼ Xb ¼1 ðr þ Þ ðr Þp C ðb 1Þ p< >=p ðr < > ; r þ Þ; ð4:1þ here refers to the cell which is the last one to reach its quota, where p < > =p ¼ðp 1 =p, p 2 =p,, p 1 =p, p þ1 =p,, p b =p Þ; r < > denotes the vector of parameters (r-values) with the component r absent, i.e., (r 1, r 2,, r 1,r þ1,, r b ). Now we consider a certain useful compound event and in order to apply the above factorial moment to our situation. In the notations we have been using, we have N c ¼ f 1 þ f 2 þþf b, where f i (below we

14 MARCEL DEKKER, INC. 27 MADISON AVENUE NEW YORK, NY Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc. 32 Cho also use x i for the value of f i ) is observed cell frequency in the i-th cell at stopping time, i ¼ 1, 2,, b. Lemma 4.1. Let X ¼ (X 1, X 2,, X b ) be the b-variate multinomial vector of frequencies with corresponding cell probabilities p ¼ ( p 1, p 2,, p b ) and b i¼1p i ¼ 1. Define an event E ¼ {at stopping time (a.s.t.) x ¼ c for the first time: x i c (i ¼ 1, 2,, 1) and x i ¼ (i ¼ þ 1, þ 2,, b)} which can occur under our inverse sampling procedure R. Let N c be the contribution to the total number of observations at stopping time, c þ x 1 þ x 2 þþx 1 ¼ c þ x 1 þ x 2 þþx 1 þ x þ1 þþx b, contributed by the event E. Then, writing E(N c ) for this contribution, the total expected sample size (a.s.t.) is a sum of terms of the form EðN c Þ¼ Xb c C ðb 1Þ p< >=p ðr < > ; c þ 1Þ ð4:2þ ¼1 p where p< >/p ¼ ( p 1 /p, p 2 /p,, p 1 /p, p þ 1/ p,, p b /p ) and we use the scalar c when all the r s are equal c to which is the minimum positive observed frequency at stopping time (our stopping constant). Proof. Let denote x i be the observed frequency at stopping time for i-th cell, i ¼ 1, 2,, b. There are now (b 1) blue cells and one stopping cell. Since N c ¼ c þ P b i¼1, i6¼ x i at stopping time, the expectation of N c becomes as follows: EðN c Þ¼ Xb ¼1 ðx 1 þ x 2 þþx b ÞPfa:s:t:X ¼ c, X t c, X j ¼ g ð4:3þ where i ¼ 1, 2,, 1, and j ¼ þ 1, þ 2,, b. Then we obtain 2 8 EðN c Þ¼ Xb X 1 X 1 X 1 c þ P 9 b j¼1, j6¼ x 3 < j! Y ¼1 x 1 ¼c x 2 ¼c x 1 ¼c c! Q b j¼1, j6¼ x p c b = 4 p x l 5 : l j! ; l¼1, l6¼ " ¼ Xb c X 1 X 1 X 1 p ¼1 x 1 ¼c x 2 ¼c x 1 ¼c 8 h ðc þ 1Þþ P i 9 b j¼1, j6¼ x 3 < j Y b = ðc þ 1Þ Q b j¼1, j6¼ ðx j þ 1Þ pcþ1 p x l : l 5: ð4:4þ ; l¼1, l6¼ Using the exact correspondence between the cumulative multinomial distribution and Dirichlet-Type II (i.e., the C-integral) in Olkin and Sobel [1] and Sobel et al., [11] we obtain Eq. (4.2). g

15 MARCEL DEKKER, INC. 27 MADISON AVENUE NEW YORK, NY Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc. Inverse-Type Sampling Procedure 321 We note that we use the scalar c when all the r s are equal to c which is the minimum positive observed frequency at stopping time (our stopping constant). That is, from the multinomial interpretation of C-integrals (in fact, D-integrals have vanished due to some cells which have not appeared before the stopping time) and the expression in Eq. (3.6), the above result exactly coincides with the right hand side of Eq. (4.1). Example 4.1. Consider the case of k ¼ 2. The configuration of "-LFC for k ¼ 2 is {.9,.1}. Since the maximum number of missing cells is at most 1, the probability of the wrong decision will be exact in this case. From the Table 3.1, P{CD} ¼ 1 P{WD} ¼ ¼ From Eq. (4.2) we obtain 13ð1Þ EðN c Þ¼:9529 :9 Cð1Þ 1=9ð13, 14Þþ13ð1Þ :1 Cð1Þ 9 ð13, 14Þ 29ð1Þ :9 Dð1Þ 1=9ð1, 3Þþ29ð1Þ :1 Dð1Þ 9 ð1, 3Þ þ :471 29ð1Þ :9 Dð1Þ 1=9ð1, 3Þþ29ð1Þ :1 Dð1Þ 9 ð1, 3Þ ¼ 122:6396: This E (N c k ¼ 2) is quite agreeable with the result we obtain from Monte Carlo simulation which is in Table 5.1. Example 4.2. Consider the case of k ¼ 4. The configuration of "-LFC for k ¼ 4 is {.3,.3,.3,.1}. The number of missing cells can be more than 1, so we approximate the wrong decision based on one or two missing. From the Table 3.1, P{CD} ¼ ¼ Similarly, from Eq. (4.2) we obtain 6ð1Þ EðN c Þ¼:95818 :3 Cð1Þ 1=3 Cð1Þ 1 ð6, 6; 7Þþ6ð1Þ :1 Cð3Þ 3 ð6; 7Þ 8ð3Þ :3 Dð1Þ 1=3 Cð2Þ 1 ð1, 8; 9Þþ8ð6Þ :3 Dð1Þ 1 Cð1Þ 1 Cð1Þ 1=3ð1, 8, 8; 9Þ þ 8ð3Þ :1 Dð1Þ 3 Cð2Þ 3 ð1, 8; 9Þ þ :471 8ð3Þ :3 Dð1Þ 1=3 Cð2Þ 1 ð1, 8; 9Þ þ 8ð6Þ :3 Dð1Þ 1 Cð1Þ ¼ 56:9194: 1 Cð1Þ 1=3 ð1, 8, 8; 9Þþ8ð3Þ :1 Dð1Þ 3 Cð2Þ 3 ð1, 8; 9Þ

16 MARCEL DEKKER, INC. 27 MADISON AVENUE NEW YORK, NY Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc. 322 Cho This E(N c k ¼ 4) is comparable with the result we obtain from Monte Carlo simulation which is in Table Variance of the Total Number of Observations The first ascending factorial moment ( ¼ 1) is clearly the mean ¼ E(N c ). Since [2] ¼ E{N c (N c þ 1)} is the second ascending factorial moment of N c, the variance 2 is obtained from the relationship VARðN c Þ¼ ½2Š ð1 þ Þ: ð4:5þ 5. SIMULATION STUDIES In the Monte Carlo experimentation, we used several configurations for each value of k, (k ¼ 2, 4, 8) depending on the number t of cells that have been observed. We calculated the P{CD} for each configuration and several other configuration as well Monte Carlo Experimentation The results of the Monte Carlo simulation are summarized in the following tables, which show the probability of correct decision, P{CD}, Table 5.1. For k ¼ 2; P{CD} P, P ¼.95, " ¼.1. Configuration P{CD} E(N) s.e. #WD ð 1 2, 1 2Þ ð 9 1, 1 1Þ ð 8 Þ , 2 1 Table 5.2. For k ¼ 4; P{CD} P, P ¼.95, " ¼.1. Configuration P{CD} E(N) s.e. #WD ð 1 4, 1 4, 1 4, 1 4Þ ð 3 1, 3 1, 3 1, 1 1Þ ð 4 1, 4 1, 1 1, 1 1Þ ð 7 1, 1 1, 1 1, 1 1Þ

17 MARCEL DEKKER, INC. 27 MADISON AVENUE NEW YORK, NY Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc. Inverse-Type Sampling Procedure 323 Table 5.3. For k ¼ 8; P{CD} P, P ¼.95, " ¼.1. Configuration P{CD} E(N) s.e. #WD ð 1 8, 1 8, 1 8, 1 8, 1 8, 1 8, 1 8, 1 8Þ ð 9 7, 9 7, 9 7, 9 7, 9 7, 9 7, 9 7, 1 1Þ ð 2 15, 2 15, 2 15, 2 15, 2 15, 2 15, 1 1, 1 1Þ ð 7 5, 7 5, 7 5, 7 5, 7 5, 1 1, 1 1, 1 1Þ ð 3 2, 3 2, 3 2, 3 2, 1 1, 1 1, 1 1, 1 1Þ ð 5 3, 5 3, 5 3, 1 1, 1 1, 1 1, 1 1, 1 1Þ ð 2 1, 2 1, 1 1, 1 1, 1 1, 1 1, 1 1, 1 1Þ ð 3 1, 1 1, 1 1, 1 1, 1 1, 1 1, 1 1, 1 1Þ the average number of observations, E(N), its standard error, s.e., and the number of wrong decisions in the experiments. Every row in each of Table corresponds to 1, independent experiments. REFERENCES 1. Mann, C. Extinction: Are ecologists crying wolf? Science 1991, 253, Marchand, J.; Schroeck, F., Jr. On estimation of the number of equally likely classes in a population. Commun. Statist. 1982, 11, Arnold, B.; Beaver, R. Estimation of the number of classes in a population. Biometrical Journal 1988, 3, McNeil, D. Estimating an author s vocabulary. J. Ameri. Statist. Assoc. 1973, 68, Efron, B.; Thisted, R. Estimating the number of unseen species: how many words did Shakespeare know? Biometrika 1976, 63, Goodman, L. On the estimation of the number of classes in a population. Ann. Math. Statist. 1949, 2, Bunge, J.; Fitzpatrick, M. Estimating the number of species: a review. J. Ameri. Statist. Assoc. 1993, 88, Boender, C.; Rinnooy Kan, A. A Bayesian analysis of the number of cells of a multinomial distribution. The Statistician 1983, 32, Gibbons, J.; Olkin, I.; Sobel, M. Selecting and Ordering Population: A New Statistical Methodology; John Wiley and Sons: New York, 1977.

18 MARCEL DEKKER, INC. 27 MADISON AVENUE NEW YORK, NY Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc. 324 Cho 1. Olkin, I.; Sobel, M. Integral expression for tail probabilities of the multinomial and negative multinomial distributions. Biometrika 1965, 52, Sobel, M.; Uppuluri, V.; Frankowski, K. Selected Tables in Mathematical Statistics, Vol. IX-Dirichlet Integrals of Type II and Their Applications; Edited by IMS. American Mathematical Society, Providence: Rhode Island, Received August 22 Recommended by B. Eisenberg

Information Matrix for Pareto(IV), Burr, and Related Distributions

Information Matrix for Pareto(IV), Burr, and Related Distributions MARCEL DEKKER INC. 7 MADISON AVENUE NEW YORK NY 6 3 Marcel Dekker Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker