Similar documents
Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Lecture 2: Monte Carlo Simulation

Estimation for Complete Data

Chapter 6 Sampling Distributions

The Poisson Process *

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Statistics 511 Additional Materials

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

1 Introduction to reducing variance in Monte Carlo simulations

Random Variables, Sampling and Estimation

Lecture 33: Bootstrap

It should be unbiased, or approximately unbiased. Variance of the variance estimator should be small. That is, the variance estimator is stable.

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Properties and Hypothesis Testing

Basics of Probability Theory (for Theory of Computation courses)

Information-based Feature Selection

The standard deviation of the mean

Exponential Families and Bayesian Inference

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

GUIDELINES ON REPRESENTATIVE SAMPLING

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Statisticians use the word population to refer the total number of (potential) observations under consideration

Economics Spring 2015

Estimation of Population Mean Using Co-Efficient of Variation and Median of an Auxiliary Variable

Regression with an Evaporating Logarithmic Trend

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense,

On an Application of Bayesian Estimation


Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Estimation of a population proportion March 23,

ON POINTWISE BINOMIAL APPROXIMATION

ANALYSIS OF EXPERIMENTAL ERRORS

A statistical method to determine sample size to estimate characteristic value of soil parameters

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

Understanding Samples

( µ /σ)ζ/(ζ+1) µ /σ ( µ /σ)ζ/(ζ 1)

Similarity Solutions to Unsteady Pseudoplastic. Flow Near a Moving Wall

This is an introductory course in Analysis of Variance and Design of Experiments.

Confidence intervals summary Conservative and approximate confidence intervals for a binomial p Examples. MATH1005 Statistics. Lecture 24. M.

Output Analysis and Run-Length Control

Monte Carlo Integration

Statistical inference: example 1. Inferential Statistics


Lecture 12: September 27

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Statistical Inference Based on Extremum Estimators

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Topic 9: Sampling Distributions of Estimators

CHAPTER 8 FUNDAMENTAL SAMPLING DISTRIBUTIONS AND DATA DESCRIPTIONS. 8.1 Random Sampling. 8.2 Some Important Statistics

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

R. van Zyl 1, A.J. van der Merwe 2. Quintiles International, University of the Free State

µ and π p i.e. Point Estimation x And, more generally, the population proportion is approximately equal to a sample proportion

Modied moment estimation for the two-parameter Birnbaum Saunders distribution

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

If, for instance, we were required to test whether the population mean μ could be equal to a certain value μ

Element sampling: Part 2

1 Inferential Methods for Correlation and Regression Analysis

Chapter 8: Estimating with Confidence

Introducing Sample Proportions

THE SYSTEMATIC AND THE RANDOM. ERRORS - DUE TO ELEMENT TOLERANCES OF ELECTRICAL NETWORKS

Goodness-Of-Fit For The Generalized Exponential Distribution. Abstract

1 Review of Probability & Statistics

Distribution of Random Samples & Limit theorems

Double Stage Shrinkage Estimator of Two Parameters. Generalized Exponential Distribution

Topic 9: Sampling Distributions of Estimators

Introducing Sample Proportions

Statistical Properties of OLS estimators

Chapter 6 Principles of Data Reduction

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

6.3 Testing Series With Positive Terms

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates

4. Hypothesis testing (Hotelling s T 2 -statistic)

Simulation. Two Rule For Inverting A Distribution Function

A proposed discrete distribution for the statistical modeling of

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random

Binomial Distribution

Modified Ratio Estimators Using Known Median and Co-Efficent of Kurtosis

CHAPTER 10 INFINITE SEQUENCES AND SERIES

Bayesian Control Charts for the Two-parameter Exponential Distribution

The Random Walk For Dummies

Improved Class of Ratio -Cum- Product Estimators of Finite Population Mean in two Phase Sampling

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Topic 9: Sampling Distributions of Estimators

4. Partial Sums and the Central Limit Theorem

Computing Confidence Intervals for Sample Data

MEASURES OF DISPERSION (VARIABILITY)

CS284A: Representations and Algorithms in Molecular Biology

Lecture 19: Convergence

Confidence Interval for Standard Deviation of Normal Distribution with Known Coefficients of Variation

x = Pr ( X (n) βx ) =

Finally, we show how to determine the moments of an impulse response based on the example of the dispersion model.

Singular Continuous Measures by Michael Pejic 5/14/10

Chapter 9: Numerical Differentiation

An Introduction to Randomized Algorithms

TEACHER CERTIFICATION STUDY GUIDE

Estimation of Gumbel Parameters under Ranked Set Sampling

Expectation and Variance of a random variable

Transcription:

RJ 10025 (90521) May 29, 1996 (Revised 3/20/98) Computer Sciece Research Report ESTIMATING THE NUMBER OF CLASSES IN A FINITE POPULATION Peter J. Haas IBM Research Divisio Almade Research Ceter 650 Harry Road Sa Jose, CA 95120-6099 Lye Stokes Departmet of Maagemet Sciece ad Iformatio Systems Uiversity oftexas Austi, TX 78712 LIMITED DISTRIBUTION NOTICE This report has bee submitted for publicatio outside of IBM ad will probably be copyrighted if accepted for publicatio. It has bee issued as a Research Report for early dissemiatio of its cotets. I view of the trasfer of copyright to the outside publisher, its distributio outside of IBM prior to publicatio should be limited to peer commuicatios ad specic requests. After outside publicatio, requests should be lled oly by reprits or legally obtaied copies of the article (e.g., paymet of royalties). IBM Research Divisio Yorktow Heights, New York Sa Jose, Califoria Zurich, Switzerlad

ESTIMATING THE NUMBER OF CLASSES IN A FINITE POPULATION Peter J. Haas IBM Research Divisio Almade Research Ceter 650 Harry Road Sa Jose, CA 95120-6099 e-mail: peterh@almade.ibm.com Lye Stokes Departmet of Maagemet Sciece ad Iformatio Systems Uiversity oftexas Austi, TX 78712 e-mail: lstokes@mail.utexas.edu ABSTRACT: We use a extesio of the geeralized jackkife approach of Gray ad Schucay to obtai ew oparametric estimators for the umber of classes i a ite populatio of kow size. We also show that geeralized jackkife estimators are closely related to certai Horvitz-Thompso estimators, to a estimator of Shlosser, ad to estimators based o sample coverage. I particular, the geeralized jackkife approach leads to a modicatio of Shlosser's estimator that does ot suer from the erratic behavior of the origial estimator. The performace of both ew ad previous estimators is ivestigated by meas of a asymptotic variace aalysis ad a Mote Carlo simulatio study. Keywords: jackkife, sample coverage, umber of species, umber of classes, database, cesus

1. Itroductio The problem of estimatig the umber of classes i a populatio has bee studied for may years. A recet review article (Buge ad Fitzpatrick 1993) lists more tha 125 refereces. I this article, we cosider a importat special case of the geeral problem estimatig the umber of classes i a ite populatio of kow size. Oly a hadful of papers have addressed this problem ad oe has reached a etirely satisfactory solutio, despite the fact that the rst attempt at solutio appeared i the statistical literature early 50 years ago (Mosteller 1949). The problem we cosider has arise i the literature i a variety of applicatios, icludig the followig. (i) I a compay-sposored cotest, may etries (say several hudred thousad) have bee received. It is kow that some people have etered more tha oce. The goal is to estimate the umber of dieret people who have etered from a sample of etries (Mosteller 1949; Sudma 1976). (ii) A samplig frame is costructed by combiig a umber of lists that may cotai overlappig etries. It is desired to estimate, usig a sample from all lists, the umber of uits o the combied list (Demig ad Glasser 1959; Goodma 1952; Kish 1965, Sec. 11.2; Sudma 1976, Sec. 3.6). A importat example of such a problem is a \admiistrative records cesus," curretly uder study by the U.S. Bureau of the Cesus. I such a cesus, several admiistrative les (such as AFDC or IRS records) are combied, ad the total umber of distict idividuals icluded i the combied le is determied. Exact computatio of the umber of distict idividuals i the combied le is extremely expesive because of the high cost of determiig the umber of duplicated etries. A similar problem ad proposed solutio was discussed i the Lodo Fiacial Times (March 2, 1949) by C. F. Carter, who was iterested i estimatig the umber of dieret ivestors i British idustrial stocks based o samples from share registers of compaies (Mosteller 1949). (iii) I a relatioal database system, data are orgaized i tables called relatios (see, e.g., Korth ad Silberschatz 1991, Chap. 3). I a typical relatio, each row might represet a record for a idividual employee i a compay, ad each colum might correspod to a dieret attribute of the employee, such as salary, years of experiece, departmet umber, ad so forth. A relatioal query species a output relatio that is to be computed from the set of base relatios stored by the system. Kowledge of the umber of distict values for each attribute i the base relatios is cetral to determiig the most eciet method for computig a specied output relatio (Hellerstei ad Stoebraker 1994; Seliger, Astraha, Chamberlai, Lorie, ad Price 1979). The size of the base relatios i moder database systems ofte is so large that exact computatio of the distict-value parameters is prohibitively expesive, ad thus estimatio of these parameters is desired (Astraha, Schkolick, ad Whag 1987; Flajolet ad Marti 1985; Gelebe ad Gardy 1982; Hou, Ozsoyoglu, ad Taeja 1

1988, 1989; Naughto ad Seshadri 1990; Ozsoyoglu, Du, Tjahjaa, Hou, ad Rowlad 1991; Whag, Vader-Zade, ad Taylor 1990). I each of these applicatios, the size of the populatio (umber of cotest etries, total umber of uits over all lists, ad umber of rows i the base relatio) is kow, ad this size is too large for easy computatio of the umber of classes. The problem studied i this article ca be described formally as follows. A populatio of size N cosists of D mutually disjoit classes P of items, labelled C 1 ;C 2 ;::: ;C D. Dee D N j to be the size of class C j, so that N = N j. A simple radom sample of items is selected (without replacemet) from the populatio. This sample icludes j items from class C j. The problem we cosider is that of estimatig D usig iformatio from the sample alog with kowledge of the value of P N. We deote by F i the umber of N classes of size i i the populatio, so that D = i=1 F i. Similarly, we deote by f i the umber of classes represeted exactly i times i the sample ad by d the total umber of classes represeted i the sample. Thus d = P i=1 f i ad P i=1 if i =. Dee vectors N =(N 1 ;N 2 ;::: ;N D ), =( 1 ; 2 ;::: ; D ), ad f =(f 1 ;f 2 ;::: ;f ). Note that is ot observable, but f is. Because we sample without replacemet, the radom vector has a multivariate hypergeometric distributio with probability mass fuctio P ( j D;N) =, N1, N2, 1 2 ND, N : (1) The probability mass fuctio of the observable radom vector f is simply P ( j D;N) summed over all poits that correspod to f: D P (f j D;N) = X S P ( j D;N); where S = f : #( j = i) =f i for 1 i D g. The probability mass fuctio P (f j D;N) does ot have a closed-form expressio i geeral. I Sectio 2 we review the estimators that have bee proposed for estimatig D from data geerated uder model (1). I Sectio 3 we provide several ew estimators of D based o a extesio of the geeralized jackkife approach of Gray ad Schucay (1972). We the show that geeralized jackkife estimators of the umber of classes i a populatio are closely related to certai \Horvitz-Thompso" estimators, to a estimator due to Shlosser (1981), ad to estimators based o the otio of \sample coverage" (Chao ad Lee 1992). I Sectio 4 we provide ad compare approximate expressios for the asymptotic variace of several of the estimators, ad i Sectio 5 apply our formulas to a well-kow example from the literature. We provide a simulatio-based empirical compariso of the various estimators i Sectio 6, ad summarize our results ad give recommedatios i Sectio 7. 2

2. Previous Estimators Buge ad Fitzpatrick (1993) metio oly two o-bayesia estimators that have bee developed as estimators of D uder model (1). These are the estimators of Goodma (1949) ad Shlosser (1981). Goodma proved that bd Good1 = d + X i=1 i+1 (N, + i, 1)! (, i)! (,1) (N,, 1)!! is the uique ubiased estimator of D whe >M def = max(n 1 ;N 2 ;::: ;N D ). He further proved that o ubiased estimator of D exists whe M. Ufortuately, uless the samplig fractio is quite large, the variace of b DGood1 is so great ad the umerical dif- culties ecoutered whe computig b DGood1 are so severe that the estimator is uusable. Goodma, who made ote of the high variace of b DGood1 himself, suggested the alterative estimator bd Good2 = N, N(N, 1) (, 1) f 2 for overcomig the variace problem. Although b DGood2 has lower variace tha b DGood1,it ca take o egative values ad ca have a large bias for ay if D is small. For example, cosider the case i which D = 1 ad >2, ad observe that f 2 = 0 ad b DGood2 = N. Uder the assumptio that the populatio size N is large ad the samplig fractio q = =N is oegligible, Shlosser (1981) derived the estimator P bd Sh = d + f (1, i=1 q)i f i 1 P iq(1, : i=1 q)i,1 f i f i For the two examples cosidered i his paper, Shlosser foud that use of b DSh with a 10% samplig fractio resulted i a error rate below 20%. I our experimets, however, we observed root mea squared errors (rmse's) exceedig 200%, eve for well-behaved populatios with relatively little variatio amog the class sizes (see Sec. 6). Cosiderig the relatioship betwee b DSh ad geeralized jackkife estimators (see Sec. 3.4) provides isight ito the source of this erratic behavior ad suggests some possible modicatios of b DSh to improve performace. I related work, Burham ad Overto (1978, 1979) proposed a family of (traditioal) geeralized jackkife estimators for estimatig the size of a closed populatio whe capture probabilities vary amog aimals. The D idividuals i the populatio play the role of our D classes; a give idividual ca appear up to times i the overall sample if captured o oe or more of possible trappig occasios. The capture probability for a idividual is assumed to be costat over time, ad the capture probabilities for the D idividuals are modeled as D iid radom samples from a xed probability distributio. Burham ad 3

Overto's sample desig is clearly dieret from model (1). Uder the Burham ad Overto model, for example, the quatities f 1 ;f 2 ;::: ;f have a joit multiomial distributio. Closely related to the work of Burham ad Overto are the ordiary jackkife estimators of the umber of species i a closed regio developed by Heltshe ad Forrester (1983) ad Smith ad va Belle (1984). The sample data cosist of a list of the species that appear i each of quadrats. (The umber of times that a species is represeted i a quadrat is ot recorded.) This setup is essetially idetical to that of Burham ad Overto, with the D species playig the role of the D idividuals ad the quadrats playig the role of the trappig occasios. 3. Geeralized Jackkife Estimators I this sectio we outlie a extesio of the geeralized jackkife approach to bias reductio ad the use this approach to derive ew estimators for the umber of classes i a ite populatio. We also poit out coectios betwee our geeralized jackkife approach ad several other estimatio approaches i the literature. 3.1. The Geeralized Jackkife Approach Let be a ukow real-valued parameter. A geeralized jackkife estimator of is a estimator of the form G(b 1 ; b 2 )= b 1, Rb 2 1, R ; (2) where b 1 ad b 2 are biased estimators of ad R (6= 1) is a real umber (Gray ad Schucay 1972). The idea uderlyig the geeralized jackkife approach is to try ad choose R such that G(b 1 ; b 2 ) has lower bias tha either b 1 or b 2. To motivate the choice of R, observe that for R = E[ b 1 ], E[b 2 ], ; (3) the estimator G(b 1 ; 2 b ) is ubiased for. This optimal value of R is typically ukow, however, ad ca oly be approximated, resultig i bias reductio but ot complete bias elimiatio. I the followig, we exted the origial deitio of the geeralized jackkife give by Gray ad Schucay (1972) by allowig R to deped o the data; that is, we allow R to be radom. Recall that d is the umber of classes represeted i the sample. Write d for d to emphasize the depedece of d o the sample size, ad deote by d,1 (k) the umber of classes represeted i the sample after the kth observatio has bee removed. Set X d (,1) = 1 4 k=1 d,1 (k):

We focus o geeralized jackkife estimators that are obtaied by takig 1 b = d ad b 2 = d (,1) i (2); these are the usual choices for b 1 ad b 2 i the classical rst-order jackkife estimator (Miller 1974). Observe that d,1 (k) = d, 1 if the class for the kth observatio is represeted oly oce i the sample; otherwise, d,1 (k) = d. Thus d (,1) = d, (f 1 =) ad, by (2), G(b 1 ; b 2 )= b D, where bd = d + K f 1 ad K = R=(1, R). It follows from (3) that the optimal choice of K is K = E [d ], D E[d (,1) ], E [d ] = D, E [d ] E [f 1 ] = : (5) To derive a more explicit formula for K, deote by I[A] the idicator of evet A ad observe that E [d ]=E 2 X 4 D I[ j > 0] 3 X D 5 = P f j > 0 g = D, DX P f j =0g : (4) Similar reasoig shows that E [f 1 ]= DX P f j =1g ; (6) so that P D K = P f j =0g P D P f j =1g : (7) Followig Shlosser (1981), we focus o the case i which the populatio size N is large ad the samplig fractio q = =N is oegligible, ad we make the approximatio P f j = k g Nj q k (1, q) N j,k (8) k for 0 k ad 1 j D. That is, the probability distributio of each j is approximated by the probability distributio of j uder a Beroulli sample desig i which each item is icluded i the sample with probability q, idepedetly of all other items i the populatio. Use of this approximatio leads to estimators that behave almost idetically to estimators derived usig the exact distributio of but are simpler to compute ad derive (see App. A for further discussio). Substitutig (8) ito (7), we obtai P D K (1, q)n j P D N jq(1, q) : (9) N j,1 5

The quatity K deed i (9) depeds o ukow parameters N 1 ;N 2 ;::: ;N D that are dicult to estimate. Our approach is to approximate K by a fuctio of D ad of other parameters that are easier to estimate, thereby obtaiig a approximate versio of (4). The estimates for these parameters, icludig b D for D, are the substituted ito the approximate versio of (4) ad the resultig equatio is solved for b D. We also cosider \smoothed" jackkife estimators. The idea is to replace the quatity f 1 = i (4) by its expected value E [f 1 ] = i the hope that the resultig estimator of D will be more stable tha the origial \usmoothed" estimator. As with the parameter K, the quatity E [f 1 ] = depeds o the ukow parameters N 1 ;N 2 ;::: ;N D ; see (6) ad (8). Thus our approach to estimatig E [f 1 ] = is the same as our approach to estimatig K. Estimators also ca be based o high-order jackkig schemes that cosider the umber of distict values i the sample whe two elemets are removed, whe three elemets are removed, ad so forth. Typically, usig a high-order jackkig scheme requires estimatig high-order momets (skewess, kurtosis, ad so forth) of the set of umbers f N 1 ;N 2 ;::: ;N D g. Iitial experimets idicated that the reductio i estimatio error due to usig the high-order jackkife is outweighed by the icrease i error due to ucertaity i the momet estimates. Thus we do ot pursue high-order jackkife schemes further. 3.2. The Estimators Dieret approximatios for K ad E [f 1 ] = lead to dieret estimators for D. Here we develop a umber of the possible estimators. 3.2.1. First-Order Estimators The simplest estimators of D ca be derived usig a rst-order approximatio to K. Specically, approximate each N j i (9) by the average value N = 1 D DX N j = N D ad substitute the resultig expressio for K ito (4) to obtai bd = d + (1, q)f 1D : (10) Now substitute D b for D o the right side of (10) ad solve for D. b The resultig solutio, deoted by Duj1 b, is give by bd uj1 = 1, (1, q)f 1,1 d : (11) We refer to this estimator as the \usmoothed rst-order jackkife estimator." 6

To derive a \smoothed rst-order jackkife estimator," observe that by (6) ad (8), E [f 1 ] 1 DX Approximatig each N j i (12) by N, wehave E [f 1 ] N j q(1, q) N j,1 : (12) (1, q) N,1 : (13) O the right side of (10), replace f 1 = with the approximate expressio for E [f 1 ] = give i (13), yieldig bd = d + D(1, q) N : Replacig D with b D ad N with N= b D i the foregoig expressio leads to the relatio bd, 1, (1, q) N= b D = d : We dee the smoothed rst-order jackkife estimator b Dsj1 as the value of b D that solves this equatio. Give d,, ad N, b Dsj1 ca be computed umerically usig stadard root-dig procedures. Observe that if i fact N 1 = N 2 = = N D = N=D, the I this case b Dsj1 E [d ] D, 1, (1, q) N=D : ca be viewed as a simple method-of-momets estimator obtaied by replacig E [d ] with the estimate d ad solvig for D. If, moreover, the samplig fractio q is small eough so that the distributio of ( 1 ; 2 ;::: ; D ) is approximately multiomial (see Sec. 3.3), the b Dsj1 is approximately equal to the maximum likelihood estimator for D (see Good 1950). Observe that both b Duj1 ad b Dsj1 are cosistet for D: b Duj1! D ad bd sj1! D as q! 1. 3.2.2. Secod-Order Estimators A secod-order approximatio to K ca be derived as follows. Deote by 2 the squared coecietofvariatio of the class sizes N 1 ;N 2 ;::: ;N D : 2 = (1=D)P D (N j, N) 2 N 2 : (14) Suppose that 2 is relatively small, so that each N j is close to the average value N. Substitute the Taylor approximatios (1, q) N j (1, q) N +(1, q) N l(1, q)(n j, N) 7

ad N j q(1, q) Nj,1 N j q (1, q) N,1 +(1, q) N,1 l(1, q)(n j, N) for 1 j D ito (9) to obtai K D(1, q) 1 1+l(1, q)n 2 D(1, q), 1, l(1, q)n 2 : (15) The ukow parameter 2 ca be estimated, usig the followig approach (cf. Chao ad Lee 1992). With the usual covetio that m = 0 for <m,we d that NX NX DX Nj i(i, 1)E [f i ] i(i, 1) q i (1, q) N j,i i=1 i=1 = q 2 D X N j (N j, 1) = q 2 D X N j (N j, 1); i N X j i=2 Nj, 2 q i,2 (1, q) N j,i i, 2 so that 2 D 2 N X i=1 i(i, 1)E [f i ]+ D N, 1: Thus if D were kow, the a atural method-of-momets estimator ^ 2 (D) of 2 would be ^ 2 (D) = max 0; D 2 X i=1 i(i, 1)f i + D N, 1 : (16) To develop a secod-order estimate of D, substitute (15) ito (4) to obtai from which it follows that bd = d + Df 1(1, q) bd = d + Df 1(1, q), 1, l(1, q)n 2 ; (17), f 1(1, q)l(1, q) 2 : q Replacig D with D b o the right side of this equatio ad solvig for D b yields the relatio 1, f 1(1, q) bd = d, f 1(1, q)l(1, q) 2 : (18) q 8

A estimator of D ca be obtaied by substitutig ^ 2 ( b D) for 2 i (18) ad solvig for bd umerically. Alteratively, we ca start with a simple iitial estimator of D ad the correct this estimator usig (18). Followig this latter approach, we use b Duj1 as our iitial estimator ad dee bd uj2 = 1, f 1(1, q),1! d, f 1(1, q)l(1, q)^ 2 ( Duj1 b ) : q A smoothed secod-order jackkife estimator ca be obtaied by replacig the expressio f 1 = i (17) with the approximatio to E [f 1 ] = give i (13), leadig to bd = d + D(1, q) N, 1, l(1, q)n 2 : Replacig D with b D ad proceedig as before, we obtai the estimator bd sj2 = 1 N, (1, q) ~,1 d, (1, q) ~N l(1, q)n ^ 2 ( Duj1 b ) where ~ N = N= b Duj1. As with the rst-order estimators b Duj1 ad b Dsj1, the secod-order estimators b Duj2 ad b Dsj2 are cosistet for D. 3.2.3. Horvitz-Thompso Jackkife Estimators I this sectio we discuss a alterative approach to estimatio of K based o a techique of Horvitz ad Thompso. (See Sardal, Swesso, ad Wretma 1992 for a geeral discussio of Horvitz-Thompso estimators.) P First, cosider the geeral problem of estimatig a parameter of the form D (g) = g(n j), where g is a specied fuctio. Observe that because P f j > 0 g > 0 for 1 j D, wehave (g) =E [X(g)], where X(g) = DX g(n j )I( j > 0) P f j > 0 g = X fj: j >0g g(n j ) P f j > 0 g : It follows from (8) that P f j > 0 g1, (1, q) N j, ad the foregoig discussio suggests that we estimate (g) by b(g) = X fj: j >0g g( b Nj ) 1, (1, q) b N j ; (19) where b Nj is a estimator for N j. The key poit is that we eed to estimate N j oly whe j > 0. To do this, observe that E [ j j j > 0] = E [ j ] P f j > 0 g qn j 1, (1, q) N j : ; 9

Replacig E [ j j j > 0] with j leads to the estimatig equatio j = qn j 1, (1, q) N j ; (20) ad a method-of-momets estimator b Nj ca be deed as the value of N j that solves (20). Now cosider the problem of estimatig K, ad hece D. By (9), K (f)=(g), where f(x) = (1, q) x ad g(x) = xq(1, q) x,1 =. Thus a atural estimator of K is give by b(f)=b (g), leadig to the al estimator, (f) b f 1 bd HTj = d + b(g) : A smoothed variat of b DHTj ca be obtaied by replacig f 1 = with the Horvitz-Thompso estimator of E [f 1 ] =, amely b (g). The resultig estimator, deoted by b DHTsj, is give by bd HTsj = d + b (f): Fially, a hybrid estimator ca be obtaied usig a rst-order approximatio for the umerator of K ad a Horvitz-Thompso estimator for the deomiator. This leads to the estimator Dhj b, deed as the solutio D b of the equatio! bd 1, f 1(1, q) N= D b = d : b (g) If we replace f 1 = with the Horvitz-Thompso estimator for E [f 1 ] = i the foregoig equatio i order to obtai a smoothed variatof b Dhj, the the resultig estimator coicides with b Dsj1. Because D = (u), where u(x) 1, it may appear that a \o-jackkife" Horvitz- Thompso estimator b DHT ca be deed by settig b DHT = b (u). It is straightforward to show, however, that b DHT = b DHTsj, so that b DHT ca i fact be viewed as a smoothed jackkife estimator. Simulatio experimets idicate that the behavior of the Horvitz-Thompso jackkife estimators b DHTj ad b DHTsj is erratic (see App. D for detailed results). Overall, the poor performace of b DHTj ad b DHTsj is caused by iaccurate estimatio of b (f). The problem seems to be that whe N j is small, the estimator b Nj is ustable ad yet typically has a large eect o the value of b (f) through the term (1, q) b N j =, 1, (1, q) b N j. The estimator bd hj uses a Taylor approximatio i place of b (f) ad hece has lower bias ad rmse tha the other two Horvitz-Thompso jackkife estimators. However, other estimators perform better tha b Dhj, ad we do ot cosider the estimators b DHTj, b DHTsj, ad b Dhj further. 10

3.3. Relatio to Estimators Based o Sample Coverage The geeralized jackkife approach for derivig a estimator of D works for sample desigs other tha hypergeometric samplig. For example, the most thoroughly studied versio of the umber-of-classes problem is that i which the populatio is assumed to be iite ad is assumed to have a multiomial distributio with parameter vector =( 1 ; 2 ;::: ; D ); that is, P ( j D; ) = 1 1 1 2 2 2 D D : (21) D Whe we proceed as i Sectio 3.1 to derive a geeralized jackkife estimator uder the model i (21), the estimator turs out to be early idetical to the \coverage-based" estimator proposed by Chao ad Lee (1992). To see this, start agai with (4) ad select K as i (5). Because E [d ], D =, uder the model i (21), it follows that K = DX (1, j ) P D v ( j ) P D jv,1 ( j ) ; where v (x) =(1, x). Set =1=D ad use the Taylor approximatios v ( j ) v ()+( j, )v 0 () ad j v,1 ( j ) j, v,1 ()+( j, )v 0,1() i a maer aalogous to the derivatio i Sectio 3.2.2 to obtai K (D, 1) + (, 1) 2 ; (22) where 2 =,1+D P D 2 j is the squared coeciet ofvariatio of the umbers 1; 2 ;::: ; D. Deote by b Dmult the estimator of D uder the multiomial model. The, by (4), bd mult = d +, (D, 1) + (, 1) 2 f 1 : (23) Replace D with b Dmult ad 2 with a estimator ~ 2 i (23) ad solve for b Dmult to obtai bd mult = d bc + (1, b C) bc 11, 1 ~2, 1 ;

where b C =1, (f1 =). Whe the sample size is large, the estimator b Dmult is essetially the same as the estimator bd CL = d + (1, C) b ~ 2 bc bc proposed by Chao ad Lee (1992). The estimator b DCL was developed from a dieret poit of view, usig the cocept of sample coverage. The sample coverage for a iite populatio is deed as P D ji[ j > 0], ad the quatity b C =1, (f1 =) is a stadard estimator of the sample coverage. Coversely, whe Chao ad Lee's derivatio is modied to accout for hypergeometric samplig, the resultig estimator is equal to b Duj2 (see App. B). Thus at least some estimators based o sample coverage ca be viewed as geeralized jackkife estimators. 3.4. Relatio to Shlosser's Estimator Observe that the estimator DSh b, though ot developed from a jackkife perspective, ca be viewed as a estimator of the form (4) with K estimated by P i=1 bk Sh = (1, q)i f P i iq(1 i=1, : q)i,1 f i To aalyze the behavior of DSh b,we rst rewrite the jackkife quatity K deed i (9) as follows: P N i=1 K = (1, q)i F i PN iq(1 i=1, : (24) q)i,1 F i Shlosser's justicatio of b DSh assumes that E [f i ] E [f 1 ] F i F 1 (25) for 1 i N. Whe the assumptio i (25) holds ad the sample size is large eough so that for 1 i N, f i E [f i ] (26) P N i=1 bk Sh (1, q)i E [f i ] P N iq(1, i=1 q)i,1 E [f i ] P N (1, q)i, i=1 E [f i ] =E [f 1 ] P N iq(1 i=1, q)i,1, E [f i ] =E [f 1 ] = P F,1 N (1 1 i=1, q)i F P i N iq(1 i=1, q)i,1 F i = K; F,1 1 12

so that b DSh behaves as a geeralized jackkife estimator. Although the relatios i (25) ad (26) hold exactly for = N (implyig that b DSh is cosistet for D), these relatios ca fail drastically for smaller sample sizes. For example, whe F 1 = 0 ad F i > 0 for some i>1, the right side of (25) is iite, whereas the left side is ite for sucietly small. This observatio leads oe to expect that b DSh will ot perform well whe the sample size is relatively small ad N 1 ;N 2 ;::: ;N D have similar values (with N j > 1 for each j). Both the variace aalysis i Sectio 4 ad the simulatio experimets described i Sectio 6 bear out this cojecture. The foregoig discussio suggests that replacig b KSh with bk Sh = K bk E[ KSh b Sh (27) ] i the formula for DSh b might result i a improved estimator, because K b Sh is ubiased for K. Of course we caot perform this replacemet exactly, sice K ad E[ KSh b ] are ukow, but we ca approximate K b Sh as follows. Usig the fact that DX DX Nj NX i E [f r ]= P f j = r g q r (1, q) Nj,r = q r (1, q) i,r F i (28) r r for 1 r, wehave, to rst order, i=r P N E[ KSh b i=1 ] (1, q)i E [f i ] P N iq(1, i=1 q)i,1 E [f i ] = P N i=1 (1, q)i, (1 + q) i, 1 F i PN i=1 iq2 (1, q 2 ) i,1 F i : (29) Usig the rst-order approximatio N 1 = N 2 = = N D = N together with (24), (27), ad (29), we d that bk Sh q(1 + q)n,1 (1 + q) N, 1 We thus obtai a modied Shlosser estimator give by bd Sh2 = d + f 1 q(1 + q) ~ N,1 (1 + q) ~N, 1! bk Sh :! P (1 i=1, q)i f P i iq(1 i=1, ; q)i,1 f i where ~ N is a iitial estimate of N based o a iitial estimate of D. We set ~ N equal to N= b Duj1 throughout. As with b DSh, the estimator b DSh2 is cosistet for D. 13

A alterative cosistet estimator of D ca be obtaied by directly usig the expressios i (24), (27), ad (29) with F i estimated by bf i = for 1 i N; these estimators of F 1 ;F 2 ;::: ;F N f 1 f i P i=1 iq(1, q)i,1 f i (30) were proposed by Shlosser (1981) i cojuctio with the estimator DSh b. Substitutig the resultig estimator of K ad E[ KSh b ] ito (27) leads to the al estimator P! i=1 iq2 (1, q 2 ) i,1 f P i (1 i=1, q)i, P (1 i=1, 2 q)i f P i (1 + q) i, 1 f i iq(1, : i=1 q)i,1 f i bd Sh3 = d + f 1 As with the estimator DSh b, Shlosser's justicatio of the estimators i (30) rests o the assumptio i (25). Thus oe might expect that, like DSh b, the estimator DSh3 b will be ustable whe the sample size is relatively small ad N 1 ;N 2 ;::: ;N D have similar values. O the other had, the reductio i bias of K b relative to b Sh KSh leads oe to expect that bd Sh3 will perform better tha DSh b whe 2 is sucietly large. (Oe might be tempted to avoid the assumptio i (25) whe estimatig F 1 ;F 2 ;::: ;F N by takig a method-ofmomets approach: replace E [f r ] with f r i (28) for 1 r ad solve the resultig set of liear equatios either exactly or approximately. As poited out by Shlosser (1981), however, this system of equatios is early sigular, ad hece extremely ustable.) 4. Variace ad Variace Estimates Cosider a estimator b D that is a fuctio of the sample oly through f =(f1 ;f 2 ;::: ; f M ), where M = max(n 1 ;N 2 ;::: ;N D ). All of the estimators itroduced i Sectio 3 are of this type. I geeral, we also allow b D to deped explicitly o the populatio size N ad write b D = b D(f;N). Suppose that, for ay N > 0 ad oegative M-dimesioal vector f 6= 0, the fuctio b D is cotiuously dieretiable at the poit (f;n) ad bd(cf;cn)=c b D(f;N) (31) for c>0. Approximatig the hypergeometric sample desig by a Beroulli sample desig as i (8), we ca obtai the followig approximate expressio for the asymptotic variace of b D(f;N)asD becomes large: AVar[ b D(f;N)] M X i=1 A 2 i Var [f i ]+ X 1i;i 0 M i6=i 0 A i A i 0Cov[f i ;f i 0] ; (32) where A i is the partial derivative of b D with respect to fi, evaluated at the poit (f;n). (Whe computig each A i,we replace each occurrece of ad d i the formula for b D by 14

P M i=1 if i ad P M i=1 f i before takig derivatives.) The approximatio i (32) is valid whe there is ot too much variability i the class sizes (see App. C for a precise formulatio ad proof of this result). It follows from the proof that, to a good approximatio, the variace of a estimator b D satisfyig (31) icreases liearly as D icreases. Straightforward calculatios show that each of the specic estimators b Duj1, b Duj2, b DSh, bd Sh2, ad b DSh3 is cotiuously dieretiable as stated previously ad also satises (31). Thus we ca use (32) to study the asymptotic variace of these estimators. We focus o bd uj1, b Duj2, b DSh2, ad b DSh3 because each of these estimators performs best for at least oe populatio studied i the simulatio experimets described i Sectio 6; we also cosider bd Sh, because b DSh is the most useful of the estimators previously proposed i the literature. Computatio of the A i coeciets for each estimator is tedious, but straightforward. Whe bd = b Duj2, for example, we obtai ad A (uj2) i 1 = A (uj1) N(1, q)l(1, q) 1,, (1, q)f 1 "^ 2 A (uj1) 1 + f 1,^ 2 +1, 2 bd uj1 A (uj2) = A (uj1) N(1, q) l(1, q) i,, (1, q)f 1 f 1 A (uj1) i bd uj1,^ 2 +1, 2i for 1 <i, where ^ 2 =^ 2 ( b Duj1 ), ad ^ 2 +1, b D uj1 N A (uj1) 1 = b Duj1 1 d + ^ 2 +1, b D uj1 N, + i(i, 1) b Duj1 2, (1, q), (1, q)f 1 1, f 1 A (uj1) i = Duj1 b 1 + i(1, q)(f 1=) d, (1, q)f 1!# ^ 2 + ^2, (1, q)f 1! i^ 2 + i^2, (1, q)f 1 ; for 1 <i. Figures 1 ad 2 compare the variaces of the estimators b Duj1, b Duj2, b DSh, b DSh2, ad bd Sh3 for a umber of populatios with equal class sizes. For these special populatios, b Duj1 ad b Duj2 are approximately ubiased, so that the relative variaces of these estimators are appropriate measures of relative performace. It is particularly istructive to compare the variace of b Duj1 ad b Duj2, sice b Duj2 is obtaied from b Duj1 by adjustig the latter estimator to compesate for bias iduced by the assumptio of equal class sizes. This adjustmet is uecessary for our special populatios, ad a compariso allows evaluatio of the pealty (i.e., the icrease i variace) that is beig paid for the adjustmet. 15

stadard deviatio stadard deviatio 800 700 600 500 400 300 200 100 0 0.02 bd uj1 bd uj2 bd Sh bd Sh2 bd Sh3 0.06 0.1 0.14 0.18 samplig fractio (q) 160 140 120 100 80 60 40 20 0 0 bd uj1 bd uj2 bd Sh2 20 40 60 80 class size (N) 100 Figure 1: Stadard deviatio of b Duj1, bd uj2, b DSh, b DSh2, ad b DSh3 (D = 15; 000 ad N = 10). Figure 2: Stadard deviatio of b Duj1, bd uj2, ad b DSh2 (D = 1500 ad q =0:10). Figure 1 displays the stadard deviatios of b Duj1, b Duj2, b DSh, b DSh2, ad b DSh3 for a equal-class-size populatio with N = 15; 000 ad D = 1500 (so that N = 10) as the samplig fractio q varies. Observe that b Duj2 is oly slightly less eciet tha b Duj1, so that the pealty for bias adjustmet is small i this case. Performace of the estimators bd uj1 ad b DSh2 is early idistiguishable. The most strikig observatio is that for this populatio, b DSh ad b DSh3 are ot competitive with the other three estimators. The relative performace of b DSh ad b DSh3 is especially poor for small samplig fractios. O the other had, the variace aalysis idicates that modicatio of b DSh as i (27) ad (29) ideed reduces the istability of the origial Schlosser estimator i this case. Thus we focus o the estimators b Duj1, b Duj2, ad b DSh2 i the remaider of this sectio ad i the ext sectio. (We retur to the estimator b DSh3 i Sectio 6, where our simulatio experimets idicate that b DSh3 ca exhibit smaller rmse tha the other estimators, but oly at large sample sizes ad for certai \ill-coditioed" populatios i which 2 is extremely large.) Figure 2 compares the three estimators b Duj1, b Duj2, ad b DSh2 for equal-class-size populatios with a rage of class sizes; for these calculatios the umber of classes ad the samplig fractio are held costat at D = 1500 ad q = 10. This gure illustrates the diculty of precisely estimatig D whe the class size is small (but greater tha 1). Agai, we see that these three estimators perform similarly, with early equal variability whe N exceeds about 40. We checked the accuracy of the variace approximatio i some example populatios by comparig the values computed from (32) with results of a simulatio experimet. (This experimet is discussed more completely i Sectio 6 below.) Simulated samplig with q =0:05, 0:10, ad 0:20 from the populatio examied i Figure 1 (N =15; 000, D = 1500) yields variace estimates withi 10% (o average) of those calculated from (32). Similar results were foud i samplig from a equal-class-size populatio with N = 15; 000 ad D = 150. The oly diculties we ecoutered occurred for equal-class-size populatios with 16

class sizes of N = 1 ad N =2. For these small class sizes the variace approximatio, which is based o the approximatio of the hypergeometric sample desig by aberoulli sample desig, is ot sucietly accurate. I particular, the approximate variace strogly reects radom uctuatios i the sample size due to the Beroulli sample desig; such uctuatios are ot preset i the actual hypergeometric sample desig. Simulatio experimets idicate that for N 3 the diereces caused by Beroulli versus hypergeometric samplig become egligible. (Of course, if the sample desig is i fact Beroulli, the this problem does ot occur.) I practice, we estimate the asymptotic variace of a estimator D b by substitutig estimates for f Var [f i ]: 1 i M g, ad f Cov[f i ;f i 0]: 1 i 6= i 0 M g ito (32). To obtai such estimates, we approximate the true populatio by a populatio with D classes, each of size N=D. Uder this approximatio ad the assumptio i (8) of a Beroulli sample desig, the radom vector f has a multiomial distributio with parameters D ad p =(p 1 ;p 2 ;::: ;p ), where N=D p i = q i (1, q) (N=D),i i for 1 i. It follows that Var [f i ]=Dp i (1, p i ) ad Cov[f i ;f i 0]=,Dp i p i 0. Each p i ca be estimated either by bp i = N= D b i q i (1, q) (N= bd),i or simply by f i = D. b It turs out that the latter formula yields better variace estimates, ad so we take dvar[f i ]=f i 1, f i bd ad for 1 i; i 0. dcov[f i ;f i 0]=, f if i 0 bd These formulas coicide with the estimators obtaied usig the \ucoditioal approach" of Chao ad Lee (1992). A computer program that calculates b Duj1, bd uj2, b DSh2 ad their estimated stadard errors from sample data ca be obtaied from the secod author. 5. A Example The followig example illustrates how kowledge of the populatio size N ca aect estimates of the umber of classes. Whe the populatio size N is ukow, Chao ad Lee (1992, Sec. 3) have proposed that the estimator b DCL deed i Sectio 3.3 be used to 17

N bd uj1 bd uj2 bd Sh2 1000 455 502 455 (47) (60) (51) 10,000 709 788 707 (125) (161) (128) 100,000 752 835 749 (141) (183) (144) Table 1: Values of b Duj1, b Duj2, ad b DSh2 for three hypothetical combied lists. (Stadard errors are i paretheses.) estimate the umber of classes, because the formula for b DCL does ot ivolve the ukow parameter N. Whe N is kow, a slight modicatio of the derivatio of b DCL leads to the usmoothed secod-order jackkife estimator b Duj2 (see App. B). Our example is based o oe discussed by Chao ad Lee (1992), who borrowed data rst described ad aalyzed by Holst (1981). These data arose from a applicatio i umismatics i which 204 aciet cois were classied accordig to die type i order to estimate the umber of dieret dies used i the mitig process. Amog the die types o the reverse sides of the 204 cois were 156 sigletos, 19 pairs, 2 triplets, ad 1 quadruplet (f 1 = 156, f 2 = 19, f 3 = 2, f 4 = 1, d = 178). Because the total umber of cois mited is ukow i this case, model (1) is iappropriate for aalyzig these data. But suppose that the same data had arise from a applicatio i which N was kow. For example, suppose that the data were obtaied by selectig a simple radom sample of 204 ames from a samplig frame that had bee costructed by combiig 5 lists of 200 ames each (N = 1000), 50 lists of 200 ames each (N = 10; 000), or 500 lists of 200 ames each (N = 100; 000). I each case our object is to estimate the umber of uique idividuals o the combied list, based o the sample results. We focus o the three estimators b Duj1, bd uj2, ad b DSh2. The estimates for the three cases are give i Table 1; the stadard errors displayed i Table 1 are estimated usig the procedure outlied i Sectio 4. We would expect similar ifereces to be made from the same data uder the multiomial model ad the ite populatio model whe N is very large. Ideed, the value bd uj2 = 835 agrees closely with Chao ad Lee's estimate b DCL = 844 (se 187) whe N = 100;000. Moreover, whe N = 100;000 we d that ^ 2 ( b Duj1 ) 0:13, which is the same estimate of 2 give by Chao ad Lee. As the populatio size decreases, however, both our assessmet of the magitude of D ad our ucertaity about that magitude decrease, because we are observig a larger ad larger fractio of both the populatio ad the classes. The most extreme divergece betwee the estimate obtaied usig b DCL ad estimates obtaied usig b Duj1, b Duj2,or b DSh2 occurs whe the sample cosists of all sigletos (f 1 = ). I that case, b DCL = 1, whereas b Duj1 = b Duj2 = b DSh2 = N. This result idicates that whe the populatio size N is kow, it is better to use a estimator that exploits kowledge 18

of N tha to sample with replacemet ad use the estimator b DCL. I some applicatios, samplig with replacemet is ot eve a optio. For example, the oly available samplig mechaism i at least oe curret database system is a oe-pass reservoir algorithm (as i Vitter 1985). The empirical results i Sectio 6 idicate that, of the three estimators displayed i Table 1, b Duj2 is the superior estimator whe 2 is small (< 1). Thus for our example, b Duj2 would be the preferred estimator, sice ^ 2 ( b Duj1 ) 0:13 i all three cases. Note that b Duj2 cosistetly has the highest variace of the three estimators i Table 1. The bias of b Duj2 is typically lower tha that of b Duj1 or b DSh2 whe 2 is small, however, so that the overall rmse is lower. 6. Simulatio Results This sectio describes the results of a simulatio study doe to compare the performace of the various estimators described i Sectio 3. Our compariso is based o the performace of the estimators for samplig fractios of 5%, 10%, ad 20% i 52 populatios. (Iitial experimets idicated that the performace of the various estimators is best viewed as a fuctio of samplig fractio, rather tha absolute sample size. This is i cotrast to estimators of, for example, populatio averages.) We cosider several sets of populatios. The rst set comprises sythetic populatios of the type cosidered i the literature. Populatios EQ10 ad EQ100 have equal class sizes of 10 ad 100. I populatios NGB/1, NGB/2, ad NGB/4, the class sizes follow a egative biomial distributio. Specically, the fractio f(m) of classes i populatio NGB/k with class size equal to m is give by f(m) m, 1 r k (1, r) m,k k, 1 for m k, where r = 0:04. Chao ad Lee (1992) cosidered populatios of this type. The populatios i the secod set are meat to be represetative of data that could be ecoutered whe a samplig frame for a populatio cesus is costructed by combiig a umber of lists which may cotai overlappig etries. Populatio GOOD ad SUDM were studied by Goodma (1949) ad Sudma (1976). Populatio FRAME2 mimics a samplig frame that might arise i a admiistrative records cesus of the type described i Sectio 1. Oe approach to such a cesus is to augmet the usual cesus address list with a small umber of relatively large admiistrative records les, such as AFDC or Food Stamps, ad the estimate the umber of distict idividuals o the combied list from a sample. We have costructed FRAME2 so that a give idividual ca appear at most ve times, but most idividuals appear exactly oce, mimickig the case i which four admiistrative lists are used to supplemet the cesus address list. Populatio FRAME3 is similar to FRAME2, but for the FRAME3 populatio it is assumed that the combied list is made up of a umber of small lists (perhaps obtaied from eighborhood-level orgaizatios) rather tha a few 19

Name N D 2 Skew EQ10 15000 1500 0.00 0.00 EQ100 15000 150 0.00 0.00 NGB/4 82135 874 0.18 0.50 NGB/2 41197 906 0.37 0.81 NGB/1 20213 930 0.75 1.25 Table 2: Characteristics of sythetic populatios. Name N D 2 Skew GOOD 10000 9595 0.04 5.64 FRAME2 33750 19000 0.31 1.18 FRAME3 111500 36000 0.52 1.92 SUDM 330000 100000 1.87 2.71 Table 3: Characteristics of \merged list" populatios. Name N D 2 Skew Z20A 50000 247 114.38 14.60 Z15 50000 772 166.18 23.44 Z20B 50000 10384 234.81 73.54 Table 4: Characteristics of \ill-coditioed" populatios. large lists. The populatios i the third set, deoted by Z20A, Z20B, ad Z15, are used to study the behavior of the estimators whe the data are extremely ill-coditioed. The class sizes i each of these populatios follow a geeralized Zipf distributio (see Kuth 1973, p. 398). Specically, N j =N / j,, where equals 1.5 or 2.0. These populatios have extremely high values of 2. Descriptive statistics for these three sets of populatios are give i Tables 2, 3, ad 4. The colum etitled \skew" displays the dimesioless coeciet of skewess, which is deed by = P D (N j, N) 3 =D PD (N j, N) 2 =D 3=2 : The al set comprises 40 real populatios that demostrate the type of distributios ecoutered whe estimatig the umber of distict values of a attribute i a relatioal database. Specically, the populatios studied correspod to various relatioal attributes from a database of erollmet records for studets at the Uiversity of Wiscosi ad a database of billig records from a large isurace compay. The populatio size N rages from 15,469 to 1,654,700, with D ragig from 3 to 1,547,606 ad 2 ragig from 0 to 81.63 (see App. D for further details). It is otable that values of 2 ecoutered i the literature (Chao ad Lee 1992; Goodma 1949; Shlosser 1981; Sudma 1976) ted ot to exceed the value 2, ad are typically less tha 1, whereas the value of 2 exceeds 2 for more tha 50% of the real populatios. For each estimator, populatio, ad samplig fractio, we estimated the bias ad rmse by repeatedly drawig a simple radom sample from the populatio, evaluatig the estimator, ad the computig the error of each estimate. (Whe evaluatig the estimator, we trucated each estimate below at d ad above at N.) The al estimate of bias was obtaied by averagig the error over all of the experimetal replicatios, ad rmse was 20

sample size bd 2 uj1 bd sj1 bd uj2 bd sj2 bd Sh bd Sh2 bd Sh3 ^ 2 5% 0 ad < 1 Average 13.48 14.20 11.84 12.27 79.17 13.23 202.16 56.65 1 ad < 50 Maximum 43.81 45.14 39.56 39.67 428.25 46.59 3299.10 96.72 Average 38.14 39.17 65.34 45.25 54.30 36.67 93.92 46.51 50 Maximum 70.47 70.48 186.15 186.15 218.02 66.82 1042.73 91.70 Average 74.11 75.92 388.77 77.78 28.13 71.23 21.45 74.72 Maximum 85.09 88.49 564.57 112.13 47.63 83.71 38.58 85.55 all Average 30.95 31.91 68.61 34.44 62.33 29.86 132.06 52.78 Maximum 85.09 88.49 564.57 186.15 428.25 83.71 3299.10 96.72 10% 0 ad < 1 Average 11.30 12.14 9.05 9.71 33.09 11.19 22.68 49.68 1 ad < 50 Maximum 39.80 42.32 31.73 31.90 200.79 44.83 131.15 90.68 Average 31.41 32.59 90.96 38.74 34.96 29.16 50.17 38.34 50 Maximum 61.27 61.28 267.08 186.15 107.16 54.03 357.43 83.12 Average 63.92 65.88 682.55 115.77 15.50 58.82 11.51 64.43 Maximum 76.47 81.21 1133.61 281.98 28.97 73.14 21.81 76.89 all Average 25.79 26.89 103.38 32.94 32.71 24.18 36.10 44.93 Maximum 76.47 81.21 1133.61 281.98 200.79 73.14 357.43 90.68 20% 0 ad < 1 Average 8.89 9.86 5.77 6.53 12.91 8.30 9.05 40.65 1 ad < 50 Maximum 33.01 37.28 29.82 27.49 79.16 30.14 79.16 81.03 Average 23.44 24.81 123.00 32.79 18.14 20.88 17.91 28.65 50 Maximum 46.77 49.73 369.77 186.15 49.20 43.38 74.99 67.42 Average 50.10 52.19 1093.07 130.30 7.73 42.58 6.32 50.51 Maximum 62.96 69.06 2010.61 381.51 15.12 56.72 10.62 63.37 all Average 19.62 20.88 150.28 29.69 15.23 17.47 13.44 35.18 Maximum 62.96 69.06 2010.61 381.51 79.16 56.72 79.16 81.03 Table 5: Average ad maximum rmse (%) for various estimators. estimated as the square root of the averaged square error. We used 100 replicatios, which was suciet to estimate the rmse with a stadard error below 5% i early all cases; typically the stadard error was much less. Summary results from the simulatios are displayed i Tables 5 ad 6. Table 5 gives the average ad maximum rmse's for each estimator of D over all populatios with 0 2 < 1, with 1 2 < 50, ad with 2 50, as well as the average ad maximum rmse's for each estimator over all populatios combied. Similarly, Table 6 gives the average ad maximum bias for each estimator. I these tables, the rmse ad bias are each expressed as a percetage of the true umber of classes. Tables 5 ad 6 also display the rmse ad bias of the estimator ^ 2 ( b Duj1 ) used i the secod-order jackkife estimators; the rmse ad bias are expressed as a percetage of the true value 2 ad are displayed i the colum labelled ^ 2. Comparig Tables 5 ad 6 idicates that for each estimator the major compoet of the rmse is almost always bias, ot variace. Thus, eve though the stadard error ca be estimated as i Sectio 4, this estimated stadard error usually does ot give a accurate picture of the error i estimatio of D. Aother cosequece of the predomiace of bias is that whe 2 is large, the rmse for the secod-order estimator b Duj2 does ot decrease 21

sample size bd 2 uj1 bd sj1 bd uj2 bd sj2 bd Sh bd Sh2 bd Sh3 ^ 2 5% 0 ad < 1 Average -12.71-13.43-10.76-11.38 71.11-10.98 90.57-55.75 1 ad < 50 Maximum -43.77-45.10-39.51-39.62 427.53-46.59 958.74-94.97 Average -37.95-38.99 42.77-16.83 39.13-36.35 61.98-46.19 50 Maximum -70.32-70.32 186.15 186.15 218.01-66.49 663.26-91.70 Average -74.10-75.91 382.88-22.16 22.92-71.22 3.17-74.71 Maximum -85.09-88.49 556.68 110.28 44.65-83.71 33.44-85.54 all Average -30.54-31.51 47.31-15.04 50.80-28.79 69.00-52.24 Maximum -85.09-88.49 556.68 186.15 427.53-83.71 958.74-94.97 10% 0 ad < 1 Average -10.93-11.78-8.38-9.31 28.12-9.47 17.66-48.87 1 ad < 50 Maximum -39.79-42.31-31.47-31.88 200.49-44.83 130.80-90.59 Average -31.16-32.34 74.62-10.44 25.41-28.61 35.20-37.98 50 Maximum -61.00-61.00 261.47 186.15 107.16-53.88 264.38-83.12 Average -63.91-65.87 677.18 24.90 11.57-58.78 3.10-64.41 Maximum -76.47-81.21 1125.89 280.63 27.09-73.13 18.47-76.88 all Average -25.51-26.62 87.45-7.26 25.44-23.20 25.65-44.41 Maximum -76.47-81.21 1125.89 280.63 200.49-73.13 264.38-90.59 20% 0 ad < 1 Average -8.57-9.55-4.99-6.20 9.99-6.75 5.73-39.71 1 ad < 50 Maximum -33.01-37.27-17.83-22.67 45.86-28.38 28.17-81.00 Average -23.12-24.49 112.39-3.41 12.09-20.13 10.02-28.23 50 Maximum -46.54-49.73 362.12 186.15 49.20-43.38 49.34-67.36 Average -50.09-52.17 1087.89 60.47 5.03-42.53 1.72-50.49 Maximum -62.96-69.06 2003.12 381.51 13.90-56.71 8.23-63.36 all Average -19.32-20.59 140.02 0.38 10.70-16.45 7.65-34.58 Maximum -62.96-69.06 2003.12 381.51 49.20-56.71 49.34-81.00 Table 6: Average ad maximum bias (%) for various estimators. mootoically as the samplig fractio icreases. (I all other cases the rmse decreases mootoically.) Comparig b Duj1 with b Dsj1 ad the comparig b Duj2 with b Dsj2,we see that smoothig a rst-order jackkife estimator ever results i a better rst-order estimator. O the other had, smoothig a secod-order jackkife estimator ca result i sigicat performace improvemet whe 2 is large. Similarly, usig higher-order Taylor expasios leads to mixed results. Secod-order estimators perform better tha rst-order estimators whe 2 is relatively small, but ot whe 2 is large. The diculty ispartially that the estimator ^ 2 ( b Duj1 ) teds to uderestimate 2 whe 2 is large, leadig to uderestimates of the umber of classes. Moreover, the Taylor approximatios uderlyig b Duj1, b Dsj1, b Duj2, ad b Dsj2 are derived uder the assumptio of ot too much variability betwee class sizes; this assumptio is violated whe 2 is large. There apparetly is o systematic relatio betwee the coeciet of skewess for the class sizes ad the performace of secod-order jackkife estimators. As predicted i Sectios 3.4 ad 4, the estimators b DSh ad b DSh3 behave poorly whe 2 is relatively small, ad b DSh3 performs better tha b DSh whe 2 is large. For small to medium values of 2, the modied estimator b DSh2 has a smaller rmse tha b DSh or b DSh3, ad 22

sample size bd 2 uj2 bd uj2a bd Sh2 bd Sh3 bd hybrid 5% 0 ad < 1 Average 11.84 19.46 13.23 202.16 11.84 1 ad < 50 Maximum 39.56 192.64 46.59 3299.10 39.56 Average 65.34 27.47 36.67 93.92 27.47 50 Maximum 186.15 54.51 66.82 1042.73 54.51 Average 388.77 23.00 71.23 21.45 26.17 Maximum 564.57 36.60 83.71 38.58 39.20 all Average 68.61 23.89 29.86 132.06 21.06 Maximum 564.57 192.64 83.71 3299.10 54.51 10% 0 ad < 1 Average 9.05 13.26 11.19 22.68 9.05 1 ad < 50 Maximum 31.73 120.14 44.83 131.15 31.73 Average 90.96 19.22 29.16 50.17 19.55 50 Maximum 267.08 48.12 54.03 357.43 48.12 Average 682.55 17.82 58.82 11.51 11.51 Maximum 1133.61 27.30 73.14 21.81 21.81 all Average 103.38 16.71 24.18 36.10 14.69 Maximum 1133.61 120.14 73.14 357.43 48.12 20% 0 ad < 1 Average 5.77 8.12 8.30 9.05 5.77 1 ad < 50 Maximum 29.82 79.16 30.14 79.16 29.82 Average 123.00 17.44 20.88 17.91 17.69 50 Maximum 369.77 76.57 43.38 74.99 76.57 Average 1093.07 37.30 42.58 6.32 6.32 Maximum 2010.61 83.69 56.72 10.62 10.62 all Average 150.28 15.20 17.47 13.44 12.00 Maximum 2010.61 83.69 56.72 79.16 76.57 Table 7: Average ad maximum rmse (%) of b Duj2, b Duj2a, b DSh2, b DSh3, ad b Dhybrid. its performace is comparable to the geeralized jackkife estimators. For extremely large values of 2 ad also for large sample sizes, the estimator b DSh3 has the best performace of the three Shlosser-type estimators. (For a 20% samplig fractio, b DSh3 i fact has the lowest average rmse of all the estimators cosidered.) As idicated earlier, smoothig ca improve the performace of the secod-order jackkife estimator Duj2 b. A alterative ad hoc techique for improvig performace is to \stabilize" b Duj2 usig a method suggested by Chao, Ma, ad Yag (1993). Fix c 1 ad remove ay class whose frequecy i the sample exceeds c; that is, remove from the sample all members of classes f C j : j 2 B g, where B = f 1 j D : j >cg. The compute the estimator b Duj2 from the reduced sample ad subsequetly icremet it by jbj to produce the al estimate, deoted by b Duj2a. (Here jbj deotes the umber of elemets i the set B.) Whe computig b Duj2 from the reduced sample, take the populatio size as N, P j2b b N j, where each b Nj is a method-of-momets estimator of N j as i Sectio 3.2.3. If, P j2b j = 0, the simply compute b Duj2 from the full sample. The idea behid this procedure is as follows. Whe 2 is large, the populatio cosists of a few large classes ad may smaller classes. By i eect removig the largest classes from the populatio, 23