Size: px
Start display at page:

Download ""

Transcription

1 RJ (90521) May 29, 1996 (Revised 3/20/98) Computer Sciece Research Report ESTIMATING THE NUMBER OF CLASSES IN A FINITE POPULATION Peter J. Haas IBM Research Divisio Almade Research Ceter 650 Harry Road Sa Jose, CA Lye Stokes Departmet of Maagemet Sciece ad Iformatio Systems Uiversity oftexas Austi, TX LIMITED DISTRIBUTION NOTICE This report has bee submitted for publicatio outside of IBM ad will probably be copyrighted if accepted for publicatio. It has bee issued as a Research Report for early dissemiatio of its cotets. I view of the trasfer of copyright to the outside publisher, its distributio outside of IBM prior to publicatio should be limited to peer commuicatios ad specic requests. After outside publicatio, requests should be lled oly by reprits or legally obtaied copies of the article (e.g., paymet of royalties). IBM Research Divisio Yorktow Heights, New York Sa Jose, Califoria Zurich, Switzerlad

2

3 ESTIMATING THE NUMBER OF CLASSES IN A FINITE POPULATION Peter J. Haas IBM Research Divisio Almade Research Ceter 650 Harry Road Sa Jose, CA peterh@almade.ibm.com Lye Stokes Departmet of Maagemet Sciece ad Iformatio Systems Uiversity oftexas Austi, TX lstokes@mail.utexas.edu ABSTRACT: We use a extesio of the geeralized jackkife approach of Gray ad Schucay to obtai ew oparametric estimators for the umber of classes i a ite populatio of kow size. We also show that geeralized jackkife estimators are closely related to certai Horvitz-Thompso estimators, to a estimator of Shlosser, ad to estimators based o sample coverage. I particular, the geeralized jackkife approach leads to a modicatio of Shlosser's estimator that does ot suer from the erratic behavior of the origial estimator. The performace of both ew ad previous estimators is ivestigated by meas of a asymptotic variace aalysis ad a Mote Carlo simulatio study. Keywords: jackkife, sample coverage, umber of species, umber of classes, database, cesus

4

5 1. Itroductio The problem of estimatig the umber of classes i a populatio has bee studied for may years. A recet review article (Buge ad Fitzpatrick 1993) lists more tha 125 refereces. I this article, we cosider a importat special case of the geeral problem estimatig the umber of classes i a ite populatio of kow size. Oly a hadful of papers have addressed this problem ad oe has reached a etirely satisfactory solutio, despite the fact that the rst attempt at solutio appeared i the statistical literature early 50 years ago (Mosteller 1949). The problem we cosider has arise i the literature i a variety of applicatios, icludig the followig. (i) I a compay-sposored cotest, may etries (say several hudred thousad) have bee received. It is kow that some people have etered more tha oce. The goal is to estimate the umber of dieret people who have etered from a sample of etries (Mosteller 1949; Sudma 1976). (ii) A samplig frame is costructed by combiig a umber of lists that may cotai overlappig etries. It is desired to estimate, usig a sample from all lists, the umber of uits o the combied list (Demig ad Glasser 1959; Goodma 1952; Kish 1965, Sec. 11.2; Sudma 1976, Sec. 3.6). A importat example of such a problem is a \admiistrative records cesus," curretly uder study by the U.S. Bureau of the Cesus. I such a cesus, several admiistrative les (such as AFDC or IRS records) are combied, ad the total umber of distict idividuals icluded i the combied le is determied. Exact computatio of the umber of distict idividuals i the combied le is extremely expesive because of the high cost of determiig the umber of duplicated etries. A similar problem ad proposed solutio was discussed i the Lodo Fiacial Times (March 2, 1949) by C. F. Carter, who was iterested i estimatig the umber of dieret ivestors i British idustrial stocks based o samples from share registers of compaies (Mosteller 1949). (iii) I a relatioal database system, data are orgaized i tables called relatios (see, e.g., Korth ad Silberschatz 1991, Chap. 3). I a typical relatio, each row might represet a record for a idividual employee i a compay, ad each colum might correspod to a dieret attribute of the employee, such as salary, years of experiece, departmet umber, ad so forth. A relatioal query species a output relatio that is to be computed from the set of base relatios stored by the system. Kowledge of the umber of distict values for each attribute i the base relatios is cetral to determiig the most eciet method for computig a specied output relatio (Hellerstei ad Stoebraker 1994; Seliger, Astraha, Chamberlai, Lorie, ad Price 1979). The size of the base relatios i moder database systems ofte is so large that exact computatio of the distict-value parameters is prohibitively expesive, ad thus estimatio of these parameters is desired (Astraha, Schkolick, ad Whag 1987; Flajolet ad Marti 1985; Gelebe ad Gardy 1982; Hou, Ozsoyoglu, ad Taeja 1

6 1988, 1989; Naughto ad Seshadri 1990; Ozsoyoglu, Du, Tjahjaa, Hou, ad Rowlad 1991; Whag, Vader-Zade, ad Taylor 1990). I each of these applicatios, the size of the populatio (umber of cotest etries, total umber of uits over all lists, ad umber of rows i the base relatio) is kow, ad this size is too large for easy computatio of the umber of classes. The problem studied i this article ca be described formally as follows. A populatio of size N cosists of D mutually disjoit classes P of items, labelled C 1 ;C 2 ;::: ;C D. Dee D N j to be the size of class C j, so that N = N j. A simple radom sample of items is selected (without replacemet) from the populatio. This sample icludes j items from class C j. The problem we cosider is that of estimatig D usig iformatio from the sample alog with kowledge of the value of P N. We deote by F i the umber of N classes of size i i the populatio, so that D = i=1 F i. Similarly, we deote by f i the umber of classes represeted exactly i times i the sample ad by d the total umber of classes represeted i the sample. Thus d = P i=1 f i ad P i=1 if i =. Dee vectors N =(N 1 ;N 2 ;::: ;N D ), =( 1 ; 2 ;::: ; D ), ad f =(f 1 ;f 2 ;::: ;f ). Note that is ot observable, but f is. Because we sample without replacemet, the radom vector has a multivariate hypergeometric distributio with probability mass fuctio P ( j D;N) =, N1, N2, 1 2 ND, N : (1) The probability mass fuctio of the observable radom vector f is simply P ( j D;N) summed over all poits that correspod to f: D P (f j D;N) = X S P ( j D;N); where S = f : #( j = i) =f i for 1 i D g. The probability mass fuctio P (f j D;N) does ot have a closed-form expressio i geeral. I Sectio 2 we review the estimators that have bee proposed for estimatig D from data geerated uder model (1). I Sectio 3 we provide several ew estimators of D based o a extesio of the geeralized jackkife approach of Gray ad Schucay (1972). We the show that geeralized jackkife estimators of the umber of classes i a populatio are closely related to certai \Horvitz-Thompso" estimators, to a estimator due to Shlosser (1981), ad to estimators based o the otio of \sample coverage" (Chao ad Lee 1992). I Sectio 4 we provide ad compare approximate expressios for the asymptotic variace of several of the estimators, ad i Sectio 5 apply our formulas to a well-kow example from the literature. We provide a simulatio-based empirical compariso of the various estimators i Sectio 6, ad summarize our results ad give recommedatios i Sectio 7. 2

7 2. Previous Estimators Buge ad Fitzpatrick (1993) metio oly two o-bayesia estimators that have bee developed as estimators of D uder model (1). These are the estimators of Goodma (1949) ad Shlosser (1981). Goodma proved that bd Good1 = d + X i=1 i+1 (N, + i, 1)! (, i)! (,1) (N,, 1)!! is the uique ubiased estimator of D whe >M def = max(n 1 ;N 2 ;::: ;N D ). He further proved that o ubiased estimator of D exists whe M. Ufortuately, uless the samplig fractio is quite large, the variace of b DGood1 is so great ad the umerical dif- culties ecoutered whe computig b DGood1 are so severe that the estimator is uusable. Goodma, who made ote of the high variace of b DGood1 himself, suggested the alterative estimator bd Good2 = N, N(N, 1) (, 1) f 2 for overcomig the variace problem. Although b DGood2 has lower variace tha b DGood1,it ca take o egative values ad ca have a large bias for ay if D is small. For example, cosider the case i which D = 1 ad >2, ad observe that f 2 = 0 ad b DGood2 = N. Uder the assumptio that the populatio size N is large ad the samplig fractio q = =N is oegligible, Shlosser (1981) derived the estimator P bd Sh = d + f (1, i=1 q)i f i 1 P iq(1, : i=1 q)i,1 f i f i For the two examples cosidered i his paper, Shlosser foud that use of b DSh with a 10% samplig fractio resulted i a error rate below 20%. I our experimets, however, we observed root mea squared errors (rmse's) exceedig 200%, eve for well-behaved populatios with relatively little variatio amog the class sizes (see Sec. 6). Cosiderig the relatioship betwee b DSh ad geeralized jackkife estimators (see Sec. 3.4) provides isight ito the source of this erratic behavior ad suggests some possible modicatios of b DSh to improve performace. I related work, Burham ad Overto (1978, 1979) proposed a family of (traditioal) geeralized jackkife estimators for estimatig the size of a closed populatio whe capture probabilities vary amog aimals. The D idividuals i the populatio play the role of our D classes; a give idividual ca appear up to times i the overall sample if captured o oe or more of possible trappig occasios. The capture probability for a idividual is assumed to be costat over time, ad the capture probabilities for the D idividuals are modeled as D iid radom samples from a xed probability distributio. Burham ad 3

8 Overto's sample desig is clearly dieret from model (1). Uder the Burham ad Overto model, for example, the quatities f 1 ;f 2 ;::: ;f have a joit multiomial distributio. Closely related to the work of Burham ad Overto are the ordiary jackkife estimators of the umber of species i a closed regio developed by Heltshe ad Forrester (1983) ad Smith ad va Belle (1984). The sample data cosist of a list of the species that appear i each of quadrats. (The umber of times that a species is represeted i a quadrat is ot recorded.) This setup is essetially idetical to that of Burham ad Overto, with the D species playig the role of the D idividuals ad the quadrats playig the role of the trappig occasios. 3. Geeralized Jackkife Estimators I this sectio we outlie a extesio of the geeralized jackkife approach to bias reductio ad the use this approach to derive ew estimators for the umber of classes i a ite populatio. We also poit out coectios betwee our geeralized jackkife approach ad several other estimatio approaches i the literature The Geeralized Jackkife Approach Let be a ukow real-valued parameter. A geeralized jackkife estimator of is a estimator of the form G(b 1 ; b 2 )= b 1, Rb 2 1, R ; (2) where b 1 ad b 2 are biased estimators of ad R (6= 1) is a real umber (Gray ad Schucay 1972). The idea uderlyig the geeralized jackkife approach is to try ad choose R such that G(b 1 ; b 2 ) has lower bias tha either b 1 or b 2. To motivate the choice of R, observe that for R = E[ b 1 ], E[b 2 ], ; (3) the estimator G(b 1 ; 2 b ) is ubiased for. This optimal value of R is typically ukow, however, ad ca oly be approximated, resultig i bias reductio but ot complete bias elimiatio. I the followig, we exted the origial deitio of the geeralized jackkife give by Gray ad Schucay (1972) by allowig R to deped o the data; that is, we allow R to be radom. Recall that d is the umber of classes represeted i the sample. Write d for d to emphasize the depedece of d o the sample size, ad deote by d,1 (k) the umber of classes represeted i the sample after the kth observatio has bee removed. Set X d (,1) = 1 4 k=1 d,1 (k):

9 We focus o geeralized jackkife estimators that are obtaied by takig 1 b = d ad b 2 = d (,1) i (2); these are the usual choices for b 1 ad b 2 i the classical rst-order jackkife estimator (Miller 1974). Observe that d,1 (k) = d, 1 if the class for the kth observatio is represeted oly oce i the sample; otherwise, d,1 (k) = d. Thus d (,1) = d, (f 1 =) ad, by (2), G(b 1 ; b 2 )= b D, where bd = d + K f 1 ad K = R=(1, R). It follows from (3) that the optimal choice of K is K = E [d ], D E[d (,1) ], E [d ] = D, E [d ] E [f 1 ] = : (5) To derive a more explicit formula for K, deote by I[A] the idicator of evet A ad observe that E [d ]=E 2 X 4 D I[ j > 0] 3 X D 5 = P f j > 0 g = D, DX P f j =0g : (4) Similar reasoig shows that E [f 1 ]= DX P f j =1g ; (6) so that P D K = P f j =0g P D P f j =1g : (7) Followig Shlosser (1981), we focus o the case i which the populatio size N is large ad the samplig fractio q = =N is oegligible, ad we make the approximatio P f j = k g Nj q k (1, q) N j,k (8) k for 0 k ad 1 j D. That is, the probability distributio of each j is approximated by the probability distributio of j uder a Beroulli sample desig i which each item is icluded i the sample with probability q, idepedetly of all other items i the populatio. Use of this approximatio leads to estimators that behave almost idetically to estimators derived usig the exact distributio of but are simpler to compute ad derive (see App. A for further discussio). Substitutig (8) ito (7), we obtai P D K (1, q)n j P D N jq(1, q) : (9) N j,1 5

10 The quatity K deed i (9) depeds o ukow parameters N 1 ;N 2 ;::: ;N D that are dicult to estimate. Our approach is to approximate K by a fuctio of D ad of other parameters that are easier to estimate, thereby obtaiig a approximate versio of (4). The estimates for these parameters, icludig b D for D, are the substituted ito the approximate versio of (4) ad the resultig equatio is solved for b D. We also cosider \smoothed" jackkife estimators. The idea is to replace the quatity f 1 = i (4) by its expected value E [f 1 ] = i the hope that the resultig estimator of D will be more stable tha the origial \usmoothed" estimator. As with the parameter K, the quatity E [f 1 ] = depeds o the ukow parameters N 1 ;N 2 ;::: ;N D ; see (6) ad (8). Thus our approach to estimatig E [f 1 ] = is the same as our approach to estimatig K. Estimators also ca be based o high-order jackkig schemes that cosider the umber of distict values i the sample whe two elemets are removed, whe three elemets are removed, ad so forth. Typically, usig a high-order jackkig scheme requires estimatig high-order momets (skewess, kurtosis, ad so forth) of the set of umbers f N 1 ;N 2 ;::: ;N D g. Iitial experimets idicated that the reductio i estimatio error due to usig the high-order jackkife is outweighed by the icrease i error due to ucertaity i the momet estimates. Thus we do ot pursue high-order jackkife schemes further The Estimators Dieret approximatios for K ad E [f 1 ] = lead to dieret estimators for D. Here we develop a umber of the possible estimators First-Order Estimators The simplest estimators of D ca be derived usig a rst-order approximatio to K. Specically, approximate each N j i (9) by the average value N = 1 D DX N j = N D ad substitute the resultig expressio for K ito (4) to obtai bd = d + (1, q)f 1D : (10) Now substitute D b for D o the right side of (10) ad solve for D. b The resultig solutio, deoted by Duj1 b, is give by bd uj1 = 1, (1, q)f 1,1 d : (11) We refer to this estimator as the \usmoothed rst-order jackkife estimator." 6

11 To derive a \smoothed rst-order jackkife estimator," observe that by (6) ad (8), E [f 1 ] 1 DX Approximatig each N j i (12) by N, wehave E [f 1 ] N j q(1, q) N j,1 : (12) (1, q) N,1 : (13) O the right side of (10), replace f 1 = with the approximate expressio for E [f 1 ] = give i (13), yieldig bd = d + D(1, q) N : Replacig D with b D ad N with N= b D i the foregoig expressio leads to the relatio bd, 1, (1, q) N= b D = d : We dee the smoothed rst-order jackkife estimator b Dsj1 as the value of b D that solves this equatio. Give d,, ad N, b Dsj1 ca be computed umerically usig stadard root-dig procedures. Observe that if i fact N 1 = N 2 = = N D = N=D, the I this case b Dsj1 E [d ] D, 1, (1, q) N=D : ca be viewed as a simple method-of-momets estimator obtaied by replacig E [d ] with the estimate d ad solvig for D. If, moreover, the samplig fractio q is small eough so that the distributio of ( 1 ; 2 ;::: ; D ) is approximately multiomial (see Sec. 3.3), the b Dsj1 is approximately equal to the maximum likelihood estimator for D (see Good 1950). Observe that both b Duj1 ad b Dsj1 are cosistet for D: b Duj1! D ad bd sj1! D as q! Secod-Order Estimators A secod-order approximatio to K ca be derived as follows. Deote by 2 the squared coecietofvariatio of the class sizes N 1 ;N 2 ;::: ;N D : 2 = (1=D)P D (N j, N) 2 N 2 : (14) Suppose that 2 is relatively small, so that each N j is close to the average value N. Substitute the Taylor approximatios (1, q) N j (1, q) N +(1, q) N l(1, q)(n j, N) 7

12 ad N j q(1, q) Nj,1 N j q (1, q) N,1 +(1, q) N,1 l(1, q)(n j, N) for 1 j D ito (9) to obtai K D(1, q) 1 1+l(1, q)n 2 D(1, q), 1, l(1, q)n 2 : (15) The ukow parameter 2 ca be estimated, usig the followig approach (cf. Chao ad Lee 1992). With the usual covetio that m = 0 for <m,we d that NX NX DX Nj i(i, 1)E [f i ] i(i, 1) q i (1, q) N j,i i=1 i=1 = q 2 D X N j (N j, 1) = q 2 D X N j (N j, 1); i N X j i=2 Nj, 2 q i,2 (1, q) N j,i i, 2 so that 2 D 2 N X i=1 i(i, 1)E [f i ]+ D N, 1: Thus if D were kow, the a atural method-of-momets estimator ^ 2 (D) of 2 would be ^ 2 (D) = max 0; D 2 X i=1 i(i, 1)f i + D N, 1 : (16) To develop a secod-order estimate of D, substitute (15) ito (4) to obtai from which it follows that bd = d + Df 1(1, q) bd = d + Df 1(1, q), 1, l(1, q)n 2 ; (17), f 1(1, q)l(1, q) 2 : q Replacig D with D b o the right side of this equatio ad solvig for D b yields the relatio 1, f 1(1, q) bd = d, f 1(1, q)l(1, q) 2 : (18) q 8

13 A estimator of D ca be obtaied by substitutig ^ 2 ( b D) for 2 i (18) ad solvig for bd umerically. Alteratively, we ca start with a simple iitial estimator of D ad the correct this estimator usig (18). Followig this latter approach, we use b Duj1 as our iitial estimator ad dee bd uj2 = 1, f 1(1, q),1! d, f 1(1, q)l(1, q)^ 2 ( Duj1 b ) : q A smoothed secod-order jackkife estimator ca be obtaied by replacig the expressio f 1 = i (17) with the approximatio to E [f 1 ] = give i (13), leadig to bd = d + D(1, q) N, 1, l(1, q)n 2 : Replacig D with b D ad proceedig as before, we obtai the estimator bd sj2 = 1 N, (1, q) ~,1 d, (1, q) ~N l(1, q)n ^ 2 ( Duj1 b ) where ~ N = N= b Duj1. As with the rst-order estimators b Duj1 ad b Dsj1, the secod-order estimators b Duj2 ad b Dsj2 are cosistet for D Horvitz-Thompso Jackkife Estimators I this sectio we discuss a alterative approach to estimatio of K based o a techique of Horvitz ad Thompso. (See Sardal, Swesso, ad Wretma 1992 for a geeral discussio of Horvitz-Thompso estimators.) P First, cosider the geeral problem of estimatig a parameter of the form D (g) = g(n j), where g is a specied fuctio. Observe that because P f j > 0 g > 0 for 1 j D, wehave (g) =E [X(g)], where X(g) = DX g(n j )I( j > 0) P f j > 0 g = X fj: j >0g g(n j ) P f j > 0 g : It follows from (8) that P f j > 0 g1, (1, q) N j, ad the foregoig discussio suggests that we estimate (g) by b(g) = X fj: j >0g g( b Nj ) 1, (1, q) b N j ; (19) where b Nj is a estimator for N j. The key poit is that we eed to estimate N j oly whe j > 0. To do this, observe that E [ j j j > 0] = E [ j ] P f j > 0 g qn j 1, (1, q) N j : ; 9

14 Replacig E [ j j j > 0] with j leads to the estimatig equatio j = qn j 1, (1, q) N j ; (20) ad a method-of-momets estimator b Nj ca be deed as the value of N j that solves (20). Now cosider the problem of estimatig K, ad hece D. By (9), K (f)=(g), where f(x) = (1, q) x ad g(x) = xq(1, q) x,1 =. Thus a atural estimator of K is give by b(f)=b (g), leadig to the al estimator, (f) b f 1 bd HTj = d + b(g) : A smoothed variat of b DHTj ca be obtaied by replacig f 1 = with the Horvitz-Thompso estimator of E [f 1 ] =, amely b (g). The resultig estimator, deoted by b DHTsj, is give by bd HTsj = d + b (f): Fially, a hybrid estimator ca be obtaied usig a rst-order approximatio for the umerator of K ad a Horvitz-Thompso estimator for the deomiator. This leads to the estimator Dhj b, deed as the solutio D b of the equatio! bd 1, f 1(1, q) N= D b = d : b (g) If we replace f 1 = with the Horvitz-Thompso estimator for E [f 1 ] = i the foregoig equatio i order to obtai a smoothed variatof b Dhj, the the resultig estimator coicides with b Dsj1. Because D = (u), where u(x) 1, it may appear that a \o-jackkife" Horvitz- Thompso estimator b DHT ca be deed by settig b DHT = b (u). It is straightforward to show, however, that b DHT = b DHTsj, so that b DHT ca i fact be viewed as a smoothed jackkife estimator. Simulatio experimets idicate that the behavior of the Horvitz-Thompso jackkife estimators b DHTj ad b DHTsj is erratic (see App. D for detailed results). Overall, the poor performace of b DHTj ad b DHTsj is caused by iaccurate estimatio of b (f). The problem seems to be that whe N j is small, the estimator b Nj is ustable ad yet typically has a large eect o the value of b (f) through the term (1, q) b N j =, 1, (1, q) b N j. The estimator bd hj uses a Taylor approximatio i place of b (f) ad hece has lower bias ad rmse tha the other two Horvitz-Thompso jackkife estimators. However, other estimators perform better tha b Dhj, ad we do ot cosider the estimators b DHTj, b DHTsj, ad b Dhj further. 10

15 3.3. Relatio to Estimators Based o Sample Coverage The geeralized jackkife approach for derivig a estimator of D works for sample desigs other tha hypergeometric samplig. For example, the most thoroughly studied versio of the umber-of-classes problem is that i which the populatio is assumed to be iite ad is assumed to have a multiomial distributio with parameter vector =( 1 ; 2 ;::: ; D ); that is, P ( j D; ) = D D : (21) D Whe we proceed as i Sectio 3.1 to derive a geeralized jackkife estimator uder the model i (21), the estimator turs out to be early idetical to the \coverage-based" estimator proposed by Chao ad Lee (1992). To see this, start agai with (4) ad select K as i (5). Because E [d ], D =, uder the model i (21), it follows that K = DX (1, j ) P D v ( j ) P D jv,1 ( j ) ; where v (x) =(1, x). Set =1=D ad use the Taylor approximatios v ( j ) v ()+( j, )v 0 () ad j v,1 ( j ) j, v,1 ()+( j, )v 0,1() i a maer aalogous to the derivatio i Sectio to obtai K (D, 1) + (, 1) 2 ; (22) where 2 =,1+D P D 2 j is the squared coeciet ofvariatio of the umbers 1; 2 ;::: ; D. Deote by b Dmult the estimator of D uder the multiomial model. The, by (4), bd mult = d +, (D, 1) + (, 1) 2 f 1 : (23) Replace D with b Dmult ad 2 with a estimator ~ 2 i (23) ad solve for b Dmult to obtai bd mult = d bc + (1, b C) bc 11, 1 ~2, 1 ;

16 where b C =1, (f1 =). Whe the sample size is large, the estimator b Dmult is essetially the same as the estimator bd CL = d + (1, C) b ~ 2 bc bc proposed by Chao ad Lee (1992). The estimator b DCL was developed from a dieret poit of view, usig the cocept of sample coverage. The sample coverage for a iite populatio is deed as P D ji[ j > 0], ad the quatity b C =1, (f1 =) is a stadard estimator of the sample coverage. Coversely, whe Chao ad Lee's derivatio is modied to accout for hypergeometric samplig, the resultig estimator is equal to b Duj2 (see App. B). Thus at least some estimators based o sample coverage ca be viewed as geeralized jackkife estimators Relatio to Shlosser's Estimator Observe that the estimator DSh b, though ot developed from a jackkife perspective, ca be viewed as a estimator of the form (4) with K estimated by P i=1 bk Sh = (1, q)i f P i iq(1 i=1, : q)i,1 f i To aalyze the behavior of DSh b,we rst rewrite the jackkife quatity K deed i (9) as follows: P N i=1 K = (1, q)i F i PN iq(1 i=1, : (24) q)i,1 F i Shlosser's justicatio of b DSh assumes that E [f i ] E [f 1 ] F i F 1 (25) for 1 i N. Whe the assumptio i (25) holds ad the sample size is large eough so that for 1 i N, f i E [f i ] (26) P N i=1 bk Sh (1, q)i E [f i ] P N iq(1, i=1 q)i,1 E [f i ] P N (1, q)i, i=1 E [f i ] =E [f 1 ] P N iq(1 i=1, q)i,1, E [f i ] =E [f 1 ] = P F,1 N (1 1 i=1, q)i F P i N iq(1 i=1, q)i,1 F i = K; F,1 1 12

17 so that b DSh behaves as a geeralized jackkife estimator. Although the relatios i (25) ad (26) hold exactly for = N (implyig that b DSh is cosistet for D), these relatios ca fail drastically for smaller sample sizes. For example, whe F 1 = 0 ad F i > 0 for some i>1, the right side of (25) is iite, whereas the left side is ite for sucietly small. This observatio leads oe to expect that b DSh will ot perform well whe the sample size is relatively small ad N 1 ;N 2 ;::: ;N D have similar values (with N j > 1 for each j). Both the variace aalysis i Sectio 4 ad the simulatio experimets described i Sectio 6 bear out this cojecture. The foregoig discussio suggests that replacig b KSh with bk Sh = K bk E[ KSh b Sh (27) ] i the formula for DSh b might result i a improved estimator, because K b Sh is ubiased for K. Of course we caot perform this replacemet exactly, sice K ad E[ KSh b ] are ukow, but we ca approximate K b Sh as follows. Usig the fact that DX DX Nj NX i E [f r ]= P f j = r g q r (1, q) Nj,r = q r (1, q) i,r F i (28) r r for 1 r, wehave, to rst order, i=r P N E[ KSh b i=1 ] (1, q)i E [f i ] P N iq(1, i=1 q)i,1 E [f i ] = P N i=1 (1, q)i, (1 + q) i, 1 F i PN i=1 iq2 (1, q 2 ) i,1 F i : (29) Usig the rst-order approximatio N 1 = N 2 = = N D = N together with (24), (27), ad (29), we d that bk Sh q(1 + q)n,1 (1 + q) N, 1 We thus obtai a modied Shlosser estimator give by bd Sh2 = d + f 1 q(1 + q) ~ N,1 (1 + q) ~N, 1! bk Sh :! P (1 i=1, q)i f P i iq(1 i=1, ; q)i,1 f i where ~ N is a iitial estimate of N based o a iitial estimate of D. We set ~ N equal to N= b Duj1 throughout. As with b DSh, the estimator b DSh2 is cosistet for D. 13

18 A alterative cosistet estimator of D ca be obtaied by directly usig the expressios i (24), (27), ad (29) with F i estimated by bf i = for 1 i N; these estimators of F 1 ;F 2 ;::: ;F N f 1 f i P i=1 iq(1, q)i,1 f i (30) were proposed by Shlosser (1981) i cojuctio with the estimator DSh b. Substitutig the resultig estimator of K ad E[ KSh b ] ito (27) leads to the al estimator P! i=1 iq2 (1, q 2 ) i,1 f P i (1 i=1, q)i, P (1 i=1, 2 q)i f P i (1 + q) i, 1 f i iq(1, : i=1 q)i,1 f i bd Sh3 = d + f 1 As with the estimator DSh b, Shlosser's justicatio of the estimators i (30) rests o the assumptio i (25). Thus oe might expect that, like DSh b, the estimator DSh3 b will be ustable whe the sample size is relatively small ad N 1 ;N 2 ;::: ;N D have similar values. O the other had, the reductio i bias of K b relative to b Sh KSh leads oe to expect that bd Sh3 will perform better tha DSh b whe 2 is sucietly large. (Oe might be tempted to avoid the assumptio i (25) whe estimatig F 1 ;F 2 ;::: ;F N by takig a method-ofmomets approach: replace E [f r ] with f r i (28) for 1 r ad solve the resultig set of liear equatios either exactly or approximately. As poited out by Shlosser (1981), however, this system of equatios is early sigular, ad hece extremely ustable.) 4. Variace ad Variace Estimates Cosider a estimator b D that is a fuctio of the sample oly through f =(f1 ;f 2 ;::: ; f M ), where M = max(n 1 ;N 2 ;::: ;N D ). All of the estimators itroduced i Sectio 3 are of this type. I geeral, we also allow b D to deped explicitly o the populatio size N ad write b D = b D(f;N). Suppose that, for ay N > 0 ad oegative M-dimesioal vector f 6= 0, the fuctio b D is cotiuously dieretiable at the poit (f;n) ad bd(cf;cn)=c b D(f;N) (31) for c>0. Approximatig the hypergeometric sample desig by a Beroulli sample desig as i (8), we ca obtai the followig approximate expressio for the asymptotic variace of b D(f;N)asD becomes large: AVar[ b D(f;N)] M X i=1 A 2 i Var [f i ]+ X 1i;i 0 M i6=i 0 A i A i 0Cov[f i ;f i 0] ; (32) where A i is the partial derivative of b D with respect to fi, evaluated at the poit (f;n). (Whe computig each A i,we replace each occurrece of ad d i the formula for b D by 14

19 P M i=1 if i ad P M i=1 f i before takig derivatives.) The approximatio i (32) is valid whe there is ot too much variability i the class sizes (see App. C for a precise formulatio ad proof of this result). It follows from the proof that, to a good approximatio, the variace of a estimator b D satisfyig (31) icreases liearly as D icreases. Straightforward calculatios show that each of the specic estimators b Duj1, b Duj2, b DSh, bd Sh2, ad b DSh3 is cotiuously dieretiable as stated previously ad also satises (31). Thus we ca use (32) to study the asymptotic variace of these estimators. We focus o bd uj1, b Duj2, b DSh2, ad b DSh3 because each of these estimators performs best for at least oe populatio studied i the simulatio experimets described i Sectio 6; we also cosider bd Sh, because b DSh is the most useful of the estimators previously proposed i the literature. Computatio of the A i coeciets for each estimator is tedious, but straightforward. Whe bd = b Duj2, for example, we obtai ad A (uj2) i 1 = A (uj1) N(1, q)l(1, q) 1,, (1, q)f 1 "^ 2 A (uj1) 1 + f 1,^ 2 +1, 2 bd uj1 A (uj2) = A (uj1) N(1, q) l(1, q) i,, (1, q)f 1 f 1 A (uj1) i bd uj1,^ 2 +1, 2i for 1 <i, where ^ 2 =^ 2 ( b Duj1 ), ad ^ 2 +1, b D uj1 N A (uj1) 1 = b Duj1 1 d + ^ 2 +1, b D uj1 N, + i(i, 1) b Duj1 2, (1, q), (1, q)f 1 1, f 1 A (uj1) i = Duj1 b 1 + i(1, q)(f 1=) d, (1, q)f 1!# ^ 2 + ^2, (1, q)f 1! i^ 2 + i^2, (1, q)f 1 ; for 1 <i. Figures 1 ad 2 compare the variaces of the estimators b Duj1, b Duj2, b DSh, b DSh2, ad bd Sh3 for a umber of populatios with equal class sizes. For these special populatios, b Duj1 ad b Duj2 are approximately ubiased, so that the relative variaces of these estimators are appropriate measures of relative performace. It is particularly istructive to compare the variace of b Duj1 ad b Duj2, sice b Duj2 is obtaied from b Duj1 by adjustig the latter estimator to compesate for bias iduced by the assumptio of equal class sizes. This adjustmet is uecessary for our special populatios, ad a compariso allows evaluatio of the pealty (i.e., the icrease i variace) that is beig paid for the adjustmet. 15

20 stadard deviatio stadard deviatio bd uj1 bd uj2 bd Sh bd Sh2 bd Sh samplig fractio (q) bd uj1 bd uj2 bd Sh class size (N) 100 Figure 1: Stadard deviatio of b Duj1, bd uj2, b DSh, b DSh2, ad b DSh3 (D = 15; 000 ad N = 10). Figure 2: Stadard deviatio of b Duj1, bd uj2, ad b DSh2 (D = 1500 ad q =0:10). Figure 1 displays the stadard deviatios of b Duj1, b Duj2, b DSh, b DSh2, ad b DSh3 for a equal-class-size populatio with N = 15; 000 ad D = 1500 (so that N = 10) as the samplig fractio q varies. Observe that b Duj2 is oly slightly less eciet tha b Duj1, so that the pealty for bias adjustmet is small i this case. Performace of the estimators bd uj1 ad b DSh2 is early idistiguishable. The most strikig observatio is that for this populatio, b DSh ad b DSh3 are ot competitive with the other three estimators. The relative performace of b DSh ad b DSh3 is especially poor for small samplig fractios. O the other had, the variace aalysis idicates that modicatio of b DSh as i (27) ad (29) ideed reduces the istability of the origial Schlosser estimator i this case. Thus we focus o the estimators b Duj1, b Duj2, ad b DSh2 i the remaider of this sectio ad i the ext sectio. (We retur to the estimator b DSh3 i Sectio 6, where our simulatio experimets idicate that b DSh3 ca exhibit smaller rmse tha the other estimators, but oly at large sample sizes ad for certai \ill-coditioed" populatios i which 2 is extremely large.) Figure 2 compares the three estimators b Duj1, b Duj2, ad b DSh2 for equal-class-size populatios with a rage of class sizes; for these calculatios the umber of classes ad the samplig fractio are held costat at D = 1500 ad q = 10. This gure illustrates the diculty of precisely estimatig D whe the class size is small (but greater tha 1). Agai, we see that these three estimators perform similarly, with early equal variability whe N exceeds about 40. We checked the accuracy of the variace approximatio i some example populatios by comparig the values computed from (32) with results of a simulatio experimet. (This experimet is discussed more completely i Sectio 6 below.) Simulated samplig with q =0:05, 0:10, ad 0:20 from the populatio examied i Figure 1 (N =15; 000, D = 1500) yields variace estimates withi 10% (o average) of those calculated from (32). Similar results were foud i samplig from a equal-class-size populatio with N = 15; 000 ad D = 150. The oly diculties we ecoutered occurred for equal-class-size populatios with 16

21 class sizes of N = 1 ad N =2. For these small class sizes the variace approximatio, which is based o the approximatio of the hypergeometric sample desig by aberoulli sample desig, is ot sucietly accurate. I particular, the approximate variace strogly reects radom uctuatios i the sample size due to the Beroulli sample desig; such uctuatios are ot preset i the actual hypergeometric sample desig. Simulatio experimets idicate that for N 3 the diereces caused by Beroulli versus hypergeometric samplig become egligible. (Of course, if the sample desig is i fact Beroulli, the this problem does ot occur.) I practice, we estimate the asymptotic variace of a estimator D b by substitutig estimates for f Var [f i ]: 1 i M g, ad f Cov[f i ;f i 0]: 1 i 6= i 0 M g ito (32). To obtai such estimates, we approximate the true populatio by a populatio with D classes, each of size N=D. Uder this approximatio ad the assumptio i (8) of a Beroulli sample desig, the radom vector f has a multiomial distributio with parameters D ad p =(p 1 ;p 2 ;::: ;p ), where N=D p i = q i (1, q) (N=D),i i for 1 i. It follows that Var [f i ]=Dp i (1, p i ) ad Cov[f i ;f i 0]=,Dp i p i 0. Each p i ca be estimated either by bp i = N= D b i q i (1, q) (N= bd),i or simply by f i = D. b It turs out that the latter formula yields better variace estimates, ad so we take dvar[f i ]=f i 1, f i bd ad for 1 i; i 0. dcov[f i ;f i 0]=, f if i 0 bd These formulas coicide with the estimators obtaied usig the \ucoditioal approach" of Chao ad Lee (1992). A computer program that calculates b Duj1, bd uj2, b DSh2 ad their estimated stadard errors from sample data ca be obtaied from the secod author. 5. A Example The followig example illustrates how kowledge of the populatio size N ca aect estimates of the umber of classes. Whe the populatio size N is ukow, Chao ad Lee (1992, Sec. 3) have proposed that the estimator b DCL deed i Sectio 3.3 be used to 17

22 N bd uj1 bd uj2 bd Sh (47) (60) (51) 10, (125) (161) (128) 100, (141) (183) (144) Table 1: Values of b Duj1, b Duj2, ad b DSh2 for three hypothetical combied lists. (Stadard errors are i paretheses.) estimate the umber of classes, because the formula for b DCL does ot ivolve the ukow parameter N. Whe N is kow, a slight modicatio of the derivatio of b DCL leads to the usmoothed secod-order jackkife estimator b Duj2 (see App. B). Our example is based o oe discussed by Chao ad Lee (1992), who borrowed data rst described ad aalyzed by Holst (1981). These data arose from a applicatio i umismatics i which 204 aciet cois were classied accordig to die type i order to estimate the umber of dieret dies used i the mitig process. Amog the die types o the reverse sides of the 204 cois were 156 sigletos, 19 pairs, 2 triplets, ad 1 quadruplet (f 1 = 156, f 2 = 19, f 3 = 2, f 4 = 1, d = 178). Because the total umber of cois mited is ukow i this case, model (1) is iappropriate for aalyzig these data. But suppose that the same data had arise from a applicatio i which N was kow. For example, suppose that the data were obtaied by selectig a simple radom sample of 204 ames from a samplig frame that had bee costructed by combiig 5 lists of 200 ames each (N = 1000), 50 lists of 200 ames each (N = 10; 000), or 500 lists of 200 ames each (N = 100; 000). I each case our object is to estimate the umber of uique idividuals o the combied list, based o the sample results. We focus o the three estimators b Duj1, bd uj2, ad b DSh2. The estimates for the three cases are give i Table 1; the stadard errors displayed i Table 1 are estimated usig the procedure outlied i Sectio 4. We would expect similar ifereces to be made from the same data uder the multiomial model ad the ite populatio model whe N is very large. Ideed, the value bd uj2 = 835 agrees closely with Chao ad Lee's estimate b DCL = 844 (se 187) whe N = 100;000. Moreover, whe N = 100;000 we d that ^ 2 ( b Duj1 ) 0:13, which is the same estimate of 2 give by Chao ad Lee. As the populatio size decreases, however, both our assessmet of the magitude of D ad our ucertaity about that magitude decrease, because we are observig a larger ad larger fractio of both the populatio ad the classes. The most extreme divergece betwee the estimate obtaied usig b DCL ad estimates obtaied usig b Duj1, b Duj2,or b DSh2 occurs whe the sample cosists of all sigletos (f 1 = ). I that case, b DCL = 1, whereas b Duj1 = b Duj2 = b DSh2 = N. This result idicates that whe the populatio size N is kow, it is better to use a estimator that exploits kowledge 18

23 of N tha to sample with replacemet ad use the estimator b DCL. I some applicatios, samplig with replacemet is ot eve a optio. For example, the oly available samplig mechaism i at least oe curret database system is a oe-pass reservoir algorithm (as i Vitter 1985). The empirical results i Sectio 6 idicate that, of the three estimators displayed i Table 1, b Duj2 is the superior estimator whe 2 is small (< 1). Thus for our example, b Duj2 would be the preferred estimator, sice ^ 2 ( b Duj1 ) 0:13 i all three cases. Note that b Duj2 cosistetly has the highest variace of the three estimators i Table 1. The bias of b Duj2 is typically lower tha that of b Duj1 or b DSh2 whe 2 is small, however, so that the overall rmse is lower. 6. Simulatio Results This sectio describes the results of a simulatio study doe to compare the performace of the various estimators described i Sectio 3. Our compariso is based o the performace of the estimators for samplig fractios of 5%, 10%, ad 20% i 52 populatios. (Iitial experimets idicated that the performace of the various estimators is best viewed as a fuctio of samplig fractio, rather tha absolute sample size. This is i cotrast to estimators of, for example, populatio averages.) We cosider several sets of populatios. The rst set comprises sythetic populatios of the type cosidered i the literature. Populatios EQ10 ad EQ100 have equal class sizes of 10 ad 100. I populatios NGB/1, NGB/2, ad NGB/4, the class sizes follow a egative biomial distributio. Specically, the fractio f(m) of classes i populatio NGB/k with class size equal to m is give by f(m) m, 1 r k (1, r) m,k k, 1 for m k, where r = 0:04. Chao ad Lee (1992) cosidered populatios of this type. The populatios i the secod set are meat to be represetative of data that could be ecoutered whe a samplig frame for a populatio cesus is costructed by combiig a umber of lists which may cotai overlappig etries. Populatio GOOD ad SUDM were studied by Goodma (1949) ad Sudma (1976). Populatio FRAME2 mimics a samplig frame that might arise i a admiistrative records cesus of the type described i Sectio 1. Oe approach to such a cesus is to augmet the usual cesus address list with a small umber of relatively large admiistrative records les, such as AFDC or Food Stamps, ad the estimate the umber of distict idividuals o the combied list from a sample. We have costructed FRAME2 so that a give idividual ca appear at most ve times, but most idividuals appear exactly oce, mimickig the case i which four admiistrative lists are used to supplemet the cesus address list. Populatio FRAME3 is similar to FRAME2, but for the FRAME3 populatio it is assumed that the combied list is made up of a umber of small lists (perhaps obtaied from eighborhood-level orgaizatios) rather tha a few 19

24 Name N D 2 Skew EQ EQ NGB/ NGB/ NGB/ Table 2: Characteristics of sythetic populatios. Name N D 2 Skew GOOD FRAME FRAME SUDM Table 3: Characteristics of \merged list" populatios. Name N D 2 Skew Z20A Z Z20B Table 4: Characteristics of \ill-coditioed" populatios. large lists. The populatios i the third set, deoted by Z20A, Z20B, ad Z15, are used to study the behavior of the estimators whe the data are extremely ill-coditioed. The class sizes i each of these populatios follow a geeralized Zipf distributio (see Kuth 1973, p. 398). Specically, N j =N / j,, where equals 1.5 or 2.0. These populatios have extremely high values of 2. Descriptive statistics for these three sets of populatios are give i Tables 2, 3, ad 4. The colum etitled \skew" displays the dimesioless coeciet of skewess, which is deed by = P D (N j, N) 3 =D PD (N j, N) 2 =D 3=2 : The al set comprises 40 real populatios that demostrate the type of distributios ecoutered whe estimatig the umber of distict values of a attribute i a relatioal database. Specically, the populatios studied correspod to various relatioal attributes from a database of erollmet records for studets at the Uiversity of Wiscosi ad a database of billig records from a large isurace compay. The populatio size N rages from 15,469 to 1,654,700, with D ragig from 3 to 1,547,606 ad 2 ragig from 0 to (see App. D for further details). It is otable that values of 2 ecoutered i the literature (Chao ad Lee 1992; Goodma 1949; Shlosser 1981; Sudma 1976) ted ot to exceed the value 2, ad are typically less tha 1, whereas the value of 2 exceeds 2 for more tha 50% of the real populatios. For each estimator, populatio, ad samplig fractio, we estimated the bias ad rmse by repeatedly drawig a simple radom sample from the populatio, evaluatig the estimator, ad the computig the error of each estimate. (Whe evaluatig the estimator, we trucated each estimate below at d ad above at N.) The al estimate of bias was obtaied by averagig the error over all of the experimetal replicatios, ad rmse was 20

25 sample size bd 2 uj1 bd sj1 bd uj2 bd sj2 bd Sh bd Sh2 bd Sh3 ^ 2 5% 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum % 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum % 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum Table 5: Average ad maximum rmse (%) for various estimators. estimated as the square root of the averaged square error. We used 100 replicatios, which was suciet to estimate the rmse with a stadard error below 5% i early all cases; typically the stadard error was much less. Summary results from the simulatios are displayed i Tables 5 ad 6. Table 5 gives the average ad maximum rmse's for each estimator of D over all populatios with 0 2 < 1, with 1 2 < 50, ad with 2 50, as well as the average ad maximum rmse's for each estimator over all populatios combied. Similarly, Table 6 gives the average ad maximum bias for each estimator. I these tables, the rmse ad bias are each expressed as a percetage of the true umber of classes. Tables 5 ad 6 also display the rmse ad bias of the estimator ^ 2 ( b Duj1 ) used i the secod-order jackkife estimators; the rmse ad bias are expressed as a percetage of the true value 2 ad are displayed i the colum labelled ^ 2. Comparig Tables 5 ad 6 idicates that for each estimator the major compoet of the rmse is almost always bias, ot variace. Thus, eve though the stadard error ca be estimated as i Sectio 4, this estimated stadard error usually does ot give a accurate picture of the error i estimatio of D. Aother cosequece of the predomiace of bias is that whe 2 is large, the rmse for the secod-order estimator b Duj2 does ot decrease 21

26 sample size bd 2 uj1 bd sj1 bd uj2 bd sj2 bd Sh bd Sh2 bd Sh3 ^ 2 5% 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum % 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum % 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum Table 6: Average ad maximum bias (%) for various estimators. mootoically as the samplig fractio icreases. (I all other cases the rmse decreases mootoically.) Comparig b Duj1 with b Dsj1 ad the comparig b Duj2 with b Dsj2,we see that smoothig a rst-order jackkife estimator ever results i a better rst-order estimator. O the other had, smoothig a secod-order jackkife estimator ca result i sigicat performace improvemet whe 2 is large. Similarly, usig higher-order Taylor expasios leads to mixed results. Secod-order estimators perform better tha rst-order estimators whe 2 is relatively small, but ot whe 2 is large. The diculty ispartially that the estimator ^ 2 ( b Duj1 ) teds to uderestimate 2 whe 2 is large, leadig to uderestimates of the umber of classes. Moreover, the Taylor approximatios uderlyig b Duj1, b Dsj1, b Duj2, ad b Dsj2 are derived uder the assumptio of ot too much variability betwee class sizes; this assumptio is violated whe 2 is large. There apparetly is o systematic relatio betwee the coeciet of skewess for the class sizes ad the performace of secod-order jackkife estimators. As predicted i Sectios 3.4 ad 4, the estimators b DSh ad b DSh3 behave poorly whe 2 is relatively small, ad b DSh3 performs better tha b DSh whe 2 is large. For small to medium values of 2, the modied estimator b DSh2 has a smaller rmse tha b DSh or b DSh3, ad 22

27 sample size bd 2 uj2 bd uj2a bd Sh2 bd Sh3 bd hybrid 5% 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum % 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum % 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum Table 7: Average ad maximum rmse (%) of b Duj2, b Duj2a, b DSh2, b DSh3, ad b Dhybrid. its performace is comparable to the geeralized jackkife estimators. For extremely large values of 2 ad also for large sample sizes, the estimator b DSh3 has the best performace of the three Shlosser-type estimators. (For a 20% samplig fractio, b DSh3 i fact has the lowest average rmse of all the estimators cosidered.) As idicated earlier, smoothig ca improve the performace of the secod-order jackkife estimator Duj2 b. A alterative ad hoc techique for improvig performace is to \stabilize" b Duj2 usig a method suggested by Chao, Ma, ad Yag (1993). Fix c 1 ad remove ay class whose frequecy i the sample exceeds c; that is, remove from the sample all members of classes f C j : j 2 B g, where B = f 1 j D : j >cg. The compute the estimator b Duj2 from the reduced sample ad subsequetly icremet it by jbj to produce the al estimate, deoted by b Duj2a. (Here jbj deotes the umber of elemets i the set B.) Whe computig b Duj2 from the reduced sample, take the populatio size as N, P j2b b N j, where each b Nj is a method-of-momets estimator of N j as i Sectio If, P j2b j = 0, the simply compute b Duj2 from the full sample. The idea behid this procedure is as follows. Whe 2 is large, the populatio cosists of a few large classes ad may smaller classes. By i eect removig the largest classes from the populatio, 23

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n. Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator

More information

Lecture 2: Monte Carlo Simulation

Lecture 2: Monte Carlo Simulation STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?

More information

Estimation for Complete Data

Estimation for Complete Data Estimatio for Complete Data complete data: there is o loss of iformatio durig study. complete idividual complete data= grouped data A complete idividual data is the oe i which the complete iformatio of

More information

Chapter 6 Sampling Distributions

Chapter 6 Sampling Distributions Chapter 6 Samplig Distributios 1 I most experimets, we have more tha oe measuremet for ay give variable, each measuremet beig associated with oe radomly selected a member of a populatio. Hece we eed to

More information

The Poisson Process *

The Poisson Process * OpeStax-CNX module: m11255 1 The Poisso Process * Do Johso This work is produced by OpeStax-CNX ad licesed uder the Creative Commos Attributio Licese 1.0 Some sigals have o waveform. Cosider the measuremet

More information

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering CEE 5 Autum 005 Ucertaity Cocepts for Geotechical Egieerig Basic Termiology Set A set is a collectio of (mutually exclusive) objects or evets. The sample space is the (collectively exhaustive) collectio

More information

Statistics 511 Additional Materials

Statistics 511 Additional Materials Cofidece Itervals o mu Statistics 511 Additioal Materials This topic officially moves us from probability to statistics. We begi to discuss makig ifereces about the populatio. Oe way to differetiate probability

More information

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Discrete Mathematics for CS Spring 2008 David Wagner Note 22 CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig

More information

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample. Statistical Iferece (Chapter 10) Statistical iferece = lear about a populatio based o the iformatio provided by a sample. Populatio: The set of all values of a radom variable X of iterest. Characterized

More information

1 Introduction to reducing variance in Monte Carlo simulations

1 Introduction to reducing variance in Monte Carlo simulations Copyright c 010 by Karl Sigma 1 Itroductio to reducig variace i Mote Carlo simulatios 11 Review of cofidece itervals for estimatig a mea I statistics, we estimate a ukow mea µ = E(X) of a distributio by

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

Lecture 33: Bootstrap

Lecture 33: Bootstrap Lecture 33: ootstrap Motivatio To evaluate ad compare differet estimators, we eed cosistet estimators of variaces or asymptotic variaces of estimators. This is also importat for hypothesis testig ad cofidece

More information

It should be unbiased, or approximately unbiased. Variance of the variance estimator should be small. That is, the variance estimator is stable.

It should be unbiased, or approximately unbiased. Variance of the variance estimator should be small. That is, the variance estimator is stable. Chapter 10 Variace Estimatio 10.1 Itroductio Variace estimatio is a importat practical problem i survey samplig. Variace estimates are used i two purposes. Oe is the aalytic purpose such as costructig

More information

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING Lectures MODULE 5 STATISTICS II. Mea ad stadard error of sample data. Biomial distributio. Normal distributio 4. Samplig 5. Cofidece itervals

More information

Properties and Hypothesis Testing

Properties and Hypothesis Testing Chapter 3 Properties ad Hypothesis Testig 3.1 Types of data The regressio techiques developed i previous chapters ca be applied to three differet kids of data. 1. Cross-sectioal data. 2. Time series data.

More information

Basics of Probability Theory (for Theory of Computation courses)

Basics of Probability Theory (for Theory of Computation courses) Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface.

More information

Information-based Feature Selection

Information-based Feature Selection Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with

More information

The standard deviation of the mean

The standard deviation of the mean Physics 6C Fall 20 The stadard deviatio of the mea These otes provide some clarificatio o the distictio betwee the stadard deviatio ad the stadard deviatio of the mea.. The sample mea ad variace Cosider

More information

Exponential Families and Bayesian Inference

Exponential Families and Bayesian Inference Computer Visio Expoetial Families ad Bayesia Iferece Lecture Expoetial Families A expoetial family of distributios is a d-parameter family f(x; havig the followig form: f(x; = h(xe g(t T (x B(, (. where

More information

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals 7-1 Chapter 4 Part I. Samplig Distributios ad Cofidece Itervals 1 7- Sectio 1. Samplig Distributio 7-3 Usig Statistics Statistical Iferece: Predict ad forecast values of populatio parameters... Test hypotheses

More information

GUIDELINES ON REPRESENTATIVE SAMPLING

GUIDELINES ON REPRESENTATIVE SAMPLING DRUGS WORKING GROUP VALIDATION OF THE GUIDELINES ON REPRESENTATIVE SAMPLING DOCUMENT TYPE : REF. CODE: ISSUE NO: ISSUE DATE: VALIDATION REPORT DWG-SGL-001 002 08 DECEMBER 2012 Ref code: DWG-SGL-001 Issue

More information

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1. Eco 325/327 Notes o Sample Mea, Sample Proportio, Cetral Limit Theorem, Chi-square Distributio, Studet s t distributio 1 Sample Mea By Hiro Kasahara We cosider a radom sample from a populatio. Defiitio

More information

Statisticians use the word population to refer the total number of (potential) observations under consideration

Statisticians use the word population to refer the total number of (potential) observations under consideration 6 Samplig Distributios Statisticias use the word populatio to refer the total umber of (potetial) observatios uder cosideratio The populatio is just the set of all possible outcomes i our sample space

More information

Economics Spring 2015

Economics Spring 2015 1 Ecoomics 400 -- Sprig 015 /17/015 pp. 30-38; Ch. 7.1.4-7. New Stata Assigmet ad ew MyStatlab assigmet, both due Feb 4th Midterm Exam Thursday Feb 6th, Chapters 1-7 of Groeber text ad all relevat lectures

More information

Estimation of Population Mean Using Co-Efficient of Variation and Median of an Auxiliary Variable

Estimation of Population Mean Using Co-Efficient of Variation and Median of an Auxiliary Variable Iteratioal Joural of Probability ad Statistics 01, 1(4: 111-118 DOI: 10.593/j.ijps.010104.04 Estimatio of Populatio Mea Usig Co-Efficiet of Variatio ad Media of a Auxiliary Variable J. Subramai *, G. Kumarapadiya

More information

Regression with an Evaporating Logarithmic Trend

Regression with an Evaporating Logarithmic Trend Regressio with a Evaporatig Logarithmic Tred Peter C. B. Phillips Cowles Foudatio, Yale Uiversity, Uiversity of Aucklad & Uiversity of York ad Yixiao Su Departmet of Ecoomics Yale Uiversity October 5,

More information

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference EXST30 Backgroud material Page From the textbook The Statistical Sleuth Mea [0]: I your text the word mea deotes a populatio mea (µ) while the work average deotes a sample average ( ). Variace [0]: The

More information

3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense,

3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense, 3. Z Trasform Referece: Etire Chapter 3 of text. Recall that the Fourier trasform (FT) of a DT sigal x [ ] is ω ( ) [ ] X e = j jω k = xe I order for the FT to exist i the fiite magitude sese, S = x [

More information

On an Application of Bayesian Estimation

On an Application of Bayesian Estimation O a Applicatio of ayesia Estimatio KIYOHARU TANAKA School of Sciece ad Egieerig, Kiki Uiversity, Kowakae, Higashi-Osaka, JAPAN Email: ktaaka@ifokidaiacjp EVGENIY GRECHNIKOV Departmet of Mathematics, auma

More information

Chapter 12 - Quality Cotrol Example: The process of llig 12 ouce cas of Dr. Pepper is beig moitored. The compay does ot wat to uderll the cas. Hece, a target llig rate of 12.1-12.5 ouces was established.

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

Estimation of a population proportion March 23,

Estimation of a population proportion March 23, 1 Social Studies 201 Notes for March 23, 2005 Estimatio of a populatio proportio Sectio 8.5, p. 521. For the most part, we have dealt with meas ad stadard deviatios this semester. This sectio of the otes

More information

ON POINTWISE BINOMIAL APPROXIMATION

ON POINTWISE BINOMIAL APPROXIMATION Iteratioal Joural of Pure ad Applied Mathematics Volume 71 No. 1 2011, 57-66 ON POINTWISE BINOMIAL APPROXIMATION BY w-functions K. Teerapabolar 1, P. Wogkasem 2 Departmet of Mathematics Faculty of Sciece

More information

ANALYSIS OF EXPERIMENTAL ERRORS

ANALYSIS OF EXPERIMENTAL ERRORS ANALYSIS OF EXPERIMENTAL ERRORS All physical measuremets ecoutered i the verificatio of physics theories ad cocepts are subject to ucertaities that deped o the measurig istrumets used ad the coditios uder

More information

A statistical method to determine sample size to estimate characteristic value of soil parameters

A statistical method to determine sample size to estimate characteristic value of soil parameters A statistical method to determie sample size to estimate characteristic value of soil parameters Y. Hojo, B. Setiawa 2 ad M. Suzuki 3 Abstract Sample size is a importat factor to be cosidered i determiig

More information

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01 ENGI 44 Cofidece Itervals (Two Samples) Page -0 Two Sample Cofidece Iterval for a Differece i Populatio Meas [Navidi sectios 5.4-5.7; Devore chapter 9] From the cetral limit theorem, we kow that, for sufficietly

More information

Understanding Samples

Understanding Samples 1 Will Moroe CS 109 Samplig ad Bootstrappig Lecture Notes #17 August 2, 2017 Based o a hadout by Chris Piech I this chapter we are goig to talk about statistics calculated o samples from a populatio. We

More information

( µ /σ)ζ/(ζ+1) µ /σ ( µ /σ)ζ/(ζ 1)

( µ /σ)ζ/(ζ+1) µ /σ ( µ /σ)ζ/(ζ 1) A eective CI for the mea with samples of size 1 ad Melaie Wall James Boe ad Richard Tweedie 1 Abstract It is couterituitive that with a sample of oly oe value from a ormal distributio oe ca costruct a

More information

Similarity Solutions to Unsteady Pseudoplastic. Flow Near a Moving Wall

Similarity Solutions to Unsteady Pseudoplastic. Flow Near a Moving Wall Iteratioal Mathematical Forum, Vol. 9, 04, o. 3, 465-475 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/0.988/imf.04.48 Similarity Solutios to Usteady Pseudoplastic Flow Near a Movig Wall W. Robi Egieerig

More information

This is an introductory course in Analysis of Variance and Design of Experiments.

This is an introductory course in Analysis of Variance and Design of Experiments. 1 Notes for M 384E, Wedesday, Jauary 21, 2009 (Please ote: I will ot pass out hard-copy class otes i future classes. If there are writte class otes, they will be posted o the web by the ight before class

More information

Confidence intervals summary Conservative and approximate confidence intervals for a binomial p Examples. MATH1005 Statistics. Lecture 24. M.

Confidence intervals summary Conservative and approximate confidence intervals for a binomial p Examples. MATH1005 Statistics. Lecture 24. M. MATH1005 Statistics Lecture 24 M. Stewart School of Mathematics ad Statistics Uiversity of Sydey Outlie Cofidece itervals summary Coservative ad approximate cofidece itervals for a biomial p The aïve iterval

More information

Output Analysis and Run-Length Control

Output Analysis and Run-Length Control IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%

More information

Monte Carlo Integration

Monte Carlo Integration Mote Carlo Itegratio I these otes we first review basic umerical itegratio methods (usig Riema approximatio ad the trapezoidal rule) ad their limitatios for evaluatig multidimesioal itegrals. Next we itroduce

More information

Statistical inference: example 1. Inferential Statistics

Statistical inference: example 1. Inferential Statistics Statistical iferece: example 1 Iferetial Statistics POPULATION SAMPLE A clothig store chai regularly buys from a supplier large quatities of a certai piece of clothig. Each item ca be classified either

More information

1 of 7 7/16/2009 6:06 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 6. Order Statistics Defiitios Suppose agai that we have a basic radom experimet, ad that X is a real-valued radom variable

More information

Lecture 12: September 27

Lecture 12: September 27 36-705: Itermediate Statistics Fall 207 Lecturer: Siva Balakrisha Lecture 2: September 27 Today we will discuss sufficiecy i more detail ad the begi to discuss some geeral strategies for costructig estimators.

More information

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen) Goodess-of-Fit Tests ad Categorical Data Aalysis (Devore Chapter Fourtee) MATH-252-01: Probability ad Statistics II Sprig 2019 Cotets 1 Chi-Squared Tests with Kow Probabilities 1 1.1 Chi-Squared Testig................

More information

Statistical Inference Based on Extremum Estimators

Statistical Inference Based on Extremum Estimators T. Rotheberg Fall, 2007 Statistical Iferece Based o Extremum Estimators Itroductio Suppose 0, the true value of a p-dimesioal parameter, is kow to lie i some subset S R p : Ofte we choose to estimate 0

More information

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS Jauary 25, 207 INTRODUCTION TO MATHEMATICAL STATISTICS Abstract. A basic itroductio to statistics assumig kowledge of probability theory.. Probability I a typical udergraduate problem i probability, we

More information

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4 MATH 30: Probability ad Statistics 9. Estimatio ad Testig of Parameters Estimatio ad Testig of Parameters We have bee dealig situatios i which we have full kowledge of the distributio of a radom variable.

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2016 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

CHAPTER 8 FUNDAMENTAL SAMPLING DISTRIBUTIONS AND DATA DESCRIPTIONS. 8.1 Random Sampling. 8.2 Some Important Statistics

CHAPTER 8 FUNDAMENTAL SAMPLING DISTRIBUTIONS AND DATA DESCRIPTIONS. 8.1 Random Sampling. 8.2 Some Important Statistics CHAPTER 8 FUNDAMENTAL SAMPLING DISTRIBUTIONS AND DATA DESCRIPTIONS 8.1 Radom Samplig The basic idea of the statistical iferece is that we are allowed to draw ifereces or coclusios about a populatio based

More information

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015 ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],

More information

R. van Zyl 1, A.J. van der Merwe 2. Quintiles International, University of the Free State

R. van Zyl 1, A.J. van der Merwe 2. Quintiles International, University of the Free State Bayesia Cotrol Charts for the Two-parameter Expoetial Distributio if the Locatio Parameter Ca Take o Ay Value Betwee Mius Iity ad Plus Iity R. va Zyl, A.J. va der Merwe 2 Quitiles Iteratioal, ruaavz@gmail.com

More information

µ and π p i.e. Point Estimation x And, more generally, the population proportion is approximately equal to a sample proportion

µ and π p i.e. Point Estimation x And, more generally, the population proportion is approximately equal to a sample proportion Poit Estimatio Poit estimatio is the rather simplistic (ad obvious) process of usig the kow value of a sample statistic as a approximatio to the ukow value of a populatio parameter. So we could for example

More information

Modied moment estimation for the two-parameter Birnbaum Saunders distribution

Modied moment estimation for the two-parameter Birnbaum Saunders distribution Computatioal Statistics & Data Aalysis 43 (23) 283 298 www.elsevier.com/locate/csda Modied momet estimatio for the two-parameter Birbaum Sauders distributio H.K.T. Ng a, D. Kudu b, N. Balakrisha a; a Departmet

More information

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates. 5. Data, Estimates, ad Models: quatifyig the accuracy of estimates. 5. Estimatig a Normal Mea 5.2 The Distributio of the Normal Sample Mea 5.3 Normal data, cofidece iterval for, kow 5.4 Normal data, cofidece

More information

If, for instance, we were required to test whether the population mean μ could be equal to a certain value μ

If, for instance, we were required to test whether the population mean μ could be equal to a certain value μ STATISTICAL INFERENCE INTRODUCTION Statistical iferece is that brach of Statistics i which oe typically makes a statemet about a populatio based upo the results of a sample. I oesample testig, we essetially

More information

Element sampling: Part 2

Element sampling: Part 2 Chapter 4 Elemet samplig: Part 2 4.1 Itroductio We ow cosider uequal probability samplig desigs which is very popular i practice. I the uequal probability samplig, we ca improve the efficiecy of the resultig

More information

1 Inferential Methods for Correlation and Regression Analysis

1 Inferential Methods for Correlation and Regression Analysis 1 Iferetial Methods for Correlatio ad Regressio Aalysis I the chapter o Correlatio ad Regressio Aalysis tools for describig bivariate cotiuous data were itroduced. The sample Pearso Correlatio Coefficiet

More information

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence Chapter 8: Estimatig with Cofidece Sectio 8.2 The Practice of Statistics, 4 th editio For AP* STARNES, YATES, MOORE Chapter 8 Estimatig with Cofidece 8.1 Cofidece Itervals: The Basics 8.2 8.3 Estimatig

More information

Introducing Sample Proportions

Introducing Sample Proportions Itroducig Sample Proportios Probability ad statistics Aswers & Notes TI-Nspire Ivestigatio Studet 60 mi 7 8 9 0 Itroductio A 00 survey of attitudes to climate chage, coducted i Australia by the CSIRO,

More information

THE SYSTEMATIC AND THE RANDOM. ERRORS - DUE TO ELEMENT TOLERANCES OF ELECTRICAL NETWORKS

THE SYSTEMATIC AND THE RANDOM. ERRORS - DUE TO ELEMENT TOLERANCES OF ELECTRICAL NETWORKS R775 Philips Res. Repts 26,414-423, 1971' THE SYSTEMATIC AND THE RANDOM. ERRORS - DUE TO ELEMENT TOLERANCES OF ELECTRICAL NETWORKS by H. W. HANNEMAN Abstract Usig the law of propagatio of errors, approximated

More information

Goodness-Of-Fit For The Generalized Exponential Distribution. Abstract

Goodness-Of-Fit For The Generalized Exponential Distribution. Abstract Goodess-Of-Fit For The Geeralized Expoetial Distributio By Amal S. Hassa stitute of Statistical Studies & Research Cairo Uiversity Abstract Recetly a ew distributio called geeralized expoetial or expoetiated

More information

1 Review of Probability & Statistics

1 Review of Probability & Statistics 1 Review of Probability & Statistics a. I a group of 000 people, it has bee reported that there are: 61 smokers 670 over 5 960 people who imbibe (drik alcohol) 86 smokers who imbibe 90 imbibers over 5

More information

Distribution of Random Samples & Limit theorems

Distribution of Random Samples & Limit theorems STAT/MATH 395 A - PROBABILITY II UW Witer Quarter 2017 Néhémy Lim Distributio of Radom Samples & Limit theorems 1 Distributio of i.i.d. Samples Motivatig example. Assume that the goal of a study is to

More information

Double Stage Shrinkage Estimator of Two Parameters. Generalized Exponential Distribution

Double Stage Shrinkage Estimator of Two Parameters. Generalized Exponential Distribution Iteratioal Mathematical Forum, Vol., 3, o. 3, 3-53 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/.9/imf.3.335 Double Stage Shrikage Estimator of Two Parameters Geeralized Expoetial Distributio Alaa M.

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2018 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

Introducing Sample Proportions

Introducing Sample Proportions Itroducig Sample Proportios Probability ad statistics Studet Activity TI-Nspire Ivestigatio Studet 60 mi 7 8 9 10 11 12 Itroductio A 2010 survey of attitudes to climate chage, coducted i Australia by the

More information

Statistical Properties of OLS estimators

Statistical Properties of OLS estimators 1 Statistical Properties of OLS estimators Liear Model: Y i = β 0 + β 1 X i + u i OLS estimators: β 0 = Y β 1X β 1 = Best Liear Ubiased Estimator (BLUE) Liear Estimator: β 0 ad β 1 are liear fuctio of

More information

Chapter 6 Principles of Data Reduction

Chapter 6 Principles of Data Reduction Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a

More information

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10 DS 00: Priciples ad Techiques of Data Sciece Date: April 3, 208 Name: Hypothesis Testig Discussio #0. Defie these terms below as they relate to hypothesis testig. a) Data Geeratio Model: Solutio: A set

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates Iteratioal Joural of Scieces: Basic ad Applied Research (IJSBAR) ISSN 2307-4531 (Prit & Olie) http://gssrr.org/idex.php?joural=jouralofbasicadapplied ---------------------------------------------------------------------------------------------------------------------------

More information

4. Hypothesis testing (Hotelling s T 2 -statistic)

4. Hypothesis testing (Hotelling s T 2 -statistic) 4. Hypothesis testig (Hotellig s T -statistic) Cosider the test of hypothesis H 0 : = 0 H A = 6= 0 4. The Uio-Itersectio Priciple W accept the hypothesis H 0 as valid if ad oly if H 0 (a) : a T = a T 0

More information

Simulation. Two Rule For Inverting A Distribution Function

Simulation. Two Rule For Inverting A Distribution Function Simulatio Two Rule For Ivertig A Distributio Fuctio Rule 1. If F(x) = u is costat o a iterval [x 1, x 2 ), the the uiform value u is mapped oto x 2 through the iversio process. Rule 2. If there is a jump

More information

A proposed discrete distribution for the statistical modeling of

A proposed discrete distribution for the statistical modeling of It. Statistical Ist.: Proc. 58th World Statistical Cogress, 0, Dubli (Sessio CPS047) p.5059 A proposed discrete distributio for the statistical modelig of Likert data Kidd, Marti Cetre for Statistical

More information

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random Part III. Areal Data Aalysis 0. Comparative Tests amog Spatial Regressio Models While the otio of relative likelihood values for differet models is somewhat difficult to iterpret directly (as metioed above),

More information

Binomial Distribution

Binomial Distribution 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 1 2 3 4 5 6 7 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Overview Example: coi tossed three times Defiitio Formula Recall that a r.v. is discrete if there are either a fiite umber of possible

More information

Modified Ratio Estimators Using Known Median and Co-Efficent of Kurtosis

Modified Ratio Estimators Using Known Median and Co-Efficent of Kurtosis America Joural of Mathematics ad Statistics 01, (4): 95-100 DOI: 10.593/j.ajms.01004.05 Modified Ratio s Usig Kow Media ad Co-Efficet of Kurtosis J.Subramai *, G.Kumarapadiya Departmet of Statistics, Podicherry

More information

CHAPTER 10 INFINITE SEQUENCES AND SERIES

CHAPTER 10 INFINITE SEQUENCES AND SERIES CHAPTER 10 INFINITE SEQUENCES AND SERIES 10.1 Sequeces 10.2 Ifiite Series 10.3 The Itegral Tests 10.4 Compariso Tests 10.5 The Ratio ad Root Tests 10.6 Alteratig Series: Absolute ad Coditioal Covergece

More information

Bayesian Control Charts for the Two-parameter Exponential Distribution

Bayesian Control Charts for the Two-parameter Exponential Distribution Bayesia Cotrol Charts for the Two-parameter Expoetial Distributio R. va Zyl, A.J. va der Merwe 2 Quitiles Iteratioal, ruaavz@gmail.com 2 Uiversity of the Free State Abstract By usig data that are the mileages

More information

The Random Walk For Dummies

The Random Walk For Dummies The Radom Walk For Dummies Richard A Mote Abstract We look at the priciples goverig the oe-dimesioal discrete radom walk First we review five basic cocepts of probability theory The we cosider the Beroulli

More information

Improved Class of Ratio -Cum- Product Estimators of Finite Population Mean in two Phase Sampling

Improved Class of Ratio -Cum- Product Estimators of Finite Population Mean in two Phase Sampling Global Joural of Sciece Frotier Research: F Mathematics ad Decisio Scieces Volume 4 Issue 2 Versio.0 Year 204 Type : Double Blid Peer Reviewed Iteratioal Research Joural Publisher: Global Jourals Ic. (USA

More information

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara Poit Estimator Eco 325 Notes o Poit Estimator ad Cofidece Iterval 1 By Hiro Kasahara Parameter, Estimator, ad Estimate The ormal probability desity fuctio is fully characterized by two costats: populatio

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2018 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

4. Partial Sums and the Central Limit Theorem

4. Partial Sums and the Central Limit Theorem 1 of 10 7/16/2009 6:05 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 4. Partial Sums ad the Cetral Limit Theorem The cetral limit theorem ad the law of large umbers are the two fudametal theorems

More information

Computing Confidence Intervals for Sample Data

Computing Confidence Intervals for Sample Data Computig Cofidece Itervals for Sample Data Topics Use of Statistics Sources of errors Accuracy, precisio, resolutio A mathematical model of errors Cofidece itervals For meas For variaces For proportios

More information

MEASURES OF DISPERSION (VARIABILITY)

MEASURES OF DISPERSION (VARIABILITY) POLI 300 Hadout #7 N. R. Miller MEASURES OF DISPERSION (VARIABILITY) While measures of cetral tedecy idicate what value of a variable is (i oe sese or other, e.g., mode, media, mea), average or cetral

More information

CS284A: Representations and Algorithms in Molecular Biology

CS284A: Representations and Algorithms in Molecular Biology CS284A: Represetatios ad Algorithms i Molecular Biology Scribe Notes o Lectures 3 & 4: Motif Discovery via Eumeratio & Motif Represetatio Usig Positio Weight Matrix Joshua Gervi Based o presetatios by

More information

Lecture 19: Convergence

Lecture 19: Convergence Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may

More information

Confidence Interval for Standard Deviation of Normal Distribution with Known Coefficients of Variation

Confidence Interval for Standard Deviation of Normal Distribution with Known Coefficients of Variation Cofidece Iterval for tadard Deviatio of Normal Distributio with Kow Coefficiets of Variatio uparat Niwitpog Departmet of Applied tatistics, Faculty of Applied ciece Kig Mogkut s Uiversity of Techology

More information

x = Pr ( X (n) βx ) =

x = Pr ( X (n) βx ) = Exercise 93 / page 45 The desity of a variable X i i 1 is fx α α a For α kow let say equal to α α > fx α α x α Pr X i x < x < Usig a Pivotal Quatity: x α 1 < x < α > x α 1 ad We solve i a similar way as

More information

Finally, we show how to determine the moments of an impulse response based on the example of the dispersion model.

Finally, we show how to determine the moments of an impulse response based on the example of the dispersion model. 5.3 Determiatio of Momets Fially, we show how to determie the momets of a impulse respose based o the example of the dispersio model. For the dispersio model we have that E θ (θ ) curve is give by eq (4).

More information

Singular Continuous Measures by Michael Pejic 5/14/10

Singular Continuous Measures by Michael Pejic 5/14/10 Sigular Cotiuous Measures by Michael Peic 5/4/0 Prelimiaries Give a set X, a σ-algebra o X is a collectio of subsets of X that cotais X ad ad is closed uder complemetatio ad coutable uios hece, coutable

More information

Chapter 9: Numerical Differentiation

Chapter 9: Numerical Differentiation 178 Chapter 9: Numerical Differetiatio Numerical Differetiatio Formulatio of equatios for physical problems ofte ivolve derivatives (rate-of-chage quatities, such as velocity ad acceleratio). Numerical

More information

An Introduction to Randomized Algorithms

An Introduction to Randomized Algorithms A Itroductio to Radomized Algorithms The focus of this lecture is to study a radomized algorithm for quick sort, aalyze it usig probabilistic recurrece relatios, ad also provide more geeral tools for aalysis

More information

TEACHER CERTIFICATION STUDY GUIDE

TEACHER CERTIFICATION STUDY GUIDE COMPETENCY 1. ALGEBRA SKILL 1.1 1.1a. ALGEBRAIC STRUCTURES Kow why the real ad complex umbers are each a field, ad that particular rigs are ot fields (e.g., itegers, polyomial rigs, matrix rigs) Algebra

More information

Estimation of Gumbel Parameters under Ranked Set Sampling

Estimation of Gumbel Parameters under Ranked Set Sampling Joural of Moder Applied Statistical Methods Volume 13 Issue 2 Article 11-2014 Estimatio of Gumbel Parameters uder Raked Set Samplig Omar M. Yousef Al Balqa' Applied Uiversity, Zarqa, Jorda, abuyaza_o@yahoo.com

More information

Expectation and Variance of a random variable

Expectation and Variance of a random variable Chapter 11 Expectatio ad Variace of a radom variable The aim of this lecture is to defie ad itroduce mathematical Expectatio ad variace of a fuctio of discrete & cotiuous radom variables ad the distributio

More information