Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download ""

Transcription

1 RJ (90521) May 29, 1996 (Revised 3/20/98) Computer Sciece Research Report ESTIMATING THE NUMBER OF CLASSES IN A FINITE POPULATION Peter J. Haas IBM Research Divisio Almade Research Ceter 650 Harry Road Sa Jose, CA Lye Stokes Departmet of Maagemet Sciece ad Iformatio Systems Uiversity oftexas Austi, TX LIMITED DISTRIBUTION NOTICE This report has bee submitted for publicatio outside of IBM ad will probably be copyrighted if accepted for publicatio. It has bee issued as a Research Report for early dissemiatio of its cotets. I view of the trasfer of copyright to the outside publisher, its distributio outside of IBM prior to publicatio should be limited to peer commuicatios ad specic requests. After outside publicatio, requests should be lled oly by reprits or legally obtaied copies of the article (e.g., paymet of royalties). IBM Research Divisio Yorktow Heights, New York Sa Jose, Califoria Zurich, Switzerlad

2

3 ESTIMATING THE NUMBER OF CLASSES IN A FINITE POPULATION Peter J. Haas IBM Research Divisio Almade Research Ceter 650 Harry Road Sa Jose, CA Lye Stokes Departmet of Maagemet Sciece ad Iformatio Systems Uiversity oftexas Austi, TX ABSTRACT: We use a extesio of the geeralized jackkife approach of Gray ad Schucay to obtai ew oparametric estimators for the umber of classes i a ite populatio of kow size. We also show that geeralized jackkife estimators are closely related to certai Horvitz-Thompso estimators, to a estimator of Shlosser, ad to estimators based o sample coverage. I particular, the geeralized jackkife approach leads to a modicatio of Shlosser's estimator that does ot suer from the erratic behavior of the origial estimator. The performace of both ew ad previous estimators is ivestigated by meas of a asymptotic variace aalysis ad a Mote Carlo simulatio study. Keywords: jackkife, sample coverage, umber of species, umber of classes, database, cesus

4

5 1. Itroductio The problem of estimatig the umber of classes i a populatio has bee studied for may years. A recet review article (Buge ad Fitzpatrick 1993) lists more tha 125 refereces. I this article, we cosider a importat special case of the geeral problem estimatig the umber of classes i a ite populatio of kow size. Oly a hadful of papers have addressed this problem ad oe has reached a etirely satisfactory solutio, despite the fact that the rst attempt at solutio appeared i the statistical literature early 50 years ago (Mosteller 1949). The problem we cosider has arise i the literature i a variety of applicatios, icludig the followig. (i) I a compay-sposored cotest, may etries (say several hudred thousad) have bee received. It is kow that some people have etered more tha oce. The goal is to estimate the umber of dieret people who have etered from a sample of etries (Mosteller 1949; Sudma 1976). (ii) A samplig frame is costructed by combiig a umber of lists that may cotai overlappig etries. It is desired to estimate, usig a sample from all lists, the umber of uits o the combied list (Demig ad Glasser 1959; Goodma 1952; Kish 1965, Sec. 11.2; Sudma 1976, Sec. 3.6). A importat example of such a problem is a \admiistrative records cesus," curretly uder study by the U.S. Bureau of the Cesus. I such a cesus, several admiistrative les (such as AFDC or IRS records) are combied, ad the total umber of distict idividuals icluded i the combied le is determied. Exact computatio of the umber of distict idividuals i the combied le is extremely expesive because of the high cost of determiig the umber of duplicated etries. A similar problem ad proposed solutio was discussed i the Lodo Fiacial Times (March 2, 1949) by C. F. Carter, who was iterested i estimatig the umber of dieret ivestors i British idustrial stocks based o samples from share registers of compaies (Mosteller 1949). (iii) I a relatioal database system, data are orgaized i tables called relatios (see, e.g., Korth ad Silberschatz 1991, Chap. 3). I a typical relatio, each row might represet a record for a idividual employee i a compay, ad each colum might correspod to a dieret attribute of the employee, such as salary, years of experiece, departmet umber, ad so forth. A relatioal query species a output relatio that is to be computed from the set of base relatios stored by the system. Kowledge of the umber of distict values for each attribute i the base relatios is cetral to determiig the most eciet method for computig a specied output relatio (Hellerstei ad Stoebraker 1994; Seliger, Astraha, Chamberlai, Lorie, ad Price 1979). The size of the base relatios i moder database systems ofte is so large that exact computatio of the distict-value parameters is prohibitively expesive, ad thus estimatio of these parameters is desired (Astraha, Schkolick, ad Whag 1987; Flajolet ad Marti 1985; Gelebe ad Gardy 1982; Hou, Ozsoyoglu, ad Taeja 1

6 1988, 1989; Naughto ad Seshadri 1990; Ozsoyoglu, Du, Tjahjaa, Hou, ad Rowlad 1991; Whag, Vader-Zade, ad Taylor 1990). I each of these applicatios, the size of the populatio (umber of cotest etries, total umber of uits over all lists, ad umber of rows i the base relatio) is kow, ad this size is too large for easy computatio of the umber of classes. The problem studied i this article ca be described formally as follows. A populatio of size N cosists of D mutually disjoit classes P of items, labelled C 1 ;C 2 ;::: ;C D. Dee D N j to be the size of class C j, so that N = N j. A simple radom sample of items is selected (without replacemet) from the populatio. This sample icludes j items from class C j. The problem we cosider is that of estimatig D usig iformatio from the sample alog with kowledge of the value of P N. We deote by F i the umber of N classes of size i i the populatio, so that D = i=1 F i. Similarly, we deote by f i the umber of classes represeted exactly i times i the sample ad by d the total umber of classes represeted i the sample. Thus d = P i=1 f i ad P i=1 if i =. Dee vectors N =(N 1 ;N 2 ;::: ;N D ), =( 1 ; 2 ;::: ; D ), ad f =(f 1 ;f 2 ;::: ;f ). Note that is ot observable, but f is. Because we sample without replacemet, the radom vector has a multivariate hypergeometric distributio with probability mass fuctio P ( j D;N) =, N1, N2, 1 2 ND, N : (1) The probability mass fuctio of the observable radom vector f is simply P ( j D;N) summed over all poits that correspod to f: D P (f j D;N) = X S P ( j D;N); where S = f : #( j = i) =f i for 1 i D g. The probability mass fuctio P (f j D;N) does ot have a closed-form expressio i geeral. I Sectio 2 we review the estimators that have bee proposed for estimatig D from data geerated uder model (1). I Sectio 3 we provide several ew estimators of D based o a extesio of the geeralized jackkife approach of Gray ad Schucay (1972). We the show that geeralized jackkife estimators of the umber of classes i a populatio are closely related to certai \Horvitz-Thompso" estimators, to a estimator due to Shlosser (1981), ad to estimators based o the otio of \sample coverage" (Chao ad Lee 1992). I Sectio 4 we provide ad compare approximate expressios for the asymptotic variace of several of the estimators, ad i Sectio 5 apply our formulas to a well-kow example from the literature. We provide a simulatio-based empirical compariso of the various estimators i Sectio 6, ad summarize our results ad give recommedatios i Sectio 7. 2

7 2. Previous Estimators Buge ad Fitzpatrick (1993) metio oly two o-bayesia estimators that have bee developed as estimators of D uder model (1). These are the estimators of Goodma (1949) ad Shlosser (1981). Goodma proved that bd Good1 = d + X i=1 i+1 (N, + i, 1)! (, i)! (,1) (N,, 1)!! is the uique ubiased estimator of D whe >M def = max(n 1 ;N 2 ;::: ;N D ). He further proved that o ubiased estimator of D exists whe M. Ufortuately, uless the samplig fractio is quite large, the variace of b DGood1 is so great ad the umerical dif- culties ecoutered whe computig b DGood1 are so severe that the estimator is uusable. Goodma, who made ote of the high variace of b DGood1 himself, suggested the alterative estimator bd Good2 = N, N(N, 1) (, 1) f 2 for overcomig the variace problem. Although b DGood2 has lower variace tha b DGood1,it ca take o egative values ad ca have a large bias for ay if D is small. For example, cosider the case i which D = 1 ad >2, ad observe that f 2 = 0 ad b DGood2 = N. Uder the assumptio that the populatio size N is large ad the samplig fractio q = =N is oegligible, Shlosser (1981) derived the estimator P bd Sh = d + f (1, i=1 q)i f i 1 P iq(1, : i=1 q)i,1 f i f i For the two examples cosidered i his paper, Shlosser foud that use of b DSh with a 10% samplig fractio resulted i a error rate below 20%. I our experimets, however, we observed root mea squared errors (rmse's) exceedig 200%, eve for well-behaved populatios with relatively little variatio amog the class sizes (see Sec. 6). Cosiderig the relatioship betwee b DSh ad geeralized jackkife estimators (see Sec. 3.4) provides isight ito the source of this erratic behavior ad suggests some possible modicatios of b DSh to improve performace. I related work, Burham ad Overto (1978, 1979) proposed a family of (traditioal) geeralized jackkife estimators for estimatig the size of a closed populatio whe capture probabilities vary amog aimals. The D idividuals i the populatio play the role of our D classes; a give idividual ca appear up to times i the overall sample if captured o oe or more of possible trappig occasios. The capture probability for a idividual is assumed to be costat over time, ad the capture probabilities for the D idividuals are modeled as D iid radom samples from a xed probability distributio. Burham ad 3

8 Overto's sample desig is clearly dieret from model (1). Uder the Burham ad Overto model, for example, the quatities f 1 ;f 2 ;::: ;f have a joit multiomial distributio. Closely related to the work of Burham ad Overto are the ordiary jackkife estimators of the umber of species i a closed regio developed by Heltshe ad Forrester (1983) ad Smith ad va Belle (1984). The sample data cosist of a list of the species that appear i each of quadrats. (The umber of times that a species is represeted i a quadrat is ot recorded.) This setup is essetially idetical to that of Burham ad Overto, with the D species playig the role of the D idividuals ad the quadrats playig the role of the trappig occasios. 3. Geeralized Jackkife Estimators I this sectio we outlie a extesio of the geeralized jackkife approach to bias reductio ad the use this approach to derive ew estimators for the umber of classes i a ite populatio. We also poit out coectios betwee our geeralized jackkife approach ad several other estimatio approaches i the literature The Geeralized Jackkife Approach Let be a ukow real-valued parameter. A geeralized jackkife estimator of is a estimator of the form G(b 1 ; b 2 )= b 1, Rb 2 1, R ; (2) where b 1 ad b 2 are biased estimators of ad R (6= 1) is a real umber (Gray ad Schucay 1972). The idea uderlyig the geeralized jackkife approach is to try ad choose R such that G(b 1 ; b 2 ) has lower bias tha either b 1 or b 2. To motivate the choice of R, observe that for R = E[ b 1 ], E[b 2 ], ; (3) the estimator G(b 1 ; 2 b ) is ubiased for. This optimal value of R is typically ukow, however, ad ca oly be approximated, resultig i bias reductio but ot complete bias elimiatio. I the followig, we exted the origial deitio of the geeralized jackkife give by Gray ad Schucay (1972) by allowig R to deped o the data; that is, we allow R to be radom. Recall that d is the umber of classes represeted i the sample. Write d for d to emphasize the depedece of d o the sample size, ad deote by d,1 (k) the umber of classes represeted i the sample after the kth observatio has bee removed. Set X d (,1) = 1 4 k=1 d,1 (k):

9 We focus o geeralized jackkife estimators that are obtaied by takig 1 b = d ad b 2 = d (,1) i (2); these are the usual choices for b 1 ad b 2 i the classical rst-order jackkife estimator (Miller 1974). Observe that d,1 (k) = d, 1 if the class for the kth observatio is represeted oly oce i the sample; otherwise, d,1 (k) = d. Thus d (,1) = d, (f 1 =) ad, by (2), G(b 1 ; b 2 )= b D, where bd = d + K f 1 ad K = R=(1, R). It follows from (3) that the optimal choice of K is K = E [d ], D E[d (,1) ], E [d ] = D, E [d ] E [f 1 ] = : (5) To derive a more explicit formula for K, deote by I[A] the idicator of evet A ad observe that E [d ]=E 2 X 4 D I[ j > 0] 3 X D 5 = P f j > 0 g = D, DX P f j =0g : (4) Similar reasoig shows that E [f 1 ]= DX P f j =1g ; (6) so that P D K = P f j =0g P D P f j =1g : (7) Followig Shlosser (1981), we focus o the case i which the populatio size N is large ad the samplig fractio q = =N is oegligible, ad we make the approximatio P f j = k g Nj q k (1, q) N j,k (8) k for 0 k ad 1 j D. That is, the probability distributio of each j is approximated by the probability distributio of j uder a Beroulli sample desig i which each item is icluded i the sample with probability q, idepedetly of all other items i the populatio. Use of this approximatio leads to estimators that behave almost idetically to estimators derived usig the exact distributio of but are simpler to compute ad derive (see App. A for further discussio). Substitutig (8) ito (7), we obtai P D K (1, q)n j P D N jq(1, q) : (9) N j,1 5

10 The quatity K deed i (9) depeds o ukow parameters N 1 ;N 2 ;::: ;N D that are dicult to estimate. Our approach is to approximate K by a fuctio of D ad of other parameters that are easier to estimate, thereby obtaiig a approximate versio of (4). The estimates for these parameters, icludig b D for D, are the substituted ito the approximate versio of (4) ad the resultig equatio is solved for b D. We also cosider \smoothed" jackkife estimators. The idea is to replace the quatity f 1 = i (4) by its expected value E [f 1 ] = i the hope that the resultig estimator of D will be more stable tha the origial \usmoothed" estimator. As with the parameter K, the quatity E [f 1 ] = depeds o the ukow parameters N 1 ;N 2 ;::: ;N D ; see (6) ad (8). Thus our approach to estimatig E [f 1 ] = is the same as our approach to estimatig K. Estimators also ca be based o high-order jackkig schemes that cosider the umber of distict values i the sample whe two elemets are removed, whe three elemets are removed, ad so forth. Typically, usig a high-order jackkig scheme requires estimatig high-order momets (skewess, kurtosis, ad so forth) of the set of umbers f N 1 ;N 2 ;::: ;N D g. Iitial experimets idicated that the reductio i estimatio error due to usig the high-order jackkife is outweighed by the icrease i error due to ucertaity i the momet estimates. Thus we do ot pursue high-order jackkife schemes further The Estimators Dieret approximatios for K ad E [f 1 ] = lead to dieret estimators for D. Here we develop a umber of the possible estimators First-Order Estimators The simplest estimators of D ca be derived usig a rst-order approximatio to K. Specically, approximate each N j i (9) by the average value N = 1 D DX N j = N D ad substitute the resultig expressio for K ito (4) to obtai bd = d + (1, q)f 1D : (10) Now substitute D b for D o the right side of (10) ad solve for D. b The resultig solutio, deoted by Duj1 b, is give by bd uj1 = 1, (1, q)f 1,1 d : (11) We refer to this estimator as the \usmoothed rst-order jackkife estimator." 6

11 To derive a \smoothed rst-order jackkife estimator," observe that by (6) ad (8), E [f 1 ] 1 DX Approximatig each N j i (12) by N, wehave E [f 1 ] N j q(1, q) N j,1 : (12) (1, q) N,1 : (13) O the right side of (10), replace f 1 = with the approximate expressio for E [f 1 ] = give i (13), yieldig bd = d + D(1, q) N : Replacig D with b D ad N with N= b D i the foregoig expressio leads to the relatio bd, 1, (1, q) N= b D = d : We dee the smoothed rst-order jackkife estimator b Dsj1 as the value of b D that solves this equatio. Give d,, ad N, b Dsj1 ca be computed umerically usig stadard root-dig procedures. Observe that if i fact N 1 = N 2 = = N D = N=D, the I this case b Dsj1 E [d ] D, 1, (1, q) N=D : ca be viewed as a simple method-of-momets estimator obtaied by replacig E [d ] with the estimate d ad solvig for D. If, moreover, the samplig fractio q is small eough so that the distributio of ( 1 ; 2 ;::: ; D ) is approximately multiomial (see Sec. 3.3), the b Dsj1 is approximately equal to the maximum likelihood estimator for D (see Good 1950). Observe that both b Duj1 ad b Dsj1 are cosistet for D: b Duj1! D ad bd sj1! D as q! Secod-Order Estimators A secod-order approximatio to K ca be derived as follows. Deote by 2 the squared coecietofvariatio of the class sizes N 1 ;N 2 ;::: ;N D : 2 = (1=D)P D (N j, N) 2 N 2 : (14) Suppose that 2 is relatively small, so that each N j is close to the average value N. Substitute the Taylor approximatios (1, q) N j (1, q) N +(1, q) N l(1, q)(n j, N) 7

12 ad N j q(1, q) Nj,1 N j q (1, q) N,1 +(1, q) N,1 l(1, q)(n j, N) for 1 j D ito (9) to obtai K D(1, q) 1 1+l(1, q)n 2 D(1, q), 1, l(1, q)n 2 : (15) The ukow parameter 2 ca be estimated, usig the followig approach (cf. Chao ad Lee 1992). With the usual covetio that m = 0 for <m,we d that NX NX DX Nj i(i, 1)E [f i ] i(i, 1) q i (1, q) N j,i i=1 i=1 = q 2 D X N j (N j, 1) = q 2 D X N j (N j, 1); i N X j i=2 Nj, 2 q i,2 (1, q) N j,i i, 2 so that 2 D 2 N X i=1 i(i, 1)E [f i ]+ D N, 1: Thus if D were kow, the a atural method-of-momets estimator ^ 2 (D) of 2 would be ^ 2 (D) = max 0; D 2 X i=1 i(i, 1)f i + D N, 1 : (16) To develop a secod-order estimate of D, substitute (15) ito (4) to obtai from which it follows that bd = d + Df 1(1, q) bd = d + Df 1(1, q), 1, l(1, q)n 2 ; (17), f 1(1, q)l(1, q) 2 : q Replacig D with D b o the right side of this equatio ad solvig for D b yields the relatio 1, f 1(1, q) bd = d, f 1(1, q)l(1, q) 2 : (18) q 8

13 A estimator of D ca be obtaied by substitutig ^ 2 ( b D) for 2 i (18) ad solvig for bd umerically. Alteratively, we ca start with a simple iitial estimator of D ad the correct this estimator usig (18). Followig this latter approach, we use b Duj1 as our iitial estimator ad dee bd uj2 = 1, f 1(1, q),1! d, f 1(1, q)l(1, q)^ 2 ( Duj1 b ) : q A smoothed secod-order jackkife estimator ca be obtaied by replacig the expressio f 1 = i (17) with the approximatio to E [f 1 ] = give i (13), leadig to bd = d + D(1, q) N, 1, l(1, q)n 2 : Replacig D with b D ad proceedig as before, we obtai the estimator bd sj2 = 1 N, (1, q) ~,1 d, (1, q) ~N l(1, q)n ^ 2 ( Duj1 b ) where ~ N = N= b Duj1. As with the rst-order estimators b Duj1 ad b Dsj1, the secod-order estimators b Duj2 ad b Dsj2 are cosistet for D Horvitz-Thompso Jackkife Estimators I this sectio we discuss a alterative approach to estimatio of K based o a techique of Horvitz ad Thompso. (See Sardal, Swesso, ad Wretma 1992 for a geeral discussio of Horvitz-Thompso estimators.) P First, cosider the geeral problem of estimatig a parameter of the form D (g) = g(n j), where g is a specied fuctio. Observe that because P f j > 0 g > 0 for 1 j D, wehave (g) =E [X(g)], where X(g) = DX g(n j )I( j > 0) P f j > 0 g = X fj: j >0g g(n j ) P f j > 0 g : It follows from (8) that P f j > 0 g1, (1, q) N j, ad the foregoig discussio suggests that we estimate (g) by b(g) = X fj: j >0g g( b Nj ) 1, (1, q) b N j ; (19) where b Nj is a estimator for N j. The key poit is that we eed to estimate N j oly whe j > 0. To do this, observe that E [ j j j > 0] = E [ j ] P f j > 0 g qn j 1, (1, q) N j : ; 9

14 Replacig E [ j j j > 0] with j leads to the estimatig equatio j = qn j 1, (1, q) N j ; (20) ad a method-of-momets estimator b Nj ca be deed as the value of N j that solves (20). Now cosider the problem of estimatig K, ad hece D. By (9), K (f)=(g), where f(x) = (1, q) x ad g(x) = xq(1, q) x,1 =. Thus a atural estimator of K is give by b(f)=b (g), leadig to the al estimator, (f) b f 1 bd HTj = d + b(g) : A smoothed variat of b DHTj ca be obtaied by replacig f 1 = with the Horvitz-Thompso estimator of E [f 1 ] =, amely b (g). The resultig estimator, deoted by b DHTsj, is give by bd HTsj = d + b (f): Fially, a hybrid estimator ca be obtaied usig a rst-order approximatio for the umerator of K ad a Horvitz-Thompso estimator for the deomiator. This leads to the estimator Dhj b, deed as the solutio D b of the equatio! bd 1, f 1(1, q) N= D b = d : b (g) If we replace f 1 = with the Horvitz-Thompso estimator for E [f 1 ] = i the foregoig equatio i order to obtai a smoothed variatof b Dhj, the the resultig estimator coicides with b Dsj1. Because D = (u), where u(x) 1, it may appear that a \o-jackkife" Horvitz- Thompso estimator b DHT ca be deed by settig b DHT = b (u). It is straightforward to show, however, that b DHT = b DHTsj, so that b DHT ca i fact be viewed as a smoothed jackkife estimator. Simulatio experimets idicate that the behavior of the Horvitz-Thompso jackkife estimators b DHTj ad b DHTsj is erratic (see App. D for detailed results). Overall, the poor performace of b DHTj ad b DHTsj is caused by iaccurate estimatio of b (f). The problem seems to be that whe N j is small, the estimator b Nj is ustable ad yet typically has a large eect o the value of b (f) through the term (1, q) b N j =, 1, (1, q) b N j. The estimator bd hj uses a Taylor approximatio i place of b (f) ad hece has lower bias ad rmse tha the other two Horvitz-Thompso jackkife estimators. However, other estimators perform better tha b Dhj, ad we do ot cosider the estimators b DHTj, b DHTsj, ad b Dhj further. 10

15 3.3. Relatio to Estimators Based o Sample Coverage The geeralized jackkife approach for derivig a estimator of D works for sample desigs other tha hypergeometric samplig. For example, the most thoroughly studied versio of the umber-of-classes problem is that i which the populatio is assumed to be iite ad is assumed to have a multiomial distributio with parameter vector =( 1 ; 2 ;::: ; D ); that is, P ( j D; ) = D D : (21) D Whe we proceed as i Sectio 3.1 to derive a geeralized jackkife estimator uder the model i (21), the estimator turs out to be early idetical to the \coverage-based" estimator proposed by Chao ad Lee (1992). To see this, start agai with (4) ad select K as i (5). Because E [d ], D =, uder the model i (21), it follows that K = DX (1, j ) P D v ( j ) P D jv,1 ( j ) ; where v (x) =(1, x). Set =1=D ad use the Taylor approximatios v ( j ) v ()+( j, )v 0 () ad j v,1 ( j ) j, v,1 ()+( j, )v 0,1() i a maer aalogous to the derivatio i Sectio to obtai K (D, 1) + (, 1) 2 ; (22) where 2 =,1+D P D 2 j is the squared coeciet ofvariatio of the umbers 1; 2 ;::: ; D. Deote by b Dmult the estimator of D uder the multiomial model. The, by (4), bd mult = d +, (D, 1) + (, 1) 2 f 1 : (23) Replace D with b Dmult ad 2 with a estimator ~ 2 i (23) ad solve for b Dmult to obtai bd mult = d bc + (1, b C) bc 11, 1 ~2, 1 ;

16 where b C =1, (f1 =). Whe the sample size is large, the estimator b Dmult is essetially the same as the estimator bd CL = d + (1, C) b ~ 2 bc bc proposed by Chao ad Lee (1992). The estimator b DCL was developed from a dieret poit of view, usig the cocept of sample coverage. The sample coverage for a iite populatio is deed as P D ji[ j > 0], ad the quatity b C =1, (f1 =) is a stadard estimator of the sample coverage. Coversely, whe Chao ad Lee's derivatio is modied to accout for hypergeometric samplig, the resultig estimator is equal to b Duj2 (see App. B). Thus at least some estimators based o sample coverage ca be viewed as geeralized jackkife estimators Relatio to Shlosser's Estimator Observe that the estimator DSh b, though ot developed from a jackkife perspective, ca be viewed as a estimator of the form (4) with K estimated by P i=1 bk Sh = (1, q)i f P i iq(1 i=1, : q)i,1 f i To aalyze the behavior of DSh b,we rst rewrite the jackkife quatity K deed i (9) as follows: P N i=1 K = (1, q)i F i PN iq(1 i=1, : (24) q)i,1 F i Shlosser's justicatio of b DSh assumes that E [f i ] E [f 1 ] F i F 1 (25) for 1 i N. Whe the assumptio i (25) holds ad the sample size is large eough so that for 1 i N, f i E [f i ] (26) P N i=1 bk Sh (1, q)i E [f i ] P N iq(1, i=1 q)i,1 E [f i ] P N (1, q)i, i=1 E [f i ] =E [f 1 ] P N iq(1 i=1, q)i,1, E [f i ] =E [f 1 ] = P F,1 N (1 1 i=1, q)i F P i N iq(1 i=1, q)i,1 F i = K; F,1 1 12

17 so that b DSh behaves as a geeralized jackkife estimator. Although the relatios i (25) ad (26) hold exactly for = N (implyig that b DSh is cosistet for D), these relatios ca fail drastically for smaller sample sizes. For example, whe F 1 = 0 ad F i > 0 for some i>1, the right side of (25) is iite, whereas the left side is ite for sucietly small. This observatio leads oe to expect that b DSh will ot perform well whe the sample size is relatively small ad N 1 ;N 2 ;::: ;N D have similar values (with N j > 1 for each j). Both the variace aalysis i Sectio 4 ad the simulatio experimets described i Sectio 6 bear out this cojecture. The foregoig discussio suggests that replacig b KSh with bk Sh = K bk E[ KSh b Sh (27) ] i the formula for DSh b might result i a improved estimator, because K b Sh is ubiased for K. Of course we caot perform this replacemet exactly, sice K ad E[ KSh b ] are ukow, but we ca approximate K b Sh as follows. Usig the fact that DX DX Nj NX i E [f r ]= P f j = r g q r (1, q) Nj,r = q r (1, q) i,r F i (28) r r for 1 r, wehave, to rst order, i=r P N E[ KSh b i=1 ] (1, q)i E [f i ] P N iq(1, i=1 q)i,1 E [f i ] = P N i=1 (1, q)i, (1 + q) i, 1 F i PN i=1 iq2 (1, q 2 ) i,1 F i : (29) Usig the rst-order approximatio N 1 = N 2 = = N D = N together with (24), (27), ad (29), we d that bk Sh q(1 + q)n,1 (1 + q) N, 1 We thus obtai a modied Shlosser estimator give by bd Sh2 = d + f 1 q(1 + q) ~ N,1 (1 + q) ~N, 1! bk Sh :! P (1 i=1, q)i f P i iq(1 i=1, ; q)i,1 f i where ~ N is a iitial estimate of N based o a iitial estimate of D. We set ~ N equal to N= b Duj1 throughout. As with b DSh, the estimator b DSh2 is cosistet for D. 13

18 A alterative cosistet estimator of D ca be obtaied by directly usig the expressios i (24), (27), ad (29) with F i estimated by bf i = for 1 i N; these estimators of F 1 ;F 2 ;::: ;F N f 1 f i P i=1 iq(1, q)i,1 f i (30) were proposed by Shlosser (1981) i cojuctio with the estimator DSh b. Substitutig the resultig estimator of K ad E[ KSh b ] ito (27) leads to the al estimator P! i=1 iq2 (1, q 2 ) i,1 f P i (1 i=1, q)i, P (1 i=1, 2 q)i f P i (1 + q) i, 1 f i iq(1, : i=1 q)i,1 f i bd Sh3 = d + f 1 As with the estimator DSh b, Shlosser's justicatio of the estimators i (30) rests o the assumptio i (25). Thus oe might expect that, like DSh b, the estimator DSh3 b will be ustable whe the sample size is relatively small ad N 1 ;N 2 ;::: ;N D have similar values. O the other had, the reductio i bias of K b relative to b Sh KSh leads oe to expect that bd Sh3 will perform better tha DSh b whe 2 is sucietly large. (Oe might be tempted to avoid the assumptio i (25) whe estimatig F 1 ;F 2 ;::: ;F N by takig a method-ofmomets approach: replace E [f r ] with f r i (28) for 1 r ad solve the resultig set of liear equatios either exactly or approximately. As poited out by Shlosser (1981), however, this system of equatios is early sigular, ad hece extremely ustable.) 4. Variace ad Variace Estimates Cosider a estimator b D that is a fuctio of the sample oly through f =(f1 ;f 2 ;::: ; f M ), where M = max(n 1 ;N 2 ;::: ;N D ). All of the estimators itroduced i Sectio 3 are of this type. I geeral, we also allow b D to deped explicitly o the populatio size N ad write b D = b D(f;N). Suppose that, for ay N > 0 ad oegative M-dimesioal vector f 6= 0, the fuctio b D is cotiuously dieretiable at the poit (f;n) ad bd(cf;cn)=c b D(f;N) (31) for c>0. Approximatig the hypergeometric sample desig by a Beroulli sample desig as i (8), we ca obtai the followig approximate expressio for the asymptotic variace of b D(f;N)asD becomes large: AVar[ b D(f;N)] M X i=1 A 2 i Var [f i ]+ X 1i;i 0 M i6=i 0 A i A i 0Cov[f i ;f i 0] ; (32) where A i is the partial derivative of b D with respect to fi, evaluated at the poit (f;n). (Whe computig each A i,we replace each occurrece of ad d i the formula for b D by 14

19 P M i=1 if i ad P M i=1 f i before takig derivatives.) The approximatio i (32) is valid whe there is ot too much variability i the class sizes (see App. C for a precise formulatio ad proof of this result). It follows from the proof that, to a good approximatio, the variace of a estimator b D satisfyig (31) icreases liearly as D icreases. Straightforward calculatios show that each of the specic estimators b Duj1, b Duj2, b DSh, bd Sh2, ad b DSh3 is cotiuously dieretiable as stated previously ad also satises (31). Thus we ca use (32) to study the asymptotic variace of these estimators. We focus o bd uj1, b Duj2, b DSh2, ad b DSh3 because each of these estimators performs best for at least oe populatio studied i the simulatio experimets described i Sectio 6; we also cosider bd Sh, because b DSh is the most useful of the estimators previously proposed i the literature. Computatio of the A i coeciets for each estimator is tedious, but straightforward. Whe bd = b Duj2, for example, we obtai ad A (uj2) i 1 = A (uj1) N(1, q)l(1, q) 1,, (1, q)f 1 "^ 2 A (uj1) 1 + f 1,^ 2 +1, 2 bd uj1 A (uj2) = A (uj1) N(1, q) l(1, q) i,, (1, q)f 1 f 1 A (uj1) i bd uj1,^ 2 +1, 2i for 1 <i, where ^ 2 =^ 2 ( b Duj1 ), ad ^ 2 +1, b D uj1 N A (uj1) 1 = b Duj1 1 d + ^ 2 +1, b D uj1 N, + i(i, 1) b Duj1 2, (1, q), (1, q)f 1 1, f 1 A (uj1) i = Duj1 b 1 + i(1, q)(f 1=) d, (1, q)f 1!# ^ 2 + ^2, (1, q)f 1! i^ 2 + i^2, (1, q)f 1 ; for 1 <i. Figures 1 ad 2 compare the variaces of the estimators b Duj1, b Duj2, b DSh, b DSh2, ad bd Sh3 for a umber of populatios with equal class sizes. For these special populatios, b Duj1 ad b Duj2 are approximately ubiased, so that the relative variaces of these estimators are appropriate measures of relative performace. It is particularly istructive to compare the variace of b Duj1 ad b Duj2, sice b Duj2 is obtaied from b Duj1 by adjustig the latter estimator to compesate for bias iduced by the assumptio of equal class sizes. This adjustmet is uecessary for our special populatios, ad a compariso allows evaluatio of the pealty (i.e., the icrease i variace) that is beig paid for the adjustmet. 15

20 stadard deviatio stadard deviatio bd uj1 bd uj2 bd Sh bd Sh2 bd Sh samplig fractio (q) bd uj1 bd uj2 bd Sh class size (N) 100 Figure 1: Stadard deviatio of b Duj1, bd uj2, b DSh, b DSh2, ad b DSh3 (D = 15; 000 ad N = 10). Figure 2: Stadard deviatio of b Duj1, bd uj2, ad b DSh2 (D = 1500 ad q =0:10). Figure 1 displays the stadard deviatios of b Duj1, b Duj2, b DSh, b DSh2, ad b DSh3 for a equal-class-size populatio with N = 15; 000 ad D = 1500 (so that N = 10) as the samplig fractio q varies. Observe that b Duj2 is oly slightly less eciet tha b Duj1, so that the pealty for bias adjustmet is small i this case. Performace of the estimators bd uj1 ad b DSh2 is early idistiguishable. The most strikig observatio is that for this populatio, b DSh ad b DSh3 are ot competitive with the other three estimators. The relative performace of b DSh ad b DSh3 is especially poor for small samplig fractios. O the other had, the variace aalysis idicates that modicatio of b DSh as i (27) ad (29) ideed reduces the istability of the origial Schlosser estimator i this case. Thus we focus o the estimators b Duj1, b Duj2, ad b DSh2 i the remaider of this sectio ad i the ext sectio. (We retur to the estimator b DSh3 i Sectio 6, where our simulatio experimets idicate that b DSh3 ca exhibit smaller rmse tha the other estimators, but oly at large sample sizes ad for certai \ill-coditioed" populatios i which 2 is extremely large.) Figure 2 compares the three estimators b Duj1, b Duj2, ad b DSh2 for equal-class-size populatios with a rage of class sizes; for these calculatios the umber of classes ad the samplig fractio are held costat at D = 1500 ad q = 10. This gure illustrates the diculty of precisely estimatig D whe the class size is small (but greater tha 1). Agai, we see that these three estimators perform similarly, with early equal variability whe N exceeds about 40. We checked the accuracy of the variace approximatio i some example populatios by comparig the values computed from (32) with results of a simulatio experimet. (This experimet is discussed more completely i Sectio 6 below.) Simulated samplig with q =0:05, 0:10, ad 0:20 from the populatio examied i Figure 1 (N =15; 000, D = 1500) yields variace estimates withi 10% (o average) of those calculated from (32). Similar results were foud i samplig from a equal-class-size populatio with N = 15; 000 ad D = 150. The oly diculties we ecoutered occurred for equal-class-size populatios with 16

21 class sizes of N = 1 ad N =2. For these small class sizes the variace approximatio, which is based o the approximatio of the hypergeometric sample desig by aberoulli sample desig, is ot sucietly accurate. I particular, the approximate variace strogly reects radom uctuatios i the sample size due to the Beroulli sample desig; such uctuatios are ot preset i the actual hypergeometric sample desig. Simulatio experimets idicate that for N 3 the diereces caused by Beroulli versus hypergeometric samplig become egligible. (Of course, if the sample desig is i fact Beroulli, the this problem does ot occur.) I practice, we estimate the asymptotic variace of a estimator D b by substitutig estimates for f Var [f i ]: 1 i M g, ad f Cov[f i ;f i 0]: 1 i 6= i 0 M g ito (32). To obtai such estimates, we approximate the true populatio by a populatio with D classes, each of size N=D. Uder this approximatio ad the assumptio i (8) of a Beroulli sample desig, the radom vector f has a multiomial distributio with parameters D ad p =(p 1 ;p 2 ;::: ;p ), where N=D p i = q i (1, q) (N=D),i i for 1 i. It follows that Var [f i ]=Dp i (1, p i ) ad Cov[f i ;f i 0]=,Dp i p i 0. Each p i ca be estimated either by bp i = N= D b i q i (1, q) (N= bd),i or simply by f i = D. b It turs out that the latter formula yields better variace estimates, ad so we take dvar[f i ]=f i 1, f i bd ad for 1 i; i 0. dcov[f i ;f i 0]=, f if i 0 bd These formulas coicide with the estimators obtaied usig the \ucoditioal approach" of Chao ad Lee (1992). A computer program that calculates b Duj1, bd uj2, b DSh2 ad their estimated stadard errors from sample data ca be obtaied from the secod author. 5. A Example The followig example illustrates how kowledge of the populatio size N ca aect estimates of the umber of classes. Whe the populatio size N is ukow, Chao ad Lee (1992, Sec. 3) have proposed that the estimator b DCL deed i Sectio 3.3 be used to 17

22 N bd uj1 bd uj2 bd Sh (47) (60) (51) 10, (125) (161) (128) 100, (141) (183) (144) Table 1: Values of b Duj1, b Duj2, ad b DSh2 for three hypothetical combied lists. (Stadard errors are i paretheses.) estimate the umber of classes, because the formula for b DCL does ot ivolve the ukow parameter N. Whe N is kow, a slight modicatio of the derivatio of b DCL leads to the usmoothed secod-order jackkife estimator b Duj2 (see App. B). Our example is based o oe discussed by Chao ad Lee (1992), who borrowed data rst described ad aalyzed by Holst (1981). These data arose from a applicatio i umismatics i which 204 aciet cois were classied accordig to die type i order to estimate the umber of dieret dies used i the mitig process. Amog the die types o the reverse sides of the 204 cois were 156 sigletos, 19 pairs, 2 triplets, ad 1 quadruplet (f 1 = 156, f 2 = 19, f 3 = 2, f 4 = 1, d = 178). Because the total umber of cois mited is ukow i this case, model (1) is iappropriate for aalyzig these data. But suppose that the same data had arise from a applicatio i which N was kow. For example, suppose that the data were obtaied by selectig a simple radom sample of 204 ames from a samplig frame that had bee costructed by combiig 5 lists of 200 ames each (N = 1000), 50 lists of 200 ames each (N = 10; 000), or 500 lists of 200 ames each (N = 100; 000). I each case our object is to estimate the umber of uique idividuals o the combied list, based o the sample results. We focus o the three estimators b Duj1, bd uj2, ad b DSh2. The estimates for the three cases are give i Table 1; the stadard errors displayed i Table 1 are estimated usig the procedure outlied i Sectio 4. We would expect similar ifereces to be made from the same data uder the multiomial model ad the ite populatio model whe N is very large. Ideed, the value bd uj2 = 835 agrees closely with Chao ad Lee's estimate b DCL = 844 (se 187) whe N = 100;000. Moreover, whe N = 100;000 we d that ^ 2 ( b Duj1 ) 0:13, which is the same estimate of 2 give by Chao ad Lee. As the populatio size decreases, however, both our assessmet of the magitude of D ad our ucertaity about that magitude decrease, because we are observig a larger ad larger fractio of both the populatio ad the classes. The most extreme divergece betwee the estimate obtaied usig b DCL ad estimates obtaied usig b Duj1, b Duj2,or b DSh2 occurs whe the sample cosists of all sigletos (f 1 = ). I that case, b DCL = 1, whereas b Duj1 = b Duj2 = b DSh2 = N. This result idicates that whe the populatio size N is kow, it is better to use a estimator that exploits kowledge 18

23 of N tha to sample with replacemet ad use the estimator b DCL. I some applicatios, samplig with replacemet is ot eve a optio. For example, the oly available samplig mechaism i at least oe curret database system is a oe-pass reservoir algorithm (as i Vitter 1985). The empirical results i Sectio 6 idicate that, of the three estimators displayed i Table 1, b Duj2 is the superior estimator whe 2 is small (< 1). Thus for our example, b Duj2 would be the preferred estimator, sice ^ 2 ( b Duj1 ) 0:13 i all three cases. Note that b Duj2 cosistetly has the highest variace of the three estimators i Table 1. The bias of b Duj2 is typically lower tha that of b Duj1 or b DSh2 whe 2 is small, however, so that the overall rmse is lower. 6. Simulatio Results This sectio describes the results of a simulatio study doe to compare the performace of the various estimators described i Sectio 3. Our compariso is based o the performace of the estimators for samplig fractios of 5%, 10%, ad 20% i 52 populatios. (Iitial experimets idicated that the performace of the various estimators is best viewed as a fuctio of samplig fractio, rather tha absolute sample size. This is i cotrast to estimators of, for example, populatio averages.) We cosider several sets of populatios. The rst set comprises sythetic populatios of the type cosidered i the literature. Populatios EQ10 ad EQ100 have equal class sizes of 10 ad 100. I populatios NGB/1, NGB/2, ad NGB/4, the class sizes follow a egative biomial distributio. Specically, the fractio f(m) of classes i populatio NGB/k with class size equal to m is give by f(m) m, 1 r k (1, r) m,k k, 1 for m k, where r = 0:04. Chao ad Lee (1992) cosidered populatios of this type. The populatios i the secod set are meat to be represetative of data that could be ecoutered whe a samplig frame for a populatio cesus is costructed by combiig a umber of lists which may cotai overlappig etries. Populatio GOOD ad SUDM were studied by Goodma (1949) ad Sudma (1976). Populatio FRAME2 mimics a samplig frame that might arise i a admiistrative records cesus of the type described i Sectio 1. Oe approach to such a cesus is to augmet the usual cesus address list with a small umber of relatively large admiistrative records les, such as AFDC or Food Stamps, ad the estimate the umber of distict idividuals o the combied list from a sample. We have costructed FRAME2 so that a give idividual ca appear at most ve times, but most idividuals appear exactly oce, mimickig the case i which four admiistrative lists are used to supplemet the cesus address list. Populatio FRAME3 is similar to FRAME2, but for the FRAME3 populatio it is assumed that the combied list is made up of a umber of small lists (perhaps obtaied from eighborhood-level orgaizatios) rather tha a few 19

24 Name N D 2 Skew EQ EQ NGB/ NGB/ NGB/ Table 2: Characteristics of sythetic populatios. Name N D 2 Skew GOOD FRAME FRAME SUDM Table 3: Characteristics of \merged list" populatios. Name N D 2 Skew Z20A Z Z20B Table 4: Characteristics of \ill-coditioed" populatios. large lists. The populatios i the third set, deoted by Z20A, Z20B, ad Z15, are used to study the behavior of the estimators whe the data are extremely ill-coditioed. The class sizes i each of these populatios follow a geeralized Zipf distributio (see Kuth 1973, p. 398). Specically, N j =N / j,, where equals 1.5 or 2.0. These populatios have extremely high values of 2. Descriptive statistics for these three sets of populatios are give i Tables 2, 3, ad 4. The colum etitled \skew" displays the dimesioless coeciet of skewess, which is deed by = P D (N j, N) 3 =D PD (N j, N) 2 =D 3=2 : The al set comprises 40 real populatios that demostrate the type of distributios ecoutered whe estimatig the umber of distict values of a attribute i a relatioal database. Specically, the populatios studied correspod to various relatioal attributes from a database of erollmet records for studets at the Uiversity of Wiscosi ad a database of billig records from a large isurace compay. The populatio size N rages from 15,469 to 1,654,700, with D ragig from 3 to 1,547,606 ad 2 ragig from 0 to (see App. D for further details). It is otable that values of 2 ecoutered i the literature (Chao ad Lee 1992; Goodma 1949; Shlosser 1981; Sudma 1976) ted ot to exceed the value 2, ad are typically less tha 1, whereas the value of 2 exceeds 2 for more tha 50% of the real populatios. For each estimator, populatio, ad samplig fractio, we estimated the bias ad rmse by repeatedly drawig a simple radom sample from the populatio, evaluatig the estimator, ad the computig the error of each estimate. (Whe evaluatig the estimator, we trucated each estimate below at d ad above at N.) The al estimate of bias was obtaied by averagig the error over all of the experimetal replicatios, ad rmse was 20

25 sample size bd 2 uj1 bd sj1 bd uj2 bd sj2 bd Sh bd Sh2 bd Sh3 ^ 2 5% 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum % 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum % 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum Table 5: Average ad maximum rmse (%) for various estimators. estimated as the square root of the averaged square error. We used 100 replicatios, which was suciet to estimate the rmse with a stadard error below 5% i early all cases; typically the stadard error was much less. Summary results from the simulatios are displayed i Tables 5 ad 6. Table 5 gives the average ad maximum rmse's for each estimator of D over all populatios with 0 2 < 1, with 1 2 < 50, ad with 2 50, as well as the average ad maximum rmse's for each estimator over all populatios combied. Similarly, Table 6 gives the average ad maximum bias for each estimator. I these tables, the rmse ad bias are each expressed as a percetage of the true umber of classes. Tables 5 ad 6 also display the rmse ad bias of the estimator ^ 2 ( b Duj1 ) used i the secod-order jackkife estimators; the rmse ad bias are expressed as a percetage of the true value 2 ad are displayed i the colum labelled ^ 2. Comparig Tables 5 ad 6 idicates that for each estimator the major compoet of the rmse is almost always bias, ot variace. Thus, eve though the stadard error ca be estimated as i Sectio 4, this estimated stadard error usually does ot give a accurate picture of the error i estimatio of D. Aother cosequece of the predomiace of bias is that whe 2 is large, the rmse for the secod-order estimator b Duj2 does ot decrease 21

26 sample size bd 2 uj1 bd sj1 bd uj2 bd sj2 bd Sh bd Sh2 bd Sh3 ^ 2 5% 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum % 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum % 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum Table 6: Average ad maximum bias (%) for various estimators. mootoically as the samplig fractio icreases. (I all other cases the rmse decreases mootoically.) Comparig b Duj1 with b Dsj1 ad the comparig b Duj2 with b Dsj2,we see that smoothig a rst-order jackkife estimator ever results i a better rst-order estimator. O the other had, smoothig a secod-order jackkife estimator ca result i sigicat performace improvemet whe 2 is large. Similarly, usig higher-order Taylor expasios leads to mixed results. Secod-order estimators perform better tha rst-order estimators whe 2 is relatively small, but ot whe 2 is large. The diculty ispartially that the estimator ^ 2 ( b Duj1 ) teds to uderestimate 2 whe 2 is large, leadig to uderestimates of the umber of classes. Moreover, the Taylor approximatios uderlyig b Duj1, b Dsj1, b Duj2, ad b Dsj2 are derived uder the assumptio of ot too much variability betwee class sizes; this assumptio is violated whe 2 is large. There apparetly is o systematic relatio betwee the coeciet of skewess for the class sizes ad the performace of secod-order jackkife estimators. As predicted i Sectios 3.4 ad 4, the estimators b DSh ad b DSh3 behave poorly whe 2 is relatively small, ad b DSh3 performs better tha b DSh whe 2 is large. For small to medium values of 2, the modied estimator b DSh2 has a smaller rmse tha b DSh or b DSh3, ad 22

27 sample size bd 2 uj2 bd uj2a bd Sh2 bd Sh3 bd hybrid 5% 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum % 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum % 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum Table 7: Average ad maximum rmse (%) of b Duj2, b Duj2a, b DSh2, b DSh3, ad b Dhybrid. its performace is comparable to the geeralized jackkife estimators. For extremely large values of 2 ad also for large sample sizes, the estimator b DSh3 has the best performace of the three Shlosser-type estimators. (For a 20% samplig fractio, b DSh3 i fact has the lowest average rmse of all the estimators cosidered.) As idicated earlier, smoothig ca improve the performace of the secod-order jackkife estimator Duj2 b. A alterative ad hoc techique for improvig performace is to \stabilize" b Duj2 usig a method suggested by Chao, Ma, ad Yag (1993). Fix c 1 ad remove ay class whose frequecy i the sample exceeds c; that is, remove from the sample all members of classes f C j : j 2 B g, where B = f 1 j D : j >cg. The compute the estimator b Duj2 from the reduced sample ad subsequetly icremet it by jbj to produce the al estimate, deoted by b Duj2a. (Here jbj deotes the umber of elemets i the set B.) Whe computig b Duj2 from the reduced sample, take the populatio size as N, P j2b b N j, where each b Nj is a method-of-momets estimator of N j as i Sectio If, P j2b j = 0, the simply compute b Duj2 from the full sample. The idea behid this procedure is as follows. Whe 2 is large, the populatio cosists of a few large classes ad may smaller classes. By i eect removig the largest classes from the populatio, 23

Chapter 6 Sampling Distributions

Chapter 6 Sampling Distributions Chapter 6 Samplig Distributios 1 I most experimets, we have more tha oe measuremet for ay give variable, each measuremet beig associated with oe radomly selected a member of a populatio. Hece we eed to

More information

It should be unbiased, or approximately unbiased. Variance of the variance estimator should be small. That is, the variance estimator is stable.

It should be unbiased, or approximately unbiased. Variance of the variance estimator should be small. That is, the variance estimator is stable. Chapter 10 Variace Estimatio 10.1 Itroductio Variace estimatio is a importat practical problem i survey samplig. Variace estimates are used i two purposes. Oe is the aalytic purpose such as costructig

More information

Properties and Hypothesis Testing

Properties and Hypothesis Testing Chapter 3 Properties ad Hypothesis Testig 3.1 Types of data The regressio techiques developed i previous chapters ca be applied to three differet kids of data. 1. Cross-sectioal data. 2. Time series data.

More information

The standard deviation of the mean

The standard deviation of the mean Physics 6C Fall 20 The stadard deviatio of the mea These otes provide some clarificatio o the distictio betwee the stadard deviatio ad the stadard deviatio of the mea.. The sample mea ad variace Cosider

More information

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals 7-1 Chapter 4 Part I. Samplig Distributios ad Cofidece Itervals 1 7- Sectio 1. Samplig Distributio 7-3 Usig Statistics Statistical Iferece: Predict ad forecast values of populatio parameters... Test hypotheses

More information

Statisticians use the word population to refer the total number of (potential) observations under consideration

Statisticians use the word population to refer the total number of (potential) observations under consideration 6 Samplig Distributios Statisticias use the word populatio to refer the total umber of (potetial) observatios uder cosideratio The populatio is just the set of all possible outcomes i our sample space

More information

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1. Eco 325/327 Notes o Sample Mea, Sample Proportio, Cetral Limit Theorem, Chi-square Distributio, Studet s t distributio 1 Sample Mea By Hiro Kasahara We cosider a radom sample from a populatio. Defiitio

More information

Regression with an Evaporating Logarithmic Trend

Regression with an Evaporating Logarithmic Trend Regressio with a Evaporatig Logarithmic Tred Peter C. B. Phillips Cowles Foudatio, Yale Uiversity, Uiversity of Aucklad & Uiversity of York ad Yixiao Su Departmet of Ecoomics Yale Uiversity October 5,

More information

On an Application of Bayesian Estimation

On an Application of Bayesian Estimation O a Applicatio of ayesia Estimatio KIYOHARU TANAKA School of Sciece ad Egieerig, Kiki Uiversity, Kowakae, Higashi-Osaka, JAPAN Email: ktaaka@ifokidaiacjp EVGENIY GRECHNIKOV Departmet of Mathematics, auma

More information

( µ /σ)ζ/(ζ+1) µ /σ ( µ /σ)ζ/(ζ 1)

( µ /σ)ζ/(ζ+1) µ /σ ( µ /σ)ζ/(ζ 1) A eective CI for the mea with samples of size 1 ad Melaie Wall James Boe ad Richard Tweedie 1 Abstract It is couterituitive that with a sample of oly oe value from a ormal distributio oe ca costruct a

More information

ON POINTWISE BINOMIAL APPROXIMATION

ON POINTWISE BINOMIAL APPROXIMATION Iteratioal Joural of Pure ad Applied Mathematics Volume 71 No. 1 2011, 57-66 ON POINTWISE BINOMIAL APPROXIMATION BY w-functions K. Teerapabolar 1, P. Wogkasem 2 Departmet of Mathematics Faculty of Sciece

More information

This is an introductory course in Analysis of Variance and Design of Experiments.

This is an introductory course in Analysis of Variance and Design of Experiments. 1 Notes for M 384E, Wedesday, Jauary 21, 2009 (Please ote: I will ot pass out hard-copy class otes i future classes. If there are writte class otes, they will be posted o the web by the ight before class

More information

Statistical inference: example 1. Inferential Statistics

Statistical inference: example 1. Inferential Statistics Statistical iferece: example 1 Iferetial Statistics POPULATION SAMPLE A clothig store chai regularly buys from a supplier large quatities of a certai piece of clothig. Each item ca be classified either

More information

Output Analysis and Run-Length Control

Output Analysis and Run-Length Control IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%

More information

Monte Carlo Integration

Monte Carlo Integration Mote Carlo Itegratio I these otes we first review basic umerical itegratio methods (usig Riema approximatio ad the trapezoidal rule) ad their limitatios for evaluatig multidimesioal itegrals. Next we itroduce

More information

R. van Zyl 1, A.J. van der Merwe 2. Quintiles International, University of the Free State

R. van Zyl 1, A.J. van der Merwe 2. Quintiles International, University of the Free State Bayesia Cotrol Charts for the Two-parameter Expoetial Distributio if the Locatio Parameter Ca Take o Ay Value Betwee Mius Iity ad Plus Iity R. va Zyl, A.J. va der Merwe 2 Quitiles Iteratioal, ruaavz@gmail.com

More information

Element sampling: Part 2

Element sampling: Part 2 Chapter 4 Elemet samplig: Part 2 4.1 Itroductio We ow cosider uequal probability samplig desigs which is very popular i practice. I the uequal probability samplig, we ca improve the efficiecy of the resultig

More information

Introducing Sample Proportions

Introducing Sample Proportions Itroducig Sample Proportios Probability ad statistics Aswers & Notes TI-Nspire Ivestigatio Studet 60 mi 7 8 9 0 Itroductio A 00 survey of attitudes to climate chage, coducted i Australia by the CSIRO,

More information

Goodness-Of-Fit For The Generalized Exponential Distribution. Abstract

Goodness-Of-Fit For The Generalized Exponential Distribution. Abstract Goodess-Of-Fit For The Geeralized Expoetial Distributio By Amal S. Hassa stitute of Statistical Studies & Research Cairo Uiversity Abstract Recetly a ew distributio called geeralized expoetial or expoetiated

More information

THE SYSTEMATIC AND THE RANDOM. ERRORS - DUE TO ELEMENT TOLERANCES OF ELECTRICAL NETWORKS

THE SYSTEMATIC AND THE RANDOM. ERRORS - DUE TO ELEMENT TOLERANCES OF ELECTRICAL NETWORKS R775 Philips Res. Repts 26,414-423, 1971' THE SYSTEMATIC AND THE RANDOM. ERRORS - DUE TO ELEMENT TOLERANCES OF ELECTRICAL NETWORKS by H. W. HANNEMAN Abstract Usig the law of propagatio of errors, approximated

More information

1 Inferential Methods for Correlation and Regression Analysis

1 Inferential Methods for Correlation and Regression Analysis 1 Iferetial Methods for Correlatio ad Regressio Aalysis I the chapter o Correlatio ad Regressio Aalysis tools for describig bivariate cotiuous data were itroduced. The sample Pearso Correlatio Coefficiet

More information

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10 DS 00: Priciples ad Techiques of Data Sciece Date: April 3, 208 Name: Hypothesis Testig Discussio #0. Defie these terms below as they relate to hypothesis testig. a) Data Geeratio Model: Solutio: A set

More information

Binomial Distribution

Binomial Distribution 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 1 2 3 4 5 6 7 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Overview Example: coi tossed three times Defiitio Formula Recall that a r.v. is discrete if there are either a fiite umber of possible

More information

Chapter 6 Principles of Data Reduction

Chapter 6 Principles of Data Reduction Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a

More information

Modified Ratio Estimators Using Known Median and Co-Efficent of Kurtosis

Modified Ratio Estimators Using Known Median and Co-Efficent of Kurtosis America Joural of Mathematics ad Statistics 01, (4): 95-100 DOI: 10.593/j.ajms.01004.05 Modified Ratio s Usig Kow Media ad Co-Efficet of Kurtosis J.Subramai *, G.Kumarapadiya Departmet of Statistics, Podicherry

More information

The Random Walk For Dummies

The Random Walk For Dummies The Radom Walk For Dummies Richard A Mote Abstract We look at the priciples goverig the oe-dimesioal discrete radom walk First we review five basic cocepts of probability theory The we cosider the Beroulli

More information

Estimation of Gumbel Parameters under Ranked Set Sampling

Estimation of Gumbel Parameters under Ranked Set Sampling Joural of Moder Applied Statistical Methods Volume 13 Issue 2 Article 11-2014 Estimatio of Gumbel Parameters uder Raked Set Samplig Omar M. Yousef Al Balqa' Applied Uiversity, Zarqa, Jorda, abuyaza_o@yahoo.com

More information

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to: STA 2023 Module 10 Comparig Two Proportios Learig Objectives Upo completig this module, you should be able to: 1. Perform large-sample ifereces (hypothesis test ad cofidece itervals) to compare two populatio

More information

TEACHER CERTIFICATION STUDY GUIDE

TEACHER CERTIFICATION STUDY GUIDE COMPETENCY 1. ALGEBRA SKILL 1.1 1.1a. ALGEBRAIC STRUCTURES Kow why the real ad complex umbers are each a field, ad that particular rigs are ot fields (e.g., itegers, polyomial rigs, matrix rigs) Algebra

More information

Singular Continuous Measures by Michael Pejic 5/14/10

Singular Continuous Measures by Michael Pejic 5/14/10 Sigular Cotiuous Measures by Michael Peic 5/4/0 Prelimiaries Give a set X, a σ-algebra o X is a collectio of subsets of X that cotais X ad ad is closed uder complemetatio ad coutable uios hece, coutable

More information

5. Likelihood Ratio Tests

5. Likelihood Ratio Tests 1 of 5 7/29/2009 3:16 PM Virtual Laboratories > 9. Hy pothesis Testig > 1 2 3 4 5 6 7 5. Likelihood Ratio Tests Prelimiaries As usual, our startig poit is a radom experimet with a uderlyig sample space,

More information

Bernoulli numbers and the Euler-Maclaurin summation formula

Bernoulli numbers and the Euler-Maclaurin summation formula Physics 6A Witer 006 Beroulli umbers ad the Euler-Maclauri summatio formula I this ote, I shall motivate the origi of the Euler-Maclauri summatio formula. I will also explai why the coefficiets o the right

More information

Basis for simulation techniques

Basis for simulation techniques Basis for simulatio techiques M. Veeraraghava, March 7, 004 Estimatio is based o a collectio of experimetal outcomes, x, x,, x, where each experimetal outcome is a value of a radom variable. x i. Defiitios

More information

A goodness-of-fit test based on the empirical characteristic function and a comparison of tests for normality

A goodness-of-fit test based on the empirical characteristic function and a comparison of tests for normality A goodess-of-fit test based o the empirical characteristic fuctio ad a compariso of tests for ormality J. Marti va Zyl Departmet of Mathematical Statistics ad Actuarial Sciece, Uiversity of the Free State,

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

WHAT IS THE PROBABILITY FUNCTION FOR LARGE TSUNAMI WAVES? ABSTRACT

WHAT IS THE PROBABILITY FUNCTION FOR LARGE TSUNAMI WAVES? ABSTRACT WHAT IS THE PROBABILITY FUNCTION FOR LARGE TSUNAMI WAVES? Harold G. Loomis Hoolulu, HI ABSTRACT Most coastal locatios have few if ay records of tsuami wave heights obtaied over various time periods. Still

More information

Activity 3: Length Measurements with the Four-Sided Meter Stick

Activity 3: Length Measurements with the Four-Sided Meter Stick Activity 3: Legth Measuremets with the Four-Sided Meter Stick OBJECTIVE: The purpose of this experimet is to study errors ad the propagatio of errors whe experimetal data derived usig a four-sided meter

More information

EDGEWORTH SIZE CORRECTED W, LR AND LM TESTS IN THE FORMATION OF THE PRELIMINARY TEST ESTIMATOR

EDGEWORTH SIZE CORRECTED W, LR AND LM TESTS IN THE FORMATION OF THE PRELIMINARY TEST ESTIMATOR Joural of Statistical Research 26, Vol. 37, No. 2, pp. 43-55 Bagladesh ISSN 256-422 X EDGEORTH SIZE CORRECTED, AND TESTS IN THE FORMATION OF THE PRELIMINARY TEST ESTIMATOR Zahirul Hoque Departmet of Statistics

More information

Probability and statistics: basic terms

Probability and statistics: basic terms Probability ad statistics: basic terms M. Veeraraghava August 203 A radom variable is a rule that assigs a umerical value to each possible outcome of a experimet. Outcomes of a experimet form the sample

More information

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND. XI-1 (1074) MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND. R. E. D. WOOLSEY AND H. S. SWANSON XI-2 (1075) STATISTICAL DECISION MAKING Advaced

More information

Discrete Orthogonal Moment Features Using Chebyshev Polynomials

Discrete Orthogonal Moment Features Using Chebyshev Polynomials Discrete Orthogoal Momet Features Usig Chebyshev Polyomials R. Mukuda, 1 S.H.Og ad P.A. Lee 3 1 Faculty of Iformatio Sciece ad Techology, Multimedia Uiversity 75450 Malacca, Malaysia. Istitute of Mathematical

More information

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

A Risk Comparison of Ordinary Least Squares vs Ridge Regression Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer

More information

MOMENT-METHOD ESTIMATION BASED ON CENSORED SAMPLE

MOMENT-METHOD ESTIMATION BASED ON CENSORED SAMPLE Vol. 8 o. Joural of Systems Sciece ad Complexity Apr., 5 MOMET-METHOD ESTIMATIO BASED O CESORED SAMPLE I Zhogxi Departmet of Mathematics, East Chia Uiversity of Sciece ad Techology, Shaghai 37, Chia. Email:

More information

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES Read Sectio 1.5 (pages 5 9) Overview I Sectio 1.5 we lear to work with summatio otatio ad formulas. We will also itroduce a brief overview of sequeces,

More information

The Sampling Distribution of the Maximum. Likelihood Estimators for the Parameters of. Beta-Binomial Distribution

The Sampling Distribution of the Maximum. Likelihood Estimators for the Parameters of. Beta-Binomial Distribution Iteratioal Mathematical Forum, Vol. 8, 2013, o. 26, 1263-1277 HIKARI Ltd, www.m-hikari.com http://d.doi.org/10.12988/imf.2013.3475 The Samplig Distributio of the Maimum Likelihood Estimators for the Parameters

More information

Lecture 9: September 19

Lecture 9: September 19 36-700: Probability ad Mathematical Statistics I Fall 206 Lecturer: Siva Balakrisha Lecture 9: September 9 9. Review ad Outlie Last class we discussed: Statistical estimatio broadly Pot estimatio Bias-Variace

More information

o <Xln <X2n <... <X n < o (1.1)

o <Xln <X2n <... <X n < o (1.1) Metrika, Volume 28, 1981, page 257-262. 9 Viea. Estimatio Problems for Rectagular Distributios (Or the Taxi Problem Revisited) By J.S. Rao, Sata Barbara I ) Abstract: The problem of estimatig the ukow

More information

1 Hash tables. 1.1 Implementation

1 Hash tables. 1.1 Implementation Lecture 8 Hash Tables, Uiversal Hash Fuctios, Balls ad Bis Scribes: Luke Johsto, Moses Charikar, G. Valiat Date: Oct 18, 2017 Adapted From Virgiia Williams lecture otes 1 Hash tables A hash table is a

More information

6.883: Online Methods in Machine Learning Alexander Rakhlin

6.883: Online Methods in Machine Learning Alexander Rakhlin 6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURES 5 AND 6. THE EXPERTS SETTING. EXPONENTIAL WEIGHTS All the algorithms preseted so far halluciate the future values as radom draws ad the perform

More information

Kernel density estimator

Kernel density estimator Jauary, 07 NONPARAMETRIC ERNEL DENSITY ESTIMATION I this lecture, we discuss kerel estimatio of probability desity fuctios PDF Noparametric desity estimatio is oe of the cetral problems i statistics I

More information

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ. 2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of o-time jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For

More information

NCSS Statistical Software. Tolerance Intervals

NCSS Statistical Software. Tolerance Intervals Chapter 585 Itroductio This procedure calculates oe-, ad two-, sided tolerace itervals based o either a distributio-free (oparametric) method or a method based o a ormality assumptio (parametric). A two-sided

More information

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2. SAMPLE STATISTICS A radom sample x 1,x,,x from a distributio f(x) is a set of idepedetly ad idetically variables with x i f(x) for all i Their joit pdf is f(x 1,x,,x )=f(x 1 )f(x ) f(x )= f(x i ) The sample

More information

Statistical Fundamentals and Control Charts

Statistical Fundamentals and Control Charts Statistical Fudametals ad Cotrol Charts 1. Statistical Process Cotrol Basics Chace causes of variatio uavoidable causes of variatios Assigable causes of variatio large variatios related to machies, materials,

More information

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 10 October Minimaxity and least favorable prior sequences STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least

More information

Chapter 13, Part A Analysis of Variance and Experimental Design

Chapter 13, Part A Analysis of Variance and Experimental Design Slides Prepared by JOHN S. LOUCKS St. Edward s Uiversity Slide 1 Chapter 13, Part A Aalysis of Variace ad Eperimetal Desig Itroductio to Aalysis of Variace Aalysis of Variace: Testig for the Equality of

More information

Some Properties of the Exact and Score Methods for Binomial Proportion and Sample Size Calculation

Some Properties of the Exact and Score Methods for Binomial Proportion and Sample Size Calculation Some Properties of the Exact ad Score Methods for Biomial Proportio ad Sample Size Calculatio K. KRISHNAMOORTHY AND JIE PENG Departmet of Mathematics, Uiversity of Louisiaa at Lafayette Lafayette, LA 70504-1010,

More information

G. R. Pasha Department of Statistics Bahauddin Zakariya University Multan, Pakistan

G. R. Pasha Department of Statistics Bahauddin Zakariya University Multan, Pakistan Deviatio of the Variaces of Classical Estimators ad Negative Iteger Momet Estimator from Miimum Variace Boud with Referece to Maxwell Distributio G. R. Pasha Departmet of Statistics Bahauddi Zakariya Uiversity

More information

B Supplemental Notes 2 Hypergeometric, Binomial, Poisson and Multinomial Random Variables and Borel Sets

B Supplemental Notes 2 Hypergeometric, Binomial, Poisson and Multinomial Random Variables and Borel Sets B671-672 Supplemetal otes 2 Hypergeometric, Biomial, Poisso ad Multiomial Radom Variables ad Borel Sets 1 Biomial Approximatio to the Hypergeometric Recall that the Hypergeometric istributio is fx = x

More information

Section 1 of Unit 03 (Pure Mathematics 3) Algebra

Section 1 of Unit 03 (Pure Mathematics 3) Algebra Sectio 1 of Uit 0 (Pure Mathematics ) Algebra Recommeded Prior Kowledge Studets should have studied the algebraic techiques i Pure Mathematics 1. Cotet This Sectio should be studied early i the course

More information

Sampling Distributions, Z-Tests, Power

Sampling Distributions, Z-Tests, Power Samplig Distributios, Z-Tests, Power We draw ifereces about populatio parameters from sample statistics Sample proportio approximates populatio proportio Sample mea approximates populatio mea Sample variace

More information

Notes on iteration and Newton s method. Iteration

Notes on iteration and Newton s method. Iteration Notes o iteratio ad Newto s method Iteratio Iteratio meas doig somethig over ad over. I our cotet, a iteratio is a sequece of umbers, vectors, fuctios, etc. geerated by a iteratio rule of the type 1 f

More information

1 Covariance Estimation

1 Covariance Estimation Eco 75 Lecture 5 Covariace Estimatio ad Optimal Weightig Matrices I this lecture, we cosider estimatio of the asymptotic covariace matrix B B of the extremum estimator b : Covariace Estimatio Lemma 4.

More information

MA131 - Analysis 1. Workbook 9 Series III

MA131 - Analysis 1. Workbook 9 Series III MA3 - Aalysis Workbook 9 Series III Autum 004 Cotets 4.4 Series with Positive ad Negative Terms.............. 4.5 Alteratig Series.......................... 4.6 Geeral Series.............................

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

NUMERICAL METHODS FOR SOLVING EQUATIONS

NUMERICAL METHODS FOR SOLVING EQUATIONS Mathematics Revisio Guides Numerical Methods for Solvig Equatios Page 1 of 11 M.K. HOME TUITION Mathematics Revisio Guides Level: GCSE Higher Tier NUMERICAL METHODS FOR SOLVING EQUATIONS Versio:. Date:

More information

Quasi-Monte Carlo methods

Quasi-Monte Carlo methods Quasi-Mote Carlo methods SE 702813 Semiar zur Numerik ud Stochastik) Lukas Eikemmer March 29, 2010 1 Itroductio The geeral problem we are iterested i is to umerically compute the itegral I : fx) dx, [0,1]

More information

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled 1 Lecture : Area Area ad distace traveled Approximatig area by rectagles Summatio The area uder a parabola 1.1 Area ad distace Suppose we have the followig iformatio about the velocity of a particle, how

More information

Analysis of Experimental Data

Analysis of Experimental Data Aalysis of Experimetal Data 6544597.0479 ± 0.000005 g Quatitative Ucertaity Accuracy vs. Precisio Whe we make a measuremet i the laboratory, we eed to kow how good it is. We wat our measuremets to be both

More information

PAijpam.eu ON TENSOR PRODUCT DECOMPOSITION

PAijpam.eu ON TENSOR PRODUCT DECOMPOSITION Iteratioal Joural of Pure ad Applied Mathematics Volume 103 No 3 2015, 537-545 ISSN: 1311-8080 (prited versio); ISSN: 1314-3395 (o-lie versio) url: http://wwwijpameu doi: http://dxdoiorg/1012732/ijpamv103i314

More information

Probability, Expectation Value and Uncertainty

Probability, Expectation Value and Uncertainty Chapter 1 Probability, Expectatio Value ad Ucertaity We have see that the physically observable properties of a quatum system are represeted by Hermitea operators (also referred to as observables ) such

More information

A General Family of Estimators for Estimating Population Variance Using Known Value of Some Population Parameter(s)

A General Family of Estimators for Estimating Population Variance Using Known Value of Some Population Parameter(s) Rajesh Sigh, Pakaj Chauha, Nirmala Sawa School of Statistics, DAVV, Idore (M.P.), Idia Floreti Smaradache Uiversity of New Meico, USA A Geeral Family of Estimators for Estimatig Populatio Variace Usig

More information

Final Examination Solutions 17/6/2010

Final Examination Solutions 17/6/2010 The Islamic Uiversity of Gaza Faculty of Commerce epartmet of Ecoomics ad Political Scieces A Itroductio to Statistics Course (ECOE 30) Sprig Semester 009-00 Fial Eamiatio Solutios 7/6/00 Name: I: Istructor:

More information

The Sample Variance Formula: A Detailed Study of an Old Controversy

The Sample Variance Formula: A Detailed Study of an Old Controversy The Sample Variace Formula: A Detailed Study of a Old Cotroversy Ky M. Vu PhD. AuLac Techologies Ic. c 00 Email: kymvu@aulactechologies.com Abstract The two biased ad ubiased formulae for the sample variace

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

Bootstrap Intervals of the Parameters of Lognormal Distribution Using Power Rule Model and Accelerated Life Tests

Bootstrap Intervals of the Parameters of Lognormal Distribution Using Power Rule Model and Accelerated Life Tests Joural of Moder Applied Statistical Methods Volume 5 Issue Article --5 Bootstrap Itervals of the Parameters of Logormal Distributio Usig Power Rule Model ad Accelerated Life Tests Mohammed Al-Ha Ebrahem

More information

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS J. Japa Statist. Soc. Vol. 41 No. 1 2011 67 73 A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS Yoichi Nishiyama* We cosider k-sample ad chage poit problems for idepedet data i a

More information

Recurrence Relations

Recurrence Relations Recurrece Relatios Aalysis of recursive algorithms, such as: it factorial (it ) { if (==0) retur ; else retur ( * factorial(-)); } Let t be the umber of multiplicatios eeded to calculate factorial(). The

More information

IIT JAM Mathematical Statistics (MS) 2006 SECTION A

IIT JAM Mathematical Statistics (MS) 2006 SECTION A IIT JAM Mathematical Statistics (MS) 6 SECTION A. If a > for ad lim a / L >, the which of the followig series is ot coverget? (a) (b) (c) (d) (d) = = a = a = a a + / a lim a a / + = lim a / a / + = lim

More information

A NOTE ON THE TOTAL LEAST SQUARES FIT TO COPLANAR POINTS

A NOTE ON THE TOTAL LEAST SQUARES FIT TO COPLANAR POINTS A NOTE ON THE TOTAL LEAST SQUARES FIT TO COPLANAR POINTS STEVEN L. LEE Abstract. The Total Least Squares (TLS) fit to the poits (x,y ), =1,,, miimizes the sum of the squares of the perpedicular distaces

More information

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity LINEAR REGRESSION ANALYSIS MODULE IX Lecture - 9 Multicolliearity Dr Shalabh Departmet of Mathematics ad Statistics Idia Istitute of Techology Kapur Multicolliearity diagostics A importat questio that

More information

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting Lecture 6 Chi Square Distributio (χ ) ad Least Squares Fittig Chi Square Distributio (χ ) Suppose: We have a set of measuremets {x 1, x, x }. We kow the true value of each x i (x t1, x t, x t ). We would

More information

4 Conditional Distribution Estimation

4 Conditional Distribution Estimation 4 Coditioal Distributio Estimatio 4. Estimators Te coditioal distributio (CDF) of y i give X i = x is F (y j x) = P (y i y j X i = x) = E ( (y i y) j X i = x) : Tis is te coditioal mea of te radom variable

More information

Discrete probability distributions

Discrete probability distributions Discrete probability distributios I the chapter o probability we used the classical method to calculate the probability of various values of a radom variable. I some cases, however, we may be able to develop

More information

Comparison Study of Series Approximation. and Convergence between Chebyshev. and Legendre Series

Comparison Study of Series Approximation. and Convergence between Chebyshev. and Legendre Series Applied Mathematical Scieces, Vol. 7, 03, o. 6, 3-337 HIKARI Ltd, www.m-hikari.com http://d.doi.org/0.988/ams.03.3430 Compariso Study of Series Approimatio ad Covergece betwee Chebyshev ad Legedre Series

More information

SRC Technical Note June 17, Tight Thresholds for The Pure Literal Rule. Michael Mitzenmacher. d i g i t a l

SRC Technical Note June 17, Tight Thresholds for The Pure Literal Rule. Michael Mitzenmacher. d i g i t a l SRC Techical Note 1997-011 Jue 17, 1997 Tight Thresholds for The Pure Literal Rule Michael Mitzemacher d i g i t a l Systems Research Ceter 130 Lytto Aveue Palo Alto, Califoria 94301 http://www.research.digital.com/src/

More information

Confidence Intervals for the Population Proportion p

Confidence Intervals for the Population Proportion p Cofidece Itervals for the Populatio Proportio p The cocept of cofidece itervals for the populatio proportio p is the same as the oe for, the samplig distributio of the mea, x. The structure is idetical:

More information

Fastest mixing Markov chain on a path

Fastest mixing Markov chain on a path Fastest mixig Markov chai o a path Stephe Boyd Persi Diacois Ju Su Li Xiao Revised July 2004 Abstract We ider the problem of assigig trasitio probabilities to the edges of a path, so the resultig Markov

More information

In algebra one spends much time finding common denominators and thus simplifying rational expressions. For example:

In algebra one spends much time finding common denominators and thus simplifying rational expressions. For example: 74 The Method of Partial Fractios I algebra oe speds much time fidig commo deomiators ad thus simplifyig ratioal epressios For eample: + + + 6 5 + = + = = + + + + + ( )( ) 5 It may the seem odd to be watig

More information

Poisson approximation

Poisson approximation p^ 0.17 0.16 0.15 0.14 0.13 0.12 0.11 0.10 0.09 0.08 0.07 0.06 Poisso approximatio Normal approximatio 90 200 400 800 2000 5000 10,000 Figure 3: Poisso vs. ormal approximatios for large sample sizes. 14

More information

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab Sectio 12 Tests of idepedece ad homogeeity I this lecture we will cosider a situatio whe our observatios are classified by two differet features ad we would like to test if these features are idepedet

More information

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman: Math 224 Fall 2017 Homework 4 Drew Armstrog Problems from 9th editio of Probability ad Statistical Iferece by Hogg, Tais ad Zimmerma: Sectio 2.3, Exercises 16(a,d),18. Sectio 2.4, Exercises 13, 14. Sectio

More information

DISTRIBUTION LAW Okunev I.V.

DISTRIBUTION LAW Okunev I.V. 1 DISTRIBUTION LAW Okuev I.V. Distributio law belogs to a umber of the most complicated theoretical laws of mathematics. But it is also a very importat practical law. Nothig ca help uderstad complicated

More information

Bayesian and E- Bayesian Method of Estimation of Parameter of Rayleigh Distribution- A Bayesian Approach under Linex Loss Function

Bayesian and E- Bayesian Method of Estimation of Parameter of Rayleigh Distribution- A Bayesian Approach under Linex Loss Function Iteratioal Joural of Statistics ad Systems ISSN 973-2675 Volume 12, Number 4 (217), pp. 791-796 Research Idia Publicatios http://www.ripublicatio.com Bayesia ad E- Bayesia Method of Estimatio of Parameter

More information

ESTIMATION AND PREDICTION BASED ON K-RECORD VALUES FROM NORMAL DISTRIBUTION

ESTIMATION AND PREDICTION BASED ON K-RECORD VALUES FROM NORMAL DISTRIBUTION STATISTICA, ao LXXIII,. 4, 013 ESTIMATION AND PREDICTION BASED ON K-RECORD VALUES FROM NORMAL DISTRIBUTION Maoj Chacko Departmet of Statistics, Uiversity of Kerala, Trivadrum- 695581, Kerala, Idia M. Shy

More information

Zeros of Polynomials

Zeros of Polynomials Math 160 www.timetodare.com 4.5 4.6 Zeros of Polyomials I these sectios we will study polyomials algebraically. Most of our work will be cocered with fidig the solutios of polyomial equatios of ay degree

More information

Number of fatalities X Sunday 4 Monday 6 Tuesday 2 Wednesday 0 Thursday 3 Friday 5 Saturday 8 Total 28. Day

Number of fatalities X Sunday 4 Monday 6 Tuesday 2 Wednesday 0 Thursday 3 Friday 5 Saturday 8 Total 28. Day LECTURE # 8 Mea Deviatio, Stadard Deviatio ad Variace & Coefficiet of variatio Mea Deviatio Stadard Deviatio ad Variace Coefficiet of variatio First, we will discuss it for the case of raw data, ad the

More information

Simon Blackburn. Sean Murphy. Jacques Stern. Laboratoire d'informatique, Ecole Normale Superieure, Abstract

Simon Blackburn. Sean Murphy. Jacques Stern. Laboratoire d'informatique, Ecole Normale Superieure, Abstract The Cryptaalysis of a Public Key Implemetatio of Fiite Group Mappigs Simo Blackbur Sea Murphy Iformatio Security Group, Royal Holloway ad Bedford New College, Uiversity of Lodo, Egham, Surrey TW20 0EX,

More information

Lecture 4. Random variable and distribution of probability

Lecture 4. Random variable and distribution of probability Itroductio to theory of probability ad statistics Lecture. Radom variable ad distributio of probability dr hab.iż. Katarzya Zarzewsa, prof.agh Katedra Eletroii, AGH e-mail: za@agh.edu.pl http://home.agh.edu.pl/~za

More information

REGRESSION (Physics 1210 Notes, Partial Modified Appendix A)

REGRESSION (Physics 1210 Notes, Partial Modified Appendix A) REGRESSION (Physics 0 Notes, Partial Modified Appedix A) HOW TO PERFORM A LINEAR REGRESSION Cosider the followig data poits ad their graph (Table I ad Figure ): X Y 0 3 5 3 7 4 9 5 Table : Example Data

More information