
 Ruth Bradley
 10 months ago
 Views:
Transcription
1 RJ (90521) May 29, 1996 (Revised 3/20/98) Computer Sciece Research Report ESTIMATING THE NUMBER OF CLASSES IN A FINITE POPULATION Peter J. Haas IBM Research Divisio Almade Research Ceter 650 Harry Road Sa Jose, CA Lye Stokes Departmet of Maagemet Sciece ad Iformatio Systems Uiversity oftexas Austi, TX LIMITED DISTRIBUTION NOTICE This report has bee submitted for publicatio outside of IBM ad will probably be copyrighted if accepted for publicatio. It has bee issued as a Research Report for early dissemiatio of its cotets. I view of the trasfer of copyright to the outside publisher, its distributio outside of IBM prior to publicatio should be limited to peer commuicatios ad specic requests. After outside publicatio, requests should be lled oly by reprits or legally obtaied copies of the article (e.g., paymet of royalties). IBM Research Divisio Yorktow Heights, New York Sa Jose, Califoria Zurich, Switzerlad
2
3 ESTIMATING THE NUMBER OF CLASSES IN A FINITE POPULATION Peter J. Haas IBM Research Divisio Almade Research Ceter 650 Harry Road Sa Jose, CA Lye Stokes Departmet of Maagemet Sciece ad Iformatio Systems Uiversity oftexas Austi, TX ABSTRACT: We use a extesio of the geeralized jackkife approach of Gray ad Schucay to obtai ew oparametric estimators for the umber of classes i a ite populatio of kow size. We also show that geeralized jackkife estimators are closely related to certai HorvitzThompso estimators, to a estimator of Shlosser, ad to estimators based o sample coverage. I particular, the geeralized jackkife approach leads to a modicatio of Shlosser's estimator that does ot suer from the erratic behavior of the origial estimator. The performace of both ew ad previous estimators is ivestigated by meas of a asymptotic variace aalysis ad a Mote Carlo simulatio study. Keywords: jackkife, sample coverage, umber of species, umber of classes, database, cesus
4
5 1. Itroductio The problem of estimatig the umber of classes i a populatio has bee studied for may years. A recet review article (Buge ad Fitzpatrick 1993) lists more tha 125 refereces. I this article, we cosider a importat special case of the geeral problem estimatig the umber of classes i a ite populatio of kow size. Oly a hadful of papers have addressed this problem ad oe has reached a etirely satisfactory solutio, despite the fact that the rst attempt at solutio appeared i the statistical literature early 50 years ago (Mosteller 1949). The problem we cosider has arise i the literature i a variety of applicatios, icludig the followig. (i) I a compaysposored cotest, may etries (say several hudred thousad) have bee received. It is kow that some people have etered more tha oce. The goal is to estimate the umber of dieret people who have etered from a sample of etries (Mosteller 1949; Sudma 1976). (ii) A samplig frame is costructed by combiig a umber of lists that may cotai overlappig etries. It is desired to estimate, usig a sample from all lists, the umber of uits o the combied list (Demig ad Glasser 1959; Goodma 1952; Kish 1965, Sec. 11.2; Sudma 1976, Sec. 3.6). A importat example of such a problem is a \admiistrative records cesus," curretly uder study by the U.S. Bureau of the Cesus. I such a cesus, several admiistrative les (such as AFDC or IRS records) are combied, ad the total umber of distict idividuals icluded i the combied le is determied. Exact computatio of the umber of distict idividuals i the combied le is extremely expesive because of the high cost of determiig the umber of duplicated etries. A similar problem ad proposed solutio was discussed i the Lodo Fiacial Times (March 2, 1949) by C. F. Carter, who was iterested i estimatig the umber of dieret ivestors i British idustrial stocks based o samples from share registers of compaies (Mosteller 1949). (iii) I a relatioal database system, data are orgaized i tables called relatios (see, e.g., Korth ad Silberschatz 1991, Chap. 3). I a typical relatio, each row might represet a record for a idividual employee i a compay, ad each colum might correspod to a dieret attribute of the employee, such as salary, years of experiece, departmet umber, ad so forth. A relatioal query species a output relatio that is to be computed from the set of base relatios stored by the system. Kowledge of the umber of distict values for each attribute i the base relatios is cetral to determiig the most eciet method for computig a specied output relatio (Hellerstei ad Stoebraker 1994; Seliger, Astraha, Chamberlai, Lorie, ad Price 1979). The size of the base relatios i moder database systems ofte is so large that exact computatio of the distictvalue parameters is prohibitively expesive, ad thus estimatio of these parameters is desired (Astraha, Schkolick, ad Whag 1987; Flajolet ad Marti 1985; Gelebe ad Gardy 1982; Hou, Ozsoyoglu, ad Taeja 1
6 1988, 1989; Naughto ad Seshadri 1990; Ozsoyoglu, Du, Tjahjaa, Hou, ad Rowlad 1991; Whag, VaderZade, ad Taylor 1990). I each of these applicatios, the size of the populatio (umber of cotest etries, total umber of uits over all lists, ad umber of rows i the base relatio) is kow, ad this size is too large for easy computatio of the umber of classes. The problem studied i this article ca be described formally as follows. A populatio of size N cosists of D mutually disjoit classes P of items, labelled C 1 ;C 2 ;::: ;C D. Dee D N j to be the size of class C j, so that N = N j. A simple radom sample of items is selected (without replacemet) from the populatio. This sample icludes j items from class C j. The problem we cosider is that of estimatig D usig iformatio from the sample alog with kowledge of the value of P N. We deote by F i the umber of N classes of size i i the populatio, so that D = i=1 F i. Similarly, we deote by f i the umber of classes represeted exactly i times i the sample ad by d the total umber of classes represeted i the sample. Thus d = P i=1 f i ad P i=1 if i =. Dee vectors N =(N 1 ;N 2 ;::: ;N D ), =( 1 ; 2 ;::: ; D ), ad f =(f 1 ;f 2 ;::: ;f ). Note that is ot observable, but f is. Because we sample without replacemet, the radom vector has a multivariate hypergeometric distributio with probability mass fuctio P ( j D;N) =, N1, N2, 1 2 ND, N : (1) The probability mass fuctio of the observable radom vector f is simply P ( j D;N) summed over all poits that correspod to f: D P (f j D;N) = X S P ( j D;N); where S = f : #( j = i) =f i for 1 i D g. The probability mass fuctio P (f j D;N) does ot have a closedform expressio i geeral. I Sectio 2 we review the estimators that have bee proposed for estimatig D from data geerated uder model (1). I Sectio 3 we provide several ew estimators of D based o a extesio of the geeralized jackkife approach of Gray ad Schucay (1972). We the show that geeralized jackkife estimators of the umber of classes i a populatio are closely related to certai \HorvitzThompso" estimators, to a estimator due to Shlosser (1981), ad to estimators based o the otio of \sample coverage" (Chao ad Lee 1992). I Sectio 4 we provide ad compare approximate expressios for the asymptotic variace of several of the estimators, ad i Sectio 5 apply our formulas to a wellkow example from the literature. We provide a simulatiobased empirical compariso of the various estimators i Sectio 6, ad summarize our results ad give recommedatios i Sectio 7. 2
7 2. Previous Estimators Buge ad Fitzpatrick (1993) metio oly two obayesia estimators that have bee developed as estimators of D uder model (1). These are the estimators of Goodma (1949) ad Shlosser (1981). Goodma proved that bd Good1 = d + X i=1 i+1 (N, + i, 1)! (, i)! (,1) (N,, 1)!! is the uique ubiased estimator of D whe >M def = max(n 1 ;N 2 ;::: ;N D ). He further proved that o ubiased estimator of D exists whe M. Ufortuately, uless the samplig fractio is quite large, the variace of b DGood1 is so great ad the umerical dif culties ecoutered whe computig b DGood1 are so severe that the estimator is uusable. Goodma, who made ote of the high variace of b DGood1 himself, suggested the alterative estimator bd Good2 = N, N(N, 1) (, 1) f 2 for overcomig the variace problem. Although b DGood2 has lower variace tha b DGood1,it ca take o egative values ad ca have a large bias for ay if D is small. For example, cosider the case i which D = 1 ad >2, ad observe that f 2 = 0 ad b DGood2 = N. Uder the assumptio that the populatio size N is large ad the samplig fractio q = =N is oegligible, Shlosser (1981) derived the estimator P bd Sh = d + f (1, i=1 q)i f i 1 P iq(1, : i=1 q)i,1 f i f i For the two examples cosidered i his paper, Shlosser foud that use of b DSh with a 10% samplig fractio resulted i a error rate below 20%. I our experimets, however, we observed root mea squared errors (rmse's) exceedig 200%, eve for wellbehaved populatios with relatively little variatio amog the class sizes (see Sec. 6). Cosiderig the relatioship betwee b DSh ad geeralized jackkife estimators (see Sec. 3.4) provides isight ito the source of this erratic behavior ad suggests some possible modicatios of b DSh to improve performace. I related work, Burham ad Overto (1978, 1979) proposed a family of (traditioal) geeralized jackkife estimators for estimatig the size of a closed populatio whe capture probabilities vary amog aimals. The D idividuals i the populatio play the role of our D classes; a give idividual ca appear up to times i the overall sample if captured o oe or more of possible trappig occasios. The capture probability for a idividual is assumed to be costat over time, ad the capture probabilities for the D idividuals are modeled as D iid radom samples from a xed probability distributio. Burham ad 3
8 Overto's sample desig is clearly dieret from model (1). Uder the Burham ad Overto model, for example, the quatities f 1 ;f 2 ;::: ;f have a joit multiomial distributio. Closely related to the work of Burham ad Overto are the ordiary jackkife estimators of the umber of species i a closed regio developed by Heltshe ad Forrester (1983) ad Smith ad va Belle (1984). The sample data cosist of a list of the species that appear i each of quadrats. (The umber of times that a species is represeted i a quadrat is ot recorded.) This setup is essetially idetical to that of Burham ad Overto, with the D species playig the role of the D idividuals ad the quadrats playig the role of the trappig occasios. 3. Geeralized Jackkife Estimators I this sectio we outlie a extesio of the geeralized jackkife approach to bias reductio ad the use this approach to derive ew estimators for the umber of classes i a ite populatio. We also poit out coectios betwee our geeralized jackkife approach ad several other estimatio approaches i the literature The Geeralized Jackkife Approach Let be a ukow realvalued parameter. A geeralized jackkife estimator of is a estimator of the form G(b 1 ; b 2 )= b 1, Rb 2 1, R ; (2) where b 1 ad b 2 are biased estimators of ad R (6= 1) is a real umber (Gray ad Schucay 1972). The idea uderlyig the geeralized jackkife approach is to try ad choose R such that G(b 1 ; b 2 ) has lower bias tha either b 1 or b 2. To motivate the choice of R, observe that for R = E[ b 1 ], E[b 2 ], ; (3) the estimator G(b 1 ; 2 b ) is ubiased for. This optimal value of R is typically ukow, however, ad ca oly be approximated, resultig i bias reductio but ot complete bias elimiatio. I the followig, we exted the origial deitio of the geeralized jackkife give by Gray ad Schucay (1972) by allowig R to deped o the data; that is, we allow R to be radom. Recall that d is the umber of classes represeted i the sample. Write d for d to emphasize the depedece of d o the sample size, ad deote by d,1 (k) the umber of classes represeted i the sample after the kth observatio has bee removed. Set X d (,1) = 1 4 k=1 d,1 (k):
9 We focus o geeralized jackkife estimators that are obtaied by takig 1 b = d ad b 2 = d (,1) i (2); these are the usual choices for b 1 ad b 2 i the classical rstorder jackkife estimator (Miller 1974). Observe that d,1 (k) = d, 1 if the class for the kth observatio is represeted oly oce i the sample; otherwise, d,1 (k) = d. Thus d (,1) = d, (f 1 =) ad, by (2), G(b 1 ; b 2 )= b D, where bd = d + K f 1 ad K = R=(1, R). It follows from (3) that the optimal choice of K is K = E [d ], D E[d (,1) ], E [d ] = D, E [d ] E [f 1 ] = : (5) To derive a more explicit formula for K, deote by I[A] the idicator of evet A ad observe that E [d ]=E 2 X 4 D I[ j > 0] 3 X D 5 = P f j > 0 g = D, DX P f j =0g : (4) Similar reasoig shows that E [f 1 ]= DX P f j =1g ; (6) so that P D K = P f j =0g P D P f j =1g : (7) Followig Shlosser (1981), we focus o the case i which the populatio size N is large ad the samplig fractio q = =N is oegligible, ad we make the approximatio P f j = k g Nj q k (1, q) N j,k (8) k for 0 k ad 1 j D. That is, the probability distributio of each j is approximated by the probability distributio of j uder a Beroulli sample desig i which each item is icluded i the sample with probability q, idepedetly of all other items i the populatio. Use of this approximatio leads to estimators that behave almost idetically to estimators derived usig the exact distributio of but are simpler to compute ad derive (see App. A for further discussio). Substitutig (8) ito (7), we obtai P D K (1, q)n j P D N jq(1, q) : (9) N j,1 5
10 The quatity K deed i (9) depeds o ukow parameters N 1 ;N 2 ;::: ;N D that are dicult to estimate. Our approach is to approximate K by a fuctio of D ad of other parameters that are easier to estimate, thereby obtaiig a approximate versio of (4). The estimates for these parameters, icludig b D for D, are the substituted ito the approximate versio of (4) ad the resultig equatio is solved for b D. We also cosider \smoothed" jackkife estimators. The idea is to replace the quatity f 1 = i (4) by its expected value E [f 1 ] = i the hope that the resultig estimator of D will be more stable tha the origial \usmoothed" estimator. As with the parameter K, the quatity E [f 1 ] = depeds o the ukow parameters N 1 ;N 2 ;::: ;N D ; see (6) ad (8). Thus our approach to estimatig E [f 1 ] = is the same as our approach to estimatig K. Estimators also ca be based o highorder jackkig schemes that cosider the umber of distict values i the sample whe two elemets are removed, whe three elemets are removed, ad so forth. Typically, usig a highorder jackkig scheme requires estimatig highorder momets (skewess, kurtosis, ad so forth) of the set of umbers f N 1 ;N 2 ;::: ;N D g. Iitial experimets idicated that the reductio i estimatio error due to usig the highorder jackkife is outweighed by the icrease i error due to ucertaity i the momet estimates. Thus we do ot pursue highorder jackkife schemes further The Estimators Dieret approximatios for K ad E [f 1 ] = lead to dieret estimators for D. Here we develop a umber of the possible estimators FirstOrder Estimators The simplest estimators of D ca be derived usig a rstorder approximatio to K. Specically, approximate each N j i (9) by the average value N = 1 D DX N j = N D ad substitute the resultig expressio for K ito (4) to obtai bd = d + (1, q)f 1D : (10) Now substitute D b for D o the right side of (10) ad solve for D. b The resultig solutio, deoted by Duj1 b, is give by bd uj1 = 1, (1, q)f 1,1 d : (11) We refer to this estimator as the \usmoothed rstorder jackkife estimator." 6
11 To derive a \smoothed rstorder jackkife estimator," observe that by (6) ad (8), E [f 1 ] 1 DX Approximatig each N j i (12) by N, wehave E [f 1 ] N j q(1, q) N j,1 : (12) (1, q) N,1 : (13) O the right side of (10), replace f 1 = with the approximate expressio for E [f 1 ] = give i (13), yieldig bd = d + D(1, q) N : Replacig D with b D ad N with N= b D i the foregoig expressio leads to the relatio bd, 1, (1, q) N= b D = d : We dee the smoothed rstorder jackkife estimator b Dsj1 as the value of b D that solves this equatio. Give d,, ad N, b Dsj1 ca be computed umerically usig stadard rootdig procedures. Observe that if i fact N 1 = N 2 = = N D = N=D, the I this case b Dsj1 E [d ] D, 1, (1, q) N=D : ca be viewed as a simple methodofmomets estimator obtaied by replacig E [d ] with the estimate d ad solvig for D. If, moreover, the samplig fractio q is small eough so that the distributio of ( 1 ; 2 ;::: ; D ) is approximately multiomial (see Sec. 3.3), the b Dsj1 is approximately equal to the maximum likelihood estimator for D (see Good 1950). Observe that both b Duj1 ad b Dsj1 are cosistet for D: b Duj1! D ad bd sj1! D as q! SecodOrder Estimators A secodorder approximatio to K ca be derived as follows. Deote by 2 the squared coecietofvariatio of the class sizes N 1 ;N 2 ;::: ;N D : 2 = (1=D)P D (N j, N) 2 N 2 : (14) Suppose that 2 is relatively small, so that each N j is close to the average value N. Substitute the Taylor approximatios (1, q) N j (1, q) N +(1, q) N l(1, q)(n j, N) 7
12 ad N j q(1, q) Nj,1 N j q (1, q) N,1 +(1, q) N,1 l(1, q)(n j, N) for 1 j D ito (9) to obtai K D(1, q) 1 1+l(1, q)n 2 D(1, q), 1, l(1, q)n 2 : (15) The ukow parameter 2 ca be estimated, usig the followig approach (cf. Chao ad Lee 1992). With the usual covetio that m = 0 for <m,we d that NX NX DX Nj i(i, 1)E [f i ] i(i, 1) q i (1, q) N j,i i=1 i=1 = q 2 D X N j (N j, 1) = q 2 D X N j (N j, 1); i N X j i=2 Nj, 2 q i,2 (1, q) N j,i i, 2 so that 2 D 2 N X i=1 i(i, 1)E [f i ]+ D N, 1: Thus if D were kow, the a atural methodofmomets estimator ^ 2 (D) of 2 would be ^ 2 (D) = max 0; D 2 X i=1 i(i, 1)f i + D N, 1 : (16) To develop a secodorder estimate of D, substitute (15) ito (4) to obtai from which it follows that bd = d + Df 1(1, q) bd = d + Df 1(1, q), 1, l(1, q)n 2 ; (17), f 1(1, q)l(1, q) 2 : q Replacig D with D b o the right side of this equatio ad solvig for D b yields the relatio 1, f 1(1, q) bd = d, f 1(1, q)l(1, q) 2 : (18) q 8
13 A estimator of D ca be obtaied by substitutig ^ 2 ( b D) for 2 i (18) ad solvig for bd umerically. Alteratively, we ca start with a simple iitial estimator of D ad the correct this estimator usig (18). Followig this latter approach, we use b Duj1 as our iitial estimator ad dee bd uj2 = 1, f 1(1, q),1! d, f 1(1, q)l(1, q)^ 2 ( Duj1 b ) : q A smoothed secodorder jackkife estimator ca be obtaied by replacig the expressio f 1 = i (17) with the approximatio to E [f 1 ] = give i (13), leadig to bd = d + D(1, q) N, 1, l(1, q)n 2 : Replacig D with b D ad proceedig as before, we obtai the estimator bd sj2 = 1 N, (1, q) ~,1 d, (1, q) ~N l(1, q)n ^ 2 ( Duj1 b ) where ~ N = N= b Duj1. As with the rstorder estimators b Duj1 ad b Dsj1, the secodorder estimators b Duj2 ad b Dsj2 are cosistet for D HorvitzThompso Jackkife Estimators I this sectio we discuss a alterative approach to estimatio of K based o a techique of Horvitz ad Thompso. (See Sardal, Swesso, ad Wretma 1992 for a geeral discussio of HorvitzThompso estimators.) P First, cosider the geeral problem of estimatig a parameter of the form D (g) = g(n j), where g is a specied fuctio. Observe that because P f j > 0 g > 0 for 1 j D, wehave (g) =E [X(g)], where X(g) = DX g(n j )I( j > 0) P f j > 0 g = X fj: j >0g g(n j ) P f j > 0 g : It follows from (8) that P f j > 0 g1, (1, q) N j, ad the foregoig discussio suggests that we estimate (g) by b(g) = X fj: j >0g g( b Nj ) 1, (1, q) b N j ; (19) where b Nj is a estimator for N j. The key poit is that we eed to estimate N j oly whe j > 0. To do this, observe that E [ j j j > 0] = E [ j ] P f j > 0 g qn j 1, (1, q) N j : ; 9
14 Replacig E [ j j j > 0] with j leads to the estimatig equatio j = qn j 1, (1, q) N j ; (20) ad a methodofmomets estimator b Nj ca be deed as the value of N j that solves (20). Now cosider the problem of estimatig K, ad hece D. By (9), K (f)=(g), where f(x) = (1, q) x ad g(x) = xq(1, q) x,1 =. Thus a atural estimator of K is give by b(f)=b (g), leadig to the al estimator, (f) b f 1 bd HTj = d + b(g) : A smoothed variat of b DHTj ca be obtaied by replacig f 1 = with the HorvitzThompso estimator of E [f 1 ] =, amely b (g). The resultig estimator, deoted by b DHTsj, is give by bd HTsj = d + b (f): Fially, a hybrid estimator ca be obtaied usig a rstorder approximatio for the umerator of K ad a HorvitzThompso estimator for the deomiator. This leads to the estimator Dhj b, deed as the solutio D b of the equatio! bd 1, f 1(1, q) N= D b = d : b (g) If we replace f 1 = with the HorvitzThompso estimator for E [f 1 ] = i the foregoig equatio i order to obtai a smoothed variatof b Dhj, the the resultig estimator coicides with b Dsj1. Because D = (u), where u(x) 1, it may appear that a \ojackkife" Horvitz Thompso estimator b DHT ca be deed by settig b DHT = b (u). It is straightforward to show, however, that b DHT = b DHTsj, so that b DHT ca i fact be viewed as a smoothed jackkife estimator. Simulatio experimets idicate that the behavior of the HorvitzThompso jackkife estimators b DHTj ad b DHTsj is erratic (see App. D for detailed results). Overall, the poor performace of b DHTj ad b DHTsj is caused by iaccurate estimatio of b (f). The problem seems to be that whe N j is small, the estimator b Nj is ustable ad yet typically has a large eect o the value of b (f) through the term (1, q) b N j =, 1, (1, q) b N j. The estimator bd hj uses a Taylor approximatio i place of b (f) ad hece has lower bias ad rmse tha the other two HorvitzThompso jackkife estimators. However, other estimators perform better tha b Dhj, ad we do ot cosider the estimators b DHTj, b DHTsj, ad b Dhj further. 10
15 3.3. Relatio to Estimators Based o Sample Coverage The geeralized jackkife approach for derivig a estimator of D works for sample desigs other tha hypergeometric samplig. For example, the most thoroughly studied versio of the umberofclasses problem is that i which the populatio is assumed to be iite ad is assumed to have a multiomial distributio with parameter vector =( 1 ; 2 ;::: ; D ); that is, P ( j D; ) = D D : (21) D Whe we proceed as i Sectio 3.1 to derive a geeralized jackkife estimator uder the model i (21), the estimator turs out to be early idetical to the \coveragebased" estimator proposed by Chao ad Lee (1992). To see this, start agai with (4) ad select K as i (5). Because E [d ], D =, uder the model i (21), it follows that K = DX (1, j ) P D v ( j ) P D jv,1 ( j ) ; where v (x) =(1, x). Set =1=D ad use the Taylor approximatios v ( j ) v ()+( j, )v 0 () ad j v,1 ( j ) j, v,1 ()+( j, )v 0,1() i a maer aalogous to the derivatio i Sectio to obtai K (D, 1) + (, 1) 2 ; (22) where 2 =,1+D P D 2 j is the squared coeciet ofvariatio of the umbers 1; 2 ;::: ; D. Deote by b Dmult the estimator of D uder the multiomial model. The, by (4), bd mult = d +, (D, 1) + (, 1) 2 f 1 : (23) Replace D with b Dmult ad 2 with a estimator ~ 2 i (23) ad solve for b Dmult to obtai bd mult = d bc + (1, b C) bc 11, 1 ~2, 1 ;
16 where b C =1, (f1 =). Whe the sample size is large, the estimator b Dmult is essetially the same as the estimator bd CL = d + (1, C) b ~ 2 bc bc proposed by Chao ad Lee (1992). The estimator b DCL was developed from a dieret poit of view, usig the cocept of sample coverage. The sample coverage for a iite populatio is deed as P D ji[ j > 0], ad the quatity b C =1, (f1 =) is a stadard estimator of the sample coverage. Coversely, whe Chao ad Lee's derivatio is modied to accout for hypergeometric samplig, the resultig estimator is equal to b Duj2 (see App. B). Thus at least some estimators based o sample coverage ca be viewed as geeralized jackkife estimators Relatio to Shlosser's Estimator Observe that the estimator DSh b, though ot developed from a jackkife perspective, ca be viewed as a estimator of the form (4) with K estimated by P i=1 bk Sh = (1, q)i f P i iq(1 i=1, : q)i,1 f i To aalyze the behavior of DSh b,we rst rewrite the jackkife quatity K deed i (9) as follows: P N i=1 K = (1, q)i F i PN iq(1 i=1, : (24) q)i,1 F i Shlosser's justicatio of b DSh assumes that E [f i ] E [f 1 ] F i F 1 (25) for 1 i N. Whe the assumptio i (25) holds ad the sample size is large eough so that for 1 i N, f i E [f i ] (26) P N i=1 bk Sh (1, q)i E [f i ] P N iq(1, i=1 q)i,1 E [f i ] P N (1, q)i, i=1 E [f i ] =E [f 1 ] P N iq(1 i=1, q)i,1, E [f i ] =E [f 1 ] = P F,1 N (1 1 i=1, q)i F P i N iq(1 i=1, q)i,1 F i = K; F,1 1 12
17 so that b DSh behaves as a geeralized jackkife estimator. Although the relatios i (25) ad (26) hold exactly for = N (implyig that b DSh is cosistet for D), these relatios ca fail drastically for smaller sample sizes. For example, whe F 1 = 0 ad F i > 0 for some i>1, the right side of (25) is iite, whereas the left side is ite for sucietly small. This observatio leads oe to expect that b DSh will ot perform well whe the sample size is relatively small ad N 1 ;N 2 ;::: ;N D have similar values (with N j > 1 for each j). Both the variace aalysis i Sectio 4 ad the simulatio experimets described i Sectio 6 bear out this cojecture. The foregoig discussio suggests that replacig b KSh with bk Sh = K bk E[ KSh b Sh (27) ] i the formula for DSh b might result i a improved estimator, because K b Sh is ubiased for K. Of course we caot perform this replacemet exactly, sice K ad E[ KSh b ] are ukow, but we ca approximate K b Sh as follows. Usig the fact that DX DX Nj NX i E [f r ]= P f j = r g q r (1, q) Nj,r = q r (1, q) i,r F i (28) r r for 1 r, wehave, to rst order, i=r P N E[ KSh b i=1 ] (1, q)i E [f i ] P N iq(1, i=1 q)i,1 E [f i ] = P N i=1 (1, q)i, (1 + q) i, 1 F i PN i=1 iq2 (1, q 2 ) i,1 F i : (29) Usig the rstorder approximatio N 1 = N 2 = = N D = N together with (24), (27), ad (29), we d that bk Sh q(1 + q)n,1 (1 + q) N, 1 We thus obtai a modied Shlosser estimator give by bd Sh2 = d + f 1 q(1 + q) ~ N,1 (1 + q) ~N, 1! bk Sh :! P (1 i=1, q)i f P i iq(1 i=1, ; q)i,1 f i where ~ N is a iitial estimate of N based o a iitial estimate of D. We set ~ N equal to N= b Duj1 throughout. As with b DSh, the estimator b DSh2 is cosistet for D. 13
18 A alterative cosistet estimator of D ca be obtaied by directly usig the expressios i (24), (27), ad (29) with F i estimated by bf i = for 1 i N; these estimators of F 1 ;F 2 ;::: ;F N f 1 f i P i=1 iq(1, q)i,1 f i (30) were proposed by Shlosser (1981) i cojuctio with the estimator DSh b. Substitutig the resultig estimator of K ad E[ KSh b ] ito (27) leads to the al estimator P! i=1 iq2 (1, q 2 ) i,1 f P i (1 i=1, q)i, P (1 i=1, 2 q)i f P i (1 + q) i, 1 f i iq(1, : i=1 q)i,1 f i bd Sh3 = d + f 1 As with the estimator DSh b, Shlosser's justicatio of the estimators i (30) rests o the assumptio i (25). Thus oe might expect that, like DSh b, the estimator DSh3 b will be ustable whe the sample size is relatively small ad N 1 ;N 2 ;::: ;N D have similar values. O the other had, the reductio i bias of K b relative to b Sh KSh leads oe to expect that bd Sh3 will perform better tha DSh b whe 2 is sucietly large. (Oe might be tempted to avoid the assumptio i (25) whe estimatig F 1 ;F 2 ;::: ;F N by takig a methodofmomets approach: replace E [f r ] with f r i (28) for 1 r ad solve the resultig set of liear equatios either exactly or approximately. As poited out by Shlosser (1981), however, this system of equatios is early sigular, ad hece extremely ustable.) 4. Variace ad Variace Estimates Cosider a estimator b D that is a fuctio of the sample oly through f =(f1 ;f 2 ;::: ; f M ), where M = max(n 1 ;N 2 ;::: ;N D ). All of the estimators itroduced i Sectio 3 are of this type. I geeral, we also allow b D to deped explicitly o the populatio size N ad write b D = b D(f;N). Suppose that, for ay N > 0 ad oegative Mdimesioal vector f 6= 0, the fuctio b D is cotiuously dieretiable at the poit (f;n) ad bd(cf;cn)=c b D(f;N) (31) for c>0. Approximatig the hypergeometric sample desig by a Beroulli sample desig as i (8), we ca obtai the followig approximate expressio for the asymptotic variace of b D(f;N)asD becomes large: AVar[ b D(f;N)] M X i=1 A 2 i Var [f i ]+ X 1i;i 0 M i6=i 0 A i A i 0Cov[f i ;f i 0] ; (32) where A i is the partial derivative of b D with respect to fi, evaluated at the poit (f;n). (Whe computig each A i,we replace each occurrece of ad d i the formula for b D by 14
19 P M i=1 if i ad P M i=1 f i before takig derivatives.) The approximatio i (32) is valid whe there is ot too much variability i the class sizes (see App. C for a precise formulatio ad proof of this result). It follows from the proof that, to a good approximatio, the variace of a estimator b D satisfyig (31) icreases liearly as D icreases. Straightforward calculatios show that each of the specic estimators b Duj1, b Duj2, b DSh, bd Sh2, ad b DSh3 is cotiuously dieretiable as stated previously ad also satises (31). Thus we ca use (32) to study the asymptotic variace of these estimators. We focus o bd uj1, b Duj2, b DSh2, ad b DSh3 because each of these estimators performs best for at least oe populatio studied i the simulatio experimets described i Sectio 6; we also cosider bd Sh, because b DSh is the most useful of the estimators previously proposed i the literature. Computatio of the A i coeciets for each estimator is tedious, but straightforward. Whe bd = b Duj2, for example, we obtai ad A (uj2) i 1 = A (uj1) N(1, q)l(1, q) 1,, (1, q)f 1 "^ 2 A (uj1) 1 + f 1,^ 2 +1, 2 bd uj1 A (uj2) = A (uj1) N(1, q) l(1, q) i,, (1, q)f 1 f 1 A (uj1) i bd uj1,^ 2 +1, 2i for 1 <i, where ^ 2 =^ 2 ( b Duj1 ), ad ^ 2 +1, b D uj1 N A (uj1) 1 = b Duj1 1 d + ^ 2 +1, b D uj1 N, + i(i, 1) b Duj1 2, (1, q), (1, q)f 1 1, f 1 A (uj1) i = Duj1 b 1 + i(1, q)(f 1=) d, (1, q)f 1!# ^ 2 + ^2, (1, q)f 1! i^ 2 + i^2, (1, q)f 1 ; for 1 <i. Figures 1 ad 2 compare the variaces of the estimators b Duj1, b Duj2, b DSh, b DSh2, ad bd Sh3 for a umber of populatios with equal class sizes. For these special populatios, b Duj1 ad b Duj2 are approximately ubiased, so that the relative variaces of these estimators are appropriate measures of relative performace. It is particularly istructive to compare the variace of b Duj1 ad b Duj2, sice b Duj2 is obtaied from b Duj1 by adjustig the latter estimator to compesate for bias iduced by the assumptio of equal class sizes. This adjustmet is uecessary for our special populatios, ad a compariso allows evaluatio of the pealty (i.e., the icrease i variace) that is beig paid for the adjustmet. 15
20 stadard deviatio stadard deviatio bd uj1 bd uj2 bd Sh bd Sh2 bd Sh samplig fractio (q) bd uj1 bd uj2 bd Sh class size (N) 100 Figure 1: Stadard deviatio of b Duj1, bd uj2, b DSh, b DSh2, ad b DSh3 (D = 15; 000 ad N = 10). Figure 2: Stadard deviatio of b Duj1, bd uj2, ad b DSh2 (D = 1500 ad q =0:10). Figure 1 displays the stadard deviatios of b Duj1, b Duj2, b DSh, b DSh2, ad b DSh3 for a equalclasssize populatio with N = 15; 000 ad D = 1500 (so that N = 10) as the samplig fractio q varies. Observe that b Duj2 is oly slightly less eciet tha b Duj1, so that the pealty for bias adjustmet is small i this case. Performace of the estimators bd uj1 ad b DSh2 is early idistiguishable. The most strikig observatio is that for this populatio, b DSh ad b DSh3 are ot competitive with the other three estimators. The relative performace of b DSh ad b DSh3 is especially poor for small samplig fractios. O the other had, the variace aalysis idicates that modicatio of b DSh as i (27) ad (29) ideed reduces the istability of the origial Schlosser estimator i this case. Thus we focus o the estimators b Duj1, b Duj2, ad b DSh2 i the remaider of this sectio ad i the ext sectio. (We retur to the estimator b DSh3 i Sectio 6, where our simulatio experimets idicate that b DSh3 ca exhibit smaller rmse tha the other estimators, but oly at large sample sizes ad for certai \illcoditioed" populatios i which 2 is extremely large.) Figure 2 compares the three estimators b Duj1, b Duj2, ad b DSh2 for equalclasssize populatios with a rage of class sizes; for these calculatios the umber of classes ad the samplig fractio are held costat at D = 1500 ad q = 10. This gure illustrates the diculty of precisely estimatig D whe the class size is small (but greater tha 1). Agai, we see that these three estimators perform similarly, with early equal variability whe N exceeds about 40. We checked the accuracy of the variace approximatio i some example populatios by comparig the values computed from (32) with results of a simulatio experimet. (This experimet is discussed more completely i Sectio 6 below.) Simulated samplig with q =0:05, 0:10, ad 0:20 from the populatio examied i Figure 1 (N =15; 000, D = 1500) yields variace estimates withi 10% (o average) of those calculated from (32). Similar results were foud i samplig from a equalclasssize populatio with N = 15; 000 ad D = 150. The oly diculties we ecoutered occurred for equalclasssize populatios with 16
21 class sizes of N = 1 ad N =2. For these small class sizes the variace approximatio, which is based o the approximatio of the hypergeometric sample desig by aberoulli sample desig, is ot sucietly accurate. I particular, the approximate variace strogly reects radom uctuatios i the sample size due to the Beroulli sample desig; such uctuatios are ot preset i the actual hypergeometric sample desig. Simulatio experimets idicate that for N 3 the diereces caused by Beroulli versus hypergeometric samplig become egligible. (Of course, if the sample desig is i fact Beroulli, the this problem does ot occur.) I practice, we estimate the asymptotic variace of a estimator D b by substitutig estimates for f Var [f i ]: 1 i M g, ad f Cov[f i ;f i 0]: 1 i 6= i 0 M g ito (32). To obtai such estimates, we approximate the true populatio by a populatio with D classes, each of size N=D. Uder this approximatio ad the assumptio i (8) of a Beroulli sample desig, the radom vector f has a multiomial distributio with parameters D ad p =(p 1 ;p 2 ;::: ;p ), where N=D p i = q i (1, q) (N=D),i i for 1 i. It follows that Var [f i ]=Dp i (1, p i ) ad Cov[f i ;f i 0]=,Dp i p i 0. Each p i ca be estimated either by bp i = N= D b i q i (1, q) (N= bd),i or simply by f i = D. b It turs out that the latter formula yields better variace estimates, ad so we take dvar[f i ]=f i 1, f i bd ad for 1 i; i 0. dcov[f i ;f i 0]=, f if i 0 bd These formulas coicide with the estimators obtaied usig the \ucoditioal approach" of Chao ad Lee (1992). A computer program that calculates b Duj1, bd uj2, b DSh2 ad their estimated stadard errors from sample data ca be obtaied from the secod author. 5. A Example The followig example illustrates how kowledge of the populatio size N ca aect estimates of the umber of classes. Whe the populatio size N is ukow, Chao ad Lee (1992, Sec. 3) have proposed that the estimator b DCL deed i Sectio 3.3 be used to 17
22 N bd uj1 bd uj2 bd Sh (47) (60) (51) 10, (125) (161) (128) 100, (141) (183) (144) Table 1: Values of b Duj1, b Duj2, ad b DSh2 for three hypothetical combied lists. (Stadard errors are i paretheses.) estimate the umber of classes, because the formula for b DCL does ot ivolve the ukow parameter N. Whe N is kow, a slight modicatio of the derivatio of b DCL leads to the usmoothed secodorder jackkife estimator b Duj2 (see App. B). Our example is based o oe discussed by Chao ad Lee (1992), who borrowed data rst described ad aalyzed by Holst (1981). These data arose from a applicatio i umismatics i which 204 aciet cois were classied accordig to die type i order to estimate the umber of dieret dies used i the mitig process. Amog the die types o the reverse sides of the 204 cois were 156 sigletos, 19 pairs, 2 triplets, ad 1 quadruplet (f 1 = 156, f 2 = 19, f 3 = 2, f 4 = 1, d = 178). Because the total umber of cois mited is ukow i this case, model (1) is iappropriate for aalyzig these data. But suppose that the same data had arise from a applicatio i which N was kow. For example, suppose that the data were obtaied by selectig a simple radom sample of 204 ames from a samplig frame that had bee costructed by combiig 5 lists of 200 ames each (N = 1000), 50 lists of 200 ames each (N = 10; 000), or 500 lists of 200 ames each (N = 100; 000). I each case our object is to estimate the umber of uique idividuals o the combied list, based o the sample results. We focus o the three estimators b Duj1, bd uj2, ad b DSh2. The estimates for the three cases are give i Table 1; the stadard errors displayed i Table 1 are estimated usig the procedure outlied i Sectio 4. We would expect similar ifereces to be made from the same data uder the multiomial model ad the ite populatio model whe N is very large. Ideed, the value bd uj2 = 835 agrees closely with Chao ad Lee's estimate b DCL = 844 (se 187) whe N = 100;000. Moreover, whe N = 100;000 we d that ^ 2 ( b Duj1 ) 0:13, which is the same estimate of 2 give by Chao ad Lee. As the populatio size decreases, however, both our assessmet of the magitude of D ad our ucertaity about that magitude decrease, because we are observig a larger ad larger fractio of both the populatio ad the classes. The most extreme divergece betwee the estimate obtaied usig b DCL ad estimates obtaied usig b Duj1, b Duj2,or b DSh2 occurs whe the sample cosists of all sigletos (f 1 = ). I that case, b DCL = 1, whereas b Duj1 = b Duj2 = b DSh2 = N. This result idicates that whe the populatio size N is kow, it is better to use a estimator that exploits kowledge 18
23 of N tha to sample with replacemet ad use the estimator b DCL. I some applicatios, samplig with replacemet is ot eve a optio. For example, the oly available samplig mechaism i at least oe curret database system is a oepass reservoir algorithm (as i Vitter 1985). The empirical results i Sectio 6 idicate that, of the three estimators displayed i Table 1, b Duj2 is the superior estimator whe 2 is small (< 1). Thus for our example, b Duj2 would be the preferred estimator, sice ^ 2 ( b Duj1 ) 0:13 i all three cases. Note that b Duj2 cosistetly has the highest variace of the three estimators i Table 1. The bias of b Duj2 is typically lower tha that of b Duj1 or b DSh2 whe 2 is small, however, so that the overall rmse is lower. 6. Simulatio Results This sectio describes the results of a simulatio study doe to compare the performace of the various estimators described i Sectio 3. Our compariso is based o the performace of the estimators for samplig fractios of 5%, 10%, ad 20% i 52 populatios. (Iitial experimets idicated that the performace of the various estimators is best viewed as a fuctio of samplig fractio, rather tha absolute sample size. This is i cotrast to estimators of, for example, populatio averages.) We cosider several sets of populatios. The rst set comprises sythetic populatios of the type cosidered i the literature. Populatios EQ10 ad EQ100 have equal class sizes of 10 ad 100. I populatios NGB/1, NGB/2, ad NGB/4, the class sizes follow a egative biomial distributio. Specically, the fractio f(m) of classes i populatio NGB/k with class size equal to m is give by f(m) m, 1 r k (1, r) m,k k, 1 for m k, where r = 0:04. Chao ad Lee (1992) cosidered populatios of this type. The populatios i the secod set are meat to be represetative of data that could be ecoutered whe a samplig frame for a populatio cesus is costructed by combiig a umber of lists which may cotai overlappig etries. Populatio GOOD ad SUDM were studied by Goodma (1949) ad Sudma (1976). Populatio FRAME2 mimics a samplig frame that might arise i a admiistrative records cesus of the type described i Sectio 1. Oe approach to such a cesus is to augmet the usual cesus address list with a small umber of relatively large admiistrative records les, such as AFDC or Food Stamps, ad the estimate the umber of distict idividuals o the combied list from a sample. We have costructed FRAME2 so that a give idividual ca appear at most ve times, but most idividuals appear exactly oce, mimickig the case i which four admiistrative lists are used to supplemet the cesus address list. Populatio FRAME3 is similar to FRAME2, but for the FRAME3 populatio it is assumed that the combied list is made up of a umber of small lists (perhaps obtaied from eighborhoodlevel orgaizatios) rather tha a few 19
24 Name N D 2 Skew EQ EQ NGB/ NGB/ NGB/ Table 2: Characteristics of sythetic populatios. Name N D 2 Skew GOOD FRAME FRAME SUDM Table 3: Characteristics of \merged list" populatios. Name N D 2 Skew Z20A Z Z20B Table 4: Characteristics of \illcoditioed" populatios. large lists. The populatios i the third set, deoted by Z20A, Z20B, ad Z15, are used to study the behavior of the estimators whe the data are extremely illcoditioed. The class sizes i each of these populatios follow a geeralized Zipf distributio (see Kuth 1973, p. 398). Specically, N j =N / j,, where equals 1.5 or 2.0. These populatios have extremely high values of 2. Descriptive statistics for these three sets of populatios are give i Tables 2, 3, ad 4. The colum etitled \skew" displays the dimesioless coeciet of skewess, which is deed by = P D (N j, N) 3 =D PD (N j, N) 2 =D 3=2 : The al set comprises 40 real populatios that demostrate the type of distributios ecoutered whe estimatig the umber of distict values of a attribute i a relatioal database. Specically, the populatios studied correspod to various relatioal attributes from a database of erollmet records for studets at the Uiversity of Wiscosi ad a database of billig records from a large isurace compay. The populatio size N rages from 15,469 to 1,654,700, with D ragig from 3 to 1,547,606 ad 2 ragig from 0 to (see App. D for further details). It is otable that values of 2 ecoutered i the literature (Chao ad Lee 1992; Goodma 1949; Shlosser 1981; Sudma 1976) ted ot to exceed the value 2, ad are typically less tha 1, whereas the value of 2 exceeds 2 for more tha 50% of the real populatios. For each estimator, populatio, ad samplig fractio, we estimated the bias ad rmse by repeatedly drawig a simple radom sample from the populatio, evaluatig the estimator, ad the computig the error of each estimate. (Whe evaluatig the estimator, we trucated each estimate below at d ad above at N.) The al estimate of bias was obtaied by averagig the error over all of the experimetal replicatios, ad rmse was 20
25 sample size bd 2 uj1 bd sj1 bd uj2 bd sj2 bd Sh bd Sh2 bd Sh3 ^ 2 5% 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum % 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum % 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum Table 5: Average ad maximum rmse (%) for various estimators. estimated as the square root of the averaged square error. We used 100 replicatios, which was suciet to estimate the rmse with a stadard error below 5% i early all cases; typically the stadard error was much less. Summary results from the simulatios are displayed i Tables 5 ad 6. Table 5 gives the average ad maximum rmse's for each estimator of D over all populatios with 0 2 < 1, with 1 2 < 50, ad with 2 50, as well as the average ad maximum rmse's for each estimator over all populatios combied. Similarly, Table 6 gives the average ad maximum bias for each estimator. I these tables, the rmse ad bias are each expressed as a percetage of the true umber of classes. Tables 5 ad 6 also display the rmse ad bias of the estimator ^ 2 ( b Duj1 ) used i the secodorder jackkife estimators; the rmse ad bias are expressed as a percetage of the true value 2 ad are displayed i the colum labelled ^ 2. Comparig Tables 5 ad 6 idicates that for each estimator the major compoet of the rmse is almost always bias, ot variace. Thus, eve though the stadard error ca be estimated as i Sectio 4, this estimated stadard error usually does ot give a accurate picture of the error i estimatio of D. Aother cosequece of the predomiace of bias is that whe 2 is large, the rmse for the secodorder estimator b Duj2 does ot decrease 21
26 sample size bd 2 uj1 bd sj1 bd uj2 bd sj2 bd Sh bd Sh2 bd Sh3 ^ 2 5% 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum % 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum % 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum Table 6: Average ad maximum bias (%) for various estimators. mootoically as the samplig fractio icreases. (I all other cases the rmse decreases mootoically.) Comparig b Duj1 with b Dsj1 ad the comparig b Duj2 with b Dsj2,we see that smoothig a rstorder jackkife estimator ever results i a better rstorder estimator. O the other had, smoothig a secodorder jackkife estimator ca result i sigicat performace improvemet whe 2 is large. Similarly, usig higherorder Taylor expasios leads to mixed results. Secodorder estimators perform better tha rstorder estimators whe 2 is relatively small, but ot whe 2 is large. The diculty ispartially that the estimator ^ 2 ( b Duj1 ) teds to uderestimate 2 whe 2 is large, leadig to uderestimates of the umber of classes. Moreover, the Taylor approximatios uderlyig b Duj1, b Dsj1, b Duj2, ad b Dsj2 are derived uder the assumptio of ot too much variability betwee class sizes; this assumptio is violated whe 2 is large. There apparetly is o systematic relatio betwee the coeciet of skewess for the class sizes ad the performace of secodorder jackkife estimators. As predicted i Sectios 3.4 ad 4, the estimators b DSh ad b DSh3 behave poorly whe 2 is relatively small, ad b DSh3 performs better tha b DSh whe 2 is large. For small to medium values of 2, the modied estimator b DSh2 has a smaller rmse tha b DSh or b DSh3, ad 22
27 sample size bd 2 uj2 bd uj2a bd Sh2 bd Sh3 bd hybrid 5% 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum % 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum % 0 ad < 1 Average ad < 50 Maximum Average Maximum Average Maximum all Average Maximum Table 7: Average ad maximum rmse (%) of b Duj2, b Duj2a, b DSh2, b DSh3, ad b Dhybrid. its performace is comparable to the geeralized jackkife estimators. For extremely large values of 2 ad also for large sample sizes, the estimator b DSh3 has the best performace of the three Shlossertype estimators. (For a 20% samplig fractio, b DSh3 i fact has the lowest average rmse of all the estimators cosidered.) As idicated earlier, smoothig ca improve the performace of the secodorder jackkife estimator Duj2 b. A alterative ad hoc techique for improvig performace is to \stabilize" b Duj2 usig a method suggested by Chao, Ma, ad Yag (1993). Fix c 1 ad remove ay class whose frequecy i the sample exceeds c; that is, remove from the sample all members of classes f C j : j 2 B g, where B = f 1 j D : j >cg. The compute the estimator b Duj2 from the reduced sample ad subsequetly icremet it by jbj to produce the al estimate, deoted by b Duj2a. (Here jbj deotes the umber of elemets i the set B.) Whe computig b Duj2 from the reduced sample, take the populatio size as N, P j2b b N j, where each b Nj is a methodofmomets estimator of N j as i Sectio If, P j2b j = 0, the simply compute b Duj2 from the full sample. The idea behid this procedure is as follows. Whe 2 is large, the populatio cosists of a few large classes ad may smaller classes. By i eect removig the largest classes from the populatio, 23
Chapter 6 Sampling Distributions
Chapter 6 Samplig Distributios 1 I most experimets, we have more tha oe measuremet for ay give variable, each measuremet beig associated with oe radomly selected a member of a populatio. Hece we eed to
More informationIt should be unbiased, or approximately unbiased. Variance of the variance estimator should be small. That is, the variance estimator is stable.
Chapter 10 Variace Estimatio 10.1 Itroductio Variace estimatio is a importat practical problem i survey samplig. Variace estimates are used i two purposes. Oe is the aalytic purpose such as costructig
More informationProperties and Hypothesis Testing
Chapter 3 Properties ad Hypothesis Testig 3.1 Types of data The regressio techiques developed i previous chapters ca be applied to three differet kids of data. 1. Crosssectioal data. 2. Time series data.
More informationThe standard deviation of the mean
Physics 6C Fall 20 The stadard deviatio of the mea These otes provide some clarificatio o the distictio betwee the stadard deviatio ad the stadard deviatio of the mea.. The sample mea ad variace Cosider
More information71. Chapter 4. Part I. Sampling Distributions and Confidence Intervals
71 Chapter 4 Part I. Samplig Distributios ad Cofidece Itervals 1 7 Sectio 1. Samplig Distributio 73 Usig Statistics Statistical Iferece: Predict ad forecast values of populatio parameters... Test hypotheses
More informationStatisticians use the word population to refer the total number of (potential) observations under consideration
6 Samplig Distributios Statisticias use the word populatio to refer the total umber of (potetial) observatios uder cosideratio The populatio is just the set of all possible outcomes i our sample space
More informationEcon 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chisquare Distribution, Student s t distribution 1.
Eco 325/327 Notes o Sample Mea, Sample Proportio, Cetral Limit Theorem, Chisquare Distributio, Studet s t distributio 1 Sample Mea By Hiro Kasahara We cosider a radom sample from a populatio. Defiitio
More informationRegression with an Evaporating Logarithmic Trend
Regressio with a Evaporatig Logarithmic Tred Peter C. B. Phillips Cowles Foudatio, Yale Uiversity, Uiversity of Aucklad & Uiversity of York ad Yixiao Su Departmet of Ecoomics Yale Uiversity October 5,
More informationOn an Application of Bayesian Estimation
O a Applicatio of ayesia Estimatio KIYOHARU TANAKA School of Sciece ad Egieerig, Kiki Uiversity, Kowakae, HigashiOsaka, JAPAN Email: ktaaka@ifokidaiacjp EVGENIY GRECHNIKOV Departmet of Mathematics, auma
More information( µ /σ)ζ/(ζ+1) µ /σ ( µ /σ)ζ/(ζ 1)
A eective CI for the mea with samples of size 1 ad Melaie Wall James Boe ad Richard Tweedie 1 Abstract It is couterituitive that with a sample of oly oe value from a ormal distributio oe ca costruct a
More informationON POINTWISE BINOMIAL APPROXIMATION
Iteratioal Joural of Pure ad Applied Mathematics Volume 71 No. 1 2011, 5766 ON POINTWISE BINOMIAL APPROXIMATION BY wfunctions K. Teerapabolar 1, P. Wogkasem 2 Departmet of Mathematics Faculty of Sciece
More informationThis is an introductory course in Analysis of Variance and Design of Experiments.
1 Notes for M 384E, Wedesday, Jauary 21, 2009 (Please ote: I will ot pass out hardcopy class otes i future classes. If there are writte class otes, they will be posted o the web by the ight before class
More informationStatistical inference: example 1. Inferential Statistics
Statistical iferece: example 1 Iferetial Statistics POPULATION SAMPLE A clothig store chai regularly buys from a supplier large quatities of a certai piece of clothig. Each item ca be classified either
More informationOutput Analysis and RunLength Control
IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad RuLegth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%
More informationMonte Carlo Integration
Mote Carlo Itegratio I these otes we first review basic umerical itegratio methods (usig Riema approximatio ad the trapezoidal rule) ad their limitatios for evaluatig multidimesioal itegrals. Next we itroduce
More informationR. van Zyl 1, A.J. van der Merwe 2. Quintiles International, University of the Free State
Bayesia Cotrol Charts for the Twoparameter Expoetial Distributio if the Locatio Parameter Ca Take o Ay Value Betwee Mius Iity ad Plus Iity R. va Zyl, A.J. va der Merwe 2 Quitiles Iteratioal, ruaavz@gmail.com
More informationElement sampling: Part 2
Chapter 4 Elemet samplig: Part 2 4.1 Itroductio We ow cosider uequal probability samplig desigs which is very popular i practice. I the uequal probability samplig, we ca improve the efficiecy of the resultig
More informationIntroducing Sample Proportions
Itroducig Sample Proportios Probability ad statistics Aswers & Notes TINspire Ivestigatio Studet 60 mi 7 8 9 0 Itroductio A 00 survey of attitudes to climate chage, coducted i Australia by the CSIRO,
More informationGoodnessOfFit For The Generalized Exponential Distribution. Abstract
GoodessOfFit For The Geeralized Expoetial Distributio By Amal S. Hassa stitute of Statistical Studies & Research Cairo Uiversity Abstract Recetly a ew distributio called geeralized expoetial or expoetiated
More informationTHE SYSTEMATIC AND THE RANDOM. ERRORS  DUE TO ELEMENT TOLERANCES OF ELECTRICAL NETWORKS
R775 Philips Res. Repts 26,414423, 1971' THE SYSTEMATIC AND THE RANDOM. ERRORS  DUE TO ELEMENT TOLERANCES OF ELECTRICAL NETWORKS by H. W. HANNEMAN Abstract Usig the law of propagatio of errors, approximated
More information1 Inferential Methods for Correlation and Regression Analysis
1 Iferetial Methods for Correlatio ad Regressio Aalysis I the chapter o Correlatio ad Regressio Aalysis tools for describig bivariate cotiuous data were itroduced. The sample Pearso Correlatio Coefficiet
More informationDS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10
DS 00: Priciples ad Techiques of Data Sciece Date: April 3, 208 Name: Hypothesis Testig Discussio #0. Defie these terms below as they relate to hypothesis testig. a) Data Geeratio Model: Solutio: A set
More informationBinomial Distribution
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 1 2 3 4 5 6 7 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Overview Example: coi tossed three times Defiitio Formula Recall that a r.v. is discrete if there are either a fiite umber of possible
More informationChapter 6 Principles of Data Reduction
Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a
More informationModified Ratio Estimators Using Known Median and CoEfficent of Kurtosis
America Joural of Mathematics ad Statistics 01, (4): 95100 DOI: 10.593/j.ajms.01004.05 Modified Ratio s Usig Kow Media ad CoEfficet of Kurtosis J.Subramai *, G.Kumarapadiya Departmet of Statistics, Podicherry
More informationThe Random Walk For Dummies
The Radom Walk For Dummies Richard A Mote Abstract We look at the priciples goverig the oedimesioal discrete radom walk First we review five basic cocepts of probability theory The we cosider the Beroulli
More informationEstimation of Gumbel Parameters under Ranked Set Sampling
Joural of Moder Applied Statistical Methods Volume 13 Issue 2 Article 112014 Estimatio of Gumbel Parameters uder Raked Set Samplig Omar M. Yousef Al Balqa' Applied Uiversity, Zarqa, Jorda, abuyaza_o@yahoo.com
More informationSTA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:
STA 2023 Module 10 Comparig Two Proportios Learig Objectives Upo completig this module, you should be able to: 1. Perform largesample ifereces (hypothesis test ad cofidece itervals) to compare two populatio
More informationTEACHER CERTIFICATION STUDY GUIDE
COMPETENCY 1. ALGEBRA SKILL 1.1 1.1a. ALGEBRAIC STRUCTURES Kow why the real ad complex umbers are each a field, ad that particular rigs are ot fields (e.g., itegers, polyomial rigs, matrix rigs) Algebra
More informationSingular Continuous Measures by Michael Pejic 5/14/10
Sigular Cotiuous Measures by Michael Peic 5/4/0 Prelimiaries Give a set X, a σalgebra o X is a collectio of subsets of X that cotais X ad ad is closed uder complemetatio ad coutable uios hece, coutable
More information5. Likelihood Ratio Tests
1 of 5 7/29/2009 3:16 PM Virtual Laboratories > 9. Hy pothesis Testig > 1 2 3 4 5 6 7 5. Likelihood Ratio Tests Prelimiaries As usual, our startig poit is a radom experimet with a uderlyig sample space,
More informationBernoulli numbers and the EulerMaclaurin summation formula
Physics 6A Witer 006 Beroulli umbers ad the EulerMaclauri summatio formula I this ote, I shall motivate the origi of the EulerMaclauri summatio formula. I will also explai why the coefficiets o the right
More informationBasis for simulation techniques
Basis for simulatio techiques M. Veeraraghava, March 7, 004 Estimatio is based o a collectio of experimetal outcomes, x, x,, x, where each experimetal outcome is a value of a radom variable. x i. Defiitios
More informationA goodnessoffit test based on the empirical characteristic function and a comparison of tests for normality
A goodessoffit test based o the empirical characteristic fuctio ad a compariso of tests for ormality J. Marti va Zyl Departmet of Mathematical Statistics ad Actuarial Sciece, Uiversity of the Free State,
More informationECE 901 Lecture 12: Complexity Regularization and the Squared Loss
ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality
More informationWHAT IS THE PROBABILITY FUNCTION FOR LARGE TSUNAMI WAVES? ABSTRACT
WHAT IS THE PROBABILITY FUNCTION FOR LARGE TSUNAMI WAVES? Harold G. Loomis Hoolulu, HI ABSTRACT Most coastal locatios have few if ay records of tsuami wave heights obtaied over various time periods. Still
More informationActivity 3: Length Measurements with the FourSided Meter Stick
Activity 3: Legth Measuremets with the FourSided Meter Stick OBJECTIVE: The purpose of this experimet is to study errors ad the propagatio of errors whe experimetal data derived usig a foursided meter
More informationEDGEWORTH SIZE CORRECTED W, LR AND LM TESTS IN THE FORMATION OF THE PRELIMINARY TEST ESTIMATOR
Joural of Statistical Research 26, Vol. 37, No. 2, pp. 4355 Bagladesh ISSN 256422 X EDGEORTH SIZE CORRECTED, AND TESTS IN THE FORMATION OF THE PRELIMINARY TEST ESTIMATOR Zahirul Hoque Departmet of Statistics
More informationProbability and statistics: basic terms
Probability ad statistics: basic terms M. Veeraraghava August 203 A radom variable is a rule that assigs a umerical value to each possible outcome of a experimet. Outcomes of a experimet form the sample
More informationMOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.
XI1 (1074) MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND. R. E. D. WOOLSEY AND H. S. SWANSON XI2 (1075) STATISTICAL DECISION MAKING Advaced
More informationDiscrete Orthogonal Moment Features Using Chebyshev Polynomials
Discrete Orthogoal Momet Features Usig Chebyshev Polyomials R. Mukuda, 1 S.H.Og ad P.A. Lee 3 1 Faculty of Iformatio Sciece ad Techology, Multimedia Uiversity 75450 Malacca, Malaysia. Istitute of Mathematical
More informationA Risk Comparison of Ordinary Least Squares vs Ridge Regression
Joural of Machie Learig Research 14 (2013) 15051511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer
More informationMOMENTMETHOD ESTIMATION BASED ON CENSORED SAMPLE
Vol. 8 o. Joural of Systems Sciece ad Complexity Apr., 5 MOMETMETHOD ESTIMATIO BASED O CESORED SAMPLE I Zhogxi Departmet of Mathematics, East Chia Uiversity of Sciece ad Techology, Shaghai 37, Chia. Email:
More informationSECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES
SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES Read Sectio 1.5 (pages 5 9) Overview I Sectio 1.5 we lear to work with summatio otatio ad formulas. We will also itroduce a brief overview of sequeces,
More informationThe Sampling Distribution of the Maximum. Likelihood Estimators for the Parameters of. BetaBinomial Distribution
Iteratioal Mathematical Forum, Vol. 8, 2013, o. 26, 12631277 HIKARI Ltd, www.mhikari.com http://d.doi.org/10.12988/imf.2013.3475 The Samplig Distributio of the Maimum Likelihood Estimators for the Parameters
More informationLecture 9: September 19
36700: Probability ad Mathematical Statistics I Fall 206 Lecturer: Siva Balakrisha Lecture 9: September 9 9. Review ad Outlie Last class we discussed: Statistical estimatio broadly Pot estimatio BiasVariace
More informationo <Xln <X2n <... <X n < o (1.1)
Metrika, Volume 28, 1981, page 257262. 9 Viea. Estimatio Problems for Rectagular Distributios (Or the Taxi Problem Revisited) By J.S. Rao, Sata Barbara I ) Abstract: The problem of estimatig the ukow
More information1 Hash tables. 1.1 Implementation
Lecture 8 Hash Tables, Uiversal Hash Fuctios, Balls ad Bis Scribes: Luke Johsto, Moses Charikar, G. Valiat Date: Oct 18, 2017 Adapted From Virgiia Williams lecture otes 1 Hash tables A hash table is a
More information6.883: Online Methods in Machine Learning Alexander Rakhlin
6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURES 5 AND 6. THE EXPERTS SETTING. EXPONENTIAL WEIGHTS All the algorithms preseted so far halluciate the future values as radom draws ad the perform
More informationKernel density estimator
Jauary, 07 NONPARAMETRIC ERNEL DENSITY ESTIMATION I this lecture, we discuss kerel estimatio of probability desity fuctios PDF Noparametric desity estimatio is oe of the cetral problems i statistics I
More informationw (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.
2 5. Weighted umber of late jobs 5.1. Release dates ad due dates: maximimizig the weight of otime jobs Oce we add release dates, miimizig the umber of late jobs becomes a sigificatly harder problem. For
More informationNCSS Statistical Software. Tolerance Intervals
Chapter 585 Itroductio This procedure calculates oe, ad two, sided tolerace itervals based o either a distributiofree (oparametric) method or a method based o a ormality assumptio (parametric). A twosided
More informationThe variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.
SAMPLE STATISTICS A radom sample x 1,x,,x from a distributio f(x) is a set of idepedetly ad idetically variables with x i f(x) for all i Their joit pdf is f(x 1,x,,x )=f(x 1 )f(x ) f(x )= f(x i ) The sample
More informationStatistical Fundamentals and Control Charts
Statistical Fudametals ad Cotrol Charts 1. Statistical Process Cotrol Basics Chace causes of variatio uavoidable causes of variatios Assigable causes of variatio large variatios related to machies, materials,
More informationLecture 10 October Minimaxity and least favorable prior sequences
STATS 300A: Theory of Statistics Fall 205 Lecture 0 October 22 Lecturer: Lester Mackey Scribe: Brya He, Rahul Makhijai Warig: These otes may cotai factual ad/or typographic errors. 0. Miimaxity ad least
More informationChapter 13, Part A Analysis of Variance and Experimental Design
Slides Prepared by JOHN S. LOUCKS St. Edward s Uiversity Slide 1 Chapter 13, Part A Aalysis of Variace ad Eperimetal Desig Itroductio to Aalysis of Variace Aalysis of Variace: Testig for the Equality of
More informationSome Properties of the Exact and Score Methods for Binomial Proportion and Sample Size Calculation
Some Properties of the Exact ad Score Methods for Biomial Proportio ad Sample Size Calculatio K. KRISHNAMOORTHY AND JIE PENG Departmet of Mathematics, Uiversity of Louisiaa at Lafayette Lafayette, LA 705041010,
More informationG. R. Pasha Department of Statistics Bahauddin Zakariya University Multan, Pakistan
Deviatio of the Variaces of Classical Estimators ad Negative Iteger Momet Estimator from Miimum Variace Boud with Referece to Maxwell Distributio G. R. Pasha Departmet of Statistics Bahauddi Zakariya Uiversity
More informationB Supplemental Notes 2 Hypergeometric, Binomial, Poisson and Multinomial Random Variables and Borel Sets
B671672 Supplemetal otes 2 Hypergeometric, Biomial, Poisso ad Multiomial Radom Variables ad Borel Sets 1 Biomial Approximatio to the Hypergeometric Recall that the Hypergeometric istributio is fx = x
More informationSection 1 of Unit 03 (Pure Mathematics 3) Algebra
Sectio 1 of Uit 0 (Pure Mathematics ) Algebra Recommeded Prior Kowledge Studets should have studied the algebraic techiques i Pure Mathematics 1. Cotet This Sectio should be studied early i the course
More informationSampling Distributions, ZTests, Power
Samplig Distributios, ZTests, Power We draw ifereces about populatio parameters from sample statistics Sample proportio approximates populatio proportio Sample mea approximates populatio mea Sample variace
More informationNotes on iteration and Newton s method. Iteration
Notes o iteratio ad Newto s method Iteratio Iteratio meas doig somethig over ad over. I our cotet, a iteratio is a sequece of umbers, vectors, fuctios, etc. geerated by a iteratio rule of the type 1 f
More information1 Covariance Estimation
Eco 75 Lecture 5 Covariace Estimatio ad Optimal Weightig Matrices I this lecture, we cosider estimatio of the asymptotic covariace matrix B B of the extremum estimator b : Covariace Estimatio Lemma 4.
More informationMA131  Analysis 1. Workbook 9 Series III
MA3  Aalysis Workbook 9 Series III Autum 004 Cotets 4.4 Series with Positive ad Negative Terms.............. 4.5 Alteratig Series.......................... 4.6 Geeral Series.............................
More informationChapter 3. Strong convergence. 3.1 Definition of almost sure convergence
Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i
More informationNUMERICAL METHODS FOR SOLVING EQUATIONS
Mathematics Revisio Guides Numerical Methods for Solvig Equatios Page 1 of 11 M.K. HOME TUITION Mathematics Revisio Guides Level: GCSE Higher Tier NUMERICAL METHODS FOR SOLVING EQUATIONS Versio:. Date:
More informationQuasiMonte Carlo methods
QuasiMote Carlo methods SE 702813 Semiar zur Numerik ud Stochastik) Lukas Eikemmer March 29, 2010 1 Itroductio The geeral problem we are iterested i is to umerically compute the itegral I : fx) dx, [0,1]
More informationThe picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled
1 Lecture : Area Area ad distace traveled Approximatig area by rectagles Summatio The area uder a parabola 1.1 Area ad distace Suppose we have the followig iformatio about the velocity of a particle, how
More informationAnalysis of Experimental Data
Aalysis of Experimetal Data 6544597.0479 ± 0.000005 g Quatitative Ucertaity Accuracy vs. Precisio Whe we make a measuremet i the laboratory, we eed to kow how good it is. We wat our measuremets to be both
More informationPAijpam.eu ON TENSOR PRODUCT DECOMPOSITION
Iteratioal Joural of Pure ad Applied Mathematics Volume 103 No 3 2015, 537545 ISSN: 13118080 (prited versio); ISSN: 13143395 (olie versio) url: http://wwwijpameu doi: http://dxdoiorg/1012732/ijpamv103i314
More informationProbability, Expectation Value and Uncertainty
Chapter 1 Probability, Expectatio Value ad Ucertaity We have see that the physically observable properties of a quatum system are represeted by Hermitea operators (also referred to as observables ) such
More informationA General Family of Estimators for Estimating Population Variance Using Known Value of Some Population Parameter(s)
Rajesh Sigh, Pakaj Chauha, Nirmala Sawa School of Statistics, DAVV, Idore (M.P.), Idia Floreti Smaradache Uiversity of New Meico, USA A Geeral Family of Estimators for Estimatig Populatio Variace Usig
More informationFinal Examination Solutions 17/6/2010
The Islamic Uiversity of Gaza Faculty of Commerce epartmet of Ecoomics ad Political Scieces A Itroductio to Statistics Course (ECOE 30) Sprig Semester 00900 Fial Eamiatio Solutios 7/6/00 Name: I: Istructor:
More informationThe Sample Variance Formula: A Detailed Study of an Old Controversy
The Sample Variace Formula: A Detailed Study of a Old Cotroversy Ky M. Vu PhD. AuLac Techologies Ic. c 00 Email: kymvu@aulactechologies.com Abstract The two biased ad ubiased formulae for the sample variace
More informationInfinite Sequences and Series
Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet
More informationBootstrap Intervals of the Parameters of Lognormal Distribution Using Power Rule Model and Accelerated Life Tests
Joural of Moder Applied Statistical Methods Volume 5 Issue Article 5 Bootstrap Itervals of the Parameters of Logormal Distributio Usig Power Rule Model ad Accelerated Life Tests Mohammed AlHa Ebrahem
More informationA RANK STATISTIC FOR NONPARAMETRIC KSAMPLE AND CHANGE POINT PROBLEMS
J. Japa Statist. Soc. Vol. 41 No. 1 2011 67 73 A RANK STATISTIC FOR NONPARAMETRIC KSAMPLE AND CHANGE POINT PROBLEMS Yoichi Nishiyama* We cosider ksample ad chage poit problems for idepedet data i a
More informationRecurrence Relations
Recurrece Relatios Aalysis of recursive algorithms, such as: it factorial (it ) { if (==0) retur ; else retur ( * factorial()); } Let t be the umber of multiplicatios eeded to calculate factorial(). The
More informationIIT JAM Mathematical Statistics (MS) 2006 SECTION A
IIT JAM Mathematical Statistics (MS) 6 SECTION A. If a > for ad lim a / L >, the which of the followig series is ot coverget? (a) (b) (c) (d) (d) = = a = a = a a + / a lim a a / + = lim a / a / + = lim
More informationA NOTE ON THE TOTAL LEAST SQUARES FIT TO COPLANAR POINTS
A NOTE ON THE TOTAL LEAST SQUARES FIT TO COPLANAR POINTS STEVEN L. LEE Abstract. The Total Least Squares (TLS) fit to the poits (x,y ), =1,,, miimizes the sum of the squares of the perpedicular distaces
More informationLINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity
LINEAR REGRESSION ANALYSIS MODULE IX Lecture  9 Multicolliearity Dr Shalabh Departmet of Mathematics ad Statistics Idia Istitute of Techology Kapur Multicolliearity diagostics A importat questio that
More informationLecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting
Lecture 6 Chi Square Distributio (χ ) ad Least Squares Fittig Chi Square Distributio (χ ) Suppose: We have a set of measuremets {x 1, x, x }. We kow the true value of each x i (x t1, x t, x t ). We would
More information4 Conditional Distribution Estimation
4 Coditioal Distributio Estimatio 4. Estimators Te coditioal distributio (CDF) of y i give X i = x is F (y j x) = P (y i y j X i = x) = E ( (y i y) j X i = x) : Tis is te coditioal mea of te radom variable
More informationDiscrete probability distributions
Discrete probability distributios I the chapter o probability we used the classical method to calculate the probability of various values of a radom variable. I some cases, however, we may be able to develop
More informationComparison Study of Series Approximation. and Convergence between Chebyshev. and Legendre Series
Applied Mathematical Scieces, Vol. 7, 03, o. 6, 3337 HIKARI Ltd, www.mhikari.com http://d.doi.org/0.988/ams.03.3430 Compariso Study of Series Approimatio ad Covergece betwee Chebyshev ad Legedre Series
More informationSRC Technical Note June 17, Tight Thresholds for The Pure Literal Rule. Michael Mitzenmacher. d i g i t a l
SRC Techical Note 1997011 Jue 17, 1997 Tight Thresholds for The Pure Literal Rule Michael Mitzemacher d i g i t a l Systems Research Ceter 130 Lytto Aveue Palo Alto, Califoria 94301 http://www.research.digital.com/src/
More informationConfidence Intervals for the Population Proportion p
Cofidece Itervals for the Populatio Proportio p The cocept of cofidece itervals for the populatio proportio p is the same as the oe for, the samplig distributio of the mea, x. The structure is idetical:
More informationFastest mixing Markov chain on a path
Fastest mixig Markov chai o a path Stephe Boyd Persi Diacois Ju Su Li Xiao Revised July 2004 Abstract We ider the problem of assigig trasitio probabilities to the edges of a path, so the resultig Markov
More informationIn algebra one spends much time finding common denominators and thus simplifying rational expressions. For example:
74 The Method of Partial Fractios I algebra oe speds much time fidig commo deomiators ad thus simplifyig ratioal epressios For eample: + + + 6 5 + = + = = + + + + + ( )( ) 5 It may the seem odd to be watig
More informationPoisson approximation
p^ 0.17 0.16 0.15 0.14 0.13 0.12 0.11 0.10 0.09 0.08 0.07 0.06 Poisso approximatio Normal approximatio 90 200 400 800 2000 5000 10,000 Figure 3: Poisso vs. ormal approximatios for large sample sizes. 14
More informationTable 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab
Sectio 12 Tests of idepedece ad homogeeity I this lecture we will cosider a situatio whe our observatios are classified by two differet features ad we would like to test if these features are idepedet
More informationProblems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:
Math 224 Fall 2017 Homework 4 Drew Armstrog Problems from 9th editio of Probability ad Statistical Iferece by Hogg, Tais ad Zimmerma: Sectio 2.3, Exercises 16(a,d),18. Sectio 2.4, Exercises 13, 14. Sectio
More informationDISTRIBUTION LAW Okunev I.V.
1 DISTRIBUTION LAW Okuev I.V. Distributio law belogs to a umber of the most complicated theoretical laws of mathematics. But it is also a very importat practical law. Nothig ca help uderstad complicated
More informationBayesian and E Bayesian Method of Estimation of Parameter of Rayleigh Distribution A Bayesian Approach under Linex Loss Function
Iteratioal Joural of Statistics ad Systems ISSN 9732675 Volume 12, Number 4 (217), pp. 791796 Research Idia Publicatios http://www.ripublicatio.com Bayesia ad E Bayesia Method of Estimatio of Parameter
More informationESTIMATION AND PREDICTION BASED ON KRECORD VALUES FROM NORMAL DISTRIBUTION
STATISTICA, ao LXXIII,. 4, 013 ESTIMATION AND PREDICTION BASED ON KRECORD VALUES FROM NORMAL DISTRIBUTION Maoj Chacko Departmet of Statistics, Uiversity of Kerala, Trivadrum 695581, Kerala, Idia M. Shy
More informationZeros of Polynomials
Math 160 www.timetodare.com 4.5 4.6 Zeros of Polyomials I these sectios we will study polyomials algebraically. Most of our work will be cocered with fidig the solutios of polyomial equatios of ay degree
More informationNumber of fatalities X Sunday 4 Monday 6 Tuesday 2 Wednesday 0 Thursday 3 Friday 5 Saturday 8 Total 28. Day
LECTURE # 8 Mea Deviatio, Stadard Deviatio ad Variace & Coefficiet of variatio Mea Deviatio Stadard Deviatio ad Variace Coefficiet of variatio First, we will discuss it for the case of raw data, ad the
More informationSimon Blackburn. Sean Murphy. Jacques Stern. Laboratoire d'informatique, Ecole Normale Superieure, Abstract
The Cryptaalysis of a Public Key Implemetatio of Fiite Group Mappigs Simo Blackbur Sea Murphy Iformatio Security Group, Royal Holloway ad Bedford New College, Uiversity of Lodo, Egham, Surrey TW20 0EX,
More informationLecture 4. Random variable and distribution of probability
Itroductio to theory of probability ad statistics Lecture. Radom variable ad distributio of probability dr hab.iż. Katarzya Zarzewsa, prof.agh Katedra Eletroii, AGH email: za@agh.edu.pl http://home.agh.edu.pl/~za
More informationREGRESSION (Physics 1210 Notes, Partial Modified Appendix A)
REGRESSION (Physics 0 Notes, Partial Modified Appedix A) HOW TO PERFORM A LINEAR REGRESSION Cosider the followig data poits ad their graph (Table I ad Figure ): X Y 0 3 5 3 7 4 9 5 Table : Example Data
More information