Variance Estimation for the General Regression Estimator

Size: px

Start display at page:

Download "Variance Estimation for the General Regression Estimator"

Emery Kelly
5 years ago
Views:

1 Varance Estmaton for the General Regresson Estmator Rchard Vallant Westat 1650 Research Boulevard Rockvlle MD 0850 and Jont Program for Survey Methodology Unversty of Maryland Unversty of Mchgan February 00

2 1 ABSTRACT A varety of estmators of the varance of the general regresson (GREG) estmator of a mean have been proposed n the samplng lterature, manly wth the goal of estmatng the desgn-based varance. Estmators can be easly constructed that, under certan condtons, are approxmately unbased for both the desgn-varance and the model-varance. Several dualpurpose estmators are studed here n sngle-stage samplng. These choces are robust estmators of a model-varance even f the model that motvates the GREG has an ncorrect varance parameter. A key feature of the robust estmators s the adjustment of squared resduals by factors analogous to the leverages used n standard regresson analyss. We also show that the deleteone jackknfe mplctly ncludes the leverage adjustments and s a good choce from ether the desgn-based or model-based perspectve. In a set of smulatons, these varance estmators have small bas and produce confdence ntervals wth near-nomnal coverage rates for several samplng methods, sample szes, and populatons n sngle-stage samplng. We also present smulaton results for a skewed populaton where all varance estmators perform poorly. Samples that do not adequately represent the unts wth large values lead to estmated means that are too small, varance estmates that are too small, and confdence ntervals that cover at far less than the nomnal rate. These defects need to be avoded at the desgn stage by selectng samples that cover the extreme unts well. However, n populatons wth nadequate desgn nformaton ths wll not be feasble. KEY WORDS: Confdence nterval coverage; Hat matrx; Jackknfe; Leverage; Model unbased; Skewness

3 1 1. Introducton Robust varance estmaton s a key consderaton n the predcton approach to fnte populaton samplng. Vallant, Dorfman, and Royall (000) synthesze much of the model-based lterature. In that approach, a workng model s formulated that s used to construct a pont estmator of a mean or total. Varance estmators are created that are robust n the sense of beng approxmately model-unbased and consstent for the model-varance even when the varance specfcaton n the workng model s ncorrect. In ths paper, that approach s extended to the general regresson estmator (GREG) to construct varance estmators that are approxmately model-unbased but are also approxmately desgn-unbased n sngle-stage samplng. A number of alternatves are compared ncludng the jackknfe and some varants of the jackknfe. We wll use a partcular class of lnear models along wth Bernoull or Posson samplng as motvaton for the varance estmators. However, some of these estmators can often be successfully appled n practce to sngle-stage desgns where selectons are not ndependent. Assocated wth each unt n the populaton s a target varable Y and a p-vector of auxlary varables = ( 1,, p) x x x where = 1,, N. The populaton vector of totals of the auxlares s x = ( x1,, xp) T T T where T xk = x, k = 1,, p. The general regresson N = 1 k estmator, defned below, s motvated by a lnear model n whch the Y s are ndependent random varables wth E var ( Y ) ( Y ) M = xβ. (1.1) = v M In most stuatons (1.1) s a workng model that s lkely to be ncorrect to some degree.

4 Assume that a probablty sample s s selected and that the selecton probablty of sample unt s P( δ 1) = = π where δ s a 0-1 ndcator for whether a unt s n the sample or not. We assume that the sample selecton mechansm s gnorable. Roughly speakng, gnorablty means that the jont dstrbuton of the Y s and the sample ndcators, gven the x s, can be factored nto the product of the dstrbuton for Y gven x and the dstrbuton for the ndcators gven x (see Sugden and Smth 1984 for a formal defnton). In that case, model-based nference can proceed usng the model and gnorng the selecton mechansm. The n-vector of targets for the sample unts s = ( ) auxlares for the sample unts s X = ( x x ) probabltes as = dag( π ) s 1,, s n Y s Y1,, Y n, and the n p matrx of. Defne the dagonal matrx of selecton Π, s, and the dagonal matrx of model-varances as V s = dag( v ). The GREG estmator of the total, T =, s then defned as the Horvtz- N Y = 1 Thompson estmator or π -estmator, T π = Y π, plus an adjustment: s ( ) T = T π + B T T (1.) G x x where B = A XV Π Y wth π s s s s s 1 1 πs = s s s s = A XV Π X, and T x x π. The GREG s estmator can also be wrtten as 1 1 wth ( π ) s s s s x x s 1 TG =g s s Y s Π (1.3) g = V XA T T + 1 and 1 s beng an n-vector of 1 s. Expresson (1.3) wll be useful for subsequent calculatons. A varant of the GREG, referred to as a cosmetc estmator, was ntroduced by Särndal and Wrght (1984) and amplfed by Brewer (1995, 1999). A cosmetc estmator also has desgn-

5 3 based and model-based nterpretatons. The varance estmators n ths paper could also be adapted to cover cosmetc estmaton. Assumng that N s known, the GREG estmator of the mean s smply Y = T N. We wll concentrate on the analyss of Y G. (In some stuatons, partcularly ones where mult-stage G G samplng s used, the populaton sze s unknown and an estmate, N, must be used n the denomnator of Y G. The followng analyss for the mean does not apply n that case.) Ether quanttatve or qualtatve auxlares (or both) can be used n the GREG. If a qualtatve varable lke gender (male or female) s used, then two or more columns n n whch case a generalzed nverse, denoted by X s wll be lnearly dependent, A π s, wll be used n (1.) and (1.3). Note that, although A π s s not unque, the GREG estmator Y G s nvarant to the choce of generalzed nverse. The proof s smlar to Theorem n Vallant, Dorfman, and Royall (000). The GREG estmator s model-unbased under (1.1) and s approxmately desgnunbased n large probablty samples. Note that the model-unbasedness requres only that ( ) E Y = xβ; f the varance parameters n (1.1) are msspecfed, the GREG wll stll be M model-unbased. On the other hand, f ( ) M E Y s ncorrectly specfed, the GREG s modelbased and the model mean squared error may contan an mportant bas-squared term. The estmaton error of the GREG Y G s defned as G = s s r r 1 ( ay 1Y ) Y Y N where Y = T N, 1 s = s s s a Π g 1, r Y s the ( N n) -vector of target varables for the nonsample unts, and 1 r s a vector of N n 1 s. Next, suppose that the true model for Y s

6 E var ( Y ) ( Y ) M = x β, (1.4) = ψ M.e., the varance specfcaton s dfferent from (1.1) but ( ) E Y s the same. Usng the M 4 estmaton error, the error-varance of Y G s then where the n n covarance matrx for s ( Y Y) = N ( a Ψ a + 1 Ψ 1 ) var M G s s s r r r Y s =dag( ψ ) Ø s and r Ø s the ( N n) ( N n) covarance matrx for Y r. When the sample and populaton szes are both large and the samplng fracton, f = nn, s neglgble, the error-varance s approxmately ( ) var M Y G Y N a ψ s Note that ths varance depends on the true varance parameters,. (1.5) ψ, and on the workng model varance parameters, v, because v s part of a. Snce a s s approxmately the same as when selecton probabltes are small, the error varance n that case s also approxmately var ( ) g M YG Y N ψ sπ 1 s s Π g (1.6) For model-based varance estmaton, we wll take ether of the asymptotc forms n (1.5) or (1.6) as the target. However, when the samplng fracton s substantal, the term r r r N 1 Ψ 1 can be an mportant part of the error-varance and (1.5) or (1.6) may be poor approxmatons. We wll consder the desgn varance under two sngle-stage plans Bernoull and Posson. In Posson samplng, the ndcators δ for whether a unt s n the sample or not are ndependent wth P( 1) 1 P( 0) δ = = δ = = π (see Särndal, Swensson, and Wretman 199, sec.

7 5 3.5, for a more detaled descrpton). Bernoull samplng s a specal case of Posson samplng n whch each unt has the same ncluson probablty. Under these two plans, the approxmate desgn-varance of Y G s where E Y var = xb and = ( ) 1 π ( Y N ) 1 G N π = 1 E (1.7) π 1 1 B XV X XV Y s the regresson parameter estmator evaluated for the full fnte populaton. Särndal (1996) recommends usng the GREG n conjuncton wth samplng plans for whch (1.7) s vald on the grounds that the varance (1.7) s smple and that the use of regresson estmaton can often more than compensate for the random sample szes that are a consequence of such desgns. The Bernoull and Posson desgns and the lnear models (1.1) and (1.4) serve manly as motvaton for the varance estmators presented n sectons and 3. As noted by Yung and Rao (1996, p.4), t s common practce to use varance estmators that are approprate to a desgn wth ndependent selectons or to a wth-replacement desgn even when a sample has been selected wthout replacement. Lkewse, varance estmators motvated by a lnear model are often appled n cases where departures from the model are antcpated. Ths practcal approach underles the thnkng n ths paper and s llustrated n the smulaton study reported n secton 4.. Varance Estmators Our general goal n varance estmaton wll be to fnd estmators that are consstent and approxmately unbased under both a model and a desgn. Kott (1990) also consdered ths problem. Note that the goal here s not the estmaton of a combned (or antcpated) model-

8 6 desgn varance, ( ) ( ) E E Y Y E E Y Y M π G M π G useful for both var M ( Y G Y) and var π ( Y G). Rather we seek estmators that are. The arguments gven here are largely heurstc ones used to motvate the forms of the varance estmators. Addtonal, formal condtons such as those found n Royall and Cumberland (1978) or Yung and Rao (000) are needed for modelbased and desgn-based consstency and approxmate unbasedness. Frst, consder estmaton of the approxmate model-varance gven n (1.5). In the followng development, we assume that, as N and n become large, () Nmax( ) O( n) π = and () A πs N converges to a matrx of constants, A o. A resdual assocated wth sample unt s r = Y Y where Y =xb. The vector of predcted values for the sample unts can be wrtten as Ys = HY s s (.1) where s = s π s s s s = H XA XV Π. The predcted value for an ndvdual unt s Y hy j j j s 1 where hj πs j ( vjπ j) =xa x s the (j) th element of H s. The matrx H s s the analog to the usual hat matrx (Belsley, Kuh, and Welsch 1980) from standard regresson analyss. The dagonal elements of the hat matrx are known as leverages and are a measure of the effect that a unt has on ts own predcted value. Notce that the nverses of the selecton probabltes are nvolved n (.1), although these would have no role n purely model-based analyss. The followng lemma, whch s a varaton of some results n Lemma of (Vallant, Dorfman, and Royall 000), gves some propertes of the leverages and the hat matrx.

9 7 Lemma 1. Assume that () and () hold. For s = s π s s s s H XA XV Π the followng propertes hold for all s: 1 (a) hj = O( n ) (b) H s s dempotent. (c) 0 h 1. 1 Proof: Snce hj πs j ( vjπ j) 1 =xa x, condtons () and () mply that hj O( n ) =. Part (b) follows from drect multplcaton, usng the defnton of H. To prove (c) note that h 0 s snce t s a quadratc form. Part (b) mples that j j j h = h + hh whch can hold only f h 1. Next, we wrte the resdual as r = Y ( 1 h ) hy where ( ) excludng unt. Snce ( ) 0 M j j j s ( ) E r =, we have E ( r ) var ( r ) ( ) ψ ( 1 ) = and M M s s the sample EM r = h + hjψ j (.) j s ( ) under model (1.4). Usng Lemma 1(a), we have h = o( 1), h o( 1) ( ) E r ψ. Thus, n large samples, M j =, and consequently, r s an approxmately unbased estmator of the correct model-varance even though the varance specfcaton n model (1.1) was ncorrect. As a result,

10 8 r s a robust estmator of the model-varance for unt regardless of the form of ψ. A smple, robust estmator of the approxmate model-varance (1.5) s then ( ) s v Y = N ar (.3) R1 G whch s a type of sandwch estmator (see, e.g., Whte 198). (Note that a formal argument that R1 v s robust would requre condtons such that n 1 E ( v ) M R1 and n 1 N a ψ s converge to the same quantty.) Another varance estmator, smlar to v R1 f ( ) g R G s π a 1 s Π s gs v Y = N r. (.4) An estmator of the approxmate desgn-varance n (1.7) s 1 π ( G) s v π Y = N π r. (.5) An alternatve suggested by Särndal, Swensson, and Wretman (1989) as havng better condtonal propertes s 1 π ( ) s SSW G π v Y = N g r. (.6) Another, smlar estmator, used n the SUPERCARP software (Hdroglou, Fuller, and Hckman 1980) and derved usng Taylor seres methods, s ( ) n gr 1 gr T G = 1 s π s. (.7) π v Y N n n As shown n the Appendx, the second term n parentheses n (.7) converges n probablty to zero under model (1.1). Thus, vt vr n large samples., s

11 9 When the selecton probablty of each unt s small, v SSW wll be smlar to v R1, v R, and v T. All three wll be approxmately model-unbased under (1.4) and approxmately desgnunbased under Bernoull and Posson samplng. On the other hand, v π s approxmately desgnunbased but gnores the g coeffcents and s based under ether model (1.1) or (1.4). As a smple example, consder Bernoull samplng wth π = nn and the workng model ( ) =, ( Y ) σ E Y x β M var M x =. Then the GREG s the rato estmator Y G = Ys x xs where x s a fnte populaton mean. The approxmate model varance under the more general specfcaton, ( Y ) = ψ, s ( n) ( x x ) var M N desgn-varance s ( 1 f ) ( nn) ( Y xy x) estmator ( ) ( ) R s s s s ψ s s where s s ψ = ψ n. The approxmate = 1 where Y s a fnte populaton mean. The p v = n x x Y xy x s approxmately unbased for the modelvarance and, because x xs 1 n large Bernoull samples, v R s also approxmately unbased for the desgn-varance as long as f s small. In contrast, vπ = n f Y xy x s approxmately desgn-unbased but s model-unbased ( ) ( ) 1 s s s only n balanced samples where x = x. Royall and Cumberland (1981) noted smlar results for s the rato estmator n smple random samplng wthout replacement. 3. Alternatve Varance Estmators Usng Adjusted Squared Resduals The frst alternatve varance estmator we consder s the jackknfe. The partcular verson to be studed s defned as

12 10 n n 1 v J = YG ( ) Y G n ( ) (3.1) = 1 where Y G ( ) has the same form as the full sample estmator after omttng sample unt. If the selecton probablty has the form π = np, then (3.1) can be rewrtten. Usng the conventon that the subscrpt () means that sample unt has been omtted, we have Y G ( ) = T G ( ) N, Y ( ) G Y G ( ) n =, T G ( ) = T ( ) ( ) x π + x ( ) s ( ) = j π j( ), T ( ) = n j π j( n 1 ) T π n Y n 1 j s ( ) πs s s s s ( ) B T T, x x, and j s ( ) 1 1 s s B( ) = A X ( ) ( ) V ( ) Π Y ( ) ( ) wth Aπs ( ) = X s ( ) V ( ) Π ( ) Xs ( ). Another more conservatve, but asymptotcally equvalent, verson of the jackknfe replaces Y G( ) wth the full sample estmator Y G. Desgn-based propertes of the jackknfe n (3.1) are usually studed n samples selected wth replacement (see, e.g., Krewsk and Rao 1981, Rao and Wu 1985, Yung and Rao 1996), but appled n practce to wthout-replacement desgns. Note that for the lnear estmator π s 1 Y = N Y π n probablty proportonal to sze wthoutreplacement samplng, nether the jackknfe, v J, nor the approxmatons to v J gven later n ths secton, reduce to the usual Horvtz-Thompson or Yates-Grundy varance estmators. Wth some effort we can wrte the jackknfe n a form that nvolves the resduals and the leverages. The rewrtten form wll make clear the relatonshp of the jackknfe to the varance estmators n secton. Frst, note the followng equaltes that are easly verfed: n Y Tπ( ) = Tπ 1 ( n ), n x Tx π ( ) = T x ( ) (3.) n 1 π

13 s s X s ( ) V Π Y ( ) ( ) s ( ) = XV s s Π s Ys xy vπ, Aπs ( ) = Aπs xx vπ (3.3) Usng a standard formula for the nverse of the sum of two matrces, the slope estmator, omttng sample unt, equals 1 1 s r B π ( ) = B + n A x. s 1 h vπ Detals of ths and the succeedng computatons are sketched n the Appendx. After a consderable amount of algebra, we have n n TG ( ) TG( ) = ( D Ds) + F n 1 n 1 where D = π gr ( 1 h ) and F s defned n the Appendx. The jackknfe n (3.1) s then equal to ( ) n = ( ) + s s s ( ) vj YG N D Ds F F D D s n 1. (3.4) Expresson (3.4) s an exact equalty and could be used as a computatonal formula for the jackknfe. Ths would sdestep the need to mechancally delete a unt, compute Y G ( ), and so on, through the entre sample. In large samples the frst term n brackets n (3.4) s domnant whle the second and thrd are near zero under some reasonable condtons. Thus, n large samples the jackknfe s approxmated by v ( Y ) N ( D D ) J G s s v, or, equvalently, 1 gr 1 gr ( Y ) π ( 1 h ) π ( 1 h ). (3.5) J G s s N N n

14 1 As shown n the Appendx, the second term n (3.5) converges n probablty to zero under model (1.1). Consequently, a further approxmaton to the jackknfe s v 1 g r ( Y ) N π ( 1 h ). (3.6) J G s As (3.5) and (3.6) show, the jackknfe mplctly ncorporates the g coeffcents needed for estmatng the model-varance. The rght-hand sde of (3.6) s tself an alternatve estmator that we wll denote by v J ( YG ). Yung and Rao (1996) also derved an approxmaton to the jackknfe for the GREG n multstage samplng. For sngle-stage samplng, ther approxmaton s equal to v T, defned n (.7), whch s the same as (3.5) f the leverages are zero. Duchesne (000) also presented a formula for the jackknfe, whch he denoted as V JK, that nvolved sample leverages. The advantage of (3.4) s that t makes clear whch parts of the jackknfe are neglgble n large samples. Duchesne also presented an estmator, denoted by VJK, that s essentally the same as v R and s an approxmaton to the jackknfe. Expressons (3.5) and (3.6) explctly show how the leverages affect the sze of the jackknfe. Weghted leverages, h, that are not near zero wll nflate v J. Dependng on the confguraton of the x s, ths could be a substantal effect on some samples. Snce h approaches zero wth ncreasng sample sze, J v, v R, v SSW, and v T have the same asymptotc propertes. In partcular, the jackknfe s approxmately unbased wth respect to ether the model or the desgn and s robust to msspecfcaton of the varances n model (1.1). However, the factor ( h ) 1 n (3.6) s less than or equal to 1 and wll make the jackknfe larger

15 13 than the other varance estmators. Ths wll typcally result n confdence ntervals based on the jackknfe coverng at a hgher rate than ones usng v R, v SSW, or v T. Note, also, that f a wthout-replacement sample s used, and some frst-order or secondorder selecton probabltes are not small, the choces, v R, v D, v J, and v J wll be overestmates of ether the desgn-varance or the model-varance. To account for non-neglgble selecton probabltes, we can make some smple adjustments. An adjusted verson of v J ( YG ) patterned after v SSW, s v ( Y ) ( 1 π) g r π ( 1 h ) 1 JP G = s N., Ths expresson s smlar to VJK 3 of Duchesne (000), although JK 3 V omts the leverages. Expresson (3.6) also suggests another alternatve that s closely related to an estmator of the error varance of the best lnear unbased predctor of the mean under model (1.1) (see, Vallant, Dorfman, and Royall 000, ch.5). Ths estmator s somewhat less conservatve than (3.6), but stll adjusts usng the leverages: Because h o( 1) v ( ) 1 g r Y = N π ( 1 h ) D G s. =, v D s also approxmately model and desgn-unbased. A varant of ths that may perform better when some selecton probabltes are large s v ( Y ) = 1 DP G s N ( 1 π) g r π ( 1 h ).

16 14 4. Smulaton Results To check the performance of the varance estmators, we conducted several smulaton studes usng three dfferent populatons. The frst s the Hosptals populaton lsted n Vallant, Dorfman, and Royall (000, App. B). The second populaton s the Labor Force populaton descrbed n Vallant (1993). The thrd s a modfcaton of the Labor Force populaton. In all three populatons, samplng s done wthout replacement, as descrbed below. These samplng plans wll test the noton that varance estmators motvated, n part, by wth-replacement desgns can stll be useful when appled to wthout-replacement desgns. The Hosptals populaton has N = 393 and a sngle auxlary value x, whch s the number of npatent beds n each hosptal. The Y varable s the number patents dscharged durng a partcular tme perod. The GREG estmator for ths populaton s based on the model M 1 ( ) = β + β, ( ) σ E Y x x 1 var M Y = x. Samples of sze 50 and 100 were selected usng smple random samplng wthout replacement (srswor) and probablty proportonal to sze (pps) wthout replacement wth the sze beng the square root of x. For each combnaton of selecton method and sample sze, 3000 samples were selected. The estmators Y G, v π, v R1, v R, v SSW, v D, v DP, vj, v JP, and v J were calculated for each sample. For comparson we also ncluded the π -estmator, Y T π = π N. The varance estmator v T was ncluded but s not reported here snce results were lttle dfferent from v R. The Labor Force populaton contans 10,841 persons. The auxlary varables used were age, sex, and number of hours worked per week. The Y varable was total weekly wages. Age was grouped nto four categores: 19 years and under, 0-4, 5-34, and 35 or more. The model for the GREG ncluded an ntercept, man effects for age and sex, and the quanttatve varable,

17 15 hours worked. A constant model-varance was used. Samples of sze 50, 100, and 50 were selected. The two selecton methods used were srswor and samplng wthout replacement wth probablty proportonal to hours worked. (Ths populaton has some clusterng but ths was gnored n these smulatons.) The thrd populaton was a verson of Labor Force desgned to nject some outlers or skewness nto the weekly wages varable. We denote ths new verson as LF(mod) for reference. In the orgnal Labor Force populaton, weekly wages were top-coded at $999. For each such top-coded wage, a new wage was generated equal to $1000 plus a lognormal random varable whose dstrbuton had scale and shape parameters of 6.9 and 1. Recoded wages were generated for 4.4% of the populaton. Pror to recodng, the annualzed mean wage was $19,359, and the maxmum was $51,948; after recodng, the mean was $3,103 and the maxmum was $608,116. Thus, LF(mod) exhbts more of the skewness n ncome that would be found n a real populaton. The resultng LF(mod) dstrbuton s shown n Fgure 1 where weekly wages s plotted aganst hours worked for subgroups defned by age. In each panel the black ponts are for males whle the open crcles are for females. A horzontal reference lne s drawn n each panel at $999. Although there s a consderable amount of over-plottng, the general features are clear. Wage levels and spread go up as age ncreases, hours worked per week s related, though somewhat weakly, to wages, and wages are most skewed for age groups 5-34 and 35+. Less evdent s the fact that wages for males are generally hgher than ones for females. Table 1 shows the emprcal percentage relatve bases, defned as the average over the samples of ( ) T T T for the π -estmator and general regresson estmator for the varous populatons and sample szes. Root mean square errors (rmse s), defned as the square root of

18 16 the average over the samples of ( ) T T, are also shown. In the Hosptals populaton, both estmators have neglgble bas at ether sample sze. The GREG s consderably more effcent n Hosptals than the π -estmator because of a strong relatonshp of Y to x. In the two Labor Force populatons, both the π -estmator and the GREG are nearly unbased whle the GREG s somewhat more effcent as measured by the rmse for all sample szes and selecton methods. Table lsts the emprcal relatve bases (relbases) of the nne varance estmators, defned as 100( v mse) mse, where v s the average of a varance estmator over the 3000 samples and mse s the emprcal mean square error of the GREG. The rows of the table are sorted by the sze of the relbas n LF(mod) for srswor s of sze 50, although the orderng would be smlar for the other populatons, sample szes, and selecton methods. In the Hosptals populaton, the samplng fracton s substantal, especally when n = 100. As mght be expected, ths results n the estmators that omt any type of fnte populaton correcton (fpc) v R, v D, vj, and v J beng severe over-estmates n ether srswor or pps samples. Because v R1 lacks a term to reflect the model-varance of the nonsample sum, t under-estmates the mse badly when the samplng fracton s large. In the Labor Force and LF(mod) populatons, ncreasng sample sze leads to decreasng bas. The estmators v π, R1 v, v R, and v SSW have negatve bases that tend to be less severe as the sample sze ncreases. The jackknfe v J and ts varants, v J, v JP, are over-estmates, especally at n = 50. The estmators, v D and v DP, are more nearly unbased at each of the sample szes than most of the other estmators. The emprcal coverages of 95% confdence ntervals across the 3000 samples n each set are shown n Table 3 for the Hosptals populaton. The three choces of varance estmator that

19 17 use the leverage adjustments but not fpc s v D, vj, and v J are larger and, thus, have hgher coverage rates than v π, v R, and v SSW. The tendency of the jackknfe to be larger than other varance estmates for the GREG has also been noted by Stukel, Hdroglou, and Särndal (1996). Ths s an advantage for the smaller sample sze, n= 50. When n =100 and the samplng fracton s large, the estmators wth the fpc s v π, v SSW, v DP, and v JP have closer to the nomnal 95% coverage rates whle v R samples. The estmator v JP choce at ether sample sze or samplng plan., v D, v J, and v J cover n about 97 or 98% of the, that approxmates the jackknfe but ncludes an fpc, s a good Tables 4 and 5 show the coverage rates for the Labor Force and LF(mod) populatons. For the former, v DP, v D, v J, v JP, and v J are clearly better n Labor Force at n = 50 for both srswor and pps samples. But, for n = 50, coverages rates are smlar for all estmators. The purely desgn-based estmator, v π, s unsatsfactory at the smaller sample szes for ether samplng plan. As n Hosptals, v JP gves near nomnal coverage at each sample sze n the Labor Force populaton. The most strkng results n Tables 4 and 5 are for LF(mod) where all varance estmators gve poor coverage. Coverages range from 78.0% for the combnaton ( v π, n = 50, srswor) to 90.7% for ( v J and v J, n = 50, pps). Vrtually all cases of non-coverage are because 1 ( Y G Y) v < 1.96, where v s any of the varance estmators. The poor coverage rates occur even though the π -estmator and GREG are unbased over all samples (see Table 1) and, n the cases of v J, v JP, and vj, the varance estmators are overestmates (see Table ).

20 18 Negatve estmaton errors, Y G Y wth large weekly wages. Fgure s a plot of t-statstcs based on, occur n samples that nclude relatvely few persons v JP,.e., ( G ) Y Y v, versus the number of sample persons wth weekly wages of $1000 or more n sets of 1000 samples for (srswor; n= 50, 100, 50). The negatve estmaton errors n samples wth few persons wth hgh ncomes lead to negatve t-statstcs, and confdence ntervals that mss the populaton mean on the low sde. The problem decreases wth ncreasng sample sze, but the convergence to the nomnal coverage rates s slow and occurs from the bottom up. Regardless of the varance estmator used, coverage wll be less than 95% unless the sample s qute large. We also examned how well the varance estmators perform, condtonal on sample characterstcs. We present only results related to bas of the varance estmators to conserve space. For the Hosptals populaton, we sorted the samples based on D ( x = x x) JP 1 T T, whch s the sum of the dfferences of the π -estmates of the totals of 1 x and x from ther populaton totals. Twenty groups of 150 samples each were then formed. In each group, we computed the bas of Y G along wth the rmse, and the square root of the average of each varance estmator. The results are plotted n Fgure 3 for srswor wth n = 50 and 100 and for pps wth n = 50 and 100. A subset of the varance estmators s plotted. The horzontal axs n each panel gves values of D x. Snce v J, vj, v D, and v R are smlar through most of the range of D x, only the jackknfe v J s plotted. Also, v DP and v JP are close, and only the latter s plotted. The GREG does have a condtonal bas that affects the rmse n off-balance samples. The poor condtonal propertes of v π are most evdent n the smple random samples where the bas of v π as an estmate of the mse runs from negatve to postve over the range of D x. Among the other

21 19 varance estmates, condtonal bases are smlar to the uncondtonal bases n Table. Both v JP and v SSW are n theory approxmately desgn and model-unbased, and both track the rmse well. Fgure 4 s a smlar plot for the samples from the Labor Force populaton. The followng sets of estmates are very smlar and only the frst n each set s ncluded n the plots: (, ) (,, ) vssw vr1 v R, and ( vj, vj, vjp) v v,. Only the srswor and pps samples of sze n = 50 and 50 are D DP ncluded. The horzontal axs s agan D x, whch s the sum of dfferences between the π - estmates and the populaton values of the totals for age and sex groups and the number of hours worked per week. The condtonal bas of v π s evdent n samples wth the smallest values of D x but the problem dmnshes for the larger sample sze n both srswor and pps samples. The jackknfe v J s, on average, the largest of the varance estmators throughout the range of The dfferences among the varance estmates and ther bases are less for the larger sample sze. The estmators vd, vssw, and v J all track the rmse reasonably well except when D x s most negatve, where all are somewhat low. D x. 5. Concluson A varety of estmators of the varance of the general regresson estmator have been proposed n the samplng lterature, manly wth the goal of estmatng the desgn-based varance. Estmators can be easly constructed that are approxmately unbased for both the desgn-varance and, under certan models, the model-varance. Moreover, the dual-purpose estmators studed here are robust estmators of a model-varance even f the model that motvates the GREG has an ncorrect varance parameter.

22 0 A key feature of the best of these estmators s the adjustment of squared resduals by factors analogous to the leverages used n standard regresson analyss. The desrablty of usng leverage correctons to regresson varance estmators n order to combat heteroscedastcty s well-known n econometrcs, havng been proposed by MacKnnon and Whte (1985) and recently revsted by Long and Ervn (000). One of the best choces s an approxmaton to the jackknfe, denoted here by v JP, that ncludes a type of fnte populaton correcton. The robust estmators studed here are qute useful for varables whose dstrbutons are reasonably well behaved. They adjust varance estmators n small and moderate sze samples n a way that often results n better confdence nterval coverage. However, they are no defense when varables are extremely skewed, and large observatons are not well represented n a sample. Whether one refers to ths problem as one of skewness or of outlers, the effect s clear. A sample that does not nclude a suffcent number of unts wth large values wll produce an estmated mean that s too small. A varance estmator that s small often accompanes the small estmated mean. As the smulatons n secton 4 llustrate, n such samples even the best of the proposed varance estmators wll not yeld confdence ntervals that cover at the nomnal rate. The transformaton methods of Chen and Chen (1996) mght hold some promse, but that approach would have to be tested for the more complex GREG estmators studed here. The most effectve soluton to the skewness problem does not appear to be to make better use of the sample data. Rather, the sample tself needs to be desgned to nclude good representaton of the large unts. In many cases, however, lke a survey of households to measure ncome or captal assets, ths may be dffcult or mpossble f auxlary nformaton closely related to the target varable s not avalable. Better use of the sample data employng models for skewed varables may then be useful (see, e.g., Karlberg 000).

23 1 ACKNOWLEDGMENT The author s ndebted to Alan Dorfman whose deas were the mpetus for ths work and to the Assocate Edtor and two referees for ther careful revews. APPENDIX: Detals of Jackknfe Calculatons Usng (3.), (3.3), and the standard matrx result n Lemma of Vallant, et al. (000), we have AπsxxA πs vπ A = πs ( ) A πs +. 1 h From ths and the defnton of 1 B ( ), the slope estmator, omttng unt, s B ( ) = B+ n Q s where Q Aπ x = 1 h s r vπ. The GREG estmator, after deletng unt, s n Y ( ) ( n T ) G T x = π + B Q Tx T x. n 1 π n 1 π After some rearrangement, ths can be rewrtten as n n gr n 1 TG ( ) = T + G + K n 1 n 1 π( 1 h) n 1 n 1 G where G hy Y = π ( 1 h ) and K ( ) nx = B Q T. It follows that π x n n TG ( ) TG( ) = ( D Ds) + F n 1 n 1 where F ( G Gs) n 1 ( K Ks) = + wth G s and Ks beng sample means wth the obvous defntons. Substtutng n the jackknfe formula (3.1) gves

24 ( ) n = ( ) + s s s ( ) vj YG N D Ds F F D D s n 1. (A.1) Formula (A.1) s exact, but wth some further approxmatons we can get the relatve szes of the terms. Usng the values of G and K above and the fact that h and the elements of Q are o ( 1), we have 1 hy Y 1 ( n G ) n K x + = + B Q Tx π( 1 h) n π Y x 1 + B BT x π π n 1 = BT n x where denotes asymptotcally equvalent to. It follows that F 0 and that ( ) ( ) s v Y D D J G s,.e., (3.5) holds. Next, we can show that the second term n (3.5) converges n probablty to zero. The vector of resduals can be expressed as = ( ) N n g sπs U rr s su Πs gs r I H Y, and the second term n (3.5) s equal to s s s Π Π where U = dag( h ), s. Thus, the second term n (3.5) 1 s the square of s s s B= N n g Π U r whch has expectaton zero under any model wth M ( ) 0 E r =. The model-varance of B s ( ) = ( ) ( ) var M g s Π s U r s g s Π s U I H s V s I H s U Π s g s N n Π N n Π Π (A.) whch has order of magntude n under the assumptons we have made. Consequently, the second term n (3.5) s the square of a term wth mean zero and a model-varance that approaches zero as the sample sze ncreases. The second term n (3.5) then converges to zero by Chebyshev s nequalty. Ths justfes (3.6).

25 3 REFERENCES BELSLEY, D.A., KUH, E., AND WELSCH, R.E. (1980). Regresson Dagnostcs. New York: John Wley & Sons. BREWER, K.R.W. (1995). Combnng desgn-based and model-based nference. Chapter 30 n Busness Survey Methods, (Eds. B.G. Cox, D.A. Bnder, B.N. Chnnappa, A. Chrstanson, M.J. Kollege, and P.S. Kott). New York: John Wley, BREWER, K.R.W. (1999). Cosmetc Calbraton wth Unequal Probablty Samplng. Survey Methodology, 5, CHEN, G., AND CHEN, J. (1996) A transformaton method for fnte populaton samplng calbrated wth emprcal lkelhood. Survey Methodology,, DUCHESNE, P. (000). A note on jackknfe varance estmaton for the general regresson estmator. Journal of Offcal Statstcs, 16, HIDIROGLOU, M.A., FULLER, W.A., AND HICKMAN, R.D. (1980). SUPERCARP. Department of Statstcs. Ames, Iowa: Iowa State Unversty. KARLBERG, F. (000). Survey estmaton for hghly skewed populatons n the presence of zeroes. Journal of Offcal Statstcs, 16, KOTT, P.S. (1990). Estmatng the condtonal varance of a desgn consstent regresson estmator. Journal of Statstcal Plannng and Inference, 4, KREWSKI AND RAO, J.N.K. (1981). Inference from stratfed samples: propertes of the lnearzaton, jackknfe, and balanced repeated replcaton methods. Annals of Statstcs, 9,

26 4 LONG, J.S., AND ERVIN, L.H. (000). Usng heteroscedastcty consstent standard errors n the lnear regresson model. The Amercan Statstcan, 54, MACKINNON, J.G., AND WHITE, H. (1985). Some heteroskedastc consstent covarance matrx estmators wth mproved fnte sample propertes. Journal of Econometrcs, 9, RAO, J.N.K. AND WU, C.J.F. (1985). Inference from stratfed samples: second-order analyss of three methods for nonlnear statstcs. Journal of the Amercan Statstcal Assocaton, 80, ROYALL, R.M., and CUMBERLAND, W.G. (1978). Varance estmaton n fnte populaton samplng. Journal of the Amercan Statstcal Assocaton, 73, ROYALL, R.M., and CUMBERLAND, W.G. (1981). An emprcal study of the rato estmator and estmators of ts varance. Journal of the Amercan Statstcal Assocaton, 76, SÄRNDAL, C.-E. (1996). Effcent estmators wth smple varance n unequal probablty samplng. Journal of the Amercan Statstcal Assocaton, 91, SÄRNDAL, C.-E., SWENSSON, B., AND WRETMAN, J. (1989). The weghted resdual technque for estmatng the varance of the general regresson estmator. Bometrka, 76, SÄRNDAL, C.-E., SWENSSON, B., AND WRETMAN, J. (199). Model Asssted Survey Samplng. New York: Sprnger-Verlag. SÄRNDAL, C.-E. AND WRIGHT, R. (1984). Cosmetc form of estmators n survey samplng. Scandanavan Journal of Statstcs, 11,

27 5 STUKEL, D., HIDIROGLOU, M.A., AND SÄRNDAL, C.-E. (1996). Varance estmaton for calbraton estmators: a comparson of jackknfng versus Taylor lnearzaton. Survey Methodology,, SUGDEN, R.A., and SMITH, T.M.F. (1984). Ignorable and nformatve desgns n survey samplng nference. Bometrka, 71, VALLIANT, R. (1993). Poststratfcaton and condtonal varance estmaton. Journal of the Amercan Statstcal Assocaton, 88, VALLIANT, R., DORFMAN, A.H., AND ROYALL, R.M. (000). Fnte Populaton Samplng and Inference: A Predcton Approach. New York: John Wley & Sons. WHITE, H. (198). Maxmum lkelhood estmaton of msspecfed models. Econometrca, 50, 1-5. YUNG, W., AND RAO, J.N.K. (1996). Jackknfe lnearzaton varance estmators under stratfed mult-stage samplng. Survey Methodology,, YUNG, W., AND RAO, J.N.K. (000). Jackknfe varance estmaton under mputaton for estmators usng poststratfcaton nformaton. Journal of the Amercan Statstcal Assocaton, 95,

28 6 Fgure Ttles Fgure 1. Scatterplots of Weekly Wages versus Hours Worked per Week n Four Age Groups for the LF(mod) populaton. Open crcles are for females. Black crcles are for males. A horzontal lne s drawn at $999 per week, the maxmum value n the orgnal Labor Force populaton. Fgure. Plot of t-statstcs versus the number of sample persons wth weekly wages greater than $1000 n the sets of 1000 smple random samples of sze n= 50, 100, 50 from the LF(mod) populaton. Horzontal reference lnes are drawn at ± Ponts are jttered to mnmze overplottng. Fgure 3. Plot of condtonal bases, rmse s, and means of standard error estmates of the GREG for the samples from the Hosptals populaton. Horzontal and vertcal reference lnes are drawn at 0. The lowest curve n each panel s the bas of the GREG. The thck sold lne s the condtonal root mean square error. Fgure 4. Plot of condtonal bases, rmse s, and means of standard error estmates of the GREG for the samples from the Labor Force populaton. Horzontal and vertcal reference lnes are drawn at 0. The lowest curve n each panel s the bas of the GREG. The thck sold lne s the condtonal root mean square error.

29 7 Table 1. Relatve bases and root mean square errors (rmse s) of the π -estmator and the general regresson estmator n dfferent smulaton studes of 3000 samples each. Hosptals Labor Force LF(mod) n= 50 n= 100 n= 50 n= 100 n= 50 n= 50 n= 100 n= 50 Smple random samples Y π Relbas (%) rmse YG Relbas (%) rmse Probablty proportonal to sze samples Y π Relbas (%) rmse YG Relbas (%) rmse

30 8 Table. Relatve bases of nne varance estmators for the general regresson estmator n dfferent smulaton studes of 3000 samples each. Smple random samples Hosptals Labor Force LF(mod) n= 50 n= 100 n= 50 n= 100 n= 50 n= 50 n= 100 n= 50 v π vr v SSW vr vdp vd v J v JP vj Probablty proportonal to sze samples v π vr v SSW vr vdp vd v J v JP vj

31 9 Table 3. 95% confdence nterval coverage rates for smulatons usng the Hosptals populaton and nne varance estmators smple random samples were selected wthout replacement for samples of sze 50 and 100. L s percent of samples wth 1 Y Y v < 1.96; M s percent wth Y Y v 1.96; U s percent wth ( G ) 1 ( Y G Y) 1 v > n= 50 n= 100 L M U L M U Smple random samples v π vr v SSW vr vdp vd v J v JP vj Probablty proportonal to sze samples v π vr v SSW vr vdp vd v J v JP vj G

32 30 Table 4. 95% confdence nterval coverage rates for smulatons usng the Labor Force and LF(mod) populatons and nne varance estmators smple random samples were selected wthout replacement for samples of sze 50, 100, and 50. L s percent of samples wth ( ) 1 1 Y Y v < 1.96; M s percent wth Y Y v 1.96; U s percent wth G 1 ( Y G Y) v > n= 50 n= 100 n= 50 L M U L M U L M U Labor Force v π vr v SSW vr vdp vd v J v JP vj LF(mod) v π vr v SSW vr vdp vd v J v JP vj G

33 31 Table 5. 95% confdence nterval coverage rates for smulatons usng the Labor Force and LF(mod) populatons and nne varance estmators probablty proportonal to sze samples were selected wthout replacement for samples of sze 50, 100, and 50. L s percent of samples wth ( ) 1 1 Y Y v < 1.96; M s percent wth Y Y v 1.96; U s percent wth ( ) 1 Y Y v G G > n= 50 n= 100 n= 50 L M U L M U L M U Labor Force v π vr v SSW vr vdp vd v J v JP vj LF(mod) v π vr v SSW vr vdp vd v J v JP vj G

34 Fgure 1 Age <= 19 Age 0-4 Wages Age 5-34 Age 35+ Wages Hours Hours

36 srs n = 50 Fgure 3 pps n = v.p v.r1 vj.star.p v.ssw vj srs n = pps n = 100 v.p v.r1 vj.star.p v.ssw vj v.p v.r1 vj.star.p v.ssw vj v.p v.r1 vj.star.p v.ssw vj Tx.hat - Tx Tx.hat - Tx srs n = 50 Fgure 4 pps n = v.p vd v.ssw vj v.p vd v.ssw vj srs n = 50 pps n = v.p vd v.ssw vj v.p vd v.ssw vj Tx.hat - Tx Tx.hat - Tx

Estimation: Part 2. Chapter GREG estimation

Estimation: Part 2. Chapter GREG estimation Chapter 9 Estmaton: Part 2 9. GREG estmaton In Chapter 8, we have seen that the regresson estmator s an effcent estmator when there s a lnear relatonshp between y and x. In ths chapter, we generalzed the