Maximizing Overlap of Large Primary Sampling Units in Repeated Sampling: A comparison of Ernst s Method with Ohlsson s Method

Maxmzng Overlap of Large Prmary Samplng Unts n Repeated Samplng: A comparson of Ernst s Method wth Ohlsson s Method Red Rottach and Padrac Murphy 1 U.S. Census Bureau 4600 Slver Hll Road, Washngton DC 20233 padrac.a.murphy@census.gov, red.a.rottach@census.gov Abstract Many large repeated or contnuous demographc surveys employ a mult-stage desgn where large geographc areas (such as countes or clusters of contguous countes) are sampled n the frst or prmary stage. Usually, a new sample of these prmary sample unts (PSUs) s selected perodcally n order to account for changes n populaton, survey obectves, or other consderatons. But because hrng and tranng new ntervewers can be expensve, and replacng experenced ntervewers wth nexperenced ones may have an adverse effect on data qualty, there s often a strong ncentve to retan as many as possble of the PSUs from the old sample desgn when selectng the new PSU sample. At the same tme, one wshes to also retan the advantages of havng a probablty sample. Varous methods have been proposed to coordnate repeated samples wth these two consderatons n mnd. Ths paper dscusses and compares two such methods. The frst method, due to Ernst (1986,) has been used for demographc surveys at the U.S. Census Bureau. Ths method does not requre ndependent samplng between strata n the prevous desgn, and s cast as a constraned optmzaton problem, so n some respect the soluton s optmal. The second method, due to Ohlsson (1996, 2001,) uses exponental samplng, and does have the requrement of ndependent samplng; but t may be used repeatedly because t does not destroy ndependence n the current desgn. Key Words: Repeated Samplng, Coordnated Samplng, Maxmzng Overlap, Exponental Samplng, Permanent Random Numbers (PRNs) 1. Introducton The Census Bureau s currently n the research phase of a sample redesgn for several maor demographc surveys. Sample wll be selected followng the 2010 Census. One of the areas of research s that of maxmum overlap of ts PSUs. We defne a method of maxmum overlap as one that ncreases the probablty of reselectng PSUs already n sample compared to ndependent selectons, whle mantanng uncondtonal probablty proportonal to sze (pps) samplng. We are nterested n comparng overlap procedures that would be sutable gven the constrant that they can be used repeatedly across multple desgns. The method of Ernst (1986) was frst used at the Census Bureau followng the 1980 redesgn, and has been used n the 1990 and 2000 redesgns as well. The method of Ohlsson (1996, 2001) was an mportant development snce t appears to be the only method that does not lead to dependent selectons n the current desgn. Ths s at the heart of how Ohlsson s method satsfes the requrement for repeated use, whereas Ernst s method satsfes the requrement by not requrng ndependent samplng from stratum to stratum n the old desgn. Our nterest n presentng a drect comparson between the two methods comes from ths feature they have n common, and from the lack of a drect numercal comparson of the two methods n statstcal lterature. Ernst (1999) dscusses several dfferent features of several methods of overlap, although he does not nclude ther expected overlaps. Ohlsson (1996) compares the expected 1 Ths report s released to nform nterested partes of ongong research and to encourage dscusson of work n progress. The vews expressed on statstcal ssues are those of the author and not necessarly those of the U.S. Census Bureau.

overlaps of several dfferent methods, but he does not nclude Ernst, sayng that n part t was to avod lnear programmng. For our numercal comparsons we use data from the prevous two redesgns of the Current Populaton Survey (CPS), n whch we formed and restratfed PSUs followng the 1990 and 2000 Censuses. 2. PSU creaton, stratfcaton, and probabltes of selecton The prmary motvaton for PSU creaton s to form areas that allow manageable ntervewer workloads. Many PSUs are sngle countes, although they may be formed from any number of contguous countes, or n some cases, county-equvalents. The PSUs are then stratfed nto lke groups, such as by choosng the stratfcaton that mnmzes a samplng varance. PSUs are then assgned probabltes of selecton that are proportonal to sze. For surveys that select one PSU per stratum, ths s the measure of sze of the PSU dvded by the measure of sze of the stratum. Otherwse, for the selecton of two PSUs, the probablty of selecton s twce the measure of sze of the PSU dvded by the measure of sze of the stratum; ths s approprate for wthout replacement samplng. Wthn each stratum, the ont selecton probabltes for selectng pars of PSUs are controlled usng Durbn s formula (1967). When selectng PSUs, we restrct ourselves to pps samplng, but do not necessarly constran ont probabltes of selecton of PSUs n dfferent strata. In fact, we may follow an approach that leads to unknown ont probabltes of selecton. 3. Overlap We defne overlap to be an ndcator of whether a PSU, or some porton of the PSU, was n sample n two consecutve desgns. For the current desgn, the sum of these ndcator varables s the number of PSUs that were sampled n the prevous desgn. Expected overlap s the expected value of the number of PSUs selected n both desgns. Ths varable s defned at the stratum level for the current desgn. It does not depend on any realzaton n the old or new desgns, but ntegrates over all possble outcomes. From ths, we may present the expected number of contnung PSUs (those sampled n both desgns), whch would be the sum of expected overlaps, or smlarly, an average expected overlap. Our workng defnton of maxmum overlap s a method of samplng PSUs that: Is a probablty sample; that s, has known selecton probabltes Has a hgher average expected overlap than samplng ndependently from the prevous desgn 4. Samplng PSUs 4.1 Notaton In ths paper we wll dentfy PSUs as though ther defnton had not changed across desgns, although n fact that wll not be the case. The PSUs that changed defnton were dvded nto peces, and these peces were treated as PSUs for the sake of overlap. For a gven stratum n the new desgn: represents a PSU s ts probablty of selecton n the new desgn p was ts probablty of selecton n the old desgn Sums ndexed by are over all PSUs n the new desgn stratum

4.2 Independent Samplng (A Lower Bound for Expected Overlap) The overlap procedures we examne wll perform at least as well as ndependent samplng n each stratum, so the expected overlap of ndependent samplng s an obvous lower bound. Furthermore, we would lke to consder the possblty of usng ths approach f we can t show there are real benefts to usng maxmum overlap procedures. Wth ndependent selecton, we gnore the outcome of the prevous desgn when selectng PSUs n the new desgn, so for each PSU the probablty t s n both desgns s the product of ther probabltes. For each new desgn stratum, the expected overlap for ndependent samplng s: overlapnd p 4.3 Posson Samplng (An Upper Bound for Expected Overlap) If we allowed varable sample szes, we could mplement a Posson samplng procedure that would acheve an expected overlap hgher than the procedures we are consderng. Posson samplng refers to an approach n whch each PSU s selected ndependently of every other PSU n the stratum. That s, the PSUs are subected to ndependent Bernoull trals, n whch the expected number of PSUs selected s a sum of the probabltes of selecton. So, for example, f we were to select an expected one PSU per stratum, we may end up wth some strata wth no PSUs n sample, as well as strata wth multple PSUs. Brewer, Early, and Joyce (1972) dscuss an approach to samplng n whch a PRN from a unform [0,1] dstrbuton s assgned to every PSU, and the PSU s selected f the PRN s less than the target number of PSUs tmes the probablty of selectng that PSU. Usng these PRN s n the next desgn wll result n a maxmum overlap approach to samplng, and one that s n fact optmal. Followng an approach other than Posson samplng, n whch we add the constrant of a fxed sample sze, wll lead to an expected overlap no greater than that of the Posson approach. We dscuss Posson samplng only as an upper bound for the expected overlap of the methods we wll consder. For each new desgn stratum, the expected overlap for Posson samplng s: overlap po mn p, 4.4 Ernst s Method Ernst s method s a varant of an approach outlned n Causey, Cox, and Ernst (CCE, 1985). These authors address the problem of constraned optmzaton drectly, n whch the expected overlap s maxmzed usng numercal technques subect to the constrants on sample sze and probabltes of selecton. So, CCE s truly optmal, but has the drawback that t can only be used once for pps samplng snce t requres the knowledge of ont probabltes of selecton. These are dffcult enough to determne after t s mplemented that we consder them effectvely unknown. The way n whch Ernst (1986) avods the requrement of ndependent samplng from stratum to stratum s by selectng only one stratum from the old desgn to overlap wth, smlar to an earler method descrbed n Perkns (1970). Essentally, the expected overlap s optmzed gven the requrement that we wll select ust one stratum n the old desgn to overlap wth. It s superor to Perkns procedure n ths respect, but t s not necessarly optmal among a broader class of overlappng algorthms. Usng Ernst s method, the old desgn stratum s chosen probablstcally, wth the probabltes determned va the optmzaton procedure. The expected overlap for Ernst s method s determned by the optmzaton procedure and does not have a closed form. The expected overlap s the value of the obectve functon we wll maxmze by lnear programmng (PROC LP n SAS). 4.5 Ohlsson s Method As wth Posson samplng, Ohlsson s method uses PRNs. For a one-psu per stratum desgn, the approach s to transform the unformly dstrbuted PRNs and select the PSU wth the smallest assgned value. In

log 1 X partcular, for a gven PSU wth PRN equal to X, the transformed number s. It s very smple to mplement, and correlates wth the selectons n the old desgn only through the PRN. Although not mmedately apparent, t can be shown to be a method of maxmzng overlap; t satsfes our constrant of beng a probablty sample that ncreases the expected overlap when compared to samplng ndependently from the old desgn. For each new desgn stratum, the expected overlap for Ohlsson s method s p overlapohl p p A A' Where for each, defne the followng: D s the set of PSUs {} n the same old and new strata as unt, and satsfy p p. A s the set of PSUs n the same old desgn stratum as, but not n D. A s the set of PSUs n the same new stratum as unt, except those unts n A. Ths approach has been expanded to the selecton of n>1 PSUs per stratum (Ohlsson, 1999), but that case wll not be consdered here. 4.6 A Hybrd Approach As already dscussed, f we are to mantan probablty samplng we cannot use Ohlsson s method wthout frst selectng ndependently. One opton would be to phase out Ernst s method and phase n Ohlsson s across multple desgns, by selectng ndependently frst n some states. For example, f half the states were selected usng Ernst s method, and the other half ndependently, then the average expected overlap would be approxmately halfway between that of the two methods. 5. Results Table 1. Average Expected Overlap Method Average Expected Overlap Ernst 60% Ohlsson 61% Independent (Lower Bound) 35% Posson (Upper Bound) 81%

Fgure 1. Expected Overlap For 374 Non-self-representng Strata 1 0.9 0.8 0.7 Ernst's Method 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Ohlsson's Method The average expected overlaps of Ernst and Ohlsson were very close, at 60% and 61%, respectvely. Independent selecton was 35% on average, and the upper bound of Posson samplng resulted n 81%. It s nterestng to note the dfferng dstrbutons of expected overlap n Ernst and Ohlsson, as shown n Fgure 1. The dagonal lne represents equalty of the two axes. Ohlsson s method seems to perform better at the lower end of the scale, whle Ernst s method seems to perform better at the hgher end. Lower expected overlaps may suggest larger strata relatve to the PSU szes, whch may also be related to the number of strata that are overlapped wth. A possble reason for Ohlsson s performng better at the lower end s that the method uses nformaton from all strata overlapped n the old desgn, rather than havng to select ust one to overlap wth. Ernst s method wll be optmal when stratum defntons do not change, and t seems that n general the method wll work better when there are fewer old desgn strata that overlap, whch may explan why t performs better at the hgher end. References Brewer, K.R.W., Early, L.J. and Joyce, S.F. (1972). Selectng several samples from a sngle populaton. Australan Journal of Statstcs, 14, 231-239. Durbn, J. (1967). Desgn of Mult-Stage Surveys for the Estmaton of Samplng Errors. Appled Statstcs, 16, 152-164

Ernst, L.R. (1986). Maxmzng the Overlap Between Surveys When Informaton s Incomplete. European Journal of Operatonal Research, 27, 192-200. Ernst. Lawrence R. (1999). The Maxmzaton and Mnmzaton of Sample Overlap Problems: A Half Century of Results. Internatonal Statstcal Insttute, Proceedngs, Invted Papers, IASS Topcs, 168-182. Ernst, Lawrence R. (2000). Dscusson Paper - Sesson 31: Coordnatng Samplng Between and Wthn Surveys. The Second Internatonal Conference on Establshment Surveys. Alexandra VA: Amercan Statstcal Assocaton, 265-267. Ohlsson, E. (1996). Methods for PPS Sze One Sample Coordnaton. Insttute of Actuaral Mathematcs and Mathematcal Statstcs, Stockholm Unversty, No. 194. Ohlsson, E. (1999). Comparson of PRN Technques for Small Sample Sze PPS Sample Coordnaton. Insttute of Actuaral Mathematcs and Mathematcal Statstcs, Stockholm Unversty, No. 210. Ohlsson, E. (2000). Coordnaton of PPS Samples Over Tme. The Second Internatonal Conference on Establshment Surveys. Alexandra VA: Amercan Statstcal Assocaton, 255-264. Perkns, W.M. (1970). 1970 CPS Redesgn: Proposed Method for Dervng Sample PSU Selecton Probabltes Wthn 1970 NSR Strata. Memorandum to Joseph Waksberg, U.S. Bureau of the Census.