of being selected and varying such probability across strata under optimal allocation leads to increased accuracy.

Size: px

Start display at page:

Download "of being selected and varying such probability across strata under optimal allocation leads to increased accuracy."

Merilyn Patterson
5 years ago
Views:

5 Sampling with Unequal Probabilities Simple random sampling and systematic sampling are schemes where every unit in the population has the same chance of being selected We will now consider unequal

1 5 Sampling with Unequal Probabilities Simple random sampling and systematic sampling are schemes where every unit in the population has the same chance of being selected We will now consider unequal probability sampling We have encountered an example under stratified sampling in which the units in stratum l have chance n l N l of being selected and varng such probability across strata under optimal allocation leads to increased accuracy 51 Sampling with Replacement Using with replacement sampling simplifies the calculations and if the sampling fraction is small this model should give a reasonable approximation to the exact behaviour of the estimators in without replacement sampling Let p j denote the probability of selecting unit y j on the ith draw, so P Y i = y j ) = p j, j = 1,,, N where Y i represents a rv, not a total value The first two moments are EY i ) = p j y j and VarY i ) = p j yj p j y j j=1 The Hansen and Hurwitz estimator for the population total Y is Ŷ HH = 1 /p i n This estimator is unbiased for Y since EŶHH) = 1 E Y i ) = n n p i n EY i ) = Y 1 p 1 + Y p + + Y N p N = Y p i p 1 p p N SydU STAT ) Second semester Dr J Chan 8 j=1 j=1

2 Under sampling with replacement the random variables Y i /p i ) are independent so VarŶHH) = 1 Var Y i ) = n ) Yi p n p i n i Y = 1 N ) Yi Y p i n p i and is estimated by [ varŷhh) = 1 1 n n 1 ) ] p ŶHH = i 1 nn 1) How do we choose the p i to minimise the variance? [ p i ) nŷ HH The minimum is achieved if we set p i = Y i /Y, i = 1,, N since VarŶHH) = 1 ) Yi p i Y = 1 p i Y Y ) = 0 n p i n Of course we cannot use these values in practice as not all Y i and hence Y ) are known Instead we look for another variable, X i, that is known and highly correlated with Y i to construct probability estimates Set p j = X j /X where X = N j=1 X j Then Ŷ HH = X n the mean-of-ratio estimator in the probability proportional to size PPS) sampling Note ) VarŶHH) = X n Var x i ) where Var is estimated by a sample variance of x i x i x i ] SydU STAT ) Second semester Dr J Chan 84

3 When p i = 1 N, ŶHH = 1 n in SRS STAT014/914 Applied Stat-Sampling C5-Uneq Prob sample p i = N n = Nȳ, the total estimator A natural alternative to the pps sampling here would be to use the ratio estimator whereas Ŷ R = X Ŷ HH = X n x i One is based on the ratio of the averages whilst the other is the average of the ratios Example: Half Ackroyd case) Consider the population of firms A, B and C only with N =, Y = = 7667 and n = The following table shows selection probabilities of a particular sampling procedures and the sample estimates x i Sample Outcomes and HH estimators Firm Sales x i Selection probability p i = x i X Employee A 1 1/4 = B 1 1/4 = C 9 9/4 = SydU STAT ) Second semester Dr J Chan 85

4 Sample Outcomes and HH Estimates Outcome Probability Y HH = 1 n nn p i 1 1 A,A 4 4 = / /4 ) = A,B 4 4 = / /4 ) = A,C 4 4 = / /4 ) = B,A 4 4 = / /4 ) = B,B 4 4 = / /4 ) = B,C 4 4 = / /4 ) = C,A 4 4 = / /4 ) = C,B 4 4 = / /4 ) = C,C 4 4 = / /4 ) = Then the expected values and variances of these estimates are EŶ HH) = 78461)0146) )00701) = VarŶ HH) = ) 0146) ) 00701) = Note that the estimators are indeed unbiased and that they agree with the direct calculation of their variances: VarŶ HH) = 1 N ) Y nn p i ) 9 = 1 ) 1/ / /4 = SydU STAT ) Second semester Dr J Chan 86

5 If we use SRS, VarŶ srs) = 1 n ) S y N n = 1 ) 9 7 6) ) ) which is larger = SydU STAT ) Second semester Dr J Chan 87

6 5 Inclusion PPS IPPS) Sampling If we are sampling without replacement the nature of the population changes after each selection and so if at each step we select using probabilities proportional to size the overall scheme will not necessarily be PPS One way to achieve a PPS scheme is to use systematic sampling where only one random selection is necessary Let denote the probability Y i is selected in the sample, not just the j- th draw Note that is a first order inclusion not selection) probability The Horvitz-Thompson estimator for the population total Y is Ŷ HT = Let the first order inclusion indicator is Write Using this form EŶHT) = so ŶHT is unbiased for Y I i = 1 if is selected = 0 otherwise Ŷ HT = EI i ) = The second order inclusion indicator is I ij I i = Y = I i I j = 1 if both and y j are selected = 0 otherwise SydU STAT ) Second semester Dr J Chan 88

7 Note I ii = I i Let the second order inclusion probability j = EI ij ) Then N ) VarŶHT) = Var I i π i = VarI i ) πi + CovI i, I j ) y j π j = 1 ) π i + j=1,i j j=1,i j j π j ) y j π j since VarI i ) = 1 ) and CovI i, I j ) = EI i I j ) EI i )EI j ) = j π j We estimate this via varŷht,1) = 1 ) πi + j=1,i j j π j j y j π j since 1 = N n say adjusts the sum of N terms to the sum of n terms and j π j = πi = 1 ) if i = j Hence Note: varŷht,1) = 1 π i y i + j=1,i<j 1 The first order inclusion probabilities satisfy j π j j π j y j = N ) EI i ) = E I i = n whereas p i = 1 SydU STAT ) Second semester Dr J Chan 89

8 The estimator in SRS is a special case of the HT estimator Y HT = 1 y1 + y + + y ) n N π 1 π π n = 1 y1 N n/n + y n/n + + y ) n = ȳ n/n The variance estimate can be negative if π j > j ) If we restrict the variance to 0 in these cases, we induce bias Example: Strip transect sampling) Consider a study area of 100 km partitioned into strips 1 km wide but varng in length A sample of n = 4 strips are selected by draw-by-draw with replacement The number of animals is counted in each of these 4 strips The data are shown below: Sample strip Length of strip in km) p i Estimate the total number of animals in this area using the HT estimator and report its standard error Solution: The sample is drawn with replacement such that n = 4 and s = {7,, 56} π = 1 1 p ) 4 = ) 4 = π 7 = 1 1 p 7 ) 4 = ) 4 = π 56 = 1 1 p 56 ) 4 = ) 4 = 0094 π,7 = π + π 7 [1 1 p 7 p ) 4 ] = } {{ } [ ) 4 ] }{{} 7, 7, 7, 7 7, 7, 7 = 0011 SydU STAT ) Second semester Dr J Chan 90

9 π 7,56 = π 7 + π 56 [1 1 p 7 p 56 ) 4 ] = [ ) 4 ] = π,56 = π + π 56 [1 1 p p 56 ) 4 ] = [ ) 4 ] = 000 Ŷ HT = i S = = varŷht,1) = 1 π + πij π j y j i S i π i<j i π j j = ) ) = 74, seŷht,1) = 74, = 79 5 IPPS sampling without replacement For sampling without replacement WOR), the inclusion probabilities are WOR: j = EI i I j ) = E I i N I j j=1,j i j=1,j i j=1,j i = E[I i n I i )] = n = n 1) j π j ) = n 1) n ) = 1 ) j i SydU STAT ) Second semester Dr J Chan 91

10 Then the variance for the HT estimator becomes = = = = = VarŶHT,) 1 ) j=1,i<j j=1,i j ) + j=1,i j ) j π j ) [ ) π j j ) + [ ) π j j ) + j=i+1 π j j ) j=i+1 We estimate this via Note: varŷht,) = j π j ) yj yj y ) j π j π j j=i+1 π j ) ] ) j=1,i<j π j j j ) ) yj π j π j j ) j=1,i<j ) ) ] yj π j π j j ) y ) j π j ) ) yj π j ) ) yj 1 The variance varŷht,) is guaranteed to be positive for any sampling WOR method satisfng π j > j Also varŷht,) takes negative values less often than varŷht,1) The calculating j depends on the way the sample is selected Note n ) values of πij need to be calculated If Y i are approximately proportional to an auxiliary variable X i, ie rx i and we set = nx i N x np i cx i, then for all i, j) i y j π j rx i cx i rx j cx j 0 SydU STAT ) Second semester Dr J Chan 9 π j

11 or VarŶHT,) 0 The IPPS without replacement esitmator is Ŷ HT = 1 n p i Moreover if and x i are perfectly correlated such that = rx i, i = 1,,, N, then = n /Y and Ŷ HT = = n /Y = Y n = Y We expect good performance if rx i for all i under PPS sampling 4 Systematic IPPS is popular because of its simplicity but an unbiased estimator for VarŶsys) is not available Hurtley and Rao 196) showed that when n x i X < 1 for all i, VarŶsys,pps) 1 n 1 n This expression can be estimated by varŷsys,pps) ) Y ) n 1 n 1 ) ) n π i ŶHT n SydU STAT ) Second semester Dr J Chan 9

12 Example: Half Ackroyd case) Consider the population of firms A, B and C only with N =, Y = = 7667 The following table shows the inclusion probabilities of a particular sampling procedures and the sample estimates Sample outcomes and HT estimator Sample nd order Inclusion prob j Ŷ HT = 1 n N A,B π 1 = 05 Y HT = ) = A,C π 1 = 0 Y HT = ) = B,C π = 0 Y HT = ) = The first-order inclusion probabilities are Note Firm A 9 π 1 = = 08 B 8 π = = 07 C 6 π = = 05 Total 0 P A or B) = P A)+P B) P A and B) = π 1 +π π 1 = = 1 since in a sample of size n = from N =, at least one element of the pair will be included The variance of the HT estimator is: SydU STAT ) Second semester Dr J Chan 94

13 VarŶ HT,) = 1 y1 [π N 1 π π 1 ) y ) y1 + π 1 π π 1) y π 1 π π 1 π ) + π π π ) y y ) ] π π = 1 9 [8 7 5) 8 8 ) ) ) ) ) ] 5 = The following calculation verifies that the HT estimator is unbiased and that its variance is indeed as calculated using: EŶ HT) = 75595)05) )0) )0) = VarŶ HT) = ) 05) ) 0) = The simple average estimator is biased when the inclusion probabilities are not the same for all elements Note that Outcome Inclusion probability 1 A,B 05 A,C B,C 0 10 Y = ȳ 9 + 8) = ) = ) = 70 Eȳ) = 0585) + 075) + 070) = Therefore, the sample mean ȳ is biased SydU STAT ) Second semester Dr J Chan 95

14 54 How to draw IPPS sample? It is very difficult Madow 1949) method for a systematic IPPS sample Index x Partial sum Assigned interval 1 x 1 S 1 = x 1 0, S 1 ) x S = x 1 + x S 1, S ) N x N S N = x x N = X S N 1, S N ) Define d = X n Step 1 Select a uniform random number u from a uniform dist U0, 1) and set u = ud 0, d) Step Select the units whose assigned intervals contain u, u + d, u + d,, u + n 1)d For systematic IPPS method, it can be shown that = x i d = nx i X = nz i Hence Ŷsys = is unbiased for the population total Y = P [u S i 1, S i ) or u + d S i 1, S i ) or or u + n 1)d S i 1, S i )] = P [u S i 1, S i ) mod d] = S i S i 1 d = x i d = nx i X = nz i Note that x i > d x i > X n n x i X > 1 nz i > 1 If nz i < 1 for all i, any unit has a probability = nz i of being selected and no unit is selected more than once If nz i > 1 for one or more i, such units may be selected more than once in the sample but the average frequency of selection is nz i SydU STAT ) Second semester Dr J Chan 96

15 Example: IPPS sample) For a population of 7 households, the number of visits to a local supermarket last week x i and this week are given below: Household i No of visit last week x i No of visit this week Use random numbers u = 650 Select a systematic IPPS sample of size n = using the Madow 1949) method Estimate the average number of visits to the supermarket this week and its standard error Solution: d = X n = 0 = 10 Note that nz = 11 0 = 0 unit can be selected more than once Index i > 1 Hence Assigned interval Assigned interval Width Prob x i S i z i = x i X S i 1, S i ) mod d = 10) 1 0 0,) 0,) ,4),4) ,15) 4,10) or 0,5) 6+5= ,1) 5,10) or 0,1) 1+5= ,5) 1,5) ,7) 5,7) ,0) 7,10) 10 Total Now u 1 = = 65 from the interval 0,10) The IPPS sample contains households {, 4, 6} A complete list of different samples based SydU STAT ) Second semester Dr J Chan 97

16 in u 0, 10) is given below STAT014/914 Applied Stat-Sampling C5-Uneq Prob sample u IP P S sample 0,1) {1,,4} 1,) {1,,5},4) {,,5} 4,5) {,,5} 5,7) {,4,6} 7,10) {,4,7} It is difficult to estimate se using the HT estimator because the second order inclusion probabilities j are unknown Hence the estimate of the average number of visits to the supermarket this week and its standard error using Hartley & Rao 196) estimator for systematic sampling are Y HT = 1 N varŷ sys,ipps) = 1 N = n 1 n ) ) Ŷsys n ) = 1 [ 1 ) ) ) 1 ) ) = = 4191 seŷsys,ipps) = 4191 = = ] ) SydU STAT ) Second semester Dr J Chan 98

Unequal Probability Designs

Unequal Probability Designs Department of Statistics University of British Columbia This is prepares for Stat 344, 2014 Section 7.11 and 7.12 Probability Sampling Designs: A quick review A probability