Cluster Sampling 2. Chapter Introduction

Size: px

Start display at page:

Download "Cluster Sampling 2. Chapter Introduction"

Job Williamson
6 years ago
Views:

1 Chapter 7 Cluster Sampling 7.1 Introduction In this chapter, we consider two-stage cluster sampling where the sample clusters are selected in the first stage and the sample elements are selected in the second stage sampling. To formally state this, two-stage sampling can be described as follows: 1. Stage 1: Draw A I U I via p I. Stage : For every i A I, draw A i U i via p i A I Sample of elements is now given by A = i AI A i. In two-stage sampling, we have two simplifying assumptions about the second stage sampling design p i A I : 1. Invariance of the second-stage design p i A I = p i for every i U I and for every A I such that i A I. Independence of the second-stage design Pr i AI A i A I = Pr A i A I 1

2 CHAPTER 7. CLUSTER SAMPLING Under independence, we have Prk A i,l A j = Prk A i Prl A j, i j. If the invariance assumption does not hold, the sampling design is called two-phase sampling design. Two-phase sampling design will be covered in Chapter 1. In two-stage sampling, we use to denote the cluster sample size in the firststage sampling and use m i = A i to denote the sample size in cluster i. The number of sampled elements is equal to i AI m i = A. The first-order inclusion probability of element k in cluster i is a product of the cluster-level inclusion probability and the conditional inclusion probability given the cluster: π ik = Pr {ik A} = Pr k A i i A I Pr i A I = π k i, where is the cluster-level inclusion probability and π k i = Pr [k A i i A I ] is the element level conditional inclusion probability. In general, π k i is a random variable in the sense that it is a function of A I. Under invariance, it is fixed. as The second-order inclusion probability between two elements can be expressed π k i π ik, jl = π kl i j π k i π l j if i = j and k = l if i = j and k l if i j where j is the cluster level joint inclusion probability and π kl i = Pr [k,l A i i A I ]. 7. Estimation In two-stage cluster sampling, we do not observe Y i. Instead, we obtain Ŷ i from the second stage sampling such that EŶ i A I = Y i, where the conditional expectation is taken with respect to the second-stage sampling. For simplicity, we use E Ŷ i = EŶ i A I. The HT estimation for Y = i UI Y i = i UI k Ui y ik is given by Ŷ HT = Ŷ i = k A i y ik π k i. 7.1

3 7.. ESTIMATION 3 The HT estimator in 7.1 is unbiased and its variance can be computed by V Ŷ HT = V { E ŶHT A I } + E { V ŶHT A I } 7. The first term is the variance due to the first-stage sampling sampling of PSUs and the second term is the variance due to the second-stage sampling sampling of SSUs. Thus, we can write V Ŷ HT = VPSU +V SSU 7.3 where V PSU = V { Y i } = j π I j Y i Y j i U I j U I { } 1 V SSU = E πii V i V i =, i U I π I j and V i = V Ŷ i A i = π kl i π k i π l i y ik y il. k U i l U i π k i π l i Example 7.1. Consider the following two-stage sampling design. 1. Stage One: Select sample clusters from population clusters by simple random cluster sampling.. Stage Two: Within sampled cluster i, select m i sample elements from M i population elements independently. Under this two-stage sampling, we have Ŷ HT = N n Ŷ i = j A i M i m i y i j

4 4 CHAPTER 7. CLUSTER SAMPLING and its variance is where V Ŷ HT = N I 1 S I + NI NI Mi 1 m i S i 7.4 M i m i SI = 1 1 Y i Ȳ N Si = M i 1 1 M i y i j Ȳ i. Now, consider estimation of population mean Ȳ = N 1 M i j=1 y i j where N = M i is assumed to be known. In this case, where M = NI 1 ˆȲ HT = ŶHT N V { ˆȲ HT } = 1 1 j=1 = N Ŷ i = 1 Ŷ i M M i. Its variance is, using 7.4, Sq1 + 1 M Mi m i 1 m i Si M i where Sq1 = 1 1 q i q 1 with q i = Y i / M, q 1 = NI 1 q i, and Si = M i 1 1 k Ui y ik Ȳ i. If the sampling rate for the second stage sampling is constant such that m i /M i = f, then we can write V { ˆȲ HT } = 1 1 f 1 S q1 + 1 m 1 f 1 N where f 1 = / and m = NI 1 m i. = 1 1 f 1 B + 1 m 1 f W M i S i Example 7.. We now consider a special case of Example 7.1 where M i = M and m i = m. In this case, 7.4 is further simplified V N Ŷ HT = I 1 n I M SSB 1 + = n 1 n I N MSb + I NI 1 m M 1 m M M M mm 1 SSW m S w 7.5

5 7.. ESTIMATION 5 For the case of mean estimation, we can simply divide 7.5 by N = N I M to get V ˆȲ = 1 S b M + 1 m M S w m 7.6 Note that the sample size associated with the first term V PSU term is M while the sample size associated with the second term V SSU term is nm. Now, we can express the variance term in 7.6 in terms of intracluster correlation coefficient. Using Table 6.1 and the property of ρ, given by 6.4, we have and S b 1 1 SSB = 1 1 M 1 [1 + ρ M 1]SST Sw NI 1 M 1 1 SSW = NI 1 M 1 1 ρsst Thus, the variance term in 7.6 reduces to, ignoring / term,.= 1 V ˆȲ M [1 + M 1ρ]S + 1 m 1 1 ρs M m = S {1 + m 1ρ}. 7.7 m Thus, the design effect becomes 1 + m 1ρ. In this case of M i = M and m i = m, the problem of finding the optimal choice of m given the cost function C = c 0 + c 1 + c m can be formulated as minimizing subject to V ˆȲ HT When the total cost C is fixed, we have = S b M + 1 m M S w m = 1 { } 1 S M b Sw 1 + m S w C = c 0 + c 1 + c m. n = C c 0 c 1 + c m

6 6 CHAPTER 7. CLUSTER SAMPLING and the optimal choice is given by m c 1 M Sw = c Sb. 7.8 S w The optimal solution 7.8 is obtained by applying a m + bm ab with equality if and only if m = a/b. That is, since { } V ˆȲ 1 HT C c 0 = S M b Sw 1 + m S w c 1 + c m = const. + c 1 m S w + c S M b Sw m, the lower bound is achieved when m = { c 1 Sw } 1/ { c S M b Sw } 1/ which equals to 7.8. For sufficiently large M, the optimal solution becomes m c 1 = 1 c ρ 1. More generally, the objective function can be written as V ˆȲ HT = 1 B + 1 m 1 f W. In this case, the optimal solution becomes m c 1 W = c B W /M. 7.9 We now discuss variance estimation under two-stage cluster sampling. Theorem 7.1. An unbiased estimator for the variance of HT estimator in 7.1 under two-stage sampling design is ˆV j π I j Ŷ HT = j A I j Ŷ i where ˆV i satisfies E ˆV i i A I = VarŶ i i A I. Ŷ j 1 + π I j ˆV i 7.10

7 7.. ESTIMATION 7 Proof. By 7.3, N j=1 V Ŷ HT = By the independence assumption, j π I j Y i Y j π I j + N V i { Y E Ŷi Ŷ j = i +V i Y i Y j if i = j if i j where E denotes the expectation with respect to the second-stage sampling. Thus, j A I j π I j E Ŷi Ŷ j j and, since E ˆV i = Vi, we have E { ˆV Ŷ HT } = π I j = j π I j j A I j + V i πii j π I j j A I j Y i Y j V i + π I j πii. Y i Y j π I j Taking the expectation of the above term with respect to the first-stage sampling design, it equal to the variance term in The variance estimation formula in 7.10 is the sum of two terms. The first term is the variance estimation formula for the first-stage sampling applied to Ŷ i and the second term is the point estimator for the first-stage sampling applied to ˆV i. The validity of the variance estimation formula 7.10 further holds even when Ŷ i and ˆV i are obtained from multi-stage sampling. That is, as long as EŶ i A I = Y i and E ˆV i A I = V Ŷ i A I hold, the variance estimation formula in 7.10 remain unbiased. Such phenomenon was first discovered by Raj If we use only the first term of 7.10 j π I j ˆV 1 = j A I j Ŷ i Ŷ j π I j,

8 8 CHAPTER 7. CLUSTER SAMPLING to estimate the total variance, the bias can be written Bias ˆV 1 = V i. 7.1 and the bias term is of order O. Since Var Ŷ HT is of order O n 1 I NI, the bias term is negligible when / = o1. Under the setup of Example 7.1 where M i = M, m i = m, the variance estimation formula in 7.10 reduces to ˆV Ŷ HT = N I 1 n I 1 1 Ŷ i 1 Ŷ i + j A I ˆV i 7.13 where Ŷ i = Mȳ i and ˆV i = M m 1 m 1 M m 1 y i j ȳ i. j A i In the case of mean estimation, we can divide 7.10 by N = N I M to get where ˆV ˆȲ = 1 s b + 1 m M s w m s b = 1 1 ȳ i ˆȲ s w = n 1 I m 1 1 j A i y i j ȳ i If f 1 = / is negligible, then we can use ˆV ˆȲ = s b as a variance estimator for the mean estimator under simple random sampling. When the cluster sizes are unequal, the simple random sampling in the firststage sampling is not preferable. The following example is very popular method of two-stage sampling under the case of unequal cluster sizes. Example 7.3. Consider the following two-stage sampling design.

9 7.. ESTIMATION 9 1. Stage One: Select clusters of size by PPS sampling with size measure M i.. Stage Two: Select elements by SRS of size m from M i elements in the sample cluster i. We first consider estimation of population total Y = M i j=1 y i j. Under single-stage cluster sampling, we would have observed Y i = M i j=1 y i j. In this case, an unbiased estimator of Y is given by Ŷ PPS = N k=1 Y ak M ak 7.15 where a k is the index of population cluster in the k-th draw of the PPS sampling. In the two-stage sampling, we do not observe Y i but we obtain Ŷ i = M i ȳ i, where ȳ i is the sample mean of elements in cluster i. Thus, we can use Ŷ PPS = N Ŷ ak k=1 M ak = N ȳ ak 7.16 k=1 to estimate the total Y. Assuming that there is no duplication of the selected clusters, the sampling weights are all equal to N/ m, which implies that every element in the population has the same probability of selection. The sampling design that leads to equal sampling weights is called self-weighting design. For estimation of the population mean Ȳ = Y /N, we have ˆȲ PPS = 1 which takes the sample mean of the cluster means. ȳ ak 7.17 k=1 To discuss variance estimation, note that the point estimator 7.16 can be written as the sample mean of z 1,,z ni distributed with the following discrete distribution: where z k are independently and identically z 1 = Ŷ i /p i with probability p i = M i /N, i = 1,,. Note that Ez 1 = Ŷi, which is unbiased for Y = Y i as E Ŷ i = Y i. For variance estimation, since Ŷ PPS in 7.16 can be written as the sample mean of

10 10 CHAPTER 7. CLUSTER SAMPLING independent z k, we have ˆV PPS ŶPPS = 1 S z = k=1 z k z Variance estimation of the mean estimator 7.17 can be similarly constructed. Specifically, we can use ˆV PPS ˆȲ PPS = k=1 ȳ ak ˆȲ PPS 7.19 as an unbiased estimator for the variance of the mean estimator To illustrate the use of two-stage sampling in Example 7.3, consider a finite population of households in a city. The city consists of clusters of houses and cluster i consists of M i houses. We use the following two-stage cluster sampling. [Stage 1] Select = 3 sample clusters by the PPS sampling with the measure of size equal to M i. [Stage ] Within each selected cluster i, select m i = 4 sample houses by the simple random sampling. Once the sample households are selected, we obtain two information. One is the number of household members in the house t i j and the other is the number of household members with age under 6 y i j. We are interested in estimating the proportion of the population with age under 6 in the city. That is, the parameter of interest is P = M i j=1 y i j N i M i t := Y i j T. The following table gives the summary of the realized sample household from the above two-stage sampling.

11 7.. ESTIMATION 11 Sample Cluster ID Sample household ID t i j y i j The proportion of the population with age under 6 in the city is estimated by ˆP = Ŷ ˆT = Nn 1 I Nn 1 I k=1 ȳk k=1 t k = 6/4 + 5/4 + 8/4 8/4 + 41/4 + 0/4. = 0.13 where the second equality follows from To estimate the variance of ˆP, we use ˆV ˆP = ni t i ȳ i ˆθ t i = The design effect can be computed by the ratio of ˆV ˆP under the current sampling design to the variance of ˆP under simple random sampling, which is computed by ˆV SRS ˆp = 1 n ˆp1 ˆp = ni m i j=1 t i j 1 ˆp1 ˆp = Thus, the estimated design effect is / =.8105.

12 1 CHAPTER 7. CLUSTER SAMPLING Reference Raj, D Some remarks on a simple procedure of sampling without replacement, Journal of the American Statistical Association 61,

Chapter 3: Element sampling design: Part 1

Chapter 3: Element sampling design: Part 1 Jae-Kwang Kim Fall, 2014 Simple random sampling 1 Simple random sampling 2 SRS with replacement 3 Systematic sampling Kim Ch. 3: Element sampling design: Part