Imputation for Missing Data under PPSWR Sampling

July 5, 2010 Beijing Imputation for Missing Data under PPSWR Sampling Guohua Zou Academy of Mathematics and Systems Science Chinese Academy of Sciences 1 23

() Outline () Imputation method under PPSWR sampling () Special case: SRS sampling and uniform response () () 2 23

() Item nonresponse: occurs frequently in sample surveys. Example In sample survey on transportation, some vehicles may not be found, but their tonnage or seat capacity is known to us. Solutions: (1) Increasing response probability; (2) Imputing the missing values of the sampled units. Imputation methods: Ratio imputation, regression imputation, random imputation etc. Shortcoming: Uniform response (often), simple random sampling PPSWR sampling: the sampling with probability proportional to size with replacement which is often used in the first stage of multistage sampling. 3 23

() Imputation method under PPSWR sampling 1. Notation Survey population U: consist of N distinct units identified through the labels i = 1,..., N. Suppose that the auxiliary variable, X, is available for each unit of the population, but the variable of interest, Y, is missing for some of the sampled units. s n : sample with size n drawn from U by PPSWR sampling. s r : the respondent set with size r( 1). 4 23 s n r : the nonrespondent set with size n r.

Uniform response mechanism: independent response across sample units and equal response probability, p (q 1 p). Non-uniform response mechanism: independent response across sample units and possibly unequal response probabilities, p i (q i 1 p i ). Response indicator on y i : { 1, if the unit i responds to y i, I i = 0, otherwise. 5 23

2. Imputation method We first let p i be known. For the missing Y values, we suggest the following imputation method: y i = 1 n r q j y j x i, i s n r. p j s j x j r 6 23

It is interesting to note that yi is an approximation of the weighted least squares predictor under the following superpopulation model: y i = βx i + e i, ε(e i ) = 0, ε(e 2 i ) = σ2 x 2 i, ε(e ie j ) = 0 (i j). In fact, it can be seen that under the above superpopulation model, the weighted least squares estimator of β with the weights w i q i /(p i x 2 i ) is given by s ˆβ = r (q i y i )/(p i x i ) s r 1/p i r. Further, the expectation of s r 1/p i with respect to the response mechanism is n. 7 23

3. Estimator of population mean and its variance Applying the above imputation method to the PPSWR sampling, the corresponding Hansen-Hurwitz estimator of the population mean Ȳ is ˆȲ P P S = X n i s r y i = X p i x i n i s n y i p i x i I i. is design-unbiased under the non- Theorem 1. The estimator ˆȲ P P S uniform response mechanism. Theorem 2. Under the non-uniform response, the variance of ˆȲ P P S is given by V ( ˆȲ P P S) = 1 N 2 where Z i = X i /X. { 1 n N i=1 ( ) 2 Yi Z i Y + 1 Z i n N i=1 } q i Yi 2, p i Z i 8 23

4. Jackknife variance estimator Define the imputed values as follows: For i s n r, ( 1 q k y ) k x i, j s r, n r p k x k yi a k s (j) = r {j} ( 1 q k y ) k x i, j s n r, n r 1 p k x k k s r when the j-th sample unit is deleted. Based on these imputed values, the estimator of Ȳ can be obtained as X y i, j s r, n 1 p i x i ˆȲ P a i s P S(j) = r {j} X y i, j s n r, n 1 p i x i i s r when the j-th sample unit is deleted. 9 23

Define the j-th pseudovalue as follows: ˆȲ j = n ˆȲ P a P S (n 1) ˆȲ P P S(j). A jackknife variance estimator of ˆȲ P P S v J ( ˆȲ P P S ) = n 1 n = [ ˆȲ P P S j s n X 2 n(n 1) i s r is then given by y2 i p 2 i x2 i ˆȲ a P P S(j)] 2 1 n ( i s r y i p i x i ) 2. Theorem 3. Let n > 1. Then under the non-uniform response, we have E[v J ( ˆȲ P P S)] = V ( ˆȲ P P S). Theorem 3 shows that the jackknife variance estimator v J ( ˆȲ P P S ) is designunbiased under the non-uniform response. 10 23

5. Extension to the case of unknown response probability In practice, the response probability p i is rare to be known. Here we model the response probability p i by a parametric model p i = g(x i ; θ), where g is a known smooth function and θ is an unknown finitedimensional parameter. An example of such a model is the logistic regression model. The estimator ˆθ of the parameter θ can be obtained by the maximum likelihood approach. Let ˆp i = g(x i ; ˆθ) be the estimated response probability. Then the corresponding estimator of the population mean Ȳ and its jackknife variance estimator are given by ˆȲ e P P S = X n i s r y i ˆp i x i, 11 23

and respectively. v J ( ˆȲ e P P S) = X 2 n(n 1) i s r y2 i ˆp 2 i x2 i 1 n ( i s r y i ˆp i x i ) 2, For the relationship between the estimators with known and unknown p i, under some regularity conditions, we have and ˆȲ e P P S = ˆȲ P P S + O p ( 1 n ), ( ) v J ( ˆȲ P e P S) = v J ( ˆȲ 1 P P S) + O p. n 3/2 12 23

() Special case: SRS sampling and uniform response Imputed value: y i = Imputed estimator: 1 r y j x j s j r ˆȲ m SRS= ū r x n + where ū r = 1 u j with u j = y j /x j. r j s r x i, i s n r. r(n 1) (r 1)n (ȳ r ū r x r ), 13 23

The estimator ˆȲ SRS m is a design-unbiased estimator under uniform response. It is interesting that it is just the version of Hartley-Ross estimator (Hartley and Ross 1954, Nature) for estimating the population mean under two-phase sampling. Variance: V ( ˆȲ SRS) m = 1 ( ) np [S2 Y + (1 p)(2ȳ Ū X + Ū 2 T Ū 2 X2 2Ū Z)] 1 + O n 2, where T = 1 N and N T i with T i = Xi 2, and Z = 1 N i=1 S 2 Y = 1 N 1 N (Y i Ȳ )2. i=1 N Z i with Z i = Y i X i, i=1 14 23

Jackknife variance estimator: v J ( ˆȲ SRS m ) = 1 { (n 1)(r 1)ū 2 n(n 1)(r 1) r s 2 x(n) + 1 (r 2) 2[n(r 2) x n (nr r 2) x r ] 2 s 2 u(r) + (n 2)2 r 2 (r 2) 2 s2 y(r) where s y ν 1 u ν 2x ν 3(r) = 1 r 1 + (n 2)r(nr 2r2 + 4r 4) (r 2) 2 ū 2 r s 2 x(r) 2(n 2)r + (r 2) 2 [n(r 2) x n (nr r 2) x r ] s yu (r) 2 (r 2) 2[(n2 r n(r 2 r + 2) (r 1)(r 2))(r 2) x n (n 2 r 2 nr(r 2 + 4) + 2(3r 2 4r + 4)) x r ]ū r s ux (r) 2(n 2)r(nr r2 + r 2) (r 2) 2 ū r s yx (r) 2 r 2 [n(r 2) x n (nr r 2) x r ]s u 2 x(r) + 2(nr r2 + r 2) ū r s ux 2(r) r 2 2r +s u 2 x2(r) n (ȳ r ū r x r )s ux (r) + 2(n 2)r s yux (r) r 2 } r2 n(r 1) (ȳ r ū r x r ) 2, j s r (y j ȳ r ) ν 1 (u j ū r ) ν 2 (x j x r ) ν 3. 15 23

Two approximate Jackknife variance estimators: v J ( ˆȲ SRS m ) = nū2 1 rs 2 x(n) + 1 r [( x r x n ) 2 s 2 u(r) 2( x r x n )s yu (r) + s 2 y(r)] ( 1 + r 2 ) ( 1 ū 2 n rs 2 x(r) + 2 r 1 ) ū r [( x r x n )s ux (r) s yx (r)], n and v J( ˆȲ SRS) m = nū2 1 rs 2 x(n) + 1 ( 1 r s2 y(r) + r 2 ) ( 1 ū 2 n rs 2 x(r) 2 r 1 ) ū r s yx (r). n 16 23

Asymptotic design-unbiasedness: (i) E[v J ( ˆȲ SRS m m )] = V ( ˆȲ SRS ) + O ( ) 1 n. 2 (ii) E[v J ( ˆȲ SRS m m )] = V ( ˆȲ SRS ) + O ( ) 1 n. 2 (iii) E[v J ( ˆȲ SRS m m )] = V ( ˆȲ SRS ) + O ( ) 1 n. 2 Remark: We should note that the (approximate) design-unbiasedness is the main requirement for a good estimator in survey sampling. The approximate design-unbiasedness of the Jackknife variance estimators has been found first in Zou and Feng (1998), and then in Skinner and Rao (2002) and Zou et al. (2002). Such a property universally holds from our subsequent analysis. 17 23

() The data are generated from the three ratio models which are different only in the auxiliary variables: y i = 3.9x i + x i ε i with x i U(0.1, 2.1), N(1, 1), and N(20, 16), respectively, ε i N(0, 1), and x i and ε i are assumed to be independent. In the case of uniform response, we set p = 0.76; in the case of nonuniform response, the unequal response probability p i for the unit i follows the logistic model p i = exp( 1 + 2.3x i) 1 + exp( 1 + 2.3x i ). 18 23

We first generate a finite population with the size of N = 10, 000. Then the samples with n = 100 and n = 500 are drawn from the finite population by the PPSWR sampling, respectively. We repeat the process B = 5, 000 times. For the b-th run, denote the estimators of the population mean under the uniform (b) and non-uniform responses as and ˆȲ P P S, respectively. We calculate the ˆȲ I(b) P P S simulated means and variances as follows: and E ( ˆȲ I P P S) = 1 B V ( ˆȲ I P P S) = 1 B B ( b=1 B b=1 ˆȲ I(b) P P S, ˆȲ I(b) P P S Ȳ )2, E ( ˆȲ P P S) = 1 B V ( ˆȲ P P S) = 1 B B b=1 B ( b=1 ˆȲ (b) P P S ; ˆȲ (b) P P S Ȳ )2. Similarly, the corresponding jackknife variance estimators are calculated as E [v J ( ˆȲ I P P S)] = 1 B B b=1 v (b) J ( ˆȲ I P P S), and E [v J ( ˆȲ P P S)] = 1 B B b=1 v (b) J ( ˆȲ P P S). 19 23

20 23

Table 1 summarizes the results on the simulated mean, variance and jackknife variance estimate. It can be seen from the table that both of the estimators ˆȲ P I P S and ˆȲ P P S are very close to the true population means for the three distributions of auxiliary variable. Also, the jackknife variance estimators perform very well. 21 23

To study the effect of the response probability, we set various response probabilities: p = 0.5 for uniform response, and p i follows p i = exp{0.3(x i X)} 1 + exp{0.3(x i X)} for non-uniform response. The results are presented in Table 2. It is observed that the approximate design-unbiasedness of the proposed estimators still holds. On the other hand, it is also clear that the variances become larger for low response probability. For some other settings of response probability, we obtain the similar results but omit them here for saving space. 22 23

23 23

() Incomplete auxiliary information and multiple auxiliary information Estimation of response probability: the use of non-parametric approach. Other unequal probability sampling 24 23

Thank you! 25 23