Combining Non-probability and Probability Survey Samples Through Mass Imputation

Size: px

Start display at page:

Download "Combining Non-probability and Probability Survey Samples Through Mass Imputation"

Henry Ward
5 years ago
Views:

1 Combining Non-probability and Probability Survey Samples Through Mass Imputation Jae-Kwang Kim 1 Iowa State University & KAIST October 27, Joint work with Seho Park, Yilin Chen, and Changbao Wu

2 Outline 1 Introduction 2 Proposed method 3 Variance estimation 4 Replication variance estimation 5 A real data application 6 Conclusion Kim (ISU) Survey Data Integration October 27, / 26

3 1. Introduction We are interested in combining information from two samples, one with probability sampling and the other with non-probability sampling (such as voluntary sample). We observe X from the probability sample and observe (X, Y ) from the non-probability sample. The sampling mechanism for sample B is unknown. Table: Data Structure Data X Y Representativeness A! Yes B!! No Kim (ISU) Survey Data Integration October 27, / 26

4 Mass Imputation Rivers (2007) idea 1 Use X to find the nearest neighbor for each unit i A. 2 Compute ˆθ = i A w i y i where yi is the imputed value of y i in i A using nearest neighbor imputation. Based on MAR (missing at random) assumption f (y x, δ = 1) = f (y x) where δ i = { 1 if i B 0 if i / B. Kim (ISU) Survey Data Integration October 27, / 26

5 Mass Imputation for data integration Basic Steps 1 Use sample B to estimate the conditional distribution f (y x). 2 Predict y-values for sample A using the estimated conditional distribution. If f (y x) is correctly specified and MAR holds, then the mass imputation estimator is unbiased. Question: How to estimate the variance of mass imputation estimator? Kim (ISU) Survey Data Integration October 27, / 26

6 2. Proposed method Assume Y i = m(x i ; β) + e i (1) for some β with known function m( ), with E(e i x i ) = 0. We assume that ˆβ is the unique solution to Û(β) {y i m(x i ; β)} h(x i ; β) = 0 (2) i B for some p-dimensional vector h(x i ; β). Thus, we use the observations in sample B to obtain ˆβ and then use it to construct ŷ i = m(x i ; ˆβ) for all i A. Kim (ISU) Survey Data Integration October 27, / 26

7 Theorem Suppose that model (1) and MAR condition hold. Under some regularity conditions, the mass imputation estimator satisfies where ỹ I (β) = N 1 i A c = [ n 1 B ȳ I = 1 w i m(x i ; N ˆβ) (3) i A ȳ I = ỹ I (β 0 ) + o p (n 1/2 B ) (4) w i m(x i ; β) + n 1 B {y i m(x i ; β)} h(x i ; β) c, i B ] 1 N ṁ(x i ; β 0 )h (x i ; β 0 ) N 1 ṁ(x i ; β 0 ), i B β 0 is the true value of β in (1), and ṁ(x; β) = m(x; β)/ β. Kim (ISU) Survey Data Integration October 27, / 26 i=1

8 Also, and E{ỹ I (β 0 ) ȳ N } = 0, (5) { V {ỹ I (β 0 ) ȳ N } = V N 1 w i m(x i ; β 0 ) N 1 m(x i ; β 0 ) i A i U [ + E E ( ] ei 2 ) { x i h(xi ; β 0 ) c } 2, (6) where e i = y i m(x i ; β 0 ). n 2 B i B } Kim (ISU) Survey Data Integration October 27, / 26

9 Example Under the special case of linear model y i = x iβ + e i with e i (0, σe), 2 we can use ŷ i = x ˆβ i with ˆβ = ( i B x ix ) 1 i i B x iy i to construct regression mass imputation. If we assume SRS for sample A, the asymptotic variance in (6) reduces to ) V (ˆθ I,reg = 1 β n 1σ 2 x { σ 2 ( xn x B ) 2 } e + E A n B i B (x i x B ) 2 σe. 2 If sample B were an independent random sample of size n B, then the third term would of order O(n 2 B ) and is negligible. However, as sample B is a non-probability sample, the third term is not negligible. Kim (ISU) Survey Data Integration October 27, / 26

10 3. Variance estimation For variance estimation of the mass imputation estimator (3), we have only to estimate the variance of the linearized estimator ỹ I (β 0 ) in (4). Since the variance formula can be written as where { V {ỹ I (β 0 ) ȳ N } = V A + V B V A = V N 1 w i m(x i ; β 0 ) N 1 m(x i ; β 0 ) i A i U [ V B = E E ( ] ei 2 ) { x i h(xi ; β 0 ) c } 2, n 2 B i B we can estimate V A and V B separately. } Kim (ISU) Survey Data Integration October 27, / 26

11 To estimate ˆV A, we can use ˆV A = N 2 π ij π i π j w i m(x i ; ˆβ)w j m(x j ; ˆβ). π ij i A j A where π ij is the joint inclusion probability for unit i and j, which is assumed to be positive. To estimate V B, we can use ˆV B = n 2 B i B ê 2 i { h(x i ; ˆβ) ĉ } 2, (7) where ê i = y i m(x i ; ˆβ) and [ ] 1 ĉ = ṁ(x i ; β 0 )h (x i ; β 0 ) N 1 w i ṁ(x i ; β 0 ) n 1 B i B Hence, the variance of ȳ I,reg can be estimated by ˆV (ȳ I,reg ) = ˆV A + ˆV B. i A Kim (ISU) Survey Data Integration October 27, / 26

12 Remark If n A /n B = o(1), then V B is smaller order than V A and total variance is dominated by V A. Otherwise, the two variances both contribute to the total variance. If sample B is a big data, n B is huge and V B can be safely ignored. However, to compute ˆV B in (7), we use individual observations of (x i, y i ) in sample B, which is not necessarily available when only sample A with mass imputation is released to the public. Note that the goal of mass imputation is to produce a representative sample with synthetic observations using sample B as a training data. Once the mass imputation is performed, the training data is no longer necessary in computing the point estimation. So, it is desirable to develop a variance estimation method that does not require access to the sample B observations. Kim (ISU) Survey Data Integration October 27, / 26

13 4. Replication variance estimation We consider a bootstrap method for variance estimation that creates replicated synthetic data {ŷ (k) i, i A} corresponding to each set of bootstrap weights {w (k) i, i A} associated with sample A only. This method enables the user to correctly estimate the variance of the mass imputation estimator ȳ I,reg without access to the training data {(y i, x i ) : i B} from sample B. The data file will contain additional columns of {y (k) i : i A} associated with the columns of weights {w (k) i ; i A} (k = 1,, L), where L is the number of replicates created from sample A. Kim (ISU) Survey Data Integration October 27, / 26

14 Kim & Rao (2012) developed a replication method for survey integration when sample B is also a probability sample. 1 Obtain ˆβ (k), the k-th replicate of ˆβ, by solving the same estimating equation for β using the replication weights for sample B. 2 The k-th replicate of the mass imputation estimator ȳ I,reg = i A w iŷ i is ȳ (k) I,reg = w (k) i ŷ (k) i i A where ŷ i = m(x i ; ˆβ (k) ). How to modify the method of Kim & Rao (2012) when sample B is a non-probability sample? Kim (ISU) Survey Data Integration October 27, / 26

15 In order to develop valid bootstrap method for mass imputation estimator ȳ I,reg in (3), it is critical to develop a valid bootstrap method for estimating V (ˆβ) when ˆβ is computed from (2). Note that, under assumption (1) and MAR, we can obtain V (ˆβ) =. J 1 ΩJ 1 (8) where J = E { n 1 B i B ṁih } { i and Ω = E n 2 B i B E(e2 i x)h i h } i with ṁ i = ṁ(x i ; β 0 ) and h i = h(x i ; β 0 ). In (8), the reference distribution is the joint distribution of the superpopulation model and the unknown sampling mechanism for sample B. Kim (ISU) Survey Data Integration October 27, / 26

16 Interestingly, the variance formula in (8) is exactly equal to the variance of ˆβ when sample B is selected from simple random sampling (SRS). That is, even though the sample design for sample B is not SRS, its effect for the variance of ˆβ is essentially equal to that under SRS. Roughly speaking, the MAR assumption makes the effect of the sampling design for estimating β ignorable even though it is still not ignorable for Ȳ N. Therefore, we can safely ignore the sampling design for sample B when estimating β and develop a valid bootstrap method for variance estimation of ˆβ using the bootstrap method for SRS. Kim (ISU) Survey Data Integration October 27, / 26

17 Thus, the proposed bootstrap method can be described as in the following steps: 1 Treating sample B as a simple random sample, generate the k-th bootstrap sample from sample B to compute ˆβ (k), the k-th bootstrap replicate of ˆβ, using the same estimation formula (2) applied to the bootstrap sample. 2 Using ˆβ (k) from [Step 1], compute ŷ i = m(x i ; ˆβ (k) ) for each i A. Using the ŷ (k) i, we obtain the replicated mass imputation estimator ȳ (k) I,reg = i A w (k) i ŷ (k) i. 3 The resulting bootstrap variance estimator of ȳ I,reg is then ˆV b (ȳ I,reg ) = L 1 L k=1 ( ȳ (k) I,reg ȳ I,reg ) 2. (9) Kim (ISU) Survey Data Integration October 27, / 26

18 5. A real data application Pew Research Center (PRC) data in 2015: a non-probability sample data of size n = 9, 301 with 56 variables, provided by eight different vendors with unknown sampling and data collection strategies. The PRC dataset aims to study the relation between people and community. We choose 9 variables, among them 8 are binary and 1 is continuous, as response variables in our analysis. We consider two probability samples with common auxiliary variables. The first is the Behavioral Risk Factor Surveillance System (BRFSS) survey data and the second is the Volunteer Supplement survey data from the Current Population Survey (CPS), both collected in Kim (ISU) Survey Data Integration October 27, / 26

19 ˆX PRC ˆX BRFSS ˆX CPS Age category < >=30,< >=50,< >= Gender Female Race White only Race Black only Origin Hispanic/Latino Region Northeast Region South Region West Marital status Married Employment Working Kim Employment (ISU) Retired Survey Data Integration October , / 26 Comparison of covariates from three datasets Table: Estimated Population Mean of Covariates from the Three Samples

20 Comparison of covariates from three datasets (Cont d) ˆX PRC ˆX BRFSS ˆX CPS Education High school or less Education Bachelor s degree and above Education Bachelor s degree NA Education Postgraduate NA Household Presence of child in household NA Household Home ownership NA Health Smoke everyday NA Health Smoke never NA Financial status No money to see doctors NA Financial status Having medical insurance NA Financial status Household income < 20K NA Financial status Household income >100K NA Volunteer works Volunteered NA Kim (ISU) Survey Data Integration October 27, / 26

21 Remark There are noticeable differences between the naive estimates from the PRC sample and the estimates from the two probability samples for covariates such as Origin (Hispanic/Latino), Education (High school or less), Household (with children), Health (Smoking) and Volunteer works. It is strong evidence that the PRC dataset is not a representative sample for the population. Kim (ISU) Survey Data Integration October 27, / 26

22 Mass Imputation using a single set of common covariates Table: Estimated Population Mean Using A Single Set of Common Covariates Binary Response y ˆθ v l ( 10 5 ) v b ( 10 5 ) Talked with PRC neighbours frequently BRFSS CPS Tended to trust PRC neighbours BRFSS CPS Expressed opinions PRC at a government level BRFSS CPS Voted local PRC elections BRFSS CPS Kim (ISU) Survey Data Integration October 27, / 26

23 Mass Imputation using a single set of common covariates Table: Estimated Population Mean Using A Single Set of Common Covariates Binary Response y ˆθ v l ( 10 5 ) v b ( 10 5 ) Participated in PRC school groups BRFSS CPS Participated in PRC service organizations BRFSS CPS Participated in PRC sports organizations BRFSS CPS No money PRC to buy food BRFSS CPS Kim (ISU) Survey Data Integration October 27, / 26

24 Mass Imputation using a single set of common covariates Table: Estimated Population Mean Using A Single Set of Common Covariates Continuous Response y ˆθ I v l ( 10 2 ) v b ( 10 2 ) Days had at least PRC one drink last month BRFSS CPS Kim (ISU) Survey Data Integration October 27, / 26

25 Discussion There are substantial discrepancies between the mass imputation estimator and the naive estimator in most cases. The mass imputation estimates obtained with two different probability samples are comparable for all cases. The two variance estimators obtained by using the linearization and the bootstrap methods generally agree with each other. Kim (ISU) Survey Data Integration October 27, / 26

26 6. Conclusion Using a non-probability sample data as a training set for prediction, we can implement mass imputation for survey sample data. Nonparametric model can be used for the imputation model. Machine learning algorithm can also be used for mass imputation. A promising area of research. Kim (ISU) Survey Data Integration October 27, / 26

27 REFERENCES Kim, J. K. & Rao, J. N. K. (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika 99, Rivers, D. (2007). Sampling for web surveys. In Proceedings of the Survey Research Methods Section of the American Statistical Association. Kim (ISU) Survey Data Integration October 27, / 26

Combining Non-probability and. Probability Survey Samples Through Mass Imputation

Combining Non-probability and. Probability Survey Samples Through Mass Imputation Combining Non-probability and arxiv:1812.10694v2 [stat.me] 31 Dec 2018 Probability Survey Samples Through Mass Imputation Jae Kwang Kim Seho Park Yilin Chen Changbao Wu January 1, 2019 Abstract. This paper