Combining data from two independent surveys: model-assisted approach Jae Kwang Kim 1 Iowa State University January 20, 2012 1 Joint work with J.N.K. Rao, Carleton University
Reference Kim, J.K. and Rao, J.N.K. (2012). Combining data from two independent surveys: a model-assisted approach, Biometrika, In Press. (Available online via Advance Access 10.1093/biomet/asr063.)
3 Outline 1 Introduction 2 Projection estimation 3 Replication variance estimation 4 Efficient estimation: Full information 5 Simulation study 6 Concluding remarks & Discussion
1. Introduction Two-phase sampling (Classical) Two-phase sampling A 1 : first-phase sample of size n 1 A 2 : second-phase sample of size n 2 (A 2 A 1 ) x observed in phase 1 and both y and x observed in phase 2. Assume that 1 is an element of x i. Neyman (1934), Hansen & Hurwitz (1946), Rao (1973), Kott & Stukel (1997), Binder et al. (2000), Kim et al. (2006), Hidiroglou et al. (2009).
5 1. Introduction Two-phase sampling GREG estimator of Y = N i=1 y i: Ŷ G = ˆX ˆβ 1 2 ˆX 1 = w 1i x i, ˆβ 2 = w 2i x i x i i A1 i A2 1 Two ways of implementing the GREG estimator Calibration: create data file for A 2 ( Ŷ G = i A 2 w 2G,i y i, w 2G,i = ˆX 1 i A 2 w 2i x i x i Projection estimation: create data file for A 1. Ŷ G = i A 1 w 1i ỹ i, ỹ i = x i ˆβ 2 i A 2 w 2i x i y i ) 1 w 2ix i
6 1. Introduction Domain projection estimators Calibration estimator of domain total Y d = N i=1 δ i(d)y i : Ŷ Cal,d = i A 2 w 2G,i δ i (d)y i δ i (d) = 1 if i belongs to domain d, δ i (d) = 0 otherwise. Note: Ŷ Cal,d is based only on the domain sample belonging to A 2 and it could lead to large variance if domain A 2 sample is very small.
1. Introduction Domain projection estimators Domain projection estimator (Fuller, 2003) Ŷ p,d = i A 1 w i1 δ i (d)ỹ i Note: Ŷ p,d is based on much larger domain sample belonging to A 1 than Ŷ Cal,d based on domain sample belonging to A 2. Hence, Ŷp,d could be significantly more efficient if its relative bias is small. Under the model y i = x i β + e i with E(e i ) = 0, Ŷ p,d is model unbiased for Y d. But, it is possible to construct populations for which Ŷ p,d is very design biased (Fuller, 2003).
8 1. Introduction Combining two independent surveys Large sample A 1 collecting only x, and weights {w i1, i A 1 }. Much smaller sample A 2 collecting x and y drawn independently and weights {w i2, i A 2 }. Example 1 (Hidiroglou, 2001): Canadian Survey of Employment, Payrolls and Hours A 1 : Large sample drawn from a Canadian Customs and Revenue Agency administrative data file and auxiliary variables x observed. A 2 : Small sample from Statistics Canada Business Register and study variables y, number of hours worked by employees and summarized earnings, observed.
9 1. Introduction Combining two independent surveys Example 2 (Reiter, 2008) A 2 : Both self-reported health measurements, x, and clinical measurements from physical examinations, y, observed A 1 : Only x reported Synthetic values ỹ i, i A 1 are created by first fitting a working model E(y) = m(x, β) relating y to x to data {(y i, x i ), i A 2 } and then predicting y i associated with x i, i A 1. Only synthetic values ỹ i = m(x i, ˆβ), i A 1 and associated weights w i1, i A 1 are released to the public. Our focus is on producing estimators of totals and domain totals from the synthetic data file {(ỹ i, w i1 ), i A 1 }.
10 2. Projection estimation Estimation of Y Projection estimator of Y : Ŷ p = i A 1 w i1 ỹ i Ŷ p is asymptotically design-unbiased if ˆβ satisfies { ( )} y i m x i, ˆβ = 0 ( ) i A 2 w i2 Note: Under condition (*), Ŷ p = i A 1 w i1 ỹ i + i A 2 w i2 {y i ỹ i } = prediction + bias correction
2. Projection estimation Estimation of Y Theorem 1: Under some regularity conditions, if ˆβ satisfies condition (*), we can write Ŷ p = w i1 m 0 (x i ) + w i2 {y i m 0 (x i )} = ˆP 1 + ˆQ 2 i A 1 i A 2 where m 0 (x i ) = m(x i, β 0 ) and β 0 = p lim ˆβ with respect to survey 2. Thus, and E(Ŷ p ) = N N m 0 (x i ) + {y i m 0 (x i )} = i=1 i=1 V (Ŷ p ) = V (ˆP 1 ) + V ( ˆQ 2 ). N y i. i=1
12 2. Projection estimation Model-assisted approach: Asymptotic unbiasedness of Ŷ p does not depend on the validity of the working model but efficiency is affected. Note: In the variance decomposition V (Ŷp) = V (ˆP 1 ) + V ( ˆQ 2 ) = V 1 + V 2. V 1 is based on n 1 sample elements and V 2 is based on n 2 sample elements. If n 2 << n 1, then V 1 << V 2. If the working model is good, then the squared error terms ei 2 = {y i m 0 (x i )} 2 are small and V 2 will also be small.
2. Projection Estimation When is condition (*) satisfied? If 1 is an element of x i, this condition is satisfied for linear regression m(x i, β) = x iβ and logistic regression logit{m(x i, β)} = x i β when ˆβ is obtained from the estimating equation w i2 x i (y i m i ) = 0 i A 2 for linear and logistic regression working models. For the ratio model, ˆβ is the solution of i A 2 w i2 (y i m i ) = 0.
14 2. Projection Estimation Linearization variance estimation Let e i = y i ỹ i, then the variance estimator of Ŷ p is v L (Ŷ p ) = v 1 (ỹ i ) + v 2 (ê i ) v 1 ( z i ) = v(ẑ 1 ) = variance estimator for survey 1 v 2 ( z i ) = v(ẑ 2 ) = variance estimator for survey 2 Ẑ 1 = i A 1 w i1 z i, Ẑ 2 = i A 2 w i2 z i. Note v L (Ŷ p ) requires access to data from both surveys.
15 2. Projection Estimation Estimation of domain total Y d Projection domain estimator Ŷ d,p = i A 1 w i1 δ i (d)ỹ i Ŷ d,p is asymptotically unbiased if Case (i) : w i2 δ i (d)(y i ỹ i ) = 0 i A 2 OR Case (ii) : Cov {δ i (d), y i m(x i, β 0 )} = 0.
16 2. Projection Estimation Estimation of domain total Y d Case (i): For linear or logistic regression models (i) is satisfied if δ i (d) is an elements of x i. For planned domains specified in advance, augmented working models can be used. Survey 1 data file should provide planned domain indicators. Case (ii): If working model is good, then the relative bias of Ŷ d,p would be small. Ŷ d,p is asymptotically model unbiased if model is correct. Ŷ d,p can be significantly design biased for some populations.
3. Replication variance estimation Replication variance estimation for Ŷ p Replication variance estimator for survey 1: L 1 ) (k) 2 v 1,rep (Ẑ) = c k (Ẑ 1 Ẑ 1 k=1 Ẑ (k) 1 = i A 1 w (k) i1 z i and {w (k) i1, i A 1}, k = 1,, L 1 : replication weights for survey 1 Replication variance estimator for Ŷ p : where Ŷ p (k) = i A 1 w (k) i1 values for replicate k. L 1 (k) v 1,rep (Ŷp) = c k (Ŷ p k=1 ỹ (k) i Ŷp ) 2 and {ỹ (k) i, i A 1 } are synthetic
18 3. Replication variance estimation Replication variance estimation for Ŷ p How to create replicated synthetic data {ỹ (k) i, i A 1 }? 1 Create {w (k) i2, k = 1,, L 1; i A 2 } such that L 1 k=1 c k (Ŷ (k) 2 Ŷ 2 ) 2 = v2 (Ŷ 2 ) 2 Compute ˆβ (k) and ỹ (k) i = m(x i, ˆβ (k) ) by solving w (k) i2 {y i m(x i, β)}x i = 0 i A 2 for ˆβ (k) (linear or logistic linear regression) v 1,rep (Ŷ p ) is asymptotically unbiased. Data file for sample A 1 should contain additional columns of {ỹ (k) i, i A 1 } and associated {w (k) i1, i A 1}, k = 1, 2,, L 1.
19 3. Replication variance estimation Replication variance estimation for Ŷ d,p Let Ŷ (k) d,p = i A 1 w (k) i1 δ i(d)ỹ (k) i, then L 1 ) (k) 2 v 1,rep (Ŷd,p) = c k (Ŷ d,p Ŷd,p k=1 Asymptotically unbiased under either case (i) or case (ii).
4. Optimal estimator: Full information Estimation of total Y Three estimators for two parameters Survey 1: ˆX1 for X Survey 2: ( ˆX 2, Ŷ2) for (X, Y ) Combine information using generalized least squares minimize Q(X, Y ) = ˆX 1 X ˆX 2 X Ŷ 2 Y V 1 ˆX 1 X ˆX 2 X Ŷ 2 Y with respect to (X, Y ) where V is the variance-covariance matrix of ( ˆX 1, ˆX 2, Ŷ 2 ).
21 4. Optimal estimator: Full information Estimation of total Y Best linear unbiased estimator based on ˆX 2, Ŷ2 and ˆX 1 : Ỹ opt = Ŷ2 + B y x2 ( Xopt ˆX 2 ) X opt = V xx2 ˆX 1 + V xx1 ˆX 2 V xx1 + V xx2 where B y x2 = V yx2 /V xx2, V xx1 = V ( ˆX 1 ), V xx2 = V ( ˆX 2 ), V yx2 = Cov(Ŷ2, ˆX 2 ). Replace variances in Ỹopt by estimated variances to get Ŷopt and ˆX opt.
22 4. Optimal estimator: Full information Estimation of total Y Ŷ opt can be expressed as Ŷ opt = i A 2 w i2y i {wi2, i A 2} are calibration weights: i A 2 wi2 x i = ˆX opt. Ŷ opt can be computed from data file for A 2 providing weights {wi2, i A 2} Example: Simple random samples A 1 and A 2 w i2 = N n 2 + x 2 : mean of x for A 2 ( ) x i x 2 ˆX opt ˆX 2 i A 2 (x i x 2 ) 2
4. Optimal estimator: Full information Domain estimation Calibration estimator: Ŷ d = i A 2 w i2δ i (d)y i computed from data file for A 2 only. Projection estimator: Ŷ p,d = i A 1 w i1 δ i (d)ỹ i computed from data file for A 1. Both Ŷd and Ŷ d,p satisfy internal consistency property: Ŷd = Ŷ opt, Ŷ d,p = Ŷ p d d
4. Optimal estimator: Full information Domain estimation Ŷd is asymptotically design unbiased but can lead to a large variance if domain contains few sample A 2 units. Optimal estimator Ŷ opt,d based on domain specific variances does not satisfy internal consistency, may not be stable for small domain sample size and it cannot be implemented from A 2 data file.
5. Simulation Study Simulation Setup Two artificial populations A and B of size N = 10, 000: {(y i, x i, z i ); i = 1,, N} Population A: Regression model Population B: Ratio model x i χ 2 (2), y i = 1 + 0.7x i + e i e i N(0, 2), z i Unif(0, 1) z i independent of (x i, y i ) same (x i, z i ) but y i = 0.7x i + u i u i N(0, x i ) cov(y, x) = 0.71 for both populations Domain d: δ i (d) = 1 if z i < 0.3; δ i (d) = 0 otherwise.
26 5. Simulation Study Simulation Setup Two independent simple random samples: n 1 = 500, n 2 = 100 Working models: linear regression, ratio, augmented linear regression, augmented ratio Relative bias: RB(Ŷ ) = {E(Ŷ ) Y }/Y Relative efficiency: RE(Ŷ ) = mse(ŷopt)/mse(ŷ )
27 5. Simulation Study Simulation Results Table 1: Simulation Results (Point estimation) Parameter Estimator Population A Population B RB RE RB RE Total Regression projection 0.00 0.98 0.00 0.97 Ratio projection 0.00 0.58 0.00 0.99 Aug. Reg. projection 0.00 0.97 0.00 0.97 Aug. Rat. projection 0.01 0.55 0.00 0.98 Optimal 0.00 1.00 0.00 1.00 Domain Regression projection 0.00 1.96 0.01 2.01 Ratio projection 0.01 1.22 0.01 2.05 Aug. Reg. projection 0.00 1.05 0.00 0.98 Aug. Rat. projection 0.00 0.64 0.00 0.96 Optimal -0.01 1.00-0.02 1.00 Calibration 0.00 0.45 0.00 0.53
28 5. Simulation Study Conclusions from Table 1 Estimation of total Y 1 RB of all estimator negligible: less than 2% 2 Regression projection estimator almost as efficient as Ŷopt even when the true model is ratio model. Ratio projection estimator is considerably less efficient if the true model has substantial intercept term: model diagnostics to identify good working model 3 Augmented projection estimators similar to corresponding projection estimators in terms of RB and RE.
29 5. Simulation Study Conclusions from Table 1 Domain estimation 1 RB of all estimators less than 5%: simulation setup ensures δ i (d) unrelated to r i = y i m(x i ; β). 2 Regression projection estimator considerably more efficient than the calibration estimator or optimal estimator: projection estimator based on larger sample size 3 Ratio projection estimator considerably less efficient if the model has substantial intercept term.
30 5. Simulation Study Jackknife variance estimation L 1 = n 2 = 100 pseudo replicates by random group jackknife Table 2: Simulation Results (relative biases of var. est.) Point Estimator Parameter Pop. A Pop. B Regression Projection Total -0.013 0.024 Domain -0.030 0.006 Ratio Projection Total 0.032 0.000 Domain -0.001-0.017 Aug. Reg. Projection Total 0.033 0.040 Domain 0.022 0.050 Aug. Rat. Projection Total 0.059 0.030 Domain 0.064 0.061 RB of jackknife variance estimators small: less than 5%
6. Discussion Some alternative approaches The proposed method does not lead to the optimal estimator: Ŷ opt = Ŷ 2 + ˆB y x2 ( X opt ˆX 2 ) X opt = V xx2 ˆX 1 + V xx1 ˆX 2 V xx1 + V xx2 To implement the optimal estimator using synthetic data, we may express Ŷ opt = w i3 ỹ i + w i2 (y i ỹ i ) i A 2 i A 1 where ỹ i = x i ˆB y x2, A 1 = A 1 A 2 and w i3 is the sampling weight for A 1 satisfying i A 1 w i3 x i = X opt 31
6. Discussion Some alternative approaches If i A 2 w i2 = i A w i3, then we can further express 1 Ŷ opt = w i3 w ij (ŷ i + ê j ) j A 2 i A 1 where w ij = w j2 /( i A 2 w i2 ) and ê j = y j ŷ j. It now take the form of fractional imputation considered in Fuller & Kim (2005). To reduce the size of the data set, we may consider random selection of M residuals to get êj and Ŷ FI = M i A 1 j=1 where wij satisfies M j=1 w ij w i3 wij (ŷi + êj ), ( 1, ê j 32 ) = j A 2 w ij (1, ê j).
33 6. Discussion Some alternative approaches Nested two-phase sampling: A 2 A 1 Non-nested two-phase sampling : A 1, A 2 independent We can convert non-nested two-phase sampling into a nested two-phase sampling A 2 A 1 where A 1 = A 1 A 2 Synthetic data can be released for A 1
34 6. Discussion Parametric multiple imputation Assume that f (y i x i, θ) is known for fixed θ and that A 1 and A 2 are simple random samples Obtain the posterior distribution of θ: p(θ y 2, x 2 ) assuming a diffuse prior on θ, where (y 2, x 2 )= data from A 2 Draw M values θ (1),, θ (M) from the posterior distribution. Draw y (l) i from f (y i x i, θ (l) ) for i A 1 and l = 1,, M. Synthetic data sets: {y (l) i, i A 1 }, l = 1,, M. Standard multiple imputation variance estimators do not work here. Reiter (2008) proposed a two-stage imputation procedure requiring T synthetic data sets {y (l) it : i A 1, t = 1,, T } for each θ (l) to be generated. In all, TM synthetic data sets are generated.
35 6. Discussion Conclusion The proposed method is based on determination imputation to generate synthetic values. Synthetic data along with the replicates are created for survey 1 and only survey 1 data is released. Significant efficiency gain is achieved for domain estimation. Stochastic imputation approach is under study.
REFERENCES Binder, D.A. anad Babyak, C., Brodeur, M., Hidiroglou, M., & Jocelyn, W. (2000). Variance estimation for two-phase stratified sampling. Can. J. Statist. 28, 751 764. Fuller, W. A. (2003). Estimation for multiple phase samples. In Analysis of Survey Data, R. L. Chambers & C. J. Skinner, eds. Wiley: Chichester, England. Fuller, W. A. & Kim, J.-K. (2005). Hot deck imputation for the response model. Survey Methodology 31, 139 149. Hansen, M. & Hurwitz, W. (1946). The problem of non-response in sample surveys. J. Am. Statist. Assoc. 41, 517 529. Hidiroglou, M. (2001). Double sampling. Survey Methodol. 27, 143 54.
Hidiroglou, M. A., Rao, J. N. K. & Haziza, D. (2009). Variance estimation in two-phase sampling. Australian and New Zealand Journal of Statistics 51, 127 141. Kim, J. K., Navarro, A. & Fuller, W. A. (2006). Replicate variance estimation after multi-phase stratified sampling. J. Am. Statist. Assoc. 101, 312 320. Kott, P. & Stukel, D. (1997). Can the jackknife be used with a two-phase sample? Survey Methodology 23, 81 89. Neyman, J. (1934). On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society 97, 558 606. Rao, J. N. K. (1973). On double sampling for stratification and analytical surveys. Biometrika 60, 125 33. Reiter, J. (2008). Multiple imputation when records used for imputation are not used or disseminated for analysis. Biometrika 95, 933 46. 37