Variance Estimation for Calibration to Estimated Control Totals

Size: px

Start display at page:

Download "Variance Estimation for Calibration to Estimated Control Totals"

Junior Pierce
6 years ago
Views:

1 Variance Estimation for Calibration to Estimated Control Totals Siyu Qing Coauthor with Michael D. Larsen Associate Professor of Statistics Tuesday, 11/05/2013

2 2 Outline A. Background B. Calibration Technique C. Calibration Weighting with Estimated Control Totals D. Variance Estimation E. Simulation E. Discussion

3 3 What is calibration weighting? Let {d k } be original survey (design) weights. t x = U x k is a known total in the population with indices U; x k can be a vector The calibrated weights {w k } are close to {d k } but satisfy a set of calibration equations: s w k x k = x k. U The closeness is measured by a distance function: 1 Linear calibration: E p { s (w k d k ) 2 /d k q k } 2 Raking: E p { s [w k log(w k /d k ) w k + d k ]} 3 Logit method and others. The calibration estimator ˆt yw = s w k y k.

4 4 Calibration Estimator Properties The calibration weights can be written as w k = d k F k (x k λ s ), where F k ( ) comes from the inverse of distance function. The linear calibration estimator AV (ˆt yw ) = AV (ˆt yl ). ˆt yl = w k y k = ˆt Ay + (t x ˆt Ax ) ˆBs s Calibration weigthing can reduce mean squared error. There are various ways to compute the weights, including in the survey and TeachingSampling packages in R.

5 5 Calibration on Estimated Control Totals In calibration only t x are needed from outside source. Sometimes t x are not available and one may seek estimated totals ˆt Cx instead. The calibration constraint equation becomes: w k x k = ˆt Cx. s Calibration estimator with estimated control totals ˆt yw (CEEC). Assuming Analytic Survey and Control Total Survey are independent.

6 6 Formula of the CEEC The linear CEEC ˆt yl = s w k y k = ˆt Ay + (ˆt Cx ˆt Ax ) ˆBs, where ˆB s = T 1 s s d k q k x k y k, and T s = s d k q k x k x k. The general CEEC ˆt yw = s w k y k = d k F k (x k γ s)y k s where γ s can be computationally solved from calibration constraints.

7 7 Assumptions for the CEEC 1 max x k and F k (0) are both uniformly bounded by K 1 < and K 2 <, respectively. 2 limn 1 t x exists. 3 N 1 (ˆt Ax t x ) 0 in design probability. 4 n 1/2 N 1 (ˆt Ax t x ) d MVN(0,Σ A ). 5 C λ = k U{λ : x k λ Im k(d k )} is a convex domain as well as an open neighborhood of 0. 6 N 1 (ˆt Cx t x ) 0 in design probability. 7 n α N 1 (ˆt Cx t x ) d MVN(0,Σ C ) with α 1/2, so the control total survey is at least as accurate as the analytic survey we conduct.

8 8 Properties of the CEEC ˆt yl and ˆt yl has exactly the same expectation. N 1 (ˆt yl ˆt yl ) = O p (n α ) ˆt yw is design consistent as well as asymptotically design unbiased. N 1 (ˆt yw ˆt Ay ) = O p (n 1/2 ) and N 1 (ˆt yw ˆt yl ) = O p (n 1 ). n 1/2 N 1 (ˆt yw ˆt yl ) = O p (n 1/2 α ).

9 9 Variance Estimation for the CEEC Extra variation brought by the estimated control totals. Variance might be underestimated by traditional methods. the contribution of the bias to the MSE is likely to be small. The key is to precisely estimate the inflation part of the variance. The asymptotic variance of ˆt yw is: AV (ˆt yw ) = AV (ˆt yw ) + B V (ˆt Cx )B = U kl d k E k d l E l + B V (ˆt Cx )B, where B = ( U x k x k ) 1 ( U x k y k ).

10 10 A traditional variance estimator (Naive) Traditional variance estimator considers the estimated control totals as they were true population totals. Deville and Särndal, 1992, JASA first defined the calibration estimator as well as gave a variance estimator. The formula is: V naive (ˆt yw ) = s ( kl /π kl )(w k e k )(w l e l ) where w k = d k F k (x k γ s), and e k = y k x k ˆB s is the sample fit residual. Whether this formula is a good estimate depends on the accuracy of control total survey.

11 Taylor linearization variance estimator (TL) The Taylor linearization variance estimator is: ˆV TL (ˆt yw ) = s ˇ kl w k e ks w l e ls + ˆB s V (ˆt Cx )ˆB s def = V naive + V inf E p (ˆB s V (ˆt Cx )ˆB s ) = B V (ˆt Cx )B + O p (n 2α ). V (ˆt Cx ) needs to be specified also from the outside source. Taylor Linearization method is likely to be faster than Jackknife methods.

12 Multivariate normal jackknife method (MVNJ) Dever and Valliant, 2010, Survey Methodology first used this method in variance estimation of poststratification to populatoin control totals. It is a delete-one jackknife method. The replicates ˆt Cx(j) = ˆt Cx + c n ˆɛ (j) 1/(n 1) where ˆɛ (j) i.i.d MVN(0, V (ˆt Cx )) (j = 1,2,...,n) and c n = 1/(n 1). The replicates of ˆt yl : ˆt yl(j) = ˆt Ay(j) + (ˆt Cx(j) ˆt Ax(j) ) ˆB s(j) The MVNJ variance estimator is: V MVNJ (ˆt yw ) = n 1 n n j=1 (ˆt yl(j) ˆt yl ) 2. E p { V MVNJ } = E p [ n 1 n n j=1 (ˆt yl(j) ˆt yl ) 2 ] + B V (ˆt Cx )B + O p (n 2α )

13 13 Fuller two-phase jackknife after Taylor linear. (F2TL) ˆt yl = s d k a ks E k +ˆt Cx B and ˆθ ˆt Cx B = O p(n 1 ), where ˆθ = ˆt Cx ˆB s. Fuller jackknife method is used to estimate V (ˆθ). Let V (ˆt Cx ) be m m matrix, and λ 1, λ 2,...,λ m be its eigenvalues with q 1, q 2,...,q m their corresponding eigenvectors. The replicates are: ˆt Cx(j) = ˆt Cx + c m λ 1/2 j q j and ˆθ (j) = ˆt Cx(j) ˆB s(j), where c m = (m 1) 1/2 m 1/2. Then the F2TL variance estimator is: V F 2TL (ˆt yw ) = V naive (ˆt yw ) + m 1 m m j=1 (ˆθ (j) ˆθ) 2. E p ( V F 2TL ) is equal to E p ( V naive ) + E p [ m 1 m n j=1 (ˆθ (j) ˆθ) 2 ] + B V (ˆt Cx )B + O p (n 2α ).

14 Simulation procedure 1 SDR 2010 is chosen as our populatoin. The population contains individuals, which were divided by us into 54 clusters. 2 At first stage np=30 PSU s (U 1, U 2,...,U 30 ) are sampled without replacement out of 54 clusters. 3 Secondly, from within each PSU sampled, we selected 50 individuals using simple random sampling without replacement. 4 Choose salary as the parameter we want to estimate and then choose m=20 auxiliary variables (X 1,X 2,...,X 20 ). 5 For each X i, calculate the PSU totals for each sampled PSU: (ˆt i1, ˆt i2,..., ˆt i30 ) 6 Estimate population totals of X i using PSU totals, then consider it as estimated control totals. 7 Calculate the calibrated estimator and its variance estimation with four methods mentioned above. 14

15 15 Simulation Result Variance estimator RBVE (%) Cover (%) MeanSE StdSE Simulation 1. np=20, linear calibration. HT ,002 23,373 Naive ,782 10,246 TL ,200 23,739 MVNJ ,788 48,803 F2TL ,668 28,489 Simulation 2. np=30, raking. HT ,940 10,888 Naive ,586 5,388 TL ,402 11,032 MVNJ ,470 27,049 F2TL ,537 15,280

16 16 Summary The CEEC is certainly a reasonable estimate when the population control totals are unknown. In simulation 2, the estimates are close to the true values. Overall, all the improved variance estimators we give are acceptable in simulation 2. They certainly mitigate the bias of the naive estimator.

17 Future plans 17

18 Thanks! Variance Estimation for Calibration to Estimated Control Totals Siyu Qing The George Washington University Department of Statistics

Implications of Ignoring the Uncertainty in Control Totals for Generalized Regression Estimators. Calibration Estimators

Implications of Ignoring the Uncertainty in Control Totals for Generalized Regression Estimators Jill A. Dever, RTI Richard Valliant, JPSM & ISR is a trade name of Research Triangle Institute. www.rti.org