in Survey Sampling Petr Novák, Václav Kosina Czech Statistical Office Using the Superpopulation Model for Imputations and Variance

Size: px

Start display at page:

Download "in Survey Sampling Petr Novák, Václav Kosina Czech Statistical Office Using the Superpopulation Model for Imputations and Variance"

Randall Newton
5 years ago
Views:

1 Using the Superpopulation Model for Imputations and Variance Computation in Survey Sampling Czech Statistical Office

2 Introduction Situation Let us have a population of N units: n sampled (sam) and N-n unknown (imp). We want to estimate the population total Y = N i=1 y i.

3 Introduction Situation Let us have a population of N units: n sampled (sam) and N-n unknown (imp). We want to estimate the population total Y = N i=1 y i. Model assumptions y i = βx i +ɛ i, ɛ i are independent random variables, Eɛ i = 0 and varɛ i = c i σ 2, x i and c i known constants for all i = 1,..., N, β and σ 2 unknown parameters.

4 Imputation Estimation Estimate β from the sampled part using the least squares method: ˆβ = sam w ix i y i /c i sam w ixi 2. /c i w i are some appropriate weights. Note: constant weights and c i = x i gives ˆβ = sam y i sam x i.

5 Imputation Estimation Estimate β from the sampled part using the least squares method: ˆβ = sam w ix i y i /c i sam w ixi 2. /c i w i are some appropriate weights. Note: constant weights and c i = x i gives ˆβ = Data imputation For each unit from the unknown part we impute ŷ i = x iˆβ. The estimate of the population total is then Ŷ = sam y i + imp ŷ i. sam y i sam x i.

6 Differences from classic techniques Classic reweighting approach: y i treated as constants. Randomness through sample inclusion indicators. Error computed through varŷ. Superpopulation model approach: y i treated as random variables. Real y i from the imputed part predicted with ŷ i = x iˆβ. Error computed through mseŷ = E(Ŷ Y)2.

7 Error computation The least squares estimator is unbiased (Eˆβ = β). Therefore Eŷ i = Ex i ˆβ = xi β = Ey i. The mean square error of the prediction is then mseŷ = E(Ŷ Y)2 = E(Ŷimp Y imp ) 2 = E(Ŷimp EŶimp Y imp + EY imp ) 2 = E(Ŷimp EŶimp) 2 + E(Y imp EY imp ) 2 2E(Ŷimp EŶimp)(Y imp + EY imp ) = varŷimp + vary imp.

8 Variance computation The variance of estimated values is varŷimp = varx impˆβ = X 2 imp var ˆβ = X 2 We denote var ˆβ as σ 2 β. The variance of the predicted real values is vary imp = imp imp c i σ 2. sam w i 2 xi 2 /c i ( sam w ixi 2. /c i ) 2σ2 Denote C imp := imp c i. We get mseŷ = X 2 imp σ2 β + C impσ 2.

9 Variance computation The variance of estimated values is varŷimp = varx impˆβ = X 2 imp var ˆβ = X 2 We denote var ˆβ as σ 2 β. The variance of the predicted real values is vary imp = imp imp c i σ 2. sam w i 2 xi 2 /c i ( sam w ixi 2. /c i ) 2σ2 Denote C imp := imp c i. We get mseŷ = X 2 imp σ2 β + C impσ 2. Possible estimators for σ 2 : 1 (y i ˆβx i ) 2, n 1 c sam i 1 wi w i sam w i (y i ˆβx i ) 2 c i.

10 Special cases If w i const. and c i = x i, we get and therefore mseŷ = X 2 imp σ 2 β = 1 X sam σ 2 σ 2 + X imp σ 2 = X impx all σ 2. X sam X sam If we have no auxiliary information available and set x i 1, we impute the sample mean for each unit. We get then the commonly used formula (N n)n ( mseŷ = σ 2 = N2 1 n ) σ 2. n n N

11 Chain imputation Situation: x i not known, but estimated from z i Model: y i x i (x i β yx, c i σyx), 2 x i (z i β xz, d i σxz) 2 With help of conditional variance decomposition we get mse(ŷ) = varŷimp + vary imp = Evar[Ŷimp X]+varE[Ŷimp X] + Evar[Y imp X]+varE[Y imp X]... = Emse(Ŷ X)+β2 yx mse(ˆx).

12 Chain imputation Estimated error: mseŷ = mse(y ˆX)+ ˆβ 2 yx mseˆx. The chain structure can be followed up and stacked until we get to an auxiliary variable which is known for all units, i.e. administrative data.

13 Stratification level shifts Situation: The population is divided into strata (size class, NACE, region). There are several stratification levels, going from relatively small groups to larger ones. When there are not enough responding units to estimate β in one stratum, we use the estimates from corresponding higher level stratum S2 S1 S

14 Stratification level shifts If the estimated total of the whole population divided into strata m 1,..., m K is Ŷ = Ŷ mj, j the mean square error is mseŷ = varŷimp + vary imp = var j Ŷ imp m j + var j Y imp m j = j varŷ imp m j + j k cov(ŷ imp m j, Ŷ m imp k )+ j vary imp m j. Both variances of estimated and real values can be computed with methods from above.

15 Stratification level shifts - covariance computation Covariance computation Then Let m 1 and m 2 be two basic strata. ˆβ estimated from superstrata S 1 and S 2 respectively. Denote S d = S 1 S 2, which is the smaller of S 1 and S 2, if the stratification levels are well ordered. Denote S = S 1 S 2, which is then the larger of both. cov(ŷm 1, Ŷm 2 ) = cov(x imp m 1 = X imp m 1 Xm imp 2 cov ˆβS1, Xm imp 2 ( ˆβS2 ) = Xm imp 1 w i x i y i /c i S sam 1 S sam 1 w i x 2 i /c i, X imp m 2 cov(ˆβ S1, ˆβ S2 ) ) S w sam i x i y i /c i 2 w i xi 2. /c i S sam 2

16 Stratification level shifts - covariance computation The variables y i belonging to either S 1 or S 2 but not to S d are mutually independent. Denote as B S1 and B S1 the sums in the denominator: cov(ŷm 1, Ŷm 2 ) = X m imp 1 = X m imp 1 = X m imp 1 X imp m 2 B S1 B S2 X imp m 2 B S1 B S2 X imp m 2 B S1 B S2 var S sam d S sam d S sam d w i x i y i /c i w 2 i x 2 i /c 2 i vary i w 2 i x 2 i /c i σ 2 S d = X imp m 1 Xm imp B Sd 2 σβ 2 B Sd. S This way we can compute all the covariances between base strata and the mean square error of the whole sum.

17 Stratification level shifts - chained imputations If we have a sophisticated stratification structure and chained imputations, we need to compute the chained covariance also. The covariances are computed with help of conditional covariance decomposition: cov(ŷm 1, Ŷm 2 ) = Ecov[Ŷm 1, Ŷm 2 X]+cov(E[Ŷm 1 X], E[Ŷm 2 X]) = Ecov[Ŷm 1, Ŷm 2 X]+β S1 β S2 cov(ˆx m1, ˆX m2 ). The computation of the mean of the first term with respect to X would be rather difficult, we substitute it with the estimate with the help of ˆX : ĉov(ŷm 1, Ŷm 2 ) = ĉov[ŷm 1, Ŷm 2 X]+ ˆβ S1ˆβS2 cov(ˆx m1, ˆX m2 ).

18 Choosing the weights If no stratification shifts are involved and no outliers are present, we can use w i 1. If we compute ˆβ from a superstratum S consisting of basic strata k = 1,.., K, we can use w i N k /n k for units from stratum k. Data from the greater strata then influence the estimates more than the data from the smaller strata. If we apply some outlier-detection methods, we can use w i = 0 for data which may not fit the model, so that they will not influence the estimates.

19 Conclusions The superpopulation model allows us to: Estimate the target variable for each unit separately. Report the estimated population totals with respect to any groupings of choice, regardless of the sampling plan. Easily compute the mean square error of the estimated sums. Develop methods of error computation in complex stratification and chaining structure. Drawbacks: The approach is model-based, the results may be inacurrate if the assumptions are not met, especially the linear dependence of y i on x i and the choice of c i. Auxiliary variables x i and c i must be available for all units.

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013 Applied Regression Chapter 2 Simple Linear Regression Hongcheng Li April, 6, 2013 Outline 1 Introduction of simple linear regression 2 Scatter plot 3 Simple linear regression model 4 Test of Hypothesis