A comparison of weighted estimators for the population mean. Ye Yang Weighting in surveys group

Size: px

Start display at page:

Download "A comparison of weighted estimators for the population mean. Ye Yang Weighting in surveys group"

Annabel Fox
5 years ago
Views:

1 A comparison of weighted estimators for the population mean Ye Yang Weighting in surveys group

2 Motivation Survey sample in which auxiliary variables are known for the population and an outcome variable known only for the sampled and responding units. Interest concerns estimating population mean of outcome variable. Many methods to estimate the mean, but not sure which to choose from. Simulation conducted to assess performance of estimators under different scenarios.

3 Horvitz-Thompson estimator: Estimators µ = Robust Horvitz-Thompson estimator: B HT i = (w i - 1) y i π i 1 π /π 1 π µ = µ HT - (B HT Min + B HT Max) / (2N), Horvitz-Thompson estimator with post-stratification: µ = / ), where i belongs to post-stratum j and =

4 Estimators Hajek estimator: µ = Robust Hajek estimator: B HA i = (w i - 1) e i π i 1 π /π 1 π µ = µ HA - (B HA Min + B HA Max) / (2N),, e i = y i - µ HA Trimmed estimator: µ = w* i = w 0 w 0, where w γ < w 0 = 3, γ = 0, and k i = 1{ w 0 } 0 1

5 Estimators Beaumont estimator: µ =, where is obtained by regressing w i on a spline of y i Penalized spline of propensity prediction (PSPP) y i = =1 + + ε i, ε i ~ N(0, π 2 i σ 2 ) + = if > 0 0 otherwise µ = i

6 Questions to answer How does diving the weighted sum by N as in the Horvitz-Thompson estimator compare with dividing by as in the Hajek estimator? How do the robust versions of HT and Hajek improve upon the respective estimators? How does weight trimming and the Beaumont method improve upon the Hajek estimate? How do the weighted estimators compare with and PSPP? Impact of variance structure on PSPP.

7 Simulations 10 scenarios covering stratified PPS sampling, SRS with nonresponse, and PPS sample with nonresponse. 500 replications. Performance of estimators compared using relative root mean squared error (RRMSE): RRMSE estimator = 100 *

8 Scenario 1 Population: Three strata: Z = 1, 2, 3 Strata population sizes N 1 =5,000, N 2 =6,000, N 3 =9,000 X X 1 = floor(100x 1 ), log(x 1 ) ~ N(0.5, 0.5) Sample: n = 100, 500 Proportional allocation of strata Stratified PPS sampling with X as size variable Weights: h Selection probability π hi = n h X hi / =1 h Sample weight w hi = 1 / π hi Y X ~ N( logX, 1)

10 Relative RMSE for Scenario 1

11 Scenario 2 Population: Three strata: Z = 1, 2, 3 Strata population sizes N 1 =5,000, N 2 =6,000, N 3 =9,000 X X 1 = floor(100x 1 ), log(x 1 ) ~ N(0.5, 0.5) Sample: n = 100, 500 Proportional allocation of strata Stratified PPS sampling with X as size variable Weights: h Selection probability π hi = n h X hi / h Sample weight w hi = 1 / π hi =1 Y Z, X ~ N( logX + 0.5{Z=2} {Z=3} 0.2{Z=2}logX + 0.3{Z=3}logX, 1)

13 Relative RMSE for Scenario 2

14 Scenario 3 Population: N = 20,000 X ~ GAMMA(1.5, 0.001) Sample: n = 100, 500 PPS sampling with X as size variable Weights: Selection probability π i = nx i / =1 Sample weight w i = 1 / π i Y X ~ N(0, 1)

16 Relative RMSE for Scenario 3

17 Scenario 4 Population: N = 20,000 X ~ GAMMA(1.5, 0.001) e ~ N(0, 1) Y X, e = 10*X + 3*X*e If Y 0 then Y = 1 Sample: n = 100, 500 PPS sampling with X as size variable Weights: Selection probability π i = nx i / =1 Sample weight w i = 1 / π i

19 Relative RMSE for Scenario 4

20 Scenario 5 Population: N = 20,000 X ~ GAMMA(1.5, 0.001) e ~ N(0, 1) Y X, e = *X*e Sample: n = 100, 500 PPS sampling with X as size variable Weights: Selection probability π i = nx i / =1 Sample weight w i = 1 / π i

22 Relative RMSE for Scenario 5

23 Scenario 6 Population: N = 20,000 Z 1 ~ Bernoulli(0.5) Z 2 Z 1 ~ Bernoulli( Z 1 ) U Z 1, Z 2 ~ N( Z Z 2, 0.01) Y U ~ N( U + 0.1U 2, 1) R U ~ Bernoulli(logit -1 ( U)) Sample: n = 100, 500 SRS Y observed only when R=1 Weights: Estimate response probability π i from logistic regression on U. Weight w i = N / n i

24 Distribution of population

25 Relative RMSE for Scenario 6

26 Scenario 7 Population: N = 20,000 Z 1 ~ Bernoulli(0.5) Z 2 Z 1 ~ Bernoulli( Z 1 ) U Z 1, Z 2 ~ N( Z Z 2, 0.01) Y U ~ N( U + 3U 2, 1) R U ~ Bernoulli(logit -1 (0.5U)) Sample: n = 100, 500 SRS Y observed only when R=1 Weights: Estimate response probability π i from logistic regression on U. Weight w i = N / n i

27 Distribution of population

28 Relative RMSE for Scenario 7

29 Scenario 8 Population: N = 20,000 Z 1 ~ Bernoulli(0.5) Z 2 Z 1 ~ Bernoulli( Z 1 ) U Z 1, Z 2 ~ N( Z Z 2, 0.01) Y U ~ N( U + 3U 2, 1) R U ~ Bernoulli(logit -1 ( U)) Sample: n = 100, 500 SRS Y observed only when R=1 Weights: Estimate response probability π i from logistic regression on U. Weight w i = N / n i

31 Relative RMSE for Scenario 8

32 Scenario 9 Population: N = 20,000 X Z = floor(100z), log(z) ~ N(0.5, 0.5) Y X ~ N(10 + logx + 2log 2 X, log 2 X) R X ~ Bernoulli(logit -1 ( logX)) Sample: n = 100, 500 PPS sampling with X as size variable Y observed only when R=1 Weights: Estimate response probability π i from logistic regression on X. Weight w i = N / n i

34 Relative RMSE for Scenario 9

35 Scenario 10 Population: N = 20,000 X Z = floor(100z), log(z) ~ N(0.5, 0.5) Y X ~ N(10 + logx + 2log 2 X, log 2 X) R X ~ Bernoulli(logit -1 (11.2-2logX)) Sample: n = 100, 500 PPS sampling with X as size variable Y observed only when R=1 Weights: Estimate response probability π i from logistic regression on X. Weight w i = N / n i

37 Relative RMSE for Scenario 10

38 Summary Hajek generally better than Horvitz-Thompson when y i and π i are not proportional. Weight trimming fails when y i is strongly associated with w i. Little to no difference between Beaumont and Hajek. PSPP generally does best when y is a continuous function of π, though less successful in discontinuous functions and in presence of extreme outliers. Minor differences between variance structures in PSPP, with small gains when variance of error is not constant.

39 Next steps Explore additional models with discontinuous mean functions and extreme outliers. Variance estimation and inference.

Data Integration for Big Data Analysis for finite population inference

Data Integration for Big Data Analysis for finite population inference for Big Data Analysis for finite population inference Jae-kwang Kim ISU January 23, 2018 1 / 36 What is big data? 2 / 36 Data do not speak for themselves Knowledge Reproducibility Information Intepretation