Small Area Confidence Bounds on Small Cell Proportions in Survey Populations

Size: px

Start display at page:

Download "Small Area Confidence Bounds on Small Cell Proportions in Survey Populations"

Duane Baker
5 years ago
Views:

1 Small Area Confidence Bounds on Small Cell Proportions in Survey Populations Aaron Gilary, Jerry Maples, U.S. Census Bureau U.S. Census Bureau Eric V. Slud, U.S. Census Bureau Univ. Maryland College Park DC-AAPOR / WSS 1

2 Outline General Problem Cell-Based vs. Model-Based Methods Alternative Definitions of Effective Sample Size Data Example Erroneous Enumeration of HUs in Census Coverage Measurement study (CCM) Numerical Results Conclusion and Future Plans 2

3 General Problem Setting: given direct (ratio) estimates for small proportions ˆπ i at level of cell or domain i (e.g., county) Problem: specify upper confidence bound for ˆπ i either bound in transformed measurement scale h(π i ) from data within domain i or model-based bound connecting values π i across domains using predictors x i Approach: using effective sample-sizes n i, adapt binomial/srs estimator V ar(ˆπ i ) = ˆπ i (1 ˆπ i )/n i 3

4 Applications demographic tables in ACS, the American Community Survey; area-level Erroneous Enumeration rates in CCM, Census Coverage Measurement; and rates in other Census Bureau surveys. 4

5 A Good Cell-Based Method Liu and Kott 2009, Survey Methodology π i = true proportion n i = effective sample size, yi Binom(n i, π i) ˆπ i = y i n i = y i n i direct estimator Transformation: asin( ˆπ) (Variance-stabilizing) centered at asin( π), Var 1/(4n i ) Obtain UCB on arcsin scale, transform back to prob. scale by sin 2 (x) 5

6 Ideas of Model-Based Methods π i includes { predicted part ηi = x i β unmodeled random component (1) Fay-Herriot: π i = sin 2 (η i +u i ), u i N(0, σ 2 u) (2) Logistic: π i = eη i +v i 1+e η i +v i, v i N(0, σ 2 v ) (3) Beta-Binomial: π i = Beta( τ eη i 1+e η i, τ 1+e η i ) Parameters σ 2 u, σ 2 v, (1 + τ) 1 quantify imprecision of π i in terms of η i 6

7 Model-Based Methods, Continued Recall y i /n i = ˆπ i Fay-Herriot: arcsin( ˆπ i ) N(arcsin( π i ), Logistic & Beta-Binomial: 1 4n i ) y i Binom(n i, π i) 7

8 Estimation in Model-Based Methods Point Predictor for π i generally different from direct estimator BLUP: E(π i y i ) EBLUP: substitute MLE for β & (σ u, σ v or τ) Upper Confidence Bound for π i based on V ar(eblup ) 8

9 Defining Effective Sample Size n i If variance V i n = π i(1 π i ) i n of direct π i estimator in area i i is reliably estimated and sampling fraction f 1: DEFFi = V i π i (1 π i ), n i = n i DEFFi What if ˆV i is erratic or π i too small? Proposal 1: Area size via higher-level DEFF With ˆV /n = V (ˆπ), and DEFF at higher (e.g., State) level DEFF = ˆV π(1 π), n i = n i DEFF 9

10 Effective Sample Size, Continued Proposal 2: Eff. size modified by sampling weights w ik indiv.-level sampling weights within area i n i = n i DEFF k w 2 ik ( k w ik ) 2 / j,k w 2 jk ( j,k w jk ) 2 Effective sizes for survey CIs : Liu & Kott 2009 in Bayesian analysis: Chen et al. 2011, Malec

11 Data Example Census Coverage Measurement (CCM): evaluation of Census performance. finds Erroneous Enumeration (EE) rates for counties in sample. 170k Housing Units (HUs) and 1,728 counties. Census publishes detailed estimates for 128 counties with pop. 500k 11

12 Specific CCM Details National EE rate 2.7% for metro areas (Olson, 2012). For our 128 counties, P(ˆπ i = 0) > 0. EE Rates (i = county, k = HU): ˆπ = Σ j,k W jk Y jk Σ j,k W jk, ˆπ i = Σ k W ik Y ik Σ k W ik 12

13 Specific CCM Modeling Details Area- or cell-level models, all covariates from Census. BIC model-selection penalizes max loglik by k ln(n) (k = # regr. coef s, n = total sample size). Our selected model had 5 predictor variables: state EE rate, a synthetic estimator; area rate of single-unit households; area rate of large multi-unit households; area rate of urban households; area enumeration rate. 13

14 Variability Across Counties n = Proposal 2 Eff. S.S., n = Proposal 1 Eff. S.S. Mean 1st Q Med. 3rd Q St DEFF Cou n Cou n /n NB: n, n based on counties with 20 HUs in sample. 14

15 Results Inclusion of n led to wider, more conservative estimates for the UCBs. The Fay-Herriot and Logistic methods had slightly higher means and medians for UCBs for areas with ˆπ 0, but the Cell and Beta-Binomial methods had the highest maximum UCBs. Overall, all four methods created upper bounds within a similar range. 15

16 UCB of Production Counties where ˆπ i = Cell FH LOG BB 16

17 Conclusion & Future Plans Conclusions: For our project, we chose the Cell-Based method. With the results so close, a simpler method without a specified regression model was preferrable. We used the Proposal 2 n i for accurately capturing the variance because it did not assume SRS or equal weights within area. Results were more conservative which was a priority in this case. Future plans: extend approach to other applications test performance of DEFF under different assumptions? generalized R package for Census Bureau use? 17

18 References Chen, C., Lumley, T. and Wakefield, J. (2011), The Use of Sampling Weights in Bayesian Hierarchical Models for Small Area Estimation. U. Wash. Stat. Dept. Tech. Rep. #583. Fay, R. and Herriot, R. (1979, JASA). Liu, Y. and Kott, P. (2009, Jour. Official Stat.) Malec, D. (2005, Jour. Official Stat.) Small Area Estimation... Housing Units. Maples, J., Bell, W. and Huang, E. (2009 ASA Proc.) GVF s for area-level variances Olson, Doug (2012) Census Coverage Measurement Report on Modeling. DSSD #2010-G-13. Prentice, R. (1986, JASA) Beta-binomial regression Slud, Eric V. (2012). Small-Area Confidence Bounds... in Tables. Jour. Indian. Soc. Agric. Stat. 18

19 Thank You! 19

20 A Cell-Based Method Variance-Stabilizing transformation to arcsin sqrt scale of estimated area proportion y i /n i = yi /n i For area i : π i = true proportion, n i = eff. samp. size y i Binom(n i, π i) N (n i π i, n i π i (1 π i )) arcsin y i n i N (arcsin π i, 1 4n i ) -method Transformed scale 90% CI: arcsin y i /n i ± 1.645/ 4 n i Transform back to the probability scale by sin 2 (x) 20

21 Models which Borrow Strength across Areas Notation: ˆπ i = y i n = y i i n, â i = arcsin ˆπ i, η i = x i β i area-level observed predictors x i u i N (0, σ 2 u), v i N (0, σ 2 v ) random effects (1) Fay-Herriot : â i = η i + u i + ɛ i, ɛ i N (0, 1 4n ) i (2) Logistic Random-Intercept : y i Bin(n i, e η i +v i 1+e η i +v i ) (3) Beta-Binomial : y i Bin(n i, π i), π i Beta( τ eη i 1+e η i, τ 1+e η i ) Targets for small-area prediction: sin 2 (η i + u i ) in (1); e η i +v i 1+e η i +v i in (2); and π i in (3). 21

22 Fay-Herriot Model UCB Mod1 (FH): EBLUP ˆπ i = sin 2 (ˆθ i /n i ), based on θ i = η i + µ i, ˆθ i = ˆγ i y i + (1 ˆγ i ) x iˆβ, ˆγ i = UCB i = sin 2 ( 1 n i ˆσ 2 u ˆσ 2 u + (4n i ) 1 (ˆθ i + z ) α 2 ((1 ˆγ i) ˆσ u 2 + (1 ˆγ i ) 2 x i ˆVˆβ x i} 1/2 ) Fay and Herriot (1979), Slud (2012) NB: includes sample variability of ˆβ, not ˆσ 2 u Rao (2003) has more inclusive formulas for related mse 22

23 Logistic Random-Intercept Model-Based UCB Mod2 (LgstRI): EBLUP ˆπ i = g(y i + 1, n i + 1, ˆη i, ˆω 2 ) g(y i, n i, ˆη, ˆω2 ) where ˆη i = x i ˆβ, ˆω 2 = ˆσ 2 v + x i ˆVˆβ x i g(k, n, η, ω 2 ) = e (η+ωz)k (1 + e η+ωz φ(z)dz, φ( ) N (0, 1) ) n [ g(y UCB i = ˆπ i i + 2, n i + 2, ˆη, ˆω2 ) g(yi, n i, ˆη, ˆω2 ) ˆπ 2 i ] 1/2 NB: includes sample variability of ˆβ, not ˆσ 2 v. Jiang and Lahiri 2006, Slud

24 Beta-Binomial Model-Based UCB Mod3 (Beta-Bin): Estimation via Posterior, µ i = eη i 1 + e η i, π i y i Beta(τ µ i + y i, τ (1 µ i) + n i y i ) Empirical Bayes ˆπ i = y i + ˆτ ˆµ i n i + ˆτ Bootstrap approach to UCB 24

Eric V. Slud, Census Bureau & Univ. of Maryland Mathematics Department, University of Maryland, College Park MD 20742

COMPARISON OF AGGREGATE VERSUS UNIT-LEVEL MODELS FOR SMALL-AREA ESTIMATION Eric V. Slud, Census Bureau & Univ. of Maryland Mathematics Department, University of Maryland, College Park MD 20742 Key words: