Weight calibration and the survey bootstrap

Size: px

Start display at page:

Download "Weight calibration and the survey bootstrap"

Frank Henry
6 years ago
Views:

1 Weight and the survey Department of Statistics University of Missouri-Columbia March 7, 2011

2 Motivating questions 1 Why are the large scale samples always so complex? 2 Why do I need to use weights? 3 What is, and why is it useful for complex survey data? 4 If I have calibrated my data, shouldn t the standard errors go down? 5 How can I see that? What variance estimator should I use? 6 : interplay between and various variance methods. Scattered additional ideas: Survey

4 Most large scale data sets utilize features of complex survey sampling designs (Cochran 1977, Kish 1965, Korn & Graubard 1999): stratification: sampling independently within parts of population or from different frames clustering: sampling groups of observation units varying probabilities of selection: optimize performance of the sampling design multiple stages of selection: districts, schools, classes, students Some features are aimed at improving precision, while others, at minimizing costs.

sample design http://nces.ed.gov/nationsreportcard/tdw/images/sample_design/componentsofthe2007samples.jpg A, B, C,.

5 sample design A, B, C,..., Alpha, Beta, Gamma, Delta samples Stratification of public schools by jurisdiction, private schools by affiliation Systematic sampling of schools with probability proportional to size Sampling of students to attain (approximately) self-weighting sample Oversampling of (areas and schools with large % of) minorities Matrix sampling of items into booklets

7 Setting Finite population U with L strata and N h units in each Designs: sampling with replacement, n h units from stratum h Sample s of size n = h n h = total number of PSUs Estimates: t[y] for T [Y ] ˆθ = f (t1,..., t k ) for θ = f (T 1,..., T k ) ˆθ from estimating equations u(x, ˆθ) = 0 Goal: of design variance

8 Expansion weight: w i = 1 π i = Weighted 1 IPr [ unit i in sample ] Expansion estimator (Horvitz & Thompson 1952) of the total: t[y] = i s w i y i Weighted mean: Analytical use: ȳ w = i s w iy i i s w i = t[y] t[1] ˆβ WLS = (X WX) 1 X Wy, W = diag(w 1,..., w n ) Pseudo-maximum likelihood: wi l(θ, x i ) max θ

9 Sampling weights Pfeffermann (1993), Binder & Roberts (2009): 1 The weights can be used to test and protect against nonignorable sampling designs which could cause selection bias. Without the weights, the parameters may be biased if the selection probabilities depend on characteristics of the units. 2 The weights can be used to protect against misspecification of the model holding in the population. Estimation will be consistent for the census regression, even though a formal one.

10 Replicate weights Besides the main sampling weights, microdata often come with (dozens or hundreds of) replicate weights. Necessary for variance It suffices for the end user to run the model with all sets of weights and compute the variance estimator as the empirical variance of the results with different weights

11 Accuracy of estimates Stratification Weighting Population Sample y

12 Stratification Accuracy of estimates Weighting Population Sample y

13 Deville & Sarndal (1992), Särndal (2007), Kott (2009) Auxiliary information: variables x k Either x k s are known for every unit k in the population, or the population totals, or independent control totals T [x] = k U x k are known Human population samples: demography (from census data or CPS) Calibrate the design weights d i w i so that equations (benchmark constraints) hold: x k (1) i s w i x i = k U Typical procedures: raking, generalized raking (Deville, Sarndal & Sautory 1993)

14 Extensions Instrumental variables (Kott 2006) using information at different levels (Estevao & Säarndal 2006, Isaki, Tsay & Fuller 2004) Restricted weight ranges (Chen, Sitter & Wu 2002, Théberge 2000) Non-response adjustments (targeting bias rather than variance) (Crouse & Kott 2004, Kott 2006)

15 A popular functional form: Linear w i = d i (1 + λ x i ) (2) where λ is the vector of Lagrange multipliers Solution: λ = ( i s d i x i x i ) 1 ( k U x k i s w i x i ) The resulting expansion estimator of the total is identical to GREG estimator (Särndal, Swensson & Wretman 1992) Disadvantage: weights may become negative.

16 Owen (1988) proposed to construct confidence intervals for the population mean using profile empirical likelihood function: { n R(µ) = nw i w i 0, i=1 n w i = 1, i=1 Maximum empirical likelihood estimator: ˆµ = arg max R(µ) Asymptotic distributional results: n i=1 ˆµ is asy. normal; 2 log R(ˆµ) } w i y i = µ, (3) d χ 2 Extensions from i.i.d. to regression models(owen 1991), models based on estimating equations (Qin & Lawless 1994),...

17 Pseudo-EL for survey data Chen & Qin (1993), Chen & Sitter (1999), Wu (2004b), Rao & Wu (2009) discussed complex survey extensions: l(w) = d i ln w i (4) use: l(w) max w Solution: s.t. N i s w i = 0 = i s Nd i j s d j i s w i x i = k U d i j s d j λ (x i X) x k, w i 0, i s w i = 1 x i X 1 + λ (x i X) (5) (6)

18 Pseudo-EL for survey data Convex hull condition: the population value X is in the convex hull of the sample x i Non-negative weights Convex optimization problem, solution always exists, and algorithms find it reliably (Wu 2004b) Optimality properties: Wu (2003) Chen, Sitter & Wu (2002) discuss how to utilize EL methods to construct bounded weights (may have to relax the calibrating constraints) Combining information from multiple surveys (Wu 2004a)

20 Wolter (2007), Eltinge (1996) Reporting and analytic purposes: goals a survey analyst needs standard errors to include in the report an applied researcher needs standard errors to construct confidence intervals and test their substantive models Design purposes: a sample designer needs to know population variances to come up with efficient designs, strata allocations, small area estimators

21 Expansion estimator Horvitz-Thompson estimator t[y] = i s d i y i = i s y i /π i (with respect to sampling, i.e., design-variance): V { t[y] } = Estimator of variance: k,l U v { t[y] } = i,j s (π kl π k π l ) y k π k y l π l π ij π i π j π ij y i π i y j π j

22 variance estimator By taking Taylor series expansion near the population value θ, one can obtain linearization estimators for: Function of moments: [ f ] v L [ˆθ] MSE[ˆθ] v t k t k Example: ratio estimator of the total t r = t[y]t [x]/t[x], variance estimator v L [t r ] = T [x]2 t[x] 2 v[e i], e i = y i rx i, T [x] 0 Remind notation: slide 2 k

23 variance estimator By taking Taylor series expansion near the population value θ, one can obtain linearization estimators for: Estimator defined by estimating equations: v L [ˆθ] MSE[ˆθ] ( u) 1 1T v[u(x, ˆθ)]( u) Examples: regression (Fuller 1975) i s w i u(x i, y i, θ) i s Closely related to White (1980) heteroskedasticity-robust estimator w i x i (y i x i θ) = 0 examples: GLM (Binder 1983), pseudo-mle (Skinner 1989) Remind notation: slide 2

24 estimators Särndal, Swensson & Wretman (1992), Deville & Sarndal (1992): variables x Parameter of interest: total of y of the calibrated estimator: V[t cal ] = ij U(π ij π i π j )(d i E i )(d j E j ), where E i = y i x ib are residuals from the census regression (c.f. y i Ȳ for H-T estimator, y i rx i for ratio) estimator: v[t cal ] = ij s π ij π i π j π ij (w i e i )(w j e j ) where e i = y i x i ˆβ are residuals from the sample (weighted) LS regression

25 methods For a given procedure (X 1,..., X n ) ˆθ: 1 To create data for replicate r, reshuffle PSUs, omitting some and/or repeating others, according to a certain replication scheme 2 Using the original procedure and the replicate data, obtain parameter estimate ˆθ (r) 3 Repeat Steps 1 2 for r = 1,..., R 4 Estimate variance/mse as v m [ˆθ] = A R R (ˆθ (r) θ) 2 (7) r=1 where A is a scaling constant, θ = r ˆθ (r) /R for variance and θ = ˆθ for MSE Alternative implementation: use replicate weights w (r) hij instead of actually resampling the data.

26 The jackknife Kish & Frankel (1974), Krewski & Rao (1981) Replicates: omit only one PSU from the entire sample Replicate weights: if unit k from stratum g is omitted, 0, h = g, i = k w (gk) n hij = g n g 1 w hij, h = g, i k w hij, h g Number of replicates: R = n estimators: v J1 = h v J2 = h n h 1 n h n h 1 n h (ˆθ (hi) ˆθ h ) 2 i (ˆθ (hi) ˆθ) 2 i

27 Balanced repeated replication (BRR) Design restriction: n h = 2 PSUs/stratum Replicates (half-samples): omit one of the two PSUs from each stratum Replicate weights: w (r) hij = { 2w hij, PSU hi is retained 0, PSU hi is omitted (2nd order) balance conditions: each PSU is used R/2 times each pair of PSUs is used R/4 times Number of replicates: L R 2 L McCarthy (1969): L R = 4m L + 3 using Hadamard matrices Scaling factor in (7): A = 1

28 Rescaling (RBS) Rao & Wu (1988): for parameter θ = f ( x), 1 Sample with replacement m h out of n h units in stratum h. 2 Compute pseudo-values x (r) h = x h + m 1/2 h (n h 1) 1/2 ( x ( r) h x h ), x (r) = W h x (r) h, θ(r) = f ( x (r) ) (8) h 3 Repeat steps 1 2 for r = 1,..., R. 4 Compute v RBS [ˆθ] using (7) with A = 1.

29 Scaling of weights Rao, Wu & Yue (1992): weights can be scaled instead of values. m ( r) hi = # times the i-th unit from stratum h is used in the r-th replicate Replicate weight is ( w (r) hik [1 = mh ) 1/2 ( mh ) 1/2 nh ] + m ( r) n h 1 n h 1 m hi w hik h (9) Equivalent to RBS for functions of moments Applicable to ˆθ obtained from estimating equations

30 Choice of m h : Bootstrap scheme options m h n h 1 to ensure non-negative replicate weights m h = n h 1: no need for internal scaling m h = n h 3: matching third moments (Rao & Wu 1988) evidence (Kovar, Rao & Wu 1988): for n h = 5, the choice m h = n h 1 leads to more stable estimators with better coverage than m h = n h 3 Choice of R: No theoretical foundations Popular choices: R = 100, 200 or 500 R design degrees of freedom = n L

31 Yung (1997), Yeo, Mantel & Liu (1999) Mean Confidentiality protection: units with a weight of 0 belong to the same PSU, risk of identification. Replace the number of draws m ( r) hi m ( r) hi = 1 K rk k=(r 1)K +1 m ( k) hi Take K large enough so that IPr [ m ( r) hi = 0] is small. Proceed to compute the weights (9). Compute v MBOOT using (7) with scaling factor A = K. Number of resulting weight variables = R/K. Warning: no formal theory have been developed so far. by

32 Pros and cons of resampling estimators + Analysts and data users only need software that does weighted + No need to program specific estimators for each model + No need to release strata and sampling unit identifiers in public data sets (jackknife?) Computationally intensive, especially when n Post-stratification and non-response adjustments need to be performed on every set of weights Bulky data files with many weight variables

33 Goal of this study Give the calibrated estimators to people! for calibrated estimators requires knowledge of what the calibrating variable are, and what their population totals are Such information is unlikely to be available to analysis and public micro data users Resampling estimators must play a role

35 Population 1/10 of IPUMS 2000 Census 5% sample (Ruggles, Alexander, Genadek, Goeken, Schroeder & Sobek 2010) Alaska, Hawaii, DC excluded individuals 2052 PUMAs From 298 to 2080 individuals in each PUMA Demographic variables: age, gender, race, ethnicity, education, language spoken Economic variables: income, employment status PUMA = Public Use Microdata Area: 100,000+ residents; whole central city, MSA, PMSA, non-metro places

36 Sample 41 strata state (4 New England states conglomerated; 5 Mountains states conglomerated) PSU = PUMA In population, 11 N h 233 PSU/stratum In sample, 2 n h 6 PSU/stratum (more in larger strata) Resulting sample size: n = 120 PSUs 1st stage of selection: PPS (Rao-Hartley-Cochran method) 2nd stage of selection: persons per PSU (fewer in rural areas, more in urban areas) 2000 Monte Carlo replications Sample size: individuals

37 Statistics collected: English proficiency (6 categories) Marital status (6 categories) Statistics Logistic regression: high school graduate = (age + age 2 ) sex + black + white + hispanic Gini coefficient of income inequality (Lorenz curve) variables (if used): Age groups (0 to 10, 10 to 20,... ) Race (white, black, Asian) Gender Hispanic (0/1) Strata indicators Linear (2), EL (5)

38 methods + Deville & Sarndal (1992) estimator for calibrated data + [ {Jackknife, Bootstrap R = 100, Bootstrap R = 990, Mean R = } {No, of the main weight only, ] of the main and replicate weights}

39 Weights method None EL Linear Mean CV Median of min{w i } Median of max{w i } Median of max{w i} min{w i } Min w i /d i Max w i /d i : increases the spread of weights

40 bias Average relative bias of parameter estimates: method None EL Linear Language 0.03% -0.21% -0.22% Marital status -0.14% -0.16% -0.14% Logistic regression 7.61% 7.0% 7.0% Gini index -0.40% -0.38% -0.37% :??

41 efficiency Average ratio of MSEs of parameter estimates: method None EL Linear Language Marital status Logistic regression Gini index Smaller than 1: calibrated is better. : substantial efficiency gains in simple statistics minute loss of efficiency in complicated statistics

42 Relative bias of variance estimates for the descriptive statistics: None EL Linear main main+rep main main+rep -2.5% 55.4% 55.0% D & S % -10.5% Jackknife -1.8% 56.6% 35.2% 56.1% 35.1% BS (R = 100) -3.7% 53.5% -1.8% 53.2% -2.2% BS (R = 990) -2.7% 55.4% -1.1% 54.7% -1.1% MBS % 55.1% -10.4% 54.7% -10.4% : must be accounted for in variance! Without such account, the design-consistent methods report the design variance of the uncalibrated point estimator.

43 Relative bias of variance estimates None EL Linear main main+rep main main+rep Logistic regression -11.2% -11.8% -11.1% Jackknife 3.9% 3.9% 40.6% 4.6% 41.5% BS (R = 100) 13.0% 14.6% 15.5% 15.2% 16.3% BS (R = 990) 17.0% 16.7% 18.2% 17.4% 80.4% MBS % -11.0% -16.4% -10.3% -15.8% Gini index 42.4% 38.6% 37.8% Jackknife -14.1% -16.4% 14.6% -16.9% 13.3% BS (R = 100) -19.0% -20.8% -20.5% -22.0% -21.6% BS (R = 990) -18.4% -20.5% -19.7% -21.1% -20.6% MBS % -20.2% -25.3% -20.9% -26.1% : nothing works for Gini.

44 CI coverage Tail probabilities of the nominal 90% CI. None EL main main+rep Descriptive statistics Most methods % % Logistic regression 6.2% + 6.4% 6.3% +6.6% Jackknife 5.0% + 5.1% 5.0% +5.2% 2.9%+ 3.0% BS (R = 100) 4.5% + 4.5% 4.6% +4.6% 4.4%+ 4.4% BS (R = 990) 4.3% + 4.2% 4.3% +4.4% 4.3%+ 4.2% Mean BS % + 6.3% 6.2% +6.4% 6.8%+ 7.0% Gini index Most methods 3.5% % Monte Carlo margin of error: 3σ = 1.5% : notable asymmetry for fractions and Gini.

45 Limited simulation evidence suggests: s I improves precision of the descriptive statistics, not so much for the model parameter estimates The linearization and the mean are the most stable methods, followed by the jackknife and by other methods (not shown) using strata indicators does not work well with the jackknife. With on demographics only (not shown), the jackknife demonstrated performance comparable to linearization and the mean

46 s II Performance of the mean was virtually the same as that of linearization and jackknife, where applicable, or Deville & Sarndal (1992) estimator for calibrated weights: it had similar (downward) bias, stability, and coverage End users of complex survey data sets with calibrated main weights will not be able to construct adequate replicate weights There are analysis that neither variance method can handle adequately

47 What I covered was...

48 My Survey statistics Latent variable (structural equation) modeling Econometrics Multilevel modeling Statistical programming and computing

49 I Binder, D. A. (1983), On the variances of asymptotically normal estimators from complex surveys, International Statistical Review 51, Binder, D. A. & Roberts, G. R. (2009), Design- and model-based inference for model parameters, in D. Pfeffermann & C. R. Rao, eds, Sample Surveys: Inference and Analysis, Vol. 29B of Handbook of Statistics, Elsevier, Oxford, UK, chapter 24. Chen, J. & Qin, J. (1993), for finite populations and the effective usage of auxiliary information, Biometrika 80(1), Chen, J. & Sitter, R. R. (1999), A pseudo empirical likelihood approach to the effective use of auxiliary information in complex surveys, Statistica Sinica 9, Chen, J., Sitter, R. R. & Wu, C. (2002), Using empirical likelihood methods to obtain range restricted weights in regression estimators for surveys, Biometrika 89(1),

50 II Cochran, W. G. (1977), Sampling Techniques, 3rd edn, John Wiley and Sons, New York. Crouse, C. & Kott, P. S. (2004), Evaluating alternative schemes for an economic survey with large nonresponse, Survey Research Methodology Section, American Statistical Association, pp Deville, J. C. & Sarndal, C. E. (1992), estimators in survey sampling, Journal of the American Statistical Association 87(418), Deville, J. C., Sarndal, C. E. & Sautory, O. (1993), Generalized raking procedures in survey sampling, Journal of the American Statistical Association 88(423), Eltinge, J. (1996), Discussion of resampling methods in sample surveys by j. shao, Statistics 27, Estevao, V. M. & Säarndal, C.-E. (2006), Survey estimates by on complex auxiliary information, International Statistical Review 74(2),

51 III Fuller, W. A. (1975), Regression analysis for sample survey, Sankhya Series C 37, Horvitz, D. G. & Thompson, D. J. (1952), A generalization of sampling without replacement from a finite universe, Journal of the American Statistical Association 47(260), Isaki, C. T., Tsay, J. H. & Fuller, W. A. (2004), Weighting sample data subject to independent controls, Survey Methodology 30(1), Kish, L. (1965), Survey Sampling, John Wiley and Sons, New York. Kish, L. & Frankel, M. R. (1974), Inference from complex samples, Journal of the Royal Statistical Society, Series B 36, 1 37., S. (2010), Resampling inference with complex survey data, Stata Journal 10, Korn, E. L. & Graubard, B. I. (1999), Analysis of Health Surveys, John Wiley and Sons.

52 IV Kott, P. S. (2006), Using weightingtoadjust for nonresponse andcoverage errors, Survey Methodology 32(2), Kott, P. S. (2009), weighting: Combining probability samples and linear prediction models, in D. Pfeffermann & C. R. Rao, eds, Sample Surveys: Inference and Analysis, Vol. 29B of Handbook of Statistics, Elsevier, Oxford, UK, chapter 25. Kovar, J. G., Rao, J. N. K. & Wu, C. F. J. (1988), Bootstrap and other methods to measure errors in survey estimates, Canadian Journal of Statistics 16, Krewski, D. & Rao, J. N. K. (1981), Inference from stratified samples: Properties of the linearization, jackknife and balanced repeated replication methods, The Annals of Statistics 9(5), McCarthy, P. J. (1969), Pseudo-replication: Half samples, Review of the International Statistical Institute 37(3),

53 V Owen, A. B. (1988), ratio confidence intervals for a single functional, Biometrika 75(2), Owen, A. B. (1991), for linear models, The Annals of Statistics 19(4), Pfeffermann, D. (1993), The role of sampling weights when modeling survey data, International Statistical Review 61, Qin, J. & Lawless, J. (1994), and general estimating equations, The Annals of Statistics 22(1), Rao, J. N. K. & Wu, C. (2009), methods, in Sample Surveys: Theory, Methods and Inference, Vol. 29 of Handbook of Statistics, Elsevier. Rao, J. N. K. & Wu, C. F. J. (1988), Resampling inference with complex survey data, Journal of the American Statistical Association 83(401),

54 VI Rao, J. N. K., Wu, C. F. J. & Yue, K. (1992), Some recent work on resampling methods for complex surveys, Survey Methodology 18(2), Ruggles, S., Alexander, J. T., Genadek, K., Goeken, R., Schroeder, M. B. & Sobek, M. (2010), Integrated Public Use Microdata Series: Version 5.0 [Machine-readable database], University of Minnesota, Minneapolis. Särndal, C.-E. (2007), The approach in survey theory and practice, Survey Methodology 33(2), Särndal, C.-E., Swensson, B. & Wretman, J. (1992), Model Assisted Survey Sampling, Springer, New York. Skinner, C. J. (1989), Domain means, regression and multivariate analysis, in C. J. Skinner, D. Holt & T. M. Smith, eds, Analysis of Complex Surveys, Wiley, New York, chapter 3, pp Théberge, A. (2000), and restricted weights, Survey Methodology 26(1),

55 VII White, H. (1980), A heteroskedasticity-consistent covariance-matrix estimator and a direct test for heteroskedasticity, Econometrica 48(4), Wolter, K. M. (2007), to Estimation, 2nd edn, Springer, New York. Wu, C. (2003), Optimal estimators in survey sampling, Biometrika 90(4), Wu, C. (2004a), Combining information from multiple surveys through the empirical likelihood method, Can J Statistics 32(1), Wu, C. (2004b), Some algorithmic aspects of the empirical likelihood method in survey sampling, Statistica Sinica 14, Yeo, D., Mantel, H. & Liu, T.-P. (1999), Bootstrap variance for the National Population Health Survey, in Proceedings of Survey Research Methods Section, The American Statistical Association, pp

56 VIII Yung, W. (1997), for public use files under confidentiality constraints, in Proceedings of Statistics Canada Symposium, Statistics Canada, pp

Empirical Likelihood Methods for Sample Survey Data: An Overview

AUSTRIAN JOURNAL OF STATISTICS Volume 35 (2006), Number 2&3, 191 196 Empirical Likelihood Methods for Sample Survey Data: An Overview J. N. K. Rao Carleton University, Ottawa, Canada Abstract: The use