An Overview of the Pros and Cons of Linearization versus Replication in Establishment Surveys

Similar documents
Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

A decision theoretic approach to Imputation in finite population sampling

On the bias of the multiple-imputation variance estimator in survey sampling

Introduction to Survey Data Analysis

VARIANCE ESTIMATION FOR NEAREST NEIGHBOR IMPUTATION FOR U.S. CENSUS LONG FORM DATA

Combining data from two independent surveys: model-assisted approach

Sampling from Finite Populations Jill M. Montaquila and Graham Kalton Westat 1600 Research Blvd., Rockville, MD 20850, U.S.A.

Regression Diagnostics for Survey Data

SAS/STAT 14.2 User s Guide. Introduction to Survey Sampling and Analysis Procedures

Fractional Imputation in Survey Sampling: A Comparative Review

The Effect of Multiple Weighting Steps on Variance Estimation

REPLICATION VARIANCE ESTIMATION FOR THE NATIONAL RESOURCES INVENTORY

Jong-Min Kim* and Jon E. Anderson. Statistics Discipline Division of Science and Mathematics University of Minnesota at Morris

Two-phase sampling approach to fractional hot deck imputation

Model Assisted Survey Sampling

Multidimensional Control Totals for Poststratified Weights

6. Fractional Imputation in Survey Sampling

Implications of Ignoring the Uncertainty in Control Totals for Generalized Regression Estimators. Calibration Estimators

Estimation of Parameters and Variance

Monte Carlo Study on the Successive Difference Replication Method for Non-Linear Statistics

Fractional hot deck imputation

Empirical Likelihood Methods for Sample Survey Data: An Overview

Nonresponse weighting adjustment using estimated response probability

Estimation of change in a rotation panel design

Variance Estimation for Calibration to Estimated Control Totals

Imputation for Missing Data under PPSWR Sampling

Deriving indicators from representative samples for the ESF

Parametric fractional imputation for missing data analysis

Successive Difference Replication Variance Estimation in Two-Phase Sampling

Weight calibration and the survey bootstrap

In Praise of the Listwise-Deletion Method (Perhaps with Reweighting)

BOOK REVIEW Sampling: Design and Analysis. Sharon L. Lohr. 2nd Edition, International Publication,

Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design

Chapter 5: Models used in conjunction with sampling. J. Kim, W. Fuller (ISU) Chapter 5: Models used in conjunction with sampling 1 / 70

Unequal Probability Designs

University of Michigan School of Public Health

SAS/STAT 13.1 User s Guide. Introduction to Survey Sampling and Analysis Procedures

Empirical and Constrained Empirical Bayes Variance Estimation Under A One Unit Per Stratum Sample Design

REPLICATION VARIANCE ESTIMATION FOR TWO-PHASE SAMPLES

Task. Variance Estimation in EU-SILC Survey. Gini coefficient. Estimates of Gini Index

SAS/STAT 13.2 User s Guide. Introduction to Survey Sampling and Analysis Procedures

Jong-Min Kim* and Jon E. Anderson. Statistics Discipline Division of Science and Mathematics University of Minnesota at Morris

Workpackage 5 Resampling Methods for Variance Estimation. Deliverable 5.1

Finite Population Sampling and Inference

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

A Design-Sensitive Approach to Fitting Regression Models With Complex Survey Data

Comments on Design-Based Prediction Using Auxilliary Information under Random Permutation Models (by Wenjun Li (5/21/03) Ed Stanek

Weighting Missing Data Coding and Data Preparation Wrap-up Preview of Next Time. Data Management

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Temi di discussione. Accounting for sampling design in the SHIW. (Working papers) April by Ivan Faiella. Number

Ordered Designs and Bayesian Inference in Survey Sampling

A note on multiple imputation for general purpose estimation

Resampling Variance Estimation in Surveys with Missing Data

Miscellanea A note on multiple imputation under complex sampling

Data Integration for Big Data Analysis for finite population inference

New Developments in Nonresponse Adjustment Methods

Estimation from Purposive Samples with the Aid of Probability Supplements but without Data on the Study Variable

Taking into account sampling design in DAD. Population SAMPLING DESIGN AND DAD

Plausible Values for Latent Variables Using Mplus

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW

Complex sample design effects and inference for mental health survey data

New Developments in Econometrics Lecture 9: Stratified Sampling

Advanced Methods for Agricultural and Agroenvironmental. Emily Berg, Zhengyuan Zhu, Sarah Nusser, and Wayne Fuller

5.3 LINEARIZATION METHOD. Linearization Method for a Nonlinear Estimator

Statistical Education - The Teaching Concept of Pseudo-Populations

A MODEL-BASED EVALUATION OF SEVERAL WELL-KNOWN VARIANCE ESTIMATORS FOR THE COMBINED RATIO ESTIMATOR

A Note on Bayesian Inference After Multiple Imputation

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

INSTRUMENTAL-VARIABLE CALIBRATION ESTIMATION IN SURVEY SAMPLING

New estimation methodology for the Norwegian Labour Force Survey

Comparative study on the complex samples design features using SPSS Complex Samples, SAS Complex Samples and WesVarPc.

Statistical Practice

Graybill Conference Poster Session Introductions

BOOTSTRAPPING SAMPLE QUANTILES BASED ON COMPLEX SURVEY DATA UNDER HOT DECK IMPUTATION

F. Jay Breidt Colorado State University

Survey nonresponse and the distribution of income

J.N.K. Rao, Carleton University Department of Mathematics & Statistics, Carleton University, Ottawa, Canada

STA304H1F/1003HF Summer 2015: Lecture 11

Causal Inference with a Continuous Treatment and Outcome: Alternative Estimators for Parametric Dose-Response Functions

A comparison of stratified simple random sampling and sampling with probability proportional to size

Alexina Mason. Department of Epidemiology and Biostatistics Imperial College, London. 16 February 2010

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Methods Used for Estimating Statistics in EdSurvey Developed by Paul Bailey & Michael Cohen May 04, 2017

COMPLEX SAMPLING DESIGNS FOR THE CUSTOMER SATISFACTION INDEX ESTIMATION

C. J. Skinner Cross-classified sampling: some estimation theory

A Bayesian solution for a statistical auditing problem

Empirical Likelihood Methods

Calibrated Bayes: spanning the divide between frequentist and. Roderick J. Little

Preliminaries The bootstrap Bias reduction Hypothesis tests Regression Confidence intervals Time series Final remark. Bootstrap inference

Development of methodology for the estimate of variance of annual net changes for LFS-based indicators

Variance estimation on SILC based indicators

Combining multiple observational data sources to estimate causal eects

Preliminaries The bootstrap Bias reduction Hypothesis tests Regression Confidence intervals Time series Final remark. Bootstrap inference

MISSING or INCOMPLETE DATA

What is Survey Weighting? Chris Skinner University of Southampton

Some methods for handling missing values in outcome variables. Roderick J. Little

TWO-WAY CONTINGENCY TABLES UNDER CONDITIONAL HOT DECK IMPUTATION

Biostat 2065 Analysis of Incomplete Data

Software for GEE: PROC GENMOD and SUDAAN Babubhai V. Shah, Research Triangle Institute, Research Triangle Park, NC

A BOOTSTRAP VARIANCE ESTIMATOR FOR SYSTEMATIC PPS SAMPLING

Transcription:

An Overview of the Pros and Cons of Linearization versus Replication in Establishment Surveys Richard Valliant University of Michigan and Joint Program in Survey Methodology University of Maryland 1

Introduction Nonlinear estimators are rule not exception in survey estimation Means: ratios of estimated totals Totals: nonlinear due to nonresponse adjustments, poststratification, other calibration estimation 2

More Complex Examples Price indexes Long-term index = product of short-term indexes across time periods Each short-term index may be ratio of long-term indexes Regression parameter estimates from X-sectional survey Autoregressive parameter estimates from panel survey Time series models with trend, seasonal, irregular terms 3

Options for Variance Estimation Linearization - Standard linearization - Jackknife linearization Replication - Jackknife - BRR - Bootstrap 4

Examples of Establishment Survey Designs Stratified, single-stage (often equal probability within strata) - Current Employment Statistics (US) - Occupational Employment Statistics (US) - Business Payrolls Survey (Canada) - Survey of Manufacturing (Canada) Stratified, two-stage - Consumer Price Index (US); geographic PSUs - National Compensation Survey (US); geographic PSUs - Occupational Safety and Health Statistics (US); establishments are PSUs/injury cases sampled within 5

Goals of Variance Estimation Construct confidence intervals to make inference about pop parameters Estimate variance components for survey design Desiderata - Design consistent under a design close to what was actually used - Model consistent under model that motivates the point estimator - Easy application to derived estimates (differences or ratios in domain means, interquartile ranges) 6

Ratio estimator in srs Example ˆ ˆ, ˆ y T s R = NXβ β =. Motivated by model x s 2 ( ) = β ; var ( ) = σ E y x y x M k k M k k Design-consistent but not model-consistent estimator: 2 2 N n rk 1 s L =, k = k k v r y x n N n 1 Both design-consistent and model-consistent: v X = 2 v 2 L xs ˆ β 7

Frequentist approach Objections: piecemeal, every problem needs a new solution Bootstrap is more general, frequentist solution for some problems generate entire distribution of statistic Generate pseudo-population: Booth, Butler, & Hall, JASA (1994) Canty and Davison, The Statistician (1999) More specialized bootstraps: Rao & Wu JASA (1988) Langlet, Faucher & Lesage, Proc JSM (2003) 8

(One) Bayesian solution - Generate entire posterior of population parameter; use to estimate mean, intervals for parameter, etc - Polya posterior: Ghosh & Meeden, Bayesian Methods for Finite Pop Sampling (1997) - Unknown in practice but theoretically interesting; not available for clustered pops 9

Practical Issues/Work-arounds How to reflect weight adjustments in variance estimates - Unknown eligibility - Nonresponse - Use of auxiliary data (calibration) Linearization: some steps often ignored (like NR adjustment) Replication: Units combined into groups for jackknife, other methods - Loss of degrees of freedom; poor combinations can inject bias Item Imputation: special procedures needed 10

More Practical Issues/Work-arounds Design compromise: assume 1 st stage units selected with replacement - Without replacement theory possible but not always practical - Joint selection probabilities not tracked or uncomputable in many (most?) designs Adaptive procedures - Cell collapsing in PS, NR - Weight censoring 11

Some designs do not permit design-unbiased or consistent variance estimates - Systematic sampling from an ordered list - Standard practice is PISE (pretend it s something else) Replication estimators are often both model consistent (assuming independent 1 st stage units) and design consistent (assuming with-replacement sampling of 1 st stage units) 12

Basic Linearization Method Write linear approx to statistic; compute (design or model) variance of approx t Writing ˆ j ˆ θ θ = g tˆ tˆ ˆ θ ( ) 1, 2,, tp p j= 1 g tˆ j tˆ = t ( tˆ ) j tj = tˆ and reversing PSU (i) and h i s h variable (j) sums and noting t j is constant: var g θ θ tˆ hi, sh j= 1 ˆ jhi t = t t j jhi ( ˆ ) p var ˆ 13

The variance can be w.r.t. a model or design Assuming units in different strata are independent (model) or sampling is independent from stratum to stratum (design): var u i ( ˆ ) ( ) θ θ var u h i s i p g = 1 ˆ tˆ j= tˆ = t t j jhi h (linear substitute method) Variance is computed under whatever design or model is appropriate. Assumption of with-replacement selection of PSUs not necessary but often used for design-based calculation. 14

Issue of evaluating partial derivatives - when to substitute estimates for unknown quantities? Binder, Surv Meth (1996) - Take total differential of statistic - Evaluate derivatives at sample estimates where needed - Leads to variance estimators with better conditional (model) properties 15

Example More general ratio estimator: ˆ Evaluating partials at pop values gives standard linearization: t ˆ ˆ y t ˆ R t ty t = x w k s krk t x w k = survey base weight ty rk = yk x k; ( t ˆ ) x t x ˆ 1 t t= t = in partials x Binder recipe: ˆ t t x R t w r tˆ k s x Retains t ˆ x x k k t R = t tˆ x x tˆ t in variance estimate y 16

In case of srs without replacement Standard linearization: v L 2 k 2 N n r = 1 s n N n 1 Royal-Cumberland/Binder: v X = 2 v 2 L xs Design consistent (under srswor) and approximately model-unbiased (under model that motivates ratio estimator); special case of sandwich estimator 17

Problems in Panel Surveys Estimating change over time involves data from 2 or more time periods Linear substitute useful in multi-stage design if PSUs same in all periods (true in US CPI) Not clean solution in single-stage sample with rotation of PSUs (establishments) need to worry about non-overlap, births, deaths when computing variance of change 18

Price Indexes in Panel Surveys Hard to Use Linearization Jevons (geometric mean) price index of change from time 0 to time t is product of 1-period price changes. Each 1- period change is estimated by a geomean: Pˆ t 1 ( p, p ) = Pˆ ( p, p,s ) where J t 0 J u+ 1 u u+ 1 u= 0 ( ) ( u+ 1 u p ) + 1, p, + 1 P s = p p J u u u k k k s u+ 1 we a E k = proportion of expenditure due to item k at a reference period a. s u + 1 = set of sample items at u+1. k a k 19

With t time periods, this is function of 1-period geometric means t 1 ( p p ) = ˆ ( p p ) log Pˆ, log P,, s J t 0 J u+ 1 u u+ 1 u= 0 t 1 a u u = we k k log pk pk u= 0 k s u+ 1 ( + 1 ) In case of US CPI, could expand sum over k s u + 1 in terms of strata and PSUs, reverse sum over time and samples. Then get linear substitute. Add to sum as time moves on. Method would work (between decennial censuses) because PSU sample is fixed. With single-stage establishment sample, may not be feasible. 20

Jackknife Delete one 1 st -stage unit at a time; compute estimate from each replicate n h 1 v ˆ ˆ J = θ h i sh ( hi n ) θ h ( ) 2 ˆ θ = ˆ ˆ ˆ ; smooth g, with-replacement Works for g( t ) 1, t2,, t p sampling of 1 st -stage units Example: GREG estimator of a total tˆ = tˆ ˆ π + B t t ( ˆ ) G x x 21

Exact formula for jackknife for GREG is available (Valliant, Surv Meth 2004) for single-stage, unequal probability sampling An approximation is v ˆ ( t ) J G s wgr k k k 1 hk r ˆ k = y k x kb; gk = 1+ x k XWX tx t x = g-weight 2 1 ˆ ( ) ( ) h k = weighted regression leverage 22

Jackknife Linearization Method Yung & Rao, (Surv Meth 1996, JASA 2000) General idea: get linear approximation to ˆ θ ( hi ) substitute in v J n For GREG h ( v = r r ) 2 where r JL h hi h n 1 i sh h hi = w k s k gkr k ; rh = r i s hi n h hi h ˆ θ and In single-stage sampling n v w g r wgr ( ( ) ) 2 h JL = k k k hn 1 k sh h h (missing leverage adjustment, but good large sample design/model properties) 23

More Advanced Linearization Techniques Deville, Surv Meth (1999) - Formulation in terms of influence functions - Goal: estimate some parameter θ = T( F), a function of the distribution function of y ˆ T( Fˆ ) θ = can be linearized near F - Compute variance of linear approx; - Deville (1999) gives many examples: correlation coefficient, implicit parameters (logistic β), Gini coefficient, quantiles, principal components 24

Demnati & Rao, Surv Meth (2004) - Extension of Deville unique way to evaluate partials - Estimating equations - Two-phase sampling 25

Accounting for Imputations If imputations made for missing items, variance of resulting estimates (totals, ratio means, etc) usually increased. Treating imputed values as if real can lead to severe underestimates of variances. Special procedures needed to account for effect of imputing. Some choices: Multiple imputation MI (Rubin) Adjusted jackknife or BRR (Rao, Shao) Model-assisted MA (Särndal) 26

To get theory for these methods, 4 different probabilistic distributions can be considered: Superpopulation model Response mechanism Sample design Imputation mechanism Assumptions needed for response mechanism, e.g. random response within certain groups and, in the cases of MI and MA, a superpopulation model that describes the analysis variable. How methods are implemented and what assumptions are needed for each depends, in part, on the imputation method used (hot deck, regression, etc). 27

Multiple Imputation MI uses a specialized form of replication. M imputed values created for a missing item must be a random element to how the imputations are created. z ˆI ( k ) = estimate based on the k-th completed data set VˆI ( k ) = naïve variance estimator that treats imputed values as if they were observed. an exact formula. V ˆI( k MI point estimator of θ is ) could be linearization, replication, or zˆ M M 1 = zˆi ( k ), M k = 1 28

Variance estimator is ˆ 1 V = U + 1+ B M M M M, where 1 M ˆ k= 1 UM = M VI ( k ) and 1 M ( 1) ˆI( k) k= 1 ( ˆ ) 2 M B M = M z z. For hot deck imputation, the MI method assumes a uniform response probability model and a common mean model within each hot deck cell. Overestimation in cluster samples Kim, Brick, Fuller (JRSS-B 2006) 29

Adjusted jackknife Rao and Shao (1992) adjusted jackknife (AJ) variance estimator. In jackknife variance formula use ( ˆ ) 1 ˆ zˆ = g y,, y = full sample estimate including any I I Ip imputed values ( ) zˆ I( hi) = g yˆi 1( hi),, yˆip ( hi) = an adjusted estimated with adjustment dependent on imputation method 30

Example: 1-stage stratified sample - Hot deck method: cells formed and donor selected with probability proportional to sampling weight - Adjusted ŷ value, associated with deleting unit hi, is G yˆ I( hi) = wj( hi) yj + w j( hi) yj + e j( hi) g= 1 j ARg j AMg g = hot deck cell (which can cut across strata) A Rg, A Mg = sets of responding and missing units in g w j( hi) = adjusted weight for unit j when unit i in stratum h is omitted y j = hot deck imputed value for unit j e jhi ( ) = y Rghi ( ) y Rg, a residual specific to a replicate The AJ method assumes a uniform response probability model within each hot deck cell. 31

Software Options Stata, SUDAAN, SPSS, WesVar, R survey package Off-the-shelf software may not cover what you need Likely omissions: NR adjustment, adaptive collapsing, specialized estimates (price indexes), item imputations Write your own 32

Summary Linearization Replication Pros Pros - good large sample properties - good large sample properties - applies to complex forms of estimates - applies to complex forms of estimates - can be computationally faster than replication - sample adjustments easy to reflect in variance estimates - maximizes degrees of freedom - applies to analytic subpopulations - sandwich version is modelrobust - user does not need to know or understand sample design Cons Cons - separate formula for each type - computationally intensive of estimate - special purpose programming - may be unclear how best to form replicates - hard to account for some - increased file sizes sample adjustments, e.g., - sometimes applied in ways that nonresponse, adaptive methods lose degrees of freedom 33