A Semi-Parametric Approach to Account for Complex. Designs in Multiple Imputation

Similar documents
Non-Parametric Non-Line-of-Sight Identification 1

Ensemble Based on Data Envelopment Analysis

Testing equality of variances for multiple univariate normal populations

A method to determine relative stroke detection efficiencies from multiplicity distributions

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

Meta-Analytic Interval Estimation for Bivariate Correlations

Extension of CSRSM for the Parametric Study of the Face Stability of Pressurized Tunnels

Feature Extraction Techniques

Bootstrapping Dependent Data

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

Biostatistics Department Technical Report

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

C na (1) a=l. c = CO + Clm + CZ TWO-STAGE SAMPLE DESIGN WITH SMALL CLUSTERS. 1. Introduction

A Simple Regression Problem

Bayesian Approach for Fatigue Life Prediction from Field Inspection

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE

Correcting a Significance Test for Clustering in Designs With Two Levels of Nesting

Estimating Parameters for a Gaussian pdf

E. Alpaydın AERFAISS

Block designs and statistics

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Estimation of the Mean of the Exponential Distribution Using Maximum Ranked Set Sampling with Unequal Samples

OBJECTIVES INTRODUCTION

Data-Driven Imaging in Anisotropic Media

In this chapter, we consider several graph-theoretic and probabilistic models

AN OPTIMAL SHRINKAGE FACTOR IN PREDICTION OF ORDERED RANDOM EFFECTS

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

Estimation of the Population Mean Based on Extremes Ranked Set Sampling

Analyzing Simulation Results

Pattern Recognition and Machine Learning. Artificial Neural networks

Examining the Effects of Site Selection Criteria for Evaluating the Effectiveness of Traffic Safety Countermeasures

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

arxiv: v1 [stat.ot] 7 Jul 2010

Probability Distributions

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

The proofs of Theorem 1-3 are along the lines of Wied and Galeano (2013).

Comparing Probabilistic Forecasting Systems with the Brier Score

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

COS 424: Interacting with Data. Written Exercises

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

Chapter 6 1-D Continuous Groups

Combining Classifiers

The Distribution of the Covariance Matrix for a Subset of Elliptical Distributions with Extension to Two Kurtosis Parameters

Example A1: Preparation of a Calibration Standard

The Transactional Nature of Quantum Information

TEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES

A Comparative Study of Parametric and Nonparametric Regressions

An Introduction to Meta-Analysis

. The univariate situation. It is well-known for a long tie that denoinators of Pade approxiants can be considered as orthogonal polynoials with respe

Nonlinear Log-Periodogram Regression for Perturbed Fractional Processes

IN modern society that various systems have become more

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

Handwriting Detection Model Based on Four-Dimensional Vector Space Model

Best Procedures For Sample-Free Item Analysis

3.3 Variational Characterization of Singular Values

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials

AN EFFICIENT CLASS OF CHAIN ESTIMATORS OF POPULATION VARIANCE UNDER SUB-SAMPLING SCHEME

W-BASED VS LATENT VARIABLES SPATIAL AUTOREGRESSIVE MODELS: EVIDENCE FROM MONTE CARLO SIMULATIONS

SPECTRUM sensing is a core concept of cognitive radio

Using a De-Convolution Window for Operating Modal Analysis

Bootstrapping clustered data

Physics 139B Solutions to Homework Set 3 Fall 2009

Department of Electronic and Optical Engineering, Ordnance Engineering College, Shijiazhuang, , China

a a a a a a a m a b a b

Sharp Time Data Tradeoffs for Linear Inverse Problems

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

Inference in the Presence of Likelihood Monotonicity for Polytomous and Logistic Regression

Polygonal Designs: Existence and Construction

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

A note on the multiplication of sparse matrices

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

A proposal for a First-Citation-Speed-Index Link Peer-reviewed author version

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

Modeling the Structural Shifts in Real Exchange Rate with Cubic Spline Regression (CSR). Turkey

Randomized Recovery for Boolean Compressed Sensing

DERIVING PROPER UNIFORM PRIORS FOR REGRESSION COEFFICIENTS

Figure 1: Equivalent electric (RC) circuit of a neurons membrane

Boosting with log-loss

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

Warning System of Dangerous Chemical Gas in Factory Based on Wireless Sensor Network

CHAPTER 19: Single-Loop IMC Control

INTELLECTUAL DATA ANALYSIS IN AIRCRAFT DESIGN

Generalized Queries on Probabilistic Context-Free Grammars

Statistical Logic Cell Delay Analysis Using a Current-based Model

Pattern Recognition and Machine Learning. Artificial Neural networks

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

RAFIA(MBA) TUTOR S UPLOADED FILE Course STA301: Statistics and Probability Lecture No 1 to 5

1 Proof of learning bounds

Modified Systematic Sampling in the Presence of Linear Trend

CS Lecture 13. More Maximum Likelihood

Machine Learning Basics: Estimators, Bias and Variance

Transcription:

A Sei-Paraetric Approach to Account for Coplex Designs in ultiple Iputation Hanzhi Zhou, Trivellore E. Raghunathan and ichael R. Elliott Progra in Survey ethodology, University of ichigan Departent of iostatistics, University of ichigan zhouhanz@uich.edu, teraghu@uich.edu, relliot@isr.uich.edu ultiple iputation (I) has becoe one of leading approaches in dealing with issing data in survey research. However, existing software packages and procedures typically do not incorporate coplex saple design features in the iputation process. Researcher has deonstrated that ipleentation of I based on siple rando sapling (SRS) assuption can cause severe bias in estiation and hence invalid inferences, especially when the design features are highly related to survey variables of interest (Reiter et al. 2006). Recent work to accoodate coplex saple designs in iputation has focused on odel-based ethods which directly odel the coplex design features in the forulation of the iputation odel. In this paper, we propose a sei-paraetric procedure as an alternative approach to incorporate coplex saple designs in I. Specifically, we divide the iputation process into two stages: the coplex feature of the survey design (including weights and clusters) is fully accounted for at the first stage, which is accoplished by applying a nonparaetric ethod to generate a series of synthetic datasets; we then perfor conventional paraetric I for issing data at the second stage using readily available iputation software designed for an SRS saple. A new cobining rule for the point and variance estiates is derived to ake valid inferences based on the two-stage procedure. Using health survey data fro the ehavior Risk Factor Surveillance Syste, we evaluate the proposed ethod with a siulation study and copare it with the odel-based ethod with respect to coplete data analysis. Results show that the proposed ethod yields saller bias and is ore efficient than the odel-based ethod. Keywords: data issing data, coplex saple design, ultiple iputation, ayesian ootstrap, synthetic 1. Introduction and Research Question ultiple iputation (I) is a principled ethod in dealing with issing data in survey research and has been adopted by federal statistical agencies in recent years. A very iportant point underlying the I theory is that the ethod was designed for coplex saple surveys hence requires the iputation to be ade conditional on saple designs. The purpose is to ake the issing data echanis ignorable: since design features are usually related to survey variables of interest in real survey data, severe bias on the estiates can be avoided when they are properly accounted for in the process (Reiter et al. 2006). However, survey practitioners usually assue siple rando sapling (SRS) when they re perforing I, largely due to the inadequacy of standard software packages in handling coplex saple designs. A typical exaple is the Sequential Regression ultivariate 1 / 18

Iputation procedure (SRI) using IVEware (Raghunathan et al.2001), which has been gaining increasing popularity in handling ultivariate issing data in large scale surveys. Applications such as I for issing incoe data in NHIS (Schenker et al. 2006) focused on the strategies of odeling coplicated data structure and failed to recognize the iportance of incorporating coplex saple design features, reflecting an inconsistency between theory and practice. The question is then: how do we fully incorporate the saple designs in I to achieve valid statistical inferences? Reiter et al. (2006) proposed a fixed effect odeling ethod in addressing the proble where they included design variables as predictors in the iputation odel and it outperfors SRS scenario in ters of correcting the bias. Their conclusion thus supports the general advice of including coplex saple designs in I procedure. However, they did not look at survey weight as another iportant design variable. Treating weight as scalar suary of the design inforation in the iputation odel ay not work well for inference beyond eans or totals, since interactions between the probabilities of selection and the population paraeters of interest will not be accounted for (Elliott 2007). esides, their results for real data application did not show significant gains of incorporating designs over ignoring designs as in their siulation study. Little following work has been done to replicate their results with real survey datasets or to investigate other potential ethods in this regard, say, is there a way to incorporate design inforation within the I fraework other than including the as covariates in the iputation odel? The goal of this paper is to propose a sei-paraetric two-step ethod as an alternative to the existing fully odel-based ethods. Specifically, we divide the iputation process into two steps: the coplex feature of the survey design (including weights and clusters) is fully accounted for at the first step, which is accoplished by applying a nonparaetric ethod to generate a series of synthetic datasets; we then perfor conventional paraetric I for issing data at the second step using readily available iputation software designed for an SRS saple. A new cobining rule for the point and variance estiates is derived to ake valid inferences based on the two-step procedure. The rest of this paper is structured as follows: Section 2 describes the proposed ethod in detail. We first lay out the conceptual idea and then deonstrate the theoretical results. Section 3 provides the results fro a siulation study in a PPS sapling design setting to epirically evaluate the new ethod. Section 4 concludes with discussion and directions for future research. 2. Two-step I Procedure In this section, we propose a two-step procedure to perfor ultiple iputation with which accounting for coplex designs and ultiply iputing issing data are divided into two separate steps. The basic idea is to uncoplex the coplex designs before ipleenting standard I procedure. y uncoplex, we ean a statistical procedure which akes the design features irrelevant at the analysis stage, i.e. turning a dataset into one that is self-weighting, so that we can treat the uncoplexed populations as if they were siple rando saples fro a superpopulation or the true population. Thus, siple estiation forulae for self-weighting saples directly apply. Specifically in our case, this is achieved by generating synthetic populations through a nonparaetric resapling procedure adapted fro ayesian ootstrap (Rubin 1981) i.e. the Finite Population ayesian ootstrap (FP) which will be introduced in section 2.2, such that coplication of odeling those designs in the iputation can be avoided. The standard practice of I assuing SRS can then apply directly to 2 / 18

the generated populations which are free of sapling designs. Two ajor assuptions are ade for ipleenting the proposed ethod: 1) The ethod is proposed to treat ite nonresponse proble under I fraework, rather than unit nonresponse. Therefore all input weights are assued to be final weights after unit nonresponse adjustent and calibration. 2) issing at rando (AR) as a ore practical issing data echanis is assued for all types of analysis under study. We do not consider CAR (issing Copletely At Rando) which is too ideal for real world surveys, neither do we consider NAR (Not issing At Rando) which is another scope of research topic. 2.1. Conceptual Idea Figure 1 and Figure 2 together illustrate the conceptual idea of the procedure. Figure 1 shows the proposed procedure to account for coplex designs in I. Denote Q as the population paraeter we are interested in. Denote the actual saple survey data as D ( Y, Y ), is obs where Y is represents the portion of ite issing data and Y obs represents the portion of observed data. The first step of the synthetic data generation approach creates FP synthetic populations D D D D (1) (2) ( ) {,,..., }, where D ( Y, Y ), b 1,2,...,. is obs The second step of ultiple iputation creates iputed datasets for each of the FP synthetic population generated fro the first step, D { D, D,..., D }, for b 1,2,...,. 1 2 Thus we end up with D { D, D,..., D, D, D,..., D,..., D, D,..., D } which contains all the (1) (1) (1) (2) (2) (2) ( ) ( ) ( ) 1 2 1 2 1 2 iputed synthetic population datasets generated by the two-step procedure. Figure 2 (a)-(c) shows the evolution of data structure in the proposed procedure. Suppose we have three survey variables of interest, Y1, Y2 and Y3, and a survey saple of size n was drawn fro of a target population of size N through soe type of coplex sapling design. The character in pink square denotes the issing part for each survey variable in both saple data and synthetic population data. At the first step, we can think of the unobserved eleents of the population as issing by design and we treat issing values as a separate category for each variable. y applying the adapted FP ethod, a synthetic population is created, ideally a plausible reflection of the target population. Note that in this process, the issing data are also brought up to the population level. At the second step, conventional SRI assuing SRS is applied to fill in those ite issing data and we end up with a coplete dataset which we call iputed synthetic population. The whole process will be replicated for by ties. 3 / 18

Figure1. Proposed Procedure to Account for Coplex Designs in I Original Saple with issing Data Nonparaetric Approach for Uncoplexing the Design: Use FP to Generate Synthetic Populations D (1) D (2) D () Paraetric Approach for Iputing the issing Data: Standard ultiple Iputation by SRI D (1) 1,.. D (2) 1,.. D () 1,.. D (1) D (2) D () Cobining Rule for Valid Inference: f(q ) Figure2. Data Structure Evolution (a) (b) (c) Y1 Y2 Y3 Y1 Y2 Y3 Y1 Y2 Y3 1 1 1 2 2 2......... n N N Actual Saple Dataset Synthetic Dataset Iputed Synthetic Dataset 4 / 18

2.2. ethods Now we deonstrate how the adapted-fp ethod for unequal probability sapling (Cohen 1997) can be applied to uncoplex weight as one design feature and how this ethod can be adapted to uncoplex cluster as another, hence ore coplicated designs such as stratified ultistage cluster sapling can be handled. SRI is illustrated as one option to perfor the conventional I once the coplex designs have been appropriately accounted for in the first place. 2.2.1. First step---nonparaetric approach to Uncoplex Designs Per the liitations of fully odel-based ethods stated in section1, we propose using a nonparaetric ethod, i.e. the adapted-fp by Cohen (1997) to account for coplex saple designs in I. Specifically, the adapted-fp serves as a procedure to restore the existing coplex survey saple back to soe SRS-type/self-weighting data structure. This will be realized by generating populations fro the coplex saple repeatedly in a spirit siilar to synthetic population generation ethod in the context of I for disclosure risk liitation (Raghunathan et al. 2003). Resapling technique is used in order to fully capture the uncertainty in the original saple. The nonparaetric approach has iniu assuption of the distributional for of rando effects and is robust to odel isspecification that usually poses probles to the odel-based ethods. Additional to that, FP is a ethod developed fro the conventional bootstrap whose ayesian nature akes it fit to the I fraework well. 2.2.1.1. Finite population ayesian ootstrap: Pólya s Urn Schee: Denote an urn containing finite nuber of balls as { }. A ball is randoly drawn fro the urn and a sae ball fro outside of the urn is added back to the urn along with the originally picked one. Repeat such selection process until balls have been selected as a saple, call this saple Pólya saple of size. Figure 3 is a flow chart of how a Pólya Saple of size =N-n is drawn. Figure3. A Flow Chart of Drawing a Pólya Saple of size =N-n. n 1 st draw n+1 2 nd draw n+2 (N-n)th draw n+(n-n) Original Urn Final Urn Adapted-FP ethod: ased on the Pólya s urn schee described above, the adapted-fp populations can be generated following the three-step procedure: Step1: Take a Pólya saple of size N-n, denoted by * * * y1, y2,..., y( N n) fro the urn { y1, y2,..., y n }. In this process, each y i in the urn is selected with probability: N n wi 1 li,k 1 n, (1) N n N n k 1 n 5 / 18

where w is the case weight for the ith unit and lik, 1 i is the nuber of bootstrap selections of unit i up to (k-1) th selection, setting li,0 0. Step2: For the FP population * * * y1 y2 yn y1 y2 y,,,,,,, N n so that the FP population has exact size N. Step3: Repeat the previous steps a large nuber of ties, say ties, to obtain FP populations. 2.2.1.2. Relating FP to ultiple Iputation for Ite issing Data: If we think of the sapling process and responding process as one cobined process, in other words, treating the responding process as another level of sapling of responding units given the original sapled units, then it s easy to see the connection between FP and standard I. FP tries to bootstrap the saple to the entire population while I tries to bootstrap the responding pool to the coplete saple. A typical exaple is the approxiate ayesian bootstrap (A) suggested by Rubin & Schenker (1986) as a way of generating I when the original saple can be regarded as IID and the response echanis is ignorable. Now that Cohen s FP extends to unequal probability selection, we ay well think of the unsapled part of population as issing and ultiply ipute this part using the adapted-fp procedure described above. Hence the essential of our approach---to carry out two levels of ultiple iputation: 1) I for unit nonresponse by Pólyaing up a coplex saple onto a population where issing values are treated as a separate category for each variable with issing data, and 2) I for ite nonresponse by SRI for the entire population. 2.2.1.3. Uncoplex Weights: The adapted-fp can be applied directly to weights in a sapling design such as probability proportional to size (PPS) sapling. Note that in practice, the input weight in Forula (1) should be the final weight or poststratified weight after all types of adjustent including unit nonresponse adjustent and calibration for undercoverage, etc. The reason is obvious: if we use the base/design weight instead in applying the adapted FP for generating synthetic populations, the potential probles of undercoverage or unit nonresponse existing in the original coplex saple would be brought up to the population level without being adjusted at all. Since ultiate analyses are based on such iputed synthetic populations, a direct consequence would be biased inference even if the procedure by itself is efficient. Intuitive interpretation by assuing an SRS saple design: Forula (1) for adjusting bootstrap selection probability based on case weight is interpreted as follows: Let k 1,2,..., N n 1, i 1,2,..., n, before aking any bootstrap selection of y, y,, y unobserved units in the population fro the observed original coplex saple 1 2 n, i.e. when k 1 and l, 1 l,0 0 i k i, the probability of selecting unit i with sapling weight i w is ( w 1) / ( N n). To ake it sipler to understand, suppose we have a siple rando sapling of n i 6 / 18

units in the first place, then w / i N n for all sapled units, each representing N / n units in the population, then the probability of that any one unit fro the SRS saple is selected before any N bootstrap selection is ( 1) / ( N n), which is the selection probability of any units aong all the n rest N n units in the population, and this exactly equals 1/n. As we proceed with the bootstrap selection, we adjust this selection probability according to the nuber of ties each unit aong y, y,, y was selected during the FP procedure, each unit now represents ( N n) / n 1 2 n aong the N n units to be selected during one bootstrap whenever it is selected once. After each selection, the denoinator of the prior probability function needs to be inflated to reflect the total units being represented during all the bootstrap selections so far, while the nuerator also needs to be inflated to reflect the total units represented by unit i in the process. Therefore we obtain the probability as in forula (1). 2.2.1.4. Uncoplex Sapling Error Codes (i.e. stratu and clusters): Now we will show how the adapted FP also works when clusters are involved in the saple design. Suppose we have a stratified two-stage clustering design, with the probability of selection for each priary sapling unit (PSU)/cluster being proportional to its population size (PPS) within strata. As usual we treat each stratu independently and apply the ethod separately within each. Now we have two layers of bootstrap selection---one at the cluster level and the other at the eleent level. Suppose there are C h clusters in th h stratu in the population, aong which c h were sapled, denote the as z hj where j 1,2,..., ch. Treating each cluster as the sapling unit, we can apply the sae procedure as in previous section where only eleents are involved. That is, we want to bootstrap selecting Ch ch clusters out of 1 2 z, z,..., z to for a population of clusters: h h hc h z, z,..., z, z, z,..., z. Accordingly, we need to change the corresponding ters in * * * h1 h2 hch h1 h2 h( Ch ch ) forula (1) to ake it a cluster-level selection probability. Let w, j 1,2,..., c hj be the sapling h weight for th j cluster, forula (1) thus can be adapted as forulae (2) and (2)', corresponding to the cluster-level selection and eleent-level selection, respectively. Ch c h Nch n ch whj 1 li,k 1 wci 1 li,k 1 ch nch, (2) and, (2)' Ch c h Nch n ch Ch ch k 1 Nh c nch k 1 ch nch Once the clusters have been selected by adapted FP procedure using forula (2), we can further 7 / 18

select eleents using the sae ethod within each selected cluster using forula (2)'. Notice that at the eleent level of selection, each selected cluster should now be treated as a population therefore the sapling weight for eleents in the original saple cannot be used anyore, instead we need to derive a new set of weights for eleents conditional on cluster being selected. First, we need to obtain the conditional probability of selection for eleent given cluster (forula (3) ), then inverse it to get the corresponding weight w ci in forula (2)'. Pr( ith eleent selected jth cluster selected) Pr( ith eleent selected & jth cluster selected) = Pr( jth cluster selected) Pr( ith eleent selected in the original saple) phji Pr( jth cluster selected) p, (3) hj Where is the selection probability of cluster j aong all clusters in stratu h, is the selection probability of unit i aong all units in stratu h. 2.2.2. Second Step---I for issing Data Using SRI: Now that we have uncoplexed the sapling designs, we are in a good position to proceed with perforing conventional ultiple iputation. SRI as a popular technique for coplex survey data structure is one option. Without the need to include design variables in the iputation odel due to a self-weighting FP population generated fro previous step, our task should now be concentrated on correctly odeling the covariate variables as well as interactions aong the whenever necessary. 2.3. Theoretical Results Rubin (1987) s standard I rule for cobining point and variance estiation does not fit the two-step I procedure. A new cobining rule is developed accordingly, which accoodates two sources of variability due to synthesizing populations by a nonparaetric ethod at the first step and ultiply iputing issing data by SRI at the second step. The validity of the new cobining rule is to be justified both fro a ayesian perspective and fro a repeated sapling perspective. Cobining rule for the point estiate: 1 1 1 q q q b1 1 b1 1, (4) 8 / 18

Cobining rule for the variance: 1 1 1 T (1 ) ( q q ) U 1 2 b1 b1 1 1 1 1 1 1 1 1 (1 ) ( ) (1 ) ( ) 1 1 2 2 q q q q b1 1 b1 1 b1 1 1 1 1 1 1 1 1 1 1 (1 ) ( ) (1 ) ( 1 1 2 q q q b1 1 b1 1 b1 1 1 1 1 1 (1 ) V (1 ) V b1 1 T T, (5) q ) 2 When n, and are large, the inference can be approxiated by noral distribution, thus the 95% confidence interval can be coputed as [ q z0.975 T, q z0.975 T ]. (See Appendix A for detailed derivation and notation) ayesian Proof of the New Cobining Rules for Inference: With iputed FP synthetic datasets generated fro the proposed two-step procedure, we need to find a way to cobine inferences fro both steps. Now we show how the cobining rules for noral approxiation is derived, assuing large saples. Let the coplete data be D ( D, D ), where D ( X, Y, R, I), X is covariate obs is obs obs inc atrix, Y obs is the observed part of survey variable with issing data, R inc is the response indicator for all sapled units, and I is the sapling indicator. Let the FP synthetic population be D ( D, D ), b=1,2,...,. obs is Let the iputed synthetic population be D ( D, D ), =1,2,..., and b=1,2,...,. ( ) obs is( ) The posterior ean and variance of Q are iediate using the rules for finding unconditional oents fro conditional oents (according to Result 3.2 of Rubin 1987). In our case, we shall condition on two layers of observed data due to the two-step I procedure: Posterior ean: E( Q D ) E{ E[ E( Q D ) D ] D } E{ Q D } Q b obs obs ( ) obs obs, (10) Where Qˆ Qb E Q D note that ˆ ˆ li ( obs ), 1 Q is the coplete data statistic for the 9 / 18

synthetic population and Qˆ { Qˆ,..., Qˆ } is repeated values of the posterior distribution 1 of Q ˆ. Q ˆ b Q li li ( ˆ E Q Dobs), note that ˆQ is the coplete data statistic b 1 1 for the original population and Qˆ { Qˆ,..., Qˆ,..., Qˆ,..., Qˆ }. b 1 1 1 1 Where Posterior Variance: V ( Q D ) obs V{ E[ E( Q D ) D ] D } E{ V[ E( Q D ) D ] D } ( ) obs obs ( ) obs obs V{ Q D } E{ T D } b obs obs 1 T T, (11) 1 1 1 1 T Q Q T Qˆ Q 2 2 li(1 ) ( b ), li (1 ) ( b ). 1 b 1 1 1 3. Siulation Study A siulation study was designed to investigate the properties of inference based on the proposed ethod. In particular, we are interested to see how the two-step I procedure perfors in coparison with the existing alternative ethods including: (1) coplete case analysis, (2) ignore designs in the iputation odel, and (3) include designs as fixed effect in the iputation odel. 3.1. Data: Data fro real surveys were anipulated to serve as our population. Specifically, RFSS 2009 in the state of ichigan used a disproportionate stratified sapling design (DSS) with no PSU (cluster level) involved therefore is suitable for our purpose of looking at weight as a single design variable in ultiple iputation. There are four strata in ichigan and for siplicity we only chose one stratu as the basis of our siulation study. Eight categorical variables were selected which we thought would be potentially correlated with survey weight. Table 1 shows the recoded variables we ve chosen for analysis. After soe data cleaning, we ended up with a coplete dataset of N=1323 cases which will be serving as our population. 10 / 18

Table1. Survey Variable under Analysis Survey Variable Coding Race 1: Whites 2: Non-Whites Whether or Not Have Health Plan 1: Yes 2: No Incoe Level 1: Low 2: ediu 3: High Eployent Status 1: Eployed 2: Uneployed 3: Other arital Status 1: arried 2: Unarried Education 1: Lower Than High School 2: High School and Higher Whether or Not Have Diabete 1: Yes 2: No Whether or Not Have Astha 1: Yes 2: No 3.2. Design: Our strategies of evaluating the perforance of proposed ethod in coparison with alternative ethods consist of the following steps: Step1: ake design variable related to survey variables: Since we want to exaine the new ethod assuing the designs are relevant to issingness on survey variables, we achieved the assuption by regressing weights as the dependent variable on all other survey variables as predictors and obtained the predicted values of weight to be used for siulation. The purpose is to ake sure that weight as a design variable is at least oderately related to survey variables. Step2: Draw 100 saples/replicates with probability proportionate to the inverse of predicted weights, each of size 200 (we will call it before-deletion saples): In this way, we obtained the probability proportional to size (PPS) sapling weights ( ) which are directly related to the predicted weights ( ) fro the previous step, since we were iplicitly treating the predicted weights as kind of a easure of size. Now becoes our target design variable to be exained with the new ethod. Step3: Ipose issingness on the Coplete Data under AR echanis we used a deletion function taking the for,where is a binary indicator for issingness, represents Race, represents unit s values of other survey variables (except arital status and education ) of which we purposefully delete values. Thus we obtained 100 after-deletion saples. Table 2 shows the fractions of issing by race for each survey variable of interest. The fractions of issingness by race are in a range of 15%~40%, where the issingness for incoe was ade to be generally higher than all other variables. Table2. Fractions of issingness on Survey Variables of Interest by Race Race Eployent Health Status Plan Diabete Astha Incoe Whites 15% 15% 15% 15% 20% Non-Whites 25% 25% 25% 25% 40% 11 / 18

Step4: Generate =100 FP synthetic populations for each replicate saple, with the PPS sapling weights as input weights, using Cohen (1997) s adapted FP ethod. In the process, we ade the population size five ultiples of the original PPS saple size thus each FP population is of size 1000. Step5: Create =5 ultiply iputed datasets by SRI procedure for each FP population with each replicate saple, thus I obtained 5*100*100=50000 iputed datasets each of size 1000. Step6: Obtain ultiply iputed synthetic population estiates for the ean of each survey variable of interest. For each replicate saple, we use forulae (4) and (5) as our new cobining rules to cobine the estiated eans and variances fro 500 iputed synthetic datasets. 3.3. Results: Three critical statistics were exained for coparison across the four ethods. They are absolute relative bias, root ean square error, and epirical noinal 95% confidence interval coverage rate. All are calculated based on 100 replicate saples. Also note that except for the new ethod all estiates under the other three ethods as well as the actual saples before deletion are design-based. Figure4 displays the Q-Q plot atrix of 100 pairs of estiated proportions fro the actual saples before deletion versus that fro the corresponding iputed synthetic populations under proposed ethod, for each survey variable by level. The plots deonstrate a nearly perfect 45-degree straight line for all the variable levels. This indicates that the distributions of the iputed synthetic populations practically atch the actual saples before deletion. Table3 gives the detailed results fro the siulation study. For siplicity, we only display the results for three variables, eployent status and health plan, which has higher correlations with the design variable weight, and incoe, whose fraction of issingness is highest aong all. According to table3, the new ethod has uch saller bias than all its copetitors. In ters of RSE, although the gain is not as substantial as that in the case of absolute relative bias, the new ethod perfors generally the best. Noinal 95% CI coverage of the point estiate under the proposed ethod is in a range of 86%-96%, considering categorical nature of all the survey variables, this is a reasonable result. We can see that except for coplete case analysis which has lowest CI coverage in general, there sees no uch difference between the new ethod and the other two odel-based I ethods. 12 / 18

Figure4. Q-Q Plot atrix for Estiated Proportions: Actual Saple efore Deletion versus FP+SRI 13 / 18

Table3. Absolute Relative ias, RSE and 95% CI Coverage Rates Copared across Four ethods Variables FP+SRI Include Weights in the Iputation odel Do Not Include Weights in the Iputation odel Coplete Case Analysis EPLOY_ Relbias RSE 95% CI cov. Relbias RSE 95% CI cov. 1 0.30% 3.99E-02 96% 1.03% 3.59E-02 95% 2.10% 3.93E-02 96% 3.23% 7.34E-02 95% 2 0.17% 5.41E-02 87% 19.15% 5.76E-02 86% 8.73% 5.67E-02 86% 5.98% 5.97E-02 79% 3 0.17% 7.69E-02 90% 3.66% 6.80E-02 90% 3.58% 7.69E-02 90% 1.02% 8.43E-02 92% INCOE_ 1 0.84% 5.54E-02 88% 1.38% 5.27E-02 91% 1.22% 5.63E-02 89% 6.77% 8.94E-02 88% 2 0.33% 6.16E-02 88% 0.25% 5.99E-02 85% 0.63% 6.20E-02 87% 3.17% 8.39E-02 93% 3 1.91% 3.02E-02 96% 5.93% 3.33E-02 95% 6.68% 3.35E-02 94% 13.53% 5.48E-02 90% HLTHPLAN_ 1 0.03% 6.68E-02 86% 4.59% 7.88E-02 87% 1.51% 6.96E-02 85% 0.80% 7.11E-02 80% 2 0.19% 6.68E-02 86% 32.70% 7.88E-02 87% 10.80% 6.96E-02 85% 2.40% 1.12E-01 80% Relbias RSE 95% CI cov. Relbias RSE 95% CI cov. 14 / 18

4. Discussions Our priary goal was to propose a new ethod to account for coplex saple design features in ultiple iputation for ite issing data and to evaluate the perforance of the new ethod in a PPS saple setting. There are several advantages of the proposed two-step I procedure: first, it relaxes the usually strong distributional assuptions of rando effects in paraetric odels; second, it potentially protects against odel isspecification, for exaple, wrongly inclusion or exclusion of interactions between design variables and other covariates in the iputation odel. eanwhile, it retains the nice features of SRI in handling coplex data structure and various types of issing variable. Another advantage is that both steps of the procedure are of a ayesian flavor and iplies a proper iputation ethod. This fits into the standard I paradig which requires ayesian derived iputation ethod to attain randoization validity. A further advantage lies in that unlike the fully odel-based ethods which include designs in the iputation odel and still require coplex survey packages to analyze the iputed datasets, the new ethod fully accounts for the designs by uncoplexing the and restoring a population in a separate step, therefore only siple, unweighted coplete-data analysis techniques are needed for inferences with the newly developed cobining rules. This reduces a lot burden on data users. Our findings in the siulation study suggest that the new ethod can bring about significant gains in bias relative to the existing odel-based ethods without losing any efficiency. Therefore, even for categorical variables, the noinal 95% confidence interval coverage rates under the new ethod are quite reasonable. A direct application of the proposed approach is in the context of I for disclosure risk liitation context. The approach of using I fraework to generate synthetic populations has becoe popular in protecting the confidentiality of respondents in the survey world, yet ost are odel-based or based on nonparaetric ethod alone (i.e. ayesian ootstrap not accounting for designs), the sei-paraetric approach as a cobining use of the two serves as a good alternative, especially for coplex survey designs. Our current work focuses on ore coprehensive siulation studies to assess the general perforance of the proposed ethod, including its robustness to different degrees of odel isspecification. We also ai to extend the application of our ethod in ore coplex saple design settings where both unequal probabilities of selection and clustering are involved. 15 / 18

Appendix A. Direct Derivation for the Two-step Cobining Rules This is achieved by constructing an approxiate posterior distribution of Q given D in analogy with the standard theory of ultiple iputation for issing data. The conceptual fraework of the proposed procedure suggests that we have the following decoposition: f ( Q D ) f ( Q D, D, V, V ) f ( D, V D, V ) f ( V D ) dd dv dv f ( Q D, V ) f ( D, V D, V ) f ( V D ) dd dv dv Where V is variance of the posterior ean q for each D obtained when, let V be the variance of the q obtained when. Then V is the average of the ( b ) V obtained when. Step1: Cobining rule fro Synthesizing data by adapted FP: f ( Q D, V ) Let Q f ( Y) be a scalar population quantity we are interested in. For exaple, it can be the population ean of survey variable Y, or it can also be a regression coefficient. Let q and u be the point and variance estiates for Q based on the actual saple data, (1) (2) ( ) q { q, q,..., q } denotes the point estiates fro all FP populations. Adapted fro the cobining rules developed by Raghunathan et al.(2003) for fully synthetic data, the posterior ean and variance can be estiated as: q 1 q b 1 (6) T etween synthetic variance + Within synthetic variance 1 1 1 1 1 1 2 (1 ) V U (1 ) ( q q ) U (7) b1 b1 b1 Where U T which is the between iputation variance for FP population b, and will be defined later in step 2. Step2: Cobining rules fro ultiply iputing issing data: f ( D, V D, V) f ( V D ) or Let f Q D ( ) q { q, q,..., q } denote the point estiates fro all ultiple iputed datasets for 1 2 FP b. Adapted fro Rubin (1987) s conventional cobining rules for ultiple iputation, the posterior ean and variance estiates can be written as: 16 / 18

q 1 q 1 (8) T U etween iputation variance Within iputation variance 1 1 1 1 1 2 (1 ) V U (1 ) V (1 ) ( q q ) (9) 1 1 1 The within iputation variance disappears because at this step we are actually ultiply iputing issing data for a population (FP is treated as a population) therefore no sapling variance is involved here, i.e. U 0. Two-step cobining rules for inference: f ( Q D ) When we were deriving f ( Q D ) at the first step, q was viewed as the sufficient suaries of FP b, after constructing the unbiased estiator q for each ( b ) q fro generating ultiply iputed datasets nested within each FP population, we need to approxiate f ( Q D ) with f ( Q D ) by substituting q with their estiators q to obtain the ( b ) posterior ean and variance for the two-step procedure as in forulae (4) and (5). Where V and V represents the between synthetic population variance and between ultiple iputation variance, respectively. Notice that the within variance coponent at the first step is actually represented by the between variance fro the second step. References: 1. Cohen, ichael P. (1997). The ayesian bootstrap and ultiple iputation for unequal probability saple designs. ASA Proceedings of the Section on Survey Research ethods, 635-638. 2. Dong, Qi. (2011). Cobining Inforation fro ultiple Coplex Surveys. University of ichigan. Unpublished Dissertation. 3. Efron,. (1979). ootstrap ethods: another look at the jackknife. Annals of Statistics, 7, 1-26. 4. Elliott,.R. (2007). ayesian Weight Triing for Generalized Linear Regression odels. Survey ethodology 33(1):23-34. 5. Gross, S. (1980). edian estiation in saple surveys, presented at the 1980 Joint Statistical eetings. 6. Ki, J.K. (2004). Finite Saple Properties of ultiple Iputation Estiator. Ann. Statist. 32(2):766-783. 7. Ki, Jae Kwang, ichael rick, J., Fuller, Wayne A. and Kalton, Graha (2006). On the bias of the ultiple-iputation variance estiator in survey sapling. Journal of the Royal Statistical Society, Series : Statistical ethodology, 68, 509-521. 8. Little, R.A. and Rubin, D.. (2002). Statistical Analysis with issing Data (Second Edition), New 17 / 18

York: J Wiley & Sons, New York. 9. Lo, Albert Y. (1988). A ayesian ootstrap for a Finite Population. The Annals of Statistics 16(4):1684-1695. 10. eng, Xiao-Li. (1994). ultipl-iputation Inferences with Uncongenial Sources of Iput. Statistical Science 9(4):538-558. 11. Raghunathan, T.E. Lepkowski, J.. Van Hoewyk, J. Solenberger, P. (2001). A ultivariate technique for ultiply iputing issing values using a sequence of regression odels. Survey ethodology 27(1):85-95. 12. Raghunathan, T.E. Reiter, J.P. and Rubin, D.. (2003). ultiple Iputation for Statistical Disclosure Liitation. Journal of Official Statistics 19(1):1-16. 13. Rao, J.N.K. and Wu, C.F.J. (1988) Resapling Inference with Coplex Survey Data. Journal of the Aerican Statistical Association 83:231-241. 14. Reiter, J.P. Raghunathan, T.E. and Kinney, Satkartar K. (2006). The Iportance of odeling the Sapling Design in ultiple Iputation for issing Data. Survey ethodology 32(2):143-149. 15. Reiter, J.P. (2004). Siultaneous use of ultiple iputation for issing data and disclosure liitation. Survey ethodology 30(2):235-242. 16. Rubin, D. (1981). The ayesian bootstrap. Annals of Statistics, 9, 130-134. 17. Rubin, D.. and Schenker, N. (1986). ultiple Iputation for Interval Estiation fro Siple Rando Saples with Ignorable Nonresponse. Journal of the Aerican Statistical Association, 81(394):366-374. 18. Rubin, D. (1987). ultiple Iputation for Nonresponse in Surveys. New York: Wiley. 19. Rubin, D. (1996). ultiple Iputation After 18+ Years. Journal of the Aerican Statistical Association, 91(434):473-489. 20. Schafer, J.L. (1999). ultiple Iputation: A Prier. Statistical ethods in edical Research 8:3-15. 21. Schafer, J.L. Ezzati-Rice, T.. Johnson, W. Khare,. Little, R.J.A. and Rubin, D.. (1997). The Nhanes III ultiple Iputation Project. 22. Schenker, Nathaniel, Raghunathan, T.E. Chiu, Pei-Lu, akuc, D.. Zhang, Guangyu and Cohen, A.J. (2006). ultiple Iputation of issing Incoe Data in the National Health Interview Survey. Journal of the Aerican Statistical Association 101(475):924-933. 23. Yu, andi (2008). Disclosure Risk Assessents and Control. Dissertation. The University of ichigan. 18 / 18