Two-phase sampling approach to fractional hot deck imputation

Similar documents
Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

6. Fractional Imputation in Survey Sampling

VARIANCE ESTIMATION FOR NEAREST NEIGHBOR IMPUTATION FOR U.S. CENSUS LONG FORM DATA

Fractional Imputation in Survey Sampling: A Comparative Review

Parametric fractional imputation for missing data analysis

On the bias of the multiple-imputation variance estimator in survey sampling

Chapter 5: Models used in conjunction with sampling. J. Kim, W. Fuller (ISU) Chapter 5: Models used in conjunction with sampling 1 / 70

Combining data from two independent surveys: model-assisted approach

Nonresponse weighting adjustment using estimated response probability

A note on multiple imputation for general purpose estimation

Fractional hot deck imputation

Data Integration for Big Data Analysis for finite population inference

Recent Advances in the analysis of missing data with non-ignorable missingness

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Bootstrap inference for the finite population total under complex sampling designs

TWO-WAY CONTINGENCY TABLES UNDER CONDITIONAL HOT DECK IMPUTATION

REPLICATION VARIANCE ESTIMATION FOR THE NATIONAL RESOURCES INVENTORY

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW

Model Assisted Survey Sampling

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

Chapter 4: Imputation

Likelihood-based inference with missing data under missing-at-random

Sampling from Finite Populations Jill M. Montaquila and Graham Kalton Westat 1600 Research Blvd., Rockville, MD 20850, U.S.A.

A measurement error model approach to small area estimation

Introduction to Survey Data Analysis

Finite Population Sampling and Inference

arxiv: v2 [math.st] 20 Jun 2014

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

An Overview of the Pros and Cons of Linearization versus Replication in Establishment Surveys

A decision theoretic approach to Imputation in finite population sampling

INSTRUMENTAL-VARIABLE CALIBRATION ESTIMATION IN SURVEY SAMPLING

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

REPLICATION VARIANCE ESTIMATION FOR TWO-PHASE SAMPLES

Monte Carlo Study on the Successive Difference Replication Method for Non-Linear Statistics

Analyzing Pilot Studies with Missing Observations

Calibration estimation using exponential tilting in sample surveys

analysis of incomplete data in statistical surveys

Chapter 8: Estimation 1

Combining Non-probability and Probability Survey Samples Through Mass Imputation

SAS/STAT 14.2 User s Guide. Introduction to Survey Sampling and Analysis Procedures

Statistical Methods for Handling Missing Data

EFFICIENT REPLICATION VARIANCE ESTIMATION FOR TWO-PHASE SAMPLING

Graybill Conference Poster Session Introductions

J.N.K. Rao, Carleton University Department of Mathematics & Statistics, Carleton University, Ottawa, Canada

Pooling multiple imputations when the sample happens to be the population.

Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design

Some methods for handling missing values in outcome variables. Roderick J. Little

ASYMPTOTIC NORMALITY UNDER TWO-PHASE SAMPLING DESIGNS

Miscellanea A note on multiple imputation under complex sampling

Asymptotic Normality under Two-Phase Sampling Designs

A Note on Bayesian Inference After Multiple Imputation

Basics of Modern Missing Data Analysis

ASA Section on Survey Research Methods

Estimation from Purposive Samples with the Aid of Probability Supplements but without Data on the Study Variable

STATISTICAL METHODS FOR MISSING DATA IN COMPLEX SAMPLE SURVEYS

Introduction to Survey Data Integration

Introduction An approximated EM algorithm Simulation studies Discussion

arxiv:math/ v1 [math.st] 23 Jun 2004

Propensity score adjusted method for missing data

University of Michigan School of Public Health

Chapter 4. Replication Variance Estimation. J. Kim, W. Fuller (ISU) Chapter 4 7/31/11 1 / 28

5. Fractional Hot deck Imputation

A weighted simulation-based estimator for incomplete longitudinal data models

PIRLS 2016 Achievement Scaling Methodology 1

Inference with Imputed Conditional Means

On the Use of Compromised Imputation for Missing data using Factor-Type Estimators

Imputation for Missing Data under PPSWR Sampling

MISSING or INCOMPLETE DATA

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

New Developments in Nonresponse Adjustment Methods

Weighting in survey analysis under informative sampling

Imputation Algorithm Using Copulas

Some methods for handling missing data in surveys

Modelling Dropouts by Conditional Distribution, a Copula-Based Approach

Multiple Imputation for Missing Data in Repeated Measurements Using MCMC and Copulas

Label Switching and Its Simple Solutions for Frequentist Mixture Models

BOOTSTRAPPING SAMPLE QUANTILES BASED ON COMPLEX SURVEY DATA UNDER HOT DECK IMPUTATION

Statistical Practice

Combining Non-probability and. Probability Survey Samples Through Mass Imputation

ANALYSIS OF PANEL DATA MODELS WITH GROUPED OBSERVATIONS. 1. Introduction

TWO-WAY CONTINGENCY TABLES WITH MARGINALLY AND CONDITIONALLY IMPUTED NONRESPONDENTS

Testing a Normal Covariance Matrix for Small Samples with Monotone Missing Data

The Use of Survey Weights in Regression Modelling

Eric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

Chapter 3: Element sampling design: Part 1

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Multiple Imputation for Missing Values Through Conditional Semiparametric Odds Ratio Models

PQL Estimation Biases in Generalized Linear Mixed Models

Regression: Lecture 2

Small area estimation with missing data using a multivariate linear random effects model

Don t be Fancy. Impute Your Dependent Variables!

Slides 12: Output Analysis for a Single Model

Deriving indicators from representative samples for the ESF

Toutenburg, Fieger: Using diagnostic measures to detect non-mcar processes in linear regression models with missing covariates

Exact balanced random imputation for sample survey data

Conservative variance estimation for sampling designs with zero pairwise inclusion probabilities

Nonrespondent subsample multiple imputation in two-phase random sampling for nonresponse

Statistical Inference of Covariate-Adjusted Randomized Experiments

EM Algorithm II. September 11, 2018

Transcription:

Two-phase sampling approach to fractional hot deck imputation Jongho Im 1, Jae-Kwang Kim 1 and Wayne A. Fuller 1 Abstract Hot deck imputation is popular for handling item nonresponse in survey sampling. In hot deck imputation, imputed values are taken from the respondents in the same imputation cell, where imputation cells are used to approximate the imputation model. We extend the fractional hot deck imputation of Kim and Fuller (2004) to the case where the imputation cells are not defined in advance. The proposed method of fractional hot deck imputation is performed in two steps and has a structure similar to that of two-phase systematic sampling. The proposed hot deck imputation method is applicable to multivariate missing data. A replication method is used for variance estimation. Results from two simulation studies are presented. Key Words: Cell mean model, Item nonresponse, EM algorithm, Multivariate missing, Replication variance estimation. 1. Introduction Nonresponse is frequently encountered in survey sampling. Unit nonresponse and item nonresponse are two major types of nonresponse (Kalton and Kasprzyk,1986). While weighting adjustment is commonly used to compensate for unit nonresponse, imputation is preferred to handle item nonresponse. Haziza (2009) provides a comprehensive overview of imputation methods. In hot deck imputation, the imputed values are real observations taken from the respondents in the same sample. Hot deck imputation is popular because it does not create artificial values and does not rely on strong model assumptions. In hot deck imputation, creating imputation cells to achieve homogeneity within imputation cells is critical. In Brick and Kalton (1996), all auxiliary variables are treated as categorical and imputation cells are formed as a combination of those categorized auxiliary variables. A nearest-neighbor imputation approach that uses a metric distance of auxiliary variables to find the set of donors has been used by Cotton (1991), Rancourt, Särndal and Lee (1994), and Chen and Shao (2000). Haziza and Beaumont (2007) used the score estimated by the regression of response on the auxiliary variables to create imputation cells. Rubin and Schenker (1986) proposed approximate Bayesian bootstrap (ABB) imputation as a hot deck approach to multiple imputation. Variance estimation after hot deck imputation is a challenging problem because it is well known that naive approach of treating imputed values as if observed underestimates the true variance. Rubin (1987) proposed multiple imputation as a general tool for inference with imputed data. In multiple imputation, M(> 1), imputed estimates are created for each missing item and then the imputation values are used for variance estimation. Fractional imputation proposed by Kalton and Kish (1984), and investigated by Kim and Fuller (2004), is a way of achieving efficient hot deck imputation. As in multiple imputation, M imputed values are generated for each missing value, but a single data set is created after fractional imputation. Fractional weights are assigned to the imputed values 1 Department of Statistics, Iowa State University, Ames, U.S.A. 1030

and replication methods are used for variance estimation. Kim and Fuller (2004) and Fuller and Kim (2005) describe some properties of fractional hot deck imputation and discuss variance estimation. Imputation cells are pre-determined and the determination of imputation cells is not discussed in the fractional hot deck imputation of Kim and Fuller (2004). In this paper, we extend fractional hot deck imputation in two ways. First, instead of assuming the imputation cells to be given, we allow multiple cells for each missing item. The multiple cells can be understood to be a nonparametric approximation of the true model by a finite mixture model. The implementation of fractional hot deck imputation under the finite mixture model is made through a two-phase systematic sampling mechanism. Second, the proposed method is applied to multivariate missing data with arbitrary missing patterns. In Section 2, the basic setup is introduced. The proposed two-phase fractional imputation and variance estimator are discussed for the univariate case in Section 3. In Section 4, the proposed method is extended to the case of multivariate missing data. Results from two limited simulation studies are presented in Section 5, with concluding remarks in Section 6. 2. Basic setup: univariate missing case Suppose that we have a finite population of size N, indexed by U = {1, 2,, N}, and let A be the index set for the units in the sample selected by a probability sampling mechanism. Let A be partitioned into G groups based on the auxiliary information x, where x takes values on {1,, G}. Thus, we can write A = A 1 A G. In addition to x, we collect y and z where y is the study variable and z is another categorical variable that takes values on {1,, H}. The cross classification of x and z forms imputation cells and we assume that y i (x i = g, z i = h) ii(µ gh, σgh 2 ), i U, (1) for some µ gh and σgh 2 > 0, where ii denotes independently and identically distributed. Here, x i is always observed but (y i, z i ) is subject to missingness. Define δ i = 1 if (y i, z i ) is observed and δ i = 0 otherwise. From unit responses, A can be re-partitioned into A R = {j A; δ j = 1}, A M = {j A; δ j = 0} with A = A M A R. Also, A g can be subpartitioned into A Rg = {j A g ; δ j = 1} and A Mg = {j A g ; δ j = 0}. Let n Rg and n Mg be respectively the size of A Rg and A Mg. We assume that the response mechanism is missing at random (MAR) in the sense that δ is conditionally independent of (y, z) given x. That is, f(y, z x, δ) = f(y, z x). (2) The MAR condition (2) implies that model (1) also holds for the responding units. That is, y i (x i = g, z i = h, δ i = 1) ii(µ gh, σgh 2 ). (3) We now consider a hot deck imputation estimator of Y N = N i=1 y i under nonresponse. By the condition (2), f(y x, δ = 1) = H P (z = h x, δ = 1)f(y x, z = h, δ = 1). (4) Expression (4) takes the form of a finite mixture model. Let π h g = P (z = h x = g, δ = 1) be the conditional probability of z = h given x = g. The vector (x, z) defines the 1031

imputation cell for hot deck imputation. Note that, from (4), E(y i x i = g, δ i = 1) = H π h g E(y i x i = g, z i = h, δ i = 1). Thus, if π h g is known, we use all respondents in the cell to estimate E(y i x i = g, z i = h) to get Ŷ F EF I = i A g { δ i y i + (1 δ i ) } H π h g ˆµ gh, (5) where ˆµ gh = j A w jδ j a jgh y j j A w jδ j a jgh, with a jgh = 1 if x j = g and z j = h, and a jgh = 0 otherwise. The estimator of (5) uses all observed values as donors in the imputation cell and is called the fully efficient fractional efficient (FEFI) estimator (Kim and Fuller, 2004). 3. Fractional hot deck imputation: univariate missing case We now propose a new fractional hot deck imputation (FHDI) procedure that does not require imputation cell information be given in advance. Given the finite mixture model in (4), the imputed values are taken from the imputation cells with propbability proportional to the conditional cell probabilities, π h g. In practice, the cell probabilities π h g are unknown and need to be estimated. The proposed fractional hot deck imputation is similar in spirit to two-phase sampling for stratification (Rao, 1973; Kim, Navarro, and Fuller, 2006). In phase one, the cells are determined and the cell probabilities π h g are estimated. In phase two, M donors are selected in each imputation cell. The π h g are estimated so that H ˆπ h g = 1 for each group g. Using the definition π h g = P r(z i = h x i = g, δ i = 1), an estimator of π h g is ˆπ h g = j A w jδ j a jgh j A w jδ j a jg, (6) where a jg = H a jgh. Thus, the FEFI estimator of (5) can be rewritten as { } H Ŷ F EF I = δ i y i + (1 δ i ) ˆπ h g ˆµ gh, = i A g i A g δ iy i + (1 δ i ) wij,f EF Iy j, (7) j A where w ij,f EF I = H ˆπ h g{w j δ j a jgh / l A δ lw l a lgh } is the fractional weights of the j-th donor for the i-th recipient. 1032

Given M imputed values selected for each recipient, the two-phase fractional imputation (FI) estimator of Y N is defined as { } H Ŷ F I = δ i y i + (1 δ i ) ˆπ h g ȳi = i A g i A g δ iy i + (1 δ i ) M j=1 w ijy (j) i, (8) where y (j) i is the j-th imputed value of y i, ȳi = M 1 M j=1 y (j) i is a mean of imputed values, and wij are the fractional weights for the FI estimator. Note that the FI estimator can be also expressed in terms of the FEFI estimator, Ŷ F I = ŶF EF I + (ŶF I ŶF EF I) = ŶF EF I + (ȳi ˆµ g ), (9) i A Mg where ˆµ g = j A Rg w j y j / j A Rg w j. While the fully efficient fractional imputation estimator ŶF EF I has no variance due to the selection of donors, the fractional estimator Ŷ F I has additional variance due to the donor selection procedure. Theorem 1 presents some asymptotic properties of the FI estimator. Theorem 1 Let the fractional hot deck imputation estimator ŶF I in (8) be constructed using the two-phase systematic pps sampling. (A1) A sequence of probability samples is drawn from a sequence of finite populations (Fuller, 2009) and Ŷn = i A y i is design-unbiased for Y N, where is the inverse of the selection probability. (A2) The cell mean model (1) and the MAR condition (2) hold for the sequence of populations and samples. (A3) Let U g, g = 1,..., G, be subsets of the finite population with size of N g, where N g is fixed, and ( ˆNg N g, ˆN Rg N Rg, ŶRg Y Rg ) = O p (n 1/2 N), where ( ˆN g, ˆN Rg, ŶRg) = i A g (1, δ i, δ i y i ) and (N g, N Rg, Y Rg ) = i U g (1, δ i, δ i y i ). (A5) The cell mean estimator, ˆµ g, satisfies ˆµ g µ g = O p (n 1/2 ), where ˆµ g = Y Rg /N Rg and µ g = H π h gµ gh. and where Then, Ŷ F I = ỸF I + o p (n 1/2 N), (10) Ỹ F I = ỸF EF I + E(ỸF I Y N ) = 0, (11) G i A Mg (ȳ i ˆµ g ), 1033

and Ỹ F EF I = {µ g + Rg 1 δ i (y i µ g )}, i A g with R g = N Rg /N g. Also, we have ) V (ỸF I) = V (ỸF EF I + E wi 2 V (ȳi ˆµ g A g ), (12) i A Mg and ) V (ỸF EF I G = V µ g + E i A g R 2 g wi 2 δ i (y i µ g ) 2. (13) i A g See Appendix A for the proof. By equation (12), the variance of ŶF I is the variance of ŶF EF I plus the variance of the mean of sample donors as an estimator of the cell mean. Our strategy is to obtain an approximation of FEFI using systematic PPS sampling. Note that, for recipient i A Mg, we have n Rg FEFI donors with the fractional weights wij,f EF I. Thus, M imputed values for y i can be systematically selected from the donors in A Rg with probability proportional to wij,f EF I. The detailed procedure is given in Appendix A. The efficiency of the procedures will depend on the efficiency of the sampling scheme used to select donors. We propose to estimate the variance by a variance estimator that approximate the variance estimator for the FEFI estimator. Recall that the donors are ordered on the y-variable. A method to construct jackknife replicates is: (V1) Delete unit k. If k A M, then the w ij are not changed for i A Mg. (V2) For i A Mg, k A Rg, and r the closest integer to k, then the wij for replicate k are, wij w ij,f EF I if j = r w (k) ij = wij + (w rj,f EF I ) wij,f EF I if j r j s w ij,f EF I otherwise. where w ij For each k A Rg, we identify the nearest FI donor to the deleted element among {1,, M}. For example, if the FEFI donor set has 20 donors {1, 2,, 20} ordered on [j] and we have FI with M = 5 donors, {2, 7, 12, 18, 20}, then the nearest FI donor for k = 3 is r = 2. Once w (k) ij are obtained for each k = 1,..., n, then a jackknife variance estimator is Ŷ (k) F I ˆV (ŶF I) = G = w (k) i i A g n k=1 c k (Ŷ (k) F I ŶF I) 2, (14) δ iy i + (1 δ i ) M j=1 w (k) ij y (j) i Note that the variance estimator (14) is biased because the changes in the fractional weights for the deleted unit are only partially reflected by deleting the closest unit. As demonstrated in the simulation study in Section 5, the relative bias in variance estimator decrease as imputation size M increases.. 1034

4. Extension to Multivariate missing data We now extend the proposed method in Section 3 to multivariate missing case, y = (y 1,, y p ). For simplicity in description, we present the proposed method for two variables. For each item, assume that we have discretized values of y p, denoted by z p, that can be predetermined or approximated using sample quantiles. Assume that z 1 takes values of {1,..., Q} and z 2 takes values of {1,..., S}. Let δ pi be the response indicator function for y pi. If y pi is missing, then z pi is also missing. Note that z = (z 1, z 2 ) can be viewed as (x, z) of univariate missing case, but z 1 and z 2 are now both subject missingness. That is, A cannot be partitioned into subgroups based on z 1 or z 2. However, we can decompose A such that A = A R A M, where A R = {j A; δ j = 1} and A M = {j A; δ j = 0} with δ j = p l=1 δ lj. A proposed strategy is that missing items for a unit in A M are imputed using values of a donor in A R. For example, if y 1i and y 2i are both missing, then imputed values of y (j) 1i and y (j) 2i should be selected from the same donor j. Thus, A R is assumed to be non-empty. Let (y i,obs, y i,mis ) be respectively the observation parts and the missing parts of y i. Similarly, we have (z i,obs, z i,mis ) for z. There are four patterns of δ i as (δ 1i = 1, δ 2i = 1), (δ 1i = 1, δ 2i = 0), (δ 1i = 0, δ 2i = 1), and (δ 1i = 0, δ 2i = 0). Thus, we have (y i,obs, y i,mis ) = (y 2i, y 1i =?) and (z i,obs, z i,mis ) = (z 2i, z 1i =?) for (δ 1i = 0, δ 2i = 1) pattern, where? denotes missing value. Similarly, we can identify y = (y i,obs, y i,mis ) and z = (z i,obs, z i,mis ) for other patterns. We assume that the cell mean model (3) holds for cells determined by z, y pi (z 1 = q, z 2 = s) ii(µ pqs, σ 2 pqs). (15) Once z i,mis are imputed, then the conditional distribution of f(y i,mis y i,obs ) can be approximated by f(y i,mis y i,obs ) = p(z i,mis z i,obs )f(yi,mis z i,obs, z i,mis), (16) z i,mis where p(z i,mis z i,obs) is the conditional cell probability of z i,mis given z i,obs. The selection of donors is similar to the univariate missing case in the sense that the approximation (16) has the same mixture model structure of (4) under the MAR assumption. The multivariate version of two-phase systematic pps sampling is as follows: [Phase 1]: Estimation of cell probabilities First step for the multivariate hot deck imputation is to estimate the joint cell probabilities p(z), where p(z) is a cell probability for a particular value of z. Since we have missing items on z i,mis, we cannot directly estimate cell probabilities as in the univariate missing case. We use a modified EM algorithm. The procedure avoids producing positive probabilities for structural zeros. See Appendix C for a description of the modified EM algorithm. To estimate the conditional cell probabilities, we consider a subdivision of A M based on observed vectors of z. From the observed z of recipients, A M can be sub-partitioned into G = Q + S + 1 groups, denoted by z 1,..., z G, corresponding {(1,..., Q),?)}, {?, (1,..., S)}, and (?,?). Thus, any nonresponding unit i A M belongs to one subgroup, A Mg = {j A M ; z j,obs = z g,obs }, g = 1,..., G. We also define A Rg = {j A R ; z j,obs = z g,obs }, g = 1,..., G. A responding unit i A R can belong to multiple subgroups. For example, z i = (1, 1) can be an element of subgroups for both z g = (1,?) and z g = (?, 1). 1035

Now, let H g be the size of all possible vectors for z g,mis. For i A Mg, if z g,mis are imputed for z g,mis, then an estimated conditional cell probability ˆp(z g,mis z g,obs) is where z (h) g H g ˆπ h g = ˆp(z (h) g )/ ˆp(z (h) g ), (17) = (z g,obs, z (h) g,mis ). Then, the FEFI estimator of Y p = N i=1 y pi is Ŷ p,f EF I = H G δ g piy pi + (1 δ pi )a ig ˆπ h g ˆµ gh i A = δ piy pi + (1 δ pi ) wij,f EF Iy pj i A j A where a jgh = 1 if (z j,obs, z j,mis ) = (z g,obs, z (h) g,mis ) and 0 otherwise, a jg = Hg a jgh, wij,f EF I = G a H ig ˆπ h g{w j δ j a jgh / l A w lδ l a lgh }, and j A ˆµ pgh = w jδ j a jgh y pj j A w. jδ j a jgh [Phase 2]: Systematic PPS sampling for missing y mis The fractional hot deck imputation for multivariate case can be implemented in a manner similar to the univariate case. The (S1) procedure in Appendix A needs to be replaced with the following (M1): (M1) Let n Rg be the number of FEFI donors in A Rg. Then, n Rg FEFI donors are ordered with y values by the half-ascending and half-descending order as in the univariate case. The sorting method depends on the number of missing items in z g,mis, (a) Single missing item: the FEFI donors are sorted based on y p values corresponding to the missing item. (b) Multiple missing items: the FEFI donors are first sorted by values of z that has the highest response rate among missing items. After then, the FEFI donors are sequentially sorted by values of z in order of item response rates. Note that, if we use y instead of z in sorting of the FEFI donors, the final order only depends on the missing item that has the lowest response rate. Once M donors are selected from the FEFI donors with probability proportional to wij,f EF I for each recipient, the FI estimator of Y p is Ŷ p,f I = H G δ g piy pi + (1 δ pi )a ig ˆπ h g ȳpi i A = M δ piy pi + (1 δ pi ) wijy (j) pi i A where y (j) pi is the y pj of j-th donor for y pi, ȳpi = M 1 M j=1 y (j) pi is a mean of imputed values, and wij = M 1. For variance estimation, we first calculate w (k) ij based on (V 1) and (V 2) in Section 3 and then compute the jackknife variance estimate from the formula in (14). j=1 1036

5.1 Univariate missing case 5. Simulation Study To check the performance of the proposed method in the univariate case, Y i = (Y 1i, Y 2i ), i = 1,, n are randomly generated from Y 1 U(0, 2), Y 2 = 1 + Y 1 + e 2, where e 2 is independent of Y 1 and is generated from a standard normal distribution. Here, Y 1 is fully observed observed but Y 2 is subject missingness with δ Bernoulli(0.7). Thus, Y 1 plays the role of x in Section 3. In the simulation, B=5,000 Monte Carlo samples are generated with size of n = 300. To implement the fractional hot deck imputation, Y 1 and Y 2 are categorized into Ỹ1 and Ỹ2 that respectively play roles of x and z of Section 2. The auxiliary variable,y 1, is categorized into five groups and the study variable, Y 2, is partitioned into two groups based on the sample quantiles of the respondents. For example, observations with y 2 values less than the median belong to group 1 (i.e. ỹ 2 = 1). For each recipient, M = 10 and M = 20 donors are respectively selected using the systematic sampling with probability proportional to the fractional weights of the FEFI donors. If M is no greater than n Rg, then we select all FEFI donors and use FEFI donors fractional weights as the fractional weights for the FI estimator. We consider five parameters: θ 1 = E(Y 2 ), θ 2 = P (Y 2 < 2), θ 3 = E(Y 2 D = 1) with D Bernoulli(0.3), θ 4 is the slope of regression of Y 2 on Y 1 and θ 5 is the correlation between Y 1 and Y 2. The five parameters are estimated using the FEFI estimator and the FI estimator. For θ 4 and θ 5, the FEFI or the FI estimator is i A ˆθ 4 = {δ i(y 1i ȳ 1 )(y 2i ȳ2i ) + (1 δ i) j A δ jwij (y 1i ȳ 1 )(y 2j ȳ2i )} i A (y 1i ȳ 1 ) 2, and ˆθ 5 = i A {δ i(y 1i ȳ 1 )(y 2i ȳ 2I ) + (1 δ i) j A δ jw ij (y 1i ȳ 1 )(y 2j ȳ 2I )} { i A (y 1i ȳ 1 ) 2 } 1/2 [ i A {δ i(y 2i ȳ 2I )2 + (1 δ i ) j A δ jw ij (y 2j ȳ 2I )2 }] 1/2, where ȳ 2I is a mean of imputed samples in y 2. In addition to point estimators, we also computed variance estimators using the replication method in Section 3. The jackknife variance estimator in (14) is used to compute variance estimates of the FI estimator. Table 1 presents the Monte Carlo means, standardized variances of the point estimators and relative bias of variance estimates. All point estimators are nearly unbiased. Slight biases in estimation of regression slope and correlation are due to the discrete approximation. The variances are reported with respect to variances of the FEFI estimators. The FI esimators are as efficient as the FEFI estimators. For larger imputation size, we have smaller relative bias in variance estimates. 5.2 Multivariate case Now we extend the proposed method to a multivariate missing case. We generated Y i = (Y 1i, Y 2i ), i = 1,, n, from Y 1 Gamma(1, 1), Y 2 = 1 + 0.5Y 1 + e 2, Y 3 = 2 + 0.5Y 1 + 0.5Y 2 + e 3 1037

Table 1: Monte Carlo results of mean, standardized variance and relative bias of variance estimators for the univariate missing case Parameter Estimator Bias of ˆθ Std. Var. of ˆθ Rel.Bias (%) of ˆV {ˆθ} θ 1 FEFI -0.00019 100.0 E(Y 2 ) FI(M=10) -0.00016 100.0-8.0 FI(M=20) -0.00019 100.0-6.1 θ 2 FEFI -0.00009 100.0 P (Y 2 < 2) FI(M=10) -0.00008 100.0-2.7 FI(M=20) -0.00009 100.0-0.8 θ 3 FEFI 0.00125 100.0 E(Y 2 D = 1) FI(M=10) 0.00113 100.0-2.7 FI(M=20) 0.00124 100.0-1.5 θ 4 FEFI -0.01201 100.0 Slope FI(M=10) -0.01202 100.0-6.5 FI(M=20) -0.01198 100.0-3.1 θ 5 FEFI -0.00599 100.0 Corr(Y 1, Y 2 ) FI(M=10) -0.00599 100.0-7.9 FI(M=20) -0.00598 100.0-3.9 where e 2 and e 3 are independently generated from a standard distribution. We generated δ li Bernoulli(p l ) independently for each Y l, l = 1, 2, 3, with p 1 = 0.5, p 2 = 0.7 and p 3 = 0.9 so that all variables are subject to missingness. Each variable is firstly categorized into three groups and then collapsed so that A Rg has at least two elements. We select M = 10 and M = 20 donors for each recipient using systematic sampling with probability proportional to the fractional weights of the FEFI donors. If M n Rg, then we select all possible donors and assign the fractional weights of the FEFI estimator as the fractional weights of the FI estimator. We generate B = 5, 000 Monte Carlo samples with size of n = 500. We computed estimators of θ 1 = E(Y 1 ), θ 2 = E(Y 2 ), θ 3 = E(Y 3 ), θ 4 = P (Y 1 < 1, Y 2 < 2) and θ 5 = E(Y 2 D = 1) with D Bernoulli(0.3). For variance estimation of the FI estimator, we used the jackknife estimator with formula (14). Table 2 presents the Monte Carlo means, standardized variances of the point estimators and relative bias of variance estimates. All estimators are nearly unbiased and the proposed FEFI and FI estimator perform well in this simulation. The variances are reported based on the FEFI estimates. As univariate missing case, the FI estimators are also as efficient as the FEFI estimators. The relative biases of variance estimators are smaller with M = 20 than the biases with M = 10. 6. Concluding remarks A fractional hot deck imputation in this paper mimics two-phase systematic sampling in the sense that imputation cells are created and missing items are imputed using the cells. The variance estimator has a potential for bias because the component due to donor selection is ignored. For multivariate imputation, joint cell probabilities are used to define conditional cell probabilities. The joint distribution of the study vector is approximated by a discrete approximation. The choice for the optimal level of discrete approximation can be viewed 1038

Table 2: Monte Carlo results of mean, standardized variance and relative bias of variance estimators for the multivariate missing case Parameter Estimator Bias of ˆθ Std. Var. of ˆθ Rel.Bias (%) of ˆV {ˆθ} θ 1 FEFI -0.00053 100.0 E(Y 1 ) FI(M=10) -0.00052 100.0-6.4 FI(M=20) -0.00054 100.0-3.5 θ 2 FEFI -0.00017 100.0 E(Y 2 ) FI(M=10) -0.00015 100.0-10.2 FI(M=20) -0.00016 100.0-5.7 θ 3 FEFI -0.00156 100.0 E(Y 3 ) FI(M=10) -0.00156 100.0-8.2 FI(M=20) -0.00156 100.0-6.9 θ 4 FEFI -0.00375 100.0 E(Y 1 < 1, Y 2 < 2) FI(M=10) -0.00376 100.0-5.3 FI(M=20) -0.00375 100.0 0.3 θ 5 FEFI 0.00071 100.0 E(Y 2 D = 1) FI(M=10) 0.00065 100.0-3.6 FI(M=20) 0.00077 100.0-2.0 as bandwidth selection for a nonparametric procedure. A modified EM algorithm is introduced for computation of joint cell probabilities. One desirable feature of the proposed method is that the covariance structure of multivariate variables is retained after imputation because imputed values are jointly generated and are selected to mimic distribution of variables as closely possible. An efficient sampling algorithm such as systematic PPS sampling is required. While the proposed FI estimator is nearly as efficient as the FEFI estimator, the size of the finally imputed data set will be relatively small compared to the use of FEFI. An R software package of the proposed method is under development. A. Systematic PPS sampling procedure Appendix (S1) Sort n Rg FEFI donors in terms of y values by the half-ascending and half-descending order. For example, {1, 2,..., 10, 11} is sorted as follows: 1, 3, 5, 7, 9, 11, 10, 8, 6, 4, 2. Let [j],j = 1,..., n Rg, be the j-th sorted unit in A Rg. (S2) Construct the interval of (L [j], U [j] ) for the systematic pps sampling. (a) Set j = 1 and L [1] = 0. (b) For current j, U [j] = L [j] + M w ij,f EF I (c) Set j = j + 1 and L [j] = U [j 1] and go to step (b) until j = n Rg. (S3) Let (RN) g be a random number generated from U(0, 1). For each i A Mg, we select M donors as follows: For l = 1,..., M, if L [j] (RN) g + (i 1) n Mg + (l 1) U [j] 1039

for some j, then y [j] be the l-th imputed value for unit i. B. Proof of Theorem 1 Before we prove Theorem 1, assume that δ i (i = 1,..., N) is extended to the entire population and assumed to be independent random variable. This extension has been discussed by Fay (1991) and used by Rao and Shao (1992). First, we rewrite the FEFI estimator in (7), Ŷ F EF I = = = = H δ i y i + (1 δ i ) ˆπ h g ˆµ gh i A g i A g H H δ i ˆπ h g ˆµ gh + (1 δ i ) ˆπ h g ˆµ gh i A g i A g H ˆπ h g ˆµ gh i A g G ˆN g (ŶRg/ ˆN Rg ) (B.1) Now, applying Taylor expansion on the ŶF EF I defined in (B.1), we have Ŷ F EF I = G + N g N Rg Y Rg + G G Y Rg N Rg ( ˆN g N g ) N g N Rg (ŶRg Y Rg ) G Y Rg N g NRg 2 ( ˆN Rg N Rg ) + S n + G n, (B.2) where S n = 1 (ŶRg Y Rg )( N ˆN g N g ) N g Rg NRg 2 (ŶRg Y Rg )( ˆN Rg N Rg ) Y Rg NRg 2 ( ˆN Rg N Rg )( ˆN g N g ) + Y RgN g NRg 3 ( ˆN Rg N Rg ) 2, and G n is a remainder term. From the assumption (A4), S n has the order of O p (n 1 N). Thus, by the assumption (A4) and (A5), (B.2) can be expressed with Ŷ F EF I = i A g {y i + (R 1 g δ i 1)(y i µ g )} + o p (n 1/2 N), (B.3) where, R g = N Rg /N g and µ g = H π h gµ gh for i U g. Henceforth, we define Ỹ F EF I = G i A g γ ig with γ ig = y i + (Rg 1 δ i 1)(y i µ g ). Thus, from (9) and ỸF EF I, we have the result (10), Ŷ F I = ỸF I + o p (n 1/2 N), where ỸF I = ỸF EF I + G i A Mg (ȳi ˆµ g). (B.4) 1040

Let E I ( ) be an expectation on imputation mechanism, we have E I (ỸF I) = ỸF EF I. Thus, to prove (11), it suffices to show that E(ỸF EF I Y N ) = 0. Taking expectation on ỸF EF I, we have E(ỸF EF I) = E { E(ȲF EF I F N ) } = E + E g δ i 1)(y i µ g ) y i i U g = E(Y N ) + E (R 1 i U g (R 1 i U g g δ i 1)(y i µ g ), where F N is a set of finite population. On the other hand, E g δ i 1)(y i µ g ) = 0. (R 1 i U g (B.5) (B.6) (B.7) From (B.5), (B.6) and (B.7), we have E(ỸF I Y N ) = 0, that is, (11) is established. We now consider variance of the FI estimator. From expression (B.3), we fist have V γ ig = V µ g + V w i Rg 1 δ i (y i µ g ) i A g i A g i A g + Cov w i µ g, Rg 1 δ i (y i µ g ). (B.8) i A g i A g For the second term of (B.8), we have G V Rg 1 δ i (y i µ g ) = E wi 2 Rg 2 δ i (y i µ g ) 2, (B.9) i A g i A g { G } where the equality comes from E i A g Rg 1 δ i (y i µ g ) = 0. For the third term of (B.8), we also have Cov µ g, Rg 1 δ i (y i µ g ) i A g i A g = E µ g Rg 1 δ i (y i µ g ) = 0 (B.10) i U g { G } where the equality comes from E i A g Rg 1 δ i (y i µ g ) = 0 and the cell mean model in (1). From (B.8)-(B.10), we write V (ỸF EF I) = V µ g + E wi 2 Rg 2 δ i (y i µ g ) 2 (B.11) i A g i A g 1041

Note that the variance of N 2 Ỹ F EF I converges to the variance of N 2 Ŷ F EF I as n goes to infinity such that 2 V γ ig = E ˆR g 1 w i δ i (y i µ g ) + (Cross product term) i A g i A g G +E (Rg 1 1 ˆR g ) 2 w i δ i (y i µ g ) (B.12) i A g = V ˆR g 1 δ i y i + O(n 3/2 N 2 ) i A g ) = V (ŶF EF I + o(n 1 N 2 ), (B.13) where the second term of (B.12) converges to 0 with order of O(n 2 N 2 ) and the cross product term converges to 0 with order of O(n 3/2 N 2 ) by the condition (A4) and the Schwarz inequality. Thus, from (B.11) and (B.13), we show (13) such that V (ŶF EF I) = V µ g + E wi 2 Rg 2 δ i (y i µ g ) 2 + o p(n 1 N 2 ). i A g i A g We now write, V (ŶF I ŶF EF I) = V { )} { )} E I (ŶF I ŶF EF I + E V I (ŶF I ŶF EF I, (B.14) where ŶF I ŶF EF I = G i A Mg (ȳi ˆµ gh) and V I ( ) is a variance on imputation mechanism. On the imputation mechanism, V I (ŶF I ŶF EF I) = i A Mg w 2 i V (ȳ i ˆµ g A g ). (B.15) E I (ŶF I ŶF EF I) = 0. (B.16) Thus, by the result of (B.15) and (B.16), V (ŶF I ŶF EF I) = E wi 2 V (ȳi ˆµ g A g ). i A Mg (B.17) Also, since E I (ŶF I) = ŶF EF I, we have Therefore, by (B.17) and (B.18), (12) is established C. Description of the EM algorithm Cov(ŶF I ŶF EF I, ŶF EF I) = 0 (B.18) The EM algorithm is used here in a slightly modified way. For each unit i, the conditional probability of z i,mis given z i,obs is computed using the current estimate of the joint probability ˆp(z), where ˆp (z) = 1. This is the E-step of the EM algorithm. The initial conditional probabilities are z w (h) i = ˆp 0(z i,obs, z i,mis = z (h) i,mis ) Hi ˆp 0(z i,obs, z (h) i,mis ). (C.1) 1042

where ˆp 0 (z) is the estimated joint probability computed from the full respondents and z (h) i,mis is the h-th imputed vector for the missing missing items of unit i A M. Here, H i denotes the number of imputed vectors in z (h) i,mis. The M-step computes the joint probability of particular combination of z = (z obs, z mis ), ˆp(z ) = ( n ) 1 i=1 n H i i=1 w (h) i I(z i,obs = z obs, z (h) i,mis = z mis). (C.2) Equations (C.1) and (C.2) form a set of iterative computations for the EM algorithm. In the iteration, ˆp 0 is replaced by ˆp to compute w (h) i and update ˆp(z ) again until it converges. REFERENCES Brick, J.M. and Kalton, G. (1996). Handling missing data in servey research. Statistical Methods in Medical Research, 5, 215 238. Chen, J. and Shao, J. (2000). Nearest neighbor imputation for survey data. Journal of Official Statistics, 16, 113 132. Cotton, C. (1991). Functional description of the Generalized Edit and Imputation System. Business Survey Methods Division, Statistics Canada. Fay, R. E. (1991). A design-based perspective on missing data variance. In Proceedings of Bureau of the Census Annual Research Conference, American Statistical Association, 429 440. Fuller, W.A. (2009). Sampling Statistics. Hoboken, New Jersey: John Wiley & Sons, Inc. Fuller, W.A. and Kim, J.K. (2005). Hot deck imputation for the response model. Survey Methodology, 31, 139 149. Haziza, D. (2009). Imputation and inference in the presence of missing data. In Handbook of Statistics, volume 29, Sample Surveys: Theory Methods and Inference, Edited by C.R. Rao and D. Pfeffermann, 215 246. Haziza, D. and Beaumont, J.F. (2007). On the construction of imputation classes in surveys. International Statistical Review, 75, 25 43. Kalton, G. and Kasprzyk, D. (1986). The treatment of missing survey data. Survey Methodology, 12, 1 16. Kalton, G. and Kish, L. (1984). Some efficient random imputation methods Communications in Statistics, 13, 1919 1939. Kim, J.K. and Fuller, W.A. (2004). Fractional hot deck imputation. Biometrika, 91, 559 578. Kim,J.K., Navarro, A., and Fuller, W.A. (2006). Replication variance estimation for two-phase stratified sampling. Journal of American Statistical Association, 101, 312 320. Rancourt, E., Särndal, C.E., and Lee, H. (1994). Estimation of the variance in presence of nearest neighbor imputation. In Proceedings of the Section on Survey Research Methods, American Statistical Association, 888 893. Rao, J.N.K. (1973). On double sampling for stratification and analytical surveys. Biometrika, 60, 125 133. Rao, J.N.K., and Shao, J. (1992). Jackknife variance estimation with survey data under hot deck imputation. Biometrika, 79, 811 822. Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys, New York: John Wiley & Sons, Inc. Rubin, D.B. and Schenker, N. (1986). Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. Journal of the American Statistical Association, 81, 366 374. 1043