Fractional hot deck imputation

Size: px

Start display at page:

Download "Fractional hot deck imputation"

Douglas Elvin Barker
5 years ago
Views:

1 Biometrika (2004), 91, 3, pp Biometrika Trust Printed in Great Britain Fractional hot deck imputation BY JAE KWANG KM Department of Applied Statistics, Yonsei University, Seoul, , Korea AND WAYNE FULLER Department of Statistics, owa State University, Ames, owa 50011, U.S.A. SUMMARY To compensate for item nonresponse, hot deck imputation procedures replace missing values with values that occur in the sample. Fractional hot deck imputation replaces each missing observation with a set of imputed values and assigns a weight to each imputed value. Under the model in which observations in an imputation cell are independently and identically distributed, fractional hot deck imputation is shown to be an effective imputation procedure. A consistent replication variance estimation procedure for estimators computed with fractional imputation is suggested. Simulations show that fractional imputation and the suggested variance estimator are superior to multiple imputation estimators in general, and much superior to multiple imputation for estimating the variance of a domain mean. Some key words: Cell mean model; Missing data; Nonresponse; Replication variance estimation. 1. NTRODUCTON tem nonresponse occurs when a sampled unit provides some information but fails to respond to all items. Hot deck imputation is an imputation procedure in which the value assigned for a missing item is taken from respondents in the current sample. Many hot deck imputation procedures use auxiliary variables known for both the respondents and nonrespondents to divide the sample into so-called imputation cells. The hot deck imputation method assigns the value from a record with a response to the record with a missing value. The record providing the value is called the donor and the record with the missing value is called the recipient. A desirable property of hot deck imputation is that all imputed values are observed values. For example, imputed values for categorical variables will also be categorical with the same number of categories as observed for the respondents. n random hot deck imputation, nonrespondents are assigned values chosen at random from respondents in the same imputation cell. Random hot deck imputation preserves the distributional properties of the imputed dataset; that is, the distribution function for imputed data within a cell differs from the distribution function for the respondents in the cell only because of the randomness of imputation.

2 560 JAE KWANG KM AND WAYNE FULLER Random selection of donors introduces variability that is termed imputation variance. Brick & Kalton (1996) describe two methods for reducing imputation variance. One is through the sample design used for selecting donors within each imputation cell. For instance, selecting donors by simple random sampling without replacement is preferable to simple random sampling of donors with replacement. A second approach is to use fractional imputation (Kalton & Kish, 1984; Fay, 1996) which involves using more than one donor for a recipient. For example, three imputed values might be assigned to each nonrespondent, with each entry allocated a weight of one-third of the nonrespondent s original weight. Multiple imputation, proposed by Rubin (1978), is a procedure for handling missing data that allows the data analyst to use analyses designed for complete data, while at the same time providing a method for estimating the uncertainty due to the missing data. Both fractional imputation and multiple imputation can be called repeated imputation, in the sense that more than one value is assigned for each missing item. However, fractional imputation was designed to reduce the imputation variance, while the primary goals of multiple imputation are to simplify estimation and to provide an estimator of the variance. The approximate Bayesian bootstrap described in Rubin & Schenker (1986) can be viewed as a hot deck imputation method in the multiple imputation context. n addition to the multiple imputation procedure of Rubin & Schenker (1986), several other methods have been proposed for estimating the variance of an estimated total after hot deck imputation. Rao & Shao (1992) proposed a jackknife variance estimator for hot deck imputation in which the donors are selected with replacement with the selection probability proportional to the sampling weights. Tollefson & Fuller (1992) proposed a variance estimator for without-replacement hot deck imputation. Särndal (1992), Fay (1996) and Chen & Shao (2001) also proposed variance estimators for certain types of hot deck imputation; see Little & Rubin (2002) for discussion of imputation methods. n this paper, we give assumptions under which hot deck imputation produces consistent estimators, provide a general representation for hot deck imputation estimators and give the variance of the estimators. Secondly, we propose a class of unbiased variance estimators for the variance of the imputed estimator that is applicable for complex survey designs. The proposed variance estimator is general in the sense that it does not depend on the specific hot deck imputation method used. For example, the proposed variance estimator is unbiased for both with-replacement and without-replacement selection of donors. Fractional imputation combined with the proposed replication variance estimator gives a set of replication weights that can be used to construct unbiased variance estimators for estimators based on the completely responding variables as well as for estimators based on imputed data. 2. MODELS FOR MPUTATON 2 1. ntroduction Consider a population of N elements identified by a set of indices U={1,2,...,N}. Associated with unit i of the population is the value of the study variable y and i Y =(y,y,...,y ) denotes the vector of values of study variable for the N units in the 1 2 N population. Let A denote the set of indices of the elements in a sample selected by a set of probability rules called the sampling mechanism. Responses are obtained from the selected sample

3 Fractional hot deck imputation 561 according to a probability mechanism called the response mechanism. Let the population quantity of interest be h =h(y,y,...,y ) and let h@ be a linear estimator of h based N 1 2 N N on the full sample, h@ = wy. (1) i i iµa f we observe y on every element of the sample, then h@ with w ={pr(iµa)} 1 is a i i design-unbiased estimator of the population sum of the y. i Assume that the finite population U is made up of G imputation cells. Let n be the g number of sample elements in imputation cell g and let r, r >0, be the number of g g respondents in imputation cell g. The elements in cell g (,...,G) of the finite population are assumed to be a realisation of independently and identically distributed random variables with mean m and variance s2. Thus, independently for each iµu, g g g y i ~(m g, s2 g ), (2) where U g denotes the set of indices for the gth imputation cell. f the y i are independent of the sampling mechanism and of the response mechanism conditional on the cell, then model (2) for the cell holds for the responding units as well as for the nonrespondents; that is, independently for each iµa g, y i (A, A R )~(m g, s2 g ), (3) where A g is the set of indices of sample elements in cell g and A R is the set of indices of the sample respondents. We call model (3) the cell mean model. Note that the y i in the population may be related to the design but the division into cells removes that dependence Hot deck imputation A hot deck imputation method can be described by two factors, the first of which is the way in which donors are selected for each missing item. The distribution of d={d ;iµa R,jµA M }, (4) where d is the number of times that y i is used as a donor for y j and A M is the set of indices of the sample nonrespondents, is called the imputation mechanism. The second factor is the way the weight of the donor is defined for each missing item. Let w* be the fraction of the original weight of element j assigned to the value from donor i used for the missing value of element j. For element j with missing y, y j = iµa R w* y i (5) is the weighted mean of the imputed values. Note that w* ii =1 for iµa R and w* ii =0 for iµa M. The sum of the imputation fractions for each item is required to be one: iµa R w* =1, (6) for all jµa. n the typical situation with M donors for element j ( jµa M ), the w* are all equal to M 1 for d =1.

4 562 JAE KWANG KM AND WAYNE FULLER A linear estimator using hot deck imputation can be written in the form h@ = iµa R A w w* j jµa B y i 7 a y. (7) i i iµa R The sum of w* w over all recipients for which i is a donor, including i itself, is the total j weight of donor i, and is denoted by a. i 3. ESTMATON AFTER HOT DECK MPUTATON The properties of the estimator (7) under the cell mean model (3) are given in Theorem 1. THEOREM 1. L et the population satisfy the cell mean model (3). Assume the distribution of d is independent of Y and depends only on (r,r,...,r ) and (n,n,...,n ). Also, 1 2 G 1 2 G pr (d >0)>0 if and only if i and j belong to the same imputation cell. Let h@ be a linear estimator of the form (1) constructed from the full sample and assume that h@ is design-unbiased for the population quantity h. L et the fractionally imputed N estimator be defined by (7) and assume that the imputation fractions are constructed so that (6) holds. T hen E(h@ h )=0, (8) N var (h@ )=var A G w m i iµa g gb +E A G a2s2 i iµa Rg gb, (9) where w is proportional to {pr (iµa)} 1, a is the total weight of donor i defined in (7), A i i is the set of sample indices, A is the set of respondent indices, G is the number of imputation R cells, A is the set of indices of the sample for the gth imputation cell, and A is the set of g Rg indices of respondents for the gth imputation cell. f h =WN y and W w =N, then N i=1 i iµa i var (h@ h )=var A G w m N i iµa g gb +E q G (a2 a )s2 i i iµa Rg gr. (10) See Appendix 1 for the proof. n Theorem 1, expression (9) is the variance of h@ as an estimator of the superpopulation parameter. Expression (10) is the variance of the estimator of the finite population total. The expectations on the right-hand side of (9) and on the right-hand side of (10) are with respect to the joint distribution defined by the superpopulation model, the sampling mechanism, the response mechanism and the imputation mechanism. Correspondingly, the variances are with respect to the joint distribution. Under the cell mean model (3), the variance of the estimator is a function of the expectation of a2, where the expectation is a function of the procedure used to select i donors. Under model (3), a procedure that produces equal a will minimise the conditional i variance, conditional on the observed sample indices. While the best estimator of the cell mean under the cell mean model is the simple sample cell mean, this estimator is seldom used in practice. The practitioner may be willing to use the model to impute for missing values but not for estimation of the mean. Also, the model may not be appropriate for other y-values. Common approaches under model (3) are to select donors within the cell with equal probabilities or to select donors with probabilities proportional to the original weights of donors.

5 Fractional hot deck imputation Given nonresponse, one estimator of h is the ratio estimator 563 h@ = G A w FE iµa g ib W w y iµa Rg i i, (11) W w iµarg i where A and A are defined in Theorem 1. The estimator (11) is called fully efficient g Rg because it contains no variability due to random selection of donors. As discussed, (11) is not fully model efficient under the cell mean model when the w differ. i The estimator (11) can be implemented by using fractional imputation. Every responding unit in the imputation cell is used as a donor for each nonresponding element in the cell and the imputation fractions w* are defined to be proportional to the sampling weights. The resulting fractionally imputed estimator is h@ = w w* y, (12) FEF j i jµa iµa R where w w* is the weight of donor i for recipient j, and w* is the imputation fraction for j donor i as a donor for recipient j, defined as w ) 1w, if jµa and iµa, w* = q(w sµa Rg s i Mg Rg (13) 1, if jµa and i=j. R The estimator (12) with w* of (13), algebraically equivalent to (11), is called the fully efficient fractionally imputed estimator. f the sampling weights in a cell are the same, this estimator has the minimum conditional variance, conditional on the number in the cell, among linear unbiased estimators under the cell mean model. The imputed dataset permits the simple computation of statistics such as percentiles without recomputing the ratio. 4. REPLCATON VARANCE ESTMATON n this section we consider variance estimation based on replication; Wolter (1985) and Rust & Rao (1996) contain discussions of such procedures. Let a replication variance estimator for a complete sample be VC (h@)= L c (h@(k) h@)2, (14) k k=1 where h@(k) is the kth estimator of h based on the observations included in the kth N replicate, L is the number of replicates, and c is a factor associated with replicate k k determined by the replication method. When the original estimator h@ is a linear estimator of the form (1), the kth replicate of h@ can be written h@(k)= w(k) y, (15) i i iµa where w(k) denotes the replicate weight for the ith unit of the kth replication. i Theorem 2 provides criteria for constructing an unbiased replication variance estimator for the fractionally imputed estimator. THEOREM 2. L et the conditions of T heorem 1 hold. Assume that h@ is of the form in (7), where the imputation fractions satisfy (6). L et the complete sample replication variance estimator of (14) be design-unbiased for the design variance of the complete sample estimator.

6 564 JAE KWANG KM AND WAYNE FULLER L et the kth replicate of h@ for the imputed sample be of the form h@(k) = w(k) w*(k) y (16) j i jµa iµa R = a(k) y, (17) i i iµa R where a(k) =W w(k) w*(k),w(k) is the kth replication weight of unit j for the complete sample i jµa j j and w*(k) is the kth replicate of the imputation fraction w*. Assume replicates are constructed so that L c (a(k) a )2= a2 (,2,...,G), (18) k i i i k=1 iµa Rg iµa Rg w*(k) =1 (for all jµa). (19) iµa R T hen the variance estimator is unbiased for the variance of h@. VC (h@ )= L k=1 c k (h@(k) )2 (20) See Appendix 1 for the proof. The estimator VC (h@ ) is unbiased for the variance of h@, where, as in Theorem 1, the variance is defined by the joint distribution. The result holds for any survey design for which a replication variance estimator is available. Requirement (18) for the replicates follows from the representation of the variance given in (9) of Theorem 1. There are numerous ways of constructing replicates that satisfy (18) and (19). We consider replicates of the jackknife type. A natural starting place is to consider replicates constructed by removing all of the imputed values associated with a deleted respondent and increasing the weights of the other donors. By slightly modifying this procedure we can construct an unbiased estimator of the variance. Let a(k) 1i =W jµa w(k) w* and let P be the set of indices of elements deleted, or that j k have their weights reduced, at the kth replication. Assume that the P (k=1,2,...,l) k are such that each element appears in one and only one of the P. The replicates will k satisfy (18) if c (a(k) a )2 c (a(k) k i i k 1i a i )2= iµa Rg iµa Rg sµp k] A Rg q a2 s L i=1 c i (a(i) 1s a s )2 r, (21) for,2,...,gand k=1,2,...,l. Consider a fractionally imputed procedure with M distinct donors for each missing value and w* =M 1 for d =1 and jµa M. Define jackknife replicate fractions for missing unit j in cell g of replicate k by d, if M >0 and iµp, kg jk k w*(k) =Gw* w* +(M M ) 1M d, if M >0 and i1p, jk jk kg jk k w*, otherwise, (22)

7 Fractional hot deck imputation 565 where M =W d is the number of donors in P that are used for missing unit j, jk iµpk] A R k and d (k=1,2,...,l) are to be determined. Equation (21) expressed as a quadratic kg function of d is kg iµp k] A Rg c kq a(k) 1i a i d kg jni w(k) d j r 2 + c iµa i1p Rg kq a(k) 1i a i +d kg jni k c (a(k) k 1i a i )2 iµa Rg (M M ) 1M w(k) d jk jk j r 2 sµp k] A Rg q a2 s L c (a(i) i 1s a s )2 =0, (23) r i=1 where a(k) 1i =W jµa w(k) w*. Thus, the d, for k=1,2,...,l and,2,...,g, can be j kg obtained by solving the LG quadratic equations. f the weights are to be positive, d kg must be less than the smallest w*. There are certain replication methods and sampling configurations where it is not possible to find a d less than w* satisfying (23). n such kg cases an unbiased variance estimator can be constructed by adding replicates where the w*(k) differ from w*, but w(k) i =w i for all i. 5. ESTMATON OF LNEAR COMBNATONS mputed datasets are often used to construct estimates for subdomains of the population where these domains were not part of the model used in the construction of imputed values. n fact, imputation may be carried out under the conscious assumption that the imputation model holds across domains. An estimated domain mean for the full sample can be represented as where h@ = A w z zy j jµa jb 1 w z y, j j j jµa if j is in the domain, z = j q1, 0, otherwise. The estimator h@ zy is a linear estimator if the z values are considered fixed. t is not linear in the survey design context because the z i vary from sample to sample. To evaluate the performance of the fractional imputation variance estimator for the domain mean, first consider the estimation of the mean of cross-products of z and y, where z i is known for all iµa and some missing values of y have been imputed using fractional imputation. Writing the estimator as a function of the observed y s, we have h@ = zy, iµa R A w i z i + w w* z j jµa M jb y i 7 a y. z,i i iµa R n constructing replicates for variance estimation for W w y, we create replicates that iµa i i satisfy (18), where a(k) i =w(k) + i w(k) j jµa M w*(k).

8 566 JAE KWANG KM AND WAYNE FULLER Only in special cases will the replicate weights a(k) z,i =w(k) z + w(k) w*(k) z i i j j jµa M also satisfy equality (18). Hence, the replicates constructed for h@ will give a biased zy, estimator of the variance of h@. The relative bias in the fractionally imputed variance zy, estimator for jackknife replicates for a simple random sample of size n, a single imputation cell and w* =M 1 is of order n 1(n r)m 1, where r is the number of respondents. Thus, the bias in the fractional imputation variance estimator can be reduced by increasing M, where M is bounded by r. 6. SMULATON STUDES Simulation was used to compare some imputation methods. n the first study, simple random samples were generated from an infinite population composed of two equally sized cells, where, independently for each i, y i ~ qn(2 8, 1 16), in cell 1, N(3 8, 1 735), in cell 2. (24) For every i, the variable z i, where the z i are independently and identically distributed indicator variables with pr (z=1)=0 25, was created. The z i are independent of the y i, define membership in a domain, and are always observed. The response rate for y is 0 7 in cell 1 and 0 6 in cell 2. Thirty thousand samples of size 60 and samples of size 120 were generated. All of the realised samples had more than one respondent in each imputation cell. The parameters of interest are the mean of y, denoted by h 1, the mean of y for z=1, denoted by h 2, and the fraction of y s less than 2 0, denoted by h 3. The parameter h 3 is the mean of the variables The full sample estimator of h 2 is u i = q1, if y i <2, 0, otherwise. h@ = A z 2 iµa ib 1 z y. i i iµa Estimators were constructed by fully efficient fractional imputation, by a fractional imputation scheme with M=3 donors explained below, by the approximate Bayesian bootstrap of Rubin & Schenker (1986) with M=3 imputations, by fractional imputation with M=10 and by the approximate Bayesian bootstrap with M=10. The fractional imputation method was performed using without-replacement selection of M donors for each nonrespondent. Let there be r g respondents and m g missing values in cell g. Let Mm g =t g r g +k g, where t g and k g are integers with 0 k g <r g. Then k g respondents are used t g +1 times, and r g k g respondents are used t g times. Those to be used t g +1 times are selected with equal probabilities without replacement. The replicates were calculated using (22) and (23).

9 Fractional hot deck imputation The variance estimator for multiple imputation adopted from Rubin (1987) is where 567 VC M =W M +(1+M 1)B M, (25) (W, h: )=M 1 M (VC, h@ ), B =(M 1) 1 M (h@ h: )2, M M (t) (t) M (t) M t=1 t=1 M is the number of the multiple imputations, VC is the complete-data variance estimator (t) applied to the tth imputation dataset, and h@ is a version of h@ computed from the tth (t) imputation dataset. Table 1 shows the means, the variances and the standardised variances of the point estimators under the five imputation schemes. The standardised variance is the variance Table 1: First simulation. Means, variances and standardised variances of the point estimators under five diverent imputation schemes, based on samples n Parameter mputation scheme Mean Variance Std var 60 Mean, FEF h F (M=3) ABB (M=3) F (M=10) ABB (M=10) Domain mean, FEF h F (M=3) ABB (M=3) F (M=10) ABB (M=10) Proportion, FEF h F (M=3) ABB (M=3) F (M=10) ABB (M=10) Mean, FEF h F (M=3) ABB (M=3) F (M=10) ABB (M=10) Domain mean, FEF h F (M=3) ABB (M=3) F (M=10) ABB (M=10) Proportion, FEF h F (M=3) ABB (M=3) F (M=10) ABB (M=10) FEF, fully efficient fractional imputation; F, fractional imputation; ABB, approximate Bayesian bootstrap; Std var, standardised variance.

10 568 JAE KWANG KM AND WAYNE FULLER divided by the variance of the fully efficient fractionally imputed estimator, and multiplied by 100. All five imputation methods are unbiased for the three parameters and the Monte Carlo results are consistent with that property. Under model (24), the theoretical variance of the full sample estimator of h is for n=60, and the variance of the fully efficient fractional imputation estimator of h is 1 var (h@ )=n 2{var (n m +n m )+E(n2r 1)s2+E(n2r 1)s2} 1,FEF j q , if n=60, , if n=120, where m is the mean of cell g, s2 is the variance of cell g, and n is the sample number g g g falling in cell g. The theoretical increases in the variance of the estimator of the mean h 1 for n=60 due to missing values are , , , and for fully efficient fractional imputation, fractional imputation with M=3, the approximate Bayesian bootstrap imputation with M=3, fractional imputation with M=10 and the approximate Bayesian bootstrap imputation with M=10, respectively. Thus, the corresponding approximate relative efficiencies as estimators of the mean of the missing values are 100, 98 9, 75 5, 99 8 and 91 1 for a sample of size 60. The corresponding relative efficiencies for n=120 are 100, 98 0, 75 2, 99 8 and The Monte Carlo results are in general agreement with the theoretical approximations. The Monte Carlo variances of Table 1 show that the fully efficient fractional imputation is always the most efficient and the approximate Bayesian bootstrap imputation with M=3 is always the least efficient. The fractional imputation scheme with M=3 shows only about 1% loss in efficiency relative to fully efficient fractional imputation for estimation of the mean. Fractional imputation with M=10 has less than 0 5% loss in efficiency relative to the fully efficient procedure for the mean. Losses are somewhat larger for the domain mean with a loss of about 8% for M=3 and a loss of about 2% for M=10 relative to fully efficient fractional imputation. Table 2 shows the mean, relative bias, relative variance, standardised variance and t-statistics for the variance estimators. The relative bias of the estimated variance is the Monte Carlo bias divided by the Monte Carlo mean. The relative variance is the Monte Carlo variance of the variance estimator divided by the square of the variance, where the variance is given in Table 1. The t-statistic for testing the hypothesis of zero bias is the Monte Carlo estimated bias divided by the Monte Carlo standard error of the estimated bias. The fractional imputation variance estimation procedures are unbiased for the variance of the mean and the Monte Carlo results for h and h are in agreement with that property. 1 3 For small imputation cell sizes, the approximate Bayesian bootstrap variance estimators are biased for the variance of the mean. The bias of the multiple imputation variance estimator for the approximate Bayesian bootstrap method is, up to O(n 2) terms, (n g r g )n g r g A 2 n + 1 n g + 1 r g B s2 gr. E(VC ) var (h@ )j n 2 q 2 M ABB The approximate percent relative bias of the variance estimator for h@ is 3 9% for 1,M (n, M)=(60, 10) and 1 9% for (n, M)=(120, 10), agreeing with the Monte Carlo results. The fractionally imputed variance estimator is biased for the variance of h@. The bias 2 comes from two sources, the first being the bias described in 5. The second source is the bias in the jackknife variance estimator for a ratio, where the domain mean is a ratio

11 Fractional hot deck imputation Table 2: First simulation. Means, relative biases, relative variances, standardised variances and t-statistics for the variance estimator, based on samples 569 n Method Mean RB (%) RV (%) Std var t-statistic 60 h FEF F (M=3) ABB (M=3) F (M=10) ABB (M=10) h FEF F (M=3) ABB (M=3) F (M=10) ABB (M=10) h FEF F (M=3) ABB (M=3) F (M=10) ABB (M=10) h FEF F (M=3) ABB (M=3) F (M=10) ABB (M=10) h FEF F (M=3) ABB (M=3) F (M=10) ABB (M=10) h FEF F (M=3) ABB (M=3) F (M=10) ABB (M=10) FEF, fully efficient fractional imputation; F, fractional imputation; ABB, approximate Bayesian bootstrap; RB, relative bias; RV, relative variance; Std var, standardised variance. estimator. The relative bias in the estimated variance of the domain mean computed with the full sample is about 8% for n=60 and about 3% for n=120. The ratio bias decreases as n increases and the imputation bias decreases as M increases. Table 2 illustrates that multiple imputation with the approximate Bayesian bootstrap produces a seriously biased estimator of the variance of the estimator of h 2. The existence of a bias in the multiple imputation variance estimator for a domain mean, where the domain information is not used for imputation, was pointed out by Fay (1992). The relative biases of the approximate Bayesian bootstrap variance estimators for h@ 2 are over 50%, for both sample sizes. This is judged to be a serious shortcoming of multiple imputation variance estimation because the construction of estimates for small domains is one of the reasons to choose imputation over weighting as the adjustment for nonresponse.

12 570 JAE KWANG KM AND WAYNE FULLER The bias in the multiple imputation variance estimator for the domain mean results primarily because the B M of (25) is a biased estimator of the effect of imputing for missing values. The model used to generate the data is the model used in imputation in that the analysis variable is assumed to be unrelated to z. The assumed independence is reflected in the imputation in that the assignment of donors is made without consideration of z. As a result, the estimator for a domain calculated with imputed data can have smaller variance than the estimator computed from a complete sample. The multiple imputation variance estimator does not reflect the fact that some of the imputed values used in the domain come from observations outside the domain. ncreasing M or increasing sample size reduces only the bias in the multiple imputation variance estimator due to the bias in the jackknife estimator of the variance of a ratio. The variance estimators for the fractionally imputed procedures are more stable than those of the approximate Bayesian bootstrap methods. Fully efficient fractional imputation has the uniformly smallest variance of the variance estimators and the approximate Bayesian bootstrap method with M=3 has the largest variance of the variance estimators. The relative efficiency of the fractional imputation variance estimator relative to the multiple imputation variance estimator with M=10 for h 1 is 140% for n=60 and 197% for n=120. The relative efficiency of the fractional imputation variance estimator with M=3 relative to the multiple imputation variance estimator with M=3 for h 1 is 375% for n=60 and 664% for n=120. The variance of the imputed estimator has two components, the variance of the full sample estimator and the variance due to imputation. The approximate Bayesian bootstrap method estimates the two components of variance separately. The degrees of freedom for the estimator of the variance due to imputation is M 1 and, hence, does not decrease as n increases. Although the variance of the estimated sampling variance is order n 1, the estimated variance due to imputation places a lower bound on the variance of the estimated variance. Unlike the multiple imputation variance estimator, the variance of the fractional imputation estimated variance is inversely proportional to the number of sampling units. Hence, for fixed M, the relative efficiency of the fractional imputation variance estimator relative to the approximate Bayesian bootstrap increases without bound as sample size increases. Table 3 displays the means and variances of the lengths of 95% confidence intervals, and the size of a two-sided test for the true value. The coverage probability of a nominal 95% interval is one minus the size. The confidence intervals are (h@ t n VC, h@ +t n VC ), where t n is the upper 2 5 percentile of the t-distribution with n degrees of freedom. The degrees of freedom for t for fractional imputation methods is one less than the number of respondents. For the approximate Bayesian bootstrap method, the degrees of freedom is that suggested by Barnard & Rubin (1999). The confidence intervals based on fractional imputation show better performance than the confidence intervals based on the approximate Bayesian bootstrap. For h 1 and n=60 the fractional imputation confidence intervals have coverages near the nominal level of 95%. The size given in Table 3 is the size of a nominal 5% two-tailed test for the mean and is one minus the coverage probability. The approximate Bayesian bootstrap with M=10 produces intervals that average 2% narrower than the fractional imputation intervals, but the coverage is significantly less than 95%. Also the lengths of the approximate Bayesian bootstrap intervals are much more variable. The coverages of confidence intervals for h 3 constructed with the fractional imputation procedure are superior to those of the approximate Bayesian bootstrap procedure, but all

13 Fractional hot deck imputation Table 3: First simulation. Standardised means of confidence interval width, standardised variances of confidence interval width and sizes of a twotailed test, based on samples 571 n Method Std mean Std var Size (%) 60 h FEF F (M=3) ABB (M=3) F (M=10) ABB (M=10) h FEF F (M=3) ABB (M=3) F (M=10) ABB (M=10) h FEF F (M=3) ABB (M=3) F (M=10) ABB (M=10) h FEF F (M=3) ABB (M=3) F (M=10) ABB (M=10) h FEF F (M=3) ABB (M=3) F (M=10) ABB (M=10) h FEF F (M=3) ABB (M=3) F (M=10) ABB (M=10) FF, fully efficient fractional imputation; F, fractional imputation; ABB, approximate Bayesian bootstrap; Std mean, standardised mean; Std var, standardised variance. procedures have coverages significantly less than 95%. Coverages deviate from 95% because the distribution of the test statistic deviates from normality. Coverages of intervals for h 3 generally improve as sample size increases for all procedures. The coverages of the fractional imputation procedure are close to 95% for h 2, the domain mean. The sizes of the approximate Bayesian bootstrap procedures for the domain mean are much smaller than the nominal 5% because the variance is seriously overestimated. The sizes of the approximate Bayesian bootstrap procedure for the domain mean do not improve as sample size increases. We also simulated the performance of a multiple imputation procedure, similar to that described by Rubin (1987, p. 83), that assumes the distribution of y to be known. n our case the distribution of y in a cell is normal and we call the procedure normal

14 572 JAE KWANG KM AND WAYNE FULLER distribution multiple imputation. The approximate Bayesian bootstrap imputation and normal distribution multiple imputation procedures have similar efficiencies for h 1 and h with the approximate Bayesian bootstrap marginally more efficient. The normal 2 distribution multiple imputation is slightly more efficient for h because it uses correct 3 information about the distribution. The sizes of normal distribution multiple imputation were slightly better than the approximate Bayesian bootstrap for h and worse for h. The 1 2 normal distribution multiple imputation overestimated the variance of the imputed estimator of h by more than 50% for both sample sizes. The normal distribution multiple 2 imputation procedure was generally the best procedure for h although it overestimated 3 the variance and thus led to sizes less than 5% for n=120. n a second simulation we generated simple random samples of clusters from an infinite population of clusters, where the population contains four types of cluster. The clusters are composed of elements selected from the two imputation cells. One-quarter of the clusters have one element from cell one and five elements from cell two; one-quarter of the clusters have two elements from cell one and four from cell two; one-quarter of the clusters have four elements from cell one and two elements from cell two; and one-quarter of the clusters have five elements from cell one and one element from cell two. The response rates for the two cells are 0 7 in cell 1 and 0 6 in cell 2, as in the first simulation. Three scenarios were simulated. n the first scenario, called population (A), they variable is generated by specification (24). n addition to the y-variable, a variable x was generated, distributed as a uniform random variable independent of y. The estimator of the mean of y and the regression coefficient, i.e. slope, for the regression of y on x were computed. The slope was calculated as W W (x iµar jµa j x: )(y i y: )w*, W (x iµa i x: )2 where w* =M 1d if jµa M and iµa R, and y: is the mean of the imputed sample. Replicates for variance estimation were calculated using (22) and (23). For the population generated in this manner, model (2) holds for the variable y. Hence, as the numbers in the first block of rows in Table 4 indicate, the imputed estimator is unbiased for the mean of y. Also, because y is independent of x, the true slope is zero and the slope estimated with imputed data is unbiased for zero. The multiple imputation variance estimator is slightly biased for the variance of the mean because of the cell size bias described earlier. n this experiment, unlike the experiment of Table 2, the variance of the estimated multiple imputation variance of the mean for M=10 is smaller than the variance of the fractional imputation variance estimator for M=10. To estimate the full sample variance, the multiple imputation procedure uses the average of the full sample variance estimators for ten sets of imputed data. Since the imputed cluster samples are not perfectly correlated, the average of the M estimated full sample variances is more efficient for the full sample variance than the usual variance estimator based on the full sample. The relative variance of the full sample variance estimator is about 10%, while the relative variance of the multiple imputation variance estimator of the full sample variance is about 7%. Thus, for multiple imputation with M=10, the contribution of the B M component to the total variance of the estimated variance is such that the relative variance of the multiple imputation variance estimator is less than 10%.

15 Fractional hot deck imputation Table 4: Second simulation. Properties of estimators for diverent estimation schemes and populations based on samples of 20 clusters 573 Parameter RB (%) RV (%) Power (Population) Method Mean Variance Est. var. Est. var. (%) Mean F (M=3) (A) M (M=3) F (M=10) M (M=10) Slope F (M=3) (A) M (M=3) F (M=10) M (M=10) Mean F (M=3) (B) M (M=3) F (M=10) M (M=10) Slope F (M=3) (C) M (M=3) F (M=10) M (M=10) F, fractional imputation; M, multiple imputation; RB, relative bias, RV, relative variance; Est var., estimated variance. The relative variance of the fractional imputation variance estimator is about 20% larger than the relative variance of the full sample variance estimator for the full sample. The fractional imputation variance estimator is based on the same number of replicates as the full sample estimator and those replicates reflect the increase in variability due to imputation. The nature of the bias in the fractional imputation estimated variance of the slope was described in 5. As that discussion suggested, the empirical relative bias for the fractional imputation variance estimator declines from 15% at M=3to7%atM=10. The multiple imputation variance estimator has a bias of about 80% for the slope in this simulation. The nature of the bias in the multiple imputation estimated variance of the slope is that described in the discussion for the estimation of domains. n population (B), the variables y 2i were generated by the model y 2i =y i +b i, where y i is defined in (24), b i =0 8 if the element is in a cluster of type 1, b i =0 4 if the element is in a cluster of type 2, b i = 0 4 if the element is in a cluster of type 3, and b i = 0 8 if the element is in a cluster of type 4. The remaining simulation parameters are as defined for population (A). The elements in a cell of population (B) are not identically distributed. However, the imputed estimator of the mean is unbiased because the response rate is the same for all elements in a cell. The multiple imputation estimator of the variance of the full sample estimator is constructed under the assumption that the imputed values in the cluster have the same mean as the missing values. Since that condition does not hold in population (B), the multiple

16 574 JAE KWANG KM AND WAYNE FULLER imputation variance estimator has a negative bias. On the other hand, the method of estimation employed by fractional imputation reflects the presence of a cluster effect, and the fractional imputation variance estimator for M=10 is nearly unbiased. To study the behaviour of the procedures under a misspecified model, a variable was generated by the rule y 3i =0 7x i +y i, where y i is defined in (24). For this population, called population (C), the remaining parts of the data configuration are as described for population (A). The imputation procedures are as previously described and assume that y 3 is independent of x. As a result, the estimator of the slope of y 3 on x is biased. The expected value of the imputed estimator is 0 455, where is the true value multiplied by the average response rate of 0 65; see Table 4. A desirable feature of an imputation procedure is the ability to identify incorrect model specifications. Given that the imputation is constructed under the assumption that y 3 is independent of x, one might consider testing the independence assumption by using the imputed data to compute the regression of y 3 on x. The entries in the final column of Table 4 for the estimated slope of population (C) are the powers of the nominal 5% test of the hypothesis that the slope is zero. Since the multiple imputation variance estimator has a bias of about 80%, the power of a test based on the multiple imputation variance is very poor. The bias of the fractional imputation variance estimator is much smaller than that of multiple imputation and hence the power for the test based on the fractional imputation estimated variance is much larger than that of multiple imputation. Thus the fractional imputation variance provides a much greater chance of identifying an incorrect model specification than does multiple imputation. ACKNOWLEDGEMENT This research was partially supported by a subcontract between Westat and owa State University under a contract between Westat and the Department of Education. We thank two referees, the associate editor and the editor for useful comments. APPENDX 1 Proofs Proof of T heorem 1. By the assumptions of the imputation mechanism and by (3), independently for each iµa, g y (A, A,d)~(m, s2 ). (A1) i R g g Let the linear estimator for the full sample be as defined in (1) and let h@ be the imputed estimator. Under the model (A1), E(h@ A, A,d)=E A G a m A, A,d B R i g R iµa Rg = G a m = G w m, (A2) i g i g iµa Rg iµa g where the last equality follows from (6), and result (8) is established.

17 The conditional variance of is var A, A,d)= G R = G by (A1). nserting (A2) and (A3) into Fractional hot deck imputation iµa Rg a2s2 i g iµa Rg G h=1 a a cov (y,y A, A,d) i j i j R jµa Rh var (h@ )=var {E(h@ A, A R,d)}+E{var (h@ A, A R,d)}, 575 (A3) (A4) we obtain result (9). To show (10), note that, by (A2), E(h@ h N A, A R,d)= G under the model (A1). For h N =WN i=1 Y i, w m G m i g g iµa g iµu g (A5) cov (h@, h A, A,d)= G N R var (h A, A,d)= G N R and var (h@ A, A R,d)is given in (A3). Therefore, a s2, i g iµa Rg s2, g iµu g Note that, by (6), var (h@ h N A, A R,d)= G (a2 2a )s2 + G s2. i i g g iµa Rg iµu g E A G a s2 i iµa Rg gb =E A G E{var (h@ h N A, A R,d)}=E q G w s2 i iµa g gb = G s2, g iµu g (a2 2a )s2 + G i i g iµa Rg s2 iµu g gr =E q G (a2 a )s2 i i iµa Rg gr. (A6) Therefore, result (10) follows by inserting (A5) and (A6) into a decomposition of the form (A4) for h@ h. % N Proof of T heorem 2. We write We can write E q L c (h@(k) k )2 r =E C L c {E(h@(k) k A, A,d)}2 D R k=1 k=1 +E C L c {var (h@(k) k A, A,d)} D. R k=1 (A7) h@(k) = a(k) i iµa R y i iµa R a i y i

18 576 JAE KWANG KM AND WAYNE FULLER and, by (6) and (19), W (a(k), a )=W (w(k),w). Therefore, under the cell mean model, iµar] U g i i iµa] U g i i we have E(h@(k) A, A,d)= G (w(k) w )m, (A8) R i i g iµa ] U g var (h@(k) A, A,d)= G (a(k) a )2s2. (A9) R i i g iµa R] U g f we use (A8) and the unbiasedness of the complete sample variance estimator, the first term on the right-hand side of the decomposition in (A7) reduces to var {WG W w m }.By iµa] U g i g (A9) and (18), the second term on the right-hand side of (A7) is E{WG W a2s2 }. Therefore, iµar] U g i g the expectation on the right-hand side of (A7) is equal to the variance given in (9). % APPENDX 2 llustrative calculation for fractional imputation We illustrate the weight construction with a cluster sample of four clusters. The data are given in Table A1. To simplify the presentation, we use initial weights of one. We use a single imputation Table A1. Replicate weights Weights Cluster Obs. mp. value Original Rep. 1 Rep. 2 Rep. 3 Rep. 4 1 y y y y Miss y (1 d ) 1 289(1+2d ) Miss y (1+0 5d ) 1 289(1 d ) Miss y (1+0 5d ) 1 289(1 d ) y y y y y y y y Miss y (1 d ) 1 289(1+0 5d ) 0 134(1+0 5d ) Miss y (1+0 5d ) 1 289(1 d ) 0 134(1+0 5d ) Miss y (1+0 5d ) 1 289(1+0 5d ) 0 134(1 d ) y 41 y Obs., observation; mp., imputed; Rep., replicate. cell and assume that y 11, y 21 and y 22 are the randomly selected donors for missing y 13 and that y 11, y 21 and y 31 are the three randomly selected donors for missing y 33. Fractional imputation is used with three donors for each missing value and with equal fractions assigned to each. The weights for the imputed dataset are given in the column original. n general, it is a good idea to construct replicates with c k =1. Such replicates will generally reduce the degrees-of-freedom effect of observing a subset of respondents. For a simple random sample of size n, a jackknife replicate for the mean with c k =1 can be created by reducing the weight on the deleted unit from n 1 to n 1 {n 3(n 1)}1/2, and increasing the weights on the remaining units by (n 1) 1[n 1 {n 3(n 1)}1/2].

19 Fractional hot deck imputation Thus, the first replicate for the mean with a sample of size 4 is created by decreasing the weight on the first cluster from one-quarter to and changing the other three weights to n the illustration we use weights of one, instead of one-quarter, and the basic replicate weights given in Table A1 are and The computations for Table A1 were carried out as follows. Step 1. Write the fractional replication weights in terms of d s; see Table A1. Step 2. Calculate a(k) 1i =W jµa w(k) w* and T =a2 WL c (a(k) j i i k=1 k 1i a )2; see Table A2. Then the i determining equation for d, in (21) of the text, can be written 577 for k=1, 2, 3, 4. c (a(k) a )2 c (a(k) k i i k 1i a i )2= T, i iµa Rg iµa Rg iµa Rg] P k (A10) Table A2. Weights for respondents Original Rep. 1 Rep. 2 Rep. 3 Rep. 4 Obs. a i a(1) 1i a(2) 1i a(3) 1i a(4) 1i T i y y y y y y y Obs., observation; Rep., replicate. Step 3. Write the determining equations (A10) in terms of the d s. For k=1, the determining equation is For k=2, we obtain For k=3, we obtain ( d d )2+( d d )2 +( d )2+( d )2 ( )2 ( )2 ( )2 ( )2= ( d d )2+( d d )2 +( d )2+( d )2 ( )2 ( )2 ( )2 ( )2= ( d )2+( d )2+( )2 +( d )2 ( )2 ( )2 ( )2 ( )2= There is no donor in cluster four so there is no d to be computed for replicate four. Step 4. Solve the determining equations to get d 1 =0 267, d 2 =0 223 and d 3 =0.

20 578 JAE KWANG KM AND WAYNE FULLER REFERENCES BARNARD, J. & RUBN, D. B. (1999). Small-sample degrees of freedom with multiple imputation. Biometrika 86, BRCK, J. M. & KALTON, G. (1996). Handling missing data in survey research. Statist. Meth. Med. Res. 5, CHEN, J. & SHAO, J. (2001). Jackknife variance estimation for nearest-neighbor imputation. J. Am. Statist. Assoc. 96, FAY, R. E. (1992). When are inferences from multiple imputation valid? n Proc. Survey Res. Meth. Sect., Am. Statist. Assoc., pp Washington, DC: American Statistical Association. FAY, R. E. (1996). Alternative paradigms for the analysis of imputed survey data. J. Am. Statist. Assoc. 91, KALTON, G. & KSH, L. (1984). Some efficient random imputation methods. Commun. Statist. A 13, LTTLE, R. J. A. & RUBN, D. B. (2002). Statistical Analysis with Missing Data. New York: Wiley. RAO, J. N. K. & SHAO, J. (1992). Jackknife variance estimation with survey data under hot deck imputation. Biometrika 79, RUBN, D. B. (1978). Multiple imputations in sample surveys: A phenomenological Bayesian approach to nonresponse. n Proc. Survey Res. Meth. Sect., Am. Statist. Assoc., pp Washington, DC: American Statistical Association. RUBN, D. B. (1987). Multiple mputation for Nonresponse in Surveys. New York: Wiley. RUBN, D. B. & SCHENKER, N. (1986). Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. J. Am. Statist. Assoc. 81, RUST, K. F. & RAO, J. N. K. (1996). Variance estimation for complex surveys using replication techniques. Statist. Meth. Med. Res. 5, SÄRNDAL, C.-E. (1992). Methods for estimating the precision of survey estimates when imputation has been used. Survey Methodol. 18, TOLLEFSON, M. & FULLER, W. A. (1992). Variance estimation for samples with random imputation. n Proc. Survey Res. Meth. Sect., Am. Statist. Assoc., pp Washington, DC: American Statistical Association. WOLTER, K. M. (1985). ntroduction to Variance Estimation. New York: Springer-Verlag. [Received May Revised February 2004]

Two-phase sampling approach to fractional hot deck imputation

$Two-phase sampling approach to fractional hot deck imputation$ Two-phase sampling approach to fractional hot deck imputation Jongho Im 1, Jae-Kwang Kim 1 and Wayne A. Fuller 1 Abstract Hot deck imputation is popular for handling item nonresponse in survey sampling.