Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Size: px

Start display at page:

Download "Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling"

Charlotte King
5 years ago
Views:

1 Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling Jae-Kwang Kim 1 Iowa State University June 26, Joint work with Shu Yang

2 Introduction 1 Introduction 2 Review 3 Fractional Hot deck imputation 4 Simulation Study 5 Conclusion Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

3 Introduction Basic Setup U = {1, 2,, N}: index set of finite population (x i, y i ): study variables in unit i in the population. η: parameter of interest defined by the solution to Examples: N U(η; x i, y i ) = 0. i=1 1 Population mean: U(η; x, y) = y η 2 Population proportion of Y less than q: U(η; x, y) = I (y < q) η 3 Population p-th quantitle : U(η; x, y) = I (y < η) p 4 Population regression coefficient: U(η; x, y) = (y xη)x 5 Domain mean: U(η; x, y) = (y η)d(x) Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

4 Introduction Basic Setup (Cont d) A: index set of the sample (A U) obtained from a probability sampling design, with π i being the first-order inclusion probability of unit i. From the sample, we collect measurement for (x i, y i ). Under complete response, a consistent estimator of η can be obtained by solving w i U(η; x i, y i ) = 0, (1) i A for η, where w i = π 1 i. Under some regularity conditions, the solution to (1) is consistent and asymptotically normally distributed (Binder and Patak, 1994). Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

5 Introduction Basic Setup (Cont d) Assume that x i are always observed and y i are subject to non-response. Define δ i = { 1 if yi is observed 0 otherwise. A consistent estimator of η is then obtained by taking the conditional expectation and solving Ū(η) = 0 for η, where Ū(η) = i A w i [δ i U(η; x i, y i ) + (1 δ i ) E{U(η; x i, Y ) x i, δ i = 0}]. (2) Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

6 Introduction How to compute the conditional expectation in (2)? 1 Often, start with assuming missing-at-random (MAR). That is, f (y x, δ) = f (y x) 2 Build a (parametric) model on f (y x). That is, for some θ. f (y x) = f (y x; θ) 3 Obtain a consistent estimator ˆθ of θ from the set of respondents. That is, solve w i δ i S(θ; x i, y i ) = 0 i A for θ, where S(θ; x, y) is the score function of θ. 4 Compute the conditional expectation by a Monte Carlo approximation using the samples from f (y x; ˆθ): E{U(η; x i, Y ) x i } = 1 M M j=1 U(η; x i, y (j) i ), where y (j) i f (y x i ; ˆθ). Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

7 Introduction Imputation Imputation: Monte Carlo approximation of the conditional expectation (given the observed data). E {U (η; x i, Y ) x i } = 1 M M ( ) U η; x i, y (j) i j=1 1 Bayesian approach: generate yi from f (y i x i, θ ) where θ is generated from p(θ x, y). 2 Frequentist approach: generate yi from f (y i x i ; ˆθ), where ˆθ is a consistent estimator. Once the conditional expectation is computed (approximately), we can obtain ˆη by solving the imputed estimating equation. Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

8 Introduction Imputation Remark Imputation can be applied even when η is unknown. Thus, it is a useful tool for general-purpose estimation. Works even when M = 1 (single imputation). To reduce the variance and to enable variance estimation, M > 1 is often used. Bayesian approach: Multiple imputation of Rubin (1987) Frequentist approach: Parametric fractional imputation of Kim (2011). Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

9 Review 1 Introduction 2 Review 3 Fractional Hot deck imputation 4 Simulation Study 5 Conclusion Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

10 Review Multiple imputation Generate M imputed values (with equal weights) Features 1 Imputed values are generated from the posterior predictive distribution, which is the average of f (y i x i ; θ) evaluated at the posterior distribution π (θ x, y obs ). 2 Variance estimation formula is simple (Rubin s formula). ˆV MI ( η M ) = W M + (1 + 1 M )B M where W M = M 1 M m=1 ˆV I (m), B M = (M 1) 1 M m=1 (ˆη(m) η M ) 2, η M = M 1 M m=1 ˆη (m) is the average of M imputed estimators of η, and ˆV I (m) is the imputed version of the variance estimator of ˆη under complete response. Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

11 Review Multiple imputation Remark Sampling design is incorporated by including w i into covariates in order to make the sampling design non-informative. Thus, the imputed values are generated from the sample model, not from population model. y i f (y x i, I i = 1) where I i is the indicator function for the sample inclusion. MAR is assumed in the sample level: f (y x, I = 1, δ = 0) = f (y x, I = 1, δ = 1), which is different from MAR in the population level: f (y x, δ = 0) = f (y x, δ = 1). Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

12 Review Multiple imputation Remark (Cont d) If the sampling design is non-informative, then the sample model and the population model are equivalent and the sample MAR and the population MAR are equivalent. Variance estimation (using Rubin s formula) does not work when the sampling design is informative. Even when the sampling design is non-informative, consistency of variance estimator is questionable (Kim et al., 2006). Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

13 Review Multiple imputation Variance estimation Rubin s formula is based on the following decomposition: V (ˆη MI ) = V (ˆη n ) + V (ˆη MI ˆη n ). Basically, W M term estimates V (ˆη n ) and (1 + M 1 )B M term estimates V (ˆη MI ˆη n ). In general, we have V (ˆη MI ) = V (ˆη n ) + V (ˆη MI ˆη n ) + 2Cov(ˆη MI ˆη n, ˆη n ) and the covariance terms can be non-negligible. The condition of zero covariance is called congeniality by Meng (1994). Congeniality holds when ˆη MI is a smooth function of the MLE of θ in f (y x; θ). Otherwise, Rubin s variance estimator can be biased, which will be discussed in the simulation section. Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

14 Review Parametric Fractional Imputation Parametric fractional imputation of Kim (2011) 1 More than one (say M) imputed values of y i : y (1) generated from some (initial) density h (y i x i ). 2 Create weighted data set where i,, y (M) i {( ) } w i wij, x i, y (j) i ; j = 1, 2,, M; i A wij f (y (j) (j) i x i ; ˆθ)/h(y i x i ), ˆθ is the (pseudo) maximum likelihood estimator of θ. 3 The weight wij are the normalized importance weights and are called fractional weights. Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

15 Review Parametric Fractional Imputation (Cont d) Product: fractionally imputed data set of size nm { } (w i wij, x i, y (j) ); j = 1, 2,, M; i A Property: for sufficiently large M, M j=1 w ij g(x i, y (j) i ) = i g(xi, y) f (y x i ;ˆθ) h(y x i ) h(y x i)dy f (y xi ;ˆθ) h(y x i ) h(y x i)dy for any g such that the expectation exists. = E { g (x i, Y ) x i ; ˆθ } Can handle informative sampling design by incorporating the sampling weights into the score equation. That is, solve w i δ i S(θ; x i, y i ) = 0 (3) i A where S(θ; x, y) = log f (y x; θ)/ θ is the score function of θ. Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

16 Review Parametric Fractional Imputation (Cont d) Remark Imputed values are generated from the population model, not from the sample model. y i f (y x i ) f (y x i, I i = 1). Thus, we assume population MAR, not sample MAR. For variance estimation, either linearization method or replication method can be used. Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

17 Fractional Hot deck imputation 1 Introduction 2 Review 3 Fractional Hot deck imputation 4 Simulation Study 5 Conclusion Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

18 Fractional Hot deck imputation Fractional Hot deck imputation Motivation Hot deck imputation Imputed values are real observations Very popular in household surveys Want to implement hot deck version of fractional imputation. Kim (2004) and Fuller and Kim (2005) already considered fractional hot deck imputation: x is categorical in f (y x). Kim, Fuller and Bell (2011) extended the method of Kim (2004) to nearest neighbor imputation. We now want to extend it to the case when x has continuous components. Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

19 Fractional Hot deck imputation Fractional Hot deck imputation Proposed method: Three steps 1 Fully efficient fractional imputation (FEFI) by choosing all the respondents as donors. That is, we use M = n R imputed values for each missing unit, where n R is the number of respondents in the sample. 2 Use a systematic PPS sampling to select m (<< n R ) donors from the FEFI. 3 Use a calibration weighting technique to compute the final fractional weights (which lead to the same estimates of FEFI for some items). Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

20 Fractional Hot deck imputation Fractional Hot deck imputation Step 1: FEFI step Want to find the fractional weights wij when the j-th imputed value is taken from the j-th value in the set of the respondents. y (j) i Without loss of generality, we assume that the first n R elements respond and write y (j) i = y j. Recall that wij f (y (j) (j) i x i ; ˆθ)/h(y i x i ) when y (j) i are generated from h(y x i ). We have only to find h(y (j) i x i ) when we use y (j) i = y j. Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

21 Fractional Hot deck imputation Fractional Hot deck imputation Step 1: FEFI step (Cont d) We can treat {y i ; δ i = 1} as a realization from f (y δ = 1), the marginal distribution of y among respondents. Now, we can write f (y j δ j = 1) = = = 1 N R f (y j x, δ j = 1) f (x δ j = 1)dx f (y j x) f (x δ j = 1)dx N δ k f (y j x k ), k=1 where N R = N i=1 δ i is the population size of (potential) respondents. Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

22 Fractional Hot deck imputation Fractional Hot deck imputation Step 1: FEFI step (Cont d) Using the survey weights, we can approximate f (y j δ j = 1) k A = R w k f (y j x k ) k A R w k and the fractional weight for y (j) i = y j becomes w ij f (y j x i ; ˆθ) k A R w k f (y j x k ; ˆθ) (4) with j A R w ij = 1, where A R = {i A; δ i = 1} and ˆθ is computed from the weighted score equation in (3). Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

23 Fractional Hot deck imputation Fractional Hot deck imputation Step 2: Sampling Step FEFI uses all the elements in A R as donors for each missing i. Want to reduce the number of donors to, say, m = 10. For each i, we can treat the FEFI donor set as the weighted population and apply a sampling method to select a smaller set of donors. Fractional weights (4) for FEFI can be used as the selection probabilities for the PPS sampling. That is, our goal is to obtain a (systematic) PPS sample D i of size m from the FEFI donor set of size M = n R, using wij as the selection probability assigned to the j-th element in A R. (Note that wij satisfies M j=1 w ij = 1 and wij > 0.) Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

24 Fractional Hot deck imputation Fractional Hot deck imputation Step 3: Weighting Step After we select D i from the complete set of respondents, the selected donors in D i are assigned with the initial fractional weights wij0 = 1/m. The fractional weights are further adjusted to satisfy w i {(1 δ i ) wij,cq(x i, y j )} = w i {(1 δ i ) wij q(x i, y j )}, i A j D i i A j A R (5) for some q(x i, y j ), and j D i wij,c = 1 for all i with δ i = 0, where wij is the fractional weights for FEFI method, as defined in (4). Regarding the choice of the control function q(x, y) in (5), we can use q(x, y) = (y, y 2 ), which will lead to fully efficient estimates for the mean and the variance of y. Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

25 Fractional Hot deck imputation Fractional Hot deck imputation Remark For variance estimation, replication method can be used. The imputed values are not changed, only the fractional weights are changed for each replication. (Details skipped) The proposed fractional hot deck imputation is less sensitive against model mis-specification in f (y x; θ). (Details skipped.) The proposed method can be extended to a non-ignorable missing case under a parametric model assumption on the response mechanism. (Details skipped). Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

26 Simulation Study 1 Introduction 2 Review 3 Fractional Hot deck imputation 4 Simulation Study 5 Conclusion Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

27 Simulation Study Simulation Study - Study One Factors considered Correct vs incorrect imputation model: to see the effect of model misspecification of f (y x). Imputation methods: MI, PFI, FHDI Parameters of interest: mean, proportion Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

28 Simulation Study Simulation Study - Study One Simulation Setup Two sets of models 1 Model A: y i = 0.5x i + e i, where x i exp(1) and e i N(0, 1). 2 Model B: same as model A except for e i {χ 2 (2) 2)}/2 Response mechanism: y i is observed only when δ i = 1 where δ i Bernoulli(π), π i = {1 + exp( 0.2 x i )} 1 Thus, we have MAR with 65% overall response in both models. B = 5, 000 Monte Carlo samples of size n = 200. We used y i N(β 0 + β 1 x i, σ 2 ) as the imputation model under both cases. (Thus, the imputation model is mis-specified under Model B.) Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

29 Simulation Study Simulation Study - Study One Simulation Setup (Cont d) Two parameters considered: 1 η 1 = E(Y ): the population mean of y 2 η 2 = Pr(Y < 1): the proportion of Y less than one. Four estimators computed: 1 Full sample estimator (FULL) that is computed using the full sample. 2 Multiple imputation (MI) estimator with imputation size m = 10 3 Parametric fractional imputation (PFI) with imputation size m = 10 4 Fractional hot deck imputation (FHDI) with imputation size m = 10 Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

30 Simulation Study Simulation Study - Study One Simulation Results under Model A Table : Point estimation Parameter Method Mean Var Std Var Full η 1 = µ y MI PFI FHDI Full η 2 = pr(y < 1) MI PFI FHDI Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

31 Simulation Study Simulation Study - Study One Simulation Results under Model A Table : Variance estimation Parameter Method R.B. (%) t-statistics V (ˆη 1 ) V (ˆη 2 ) MI PFI FHDI MI PFI FHDI Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

32 Simulation Study Simulation Study - Study One Discussion for Model A results Point estimation unbiased for both parameters under correct model. For η 1 = E(Y ), imputation increases variance roughly 45-53%: V (ˆη 1,imp ) = 1 ( 1 n σ2 y + 1 ) σe 2 n R n = ( ) = = and / = Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

33 Simulation Study Simulation Study - Study One Discussion for Model A results (Cont d) For η 2 = Pr(Y < 1), imputation increases variance roughly 25% for MI and PFI. Note that ˆη 2,imp = 1 n n [δ i I (y i < 1) + (1 δ i )E{I (Y < 1) x i }] i=1 where we used the imputation model in computing the conditional expectation. Thus, it borrows strength by making use of normality assumption at the time of imputation. In some sense, the above imputation estimator can be viewed as a composite estimator, where composite estimator is a weighted average of direct estimator and synthetic estimator. Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

34 Simulation Study Simulation Study - Study One Discussion for Model A results (Cont d) In fact, under full response, there are two estimators of η 2 = Pr(Y < 1): n ˆη 2,MME = n 1 I (y i < 1) ˆη 2,MLE = 1 i=1 ( ) y ˆµ φ dy. ˆσ The MLE is more efficient than the MME but it is less robust. The congeniality condition holds when MLE is used, but not when MME is used. Rubin s variance estimator for MI requires the congeniality condition. FI does not require congeniality. Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

35 Simulation Study Simulation Study - Study One Simulation Results under Model B Table : Point estimation Parameter Method Mean Var Std Var Full η 1 = µ y MI PFI FHDI Full η 2 = pr(y < 1) MI PFI FHDI Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

36 Simulation Study Simulation Study - Study One Simulation Results under Model B Table : Variance estimation Parameter Method R.B. (%) t-statistics V (ˆη 1 ) V (ˆη 2 ) MI PFI FHDI (m = 10) MI (m = 10) PFI (m = 10) FHDI (m = 10) Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

37 Simulation Study Simulation Study - Study One Discussion for Model B results Point estimation unbiased for η 1 = E(Y ) even when the imputation model is incorrect. Note that, for m, the imputed estimator of η 1 can be written ˆη 1,imp = 1 n = 1 n n {δ i y i + (1 δ i )ŷ i } i=1 n i=1 ŷ i which is called the projection estimator. Kim and Rao (2012) showed design-consistency of the projection estimator. Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

38 Simulation Study Simulation Study - Study One Discussion for Model B results (Cont d) However, all imputed estimator are biased for η 2 = Pr(Y < 1). The biases are much higher for MI and PFI than FHDI, with the corresponding z-statistics are -34.8,-33.5, and 5.5 for MI, PFI, and FHDI, respectively. Note that the true error distribution is e i {χ 2 (2) 2)/2 while the imputation model errors are generated from ei N(0, ˆσ e). 2 (See the picture next page). In FHDI, the donors are still generated from the true distribution, only the fractional weights are computed from the wrong model. Thus, the effect of model mis-specification is less severe than the other imputation methods that create synthetic values from the wrong model. Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

39 Simulation Study Density True model Imputation model x Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

40 Simulation Study Simulation Study - Study Two Bivariate data (x i, y i ) of size n = 100 with Y i = β 0 + β 1 x i + β 2 ( x 2 i 1 ) + e i (6) where (β 0, β 1, β 2 ) = (0, 0.9, 0.06), x i N (0, 1), e i N (0, 0.16), and x i and e i are independent. The variable x i is always observed but the probability that y i responds is 0.5. The imputation model is Y i = β 0 + β 1 x i + e i. That is, imputer s model uses extra information of β 2 = 0. From the imputed data, we fit model (6) and computed power of a test H 0 : β 2 = 0 with 0.05 significant level. In addition, we also considered the Complete-Case (CC) method that simply uses the complete cases only for the regression analysis. Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

41 Simulation Study Simulation Study - Study Two Table 5 Simulation results for the Monte Carlo experiment based on 10,000 Monte Carlo samples. Method E(ˆθ) V (ˆθ) R.B. ( ˆV ) Power MI FI CC Table 5 shows that MI provides efficient point estimator than CC method but variance estimation is very conservative (more than 100% overestimation). Because of the serious positive bias of MI variance estimator, the statistical power of the test based on MI is actually lower than the CC method. Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

42 Conclusion 1 Introduction 2 Review 3 Fractional Hot deck imputation 4 Simulation Study 5 Conclusion Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

43 Conclusion Concluding Remarks Advantage 1 Hot deck imputation: uses real observations for imputed values. 2 Robust against model mis-specification. 3 Applicable even when the sampling design is informative. 4 Does not require congeniality condition for valid variance estimation. Disadvantage : May have a higher imputation variance than the imputation methods using synthetic values. Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

44 Conclusion Future work Extension to single imputation (m = 1). Imputation variance component needs to be estimated. Instead of the calibration weighting step (in Step 3), we may consider using balanced imputation (Chauvet et al., 2011) FHDI for multivariate missing To be presented at the ISI meeting in Hong Kong To be implemented in SAS (in Proc Surveyimpute). Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

45 References REFERENCES Binder, D. and Z. Patak (1994), Use of estimating functions for estimation from complex surveys, Journal of the American Statistical Association 89, Chauvet, G., J.-C. Deville and D. Haziza (2011), On balanced random imputation in surveys, Biometrika 98, Fuller, W. A. and J. K. Kim (2005), Hot deck imputation for the response model, Survey Methodology 31, Kim, J. K. (2004), Finite sample properties of multiple imputation estimators, The Annals of Statistics 32, Kim, J. K. (2011), Parametric fractional imputation for missing data analysis, Biometrika 98, Kim, J. K. and J. N. K. Rao (2012), Combining data from two independent surveys: a model-assisted approach, Biometrika 99, Kim, J. K., M. J. Brick, W. A. Fuller and G. Kalton (2006), On the bias of the multiple imputation variance estimator in survey sampling, Journal of the Royal Statistical Society: Series B 68, Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

46 Conclusion Kim, J.K., W.A. Fuller and W.R. Bell (2011), Variance estimation for nearest neighbor imputation for u.s. census long form data, Annals of Applied Statistics 5, Meng, X. L. (1994), Multiple-imputation inferences with uncongenial sources of input (with discussion), Statistical Science 9, Rubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys, Wiley, New York. Kim (ISU) Fractional Hot Deck Imputation June 26, / 44

Fractional Imputation in Survey Sampling: A Comparative Review

Fractional Imputation in Survey Sampling: A Comparative Review Shu Yang Jae-Kwang Kim Iowa State University Joint Statistical Meetings, August 2015 Outline Introduction Fractional imputation Features Numerical