arxiv: v1 [stat.me] 27 Aug 2015

Size: px

Start display at page:

Download "arxiv: v1 [stat.me] 27 Aug 2015"

Amy Webb
5 years ago
Views:

1 Submtted to Statstcal Scence Fractonal Imputaton n Survey Samplng: A Comparatve Revew Shu Yang and Jae Kwang Km Harvard Unversty and Iowa State Unversty arxv: v1 [stat.me] 27 Aug 2015 Abstract. Fractonal mputaton (FI) s a relatvely new method of mputaton for handlng tem nonresponse n survey samplng. In FI, several mputed values wth ther fractonal weghts are created for each mssng tem. Each fractonal weght represents the condtonal probablty of the mputed value gven the observed data, and the parameters n the condtonal probabltes are often computed by an teratve method such as EM algorthm. The underlyng model for FI can be fully parametrc, semparametrc, or nonparametrc, dependng on plausblty of assumptons and the data structure. In ths paper, we gve an overvew of FI, ntroduce key deas and methods to readers who are new to the FI lterature, and hghlght some new development. We also provde gudance on practcal mplementaton of FI and vald nferental tools after mputaton. We demonstrate the emprcal performance of FI wth respect to multple mputaton usng a pseudo fnte populaton generated from a sample n Monthly Retal Trade Survey n US Census Bureau. Key words and phrases: Item nonresponse, Mssng at random, Monte Carlo EM, Multple mputaton, Synthetc mputaton. 1. INTRODUCTION In survey samplng, t s a common practce to collect data on a large number of tems. Even when a sampled unt responds to the survey, ths unt may not respond to some tems. In ths scenaro, mputaton can be used to create a complete data set by fllng n mssng values wth plausble values to facltate data analyses. The goal of mputaton s three-fold: Frst, by provdng complete data, subsequent analyses are easy to mplement and can acheve consstency among dfferent users. Second, mputaton reduces the selecton bas assocated wth only usng the respondent set, whch may not necessarly represent the orgnal sample. Thrd, the mputed data can ncorporate extra nformaton so that the resultng analyses are statstcally effcent and coherent. Combnng nformaton from several surveys or creatng synthetc data from planned mssngness are cases n pont (Schenker and Raghunathan 2007). When the mputed data set s released to the publc, t should meet the goal of multple uses both for planned and unplanned parameters (Hazza, 2009). Room 437A, HSPH2, 655 Huntngton Ave, Boston, MA (e-mal: shuyang@hsph.harvard.com) Snedecor Hall, Iowa State Unversty, Ames, IA (e-mal: jkm@astate.edu). 1 msart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

2 2 S. YANG AND J. K. KIM In a typcal survey stuaton, the mputers may know some of the parameters of nterest at the tme of mputaton, but hardly know the full set of possble parameters to be estmated from the data. Sngle mputaton, such as hot deck mputaton, regresson mputaton and stochastc regresson mputaton, replaces each of the mssng data wth one plausble value. Although sngle mputaton has been wdely used, one drawback s that t does not take nto account of the full uncertanty of mssng data and often falls short of multple-purpose estmaton. Multple mputaton (MI) has been proposed by Rubn (1976) to replace each of mssng data wth multple plausble values to reflect the full uncertanty n the predcton of mssng data. Several authors (Rubn 1987; Lttle and Rubn 2002; Schafer 1997) have promoted MI as a standard approach for general-purpose estmaton under tem nonresponse n survey samplng. Although the varance estmaton formula of Rubn (1987) s smple and easy to apply, t s not always consstent (Fay 1992; Wang and Robns 1998; Km et al. 2006). For usng the MI varance estmaton formula, the congenalty condton of Meng (1994) needs to be met, whch can be restrctve for general-purpose nference. For example, Km (2011) ponted out that a MI procedure that s congenal for mean estmaton s not necessarly congenal for proporton estmaton. Fractonal mputaton (FI) s another effectve mputaton tool for generalpurpose estmaton wth ts advantage of not requrng the congenalty condton. FI was orgnally proposed by Kalton and Ksh (1984) to reduce the varance of sngle mputaton methods by replacng each mssng value wth several plausble values at dfferentable probabltes reflected through fractonal weghts. Fay (1996), Km and Fuller (2004), Fuller and Km (2005), Durrant (2005), Durrant and Sknner (2006) dscussed FI as a nonparametrc mputaton method for descrptve parameters of nterest n survey samplng. Km (2011) and Km and Yang (2014) presented FI under fully parametrc model assumptons. More generally, FI can also serve as a computatonal tool for mplementng the expectaton step (E-step) n the EM algorthm (We and Tanner 1990; Km 2011). When the condtonal expectaton n the E-step s not avalable n a closed form, parametrc FI of Km (2011) smplfes computaton by drawng on the mportance samplng dea. Through fractonal weghts, FI can reduce the burden of teratve computaton, such as Markov Chan Monte Carlo, for evaluatng the condtonal expectaton assocated wth mssng data. Km and Hong (2012) extended parametrc FI to a more general class of ncomplete data, ncludng measurement error models. Despte these advantages, FI n appled research has not been wdely used due to lack of good nformaton that provdes researchers wth comprehensve understandng of ths approach. The goal of ths paper s to brng more attenton to FI by revewng exstng research on FI, ntroducng key deas and methods, and hghlghtng some new development, manly n the context of survey samplng. Ths paper also provdes gudance on practcal mplementatons and applcatons of FI. Ths paper s organzed as follows. Secton 2 provdes the basc setup and Secton 3 ntroduces FI under parametrc model assumptons. Secton 4 dscusses a nonparametrc approach to FI, specally n the context of hot deck mputaton. Secton 5 ntroduces synthetc data mputaton usng FI n the context of two-phase samplng and statstcal matchng. Secton 6 deals wth practcal conmsart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

3 FRACTIONAL IMPUTATION 3 sderatons and varatons of FI, ncludng mputaton szes, choces of proposal dstrbutons and doubly robust FI. Secton 7 compares FI wth MI n terms of effcency of the pont estmator and the varance estmator. Secton 8 demonstrates a smulaton study based on an actual data set. A dscusson concludes ths paper n Secton BASIC SETUP Consder a fnte populaton of N unts dentfed by a set of ndces U = {1, 2,, N} wth N known. The p-dmensonal study varable y = (y 1,, y p ), assocated wth each unt n the populaton, s subject to mssngness. We assume that the fnte populaton at hand s a realzaton from an nfnte populaton, called a superpopulaton. In the superpopulaton model, we often postulate a parametrc dstrbuton, f(y; θ), wth the parameter θ Ω. We can express the densty for the jont dstrbuton of y as (2.1) f(y; θ) = f 1 (y 1 ; θ 1 )f 2 (y 2 y 1 ; θ 2 ) f p (y p y 1,, y p 1 ; θ p ) where θ k s the parameter n the condtonal dstrbuton of y k gven y 1,, y k 1. Now let A denote the set of ndces for unts n a sample selected by a probablty samplng mechansm. Each unt s assocated wth a samplng weght, the nverse of the probablty of beng selected to the sample, denoted by w. We are nterested n estmatng η, defned as a (unque) soluton to the populaton estmatng equaton N =1 U(η; y ) = 0. For example, a populaton mean of y can be obtaned by lettng U(η; y ) = η y, a populaton proporton of y less than a threshold c can be obtaned by specfyng U(η; y ) = η I {y <c}, where I s an ndcator functon, a populaton medan of y can be obtaned by choosng U(η; y ) = 0.5 I {y <η}, and so on. Under complete response, a consstent estmator of η s obtaned by solvng (2.2) w U(η; y ) = 0. A Godambe and Thompson (1986), Bnder and Patak (1994) and Rao, Yung, and Hdroglou (2002) have done rgorous nvestgatons on the estmator obtaned from (2.2) under complex samplng. In the presence of mssng data, frst consder decomposng y = (y obs,, y ms, ), where y obs, and y ms, are the observed and mssng part of y, respectvely. We assume that the response mechansm s mssng at random (MAR) n the sense of Rubn (1976). That s, the probablty of nonresponse does not depend on the mssng value tself. Under MAR, a consstent estmator of η can be obtaned by solvng the condtonal estmatng equaton, gven the observed data y obs = (y obs,1,..., y obs,n ), (2.3) w E{U(η; y ) y obs, } = 0, A where the above condtonal expectaton s taken wth respect to the predcton model (also called the mputaton model), (2.4) f(y ms, y obs, ; θ) = f(y obs,, y ms, ; θ) f(yobs,, y ms, ; θ)dy ms,, msart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

4 4 S. YANG AND J. K. KIM Table 1 Comparson of two approaches of nference wth mssng data Bayesan Frequentst Model Posteror dstrbuton Predcton model f(latent, θ Obs.) f(latent Obs., θ) Learnng algorthm Data augmentaton EM algorthm Predcton Imputaton(I)-step Expectaton(E)-step Parameter update Posteror(P)-step Maxmzaton(M)-step Imputaton Multple mputaton Fractonal mputaton Varance estmaton Rubn s formula Lnearzaton or replcaton whch depends on the unknown parameter θ. Imputaton s thus a computatonal tool for computng the condtonal expectaton n (2.3) for arbtrary choces of the estmatng functon U(η; y). The resultng condtonal expectaton usng mputaton can be called the mputed estmatng functon. Table 1 presents a summary of Bayesan and frequentst approaches of statstcal nference wth mssng data. In the Bayesan approach, θ s treated as a random varable and the reference dstrbuton s the jont dstrbuton of θ and the latent (mssng) data, gven the observed data. On the other hand, n the frequentst approach, θ s treated as fxed and the reference dstrbuton s the condtonal dstrbuton of the latent data, condtonal on the observed data, for a gven parameter θ. The learnng algorthm, that s, the algorthm for updatng nformaton for parameters from observed data, for the Bayesan approach s data augmentaton (Tanner and Wong 1987), whle the learnng algorthm for the frequentst approach s usually the EM algorthm. MI s a Bayesan mputaton method and the mputed estmatng functon s computed wth respect to the posteror predctve dstrbuton, ˆ f(y ms, y obs ) = f(y ms, y obs, ; θ)p(θ y obs )dθ, whch s the average of the predctve dstrbuton f(y ms, y obs, ; θ) over the posteror dstrbuton of θ. On the other hand, n the frequentst approach, the condtonal expectaton n (2.3) s taken wth respect to the predcton model (2.4) evaluated at θ = ˆθ, a consstent estmator of θ. For example, one can use the pseudo MLE ˆθ of θ obtaned by solvng the pseudo mean score equaton (Lous 1982; Pfeffermann et al. 1998), (2.5) S(θ) = w E{S(θ; y ) y,obs ; θ} = 0, A where S(θ; y ) = log f(y ; θ)/ θ. Whle the Bayesan approach to mputaton, especally n the context of MI, s well studed n the lterature, the frequentst approach to mputaton s somewhat sparse. FI has been proposed to fll n ths mportant gap. In FI, the condtonal expectaton n (2.3) s computed by a weghted mean of the mputed estmatng functons (2.6) E{U(η; y ) y obs, } = M w ju(η; y obs,, y (j) ms, ). msart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

5 FRACTIONAL IMPUTATION 5 where y (j) ms,, for j = 1,..., M, are M mputed values for y ms, (f y s completely observed, y (j) ms, y ms,), wj are the fractonal weghts that satsfes w j 0, M w j = 1 and M w wjs(ˆθ; y obs,, y (j) ms, ) = 0. A Once the FI data are constructed, the FI estmator of η s obtaned by solvng (2.7) M w wju(η; y obs,, y (j) ms, ) = 0. A In general, the FI method augments the orgnal data set as (2.8) S F I = { δ (w, y ) + (1 δ ) ( w w j, y j) ; j = 1,..., M, A }, where δ s the ndcator of full response for y, and yj = (y obs,, y (j) ms, ). If (2.6) holds for an arbtrary U functon, the resultng estmator s approxmately unbased for a farly large class of parameters, whch makes the mputaton attractve for general-purpose estmaton. Km (2011) used the mportance samplng technque to satsfy (2.6) for general U functons, whch wll be presented n the next secton. 3. PARAMETRIC FRACTIONAL IMPUTATION Parametrc Fractonal Imputaton (PFI), proposed by Km (2011), features a parametrc model for fractonal mputatons, and parameters n the mputaton model are estmated by a computatonally effcent EM algorthm. To compute the condtonal estmatng equaton n (2.3) by PFI, for each mssng value y ms,, generate M mputed values, denoted by {y (1) ms,,..., y (M) ms, } from a proposal dstrbuton h(y ms, y obs, ). How to choose a proposal dstrbuton wll be dscussed n Secton 6.2. Once the mputed values are generated from h( ), compute wj f(y (j) ms, y obs,; ˆθ) h(y (j) ms, y obs,), subject to M w j = 1, as the fractonal weghts assgned to y j = (y obs,, y (j) ms, ), where ˆθ s the pseudo MLE of θ to be determned by the EM algorthm below. Snce M w j = 1, the above fractonal weght s the same as w j = w j (ˆθ), where (3.1) wj(θ) f(y obs,, y (j) ms, ; θ) h(y (j) ms, y obs,), whch only requres the knowledge of the jont dstrbuton f(y; θ) and the proposal dstrbuton h. The pseudo MLE of θ can be computed by solvng the mputed mean score equaton, (3.2) M w wj(θ)s(θ; y obs,, y (j) ms, ) = 0. A msart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

6 6 S. YANG AND J. K. KIM To solve (3.2), we can ether use the Newton method or the followng EM algorthm: I-step. For each mssng value y ms,, M mputed values are generated from a proposal dstrbuton h(y ms, y obs, ). W-step. Usng the current value of the parameter estmates ˆθ (t), compute the fractonal weghts as wj(t) f(y obs,, y (j) ms, ; ˆθ (t) )/h(y (j) ms, y obs,), subject to M w j(t) = 1. M-step. Update the parameter ˆθ (t+1) by solvng the mputed score equaton, M w wj(t) S(θ; y j) = 0, A where yj = (y obs,, y (j) ms, ) and S(θ; y) = log f(y; θ)/ θ s the score functon of θ. Iteraton. Set t = t+1 and go to the W-step. Stop f ˆθ (t+1) meets the convergence crteron. Here, the I-step s the mputaton step, the W-step s the weghtng step, and the M-step s the maxmzaton step. The I- and W-steps can be combned to mplement the E-step of the EM algorthm. Unlke the Monte Carlo EM (MCEM) method, mputed values are not changed for each EM teraton only the fractonal weghts are changed. Thus, the FI method has computatonal advantages over the MCEM method. Convergence s acheved because the mputed values are not changed. Km (2011) showed that gven the M mputed values, y (1) ms,,..., y (M) ms,, the sequence of estmators {ˆθ (0), ˆθ (1),...} from the W-and M- steps converges to a statonary pont ˆθ M for fxed M. The statonary pont ˆθ M converges to the pseudo MLE of θ as M. The resultng weght wj after convergence s the fractonal weght assgned to yj = (y obs,, y (j) ms, ). We may add an addtnal step to montor the dstrbuton of the fractonal weghts so that no extremely large fractonal weghts domnate the weghts. Once the fractonal mputed data s constructed from the above steps, t can be used to estmate other parameters of nterest. That s, we can use (2.7) to estmate η from the FI data set. We now consder a bvarate mssng data example to llustrate the use of the EM algorthm n FI. Example 1. Suppose a probablty sample conssts of n unts of z = (x, y 1, y 2 ) wth samplng weght w, where x s always observed and y = (y 1, y 2 ) s subject to mssngness. Let A 11, A 10, A 01, and A 00 be the partton of the sample based on the mssng pattern, where subscrpt 1/0 n the -th poston denote that the -th y tem s observed/mssng, respectvely. For example, A 10 s the set of the sample wth y 1 observed and y 2 mssng. The condtonal expectaton n (2.3) nvolves evaluatng the condtonal dstrbuton of y ms, gven the observed data x and y obs, for each mssng pattern, msart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

7 whch s then decomposed nto FRACTIONAL IMPUTATION 7 w E{U(η; z ) x, y obs, } = w U(η; x, y 1, y 2 )+ w E{U(η; x, Y 1, Y 2 ) A A 11 A 00 x }+ w E{U(η; x, Y 1, y 2 ) x, y 2 }+ w E{U(η; x, y 1, Y 2 ) x, y 1 }. A 01 A 10 Suppose the jont dstrbuton n (2.1) s (3.3) f(x, y 1, y 2 ; θ) = f x (x; θ 0 )f 1 (y 1 x; θ 1 )f 2 (y 2 x, y 1 ; θ 2 ). From the full respondent sample n A 11, obtan ˆθ 1(0) and ˆθ 2(0), whch are ntal parameter estmates for θ 1 and θ 2. In the I-step, for each mssng value y ms,, generate M mputed values from h(y ms, x, y obs, ) = f(y ms, x, y obs, ; ˆθ (0) ), where f 2 (y 2 x, y 1 ; ˆθ 2(0) ) f A 10 (3.4) f(y ms, x, y obs, ; ˆθ (0) ) = f(y 1 x, y 2 ; ˆθ (0) ) f A 01 f(y 1, y 2 x ; ˆθ (0) ) f A 00 and (3.5) f(y 1 x, y 2 ; ˆθ (0) ) = f 1 (y 1 x ; ˆθ 1(0) )f 2 (y 2 x, y 1 ; ˆθ 2(0) ) f1 (y 1 x ; ˆθ 1(0) )f 2 (y 2 x, y 1 ; ˆθ 2(0) )dy 1. Note that the margnal dstrbuton of x, f x (x; θ 0 ), s not used n (3.5). Except for some specal cases such as when both f 1 and f 2 are normal dstrbutons, the condtonal dstrbuton n (3.5) s not n a known form. Thus, some computatonal tools such as Metropols-Hastng (Hastngs 1970) or SIR (Samplng Importance Resamplng, Smth and Gelfand 1992) are needed to generate samples from (3.5) for A 01. For example, the SIR conssts of the followng steps: 1. Generate B (say B = 100) Monte Carlo samples, denoted by y (1) 1,, y (B) 1, from f 1 (y 1 x ; ˆθ 1(0) ). 2. Among the B samples obtaned from Step 1, select one sample wth the selecton probablty proportonal to f 2 (y 2 x, y (k) 1 ; ˆθ 2(0) ), where y (k) 1 s the k-th sample from Step 1 (k = 1,, B). 3. Repeat Step 1 and Step 2 ndependently M tmes to obtan M mputed values. Once we obtan M mputed values of y 1, we can use h(y 1 x, y 2 ) f 1 (y 1 x ; ˆθ 1(0) )f 2 (y 2 x, y 1 ; ˆθ 2(0) ) as the proposal densty n (3.4). Snce M w j = 1, we do not need to compute the normalzng constant n (3.5). For A 10, M mputed values of y 2 are generated from f 2 (y 2 x, y 1 ; ˆθ 2(0) ). For A (00), M mputed values of y 1 are generated from f 1 (y 1 x ; ˆθ 1(0) ) and then M mputed values of y 2 are generated from f 2 (y 2 x, y1 ; ˆθ 2(0) ). msart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

8 8 S. YANG AND J. K. KIM In the W-step, the fractonal weghts are computed by wth M observed. w j(t) wj(t) f 1(y (j) 1 = 1, where y (j) 1 x ; ˆθ 1(t) )f 2 (y (j) 2 x, y 1 ; ˆθ 2(t) ) h(y (j) ms, x, y obs, ) = y 1 f y 1 s observed and y (j) 2 = y 2 f y 2 s The above example covers a broad range of applcatons n the mssng data lterature, such as mssng covarate problems, measurement error models, generalzed lnear mxed models, and so on. Yang and Km (2014) consdered regresson analyses wth mssng covarates n survey data usng FI, where n the current notaton, f(y 2 x, y 1 ) s a regresson model wth y 2 and x fully observed and y 1 subject to mssngness. In generalzed lnear mxed models, f(y 2 x, y 1 ) s a generalzed lnear mxed model where y 1 s the latent random effect. See Yang, Km, and Zhu (2013) for usng FI to estmate parameters n the generalzed lnear mxed models. For varance estmaton, note that the mputed estmator ˆη F I obtaned from the mputed estmatng equaton (2.7) depends on ˆθ obtaned from (3.2). To reflect ths dependence, we can wrte ˆη F I = ˆη F I (ˆθ). To account for the samplng varablty of ˆθ n the mputed estmator ˆη F I, ether the lnearzaton method or replcaton methods can be used. In the lnearzaton method, the mputaton model s needed n order to compute partal dervatves of the score functons. To avod dsclosng the mputaton model, replcaton methods are often preferred (Rao and Shao 1992). To mplement the replcaton varance estmaton n FI, we frst obtan the k-th replcate pseudo MLE ˆθ [k] of ˆθ by solvng (3.6) S [k] (θ) A w [k] M wj(θ)s(θ; yj) = 0, where w [k] s the k-th replcaton weght and wj (θ) s defned n (3.1). To obtan ˆθ [k] from (3.6), ether EM algorthm or the one-step Newton method can be used. EM algorthm can be mplemented smlarly as before. For the one-step Newton method, we have where { } ˆθ [k] = ˆθ 1 θ S [k] T (ˆθ) A w [k] M wj(ˆθ)s(ˆθ; yj), θ S [k] T (θ) = M w [k] w j(θ)ṡ(θ; y j) + A A 2 M S(θ; y j) wj(θ)s(θ; yj) w [k] M wj(θ) msart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

9 FRACTIONAL IMPUTATION 9 wth Ṡ(θ; y) = S(θ; y)/ θt and B 2 = BB T. Once ˆθ [k] s obtaned, we obtan the k-th replcate ˆη [k] of ˆη by solvng A for η, where w [k] j = w j (ˆθ [k] ). w [k] M w [k] j U(η; y j) = 0 4. NONPARAMETRIC FRACTIONAL IMPUTATION 4.1 Fractonal Hot Deck Imputaton Hot deck mputaton uses observed responses from the sample as mputed values. The unt wth a mssng value s called the recpent and the unt provdng the value for the mputaton s called the donor. Durrant (2009), Hazza (2009) and Andrdge and Lttle (2010) provded comprehensve overvews of hot deck mputaton n survey samplng. The attractve features of hot deck mputaton nclude the followng. Frst, unlke model-based mputaton methods that generate artfcal mputed values, n hot deck mputaton, only plausble values can be mputed, and therefore dstrbutonal propertes of the data are preserved. For example, mputed values for categorcal varables wll also be categorcal, as observed from the respondents. Second, compared to fully parametrc methods, hot deck mputaton makes less or no dstrbutonal assumptons and therefore s more robust. For these reasons, hot deck mputaton s a wdely used mputaton method, especally n household surveys. Fractonal hot deck mputaton (FHDI) combnes the deas of FI and hot deck mputaton. It s effcent (due to FI), and t nherts the aforementoned good propertes of hot deck mputaton. Km and Fuller (2004), Fuller and Km (2005), and Km and Yang (2014) consdered FHDI for unvarate mssng data. We now descrbe a multvarate FHDI procedure to deal wth mssng data wth an arbtrary mssng pattern (Im et al. 2015). We frst consder categorcal data. Let z = (z 1,..., z K ) be the vector of study varables that take categorcal values. Let z = (z 1,..., z K ) be the -th realzaton of z. Let δ j be the response ndcator varable for z j. That s, δ j = 1 f z j s observed and δ j = 0 otherwse. Assume that the response mechansm s MAR. Based on δ = (δ 1,..., δ K ), the orgnal observaton z can be decomposed nto (z obs,, z ms, ), whch are the mssng and observed part of z, respectvely. Let D = {z (1) ms,,..., z (M ) ms, } be the set of all possble values of z ms,, that s, (z obs,, z (j) ms, ) s one of the actually observed value n the respondents, for j = 1,..., M, wth M > 0. If all of M possble values are taken as the mputed values for z ms,, the fractonal weght assgned to the j-th mputed value z (j) ms, s (4.1) w j = π(z obs,, z (j) ms, ) k D π(z obs,, z (k) ms, ), where π(z) s the jont probablty of z. If the jont probablty s nonparametrcally modeled, t s computed by A (4.2) π(z) = w j D wj I{(z obs,, z (j) ms, ) = z} A w, msart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

10 10 S. YANG AND J. K. KIM where z (j) ms, z ms, and wj = M 1, for j = 1,..., M 1, f z s completely observed. To compute (4.1) and (4.2), EM algorthm by weghtng (Ibrahm 1990) can be used, wth the ntal values of fractonal weghts beng wj(0) = M 1. Equatons (4.1) and (4.2) correspond to the E-step and M-step of the EM algorthm, respectvely. The M-step (4.2) can be changed f there s a parametrc model for the jont probablty π(z). For example, f the jont probablty can be modeled by a multnomal dstrbuton wth parameter α, say π(z; α), then the M-step replaces (4.2) wth solvng the mputed score equaton of α to update the estmate of α. For contnuous data y = (y 1,..., y K ), we consder a dscrete approxmaton. Dscretze each contnuous varable by dvdng ts range nto a small fnte number of segments (for example, quantles). Let z k denote the dscrete verson of y k. Note that z k s observed only f y k s observed. Let the support of z, denoted by {z 1,..., z G }, whch s the same as the sample support of z from the full respondents, specfy donor cells. The jont probablty of z, denoted by π(z g ), for g = 1,..., G, can be obtaned by the EM algorthm for categorcal mssng data as descrbed above. As n the categorcal mssng data problem, let D = {z (1) ms,,..., z (M ) ms, } be the set of all possble values of z ms,. Usng a fnte mxture model, a nonparametrc approxmaton of f(y ms, y obs, ) s M (4.3) f(y ms, y obs, ) P (z = z (j) y obs, )f(y ms, z (j) ). Each z (j) = (z obs,, z (j) ms, ) defnes an mputaton cell. The approxmaton n (4.3) s based on the assumpton that (4.4) P (y ms y obs, z) = P (y ms z), whch requres (approxmate) condtonal ndependence between y ms and y obs gven z. Thus, we assume that the covarance structure between tems are captured by the dscrete approxmaton and the wthn cell errors can be safely assumed to be ndependent. Once the mputaton cells are formed to satsfy (4.4), we select m g mputed values for y ms,, denoted by y (j) = (y obs,, y (j) ms, ), for j = 1,..., m g, randomly from the full respondents n the same cell, wth the selecton probablty proportonal to the samplng weghts. The fnal fractonal weghts assgned to y (j) s wj = ˆP (z (j) ms, y obs,)m 1 g. Ths FHDI procedure resembles a two-phase stratfed samplng (Rao 1973, Km et al. 2006), where formng the mputaton cells corresponds to stratfcaton (phase one) and conductng hot deck mputaton corresponds to stratfed samplng (phase two). For more detals, see Im, Km, and Fuller (2015). If we select all possble donors n the same cell, the resultng FI estmator s fully effcent n the sense that t does not ntroduce addtonal randomness due to hot deck mputaton. Such fractonal hot deck mputaton s called fully effcent fractonal mputaton (FEFI). The FEFI opton s currently avalable at Proc Surveympute n SAS (SAS Insttute Inc. 2015). 4.2 Nonparametrc Fractonal Imputaton Usng Kernels In real-data applcatons, nonparametrc methods are preferred f less s known about the true underlyng data model. Hot deck mputaton makes less or no dsmsart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

11 FRACTIONAL IMPUTATION 11 trbutonal assumptons and therefore s more robust than fully parametrc methods. In what follows, we dscuss an alternatve way of calculatng the fractonal weghts that lnks the FI estmator to some well-known nonparametrc estmators, such as Nadaraya-Watson kernel regresson estmator (Nadaraya 1964). For smplcty, suppose we have bvarate data (x, y ) where x s completely observed and y s subject to mssng. Assume the mssng data mechansm s MAR. Let δ be the response ndcator that takes the value one f y s observed and takes zero otherwse. We are nterested n estmatng η, whch s defned through E{U(η; X, Y )} = 0. Let A R = { A; δ = 1} be the ndex set of respondents. To calculate the condtonal estmatng equaton (2.3) nonparametrcally, we use the followng fractonal mputaton: for each unt wth δ = 0, r = A R mputed values of y are taken from A R, denoted by y (1) the Kernel-based fractonal weghts w j = K h(x x (j) where K h ( ) s the kernel functon wth bandwdth h and x (j) assocated wth y (j) (4.5) A w,, y (r), and compute )/ k A R K h (x x (k) ), s the covarate. The resultng FI estmatng equaton can be wrtten as δ U(η; x, y ) + (1 δ ) wju(η; x, y (j) ) = 0, j A R where the nonparametrc fractonal weghts measure the degrees of smlarty based on the dstance between x and x (j). The FI estmator uses Û(η; x ) j A R wj U(η; x, y (j) ) to approxmate E{U(η; x, y ) x } nonparametrcally. For fxed η, Û(η; x ) s often called the Nadaraya-Watson kernel regresson estmator of E{U(η; x, y ) x } n the nonparametrc estmaton framework. Note that ths FI estmator does not rely on any parametrc model assumptons and so s nonparametrc; however t s not assumpton free because t makes an mplct assumpton of the contnuty of E{U(η; x, y) x } through the choce of kernels to defne the smlarty (Nadaraya 1964). Notably, whle the convergence of Û(η; x ) to E{U(η; x, y ) x } does not acheve the order of O p (1/ n), the soluton ˆη F I to (4.5) satsfes ˆη F I η = O p (1/ n) under some regularty condtons, whch was proved by Wang and Chen (2009) n the IID setup. Such kernel-based nonparametrc fractonal mputaton can be drectly applcable to complex survey samplng scenaros. More developments are expected by couplng FI wth other nonparametrc methods such as those usng the nearest neghbor mputaton method (Chen and Shao 2001; Ktamura et al. 2009; Km et al. 2011) or predctve mean matchng (Vnk et al. 2014). 5. SYNTHETIC DATA IMPUTATION Synthetc mputaton s a technque of creatng mputed values for the unobserved tems by ncorporatng nformaton from other surveys. For example, suppose that there are two ndependent surveys, called Survey 1 and Survey 2, and we observe x from Survey 1 and observe (x, y ) from Survey 2. In ths case, we may want to create synthetc values of y n Survey 1 by frst fttng a model relatng y to x to the data from Survey 2 and then predctng y assocated wth x observed n Survey 1. Synthetc mputaton s partcularly useful when Survey 1 s a large scale survey and tem y s very expensve to measure. Schenker and Raghunathan (2007) reported several applcatons of synthetc mputaton, usng msart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

12 12 S. YANG AND J. K. KIM a model-based method to estmate parameters assocated wth varables not observed n Survey 1 but observed n a much smaller Survey 2. In one applcaton, both self-reported health measurements x and clncal measurements from physcal examnatons y for a small sample A 2 of ndvduals were observed. In the much larger Survey 1, only self-reported measurements, x were observed. Only the mputed or synthetc data from Survey 1 and assocated survey weghts were released to the publc. The setup of two ndependent samples wth common tems s often called nonnested two-phase samplng. Two-phase samplng can be treated as a mssng data problem, where the mssngness s planned and the response probablty s known. 5.1 Fractonal Imputaton for Two-phase Samplng In two-phase samplng, suppose we observe x n the frst-phase sample and observe (x, y ) n the second-phase sample, where the second-phase sample s not necessarly nested wthn the frst-phase sample. Let A 1 and w 1 be the set of ndces and the set of samplng weghts for the frst-phase sample, respectvely. Let A 2 and w 2 be the correspondng sets for the second-phase sample. Assume a workng model m(x ; β) for E(y x ). For estmaton of the populaton total of y, the two-phase regresson estmator can be wrtten as (5.1) Ŷ tp = w 1 m(x ; ˆβ) + w 2 {y m(x ; ˆβ)}, A 1 A 2 where the subscrpt tp stands for two-phase, and ˆβ s estmated from the second-phase sample. The two-phase regresson estmator s effcent f the workng model s well-specfed. The frst term of (5.1) s called the projecton estmator. Note that f the second term of (5.1) s equal to zero, the two-phase regresson estmator s equvalent to the projecton estmator. Some asymptotc propertes of the two-phase estmator and varance estmaton methods have been dscussed n Km, Navarro, and Fuller (2006), and Km and Yu (2011a). Km and Rao (2012) dscussed asymptotc propertes of the projecton estmator under non-nested two-phase samplng. In a large scale survey, t s a common practce to produce estmates for domans. Creatng an mputed data set for the frst-phase sample, often called mass mputaton, s one method for ncorporatng the second-phase nformaton nto the frst-phase sample. Bredt and Fuller (1996) dscussed the possblty of usng mputaton to get mproved estmates for domans. Fuller (2003) nvestgated mass mputaton n the context of two-phase samplng. The FI procedure can be used to obtan the two-phase regresson estmator n (5.1) and, at the same tme, mprove doman estmaton. Note that the two-phase regresson estmator (5.1) can be wrtten as (5.2) Ŷ F EF I = A 1, j A 2 w 1 w jy (j) where y (j) = ŷ + ê j, ŷ = m(x ; ˆβ), ê j = y j ŷ j, w j = w j2/( k A 2 w k2 ), and we assume A 1 w 1 = A 2 w 2. The expresson (5.2) mples that we mpute all the elements n the frst-phase sample, ncludng the elements that also belong to the second-phase sample. The estmator (5.2) s computed usng an msart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

13 FRACTIONAL IMPUTATION 13 augmented data set of n 1 n 2 records, where n 1 and n 2 are the szes of A 1 and A 2, respectvely, and the (, j)-th record has an (mputed) observaton y (j) = ŷ + ê j wth weght w 1 wj. That s, for each unt A 1, we mpute n 2 values of y (j) wth fractonal weght wj. The method n (5.2) mputes all the elements n A 2 and s called fully effcent fractonal mputaton (FEFI) method, accordng to Fuller and Km (2005). The FEFI estmator s algebracally equvalent to the two-phase regresson estmator of the populaton total of y, and can also provde consstent estmates for other parameters such as populaton quantles. If t s desrable to lmt the number of mputatons to a small value m (m < n 2 ), FI usng the regresson weghtng method n Fuller and Km (2005) can be adopted. We frst select m values of y (j), denoted by y (1),, y (m), among the set of n 2 mputed values {y (j) The fractonal weghts w j (5.3) m w j ; j A 2 } usng an effcent samplng method. assgned to the selected y (j) ( 1, y (j) ) = j A 2 w j ( 1, y (j) are determned so that holds for each A 1. The fractonal weght satsfyng (5.3) can be computed usng the regresson weghtng method or the emprcal lkelhood method, see secton 6.1 for detals. The resultng FI data y (j) wth weghts w 1 w j are constructed wth n 1 m records, whch ntegrate avalable nformaton from two phases. Replcaton varance estmaton wth FI, smlar to Fuller and Km (2005), can be developed. See Secton 8.7 of Km and Shao (2013). 5.2 Fractonal Imputaton for Statstcal Matchng Statstcal matchng s used to ntegrate two or more data sets when nformaton avalable for matchng records for ndvdual partcpants across data sets s ncomplete. Statstcal matchng can be vewed as a mssng data problem where a researcher wants to perform a jont analyss of varables not jontly observed. Statstcal matchng technques can be used to construct fully augmented data fles to enable statstcally vald data analyss. Table 2 A Smple Data Structure for Matchng X Y 1 Y 2 Sample A o o Sample B o o ) To smplfy the setup, suppose that there are two surveys, Survey A and Survey B, each contanng a random sample wth partal nformaton about the populaton. Suppose that we observe x and y 1 from the Survey A sample and observe x and y 2 from the Survey B sample. Table 2 llustrates a smple data structure for matchng. Wthout loss of generalzablty, consder mputng y 1 n Survey B, snce mputng y 2 n Survey A s symmetrc. Under ths setup, we can use FI to generate y 1 from the condtonal dstrbuton of y 1 gven the observatons. That s, we generate y 1 from (5.4) f (y 1 x, y 2 ) f (y 2 x, y 1 ) f (y 1 x). msart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

14 14 S. YANG AND J. K. KIM Of note, assumptons are needed to dentfy the parameters n the jont model. For example, Km, Berg, and Park (2015) used an nstrumental varable assumpton to dentfy the model. To generate y 1 from (5.4), the EM algorthm by FI can be used. For more detals, see Km, Berg, and Park (2015). 6. FRACTIONAL IMPUTATION VARIANTS 6.1 The Choce of M and Calbraton Fractonal Imputaton The choce of the mputaton sze M s a matter of tradeoff between statstcal effcency and computaton effcency: small M may lead to large varablty n Monte Carlo approxmaton; whereas large M may ncrease computatonal cost. The magntude of the mputaton error s usually O(1/ M), whch can be reduced for large M. Thus, f computatonal power allows, the larger M, the better. In survey practces, a large mputaton sze may not be desrable. Thus, nstead of releasng to publc large number of mputed values for each mssng tem, a subset of ntal mputaton values can be selected to reduce the mputaton sze. In ths case, the FI procedure can be developed n three stages. The frst stage, called Fully Effcent Fractonal Imputaton (FEFI), computes the pseudo MLE of parameters n the superpopulaton model wth suffcently large mputaton sze M, say M = 1, 000. The second stage s the Samplng Stage, whch selects small m (say, m = 10) mputed values from the set of M mputed values. The thrd stage s Calbraton Weghtng, whch nvolves constructng the fnal fractonal weghts for the m fnal mputed values to satsfy some calbraton constrants. Ths procedure can be called Calbraton FI. The FEFI step s the same as n the prevous secton. In what follows, we descrbe the last two stages n detals. In the Samplng Stage, a subset of mputed values are selected to reduce the mputaton sze. For each, we have M mputed values yj = (y obs,, y (j) ms, ) wth ther fractonal weghts w j. We treat y = {y j, j = 1,..., M} as a weghted fnte populaton wth weght w j and use an unequal probablty samplng method such as probablty-proporton-to-sze (PPS) samplng to select a sample of sze m, say m = 10, from y usng wj as the selecton probablty. Let ỹ1,..., ỹ m be the m elements sampled from y. The ntal fractonal weghts for the sampled m mputed values are gven by w j0 = m 1. Ths set of fractonal weghts may not necessarly satsfy the mputed score equaton (6.1) m w A w js(ˆθ; ỹ j) = 0, where ˆθ s the pseudo MLE of θ computed at the FEFI stage. It s desrable for the soluton to the mputed score equaton wth small m to be equal to the pseudo MLE of θ, whch specfes the calbraton constrants. At the Calbraton Weghtng stage, the ntal set of weghts are modfed to satsfy the constrant (6.1). Fndng the calbrated fractonal weghts can be acheved by the regresson weghtng technque, by whch the fractonal weghts that satsfy (6.1) and m w j = 1. The regresson fractonal weghts are constructed by (6.2) w j = w j0 + w j0 (S j S ), msart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

15 where S j = S(ˆθ; y j ), S = m w j0 S j, and FRACTIONAL IMPUTATION 15 = { A m w w j0s j} T { A m w w j0(s j S ) 2 } 1. Note that some of the fractonal weghts computed by (6.2) can take negatve values. To avod negatve weghts, alternatve algorthms other than regresson weghtng should be used. For example, the fractonal weghts of the form w j = w j0 exp( S j ) m k=1 w k0 exp( S k ) are approxmately equal to the regresson fractonal weghts n (6.2) and are always postve. 6.2 The Choce of the Proposal Dstrbuton PFI s based on samplng from an mportance samplng densty h called the proposal dstrbuton. The choce of the proposal dstrbuton s somewhat arbtrary. However, wth fnte samples and mputatons, a well-specfed proposal dstrbuton may mprove the performance of the mputaton estmator. There are a number of ways to specfy the proposal dstrbuton and to assess the goodness of specfcaton. For a planned parameter, e.g., η, the populaton mean of y, Km (2011) showed the optmal h that makes Monte Carlo approxmaton varance of ȳ M w j y j as small as possble, s gven by h (y ms, y obs, ) = f(y ms, y obs,, ˆθ) y E{y y obs,, ˆθ} E{ y E{y y obs,, ˆθ} y obs,, ˆθ}, where ˆθ s the MLE of θ. For general-purpose estmaton, η s often unknown at the tme of mputaton accordng to Fay (1992), h(y ms, y obs, ) = f(y ms, y obs, ; ˆθ) s a reasonable choce n terms of statstcal effcency. For mportance samplng, snce we do not know ˆθ at the outset of the EM algorthm, we may want to have a good ntal guess θ 0 and use h(y ms, x, y obs, ) = f(y ms, x, y obs, ; θ 0 ). If we don t have a good ntal guess of the true value of θ, we can use a pror dstrbuton π(θ) to get h(y ms, y obs, ) = f(y ms, y obs, ; θ)π(θ)dθ. We now dscuss a specal choce of the proposal dstrbuton h, based on the realzed values of the varables havng mssng values, whch s akn to hot deck mputaton. Wthout loss of generalty, assume that y s observed n the frst r elements, y s mssng n the remanng (n r) elements, and x s completely observed n the sample. Usng the mportance samplng dea, we assgn a fractonal weght to donor y j (1 j r) for the mssng tem y (r +1 n) by choosng h(y j ) = f(y j δ j = 1). In calculatng the fractonal weghts, we approxmate f(y j δ j = 1) by ts emprcal dstrbuton n 1 N R k=1 δ kf (y j x k ), where n R s the number of respondents. The EM algorthm takes the followng steps: I-step For each mssng value y, = r + 1,..., n, take all values n A R = {y 1,..., y r } as donors. msart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

16 16 S. YANG AND J. K. KIM W-step Wth the current estmate of θ, denoted by ˆθ (t),compute the fractonal weghts by (6.3) w j(t) f(y j x ; ˆθ (t) ) k A R w k f(y j x k ; ˆθ (t) ) M-step Update the parameter ˆθ (t+1) by solvng the followng mputed score equaton, ˆθ (t+1) : soluton to r S(θ; x, y ) + =1 n r =r+1 w (t) j S(θ; x, y j ) = 0. Iteraton Set t = t+1 and go to the W-step. Stop f ˆθ (t+1) meets the convergence crteron. The semparametrc fractonal mputaton (SFI) estmator of Ȳ s ˆȲ SF I = 1 r n r y + w n jy j. =1 =r+1 Km and Yang (2014) showed that the resultng estmator gans robustness. It s less senstve aganst the departure from the assumed condtonal regresson model. 6.3 Doubly Robust Fractonal Imputaton Suppose we have bvarate data (x, y ) where x s completely observed and y s subject to mssng and mssng data mechansm s MAR. Assume also an outcome regresson (OR) model, gven by E(y x ) = m(x ; β 0 ), and the response propensty (RP) model, gven by P (δ = 1 x, y ) = P (δ = 1 x ) = π(x ; φ 0 ). Denote the set of respondents as A R = {, δ = 1}, where δ s the response ndcator of y. We are nterested n the populaton total η = N =1 y. Note that not both the OR and RP models are needed to construct consstent estmators of η. For example, ˆη 1 = A w m(x ; ˆβ), wth ˆβ beng a consstent estmator of β 0, s consstent to η under the OR model and ˆη 2 = A R w y /π(x ; ˆφ), wth ˆφ beng a consstent estmator of φ 0, s consstent to η under the RP model. An estmator of η s doubly robust f t s consstent f ether the OR model or the RP model s correct, but not necessarly both. Ths property guards the estmator from possble model msspecfcatons. The DR estmators have been extensvely studed n the lterature, ncludng Robns, Rotntzky, and Zhao (1994), Bang and Robns (2005), Tan (2006), Kang and Schafer (2007), Cao, Tsats, and Davdan (2009), and Km and Hazza (2014). We now dscuss a fractonal mputaton estmator that has the double robustness feature. For each mssng y, let yj = ŷ + ê j be the j-th mputed value from the donor j A R, where ŷ = m(x ; ˆβ) wth ˆβ ftted under the OR model and ê j = y j m(x ; ˆβ). If A R w 1/π(x j ; ˆφ) = A w, each unt j A R represents 1/π(x j ; ˆφ) copes of the sample. Then, the fractonal weght wj assocated wth the j-th mputed value yj s proportonal to {1/π(x j; ˆφ) 1} over the donor pool msart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

17 A R (mnus one because y j tself counts one), that s, (6.4) w j = FRACTIONAL IMPUTATION 17 w j {1/π(x j ; φ 0 ) 1} k A w kδ k {1/π(x k ; ˆφ) 1}. Under ths weght constructon, the fractonal mputaton estmator s gven by (6.5) ˆη F I = n w δ y + (1 δ ){ δ j wjy j}. A We show that the fractonal mputaton estmator ˆη F I n (6.5) s doubly robust. Frst notce that ˆη F I s algebracally equal to (6.6) ˆη F I = ] w [m(x ; ˆβ) δ + π(x A ; ˆφ) {y m(x ; ˆβ)}. Let ˆη n = A w y be the full sample estmator of of η, then ˆη F I ˆη n = A w { δ π(x ; ˆφ) 1 } {y m(x ; ˆβ)}. Ths s an asymptotcally unbased estmator of zero f ether the OR model or the RP model s correct, but not necessarly both. Km and Hazza (2014) dscussed effcent estmaton of (β, φ) n survey samplng. 7. COMPARISON WITH MULTIPLE IMPUTATION 7.1 Statstcal Effcency In the presence of mssng data wth MAR, multple mputaton (MI) s a popular method. It s thus of nterest to compare the behavor of these two methods. We start from a smple settng wth the complete data z beng randomly drawn from a populaton whose densty s f(z; θ), where θ R d s an unknown parameter to be estmated. Suppose that m complete data sets are created by mputng the mssng data z ms from the posteror predctve dstrbuton gven the observed data z obs f(z ms z obs ) = f(z ms z obs ; θ)π(θ z obs )dθ, where π(θ z obs ) s the posteror dstrbuton of θ. The MI estmator of θ, denoted by ˆθ MI s ˆθ MI = m 1 m k=1 ˆθ (k), where ˆθ (k) s the MLE estmator appled to the k-th mputed data set. Rubn s formula s used for varance estmaton n MI, ˆV MI (ˆθ MI ) = W m + (1 + m 1 )B m, where W m = m 1 m k=1 ˆV (k), B m = (m 1) 1 m k=1 (ˆθ (k) ˆθ MI ) 2, and ˆV (k) s the varance estmator of ˆθ under complete response appled to the k-th mputed data set. msart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

18 18 S. YANG AND J. K. KIM Of note, Bayesan MI s a smulaton-based method and thus ntroduce addtonal nose. Ths explans why the asymptotc varance of the MI estmator, gven by Wang and Robns (1998), (7.1) V MI = I 1 obs + m 1 I 1 comi ms I 1 com + m 1 J T I 1 obs J, s strctly larger than the asymptotc varance of the FI estmator (7.2) V F I = I 1 obs + m 1 I 1 comi ms I 1 com, where I com = E{S(θ) 2 }, I obs = E{S obs (θ) 2 }, I ms = I com I obs, S(θ) = S(Z; θ) = log f(z; θ)/ θ s the log lkelhood score f the data were completely observed and S obs (θ) = E{S(θ) Z obs } s the score functon of the observed data log lkelhood, J = I ms Icom 1 s the fracton of mssng nformaton matrx (Rubn 1987, Chapter 4). Ths dfference between (7.1) and (7.2) can be szable for a small m. Furthermore, for a large m, although the MI estmator s effcent, the nference s neffcent snce Rubn s varance estmator of the MI estmator s only weakly unbased, that s ˆV MI (ˆθ MI ) converges n dstrbuton nstead of coverages n probablty to V MI. Ths leads to much broader confdence ntervals and less powerful tests than a consstent varance estmator would do (Nelsen 2003). For MI nference to be vald for general-purpose estmaton, mputatons must be proper accordng to Rubn (1987). A suffcent condton s gven by Meng (1994). The so-called congenalty condton, mposed on both the mputaton model and the form of subsequent complete-sample analyses, s qute restrctve for general-purpose estmaton. Otherwse, as dscussed by Fay (1992; 1996), Kott (1995), Bnder and Sun (1996), Robns and Wang (2000), Nelsen (2003), and Km et al. (2006), the MI varance estmator s not always consstent. Km et al. (2011) ponted out that MI that s congenal for mean estmaton s not necessarly congenal for proporton estmaton. Yang and Km (2015b) showed that the MI varance estmator can be postvely or negatvely based when the method of moments estmator s used as the complete-sample estmator. In contrast, FI, as we dscussed n secton 4, does not requre congenalty and always results n a consstent varance estmator for general-purpose estmaton. 7.2 Imputaton under Informatve Samplng Under nformatve samplng, the MAR assumpton s subtle. We assume that the response mechansm s MAR at the populaton level, now referred to as populaton mssng at random (PMAR), to be dstngushed from the concept of sample mssng at random (SMAR). For smplcty, assume y s a one-dmensonal varable whch s subject to mssng, δ s ts response ndcator, and I s the sample ncluson ndcator. PMAR assumes that y δ x, that s, MAR holds at the populaton level, f(y x) = f(y x, δ). On the other hand, SMAR assumes Y δ (x, I = 1), that s, MAR holds at the sample level, f(y x, I = 1) = f(y x, I = 1, δ). The two assumptons are not testable emprcally. The plausblty of these assumptons should be judged by subject matter experts. Often, PMAR s more realstc because an ndvdual s decson on whether or not to respond to a survey depends on hs or her own characterstcs, rather than the fact of hm or her beng n the sample or not. msart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

19 FRACTIONAL IMPUTATION 19 δ I Y X U Fgure 1. A drected acyclc graph (DAG) for a setup where PMAR holds but SMAR does not hold. Varable U s latent n the sense that t s never observed. For nonnformatve samplng desgn, we have P (I = 1 x, y) = P (I = 1 x), under whch PMAR mples SMAR; however for nformatve samplng desgn, PMAR does not necessarly mply SMAR. In such cases, usng an mputaton model ftted to the sample data for generatng mputatons can result n based estmaton. FI does not requre SMAR to hold besdes PMAR. Under PMAR, we have f(y x, δ = 0) = f(y x). Let f(y x; β) be a parametrc model of f(y x). The parameter β can be consstently estmated by solvng (2.5), even under nformatve samplng. Snce FI generates the mputatons from f(y x; ˆβ), wth a consstent estmator ˆβ, the resultng FI estmator s approxmately unbased (Berg et al. 2015). Whereas, MI tends to problematc under nformatve samplng. By usng an augmented model, where the mputaton model s augmented to nclude samplng weghts or some functon of them, as f(y x, w), the MI pont estmator was clamed to be approxmately unbased (Rubn 1996; Schenker et al. 2006). However, as ponted out by Berg, Km, and Sknner (2015), t s not always true. For example, Y s condtonally ndependent of δ gven X as presented n Fgure 1. However, Y s not condtonally ndependent of δ gven X and I. Augmentng X by ncludng samplng weghts does not solve the problem. The exstence of the latent varable U, whch s correlated wth I and δ, makes SMAR unachevable. 8. SIMULATION STUDY We nvestgated the performance of FI compared to MI by a lmted smulaton study usng an artfcal fnte populaton generated from real survey data. The pseudo fnte populaton was generated from a sngle month of the U.S. Census Bureau s Monthly Retal Trade Survey (MRTS). Each month, the MRTS surveys a sample of about 12, 000 retal busnesses wth pad employees to collect data on sales and nventores. The MRTS s an economc ndcator survey whose monthly estmates are nputs to the Gross Domestc Product estmates. The MRTS sample desgn s typcal of busness surveys, employng one-stage stratfed samplng wth stratfcaton based on major ndustry, further substratfed by the estmated annual sales. The sample desgn requres hgher samplng rates n strata wth larger unts than n strata wth smaller unts. More detals about MRTS can be found n Mulry, Olver, and Kaputa (2014). The orgnal populaton fle contans 19, 601 retal busnesses stratfed nto 16 strata, wth a strata dentfer (h), sales (y), and nventory values (x). For smulaton purpose, we focus on the frst 5 strata as a fnte populaton, consstng of 7, 260 retal busnesses. Fgure 2 shows the scatter plot of sales and nventory msart-sts ver. 2014/10/16 fle: paper_revew_fi_sts.tex date: August 28, 2015

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

$Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010$ Parametrc fractonal mputaton for mssng data analyss Jae Kwang Km Survey Workng Group Semnar March 29, 2010 1 Outlne Introducton Proposed method Fractonal mputaton Approxmaton Varance estmaton Multple mputaton