Jong-Min Kim* and Jon E. Anderson. Statistics Discipline Division of Science and Mathematics University of Minnesota at Morris

Size: px

Start display at page:

Download "Jong-Min Kim* and Jon E. Anderson. Statistics Discipline Division of Science and Mathematics University of Minnesota at Morris"

William Preston Gray
5 years ago
Views:

1 Jackknife Variance Estimation of the Regression and Calibration Estimator for Two 2-Phase Samples Jong-Min Kim* and Jon E. Anderson Statistics Discipline Division of Science and Mathematics University of Minnesota at Morris 2005 ENAR, March 21 1

2 Outline Jackknife Background and Definitions Regression Estimator in Simple Random Sampling Regression Estimator in Stratified Random Sampling Calibration Estimation in Stratified Random Sampling Conclusions 2005 ENAR, March 21 2

3 Jackknife Background Introduced by Quenouille (1949, 1956) as a method to reduce bias Popularized by Tukey (1958) who used it for variances and CIs Arvesen (1969) was the first to propose two-sample jackknife estimator 2005 ENAR, March 21 3

4 Jackknife Definition Let ˆθ be an estimate Let ˆθ( j) be an estimator of the same form with observation j deleted The jackknife estimate of the variance of ˆθ is n 1 n n 2 [ˆθ( j) ˆθ] j= ENAR, March 21 4

5 Main Idea 2005 ENAR, March 21 5

6 Jackknife Variance Estimation of the Regression Estimator for Two Samples, Two-Phase, Simple Random Sampling Two, first-phase simple random samples s 1 of size n 1 and s 2 of size n 2 are taken without replacement from a population of N elements. Simple random subsamples s 1 of size n 1 and s 2 of size n 2 are taken without replacement from s 1 and s ENAR, March 21 6

7 The simple linear regression estimators for two samples, two-phase sampling are and y lr1 = y 1 + ˆβ 1 (x 1 x 1 ), y lr2 = y 2 + ˆβ 2 (x 2 x 2 ), where x 1 and x 2 are the means for first stage samples s 1 and s 2, x 1 and x 2 are the means for second-stage samples s 1 and s 2, and y lr1 and y lr2 are the means for second-stage samples s 1 and s 2, ˆβ 1 = ˆσ xy /ˆσ x 2 is the least squares regression coefficient of y on x based on second-stage sample s 1, and ˆβ 2 = ˆσ xy /ˆσ x 2 is the least squares regression coefficient of y 2 on x 2 based on second-stage sample s ENAR, March 21 7

8 We obtain a jackknife variance estimator for y lrk by recalculating y lrk with the jth element removed for each j s k then using the variance of these n k jackknife values, y lrk ( j). Clearly, deleting unit j will affect x k, y k and ˆβ k only if j s k and not if j s k s k, while it will affect x k for all j s k. Define y lrk ( j) = y k ( j) + ˆβ k ( j) [ ] x k( j) x k ( j), for all j s k, where x k ( j) = (n k x k x j)/(n k 1) for all j s k, (n k x k x j )/(n k 1), if (j s k ), x k ( j) = x k, if (j s k s k), 2005 ENAR, March 21 8

9 y k ( j) = (n k y k y j )/(n k 1), if (j s k ), y k, if (j s k s k), ˆβ k ( j) = ˆβ k ˆβ k, (x j x k )d j (n k 1)ˆσ 2 xk (1 k j ), if (j s k), if (j s k s k). where d j = y j y k ˆβ k (x j x k ) and k j = 1/n k + (x j x k ) 2 / [ (n k 1)ˆσ 2 xk]. Now apply the usual jackknife method to y rlk ( j) to get v Jlrk = n k 1 n k j s k [y lrk ( j) y lrk ] ENAR, March 21 9

10 Rao and Sitter approach to Jackknife Variance Estimator for Two Samples, SRS To use convenient one-phase sample variance formulae, Rao and Sitter (1995) proposed a similar device to facilitate computations for ratio imputation. For regression imputation we define, ẑ ki ( j) = y ki + {y ki( j) y ki}, for sample k = 1, 2, and yki ( j) is defined as, y ki( j) = y k ( j) + ˆβ k ( j)(x ki x k ( j)), so that ẑ ki ( j) = y ki for j s k s k in sample k = 1, 2, and ẑ ki ( j) = y ki ( j) for j s k in sample k = 1, ENAR, March 21 10

11 We also define the adjusted estimator, y a ki( j) = 1 n k 1 n k i=1 ẑ ki ( j), and this helps define the jackknife variance estimator for sample k, v Jlrk = n k 1 n k j s k [y a ki( j) y ki ] 2, where y ki = y klr under regression imputation. The jackknife variance estimator based on adjusted imputed estimators y ki, k = 1, 2 is a weighted average of two estimators, given by vjlr a = 1v n Jlr1 + n 2v Jlr2 n 1 + n 2 = n 1 1 n 1 + [y a 1I( j) y 1I ] 2 + n 2 1 n 2 n 1 + n 2 j s 1 l s 2 [y a 2I( l) y 2I ] ENAR, March 21 11

12 Simulation Study Design: Simple Random Sampling Population size is Pop Y is related to Pop X, Y = 0.8 X + ɛ Y X 2005 ENAR, March 21 12

13 Simulation: Simple Random Sampling One Sample Jackknife Variance, Simple Random Sampling J. Variance First Phase=2000 First Phase= Second Phase Sample Size 2005 ENAR, March 21 13

14 Simulation: Simple Random Sampling Two Sample Jackknife Variance, Simple Random Sampling J. Variance First Phase=1000 First Phase= Second Phase Sample Size 2005 ENAR, March 21 14

15 Simulation: One vs. Two SRS Comparison Mean Jackknife Variance, SRS J. Variance Two Samples One Sample Second Phase Sample Size 2005 ENAR, March 21 15

16 Simulation: One vs. Two SRS Comparison SD Jackknife Variance, SRS J. Variance Two Samples One Sample Second Phase Sample Size 2005 ENAR, March 21 16

17 Simulation: Missing and Complete Comparison SRS Two Sample Jackknife Variance, Simple Random Sampling J. Variance Complete Data Missing Data Second Phase Sample Size 2005 ENAR, March 21 17

18 Jackknife Variance Estimator for Two Samples, Two-Phase, Stratified Random Sampling Assume that x is observed on all sample units, s hk, for sample k = 1, 2 in stratum h. Simple linear regression imputation uses y hki = y hk + ˆβ hk (x hki x hk ) for i s hk s hk where y hk and x hk are the means of y and x for the respondents in group s hk in stratum h. ˆβ hk is the ordinary least squares regression based on the respondents, s hk in stratum h ENAR, March 21 18

19 The imputed values y hki are best predictors of unobserved y hki under the following superpopulation model ξ: E ξ (y hki ) = α hk + β hk x hki, V ξ (y hki ) = σ 2 h, cov ξ (y hki, y hkj ) = 0, for i j provided that the model also holds for the respondents s hk. Under regression imputation, y ki = W h y hki = h h W h [y hk + ˆβ ] hk (x hk x hk ), for k = 1, 2. It is readily seen that y hki ( hkj) = y hk( hkj) + ˆβ hk ( hkj) [x hki x hk ( hkj)] under regression imputation when the hkjth respondent is deleted, where ˆβ hk ( hkj) is the least squares regression coefficient when the hkjth respondent is deleted ENAR, March 21 19

20 Rao and Sitter Approach to Jackknife Variance Estimator for two samples, Stratified Random Sampling To use convenient one-phase sample variance formulae, Rao and Sitter (1995) proposed a similar device to facilitate computations for ratio imputation. For regression imputation we define, ẑ hki ( hkj) = y hki + {y hki( hkj) y hki}, so that ẑ hki ( hkj) = y hki for hkj s hk s hk in sample k = 1, 2, and ẑ hki ( hkj) = y hki ( hkj) for hkj s hk. in sample k = 1, 2, stratum h. We also define the adjusted estimator, y a hki( hkj) = 1 n hk 1 n hk i=1 ẑ hki ( hkj), 2005 ENAR, March 21 20

21 Using these values, the jackknife variance estimator of y ki is given by v Jlr (y ki ) = n hk 1 n hk n hk j=1 [y a ki( hkj) y ki ] 2. Noting that y a ki ( hkj) y ki = W h [y a hki ( hkj) y hki], where y a hki ( hkj) is the adjusted imputed estimator of the hth stratum mean Y h when hkjth sample unit is deleted, we get v Jlr (y ki ) = = Whv 2 Jlr (y hki ) W 2 h n hk 1 n hk n hk j=1 (y a hki( hkj) y hki ) ENAR, March 21 21

22 Simulation Study Design: Stratified Random Sampling Population size is N = Pop Y is related to Pop X, Y = 0.8 X + ɛ Three strata, X < 90, 90 X 110, 110 < X. Stratum 1 size = N 1 = 1633 s.t. W 1 = N 1 N = Stratum 2 size = N 2 = 6805 s.t. W 2 = N 2 N = Stratum 3 size = N 3 = 1562 s.t. W 3 = N 3 N = ENAR, March 21 22

23 Simulation Study Design: Stratified Random Sampling Stratum=3 Population Y Values Stratum=1 Stratum= Population X Values 2005 ENAR, March 21 23

24 Simulation: Stratified Random Sampling One Sample Jackknife Variance, Stratified Random Samplin J. Variance First Phase=2000 First Phase= Second Phase Sample Size 2005 ENAR, March 21 24

25 Simulation: Stratified Random Sampling Two Sample Jackknife Variance, Stratified Random Samplin J. Variance First Phase=1000 First Phase= Second Phase Sample Size 2005 ENAR, March 21 25

26 Calibration Approach to Jackknife Variance Estimation Three major advantages of calibration approach in Survey Sampling Leads to consistent estimates Provides an important class of techniques for the efficient combination of data sources. Has computational advantage for estimates. Apply Tracy et al. (2003) calibration in Stratified and Double Sampling to Jackknife Variance Estimator 2005 ENAR, March 21 26

27 Calibration Approach to Jackknife Variance Estimation We apply calibration estimation with ratio imputation in stratified random sampling. Suppose the population of N units consists of L strata such that the h-th stratum consists of N h units and L N h = N. Suppose that an auxiliary variable, x, closely related to an item y is observed on all sample units, s hk in sample k = 1, 2 for stratum h. Ratio imputation uses y hki = (y hk/x hk )x hki for i s hk s hk, and equals y hki when it is observed in sample s hk. Note that y hk and x hk are the means of y and x for the respondents s hk in stratum h ENAR, March 21 27

28 Under ratio imputation presented by Särndal (1992), y ki = W h y hki = W h (y hk /x hk )x hk, where x hk is the x mean for the full sample s hk h. from stratum Let ˆσ 2 hk (x) = (n xh 1) 1 n kh i=1 (x hki x hk ) be the variance of x in the first sample s hk from stratum h. The variance of x in the subsample, s hk, stratum h is ˆσ hk 2 (x) = (n hk 1) 1 i s hk (x hki x hk ) 2. Also, let ˆσ 2 hk (y) = (n hk 1) 1 i s hk (y hki y hk ) 2 be the variance of the target characteristic respondents in the subsample, s hk, from stratum h ENAR, March 21 28

29 We are considering the jackknife variance estimator of y I using calibration estimation. Let s define where Wh distance y ki = Wh y hki = Wh (y hk /x hk )x hk, are the calibrated weights such that the chi-square (W h W h) 2 W h Q h, is minimized subject to constraints given below, and the Q h are predefined weights used to obtain to different estimators. The above distance is minimized subject to these constraints Wh x hk = W h x hk, 2005 ENAR, March 21 29

30 and Wh ˆσ hk 2 = W hˆσ 2 hk (x), where W h = N h /N are known stratum weights. Then we get calibrated weights as W h = W h + {W h Q h x hk A} /C + { W h Q hˆσ 2 hk(x)b } /C, where A = W h x hk x hk W h Q h ˆσ hk 4 (x) W h ˆσ 2 hk (x) ˆσ 2 hk (x) L W h Q h x hk ˆσ hk 2 (x), B = W h ˆσ 2 hk (x) ˆσ 2 hk (x) L W h Q h x 2 hk W h x hk x hk 2 C = W h Q h x 2 hk W h Q h ˆσ hk 4 (x) W h Q h x hk ˆσ hk 2 (x) W h Q h x hk ˆσ hk 2 (x), 2005 ENAR, March 21 30

31 A calibrated estimator of y ki is given by y ki = W h (y hk /x hk )x hk [ + ˆβ L ) x1 W h (x ] hk x hk + ˆβ x2 [ L W h ( ˆσ 2 hk(x) ˆσ 2 hk (x)) ], where ( L ) ˆβ x1 = ( L {W h Q h x hk y hk A} /C, ) ˆβ x2 = W h Q hˆσ 2 hk(x)(y hk /x hk )x hkb/c ENAR, March 21 31

32 It is readily seen that y hki ( hkj) = [y hk( hkj)/x hk ( hkj)]x hki, under ratio imputation when hkjth respondent is deleted, y ki ( hkj) = W h (y hk ( hkj)/x hk ( hkj))x hk( hkj) + ˆβ x1 ( hkj) + ˆβ x2 ( hkj) [ L [ L ) W h (x ] hk ( hkj) x hk( hkj) W h ( ˆσ 2 hk( hkj)(x) ˆσ 2 hx ( hkj)) ], where x hk ( hkj) = (n hk x hk x hkj)/(n hk 1) for all j s hk y hk ( hkj) = [n hk y hk y hkj ] /(n hk 1), x hk ( hkj) = [n hk x hk x hkj ] /(n hk 1), 2005 ENAR, March 21 32

33 ˆβ x1 ( hkj) = W h Q h x hk ( hkj)y hk ( hkj)a( hkj) /C( hkj), ˆβ x2 ( hkj) = W h Q h ˆσ hk 2 ( hkj)(x)(y hk ( hkj)/x hk ( hkj))x hk ( hkj)b( hkj)/c( hkj), and ˆσ 2 hk ( hkj)(x) = (n hk 1) 1 n hk i=1 (x hki x hk ( hkj))2, and ˆσ 2 hk = (n hk 1) 1 i s (x hki x hk ( hkj)), where hk A( hkj) = W h x L hk ( hkj) x hk ( hkj) W h Q h ˆσ hk 4 ( hkj)(x) W h ˆσ 2 hk ( hkj)(x) ˆσ hk 2 L ( hkj)(x) W h Q h x hk ( hkj)ˆσ hk 2 ( hkj), B( hkj) = W h ˆσ 2 hk ( hkj)(x) ˆσ hk 2 L ( hkj) W h Q h x 2 hk ( hkj)(x) W h x L hk ( hkj) x hk ( hkj) W h Q h x hk ( hkj)ˆσ hk 2 ( hkj), C( hkj) = W h Q h x 2 hk ( hkj) W h Q h ˆσ hk 4 ( hkj)(x) 2 W h Q h x hk ( hkj)ˆσ hk 2 ( hkj)(x), 2005 ENAR, March 21 33

34 Using these values, the jackknife variance estimator of y ki is given by v Jcr (y ki ) = n hk 1 n hk n hk j=1 [y a ki( hkj) y ki ] 2. Noting that y a ki ( hkj) y ki = W h [y a hki ( hkj) y hki], where y a hki ( hkj) is the adjusted imputed estimator of the hth stratum mean Y h when hkjth sample unit is deleted, we get v Jcr (y ki ) = = Whv 2 Jcr (y hki ) W 2 h n hk 1 n hk n hk j=1 (y a hki( hkj) y hki ) ENAR, March 21 34

35 The jackknife variance estimator is a weighted average of two estimators, given by v Jcr = n 1v Jcr (y 1I ) + n 2v Jcr (y 2I ) n 1 + n 2 n 1 = n 1 + W 2 n h1 1 h n 2 n h1 n 2 + n 1 + W 2 n h2 1 h n 2 n h2 n h1 j=1 n h2 j=1 where n 1 = 2 n h1 and n 2 = 2 n h2. (y a h1i( h1j) y h1i ) 2 (y a h2i( h2j) y h2i ) ENAR, March 21 35

36 Conclusions Simulation shows greater benefit from increasing the first phase sample size compared to increasing the second phase sample size. Jackknife variance estimator for two samples has only a slightly smaller mean than Jackknife variance estimator for one sample. Jackknife variance estimator for two samples has less SD (variation) than Jackknife variance estimator for one sample. Future: Apply Jackknife variance estimator for two samples to Stratified Multistage Sampling ENAR, March 21 36

Jong-Min Kim* and Jon E. Anderson. Statistics Discipline Division of Science and Mathematics University of Minnesota at Morris

Jackknife Variance Estimation for Two Samples after Imputation under Two-Phase Sampling Jong-Min Kim* and Jon E. Anderson jongmink@mrs.umn.edu Statistics Discipline Division of Science and Mathematics