Optimal Calibration Estimators Under Two-Phase Sampling

Size: px

Start display at page:

Download "Optimal Calibration Estimators Under Two-Phase Sampling"

Francis Brown
5 years ago
Views:

1 Journal of Of cial Statistics, Vol. 19, No. 2, 2003, pp. 119±131 Optimal Calibration Estimators Under Two-Phase Sampling Changbao Wu 1 and Ying Luan 2 Optimal calibration estimators require in general complete auxiliary information. When such information is not available, the estimation procedure can be combined with two-phase sampling where a large, less costly rst-phase sample measured over the auxiliary variables is used to get estimates for certain population quantities related to the covariates. In this article we propose optimal calibration estimators for the population mean, the distribution function, the population variance and other second-order nite population quantities under two-phase sampling. The proposed optimal calibration estimators for a second-order nite population quantity such as the population variance can ideally be used to obtain more ef cient variance estimators for a rst-order nite population quantity such as the total or the distribution function. The estimation strategy can be applied to various measurement error and nonresponse problems. The design-based nite sample performances of proposed estimators are investigated through a simulation study using real survey data from the 1996 Statistics Canada Family Expenditure FAME) Survey. Key words: Auxiliary information; measurement error; model-assisted approach; nonresponse; variance estimation. 1. Introduction Many highly ef cient estimation techniques in survey sampling require strong information about auxiliary variables, x. For example, the calibration estimator of Deville and SaÈrndal 1992) requires, the population totals. When such information is not available, a twophase sampling scheme can be used where a large, less costly rst-phase sample measured over the x variable is used to obtain a good estimate of. A smaller and more costly second-phase sample can then be taken and the study variable y is observed. The major advantage of using two-phase sampling is the gain in high precision without substantial increase in cost. SaÈrndal et al. 1992, Chapter 9) provide an excellent account of two-phase sampling. Recently, Wu and Sitter 2001) proposed a model-calibration approach to using auxiliary information from surveys. The proposed model-calibration estimator is not only highly ef cient but also optimal among a class of calibration estimators Wu 1 Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada. cbwu@uwaterloo.ca 2 Department of Statistics, University of California, Riverside, CA 92521, USA. Acknowledgments: This research was supported by a grant fromthe Natural Sciences and Engineering Research Council of Canada. The authors thank Professor Randy R. Sitter for helpful comments. Thanks are also due to the Associate Editor and three anonymous referees for constructive suggestions and comments that greatly improved the presentation. q Statistics Sweden

2 120 Journal of Of cial Statistics 2002). To compute the model-calibration estimator, however, one generally requires the values of the x variables to be known for the entire nite population. In practice this complete auxiliary information is often unavailable. Two-phase sampling provides an ideal solution for the use of optimal calibration estimators under such situations. Suppose U ˆf1; 2;...; Ng is the set of labels for the nite population and s is the set of labels for the sampled units. Let y i and x i be the values of the response variable y and the vector of covariates x associated with the i th unit. Let p i be the rst-order inclusion probabilities. Assuming the population totals ˆ P N i ˆ 1 x i are known, the conventional calibration estimator Deville and SaÈrndal 1992) for the nite population total Y ˆ P N i ˆ 1 y i is constructed as Y Ã C ˆ P w i y i, where the weights w i minimize a distance measure F s between the w i 's and the basic design weights d i ˆ 1=p i subject to the so-called benchmark constraints w i x i ˆ 1 The x i used in 1) are also referred to as calibration variables. There are two basic components in the construction of a calibration estimator: a distance measure F s and a set of constraints such as 1). The chi-squared distance measure F s ˆ P w i d i 2 = d i q i is commonly used, where the weighting factors q i are unrelated to d i. Other distance measures can also be considered but it has been shown by Deville and SaÈrndal 1992) that the resulting calibration estimator is asymptotically equivalent to the one using a chi-squared distance measure with a certain choice of q i. The benchmark constraints 1) which calibrate directly over the individual x variables are indeed ad hoc. There is no compelling reason to exclude the use of other types of constraints. Optimal calibration estimators have been studied by Wu 2002) under a model-assisted framework. The following semi-parametric superpopulation model was used to motivate the optimality considerations, E y y i x i ˆm x i ; v ; V y y i x i ˆ v x i Š 2 j 2 2 where m ; and v are known functions, v and j 2 are unknown model parameters, and E y, V y denote the expectation and the variance under the model, y. It was also assumed that y 1 ; y 2 ;...; y N are conditionally independent given the x i 's. Wu 2002) considered the class of calibration estimators for the population total Y obtained by using a) any chi-squared distance measure with the weighting factors satisfying q i > q for some constant q > 0andN 1 P N i ˆ 1 qi 2 ˆ O 1 ; b) any constraint through a dimension-reduction variable u i ˆ u x i ; v, i.e., w i u x i ; v ˆN u x i ; v 3 i ˆ 1 where the functional form of u ; can be arbitrary. Some of the major results can be summarized as follows: i) For any consistent estimator of v such that v Ã ˆ v o p 1, replacing v by v Ã in the constraint 3) will not change the resulting calibration estimator asymptotically. ii) Let v Ã P ˆ d i q i x i x T i 1 P d i q i x i y i. If we use u i ˆ xi T Ãv as calibration

3 Wu and Luan: Optimal Calibration Estimators Under Two-Phase Sampling 121 variable in 3), where T denotes transpose, the resulting calibration estimator is identical to the conventional calibration estimator Ã Y C using 1). Hence, the class of calibration estimators considered by Wu 2002) is very general and includes the conventional calibration estimator as a special member. iii) The model-calibration estimator Ã ÅY MC of Wu and Sitter 2001) obtained by using u i ˆ E y y i x i ˆm x i ; v in 3) is optimal under the criterion of minimum model expectation of the asymptotic design-based variance, E y fav p Ã Y g. Here AV p refers to the asymptotic variance under the sampling design, p, and Ã Y is any calibration estimator from the class. iv) For the estimation of the population distribution function N 1 F Y t ˆN I y i # t i ˆ 1 where I denotes the indicator function, the optimal calibration variable is given by g x i ; t ˆE y I y i # t jx i ŠˆP y i # t x i, which is dependent on the particular value of t. v) To estimate a second-order nite population quantity such as the population variance or more generally Q ˆ P N P Nj i ˆ 1 ˆ i 1 f y i ; y j, the optimal calibration variable is u ij ˆ E y f y i ; y j x i ; x j Š. For proofs, the required regularity conditions and more discussion we refer to Wu 2002) which is available online at It is clear that optimal calibration estimation requires in general the complete auxiliary information x 1 ; x 2 ;...; x N. In this article we focus on the use of optimal calibration estimators under two-phase sampling assuming that the complete auxiliary information is not available but a large rst-phase sample on the x variables can be easily collected with affordable cost. Estimation of the population mean or total using a regression-type estimator under twophase sampling has been discussed in the literature. Less attention, however, has been given to the estimation of the distribution function, the variance or other second-order nite population quantities under the context of two-phase sampling. In Section 2, we present optimal calibration estimators for various rst- and second-order nite population quantities under two-phase sampling under a uni ed framework. In Section 3, the optimal calibration estimators for second-order nite population quantities are naturally used to construct more ef cient variance estimators for the regression estimator for the population mean. Variance estimation for the distribution function is addressed in Section 4. Some empirical results regarding the nite sample design-based performances of the proposed estimators are reported in Section 5 using real survey data of the 1996 Statistics Canada Family Expenditure FAME) Survey for the province of Ontario. Applications to measurement error and nonresponse problems are brie y discussed in Section 6. Some concluding remarks are given in Section Optimal Calibration Estimators Under Two-Phase Sampling For simplicity of presentation, we consider cases where a relatively large rst-phase sample s 0 of xed size n 0 is selected using simple random sampling without replacement,

4 122 Journal of Of cial Statistics and x i is measured for all 0. A second-phase sample s of size n is selected using a general sampling design p s 0 with rst- and second-order inclusion probabilities p i s 0 and p ij s 0, and the response variable y i is observed for. The estimators presented below, however, can be extended to cases where the rst-phase sampling design is also arbitrary. Let d i s 0 ˆ 1=p i s 0 and d i ˆ N=n 0 d i s Estimating the population total Under Model 2), the model-calibration estimator for the population total Y is de ned as ÃY MC ˆ P w i y i, where the calibrated weights w i minimize F s ˆ w i d i 2 = d i q i subject to w i m x i ; v ˆ N=n Ã 0 m x i ; v Ã 0 Note that the right-hand side of 4) is an estimate for P N i ˆ 1 m x i ; v Ã based on the rstphase sample s 0. It is straightforward to show that " ÃY MC ˆ N n 0 d i s 0 y i m x i ; v Ã # ) d i s 0 m x i ; v Ã ÃB Y 0 where B Ã Y ˆ P d i s 0 q i Ãm i y i = P d i s 0 q i Ãm i 2 and Ãm i ˆ m x i ; v. Ã It can be shown that, under some mild regularity conditions, YMC Ã is asymptotically equivalent to " YMC ˆ N n 0 d i j s 0 y i m x i ; v N # ) d i j s 0 m x i ; v N B Y 0 where B Y ˆ P N i ˆ 1 q i m i y i = P N i ˆ 1 q i mi 2, m i ˆ m x i ; v N,andv N are the nite population parameters estimated by v. Ã See Wu and Sitter 2001) for a detailed discussion on the estimation of model parameters. The required regularity conditions were detailed in an unpublished master's essay at the University of Waterloo Luan 2001). It follows that AV p Y Ã MC ˆV p YMC ˆN 2 1 n 0 1 N S 2 Y N n 0 " # 2 E 1 V 2 d i j s 0 y i B Y m i where E 1 and V 2 denote the expectation and the variance under the rst-phase and the second-phase sampling design, respectively, and S 2 Y is the nite population variance for the y variable. Since the value of S 2 Y is unaffected by different choices of the calibration variable u i and E y E 1 V 2 ŠˆE 1 fe y V 2 g Š, following the same argument as in Theorem 1 of Wu 2002), we can show that Ã YMC minimizes E y AV p Ã Y Š among the class of calibration estimators Ã Y similarly de ned as in Section 1 with 3) modi ed for two-phase sampling in the form of 4) Estimating the distribution function The model-calibrated pseudo-empirical maximum likelihood estimator ME), which is asymptotically equivalent to the optimal calibration estimator Wu and Sitter 2001), is 4

5 Wu and Luan: Optimal Calibration Estimators Under Two-Phase Sampling 123 particularly appealing for the estimation of the distribution function. Under two-phase sampling, the ME estimator for F Y t is de ned as F Ã ME t ˆP Ãp i I y i # t where the weights Ãp i maximize the pseudo-empirical log-likelihood function Ãl p ˆN n 0 d i j s 0 log p i 5 subject to p i ˆ 1 0 < p i < 1 and p i g x i ; t ˆ 1 n 0 g x i ; t 6 0 where g x i ; t ˆE y I y i # t jx i ŠˆP y i # tjx i. Under a regression working model y i ˆ m x i ; v v x i «i ; i ˆ 1; 2;...; N 7 where the «i 's are independent and identically distributed iid) as N 0; j 2,wehave g x i ; t ˆG fy m x i ; v g=fv x i jgš The G is the cumulative distribution function of N 0; 1. In applications the unknown model parameters v and j 2 will have to be replaced by sample-based estimates. Note that if a prespeci ed t ˆ t 0 is used in 6) while the resulting weights Ãp i are used in Ã F ME t for an arbitrary t, Ã FME t will itself be a genuine distribution function. The optimality of ÃF ME t at t ˆ t 0 can be established along the lines of Theorem 2 in Wu 2002). When the regression model 7) is not desirable, a slightly less ef cient but more robust logistic regression model can be used to obtain g x i ; t. See Chen and Wu 2002) or Wu 2002) for details. A simple algorithm for computing the pseudo-empirical likelihood weights Ãp i can be found in Chen et al. 2002) Estimating second-order nite population quantities The population variance, the variance of a linear estimator or more generally a second-order nite population quantity of the form Q ˆ P N P Nj i ˆ 1 ˆ i 1 f y i ; y j can also be ef ciently estimated through optimal calibration. Under two-phase sampling, the optimal model-calibration estimator for Q is given by QMC Ã ˆ P P j > i w ij f y i ; y j, where the weights w ij minimize the modi ed distance measure F s ˆ P P j > i w ij d ij 2 = d ij q ij subject to 1 w ij E y f y i ; y j x i ; x j ŠˆN N n 0 n 0 E 1 y f y i ; y j x i ; x j Š 8 j > i Here d ij ˆ d ij s 0 N N 1 Š= n 0 n 0 1 Š and d ij s 0 ˆ 1=p ij s 0. The weighting factors q ij in the distance measure are prespeci ed and a common choice would be q ij ˆ 1. This optimal model-calibration estimator QMC Ã can be explicitly expressed as a regressiontype estimator, as detailed below for the population variance SY 2. Consider the nite population variance SY 2 ˆ N 1 1 P N i ˆ 1 y i Y 2 which can be alternatively expressed as SY 2 ˆ N N 1 Š 1 P N P Nj i ˆ 1 ˆ i 1 y i y j 2. Under a linear regression working model y i ˆ x T i v v x i «i ; i ˆ 1; 2;...; N 9 0 j > i

6 124 Journal of Of cial Statistics where the «i 's are iid with mean zero and variance j 2,wehave E y y i y j 2 x i ; x j Šˆv T x i x j x i x j T v j 2 v 2 x i v 2 x j Š Let u ij ˆ Ãv T x i x j x i x j T v Ã Ãj 2 v 2 x i v 2 x j Š and v ij ˆ y i y j 2.Itcanbe shown that the resulting optimal model-calibration estimator for SY 2 is given by " ÃS MC 2 1 ˆ n 0 n 0 d 1 ij s 0 v ij u ij ) # d ij s 0 u ij ÃB S j > i 0 j > i j > i where B Ã S ˆ P P j > i d ij s 0 q ij u ij v ij = P P j > i d ij s 0 q ij uij 2. For a general second-order nite population quantity such as Q, one needs to use v ij ˆ f y i ; y j, u ij ˆ E y f y i ; y j x i ; x j Š, and replace 1= n 0 n 0 1 Š by N N 1 = n 0 n 0 1 Š in the above formulation. If simple random sampling without replacement SRSWOR) is also used at the second phase, S Ã 2 MC reduces to "!# v 2 x i 1 v 2 x n i ÃB S 0 ÃS 2 MC ˆ s 2 Y Ã v T s 0 2 s 2 Ã v Ãj 2 1 n 0 where s 2 Y is the usual sample variance based on y i,, ands 0 2 and s 2 are the sample variance-covariance matrices for the x variable based on s 0 and s, respectively. If further the regression model 9) has a homogeneous variance structure, i.e., v x i ; 1, we have ÃS 2 MC ˆ s 2 Y Ã v T s 0 2 s 2 Ã v Ã B S More Ef cient Variance Estimators for the Regression Estimator Under the commonly used regression model 9), it can be shown that B Ã Y ˆ 1 and the optimal model-calibration estimator for the population mean ÅY ˆ N 1 P N i ˆ 1 y i is identical to the conventional regression estimator under two-phase sampling Wu and Sitter 2001): ÃÅY MC ˆ 1 n 0 " d i j s 0 y i x i! T # d i s 0 x i v Ã 0 where v Ã are the estimated regression coef cients. If SRSWOR is used at the second phase, the approximate design-based variance of Ã YMC Å is given by V p Ã YMC Å Çˆ 1 n 0 1 SY 2 1 N n 1 n 0 SE 2 where SY 2 is de ned as before, SE 2 is the nite population variance de ned over the residual variable e i ˆ y i xi T v N,andv N is the nite population regression coef cients. The conventional variance estimator for Ã YMC Å is obtained if one replaces SY 2 by sy 2 and SE 2 by se, 2 the sample variances based on the second-phase sample s, and using Ãe i ˆ y i xi T Ãv. See for example Cochran 1977, p. 343). We denote this basic variance estimator by v 0 Ã ÅY MC. An improved variance estimator which makes more complete use of the large rstphase sample measured over the covariates x can be easily developed as follows. The SY 2 can be more ef ciently estimated by S Ã MC 2 given by 10). As for SE 2, the optimal model-calibration estimator can be similarly constructed by noting that E y e i ˆ0and

7 Wu and Luan: Optimal Calibration Estimators Under Two-Phase Sampling 125 E y e i e j 2 x i ; x j Š Çˆ j 2 v 2 x i v 2 x j Š. The resulting estimator for SE 2 is given by " # ÃS E 2 ˆ se 2 1 n 0 v 2 x i 1 v 2 x n i ÃB E 0 where B Ã E is similarly de ned as B Ã S but using v ij ˆ Ãe i Ãe j 2 and u ij ˆ v 2 x i v 2 x j.it is interesting to notice that, if v x i ; 1, S Ã 2 E ˆ se, 2 the large rst-phase sample on x contains no extra information to improve the estimation of SE. 2 See Wu 2002) for more discussion on this issue. Our proposed variance estimator is as follows: v 1 Ã YMC Å ˆ 1 n 0 1 sy 2 1 N n 1 n 0 se 2 1 n 0 1 Ãv T 0 s 2 N s 2 v Ã B Ã S " # 1 n 0 v 2 x i 1 v 2 1 x n i n 0 1 ÃB N S 1 n 1 ÃB n 0 E Ãj 2 0 If v x i ; 1, the very last term in v 1 Ã YMC Å vanishes, and in this case v 1 Ã YMC Å reduces to v 1 Åy lr proposed by Sitter 1997) if we also replace B Ã S by 1. Note that, under the model 9), E y B Ã S Çˆ 1, E y B Ã E Çˆ 1, an alternative variance estimator is obtained if we replace both B Ã S and B Ã E by 1. This is given by v 2 Ã YMC Å ˆ 1 n 0 1 sy 2 1 N n 1 n 0 se 2 1 n 0 1 Ãv T 0 s 2 N s 2 v Ã 1 n 1 " # 1 N n 0 v 2 x i 1 v 2 x n i Ãj 2 0 Both v 1 and v 2 are design-consistent. The design-based nite sample performances of v 1 and v 2 are further investigated in Section 5 through a simulation study. Also included in the simulation study is the delete-1 jackknife variance estimator, v J. The detailed formulation of v J was given by Sitter 1997). 4. Variance Estimation for the Distribution Function 4.1. Analytical variance estimators Analytical variance estimators for F Ã ME t which make more ef cient use of the rst-phase sample can be developed in the same spirit to v 1 or v 2. Following the arguments of Chen and Wu 2002), it can be shown that ÃF ME t ˆ 1 Ãn 0 d i s 0 I y i # t " # 1 n 0 g x i ; t 1 Ãn 0 d i s 0 g x i ; t B N o p 0 1 p n where Ãn 0 ˆ P d i s 0, B N ˆ P N i ˆ 1 g x i ; t g N ŠI y i # t = P N i ˆ 1 g x i ; t g N Š 2,and g N ˆ N 1 P N i ˆ 1 g x i ; t. Under two-phase sampling with SRSWOR at both phases,! 11

8 126 Journal of Of cial Statistics we have V p F Ã ME t Š Çˆ 1 n 0 1 SI 2 1 N n 1 n 0 SD 2 where SI 2 and SD 2 are the nite population variances de ned over the variable I i ˆ I y i # t and D i ˆ I y i # t g x i ; t B N, respectively. The usual substitution estimator for V p F Ã ME t Š is denoted by v 0 F Ã ME t Š, withsi 2 replaced by si 2 and SD 2 replaced by sd, 2 the sample variances based on s. A more ef cient variance estimator is readily available if we estimate both SI 2 and SD 2 using the general method described in Section 2.3. The resulting estimators S Ã I 2 and S Ã D 2 can both be expressed as " # ÃS 2 ˆ s n 0 n 0 u 1 ij u n n 1 ij ÃB F 0 j > i P j > i u ij v ij = P j > i P j > i u 2 ij and s 2 is the corresponding sample var- where B Ã F ˆ P iance based on s. ForSI 2, v ij ˆ I i I j 2, u ij ˆ E y I i I j 2 x i ; x j Šˆg i g j 2g i g j, and g i ˆ g x i ; t ; for SD, 2 v ij ˆ D i D j 2, u ij ˆ E y D i D j 2 x i ; x j Š Çˆ g i 1 g i g j 1 g j. As usual, any unknown model parameters appearing in D i or u ij will be replaced by appropriate sample-based estimates. Let v 1 F Ã ME t Š be the estimator of V p F Ã ME t Š obtained by using S Ã I 2 and S Ã D 2 in place of SI 2 and SD, 2 respectively. Note that SI 2 ˆ N N 1 1 F Y t 1 F Y t Š, an alternative estimator for SI 2 could simply be N N 1 1 FME Ã t 1 F Ã ME t Š. We denote the resulting estimator of V p F Ã ME t Š as v 2 F Ã ME t Š, where SD 2 is estimated by S Ã D,asinv 2 1 F Ã ME t Š. Similar to the mean case, both v 1 and v 2 are design-consistent estimators of V p F Ã ME t Š Jackknife variance estimator The pseudo-empirical maximum likelihood estimator F Ã ME t is a nonlinear estimator and its exact design-based variance does not have a closed form. The conventional delete-1 jackknife variance estimator often provides an attractive alternative in such cases. Under two-phase sampling, the delete-1 estimator F Ã ME jš needs to be computed differently for j [ s and for j [ s 0 s. Let s j denote the set s with the j th unit deleted. Let FME Ã jš ˆP j Ãp i I y i # t if j [ s, where the weights Ãp i i Þ j maximize l Ã p ˆP j d i s 0 log p i subject to P j p i ˆ 1 0 < p i < 1 and P j p i g x i ; t ˆ n P 0 j g x i ; t ; let FME Ã jš ˆP Ãp i I y i # t if j [ s 0 s, where the weights Ãp i maximize l Ã p ˆP d i s 0 log p i subject to P p i ˆ 1 0 < p i < 1 and P p i g x i ; t ˆ n P 0 j g x i ; t. The usual jackknife variance estimator is v J F Ã ME t Š ˆ n 0 1 n 0 ff Ã ME jš F Ã ME t g 2 12 j [ s 0 This variance estimator is consistent if SRSWOR is used at both phases and the rst-phase sampling fraction n 0 =N is negligible, since F Ã ME t is asymptotically equivalent to a regression type estimator given by 11) Sitter 1997). If the rst-phase sampling fraction cannot be ignored, an ad hoc adjustment can be made by multiplying v J by 1 n 0 =N. This has been done in the simulation study.

9 Wu and Luan: Optimal Calibration Estimators Under Two-Phase Sampling 127 The major advantage of using v J is the operational convenience. For parameters in a general form of W ˆ WfF Y t g, the jackknife variance estimator for Ã W ˆ Wf Ã F ME t g can be readily formulated by replacing Ã F ME jš and Ã F ME t in 12)by Ã W jš ˆWf Ã F ME jšg and ÃW ˆ Wf Ã F ME t g. This is most useful when the method is applied to in nite populations where parameters of interest are often in the form of v ˆ v F. 5. Some Empirical Results In this section we report some simulation results regarding the design-based nite sample performances of the proposed estimators using real survey data from the 1996 Statistics Canada Family Expenditure FAME)Survey for the province of Ontario. The data set contains 2,396 observations over a variety of variables. In the simulation, the variable y, total expenditure, was used as the response, and x 1 number of people in the household) and x 2 total income after taxes)were included as auxiliary variables. It was found that a simple linear regression model y ˆ b 0 b 1 x 1 b 2 x 2 «gives a reasonable t to the data, with the nonhomogeneous variance structure V y «ˆx 2 j 2 tting slightly better than the constant variance model. Treating the data set as a xed nite population, we rst selected a two-phase sample with SRSWOR at both phases, and then computed various estimators presented in Sections 2, 3, and 4. For the distribution function F Y t we included results on ve different values of t corresponding to the 10th, 25th, 50th, 75th, and 90th population quantiles. The process was repeated B ˆ 5,000 times for point estimators Ã Å YMC, Ã S 2 MC and Ã F ME t,and B ˆ 1,000 times for variance estimators v 0, v 1, v 2,andv J. The performance of an estimator Ã T for a certain population quantity T was evaluated using the relative percentage bias RB%)and the relative ef ciency RE)de ned as RB% ˆ 1 B B b ˆ 1 ÃT b T T 100 and RE ˆ MSE Ã T 0 MSE Ã T where Ã T b was computed from the b th simulated sample, MSE Ã T ˆB 1 P B b ˆ 1 Ã T b T 2, and Ã T 0 denotes the baseline estimator for comparison. For the estimation of ÅY, S 2 Y and F Y t, Ã T0 is given by Åy ˆ n 1 P y i, s 2 Y ˆ n 1 1 P y i y 2 and ÃF Y t ˆn 1 P I y i # t, respectively. Table 1 reports the simulated RE's for the optimal-calibration or the model-calibrated pseudo-empirical maximum likelihood estimators for the population mean Y, the population variance S 2 Y and the nite population distribution function F Y t at t ˆ t a, a ˆ 0:10, 0.25, 0.50, 0.75, and The rst-phase sample size was chosen as n 0 ˆ 400. The absolute values of the RB's are all less than 2% and are not reported here to save space. Table 1. Relative ef ciency of estimators for ÅY, S 2 Y and F Y t n 0 n ÃÅY MC Ã S 2 MC Ã FME t at t ˆ t a

10 128 Journal of Of cial Statistics Table 2. Relative bias in %) of variance estimators for Ã ÅY MC and ÃF ME t n 0 n v ÃÅY MC Ã FME t at t ˆ t a v v v v J v v v v J v v v v J The proposed estimators perform uniformly better and are much superior to the conventional estimator Ã T 0 in all cases. With xed rst-phase sample size, the gain in ef ciency seems larger when the second-phase sample size n is smaller, although such a trend is not monotonic in terms of sample size. As pointed out by one of the referees, this is due to the fact that V p Ã T! V p Ã T 0 as n! n 0 for xed n 0. As for the distribution function F Y t, the improvement is more dramatic when t is not in the tail regions. For variance estimators, the true values i.e., T in the de nitions of RB and RE) of V p Ã Å YMC and V p Ã F ME t were estimated from another 5,000 independent simulation runs. The simulated RB's of v 0, v 1, v 2 and v J are reported in Table 2. All RB's are less than 11%, except for those cases relating to the estimation of the variance of ÃF ME 0:10 and Ã F ME 0:90 when n ˆ 40. The RB's in these cases all take negative values and are smaller than 20% for v 0, v 1 and v 2. The jackknife estimator v J also signi cantly overestimates the variance in these cases. It seems that none of these estimators, including the naive v 0, provides acceptable estimates when n is small and t is in the tail region. The simulated RE's for the variance estimators of Ã Å YMC and Ã F ME t are presented in Table 3. Note that we excluded the RE 's for the case of n ˆ 40 at t ˆ t 0:10 and t 0:90 where Table 3. Relative ef ciency of variance estimators for Ã ÅY MC and ÃF ME t n 0 n v ÃÅY MC Ã FME t at t ˆ t a v Ð Ð v Ð Ð v J 0.85 Ð Ð 80 v v v J v v v J

11 Wu and Luan: Optimal Calibration Estimators Under Two-Phase Sampling 129 the bias is not negligible, since comparison in terms of RE under such cases is probably misleading. The variance estimator for baseline comparison is v 0. Table 3 can be summarized as follows: i) v 1 and v 2 perform similar to each other and are uniformly better than v 0 ; ii) the improvement from using v 1 and v 2 over v 0 is only marginal for estimating V p Ã ÅY MC ; iii) v 1 and v 2 seem more ef cient for estimating V p Ã F ME t for t at large quantiles; and iv) the jackknife variance estimator, which does not reuse the auxiliary information available at the rst phase, is less stable and less ef cient in most cases. One explanation for ii) is probably the fact that the linear model y ˆ b 0 b 1 x 1 b 2 x 2 «with the nonhomogeneous variance structure V y «ˆx 2 j 2 only provides a slightly better t than the constant variance model. Under such circumstances there is little room to improve the estimation of S 2 E by using auxiliary information through a nonhomogeneous variance structure. The marginal gain comes from the estimation of S 2 Y using Ã S 2 MC. When the variance structure is clearly nonhomogeneous, higher ef ciency can be expected from using v 1 and v 2. See the theoretical and empirical results reported in Wu 2002) for the case of known complete auxiliary information. 6. Applications to Measurement Error and Nonresponse The proposed optimal calibration estimators under two-phase sampling are applicable to estimation problems under measurement errors or nonresponse. Let y i be the true value of a characteristic of interest for the i th unit in the nite population. Suppose it is dif cult or very expensive to obtain exact measurement but an inaccurate value i.e., measurement with error), say z i,ofy for the i th unit, can be obtained quite easily. In such situations the two-phase sampling technique is often employed. A large rst-phase sample s 0 is taken and the cheap inaccurate measurements z i are collected for all 0. A smaller secondphase sample s Ì s 0 is drawn and the exact measurements y i are collected using more extensive effort. The goal is to estimate various nite population quantities de ned over y i. The relationship between the y i 's and the z i 's can be described by the so-called regression calibration model Carroll et al. 1995), y i ˆ a bz i «i ; i ˆ 1; 2;...; N 13 where the «i 's are assumed independent and identically distributed random variates with E m «i ˆ0 and V m «i ˆj 2, and the subscript m refers to the measurement error model 13). If we treat z i as an auxiliary variable, the methodology developed in Sections 2, 3, and 4 can be applied directly here for estimation under the assumed measurement error model. Item nonresponse is often handled through imputation. If the response variable y is highly correlated with certain covariate s) x and missing values do not occur for the x variables, a direct analysis based on the observed sample data using our proposed estimation strategy may be more desirable. The original sample with complete information on x can be viewed as a rst-phase sample, and the subsample with observed responses on y is treated as a second-phase sample. The loss of information due to the nonresponse can be retreated from auxiliary information through the optimal estimation strategy. The

12 130 Journal of Of cial Statistics inference is conditionally valid for the given sample. An advantage of this approach is that the resulting optimal calibration estimators are design-consistent even if the superpopulation model 2) is misspeci ed. This is in contrast to estimation based on imputed data where the validity of the results depends on the correctness of the model used for imputation. 7. Concluding Remarks The calibration method has gained much popularity since its emergence in the early 90s, and calibration estimators are routinely computed by many survey organizations. The optimal calibration approach provides a uni ed framework for the ef cient estimation of various nite population quantities when complete auxiliary information is available. It has been demonstrated in this article that the method also provides a exible and ef cient way of using auxiliary information at the estimation stage under a two-phase sampling scheme. The proposed optimal calibration estimators for a second-order nite population quantity such as the population variance can perfectly be used to obtain more ef cient variance estimators for a rst-order nite population quantity such as the total or the distribution function. The gain in ef ciency can be appreciable even for a rst-phase sample with moderate size. Applications to measurement error problems or nonresponse are straightforward. The method has the potential to be applied to estimation problems under in nite populations where a two-phase sample is available. 8. References Carroll,R.J.,Ruppert,D.,and Stefanski,L.A. 1995). Measurement Errors in Non-Linear Models. Chapman & Hall. Chen,J. and Wu,C. 2002). Estimation of Distribution Function and Quantiles Using the Model-calibrated Pseudo Empirical Likelihood Method. Statistica Sinica,12, 1223±1239. Chen,J.,Sitter,R.R.,and Wu,C. 2002). Using Empirical Likelihood Method to Obtain Range Restricted Weights in Regression Estimators for Surveys. Biometrika,89, 230±237. Cochran,W.G. 1977). Sampling Techniques 3rd ed.). Wiley,New York. Deville,J.C. and SaÈrndal,C.E. 1992). Calibration Estimators in Survey Sampling. Journal of the American Statistical Association,87,376±382. Luan,Y. 2001). Model-Calibration Approach under Two-Phase Sampling. Unpublished master's essay,department of Statistics and Actuarial Science,University of Waterloo, Canada. Sitter,R.R. 1997). Variance Estimation for the Regression Estimator in Two-Phase Sampling. Journal of the American Statistical Association,92,780±787. SaÈrndal,C.E.,Swensson,B.,and Wretman,J. 1992). Model Assisted Survey Sampling. Springer,New York. Wu,C. 2002). Optimal Calibration Estimators in Survey Sampling. Working paper Department of Statistics and Actuarial Science,University of Waterloo, Canada.

13 Wu and Luan: Optimal Calibration Estimators Under Two-Phase Sampling 131 Wu, C. and Sitter, R.R. 2001). A Model-calibration Approach to Using Complete Auxiliary Information from Survey Data. Journal of the American Statistical Association, 96, 185±193. Received March 2002 Revised November 2002

Bias Correction in the Balanced-half-sample Method if the Number of Sampled Units in Some Strata Is Odd

Journal of Of cial Statistics, Vol. 14, No. 2, 1998, pp. 181±188 Bias Correction in the Balanced-half-sample Method if the Number of Sampled Units in Some Strata Is Odd Ger T. Slootbee 1 The balanced-half-sample