BIAS-ROBUSTNESS AND EFFICIENCY OF MODEL-BASED INFERENCE IN SURVEY SAMPLING

Size: px

Start display at page:

Download "BIAS-ROBUSTNESS AND EFFICIENCY OF MODEL-BASED INFERENCE IN SURVEY SAMPLING"

Spencer Hart
5 years ago
Views:

1 Statistica Sinica 22 (2012), doi: BIAS-ROBUSTNESS AND EFFICIENCY OF MODEL-BASED INFERENCE IN SURVEY SAMPLING Desislava Nedyalova and Yves Tillé University of Neuchâtel Abstract: In model-based inference, the selection of balanced samples has been considered to give protection against misspecification of the model. A recent development in finite population sampling is that balanced samples can be randomly selected. There are several possible strategies that use balanced samples. We give a definition of balanced sample that embodies overbalanced, mean-balanced, and π-balanced samples, and we derive strategies in order to equalize a d-weighted estimator with the best linear unbiased estimator. We show the value of selecting a balanced sample with inclusion probabilities proportional to the standard deviations of the errors with the Horvitz-Thompson estimator. This is a strategy that is design-robust and efficient. We show its superiority compared to other strategies that use balanced samples in the model-based framewor. In particular, we show that this strategy is preferable to the use of overbalanced samples in the polynomial model. The problem of bias-robustness is also discussed, and we show how overspecifying the model can protect against misspecification. Key words and phrases: Balanced sampling, finite population sampling, polynomial model, ratio model, robust estimation. 1. Introduction The principal difference between the model-based and the classical designbased approach for estimating finite population totals lies in the source of randomness they use (Särndal (1978)). In design-based sampling, the inference is based on the stochastic structure induced by the sampling design. In the modelbased, or prediction approach, the inference depends on the validity of the model used to describe the data. In this case, the randomness is due to the population model and not to the sampling design. The model-based approach was developed, among others, by Royall (1976, 1992), Royall and Cumberland (1981), and Chambers (1996). When the data are assumed to follow a linear model, Royall (1976) proposed the use of the best linear unbiased predictor. The model-based approach has been criticized due to the fact that it may lead to severe bias if the model assumptions are violated. In contrast to model-based inference, design-based inference is design-robust by

2 778 DESISLAVA NEDYALKOVA AND YVES TILLÉ definition. Brewer and Särndal (1983) point out that, since the inference is not based on a model, there is no need to worry about model misspecification. Much of the wor in model-based research has been devoted to the construction of robust strategies. More specifically, in order to protect the inference against a misspecified model, Royall and Herson (1973a,b) and Scott, Brewer, and Ho (1978) point out the importance of balanced samples, where balance is achieved by equalizing the sample moments of the independent variables with those in the population. They came to the conclusion that the sample must be balanced, but not necessarily random. Another way to accomplish design-robustness in the model-based approach is to choose an appropriate sampling design. Since Deville and Tillé (2004) s paper, it is now possible to randomly select balanced samples using a procedure called the cube method. Nedyalova and Tillé (2008) have shown that under a random balanced sampling design, with inclusion probabilities proportional to the standard deviations of the errors of the model, and under certain conditions defined as fully explainable heteroscedasticity, the best linear unbiased estimator is the Horvitz-Thompson estimator. This is an optimal strategy that reconciles the two approaches. Scott, Brewer, and Ho (1978) and Royall and Herson (1973a) recommend the use of balanced samples in order to protect against a misspecification of the model, while Nedyalova and Tillé (2008) recommend using balanced sampling for minimizing the anticipated variance under a linear model. An important particular case is the polynomial model. Scott, Brewer, and Ho (1978) suggest using overbalanced samples with an ad-hoc weighting system. We investigate the different strategies leading to design-robust estimations under the model-based framewor, and consider the polynomial model, overbalancing and robustness under misspecification in light of Nedyalova and Tillé (2008). The notation and definitions are given in Section 2. The model-based framewor is briefly introduced in Section 3. In Section 4, we consider a large class of balanced designs, called d-balanced designs, that embody balanced, π- balanced, mean-balanced, and overbalanced samples. We also consider a class of weighted estimators, called d-weighted estimators, and discuss several strategies for which the Best Linear Unbiased (BLU) estimator is the d-estimator. These results generalize some results of Nedyalova and Tillé (2008) for the Horvitz- Thompson estimator. In Section 5, we give a definition of design-robustness and show that an appropriate strategy to protect against misspecification of the model consists of overspecifying the model and then balancing on the independent variables of the model. In Section 6, we revisit the polynomial model to show that the overbalancing strategy of Scott, Brewer, and Ho (1978) is suboptimal, and to give an alternative strategy that minimizes the anticipated variance.

3 BIAS-ROBUSTNESS AND EFFICIENCY IN MODEL-BASED INFERENCE 779 In Section 7, we draw some general conclusions on selecting a sample when one reasonably believes in a linear model but sees protection against misspecification. 2. Notation and Definitions Consider a population U of size N. Each unit of the population is identified by a label = 1,..., N. Suppose that a register is available, and that the values of p auxiliary variables are nown for each unit of the population. Let y be the value taen by the variable of interest y on the th unit of the population. The values y are unnown. We are interested in estimating the population total Y = y. The total Y is estimated by a sample s of size n, where s is a subset of U. A sample is only a subset of the population and is not necessarily randomly selected. A sampling design is a tool to randomly select a sample. It is defined by assigning to each sample s a probability p(s) of being selected. Let S denote the random sample such that Pr(S = s) = p(s). The inclusion probability π is then the probability that unit is selected in the sample. We denote by E p ( ) and var p ( ), respectively, the expectation and variance under the sampling design p( ), and by S the set of units of the population which are not in S. Definition 1. An estimator Ŷ is design-unbiased if E p(ŷ ) = Y. Definition 2. A sample s is d-balanced on a set of variables x = (x 1 x p ) if and only if x, d x = s where d 1,..., d,..., d N is a set of weights that do not depend on the sample s. The weights d 1,..., d,..., d N are positive values. An important example is given by the d = 1/π used in the Horvitz-Thompson estimator. When the sample is randomly selected, an inclusion probability can be assigned to each statistical unit. If π > 0, for all U, the Horvitz-Thompson estimator of Y, Ŷ π = y, π is design-unbiased, i.e. E p (Ŷπ) = Y. For a balanced sample, the inclusion probabilities are π = 1/d, and the procedure that randomly selects a balanced sample is called a balanced sampling design. According to Deville and Tillé (2004), a sampling design p( ) is balanced on the auxiliary variables x 1,..., x p if and only if x = x. (2.1) π

4 780 DESISLAVA NEDYALKOVA AND YVES TILLÉ Authors such as Cumberland and Royall (1981) and Kott (1986) would call this a π-balanced sampling, as opposed to a mean-balanced sampling defined by the equation 1 x = 1 x. n N We use the expression balanced sampling to denote a sampling design that satisfies (2.1) for one or more auxiliary variables, a mean-balanced sampling being a particular case of this balanced sampling when the sample is selected with inclusion probabilities n/n. If the population size is small, a balanced sampling design can be implemented by a linear program. For larger population sizes, the cube method may be used (see Deville and Tillé (2004) or Tillé (2006)). Whatever the algorithm used for selecting a balanced sample, an exact balanced sample cannot generally be found because it does not exist. It is however, always possible to select a sample that is almost balanced. We assume that the balancing error can be neglected. The selection of a balanced sample requires the use of a register that contains the values of the auxiliary variables for each unit of the population. 3. Model-based Strategy and BLU Estimator We assume that the population follows a linear model M, y = x β + ε, (3.1) where β is the vector of regression coefficients and ε is a vector of random variables ε such that E M (ε ) = 0, var M (ε ) = σ 2 ν 2, cov M(ε, ε l ) = 0 if l, Suppose ν, U, are nown. For simplicity, we scale them so that ν = N. Model (3.1) includes the possibility of heteroscedasticity. The heteroscedasticity of ν or ν 2 may or may not be proportional to some auxiliary variables included in x. Nedyalova and Tillé (2008) have shown the importance of using auxiliary variables whose linear combination is equal to ν or ν 2. Indeed, in this case, the optimal strategies for the model-based and design-based framewors are the same. An important and common hypothesis is that the random sample S and the errors ε of the model are independent. The symbols E M ( ), var M ( ), cov M ( ) denote, respectively, the expected value, the variance, and the covariance under M. Definition 3. An estimator Ŷ is model-unbiased if E M(Ŷ Y ) = 0.

5 BIAS-ROBUSTNESS AND EFFICIENCY IN MODEL-BASED INFERENCE 781 Definition 4. The model mean squared error of an estimator Ŷ is MSE M(Ŷ ) = E M (Ŷ Y ) 2. When Ŷ is model-unbiased, the model mean squared error is also called the error variance, for instance in Royall and Cumberland (1981). Definition 5 (Isai and Fuller (1982)). The anticipated mean squared error of an estimator Ŷ is MSE pm(ŷ ) = E pe M (Ŷ Y )2. When Ŷ is design-unbiased, the anticipated mean squared error is also called the anticipated variance. Royall (1976) showed that, in the framewor of model-based inference, the Best Linear Unbiased (BLU) estimator is Ŷ BLU = = y + S x β BLU = x β BLU + (y x β BLU ) x β BLU + e, (3.2) e = y x β BLU, (3.3) where β BLU = A 1 x y ν 2, A = x x ν 2. The error variance of the best linear unbiased estimator is E M (ŶBLU Y ) 2 = σ 2 x A 1 x l + ν 2. S l S S We refer to the usual definition given, for instance, in Háje (1959, 1981), Ramarishnan (1975), and Joshi (1979). Definition 6. A strategy is a pair (p( ), Ŷ ) consisting of a sampling design and an estimator. Strategy 1. Use the best linear unbiased estimator and choose a sample of size n that minimizes E M (ŶBLU Y ) 2. This is an optimal, purely unbiased strategy that is not design-robust because, in some cases, it can lead to the choice of a very extreme sample, as in the

6 782 DESISLAVA NEDYALKOVA AND YVES TILLÉ following example. Suppose the model has no intercept and only one regressor (see for instance, Royall and Herson (1973a)): y = x β + ε, (3.4) with var M (ε ) = σ 2 ν 2 and ν2 x. Then the BLU estimator is Ŷ BLU = x y, x which, in this case, is the ordinary ratio estimator ŶR. As ν 2 x and ν = N, we have that ν 2 = N 2 x ) 2. x ( The error variance of ŶR is given by the expression ( E M (ŶR Y ) 2 = σ 2 N 2 ) 2 S x ) x. x ( x Thus, in this case, the optimal purely model-based strategy consists of choosing the units with the n largest values of variable x (Royall (1970)), a very extreme sample. The strategy can be dangerous if the model is wrong. It is thus reasonable to opt for a strategy that guarantees correct estimation when the model is misspecified, and that leads to design-unbiased inference. For these reasons, Strategy 1 is rarely used (see Hansen, Madow, and Tepping (1983)). 4. Balanced Sample under a Linear Model In this section we generalize the concept of balanced samples. Our results are thus more general than those in Nedyalova and Tillé (2008). Consider the d-weighted estimator Ŷ d = d y = d x β BLU + d e, (4.1) where e is defined in (3.3). Under d-balanced sampling, Ŷd is model-unbiased and its error variance is E M (Ŷd Y ) 2 = σ 2 (d 1) 2 ν 2 + ν 2. (4.2) S By comparing (4.1) with (3.2), we generalize Result 7 of Nedyalova and Tillé (2008).

7 BIAS-ROBUSTNESS AND EFFICIENCY IN MODEL-BASED INFERENCE 783 Result 1. A sufficient condition that ŶBLU = Ŷd is that the sampling design is d-balanced on x, e (d 1) = 0. A particular case of Result 1 is given below. Corollary 1. A sufficient condition that ŶBLU = Ŷd is that the sampling design is d-balanced on x, there exists a vector λ such that λ x = ν 2 (d 1), for all U. Proof. If then λ x ν 2 (d 1) = 1, λ x e (d 1) = ν 2 (d 1) e (d 1) = Consider a strategy that meets the conditions of Result 1. Strategy 2. λ x ν 2 e = 0. Use a d-balanced sampling design on x, with d chosen so that d = (ν 2 + λ x )/ν 2, for all U, use the d-weighted estimator. A particular case of Strategy 2 was recommended by Scott, Brewer, and Ho (1978) in the case of a polynomial model. Since the conditions of Result 1 are met, we have that ŶBLU = Ŷd. The strategy judiciously chooses the d s to equalize the d-weighted estimator and the BLU estimator. For a sample size n, this strategy can be implemented by using the inclusion probabilities ν 2 π = ν 2 +. λ x The value of λ can be chosen freely subject to π = After some algebra, it is possible to show that E M (Ŷd Y ) 2 = σ 2 S ν 2 ν 2 + λ x = n. d ν 2 = σ2 ( S ν 2 + S λ x ).

8 784 DESISLAVA NEDYALKOVA AND YVES TILLÉ Moreover, E p E M (Ŷd Y ) 2 = σ 2 λ x. Nevertheless, we see below that this is not necessarily the best strategy. Eventually, consider a strategy proposed by Nedyalova and Tillé (2008) that also meets the conditions of Result 1. Strategy 3. Use a d-balanced sampling design on x, with d ν 1 where x is complemented so that there exist two vectors α and γ such that α x = ν 2 and γ x = ν, for all U, a fully explainable heteroscedasticity, use the d-weighted estimator. Strategy 3 was recommended by Nedyalova and Tillé (2008) because it minimizes the anticipated variance among the strategies that are design-unbiased (see also Fuller (2009, p.187)). It is thus better than Strategy in the class of design-unbiased strategies. With Strategy 3, we have ŶBLU = Ŷd. The condition of fully explainable heteroscedasticity can be obtained by adding ν 2 and d ν 2 to the list of balancing variables. Thus, ν 2, d ν 2, d ν 2 = d 2 ν2 = and the error variance of the d-weighted estimator given in (4.2) simplifies to E M (Ŷd Y ) 2 = σ 2 (d 1)ν 2. For a sample size n, we must tae d = N/(ν n), and π = 1/d. The error variance of the d-weighted estimator, given in (4.2), simplifies to E M (Ŷd Y ) 2 = σ ( ) 2 Nν (N n ν2 = σ 2 2 n ) ν Bias-robustness in Submodels A large part of the model-based inference is dedicated to the bias-robustness of the BLU estimator in the case of misspecification of the model. In this section, we propose a formalization of the question of bias-robustness and an analysis of the consequences of model misspecification. We assume that a model M is used to conceive the strategy, but that the true underlying model is M.

9 BIAS-ROBUSTNESS AND EFFICIENCY IN MODEL-BASED INFERENCE 785 Definition 7. A strategy is bias-robust for a model M if E M (Ŷ Y ) = 0. Consider a model and an alternative model M : y = x β + ε M : y = z γ + η, where E M (η ) = 0. There is no assumption on the covariance matrix of the vector of η. Definition 8. Model M is a submodel of M if there exists a matrix A such that Ax = z, not necessarily of full ran. The basic fact needed to show that a strategy is robust is the following. Result 2. The strategy that consists of using a d-balanced sampling design on x (with any vector of d ) and the d-weighted estimator is bias-robust for any submodel of M. Proof. Under model M, and using a d-balanced sample, Ŷ d = d y = d (z γ + η ) = d (A x γ + η ) = A x γ + d η. Thus, E M (Ŷd Y ) = E M ( d η η ) = 0. Result 2 encourages the statistician to overspecify the model, i.e., to introduce additional variables into model M in order to ensure that M is really a submodel of M. Indeed, if model M* is true, the variance under the model of the d-weighted estimator is the same irrespective of whether the sample is balanced on x or on z. An over-specification of the model does not increase the variance under the model of the estimator because the independent variables of the model are only used for balancing the sample. Moreover, the d-weighted estimator does not depend on an estimated coefficient that relies on the number of auxiliary variables. Suppose now that the model is misspecified, i.e., that model M is not a submodel of M. If the sampling design is d-balanced on the independent variables of model M, then ( ) E M (Ŷd Y ) = γ. d z z

10 786 DESISLAVA NEDYALKOVA AND YVES TILLÉ Let f = z γ x ϕ be the residuals of a linear regression of (z γ) on x and ϕ the regression coefficient vector. Then, if the sampling design is d-balanced on x, [ E M (Ŷd Y ) = d (x ϕ + f ) ] (x ϕ + f ) = d f f, (5.1) for any value ϕ, in particular when ϕ = ( x x var(η ) ) 1 x z γ var(η ). The model variance thus only depends on the residuals of the regression, i.e., on the part of the model that remains misspecified. If we do not possess information about the z, we select a random sample with inclusion probabilities π = 1/d, because in selecting the sample randomly, the expected value under the sampling design of (5.1) is zero. Moreover, we can expect that, under reasonable asymptotic assumptions, the quantity E M (Ŷd Y ) N = 1 N ( f ) f, π remains bounded in probability with respect to the sampling design when multiplied by n. The random selection of a sample and the use of a design-unbiased estimator give an ultimate bias protection in the case where it is not possible to overspecify the model. This guarantees a negligible bias under M when n is large. 6. Application to the Polynomial Model 6.1. Presentation of the model The polynomial model was studied, among others, by Royall and Herson (1973a), Scott, Brewer, and Ho (1978), and Valliant, Dorfman, and Royall (2000). The model is J y = δ j β j x j + ε, (6.1) j=0 where x is the only independent variable, β j is the jth regression coefficient, δ j is equal to 1 or 0 as the term β j x appears or not in the regression, E M(ε ) = 0, var M (ε ) = σ 2 ν 2, and cov M(ε, ε l ) = 0, when l. We assume ν = N.

11 BIAS-ROBUSTNESS AND EFFICIENCY IN MODEL-BASED INFERENCE 787 Now, from Result 2, for any set of vectors of d, the d-weighted estimator is bias-robust under any submodel of (6.1) provided d x j = This implies several results on the polynomial model A first suboptimal strategy Let S (J) be a particular sample for which S x j = x S x j, for j = 0,..., J. (6.2) x j+1 /ν 2 x 2 /ν2, for j = 0,..., J. (6.3) With a sample satisfying (6.3), Scott, Brewer, and Ho (1978) showed that the estimator Ŷ 0 = y + x y x /ν 2, S x2 /ν2 is BLU for any polynomial Model (6.1), and any value of ν. This simple condition on the sample implies that the estimator is bias-robust for a large class of polynomial models. It can easily be shown that a sufficient condition for a sample to satisfy (6.3) is ( x j 1 + λx ) ν 2 = x j, for j = 0,..., J, (6.4) where λ is a scalar that does not depend on j. Equality (6.4) can be satisfied by using Strategy 2. In this particular case, we have π = 1/d and d = 1 + λx /ν 2. Strategy 2. (for polynomial model) Select a balanced sample such that with the unequal inclusion probabilities x j = x j π, j = 0,..., J, π = λx /ν 2, U.

12 788 DESISLAVA NEDYALKOVA AND YVES TILLÉ The constant λ is chosen as a function of the desired sample size, where π = λx /ν 2 = n. (6.5) The solution of (6.5) in λ is unique. Strategy 2 is recommended by Scott, Brewer, and Ho (1978). The inclusion probabilities are, however, chosen so that the BLU estimator is equal to the Horvitz-Thompson estimator and is thus not optimized. Two particular cases of (6.3) are as follows. (a) When ν 2 x, under the condition ν = N, In this case, (6.3) reduces to S xj S x ν 2 = = ( xj x N 2 x ) 2. x Thus, the sample should satisfy the condition 1 n x j = 1 N n x j = 1 N S, for j = 0,..., J. x j, for j = 0,..., J. Royall and Herson (1973a) call such samples balanced. In this case, Ŷ0 reduces to the ordinary ratio estimator Ŷ R = x y x. (b) When ν 2 x2, it is easily shown that ν2 = N 2 x 2 / ( x ) 2. In this case, (6.3) reduces to x j 1 n = S xj S x, for j = 0,..., J. The sample S (J) is called overbalanced (Scott, Brewer, and Ho (1978)), and Ŷ 0 reduces to Ŷ OB = ( ) 1 y y + x. n x S

13 BIAS-ROBUSTNESS AND EFFICIENCY IN MODEL-BASED INFERENCE An alternative strategy for the polynomial model Strategy 3. (for polynomial model) Use inclusion probabilities that are proportional to ν, subject to π = n, 0 π 1. Select a balanced sample according to the balancing equations Use the Horvitz-Thompson estimator. x j = x j π, j = 0,..., J, (6.6) ν 2 = ν 2 π, ν = ν. (6.7) π Note that, since π nν /N, (6.7) becomes N n 1 = N and only means that the sample size must be fixed. Nedyalova and Tillé (2008) proved that this strategy minimizes the anticipated variances in the class of design-unbiased strategies. Strategy 3 is thus better than Strategy 2 in the sense that its anticipated variance is always smaller. With Strategy 3, the inclusion probabilities are chosen to minimize the variance, while with Strategy 2 the inclusion probabilities are chosen to meet the technical condition given in Result 1. Two particular cases are as follows. (a) When ν 2 x with ν = N, then ν 2 = ( N 2 x ) 2. x As π ν with π = n, it follows that π = (n/n)ν. In this case, if the sample has a fixed sample size and is balanced on x, it is automatically balanced on ν and ν 2. (b) When ν 2 x2, with ν = N, then ν 2 = N 2 x 2 ( x ) 2. Here too, we have π = (n/n)ν. In this case, if the sample has a fixed sample size and is balanced on x 2, it is automatically balanced on ν and ν 2.

14 790 DESISLAVA NEDYALKOVA AND YVES TILLÉ 6.4. A particular case: the ratio model Consider again the model without intercept and only one regressor with ν 2 x, as at (3.4), a particular case of the polynomial model. We have seen that the BLU estimator under this model is the ordinary ratio estimator Ŷ R = x y x, with error variance ( E M (ŶR Y ) 2 = σ 2 N 2 ) 2 S x ) x. x ( x Here Strategy 2 is endorsed by Royall and Herson (1973a). Strategy 2. (for the ratio model) Select a mean-balanced sample of size n and use the ratio estimator. With a mean-balanced sample on x, satisfying the condition 1 n x = 1 N n x, S the ratio estimator reduces to the sample mean Ŷ R = x y x = 1 y. n Moreover, E M (ŶR Y ) 2 = σ 2 N 2 (N n) n ( x x ) 2 = E p E M (ŶR Y ) 2. Strategy 3. (for the ratio model) Select a balanced sample on x with inclusion probabilities π ν x and use the Horvitz-Thompson estimator. In order to show that Strategy 3 is better than Strategy 2, we compare the anticipated and model-variances, E p E M (Ŷ Y )2 and E M (Ŷ Y )2, of the ratio and Horvitz-Thompson estimators under (3.4). Nedyalova and Tillé (2008) have shown that, under Strategy 3, ( E M (Ŷπ Y ) 2 = E p E M (Ŷπ Y ) 2 = σ 2 N 2 n ) ν 2.

15 BIAS-ROBUSTNESS AND EFFICIENCY IN MODEL-BASED INFERENCE 791 After replacing ν with N x / x, we obtain [ E p E M (Ŷπ Y ) 2 = σ 2 N 2 n N 2 x ] ( ) 2. x With D = E p E M (ŶR Y ) 2 E p E M (Ŷπ Y ) 2 = E M (ŶR Y ) 2 E M (Ŷπ Y ) 2, simplification yields D = N 2 n [ N x ( x ) 2 1 Thus, Strategy 3 is better than Strategy 2 under (3.4). 7. Discussion ] 0. Under a linear model, the use of a purely model-based strategy (Strategy 1) can be dangerous. Balanced samples offer good protection against model misspecification. A d-balanced sampling design with the d-weighted estimator is a bias-robust strategy that assures protection against misspecification of the model. There however exist several ways of selecting balanced samples. The d-weighted estimator can be equivalent to the BLU estimator if some technical conditions are met. These conditions can be met by either choosing the ad hoc inclusion probabilities (Strategy 2) or by adding ν and ν 2 to the list of balancing variables and choosing the optimal inclusion probability (Strategy 3). For the polynomial model, Royall and Herson (1973a) and Scott, Brewer, and Ho (1978) used ad hoc inclusion probabilities. They showed that, in this case, an unweighted ratio estimator is BLU for a large class of polynomial models. Nevertheless, this strategy consists of choosing the inclusion probabilities in such a way that a technical property is satisfied. Strategy 3 is more appropriate, because the technical condition is met by adding two balancing variables and the inclusion probabilities can be chosen to minimize the anticipated variance. Strategy 2 is thus not admissible in the sense that it is always possible to obtain a smaller anticipated variance by selecting the units with inclusion probabilities proportional to the standard deviations of the errors of the model and using the Horvitz-Thompson estimator. Strategy 3 has the advantage of providing a bias-robust estimator, even if the model is misspecified. Indeed, in both cases, the estimator does not depend on the auxiliary variables and is always design-unbiased. In case of misspecification of the model or of measurement errors the estimators of totals thus remain unbiased but a part of the efficiency can

16 792 DESISLAVA NEDYALKOVA AND YVES TILLÉ be lost. The importance of this loss depends on the way in which the correlation between the independent variables of the model and the variables of interest is hindered by the measurement errors. Eventually, the best protection against a misspecification of the model consists of extending the list of balancing variables. Indeed, the addition of balancing variables that could be correlated to the variables of interest does not increase the model variance, but protects against bias under the model. If the model cannot really be specified, the ultimate protection against misspecification is always the random selection of the sample. We hope that these results give a general guideline on the way of planning a sample survey under a realistic linear model. In practice, designing a survey is often a more complex exercise for the following reasons: It is difficult to specify a model without nowing the variable(s) of interest. The sampling design is necessarily conceived before nowing the dependent variable. Thus the validation of the model is difficult. A survey is generally done for multiple purposes. Arbitration must thus be done between the variables of interest, the areas of interest, and the parameters to estimate, and thus between the models. Nonrepsonse implies that the set of respondents is never exactly balanced. Our results however give a general and ideal framewor on how to plan and estimate a total under a linear model. The result of Neyman (1934) concerning optimal stratification suffers the same problems. Optimality depends on the variable of interest, on the parameters to estimate, on the areas of interest. The variances within the strata are never really nown, thus they must be estimated using another survey, or must be derived from a hypothesis of the existence of a size effect. Indeed, optimal stratification is a particular case of our result. The generalization of our results to cluster sampling, two-stage, and two-phase sampling remains a challenging topic of research. Acnowledgement The authors than two anonymous reviewers and the Editor for their constructive comments that helped us improve the quality of this paper. This research is supported by the Swiss National Science Foundation (grant no. FN /1). References Brewer, K. and Särndal, C.-E. (1983). Six approaches to enumerative survey sampling. In Incomplete Data in Sample Surveys (Edited by W. G. Madow, I. Olin, and D. B. Rubin), Academic Press, N.Y.

17 BIAS-ROBUSTNESS AND EFFICIENCY IN MODEL-BASED INFERENCE 793 Chambers, R. L. (1996). Robust case-weighting for multipurpose establishment surveys. J. Off. Statist. 12, Cumberland, W. and Royall, R. (1981). Prediction models in unequal probability sampling. J. Roy. Statist. Soc. Ser. B 43, Deville, J.-C. and Tillé, Y. (2004). Efficient balanced sampling: The cube method. Biometria 91, Fuller, W. A. (2009). Sampling Statistics. John Wiley, New Jersey. Háje, J. (1959). Optimum strategy and other problems in probability sampling. Cosopis. Pest. Mat. 84, Háje, J. (1981). Sampling from a Finite Population. Marcel Deer, New Yor. Hansen, M. H., Madow, W. G. and Tepping, B. J. (1983). An Evaluation of model-dependent and probability-sampling inferences in sample surveys. J. Amer. Statist. Assoc. 78, Isai, C. and Fuller, W. A. (1982). Survey design under a regression population model. J. Amer. Statist. Assoc. 77, Joshi, V. M. (1979). The best strategy for estimating the mean of a finite population. Ann. Statist. 7, Kott, P. S. (1986). When a mean-of-ratios is the best linear unbiased estimator under a model. Amer. Statist. 40, Neyman, J. (1934). On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection, J. Roy. Statist. Soc. 97, Nedyalova, D. and Tillé, Y. (2008). Optimal sampling and estimation strategies under the linear model. Biometria 95, Ramarishnan, M. (1975). Choice of an optimum sampling strategy. Ann. Statist. 3, Royall, R. (1970). On finite population sampling theory under certain linear regression models. Biometria 57, Royall, R. (1976). The linear least squares prediction approach to two-stage sampling. J. Amer. Statist. Assoc. 71, Royall, R. (1992). Robustness and optimal design under prediction models for finite populations. Surv. Methodol. 18, Royall, R. and Cumberland, W. (1981). The finite population linear regression estimator and estimators of its variance. An empirical study. J. Amer. Stat. Assoc. 76, Royall, R. and Herson, J. (1973a). Robust estimation in finite populations I. J. Amer. Statist. Assoc. 68, Royall, R. and Herson, J. (1973b). Robust estimation in finite populations II: Stratification on a size variable. J. Amer. Statist. Assoc. 68, Särndal, C.-E. (1978). Design-based and model-based inference in survey sampling. Scand. J. Statist. 5, Scott, A., Brewer, K. and Ho, E. (1978). Finite population sampling and robust estimation. J. Amer. Statist. Assoc. 73, Tillé, Y. (2006). Sampling Algorithms. Springer-Verlag, New Yor. Valliant, R., Dorfman, A. and Royall, R. (2000). Finite Population Sampling and Inference: A Prediction Approach. Wiley, New Yor.

18 794 DESISLAVA NEDYALKOVA AND YVES TILLÉ Institute of statistics, University of Neuchâtel, Pierre à Mazel 7, 2000 Neuchâtel, Switzerland. Institute of statistics, University of Neuchâtel, Pierre à Mazel 7, 2000 Neuchâtel, Switzerland. (Received October 2010; accepted may 2011)

INSTRUMENTAL-VARIABLE CALIBRATION ESTIMATION IN SURVEY SAMPLING

Statistica Sinica 24 (2014), 1001-1015 doi:http://dx.doi.org/10.5705/ss.2013.038 INSTRUMENTAL-VARIABLE CALIBRATION ESTIMATION IN SURVEY SAMPLING Seunghwan Park and Jae Kwang Kim Seoul National Univeristy