Estimation of regression coefficients with unequal probability samples

Size: px

Start display at page:

Download "Estimation of regression coefficients with unequal probability samples"

Lesley Glenn
5 years ago
Views:

Retrospective Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2007 Estimation of regression coefficients with unequal probability samples Yu Y.

1 Retrospective Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2007 Estimation of regression coefficients with unequal probability samples Yu Y. Wu Iowa State University Follow this and additional works at: Part of the Statistics and Probability Commons Recommended Citation Wu, Yu Y., "Estimation of regression coefficients with unequal probability samples" (2007). Retrospective Theses and Dissertations This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Retrospective Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact

2 Estimation of regression coefficients with unequal probability samples by Yu Y. Wu A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Statistics Program of Study Committee: Wayne A. Fuller, Major Professor Jean D. Opsomer Tapabrata Maiti Song X. Chen Wolfgang Kliemann Iowa State University Ames, Iowa 2007 Copyright c Yu Y. Wu, All rights reserved.

3 UMI Number: UMI Microform Copyright 2007 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, MI

4 ii To my family

5 iii TABLE OF CONTENTS LIST OF TABLES v LIST OF FIGURES vii ACKNOWLEDGEMENTS viii CHAPTER 1 Introduction Overview Objectives Modeling Framework Review of Estimation Approaches Ordinary Least Squares Estimator Probability Weighted Estimator Pfeffermann-Sverchkov Estimator Instrumental Variable Estimator A Test for Importance of Weights Preliminary Testing Procedures CHAPTER 2 Design Consistent Estimation H Estimator Motivation Central Limit Theorem Pfeffermann-Sverchkov Estimator

6 iv CHAPTER 3 Instrumental Variable Estimation Introduction Instrumental Variables for Weighted Samples Central Limit Theorem A Test for Endogeneity CHAPTER 4 Preliminary Testing Procedure Pretest Procedure Example Test for Importance of Weights Instrumental Variable Pretest Procedure CHAPTER 5 Simulation Design Monte Carlo Study Set-up Estimators for the Population Parameter Pfeffermann-Sverchkov Estimator Instrumental Variable Estimator Preliminary Testing Procedures Two-step Preliminary Testing Procedure Simulation Results Case One Study Case Two Study Case Three Study Concluding Remarks A Sample Example CHAPTER 6 Summary and Conclusion BIBLIOGRAPHY

7 v LIST OF TABLES Table 5.1 Monte Carlo Correlations for corr 1 = corr(p i, x i ) and corr 2 = corr(p i, e i ) (10,000 samples) Table 5.2 Monte Carlo Means for E 1 = E{p i x i },E 2 = E{p i x 2 i } and E 3 = E{p i e i }(10,000 samples) Table 5.3 Monte Carlo Bias Ratio Bias( ˆβ 0 ) [V { ˆβ 0 }] 1/2 for estimators of β 0 (10,000 samples) Table 5.4 Monte Carlo Bias Ratio Bias( ˆβ 1 ) [V { ˆβ 1 }] 1/2 for estimators of β 1 (10,000 samples) Table 5.5 Monte Carlo Mean Squared Error ( 1000) for estimators of β 0 (10,000 samples) Table 5.6 Monte Carlo Mean Squared Error ( 1000) for estimators of β 1 (10,000 samples) Table 5.7 Monte Carlo Mean Squared Error Ratio MSE( ˆβ P W,0 ) MSE( ˆβ P S,0 ) for estimators of β 0 (10,000 samples) Table 5.8 Monte Carlo Mean Squared Error Ratio MSE( ˆβ P W,1 ) MSE( ˆβ P S,1 ) for estimators of β 1 (10,000 samples) Table 5.9 Monte Carlo Mean Squared Error ( 1000) for estimators of β 0 (10,000 samples) Table 5.10 Monte Carlo Mean Squared Error ( 1000) for estimators of β 1 (10,000 samples) Table 5.11 Monto Carlo Probability that t ˆβ0 > t.025 (10,000 samples)... 69

8 vi Table 5.12 Monto Carlo Probability that t ˆβ1 > t.025 (10,000 samples) Table 5.13 Monte Carlo Mean Squared Error ( 1000) for estimators of β 0 (10,000 samples) Table 5.14 Monte Carlo Mean Squared Error ( 1000) for estimators of β 1 (10,000 samples) Table 5.15 Monto Carlo Probability that t ˆβ0 > t.025 (10,000 samples) Table 5.16 Monto Carlo Probability that t ˆβ1 > t.025 (10,000 samples) Table 5.17 Example Sample Data

9 vii LIST OF FIGURES Figure 4.1 Plot of MSE(ˆµ) vs. µ Figure 5.1 Plot of r vs. x Figure 5.2 Plot of MSEs relative to that of ˆβ P S, Figure 5.3 Plot of MSEs relative to that of ˆβ P S, Figure 5.4 Plot of MSEs relative to that of ˆβ IV 0, Figure 5.5 Plot of MSEs relative to that of ˆβ IV 0,

10 viii ACKNOWLEDGEMENTS This work was supported by Cooperative Agreement No. 68-3A between the USDA Natural Resources Conservation Service and the Center for Survey Statistics and Methodology at Iowa State University. I would like to thank my advisors Dr. Wayne A. Fuller for guiding me on this study and teaching me new statistical ideas and issues in each meeting. I would also like to thank my committee members, Dr. Tapabrata Maiti, Dr. Jean Opsomer, Dr. Song X. Chen, Dr. Wolfgang Kliemann and Dr. Domenico D Alessandro. I am very thankful to my fellow survey graduate students for providing feedback and encouragement.

11 1 CHAPTER 1 Introduction 1.1 Overview In a simple random sample, an unbiased estimator of the population regression coefficient is the ordinary least squares (OLS) estimator, and an estimator of its variance is easy to calculate. In many surveys, the elements enter the sample with unequal probabilities. In these cases, the sampling weights, commonly the inverses of the selection probabilities, can be used to construct the probability weighted (PW) estimator. When the weights are related to the values of the response variables, after conditioning on the independent variables in the model, the sampling process becomes informative and the model holding for the sample data is different from the model holding in the population. When the selection probabilities are related to the error terms, the use of the OLS estimator can yield large biases. In complex analyses such as regression, the weighted estimator requires a more complicated calculation, and often gives a larger variance than the unweighted version of the estimator. The OLS estimator and the PW estimator are straightforward procedures, but for complex sampling designs, the OLS estimator and the PW estimator do not always perform well. 1.2 Objectives One objective is to develop consistent weighted estimators that are more efficient than the PW estimator under complex sampling designs. The alternative estimators are

12 2 based on a superpopulation model with error variances determined by values of a covariate. Procedures include a design consistent estimator based on estimated variances, the Pfeffermann-Sverchkov estimator, and an instrumental variable estimator. We will construct a testing procedure for the importance of weights and discuss an estimation strategy. If the test statistic is not significant, the unweighted estimator is used. When the testing procedure indicates that the weighted analysis is preferred, we use a consistent weighted estimator that is more efficient than the PW estimator. Preliminary testing (pretest) procedures are procedures in which a test of a model assumption is used to decide between two estimation procedures. We will develop pretest procedures to obtain a compromise between the unweighted estimator and the weighted estimator. This thesis is organized as follows. In Section 1.3 the regression models are presented. In Section 1.4 we briefly review two common estimators, introduce some alternative estimation procedures, and describe a test for the importance of weights and a pretest procedure. In Chapter 2 we discuss two proposed regression estimators in detail. In section 2.1 we describe a design consistent estimator based on estimated variances and give some limiting properties of this design consistent estimator. In section 2.2 we describe the Pfeffermann-Sverchkov estimator. In Chapter 3 we introduce instrumental variable estimators, describe some limiting properties, and describe a test for endogeneity for the instrumental variable procedure. In Chapter 4 pretest procedures are discussed in detail. We describe a pretest procedure based on the test for importance of weights and a pretest procedure based on the test for endogeneity in the instrumental variable procedure. Chapter 5 contains the details of an example based on a simulated data and a Monte Carlo simulation study designed to illustrate the performance of the alternative estimators and test statistics, and to compare the alternative estimators with the OLS and PW estimators described in Section 1.4. The main findings of this study are discussed in Chapter 6.

13 3 1.3 Modeling Framework Survey data can be viewed as the output of the two random processes: the process generating the values of the finite population from a superpopulation, known as the superpopulation model, and the process selecting a sample from the finite population, referred to as the sample selection mechanism. See Pfeffermann et al (1998). We assume the finite population to be generated by a random process, called the superpopulation. We will use script F to denote the finite population, U to denote the set of indices of the finite population, and A to denote the set of indices of the sample. We assume that there is a function p( ) such that p(a) gives the probability of selecting sample A from U. Suppose we have a superpopulation which is used to generate an infinite sequence of y values, y 1, y 2, y 3,..., where y k = (y k, x k ) is the value tied to the k-th element. Thus {y 1, y 2,...} is a sequence of iid (µ, Σ) random variables. Let θ be a superpopulation parameter. Consider a sequence of populations U 1, U 2, U 3,..., where U N consists of N elements from the infinite sequence of elements, that is, U N = {1, 2,..., N}. Let θ N be an estimator of θ based on U N. For each population U N, a sample A N of size n N is selected. We assume that n 1 < n 2 < n 3 <.... Thus, as N, n N. Let ˆθ be a finite sample estimator of θ N, based on the observed y k values, for k A N. Consider a regression model relating y i to x i as y i = x iβ + e i, (1.1) where e i are independent (0, σ 2 ) random variables independent of x j for all i and j. The model for the finite population can be written as y N = X N β + e N, (1.2) e N (0, I N σ 2 ),

14 4 where y N is the N dimensional vector of values for the dependent variable, X N is the N k matrix of values of the explantory variables, and the error vector e N is the N dimensional vector which is independent of X N. Assume a simple random sample (SRS) of size n is selected from the finite population. Then we can write the model for the sample as y = Xβ + e, (1.3) e (0, Iσ 2 ), where y is the n dimensional column vector of observations, X is the n k matrix of observations on the explanatory variables, and e is the n dimensional error vector. Because the sample is a simple random sample, e is independent of X. Let E{ F} be the average overall possible samples under the design for the particular finite population F. The conditional expectation E{ F} is called the design expectation in survey sampling. Let E{ } = E{E( F)} be the overall average over all possible samples from all possible finite populations. Let V { F} be the analogous design variance and let V { } be the analogous overall variance. (Fuller, 2006, pp.17) Let E{ X} be the conditional expectation given the explanatory variable X. Let V { X} be the analogous conditional variance. We will use concepts of order as used in real analysis. Let a n be a sequence of real numbers and g n be a sequence of positive real numbers. if We say a n is smaller order than g n and write a n = o(g n ) lim n g 1 n a n = 0. We say a n is at most of order g n and write a n = O(g n )

15 5 if there exists a real number M such that gn 1 a n M for all n. For sequences of random variables, we will use the definitions of order in probability. First we will give the concept of convergence in probability. The sequence of random variables X n converges in probability to the random variable X, and we write if for every ε > 0 p lim X n = X, lim P { X n X > ε} = 0. n We use O p to denote at most of order in probability. Let X n be a sequence of random variables and g n be a sequence of positive real numbers. If for every fixed ε > 0, there exists a positive real number M ε such that P { X n M ε g n } ε for all n, we say X n is at most of order in probality g n and write X n = O p (g n ). Let X n is a k dimensional random variable. If for every ε > 0, there exists a positive real number M ε such that P { X jn M ε g n } ε, j = 1, 2,..., k, for all n, then we say X n is at most of order in probability g n and write X n = O p (g n ).

16 6 Let B n be a k r matrix of random variables. If for every ε > 0, there exists a positive real number M ε such that P { b ijn M ε g n } ε, i = 1, 2,..., k, j = 1, 2,..., r, for all n, where b ijn are the elements of B n. Then we say B n is at most of order in probability g n and write B n = O p (g n ). We use o p to denote smaller order in probability. If p lim g 1 n X n = 0, we say X n is of smaller order in probality than g n and write X n = o p (g n ). Let X n is a k dimensional random variable. If for every ε > 0 and δ > 0, there exists an N such that for all n > N, P { X jn > εg n } < δ, j = 1, 2,..., k, then we say X n is of smaller order in probability than g n and write X n = o p (g n ). Let B n be a k r matrix of random variables. If for every ε > 0 and δ > 0, there exists an N such that for all n > N, P { b ijn > εg n } < δ, i = 1, 2,..., k, j = 1, 2,..., r, then we say B n is of smaller order in probability than g n and write B n = o p (g n ).

17 7 (Fuller, 1996, pp.216). Given a sequence of finite populations F N, the estimator ˆθ is said to be design consistent for the finite population parameter θ N, if for any fixed ε > 0, lim P r{ ˆθ θ N > ε F N } = 0. N This notation indicates that given the fixed sequence of finite populations, the probability depends only on the sample design. (Fuller, 2006, pp.42) 1.4 Review of Estimation Approaches Ordinary Least Squares Estimator On the basis of model (1.3), the ordinary least squares (OLS) estimator of β is ( ) 1 ˆβ ols = x i x i x i y i = (X X) 1 X y, (1.4) with conditional covariance matrix i A i A V { ˆβ ols X} = (X X) 1 σ 2. An estimator of the conditional variance of ˆβ ols is ˆV { ˆβ ols } = (X X) 1 ˆσ 2 ols, (1.5) where ˆσ 2 ols = (n k) 1 i A ê 2 i,ols, k is the dimension of x i and ê i,ols = y i x ˆβ i ols. For fixed X, the OLS estimator is the best linear unbiased estimator of the superpopulation parameter β, given model (1.3). Assume now that a sample is drawn from the population using a sample design p(a) with associated π i s, where π i is the inclusion probability, or the selection probability

18 8 for element i. The inclusion probability is the probability that element i is selected into the sample. The conditional expected value of X y is { } E{X y F} = E x i y i F i A = i U x i π i y i = i U x i π i x iβ + i U x i π i e i, and the conditional expected value of X X is { } E{X X F} = E x i x i F = x i π i x i. i U i A Under the moment assumption in Theorem 1 in Section 2.1.2, it can be proven that ( ˆβ ols β Dπ,N F = n ) 1 [ 1 x i π i x i n ] 1 x i (y i x iβ Dπ,N) + O p (n 1 ), (1.6) i U i A where and D π,n = diag(π 1, π 2,..., π N ). Also β Dπ,N = (X ND π,n X N ) 1 X ND π,n y N, p lim N β D π,n = β. The probability limit of the OLS estimator is the weighted regression coefficient for the superpopulation where the weights are the selection probabilities. The approximate bias of the OLS estimator is zero if π i x i and e i are independent. If π i x i and e i are correlated, then E{X e F} 0 and the OLS estimator (1.4) is biased Probability Weighted Estimator Consider a model for a sample of size n selected from the finite population (1.2) as y = Xβ + e, (1.7)

19 9 where y is the n dimensional column vector of observations on y, X is the n k matrix of observations on the explanatory variables, and e is the n dimensional error vector. Assume that the sample is selected with unequal probabilities π i. Under unequal probability sampling, a common procedure to account for possible sampling effects is the probability weighted (PW) estimator. The PW estimator, constructed with the inverses of the selection probabilities, is ( ) 1 ˆβ P W = x i π 1 i x i x i π 1 i y i i A i A = (X W X) 1 X W y, (1.8) where W = diag(π1 1, π2 1,..., π 1 ) =: diag(w 1, w 2,..., w n ). We call w i the sampling n weight, which is the inverse of the selection probability π i. The sampling weight w i can be viewed as the number of units in the population represented by the sample observation y i. where Under the moment assumption in Theorem 1 in Section 2.1.2, it can be proven that ( ) 1 [ ] ˆβ P W β N F = N 1 i U x i x i N 1 i A β N = (X NX N ) 1 X Ny N. x i π 1 i (y i x iβ N ) + O p (n 1 ), (1.9) The coefficient β N is the ordinary least squares regression of y N on X N in the population. It can be proven that ˆβ P W β N F = O p (n 1/2 ), (1.10) and ˆβ P W is design consistent for the finite population parameter β N. Under the model, { E n 1 i A } x i π 1 i e i { { = E E n 1 x i π 1 i e i F i A { = E n } 1 x i e i i U }} = 0,

20 10 and V {β N β} = O(N 1 ). Then the probability weighted regression coefficient ˆβ P W is a consistent estimator of the superpopulation parameter β. If e is independent of π, where π = (π 1, π 2,..., π n ), e = (e 1, e 2,..., e n ) and e (0, Iσ 2 ), then the conditional covariance matrix of ˆβ P W β under model (1.7) is V {( ˆβ P W β) X} = (X W X) 1 X W W X (X W X) 1 σ 2. (1.11) If the selection is such that y i π 1 i covariance matrix of ˆβ P W is is uncorrelated with y j π 1 j for i j, an estimated ˆV { ˆβ P W } = (X W X) 1 X W ˆD ee,p W W X (X W X) 1, (1.12) where ˆD ee,p W = diag(ê 2 1,P W, ê 2 2,P W,..., ê 2 n,p W ) and ê i,p W = y i x i ˆβ P W. In most cases the variance of the PW estimator is larger than the variance of the OLS estimator Pfeffermann-Sverchkov Estimator When the sample selection probabilities are correlated with the model response variables after conditioning on the auxiliary variables, the sampling mechanism is called informative. Sugden and Smith (1984) examine the role of the sample selection mechanism in a model-based approach to finite population inference. Krieger and Pfeffermann (1992) and Pfeffermann (1993) discuss the notions of the informative sampling design based on the distribution of population measurements and the distribution of sample measurements. Pfeffermann (1993) provides an example of an informative design. Suppose (y i, x i ) are independent draws from a bivariate normal distribution N 2 (µ, Σ). Suppose {(y i, x i ), i = 1,..., n} are observed for a sample of size n, and we want to estimate the population mean of y i, µ y = E{y}. If the sample is selected by simple random sampling with

21 11 replacement, then the simple sample mean ȳ is unbiased for µ y. The sample selection scheme can be ignored in this case in the inference process. However, if the sample is selected with probabilities proportional to x i with replacement, such that at each draw k = 1,..., n, P (i s) x i / N i=1 x i. If Corr(Y, X) > 0 and P (y i > µ y i s) > 0.5, then the distribution of y i s in the sample is different from that in the population and E{ȳ} > µ y. Ignoring the sampling scheme and estimating µ y by ȳ is misleading in this case. When the selection probabilities are related to the values of the response variable, the empirical sample distribution is not consistent with the distribution of the population measurements, and the selection effects need to be accounted for in the inference process. The OLS estimator that ignores the sample selection process can yield large biases. Skinner (1994) proposes an approach of extracting the population model from models fitted to the sample data. He showed two important propositions about relationships between the population distribution and the sample distribution. The first proposition shows that given the sampling weights, the conditional sample distributions of y i and x i are identical to the conditional population distributions of y i and x i. The second proposition shows that the population of the sampling weights can be obtained by weighting the sample distribution of the sampling weights. Skinner presented a Monte Carlo study to compare the proposed procedure with the OLS estimator and the PW estimator based on the empirical bias, variance and mean squared error. Pfeffermann et al (1998) propose a general method of inference for the population distribution under the informative sampling that consists of approximating the parametric sample distribution. They showed how the sampling distribution may be derived from the population distribution using the first order selection probabilities for sample units. Pfeffermann and Sverchkov (1999) propose two new classes of estimators for regression models fitted to survey data. The proposed estimators account for the effect of informative sampling schemes and are derived from relationships between the popu-

22 12 lation distribution and the sample distribution. The first class consists of estimators obtained by extracting the sample distribution as a function of the population distribution and the sample selection probabilities and applying maximum likelihood theory to the sample distribution. The second class consists of estimators obtained by using relationships between the moments of the two distributions. The basis for the second class estimators is that the population regression E{y i x i } can be obtained by using the sample regression y i w i and w i on x i. The proposed estimator of the regression coefficient β from the second class is called the Pfeffermann-Sverchkov (PS) estimator in Chapter Instrumental Variable Estimator Under the superpopulation model y i = x iβ + e i, (1.13) assume some members of x i are not independent of e i. The crucial assumption for consistency is that x i is independent of e i in the superpopulation. If a variable is independent of the error term, this variable is called an exogenous variable. If an explanatory variable is correlated with the error term, this explanatory variable is sometimes called an endogenous explanatory variable. It is known that the presence of errors of measurement in the explanatory variable and the presence of endogenous explanatory variables in the regression model make the OLS estimator inconsistent and biased. For such cases, additional information is needed to obtain consistent parameter estimators. Assume some additional variables, denoted by r i, are available with the superpopulation properties E{r i e i } = 0 (1.14) E{x ir i r ix i } 0, (1.15)

23 13 where C is the determinant of the matrix C. Variables satisfying (1.14) and (1.15) are called the superpopulation instrumental variables or instruments. Thus, an instrumental variable (IV) must have two properties: (1) it must be uncorrelated with the error term of the structural equation; (2) it must be correlated with the endogenous explanatory variable. One method of instrumental variable estimation is called two-stage least squares (2SLS). For details see Wooldridge (2000). The method of instrumental variables has been used for more than sixty years. In the 1940s the IV method was introduced for use in the Errors in Variables Model. See Reiers l (1941, 1945). Geary (1947) shows that in certain cases consistent estimators may be obtained by the use of instrumental variables. Durbin (1954) reviews the IV approach to the problem of finding a consistent estimator of the regression coefficient. Sargan (1958) applies the IV method to a more general case and discusses the effect of increasing the number of instrumental variables. Sargan s (1958) work and the instrumental variable character of two-stage least squares (2SLS) have made IV estimation widely used. The 2SLS method yields consistent estimates when one or more explanatory variables are endogenous in a regression model. The trade-off between bias and variance in the choice between the OLS estimator and the 2SLS estimator was considered in a Monte Carlo study by Summers (1965). Richardson and Wu (1971) analytically compare properties of the distribution function of the OLS and 2SLS estimators of structural coefficients in a simultaneous equation model that includes two endogenous variables. Richardson and Wu showed that the distribution function of the OLS estimator of the coefficient of the endogenous variable has the same form as the 2SLS estimator distribution, and compared the biases and mean squared errors of the estimators. Feldstein (1974) suggests and evaluates alternative procedures for balancing the loss of efficiency in the IV estimation against the potential gain of reduced bias. He considered two types of estimators: (1) a linear combination of the OLS and IV estimators and (2) a method of choosing between the OLS and IV estimators on the basis of sample information. Carter

24 14 and Fuller (1980) compare a modified IV estimator with randomly weighted average estimators of the type considered by Huntsberger (1955) and Feldstein (1974). Carter and Fuller study showed that the randomly weighted average estimators constructed for the IV estimator and the OLS estimator display the same type of behavior as the randomly weighted average estimators of two means studied by Mosteller (1948) and Huntsberger (1955). Aldrich (1993) presents a detailed study of the work of Reiers l and Geary in the 1940s to explain the idea of instrumental variables A Test for Importance of Weights In practice, it is often that not all the design variables are known for the whole population or that there are too many variables for all to be incorporated in the analysis. Not including all the design variables does not necessarily imply that the inference is biased and Sugden and Smith (1984) indicate that incorporating partial design information in the model can be sufficient for analytic inference about model parameters. A natural question arising from this topic is how to test that the design can be ignored in estimation, given the available design information. The studies reviewed on this important aspect of the modeling process are mostly in the area of regression analysis. The ignorability of the design is tested by testing the significance of the difference between the OLS estimator ˆβ ols and the weighted estimator ˆβ P W under a working model which assumes that the design is ignorable. DuMouchel and Duncan (1983) show that the difference between the weighted and unweighted estimates can be used as an aid in choosing the appropriate model and hence the appropriate estimator. The test is based on the difference ˆ = ˆβ P W ˆβ ols with the null hypothesis that H 0 : E{ ˆ } = 0. The test statistic is λ = ˆ [ ˆV ( ˆ )] 1 ˆ, (1.16) where ˆV ( ˆ ) is an estimator of the covariance matrix of ˆ.

25 15 The use of λ of (1.16) for testing the importance of weights illustrates an important role for the sampling weights in the modeling process. The PW estimator is design consistent estimator for the population parameter β N. If the sampling design is ignorable, the OLS estimator is likewise consistent for β N. However, when the ignorability conditions are not satisfied, the OLS estimator is no longer consistent for β N and two estimators converge to different limits. Fuller (1984) considers the case of a cluster sample within strata and estimated the covariance matrix V ( ˆ ) by estimating the corresponding randomization covariance matrix. The resulting test statistic has an approximate F distribution under H 0 with k and (n 2k L) degrees of freedom, where L is the number of strata. The use of the randomization distribution to estimate V ( ˆ ) is more robust than DuMouchel and Duncan s approach because it does not depend on the regression model assumption. Pfeffermann and Sverchkov (1999) suggest a formal test for testing sampling ignorability by using the moments relationships between the population distribution of regression residuals and the sample distribution of regression residuals. Let ε i = y i E{y i x i } denote the error term associated with unit i. Classical test procedures for comparing two distributions are not applicable since no observations are available for the population distribution of the residuals. However, under general conditions, the set of all moments of a distribution determine the distribution, provided the moments exist. Thus the null hypothesis is H 0 : E{ε k i } = E s {ε k i }, k = 1, 2,..., where E s is the expectation under the sample distribution. The cumulative distribution function of sample y i s is defined as P {y i b y i A} = F s (b), (1.17) and E s {y i } = y i df s. (1.18) Pfeffermann and Sverchkov showed the moments relationship between the sample and

26 16 population pdf s, E{u i v i } = [E s {w i v i }] 1 E s {u i w i v i }, (1.19) for any pair of vector random variables (u i, v i ), where w i = π 1 i is the sampling weight for unit i. By the moments relationship E{ε k i } = [E s {w i }] 1 E s {ε k i w i } which is a special case of (1.19), an equivalent set of hypotheses is H 0k : Corr s (ε k i, w i ) = 0, k = 1, 2,... where Corr s is the correlation under the sample distribution. In Pfeffermann and Sverchkov (1999) s simulation study, Pfeffermann and Sverchkov use as test statistic a standardized form of the Fisher transformation of the correlation coefficient F T (k) = (1/2)log[(1 + r k )/(1 r k )], and F T S(k) = F T (k)/ŝd(f T (k)), where r k is the empirical correlation Ĉorr(ˆε k i, w i ), ˆε k i = (y i x i ˆβ) k and ŜD(F T (k)) is the bootstrap standard deviation of F T (k). The test statistic has an asymptotic normal distribution with mean zero Preliminary Testing Procedures The motivation for the preliminary testing (pretest) procedure is to accept bias in return for reduced variance. Pretest estimators are a class of estimators that make a trade for smaller variance at the risk of bias. The pretest procedure is characterized by a test statistic, T, calculated from the data set. The test T determines the estimation method. If T is statistically significant, a given procedure will be used to estimate a parameter. Otherwise an alternative procedure will be used for calculating the estimator. The general idea of using a pretest to determine an estimation procedure is discussed by Bancroft (1944), Mosteller (1948) and Huntsberger (1955). Bancroft (1944) proposes

27 17 pretest procedures for testing homogeneity of variances and testing of a regression coefficient. Bancroft s paper does not provide a clear-cut prescription of when or whether to pretest. Mosteller (1948) discusses a simple problem concerning the pooling of data and presents several ways of pooling data from two samples to estimate the mean of the population of one of them. Mosteller pointed out that if the difference between the true means can be thought of as normally distributed from sample to sample, pooling with unequal weights is preferable. A generalization of the sometimes-pool procedure for pooling two estimators which is based on a pretest was described by Huntsberger (1955). He compared the efficiencies of the generalized weighting procedure and of the sometimes-pool procedure for the special case where the estimators are normally distributed. To formulate the sometimes-pool idea, we follow Huntsberger (1955). Suppose we have a random sample with two unknown parameters θ 1 and θ 2. Let ˆθ 1 and ˆθ 2 be estimators for θ 1 and θ 2. In general, when θ 1 = θ 2, a pooled estimator g(ˆθ 1, ˆθ 2 ) will provide a better estimator for θ 1 than ˆθ 1 alone. When it s unclear whether or not θ 1 is equal to θ 2, the pooled estimator may still provide some gains, but may lose when the two parameters are very different. Let our pretest statistic T, be the statistic for testing the null hypothesis H 0 : θ 1 = θ 2 against the alternative hypothesis H α : θ 1 θ 2. An estimator can be formed for θ 1 using the function W (T ) = φ(t )ˆθ 1 + [1 φ(t )]g(ˆθ 1, ˆθ 2 ), (1.20) where φ(t ) is an indicator defined as 0 if T A α φ(t ) = 1 if T R α, (1.21) where A α and R α are the acceptance and rejection regions for the test of the null hypothesis that θ 1 = θ 2 with the significance level α.

28 18 Rao (1966) arrives at a similar conclusion to Huntsberger (1955). Under a probability proportional to size (pps) sampling design, he proposed two estimators, a Horvitz- Thompson (H-T) estimator, and an alternative estimator. The H-T estimator is a weighted estimator. The alternative estimator is an unweighted estimator and of the form of the simple sample mean. Rao stated that there is a criterion to choose between the two estimators, the correlation between a characteristic of interest y and the selection probabilities. If the characteristics are poorly correlated with the selection probabilities in the pps sampling design, the alternative estimator may be used. For other characteristics, the weighted estimator should be used. The alternative estimator has smaller mean squared error than the H-T estimator and the bias is relatively small compared to the standard error when characteristics are poorly correlated with the selection probabilities in pps sampling design. Rao didn t propose a test procedure for testing if a characteristic of interest y and the selection probabilities are correlated. Bock, Yancey and Judge (1973) develop the properties of the pretest estimator for the general model and determine the characteristics of the risk function of the pretest estimator under the squared error loss criterion. The choice of the level of significance for the test was discussed in the paper. Cohen (1974) gives necessary and sufficient conditions for the procedures based on the pretest of significant to be admissible. As stated by Cohen (1974), the pretest procedure is sort of a compromise between a Bayesian procedure and the usual procedure. Inference based on the pretest procedure requires, in general, less prior knowledge on the part of research workers than the use of Bayesian inference procedures. With limited knowledge, the researchers tend to check the assumptions by using pretest procedures. Wallace (1977) reviews the properties of various pretest estimators both in the general case (multiple restrictions) and in the Bancroft case (a single restriction) and works on optimal pretest critical values. Bancroft and Han (1977) comply a bibliography containing references on the topic about the pretest procedure. Grossman (1986) suggests a compromise sampling strategy between a large supposedly

29 19 unpoolable sample and a smaller supposedly poolable sample based on pretest procedures in social experiments. Gregoire, Arabatzis and Reynolds (1992) provide a pretest procedure for the intercept in a simple linear regression model. Magnus and Durbin (1999) use a pretest procedure for estimation of regression coefficients of interest when other regression coefficients are a vector of nuisance parameters. In the context of the pretest procedure based on the IV estimator, Sargan (1958) suggests obtaining the confidence interval for the IV estimator and notes that if the OLS estimator lies outside this interval, it is probably significantly biased. Although there was no formal test proposed in Sargan s paper, Sargan s remarks suggest that if the absolute difference between the OLS estimator and the IV estimator is greater than the standard error of the IV estimator, infer that the OLS estimator is biased and use the IV estimator; if not, use the OLS estimator. This procedure is a pretest estimator.

30 20 CHAPTER 2 Design Consistent Estimation 2.1 H Estimator Motivation Assume that the ordinary least squares (OLS) estimator is biased. In such cases it is necessary to incorporate the sampling weights into the analysis. One approach is to use the probability weighted (PW) estimator. The PW estimator is consistent for β, but it can be inefficient due to unequal probabilities not being proportional to the conditional variance of y. An alternative approach to constructing design consistent estimators that are more efficient under the model (1.7) than the PW estimators is to find procedures to scale the sampling weights to near one or to the conditional variance of y. Consider an estimator of the form ( ) 1 ˆβ H = x i π 1 i h i x i x i π 1 i h i y i i A i A = (X W HX) 1 X W Hy, (2.1) where H is a diagonal matrix with diagonal elements h i, h i is defined for all i U and h i is independent of e i. We call the estimator ˆβ H the H estimator Central Limit Theorem In this section, we show that the H estimator is consistent for the population parameter under mild assumptions and has a limiting normal distribution. We begin with

31 21 two useful lemmas. Lemma 1 is adapted from Schenker and Welsh (1988). A proof of Lemma 1 is given by Legg (2006). Lemma 1. Let {V n } be a sequence of random variables in R k such that, for some function h, as n, h(v 1,..., V n ) L Γ, (2.2) where Γ has a distribution function G. If {L n } is a sequence of random variables in R k such that P {L n h(v 1,..., V n ) s V 1,..., V n } F (s) (2.3) almost surely for all s R k, where F is a continuous distribution function, represents conditional upon, and is taken to mean jointly less than elementwise. Then P (L n t) (G F )(t), (2.4) for all t R k, where denotes convolution. Lemma 1 is in terms of generic CDFs G and F. Lemma 2 is a special case of Lemma 1. Lemma 2 is an application of Lemma 1 to normal CDFs. The proof of Lemma 2 is given in Legg (2006). Lemma 2. Let {F N } be a sequence of finite populations and let θ N be a function on R k of the elements of F N such that N 1/2 (θ N θ) L N k (0, V 11 ). (2.5) Let a design and an estimator, ˆθN, and a sequence of conditional variance matrices V 22,N be such that N 1/2 ( ˆθ N θ N ) F N L N k (0, V 22 ) a.s., (2.6)

32 22 where V 11 + V 22,N is positive definite for all N. Then lim V 22,N = V 22 a.s., (2.7) N N 1/2 (V 11 + V 22,N ) 1/2 ( ˆθ N θ) L N k (0, I k ), (2.8) where I k is the k k identity matrix. We prove the main result of this section in the following theorem. Theorem 1. Let {(y i, x i, h i )} be a sequence of independent identically distributed random variables with 9-th moment. Let {U N, F N : N = k + 3, k + 4,...} be a sequence of finite populations, where U N is the set of indices identifying the elements and F N = ((y 1, x 1, h 1 ),..., (y N, x N, h N )). In the superpopulation y i is related to x i through a regression model, y i = x iβ + e i, (2.9) e i ind(0, σ 2 ). Assume (x i, h i ) is independent of e i. Assume h i are positive for all i. Let z i = (y i, x i ), M ZHZ,N = N 1 Z NH N Z N (2.10) and M ZHZ = E{M ZHZ,N }, (2.11) where H N = diag(h 1, h 2,..., h N ). Assume M zz = E{z i z i} and M ZHZ are positive definite. Let Z = (z 1, z 2,..., z nn ), where we index the sample elements from one to n N. Let ˆM ZHZ = N 1 Z W HZ, (2.12)

33 23 where W = diag(π 1 1, π 1 2,..., π 1 n N ) =: diag(w 1, w 2,..., w nn ) and H = diag(h 1, h 2,..., h nn ). Assume the sequence of sample designs is such that for any z with 3-rd moments and where lim n NV { z HT z N F N } = V zz, a.s., (2.13) N [V { z HT z N F N }] 1/2 ( z HT z N ) F N L N(0, I) a.s. (2.14) z HT = N 1 i A π 1 i z i, z N is the finite population mean of z, and V { z HT z N F N } and V zz, are positive definite. Assume lim N f N = f a.s., where f N = N 1 n N and 0 f < 1. Let ˆV { z HT } be the Horvitz-Thompson variance estimator of V { z HT F N }, and assume ˆV { z HT } V { z HT F N } = o p (n 1 N ) (2.15) for any z with 3-rd moments. Let ˆβ H be defined by (2.1). Then and where ˆβ H β = MXHX b 1 HT + O p (n 1 N ), (2.16) ˆβ H β N = M 1 XHX,N ( b HT b N ) + O p (n 1 N ), (2.17) β N = (X NH N X N ) 1 X NH N y N =: M 1 XHX,N M XHy,N, (2.18) b N = N 1 i U b i, b HT = N 1 i A π 1 i b i,

34 24 and b i = x i h i e i. Then n 1/2 N (V 11,N + f N V 22 ) 1/2 ( ˆβ H β) L N(0, I), (2.19) as N, where V 11,N = n N M 1 XHX,N V { b HT F N }M 1 XHX,N, V 22 = M 1 XHX Σ bbm 1 XHX and Σ bb = E{b i b i}. Let ˆΣ bb = Σ bb + o p (1). Then the estimated variance is ˆV { ˆβ H } = ˆM 1 XHX ( ˆV { ˆbHT } + N 1 ˆΣbb ) ˆM 1 XHX = n 1 N (V 11,N + f N V 22 ) + o p (n 1 N ), (2.20) where ˆV { ˆb HT } is the Horvitz-Thompson estimated sampling variance of ˆb HT calculated with ˆb i = x i h i ê i and ê i = y i x i ˆβ H. Proof. By the design, { E N } 1 x i w i h i x i F N = N 1 x i h i x i, i A i U By moment assumptions and assumption (2.13), N 1 i A x i w i h i x i N 1 i U x i h i x i F N = ˆM XHX M XHX,N F N = O p (n 1/2 N ) a.s. From a Taylor expansion, ˆM 1 XHX M 1 XHX,N F N = O p (n 1/2 N ) a.s. Similary N 1 i A x i w i h i e i N 1 i U x i h i e i F N = ˆM XHe M XHe,N F N = O p (n 1/2 N ) a.s. Under the model (2.9), by assumption, e N is independent of (X N, H N ). Thus M XHe = E{M XHe,N } = 0,

35 25 and ˆM XHe = O p (n 1/2 N ). By moment assumptions, M XHX,N M XHX = O p (N 1/2 ). Thus ˆβ H β = (X W HX) 1 X W H(y Xβ) = ˆM 1 XHX N 1 i A x i w i h i e i = ˆM 1 XHX ˆM XHe = M 1 XHX,N ˆM XHe + O p (n 1 N ) =: M 1 XHX,N b HT + O p (n 1 N ) and (2.16) is proven. Also β N β = (X NH N X N ) 1 X NH N (y N X N β) = M 1 XHX,N N 1 i U x i h i e i = M 1 XHX b N + O p (N 1 ). It follows that ˆβ H β N = ( ˆβ H β) (β N β) = M 1 XHX,N ( b HT b N ) + O p (n 1 N ) = M 1 XHX ( b HT b N ) + O p (n 1 N ), and (2.17) is proven. From variance assumption (2.13) and the normality assumption (2.14), n 1/2 N ( b HT b N ) F N L N(0, V bb, ) a.s.,

36 26 where V bb, is defined by analogy to (2.13). Therefore n 1/2 N ( ˆβ H β N ) F N L N(0, V 11 ) a.s., (2.21) where V 11 = M 1 XHX V bb, M 1 XHX. By the Central Limit Theorem for independent random variables and the moment assumptions as N. Then as N, where V 22 is defined in (2.19). and N 1/2 bn L N(0, Σ bb ), N 1/2 (β N β) L N(0, V 22 ), (2.22) Applying Lemma 2, result (2.19) then follows from (2.21) and (2.22). By the design, ˆV { b HT } = O p (n 1 N ) ˆΣ bb = O p (1), then By (2.23) and (2.15), ˆV { b HT } + N 1 ˆΣbb = O p (n 1 N ). (2.23) ˆM 1 XHX ( ˆV { b HT } + N 1 ˆΣbb ) ˆM 1 XHX = M 1 XHX,N ( ˆV { b HT } + N 1 ˆΣbb )M 1 XHX,N +O p (n 3/2 N ) = M 1 XHX,N (V { b HT F N } + N 1 Σ bb + o p (n 1 N )) M 1 XHX,N + O p(n 3/2 N ) = M 1 XHX,N (V { b HT F N } + N 1 Σ bb )M 1 XHX,N +o p (n 1 N )

37 27 = M 1 XHX,N V { b HT F N }M 1 XHX,N +N 1 M 1 XHX Σ bbm 1 XHX + o p(n 1 N ) = n 1 N V 11,N + N 1 V 22 + o p (n 1 N ) = n 1 N (V 11,N + f N V 22 ) + o p (n 1 N ). (2.24) By Theorem (Fuller, 2006), we can replace e i with ê i in (2.24). Result (2.20) then follows. Theorem 1 is a consequence of almost sure convergence assumptions. Under the regularity conditions of Theorem 1, ˆβ H is consistent for β for any H matrix that meets the moment assumptions in Theorem 1 and is independent of e. The H matrix plays little role due to independence assumptions. The proof of Theorem 1 is a proof of consistency for the probability weighted regression estimator obtained by setting H = I. In order to illustrate how to construct an H estimator, an example pf choosing H matrix is given. Assume Poisson sampling from a finite population generated as iid random variables and consider a model y i = βx i + e i, (2.25) where e i = [g(x i )] 1/2 a i, a i ind(0, 1) and is independent of x i, and g( ) is a positive bounded function. In the superpopulation, x i is independent of e i. Assume (x i, σ 2 ei, a i ) are iid random vectors, where σ 2 ei = g(x i ) is the variance of e i. Under Poisson sampling, π 1 i e i is independent of π 1 e j, for all i and j, and Assume V j { i A x i π 1 i h i e i F } = i U w i = π 1 i = k(x i ) + v i, π 1 i (1 π i )x 2 i h 2 i e 2 i. (2.26)

38 28 where k( ) is a positive bounded function and v i is independent of (x i, e i ). We have { } E π 1 i x i h i e i F = x i h i e i. (2.27) i A i U Under the independence assumptions, the unconditional variance V { } i A π 1 i x i h i e i is V Then { i A π 1 i x i h i e i } = E = E { V { i U i U { i A π 1 i x i h i e i F }} π 1 i (1 π i )x 2 i h 2 i e 2 i } + V + V { E { i A { } x i h i e i i U π 1 i x i h i e i F }} { } { } { } = E π 1 i x 2 i h 2 i e 2 i E x 2 i h 2 i e 2 i + E x 2 i h 2 i e 2 i { } = E (k(x i ) + v i )x 2 i h 2 i e 2 i i U i U { } { } = E k(x i )x 2 i h 2 i e 2 i + E v i x 2 i h 2 i e 2 i i U = i U E{k(x 1 )x 2 1h 2 1g(x 1 )}. V { ˆβ H β} = ( NE{x 2 1h 1 } ) 2 NE{k(x1 )x 2 1h 2 1g(x 1 )}. (2.28) Therefore, if k(x i ) and g(x i ) are known, we choose i U h i = k(x i ) 1 g(x i ) 1, i = 1, 2,..., n. (2.29) to minimize the unconditional variance (2.28). Also see Fuller (2006). i U We give a consistent variance estimator for ˆβ H β for Poisson sampling. The conditional variance of X W He for Poisson sampling from a finite population is { } V x i π 1 i h i e i F = π 1 i (1 π i )x i h 2 i e 2 i x i. i A i U The estimated conditional variance of V { i A x iπ 1 i h i e i F } is { } ˆV x i π 1 i h i e i F = π 2 i (1 π i )x i h 2 i ê 2 i x i. i A i A

39 29 Also and E { ˆV { i A x i π 1 i h i e i F V { E { i A }} = E { i U i U π 1 i (1 π i )x i h 2 i e 2 i x i { } { } = E π 1 i x i h 2 i e 2 i x i E x i h 2 i e 2 i x i x i π 1 i h i e i F }} = V Therefore, the unconditional variance of i A x iπ 1 i { { V i A x i π 1 i h i e i } = E { } x i h i e i i U } i U { } = E x i h 2 i e 2 i x i. i U i U h i e i is π 1 i x i h 2 i e 2 i x i The estimated unconditional variance of V { i A x } iπ 1 i h i e i is { } ˆV x i π 1 i h i e i = π 2 i x i h 2 i x iê 2 i, i A i A and an estimated covariance matrix of ˆβ H for Poisson sampling is ˆV { ˆβ H } = n(n k) 1 (X W HX) 1 X W H ˆD ee,h HW X (X W HX) 1, (2.30) }. where k is the dimension of x i, ˆD ee,h = diag(ê 2 1, ê 2 2,..., ê 2 n), ê i = y i x i ˆβ H. The estimated variance (2.30) is a consistent estimator of the variance of ˆβ H β. The variance expression (2.30) can be used for a stratified sample selected from a population which is a simple random sample from a stratified superpopulation. For stratified sampling, the conditional variance of the stratified mean is V {ȳ st ȳ N F} = H h=1 i U h N 2 π 1 hi (1 π hi)n h (N h 1) 1 e 2 hi,

40 30 where π hi = n h N 1 h is the selection probability in the stratum h, e hi = y i ȳ Nh, ȳ N is the i U h y i. The estimated conditional variance is finite population mean and ȳ Nh = N 1 h ˆV {ȳ st ȳ N F} = H h=1 where ê hi = y i ȳ nh and ȳ nh = n 1 h is ˆV {ȳ st ȳ N } = i A h N 2 π 2 hi (1 π hi)n h (n h 1) 1 ê 2 hi, i A h y i. Thus the estimated unconditional variance H h=1 i A h N 2 π 2 hi n h(n h 1) 1 ê 2 hi. For a general design, if a finite population correction can be ignored, a variance estimator of ˆβ H is ˆV { ˆβ H F} = (X W HX) 1 ˆV {X W He} (X W HX) 1, (2.31) where ˆV {X W He} = ˆV { i A x iπ 1 i h i e i } is the Horvitz-Thompson estimator of the variance of the sum calculated with x i h i ê i and ê i = y i x i ˆβ H. Under assumptions given in Theorem 1, the variance estimator (2.31) is consistent for V { ˆβ H β N F}. 2.2 Pfeffermann-Sverchkov Estimator Pfeffermann and Sverchkov (1999) consider the regression model (1.7) and a design in which the π i may be a function of x i and e i. Pfeffermann and Sverchkov proposed an estimator, that we call the Pfeffermann-Sverchkov (PS) estimator, obtained by utilizing information about the moments in the population. Estimation is a two-step procedure: (1) Calculate ŵ i by the regression of w i on known functions of x i using the sample measurements. (2) Compute ˆβ P S = arg min β { n 1 i A ŵ 1 i w i (y i x i β) 2 The ŵ i is an estimator of E s {w i x i }. The PS estimator ˆβ P S of β is calculated as }. ˆβ P S = ( ) 1 q i x i x i q i x i y i i A i A

41 31 = (X QX) 1 X Qy, (2.32) where Q = diag(q 1, q 2,..., q n ), q i = w i ŵ 1 i, and ŵ i is the fitted value from the OLS regression of w i on known functions of x i. The PS estimator is a version of the H estimator (2.1) with H = Ŵ 1 = diag(ŵ 1 1, ŵ 1 2,..., ŵ 1 i ). If the model has constant error variances and the correlation between w i and e 2 i is modest, then E s {w i x i } will be highly correlated with V {π 1 i e i } and the PS estimator will perform well. If there is reasonable correlation between x i and π 1 i e 2 i, then the PS estimator can be used as the initial estimator to provide ê 2 i to estimate h i for constructing an H estimator. The PW estimator and PS estimator coincide when π i s are independent of x i s such that E s {w i x i } = [E{π i x i }] 1 = constant. When w i is a deterministic function of x i, q i = 1 and ˆβ P S = ˆβ ols. In general, w i is not a consistent estimator of the superpopulation expected value of w i given x i. The variance from estimated E s {w i x i } may reduce the efficiency of the PS estimator. See Skinner (1994). The PS estimator ˆβ P S is not strictly unbiased for β, but under the assumptions of Theorem 1, it is consistent. We note that { E{N 1 X Qe} = E = E = E = 0. { { N 1 i A N 1 i U N 1 i U x i q i e i } ( ) } π i xi w i ŵ 1 i e i } x i ŵ 1 i e i The approximation follows from the fact that w i is an estimator of E{w i x i } and de-

Weighting in survey analysis under informative sampling

Jae Kwang Kim and Chris J. Skinner Weighting in survey analysis under informative sampling Article (Accepted version) (Refereed) Original citation: Kim, Jae Kwang and Skinner, Chris J. (2013) Weighting