The Effective Use of Complete Auxiliary Information From Survey Data

Size: px

Start display at page:

Download "The Effective Use of Complete Auxiliary Information From Survey Data"

Allison Potter
6 years ago
Views:

1 The Effective Use of Complete Auxiliary Information From Survey Data by Changbao Wu B.S., Anhui Laodong University, China, 1982 M.S. Diploma, East China Normal University, 1986 a thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Mathematics and Statistics c Changbao Wu 2004 SIMON FRASER UNIVERSITY August, 1999 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without the permission of the author.

2 APPROVAL Name: Degree: Title of thesis: Changbao Wu Doctor of Philosophy The Effective Use of Complete Auxiliary Information From Survey Data Examining Committee: Dr. Katherine Heinrich Chair Dr. Randy R. Sitter Senior Supervisor Dr. Charmaine Dean Dr. Richard Lockhart Dr. Carl Schwarz Dr. Gemai Chen External Examiner Department of Math & Stats University of Regina Date Approved: ii

3 Abstract A unified framework to deal with the effective use of complete auxiliary information from survey data at the estimation stage has been attempted. The proposed method involves modeling the relationship between the variable of interest and the auxiliary variables, and then incorporating the auxiliary information into design-consistent estimators of finite population means, totals, distribution functions and quantiles through the predicted values using calibration and empirical likelihood methods. The proposed model-calibration estimators can effectively handle any linear or non-linear models and the estimators of means and totals reduce to the generalized regression estimators under linear models. The pseudo-empirical likelihood approach (Chen and Sitter, 1999), when used in this setting, gives an estimator that is asymptotically equivalent to the model-calibration estimator but with positive weights, and therefore is preferred. Some existing estimators which use complete auxiliary information are shown to be special cases of this unified approach. The approach also provides a simple and elegant algorithm for obtaining an approximately generalized regression estimator with positive weights. This approach can be termed model-assisted as the resulting estimators are design-consistent regardless of the working model and particularly efficient if the working model adequately describes the true relationship in the population. Variance estimation and confidence intervals are also considered. Consistent analytical and jackknife variance estimators are obtained for estimators of means, totals and distribution functions. Small sample performance of these variance estimators has been investigated through a limited simulation study. Better conditional performance of the jackknife is highlighted. These variance estimates can be used in normal theory iii

4 confidence intervals. In the case of distribution functions, a simple transformation technique for obtaining better performing confidence intervals is proposed. Woodruff confidence intervals for quantiles are re-examined and a surprising property of this interval has been found and confirmed both empirically and theoretically. Finally, on a somewhat independent tack, a purely model-based approach is considered. The main new results in this area involve the development of consistent analytical and jackknife variance estimators for the model-based estimator of the distribution function of Chambers and Dunstan (1986). The analytical variance estimators involve kernel smoothing and require substantial re-derivation for every new model. The jackknife variance estimators, however, are operationally simple and easy to extend to new models. iv

5 Acknowledgments First of all, I wish to express my deepest appreciation and gratitude to my senior supervisor Dr. Randy Sitter for his enthusiasm, encouragement and guidance during the past four years. I thank all the statistics faculty members at SFU who led me into the real world of statistics and sharpened my mind for critical thinking. Special thanks are due to Dr. Charmaine Dean and Dr. Richard Lockhart for their consistent support. I would also like to thank Sylvia, Maggie, Judy, Diane, Casey and other staff members of the department for their kindness and help. Many many thanks to my friends and officemates Derek, Phil, Heidi, Chandanie, Mike, Melody, Carolyn, Darby, Chuck, Jason, Peter, Jerry, Xucai, Hilary and many others, who have always been supportive and encouraging. I thank Dr. Jiahua Chen at University of Waterloo for many valuable discussions. I have benefited very much from various financial support during the course of my studies, especially the C. D. Nelson Memorial Scholarship from Simon Fraser University, Research and Teaching Assistantships from the Department of Mathematics and Statistics, and the E. C. Bryant Scholarship from American Statistical Association and Westat. Inc. Most of all, I wish to thank my wife Jianchuan and my daughter Domeny for coming to Canada and sharing with me these unforgettable years of life at Simon Fraser University. v

6 Dedication To the memory of my mother. To Jianchuan and Domeny. vi

7 Contents Abstract iii Acknowledgments v Dedication vi List of Tables x List of Figures xi 1 Introduction A Review General Settings Estimation of The Finite Population Mean and Total The generalized regression estimator The calibration estimator The pseudo-empirical maximum likelihood estimator Estimation of The Distribution Function The design-based difference estimator The model-based prediction estimator Supplementary remarks A Discussion Model Calibration Modeling Estimation of The Population Mean Model-calibration Pseudo-empirical likelihood approach The generalized difference estimator vii

8 3.2.4 Some comparative comments A simulation Estimating The Distribution Function Estimation of F (t) under a regression model Estimation of F (t) under a general model Quantile Estimation Proofs Positive Weights in Regression Estimation Introduction Model Calibration and the Empirical Likelihood Approach Algorithm for Obtaining Positive Weights Comparing Two Sets of Weights Variance Estimation and Confidence Intervals Variance Estimation for The Finite Population Mean Variance Estimation for The Distribution Function Analytical variance estimation for ˆF d (t) Jackknife variance estimation for ˆF d (t) A simulation Illustration of (5.4) under sampling with replacement Confidence Intervals for the Distribution Function A transformation technique An empirical comparison Confidence Intervals for Quantiles Woodruff intervals for large and small quantiles: a simulation Investigation of the phenomenon Supplementary remarks Model-based Inference Model-based Prediction Estimators viii

9 6.1.1 Model-based estimator of the population mean Model-based estimator for the distribution function Variance Estimation Analytical variance estimation Jackknife variance estimation An empirical comparison of variance estimators Proof of Theorem Concluding Remarks and Future Research Concluding Remarks Some Future Work Bibliography ix

10 List of Tables 3.1 Relative bias and efficiency for estimating the mean Relative efficiency for estimating F (t) Relative bias and efficiency for estimating the quantiles (1) Relative bias and efficiency for estimating the quantiles (2) Relative bias and efficiency for estimating the quantiles (3) Relative bias and instability of variance estimators for ˆF d (t) Coverage probabilities and tail errors for transformation intervals Coverage probabilities and tail errors for Woodruff intervals Coverage probabilities for idealized woodruff intervals Coverage probabilities and tail errors for modified Woodruff intervals Relative bias and instability of the variance estimators for quantiles Relative bias and instability of variance estimators for ˆF m (t) x

11 List of Figures 3.1 Scatter plot of population Graphical representation of g(λ) Plot of R-weight versus P-weight Plot of conditional performance of variance estimators xi

12 Chapter 1 Introduction This thesis considers the use of complete auxiliary information from survey data, where complete means that the values of auxiliary variables are known for the entire finite population, not just for the selected sample, s. In sample surveys, auxiliary information on the finite population is regularly used to obtain estimates for the unknown finite population quantities with higher precision. Sometimes this auxiliary information is used at the design stage. Examples include probability proportional to size sampling where the values of an auxiliary variable are the size measures, or stratified sampling where auxiliary variables serve as stratum indicators. But often this information is incorporated into the construction of estimators at the estimation stage. Ratio and regression estimators are early examples. Recently, several more complex procedures have been proposed in the literature. However, as is discussed in Chapter 2, existing estimators for finite population means and totals essentially incorporate the auxiliary information only through the known population means or totals of the auxiliary variables, even when complete auxiliary information is available. In this thesis, a unified framework to address questions of how to effectively use the complete auxiliary information at the estimation stage is proposed and developed. This framework adopts a general modeling process and incorporates the complete auxiliary information into design-consistent estimators of finite population means, totals, distribution functions and quantiles through the fitted values using calibration and empirical likelihood methods. The logical connection and the practical difference 1

13 CHAPTER 1. INTRODUCTION 2 between estimation of the finite population mean and estimation of the distribution function using complete auxiliary information become more apparent under the unified approach. Variance estimation is also considered. For the mean case it is shown to be quite straightforward, but not for the distribution function. We propose to use a jackknife variance estimator for the distribution function and establish its consistency for some important cases. Our approach in Chapters 3, 4 and 5 can be termed model-assisted as the resulting estimators are design-consistent regardless of the working model and particularly efficient if the working model adequately describes the true relationship in the population. In Chapter 2, we first describe the general setting and notation and then briefly review existing methods that use auxiliary information at the estimation stage. In particular, we discuss the explicit or implicit nature of a linear working model used in these methods and call for more sophisticated modeling in using complete auxiliary information. In Chapter 3, we introduce a unified framework for the use of complete auxiliary information through a general modeling process and using a general approach which we term model-calibration. New estimators for population means, totals, distribution functions and quantiles are proposed and compared under this framework. We assume a superpopulation model by using the first and second moments of the y variable. This model is very general and includes the linear or non-linear regression models and the generalized linear models as special cases. The problem of estimating the model parameters is carefully treated in Section 3.1 following the proposal of Godambe and Thompson (1986) and using the theory of estimating equations. The design-based estimates of the model parameters can then be obtained which will enable us to proceed to our model-assisted approach. We argue in Section 3.2 that complete auxiliary information should be used through fitted values under the working model. A general method to do this is what we term model-calibration. This can be accomplished by first using (y i, x i ) for i s to build the model and then calibrating to the predicted values from the model using: (1) a direct calibration argument similar to that of Deville and Särndal (1992); (2) a

14 CHAPTER 1. INTRODUCTION 3 pseudo-empirical likelihood approach (Chen and Sitter, 1999); or (3) a generalized difference estimator (Cassel, Särndal and Wretman, 1976; Särndal, 1980). The proposed model-calibration estimators for finite population means and totals can effectively handle any linear or non-linear models and reduce to the conventional calibration estimators (Deville and Särndal, 1992) and/or the generalized regression estimators under linear models. The pseudo-empirical maximum likelihood estimator (Chen and Sitter, 1999), when applied in a similar way, is shown to yield an estimator that is asymptotically equivalent to the model-calibration estimator but with positive weights, and is therefore preferred. Estimating the finite population distribution function, F (t), with complete auxiliary information amounts to finding fitted values for indicator variables involving y i and not y i itself. Several estimators are proposed in Section 3.3 and some existing methods are shown to be special cases of this unified framework. Estimators of quantiles are usually obtained by inverting estimators of the distribution function. This can be easily done if the estimator of the distribution function is itself a true distribution function. This is usually not the case when auxiliary variables are incorporated into the estimator. Alternatively, in Section 3.4, we propose to use a difference estimator and a regression-type estimator for the quantiles when complete auxiliary information is available. In Chapter 4, we regress and consider estimation of the mean or total of y when only the vector finite population means of the auxiliary variables, X, are known. The generalized regression estimator (Cassel, Särndal and Wretman, 1976; Särndal, 1980) is one of the most commonly used procedures in this situation. One of the drawbacks of this estimator is that, when it is written as a weighted average of the y i s from the sample, it can result in negative weights which is very unattractive to practitioners. Some algorithms have been proposed in the literature to adjust the socalled regression weights iteratively so that the adjusted estimator is asymptotically equivalent to the generalized regression estimator but with positive weights. See, for instance, Huang and Fuller (1978). As a surprising by-product of our more generally applicable model-calibration approach, we propose a simple and elegant algorithm to obtain positive weights for the generalized regression estimator in this setting by

15 CHAPTER 1. INTRODUCTION 4 using the idea of model-calibration combined with the pseudo-empirical likelihood approach. The algorithm requires no seeds, is not iterative, and guarantees that if a solution exists, it will be obtained. In Chapter 5, we consider issues related to variance estimation and confidence intervals for the general estimation methods proposed in Chapter 3. Variance estimation for the mean case is shown to be simple. Variance estimation for the finite population distribution function in the presence of complete auxiliary information is more difficult, due to the discreteness of the indicator functions used in the estimators. In particular, we must deal with the fact that the model parameters used inside indicator functions in these estimators have themselves been estimated from the sample data. We focus on one important case: the design-based difference estimator of Rao, Kovar and Mantel (1990), which turns out to be one special case yielded by our model-calibration approach. We show that in this case the estimated model parameters do not change the asymptotic design variance of the estimator for some commonly used designs. This result is critical in establishing the consistency of proposed variance estimators. We go on to propose a jackknife variance estimator for the distribution function. The jackknife variance estimator for the design-based difference estimator does not have any great advantage other than operational simplicity in some practical settings, since the analytical variance and its estimator can be developed quite easily. However, while examining the small sample performance of this variance estimator through simulation study, we demonstrate that jackknife performs better conditionally, conditioning on the means of auxiliary variables for a given sample. After a point estimate and its estimated variance have been obtained, confidence intervals can be constructed using the conventional Z statistic and the normal approximation. However, for the distribution function, this interval performs badly for the tail region of the distribution function. In Section 5.3, we propose a simple transformation technique to obtain better behaved confidence intervals for the distribution function.

16 CHAPTER 1. INTRODUCTION 5 The well-known Woodruff confidence intervals for quantiles are obtained by inverting the normal confidence intervals for the distribution function. One might intuitively believe that Woodruff intervals should not be recommended for large or small quantiles using the normal confidence intervals, since in these cases normal intervals perform badly. Surprisingly, we demonstrate that, despite this fact, Woodruff intervals for large or small quantiles based on inverting these badly behaved intervals perform very well. We investigate this both empirically and theoretically in Section 5.4. Finally, on a somewhat independent tack, a purely model-based approach is considered in Chapter 6. The main new results in this area involve the development of consistent analytical and jackknife variance estimators for the model-based estimator of the distribution function of Chambers and Dunstan (1986). Unlike for the designbased difference estimator (Rao, Kovar and Mantel, 1990) of Section 5.2 where we show that the estimated model parameters do not change the asymptotic design variance, the estimated model parameters have to be taken into account for the variance estimation for this model-based prediction estimator. Analytical variance formulas are very difficult to obtain and must be derived one-at-a-time for different assumed models. These variance estimators also involve kernel density estimation. On the other hand, the jackknife variance estimator is easy to compute, remains operationally the same for different superpopulation models, and neatly avoids kernel density estimation. The consistency of the jackknife is established under some very mild regularity conditions in Section 6.2.

17 Chapter 2 The Use of Auxiliary Information From Surveys: A Review In sample surveys, auxiliary information on the finite population is often used at the estimation stage to increase the precision of estimators of the finite population mean, total or distribution function. In the simplest settings, customary ratio and regression estimators incorporate known finite population means of auxiliary variables. For more general situations, there have been three main methods proposed in the literature which can be categorized as model-assisted approaches: the generalized regression estimator (GR) (Cassel, Särndal and Wretman, 1976; Särndal, 1980); calibration estimators (Deville and Särndal, 1992); and more recently empirical likelihood methods (Chen and Qin, 1993; Zhong and Rao, 1998; Chen and Sitter, 1999). Recently, several estimators for the finite population distribution function using auxiliary information have also been proposed. We now briefly review these developments and address some related issues. First, the general setting and notation are described in Section

18 CHAPTER 2. A REVIEW General Settings The finite population Suppose that the finite population consists of N identifiable units. Associated with the i-th unit are, the study variable, y i, and a vector of auxiliary variables, x i. The values x 1, x 2,..., x N are known for the entire population but y i is known only if the i-th unit is selected in the sample, s. Let U = {(y i, x i ) : i = 1, 2,, N} be the set of units for the finite population and s = {(y i, x i ) : i = 1, 2,, n} be the set of units in the sample. Parameters of interest Although the parameters of interest can be formed more generally, throughout this work, we will only consider the estimation problem for the finite population mean Ȳ = N 1 N i=1 y i, total Y = N i=1 y i, distribution function F (t) = N 1 N i=1 I [yi t] where I [ ] is the indicator function, and quantiles ξ p = inf{t : F (t) p}. Asymptotic set-up We study the theoretical large sample properties of the various proposed estimators. The finite sample performance of these estimators is then examined through limited simulation studies. We assume there is a sequence of finite populations and a sequence of sampling designs, both indexed by ν. The population and the sample sizes are denoted by N ν and n ν, respectively. All limiting processes are understood to mean as ν. We also assume N ν, n ν and n ν /N ν π [0, 1) as ν. However, for simplicity of presentation, the index ν will be suppressed. The use of a superpopulation model There exist three general approaches in survey sampling theory: (1) The designbased approach, also called the probability sampling approach, treats the finite population as fixed. That is, U = {(y i, x i ) : i = 1, 2,, N} are fixed values indexed by i. Inferences are made based on the randomization induced by repeated sampling of the indices; (2) The model-based approach, also termed the prediction approach, assumes the finite population values y 1,, y N are realizations of random variables

19 CHAPTER 2. A REVIEW 8 (Y 1,, Y N ), generated from a superpopulation model. The model distribution will lead to valid inference based on the particular set of sampled units, irrespective of the sampling design; (3) The model-assisted approach considers only those estimators which are design-consistent and also approximately model-unbiased under what is termed a working model. This approach attempts to provide valid conditional inferences under the assumed model and at the same time protects against model misspecifications in the sense of providing valid design-based inferences irrespective of the population y-values (Rao, 1994). We will adopt a modified model-assisted approach in this work, though for simplicity we will refer it as model-assisted. First, we use a superpopulation model to describe the relationship between the y variable and the x variables. We then construct estimators that are design-consistent but will be particularly efficient under the working model. It will also be approximately model-unbiased if the design-based estimates for the model parameters are close to the true values. The only exception is Chapter 6, where we consider the model-based framework. Some notation 1) π i = P (i s) denote the inclusion probabilities of a complex sampling scheme; d i = 1/π i denote the basic design weights. 2) E p and V p denote the expectation and variance with respect to the sampling designs (design-based); E ξ and V ξ denote the expectation and variance under the assumed superpopulation model. If such a distinction is not necessary, we will use E and V ar. 3) For vectors θ and ˆθ, ˆθ = θ + Op (n 1/2 ) means ˆθ k = θ k + O p (n 1/2 ) for each component of the vectors. Similar notation is used for random matrices. 4) For vectors θ, θ 1 and θ 2, θ (θ 1, θ 2 ) means θ k (θ 1k, θ 2k ) for each component of the vectors. 5) X n L X denotes X n converges to X in distribution. 6) s denotes the set of non-sampled units in the finite population.

20 CHAPTER 2. A REVIEW Estimation of The Finite Population Mean and Total The finite population mean and total are defined as Ȳ = 1 N N y i and N Y = y i. i=1 i=1 In the absence of supplementary population information, the design-unbiased Horvitz-Thompson estimator ˆȲ HT = N 1 d i y i or ŶHT = d i y i are typically used. This estimator, which could possibly incorporate auxiliary information at the design stage, uses no auxiliary information at the estimation stage. In this section, we will briefly review the three main model-assisted approaches that incorporate auxiliary information into the estimation of the finite population mean and total. Estimators are presented for the mean case only The generalized regression estimator In many sampling situations, the population means of auxiliary variables are known. This information may not have been used in the sampling design, and it is highly desirable to incorporate this information into the estimation procedure. Among commonly used procedures, the generalized regression estimator (GR) (Cassel, Särndal and Wretman, 1976; Särndal, 1980) is the most general one in that the GR is easy to compute and can handle multiple auxiliary variables, continuous or discrete. Suppose that the finite population was generated from an underlying superpopulation described by a linear regression model, y i = x iθ + ε i, i = 1,..., N, (2.1) where ε i s are independent and identically distributed with E ξ (ε i ) = 0, V ξ (ε i ) = σ 2 and θ the unknown superpopulation parameters. A design-based estimator, ˆθ, of the regression coefficients θ can be obtained using sample observations {(y i, x i ), i s} (See Section 3.1 for detailed discussions). The fitted values of y i s are ŷ i = x iˆθ,

21 CHAPTER 2. A REVIEW 10 i = 1,..., N. The total prediction error from the model is N N N e i = y i ŷ i, i=1 i=1 i=1 which is itself a finite population total that can be estimated by a Horvitz-Thompson type estimator This yields d i e i = d i y i d i ŷ i. ˆȲ GR = 1 N N { d i y i + ŷ i d i ŷ i } i=1 = ˆȲ HT + { X ˆ XHT } ˆθ, where X = N 1 N i=1 x i and ˆ XHT = N 1 d i x i. The generalized regression estimator can be motivated without appealing to a superpopulation model. For instance, it is a calibration estimator under a chi-square distance measure (Deville and Särndal, 1992; see also Section 2.2.2). However, the effectiveness of ˆȲ GR depends on how strongly the y variable is linearly related to the x variables. Note that, the above construction of ˆȲ GR uses all the fitted values ŷ i = x iθ for i = 1, 2,, N, but the resulting estimator needs only the known X to be implemented. The ratio estimator ˆȲ R = ( ˆȲ HT / ˆ XHT ) X = ˆR X is a special case of the regression estimator in that it can be motivated along the same lines as before by assuming y i = βx i + ε i, where x is a univariate auxiliary variable and X is its finite population mean. Another commonly used procedure is poststratification, which can be considered as a special case of regression estimation in which the regression variables are indicator variables for the post strata (Särndal, Swensson and Wretman, 1992, p. 264). The generalized regression estimator is asymptotically design-unbiased, and is very efficient in terms of smaller mean square error under the linear working model, (2.1). It also possesses a very desirable property that, if we rewrite ˆȲ GR as a weighted average of the y i s in the sample, w i y i, the regression weights, w i = N 1 d i {1 + (X ˆX HT ) [ d i (x i x)(x i x) ] 1 (x i x)}, satisfy benchmark constraints, i.e., w i x i = X.

22 CHAPTER 2. A REVIEW The calibration estimator One can also approach the use of auxiliary information directly by revising the basic design weights, d i, to satisfy certain benchmark constraints. That is, the sample sum of a weighted average of the auxiliary variables, using the revised weights, w i, should equal the known population totals (or means) for auxiliary variables. Deville and Särndal (1992) proposed a general method of deriving so called calibration estimators by first choosing a distance measure Φ s between the basic design weights and the revised calibration weights and then minimizing this distance subject to specified benchmark constraints. The most commonly used distance measure is the chi-square distance, Φ s = (w i d i ) 2, (2.2) d i q i where the q i s are known positive weights unrelated to d i. The uniform weights q i = 1 are used in most applications, but unequal weights can also be motivated as in Example 1 of Deville and Särndal (1992). The calibration estimator of Ȳ is constructed as ˆȲ C = N 1 w i y i, where the calibration weights, w i, are chosen to minimize Φ s subject to the constraint w i x i = X. (2.3) For the chi-square distance, the resulting calibration estimator is ˆȲ C = N 1 w i y i = ˆȲ HT + ( X ˆ XHT ) ˆβ, (2.4) where ˆ XHT = N 1 d i x i and ˆβ = { d i q i x i x i} 1 d i q i x i y i. Several interesting points are observed here. First, the motivation for calibration estimators does not require an assumed superpopulation model. Second, the calibration weights, w i, give perfect estimates when applied to the auxiliary variables. Deville and Särndal (1992) argued that weights that perform well for the auxiliary variable also should perform well for the study variable. However, it is an implicit underlying assumption that y and x are linearly related that makes this a valid argument. For example, in the case of scalar x with x i = (1, x i ) used in (2.3), it is clear

23 CHAPTER 2. A REVIEW 12 that y i = β 0 + β 1 x i implies ˆȲ C = Ȳ. If a curved relationship exists between y and x, the so-constructed calibration estimator could be very inefficient. For instance, if log(y i ) =. β 0 + β 1 x i, then there is no compelling reason to use ˆȲ C. Lastly, it is possible to choose a different distance measure, but the resulting calibration estimators are all asymptotically equivalent to the generalized regression estimator (Deville and Särndal, 1992) The pseudo-empirical maximum likelihood estimator The nonparametric empirical likelihood for independent random variables has been extensively studied by Owen (1988, 1990, 1991) and other subsequent authors. It has been shown that the empirical likelihood ratio statistics have limiting chi-square distributions in certain situations. Tests and confidence limits for parameters that can be expressed as functions of an unknown distribution function can be obtained through the empirical likelihood ratio statistics. The use of empirical likelihood in the survey context was considered by Chen and Qin (1993). They show that, for simple random sampling without replacement, auxiliary information in the form of known population means or quantiles can be incorporated into the so called empirical maximum likelihood estimator through proper constraints on the maximization of the empirical likelihood. They show that the empirical maximum likelihood estimator is asymptotically equivalent to the customary regression estimator. However, the idea of using a likelihood approach in surveys goes back to Hartley and Rao (1968), when they consider what they term the scale-load approach. By assuming the finite population characteristic is measured on a known scale with a finite set of scale points, they were able to write down the likelihood as a multidimensional hyper-geometric distribution and the limiting case by a multinomial distribution for simple random sampling without replacement. Recently, Chen and Sitter (1999) extend the empirical likelihood approach from simple random sampling to general sampling schemes through a pseudo-empirical likelihood approach. First, the whole finite population {y i, i = 1, 2,, N} can be viewed

24 CHAPTER 2. A REVIEW 13 as iid observations from a certain underlying distribution, F. The corresponding empirical likelihood would then be L(F ) = N i=1 p i with log-likelihood function N l(p) = log(p i ), (2.5) i=1 where p i = p(y i ) is the density or probability mass at observation y i. To overcome the difficulty of not knowing y i for the entire population, they view the log-likelihood function l(p) in (2.5) as a finite population total. A design unbiased estimator of l(p) is then available, namely ˆl(p) = d i log(p i ), (2.6) where d i is the basic design weight and E p { d i log(p i )} = N i=1 log(p i ). Recall that E p denotes the expectation with respect to the sampling design. ˆl(p) is termed the pseudo-empirical log-likelihood. Auxiliary information of the known population means can be incorporated into the estimation of Ȳ by using the Pseudo-empirical Maximum Likelihood Estimator (EL), ˆȲ EL = ˆp i y i, where ˆp i s maximize the pseudo-empirical log-likelihood ˆl(p) subject to p i = 1, p i (x i X) = 0 (0 p i 1). (2.7) One of the surprising facts about the pseudo-empirical maximum likelihood estimator is that ˆȲ EL is asymptotically equivalent to the generalized regression estimator ˆȲ GR. It is surprising since the likelihood-type motivation underlying EL is so different from that of GR or calibration. On the other hand, it is not surprising if we look at the way auxiliary information is used here. The ˆp i s are the revised weights and the constraint p i (x i X) = 0 is identical to the calibration equation used in Section We may also view ˆl(p) = d i log(p i ) > 0 as a distance measure between d i s and p i s (It is not a true distance measure, since p i = d i for all i does not imply ˆl(p) = 0). Thus, much like the calibration method, there is implicit use of a linear relationship between y and x, and a regression type estimator is expected in such situations. Another important feature of EL is that the weights, ˆp i, are intrinsicly positive.

25 CHAPTER 2. A REVIEW 14 Survey statisticians have long recognized that some estimation procedures can result in negative weights and this is very undesirable for some situations (see, for instance, Huang and Fuller, 1978 and Rao and Singh, 1997). The consequence could be, for example, a negative estimate for known positive population quantities, or a non-monotonic estimated distribution function. We will address this issue further in subsequent chapters. For multiple auxiliary variables, computational difficulties associated with EL are not trivial. By using the Lagrange multiplier method, it can be shown that ˆp i = d i 1 + λ (x i X), where d i = d i / d i and λ is the solution to d i (x i X) 1 + λ (x i X) = 0. Solving the above nonlinear system with a vector Lagrange multiplier λ can be computationally awkward. A new partial solution to this problem is given in Chapter Estimation of The Distribution Function The finite population distribution function evaluated at t is defined as the proportion of units with y-values less than or equal to t, F (t) = 1 N I [yi t], N i=1 By replacing y i by I [yi t], many of the estimators that were constructed for estimating the population mean can be used for estimating F (t). For instance, the Horvitz- Thompson estimator for F (t) is ˆF HT (t) = N 1 d i I [yi t]. When auxiliary information is available, some special care needs to be taken when one is constructing estimators for the distribution function using auxiliary information. With complete auxiliary information (i.e. x i known for i = 1, 2,, N), there are two leading estimators for the finite population distribution function: the design-based (model-assisted)

26 CHAPTER 2. A REVIEW 15 difference estimator (Rao, Kovar and Mantel, 1990) and the model-based prediction estimator (Chambers and Dunstan, 1986) The design-based difference estimator The generalized regression estimator introduced in Section can be rewritten as a generalized difference estimator (GD), ˆȲ GD = 1 N N { d i y i + ˆµ i d iˆµ i }, where ˆµ i = x iˆθ is a design-based estimator of µ i = E ξ (y i x i ) = x iθ. A model-assisted difference estimator of F (t) can be constructed by replacing y i by I [yi t] and µ i by G i = E ξ (I [yi t] x i ) = P r(y i t x i ), and plugging in a proper estimate for G i. This difference estimator was first proposed by Rao, Kovar and Mantel (1990). Under a simple linear regression working model y i = α + βx i + ε i, i = 1, 2,, N, (2.8) G i = P r{ε i t α βx i } = G(t α βx i ), where G( ) is the cumulative distribution function of the error term, ε i, which can be estimated by an empirical distribution function using the fitted residuals. The resulting estimator of F (t) is i=1 where ˆF d (t) = 1 N { π 1 i I [yi t] + N i=1 Ĝ i πi 1 Ĝ ic }, (2.9) Ĝ i = { k s β = π 1 π 1 i i πk 1 I [ ε k t α βx i ] }/ πk 1, Ĝ ic = { k s k s π i π ik I [ εk t α βx i ] }/ k s π i π ik, (2.10) (x i x)(y i ỹ)/ π 1 i (x i x) 2, α = ỹ β x, ε k = y k α εx k, x = x i / π 1 i, ỹ = π 1 i y i / πi 1, and π i, π ij are the first- and secondorder inclusion probabilities. Ĝ i is design-unbiased for G i and Ĝic is conditionally design-unbiased for G i given the i-th unit is selected in the sample. Extension from the simple linear regression working model to a general regression model is straightforward. Note that a regression working model and complete auxiliary information are essential for the implementation of this estimator. The error

27 CHAPTER 2. A REVIEW 16 cumulative distribution function, G( ), attached to the regression model, plays a key role in the derivation of the estimator. ˆF d (t) is asymptotically design-unbiased under a general sampling design and approximately model-unbiased under a working model such as (2.8). Godambe (1989) derived ˆF d (t) based on the model- and design-based optimum estimating function theory and showed that ˆF d (t) is robust against departures from the superpopulation model The model-based prediction estimator The paper of Chambers and Dunstan (1986) motivated much of the later work in this area. In their model-based framework, x and y are assumed to follow a superpopulation model. Though the results can be extended to more complex models, for simplicity of presentation, we will restrict attention to the simple linear regression model (2.8). Under model (2.8), the model-based estimator of F (t) is given by ˆF m (t) = 1 N { I [yi t] + 1 n I [yi t ˆβ(x j x i )] }, j s where ˆβ = (y i ȳ)(x i x)/ (x i x) 2. ˆFm (t) is asymptotically model-unbiased for F (t). We will consider this estimator in detail in Chapter 6. A crucial point here is that ˆF m (t) is independent of the sampling design, as (y i, x i ) for i = 1, 2,, N are viewed as independent sample values from superpopulation model (2.8) regardless of whether they belong to the set of sampled units, s, or to the set of nonsampled units, s Supplementary remarks The use of complete auxiliary information in estimating the finite population distribution function has attracted increased attention in recent literature. Several other estimators which incorporate knowledge of an auxiliary variable known for every unit in the finite population have also been proposed and their performances examined and compared. See, for examples, Chambers, Dorfman and Hall (1992), Kuk (1993), Silva and Skinner (1995) and Wang and Dorfman (1996).

28 CHAPTER 2. A REVIEW 17 The model-based ˆF m (t) is model-unbiased but design-inconsistent. Rao, Kovar and Mantel (1990) demonstrate through simulation that the model-based ˆF m (t) has superior performance in small samples when the superpopulation model is correctly specified but is much more vulnerable than ˆF d (t) to model-misspecification and can perform poorly in large samples. Chambers, Dorfman and Hall (1992) do a theoretical comparison under simple random sampling and conclude that there is no clear winner. Whether one chooses to work under a model-based framework and use ˆF m (t) or a design-based framework and use ˆF d (t), variance estimation will need to be considered. We do this for ˆF d (t) in Section 5.2 and for ˆF m (t) in Section A Discussion We have presented the three main model-assisted estimation procedures for finite population means and totals. All of these methods have only been discussed in the context of a linear regression working model and essentially incorporate the auxiliary variables through their known population means even when the auxiliary variables are known for the entire population. The generalized regression estimator, ˆȲ GR, has (2.1) as its base model, and variance reduction from using ˆȲ GR over ˆȲ HT is directly related to the magnitude of the linear correlation coefficients between y and x. The calibration estimator and the pseudo-empirical maximum likelihood estimator can be motivated from different perspectives without assuming a model. However, their effectiveness does rely on the implicit assumption that y and x are linearly related. They are both asymptotically equivalent to the GR, and calibration on x variables directly requires a linear working model to justify. To answer the fundamental question how can complete auxiliary information be effectively used at the estimation stage, we need to use more sophisticated modeling. It is the model structure (relationship between y and x) that determines how the auxiliary information should best be used. x variables do not necessarily provide direct information for population quantities of y, they provide relevant information through a model. We need a general approach that can handle any linear or nonlinear relationship between y and x. Also, the approach should be model-assisted in

29 CHAPTER 2. A REVIEW 18 that, the resulting estimator should be asymptotically design-unbiased irrespective of the correctness of the working model, but should be particularly efficient when the working model is correctly specified. The estimation of the finite population distribution function using complete auxiliary information needs special treatment. Although F (t) can be viewed as a finite population mean for the indicator variable I [y t], estimators constructed for estimating Ȳ may not be transplantable for the estimation of F (t). Part of the reason is that, for example, a simple linear regression model assumed for y and x can not be transmitted to I [y t] and x or I [y t] and I [x t]. The model must be used in its original form while we deal with the dichotomous variable I [y t]. When a more complex working model is used, this will become more prominent. We will discuss the various approaches in Chapter 3 under a general modeling process.

30 Chapter 3 The Effective Use of Complete Auxiliary Information Through Model-Calibration In this chapter, we consider the use of more complex working models in obtaining model-assisted estimators by first generalizing the calibration method of Section We term the approach model-calibration for reasons which will become readily apparent. We argue that, under a general modeling process, complete auxiliary information should be incorporated into the construction of estimators through fitted values. How to do this properly is fairly straightforward in the case of a GR (see Section 3.2.3) but not so for calibration. We introduce a general framework which is simple and estimators for the population mean and total reduce to the usual estimators under a linear model. Once this generalization is realized, some interesting relationships between a linear model and the use of complete auxiliary information become more obvious and are discussed. Also, some differences between the approaches become more distinct. For example, it has been noted that the calibration estimator reduces to a GR under a chi-square distance measure (Deville and Särndal, 1992), where an underlying linear regression model is used. This is no longer the case when the methods are generalized to nonlinear models, and the proposed model-calibration estimators perform better. 19

31 CHAPTER 3. MODEL CALIBRATION 20 The proposed model-calibration estimators of the population mean and total can effectively handle any linear or non-linear models and reduce to the conventional calibration estimator (the generalized regression estimator) under a linear model. We then go on to similarly generalize the pseudo-empirical maximum likelihood estimator (Chen and Sitter, 1999) and show that it gives an estimator that is asymptotically equivalent to the model-calibration estimator but with positive weights, and therefore is preferred. Finite sample performance of these estimators is investigated through a limited simulation study. First, the modeling issue is addressed in Section 3.1. Estimation of the population mean and total through model-calibration and pseudo-empirical likelihood is then introduced in Section 3.2. Special treatment for the distribution function under a general modeling process is given in Section 3.3. In Section 3.4, we propose a difference estimator and a regression-type estimator for the quantile process using complete auxiliary information and a general model. All the proofs are deferred to Section Modeling Assume the relationship between y and x can be described by a superpopulation model through the first and second moments, E ξ (y i x i ) = µ(x i, θ), V ξ (y i x i ) = vi 2 σ 2, i = 1, 2,, N, (3.1) where θ = (θ 0,..., θ p ) and σ 2 are unknown superpopulation parameters, µ(x, θ) is a known function of x and θ, the v i s are known constants for given x i s and E ξ and V ξ denote the expectation and variance with respect to the superpopulation model. We also assume that (y 1, x 1 ),..., (y N, x N ) are mutually independent. The model structure (3.1) is quite general and includes two very important cases: (i) the linear or non-linear regression model, y i = µ(x i, θ) + v i ε i, i = 1, 2,, N, (3.2) where ε i s are independent and identically distributed random variables with E ξ (ε i ) = 0 and V ξ (ε i ) = σ 2, and v i = v(x i ) is a strictly positive known function of x i only;

32 CHAPTER 3. MODEL CALIBRATION 21 (ii) the generalized linear model, g(µ i ) = x iθ, V ξ (y i x i ) = v(µ i ), i = 1, 2,, N, (3.3) where µ i = E ξ (y i x i ), g( ) is a link function and v( ) is a variance function. Consider the estimation problem for the model parameters: (a) When a model-based approach is employed (see Chapter 6), (y i, x i ), i s is viewed as an iid sample from the superpopulation, i.e., randomization is with respect to the underlying distribution for the superpopulation, sampling schemes are irrelevant here. The superpopulation parameters, θ, can then be estimated using standard procedures. (b) We need design-based estimates for the model parameters. Under the designbased framework, randomization is with respect to the repeated sampling, the sample data obtained from a complex sampling scheme may not follow the same model structure as that of the whole finite population, and the superpopulation parameter θ may be meaningless or not be interpretable from the design-based point of view. In this case, following Godambe and Thompson (1986), we replace θ by θ N, a model-based estimate of θ based on the data from the entire finite population. θ N is itself a finite population quantity and can then be estimated by a design-based estimator, ˆθ, from the sample data. The notion underlying this argument is: when the superpopulation model is correct, θ and θ N are usually very close to each other since N is usually large, and the estimator for estimating θ N is essentially for estimating θ when the purpose is to estimate θ; when the superpopulation model is incorrect, θ N is still a clearly defined finite population quantity and design-based inference is still valid. For illustration, consider two important cases. Case I. θ N can be expressed explicitly as functions of population totals for properly defined population variables. For example, suppose the superpopulation follows a homogeneous linear regression model and θ N are defined as the regression coefficients for the finite population: θ N = (X NX N ) 1 X NY N, where X N is the N (p + 1) matrix with rows (1, x i) for

33 CHAPTER 3. MODEL CALIBRATION 22 i = 1,..., N and Y N = (y 1,..., y N ). Note that N N θ N = ( x i x i) 1 x i y i, i=1 i=1 and a design-based estimator ˆθ is obtained by plugging in design-based estimates for various population totals in θ N : ˆθ = ( d i x i x i) 1 d i x i y i = (X ndx n ) 1 X ndy n, where D = diag(d 1,, d n ) and the d i s are the basic design weights, with X n and Y n in obvious notation. Case II. θ N is defined by estimating equations. Suppose that the generalized linear model (3.3) is assumed. We define θ N as the maximum quasi-likelihood estimator of θ based on the entire finite population, i.e., the solution of the estimating equations (Molina and Skinner, 1992), N X i [g (1) {µ(x i, θ)}v{µ(x i, θ)}] 1 [y i µ(x i, θ)] = 0, (3.4) i=1 where X i = (1, x i) and g (1) (u) = dg(u)/du. The estimating functions on the left hand side of (3.4) are population totals and ˆθ is defined as the solution of the design-based sample version of (3.4), i.e., the solution of the following estimating equations: d i X i [g (1) {µ(x i, θ)}v{µ(x i, θ)}] 1 [y i µ(x i, θ)] = 0. The estimate ˆθ is then obtained by standard Newton-Raphson iterative procedures, θ (m+1) = θ (m) + δ (X ng 1 W 1 G 1 X n ) 1 X ng 1 W 1 (Y n µ n ), θ=θ (m) where G = diag(g (1) (µ 1 ),, g (1) (µ n )), W = diag(π 1 v(µ 1 ),, π n v(µ n )), µ n = (µ 1,, µ n ) and µ i = µ(x i, θ). δ (0, 1) is a pre-chosen constant to accelerate the convergence.

Empirical Likelihood Methods for Sample Survey Data: An Overview

AUSTRIAN JOURNAL OF STATISTICS Volume 35 (2006), Number 2&3, 191 196 Empirical Likelihood Methods for Sample Survey Data: An Overview J. N. K. Rao Carleton University, Ottawa, Canada Abstract: The use