Comments on Design-Based Prediction Using Auxilliary Information under Random Permutation Models (by Wenjun Li (5/21/03) Ed Stanek

Size: px

Start display at page:

Download "Comments on Design-Based Prediction Using Auxilliary Information under Random Permutation Models (by Wenjun Li (5/21/03) Ed Stanek"

June Griffith
6 years ago
Views:

1 Comments on Design-Based Prediction Using Auxilliary Information under Random Permutation Models (by Wenjun Li (5/2/03) Ed Stanek Here are comments on the Draft Manuscript. They are all suggestions that you can consider General Comment. Detailed Comments. Page. lines 3-4. Change wording from: permutation probability underlying SRS, and the joint permutation of response and auxiliary variables is modeled using seemingly unrelated regression. We use Royall s linear least square permutation probability underlying SRS. and tthe joint permutation of response and auxiliary variables is modeled using seemingly unrelated regression. We use Royall s linear least square Page. lines 6. I think the model we use will be called a super-population model by others. Change: variables. The predictors have similar functional form to those derived using design-based, model-assisted and calibration approaches, but depend on neither superpopulation nor regression model assumptions. variables. The predictors have similar functional form to those derived using design-based, model-assisted and calibration approaches, but depend on neither superpopulation arise directly from the sample design and do not require additional regression model assumptions. Page 2. st paragraph. I d suggest re-working this. From: In sample survey research, auxiliary information such as gender, age, income and chronic disease-bearing history are completely or partially known. Such information can be used to assist sampling design and improve estimation. Methods of improving estimation with auxiliary information have been discussed in numerous occasions in both design-based (Cassel,

2 Särndal and Wretman 977; Cochran 977; Särndal and Wright 984; Deville and Särndal 992; Särndal, Swensson and Wretman 992) and prediction-based (Bolfarine and Zacks 992; Valliant, Dorfman and Royall 2000). The two approaches are distinct by their choice of probabilistic model for inference. Estimation can be improved by accounting for auxiliary information such as gender, age, income and chronic disease-bearing history that may be completely or partially known in a population. Methods of improving estimation with auxiliary information have been discussed in numerous occasions in both design-based (Cassel, Särndal and Wretman 977; Cochran 977; Särndal and Wright 984; Deville and Särndal 992; Särndal, Swensson and Wretman 992) and prediction-based (Bolfarine and Zacks 992; Valliant, Dorfman and Royall 2000). The two approaches are distinguished by different probability models and assumptions. Page 2. 2 st paragraph I think this needs re-working. Suggest change from: The design-based approach uses probability sampling for both sample selection and inference from sample data. The probability distribution associated with the randomized sample selection provides the basis for probabilistic inferences about the target population. Common design-based approach of using known auxiliary information include ratio estimator and regression estimators (Brewer 963; Royall 970; Cochran 977). The bias, variance and mean squared error (MSE) are defined in terms of the expectation over all possible samples under the sampling design, and thus the inference of design-based approach is often referred as unconditional. This approach leads to valid repeated sampling inferences regardless of the 2

3 population structure, and is free from model misspecification (Horvitz and Thompson 952; Godambe 955; Cassel, Särndal and Wretman 977). The design-based approach uses the sampling design probabilities and additional model assumptions to develop estimators. Linear regression models may be assumed between the response and auxillary variable (Brewer 963; Royall 970; Cochran 977), or non-linear functions may be defined as ratios of response and auxiliary variables. Estimators minimize the expected mean squared error (MSE) which is defined in terms of the expectation over all possible samples under the sampling design. This approach leads to valid repeated sampling inferences regardless of the population structure, and is free from the model misspecification (Horvitz and Thompson 952; Godambe 955; Cassel, Särndal and Wretman 977). [I don t understand this statement. We have to specify a regression model. Why is it free of model miss-specification?] Page 2. Line 3 continuing to Page 3. Change and insert the following: From: In model-assisted approach, plausible population models are often used to choose efficient estimators with good design-based properties, but the sample selection is based on randomization and statistical properties of estimators are computed with respect to the probability sampling distribution (Särndal, Swensson and Wretman 992; Rao 999). 3

4 To: In model-assisted approach, efficient estimators with good design-based properties are developed for plausible population models, with efficiency determined from the probability sampling distribution (Särndal, Swensson and Wretman 992; Rao 999). The model is specified as an additive response error model with functional (usually linear) relationship between response and auxiliary variables on an elementary unit. This approach provides a formal framework for using auxiliary information at the estimation stage. [Is this your understanding? If so, why do you call them population models?] Page 3, line 2. You refer to estimation stage. I think the manuscript should only talk about estimation, not design. This approach provides a formal framework for using auxiliary information at the estimation stage. The popular generalized regression estimator (GREG) is an example of this type (Cassel, Särndal and Wretman 976; Särndal, Swensson and Wretman 992). Page 3, line. Change wording: From: Prediction- or model-based approach assumes that the target population follows a specified (superpopulation) model, the source of randomness is only attributable to the model, thus the model distribution yields inferences conditinal on the particular sample (Rao 997; Brewer 999; Valliant, Dorfman and Royall 2000). A population is typically considered as a realization of such superpopulation. The bias, variance and mean squared error of the predictor are defined in terms of the expectation over all possible realizations of a stochastic model that connects the variable of interest to a set of auxiliary variables (Brewer 995). The best linear unbiased predictor (BLUP) is a typical model-based estimator (Ghosh and Rao 994; Rao 997). Model-dependent methods can perform poorly as evaluated by the expected value of the 4

5 model based MSE over the design in large samples if the model is misspecified (Hansen, Madow and Tepping 983; Rao 997). To: Prediction- or model-based approach assumes that the target population is a realization of a superpopulation defined by a superpopulation model (Rao 997; Brewer 999; Valliant, Dorfman and Royall 2000). Inference is conditional on the sample, and focuses on predicting functions of the unobserved random variables in the potentially realized population. The bias, variance and mean squared error of the predictor are defined in terms of the expectation over the superpopulation model (Brewer 995). The best linear unbiased predictor (BLUP) is a typical model-based estimator (Ghosh and Rao 994; Rao 997). Model-dependent methods can perform poorly when evaluated by the expected value (with respect to the design) of the model based MSE if the superpopulation model is misspecified (Hansen, Madow and Tepping 983; Rao 997). Page 3 to 4. Last paragraph. I don t think it is a prediction-based approach that it appealing, but a model based approach. The idea of prediction is somewhat unique to sampling, and most people consider estimation, not prediction more important. The key is that they don t see estimation as prediction of the random variables not realized. The model based methods in the absence of a finite population are easier since they are not encumbered by the detailed structure of the population, such as actual distributions of ancillary variables. I distinguish model based from prediction based methods, since model based does not require a population at all, just a model. Prediction based (for me) implies the idea of a target to be predicted. For many, prediction also does not require a population to be conceptualized. Prediction-based approach appeals to many statisticians who practice primarily in fields other than survey statistics because the prediction-based estimators generally have regressionlike presentation, and inference is similar to those regression methods used in mainstream statistics. In addition, prediction-based approach appears to provide an easier platform for 5

6 adapting a rich collection of estimation techniques developed in regression model researches, such as generalized linear mixed models (citation, (Rao 2003)). This feature is especially pronounced in application of generalized linear mixed models in small area estimations (Rao 2003), which seems impossible in design-based framework otherwise. Page 4. re-wording second paragraph. Change: An intrigue question to survey statistician is whether and how the rich collection of estimation techniques in prediction-based method can be adapted in a design-based framework and whether the design-based estimators can be communicated with a presentation that are parallel to prediction-based predictors. To answer this, this paper illustrates a method that applies common estimation techniques used in prediction-based approach under a simple design-based framework. More specifically, we develop a design-based prediction method of using auxiliary information in estimation under simple random sampling without replacement (SRS). This method makes use of the random permutation probability underlying SRS, and requires no additional assumptions. It incorporates known auxiliary information through simple transformation of the auxiliary variables. As an example, we show how this method can be applied to derive best linear unbiased predictor (BLUP) of the population total of a response variable. An intriguing question is whether prediction based methods can be used in a designbased framework. Some work in this area has be done by Brewer ( ). Random permutation superpopulation models, as developed by Rao and Bellhouse (978) provide a link when estimating the population mean in simple random sampling, or two stage sampling settings. 6

7 However, additional model assumptions are needed to account for ancillary variables. We present a design-based approach that uses prediction theory methods to estimate the mean (or total) with auxiliary variables under simple random sampling. The method frames the problem similar to seemingly unrelated regression problem with an underlying random permutation model. o additional model assumptions are required. Known auxiliary information is incorporated through simple transformation of the auxiliary variables. We illustrate the methods in an example. Page 4. line -3. Change From: Let a finite population P consists of labeled subjects, s =, 2, K,, where is Let a finite population P consists of labeled subjects, s =,2, K,, where is Page 5. Line 4. Change From: where ( y ys y ) ( k) ( k) ( k) ( k) y = L L is an column vector for the k -th response; ( ) ( ) ( ) ( y 0 y y p ) y = L is a column vector for subject s. Further, the population values are s s s s alternatively represented as a ( p+ ) column vector z, such that = vec( ) z y. where ( y ys y ) ( k) ( k) ( k) ( k) y = L L is an column vector for the k -th response. We summarize the population values as z = vec( y ), an ( p ) Page 5, line 5. Change from: + column vector. 7

8 Population parameters for the mean, total and variance for the k -th variate are given by µ ( k) ( k) 2 ( k) ( k) = y, T = y, σ k = y P y, where P = I J, I is an ( k) ( k) dimensional identity matrix and J is an matrix of ones. The variance-covariance matrix of the p + variates are summarized as Σ ( p+ ) ( p+ ) 2 σ0 σ0 L σ 0p 2 σ0 σ L σp =. M M O M 2 σ p0 σ p σ L p Population parameters for the mean and total of the k -th variate are given by µ ( k) ( k) = y ( k) ( k) and T = y. The co-variance of the of the k -th and k *-th variates is defined in terms of terms 2 ( k) ( k* ), σ kk* = y P y, where P = I J, I is an dimensional identity matrix and J is an matrix of ones. The variance-covariance matrix of the p + variates is given by 2 σ0 σ0 L σ 0p 2 Σ ( p+ ) ( p+ ) where σ0 σ L σp Σ( p ) ( p ) =. + + M M O M 2 σ p0 σ p σ L p Page 5, line -4. Add. I think it should appear in the text the first time in addition to the abstract. Change: Suppose a sample of size n is selected via SRS from a finite population of known size. ( 0) ( k ) We assume that the parameter of interest is µ, and that µ, k =, K, p, is known. We 8

9 represent sampling with a random permutation model. To do so, we define a set of indicator random variables U, i =, 2, K,, that have a value of if the subject in the i-th position in a is permutation is subject s, and 0 otherwise. Let the matrix U = ( U U L U ) 2 represent a matrix of indicator random variables, where = ( ) U U U L U. When all i i i2 i permutation are equally likely (consequence of SRS), E ( U) = J and cov ( vec( )) U = P P. The matrix of random variables representing a joint permutation of p + variables is given by Y ( ) = U p y + ( p+ ), where ( ) ( ) ( ) ( ) ( ) Y = Y Y L Y L Y, 0 k p ( k ) Y is the random variable corresponding to variable ( k ) y, k = 0,, K, p. For simplicity, we denote vec ( ) = ( ) vec( ) Y I U y. It is shown that ( ) p+ ( ( )) ( ) = ( p ) E vec Y I + µ and cov vec Y = Σ P, where Σ is the population variance-covariance matrix of the p + variables as defined in Error! Reference source not found.. To: Suppose a sample of size n is selected via simple random sampling without replacement (SRS) from a finite population of known size. We represent the sample as the first n units in a random permutation of population units. SRS occurs when each permutation is equally likely. The random variables corresponding to the bivariate values of the permuted units constitute the random permutation model. We formalize these definitions by defining a set of indicator random variables U, i =,2, K,, that have a value of if the subject in the i-th position in a is 9

10 permutation is subject s, and 0 otherwise. Let the matrix U = ( U U U ) 2 a matrix of indicator random variables, where = ( ) i i i2 i L represent U U U L U. Then the matrix of random variables representing a joint permutation of p + variables is given by Y = U y, where ( ) ( ) p+ p+ ( ) ( ) ( ) ( ) ( ) Y = Y Y L Y L Y,where 0 k p ( k ) Y is the random variable corresponding to variable ( k ) y, k = 0,, K, p. For simplicity, we partition Y into the response vector, and the auxiliary vector, such that vec( ) ( ) ( ) ( ) ( 2 ) ( p ) Y = Y Y L Y. ( 0) ( ) ( ) Y = Y Y, where Expressions for the mean and variance can be developed using the propertied of the indicator random variables. Since all permutation are equally likely, E ( U) = J and using Y Ip+ U y, cov( vec( )) the expansion vec( ) = ( ) vec( ) ( ) ( ) we can show that E vec ( Y) = Ip+ µ and cov( vec ( )) = U = P P. Using these results, Y Σ P. Through proper rearrangement, the random variables can be partitioned into a sample and remaining portion. The sample portion corresponds to the random variables in the first n positions (rows) of Y. [WEJU, you need to write this out formally. The partition matrix should be introduced. You may do this in an appendix if it is too complex. I also think you need to provide some details, or at least explain how you will get the various variance terms. You may want to introduce some simpler notation. For example, let ( 0) ( ) ( ) Y = Y Y g and define I I I Y II similarly so that YI VI VI, II var = ] Y II VIII, VII 0

11 ( k ) Page 7. Line. Add: Wenjun, Do you really want to assume that µ is known here? I also think that you should spend some more time on motivating this. The parameter of interest is non-stochastic. You need to express the parameter as a linear combination of random variables. This is what allows you to view the problem as a prediction problem, conditional on the observed sample. This is not automatic in other literature, so you should spell it out. ( 0) ( k ) We assume that the parameter of interest is µ, and that µ, k =, K, p, is known. Page 7, line 6 etc. I m not sure you have the correct organization for the ideas. I think it may be easier to understand if you present the random permutation model followed by the seemingly unrelated regression model. With the seemingly unrelated regression model, then present the parameters of interest (and discuss them.). Finally, talk about the partitioning and the sampling. This would then lead into the estimation section. In the estimation section, don t refer to Royall s result. Instead, develop your result using the steps that Royall followed, and state that your development is parallel to that of Royall. I have had referees state that Royalls theorem doesn t apply to our setting (although I don t see why not). Avoid statements like it can be shown. Instead, provide some guidance as to how you showed it. The reviewer s are not stupid, but some of the stuff you did is complex. You may make them feel stupid if you don t provide guidance. You don t want them just to believe you, you want them to have enough information to check your results. It is OK to refer to your thesis for more details.

12 I would like to see more discussion of equation 9, and a re-expression of this equation as predicting the un-observed random variables. In its present form, this is obscure. I think it is important. I haven t yet read after page 9, but there are a fair number of things for you to work on prior to that. I hope these comments are helpful. I m gone for week, but will be back after that. Ed 2

Random permutation models with auxiliary variables. Design-based random permutation models with auxiliary information. Wenjun Li

Running heads: Random permutation models with auxiliar variables Design-based random permutation models with auxiliar information Wenjun Li Division of Preventive and Behavioral Medicine Universit of Massachusetts