Random permutation models with auxiliary variables. Design-based random permutation models with auxiliary information. Wenjun Li

Size: px

Start display at page:

Download "Random permutation models with auxiliary variables. Design-based random permutation models with auxiliary information. Wenjun Li"

Magnus Mills
6 years ago
Views:

1 Running heads: Random permutation models with auxiliar variables Design-based random permutation models with auxiliar information Wenjun Li Division of Preventive and Behavioral Medicine Universit of Massachusetts Medical School Shaw Building SH2-230, 55 Lake Avenue orth, Worcester, MA 0655, USA Telephone: (508) Fax: (508) Edward J. Stanek Department of Public Health, Universit of Massachusetts, 75. Pleasant Street, Amherst, MA 0003, USA Telephone: (43) Fax: (43) Julio da Motta Singer Departamento de Estatística, Universidade de São Paulo Caixa Postal 6628, São Paulo, SP , Brazil Phone: Fax: LiW_06.doc 9/27/2006 3:03 PM Page of 5

2 Abstract We extend the random permutation model proposed b Stanek, Singer and Lencina (2004) to obtain best linear unbiased estimators of a finite population mean under simple random without replacement sampling in situations where auxiliar information is available. The procedure provides a sstematic design-based justification for well-known results involving common estimators and ma serve as the basis for extending such a minimum assumption theor to more complicated sample designs. Kewords: auxiliar variable; design-based inference; prediction; finite sampling; random permutation model; simultaneous permutation LiW_06.doc 9/27/2006 3:03 PM Page 2 of 5

3 . ntroduction mprovements in the precision of estimates of population parameters based on random samples can be made b accounting for auxiliar information (e.g., age, gender etc.). Man estimators with such features have been proposed, but the either require assumptions beond those pertaining to the sample design or lack an integrated theor. For example, model-based approaches (Ghosh and Rao 994, Rao 997), that generate best linear unbiased predictors (BLUP) ignore the sample design but require a postulated model. Additional superpopulation model assumptions are required for model-assisted approaches that lead to generalized regression (GREG) estimators (Särndal, Swensson and Wretman 992). Calibration estimators, on the other hand, optimize benchmark weights b being adjusted to known population quantities on some set of auxiliar variables, but lack an integrated theor. Other designbased approaches consider the finite population as a sample realization from an infinite population, and thus make additional assumptions beond the sampling design (Fuller 2002). We develop a design-based estimator of a linear function of the response that accounts for auxiliar information, and requires no assumptions beond those defining simple random sampling. The development extends the use of the random permutation model (Stanek and Singer 2004, Stanek, Singer and Lencina 2004) to account for auxiliar variables. Under this minimal assumption setup, the results establish that the commonl used estimator (Cochran 977) is LiW_06.doc 9/27/2006 3:03 PM Page 3 of 5

4 BLUE. n addition, the development highlights novel ideas that emerge from the random permutation model framework. The first is the expression of population parameters as sums of random variables. Another is the classification of the underling random variables into those that will be realized, and those that need to be predicted. This paper is organized as follows. We first present definitions and notation, and introduce the random permutation model. We next include multiple auxiliar variables, and use the model to derive the best linear unbiased estimator (BLUE) of the population mean. We conclude with an example and discussion. 2. A Design-Based Model for Simple Random Sampling We represent sampling formall b a set of indicator random variables whose partial realization specifies a selected sample. These random variables permute the units in the population, and hence we refer to the underling stochastic model as a random permutation model. Elements of the population, including the response of interest and auxiliar variables, are non-stochastic, but not necessaril observed. The population is represented as a vector of random variables. We use the stochastic model for these random variables to develop an optimal estimator of a parameter defined in the population, assuming the population mean is known for the auxiliar variables. Unlike LiW_06.doc 9/27/2006 3:03 PM Page 4 of 5

5 previous work (Fuller 2002), our definition does not require the population to be a random sample from some infinite population. Let the population consist of subjects, indexed b s =, 2,, noninformative labels. A non-stochastic potentiall observable vector, s = (( zsk) ) z, k = 0,,..., p is associated with subject s, where z0s = s denotes the outcome of interest, and zks = x ks µ k for k =,..., p denote auxiliar variables (centered at zero), with x s = (( xks) ) and ( µ ) ( ) x = k = s = s µ x. The mean of the auxiliar variables is assumed known in the population. We represent the population µ 0 where vector of means b z = ( µ p) µ = s= s and the population variance b ( ) 2 σ σ X = =, s= σx ΣX Σ, where Σ ( ) ( z s µ z)( z s µ z) ( σ σ σ ) 2 p ( ) k k* σ X = x x x and Σ X = ( σ x x ). We define the random permutation model as the set of all possible equall likel permutations of subjects in the population. Following Stanek, Singer and Lencina (2004), we explicitl define a set of indicator random variables i =, 2,,, that have a value of one if subject s is in position i in a permutation, and zero otherwise. Using this notation, we define Z z zu, where Ui = ( Ui Ui2 U i) and = (( )) i = U s= is s = i U is, z z s, and LiW_06.doc 9/27/2006 3:03 PM Page 5 of 5

6 Z = Uz where U = ( U U U ). We refer to U as a random 2 permutation matrix, and require each realization to be equall likel, subject to the constraints that one unit is allocated to each position, U = and all units are assigned to a position, U =, where is an vector with all elements equal to. Taking the expectation over all possible permutations, ( Z) = µ z, and cov( vec ( )) =, E Z Σ P, where P = b J, is an ab, a a identit matrix, and J =. The random variables in Z represent a full permutation of the subjects in the population, with the first column, Y, representing response, and the remaining columns, X, k =,..., p, representing auxiliar variables. ote that subjects k are not identifiable in this representation. Without loss of generalit, we assume that the sample corresponds to the random variables in rows i =,..., n, i.e., Z S, with the remainder in the rows corresponding to i = n+,...,, i.e. Z R, where Z = Z Z. This notation explicitl represents the process of simple ( ) S R random sampling b a stochastic model. We simplif estimation b defining a column expansion of the random variables for the sample, Z = vec ( Z ) and the remainder, = vec ( ) S value of the sample and remaining random variables are ( ) Z Z. The expected R E Z = G µ and LiW_06.doc 9/27/2006 3:03 PM Page 6 of 5

7 ( ) Z = G, where = ( n np) E µ ( ) G 0 and G = n 0 ( n). The p covariance structure is Z V V, var =, where V = Σ P n,, Z V, V V = Σ P (, and n),, = n ( n) V Σ J, where J ( ) = n n. n n Consequentl, the partitioned model that reflects simple random sampling can be represented as Z G = µ + E. () Z G 3. BLUE of a linear function Our interest lie in linear functions of the permuted response variate, namel θ cy cy i= i i = =, or equivalentl, where = ( np ) when θ = CZ + C Z (2) ( ) C c 0, = ( ) ci = for all i,..., C c 0 and = ( ) n p * population total; or when c ( n ) c c c. For example, =, θ = µ ; when c i = for all i =,...,, θ is the i = for i =,..., n* < n, θ ma correspond to the mean response for an interviewer of the first n * sample subjects. After sampling, onl C Z will be unknown; thus, estimating θ is equivalent to predicting C Z. Following Roall s prediction approach (Roall 976), we LiW_06.doc 9/27/2006 3:03 PM Page 7 of 5

8 develop the best linear unbiased predictor (BLUP) of C Z which, when added to CZ, generates the BLUE of θ. We require the predictor to be a linear function of the sample, i.e., wz, to be an unbiased predictor of C Z, i.e., E( ) = E( ) wz C Z, and to have minimum expected mean squared error (MSE). As a result, the estimator of θ can be expressed as P = CZ + wz. The unbiased constraint implies that wg C G = 0. The variance of P is given b ( ), var P = wvw 2wV C + C V C. We then appl Roall s prediction theorem (Roall 976) to find the value of w that minimizes ( ) 2 2{ } Φ w = wv w w V C + w G C G,, λ where λ is a Lagrangian multiplier. The unique solution is {, ( ) (, )} ( ) β ( ) wˆ = V V + G GV G G GV V C ( ) = c f n n n n (3) c = n c, f n where ( ) i= n+ Consequentl, i = and = X X = ( β β2 βp) β Σ σ. p ( ) ( ) β ( µ ) Pˆ = cy + n c Y f k k X = k k. (4) where Y and X, k =,..., p are sample means. The variance is given b k 2 ( ˆ ) (( ) ( ) ) σ ( ) 2 ( ) ( ) var P = cc n c + f c f c f n σ 2 2 n n n ( 2( c )( c ) ( 2 )( ) ( c ) ) ( ρ ) ( σ ) n n f f n X n + + (5) LiW_06.doc 9/27/2006 3:03 PM Page 8 of 5

9 where ρ = σ Σ σ σ is the squared multiple correlation coefficient of Y on 2 2 X X X X 2 2 X. n practical applications, β, σ, and ρ X are not known, and must be replaced b sample estimators. 4. Example As an example, suppose we are interested in estimating the mean response given b b µ = Y i= i based on a simple random sample, accounting for auxiliar information. Since ci =, for i=,,, the BLUE is Pˆ = fy + f Y f k k X = k k, (6) p ( ) ( ) β ( µ ) or equivalentl, P p ˆ = Y β k( X k µ k) k = ( Pˆ ) ( ρ X )( ) 2 2, with variance, var = f n σ. (7) As a practical application, suppose there is interest in estimating the smoking rate µ = π in a population based on a simple random sample with both smoking status (=smoker, 0=non-smoker) and gender (=male, 0=female) recorded on the sample subjects. We assume that the proportion of males in the population, µ x = π x, is known, and represent the sample estimate of the proportion smoking as Y = ˆ π, the proportion of males in the sample as X = ˆ π x, and the proportion of male smokers in the sample as ˆx π. With this notation, ( ) 2 n ˆ σ ˆ ( ˆ = π π ), ( n ) ˆ σ 2 ˆ ( ˆ x πx πx) =, and LiW_06.doc 9/27/2006 3:03 PM Page 9 of 5

10 ( ) n ˆ σ = ˆ π ˆ π ˆ π. Using these estimators, we estimate β b x x x ( ˆ ˆ ˆ )( ˆ ( ˆ )) ˆ ˆ x x x x [ male] [ female] b = π π π π π = π π, which is the estimated difference in male and female prevalences based on the sample. Substituting these expressions into the estimator in (6) and (7) results in { π ( ) π ( πx πx) } ˆ ˆ ˆ ˆ P = n + n b ( [ ] [ ])( ) = ˆ π ˆ π ˆ π ˆ π π male female x x which is the well-known post-stratified estimator with estimated variance ( Pˆ ) ( ρ )( ) var ˆ ˆ ˆ 2 2 = f n σ, where ˆ ρ = ˆ π ˆ ππˆ x x ( ) ( ) ˆ π ˆ π ˆ π ˆ π x x. 5. Discussion We have shown that the estimator (4) is the best linear unbiased estimator (BLUE) of a linear combination of response under simple random sampling without replacement. The results establish that the commonl used estimator developed under alternative frameworks is BLUE. The estimator is expressed identicall as those commonl seen in multiple linear regression models that do not account for the finite population (Grabill 976), but includes a finite population correction factor in the variance. Results (6) and (7) are also identical to those for difference estimators with optimal coefficients (Montanari 987); to the GREG estimator (Särndal, Swensson and Wretman 992); and to the multiple regression estimator developed under a superpopulation model (Fuller 2002). LiW_06.doc 9/27/2006 3:03 PM Page 0 of 5

11 The surve sampling literature has struggled to reconcile design-based and model-based theories of estimation/prediction. Model-based methods recentl popularized b Valiant, Dorfman and Roall (see (Valliant, Dorfman and Roall 2000)) stem from the prediction approach developed b Roall (see (Roall 973) and (Roall 976)). The underling theoretical structure is important, since it allows such methods to be extended relativel easil to different applications with increasing complexit. The limitation of such theor is that it does not account for the sample design. A similar unifing theor has not been developed for design-based methods. Cochran s (Cochran 977) original approach was to postulate a linear regression model, and then determine the regression coefficients based on minimizing the variance. Other approaches, such as the GREG or the calibration approaches (Särndal, Swensson and Wretman 992) have combined model-based and design-based ideas, or began with ad-hoc functional forms of estimators, and optimized them in special settings. These approaches have been successful in addressing man practical problems in a design-based framework (Särndal, Swensson and Wretman 992, Brewer 2002). However, the have not provided a consistent conceptual and theoretical basis that can be readil extended to more complex applications. LiW_06.doc 9/27/2006 3:03 PM Page of 5

12 We believe that representing the sample design via a random permutation model, and then predicting functions of unobserved subjects in a sstematic wa provides an appealing, straightforward foundation for finite population inference. There are steps in this process that break with tradition, such as expressing a parameter as a sum of random variables. Focusing attention on predicting unobserved quantities is certainl intuitivel satisfing, but unusual in the context of estimation. The development also blurs the distinction between the traditional use of the term predictor (for random variables) and estimator (for parameters). We have illustrated how the design-based random permutation model theor can be extended to include auxiliar variables in a straightforward manner. These results extend the scope of previous results (Stanek and Singer 2004, Stanek, Singer and Lencina 2004) to a broader class of problems. The previous developments of the theor have identified subtleties in interpreting random effects in simple random sampling (Stanek, Singer and Lencina 2004) and developed predictors of realized random effects in balanced two stage sampling problems with response error (Stanek and Singer 2004). Current research is extending these results to clustered population settings where clusters are of different size and there is unequal probabilit sampling, and to settings where there is missing data. n each case, a similar approach is considered, with estimators (or predictors) developed via a clear optimization theor. LiW_06.doc 9/27/2006 3:03 PM Page 2 of 5

13 n practice, covariances in the expressions for the estimators need to be estimated. Some simulation stud results on the impact of such estimation are given b Li (Li 2003). The resulting estimator coincides with those developed b GREG or calibration approaches, and strengthens the appeal of the random permutation model. Still, much more work is needed to extend the methods to more complex settings, including two stage designs with cluster and unit covariates, longitudinal studies, and settings where units are randomized to treatments. We consider the basic results developed here to provide a foundation for additional work in these directions. 6. Acknowledgements This research was partiall supported b a H grant (H-PHS-R0-HD36848). The authors wish to thank Drs. John Buonaccorsi and Carol Bigelow for their constructive comments. The content of this article is a part of the first author s dissertation conducted at the Department of Biostatistics and Epidemiolog, Universit of Massachusetts, Amherst, Massachusetts. LiW_06.doc 9/27/2006 3:03 PM Page 3 of 5

14 7. References Brewer, K. R. W. (2002), Combined Surve Sampling nference: Weighing Basu's Elephants, London ; ew York, ew York: Arnold ; Distributed in the United States of America b Oxford Universit Press. Cochran, W. G. (977), Sampling Techniques (Third ed.), ew York: John Wile and Sons. Fuller, W. A. (2002), "Regression Estimation for Surve Samples," Surve methodolog, 28, Ghosh, M., and Rao, J.. K. (994), "Small Area Estimation: An Appraisal," Statistical Science, 9, Grabill, F. A. (976), Theor and Application of the Linear Model (Vol. ), Belmont, CA: Wadsworth Publishing Compan, nc. Li, W. (2003), "Use of Random Permutation Model in Rate Estimation and Standardization," Ph.D. Dissertation, Universit of Massachusetts, Department of Biostatistics and Epidemiolog. Montanari, G. E. (987), "Post-Sampling Efficient Qr-Prediction in Large-Sample Surves," nternational Statistical Review, 55, Rao, J.. K. (997), "Developments in Sample Surve Theor: An Appraisal," Canadian Journal of Statistics, 25, -2. Roall, R. M. (973), "The Prediction Approach to Finite Population Sampling Theor: Application to the Hospital Discharge Surve.," Technical, ational Center for Health Statistics, Office of Statistical Methods. Roall, R. M. (976), "The Linear Least-Squares Prediction Approach to Two-Stage Sampling," Journal of the American Statistical Association, 7, Särndal, C. E., Swensson, B., and Wretman, J. (992), Model Assisted Surve Sampling, ew York: Springer-Verlag. Stanek, E. J., and Singer, J. M. (2004), "Predicting Random Effects from Finite Population Clustered Samples with Response Error," Journal of the American Statistical Association, 99, Stanek, E. J., Singer, J. M., and Lencina, V. B. (2004), "A Unified Approach to Estimation and Prediction under Simple Random Sampling," Journal of Statistical Planning and nference, 2, LiW_06.doc 9/27/2006 3:03 PM Page 4 of 5

15 Valliant, R., Dorfman, A. H., and Roall, R. M. (2000), Finite Population Sampling and nference, a Prediction Approach, ew York: John Wile & Sons. LiW_06.doc 9/27/2006 3:03 PM Page 5 of 5

DESIGN-BASED RANDOM PERMUTATION MODELS WITH AUXILIARY INFORMATION. Wenjun Li. Division of Preventative and Behavioral Medicine

DESIGN-BASED RANDOM PERMUTATION MODELS WITH AUXILIARY INFORMATION. Wenjun Li. Division of Preventative and Behavioral Medicine DESG-BASED RADOM PERMUTATO MODELS WTH AUXLARY FORMATO Wenjun Li Division of Preventative and Behavioral Medicine Universit of Massachusetts Medical School Worcester MA 0655 Edward J. Stanek Department