DESIGN-BASED RANDOM PERMUTATION MODELS WITH AUXILIARY INFORMATION. Wenjun Li. Division of Preventative and Behavioral Medicine

DESG-BASED RADOM PERMUTATO MODELS WTH AUXLARY FORMATO Wenjun Li Division of Preventative and Behavioral Medicine Universit of Massachusetts Medical School Worcester MA 0655 Edward J. Stanek Department of Public Health Universit of Massachusetts Amherst MA 0003 Corresponding address: Wenjun Li Ph.D. Biostatistics Research Center Division of Preventive and Behavioral Medicine Universit of Massachusetts Medical School Shaw Building S-30 55 Lake Avenue orth Worcester MA 0655 The content of this article is a part of the first author s dissertation conducted at the Department of Biostatistics and Epidemiolog Universit of Massachusetts Amherst Massachusetts. This research was partiall supported b a H grant (H-PHS-R0-HD36848). The authors wish to thank Drs. Julio Singer John Buonaccorsi and Carol Bigelow for their constructive comments. Corresponding author email address: Wenjun.Li@UMASSMED.EDU LiW_05_m_v5.doc 9/9/005 i

ABSTRACT We develop design-based estimators of the population mean using auiliar information under simple random without replacement sampling b etending the random permutation model of Stanek Singer and Lencina (004) and Stanek and Singer (004). A ke step in the development is representing the population mean defined as a non-stochastic average of unit values as the sum of random variables constructed from a random permutation of the population. Using methods similar to those in model-based approaches the random variable representing the sum of non-sampled units is predicted to form a design-based estimator of the population mean. The random permutation model is etended b the joint permutation of the response and auiliar variables with known auiliar information incorporated through centering the auiliar variables on their respective means. The estimators are required to be linear functions of the sample unbiased and have minimum mean squared error. The estimators are identical to model-assisted and calibration estimators. However the results provide the building blocks for etending the design-based random permutation model theor to include covariates in more complicated sample designs and bridging the theoretical results to practical applications. The concordance of results with other less sstematic approaches is an added appeal of the methodolog. Ke words: auiliar variable design-based inference prediction finite sampling random permutation model simultaneous permutation Running title: Random Permutation models with auiliar information LiW_05_m_v5.doc 9/9/005 ii

. TRODUCTO Estimators can be made more precise b accounting for auiliar information such as gender age income and chronic disease-bearing histor that are partiall or completel known in a population. Methods of improving estimation with auiliar information have been discussed in a model-based approach (Bolfarine and Zacks 99; Valliant et al. 000) and using modelassisted or calibration methods (Cassel et al. 977; Cochran 977; Särndal and Wright 984; Deville and Särndal 99; Särndal et al. 99). n a model-based approach the result is the best linear unbiased predictor (BLUP) (Ghosh and Rao 994; Rao 997). The model-assisted approach combines ideas from model-based and design-based approaches resulting in generalized regression (GREG) estimators. The calibration approach produces a weighed estimator with weights that minimize the distance to benchmark weights while calibrated to known population quantities on some set of auiliar variables. The actual sample design plas no role in inference based on the model-based approach. Additional superpopulation model assumptions beond the sample design are required for development of GREG estimators. The calibration approach lacks an integrated theoretical framework. All of these approaches require additional assumptions or lack an integrated theor that builds simpl on the sample design. We develop a design-based estimator of the population mean that accounts for auiliar information. The development etends use of the random permutation model for simple random sampling (SRS) and two stage cluster sampling (Stanek Singer and Lencina (004) Stanek and Singer (004)) to account for auiliar variables. The method avoids the limitations of the superpopulation model approach but takes advantage of some of the optimization tools used in developing predictors with superpopulation model assumptions. Beginning with a vector of responses for each subject in a finite population a SRS of subjects is represented b a random permutation probabilit model that permutes the subject s response vectors. Auiliar LiW_05_m_v5.doc 9/9/005

information is incorporated through a simple linear transformation. The results provide a direct design-based method to account for auiliar information when estimating the population mean. This paper is organized as follows. We first present definitions and notation and introduce the random permutation model. We net present a simultaneous random permutation model in a simple setting with a response and one auiliar variable and use it to derive the best linear unbiased estimator (BLUE) of the population mean. Subsequentl we etend the results to scenarios with multiple auiliar variables. We conclude b illustrating the method with an eample and include results from a small scale simulation evaluating empirical estimators.. DEFTOS AD OTATO Let the population consist of subjects labeled s = and assume that the labels are non-informative serving onl to identif the subjects. Associated with subject s is a non-stochastic potentiall observable vector ( s s ) where s denotes the response of ( ks ) interest and = ( ) mean b the vector is a p vector of auiliar variables. We represent the population s ( µ µ ) where µ is associated with the response and µ = (( µ )) is a k p vector associated with the eplanator variables. The population covariance structure is defined b the ( p+ ) ( p+ ) matri σ σ Σ where Σ = σ Σ ( ) p ( ) k k σ = σ σ σ and Σ = ( σ ) is a p p matri. ndividual terms are defined in the usual manner such that µ = s µ k s= s= = σ = s µ ks ( ) s= σ = µ k ( ) ks and σ ( )( ) k = k s µ ks µ k s=. s= LiW_05_m_v5.doc 9/9/005

3. THE RADOM PERMUTATO MODEL We define a random permutation model as the set of all possible equall likel permutations of subjects in the population. We represent a random permutation b the sequence of random variables Y Y where the subscript refers to the position occupied in the random permutation. Following Stanek Singer and Lencina (004) we eplicitl define these random variables in terms of a set of indicator random variables U i = that is have a value of one if subject s is in position i in a permutation and zero otherwise. Using this notation the response for the subject in position i in a permutation is represented b the random variable Y i = Uiss ; using a vector notation Y i i s= = U where U = ( ) U U U i i i i ( s ) represents a vector of random variables that selects a subject in position i and = ( ) represents the population response vector. Defining ( ) U = U U U the random variables representing a permutation of response are given b Y = U. Similar vectors can be defined to represent a permutation of auiliar variables Xk = U k k =... p. This notation simplifies representation of the problem and development of the epected value and of the covariance structure of the vectors. We use properties of the indicator random variables to evaluate the epected value and covariance structure of the random variables. Taking epectation over all possible permutations ( ) EU is = for all i... ; s.... As a result = = EY ( ) E( ) = U = = µ for all i i EUU ( ) i =... where is an vector. n a similar manner is i s = ( ) when i i and s s EU ( is ) = when i= i and s s = and ( ) EUU = otherwise. Using is i s 0 LiW_05_m_v5.doc 9/9/005 3

these epressions we can show that ( ) = σ when i i. ( Y Y ) cov i i var Yi = σ for all i =... and that We combine these results to epress the epected value and the covariance structure of the random permutation using vector notation. As a result ( ) ( ) E Y = µ ; and Y = σ P var where P = b J is an identit matri and J =. Similar epressions ab a a cov YX k P k = σ appl to the auiliar variables. n addition ( ) and for k k ( ) = σ cov X X P k k k k. 4. SMULTAEOUS RADOM PERMUTATO MODEL We net consider the random variables representing the random permutation of response and auiliar variables simultaneousl. For simplicit we first assume that there is a single auiliar variable (i.e. p = ) corresponding to the subject s age (which we represent b s ). Age is assumed to be known for all subjects in the population. The simultaneous random permutation model is defined b concatenating the permutation vectors for response and the auiliar variables. The resulting model is an eample of a seemingl unrelated regression model (Zellner 963) with both regressions corresponding to simple mean models such that Y 0 µ = + E X 0 µ Z= Gµ + E () ( ) where Z= Y X = G = ( ) µ µ µ and where denotes the Kronecker product (Grabill 976). n this model E ( Z) = Gµ and var ( ) = Z Σ P. The LiW_05_m_v5.doc 9/9/005 4

mean age µ is known since age is assumed to be known for all subjects. This implies that the random variables arising from permuting the auiliar variable are constrained b X i i = µ =. We eliminate µ from model () b subtracting 0 µ from both sides (which is equivalent to multipling each term in the model b 0 R = ). Representing the 0 P elements of X b X = X µ the transformed model is given b i i Y = µ + E X 0 () where Z = RZ and G = RG. Since Z = G µ + E P is idempotent ( ) = E µ Z G cov( ) Model () is equivalent to model () with the auiliar variate centered at zero. Z = Σ P. 5. SAMPLG AD PARTTOG THE RADOM VARABLES We represent random variables for a SRS of size n b the first n elements of the random permutation vectors. The sampled ( Z ) and remaining ( ) ( ) Z portions of Z can be n 0 n ( n) obtained b pre-multiplication b Z K = i.e. KZ = ( ( n) n n). Since K is 0 Z non-stochastic it follows that ( Z) ( n 0 n) E ( Z ) = ( µ n 0 n ) E = µ and Z V V cov = where Z V V V = Σ P V = Σ P n ( n) V = V = Σ J n ( n) where ( ) = n n n n reflects SRS sampling can be represented as J. Consequentl the partitioned model that LiW_05_m_v5.doc 9/9/005 5

Z G = µ + E (3) Z G where ( ) G = 0 and G = ( 0 ). n n n n 6. PARAMETER OF TEREST We assume that the parameter of interest is the population mean of the response variate which is defined as a linear function of the permuted random variables namel i µ n = c Y i i + c = i = n+ Y where c = or equivalentl as = CZ = C Z + C Z (4) µ C ( 0 ) = ( ) where = c C c 0 and C = c( 0 ). After sampling onl n n n n C Z will be unknown; thus estimating µ is equivalent to finding a predictor of C Z = c Y. i = n + i We develop the best linear unbiased predictor (BLUP) of C Z and refer to the estimator of µ as the BLUE of µ. 7. BLUE OF A LEAR FUCTO We require the predictor of C Z to be a linear function of the sample wz n n = wy i i + wx i i to be unbiased i.e. ( E wz ) ( ) = E C Z and have minimum i= i= mean squared error (MSE). The estimator of µ is given b n n n ( C w ) Z i= i ( i= i i i= i i ) P= + = c Y + w Y + w X. (5) The unbiased constraint implies that wg C G = 0. The variance of P is ( ) var P = wvw wv C + C V C. With such a setup the prediction theorem of Roall LiW_05_m_v5.doc 9/9/005 6

(976) can be applied. The minimum variance unbiased estimator is obtained b minimizing the function ( ) { } Φ w = wv w w V C + wg C G (6) λ where λ is a Lagrangian multiplier resulting in After simplification (7) reduces to where f and ( ) ( ) ˆ w = V V + G G V G G G V V C. (7) f wˆ = n (8) n β = n β = σ σ. Consequentl the estimator and its variance are n ˆ β P = Y + Y X ( µ ) i i= i= n+ f β = fy + f Y X f ( ) ( µ ) ( X ) = Y β µ (9) ( ˆ σ var P) = ( ρ )( f ) (0) n where ρ = σ σσ is the correlation coefficient of Y on X and Y and X are the sample mean of the response variable and the auiliar variable respectivel. This epression is similar to the epressions commonl seen in linear regression estimators but includes a finite population correction factor. The epressions in equation (9) have interesting interpretations. The first epression is divided into the sum of random variables in the sample and a predictor of the random variables in the remainder. Since the weighted total of the sample and remainder random variables equals LiW_05_m_v5.doc 9/9/005 7

the population mean it is clear that for a realized sample we simpl substitute the values observed for the sample random variables in the estimator. Rather than predicting the value of each remaining random variable b the sample mean Y we include an adjustment that accounts for the auiliar variable. The adjustment shrinks the observed difference between the auiliar sample mean and the population auiliar mean. For eample if the value observed for the auiliar sample mean X eceeds the population mean µ and the response and auiliar random variables are positivel correlated we ma anticipate that the sample mean Y will also eceed the population mean µ. The predictor of the remaining random variables compensates for the over estimation of µ b the sample random variables. The correction is proportional to the auiliar difference X µ and weighted b the regression coefficient β inflated b one over the finite population correction factor f. The second epression in equation (9) is ver similar to the BLUP of a realized cluster mean based on a mied model fit to a two stage cluster sample. The similarit provides an intuition connecting BLUP in the two stage sampling contet to adjustments for auiliar variables in SRS. n the contet of two stage cluster sampling with equal size clusters the BLUP of a realized cluster mean under a two stage random permutation model is given b ˆ T = fy + f Y Y Y i i i f i ( β ) ( ) where f = m/ M is the common sampling fraction of units in clusters Y is the sample mean of i the realized cluster Y is the average of these means over sampled clusters and β = mσ ( f ) σ e + ( f ) σ e is a function of the variance between clusters σ and the average variance within clusters σ e (see Table Stanek and Singer (004 p5)). The predictor is LiW_05_m_v5.doc 9/9/005 8

the sum of two terms. The first term substitutes the values observed for the sample units for the realized cluster in the predictor. The second term predicts the contribution of the remaining units in the realized cluster b adjusting the realized cluster sample mean. The adjustment is proportional to the difference between the realized cluster sample mean and the overall mean. The similarit of the BLUP of a realized cluster mean in two stage cluster sampling to the estimator of the population mean with auiliar information in SRS is self-evident. The difference between the sample mean for the auiliar variable and the population mean in SRS is replaced b the difference between the realized cluster sample mean and the overall mean in two-stage cluster sampling. The regression coefficient relating the response to the auiliar variable in SRS is replaced b a coefficient that is a function of the within and between cluster variance components. 8. EXTESO TO CASES WTH MULTPLE COVARATES Etension of the above results to scenarios with multiple auiliar variables is straightforward. Suppose p auiliar variables are available for use in estimation. The matri of random variables representing a joint permutation of a response variable and the p covariates is given b U( p ) = ( Y X X p). A simultaneous random permutation model can be defined similar to () with E ( ) = ( p ) and = ( p+ ) Z = vec Y X X G. t follows that and that cov Z = Σ P. When auiliar means µ k k = p are known a Z Gµ ( ) 0 similar centering transformation can be applied to Z such that Z = Z. A 0 p P reparameterized model incorporating known auiliar information has a form identical to Model ( ) () but where G = 0. With these differences a linear unbiased minimum variance p LiW_05_m_v5.doc 9/9/005 9

estimator can be derived following the same steps as in Section 7 (Appendi ). The unique solution for the vector of coefficients w is f wˆ = n n β β Σ σ k = p. Accordingl the estimator and its where = X X = ( β β βp) variance are ˆ β P = fy + f Y f ( ) ( X µ ) ( ) = Y β X µ () and ( ˆ σ var P) = ( ρ X )( f ) () n where Y is the sample mean of the response variable and X is the vector of sample means of the p auiliar variable and ρ = σ Σ σ σ is the squared multiple correlation coefficient X X X X of Y on X coefficient. otice that the onl difference between () and (0) is the multiple correlation ρ X instead of ρ. This epression is similar to the epressions commonl seen in multiple linear regression models (Grabill 976) but includes a finite population correction factor. Results () and () are also the same as difference estimators with optimal coefficients (Montanari 987) and the GREG estimator (Särndal et al. 99). 9. EMPRCAL ESTMATES The estimators in equation (9) and () are functions of the regression coefficients which in turn are functions of variance components in the population. n practice although Σ X ma be known σ X will need to be estimated. Särndal et al. (99 p9) recommends LiW_05_m_v5.doc 9/9/005 0

estimating both terms. When p= the estimate of β is given b b ˆ ˆ = σ / σ where ˆ = n n ( Y Y )( X X ) and ˆ ( i σ n i n = σ = i i = Xi X) uses the known population variance of X to estimate β b b ˆ = σ / σ.. An alternative estimator We conducted a small scale simulation stud to compare different empirical estimators in the contet of a simple problem where there is interest in estimating the smoking rate µ = π in a population of size based on a SRS with both smoking status and gender recorded on sample subjects. We assume that the response variable is an indicator of smoking status ( = s if subject s is a smoker and zero otherwise) and the auiliar variable is an indicator of male gender ( = if subject s is male and zero otherwise). We also assume that the proportion of s males in the population µ = π is known. B including the gender status of the sample subjects theoreticall we can reduce the variance of the estimate of smoking prevalence in the population over the simple sample proportion with the percent reduction given b 00( ρ ). n practice since the regression parameter is not known some of the reduction in the MSE is lost due to uncertaint in the regression coefficient. The simulation stud generated a series of hpothetical populations of sizes 50 00 00 400 800 and 600; each with an auiliar variable representing 50% men ( π = 0.5 ). n all populations the overall prevalence of smokers is assumed to be 40% ( π = 0.4 ). The prevalence of smoking in men ranged from 40% to 68% translating into a relative risk of smoking for men ranging from to 5.67. We evaluated the estimators b comparing the average MSE over 5000 independent simple random samples. To compare the MSE of the estimators we epressed the MSE of estimators that account for gender as a percentage of the MSE of the estimator corresponding to the simple sample smoking prevalence. Three estimators are compared differing b whether b or b β is used for the regression coefficient. LiW_05_m_v5.doc 9/9/005

The simulation results are illustrated in Figures and and summarize the reduction in MSE that ma result b accounting for auiliar information. Figure presents results when =00 and n=5. Results were ver similar to those in Figure with larger and larger size samples. Figure presents results when =50 and n=5. n both figures notice that a reduction in the MSE will almost alwas occur when the regression coefficient is known after accounting for auiliar information. n the simulation the MSE of the estimator based on b was slightl lower but basicall equivalent to the estimator based on b in all settings. Using the empirical estimators a reduction in the MSE will not occur unless there is a sufficientl large difference (measured here b the relative risk of smoking for males) in response b the auiliar variable. From Figure a reduction in the MSE will occur when RR>.75 while from Figure a reduction will occur when RR>.8. 0. EXAMPLE As a simple eample suppose there is interest in estimating the smoking rate µ = π in a population of size based on a SRS with both smoking status and gender recorded on sample subjects. We assume that the proportion of males in the population µ = π is known and represent the sample estimate of the proportion smoking as Y = ˆ π the proportion male in the sample as X = ˆ π and the proportion of male smokers in the sample as ˆ π. With this ˆ ˆ ˆ σ π π notation ( ) ˆ ˆ ˆ σ = π π σ = π( π) and = ( ) ˆ σ = ˆ π ˆ ππˆ. Using these estimators we estimate β b ˆ π b = ˆ π ˆ ππˆ ( ˆ π ) or LiW_05_m_v5.doc 9/9/005

b ˆ π = π ˆ ππˆ ( π ) and estimate ρ b ˆ ρ = ˆ π ˆ ππˆ ( ) ˆ ( ˆ ) π π π π. Substituting these epressions into the estimator in (9) using b results in { ( ) ( π ) } Pˆ ˆ ˆ ˆ = n n b π + π π ˆ π ˆ ˆ ππ = ˆ π π π π( π) ( ˆ ) ˆ σ with variance estimated b var ˆ ( Pˆ ) ( ˆ ρ )( f ) =. n As a particular application suppose that = 00 and n = 0 so that f = 0.0 and assume that half the population is male i.e. π = 0.5. Among the sample suppose further that thirt percent smoke ( ˆ π = 0.3 ) sit percent are male ( ˆ π = 0.6 ) and that twent-five percent of the sample subjects are male smokers ( ˆ π = 0.5). Using this information b ( ) ( ) 0.5 0.6 0.3 = = 0.8 0.5 0.5 0.5 0.6( 0.3) ( ) ( ) ˆ ρ = = 0.3055 and the proportion of 0.5 0.5 0.3 0.3 smokers in the population is estimated b P ˆ = 6 + 80 0.3 00 0.8 0.6 0.5 00 { ( ) ( ) ( ) } = + 00 = + 00 = 0.7 { 6 4 8( 0.) } { 6.} Thus rather than using the sample estimate of the smoking prevalence (i.e. ˆ π = 0.3 ) the estimate of smoking prevalence accounting for gender is given b. ˆ 0.7 P =. To form this estimate the predicted number of smokers among the non-sampled subjects is reduced from 4 LiW_05_m_v5.doc 9/9/005 3

to. to account for the larger percentage of male subjects in the sample (relative to the percentage of males in the population). The same estimate can be obtained b directl appling the regression results ˆ = ˆ π ( ˆ π π ) or ( ) P b P ˆ = 0.3 0.8 0.6 0.5 = 0.7 but this epression does not reveal the essential prediction of non-sampled random variables underling the method development. Using b to estimate β P ˆ = 0.7 a result that corresponds to the simple weighted average prevalence of smoking using population gender weights. These estimators are identical to post-stratified estimators of prevalence rate as discussed in calibration (Deville and Särndal 99) and model-assisted approaches (Särndal et al. 99). The estimated standard error is given b ( ˆ ) var ˆ P = 0.39. The standard error is ˆ ρ = 0.95 of the estimated standard error based solel on the sample (given b ( ˆ π ) = ( f ) var ˆ ( ˆ ) ˆ π π n ).. DSCUSSO The surve sampling literature has struggled to reconcile design-based and model-based theories of estimation/prediction. Model based methods recentl popularized b Valiant Dorfman and Roall (000) have a theoretical structure based on the prediction-based methods developed b Roall (973976). This structure is important since it allows methods to be etended relativel easil to different applications with increasing compleit. The limitation of the theor is that it does not account for the sample design. A similar theor has not previousl emerged using design-based methods. nstead a miture of approaches such as GREG or calibration approaches (Särndal et al. 99) have been developed. These approaches combine model-based and design-based ideas or begin with ad-hoc functional forms of estimators and optimize them in special settings. These approaches have been successful in addressing man practical problems in a design-based LiW_05_m_v5.doc 9/9/005 4

framework. However the have not led to an approach that provides the consistent conceptual and theoretical base or that can be readil etended to applications with increasing compleit. We have illustrated how the design-based random permutation model theor can be etended to include auiliar variables in a straight forward manner. These results etend the scope of the random permutation model theor to a broader class of problems. Previous developments of the theor have identified subtleties in interpreting random effects in simple random sampling (Stanek Singer and Lencina (004)) and developed predictors of realized random effects in balanced two stage sampling problems with response error (Stanek and Singer 004). Current research is etending these results to clustered population settings where clusters are of different size and there is unequal probabilit sampling and to settings where there is missing data. n each case the same basic approach is considered with estimators (or predictors) developed based on a clear optimization theor. The present results illustrate how the theor can be used to account for covariates. The fact that the results coincide with those developed b GREG or calibration approaches strengthens the appeal of the random permutation model approach as a design-based competitor to the model-based superpopulation theor. Still much more work is needed to etend the methods. Etensions are being developed to more comple settings including two stage designs with cluster and unit covariates longitudinal studies and settings where randomization of units to treatments. We consider the basic results developed here to provide a foundation for additional work in these directions. Reference Bolfarine H. and S. Zacks (99). Prediction Theor for Finite Populations. ew York Springer- Verlag. Brewer K. R. W. (995). Combining design-based and model-based inference (Chapter 30). Business Surve Methods. B. B. Co DA; Chinnappa B; Christianson A; Colledge MJ; Kott PS. ew York John Wile & Sons nc.: 589-606. Brewer K. R. W. (999). "Cosmetic calibration with unequal probabilit sampling." Surve Methodolog 5: 05-. Brewer K. R. W. (999). "Design-based or prediction-based inference? Stratified random vs stratified balanced sampling." nternational Statistical Review 67(): 35-47. LiW_05_m_v5.doc 9/9/005 5

Brewer K. R. W. M. Hanif et al. (988). "How earl Can Model-based Prediction and Designbased Estimation Be Reconciled?" Journal of the American Statistical Association 83: 8-3. Cassel C. M. C.-E. Särndal et al. (977). Foundations of nference in Surve Sampling. ew York Y Wile. Cassel C. M. C. E. Särndal et al. (976). "Some Results on Generalized Difference Estimation and Generalized Regression Estimation for Finite Populations." Biometrika 63: 65-60. Cochran W. G. (977). Sampling Techniques. ew York John Wile and Sons. Deville J. C. and C. E. Särndal (99). "Calibration Estimators in Surve Sampling." Journal of the American Statistical Association 87: 376-38. Ghosh M. and J.. K. Rao (994). "Small area estimation: an appraisal." Statistical Science 9(): 55-76. Grabill F. A. (976). Theor and application of the linear model. Belmont CA Wadsworth Publishing Compan nc. Horvitz D. G. and D. J. Thompson (95). "A generalization of sampling without replacement from a finite universe." Journal of the American Statistical Association 47(60): 663-685. Montanari G. E. (987). "Post-sampling Efficient QR-prediction in Large-sample Surves." nternational Statistical Review 55: 9-0. Rao C. R. (967). Least square theor using an estimated dispersion matri and its applications to measurement of signals. Proceedings of teh 5th Berkele Smposium on Mathematical Statistics and Probabilit Berkele CA. Rao J.. K. (997). "Developments in sample surve theor: an appraisal." Canadian Journal of Statistics 5(): -. Rao J.. K. (999). "Some Current Trends in Sample Surve Theor and Methods." Sankha: The ndian Journal of Statistics: Series B 6(): -57. Roall R.M. (973). The prediction approach to finite population sampling theor: application to the hospital discharge surve. Rockville Md. DHEW o. (HSM) 73-39:-3. Roall R. M. (976). "The linear least-squares prediction approach to two-stage sampling." Journal of the American Statistical Association 7: 657-664. Särndal C. E. B. Swensson et al. (99). Model assisted surve sampling. ew York Springer- Verlag. Särndal C. E. and R. L. Wright (984). "Cosmetic Form of Estimators in Surve Sampling." Scandinavian Journal of Statistics : 46-56. Stanek E. J. J. d. M. Singer et al. (004). "A unified approach to estimation and prediction under simple random sampling." Journal of Statistical Planning and nference : 35-338. Stanek E. J. and J. M. Singer (004). "Predicting Random Effects from Finite population clustered samples with response error." Journal of the American Statistical Association 99(468): 9-30. Valliant R. A. H. Dorfman et al. (000). Finite Population Sampling and nference a Prediction Approach. ew York John Wile & Sons. Wu C. B. and R. R. Sitter (00). "A model-calibration approach to using complete auiliar information from surve data." Journal of the American Statistical Association 96(453): 85-93. Zellner A. (963). "Estimators of Seemingl Unrelated Regression Equations: Some Eact Finite Sample Results." Journal of the American Statistical Association 58: 977-99. LiW_05_m_v5.doc 9/9/005 6

Figure. Percent Change in MSE of Smoking Prevalence Estimates Accounting for Gender b Relative Risk of Smoking (=00 n=5) 0 Relative Risk of Smoking for Males.00.5.50.75.00.5.50.75 3.00 % Change in MSE relative to Sample Prevalence 5 0-5 -0-5 -0 Estimate all Var Comp.: b Estimate XY Cov onl:b Assume all Var Known:Beta LiW_05_m_v5.doc 9/9/005 7

Figure. Percent Change in MSE of Smoking Prevalence Estimates Accounting for Gender b Relative Risk of Smoking (=50 n=5) Relative Risk of Smoking for Males % Change in MSE relative to Sample Prevalence.00.5.50.75.00.5.50.75 3.00 0 5 0-5 -0-5 -0 Estimate all Var Comp.: b Estimate XY Cov onl:b Assume all Var Known:Beta LiW_05_m_v5.doc 9/9/005 8

Appendi Proof of result () and () When there are p auiliar variables a simultaneous random permutation model can be defined similar to () with Z = vec ( Y X Xp ) and = ( p+ ) G. When µ X is known let X = P X where X = ( X X Xp ). The transformation can be represented as Z = E = 0 RZ where R = and = 0 p P ( p+) ( ) G 0 and RG G 0 where = ( p) RE. The reparameterized model that incorporates the constraints thus has a form identical to Model (). To represent the SRS sampling let ( ( )) ( ( n) n ) portioned simultaneous permutation model takes the same form of (3) p+ n 0 n n K = and thus the p+ 0 n Z G µ = + E Z G where ( ) G = 0 n np and ( ( ) ). G = 0 n n p A linear function of the random variables can be defined similar to (4) with C= ( c 0 p ) ( ) ( C = c 0 np and C = c 0 ( n) p ). The estimator of C Z can then be defined as a linear function of the sample ( p) is a ( p ) w w w wz where = ( ) w = w w w + n vector of coefficients. With these a linear unbiased minimum variance estimation can be derived b minimizing the function ( ) { } Φ w = wv w w V C + wg C G. λ LiW_05_m_v5.doc 9/9/005 9

Specificall differentiating the above function with respect to w and λ and setting the derivatives to zeros results in the following estimating equations V G wˆ V C = ˆ λ G 0 G C where ( G = ( n 0 np) and G = n 0 ( n) p). The unique solution is ( ) ( ) ˆ w = V V + G G V G G G V V C. (A.) Since V = Σ P n V = Σ J n ( n) Pn n= n and n n n n G V G u Σ u with n P n n n = we have the following useful identities ( ) = ( ) u = ( 0 p ) and V V = p+ J n ( n). Subsequentl (A.) can be simplified as n follows wˆ = J ( ) + ( u Σ u ) ( ) ( n ) n n Σ uu P C n (( ) ) c = u n + uσ u Σ u n. n n p+ n n n n Since Σ ( σ σ XΣX σx ) σ σ X ( ΣX σxσ σ X ) ( σ ) ( σ ) σ σ X = = σx ΣX ΣXσ X σxσxσx ΣX σx σx ( ) ( ) ( u Σ u Σ u σ σxσxσx) ( σ σ XΣXσX) ( σ ) = = = Σ σ σ Σ σ ΣXσX β X X X X X where β= Σ σ X X. Therefore we have nc wˆ = u n + n n n. β LiW_05_m_v5.doc 9/9/005 0

Consequentl Pˆ = C + w Z ( ) = cy + c Y β X µ n = fy + f Y f ( ) n ( ) β ( X µ ) ( ) = Y β X µ and var ( Pˆ ) ( ) ( = C + w V C + w f = ( ρx ) n σ ) where ρ = σ Σ σ σ is the multiple correlation coefficient of Y on X. X X X X 0 LiW_05_m_v5.doc 9/9/005