The Role of Models in Model-Assisted and Model- Dependent Estimation for Domains and Small Areas

The Role of Moels in Moel-Assiste an Moel- Depenent Estimation for Domains an Small Areas Risto Lehtonen University of Helsini Mio Myrsylä University of Pennsylvania Carl-Eri Särnal University of Montreal Ari Veijanen Statistics Finlan Abstract This paper investigates omain total estimation for moel-assiste generalize regression (GREG) an moel-epenent EBLUP estimators uner probability proportional to size (PPS) sampling. Two particular issues are aresse: (i) how to account for the omain ifferences in the moel formulation, an (ii) how to account for the PPS esign. Results are base on Monte Carlo experiments. In the experiments, the bias of GREG estimator remaine negligible for all moels, an accuracy improve when the PPS size variable was inclue in the moel. For EBLUP, bias was large for wea moels, but the bias ecrease substantially when the PPS size variable was inclue in the moel. Thus ouble-use of the auxiliary information seeme profitable. As an alternative to the classic unweighte EBLUP, we propose a new weighte EBLUP estimator for PPS esigns. In the experiments the weighte EBLUP behave much better than the unweighte EBLUP. email: myrsylm@sas.upenn.eu

Introuction Estimation of reliable statistics for population sub-groups or omains constitutes an area of increasing importance in emography. Examples where omain level estimation is neee inclue prouction of official population statistics by emographic an geographic status, estimation of isease prevalence rates in sub-groups of the population, an poverty mapping. Typically, a survey is planne to prouce reliable statistics for the entire population an large or major areas. Stanar esign-base irect estimators, such as the Horvitz-Thompson estimator, are often use for such cases. The tas can become challenging when the number of sample elements in a number of omains remains small or minor. In this case, more avance methos that effectively use the auxiliary information are neee. The auxiliary information may be obtaine, for instance, from previous surveys, from aministrative registries, or from census ata sources. Methos available for the estimation of totals for omains an small areas inclue moel-assiste esignbase estimators, referring to the family of generalize regression (GREG) estimators (Särnal, Swensson an Wretman 99, Estevao an Särnal 999, 004), an moel-epenent techniques, such as the EBLUP estimator (Empirical Best Linear Unbiase Preictor) an synthetic estimators (Ghosh 00, Rao 003). Properties of these estimator types are iscusse for example in Lehtonen an Veijanen (998, 999) an Lehtonen, Veijanen an Särnal (003, 005). The ocumentation of the EURAREA project inclues useful comparative materials on properties of moel-epenent estimators (EURAREA Consortium 004, Heay an Ralphs 005). Known esign-base properties relate to bias, precision an accuracy of moel-assiste estimators an moel-epenent estimators are summarize in Table. Moel-assiste estimators are approximately esign-unbiase by efinition, but their variance can become large in omains where the sample size is small. Moel-epenent estimators are esign-biase: the bias can be large for omains where the moel oes not fit well. The variance of a moel-epenent estimator can be small even for small omains, but the accuracy tens to be poor because the square bias often ominates the mean square error (MSE), as shown for example by Lehtonen, Veijanen an Särnal (003 an 005). The ominance of the bias component together with a small variance can cause poor coverage rates an invali confience intervals for a moelepenent estimator. For moel-assiste esign-base estimators, on the other han, vali confience intervals can be constructe. Typically, moel-assiste estimators are use for major or not-so-small omains an moel-epenent estimators are use for small omains where moel-assiste estimators can fail.

3 Table. Design-base properties of moel-assiste an moel-epenent estimators for omains an small areas. Design bias Precision (Variance) Accuracy (Mean Square Error, MSE) Confience Intervals Design-base moel-assiste methos - GREG family Design unbiase (approximately) by the construction principle Variance may be large for small omains Variance tens to ecrease with increasing omain sample size MSE = Variance (or nearly so) Vali intervals can be constructe Moel-epenent methos SYN an EBLUP Design biase Bias oes not necessarily approach zero with increasing omain sample size Variance can be small even for small omains Variance tens to ecrease with increasing omain sample size MSE = Variance + square Bias Accuracy can be poor if the bias is substantial Vali intervals not necessarily obtaine Researcher often faces challenging methoological choices when aiming at reliable estimation of population totals for omains an small areas. These choices inclue, for example, the inferential framewor, moel type (mathematical form, specification, parametrization, estimation of moel parameters), an estimator type (point estimator, estimator of variance or MSE) for the unnown omain totals. Relate to the problem of moel choice, or the role of the moel in moel-assiste estimators an in moel-epenent estimators, the two questions of special interest in this stuy are: (i) (ii) How to account for the omain ifferences in the moel formulation (relevant for moelassiste estimators in particular)? How to account for the unerlying unequal probability sampling esign (relevant for moelepenent estimators in particular)? We iscuss points (i) an (ii) to some extent from a esign-base perspective, uner the fixe finite population approach. More specifically, we compare the relative performance (bias an accuracy) of the two estimator types of omain totals, GREG, an EBLUP, uner ifferent moel choices. A continuous response variable is assume. In the construction of moels we use both linear fixe-effects moels an linear mixe moels, where ranom effects are inclue in aition to the fixe effects. We fit the linear moels with ifferent parametrizations. In the estimation of the moel parameters, we use both weighte an unweighte estimation proceures. An unerlying unequal probability sampling esign is assume. The case of unequal probability sampling is of importance for practical purposes in official statistics an many fiels of empirical research. Withoutreplacement type fixe-size Probability Proportional to Size sampling (systematic PPS) was selecte to represent an example of an unequal probability sampling esign. This stuy extens the case of equal probability sampling investigate in Lehtonen, Särnal an Veijanen (003, 005) to unequal probability sampling esigns. The woring paper is organize as follows. Chapter introuces our notation an moels an estimators use. Results for GREG an EBLUP estimators are given in Chapter 3. Conclusions are in Chapter 4.

4 Methos. Moels an estimators of omain totals We are intereste in the estimation of totals of a continuous response variable y for the omains of interest. Availability of powerful auxiliary information is essential for the estimators of omain totals consiere. We assume that we have access to unit-level ata, which inclue omain membership inicators an vectors of auxiliary x-variables, for all units in the population. The auxiliary ata vector also contains the size variable use in the PPS sampling proceure. The auxiliary ata are incorporate in the estimation proceure by an appropriate moel. Thus, the choice of the moel that unerlies the GREG, SYN an EBLUP estimators of omain totals is consiere important. Our question (i) was How to account for the omain ifferences in the moel formulation?. The omain ifferences can be accounte for by a proper moel formulation. Basically, there are two options to facilitate the omain ifferences: () introuction of omain-specific fixe effects in the moel, an () accounting for the omain ifferences by omain-specific ranom effects, such as ranom intercepts. It is obvious that these options are relevant for moel-assiste estimators in particular. The reason is that in a stanar GREG setting, a fixe-effects linear moel is routinely use as the assisting moel (Estevao an Särnal 999, 004), an a GREG estimator that uses a mixe moel, the MGREG estimator, has been introuce only recently (Lehtonen an Veijanen 999, Lehtonen et al. 003, see also Golstein 003, p. 65). On the other han, a mixe moel formulation has a long traition in the context of EBLUP estimation of small area totals (Fay an Herriot 979, Rao 003). The problem of moel choice is iscusse in a more general spirit in Firth an Bennett (998). To throw some light on question (ii) How to account for the unerlying unequal probability sampling esign?, we stuy the ifferent options to incorporate the information of the sampling esign into the estimation proceure. In the moelling phase, there are two main options to account for the sampling esign: (a) the incorporation of sampling weights in the estimation of moel parameters, an (b) the inclusion of sampling esign variables as aitional covariates in the moel. By efault, sampling weights are incorporate in the estimation proceures for all assisting moels of GREG estimators. As a rule, sampling weights are ignore in the estimation proceures for SYN estimators. Typically, the unerlying mixe moel of a stanar EBLUP estimator is fitte in an unweighte manner. Rao (003) introuce a pseuo EBUP estimator, where sampling weights are inclue in the construction of the EBLUP estimator, but the parameters of the mixe moel are estimate by unweighte techniques. As an alternative to the unweighte EBLUP an pseuo EBLUP, we will introuce a new EBLUP estimator, where sampling weights are incorporate in the estimation of parameters of the unerlying mixe moel. We will also compare options (a) an (b) in their successfulness in accounting for the sampling esign. It is obvious that these options are relevant for EBLUP estimators in particular. We stuy the bias an accuracy properties of the estimators of omain totals by empirical methos. Our Monte Carlo simulation experiments consiste of repeate raws of systematic PPS samples from an artificially constructe fixe finite population. Table shows the moel-epenent an moel-assiste estimators to be iscusse, in a two-way arrangement by estimator type an by moel choice. Each of the rows correspons to a ifferent moel choice. CC moel (common intercepts, common slopes) is one whose only parameters are fixe effects

5 efine at the population level; it contains no omain specific parameters. We obtain SYN-CC an GREG- CC estimators. SC moel (separate intercepts, common slopes) is one having at least some of its parameters or effects efine at the omain level. These are fixe effects for SYN-SC an GREG-SC an ranom effects for EBLUP-SC, EBLUPW-SC an MGREG-SC. Table also shows the estimation methos that are use in the estimation of moel parameters. To aress points (i) an (ii) of Chapter, we iscuss in more etail GREG-SC an MGREG-SC for GREG family estimators an EBLUP-SC an EBLUPW-SC for EBLUP family estimators. Table. Schematic presentation of the moel-epenent an moel-assiste estimators of omain totals for a continuous response variable by moel choice an estimator type, uner unequal probability sampling. Moel choice Estimator type Moel abbreviatio n Moel specification Effect type Estimation of moel parameters Moelepenent estimators Moel-assiste Estimators CC SC Common intercepts Common slopes Separate intercepts Common slopes Fixe effects Fixe effects Fixe an ranom OLS SYN-CC Not applicable(**) WLS Not applicable(*) GREG-CC OLS SYN-SC Not applicable (**) WLS Not GREG-SC applicable(*) REML EBLUP-SC Not GLS applicable (**) Weighte REML EBLUPW-SC MGREG-SC GWLS OLS Orinary least squares WLS Weighte least squares (sampling weights) GLS Generalize least squares GWLS Generalize weighte least squares (sampling weights) REML Restricte (resiual) maximum lielihoo Weighte REML Restricte pseuo maximum lielihoo (sampling weights) (*) In SYN, weights are ignore in the estimation proceure by efault. (**) In GREG, weights are incorporate in the estimation proceure by efault. We next introuce the notation use in this stuy. Population an sampling esign { } U =,,...,,..., N Population (fixe, finite) U,..., U,..., U Domains of interest (non-overlapping) Y = y, =,..., D Target parameters (omain totals) x U D = ( x,..., x ) Auxiliary variable vector p I = if U Domain membership inicators, I = 0 otherwise =,..., D Note that we assume the vector value x an omain membership to be nown for every U.

6 Sampling esign: Systematic PPS with sample size n s Sample from U s = s U Ranom part of s falling in omain a π n Inclusion probability for U = / π x = x U Sampling weight for s We observe y for s. Note that for estimation purposes, sample ata an auxiliary ata are merge at the micro level by using unique ID eys that are available in both ata sources. Moels for continuous response y Linear fixe-effects moels CC moels y = β + β x +... + β x + ε 0 p p SC moels y = β + β x +... + β x + ε, =,..., D 0 p p Fitte values uner fixe-effects moels yˆ Linear mixe moels = x βˆ SC moels y = β + u + β x +... + β x + ε, =,..., D where u 0 p p are omain-specific ranom intercepts Fitte values uner mixe moels yˆ = x βˆ + uˆ, =,..., D Note that fitte values y ˆ are calculate for every U. Estimators of omain totals The preictions { yˆ ; U } specification, the estimator of the omain total Y types (SYN, GREG, EBLUP): Moel-assiste GREG estimators Yˆ = yˆ + a ( y yˆ ) GREG U s Moel-epenent SYN estimators Yˆ = yˆ SYN Moel-epenent EBLUP estimators Yˆ = y + yˆ where =,..., D. U EBLUP s U s iffer from one moel specification to another. For a given moel = y has the following structure for the three estimator U Note that Y ˆSYN an Y ˆEBLUP rely heavily on the truth of the moel, an can be biase if the moel is misspecifie. On the other han, Y ˆGREG has a secon term that protects against moel misspecification. We aopt the following conventions (Table ). In SYN-CC, SYN-SC, GREG-CC an GREG-SC, a fixeeffects moel formulation is assume. A mixe moel is assigne for EBLUP-SC, EBLUPW-SC an MGREG-SC estimators.

7 Measures use in Monte Carlo simulations In Monte Carlo simulation experiments, by using estimates Y ˆ ( s ) from repeate samples s ; v =,,..., K, we compute for each omain =,..., D the following Monte Carlo summary measures of bias an accuracy. (i) Absolute relative bias (ARB), efine as the ratio of the absolute value of bias to the true value: v v K Yˆ ( s ) Y / Y v K v = (ii) Relative root mean square error (RRMSE), efine as the ratio of the root MSE to the true value: ˆ K ( Y ( sv ) Y ) / Y K v = Details of the simulations There were 00 omains in the population. The size of omain was proportional to exp( q ), where q was simulate from U(0,.9). Each observation was allocate to a omain by geometric probability: intervals of length exp( q ) were concatenate an a ranom point was chosen in (0, exp( q ) ). The interval containing the point etermine the omain of the observation. There were 47 minor omains, 9 meium-size omains an 34 major omains in the population. These three classes were efine on the basis of expecte sample size n( N / N ) : less than 70, 70-9 an 0 or more units, respectively. The smallest omain of the generate population ha,7 units an the largest ha 8,96. The variable x is the size variable use in PPS sampling. The variable was simulate from uniform istribution U(,). Another auxiliary variable xwas simulate from N(0,9). The ranom effects u were simulate inepenently from N(0,0.5). The error term ε followe N(0,). Responses were simulate as y = + x +.5 x + u + ε ( ) Correlations of the variables in the population were: corr( y, x ) = 0.779, corr( y, x ) = 0.607 an corr( x, x ) = 0.00. Domain means of the response variable were approximately equal, but the totals iffere consierably: The means of omain totals were 45,64 for minor omains, 7,308 for meium omains an 4,57 for major omains. Our population size is N =,000,000 an sample size n = 0,000. In Monte Carlo experiments, K = 000 inepenent systematic PPS samples were generate. The inclusion probabilities are π = nx / x. The weights a = / π varie between 54.6 an 596.5.

8 3 Results 3. GREG estimators We first iscuss results for GREG estimators. Our point (i) evote to GREG was How to account for the omain ifferences in the moel formulation. This is emonstrate by the eight ifferent moel formulations in Table 3. In moels A, B, C an D, the omain ifferences are accounte for by omainspecific fixe effects β 0. In moels A, B, C an D, we use ranom intercepts 0 u the fixe intercept common for all omains, an the ranom term β +, where β 0 is u is omain-specific. In aition, we x, an have two explanatory variables at our isposal: the variable x, the size variable in PPS esign, an auxiliary variable uncorrelate to x. Note that both variables correlate quite strongly with the response variable y. For x an x, slope parameters β an β are common fixe effects for all omains. For GREG, we incorporate the sampling weights in the estimation proceure of moel parameters, incluing the mixe moel unerlying the MGREG-SC estimator. This facilitates the conition of internal bias calibration (a proper combination of moel formulation an estimation proceure uner a given sampling esign) propose by Firth an Bennett (998). Table 3 also shows our moel builing strategy. We start with simple moels A an A an procee towars the population generating moel D. In all moels, GREG family estimators are essentially unbiase, an a fixe-effects an a mixe moel formulations yiel similar accuracy. An explanation for this observation is that in the simulation setting, the average levels of the response i not vary much over the omains. Best accuracy (excluing the true moel) is for moels where the PPS size variable x is inclue. This emonstrates the accuracy gains attaine from the ouble-use of x both in the sampling esign an in the estimation esign; see also Särnal (996). We also note that accuracy ifferences between the ifferent GREG estimators are substantial, an accuracy improves with increasing the omain sample size. Table 3. Average absolute relative bias ARB (%) an average relative root mean square error RRMSE (%) of GREG estimators for minor, meium-size an major omains of the generate population. Moel an estimator Minor (0-69) y = β + ε Average ARB (%) Average RRMSE (%) Domain size class Domain size class Meium Major Minor Meium (70-9) (0+) (0-69) (70-9) Major (0+) Moel A 0 GREG-SC.4 0.5 0.3 3.7 8. 5.7 Moel A y = β0 + u + ε MGREG-SC 0. 0. 0. 3.7 8. 5.6 Moel B y = β0 + βx + ε GREG-SC 0. 0. 0.0 7.8 4.6 3. Moel B y = β0 + u + βx + ε MGREG-SC 0. 0. 0.0 7.8 4.6 3.3 Moel C y = β0 + βx + ε GREG-SC.4 0.5 0.3.6 6.8 4.8 Moel C y = β0 + u + βx + ε MGREG-SC 0. 0. 0..6 6.8 4.7 Moel D y = β0 + βx + βx + ε GREG-SC 0.0 0.0 0.0.7.0 0.7 Moel D y = β0 + u + βx + βx + ε (Population generating moel) MGREG-SC 0.0 0.0 0.0.7.0 0.7 Variables x Size variable in PPS sampling, x Auxiliary variable

9 3. EBLUP estimators For estimators of the EBLUP family, we ase How to account for the unerlying unequal probability sampling esign?. We propose two options for this purpose: (a) the incorporation of sampling weights in the estimation of moel parameters, an (b) the inclusion of sampling esign variables as aitional covariates in the moel. We compare unweighte an weighte EBLUP estimators constructe with four mixe moel formulations. Moel A inclues a ranom intercept, variable x is inclue in Moel B, variable x is inclue in Moel C an both variables appear in the population generating moel D. Similarly as for GREG, omain ifferences are accounte for by ranom intercept terms, an slope parameters are common for all omains. For all moels (except D), EBLUP estimators are calculate with unweighte an weighte estimation of moel parameters. For Moels A an C, unweighte estimators EBLUP-SC are seriously biase. For these moels, the PPS sampling esign is not accounte for. The bias eclines consierably when the sampling weights are incorporate in the estimation of the mixe moel, as shown by the new EBLUPW-SC estimators for Moels A an C. The unweighte estimator EBLUP-SC uner Moel B shows best bias behaviour, inicating that the inclusion of the PPS size variable in the moel can offer a powerful tool for bias reuction for EBLUP family estimators. Use of both weighting an the inclusion of x in the moel appears to be less powerful. Accuracy behaviour of all EBLUP estimators is infecte by the ominance of the square bias component in the MSE, as inicate by the RRMSE figures. This hols for all three omain size classes. Because of large bias an small variance, invali confience intervals can be obtaine. This means that point estimates can be systematically far away from the true value, inepenently of the omain sample size. In aition, accuracy oes not improve much with increasing the omain sample size. Table 4. Average absolute relative bias ARB (%) an average relative root mean square error RRMSE (%) of EBLUP estimators for minor, meium-size an major omains of the generate population. Moel an estimator Minor (0-69) = β + u + ε Average ARB (%) Average RRMSE (%) Domain size class Domain size class Meium Major Minor Meium (70-9) (0+) (0-69) (70-9) Major (0+) Moel A y 0 EBLUP-SC.9 3..7.9 3.3.8 EBLUPW-SC 3.7 3.5 3.3 3.9 3.6 3.5 y = + u + x + Moel B β0 β ε EBLUP-SC.8.4 0.7.8.5. EBLUPW-SC 3.5 3.5 3.3 3.5 3.6 3.3 y = + u + x + Moel C β0 β ε EBLUP-SC.3 3..8.4 3..9 EBLUPW-SC 3.7 3.6 3. 3.9 3.7 3.3 y = + u + x + x + (Population generating moel) Moel D β0 β β ε EBLUP-SC 0.3 0. 0.0.3 0.8 0.6 Variables x Size variable in PPS sampling, x Auxiliary variable

0 4 Conclusions Results inicate that uner unequal probability sampling, moel-assiste GREG family estimators are quite insensitive to the moel choice, a property also shown in our previous research to hol uner SRSWOR. Moel formulation an the estimation strategy of the moel are critical for moel-epenent EBLUP family estimators. This is especially true when using EBLUP for unequal sampling esigns. Bias of GREG estimators remaine negligible for all moel choices. Double-use of the same auxiliary information, that is, the use of the size variable in the PPS sampling esign an in the assisting moel, appeare to be beneficial with respect to accuracy. The accuracy improve with increasing the omain sample size. In this case, the mixe moel formulation i not outperform the fixe-effects moel formulation. For moel-epenent EBLUP family estimators, the bias can be large for a misspecifie moel. The PPS sampling esign coul be accounte for with two options, by the inclusion of the PPS size variable in the mixe moel, or by the use of the weighte version of the EBLUP estimator, where the sampling weights are incorporate in the estimation proceure of moel parameters. Of these two options, the first one appeare to be more effective, proucing an EBLUP estimator with small bias an goo accuracy. However, for both options, the square bias component can still ominate the MSE, even in minor omains, tening to invaliate the construction of proper confience intervals. Dominance of the bias component also can cause that the accuracy oes not show improvement, when increasing the omain sample size.

References Estevao, V.M. an Särnal, C.-E. (999) The use of auxiliary information in esign-base estimation for omains. Survey Methoology, 5, 3. Estevao, V.M. an Särnal, C.-E. (004) Borrowing strength is not the best technique within a wie class of esign-consistent omain estimators. Journal of Official Statistics, 0, 645 669. EURAREA Consortium (004) Project Reference Volume, Parts, an 3 (PDF). Website: www.statistics.gov.u/eurarea/ Fay, R.E. an Herriot, R.A. (979) Estimation of income for small places: an application of James-Stein proceures to census ata. Journal of the American Statistical Association, 74, 69 77. Firth, D. an Bennett, K.E. (998) Robust moels in probability sampling. (With iscussion) Journal of the Royal Statistical Society, B, 60, 3 56. Ghosh, M. (00) Moel-epenent small area estimation: theory an practice. In: Lehtonen an Djerf K. (es) Lecture Notes on Estimation for Population Domains an Small Areas. Helsini: Statistics Finlan, Reviews 00/5, 5 08. Golstein, H. (003) Multilevel Statistical Moels. Thir Eition. Lonon: Arnol. Heay, P. an Ralphs, M. (005) EURAREA: an overview of the project an its finings. Statistics in Transition, 7, 557 570. Lehtonen, R., Särnal, C.-E. an Veijanen, A. (003) The effect of moel choice in estimation for omains, incluing small omains. Survey Methoology, 9, 33 44. Lehtonen, R., Särnal, C.-E. an Veijanen, A. (005) Does the moel matter? Comparing moel-assiste an moel-epenent estimators of class frequencies for omains. Statistics in Transition, 7, 649 673. Lehtonen, R. an Veijanen, A. (998). Logistic generalize regression estimators. Survey Methoology, 4, 5 55. Lehtonen, R. an Veijanen, A. (999) Domain estimation with logistic generalize regression an relate estimators. Proceeings, IASS Satellite Conference on Small Area Estimation, Riga, August 999. Riga: Latvian Council of Science, 8. Rao, J.N.K. (003) Small Area Estimation. Hoboen: Wiley. Särnal, C.-E. (996) Efficient estimators with simple variance in unequal probability sampling. Journal of the American Statistical Association, 9, 86 300. Särnal, C.-E., Swensson, B. an Wretman, J.H. (99) Moel Assiste Survey Sampling. New Yor: Springer-Verlag.