arxiv: v2 [math.st] 20 Jun 2014

Similar documents
NONLINEAR CALIBRATION. 1 Introduction. 2 Calibrated estimator of total. Abstract

ESTP course on Small Area Estimation

Comments on Design-Based Prediction Using Auxilliary Information under Random Permutation Models (by Wenjun Li (5/21/03) Ed Stanek

INSTRUMENTAL-VARIABLE CALIBRATION ESTIMATION IN SURVEY SAMPLING

Estimation of Some Proportion in a Clustered Population

A MODEL-BASED EVALUATION OF SEVERAL WELL-KNOWN VARIANCE ESTIMATORS FOR THE COMBINED RATIO ESTIMATOR

Finite Population Sampling and Inference

Contextual Effects in Modeling for Small Domains

Conservative variance estimation for sampling designs with zero pairwise inclusion probabilities

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

A comparison of stratified simple random sampling and sampling with probability proportional to size

A comparison of stratified simple random sampling and sampling with probability proportional to size

Lecture 7 Introduction to Statistical Decision Theory

REPLICATION VARIANCE ESTIMATION FOR THE NATIONAL RESOURCES INVENTORY

Bootstrap inference for the finite population total under complex sampling designs

Weighting in survey analysis under informative sampling

Combining data from two independent surveys: model-assisted approach

Two-phase sampling approach to fractional hot deck imputation

BIAS-ROBUSTNESS AND EFFICIENCY OF MODEL-BASED INFERENCE IN SURVEY SAMPLING

Data Integration for Big Data Analysis for finite population inference

RESEARCH REPORT. Vanishing auxiliary variables in PPS sampling with applications in microscopy.

Model Assisted Survey Sampling

Admissible Estimation of a Finite Population Total under PPS Sampling

Calibration estimation using exponential tilting in sample surveys

Chapter 8: Estimation 1

Estimation of Parameters

Modification and Improvement of Empirical Likelihood for Missing Response Problem

Chapter 5: Models used in conjunction with sampling. J. Kim, W. Fuller (ISU) Chapter 5: Models used in conjunction with sampling 1 / 70

Generalized Pseudo Empirical Likelihood Inferences for Complex Surveys

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator

Nonresponse weighting adjustment using estimated response probability

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Indirect Sampling in Case of Asymmetrical Link Structures

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Calibration estimation in survey sampling

Optimal Auxiliary Variable Assisted Two-Phase Sampling Designs

Robustness to Parametric Assumptions in Missing Data Models

Monte Carlo Study on the Successive Difference Replication Method for Non-Linear Statistics

Regression Analysis for Data Containing Outliers and High Leverage Points

Eric V. Slud, Census Bureau & Univ. of Maryland Mathematics Department, University of Maryland, College Park MD 20742

DSGE Methods. Estimation of DSGE models: GMM and Indirect Inference. Willi Mutschler, M.Sc.

Sampling from Finite Populations Jill M. Montaquila and Graham Kalton Westat 1600 Research Blvd., Rockville, MD 20850, U.S.A.

Covariance function estimation in Gaussian process regression

A General Overview of Parametric Estimation and Inference Techniques.

Multicollinearity and A Ridge Parameter Estimation Approach

6. Fractional Imputation in Survey Sampling

Supplement-Sample Integration for Prediction of Remainder for Enhanced GREG

Implications of Ignoring the Uncertainty in Control Totals for Generalized Regression Estimators. Calibration Estimators

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Estimation of change in a rotation panel design

One-phase estimation techniques

Domain estimation under design-based models

Adaptive two-stage sequential double sampling

Small area prediction based on unit level models when the covariate mean is measured with error

MA 575 Linear Models: Cedric E. Ginestet, Boston University Bootstrap for Regression Week 9, Lecture 1

On Modifications to Linking Variance Estimators in the Fay-Herriot Model that Induce Robustness

Estimation for Mean and Standard Deviation of Normal Distribution under Type II Censoring

Non-uniform coverage estimators for distance sampling

DSGE-Models. Limited Information Estimation General Method of Moments and Indirect Inference

Statistical Education - The Teaching Concept of Pseudo-Populations

Plausible Values for Latent Variables Using Mplus

The properties of L p -GMM estimators

Sampling Theory. Improvement in Variance Estimation in Simple Random Sampling

The Goodness-of-fit Test for Gumbel Distribution: A Comparative Study

Brief Review on Estimation Theory

Cross-Sectional Vs. Time Series Benchmarking in Small Area Estimation; Which Approach Should We Use? Danny Pfeffermann

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

"ZERO-POINT" IN THE EVALUATION OF MARTENS HARDNESS UNCERTAINTY

Random permutation models with auxiliary variables. Design-based random permutation models with auxiliary information. Wenjun Li

Consequences of measurement error. Psychology 588: Covariance structure and factor models

Nonparametric Regression Estimation of Finite Population Totals under Two-Stage Sampling

Casuality and Programme Evaluation

Some Theoretical Properties and Parameter Estimation for the Two-Sided Length Biased Inverse Gaussian Distribution

ECON 3150/4150, Spring term Lecture 7

Time-varying failure rate for system reliability analysis in large-scale railway risk assessment simulation

Estimation of Parameters and Variance

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

Imputation for Missing Data under PPSWR Sampling

Estimation of Mean Population in Small Area with Spatial Best Linear Unbiased Prediction Method

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

Combining multiple surrogate models to accelerate failure probability estimation with expensive high-fidelity models

Estimation of Complex Small Area Parameters with Application to Poverty Indicators

Simple design-efficient calibration estimators for rejective and high-entropy sampling

Discussion of Bootstrap prediction intervals for linear, nonlinear, and nonparametric autoregressions, by Li Pan and Dimitris Politis

Uncertainty Quantification for Inverse Problems. November 7, 2011

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

Lecture 8: Information Theory and Statistics

Maximum Likelihood Diffusive Source Localization Based on Binary Observations

Some optimal criteria of model-robustness for two-level non-regular fractional factorial designs

REPLICATION VARIANCE ESTIMATION FOR TWO-PHASE SAMPLES

HANDBOOK OF APPLICABLE MATHEMATICS

Improved ratio-type estimators using maximum and minimum values under simple random sampling scheme

Eric V. Slud, Census Bureau & Univ. of Maryland Mathematics Department, University of Maryland, College Park MD 20742

1 Degree distributions and data

Causal Inference with a Continuous Treatment and Outcome: Alternative Estimators for Parametric Dose-Response Functions

Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods

557: MATHEMATICAL STATISTICS II BIAS AND VARIANCE

Small Domain Estimation for a Brazilian Service Sector Survey

On the bias of the multiple-imputation variance estimator in survey sampling

Transcription:

A solution in small area estimation problems Andrius Čiginas and Tomas Rudys Vilnius University Institute of Mathematics and Informatics, LT-08663 Vilnius, Lithuania arxiv:1306.2814v2 [math.st] 20 Jun 2014 Abstract We present a new method in problems where estimates are needed for finite population domains with small or even zero sample sizes. In contrast to known estimation methods, an auxiliary information is used to model sizes of population units instead of a direct prediction of their values of interest. In particular, via an additional characterization of regression models, we incorporate a scatter and variabilities of the units sizes into an estimator, and then it uses an information of the whole sample by taking into an account a location of the estimation domain inside the population. To reduce an impact of the introduced domain total estimator bias to the mean square error, we construct also a regression type version of the estimator. An efficiency of the method proposed is shown in a simulation study. Keywords: small area estimation, auxiliary information, linear regression, order statistics. MSC classes: 62D05 (62J05) 1 Introduction The topic small area estimation (SAE) becomes more and more actual and thus popular in the last decades because of a need to get more inferences about survey populations than it seems possible with given samples or even after a careful planning of them. Actually, historically established but misleading term SAE means an estimation in the finite population domain (not necessarily a small) where sample size is not sufficient to get estimates of an adequate precision. More specifically, problems of this kind start to arise when we need to plan the sample in order to estimate population characteristics in the number of the population subsets which are determined by various classifications of the population and which are intersecting in various ways. A sampling design, ensuring a sample in each of the intersections, leads to the large total sample size and thus increases a cost of the survey. Next, if samples obtained are small in domains, applications of the classical estimation theory usually fail, in the sense of estimates quality, and such a failure not so much depends on a good auxiliary information availability. In SAE, auxiliary information plays a crucial role similarly as it is important in the traditional survey sampling where it is extensively used via ratio, regression, calibration estimators, etc., see, e.g., [1, 12, 4, 5]. The main difference is that, The research of the first author is partially supported by European Union Structural Funds project Postdoctoral Fellowship Implementation in Lithuania. 1

in distinctive SAE methods, the auxiliary data are used to extract an information about the variable or parameter of interest from the sample outside the domain (so-called indirect estimation), while, in the classical case, the sample information of the domain of interest is incorporated into estimation only (direct estimation). SAE methods can be classified roughly into two large groups: design-based and model-based methods. The design-based methodologies correspond to an understanding of the classical estimation. Data models, included into an estimation of this type, are advisedly used to keep the basic properties of the resulting estimators such as an asymptotic consistency and at least an approximate unbiasedness with respect to the sampling design, i.e., here a randomness is considered by the distribution of different samples appearance. A validity of the underlying model is not so much important, i.e., if the model is an incorrect then the variance of an initial estimator can be left unreduced, but, in any case, the consistency and an acceptable unbiasedness of the estimator still hold. Here the underlying model is called therefore assisting. See [7] and also [9], for a wide review of the design-based methods. However, the unbiasedness of the design-based estimators has its cost, i.e., for the population domains with markedly small sample size, the variances of estimators are often large. Then estimators from the class of model-based estimators can be a better choice, as it is pointed, e.g., in [11]. The number of model-based methods are collected in [10], see also [9] and [3] on recent results. We refer also to reviews in [6] and [8]. Differently from the design-based methods, the mean square error (MSE) of a model-based estimator is defined and estimated with a respect to the model of an explicit form. To explain better a place of our estimation approach, let us discuss shortly about two popular types of small area estimators: synthetic and empirical best linear unbiased predictor (EBLUP). The former could be placed between the classes described above, because its properties can be considered from the point of the sampling design, while it is indirect, but uses data models implicitly. The synthetic estimator works well if the population (or a larger area, containing the domain of interest) has similar characteristics of a model to that of the domain. Estimators of the EBLUP type are the members of the model-based estimators class. They take into an account a particularity of the domain. Estimators, we propose in the paper, have the mentioned properties of synthetic estimators except that effects of the estimation domain are incorporated by a new way which is in a sense similar to that in the EBLUP case. Let us turn to specific assumptions. Consider a finite population U = {1,..., N} of size N. Assume that, in order to estimate a parameter of the population, the sample s = {i 1,..., i n } of size n is drawn from U, according to a sampling design p( ). Here p(s) is the probability to get the particular s. Let π i = P{i s} > 0 and π ij = P{i, j s} > 0 be the inclusion into the sample probabilities for the population element i and for the pair of elements i and j. Here the operator P and further E, Var and Cov mean the probability, expectation, variance and covariance according to the sampling design, respectively. It is a common situation when a construction of the design p( ) is related closely to an auxiliary variable, say, z with values {z 1,..., z N } known for all units of U. In particular, e.g., for a stratified simple random sampling design and for probability proportionalto-size sampling, the inclusion probability π i represents an importance of the population unit i by the relative size of z i. We renumber the population U in order to have z 1 z N. Let y be the variable of interest with the fixed values {y 1,..., y N } in the population, and, we aim to estimate 2

the total t = i D y i (1) where D U is any non-empty set. If the particular estimation domain D is known before the sample selection, then the sampling design p( ) can be organized to get a sufficient for quality requirements of estimates sample size in that domain. But if we are interested in D after the sample selection and collection of the data y i1,..., y in, the sample size can be too small in D, and this leads to get bad quality estimates, if the estimators of (1) are, e.g., the direct Horvitz Thompson (H T) estimator ˆt HT = d i y i, where d i = 1, (2) π i i s D or a direct generalized regression (GREG) estimator which can be also represented by a similar to (2) form. Furthermore, in the case of empty sample, the direct estimators cannot be applied at all. Assume that at the estimation stage, we have a different from z auxiliary variable x and all its values {x 1,..., x N } are known. Let these values be the realization of independent random variables X 1,..., X N modeled by the linear regression model X i = α 1 + α 2 z i + δ i, i = 1,..., N, (3) which we call η. Here, in general, X i has the distribution function F i ( ) with E η X i = α 1 +α 2 z i and Var η X i = τ 2 i, where the symbols E η and Var η denote expectation and variance with respect to the model η. Thus we assume a superpopulation model, where α 1, α 2 and τ 1,..., τ N are unknown model parameters. In traditional survey sampling estimation problems, an introduction of the auxiliary variable x means usually that it is a better than z linear predictor of y. Then the superpopulation model Y i = β 1 + β 2 x i + ε i, i = 1,..., N, (4) we call ξ, is used in an estimation. Here Y 1,..., Y N are independent random versions of the fixed y values with the expectations E ξ Y i = β 1 + β 2 x i and variances Var ξ Y i = σi 2. To be consecutive, we do not treat model (4) as an important in Section 2, where we keep in mind only that both z and x more or less represent a size of y. Next, in Section 3, we take into an account linear relation (4). Estimators of the synthetic and EBLUP types are based on a common principle to predict y values in the domain of interest using linear relations between y and auxiliary variables by the data from neighbour population areas. The approach, proposed in Section 2, is different. Here we model sizes of the population elements instead of their y values. More specifically, we introduce an estimator of the form ˆt = i s ŵ i y i, where ŵ i = ˆθ i;d d i, 0 ˆθ i;d 1, similar to those in the design-based estimation, cf. (2). Here the numbers ˆθ i;d, i s, we get from model (3), describe how the sample elements are similar to elements of the domain D. Such an incorporation of the sizes model (3) into the estimation can be interpreted, for instance, as a mimic of a change of the population over time or as a treatment of a size relativity. Therefore, we give the name hidden randomness () to the method, which we present detailed in Section 2. Further, in 3

Section 3, we aim to reduce the bias of the estimator by constructing a regression type version of the estimator. In Section 4, we present the simulation study, where we compare the introduced estimators with the synthetic, EBLUP and direct GREG estimators. Conclusions are given in Section 5, where a relaxation of auxiliary model (assumption) (3) is discussed additionally. 2 Method of hidden randomness and small area estimator Let X (1) X (N) denote the order statistics of random variables X 1,..., X N generated by linear regression model (3). We introduce the probabilities p ij = P η { Xi = X (j) }, i, j = 1,..., N. (5) These numbers are the parameters of the model η. They depend on the basic parameters α 2 and τ 1,..., τ N, but characterize also the model additionally. In particular, characteristics (5) contain an information about the scatter of the given z values in U. Here, for the fixed unit i U, we interpret the probabilities in the following way: the collection {p ij, j = 1,..., N} shows how the initial size z i tends to variate. Define the numbers t j = i U p ij y i, j = 1,..., N. Let D = {j 1,..., j m } U be the domain of interest of size 1 m N. i U y i = j U t j is satisfied, we approximate target sum (1) as follows: Since the identity t j D t j = i U θ i;d y i, where θ i;d = j D p ij can be treated as a probability that the population element i should belong to the domain D. We define the estimator of the sum t by ˆt = ˆθ i;d d i y i, with ˆθ i;d = ˆp ij. (6) i s j D Here estimates ˆp ij are plugged-in instead of the probabilities p ij, because the parameters α 1, α 2 and τ 1,..., τ N of model (3) are assumed not known as well as F 1 ( ),..., F N ( ) are not specified. An exact calculation of (5) is complicated even the mentioned model characteristics are known, since, except trivial cases, the random variables X 1,..., X N are non-identically distributed. For instance, evaluations of distributions of the corresponding order statistics require intensive computing, see, e.g., [2] and references therein. Therefore, we propose two alternative ways to evaluate the numbers θ i;d, i = 1,..., N, see Appendix A. Remark 1. In the case of D = U, estimator (6) coincides with the design unbiased H T estimator. Two other connections with known estimators are found in the following separate cases of (6). First, assuming that the ranks of X 1,..., X N satisfy {R 1,..., R N } {1,..., N} or that the coefficient of correlation between the variables x and z is ρ xz = 1, we get p ij = I{i = j}, where I{ } is the indicator function. Then (6) yields direct H T estimator (2). Second, taking α 2 = 0, τ i = τ > 0 4

and F i ( ) F ( ) in model (3), we obtain p ij = N 1. With this absence of the linear dependence between x and z, we arrive from (6) to ˆt S = m d i y i, (7) N which is the simplest synthetic estimator, where the additional information x is not used (or is not helpful) at the estimation. In comparison with (2), estimator (7) has small variance, but its bias can be large if the domain D is not homogeneous with respect to the population. Let us consider the MSE of estimator (6). The indirect estimators are generally design biased, and the construction of (6) implies a biasedness too. While the variance part of the MSE is easily estimated using standard design-based methods, an estimation of the bias is more difficult. There are specific MSE estimation methods for synthetic estimators, see, e.g., [10], which can be applied also to the estimator, but here, we follow a course of our methodology. The bias of (6) has the expression B y i s = BIAS(ˆt ) = ˆθ i;d y i y i (8) i U i D depending on parameter (1). Therefore, the design unbiased estimator of (8) is not suitable. We introduce the following: B y = 1 ρ2 ( xz m ρ 2 xz N ˆθ ) i;d d i y i. (9) i s This estimator is consistent with the first two special cases of Remark 1, where bias (8) equals zero. In the case of ρ xz = 0 only, estimator (9) is not clearly defined, because of a complexity of (properties of) parameters (5). To summarize, we formulate the statement on the accuracy of estimator (6). For short, we denote a ij = d i d j π ij 1 with a ii = d i 1. Similarly, we write ã ij = a ij /π ij. Result 1. (i) The MSE of the estimator ˆt of the sum t in the domain D is MSE(ˆt ) = Var ˆt + ( By ) 2, (10) where Var ˆt = i,j U ˆθ i;d ˆθj;D a ij y i y j, and B y is given by (8). (ii) The estimator of (10) is MSE(ˆt ) = Varˆt + ( ) 2, B (11) y where Varˆt = ˆθ i,j s i;d ˆθj;D ã ij y i y j is the unbiased estimator of Var ˆt, and B y is in (9). Remark 2. If the total sample size n and the size m of the domain D are small, then the random variable ˆt 1;D := ˆθ i s i;d d i, induced by the sampling design, varies more about m. Therefore, applying the ratio estimator ˆt 1 = m ˆt ˆt 1;D one can expect to reduce an impact of this error source to the MSE of the estimator., 5

3 Regression type estimator Assume that the auxiliary variable x explains the study variable y well according to model (4). By Result 1, the variance of (6) does not exceed the variance of the direct whole population total H T estimator. Therefore, we aim to exploit the variable x in a reduction of the estimator bias by considering a regression type estimator of the form ˆt + b(t x;d ˆt x;d ) with a properly chosen characteristic b. In particular, minimizing the MSE of this expression, we get b 0 = C yx + By V x + ( Bx Bx ) 2 (12) where C yx = Cov(ˆt, ˆt x;d ) = i,j U ˆθ i;d ˆθj;D a ij y i x j and V x = Var ˆt x;d = C xx. Then we introduce the regression type (R) estimator where ˆt R = ˆt + ˆb 0 ( tx;d ˆt x;d), (13) Ĉ yx + ˆb0 = B y V x + ( B x B x ) 2 (14) is the estimator of the parameter b 0. Here Ĉ yx = ˆθ i;d ˆθj;D ã ij y i x j and Vx = ˆθ i;d ˆθj;D ã ij x i x j i,j s are the design unbiased estimators of C yx and V x, respectively. To evaluate the MSE of estimator (13), we apply the traditional Taylor linearization to the estimators function ˆt R at the point (E ˆt, E ˆt x;d, C yx, V x, By, Bx ) which yields ˆt R t R :=ˆt ( + b 0 tx;d ˆt +B x ( B y x;d B y ) B x ( ) ( 2b0 B x i,j s V x + ( B x B y ) 2 ) 1 {Ĉyx C yx b 0 ( Vx V x ) )( B x Bx ) }. According to this one-term Taylor approximation, the bias of estimator (13) is approximated by the bias of t R. Next, one can verify that the latter bias consists of the term B y b 0 Bx plus a remainder which is negligible if biases of the estimators B y and B x are of a smaller order than the corresponding biases By and Bx. The same reasons imply that the variance of the term ˆt + b 0(t x;d ˆt x;d ) dominates in the variance of t R. Therefore, we formulate the following result on the accuracy of estimator (13). Result 2. (i) The MSE approximation of the estimator ˆt R of the sum t in the domain D is MSE(ˆt R ) Var ˆt + b 2 0V x 2b 0 C yx + ( By b 0 Bx ) 2, (15) where Var ˆt is the same as in (10), C yx and V x are from expression (12) of b 0, and By Bx are by formula (8). (ii) The estimator of approximation (15) is MSE(ˆt R ) = Varˆt + ˆb 2 0 V x 2ˆb 0 Ĉ yx + ( B y 6 and ˆb ) 2, B 0 x (16)

where Varˆt is the same as in (11), Ĉyx and V x are from expression (14) of ˆb 0, and B x are given by (9). B y and One can conclude from (15) that the R estimator reduces bias (8) of estimator (6) if the variables y and x are well-correlated. 4 Simulation study In this section, we compare R estimator (13) with: estimator (6), synthetic (SYN), EBLUP and GREG estimators. The later three estimators are taken by formulas (4.2.2), (7.2.16) and (2.3.6) from [10], and we denote them by ˆt SY N, ˆt EBLUP and ˆt GREG, respectively. The simulations are based on populations of two types. Let N = 500 and m = 50. The values of the variable z in the population (P 1 ) are obtained as follows: for the elements of the domain D, they are generated from the distribution N (4, 1), and, for the elements in U\D, from N (6, 1.25). To get the values of z for the population of the type (P 2 ), we generate the numbers from the exponential distribution E(1) for the whole population, and multiply them by 2 for the elements outside the domain D. Next, for each population, in order to generate and fix values of the variable x, we use model (3) of the form X i = z i +δ i where the independent errors δ 1,..., δ N are distributed by N (0, τ 2 ). We are interested in different correlations between x and z, therefore, firstly, we set the variances τ 2 which return the correlations ρ xz = 0.1, 0.2,..., 0.9 in average, and, secondly, for each population, we choose 9 particular collections of the values of x which realize the expected correlations with a small error. Next, for each set of the values of x, we generate collections of the values of y similarly (choosing variances of a model), in order to ensure the correlations close to ρ yx = 0.1, 0.2,..., 0.9. But here, we use three different data generation mechanisms. Case (A). Model (4) of the form Y i = 2 + x i + ε i, with the independent errors ε i distributed by N (0, σ 2 ), is applied for all i U. Case (B). The model is Y i = 2 + β 2i x i + ε i, with the independent errors ε i by N (0, σ 2 ), where β 2i = 1.25 for i D, and β 2i = 1 for i U\D. Case (C). The model is Y i = 2 + x i + ε i, with the independent errors ε i from N (0, c i σ 2 ), where c i = 3 for i D, and c i = 1 for i U\D. Finally, each of the types (P 1 ) and (P 2 ), mixed with one of Cases (A), (B) and (C), contains 81 different trios of the sets of the values of z, x and y in the population. The sampling design p( ) is the simple random without replacement in all cases, and we take n = 75. Then, in the domain D, the expectation and the standard deviation of the sample size are 7.5 and 2.4, respectively. To evaluate the MSEs of the estimators, we apply the Monte Carlo (M C) simulations by drawing independently 10 3 samples without replacement from the populations. The probabilities θ i;d, i = 1,..., N, are estimated using the M C method too, see Appendix A.1, with the number of replications R = 10 6. We assume that the survey statistician has no idea on the models in Cases (B) and (C), and, at the estimation, fits the simplest linear regressions by (4). This assumption is realistic because, in the domain D, realizations of the sample size are too small in order to test differences between regressions. The variable z is not incorporated into the estimators ˆt SY N, ˆt EBLUP and ˆt GREG here, 7

see Section 5 for an explanation. Note that the sampling design does not depend on this variable as well. Appendices B.1 B.6 present simulation results on the populations (P 1 ) and (P 2 ) paired with Cases (A), (B) and (C). By the larger figures, e.g., Figures 1 4, we present ratios of the MSEs of, SYN, EBLUP, GREG with the MSE of R, respectively. The following smaller figures, e.g., inside Figure 5, summarize the previous four. As we expected, the R estimator improves under stronger correlations ρ yx (and with stronger ρ xz ). It holds for 5 of 6 cases in Figures 1, 6, 16, 21, 26 except Figure 11 in Appendix B.3. Therefore, in B.3, we compare the estimator with the other ones. Let us focus further, considering figures, on more reliable correlations between the size variables x and z, for instance, ρ xz 0.4. Then, in some of 5 cases, the SYN estimator is a better predictor than R where the correlation ρ yx is strong, but it loses its power more than R when this correlation decreases (Figures 2, 7, 17, 22). The SYN estimator outperforms R in Figure 2. However, Case (C) seems the most unsuccessful for SYN (Figures 22, 27), as well as in the cases with the type (P 2 ) of the populations. In the separate case, where we compare the estimator with SYN, the picture (Figure 12) is different. Here one can state that, for ρ xz 0.8, the SYN estimator is better than, but, for 0.4 ρ xz 0.6, the result is opposite. Similarly, but in 2 of all 6 cases only, the EBLUP estimator works well for strong correlations ρ yx, see Figures 3, 13. In all cases of the populations (P 2 ), the R estimator is evidently better than EBLUP (Figures 8, 18, 28). From the side of different generation models for the values of y, large differences in efficiencies of R and EBLUP are in Case (C) (Figures 23, 28) as well. Our estimation approach is much better than the GREG estimator except in Appendix B.3 where GREG improves the estimator for strong correlations ρ yx and ρ xz. We conclude, from the simulations, that the R estimator competes well with the SYN, EBLUP and GREG estimators under the non-normal populations (P 2 ), and where the underlying models of Cases (B) and (C) are misspecified. 5 Conclusions In the simulation examples, the particular special case X i = z i + δ i of model (3), with the independent and normally distributed errors, is taken intentionally. In this way, we show that, in order to apply estimators (6) and (13), it is not necessarily to have two auxiliary variables at least. In particular, we can generate the population values of z from the available values of the variable x (using the corresponding regression). Then a choice of the variance τ 2 of the generating model is an optimization problem. While it is unsolved, the present simulation study could suggest to set weaker ρ xz for estimate of ρ yx indicating a strong correlation, and vice versa. More specifically, with this rule, the correlation ρ xz could vary between 0.4 and 0.9. The size of the parameter ρ xz (or τ 2 ) can be interpreted as a noise level which serves in a smoothing of the data for the estimation domain. If two auxiliary variables x and z are at the disposal, then the general methodology presented seems more natural and is much free for specific interpretations. If there are more than two auxiliary variables, generalizations of the method are possible. By the definition, estimator (6) has the important additivity (coherence) property: if the 8

population is partitioned into non-overlapping domains, then the sum of the estimators over the domains coincides with the H T estimator of the whole population. But regression type estimator (13) is not additive. Many of well-known small area estimators are expressible as a linear combination (composition) of a direct estimator and a synthetic one. According to Remark 1, estimator (6) is the composition too, but it is not linear. We formulate the hypothesis ˆt ρ 2 xzˆt HT + ( 1 ρ 2 xz) ˆt S, which we applied in the construction of estimator (9) of bias (8) of the estimator. As the simulation study indicates, estimators (6) and (13) are comparatively robust against model (4) misspecifications. The additivity feature implies also that the estimator can be efficient for domains of any size in the population. A Appendix A.1 Monte Carlo estimates of p ij Depending on the specification of model (3), firstly, we use the data (x i, z i ), i = 1,..., N to get estimates (finite population characteristics) ˆα 1, ˆα 2 and ˆτ 1,..., ˆτ N of the superpopulation model parameters, and also to obtain estimates F 1 ( ),..., F N ( ). Secondly, we apply the estimated model η: X i = ˆα 1 + ˆα 2 z i + δ i, i = 1,..., N, where X i has the distribution function F i ( ) with E η Xi = ˆα 1 + ˆα 2 z i and Var η Xi = ˆτ i 2, to generate independently the collections (x 1r,..., x Nr ), r = 1,..., R, where R is a number of M C iterations. Finally, for each r, we order the numbers: x (1)r x (N)r, and take ˆp ij = 1 R R I { } x ir = x (j)r, i, j = 1,..., N. (17) r=1 The relative frequences (17) are consistent estimates of probabilities (5), as N, if model (3) assumptions are sufficient to have the consistency of its parameters estimators, and because of the law of large numbers as R = R(N). A.2 Approximations to θ i;d We propose approximations of the following form: ˆθ i;d = ˆθ i;d (a 0 ) = cm { I z j z i a 0 ˆα 2 1 (ˆτ j 2 + ˆτ i 2 ) 1/2}, i = 1,..., N, j D where c is the normalizing constant such that i U ˆθ i;d = m, and a 0 > 0 is a chosen number. Here ˆα 2 > 0 and ˆτ 1,..., ˆτ N are estimates of model (3) parameters α 2 and τ 1,..., τ N obtained from the data (x i, z i ), i = 1,..., N. An optimal a 0 can be chosen iteratively: starting from a 0 = 0.01 and continuing with the step of a similar order, stop iterations when the sign changes less than 4 times in the sequence ˆθ i+1;d (a 0 ) ˆθ i;d (a 0 ), i = 1,..., N 1. 9

B Appendix B.1 Case (A)., SYN, EBLUP, GREG vs R under (P 1 ) Figure 1: (P 1 ), Case (A), vs R Figure 2: (P 1 ), Case (A), SYN vs R Figure 3: (P 1 ), Case (A), EBLUP vs R Figure 4: (P 1 ), Case (A), GREG vs R Figure 5: Counts for (P 1 ), Case (A), and, SYN, EBLUP, GREG vs R 10

B.2 Case (A)., SYN, EBLUP, GREG vs R under (P 2 ) Figure 6: (P 2 ), Case (A), vs R Figure 7: (P 2 ), Case (A), SYN vs R Figure 8: (P 2 ), Case (A), EBLUP vs R Figure 9: (P 2 ), Case (A), GREG vs R Figure 10: Counts for (P 2 ), Case (A), and, SYN, EBLUP, GREG vs R 11

B.3 Case (B). R, SYN, EBLUP, GREG vs under (P 1 ) Figure 11: (P 1 ), Case (B), R vs Figure 12: (P 1 ), Case (B), SYN vs Figure 13: (P 1 ), Case (B), EBLUP vs Figure 14: (P 1 ), Case (B), GREG vs Figure 15: Counts for (P 1 ), Case (B), and R, SYN, EBLUP, GREG vs 12

B.4 Case (B)., SYN, EBLUP, GREG vs R under (P 2 ) Figure 16: (P 2 ), Case (B), vs R Figure 17: (P 2 ), Case (B), SYN vs R Figure 18: (P 2 ), Case (B), EBLUP vs R Figure 19: (P 2 ), Case (B), GREG vs R Figure 20: Counts for (P 2 ), Case (B), and, SYN, EBLUP, GREG vs R 13

B.5 Case (C)., SYN, EBLUP, GREG vs R under (P 1 ) Figure 21: (P 1 ), Case (C), vs R Figure 22: (P 1 ), Case (C), SYN vs R Figure 23: (P 1 ), Case (C), EBLUP vs R Figure 24: (P 1 ), Case (C), GREG vs R Figure 25: Counts for (P 1 ), Case (C), and, SYN, EBLUP, GREG vs R 14

B.6 Case (C)., SYN, EBLUP, GREG vs R under (P 2 ) Figure 26: (P 2 ), Case (C), vs R Figure 27: (P 2 ), Case (C), SYN vs R Figure 28: (P 2 ), Case (C), EBLUP vs R Figure 29: (P 2 ), Case (C), GREG vs R Figure 30: Counts for (P 2 ), Case (C), and, SYN, EBLUP, GREG vs R 15

References [1] W.G. Cochran. Sampling Techniques. John Wiley & Sons, Inc., New York, 3rd edition, 1977. [2] E. Cramer, K. Herle, and N. Balakrishnan. Permanent expansions and distributions of order statistics in the INID case. Communications in Statistics Theory and Methods, 38:2078 2088, 2009. [3] G.S. Datta. Model-based approach to small area estimation. In D. Pfeffermann and C.R. Rao, editors, Sample Surveys: Inference and Analysis, volume 29B of Handbook of Statistics, pages 251 288. North- Holland, Amsterdam, 2009. [4] J.C. Deville and C.-E. Särndal. Calibration estimators in survey sampling. Journal of the American Statistical Association, 87:376 382, 1992. [5] W.A. Fuller. Sampling Statistics. John Wiley & Sons, Inc., Hoboken, New Jersey, 2009. [6] M. Ghosh and J.N.K. Rao. Small area estimation: an appraisal. Statistical Science, 9:55 76, 1994. [7] R. Lehtonen and A. Veijanen. Design-based methods of estimation for domains and small areas. In D. Pfeffermann and C.R. Rao, editors, Sample Surveys: Inference and Analysis, volume 29B of Handbook of Statistics, pages 219 249. North-Holland, Amsterdam, 2009. [8] D. Pfeffermann. Small area estimation new developments and directions. International Statistical Review, 70:125 143, 2002. [9] D. Pfeffermann. New important developments in small area estimation. Statistical Science, 28:40 68, 2013. [10] J.N.K. Rao. Small Area Estimation. John Wiley & Sons, Inc., Hoboken, New Jersey, 2003. [11] J.N.K. Rao. Inferential issues in small area estimation: some new developments. Statistics in Transition, 7:513 526, 2005. [12] C.-E. Särndal, B. Swensson, and J. Wretman. Model Assisted Survey Sampling. Springer-Verlag, New York, 1992. 16