What level should we use in small area estimation?

Size: px

Start display at page:

Download "What level should we use in small area estimation?"

Gavin McCormick
6 years ago
Views:

1 University of Wollongong Research Online University of Wollongong Thesis Collection University of Wollongong Thesis Collections 2011 What level should we use in small area estimation? Mohammad Reza Namazi Rad University of Wollongong, Recommended Citation Namazi Rad, Mohammad Reza, What level should we use in small area estimation?, Doctor of Philosophy thesis, School of Mathematics and Applied Statistics, University of Wollongong, Research Online is the open access institutional repository for the University of Wollongong. For further information contact Manager Repository Services:

3 What Level Should We Use in Small Area Estimation? Mohammad Reza NAMAZI RAD B.Sc.: Mathematical Statistics M.Sc.: Economics and Social Statistics A Thesis presented for the degree of Doctor of Philosophy School of Mathematics and Applied Statistics University of Wollongong Australia 2011

4 Abstract Many different Small Area Estimation (SAE) methods have been proposed to overcome the challenge of finding reliable estimates for small domains. Often, the required data for various research purposes are available at different levels of aggregation. Based on the available data, individual-level or aggregated-level models are used in SAE. If unit-level data are available, SAE is usually based on models formulated at the unit level but they are ultimately used to produce estimates at the area level. However, parameter estimates obtained from individual and aggregated level analysis may be different in practice. Individual-level analysis usually results in small area estimates with smaller variances. However, if the unit-level woring model is misspecified by exclusion of important auxiliary variables, parameter estimates obtained from the individual and aggregated level analysis will have different expectations. This thesis investigates the circumstances when using an area-level model may be more effective. This may happen due to some substantial contextual or area-level effects in the covariates which may be misspecified in an individual-level model. Ignoring these contextual effects leads to biased estimates. In particular, if an existing contextual variable is ignored, the parameter estimates calculated from an individual-level analysis will be biased, whereas an aggregated-level analysis can lead to small area estimates with less bias. Even if contextual variables are included in unit-level modeling, there may be an increase in the variance of parameter estimates due to increased number of variables in the woring model. In this thesis, synthetic estimators and Empirical Best Linear Unbiased Predictors (EBLUPs) are evaluated in SAE based on different levels of linear mixed models. Using a numerical simulation study, the ey role of contextual effects is examined for models used in SAE.

5 Certification I, Mohammad-Reza NAMAZI-RAD, declare that this thesis, submitted in partial fulfilment of the requirements for the award of Doctor of Philosophy, in the School of Mathematics and Applied Statistics, University of Wollongong, is wholly my own wor unless otherwise referenced or acnowledged. The document has not been submitted for qualifications at any other academic institution. Mohammad Reza NAMAZI RAD 25 July 2011 iii

6 Acnowledgements First of all, I would lie to express my deep sense of gratitude to my supervisor, Professor David Steel. I greatly appreciate his enthusiasm, patience, encouragement and endless support during the completion of this wor. His dedication to research and his determination to complete each tas to the fullest has been really inspiring to me. I would lie to extend my sincere appreciation to my co-supervisor, Professor Raymond Chambers for his support and insightful discussions on my research. I would also lie to than Dr. Ayoub Saei and Dr. Huum Chandra for suggesting and pointing out the direction and resources for this research. Their insights, ideas and guidance were crucial to accomplish this dissertation. Finally, special thans to my dear parents, who have provided emotional support throughout my study. Without them, I would not have the full courage to carry through. iv

7 List of Publications The following publications have been emerged from this thesis so far. 1] Namazi-Rad, M., and Steel, D. (2011). What Level of Statistical Model Should We Use in Small Area Estimation? Australian and New Zealand Journal of Statistics. Submitted for Publication. 2] Namazi-Rad, M., and Steel, D. (2011). Contextual Effects in Modeling for Small Domain Estimation. In Preceding of Fourth Annual ASEARC Research Conference (February 2011). v

8 Conference Presentations The following conference papers have been emerged from this thesis so far. 1] Namazi-Rad, M., and Steel, D. (2011). Contextual Effects in Modeling for Small Domain Estimation. Presented in the Fourth Annual ASEARC Research Conference. Sydney, Australia. 2] Namazi-Rad, M., and Steel, D. (2010). What Level of Statistical Model Should We Use in Small Area Estimation? Presented in ASC2010 Australian Statistical Conference. Fremantle, Australia. 3] Namazi-Rad, M. and Steel, D. (2009), What Level Should be Used in Small Area Estimation? Presented in SAE2009 Conference on Small Area Estimation. Elche, Spain. vi

9 Table of Contents Abstract Certification Acnowledgements List of Publications List of Publications Table of Contents List of Figures List of Tables ii iii iv v vi vii x xi 1 Introduction Basic Concepts and Notation Overview of Linear Statistical Models Fixed-effects Models Mixed Effects Models Issues and Problems to be Considered Chapter Outlines Small Area Estimation Techniques Introduction Design-based and Model-based Approaches Direct Estimation Ratio Estimator Post-Stratified Estimator Horvitz-Thompson Estimator GREG Estimator Synthetic Estimation Composite Estimator Empirical Bayes Method Hierarchical Bayes Method Small Area Estimation Around the World vii

10 CONTENTS CONTENTS SAE in Australia SAE in the United States of America (USA) SAE in Canada: SAE in Europe SAE in Iran SAE in Korea Summary SAE based on Linear Mixed Models Introduction Linear Mixed Models in SAE Unit-level Modeling Area-level Modeling Regression-Synthetic Estimator EBLUP Techniques Conclusions and Further Discussions Contextual Effects in Modeling for SAE Introduction Contextual Models Area-level BLUP in a General Format Model Comparison under Population Model Prediction based on W 1 under P Prediction based on W 2 under P Prediction based on W 3 under P Prediction based on W 4 under P Prediction based on W 5 under P Prediction based on W 6 under P Model Comparison under Population Model Prediction based on W 1 under P Prediction based on W 2 under P Prediction based on W 3 under P Prediction based on W 4 under P Prediction based on W 5 under P Prediction based on W 6 under P Conclusion Model-Assisted Design-Based Simulation Introduction Methodology Population Model Parameters Key Ideas Simulation Results MSE of Estimation in Population MSE of Estimation in Population Model Comparison using Root MSE and Bias Discussion of Simulation Results viii

11 CONTENTS CONTENTS Estimated MSE Conclusion Appendix 168 A Constants and Formulas 168 A.1 Model Variables and Parameters within LMMs A.2 Variance-Covariance Matrices in LMM A.2.1 Calculating Variance-Covariance Inverse Matrices A.3 Fisher Scoring Method A.4 BLUP A.4.1 Estimating Random Effects Using Individual-level LMM A.4.2 Conditional Distribution for Sample Means A.5 Variance of BLUP based on W 1 under P A.6 Variance of BLUP based on W 2 under P A.7 Variance of BLUP based on W 3 under P A.8 Variance of BLUP based on W 1 under P A.9 Variance of BLUP based on W 2 under P A.10 Parameter Estimates in Linear Mixed Models References 191 ix

12 List of Figures 5.1 Fitting LMM on available Individual-level Data from Australian Census Fitting a Linear Model on available Area-level Data from Australian Census Fitting a Linear Model on available Area-level Data from Australian Census 2006 Excluding Pilbara MSE under Population Bias under Population Variance under Population MSE under Population Bias under Population Variance under Population rrmse under Population rrmse under Population The Relative Efficiency of Unit-level Model to Area-level Model Resulting Bias for Synthetic Estimators and EBLUPs Resulting Variance for Synthetic Estimators and EBLUPs under P The Relative Efficiency of Synthetic Estimator to Direct Estimator under P The Relative Efficiency of EBLUP to Direct Estimator under P Estimated Relative Efficiency for EBLUPs based on W 1 over W 2 under P x

13 List of Tables 4.1 Summary of Possible Woring Models Categorizing Different Situations based on the Data Availability for Model Fitting True Sample Models under Population Model True Sample Models under Population Model Summery of Possible Woring Models and Predictors Estimating model coefficients based on different woring models Summary of Model Characteristics based on Different True Population Models Model Comparisons Weely Income and Hours Wored in Australian Census Parameter Values Considered in Population Models Description of Empirical Bias for Area Synthetic Estimates Description of Empirical Bias for Area EBLUPs Description of Empirical Root MSE for Synthetic Estimators Description of Empirical Root MSE for EBLUPs Mean rrmse of Synthetic Estimators Considering Different Area Sample Sizes Mean and Variance Parameter Estimates under Population The Population Size for Different Statistical Subdivisions Weely Gross Salary Hours Wored The Sample Size for Different Statistical Subdivisions xi

14 Chapter 1 Introduction Sample surveys allow efficient estimation and other forms of inference about a large population when the resources available do not permit collecting relevant information from every member of the population. Each year, many sample surveys are conducted in the world to obtain statistical information required for various decisions and policy maing. The demand has grown maredly in recent years for comprehensive statistical information not only at the national levels but also for sub-national domains. In order to reduce the ris of a distorted view of the population, sample selection process should be designed scientifically accounting for the nature of the variability of characteristics among items in the population. Finding a good survey design achieving well defined goals and considering all limitations in conducting the survey seems lie a daunting tas. Relevance and accuracy are two ideal principles in collecting valid survey data and calculating required estimates. Considering the sample design, data analysts and statisticians use available data to calculate reliable estimates by appropriate statistical methods. Statistical Bureaus and other survey organizations are widely using sample surveys to produce various types of estimates. However, not all statistical requirements are being fulfilled by these surveys. The governmental and non-governmental sectors require comprehensive statistical information not only at national and regional levels but also for smaller areas. Sample surveys are important sources of statistical information in many countries, and there are financial limitations for conducting 1

15 Chapter 1 Introduction surveys capable of producing reliable estimates at small area levels. Small Areas are the geographic or demographic subsets of population whose domain-specific sample size is not large enough to produce reliable direct estimates while a Large Area is the one which has enough domain-specific sampling information to warrant the use of direct estimators. Note that, a quite small sample size can be allocated to a geographically large area depending on the characteristic of interest. This area will then be considered as a small area for estimation purposes. During the last few decades, different Small Area Estimation (SAE) techniques have been developed to overcome the challenging problem of finding reliable estimates for small areas. Khoshgooyanfard and Taheri Monazah, 2006] Small area estimation involves using statistical modeling techniques to produce required estimates for several geographic sub-populations (such as cities, provinces, states etc.) and socio-demographic sub-domains (such as age groups, gender groups, race groups etc.) when the available survey data is not sufficient to calculate reliable direct estimates. Usually, auxiliary variables related to the variable of interest are used in statistical models to find reliable estimates in different SAE techniques. (Rao, 2003; Chapter 1) Woring on different approaches of producing small area statistics has become an important research topic in survey methods in the last few decades, stimulated by increasing demands in government agencies and various advertising, mareting and business sectors for data at different geographic and socio-demographic levels. At present, many large scale surveys are designed for estimating national quantities but, sometimes almost as an afterthought, are used for inferences about small areas. Statisticians face two different situations in using small area estimation. Sometimes the survey is designed for national purposes and small area estimation can be useful to increase the accuracy of sub-national estimates. In some other situations, it is already nown that the small area estimation is going to be used in a survey and statisticians have this opportunity to select the best design for this purpose. In both situations finding the best statistical model to be fitted on the available data and selecting the best estimation method is very challenging. Statistical models in small area estimation can be formulated at the unit level or 2

16 Chapter 1 Introduction area level. Unit-level models use available data for different individuals while arealevel models wor with available information at the area level and use aggregate data for estimation purposes. Area-level models are useful when available data is accessible just at the area levels. The area-level model can be also derived using aggregating (averaging) techniques on the individual data. Choosing the most reliable set of basic assumptions about the actual population model is always a challenge in model-based estimation techniques. In many practical cases, some important factors may be miss-specified in modeling the available data. This leads to biased estimates. Statistical analysis based on unit-level modeling provides the opportunity to consider available individual-level information for estimation purposes. The main advantage of using unit-level model over area-level modeling is that the resulting estimates have smaller variances. On the other hand, possible area-level covariates for certain auxiliary variables may be miss-specified in a general unit-level model. Area-level modeling can consider the mentioned area-level covariates in certain cases, automatically. This leads to less biased area-level estimates compared to the estimates based on unit-level modeling. Even if the area-level effects can be included in fitting the unit-level model based on correct assumptions, the resulting estimates may have larger variances due to increased number of model parameters to be estimated. In this thesis, assuming the target of inference to be at the area level, the performance of area-level models is explored comparing with unit-level models when both individual and aggregate data are available. The aim of this wor is to derive statistical models based on available survey data and compare the accuracy of small area estimation techniques, both theoretically and experimentally. Monte-Carlo simulations are used to compare the estimation techniques using different levels of statistical models. 1.1 Basic Concepts and Notation In this section, the notation used in this study is introduced. We consider a finite population of size N from which a sample of size n is drawn. The target vari- 3

17 Chapter 1 Introduction able within the population is denoted by Y. When sample data or observations are obtained for the attribute or characteristics of interest, it will be denoted by y. Suppose there are K different small geographical areas in the target population. In each area, some demographic characteristics of the population, such as age and sex, as well as other socio-demographic variables, including marital status, household composition, living arrangements, ethnicity, education and occupational class may be ased as the target variable for estimation purposes. The number of these subgroups is denoted by L. Therefore, y il denotes the sample value of the target variable Y on the i th unit in the l th demographic or socio-demographic sub-group within the th small geographical sub-division, where i = 1, 2,..., n l ; l = 1, 2,..., L and = 1, 2,..., K. For example, assuming a set of males and females in four age groups to be considered in a sample survey for a study purpose of three marital status in a target population, statistical analysis will be based on L = = 24 required categories. It will be noted that, N l and n l respectively denote the population size and the sample size for the lth demographic or socio-demographic sub-group in the th small geographical area. This labeling gives a complete cross-classification into LK cells with N l population members in the lth cell and the sample size of n l, where: K L n l = n l=1 (1.1) K L N l = N. l=1 Based on the sampling design, the data should be collected from some individuals within the whole population U. The part of the whole sample, s, which falls into the th area is s = s U, and the part which is not included in the sample is r = r U. The usual dot notation is used to show the value of sums in different sub-areas. For example, y.. specifies sum of sample values for the characteristics of interest in the th small geographical sub-divisions, y.l. shows the value of the sample sum for the lth demographic or socio-demographic sub-group, and y.l denotes the sample sum for the lth demographic or socio-demographic sub-group in the th small geographical sub-divisions. The summation of the sample values obtained for 4

18 Chapter 1 Introduction the whole area is denoted by y... : i s l y il = y.l K i s l y il = y.l. L l=1 i s l y il = y.. (1.2) K L l=1 i s l y il = y... Lowercase letters refer to sample values and statistics and uppercase to the population. Therefore, all definitions given in (1.2) can be applied for the population by changing the lowercase letters to uppercase. Note, the matrices are shown by bold prints. Depending on the method of estimation and purpose of each particular study, precise estimates may be needed for the whole population and further disaggregation may not be required (L=1 and K =1). Also, the term small area may be used only for small geographic area (L=1), or a socio-demographic sub-group may just be considered as the small area of interest (K =1). Depending on each situation, some components may be omitted. For instance, y i denotes the sample observation value of the characteristic of interest Y on the ith unit in the th small geographical area. 1.2 Overview of Linear Statistical Models Different linear models are used in following chapters. Therefore, some basic definitions are reviewed here. More detailed theoretical discussions can be found in the following chapters. A statistical model can be formulated for the population data. In this chapter, the models are formulated for the population data. Assuming a correct model for the actual population, a true sampling model can be derived by changing the uppercase letters to the lowercase ones within the population model 5

19 Chapter 1 Introduction assuming the differences between the sample and population indicators to be ignorable. A simple individual-level statistical linear population model for i th unit within the th area is given by: Y i = β 0 + β 1 X i1 + β 2 X i β p X ip + ε i ε i iid N(0, σ 2 ε) (1.3) where P is the number of available auxiliary variables, Y i is the response random variable for the ith individual within the th area, X ip ; p {1, 2,..., P } are observable and non-random covariates, and ε i denotes the error term which are uncorrelated random variables each with expected value 0 and variance σε. 2 Unobservable parameters of the model are the regression coefficients, β 1, β 2,..., β P. The linear model above can be written for the whole population in matrix form as below: Y = Xβ + ϵ (1.4) where Y is a N 1 column vector of random variables, X is a N (P + 1) matrix of nown quantities whose rows correspond to statistical units, β is a vector with (P + 1) parameters, and ε is a N 1 vector of errors. Davison (2003)] When the observations are not independent, Linear Mixed Models (LMMs) can be used to handle the data. Linear mixed models, which are also nown as hierarchical linear models, may be expressed in different but equivalent forms. In the linear mixed model procedures, the errors are assumed to be correlated. Linear mixed model families are further generalizations of linear models which can support the analysis of continuous dependents for random effects, hierarchical effects and repeated measures. Linear mixed models procedures are mostly used to model means, variances, and covariances in the data to display correlation and non-constant variability. Jiang, 2007] The random variable ϵ in a LMM, is divided into individual random model errors and area random effects which are independently distributed with mean zero and 6

20 Chapter 1 Introduction covariance matrices G and R, respectively, (Rao (2003); chapter 6). ϵ = Zu + e V ar u e = G 0 0 R, E(e) = 0 & E(u) = 0. (1.5) where Z is an N q matrix of random-effect regressors, and u and e are respectively q 1 and N 1 random vectors. The matrices G and R are assumed to be respectively q q and N N matrices of positive nown values. If Z = 0 and X 0, the model only includes the individual random errors. The model with random effects can be performed when Z 0 and X = 0. Finally, the model can be used as a mixed model when Z 0 and X 0. In this situation, the model includes the individual and area random effects, simultaneously Fixed-effects Models In the fixed-effect models Z = 0 and X 0. This model is defined as follows: Y = Xβ + e (1.6) E(e) = 0 & V ar(e) = σ 2 R. Although the fixed effect coefficients can reflect the relationships across the population, the model does not allow any within-group relations to be specified. Best Linear Unbiased Estimator (BLUE) of the vector of fixed effects parameters is an estimator within the class of the linear unbiased estimators which has the least variance. The best linear estimator for the vector of parameters is as follows: ˆβ = (X R 1 X) 1 X R 1 y (1.7) This estimator is discussed in more details in chapter Mixed Effects Models Sometimes, some effects in a statistical model are modelled as being random. In this case, the parameters which describe the distribution of these random effects should 7

21 Chapter 1 Introduction be estimated. Snijders, 2005; Faraway, 2006]. When Z 0 and X 0, a mixed effects model can be used. This model can be written as: Y = Xβ + Zu + e (1.8) Random effects considered in this models are to reflect the area effects in modeling which are important factors in SAE. The Best Linear Unbiased Predictor (BLUP) of random effects and Best Linear Unbiased Estimator (BLUE) of the model parameters are given as: Searle (1997)] ũ = GZ V 1 (Y X β) β = (X V 1 X) 1 X V 1 Y (1.9) where: V = ZGZ + R (1.10) Best Linear Unbiased Predictor (BLUP) of the random terms within the model is the best predictor in the sense that it has minimum mean square error within the class of the linear unbiased estimators. As can be seen above, they are linear functions of the data vector Y. The expression predictors is used in order to distinguish them from the fixed effects estimators (Robinson (1991)). Considering m and l as two nown column vectors, the best linear unbiased predictor for the linear combination of the fixed and random effects, s = l β + m u, can also be derived as: Henderson (1975)] s = l β + m GZ V 1 (Y X β) (1.11) More detailed discussions about different linear mixed models and appropriate estimation methods can be found in Chapter Issues and Problems to be Considered SAE techniques have expanded considerably during the last few decades with increasing demands for social and economical official statistics at sub-regional and 8

22 Chapter 1 Introduction provincial levels Prasad and Rao, 1990; Ghosh and Rao, 1994; Pfeffermann, 2002]. Auxiliary information are usually used in statistical methods in order to increase the efficiency of estimators. Considering the variety of available small area estimation methods, three important factors can affect the estimation procedure. Variable types. (categorical or non-categorical) Situations in which the estimates are obtainable. Auxiliary data type. Different models are fitted on available data to gain more precise small area estimates. Usually, auxiliary information and its relationship with the target indicators are used to improve the quality of statistical modeling. The term auxiliary data is sometimes used just for the population information which may have been obtained from the previous census or some other official data systems. These data resources may not be obtainable or reliable in practice. In certain cases, some available survey data can be found which may be more relevant and dependable for modeling. Depending on the type of the target variable (Continuous or Quantitative Variables/ Discrete or Qualitative Variables) and available auxiliary information for the target area or similar areas a model will be developed. This model should specify possible relationships between the source of auxiliary information and the variable of interest. Auxiliary data can be obtained through one of the sources below at individual or area level: The accurate information for the bigger areas or available data at the national level obtained from census data or other official data gathering systems. The accurate available information or some survey data about the characteristics of interest in other similar areas. The reliable available information about the characteristics of interest for the desired area from the past. The accurate updated information about the similar characteristics within the area of interest. 9

23 Chapter 1 Introduction Here, in this thesis we will use available information about the auxiliary variable from the current sample or a census. If unit-level data are available, SAE is usually based on models formulated at the unit level but they are ultimately used to produce estimates at the area level. Therefore, this approach can be used where the targets of inference are at the arealevel. For example if the area means are the main targets of inference, a unit-level model can be firstly fitted on the sample data. Required area-level estimates will be calculated using aggregating (averaging) techniques (Longford, 2005). We would expect a unit-level modeling approach to lead to estimates with smaller variances than estimates based on the area-level approach. But, if an important area-level effect is misspecified in the unit-level model, the parameter estimates obtained based on unit-level and area-level models will have different expectations. Even if all important area-level covariances are included in an individual-level model analysis, there may be an increase in the variance of parameter estimates due to the increased number of variables in the population model. This wor investigates the circumstances when directly using an area-level model is more effective. The performance of area-level models is investigated compared with unit-level models when both individual and aggregate data are available. A ey aspect is whether there is a substantial area-level effects involving the covariates. Ignoring these effects in unit-level woring models can cause biased estimates while these effects can be automatically accounted for in the area-level models. Knowing the correct model is a strong assumption in practice. Therefore, choosing the best woring model is always a challenge for survey methodologists. Depending on previous information on the target variable and its relation with available auxiliary data, statisticians try to find a suitable model. In certain cases, area-level effects in the population model will be ignored in model fitting. This causes the resulting parameter estimates to be biased. Even if possible contextual variables are included in an individual-level analysis, there may be an increase in the variance of parameter estimates due to increased number of variables in the model. Considering that the targets of inference are the area means, we might expect that a simple area-level woring model can be reasonably reliable for estimation purposes when 10

24 Chapter 1 Introduction area means are present in the population model as contextual effects. When the data comes from a complex sample, a common approach is to use area-level estimates that account for the complex sampling and regression models of a form introduced by Fay and Herriot (1979). The variance of sampling error is assumed to account for the complex sampling and be nown in the Fay-Herriot model and all other model parameters are estimated based on this ey value. In this thesis, we try not to use this unrealistic assumption in the model-based estimation. Numerical methods are introduced as an alternative to estimate variance components in different levels of estimation. Usually, sample information is not sufficient for direct estimation purposes in small areas due to small sample sizes. Therefore, resulting area-specific direct estimators are not reliable. Indirect techniques are used in such cases using lining models with other data sources in order to achieve required estimators with acceptable quality (Rao, 2001). In following chapters, some model-based estimation techniques are introduced. Synthetic and composite estimation methods are discussed in this thesis as two commonly used indirect SAE techniques. Then, the resulting estimates based on different methods are compared, empirically. 1.4 Chapter Outlines In Chapter 2, several direct and indirect estimators are introduced and the literature on the use of SAE is reviewed. Several examples of indirect estimation methods used in different parts of the world are also discussed in Chapter 2. The focus of Chapter 3 is on small area estimation techniques based on unit-level and area-level linear mixed models. Synthetic estimators and Empirical Best Linear Unbiased Predictors (EBLUPs) are introduced in this chapter as two main methods commonly used in SAE. Contextual models are introduced in Chapter 4. Recognizing that area means are the targets of inference, a number of woring models to be fitted on the unit-level and area-level sample data are discussed based on different assumed population models. The implications of having area means as possible contextual effects in 11

25 Chapter 1 Introduction the actual population model but ignored in the fitting model is considered in this chapter. To do this, two population models are considered in this study. Area means are present in one population model as contextual effects while the other population model does not consider any possible contextual effects. Then, several small area estimates are evaluated based on six woring models under each population model. Model features are presented and the resulting estimates based on different woring models are compared, theoretically. A simulation study is presented in Chapter 5 in order to compare possible woring models based on two assumed population models. Population data is generated in this study based on available information from Australian Census

26 Chapter 2 Small Area Estimation Techniques 2.1 Introduction The demand has grown maredly in recent years for comprehensive statistical information not only at national levels but also for sub-national domains. Over the past three decades, methods have been developed for estimating small area characteristics in both public and private sectors. Statistical Bureaus and other survey organizations are widely using sample surveys to produce estimates not only for the total population but also for various local areas and other small domains. These small domains are usually defined by geographic or demographic subdivisions. Survey sampling methods were developed in the first half of the last century in order to provide statistical techniques for conducting appropriate sample designs and describing the process of selecting sample individuals from the target population and producing estimates. Key references are Tchuprow (1923), Neyman (1934), Mahalanobis (1944), Yates (1946), Hansen et al. (1953), Cochran (1953), Suhatme (1954), Kish (1965) and Kish & Franel (1974). In recent decades, the academic world has taen an increasing interest in different ways of using various sources of auxiliary data to produce accurate and reliable estimates for minor geographic areas and small subgroups of the population. The auxiliary data gives information about the variations between the areas and can be obtained from previous censuses or administrative databases which are available for the area. 13

27 Chapter 2 Small Area Estimation Techniques Often, a large population is sampled but some governmental and nongovernmental sector analysts may be more interested in reliable estimates at different small area or group levels. A problem arises when only a few or no sample units fall in all or some of these small areas. In such situations, other sources of information may be used to strengthen the estimates. Source of data on auxiliary variables should be evaluated and a model should be developed to specify the relationship between existing axillary information and target variables (Chaudhuri and Maiti, 1994; Jiang and Lahiri, 2006). SAE deals with the small sample size problem in target areas for estimation purposes. The common approach is to use information from external sources to improve the required estimation. Borrowing strength in SAE involves borrowing information from available data resources in related or similar areas to strengthen estimation techniques. The ey idea in this technique is to increase the effective sample size in order to calculate required estimates with smaller Mean Square Errors (MSEs) (Raro, 2003). Using the entire sample data, usually statisticians apply predictive statistical models for borrowing strength across the target domains (Estevao and Särndal, 2004). We call the resulting estimators indirect, as they are not solely based on available sample information in the target areas. The auxiliary information can be accessible at the unit-level or area-level. If the auxiliary information is available for different individuals and the target of inference are at the area-level, two types of statistical models can be chosen to be fitted using the data. The area-level estimates can be directly calculated or can be obtained from a unit-level model by aggregating techniques. In different situations and based on the nature of available data, the estimated values resulted from one model can be more precise than the other. On the other hand, if the auxiliary information are available just at the area-level, then the only choice is to fit the model at this level. In this case, the statisticians may be concerned about loosing efficiency as they can not use all sample individuals in the model. But, they may not lose much efficiency and the selected area-level model can be better than unit-level alternatives in certain cases even if the individual data are available. The models can be fitted on available data at the unit-level or area-level. Usually, 14

28 Chapter 2 Small Area Estimation Techniques unit-level models are preferred as they can consider the available data for different individuals and the resulting estimates based on these models are supposed to be more reliable (Yurdusev, 1993; Sinha and Rao, 2008). Area-level analysis are typically used when the required data are available just at the area level. Statistical agencies prefer to use area-level models due to availability of aggregate data and reducing the complication in required calculations. In this study, we focus on area means as the main targets of inference and we assume the sample information to be available at the individual-level. Then, certain cases are discussed in which individual-level information are provided but area-level model may have some advantages to be fitted directly and may provide less biased estimates. The aim of this thesis is then to show that finding the best model to calculate the most reliable estimates is very challenging in many situations and can affect the final results considerably. In the 1970 s, Leslie Kish contributed to the development of the field through his direction of three doctoral dissertations at the University of Michigan and published the results in several publications (Ericsen, 1973; Kalsbee, 1973; Purcell and Kish, 1979, 1980). Purcell and Kish (1979) reviewed demographic and statistical methods of estimation for small domains. Some early studies on SAE methods was published by Zide (1982), Rao (1986), McCullagh & Zide (1987), and Chaudhuri (1992). Statistics Canada (1987) illustrated methods used to estimate the population for various levels in Canada. It evaluated the relevant population estimation methods and the methods which provide estimates for local areas in Canada. Direct estimates are based on the sample information obtained just from the area of interest and are not very reliable when the sample size is not reasonably large. Therefore, some additional auxiliary information is used to obtain reliable indirect estimates. Indirect approaches borrow strength by incorporating external data resources to gain more precise estimators. As discussed before, direct estimation methods are not reliable for small domains due to lac of effective sample size. Therefore, different inds of indirect methods are suggested to overcome this problem. Synthetic estimation is one popular SAE method. In this type of estimation, the information will be obtained from the sample units which are located in the target domain and sample individuals from 15

29 Chapter 2 Small Area Estimation Techniques other areas, simultaneously. The required area-level synthetic estimators will be then calculated based on a statistical model fitted on the whole sample data. If there are lots of small areas in the survey, some areas may have no sample units. The main advantage of synthetic estimation is that it can provide estimators for all areas even for the ones which have no sample data. Another way to calculate reliable estimates is to use both indirect and direct techniques, at the same time. Composite estimation technique combines two available estimators for the target small area by selecting appropriate weights. Usually, a composite estimator is derived as a linear combination of existing direct and indirect estimators. Several small area estimation methods will be introduced in this chapter. Depending on the type of estimation techniques and parameters types which have been used to provide various small area estimators, the aforementioned estimation methods can be considered as design-based or model-based. The design-based and model-based methods can be described as follows: Brael and Bethlehem, 2008; Heady et al., 2003] Design-based Approach: Under this approach, estimators are justified under the concept of stochastic structure induced by the repeated randomized sampling design. Model-based Approach: Under this approach, the estimators are derived based on a model of the inherent probabilistic structure assumed for the population. Design-based estimation is a traditional method based on repeated sampling from a finite population. This technique can be found in standard references such as: Hansen, et al. (1953), Kish (1965), Cochran (1977), and Särndal et al. (1992). Unlie the design-based approach, model-based methods are based on an assumed model for the population. This ind of estimation method was followed by authors such as Ghosh & Meeden (1997), Valliant et al. (2000), Rao (2003) and Brael & Bethlehem (2008). The small area estimation theory under the design-based approach is based on the randomization distribution of an estimator over repeated sampling from a fixed 16

30 Chapter 2 Small Area Estimation Techniques finite population. Despite of the fact that the direct design-based estimators are design consistent, they do not borrow strength from the related small areas (You and Rao, 2003; Pfeffermann and Sverchov, 2007). Estevao and Särndal (2004) indicated that borrowing strength is not possible under the design-based approach. However, under the model-based approach to small area estimation, optimal estimators can be derived under the assumed models (Rao, 2003). The structure of the randomized sample design can be seen as the explanation of a design-based procedure, while the model-based methods are based on the model of the inherent probabilistic structure underlying the population itself (Heady et al., 2003). In the following sections, some discussions about these approaches are considered in more detail. 2.2 Design-based and Model-based Approaches Sampling is an important device for large scale data collections. A sample survey includes a random process to obtain subsets of individuals to gather reliable information about the required characteristics to be generalized for the whole population. The main idea of statistical inference from sample survey is to obtain information about a finite population by taing a probability sample from a population and then to use the information from the sample to mae inferences about particular population characteristics such as means, totals and ratios. This statistical procedure can be considered under a design-based or a model-based approach. Ghosh et al., 2008] Suppose that the target population consists of N distinct elements within K sub-domains labeled = 1, 2,..., K and y i denotes the observation value of the characteristic of interest Y on the ith unit in the th sub-domain. Here, T denotes the a linear combination of required variable within the population. For example, T can be defined in a way to be the summation of the required variable over the whole population (T = Y.. = K N i=1 Y i). Over repeated sampling from the whole population, the simple direct estimator under the design-based approach is ˆT DB = DB ˆT, where ˆT DB is the design-based estimate for a linear combination of target variables within the th sub-domain and ˆT DB is the same estimator for the whole population. 17

31 Chapter 2 Small Area Estimation Techniques The expectation of the estimator over the randomization distribution, E ˆT DB ], is the expectation over repeated sampling from the population of fixed values. Suppose that S denotes the set of all possible samples (sample space) and p(s) denotes the probability of a particular samples. Särndal et al., 1992] P r(s = s) = p(s) ; p(s) = 1 Then, the expectation of the design-based estimator can be specified as: E( ˆT DB ) = p(s) ˆT (s). (2.1) s S s S where ˆT (s) denotes the value of ˆT DB for sample s. Note that, the population values are treated as fixed constants under the design-based approach and the sample membership is the only source of randomness. Therefore, the design-based estimate depends on s and it has been temporarily written by ˆT (s) (Thompson, 1997). The variance for this estimator is given by: V ar( ˆT DB ) = E{ ˆT DB E( ˆT DB ) ] } 2. (2.2) The repeated sampling distribution of the sample error ( ˆT DB T ) is the basis of design-based inference about T DB. Then, T DB can be considered as a good estimator if the repeated sampling distribution of ( ˆT DB T ) has zero (or approximately zero) mean and small variance. Under the model-based approach, population individual values are looed upon as realized values of a random variable Y which follows the probability structure of an assumed statistical model for the whole population (Muhopadhyay, 1998). The expectation of the estimator for the characteristics of interest under the modelbased approach is the expectation over repeated realization of a population model, conditioned on the sample s. Through this model, the values for the characteristics of interest for all units in the population (Y i ; i = 1, 2,..., N) are assumed to be generated by a process under a statistical model. Firstly, a probability structure generated by a stochastic model should be assumed and secondly, it is usual to use an auxiliary variable X. 18

32 Chapter 2 Small Area Estimation Techniques Survey weights are typically used in survey estimation methods based on the design-based approach. Once the sample data has been collected, statisticians develop a weighting algorithm to adjust the required variables for required survey estimates. Design-based weights reflect the probability of being sampled based on the sampling design. A typical from for the design-based estimator is defined by weighting available sample units by the inverse of inclusion probabilities (w DB i = 1/π i ). Note that, π i denotes the probability of being selected in the sample for the ith individual in the th geographical area. Then, the design-based total estimator of the target variable within area is: Säarndal et al., 1992] Ŷ DB. = i s w DB i y i. (2.3) The model-based weights can also be developed based on the assumed statistical model for the population. Using model-based weights, a model-based estimator can be derived as: where w MB i Ŷ MB. = i s w MB i y i (2.4) depends on the auxiliary data and the assumed population model. It will be noted that, the sample design p(s) is ignorable in this approach under certain conditions. The issue of ignorability in the model-based approach has been discussed by Sinner et al. (1989). It has been shown in many statistical investigations that ignoring the sample design or the population structure as reflected in a sample, may result in biased and misleading estimates and substantially underestimate the standard error (Bric et al., 2000). If the assumed model being fitted is true, the resulted model-based estimates are efficient and valid. (Binder and Roberts, 2006; Ghosh et al., 2008). Therefore, using model-based approach results in ignoring the survey design, unless the sampling design is an inherit part of the assumed model (Binder and Roberts, 2003). As a brief definition, under the design-based approach the statistical inference is based on the stochastic structure induced by the sampling design conditional on population values while the probability structure of the sampling design plays a less pronounced role in the model-based context, since the inference is based on the probability structure of an assumed statistical model conditional on the sample. 19

33 Chapter 2 Small Area Estimation Techniques Note that, the main targets of inference in this thesis are the area means which can be estimated using the available total estimates for different areas. Considering Y to denote the th area mean within the target population, the estimator for this variable can be calculated as: Ȳ = N 1 Ŷ.. Direct estimation is not reliable if the sample size is extremely small and is not possible when there are zero observations in some sub-domains. Small area estimation methods are required in these circumstances. Therefore, it is necessary to borrow strength from related areas to produce more accurate estimates for the target areas in such cases. Use of linear regression models is an example of borrowing strength from other small domains. In these methods, the relationship among the alternative sources of covariate information about the area of interest should be explored through a model-based approach and these covariates are used to predict the required variable (Dagne, 2001). For instance, if it is assumed that the relation between the variable of interest Y and the auxiliary variable X is given as: Y i = β 0 + X i β 1 + ε i ; ε i iid N(0, σ 2 ) (2.5) Assuming this model to be true for the target population, it will be fitted using the sampling data as a woring model. The parameter estimates based on this model can be implemented for calculating model-based estimates for the variable of interest. Then, the resulting model-based estimate for th area mean is: Ȳ = ˆβ 0 + X ˆβ1 (2.6) where: ˆβ = (x x) 1 x y ; β = β 0 β 1 ]. (2.7) Here, y is a column vectors of all sample information for the target variable and matrix x contains sample auxiliary information as follows: y 1 1 x 1 y y = 2 1 x & x = 2. (2.8). 1. y n 1 x n The expectation and variance of the total mean of the target variable can be obtained through the assumed model as below: 20

34 Chapter 2 Small Area Estimation Techniques E ξ (Ȳ ) = Xβ & V ar ξ (Ȳ ) = σ2 n where the subscript ξ denotes the expectation and variance under the assumed population model. If the true model is fitted, this technique considers all sample information from the whole target population in order to provide reliable estimates for all areas, even the ones which have no sample data. Additionally, small area models can be considered as special cases of general Linear Mixed Models (LMMs) which include both fixed and random effects, simultaneously (Prasad and Rao, 1990; Jiang and Lahiri, 2006). Such models are discussed in Subsection Detailed discussion on typical SAE techniques based on different LMMs is presented in Chapter 3. Some standard methods for area estimation are discussed in following subsections. Then, alternative techniques are introduced in the thesis in order to improve the resulting estimates for the target small areas Direct Estimation A direct estimate is generated using the sample data obtained only from the area of interest. Suppose that there are K small geographical areas in the population. Using Simple Random Sampling Without Replacement (SRSWOR), the required information is gathered based on the sample selected from the target population. If there are some sample units falling in a particular small area (n 1), a simple estimate of the th area mean value for the variable of interest can be given as: Ȳ = 1 y i, (2.9) n i s which is conditionally unbiased for a fixed n. Note that, in this thesis the small areas are strata and therefore n is considered as a fixed value. E( Ȳ ) = E(ȳ n ) = Ȳ. If N is nown for a particular small area, the expansion estimator for the total value of the variable of interest is given by: Ŷ exp. = N n i s y i, (2.10) 21

35 Chapter 2 Small Area Estimation Techniques If N is unnown for a particular small area, a second but less efficient estimator for the total value of the variable can be derived. This is a simple expansion estimator which is: Considering n Ŷ exp. = N n i s y i. (2.11) as a random variable, the expansion estimator is not designunbiased and can be model-unbiased estimate of Y. depending on the assumed model for the population. If n is assumed to be fixed, the expansion estimator in SRSWOR is design-unbiased for Y. is: E(Ŷ exp. n ) = N E(ȳ n ) = N Ȳ = Y.. Suppose that f = n N, where the N is nown for a particular small area, then the conditional variance for ȳ for a fixed value of n is given as below. V ar(ȳ ) = 1 f n S 2 y (2.12) where S 2 y = 1 N 1 N i=1 (y i Ȳ) Ratio Estimator Ratio estimation involves the use of nowledge of population total for an auxiliary variable to improve the estimates of a variable of interest. Suppose y i and x i denote the values obtained for ith sample unit within the th small area for two different variables through survey sampling. This estimator was introduced by Cochran (1940) in order to improve the estimation efficiency in survey sampling (Singh, 2003; Chapter 3). If the information about the variable X is available for the whole population within the th area, the ratio estimate of the population total for the variable of interest in the th area can be define as below: Ŷ R. = X. ˆR, (2.13) where: ˆR = y. x. = ȳ x. (2.14) 22

36 Chapter 2 Small Area Estimation Techniques Here, X. denotes the nown population total for the auxiliary variable. Then, y. and x. are the total values obtained for a sample of n individuals. The ratio estimator for the population mean can be defined as: Ȳ R = X ˆR, (2.15) which is approximately unbiased for Ȳ. The variance for this estimator can easily be calculated through the equation given below: Cochran, 1953; Chapter 1] where: S 2 R = 1 N 1 N i=1 V ar ( Ȳ R X ) 1 f n S 2 R, (2.16) (Y i R X i ) 2 S 2 y + R 2 S 2 x 2R S y S x. (2.17) The ratio estimator for th area mean is more precise in terms of variance than the simple direct estimator presented in (2.9) when S 2 R < S 2 y, which implies that: R 2 S2 x 2R S y S x < 0 = R > S x 2S y. Thus, ratio estimation can be more accurate than simple expansion estimation if the auxiliary variable is sufficiently correlated with the variable of interest (Cochran, 1977; Barnett, 2004). When the variables are highly correlated and there is an approximately proportional relationship between Y and X, the ratio estimation method can be much more precise than the number-raised estimation (James et al., 2004). Considering n as a random variable (which is the case when th small area is not a stratum,) the sample and population size for different areas are used as auxiliary information. Using this idea, the aforementioned expansion estimator (2.9) can be considered as a ratio estimator, as shown below: Ŷ RS. = N. y. n (2.18) 23

37 Chapter 2 Small Area Estimation Techniques Known population totals for relevant auxiliary variables are used in ratio estimation methods in order to improve the weighting from sample values to population estimates (Wright et al., 1997; Kadillar and Cingi, 2004; Gupta and Shabbir, 2007). The basic idea behind ratio estimation is to use the relationship between the auxiliary variables and the variable of interest to calculate more reliable estimates Post-Stratified Estimator Stratification is one of the most common techniques in which the sampling frame is divided into non-overlapping groups or strata and a sample is taen independently from each stratum. This technique can improve the representativeness of the sample and reduce the sampling error, simultaneously. Sometimes, stratification is used after the sample has been selected. In this way, the unit individuals will be cross-classified according to their socio-demographic characteristics after the sample selection. This procedure is nown as post-stratification (Holt and Smith, 1979; Valliant, 1993). In each of the K areas within the target population, some demographic characteristics of the population, such as age and sex, education, and similar factors can be considered for use in obtaining a post-stratified estimation. Assuming the population is divided into Q socio-demographic strata, the post-stratified estimate of the total of the variable of interest within the th small geographical area is given as: (Chhiara, et al., 1995) where: Ŷ pst.. = Q q=1 Ŷ pst.q, (2.19) Ŷ pst.q = ˆN q Ȳ D q. (2.20) where ˆN q denotes the estimated size for qth post-stratum within the th geographical area, and Ȳ q is an estimator for the mean responses within the mentioned socio-demographic strata. ˆNq is usually obtained from the census or the administrative sources of data, but can also be the population estimate obtained from much better sample survey. The post-stratified estimate of the total of the target variable 24

38 Chapter 2 Small Area Estimation Techniques within the qth post-stratum is: Ŷ pst.q. = K Ŷ pst.q. (2.21) The estimator Ȳ D q can be obtained directly through the sample mean for qth post-stratum within the th geographical area as shown below: Ȳ D q = 1 y iq = ȳ q. (2.22) n q i s p For more detailed information about post-stratified estimation see Chhiara et al. (1995) and Holt & Smith (1979). Post-stratified estimator presented in (2.20) can be also defined using auxiliary variable X as below: where: Ŷ pst R.q = ˆR q X.q, (2.23) ˆR q = y.q x.q. (2.24) Here, y.q and x.q are respectively sample totals for variable of interest and auxiliary variable for qth post-stratum within the th area. Note that, X.q is assumed to be now in (2.23) Horvitz-Thompson Estimator Under the complex designs where the probability of falling into the sample differs for various individuals due to sampling design, the Horvitz-Thompson estimator can be used as an unbiased estimator. This estimator has been developed by Narain (1951), and Horvitz and Thompson (1952) for unequal probability sampling from finite populations without replacement as follows: Ŷ HT.l = i s l y il π il. (2.25) Here, π il denotes the probability of being selected in the sample for the ith individual in the lth post-stratum within the th geographical area (Midha, 1988). Suppose il denotes the number of times the ith population unit has been selected in the sampling process from l th demographic or socio-demographic sub-groups within the th geographical area. Then, the sample set can be defined as below: 25

39 Chapter 2 Small Area Estimation Techniques s l = {i U l ; il > 0}, where U l contains all population units and s l denotes the sample data set for l th demographic or socio-demographic sub-group within the th geographical area. Every single unit within the population is assumed to have a non-zero probability of inclusion in the sample. π il = E( il ) > 0; for all i U l π (ij)l = E( il jl ) Design-based variance for the mentioned estimator can be calculated trough the equation below: Särndal et al., 1992; Stehman and Overton, 1987] V ar(ŷ.l HT ) = (Y il)2 (1 π il ) + π il i U l i U l = 1 2 i U l j i & j U l (π il π jl π (ij)l ) (π (ij)l π ilπ jl)y ily jl π il π jl j i & j U l ( Yil Y ) 2 jl, π il π jl (2.26) where π (ij)l is the pairwise inclusion probability of sample individuals i and j in l th sub-group within the th geographical area which denotes the probability that both elements i and j jointly will be included in the sample (Stehman and Overton, 1994). Using sample data, Horvitz and Thompson (1952) proposed the a formula for estimating the mentioned variance given as: V ar HT (Ŷ.l HT ) = ( y il ) 2(1 πil ) + π i s il l i s l (π (ij)l π ilπ jl π (ij)l j i & j s l y ily jl π il π jl ). (2.27) Yates & Grundy (1953), and Sen (1953) used another formula for estimating the mentioned variance as below: V ar SY G (Ŷ.l HT ) = 1 2 i s l j i & j s l ( πil π jl π (ij)l π (ij)l )( yil π il y jl π jl ) 2. (2.28) Using the inclusion probability of sample individuals, the post-stratified estimators in such cases has been described by Smith (1991) so-called Háje estimator 26

40 Chapter 2 Small Area Estimation Techniques which is given as: where Ŷ pst HT.q ( Ŷ HT ).q = N q ˆN q n q i=1 (2.29) ˆN q = π 1 iq (2.30) More detailed description about Háje estimator can be found in Dorfman & Valliant (1997) and Valliant et al.(2000) GREG Estimator Using available domain-specific information about the auxiliary variables in the study, Generalized Regression (GREG) estimation is another technique often used in order to produce area-level estimates for the target variable. Assume a regression model for the population values of Y of the from: E ξ (Y i X i ) = µ Yi, where µ i is a function of X i including unnown parameters. Then, a possible design-based estimator is based on: T = µ Yi + (Y i µ Yi ). (2.31) i U i U The common population model that is used in a simple regression model is presented in (2.5). Under this model we have: E ξ (Y i X i ) = µ Yi = β 0 + X i β 1, V ar ξ (Y i X i ) = σ 2, Cov ξ (Y i, Y j X i, X j ) = 0. Considering the linear model presented in (2.5) as the true population model, the function presented in (2.31) can be calculated through the equation below: Valliant et al., 2000; Chapter 5] T GREG = i U µ Yi + i U ε i. (2.32) Fitting the model presented in (2.5) on the available sample data, model parameters can be estimated as shown in (2.7). Then, ˆµ Ȳ = ˆβ 0 + X ˆβ1. Replacing each of 27

41 Chapter 2 Small Area Estimation Techniques unnown population values of µ i by a suitable estimate ˆµ i which corresponds to fitted value of regression of y on x, and ε i by Horvitz-Thompson estimator of the residuals, a new estimator is defined as follows: T GREG = i U ˆµ Yi + i s ˆε i π i. (2.33) This estimator is referred to as Generalized Regression Estimator or GREG by Särndal et al. (1992), Axelson (1998), Hedlin et al. (2001), and Beaumont & Alavi (2004). Design-based variance for the GREG estimator based on the Teylor series expansion is given by the formula given as: GREG V ar( T ) = (π (ij) π iπ j)(y i µ )(y Yi j µ ) Yi. π i s i π (2.34) j j i & j s Horvitz-Thompson estimator for mentioned variance is: V ar HT GREG ( T ) = π (ij) π iπ j (y i µ )(y Yi j µ ) Yi. π i s (ij) π i π j j i & j s Then SYG version of this estimator is: V ar SY G GREG ( T ) = 1 ( )( π(ij) π i π j yi µ Yi 2 π i s (ij) π i j i & j s y j µ Yi π j (2.35) ) 2. (2.36) More detailed information about Generalized Regression Estimator and its robustness can be found in Beaumont and Alavi (2004). 2.3 Synthetic Estimation Synthetic estimation is a technique which provides the required area-specific estimates using a combination of available data from the target domain and possible information from bigger areas based on certain assumptions. Synthetic State Estimates of Disability (1968) was published by the National Center Health Statistics and can be considered as one of the first applications of synthetic estimates. In this report, synthetic estimators were used to generate the estimates of long and short term physical disabilities based on the National Health Interview Survey. 28

42 Chapter 2 Small Area Estimation Techniques Levy (1971) used a synthetic estimator in order to compute the State estimates of deaths from motor vehicles for the year 1960 and evaluated the use of synthetic approach for computing the average relative errors of synthetic estimates for States. Synthetic estimates are reviewed by Gonzalez (1973). Gonzalez and Wasberg (1973) presented a comparison between the synthetic and direct estimates for the probable errors in small areas. An empirical study has been carried out by the Australian Bureau of Statistics (ABS) to produce synthetic estimates of income and wor force status for Australian Census Statistical Divisions (Purcell and Linacre, (1976)). Using unemployment data for counties from the Current Population Survey (CPS) in USA and the 1970 census, Gonzalez and Hoza (1978) investigated the errors under the synthetic approach. Schaible et al. (1979) reviewed the relevant literatures about synthetic approaches and investigated small area estimation techniques in the Health Interview Survey of the National Center for Health Statistics. A synthetic estimator is defined by Gonzalez (1973) as an estimator calculated based on required parameter estimates for a large area obtained from a sample survey. This estimator is then used to derive the estimates for its sub-areas under the assumption of similarity of characteristics of interest between the sub-areas. It is assumed that a population is divided into small areas and the estimate of some characteristics in each of the areas is required. Suppose the information about the characteristics of interest is available for the larger area (which contains the target smaller areas,) and the data distribution is homogenous in terms of parameters of interest across the whole larger area. Then under the synthetic approach, the estimated values in the bigger area can be used for smaller areas. It will be noted that, in a large homogeneous population various individuals share the same structural characteristics. Sometimes, the small areas with small amounts of sample units or no sample units will be combined in order to increase the effective sample size for area-specific estimation purposes. Then, a reliable direct estimator can be obtained from the available data for the resulting larger area. Assuming homogeneity of the target parameters across the larger area, the obtained estimator can be used for each small area. In this method of estimation, the higher the homogeneity between different 29

43 Chapter 2 Small Area Estimation Techniques areas the higher the of accuracy in estimating the characteristics within the small areas. In the real world, it is very difficult to find a situation in which the population is homogenous enough for estimating the characteristics within the small areas using the direct estimators for the larger areas, without a large amount of errors. Therefore, some auxiliary data will be used in an attempt to decrease the errors in the synthetic approach. In other words, the changes in the characteristics of interest from one area to another related to the auxiliary variable will be accounted for in the final estimation. In this approach, different estimated values will be obtained for different areas unless the areas are very similar in the structure and have the same values of the auxiliary variable. As before, suppose K denotes the number of different small areas in a research study and Ȳ1,..., ȲK are the population mean values for the characteristic Y in these areas. Then, Ȳ is the mean value for the bigger area, which contains all smaller areas and the sample size in this big area is adequate for estimating a reliable direct estimators, with reasonable sampling error. The direct estimate for the parameter of interest is found from the observed values in each area, separately. Now, let Ȳ D 1,..., Ȳ D and Ȳ D to be the resulting direct estimators for the mean values in the small areas and direct mean estimation for the whole population, respectively. It is assumed that, the estimators are design unbiased. The design-based variances of these estimators are then V ar( Ȳ D 1 ),..., V ar( Ȳ D ) and V ar( Ȳ D ). The sample size for the whole area is much bigger than the sample size for the smaller areas (n < n) and therefore, V ar( Ȳ D ) < V ar( Ȳ D ). Here, the question arises whether Ȳ D is better than Ȳ D for estimating Ȳ. To answer this question consider Ȳ D and Ȳ D as two possible choices of estimating Ȳ. Then, the MSE values for these estimators are as follows: MSE( Ȳ D ; Ȳ) = V ar(ȳ D) + E Ȳ D ] 2 Ȳ = V ar( Ȳ D MSE( Ȳ D ; Ȳ) = V ar(ȳ D ) + E Ȳ D Ȳ ] 2 ) where, E(Ȳ D ) = Ȳ. Then, Ȳ D is more suitable than Ȳ D V ar( Ȳ D ) > MSE( Ȳ D ; Ȳ). to estimate Ȳ, when 30

44 Chapter 2 Small Area Estimation Techniques V ar( Ȳ D ) > MSE( Ȳ D ; Ȳ) = V ar( Ȳ D ) V ar( Ȳ D ) > (Ȳ Ȳ) 2 = S2 S2 n n > (Ȳ Ȳ) 2 where S and S are standard deviations of the variable of interest within area and for the whole population, respectively. Then, assuming these standard deviations to be approximately the same (S S), the condition is: 1 n 1 n > (Ȳ Ȳ) 2 S 2 n < n n n(ȳ Ȳ) 2 S 2 = n. If the sample size for the th area is bigger than the boundary amount of n, the direct estimation for the required area is more appropriate than using direct estimation for the bigger areas. In other words, when there are enough sample units within the th area (n > n ), Ȳ D should not be used instead of Ȳ D, any more. Furthermore, if the population is really homogenous with respect to the characteristics or parameters of interest, then Ȳ = Ȳ. Considering the fact that variances for the mentioned estimates are influenced by the number of sample units within a homogenous population, the variance of direct estimates should normally be bigger for smaller areas (V ( Ȳ D ) < V (Ȳ D )). In other words, higher levels of homogeneity can provide higher levels of accuracy in estimating Ȳ. Therefore, using Ȳ D instead of Ȳ D would be more appropriate for homogeneous populations. Synthetic estimators can be used when there are not enough sample units falling in the area of interest, but there are some available estimates for larger subsets of population. Often the census data can be considered as a useful source of these data. Depending on the purpose of a particular study, a larger subset could mean a geographic area (such as city, province, state, country or world) or a demographic group (such as age group, gender group, wor status, food consumption etc.). In order to obtain an appropriate estimate for the particular small area, some available weighted data from larger areas should be used. The synthetic estimates for a small area are examined by Gonzalez (1973), under the similarity assumption of characteristics of interest within and between sub-areas. Gonzalez (1973) suggested 31

45 Chapter 2 Small Area Estimation Techniques an average MSE measure for evaluating the synthetic estimator and used estimates of the number of dilapidated housing units to investigate the bias of his estimator. Holt, et al. (1979) considered a population divided into K mutually exclusive and exhaustive sub-areas labeled = 1, 2,..., K. There were also L identifiable sub-groups within each subarea labeled l = 1, 2,..., L. The proportions P l were then assumed to be nown from the previous census or other source of accurate information, where: P l = N l N L P l = 1. l=1 (2.37) Here, N denotes the population size for th sub-area and N l denotes that population size for lth sub-group within th sub-area. Using the sample data, a direct estimate of area mean for lth subgroup is: K Ȳ D l = Ŷ D.l. (2.38) K N l Given proportions P l were then applied to sub-group means to obtain a combined estimate of the mean for the th sub-area based on SRSWOR as below: Holt et al. 1979] Ȳ Syn = L P l Ȳ D l. (2.39) l=1 The basic assumption using the synthetic approach here is that the mean values of the target variable for K sub-areas are approximately equal within the l th subgroup. Usually, synthetic estimator is a biased estimator and the bias is equal to the differences between Ȳl and Ȳl. The expectation and mean square error of the synthetic estimator introduced above is given as: Heeringa, 1981] E ( Ȳ Syn MSE ) = Ȳ + ( Ȳ Syn L Nl (Ȳ Ȳl) ] l=1 ) L = N 2 l V ar( Ȳ ) ] + l=1 (2.40) L Nl (Ȳ Ȳhg) ] 2. l=1 32

46 Chapter 2 Small Area Estimation Techniques In the mentioned method of estimation which is suggested by Holt et al. (1979), the population sizes are used as weights. The population sizes are assumed to be nown and may be obtained from previous census or some source of accurate information. Purcell & Kish (1979) and Ghosh & Rao (1994) proposed a different series of weights which were obtained from suitable auxiliary data using a census or some other source of accurate information. These series of weights are given as below: R l = X.l X... (2.41) where X.l denotes the total value of auxiliary variable obtained form lth sub-group within th area, and X.. denotes the total value for th area. The use of these weights imply the relationship of the small area to required sub-group with respect to auxiliary variable X. Under the similarity assumption of auxiliary variable characteristics within and between the sub-areas, Laae & Langva (1976), Purcell & Linacre (1976), Ghangurde & Sirgh (1977), and Schaible et al. (1979) have applied another type of synthetic estimator using available data from national household samples defined as: Ȳ Syn = 1 N L R l Ŷ D... (2.42) l=1 Therefore, the synthetic estimator Ȳ Syn is unbiased for Ȳ by assuming: Y.l X.l = Y.. X... A similar definition for a synthetic estimator has been presented by Rao (2000); Chapter 12. National Center for Health Statistics (NCHS) of USA (1968) used a similar synthetic estimator in study of disability indicators as follows: where: Ȳ Syn l = L l=1 R l Ȳ D, (2.43) R l = X l X l. (2.44) X l is the mean of population auxiliary individuals falling in lth sub-group while X l is the mean of those falling in th area and lth sub-group. Then, Ȳ Syn l is an 33

47 Chapter 2 Small Area Estimation Techniques unbiased estimator for population mean of required variable within l th sub-group by assuming the conditions below to be satisfied within the target population. Ȳ l = Ȳ X l = X The assumption above states that the variations between the sub-group means for auxiliary variable is reasonably small which hardly happens in real situations. Singh et al., 2002)] To use the synthetic procedure in small area estimation, high level of homogeneity with respect to the characteristics of interest is desirable within the sub-areas. This happens when the statistical properties of any one part of an overall dataset are the same as any other parts. However, homogeneity rarely happens in real situations for the whole population under study but sometimes homogeneity can be found in differen groups of sub-domains, separately. In this case, post-stratification may help to gain more reliable estimates. Considering Q homogenous post-stratified sub-groups instead of L socio-demographic sub-domains in the synthetic estimators introduced above, post-stratified synthetic estimators can be formed. Using Horvitz-Thompson estimator, a generalized class of design-unbiased synthetic estimators can be defined as: Ȳ Syn HT = L l=1 P l Ȳ HT l = 1 N l K P l i s l y il π il. (2.45) More detailed descriptions about this estimator can be found in Muhopadhyay (1998); chapter 18. The synthetic estimation can easily be calculated and this can be considered as a substantial advantage of this ind of estimates. Furthermore, the variance of the synthetic estimator is smaller than that of the post-stratified estimator as it uses some other available data sources to reduce the errors caused by the loss of information in small sub-domains but will be biased in general. Holt et al. (1979) presented various one-way and two-way fixed effect models and used the synthetic estimation methods to predict the population individual values. A simple one-way fixed-effect linear model which decomposes the response into an overall mean and a model error term for the required variable Y can be derived 34

48 Chapter 2 Small Area Estimation Techniques as: Y i = µ + ε i ; i = 1,..., N & = 1,..., K E(ε i ) = 0 & V ar(ε i ) = σ 2. (2.46) The Best Linear Unbiased Estimator (BLUE) for the parameter µ is given as: µ = Ȳ. (2.47) To use this model in a sampling design with inclusion-probability π i, the weighted model-motivated design-based estimator for µ can be defined as below: ( K )( K ) 1 y i 1 ˆµ =. (2.48) π i s i π i s i The regression model model can also be derived within the population for P auxiliary random variables. Then the model below can be defined as below: y i = x i β + e i ; i = 1,..., N & = 1,..., K E(ε i ) = 0 ; V ar(ε i ) = σ 2 E(y i x i ) = x i β ; V ar(y i x i ) = σ 2 υ i. (2.49) where x i = (x i1,..., x ip ) is a vector of P auxiliary random variables for ith unit in the th small area and β = (β 1,..., β p ) denotes the vector of unnown parameters. The term e i denotes the random effect for the ith individual within the th area and υ i is a nown function for x i. Then the weighted estimation for the regression parameter is given as: ( K ) 1 ( K ) x ˆβ = ix i x iy i. (2.50) υ i s i π i υ i s i π i Using available information about the auxiliary variable X, the synthetic predictor for the th area mean can be defined as: Ȳ Syn = X ˆβ. More detailed descriptions about estimators above and other types of this ind of synthetic estimation methods has been presented in Muhopadhyay (1998); Chapter 18. If consistency between local survey data and the data obtained from other parts of the target population is seen as a desirable goal, it is clear that synthetic estimation has a significant advantage over alternative procedures, such as post-stratified estimation. However, the presence of homogeneity of rates or proportions across the 35

49 Chapter 2 Small Area Estimation Techniques whole population is very rare and, hence, the relations observed in the large areas can not be applied for smaller domains easily. Furthermore, under the synthetic approach, census data or some other sources of accurate information are used to achieve estimates for the smaller areas, separately. These synthetic estimates may fail to reflect the actual effects of local area factors. For instance, when the previous census data are used as the auxiliary data, the changes through the structure of the population which may have changed since the census time period will be ignored. The results of the synthetic stimulation has both advantages and disadvantages. As the effective sample size increases under the synthetic approach, the variance of resulted estimates will decrease. This is typically the main advantage of this estimation technique. On the other hand, synthetic estimates are usually biased estimates in practice. Using multi-level modeling to generate synthetic estimates certainly causes a considerable decrease in the costs of a survey but the modeling process requires significant compromises with complexity of detailed analysis and calculations. These may not be so extensive when only one global estimate is being made. Twigg and Moon, 2002] 2.4 Composite Estimator The bias of a synthetic estimation depends on the degree of homogeneity of the parameters across the target small areas. A low level of homogeneity will lead to larger biases. On the other hand, a direct estimator is typically unbiased, but has higher variance compared to that of a synthetic estimator due to smaller effective sample size. Composite estimation methods provide flexibility of using both direct and synthetic estimation methods by changing the weights given to each estimate according to the levels of homogeneity across the target population. A composite estimate can be obtained by taing a weighted average of a synthetic estimator and a direct or traditional survey estimator (Datta, 2009). Using composite techniques is very popular for estimating required characteristics within different small areas. Depending on various types of weighing methods, different composite estimators can be obtained. Composite estimation is a technique which 36

50 Chapter 2 Small Area Estimation Techniques can balance the smaller variance of a synthetic estimator and the unbiasedness of the direct estimator. Although small sample sizes within the areas can cause big sampling errors, direct estimators are usually unbiased. On the other hand, a synthetic estimator which can be calculated based on the data obtained from more sample units within a larger area will have a smaller variance but it is usually biased. In comparison with the synthetic estimator for a small area, the composite estimator is less biased due to the weight being given to the direct estimator. Also it is more stable (in terms of variance) compared to a direct estimator as it gives some weights to the synthetic estimator. A straightforward way to balance the potential bias of a synthetic estimator against the instability of an unbiased direct estimator is to use the composite estimator which taes a weighted average combination of the mentioned two types of estimators (Rao, 2003; EURAREA Consortium, 2004). Suppose Ȳ D denotes the direct estimator for the th small area. Then, considering Ȳ D as the direct estimator for a bigger area, a definition for the composite estimator can be presented as: Ȳ Comp = w Ȳ D + (1 w ) Ȳ D ; 0 w 1. (2.51) If the the allocated weight to the th area is equal to zero or one, Ȳ D and Ȳ D would result as the composite estimator, respectively. Rewriting the above equation Comp as Ȳ = Ȳ D + w ( Ȳ D Ȳ D ), shows that composite estimation methods are aiming to reduce the differences between the direct and synthetic estimators by using appropriate weights. As the effect of the direct estimator for the th area is shrining towards the direct estimator from the bigger area, the composite estimator has been also referred to as a shrinage estimator. The main problem in deriving the composite estimator is to find the best value for w. One approach is to find the weights which minimize the MSE of the composite estimator. MSE( Ȳ Comp ; Ȳ) = w 2 V ar( Ȳ D ) + (1 w ) 2 MSE( Ȳ D ) + 2w (1 w )Cov( Ȳ D, Ȳ D ) (2.52) 37

51 Chapter 2 Small Area Estimation Techniques To minimize the above equation, the optimal w is: w opt = V ar( Ȳ D ) Cov( Ȳ D, Ȳ D ) V ar( Ȳ D ) + MSE( Ȳ D ) 2Cov( Ȳ D, Ȳ D ), (2.53) and the minimized value for the MSE is: min MSE( Ȳ Comp ) ] D = V ar( Ȳ D V ar( Ȳ ) Cov( Ȳ D, Ȳ D ) ] 2 ) V ar( Ȳ D ) + MSE( Ȳ D ) 2Cov( Ȳ D, Ȳ. D ) (2.54) In the above equation, usually Cov( ˆȲ D, ˆȲ D ) can be ignored (Khoshgooyanfard and Taheri Monazah, 2006). For instance suppose the mean value for the whole target population as below: Ȳ D = K h Ȳ D (2.55) K h where h are calculated weights for the area means based on the sampling design. These weights can be calculated using the sample or population sizes for different areas or they are obtained based on other available resources. Then, the covariance for th small area can be calculated as: where: Cov( Ȳ D, Ȳ D ) = V ar( Ȳ D ) = h. K h Assuming a small sample size allocated to the th small area, would be close to zero and the whole covariance value can be ignored. In such a case, the optimal weight and the minimum MSE can be calculated as below: Khoshgooyanfard and Taheri Monazah, 2006] w = 1 V ar( Ȳ D )/MSE( Ȳ D ) ] + 1 = min MSE( Ȳ Comp ) ] V ar( Ȳ D ) MSE( Ȳ D ) V ar( Ȳ D ) + MSE( Ȳ D ) V ar( Ȳ D ) ] 2 V ar( Ȳ D ) + MSE( Ȳ D ) (2.56) 38

52 Chapter 2 Small Area Estimation Techniques where: MSE( Ȳ D ) = MSE( Ȳ D ; Ȳ) = E( Ȳ D Ȳ) 2 = V ar( Ȳ D ) + ( Ȳ D Ȳ) 2. (2.57) In order to calculate the optimal weights for the composite estimator, V ar( Ȳ D ) and MSE( Ȳ D ) should be estimated. Usually, an unbiased estimator for V ar( Ȳ D ) and V ar( Ȳ D ) can be estimated based on the sampling design but it is not straightforward to estimate the bias term in (2.57) ( Ȳ D Ȳ) 2. In order to simplify the estimation process, an estimated value for the variance of the area means is used instead of the area specific bias term. MSE( Ȳ D ) = V ar( Ȳ D ) + σ 2 B (2.58) where: σb 2 = K h (Ȳ Ȳ) 2. (2.59) K h Note that, σb 2 shows the variation between areas. Then, under a simple random effects model we have: E (Ȳ Ȳ) 2] = σb 2. Therefore, as the homogeneity increases σb 2 gets smaller and therefore, MSE( Ȳ D ) will be closer to the calculated value for variance of Ȳ D. V ar( Ȳ D ) can be estimated using standard formulas appropriate for the sampling design. In order to estimate σ 2 B based on the sample data we have: where: E(S 2 B) = K h (Ȳ D Ȳ D ) 2 + K SB 2 = h ( Ȳ D Ȳ D ) 2 K h D V ar( Ȳ ) + V ar( Ȳ D ) 2Cov( Ȳ D, Ȳ D ) ] ( K ) K = σb 2 h + h V ar( Ȳ D Ȳ D ) ]. 39

53 Chapter 2 Small Area Estimation Techniques If Cov( Ȳ D, Ȳ D ) can be assumed to be close to zero, then an unbiased estimated value can be calculated for σ 2 B is: ˆσ B 2 = 1 K h { S 2 B K h V ar ( Ȳ D ) + V ar ( Ȳ D)] }. (2.60) Now, the weights above can be calculated as: w = V ar( Ȳ D ) + ˆσ 2 B V ar( Ȳ D ) + V ar( Ȳ D ) + ˆσ 2 B. (2.61) We can use the weights presented in (2.61) for composite estimators. A detailed discussion on the optimal weighting methods in composite estimation is given by Schaible (1978). As the sample size in an area increases, the resulting estimate based on the sample data is more precise and has less variance. Having appropriate sample size in th area can help to produce a reasonably precise direct estimate of area mean Ȳ D. As sample size increases, the the variance of this estimator decreases. Then, w will be a value closer to one. On the other hand, if the sample size within the th area is very small, the direct estimation for the whole population mean would be more effective in calculating the composite estimator for th area mean. This happens due to w getting closer to zero in such a case. To have a better understanding of how the different weights affect the accuracy of the composite estimator compared with the direct estimator we have: Rao, 2003] MSE( Ȳ Comp ; Ȳ) V ar( ˆȲ D) = w2 V ar( ˆȲ D ) + 2w (1 w )Cov( ˆȲ D, ˆȲ D ) + (1 w ) 2 V ar( ˆȲ ] D ) + (Ȳ Ȳ) 2 V ar( ˆȲ D ) = w 2 V ar( ˆȲ D + w 2cov( ˆȲ D, ˆȲ D ) 2V ar( ˆȲ D ) 2( ˆȲ ] Ȳ) 2 ) 2cov( ˆȲ D, ˆȲ D ) + V ar( ˆȲ D ) + (Ȳ Ȳ) 2 ] + V ar( ˆȲ D ) + (Ȳ Ȳ) 2 V ar( ˆȲ D ) = w 2 V ar( ˆȲ D D ) 2cov( ˆȲ, ˆȲ D ) + V ar( ˆȲ ] D ) + (Ȳ Ȳ) 2 + w 2cov( ˆȲ D, ˆȲ D ) 2V ar( ˆȲ ] D ) 2(Ȳ Ȳ) 2 + V ar( ˆȲ D ) + (Ȳ Ȳ) 2 V ar( ˆȲ D ). (2.62) 40

54 Chapter 2 Small Area Estimation Techniques If the equation above would be assumed to be equal to zero, a quadratic equation in terms of w would be concluded with the roots zero and 2w opt. Then, the equation above would be a negative value when 0 < w < 2w opt. Therefore, the composite estimator would be more precise than the pure direct estimator for the th area when w (0, 2w opt ), and obviously, the most precise composite estimator would be the one calculated based on w opt. Now, assume that w would be estimated using the equations (2.35) and (2.39). In this case, if σ 2 σ 2 B (2.39), then: would be used in (Ȳ Ȳ) 2 < V ar( ˆȲ D ) + V ar( ˆȲ D ) 2cov( ˆȲ D, ˆȲ D ) + 2σ ( Ȳ Ȳ) 2 V ar( ˆȲ D ˆȲ D ) ] < σ 2 (2.63) Therefore, the composite estimates would be more efficient than the direct estimates for all areas if σ 2 is bigger than max { 1 2 ( Ȳ Ȳ) 2 V ar( ˆȲ D ˆȲ D ) ]}. Note that, if σ 2, then w 0. For more detailed discussions about composite estimators and optimal weights see Schaible (1978) and Rao (2003); Chapter Empirical Bayes Method In a Bayesian approach prior distributions on the unnown quantities in the model are needed. The probability distribution for a population proportion expresses the prior nowledge or belief about it, before the data is added. Then, the conditional distribution of the unnown parameters given the observed data can be computed as a posterior distribution based on prior distribution and lielihood (Carlin and Louis, 2000). The empirical Bayes approach to statistical decision problems is used for the class of techniques which evaluate or approximate the prior distributions by using some observed data. In these methods, the parameter itself can be seen as a random variable. Additionally, it is assumed in these approaches that a personal or a conventional distribution of the parameter can change with experience (Robbins, 1964). Ghosh and Rao (1994) reviewed existing approaches including EB techniques in 41

55 Chapter 2 Small Area Estimation Techniques small area estimation. More information about the empirical Bayes approach to small area estimations can be found in Rao 2003; Chapter Hierarchical Bayes Method The hierarchical Bayes (HB) procedure can be considered as a topic in modern Bayesian analysis for statistical modeling. Assuming a prior distribution for model parameters HB methods identify the posterior distribution for parameters of interest. Datta and Ghosh (1991) used HB methods to find the small area mean estimates and Ghosh and Rao (1994) reviewed HB techniques in small area estimation. More information can be found in Rao (2003);Chapter Small Area Estimation Around the World During the last four decades, the increasing demand for various types of statistics in local sub-domains has led to various SAE techniques being used in different statistical agencies around the world. In this subsection, some small area estimation techniques used around the world are discussed SAE in Australia The Australian Bureau of Statistics (ABS) produces a range of data for regions such as states, territories, local government areas, and other Statistical Local Areas. The Australian Standard Geographical Classification (ASGC) is a hierarchical classification system of geographical areas and consists of a number of interrelated structures. It provides a common framewor of statistical geography and enables the production of statistics which are comparable. Australia has six states and two territories as the administrative divisions. The ASGC 1996 Census contains 1336 Statistical Local Areas (SLA), 194 Statistical Subdivisions (SSD) and 66 Statistical Divisions (SD). Estimates of social indicators in different local areas are required for local authorities to receive the budget from the Australian Government. Therefore, efforts started in early 1970 s to use various small area estimation techniques to 42

56 Chapter 2 Small Area Estimation Techniques obtain more precise local area estimates in Australia. Since 2003, small area estimation techniques are being implemented in ABS by the Analytical Services Unit. Usually, the sample size in many Local Government Areas (LGAs) is considerably small. Therefore, the LGA has been considered in most SAE applications as the small area (Davies et al. (2009)). Some examples of using small area estimation methods in Australia are mentioned below: Nowosilsyj (2004) used a combination of aggregated individual income tax data from the ATO and aggregated income support customer data from the Australian Government Department of Family and Community Services (FaCS) to provide regional experimental estimates of personal income for Small Areas, using taxation and income support data. Department of Education Employment and Worplace Relations (DEEWR) in Australia have been producing Small Area Labour Marets (SALM) estimates publication, since the late 1980 s (DEEWR, 2008). The SALM estimates are based on on the Structure Preserving Estimation (SPREE) methodology presented by Purcell and Kish (1979) and is based upon a synthetic approach which maes a use of direct estimates (Rao, 2003; Chapter 4). Davies et al., 2009] A program household surveys was firstly developed by the ABS in late-1950s. The first household survey conducted by the ABS was the Australian Labour Force Survey (LFS) which was conducted in November Initially, LFS was conducted quarterly in February, May, August, and November and the non- Indigenous civilian population aged 14 years and over living in state capital cities was only included in the survey. From August 1966, civilian population aged 15 years and over was included in LFS and it changed to a monthly survey in February Then, the estimated values based on LFS information became the official measurement of unemployment in Australia. Trewin, 2005] The LFS sample design uses a rotation sampling methods. According to this design, around one-eighth of the sample is being replaced, monthly. Until June 2008, LFS survey was being implemented under 2006 LFS sample design with 43

57 Chapter 2 Small Area Estimation Techniques sample individuals. From July 2008, the sample size for LFS decreased by 24% due to some financial limitations. In 2008, a new project started by ABS and DEEWR to provide small area estimates from labour force survey data. In this project a binomial logistic model with randomized intercepts has been used. The binomial classes were Local Government Area (LGA) by 5 age groups by sex. They originally used a Maximum Lielihood (ML) estimation method which has been described in detail in the report written by Saei and Chambers (2003b). Considering the SAE approach in DEEWR, Davies et al. (2009) used a Generalized Linear Mixed Model (GLMM) applied to the variable of employment status in simulated data based on available LFS information from August They investigated the quality of small area estimates of labour force status at the LGA level by their estimated relative Root Mean Squared Errors (rrmses). This research shows that small area estimates produced by SAE team in Analytical Services Unit of ABS are of reasonable quality. They also mentioned some important issues for further investigation in SAE applications in ABS. Survey of Disability, Ageing and Carers (SDAC) was conducted by the ABS to capture a broad range of information on Australian people who had one or more health impairments or limitations in 1998 and 2003, respectively (Australian Institute of Health and Welfare, 2005). Then, Elazar and Conn (2004) planned an empirical study of different small area estimation techniques to identify reliable estimates based on 2003 Survey of Disability, Ageing and Carers and 2001 Population Census. Considering Disability Support Pension data from Centrelin as an auxiliary data, they used logistic synthetic regression models with and without random effects to produce small area estimates of disability. The ABS Web Site contains useful information about different survey designs in Australia, as well. For more information about different usages of small area estimation in Australia see the manual which has been published in 2006 by the National Sta- 44

58 Chapter 2 Small Area Estimation Techniques tistical Centres and Client Services in Australia. (Australian Bureau of Statistics, 2006) SAE in the United States of America (USA) The United States has 50 States and 1 district, considered as the administrative divisions. Considering the large number of divisions, small area estimation techniques are used in various federal agencies with statistical programs, to estimate different social and economic indicators. Some examples are listed below: National Center for Health Statistics (NCHS) is part of the Centers for Disease Control and Prevention (CDC), which is part of the United States Department of Health and Human Services. This center collects and publishes vital statistics for socio-demographic and health characteristics. NCHS produces detailed statistical tabulations for each State, country and city and some smaller areas (NCHS, 1989). As the vital records (birth and death certificates) can provide extremely limited information about social, demographic, health and medical indicators, some statistical techniques should be used to obtain acceptable accuracy levels of estimation. Synthetic estimation is one of the common techniques in NCHS which can borrow strength from other relevant data sources. NCHS was one of the pioneers in using Synthetic estimation techniques to generate the estimates of long and short term physical disabilities based on the National Health Interview Survey. The center has sponsored or cosponsored three research conferences on small area estimation in 1978, 1984 and Recently, the attention has been considerably increased on using indirect estimation methods which can incorporate local area and provide more accurate estimates. Schaible, 1996] The first three National Health Surveys in USA were National Health Examination Survey (NHES) I, II, and III conducted by NCHS between 1959 and 1970 including approximately 7500 sample individuals for each survey. The National Health and Nutrition Examination Survey (NHANES) was firstly conducted by NCHS in NHANES is a complex sample survey which 45

59 Chapter 2 Small Area Estimation Techniques combines interviews and physical examinations in order to assess the health and nutrition status individuals living within the United States. Leavitt et al., 2005] NHANES had an original target sample size of approximately individuals (McDowell et al., 2008). However, many sub-domains did not have enough sample size to obtain reliable estimates from NHANES data in different time periods. Using Fay-Herriot model, Ybarra and Lohr (2008) applied SAE methods to predict required quantities measured in the NHANES , with auxiliary information from the U.S. National Health Interview Survey (NHIS). The United States Census Bureau is a government agency woring as a division of United States Department of Commerce which collects various national socio-demographic and economic data. The first U.S. national census was conducted on 2nd August Since 1903, United States Census Bureau is officially responsible for conducting the decennial census in United States in order to collect and provide required data and information for different states. Wilson and Fischetti, 2010] U.S. Census Bureau provides poverty indicators for school level age groups which is being used in Small Area Income & Poverty Estimates (SAIPE) program for schools districts, counties, and states. The SAIPE program is willing to provide more current estimates of some selected statistical measurements of income and poverty for school districts, counties, and states (Bell et al., 2007). Available data from administrative records, intercensal population estimates, and the decennial census with direct estimates from the American Community Survey are being used in order to provide required estimates of poverty and income. In larger states, there is high sampling variability in calculating required estimates for some counties. In order to estimate the child poverty rate for different counties in a state of U.S., Ghosh and Maiti (2004) proposed pseudo empirical best linear unbiased estimators of small-area means. Using available data from children in the 5-46

60 Chapter 2 Small Area Estimation Techniques 17 age-group, they performed a simulation study referring to both the design for sample selection and the natural exponential family quadratic variance function models to be fitted the data. U.S. Bureau of Economic Analysis (BEA) is an agency of the Department of Commerce and can be considered as a part of the Department s Economics and Statistics Administration in USA. BEA produces economic accounts statistics for different States and local areas that enable government and business decision-maers, researchers, and the American public to follow and understand the performance of the Nation s economy. The program started by estimating the income payments by State in Through the years, some indirect estimation methods have been used to produce acceptable economical statistics. Synthetic estimation is one the most important methods which have been used to borrow the strength from other data sources. Ghosh and Maiti (1999) mentioned the annual estimates of per capita income calculated from BEA, as an important data source in order to obtain economy ranings for several states based on the median income of four-person families using SAE methods. The purpose of the Local Area Unemployment Statistics (LAUS) program in United States is to develop monthly employment and unemployment estimates for approximately 7300 areas in U. S. (Rahman, 2008). Using the results from LAUS, Bureau of Labor Statistics (BLS) of the United States Department of Labor is responsible for estimating changes in change in the number of persons employed as a ey indicator of economic performance. Current and historical data from Current Population Survey (CPS), Employment Statistics (ES) program, and State unemployment insurance (UI) systems are being used as available information resources in LAUS. Model-based estimation techniques are used in order to develop reliable unemployment estimates based on available resources on historical and current relationships within each sub-national division. 47

61 Chapter 2 Small Area Estimation Techniques There are other statistical agencies and institutes in United Sates, which may use small area estimates methods depending on their requirements. For example, the Substance Abuse and Mental Health Services Administration produces estimates of substance abuse in states and metropolitan areas in U. S. Crop yield for required areas are being estimated in the National Agricultural Statistics Services County Estimates Program using SAE methods. (Hughes et al., 2008; USDA, 2007; Rahman, 2008). For more information about different statistical agencies and various indirect estimators in U.S. federal programmes see the boo which has published by Schaible (1996) SAE in Canada: Statistics Canada is the Canadian federal government department commissioned with meeting the statistical requirements of Canada. In addition to conducting about 350 active surveys on virtually all aspects of Canadian life, Statistics Canada undertaes a country-wide census every five years on the first and sixth year of each decade. Different small area estimation techniques are employed in Statistics Canada in order to estimate the health statistics, average weely earnings, under-coverage in the census, and unemployment rates. A brief summary of existing procedures for producing these official estimates has been given by Hidiroglou (2007). A description of two small area estimates produced in Statistics Canada is given below: The Canadian census records are used for various purposes. A variety of methods including small area estimation techniques are used for under-coverage estimation in the census. The Census in Canada provides accurate population estimates by age and sex within each province and territory. In particular, the census should collect the data from all the residents through all dwellings. But, unfortunately some eligible persons are not enumerated correctly by the census. Therefore, Statistics Canada conducts the Reverse Record Chec (RRC) as a sample survey which estimates the net number of people missed by the Census. Although the missed persons are required for single year of age for both sexes 48

62 Chapter 2 Small Area Estimation Techniques within each province an territory, the RRC can only product reliable direct estimates for large areas such as sex-age combinations at the national level. Therefore, some small area estimation methods are used as a part of the evaluation of the census in order to model the small domain estimates with smaller mean square error. Through these methods, the required estimates are produced in a two-step procedure. Firstly, the estimate for broad age group within a province is created by an empirical Bayes regression model. Secondly, the single years of age within the broad age groups is generated by a synthetic method. The raning ratio procedure is used to ensure the detailed estimates, finally. Gambino and Dic, 2000] Another estimation approaches is also used in Statistics Canada in order to produce small area estimates for employment, payrolls and hours through the Labour Force Survey (LFS) which provides estimates of employment and unemployment. These estimates can be considered as one of the most important measures of performance of the Canadian economy. LFS is a monthly survey which publishes labour force estimates by industry, occupation, age-sex group and geological area. The data on wages, hours of wor and union membership are also gathered through this survey (Gambino and Dic (2000)). More detailed descriptions about the survey design, methodology and different procedures used in LFS is given by Beaumont et al. (2008). Taing the Canadian Labour Force Survey as an example of a survey utilizing a clustered sample design, Drew et al. (1982) reviewed the Sample Size Dependent (SSD) estimator which has been used at Statistics Canada for about twenty years. They evaluated empirically some alternative small area estimation techniques including synthetic, post-stratified domain and composite estimators. Unemployment information is nown as a ey indicator for pace of local economic growth condition. Therefore, reliable estimate of unemployment rate is required in Canada at sub-national levels such as Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). However, direct estimates are not 49

63 Chapter 2 Small Area Estimation Techniques sufficiently reliable when sample size is too low in a target area (You et al., 2003). You (2008) discussed various models including the Fay-Herriot model and cross-sectional and time series models in order to estimate the unemployment rates for local sub-provincial areas such as Census Metropolitan Areas (CMAs) and Urban Centers (UCs). An integrated non-linear mixed effects model has been proposed by You (2008) under the hierarchical Bayes (HB) framewor for the LFS unemployment rate estimation. Smoing is the leading preventable cause of death. Picett et al. (2000) used available data from the 1994/95 National Population Health Survey (NPHS) and 1996 Census of Canada to study youth smoing behaviors at provincial level in Canada. Stratum-specific information about the woring life and health behavior such as disability, cause-specific mortality and unemployment is used in their study. A regression approach is then used to calculate synthetic predictors for the occurrence of the smoing indicators among youth (15-24 years). Life expectancy at birth is often used as an indicator of human health and environment in a given population. Fines (2006) used a log-linear model in order to estimate life expectancy at birth in small cities based on available information from Winnipeg Census Metropolitan Area (CMA) (population of in 1996) SAE in Europe The Small Area Methods for Poverty and Living Condition Estimates (SAM- PLE) project is a research program funded by European Commission which is the executive body of the European Union. Available information from Local surveys and other data resources are used in the SAMPLE project to conduct research for identifying and developing new indicators and models for poverty and inequality at the national and sub-national levels. This project started on March 2008 and will continue until March Molina (2009) used Fay-Herriot model Empirical Best Linear Predictors (EBLUPs) for poverty in- 50

64 Chapter 2 Small Area Estimation Techniques dicators in Spanish provinces. Quintano et al. (2007) used SAE methods for estimating poverty rates in Italy. In 2004, the Office for National Statistics (ONS) in UK participated a collaborative research project funded by Eurostat. The project was named EU- RAREA. People from six European countries spread across twelve National Statistics Institutes and universities were involved in this major project. The aim of EURAREA project was to provide efficient SAE methods for population characteristics in different sub-national domains within Europe. The results was published in EURAREA Consortium (2004). A review of theoretical components in EURAREA can be found in Saei and Chambers (2003b). Using model-based methods, ONS estimates unemployment rate for Local Authority Districts and Unitary Authorities. A number of different approaches are utilized to produce improved unemployment estimates for various small areas. The main approach is developed in ONS based on the regression models lining the unemployment estimates to the number of Job Seeers Allowance benefit claimants. Longford (2004) applied SAE methods to available data from Labour Force Survey in UK in order to estimate local area unemployment rates and economic inactivity. His study was done without any distributional assumption and it was merely relying on the high correlation between the unemployment rates and the rates of claiming unemployment benefits in UK SAE in Iran The Statistical Center of Iran (SCI) is responsible for study and recognition of statistical requirements in Iran and to develop statistical standards of the country. Many surveys are conducted each year in SCI to permit statistical analysis of available social, economical and industrial information in Iran. The Statistical Research and Training Center (SRTC) was established in 1999 to provide statistical research in 51

65 Chapter 2 Small Area Estimation Techniques SCI. Research studies funded by SRTC are to improve the quality of survey designs and statistical analysis based on the most appropriate methods and recent findings. Survey of Households Employment and Unemployment Characteristics in Iran is one of many survey being conducted in SCI, which provide information about the economic growth and labour force condition within 31 provinces. Industrial budget allocation to different urban and rural areas in Iran is based on the unemployment rates estimated in SCI. SRTC has funded several projects in order to improve the estimation techniques being applied in this survey. Using available data from 1996 about the labour force in Iran, Khoshgooyanfard and Taheri Monazah (2006) compared synthetic and composite methods in order to produce model-based indirect to produce provincial estimates. Khoshgooyanfard et al. (2007) presented a empirical Bayes estimation approach accounting for the rotation pattern used in the Survey of Households Employment and Unemployment Characteristics in Iran. As a part of another project funded by SRTC, Towhidi and Namazi-Rad (2010) studied available methods for calculating error in estimating the difference between unemployment rates in two consecutive time epochs. Considering different rotation pattern in sampling design, they also presented an alternative estimation method for different linear combinations of the area means in dissimilar sample periods based on the linear additive model with dependent errors SAE in Korea The Economically Active Population Survey (EAPS) has been conducted by Korea National Statistical Office (KNSO) to collect information on the changes in the activity pattern of the labor force and produce unemployment statistics for metropolitan areas and provincial levels. The main target in the monthly survey is the people who are more than 15 years old and residents in 30,000 sample households to produce information on the economic status of the population. The small areas (local selfgovernment areas (LSGA s)) are not reflected in the sample design and the sample data are collected through 15 large areas. With the use of composite estimation and hierarchical Bayes estimation, Chung et al. (2003) suggested some adjustment methods for the small area estimates of the unemployment statistics through the 52

66 Chapter 2 Small Area Estimation Techniques area-specific model and secured the reliability of estimates. 2.8 Summary There has been rapidly growing demand for small area statistics with increasing concern with issues of local distributions for critical policy maing. In particular, producing reliable estimates for required characteristics and indicators within the target sub-national domains is the main challenge in statistical surveys. Several direct estimation methods are discussed in this chapter. The resulting direct estimates for each area are just based on the sample information available for the target area. In the case of large national surveys, usually the effective sample sizes are not sufficiently large to calculate a valid direct estimate for all target sub-domains. Therefore, the resulting direct estimates are not reliable for such domains. SAE techniques are developed in order to tacle the problem of producing reliable estimates within the areas or sub-domains in which the effective sample size is small. The ey idea is to use other available relevant information to improve the resulted direct area-specific estimates. Several SAE techniques are presented in this chapter. Examples of using such SAE methods around the world are briefly discussed, as well. In the next chapter, we focus on certain model-based indirect estimates in order to provide reliable small area statistics. 53

67 Chapter 3 SAE based on Linear Mixed Models 3.1 Introduction A wide variety of estimation methods have been developed to handle Small Area Estimation (SAE) problems. Initially, demographic and design-based methods were used, but more sophisticated model-based methods have been increasingly employed over the last two decades (Khoshgooyanfard and Taheri Monazah, 2006). Modelbased methods are more flexible than other estimation methods in using auxiliary variables to improve the quality of indirect estimates. It is also straightforward to estimate the MSE of a model-based estimator while this is not the case for the designbased estimators. Additionally, model assumptions are explicit and, usually, there are statistical methods available to examine whether the assumptions are satisfied. This benefit may be considered as a limitation as well, because a statistical model may not be valid if its assumptions are not satisfied by the data, leading to biased or inefficient estimators. In this chapter, we introduce a general definition for the Linear Mixed Model (LMM) for SAE purposes at different levels. Model-based estimation techniques are then evaluated when area means are the main targets of inference. 54

68 Chapter 3 SAE based on Linear Mixed Models 3.2 Linear Mixed Models in SAE Statistical models utilized for SAE purposes can be unit-level or area-level. Arealevel models can be used when the auxiliary data is available just at the area level. If the auxiliary data are available for individuals, the required information can be used for each unit to derive a unit-level model. In this case, the area-level model can also be derived by aggregating (averaging) techniques. Note that, by aggregating the data within the areas, some information such as within-area variations are no longer possible to be studied or has to rely on some unstable assumptions (Longford, 2005) Unit-level Modeling We consider a population of size N divided into K small areas with N individuals in the th small area (N = K N ). Then, a unit-level LMM consists of a population model formulated at the individual level which relates the unit values of the study variable to unit-specific auxiliary variables including both fixed and random effects is: Y i = X iβ + u + e i ; i = 1,..., N & = 1,..., K u iid N(0, σ 2 u) ; e i iid N(0, σ 2 e), (3.1) where X i is a vector of P auxiliary variables for ith unit within the th area as follows: X i = 1 X i1... X ip ]. (3.2) Here, β = β 0 β 1... β P ] denotes the vector of unnown regression parameters. The random effect for the th area is u and e i denotes the random error for the ith individual within the th area. The random effects and random errors are independently distributed in the model. Later in this thesis, the term Population Model 1 (P 1 ) is used for model presented in (3.1) for evaluation proposes. Model (3.1) can be also presented in a matrix form as follows: Unit-Level Population Model : Y = Xβ + Zu + e u N(0, σ 2 u I K ) ; e N(0, σ 2 e I N ). (3.3) 55

69 Chapter 3 SAE based on Linear Mixed Models Here, Y and e are column vectors with N elements, X is a N (P + 1) dimensional matrix, and u is a column vectors with K elements. In the case of model presented in (3.3), Z is a N K dimensional matrix that includes 1s and 0s which assigns the same value of u to all the rows referring to the units within the th area. Note that, matrices are shown by bold prints in this thesis. X = X 11. X N 1 1 X 12. X N X 1K. X N K K, Z = (3.4) A basic unit-level population model assumes that the variable of interest for the ith population unit in the th small area (Y i ) is related to the auxiliary variables for the same population unit (X i ) through a nested error regression model (Rao, 2002). Battese et al. (1988) used a nested error regression model to predict areas under corn and soybeans in 12 Iowa counties using Landsat satellite data in conjunction with survey data. Using the satellite images, auxiliary information was available in the form of pixels identified as corn and soybeans (You and Rao, 2003; Pfeffermann, 2002). In the model (3.3), two different units in the same area are assumed to be cor- 56

70 Chapter 3 SAE based on Linear Mixed Models related. Longford, 1993; Chapter 2] V ar(y i ) = σu 2 + σe 2 & Cov(Y i, Y i ) = σu 2 (3.5) Longford (1993) has used the terms elementary-level and cluster-level variance components respectively for σe 2 and σu. 2 The within-cluster correlation which is the correlation of two different units in the same area is: ρ = σ2 u σ 2 u + σ 2 e. (3.6) Longford (1993) introduced the variance component ratio as follows: λ = σ2 u σ 2 e. (3.7) Model (3.3) is a population model by definition and can be fitted on population data which is not usually available in practice. Sample surveys are commonly used in order to gather relevant information about the target population. The sample s of size n is assumed to be selected from the population U. The part of the whole sample s which falls into the th area is s = s U and is of size n. It is often the case that reliable direct estimates can not be obtained based on the available sample data due to small sample sizes in all or some of the areas. In order to calculate model-based estimators, a model should be developed to specify the relationship between the auxiliary information and variable of interest based on available sample data. In this thesis, the term woring model is used for the statistical model fitted using the sample data assuming a correct model for the population. Note that, the assumed population model may not be correct in practice. Assuming (3.3) to be the actual population model, the unit-level woring model which can be fitted on individual-level sample data is given as: y = xβ + zu + e u N(0, σ 2 u I K ) ; e N(0, σ 2 e I n ). (3.8) It will be noted that, lowercase and uppercase letters refer to sample and population statistics (or values), respectively. The vector y contains sample values for the required variable and x denotes the matrix of auxiliary data values for the individuals falling into the sample. The corresponding data for s are y and x, respectively. 57

71 Chapter 3 SAE based on Linear Mixed Models We assume that the sampling scheme used is uninformative. Therefore, the same model can be used for the sample and population at the individual level. x = x 11. x n 1 1 x 12. x n x 1K. x n K K, x i = 1 x i1 x i2... x ip ] (3.9) Usually the model parameter estimates are calculated using the information obtained from the sample survey. In order to define the Maximum Lielihood (ML) technique for a simple random sample design, L(y ; β, σ 2 u, σ 2 e) is assumed to be the twice differentiable probability density function for variable y, as below: L(y ; β, σ 2 u, σ 2 e) = c Σ 1 2 exp 1 2 (y xβ) Σ 1 (y xβ) ]. (3.10) Here, c is a constant value and Σ is the bloc-diagonal variance-covariance matrix as follows: Σ = diag(σ 2 u 1 n1 1 n 1 + σ 2 ei n1,..., σ 2 u 1 nk 1 n K + σ 2 ei nk ) = σ 2 udiag(j n ) + σ 2 ei n where : J n = 1 n 1 n (3.11) Now, let l(β, σ 2 u, σ 2 e; y) to be the log-lielihood function as: l(β, σ 2 u, σ 2 e; y) = ln L(y ; β, σ 2 u, σ 2 e) ] = ln(c) 1 2 ln Σ 1 2 (y xβ) Σ 1 (y xβ) (3.12) = ln(c) 1 2 K ln Σ 1 2 K ς Σ 1 ς 58

72 Chapter 3 SAE based on Linear Mixed Models where: and, Σ = σ 2 u 1 n 1 n + σ 2 ei n ; Σ 1 γ = σ2 u = σe 2 (I n γ n 1 n 1 n ) (3.13) σ 2 u + σ2 e n & ς = (y x β) (3.14) The ML estimates are then calculated by maximizing the right-hand side of this log-lielihood equations (Ruppert et al., 2003). Assuming σ u and σ e to be nown the ML estimator for β is: ˆβ U = (x Σ 1 x) 1 x Σ 1 y (3.15) where ˆβ U denotes the ML estimated value for the parameter vector β using the unitlevel sample data. We will subsequently discuss Best Linear Unbiased Estimator (BLUE) for parameter β in the same format shown in (3.15). Calculating parameter estimates is more challenging when we drop the unrealistic assumption that variance components are already nown. On substitution ˆβ U into the log-lielihood expression, profile log-lielihood function for (σe 2, σu) 2 can be obtained as below: l P (σu, 2 σe) 2 = ln(c) 1 2 ln Σ 1 2 y Σ 1 I x(xσ 1 x) 1 x Σ 1] y. (3.16) As there is no closed form solution for maximizing profile-lielihood over (σe 2, σu), 2 numerical methods are developed. Fisher scoring algorithm is a form of Newton s method commonly used to find maximum lielihood parameter estimates in mixed models (Osborne, 1992). The parameters β, σe 2 and σu 2 can be estimated by Fisher scoring algorithm as a general method for finding maximum ML estimates (Longford, 1987; Longford, 2001). The Fisher scoring and Gauss-Newton methods has been mentioned in the statistical literature as two useful numerical techniques for solving ML equations (Lange, 2004; and Ruppert, 2005). Detailed descriptions about the mentioned methods can be found in a paper published by Wang (2007). However, mixed model pacages often use Restricted Maximum Lielihood (REML) estimation techniques in order to maximize the restricted log-lielihood expression 59

73 Chapter 3 SAE based on Linear Mixed Models and estimate the variance parameters. l R (σ 2 u, σ 2 e) = l P (σ 2 u, σ 2 e) 1 2 log xσ 1 x (3.17) The additional term in restricted log-lielihood expression (l R ) is based on contrast arguments that account for estimation of the β (McCulloch et al., 2008). Detailed discussions about different methods of estimating model parameters can be found in Searle et al. (1992); Chapter 6, and Ruppert et al. (2003); Chapter 4. ML and REML techniques are the most common strategies being used for calculating model parameter estimates. Here, an estimation technique is presented using Fisher scoring algorithm for ML estimation. Longford (1993) defined the Fisher scoring algorithm for estimating the value of parameter θ as below: where: θ (t+1) = θ (t) + I 1 (θ (t) ) S(θ (t) ) (3.18) ( I(θ 2 ) l ) = E θ θ & S(θ ) = l θ θ=θ. (3.19) The notations (t) and (t+1) denote the previous and new estimated values for this parameters, respectively. The score vector and Fisher information matrix is respectively presented at the point θ = (β, σ 2 u, σ 2 e) for (3.12), as follows: EURAREA Consortium, 2004] S(θ ) = ( l β, l σ 2 u, l σ 2 e ) = ( S β, S σ 2 u, S σ 2 e ) (3.20) and, ( S(θ I(θ ) ) ) = E = θ F ββ F βσ 2 u F βσ 2 e F βσ 2 u F σ 2 u σ 2 u F σ 2 u σ 2 e, (3.21) F βσ 2 e F σ 2 e σ 2 u F σ 2 e σ 2 e 60

74 Chapter 3 SAE based on Linear Mixed Models where the elements of score vector are: S β = 1 K ( x σe 2 ς γ ) x n 1 n 1 n ς = σ 2 e x ˆΣ 1 (y x β) S σ 2 u = 1 2 K 1 γ σ 2 e K ( ) 2 1 γ ς 1 n 1 n ς σ 2 e (3.22) S σ 2 e = 1 2 K n γ σ 2 e + 1 2(σ 2 e) 2 K ς ς + γ ] (γ 2) ς 1 n 1 n n ς. The elements of Fisher information matrix can be also calculated through the equations below: F ββ = 1 σ 2 e K ( x x γ n x 1 n 1 n x ) F σ 2 e σ 2 e = 1 2 K ( ) 2 1 γ n σ 2 e K ( ) 2 1 γ n (3.23) F σ 2 u σ 2 e = 1 2 σ 2 e and F βσ 2 u = F βσ 2 e F σ 2 u σu 2 = 1 2(σe) + K n + γ 2 2 (γ 2) ] = 0 where: x = x 1 x 2.. (3.24) x n Simplifying the elements, stability and regularity of the results are mentioned in the statistical literature as the considerable advantages of Fisher scoring over the Newton-Raphson algorithm. The Fisher scoring function can be derived for the unit-level regression parameter vector β, as follows: ˆβ (t+1) = ˆβ (t) + (σ 2 x Σ 1 x) 1 σ 2 x Σ 1 (y x ˆβ(t) ). (3.25) 61

75 Chapter 3 SAE based on Linear Mixed Models Using the the variance component ratio introduced in (3.7), the estimated value for this parameter can be calculated numerically, as below: Longford, 1993; p.108] and, l(θ ; y) λ = 1 2 ( 2 l(θ ; y) ) E = 1 2 λ 2 E K K ( 2 l(θ ; y) ) = x W 1 β λ λ 1 n W 1 1 n + 1 2σ 2 e ( 1 n W 1 1 n ) 2 = 1 2 E(e i ) = 0, where θ = (β, σ 2 u, σ 2 e), ϑ = 1 + n λ and, K K ( ς W 1 1 n ) 2 (3.26) ( f 1 1 n 1 n ) 2 ( ) W = σe 2 Σ ; W = σe 2 σ 2 u 1 n 1 n + σe 2 I n = λ 1n 1 n + I n W 1 = σ 2 e Σ 1 ; W 1 = σ 2 u σ 2 e + n σ 2 u 1 n 1 n + I nk. More details of this algorithm can be found in Appendix A.3. (3.27) (3.28) Then, given estimates ˆβ U (t) and σ2 e(t) of β and σ e, respectively, the estimated value for required parameter λ, can be calculated as follows: ˆλ (t+1) = ˆλ ( 1 (t) + 2 = ˆλ ( 1 (t) + 2 K K (ϑ 1 (t) 1 n 1 n ) 2 ) n 2 ϑ 2 (t) ) K K where ϑ (t) = 1 + n λ (t), and ˆς (t) = y x ˆβ U (t). ( n ϑ (t) ) + 1 2ˆσ 2 e(t) (ϑ 1 (t) 1 n 1 n ) + 1 2ˆσ 2 e(t) K K ] (ϑ 1 (t)ˆς (t)1 n ) 2 (ϑ 1 (t) ˆς (t) 1 n ) 2 ], (3.29) Given the estimate of β and σ 2 e, the sample data only affect the calculation in equation (3.29) through ˆς (t) 1 n = n (ȳ x ˆβ (t) U ), which are the area-level residuals. To use the Fisher algorithm for a unit-level mixed linear model, separate consecutive steps are: Firstly, the parameter β should be estimated based on the Ordinary Least Squares (OLS) method. Then, the preliminary estimated value for β is given by : 62

76 Chapter 3 SAE based on Linear Mixed Models ˆβ U (1) = (x x) 1 x y where there is no need to estimate Σ. Using this preliminary value, the individual-level residuals can be calculated. ê (1) = y x ˆβ U (1) With the use of these residuals, σ 2 e can be estimated as below: Suppose λ (t) = σ2 u(t) σ 2 e(t) through the equation given below. ˆσ 2 e(1) = 1 n p (y x ˆβ U (1)) (y x ˆβ U (1))., then a new estimated value for λ can be calculated Note that, ˆσ e(1) 1000 is taen to be the preliminary value for ˆσ u. 2 Therefore, λ (1) = 1, and the scoring function for 1000 calculating further values for this parameter is give as: ˆλ (t+1) = ˆλ ( 1 K n 2 ) 1 (t) + 1 K ( n ) + 1 K ] (ϑ 1 2 ϑ 2 (t) 2 ϑ (t) 2ˆσ e(t) 2 (t) ˆς (t) 1 n ) 2. Then, ˆσ 2 u(t+1) = ˆλ (t+1)ˆσ 2 e(t). Using ˆσ 2 u(t+1) and ˆσ2 e(t), ˆΣ (t+1) can be derived based on equation below: Σ (t+1) = σ 2 u(t+1) diag(j n ) + σ 2 e(t) I n, (3.30) and ˆβ U (t+1) can be estimated by: ˆβ U (t+1) = (x ˆΣ 1 (t+1) x) 1 x ˆΣ 1 (t+1) y. (3.31) Then, ˆσ 2 e(t+1) can be calculated by: ˆσ 2 e(t+1) = 1 n (y xβ (t+1)) Ŵ 1 (t+1) (y xβ (t+1) ), where: Ŵ 1 (t+1) = diag ( Ŵ 1 ) (t+1) ; Ŵ 1 (t+1) = ˆσ2 u(t+1) ˆσ 2 e(t) +n ˆσ 2 u(t+1) 1 n 1 n + I nk. (3.32) The steps should be repeated until the differences between consecutive iterations can be ignored and the estimators will converge to specific values. This iterative algorithm can be run in a statistical software such as S, S-Plus and R using the lme function. The detailed theoretical discussion about this function has been presented in Pinheiro and Bates (2000). 63

77 Chapter 3 SAE based on Linear Mixed Models Area-level Modeling An area-level random effects model was originally used by Fay and Herriot (1979). They applied a linear regression with area random effects in the case of unequal variances for predicting the mean value per capita income (PCI) in small geographical areas. A basic area-level mixed model assumes that the population small area totals are related to the area-specific auxiliary data through a linear model with random area effects (Rao, 2002). Therefore, this model can be used when the auxiliary information are available only at the area-level. In such a case, the Fay-Herriot model is: Ȳ D = Ȳ + ε ; = 1,..., K. (3.33) where Ȳ is the true population value for th area mean for the target variable, Ȳ D denotes its direct estimate and ε Ȳ N(0, σε 2 ). Ȳ is assumed in (3.33) to be related with P auxiliary variables as follows: Ȳ = X β + u ; where u N(0, σu) 2. (3.34) where X is the vector of th area population means to P auxiliary variables. X = 1 X 1 X2... XP ] (3.35) The variance of the error term (σε 2 ) is typically assumed to be associated with the complex sampling error for th area and it is assumed to be nown in the Fay-Herriot model. This strong assumption seems unrealistic in practice (González-Manteiga, et al., 2010). The implications of having to estimate variance components and the effectiveness of the aggregated-level approach in SAE is considered in following sections. Considering the target of inference to be the actual area means, the typical approach for estimating the target variables is to use the unit-level model by aggregating (averaging) techniques (Longford, 2005). In this approach, the model parameters are to be estimated using the individual-level data, firstly. Then, the unit-level estimated parameters will be used to estimate the variable of interest at the required area-level by aggregating the data. The aggregated relationship between the small 64

78 Chapter 3 SAE based on Linear Mixed Models area means and the area-specific auxiliary variables in the population model (3.1) can be derived as follows: Ȳ = X β + u + ē ; = 1,..., K iid u N(0, σu) 2 ; ē = 1 N e i N(0, σ2 e ). N N i=1 (3.36) An alternative approach is to directly fit the model at the area-level. The linear mixed model used in SAE relate the area values of the study variable to P areaspecific auxiliary variables within the population can also be presented in matrix forms as follows: Area-Level Population Model : Ȳ = Xβ + u + ē u N(0, σ 2 u I K ) ; ē N ( 0, diag( σ2 e N 1,..., σe 2 N K ) ) (3.37) Here, Ȳ and ē are column vectors with K elements, and X is K (P +1) dimensional matrix ( X = X 1 X 2... X ] ). A basic area-level model seems appropriate when the data is available just at the area level and the estimation process is possible only based on aggregate data. However, we will consider the issue of whether there are advantages in using an area-level model when the unit-level data is available. Considering the model presented in equation (3.37) properly specified for the population data, the appropriate model to be fitted on the aggregated-level sample data is: where: ȳ = x β + ϵ ; = 1,..., K, (3.38) x = 1 x 1 x 2... x P ]. (3.39) and ϵ = u +ē = ȳ x β. Here, we assume the sample data coming from a simple random sampling. Therefore, the Fay-Herriot model is a more general form of model (3.38) as it incorporate complex sample design effects in modeling by assuming the sampling error to be nown. Although this is an unrealistic assumption, but is considered in some literatures as an advantage for this type of modeling. In this thesis, variance components are not assumed to be nown. A numerical method is presented for estimation of model parameters at unit level in Subsection 65

79 Chapter 3 SAE based on Linear Mixed Models For aggregated-level sample data, a similar function can be developed for parameter estimation based on the model for sample data presented in (3.38). The log-lielihood function for the area-level model is given by: l(β, σu, 2 σe; 2 ȳ) = 1 { ] } ln(2kπ) + ln det( Σ) + ϵ 2 Σ 1 ϵ, (3.40) where: ϵ = ϵ 1 ϵ 2... ϵ K ] & Σ = diag ( σ 2 u + σ2 e n 1,..., σ 2 u + σ2 e n K ). (3.41) Then, the following expressions are required in the Fisher scoring algorithm for the parameter λ. Longford, 2005] l(θ ; ȳ) = 1 ( K n 1 λ λn σ 2 u ϵ ) W 1 n 1 W 1 ϵ = 1 2 ( K n 1 + λn 1 σ 2 u K ) n ϵ 2 (1 + λn ) 2 (3.42) where: ( ) 2 l(θ ; ȳ) E = 1 ( K ) n ϵ 2 2 λ 2 (1 + λn ) 2 W = σe 2 Σ = diag(λ + 1,..., λ + 1 ). (3.43) n 1 n K Assuming variance components to be nown in the area-level model, the ML estimator for parameter β based on area-level sample data is: ˆβ A = ( x Σ 1 x) 1 x Σ 1ȳ, (3.44) where: ȳ = ȳ 1 ȳ 2... ȳ K ] & x = x 1 x 2... x K ]. (3.45) Using area-level data, expressions for the Fisher scoring algorithm for the parameter λ is exactly the same as in (3.29) (Longford (2005); p.198). The preliminary value for σe 2 would be obtained from the unweighted ordinary least squared method. Then, using the Fisher scoring algorithm for the variance ratio, new estimated random effects for th area in iteration (t+1) can be calculated as below: ˆσ u(t+1) 2 = ˆλ (t+1)ˆσ e(t) 2. (3.46) 66

80 Chapter 3 SAE based on Linear Mixed Models Using ˆσ 2 u(t+1) and ˆσ2 e(t), new estimators for ˆ Σ(t+1) and ˆβ A (t+1) Then, new estimated value for σ 2 e can be calculated from: can be be obtained. ˆσ 2 e(t+1) = 1 K P ˆϵ (t+1) W 1 (t+1) ˆϵ (t+1). (3.47) where, ˆϵ (t+1) = ( ȳ x ˆβ A (t+1)). Note that, the algorithm for calculating parameter estimates using individual and aggregated level analysis are very similar. The main difference is applied in calculating ˆσ 2 e(t+1) using Ŵ(t+1) with unit-level data and W (t+1) with area-level data. There are some differences in using ˆβ A instead of ˆβ B for area-level estimation purposes. However, these estimators will have the same expectation if the unit-level model is properly specified. Note that, the sample information about the auxiliary variables may not be available in practice. In such a case, population information can be used in the arealevel woring models. These information are typically obtained from the pervious census or a bigger survey. Using population area means in the model in (3.38), we have: ȳ = X β + u + ē (s) ; = 1,..., K u iid N(0, σ 2 u) ; ē (s) = 1 n n i=1 e i N(0, σ2 e n ) A comparison among different woring models is presented in following chapters. (3.48) 3.3 Regression-Synthetic Estimator If sufficient sample data is collected from each specific area, parameter estimates can be calculated for each area, specifically. In such a case, a regression estimator is Ȳ = X ˆβ. A GREG estimator can be conducted by adding an appropriate estimate of the model error for the th area. Usually, funds are not available to collect samples large enough to allow for a reliable direct survey estimate for each sub-area. Therefore, a basic synthetic estimator for the small areas within the broad area is calculated based on this assumption that the target areas are homogenous with respect to the means of the required characteristic. This is a strong assumption and users need to be made aware of it. To decrease the mean squared error of estimates for small areas, often auxiliary variables from administrative records are 67

81 Chapter 3 SAE based on Linear Mixed Models used as covariates in a mixed linear model. Under the regression synthetic approach, the mean value for the th area would be estimated through the equation, Ȳ Syn = X ˆβ Syn, where X is a column vector including population mean values of P auxiliary variables in the th area. estimate ˆβ Syn is calculated using the whole available sample information. Here the homogeneity assumption is with respect to β being the same for each area. Two types of small area models have been developed in the literature. In the first type, auxiliary data are available for each of the population elements. Therefore, the parameter vector β = (β 0, β 1,..., β p ) would be estimated with the use of all available sample data collected from the various small areas. Such models are considered by Fuller and Harter (1987), Battese et al. (1988), MacGibbon and Tomberlin (1989), Datta and Ghosh (1991), and Kleffe and Rao (1992). In the second type of models, only area-specific auxiliary data are available. These models are considered by Fay and Herriot (1979), Kacar and Harville (1984), Fay (1987), Spjotvoll and Thomsen (1987), Ericsen and Kadane (1985, 1987, 1992), Cressie (1989, 1990, 1992), Prasad and Rao (1990), Ghosh et al. (1991), Datta el al. (1992), Ghosh and Rao (1994), Singh et al. (1994), and Chand and Alexander (1995). A brief assessment of small area estimates based on different linear models is presented by Ghosh and Rao (1994). Having estimates for regression parameters, the area mean value for the target variable can be estimated based on the fitted statistical woring models through the synthetic estimation as follows: where Ȳ SU and Ȳ SA Ȳ SU = X ˆβ U or Ȳ SA = X ˆβ A. The (3.49) respectively denote the unit-level and area-level synthetic estimators for the target variable within the th area. The estimate ˆβ U is the ML estimated value for the parameter vector β using the individual-level sample data and ˆβ A is the ML estimated value using the aggregated-level sample data. ˆβ U = (x Σ 1 x) 1 x Σ 1 y ˆβ A = ( x Σ 1 x) 1 x Σ 1ȳ (3.50) 68

82 Chapter 3 SAE based on Linear Mixed Models where: Note that, γ = Σ 1 = 1 σ 2 e diag(i n1 γ 1 n 1 1 n1 1 n 1,..., I nk γ K n K 1 nk 1 n K ) ( Σ ( 1) n 1 = diag n 1 σu 2 + σe 2 ) n K,...,. n K σu 2 + σe 2 (3.51) σ 2 u σ 2 u +σ2 e /n, and Σ and Σ are respectively n and K dimensional variance-covariance matrices, which should be derived by nown or estimated values of σ 2 e and σ 2 u (A.2). The column vector y contains all observed values for the target variable, and ȳ is the vector of area sample means. x denotes the matrix of individual-level auxiliary sample data and x is the matrix of sample area means for P auxiliary variables (A.1). The Mean Square Error (MSE) of the regression-synthetic estimator for the th area mean of the required variable can be calculated through the equation below when the aggregated-level model has been derived by averaging on the unit-level model: MSE ξ ( Ȳ Syn ) = E ξ ( Ȳ Syn ) ] 2 Ȳ = E ξ ( X ˆβ Syn X β u ē ) 2 = E ξ X ( ˆβ ] Syn 2 (3.52) β) u ē = X MSEξ ( ˆβ Syn ) ] X + σu 2 + σ2 e. N The MSE consists of the sum of the variance and the squared bias of the estimator. If the actual population model is properly specified, the resulting model parameters are unbiased. In such a case the MSE and variance of ˆβ are equal. Considering the models (3.3) and (3.37) as the actual population models respectively for the unit- and area-level analysis, the resulting MSE for ˆβ are: MSE ξ ( ˆβ U ) = V ar ξ ( ˆβ U ) = (x Σ 1 x) 1 MSE ξ ( ˆβ A ) = V ar ξ ( ˆβ A ) = ( x Σ 1 x) 1. (3.53) 3.4 EBLUP Techniques The definition of general Linear Mixed Models (LMM) given by Searle et al. (1992), Cnaan et al. (1997), and Demideno (2004) is: Y = Xβ + Zu + e. (3.54) 69

83 Chapter 3 SAE based on Linear Mixed Models This is a more general format for model (3.3) when Z is an N K matrix of randomeffect regressors. Here, u and e are assumed to be distributed independently with mean zero and covariance matrices G and R, respectively. Note that, G and R depend on the variance components θ = θ 1,..., θ m ]. V ar u = G 0, E(e) = 0 & E(u) = 0 (3.55) e 0 R The mean vector and covariance matrix for Y are respectively, µ Y = Xβ and V = ZGZ + R. The Best Linear Unbiased Estimation (BLUE) of the fixed effects β and Best Linear Unbiased Prediction (BLUP) of the random effects u in LMM have been defined by Henderson (1950; 1975) as solutions to the following simultaneous equations. Detailed discussion can be found in Rao (2003);Chapter 6. X R 1 X β + X R 1 Zũ = X R 1 Y Z R 1 X β + (Z R 1 Z + G 1 )ũ = Z R 1 Y (3.56) Robinson (1991) defined the Best Linear Unbiased Prediction (BLUP) as the linear function of the data which is unbiased. The results of these estimation methods are the best, as they minimize the generalized mean square error within the class of linear unbiased estimators, and they are unbiased as the average value of the estimates is equal to the average value of the quantity being estimated (Morris, 2001). Note that, within the statistical literature, it is conventional to use estimation for fixed effects and predictions for random effects. Considering the equations in (3.56), V 1 can be defined in order to simplify the calculations. V 1 = R 1 R 1 Z(Z R 1 Z + G 1 ) 1 Z R 1 (3.57) Then, we have: GZ V 1 = (Z R 1 Z + G 1 ) 1 Z R 1. (3.58) The plug-in formulas for β and ũ can be found as a result of solving the equations above. These formulas has been presented below: Morris (2001)] β = (X V 1 X) 1 X V 1 Y ũ = GZ V 1 (Y X β). (3.59) 70

84 Chapter 3 SAE based on Linear Mixed Models Therefore, the ML estimator for the parameter β presented in (3.15) is the same as BLUE for this model parameter. Using the estimators presented in (3.59), the BLUP for the required random variable Y can be derived as: Ŷ BLUP = Ỹ = X β + ũ. (3.60) To calculate BLUP values in (3.59) and (3.60), variance components have been assumed to be nown. Replacing the estimated values for the variance components in the mentioned equation, a two-stage estimator will be obtained. This estimator has been presented by Harville (1991) as an empirical BLUP or EBLUP. To approximate the covariance matrix of (Ỹ Y) Datta et al. (1992) presented a second-order approximation as below: MSE(ŶEBLUP ) = MSE(Ỹ) G 1 (θ) + G 2 (θ) + G 3 (θ), (3.61) where: G 1 (θ) = R R V 1 R G 2 (θ) = R V 1 X(X V 1 X)X V 1 R G 3 (θ) = R K 3 R H(σu) 2, and K = V 1 V 1 X(R V 1 X) 1 R V 1 H(σ 2 u) = 2 trace(v 2 ). Under the general definition of linear mixed model, a linear combination for the predictions of the fixed and random effects has been discussed by Henderson (1975), Prasad and Rao (1990), and Datta and Lahiri (2000) as follows: T (θ, Ȳ) = l β + m u, (3.62) when the elements l and m are defined as below: l = X & m = (0, 0,..., 0, 1, 0,..., 0). }{{} 71

85 Chapter 3 SAE based on Linear Mixed Models Then, in this especial case, the mentioned linear combination is presented as: T (θ, Ȳ) = µȳ = X β + u, (3.63) and the BLUP (or BLUE) for this combination is: Henderson (1975)] T (θ, Ȳ) = X β + m GZ V 1 (Y X β). (3.64) In the case of unit-level mixed model presented in (3.3), we have G = σ 2 u I K & R = σ 2 e I N. In such a case, the mentioned linear combination of the fixed and random effects prediction presented by Henderson (1975) is: T (σ, Ȳ) = µȳ = X β + ũ = X β + γ (Ȳ X β) = γ Ȳ + (1 γ ) X β, (3.65) where: σ = σ e σ u, X = 1 X 1., γ = σ2 u σ 2 u + ψ, X P and ψ = V ar(ē Ȳ). Sampling survey is one of the most popular ways to estimate the random and fixed effects within the models (3.3) and (3.37). Then, using the true sampling model, we have: Ȳ ȳ N { X β x β, σ2 u + σe/n 2 σu 2 + σe/n 2 σ 2 u + σ 2 e/n σ 2 u + σ 2 e/n } (Ȳ ȳ ) N{ X β + (σ 2 u + σ2 e N )(σ 2 u + σ2 e n ) 1( ȳ x β )], (σ 2 u + σ2 e N ) ] } (σu 2 + σ2 e N )(σu 2 + σ2 e n ) 1 (σu 2 + σ2 e N ) { ] } (Ȳ ȳ ) N X β + γ (ȳ x β), (σu 2 + σ2 e N ) (σu 2 + σ2 e N ) 2 (σu 2 + σ2 e n ) 1 (3.66) 72

86 Chapter 3 SAE based on Linear Mixed Models For detailed calculations see (A.4.2). For the case where area means are the actual targets of inference, Ghosh and Rao (1994) defined the BLUP under the general LMM based on available sample data. Considering µ Ȳ = E(Ȳ u ), the equation (3.65) can be based on available sample data as follows: (A.4.1) T (σ, Ȳ) = µ = X β Ȳ + ũ = X β + γ (ȳ x β ) = X β + γ (ȳ x β ) + (ˆγ X β ˆγ X β ) (3.67) = γ ȳ + ( X x ) β ] + (1 γ ) X β, where: µ Ȳ = E( Ȳ u ). Considering the variance components to be unnown and replacing the estimated values for the variance components in the equation above, EBLUP would be calculated as follows: T (σ, Ȳ) = µ Ȳ = ˆγ ȳ + ( X ] x ) β + (1 ˆγ ) X β, (3.68) where β is the BLUP estimate of model coefficients using estimated values of variance components in the variance-covariance matrix and: 1 x = x 1 ˆσ u 2 & ˆγ =. ˆσ u 2 + ˆσ2 e n x P Note that, Xp denotes the mean value for the pth auxiliary variable in the population within the th area, while x p variable calculated through the sample data.. is the mean value for the mentioned Considering X ˆβ as a regressionsynthetic estimator, Ghosh and Rao (1994) defined the composite estimator in a similar way. Based on a linear mixed model to be fitted on available sample data, the modelbased Mean Square Error (MSE) of the resulting BLUP for th actual area mean 73

87 Chapter 3 SAE based on Linear Mixed Models can be calculated as follows: MSE ξ ( Ȳ BLUP ) ) = MSEξ ( Ȳ = Eξ ( Ȳ Ȳ) 2 = E ξ ( X β + ũ ) ( X β + u ) ] = E ξ X β + γ (ȳ x β) X β u ] 2 = E ξ X β + γ (ȳ x β + x β x β) X ] 2 β u = E ξ ( X γ x )( β β) ] 2 + Eξ γ (ȳ x β) u = G 2 (σ) + G 1 (σ), ] 2 (3.69) where: G 1 (σ) = V ar ξ (u ȳ, β) = (1 γ )σu 2 G 2 (σ) = ( X γ x ) MSE ξ ( β) ] ( X (3.70) γ x ). In order to calculate G 1 (σ), conditional distribution of u is considered as follows: u { N 0 }, σ2 u σu 2 ȳ x β σu 2 + σe/n 2 σ 2 u (u ȳ ) N N { ( σ 2 u { σ 2 u + σ 2 e/n γ ϵ, (1 γ )σ 2 u ) ] ȳ x β, σu 2 } σ 4 u σ 2 u + σ 2 e/n } (3.71) N { γ ϵ, ( 1 } λ ) σ 2 λ + 1/n u. Calculating MSE of resulting EBLUP based under LMM fitted on sample data is more challenging as in: MSE = E ( Ȳ EBLUP EBLUP ( Ȳ ) = E ( Ȳ EBLUP BLUP Ȳ + ) 2 Ȳ ) 2 Ȳ BLUP Ȳ ) 2 = E( Ȳ Ȳ + Ȳ Ȳ = E( Ȳ Ȳ ) 2 ) + MSE ( Ȳ + 2E ( Ȳ Ȳ )] )( Ȳ Ȳ. (3.72) MSE of the BLUP for th area mean is presented in (3.69). The cross product term on the right hand side is negligible. However, the remaining term is not ignorable and calculating this term E ( Ȳ ) 2 Ȳ is not straightforward. Considering the definition 74

88 Chapter 3 SAE based on Linear Mixed Models presented in (3.64), Kacer and Harville (1984) proposed a general approximation using Taylor series as follows: E T ( θ) T (θ) ] 2 = trace { A(θ)E( θ θ)( θ θ) ] }. (3.73) where A(θ) is the covariance matrix of D(θ) which is given as: D(θ) = T (θ) θ Prasad and Rao (1990) proposed a second-level approximation to MSE of EBLUP. In this approximation they used the method presented by Kacer and Harville (1984) by further approximation as below: trace { A( θ)e( θ θ)( θ θ) ] } = trace { ( b)v( b )E( θ θ)( θ θ) ] }, (3.74) where b = col 1 j m ( b θ j ) & b = m GZ V 1. In case of mixed model presented in (3.3) to be the true population model, we have: (G = σ 2 u I K & R = σ 2 e I N ). In such a case, a second order approximation for the remaining term in calculating the MSE presented in (3.72) is then as follows: Prasad and Rao (1990)] G 3 (σ) = E ( Ȳ Ȳ ) 2 { = trace (1 n Σ 1 n ) n 2 γ σ u γ σ e. γ σ u ( ) = n 2 σu 2 + σ2 3 e V ar(σ 2 n uˆσ e 2 ˆσ uσ 2 e) 2. γ σ e ]V ar(σ 2 uˆσ 2 e ˆσ 2 uσ 2 e) } (3.75) Considering a true woring model to be fitted on available sample data, an approximation for the MSE of EPLUPs under general linear mixed model is: where G 1 (σ) = (1 γ )σ 2 u MSE ξ ( Ȳ EBLUP ) = MSE ξ ( Ȳ ) G 1 (σ) + G 2 (σ) + G 3 (σ), (3.76) G 2 (σ) = ( X γ x ) MSE ξ ( β) ] ( X γ x ) G 3 (σ) = ( σ 2 e n ) 2 ( σ 2 u + σ2 e n ) 3 + V ar ξ (ˆσ 2 u) + σ4 u σ 4 e ] V ar ξ (ˆσ e) 2 2 σ2 u Cov σe 2 ξ (ˆσ u, 2 ˆσ e) 2. (3.77) 75

89 Chapter 3 SAE based on Linear Mixed Models The subscript ξ denotes the MSE, expectation and variance under the assumed population model. In the next chapter, MSE of the resulting parameter estimate for β based on different woring models under two selected population model are discussed. Considering model presented in (3.3) to be the actual population model, MSE of the resulting parameter estimate for β is as follows: MSE ξ ( β) = V arξ ( β) = (x Σ 1 x) 1. Replacing σ u and σ e respectively with σ u and σ e, an estimation can be calculated for the equations presented in (3.76) as below: where MSE ξ ( Ȳ EBLUP ) = MSE ξ ( Ȳ ) Ĝ1(σ) + Ĝ2(σ) + 2Ĝ3(σ) (3.78) Ĝ 1 (σ) = (1 ˆγ )ˆσ 2 u Ĝ 2 (σ) = ( X ˆγ x ) MSE ξ ( β) ] ( X ˆγ x ) Ĝ 3 (σ) = ( ˆσ 2 e n ) 2 ( ˆσ 2 u + ˆσ2 e n ) 3 + V ar ξ (ˆσ 2 u) + ˆσ4 u ˆσ 4 e ] V ar ξ (ˆσ e) 2 2 ˆσ2 u Cov ˆσ e 2 ξ (ˆσ u, 2 ˆσ e) 2. (3.79) In order to calculate the required factors in Ĝ3(σ), the Fisher information matrix is used as below: Rao, 2003; Chapter 7] I 1 (σ) = V σu 2σ2 u V σ 2 e σ 2 u V σ 2 u σ 2 e V σ 2 e σ 2 e. (3.80) Then, we have: V ar(ˆσ 2 e) = V σ 2 e σ 2 e V ar(ˆσ u) 2 = V σ 2 u σ (3.81) u 2 Cov(ˆσ u, 2 ˆσ e) 2 = V σ 2 u σe 2. The elements of Fisher information matrix presented in (3.82) are calculated through the equations shown in (3.23). I(σ) = F σuσ 2 u 2 F σ 2 e σ 2 u F σ 2 u σ 2 e F σ 2 e σ 2 e (3.82) 76

90 Chapter 3 SAE based on Linear Mixed Models As can be seen in (3.78), an additional term is added to this equation comparing with (3.76). This term is due to: Rao (2003), p. 104] E G 3 (σ) ] = G 1 (σ) G 3 (σ) (3.83) Detailed discussion about MSE of EBLUPs is presented by Prasad and Rao (1990), and Saei and Chambers (2003a). 3.5 Conclusions and Further Discussions Statistical models utilized for SAE purposes can be unit-level or area-level. The former consists of a model formulated at the individual level and uses unit-level data for estimation of the model parameters. The latter specifies the model at the area level and uses area-level data for estimation purposes. In other words, the response variable of a unit-level model for the target population is the value of the required variable for each unit within the population, while the response variable of an area-level model is the value of the required variable for each area. Therefore, the number of elements in the response vector is equal to the overall population size for a unit-level model, and equal to the number of areas for an area-level model. When data are available for individuals, the usual approach is to estimate regression coefficients and variance components based on a unit-level linear mixed model. However, it is also possible to aggregate the data to area level and estimate these parameters based on a linear model for the area means. When the unit-level model is properly specified, the parameter estimates from the individual and aggregated level analysis will have the same expectations but we would expect that parameter estimates obtained using unit-level data to have less variances. However, in practice the parameter estimates from different levels of data analysis often differ due to some model misspecifications. Given that the targets of inference are at the area-level, the question arises as to whether it is sometimes preferable to use an area-level analysis and under what conditions an area-level analysis may be better. If the correct unit-level population model includes area-level means, the area-level analysis should produce less biased estimates of the regression coefficients. 77

91 Chapter 3 SAE based on Linear Mixed Models It might be expected that individual-level analysis would produce better small area estimates. However, if the unit-level model is misspecified by the exclusion of important area-level auxiliary variables (contextual effects), the parameter estimates obtained from the unit-level and area-level analysis will have different expectations. In particular, if an important contextual variable is omitted, the parameter estimates obtained from a individual-level analysis will be biased, whereas an aggregated-level analysis can produce less biased estimates of the regression coefficients. Even if contextual variables are included in an individual-level analysis, there may be an increase in the variance of parameter estimates due to increased number of variables in the population model. In order to estimate the model parameters, reliable estimated values for the variance components are needed. In comparison with the analysis of the individual level data, it is more challenging to calculate these estimates in aggregated analysis due to lac of information about individual model errors. On the other hand, possible contextual effects in the population model may be ignored in individual analysis. The area-level models can incorporate these area-level effects automatically, and also derive the required predictors at the area levels, directly. In the next chapter, contextual models are introduced and the results of contextual effects being misspecified in unit-level modeling are studied. The main aim of this thesis is to evaluate unit-level and area-level modeling approaches when both micro and aggregate data are available. Using a Mont-Carlo simulation in the Chapter 5, parameter estimates based on different levels of statistical modeling are studied when area-level effects are involved in the unit-level population model as contextual effects. Our aim is to identify situations where aggregated-level analysis can provide more reliable estimates than unit-level analysis. This may happen due to the presence of contextual or area-level effects in the small area distribution of the target variable. Ignoring these effects in unit-level models can lead to biased estimates. However, such area-level effects are automatically included in area-level models in certain cases. In the following chapters, the estimators will be calculated based on synthetic and EBLUP methods. The main advantage these techniques is that they can provide estimators for all areas even for the ones which have no 78

92 Chapter 3 SAE based on Linear Mixed Models sample data. The estimation in these techniques are based on certain statistical models fitted on the whole sample. Assuming the area-level auxiliary information to be available, required estimators can be then calculated for the target areas. The effects of synthetic and EBLUP methods on the efficiency of small area estimates are also evaluated in following chapters. 79

93 Chapter 4 Contextual Effects in Modeling for SAE 4.1 Introduction Indirect techniques for SAE purposes rely implicitly or explicitly on statistical models which include available data on auxiliary variables. These models usually involve random effects to explain the variation between target areas within the population as well as several covariates on available auxiliary variables (Chambers and Tzavidis, 2006). Most small area estimates typically rely on unit-level statistical models when individual data are available. If only aggregated-level data are available, then arealevel models may be used. It is often the case that social characteristics for different individuals are affected by relevant surrounding factors and social behaviours. In recent years, scientists have increasingly recognized various cases in which both individual and aggregatedlevel covariates may be needed within a statistical model, simultaneously. However, specification of the required area-level covariates to be added in the unit-level models is a challenge in practice. Even if relevant area-level covariates are included in fitting the unit-level model, the number of model parameters will increase. This may increase the variance of parameter estimates due to increased number of variables in the fitted woring model. Analytical techniques are used in the present study to investigate the implication of missing required area-level effects within the unit-level 80

94 Chapter 4 Contextual Effects in Modeling for SAE woring model. This obviously causes an error in parameter estimation process as the resulting parameter estimates in this case will be biased. Alternative modeling approaches are presented to implicitly or explicitly include some unnown area-level effects. 4.2 Contextual Models Contextual modeling is a method to analyze the effects of the surrounding context on available individual data by developing a model which contains both individual level variables and aggregated context variable (Duncan et al., 1961; Boyd & Iversen 1979; and Snijders & Boser, 1999). Achen and Shively (1995) defined the term context in social and behavioral sciences synonymously with the term environment which was previously defined by Bronfenbrenner and Morris (1983) as the events and interrelated conditions which can affect the person s development or may be influenced by that. Classified ecological contexts are represented in Bronfenbrenner (1977), Bronfenbrenner (1986) and Little, et al. (2007). Sometimes, the data will be available in two levels of observation. In such situations, the available data at the higher level of study can be added to the linear model as a contextual effect, for modeling both individual (micro) level and group (macro) level data, simultaneously (Mason et al., 1983). Linear mixed models such as (3.1) are commonly used in SAE. However, area-level covariates can also be included in the unit-level models in order to improve the efficiency in certain cases. Supposing T to denote the vector of th area-level covariates being included in unit-level population model, we have: Kreft and Leeuw, 1998] Y i = (X i; T )β + u + e i ; i = 1,..., N & = 1,..., K u iid N(0, σ 2 u ) ; e i (4.1) iid N(0, σe 2 ). Here, β as in: is the vector of model coefficients for both unit- and area-level covariates β = β 0 ( β I) ( β C ) ], (4.2) where β I and β C are the vectors of regression coefficients for unit-level covariates 81

95 Chapter 4 Contextual Effects in Modeling for SAE and area-level covariates, respectively. ( β I ) = β I 1 β I 2... β I P ] & ( β C) = β C 1 β C 2... β C P ] (4.3) In the statistical literature, area-level covariates such as those in (4.1) are sometimes referred to as contextual effects and model (4.1) is then described as a contextual model. In a contextual model, both individual level and group area-level covariates are included simultaneously (Mason et al., 1983). A special case of T is where the contextual effects are small area population means, as in: Y i = X iβ + u + e i ; i = 1,..., N & = 1,..., K (4.4) iid N(0, σu 2 ) ; iid N(0, σe 2 ), u e i Here, X i involves both individual-level and aggregated-level covariates for ith unit within the th area as below: X i = X i X ], (4.5) where: X = X 1 X2... XP ]. (4.6) Considering P auxiliary variables in modeling, X i is the vector of individual-level covariates for ith unit within the th area introduced in (3.2) while X is the vector of th area population means. Note that, X i includes the intercept term, whereas X does not. Later in this chapter, the term Population Model 2 (P 2 ) is used for model presented in (4.4). The aggregated form of this population model is given as: u Ȳ = X β + u + ē iid N(0, σu 2 ) ; ē = 1 N e i N(0, σ2 e ) N N i=1 (4.7) where β is the vector regression coefficient for the area-level contextual model presented in (4.7). Each element in this vector is the sum of regression coefficients for the corresponding individual-level and aggregated-level covariates as follows: β = β 0 β1 I + β1 C β2 I + β2 C β I P. + β C P. (4.8) 82

96 Chapter 4 Contextual Effects in Modeling for SAE Models (4.4) and (4.7) can be also presented in matrix forms as follows: Unit-Level Contextual Population Model : Y = X β + Zu + e u N(0, σ 2 u I K) ; e N(0, σ 2 e I N) (4.9) Area-Level Contextual Population Model : Ȳ = Xβ + u + ē u N(0, σ 2 u I K) ; ē N ( 0, diag( σ2 e N 1,..., σ 2 e N K ) ) (4.10) where: X = X 11. X N 1 1 X 12. X N X 1K. X N K K. (4.11) Assuming (4.4) to be the actual population model, the true model that should be fitted on available sample data is: y i = X (s)iβ + u + e i ; i = 1,..., n & = 1,..., K (4.12) iid N(0, σu 2 ) ; iid N(0, σe 2 ) u e i where X (s)i includes individual-level sample information about the ith individual falling in the th area and the area population means for such area as in: X (s)i = x i X ]. (4.13) 83

97 Chapter 4 Contextual Effects in Modeling for SAE The aggregated form of this model is given as: where: u X ȳ = (s)β + u + ē iid N(0, σu 2 ) ; ē = 1 n e i N(0, σ2 e ) n n i=1 (4.14) X (s) = x X ]. (4.15) When the population data about the auxiliary variables are not available in analysis of sample information, an alternative woring model would be: y i = x i β + u + e i ; i = 1,..., n & = 1,..., K (4.16) iid N(0, σu 2 ) ; iid N(0, σe 2 ) u e i Here, x i included auxiliary information about the ith sample individual within the th area as well as the th area sample means as below: x i = x i x ], (4.17) where: x = x 1 x 2... x P ]. (4.18) The aggregated form of this model presented in (4.16) is given as: u ȳ = x β + u + ē iid N(0, σu 2 ) ; ē = 1 n e i N(0, σ2 e ) n n i=1 (4.19) In aggregated-level analysis, the models presented in (3.38) and (4.19) are actually the same. This is the ey point in this thesis which shows that the area-level models can involve existing contextual effects within the actual population model, automatically. It is well nown that regression coefficients obtained from individual-level analysis can be different from those obtained based on analysis of aggregate data. This is referred to as the ecological fallacy and can happen when the population model should include both unit-level and area-level fixed effects (Steel and Holt, 1996). Contextual models help researchers to understand and study the issue of the ecological fallacy, which occurs when researchers want to draw a conclusion about an 84

98 Chapter 4 Contextual Effects in Modeling for SAE individual-level relationship based on aggregated-level data analysis. This causes an error in the interpretation of statistical data as the results based on purely aggregated-level analysis may not be appropriate for inference about an individual based characteristic (Gehle & Biehl, 1934; Robinson, 1950; Yule & Kendall, 1950; Hammond, 1973; Visser, 1994; Seiler & Alvarez, 2000). When contextual effects exist in the population model but are ignored in woring models, the resulting regression coefficient estimates from unit-level and area-level sample data will be different in expectation. This is referred to as ecological fallacy. Information are obtained from the population using a sampling design. After gathering the required information based on a sample design, different woring models can be fitted on the available sample data. Possible woring models considered in this thesis are presented in Table 4.1. The differences between the selected woring model and the real population model leads to possible biases in resulting parameter estimates. Table 4.1: Summary of Possible Woring Models Woring Models Equation within the Thesis W 1 y (W 1) i = x i β + u + e i (3.8) W 2 ȳ (W 2) = x β + u + ē (3.38) W 3 ȳ (W 3) = X β + u + ē (3.48) W 4 y (W 4) i = x i β + u + e i (4.16) W 5 y (W 5) i = X (s)iβ + u + e i (4.12) W 6 ȳ (W 6) = X (s)iβ + u + e i (4.14) Woring model W 1 is the standard unit-level model, which does not include any contextual effects and so will produce biased estimates if such effects are present. The woring model W 2 is obtained from W 1 by simple aggregation and uses the sample area means. Woring model W 3 uses the population area means and corresponds to the Fay-Herriot approach. It has the advantage of enabling the use of covariates not available in the sample data. Woring model W 4, W 5 and W 6 85

99 Chapter 4 Contextual Effects in Modeling for SAE attempt to include possible contextual effects. The woring model W 4 uses the sample area means instead of the population area means as contextual effects and produces biased estimates due to the differences between the sample and population area means. Woring model W 5 is a true unit-level contextual model which includes individual-level covariates and population area means. Woring model W 6 is the aggregate counterpart to W 5 and so includes sample and population area means as the covariates. While W 5 is the most general unit-level contextual model, it has 2P +1 regression coefficients. The aggregate contextual model W 6 also includes the contextual effects and has 2P + 1 regression coefficients. There may also the problem of collinearity arising from including both sample and population area means of the covariates. On the other hand, woring model W 2 avoids these issues and implicitly accounts for the contextual effects by efficiently replacing the population area means by the sample area means and is effectively the aggregate counterpart to W 4. Note that, P + 1 regression coefficients are to be estimated using W 2. Table 4.2 summarizes the situations in which previously introduced woring models can be used. Table 4.2: Categorizing Different Situations based on the Data Availability for Model Fitting Individual-level Area-level Sample info. Auxiliary info. Auxiliary info. Population info. Auxiliary info. Auxiliary info. For Target Var. in the Sample in Population for Target Var. in the Sample in Population y i x i X i ȳ x X W 1 W 2 W 3 W 4 W 5 W 6 Individual-level sample data on both auxiliary and target variables are required for 86

100 Chapter 4 Contextual Effects in Modeling for SAE W 1 while these information are needed at the aggregated-level in case of W 2. Both individual- and aggregated-level sample information about auxiliary variables should be used in order to form W 4. When the required sample data about the auxiliary variables are not available to derive W 2 and W 4, area-level population information are used instead to form W 3 and W 5, respectively. Sample and population area means should be used in order to derive W 6. Woring models presented in Table 4.1 are compared under two main population models in following sections. 4.3 Area-level BLUP in a General Format In the definition given for area-level BLUP in Chapter 3, it was assumed the model to be properly and the model the resulting parameter estimates to be unbiased. In following sections we show some examples in which this is not the case. A general definition for area-level BLUP is given as: Rao, 2003; Chapter 6] Ȳ BLUP = X ˆβ + û = X ˆβ + γ (ȳ + x ˆβ) = X ] ] β + ( ˆβ β) + γ {ȳ + x β + ( ˆβ } β) = X β + γ (ȳ + x β) + ( X γ x ) ( ˆβ β). (4.20) Then, Ȳ BLUP Ȳ = ] γ (ȳ x ˆβ) u ē + ( }{{} X γ x ) ( ˆβ β) }{{} a b In order to calculate the MSE for the BLUP presented in (4.20), we have:. (4.21) E ( Ȳ BLUP Ȳ) 2] = E(a 2 ) + E(b 2 ) + 2E(ab), (4.22) 87

101 Chapter 4 Contextual Effects in Modeling for SAE where: a = γ (ȳ x ˆβ) u ē = γ (u + ē (s) ) u ē = (γ 1)u + γ ē (s) ē, E(a 2 ) = (1 γ ) 2 σ 2 u + γ 2 ( σ 2 e n ) + σ2 e N E(ē ē (s) ) (4.23) = (1 γ ) 2 σ 2 u + γ 2 = σ 2 u 2γ σ 2 u + γ 2 ( σ 2 e n ) + σ2 e N γ σ 2 e n n N ( ) ( σu 2 + σ2 e σ + γ 2 2 ) e + (1 γ ) σ2 e. n n N Note that, ē (s) is the area-level model error for th area calculated based on available sample data, while ē is the th area-level population model residual. Also we have: E(b 2 ) = E{ ( X γ x ) ( ˆβ ] } 2 β) = ( X γ x ) E ( ˆβ β)( ˆβ ] β) ( X γ x ). (4.24) In order the minimize the term E(a 2 ), γ should be calculated as: γ = σ2 u+(σ 2 e/n ) σ 2 u+(σ 2 e/n ). As the population sizes for different target areas are assumed to be quite large, then we have: γ BLUP of X β + u. Then, σ 2 u σ 2 u+(σ 2 e/n ) which is the usual optional choice of γ for calculating the E(a 2 ) = σ 2 u 2γ σ 2 u + γ 2 ( ) ( σu 2 + σ2 e σ + γ 2 2 ) e + (1 γ ) σ2 e n n N = σ 2 u 2γ σ 2 u + γ σ 2 u + γ 2 ( σ 2 e n ) + (1 γ ) σ2 e N (4.25) ( ) = (1 γ )σu 2 + (1 γ ) σ2 e = (1 γ ) σu 2 + σ2 e. N N In order to calculate E(ab), we have: γ E(ab) = E{ (ȳ x β) u ] ē ( X γ x ) ( ˆβ } β) γ ] = E{ (u + ē (s) ) u ē ( X γ x ) ( ˆβ } β). (4.26) 88

102 Chapter 4 Contextual Effects in Modeling for SAE In case of reggression coefficients estimated based on an individual-level analysis we have: E ξ ( ˆβ U ) = (ˇx Σ 1ˇx) 1ˇx ] Σ E 1 ξ (y) = (ˇx Σ 1ˇx) 1ˇx Σ 1 (xβ + zu + e (s) ) = (ˇx Σ 1ˇx) 1ˇx Σ 1 x β + (ˇx Σ 1ˇx ) 1ˇx Σ 1 (zu + e }{{}}{{} (s) ) = cβ + d(zu + e (s) ). c d (4.27) where ˇx is the matrix of covariates considered in the unit-level fitting model based on assumptions about available auxiliary information. Note that, there may be differences between the elements of this matrix and the one in the true sampling models due to model miss-specifications. Then, we have: E ξ ( ˆβ U β) = (c I)β + d(zu + e (s) ). (4.28) Considering E(ab) = E(ba), we have: E ξ (ba) = E ξ { ( X γ x ) ( ˆβ U β) γ (u + ē (s) ) u ē ] } = ( X γ x ) E ξ { (c I)β + d(zu + e(s) ) ] γ (u + ē (s) ) u ē ] } = ( X γ x ) E ξ { d(zu + e (s) ) γ (u + ē (s) ) u ē ] } = ( X γ x ) {d (γ 1)σ 2 u1 n + γ σ 2 e n 1 n σ2 e N 1 n ] }. (4.29) Note that, (c I)β is a constant value and its covariance with the model random effects is equal to zero. Then, considering the covariance between model random effects for different areas to be zero, we have: u 1 1 n1 + e (s)1 f = ] γ (u + ē (s) ) u ē (zu + e(s) ) =. ] γ (u + ē (s) ) u ē u 1 n + e (s). u K 1 nk + e (s)k (4.30) In order to find E ξ (f), only the cross products for the th area is required as in: { γ ] } E ξ (u +ē (s) ) u ē (zu+e(s) ) = (1 γ )σu1 2 σe 2 n +γ 1 n σ2 e 1 n. (4.31) n N. 89

103 Chapter 4 Contextual Effects in Modeling for SAE If γ = σu 2, we have: σu 2+(σ2 e /n ) ] γ (u + ē (s) ) u ē (zu + e(s) ) = (1 γ )σu1 2 σe n + γ 2 σu 2 + γ σu 2 σe + γ 2 n σ2 e N ]1 n = = = σ 2 u + γ σ 2 u + γ σ 2 e n σ2 e N ]1 n = n 1 n σ2 e N 1 n σ 2 u + γ (σ 2 u + σ2 e n ) σ2 e N ]1 n ] σu 2 + σu 2 σ2 e N 1 n = σ2 e N 1 n. (4.32) In conclusion, the general formula for MSE of BLUP based on the individual-level analysis for a given γ is: ( ) BLUP U ( ) ( ) MSE ξ Ȳ = σu 2 2γ σu 2 + γ 2 σu 2 + σ2 e n + γ 2 σ 2 e n + (1 γ ) σ2 e N + ( X γ x ) E ( ˆβ U β)( ˆβ ] U β) ( X γ x ) When γ = MSE ξ ( Ȳ + ( X γ x ) {d (γ 1)σ 2 u1 n + γ σ 2 e n 1 n σ2 e N 1 n ] }. (4.33) σu 2, and considering area population sizes to be large we have: σu 2+(σ2 e /n ) ) = (1 γ )σu 2 + ( X γ x ) E ( ˆβ U β)( ˆβ ] U β) ( X γ x ). BLUP U (4.34) In case of regression coefficients estimated based on an aggregated-level analysis we have: E ξ ( ˆβ ] A ) = (ˇ x Σ 1ˇ x) 1ˇ x Σ 1 E ξ (ȳ) = (ˇ x Σ 1ˇ x) 1ˇ x Σ 1 ( xβ + u + ē (s) ) 1ˇ x Σ 1 = (ˇ x Σ 1ˇ x) 1ˇ x Σ 1 x β + (ˇ x Σ 1ˇ x ) (u + ē }{{}}{{} (s) ) = cβ + d(u + ē (s) ). c d (4.35) where ˇ x is the matrix of covariates considered in the area-level fitting model based on assumptions about available auxiliary information which may not be properly specified. Then, we have: E ξ ( ˆβ A β) = ( c I)β + d(u + ē (s) ). (4.36) 90

104 Chapter 4 Contextual Effects in Modeling for SAE Considering E(ab) = E(ba), we have: E ξ (ba) = E ξ { ( X γ x ) ( ˆβ A β) γ (u + ē (s) ) u ē ] } = ( X γ x ) E ξ { ( c I)β + d(u + ē(s) ) ] γ (u + ē (s) ) u ē ] } = ( X γ x ) E ξ { d(u + ē(s) ) γ (u + ē (s) ) u ē ] } = ( X γ x ) { d (γ 1)σ 2 u + γ σ 2 e n σ2 e N ] }. (4.37) Note that, ( c I)β is a constant value and its covariance with the model random effects is equal to zero. Then, considering the covariance between model random effects for different areas to be zero, we have: u 1 + ē (s)1 ] f = γ (u + ē (s) ) u ē (u + ē(s) ) =. ] γ (u + ē (s) ) u ē u + ē (s). u K + ē (s)k (4.38) In order to find E ξ ( f), only the cross products for the th area is required as in: { γ ] } E ξ (u + ē (s) ) u ē (zu + ē(s) ) = (1 γ )σu 2 σe 2 + γ σ2 e. (4.39) n N. If γ = σu 2 σu+(σ 2 e/n 2, we have: ) γ (u + ē (s) ) u ē ] (zu + ē(s) ) = (1 γ )σ 2 u + γ σ 2 e n σ2 e N = σ2 e N. (4.40) In conclusion, the general formula for MSE of BLUP based on the aggregatedlevel analysis for a given γ is: ( ) BLUP A ( ) ( ) MSE ξ Ȳ = σu 2 2γ σu 2 + γ 2 σu 2 + σ2 e n + γ 2 σ 2 e n + (1 γ ) σ2 e N + ( X γ x ) E ( ˆβ A β)( ˆβ ] A β) ( X γ x ) + ( X } γ x ) { d (γ 1)σu 2 σe + γ 2 n σ2 e N. When γ = MSE ξ ( Ȳ (4.41) σu 2 σu+(σ 2 e/n 2, and considering area population sizes to be large we have: ) ) = (1 γ )σu 2 + ( X γ x ) E ( ˆβ A β)( ˆβ ] A β) ( X γ x ). BLUP A (4.42) 91

105 Chapter 4 Contextual Effects in Modeling for SAE These results show that, provided the appropriate γ is used, the effect of using a biased estimation only affects the term involving MSE( ˆβ) in the BLUP. In practice, γ needs to be estimated and using an incorrect model will affect the estimation of variance components (σ 2 e and σ 2 u) and hence ˆγ, used in calculating the resulting BLUP. 4.4 Model Comparison under Population Model 1 Assuming the linear mixed model presented in (3.1) to be the actual population model in this section, possible woring model are discussed. True unit-level and area-level sample models are defined in Table 4.3 for population model (3.1). Table 4.3: True Sample Models under Population Model 1 Population Model 1 (P 1 ): Y (P 1) i True Unit-level Sample Model under (P 1 ): y (W 1) i True Area-level Sample Model under (P 1 ): ȳ (W 2) = X iβ + u + e i = x i β + u + e i = x β + u + ē When P 1 is the actual population model, W 1 is the appropriate woring model to be fitted on the unit-level sample data while W 2 is the appropriate woring model for area-level sample data. The bias, variance and MSE of the population model using different woring models are discussed in the following subsection under P Prediction based on W 1 under P 1 A unit-level woring model without any contextual effects presented in (3.8) is considered as W 1 in Table 4.1. Individual-level sample information about the auxiliary variables are used in this case of modeling. The vector of regression coefficient β is then estimated based on W 1 as follows: ˆβ (W 1) = β (W 1) = (x Σ 1 x) 1 x Σ 1 y. (4.43) 92

106 Chapter 4 Contextual Effects in Modeling for SAE Then, considering P 1 presented in Table 4.3 as the actual population model, the resulting bias and variance for ˆβ (W 1) are: E ξ(p1 ) ( β(w 1 ) ) = (x Σ 1 x) 1 x Σ 1 E ξ(p1) (y)] = (x Σ 1 x) 1 x Σ 1 xβ = β, ( V ar ) β(w 1 ) ξ(p1 = (x Σ 1 x) 1 x Σ 1 V ar (y)] ) ξ(p1) (x Σ 1 x) 1 x Σ 1] = (x Σ 1 x) 1 x Σ 1 Σ (x Σ 1 x) 1 x Σ 1] = (x Σ 1 x) 1 x (x Σ 1 x) 1 x Σ 1] = (x Σ 1 x) 1. (4.44) As can be seen in (4.44), the parameter estimate for β is unbiased using W 1 under P 1. Here, variance components are assumed to be nown. Replacing the the actual variance components with their estimated values in the variance-covariance matrix (Σ), an estimator for this matrix can be derived (Σ) and can be used in calculating other model parameters. The synthetic estimator for th area mean based W 1 is then: Ȳ Syn(W 1) = X ˆβ (W 1). (4.45) Bias, variance and MSE of the synthetic estimator presented in (4.45) under P 1 are: Bias ( Ȳ Syn(W 1) ξ(p1 ) ) = E ( Ȳ Syn(W 1) ξ(p1 ) Ȳ) = E ξ(p1 ) X ˆβ (W1) ( X β + u + ē ) ] = E ξ(p1 ) X ( ˆβ (W1) β) ] E (u ξ(p1 ) + ē ) = X E ξ(p1 ) ( ˆβ(W 1 ) β) ] = 0, V ar ( Ȳ Syn(W 1) ξ(p1 ) ) = X ] V ar ξ(p1 ) (βw 1 ) X = X (x Σ 1 x) 1 X, (4.46) MSE ( Ȳ Syn(W 1) ξ(p1 ) ) = E ( Ȳ Syn(W 1) ξ(p1 ) Ȳ) 2 = E ξ(p1 ) X ˆβ (W1) ( X β + u + ē ) ] 2 = E ξ(p1 ) X ( ˆβ (W1) β) ] 2 + Eξ(P1 (u ) + ē ) 2 = X Eξ(P1 ) ( ˆβ (W 1) β)( ˆβ (W 1) β) ] X + σ 2 u + σ2 e N = X (x Σ 1 x) 1 X + σ 2 u + σ2 e N. 93

107 Chapter 4 Contextual Effects in Modeling for SAE The BLUP for th area mean based on W 1 is given as: Ȳ BLUP (W 1) = Ȳ (W 1) = X β (W 1) + ũ (W 1) = γ ȳ + ( X x ) β ] (W 1) + (1 γ ) X β (W1). (4.47) The expected value and variance calculated for ˆβ (W 1) are two important factors in calculating the properties of synthetic estimators and BLUPs of different area means. The bias, variance and MSE of the BLUP presented in (4.47) are calculated under P 1 as follows: ( ) Bias ξ(p1 Ȳ BLUP (W 1) ) = E ( Ȳ BLUP (W 1) ξ(p1 ) Ȳ) = E ξ(p1 ) X β (W 1) + γ (ȳ x β (W1) ) X ] β u ē = E ξ(p1 ) ( X β (W 1) + γ (ȳ x β (W1) + x β x β) X β u ē ) ] = E ξ(p1 ) ( X γ x )( β (W1) β) ] + E ξ(p1 ) γ (ȳ x β) u ] + Eξ(P1 (ē ) ) V ar ξ(p1 ) = ( X γ x ) Bias ξ(p1 ) ( β (W 1) ) ] = 0, ( ) Ȳ (W 1) = γ σu 2 γ 2 x (x Σ 1 x) 1 x + X (x Σ 1 x) 1 X, MSE ξ(p1 ) ( Ȳ BLUP (W 1) ) (W 1) ) (W 1) = MSEξ(P1)( Ȳ = Eξ(P1)( Ȳ ) 2 Ȳ = ( X γ x ) MSE ξ(p1 ) ( β (W 1) ) ] ( X γ x ) + (1 γ )σ 2 u = ( X γ x ) (x Σ 1 x) 1 ( X γ x ) + (1 γ )σ 2 u. (4.48) Detailed derivation of the variance given in (4.48) can be found in A.5. Usually, the population size for different areas are quite large. This maes the term σ2 e N to be a negligible value, in practice. The MSE is obtained directly from (4.34). The variance and MSE of an estimator for th area mean are defined as follows: MSE( Ȳ ) = E ( Ȳ Ȳ) 2] { V ar( Ȳ ) = E Ȳ E( Ȳ ) ] 2} (4.49) Here, variance shows the error in estimating E(Ȳ) whereas the MSE shows the error in estimating the actual population area mean Y. Hence, the variance and MSE calculated for an unbiased estimator may not be exactly the same. 94

108 Chapter 4 Contextual Effects in Modeling for SAE As shown in Table 4.3, W 1 is the true unit-level sample model when P 1 is the actual population model. Therefore, the resulting estimates for the regression coefficients are unbiased in this case as shown in (4.44). Synthetic estimators and BLUPs based on W 1 under P 1 are unbiased, as well (see (4.46) and (4.48)) Prediction based on W 2 under P 1 An area-level woring model without any contextual effects is considered as W 2 (Table 4.1). Aggregated-level sample information about the auxiliary variables are used in W 2 as follows: u N(0, σ 2 u I K ) ; ȳ (W 2) = xβ + u + ē ē N ( 0, diag( σ2 e n 1,..., σ 2 e n K ) ). (4.50) The vector of regression coefficient β is then estimated, based on W 2, as follows: ˆβ (W 2) = β (W 2) = ( x Σ 1 x) 1 x Σ 1ȳ. (4.51) Then, considering P 1 presented in Table 4.3 as the actual population model, the resulting bias and variance for ˆβ (W2) are calculated as below: E ξ(p1 )( ) β(w 2 ) = ( x Σ 1 x) 1 x (ȳ)] Σ 1 E ξ(p1 ) = ( x Σ 1 x) 1 x Σ 1 xβ = β, ( V ar ) β(w 2 ) ξ(p1 = ( x Σ 1 x) 1 x Σ 1 V ar (ȳ)] ) ξ(p1) ( x Σ 1 x) 1 x ] Σ 1 = ( x Σ 1 x) 1 x Σ 1 Σ ( x Σ 1 x) 1 x ] Σ 1 = ( x Σ 1 x) 1 x ( x Σ 1 x) 1 x ] Σ 1 = ( x Σ 1 x) 1. (4.52) As can be seen in (4.52), the parameter estimate for β is unbiased using W 2 under P 1. A synthetic estimator for th area mean based W 2 is: Ȳ Syn(W 2) = X ˆβ (W 2). (4.53) The bias, variance and MSE of the synthetic estimator presented in (4.53) are then 95

109 Chapter 4 Contextual Effects in Modeling for SAE calculated under P 1 as below: Bias ( Ȳ Syn(W 2) ξ(p1 ) ) = E ( Ȳ Syn(W 2) ξ(p1 ) Ȳ) = E ξ(p1 ) X ˆβ (W2) ( X β + u + ē ) ] = E ξ(p1 ) X ( ˆβ (W2) β) ] E (u ξ(p1 ) + ē ) = X E ξ(p1 ) ( ˆβ(W 2 ) β) ] = 0, V ar ( Ȳ Syn(W 2) ξ(p1 ) ) = X ] V ar ξ(p1 ) (βw 2 ) X = X ( x Σ 1 x) 1 X, (4.54) MSE ( Ȳ Syn(W 2) ξ(p1 ) ) = E ( Ȳ Syn(W 2) ξ(p1 ) Ȳ) 2 = E ξ(p1 ) X ˆβ (W2) ( X β + u + ē ) ] 2 = E ξ(p1 ) X ( ˆβ (W2) β) ] 2 + Eξ(P1 (u ) + ē ) 2 = X Eξ(P1 ) ( ˆβ (W 2) β)( ˆβ (W 2) β) ] X + σ 2 u + σ2 e N = X ( x Σ 1 x) 1 X + σ 2 u + σ2 e N. The BLUP for th area mean based on W 1 is given as: Ȳ BLUP (W 2) = Ȳ (W 2) = X β (W 2) + ũ (W 2) = γ ȳ + ( X x ) β ] (W 2) + (1 γ ) X β (W2). (4.55) The bias, variance and MSE of the BLUP presented in (4.55) are calculated as follows: Bias ξ(p1 ) V ar ξ(p1 ) ( ( ) Ȳ BLUP (W 2) = E ( Ȳ BLUP (W 2) ξ(p1 ) Ȳ) = E ξ(p1 ) X β (W2) + ũ (W2) ( X β + u + ē ) ] = E ξ(p1 ) X ( β (W2) β) ] E (ũ ξ(p1 ) u ) E (ē ξ(p1 ) ) = X E ξ(p1 ) ( β(w 2 ) β) ] = 0, ) Ȳ (W 2) = γ 2(σ2 u + σe/n 2 ) γ 2 x ( x Σ 1 x) 1 x + X ( x Σ 1 x) 1 X, MSE ξ(p1 ) ( Ȳ BLUP (W 2) ) (W 2) ) (W 2) = MSEξ(P1)( Ȳ = Eξ(P1)( Ȳ ) 2 Ȳ = ( X γ x ) MSE ξ(p1 ) ( β (W 2) ) ] ( X γ x ) + (1 γ )σ 2 u = ( X γ x ) ( x Σ 1 x) 1 ( X γ x ) + (1 γ )σ 2 u. (4.56) 96

110 Chapter 4 Contextual Effects in Modeling for SAE Detailed derivation of the variance given in (4.56) can be found in A.6 and the MSE is calculated based on (4.42). As it has been shown in Table 4.3, W 1 and W 2 are the true unit-level and arealevel sample models, respectively, when P 1 is assumed to be the actual population model. Selected properties of the resulting estimates based on W 1 were previously discussed in Subsection 3.1. In case of W 2, the resulting estimates for the regression coefficients are again unbiased as shown in (4.52). Considering area means to be the target of inferences, synthetic estimators and BLUPs based on W 2 under P 1 are also unbiased (see (4.54) and (4.56)). As mentioned, the resulting estimates based on W 1 and W 2 are unbiased under P 1. However, the resulting synthetic estimators and EBLUPs for a certain area have different MSEs based on W 1 and W 2. This difference is due to the difference between the variance of parameter β in each case. V ar ξ(p1 )( ) β(w 1 ) = (x Σ 1 x) 1 V ar ξ(p1 )( ) β(w 2 ) = ( x Σ 1 x) Prediction based on W 3 under P 1 Another area-level woring model without any contextual effects is considered in Table 4.1 as W 3. This model just uses the aggregated-level population information about the auxiliary variables to model the area means as follows: u N(0, σ 2 u I K ) ; ȳ (W 3) = Xβ + u + ē ē N ( 0, diag( σ2 e n 1,..., σ 2 e n K ) ). (4.57) Vector of regression coefficient β is then estimated based on W 2 as below: ˆβ (W 3) = β (W 3) = ( X Σ 1 X) 1 X Σ 1ȳ. (4.58) 97

111 Chapter 4 Contextual Effects in Modeling for SAE Then, considering P 1 presented in Table 4.3 as the actual population model, resulting bias and variance for ˆβ (W 3) are calculated as below: ˆβ (W 3) = β (W 3) = ( X Σ 1 X) 1 X Σ 1ȳ, E ξ(p1 )( ) β(w 3 ) = ( X Σ 1 1 X) (ȳ)] X Σ 1 E ξ(p1 ) = ( X Σ 1 1 X) X Σ 1 xβ = ( X Σ 1 1 X) X Σ 1 ( x X) + X ] β = ( X Σ 1 1 X) X Σ 1 ( x X)β + β, (4.59) ( V ar ) β(w 3 ) ξ(p1 = ( X Σ 1 X) 1 X Σ 1 V ar (ȳ)] ) ξ(p1) ( X Σ 1 1 X) ] X Σ 1 = ( X Σ 1 1 X) X Σ 1 Σ ( X Σ 1 1 X) ] X Σ 1 = ( X Σ 1 1 X) X ( X Σ 1 1 X) ] X Σ 1 = ( X Σ 1 X) 1. As can be seen in (4.59), the parameter estimate for β is biased using W 3 under P 1 conditional on the sample. If the design-based expectation of the difference between the th population and sample area means are equal to zero (E D ( X x ) = 0), then E D E ξ(p1 )( β(w 3 ) )] = β. This is essentially the justification of the Fay-Herriot approach. Synthetic estimator for th area mean based W 3 is then presented below. Ȳ Syn(W 3) = X ˆβ (W 3). (4.60) The variances of biased estimators are presented so that we can see how the variance is affected by using an incorrect model, as well as bias. The bias, variance and MSE 98

112 Chapter 4 Contextual Effects in Modeling for SAE of the synthetic estimator presented in (4.60) are then calculated under P 1 as below: Bias ( Ȳ Syn(W 3) ξ(p1 ) ) = E ( Ȳ Syn(W 3) ξ(p1 ) Ȳ) = E ξ(p1 ) X ˆβ (W3) ( X β + u + ē ) ] = E ξ(p1 ) X ( ˆβ (W3) β) ] E (u ξ(p1 ) + ē ) = X E ξ(p1 ) ( ˆβ(W 3 ) β) ] = X ( X Σ 1 X) 1 X Σ 1 ( x X)β, V ar ( Ȳ Syn(W 3) ξ(p1 ) ) = X ] V ar ξ(p1 ) (βw 3 ) X = X ( x Σ 1 x) 1 X, (4.61) MSE ( Ȳ Syn(W 3) ξ(p1 ) ) = E ( Ȳ Syn(W 3) ξ(p1 ) Ȳ) 2 = E ξ(p1 ) X ˆβ (W3) ( X β + u + ē ) ] 2 = E ξ(p1 ) X ( ˆβ (W3) β) ] 2 + Eξ(P1 (u ) + ē ) 2 Note that: = X Eξ(P1 ) ( ˆβ (W 3) β)( ˆβ (W 3) β) ] X + σ 2 u + σ2 e N. E ξ(p1 ) ( ˆβ(W 3 ) β)( ˆβ (W 3) β) ] = V ar ξ(p1 ) ( ˆβ (W 3) )+E ξ(p1 ) ( ˆβ (W 3) β)e ξ(p1 ) ( ˆβ (W 3) β). The BLUP for the th area mean based on W 3 is given as: Ȳ BLUP (W 3) = Ȳ (W 3) = X β (W 3) + ũ (W 3) (4.62) = γ ȳ + (1 γ ) X β (W 3). (4.63) The bias, variance and MSE of the BLUP presented in (4.63) are calculated as 99

113 Chapter 4 Contextual Effects in Modeling for SAE follows: Bias ξ(p1 ) ( ) Ȳ BLUP (W 3) = E ( Ȳ BLUP (W 3) ξ(p1 ) Ȳ) = E ξ(p1 ) X β (W3) + ũ (W3) ( X β + u + ē ) ] = E ξ(p1 ) X ( β (W3) β) ] E (ũ ξ(p1 ) u ) E (ē ξ(p1 ) ) = X E ξ(p1 ) ( β(w 3 ) β) ], V ar ξ(p1 ) ( ) Ȳ (W 3) = γ σu 2 + (γ 2 1) X ( X Σ 1 1 X) X, (4.64) MSE ξ(p1 ) ( Ȳ BLUP (W 3) ) (W 3) ) (W 3) = MSEξ(P1)( Ȳ = Eξ(P1)( Ȳ ) 2 Ȳ = (1 γ ) 2 X MSEξ(P1 ( β (W3) ) ] X + (1 γ ) )σu 2. Detailed derivation of the variance given in (4.64) can be found in A.7 and the MSE is calculated based on (4.42). The woring model W 3 is not a true sample model when P 1 is assumed as the true population model. The differences between population and sample area means are important in calculating the bias in estimating the model parameters. When the sample area means are close to the actual area population means, the resulting estimate of parameter β is less biased (see (4.59)). However, given the small sample sizes in some areas, the differences between the sample and population area means for such areas may be substantial. Considering area means to be the target of inferences, the bias for β (W3) is the main factor in calculating the bias for the synthetic estimators and BLUPs based on W 3 (see (4.61) and (4.64)) Prediction based on W 4 under P 1 The unit-level contextual model presented in (4.16) is considered as W 4 in Table 4.1. Both individual- and aggregated-level sample information are considered in this model. The matrix form of this model is: y (W 4) = x β + Zu + e u N(0, σ 2 u I K) ; e N(0, σ 2 e I n) (4.65) 100

114 Chapter 4 Contextual Effects in Modeling for SAE where: x = x X] ] X = X 1 X 2... X K ] X = x x... x. }{{} n (4.66) The vector of regression coefficient β is then estimated based on W 4 as below: ˆβ (W 4) = β (W 4) = (x Σ 1 x ) 1 x Σ 1 y (4.67) where: Σ = diag(σ 2 u 1 n 1 1 n 1 + σ 2 e I n 1,..., σ 2 u 1 n K 1 n K + σ 2 e I n K ) Σ 1 = diag( σ2 e σu 2 1 σ e 2 +n 1 σ u 2 n 1 1 n 1 + σe 2 I σ e n 1,..., 2 σu 2 1 σ e 2 +n K σ u 2 n K 1 n K + σe 2 I n K ) = σ 2 e diag(i n 1 γ n 1 1 n1 1 n 1,..., I nk γ n K 1 nk 1 n K ) and, γ = σ 2 u σ 2 u + σ2 e /n (4.68). (4.69) Here, we assumed the variance components within the contextual model to be different from the variance components in a model without any area-level or contextual effects. Considering the σ e = σ e and σ u = σ u, we have: Σ = Σ and γ = γ. When P 1 presented in Table 4.3 is considered as the actual population model, the resulting bias and variance for ˆβ (W4) are calculated as below: ( ) ( ) E ξ(p1 ˆβ (W 4) = (x Σ 1 x ) 1 x Σ 1 E ) ξ(p1 ) y = (x Σ 1 x ) 1 x Σ 1 xβ = (x Σ 1 x ) 1 x Σ 1 x β 0 = β 0 V ar ξ(p1 ) ( ) ˆβ (W 4) = (x Σ 1 x ) 1 x Σ 1 V ar ξ(p1 ) (y)(x Σ 1 x ) 1 x Σ 1 ] ] = (x Σ 1 x ) 1 x Σ 1 Σ (x Σ 1 x ) 1 x Σ 1 = (x Σ 1 x ) 1 x Σ 1 ΣΣ 1 x (x Σ 1 x ) 1 (4.70) 101

115 Chapter 4 Contextual Effects in Modeling for SAE Note that, ˆβ = ˆβ 0 ( ) ˆβ I ( ) ] ˆβ C, and ˆβ = ˆβ 0 ( ) ˆβ I ( ) ] + ˆβ C. Therefore, when E ξ(p1 )( ˆβ (W 4 ) ) = ( 0 β ), then we have: Eξ(P1 )( ˆβ (W 4 ) ) = β. The parameter estimate for β is unbiased using W 4 under P 1 as shown in (4.70). However, the unnecessary inclusion of the area-level effect will lead to increased variance of the estimated regression coefficients. The synthetic estimator for th area mean is: Ȳ Syn(W 4) = X ˆβ (W 4) (4.71) The bias, variance and MSE of the synthetic estimator presented in (4.71) are then calculated under P 1 as below: Bias ( Ȳ Syn(W 4) ξ(p1 ) ) = E ( Ȳ Syn(W 4) ξ(p1 ) Ȳ) = E ξ(p1 ) X ˆβ (W4) ( X β + u + ē ) ] = E ξ(p1 ) X ( ˆβ (W4) β) ] E (u ξ(p1 ) + ē ) = X E ξ(p1 ) ( ˆβ (W 4 ) β) ] = 0 V ar ( Ȳ Syn(W 4) ξ(p1 ) ) = X V ar ξ(p1 )] ) (β (W 4) X (4.72) MSE ( Ȳ Syn(W 4) ξ(p1 ) ) = E ( Ȳ Syn(W 4) ξ(p1 ) Ȳ) 2 = E ξ(p1 ) X ˆβ (W4) ( X β + u + ē ) ] 2 = E ξ(p1 ) X ( ˆβ (W4) β) ] 2 + Eξ(P1 (u ) + ē ) 2 = X Eξ(P1 ) ( ˆβ (W 4) β) 2] X + σ 2 u + σ2 e N The BLUP for the th area mean based on W 4 is given as: Ȳ BLUP (W 4) = Ȳ (W 4) = X β (W 4) (W 4) + ũ (4.73) The bias, variance and MSE of the BLUP presented in (4.73) are calculated as 102

116 Chapter 4 Contextual Effects in Modeling for SAE follows: Bias ξ(p1 ) ( ) Ȳ BLUP (W 4) = E ( Ȳ (W 4) ξ(p1 ) Ȳ) = E ξ(p1 ) X β (W4) + ũ (W4) ( X β + u + ē ) ] = E ξ(p1 ) X ( β (W4) β) ] E ξ(p1 ) (ũ (W 4) u ) E (ē ξ(p1 ) ) = X E ξ(p1 ) ( β (W 4 ) β) ] = 0 MSE ξ(p1 ) ( Ȳ BLUP (W 4) ) (W 4) ) (W 4) = MSEξ(P1)( Ȳ = Eξ(P1)( Ȳ ) 2 Ȳ = ( X γ x ) MSE ξ(p1 ) ( β (W 4) ) ] ( X γ x ) + (1 γ )σ 2 u The MSE is obtained from (4.34). (4.74) The woring model W 4 is not the true sample model when P 1 is considered as the actual population model. Sample area means are considered in W 4 as contextual effects while these area level effects do not exist in the assumed population model (P 1 ). Although the estimated value for parameter β is still unbiased in this case under P 1, it is over-specified (4.70). In comparison with W 1 unbiased synthetic estimators and BLUPs for area means based on W 4 are expected to have larger MSE due to the over-fitting of model parameters (4.72 and 4.74). Collinearity is an important issue which should be considered in using W 4. The correlation between population and sample area means in estimation process based on unit-level contextual model W 4 is important and should be calculated considering the equations below: Cov(x i, x ) = Cov(x i, 1 n xi ) = 1 n V ar(xi ) + i j cov(x i, x j ) ] = 1 n V ar(xi ) + (n 1)ρ x V ar(x i ) ] = V ar(x i) n 1 + (n 1)ρ x ] Corr( x, X ) = V ar(x i ) n 1 + (n 1)ρ x ] V ar(x i ) V ar(x i) n = 1 n 1 + (n 1)ρ x ] (4.75) Prediction based on W 5 under P 1 Another unit-level contextual model is presented in (4.12). This model is considered in Table 4.1 as W 5. Individual-level sample information used while aggregated-level 103

117 Chapter 4 Contextual Effects in Modeling for SAE population information are also present in this model. Actual population area means are added to this model as possible contextual effects. W 5 can also be defined in a matrix form as follows: y (W 5) = X (s)β + Zu + e u N(0, σ 2 u I K) ; e N(0, σ 2 e I n) (4.76) where: X (s) = x X(s) ] ] X (s) = X (s)1 X (s)2... X (s)k ] (4.77) X (s) = X X... X }{{} n The vector of regression coefficient β is then estimated based on W 5 as below: ˆβ (W 5) = β (W 5) = (X (s)σ 1 X (s)) 1 X (s)σ 1 y. (4.78) Then, considering P 1 presented in Table 4.3 as the actual population model, resulting bias and variance for ˆβ (W 5) are calculated as below: E ξ(p1 ) ( ) ( ) ˆβ (W 5 ) = (X (s)σ 1 X (s)) 1 X (s)σ 1 E ξ(p1 ) y = (X (s)σ 1 X (s)) 1 X (s)σ 1 xβ = (X (s)σ 1 X (s)) 1 X ( ) = E ξ(p1 ˆβ (W 5) = β ) (s)σ 1 X (s) β = β 0 0 V ar ξ(p1 ) ( ) ˆβ (W 5 ) = (X (s)σ 1 X (s)) 1 X (s)σ 1 V ar ξ(p1 ) (y)(x (s)σ 1 X (s)) 1 X = (X (s)σ 1 X (s)) 1 X (s)σ 1 Σ (X (s)σ 1 X (s)) 1 X (s)σ 1 ] (s)σ 1 ] (4.79) The parameter estimate for β is unbiased using W 5 under P 1 as shown in (4.79). However, this model is over-specified as the area-level effect included in W 5 does not exist in the assumed actual population model. The over-estimation factor in this case maes the variance calculations for ˆβ (W 5) to be more challenging. Then, the synthetic estimator for th area mean is: Ȳ Syn(W 5) = X ˆβ (W 5). (4.80) 104

118 Chapter 4 Contextual Effects in Modeling for SAE The bias, variance and MSE of the synthetic estimator presented in (4.80) are then calculated under P 1 as below: Bias ( Ȳ Syn(W 5) ξ(p1 ) ) = E ( Ȳ Syn(W 5) ξ(p1 ) Ȳ) = E ξ(p1 ) X ˆβ (W5) ( X β + u + ē ) ] = E ξ(p1 ) X ( ˆβ (W5) β) ] E (u ξ(p1 ) + ē ) = X E ξ(p1 ) ( ˆβ (W 5 ) β) ] = 0, V ar ( Ȳ Syn(W 5) ξ(p1 ) ) = X V ar ξ(p1 ) (βw 5 )] X, (4.81) MSE ( Ȳ Syn(W 5) ξ(p1 ) ) = E ( Ȳ Syn(W 5) ξ(p1 ) Ȳ) 2 = E ξ(p1 ) X ˆβ (W5) ( X β + u + ē ) ] 2 = E ξ(p1 ) X ( ˆβ (W5) β) ] 2 + Eξ(P1 (u ) + ē ) 2 = X Eξ(P1 ) ( ˆβ (W 5) β) 2] X + σ 2 u + σ2 e N. The BLUP for th area mean based on W 5 is given as: Ȳ BLUP (W 5) = Ȳ (W 5) = X β (W 5) (W 5) + ũ. (4.82) The bias, variance and MSE of the BLUP presented in (4.82) are calculated as follows: Bias ξ(p1 ) ( ) Ȳ BLUP (W 5) = E ( Ȳ Syn(W 5) ξ(p1 ) Ȳ) = E ξ(p1 ) X β (W5) + ũ (W5) ( X β + u + ē ) ] = E ξ(p1 ) X ( β (W5) β) ] E ξ(p1 ) (ũ (W 5) u ) E (ē ξ(p1 ) ) = X E ξ(p1 ) ( β (W 5 ) β) ] = 0, MSE ξ(p1 ) ( Ȳ BLUP (W 5) ) (W 5) ) (W 5) = MSEξ(P1)( Ȳ = Eξ(P1)( Ȳ ) 2 Ȳ = ( X γ x ) MSE ( β (W5) ξ(p1 ) ] ( X ) γ x ) + (1 γ )σu 2 (4.83) The MSE is obtained directly from (4.34). The model W 5 is not a true sample model when P 1 is considered as the actual population model. Population area means are considered in W 5 as contextual effects while these area level effects do not exist in the assumed population model (P 1 ). 105

119 Chapter 4 Contextual Effects in Modeling for SAE Although the estimated value for parameter β is still unbiased in this case under P 1, it is overestimated (4.79). In comparison with W 1 unbiased synthetic estimators and BLUPs for area means based on W 5 have larger MSE due to overestimation of model parameters (4.81 and 4.83). Co-linearity is an important issue which should be considered in using W 5. The correlation between population and sample area means in estimation process based on unit-level contextual model W 5 is important and should be calculated considering the equations below: Cov(x i, X ) = Cov(X i + ε, 1 n Xi ) = 1 N V ar(xi ) + V ar(ε ) + i j cov(x i, X j ) ] = 1 N V ar(xi ) + V ar(ε ) + (N 1)ρ X V ar(x i ) ] Corr( x, X ) = 1 N V ar(xi ) + V ar(ε ) + (N 1)ρ X V ar(x i ) ] V ar(x i ) V ar(x i) N = V ar(ε ] ) N V ar(x i ) + (N 1)ρ X. (4.84) Prediction based on W 6 under P 1 The area-level contextual model introduced (4.14) is presented as W 6 in Table 4.1. The aggregated-level information on the sample and population area means are considered as the contextual effects in this model. A matrix form of this model is: Here, where: ȳ = X (s)β + u + ē u N(0, σ 2 u I K) ; ē N ( 0, diag( σ2 e n 1,..., σ 2 e n K ) ). (4.85) X (s) = x X], (4.86) X = X 1 X 2... X K]. (4.87) The vector of regression coefficient β is then estimated based on W 6 as below: ˆβ (W6) = β (W6) = ( X Σ (s) 1 X (s)) 1 X Σ (s) 1 ȳ, (4.88) 106

120 Chapter 4 Contextual Effects in Modeling for SAE where: ( ) Σ = diag σu 2 + σ2 e,..., σ u 2 n + σ2 e 1 n K ( ) Σ ( 1) n 1 n K = diag n 1 σu 2 +,..., σ2 e n K σu 2 +. σ2 e (4.89) Then, considering P 1 presented in Table 4.3 as the actual population model, resulting bias and variance for ˆβ (W 6) are calculated as below: E ξ(p1 ) ( ) β (W 6 ) = ( X Σ (s) 1 X (s)) 1 X Σ 1 (ȳ)] (s) E ξ(p1 ) = ( X Σ (s) 1 X (s)) 1 X (s) Σ 1 xβ = ( X Σ (s) 1 X (s)) 1 X Σ (s) 1 X (s) β 0 = β 0, V ar ξ(p1 )( β (W 6 ) ) = = ( X Σ (s) 1 X (s)) 1 X Σ (s) 1 V ar ξ(p1 (y)( X Σ ) (s) 1 X (s)) 1 X Σ (s) 1 ] = ( X Σ (s) 1 X (s)) 1 X Σ (s) 1 Σ ( X Σ (s) 1 X (s)) 1 X Σ ] (s) 1. (4.90) The parameter estimate for β is unbiased using W 6 under P 1 as shown in (4.90). However, this model is over-fitted as the area-level effect included in W 6 does not exist in the assumed actual population model. The over-fitting factor in this case maes the variance calculations for ˆβ (W 6) to be more challenging. Then, the synthetic estimator for th area mean is: Ȳ Syn(W 6) = X ˆβ (W 6). (4.91) The bias, variance and MSE of the synthetic estimator presented in (4.91) are then 107

121 Chapter 4 Contextual Effects in Modeling for SAE calculated under P 1 as below: Bias ( Ȳ Syn(W 6) ξ(p1 ) ) = E ( Ȳ Syn(W 6) ξ(p1 ) Ȳ) = E ξ(p1 ) X ˆβ (W6) ( X β + u + ē ) ] = E ξ(p1 ) X ( ˆβ (W6) β) ] E (u ξ(p1 ) + ē ) = X E ξ(p1 ) ( ˆβ (W 6 ) β) ] = 0, V ar ( Ȳ Syn(W 6) ξ(p1 ) ) = X V ar ξ(p1 ) (βw 6 )] X, (4.92) MSE ( Ȳ Syn(W 6) ξ(p1 ) ) = E ( Ȳ Syn(W 6) ξ(p1 ) Ȳ) 2 = E ξ(p1 ) X ˆβ (W6) ( X β + u + ē ) ] 2 = E ξ(p1 ) X ( ˆβ (W6) β) ] 2 + Eξ(P1 (u ) + ē ) 2 = X Eξ(P1 ) ( ˆβ (W 6) β) 2] X + σ 2 u + σ2 e N. The BLUP for th area mean based on W 6 is given as: Ȳ BLUP (W 6) = Ȳ (W 6) = X β (W 6) (W 6) + ũ. (4.93) Bias, variance and MSE of the BLUP presented in (4.93) are calculated as follows: ( ) Bias ξ(p1 Ȳ BLUP (W 6) ) = E ( Ȳ Syn(W 6) ξ(p1 ) Ȳ) = E ξ(p1 ) X β (W6) + ũ (W6) ( X β + u + ē ) ] = E ξ(p1 ) X ( β (W6) β) ] E ξ(p1 ) (ũ (W 6) u ) E (ē ξ(p1 ) ) = X E ξ(p1 ) ( β (W 6 ) β) ] = 0, MSE ξ(p1 ) ( Ȳ BLUP (W 6) ) (W 6) ) (W 6) = MSEξ(P1)( Ȳ = Eξ(P1)( Ȳ ) 2 Ȳ = ( X γ x ) MSE ξ(p1 ) ( β (W 6) ) ] ( X γ x ) + (1 γ )σ 2 u. The MSE is calculated based on (4.42). (4.94) The woring model W 6 includes both population and sample area means in an area-level model while the true sample model based on P 1 has a different format presented in Table 4.3. Although the estimated value for parameter β is still unbiased in this case under P 1, it is over-fitted (4.90). In comparison with W 2 unbiased synthetic estimators and BLUPs for area means based on W 6 have larger MSE due to over-fitting of model parameters (4.92 and 4.94). 108

122 Chapter 4 Contextual Effects in Modeling for SAE Co-linearity is an important issue which should be considered in using W 6. The correlation between population and sample area means in the estimation process based on the area-level contextual model W 6 is important. The equation below gives an introduction of this issue: x = X + ε (ε : Sampling error) Cov( x, X ) = V ar( X ) V ar( x ) = V ar( X ) + V ar(ε ) Corr( x, X ) = { ] V ar( X )+V ar(ε ) V ar( X ) } V ar( X 1/2 ) (4.95) = ( 1 + V ar(ε ) 1/2 ( ) 1/2 ) V ar(ε ) V ar( X = 1 + n. ) V ar(x ) 4.5 Model Comparison under Population Model 2 In this section the contextual model presented in (4.4) is considered as the actual population model. True sample model are then defined for this case in Table (4.4). Table 4.4: True Sample Models under Population Model 2 Population Model 2 (P 2 ): Y (P 2) i = X iβ + u + e i True Unit-level Sample Model under (P 2 ): y (W 5) i = X (s)iβ + u + e i True Area-level Sample Model under (P 2 ): ȳ (W 6) = X (s)β + u + ē The woring model W 5 is the appropriate woring model for unit-level data and W 6 is appropriate for area-level data under P 2. When area means are present within the actual population model as possible contextual effects, W 1 is not appropriate to be fitted on the sample data, anymore. The resulting estimates based on W 1 are expected to be biased due to the contextual effects being ignored in the fitting model. This is not the case for W 2. As we discuss in following subsections, arealevel or contextual effects within P 2 can be approximately involved in W 2 and the 109

123 Chapter 4 Contextual Effects in Modeling for SAE resulting estimates based on this model are less biased than those calculated based on W 1 under P Prediction based on W 1 under P 2 Considering the definition given for W 1 in (3.8), the vector of regression coefficient β is estimated based on W 1 as shown in (4.43). Note that, P 2 presented in Table 4.4 is assumed to be the actual population model, in the current subsection. Area population means for auxiliary variables are considered as contextual effects in P 2. Then, resulting bias for ˆβ (W1) is calculated as below: ( ) β (W 1) = (x Σ 1 x) 1 x Σ 1 E (y)] ξ(p2 ) E ξ(p2 ) = (x Σ 1 x) 1 x Σ 1 X (s)β ] = I (P +1) (x Σ 1 x) 1 x Σ 1 X(s) β = β 0 β I + (x Σ 1 x) 1 x Σ 1 X(s) 0 β C = β + (x Σ 1 x) 1 x Σ 1 ( X (s) x) 0 β C (4.96) since, (x Σ 1 x) 1 x Σ 1 X(s) 0 β C = (x Σ 1 x) 1 x Σ 1 x + ( X (s) x) ] 0 β C = 0 β C + (x Σ 1 x) 1 x Σ 1 ( X (s) x) 0 β C where: ] X (s) = X (s)1 X (s)2... X (s)k ] X (s) = X X... X. }{{} n (4.97) 110

124 Chapter 4 Contextual Effects in Modeling for SAE The resulting variance for ˆβ (W1) under P 2 is: ( ) V ar ξ(p2 β (W 1) = (x Σ 1 x) 1 x Σ 1 V ar (y)] ) ξ(p2 (x Σ 1 x) 1 x Σ 1 ] ) = (x Σ 1 x) 1 x Σ 1 Σ (x Σ 1 x) 1 x Σ 1 ] = (x Σ 1 x) 1 x Σ 1 Σ Σ 1 x(x Σ 1 x) 1. (4.98) The parameter estimate for β is biased using W 1 under P 2 as shown in (4.96). The bias term is due to miss-specification of the contextual effect in W 1. Note that, the aggregated-level population model is presented as below when P 2 is the actual population model. Ȳ = Xβ + u + ē In this model, the individual-level and contextual components are added together and β is the relevant regression coefficient. Synthetic estimators and BLUPs using W 1 under P 2 are shown in the following equations. Considering the synthetic estimator presented in (4.45), the bias, variance and MSE of this estimator are calculated under P 2 as below: Bias ( Ȳ Syn(W 1) ξ(p2 ) ) = E ( Ȳ Syn(W 1) ξ(p2 ) Ȳ) = E ξ(p2 ) X ˆβ (W1) ( X β + u + ē )] = E ξ(p2 ) X ( ˆβ (W1) β ) ] E ξ(p1 ) (u + ē ) = X E ξ(p2 ) ( ˆβ(W 1 ) β ) ] = X (x Σ 1 x) 1 x Σ 1 ( X (s) x) 0, V ar ( Ȳ Syn(W 1) ξ(p2 ) ) = X V ar ξ(p2 ) (βw 1 )] X, β C (4.99) MSE ( Ȳ Syn(W 1) ξ(p2 ) ) = E ( Ȳ Syn(W 1) ξ(p2 ) Ȳ) 2 = E ξ(p2 ) X ˆβ (W1) ( X β + u + ē )] 2 = E ξ(p2 ) X ( ˆβ (W1) β ) ] 2 + Eξ(P2 ) (u + ē )2 = X Eξ(P2) ( ˆβ (W1) β ) 2] X + σu 2 + σ2 e N. 111

125 Chapter 4 Contextual Effects in Modeling for SAE The BLUP for th area mean based on W 1 is given in Bias, variance and MSE of the BLUP presented in 4.47 are calculated under P 2 as follows: ( ) Bias ξ(p2 Ȳ BLUP (W 1) ) = E ξ(p2 ( Ȳ (W ) 1) ) Ȳ ( = E ξ(p2 ) X β (W1) + ũ (W ) ( 1) X β + u + ) ] ē = E ξ(p2 ) X β (W1) + γ (ȳ x β ) (W 1) ( X β + u + ) ] ē = E ξ(p2 ) X β (W1) + γ (ȳ x β (W1) + x β x β ) X ] β u ē = E ξ(p2 ) X ( β ] ( (W1) β ) E ξ(p2 γ ) x β(w 1 ) β )] ] + E ξ(p2 γ ) (ȳ x β ) u + E ξ(p2 ) (ē ) = ( X γ x )E ξ (P2 ) ( β (W 1) β ) E D Bias ξ(p2 ) V ar ξ(p2 ) ( ( Ȳ BLUP (W 1) ) Ȳ (W 1) = γ 2(σ2 u + σ2 e /n ) ) ] (1 γ ) X E ξ(p2 )( β(w 1 ) β ), + (γ 2 x (x Σ 1 x) 1 x Σ 1 2O )Σ Σ 1 x(x Σ 1 x) 1 x + 2γ O + (1 2γ ) X (x Σ 1 x) 1 x Σ 1] Σ Σ 1 x(x Σ 1 x) 1 X MSE ξ(p2 ) ( Ȳ BLUP (W 1) ) (W 1) ) (W 1) = MSEξ(P2)( Ȳ = Eξ(P2)( Ȳ ) 2 Ȳ = ( X γ x ) MSE ξ(p2 ) ( β (W 1) ) ] ( X γ x ) + (1 γ )σ 2 u. (4.100) Detailed derivation of the variance given in (4.48) is presented in A.8 and the MSE is obtained from (4.34). As discussed perviously, area means in P 2 are considered as contextual effects. W 1 is a unit-level model in which the contextual effects within P 2 are not included. This maes the resulting parameter estimates based on W 1 biased under P 2. As the differences between the sample individuals and actual area means ( X (s) x) increase, estimated value ˆβ W 1 under P 2 will be more biased (see 4.96). The resulting synthetic estimators and BLUPs for area means are therefore also biased (see 4.99 and 4.100). 112

126 Chapter 4 Contextual Effects in Modeling for SAE Prediction based on W 2 under P 2 Considering the definition given for W 2 in sub-section (4.3.2), the estimation formula for the vector of regression coefficient β based on W 2 is given in (4.51). Considering P 2 presented in Table (4.4) as the actual population model, the resulting bias and variance for ˆβ (W2) are calculated as below: ( ) E ξ(p2 β (W 2) = ( x Σ 1 x) 1 x (ȳ)] Σ 1 E ) ξ(p2 ) = ( x Σ 1 x) 1 x Σ 1 X (s)β ] = I (P +1) ( x Σ 1 x) 1 x Σ 1 X β = β 0 β I + ( x Σ 1 x) 1 x Σ 1 X 0 β C (4.101) = β + ( x Σ 1 x) 1 x Σ 1 ( X x) 0 β C since, ( x Σ 1 x) 1 x Σ 1 X 0 β C = ( x Σ 1 x) 1 x x Σ 1 + ( X x) ] 0 β C = 0 β C + ( x Σ 1 x) 1 x Σ 1 ( X x) 0 β C The resulting variance for ˆβ (W2) under P 2 is: ( ) V ar ξ(p2 β (W 2) = ( x Σ 1 x) 1 x Σ 1 V ar (ȳ)] ) ξ(p2 ( x Σ 1 x) 1 x ] Σ 1 ) = ( x Σ 1 x) 1 x Σ 1 Σ ( x Σ 1 x) 1 x ] (4.102) Σ 1. The parameter estimate for β is biased using W 2 under P 2 as shown in (4.101). The bias term is due to the differences between sample area mean in W 2 and actual population area mean considered in definition of P

127 Chapter 4 Contextual Effects in Modeling for SAE Then, considering synthetic estimator presented in (4.53), the bias, variance and MSE of this estimator are calculated under P 2 as below: Bias ( Ȳ Syn(W 2) ξ(p2 ) ) = E ( Ȳ Syn(W 2) ξ(p2 ) Ȳ) = E ξ(p2 ) X ˆβ (W2) ( X β + u + ē )] = E ξ(p2 ) X ( ˆβ (W2) β ) ] E ξ(p1 ) (u + ē ) = X E ξ(p2 ) ( ˆβ(W 2 ) β ) ] V ar ( Ȳ Syn(W 2) ξ(p2 ) ) = X V ar ξ(p2 ) (βw 2 )] X (4.103) MSE ( Ȳ Syn(W 2) ξ(p2 ) ) = E ( Ȳ Syn(W 2) ξ(p2 ) Ȳ) 2 = E ξ(p2 ) X ˆβ (W2) ( X β + u + ē )] 2 = E ξ(p2 ) X ( ˆβ (W2) β ) ] 2 + Eξ(P2 ) (u + ē )2 = X Eξ(P2) ( ˆβ (W2) β ) 2] X + σu 2 + σ2 e N 114

128 Chapter 4 Contextual Effects in Modeling for SAE The BLUP for th area mean based on W 2 is given in The bias, variance and MSE of the BLUP presented in 4.55 are calculated under P 2 as follows: ( ) Bias ξ(p2 Ȳ BLUP (W 2) ) = E ξ(p2 ( Ȳ (W ) 2) ) Ȳ ( = E ξ(p2 ) X β (W2) + ũ (W ) ( 2) X β + u + ) ] ē = E ξ(p2 ) X β (W2) + γ (ȳ x β ) (W 2) ( X β + u + ) ] ē = E ξ(p2 ) X β (W2) + γ (ȳ x β (W2) + x β x β ) X ] β u ē = E ξ(p2 ) X ( β ] ( (W2) β ) E ξ(p2 γ ) x β(w 2 ) β )] ] + E ξ(p2 γ ) (ȳ x β ) u + E ξ(p2 ) (ē ) = ( X γ x )E ξ (P2 ) ( β (W 2) β ) E D Bias ξ(p2 ) V ar ξ(p2 ) ( ( Ȳ BLUP (W 2) ) ] (1 γ )E ξ(p2)( β(w 2 ) β ) ) Ȳ (W 2) = γ 2(σ2 u + σe/n 2 ) + (γ 2 2) x ( x Σ 1 x) 1 x + X ( x Σ 1 x) 1 X = γ σ 2 u + (γ 2 2) x ( x Σ 1 x) 1 x + X ( x Σ 1 x) 1 X MSE ξ(p2 ) ( Ȳ BLUP (W 2) ) (W 2) ) (W 2) = MSEξ(P2)( Ȳ = Eξ(P2)( Ȳ ) 2 Ȳ = ( X γ x ) MSE ξ(p2 ) ( β (W 2) ) ] ( X γ x ) + (1 γ )σ 2 u (4.104) The MSE is calculated based on (4.42). The woring model W 2 is not the true sample model for aggregated data under P 2 (Table 4.4). Therefore, resulting parameter estimate for parameter β under P 2 is biased due to the differences between the sample and population area means. This bias is negligible when area sample means are close to actual population area means as shown in (4.101). Resulting bias in synthetic estimators and BLUPs for area means using W 2 will decrease when ˆβ W 2 will be less biased under P 2 ((4.103) and (4.104)). 115

129 Chapter 4 Contextual Effects in Modeling for SAE Prediction based on W 3 under P 2 Considering the definition given for W 3 in sub-section (4.3.3), the vector of regression coefficient β is estimated based on W 3 as shown in (4.58). Then, resulting bias and variance for ˆβ (W3) under P 2 are calculated as below: E ξ(p2 )( ) β(w 3 ) = ( X Σ 1 1 X) (ȳ)] X Σ 1 E ξ(p2 ) = ( X Σ 1 1 X) X Σ 1 X (s)β ] = ( X Σ 1 1 X) X Σ 1 x I (P +1) β = ( X Σ 1 1 X) X Σ 1 x 0 β I + β 0 β C = ( X Σ 1 1 X) X Σ 1 ( x X) 0 β I + 0 β I + β 0 β C = ( X Σ 1 1 X) X Σ 1 ( x X) 0 β I + β ( V ar ) β(w 3 ) ξ(p2 = ( X Σ 1 X) 1 X Σ 1 V ar (ȳ)] ) ξ(p2) ( X Σ 1 1 X) ] X Σ 1 = ( X Σ 1 1 X) X Σ 1 Σ ( X Σ 1 1 X) ] X Σ 1 (4.105) The parameter estimate for β is biased using W 3 under P 2 as shown in (4.105). The bias term is due to the miss-specification of sample area means in W 3 while existing in the actual population model P 2. In a true sample model to be fitted on aggregated data under P 2 should involve both sample and population area means. Woring model W 3 just consider population area means. This leads to biased estimates. The bias decreases as the differences between the sample and population area means decrease. Considering synthetic estimator presented in (4.60), the bias, variance and MSE 116

130 Chapter 4 Contextual Effects in Modeling for SAE of this estimator are calculated under P 2 as below: Bias ( Ȳ Syn(W 3) ξ(p2 ) ) = E ( Ȳ Syn(W 3) ξ(p2 ) Ȳ) = E ξ(p2 ) X ˆβ (W3) ( X β + u + ē )] = E ξ(p2 ) X ( ˆβ (W3) β ) ] E ξ(p1 ) (u + ē ) = X E ξ(p2 ) ( ˆβ(W 3 ) β ) ] V ar ( Ȳ Syn(W 3) ξ(p2 ) ) = X V ar ξ(p2 ) (βw 3 )] X (4.106) MSE ( Ȳ Syn(W 3) ξ(p2 ) ) = E ( Ȳ Syn(W 3) ξ(p2 ) Ȳ) 2 = E ξ(p2 ) X ˆβ (W3) ( X β + u + ē )] 2 = E ξ(p2 ) X ( ˆβ (W3) β ) ] 2 + Eξ(P2 ) (u + ē )2 = X Eξ(P2) ( ˆβ (W3) β ) 2] X + σu 2 + σ2 e N The BLUP for th area mean based on W 3 is given in The bias and MSE of the BLUP presented in 4.63 are calculated under P 2 as follows: ( ) Bias ξ(p2 Ȳ BLUP (W 3) ) = E ( Ȳ BLUP (W 3) ξ(p2 ) Ȳ) = E ξ(p2 ) X β (W3) + ũ (W3) ( X β + u + ē )] = E ξ(p2 ) X ( β (W3) β ) ] E (ũ ξ(p2 ) u ) E ξ (P2 ) (ē ) = X E ξ(p2 ) ( β(w 3 ) β) ] MSE ξ(p2 ) ( Ȳ BLUP (W 3) ) (W 3) ) (W 3) = MSEξ(P2)( Ȳ = Eξ(P2)( Ȳ ) 2 Ȳ = (1 γ ) 2 X MSEξ(P2 ( β (W3) ) ] X + (1 γ ) )σu 2 The MSE is calculated based on (4.42). (4.107) The woring model W 3 is not a true sample model under P 2 (Table 4.4). W 3 includes actual area means in the model where both sample and population area means are present in true area-level sample model under P 2. Therefore, the resulting estimate for parameter β under P 2 is biased due to sample area means being misspecified based on W 3. This bias is negligible when area sample means are close to actual population area means as shown in (4.105). The resulting bias in synthetic estimators and BLUPs for area means using W 3 will decrease when ˆβ W 2 will be less biased under P 2 ((4.106) and (4.107)). 117

131 Chapter 4 Contextual Effects in Modeling for SAE Prediction based on W 4 under P 2 Considering the definition given for W 4 in sub-section (4.3.4), the vector of regression coefficient β is estimated based on W 4 as shown in (4.67). Then, resulting bias and variance for ˆβ (W 4) under P 2 are calculated as below: E ξ(p2 )( ˆβ (W 4 ) ) = (x Σ 1 x ) 1 x Σ 1 E ξ(p2 ) ( y ) = (x Σ 1 x ) 1 x Σ 1 X (s)β = (x Σ 1 x ) 1 x Σ 1 (X (s) x + x )β = β + (x Σ 1 x ) 1 x Σ 1 (X (s) x )β = β + (x Σ 1 x ) 1 x Σ 0 1 ( X ] x) β = β + (x Σ 1 x ) 1 x Σ 1 0 ( X ] x)β C V ar ξ(p2 )( ) ˆβ (W 4 ) = (x Σ 1 x ) 1 x Σ 1 V ar ξ(p2 ) (y)(x Σ 1 x ) 1 x Σ 1 ] ] = (x Σ 1 x ) 1 x Σ 1 Σ (x Σ 1 x ) 1 x Σ 1 = (x Σ 1 x ) 1 (4.108) The parameter estimate for β is biased using W 4 under P 2 as shown in (4.108). The bias term is due to the miss-specification of actual population area means in W 4 while existing in the actual population model P 2. The true unit-level sample model under P 2 involve population area means as the contextual effects. Sample area means are included instead of population area means in W 4. Therefore, the difference between sample and population area means leads to biased estimates using W 4. Then, considering the synthetic estimator presented in (4.71), the bias, variance 118

132 Chapter 4 Contextual Effects in Modeling for SAE and MSE of this estimator are calculated under P 2 as below: Bias ( Ȳ Syn(W 4) ξ(p2 ) ) = E ( Ȳ Syn(W 4) ξ(p2 ) Ȳ) = E ξ(p2 ) X ˆβ (W4) ( X β + u + ē )] = E ξ(p2 ) X ( ˆβ (W4) β ) ] E ξ(p2 ) (u + ē ) = X E ξ(p2 ) ( ˆβ (W 4 ) β ) ] V ar ( Ȳ Syn(W 4) ξ(p2 ) ) = X V ar ξ(p2 )] ) (β (W 4) X (4.109) MSE ( Ȳ Syn(W 4) ξ(p2 ) ) = E ( Ȳ Syn(W 4) ξ(p2 ) Ȳ) 2 = E ξ(p2 ) X ˆβ (W4) ( X β + u + ē )] 2 = E ξ(p2 ) X ( ˆβ (W4) β ) ] 2 + Eξ(P2 ) (u + ē )2 = X Eξ(P2) ( ˆβ (W4) β ) 2] X + σu 2 + σ2 e N BLUP for th area mean based on W 4 is given in Bias and MSE of the BLUP presented in 4.73 are calculated under P 2 as follows: ( ) Bias ξ(p2 Ȳ BLUP (W 4) ) = E ( Ȳ Syn(W 4) ξ(p2 ) Ȳ) = E ξ(p2 ) X β (W4) + ũ (W4) ( X β + u + ē )] = E ξ(p2 ) X ( β (W4) β ) ] E ξ(p2 ) (ũ (W 4) u ) E ξ (P2 ) (ē ) = X E ξ(p2 ) ( β (W 4 ) β ) ] MSE ξ(p2 ) ( Ȳ BLUP (W 4) ) (W 4) ) (W 4) = MSEξ(P2)( Ȳ = Eξ(P2)( Ȳ ) 2 Ȳ = ( X γ x ) MSE ξ(p2 ) ( β (W 4) ) ] ( X γ x ) + (1 γ )σ2 u (4.110) The MSE is obtained using (4.34). Although W 4 includes the sample information about the area means as possible contextual effects, it is not exactly a true sample model under P 2 due to missspecification of possible differences between the sample and population area means Prediction based on W 5 under P 2 Considering the definition given for W 5 in sub-section (4.3.5), the vector of regression coefficient is estimated based on W 5 as shown in (4.78). Population area means are 119

133 Chapter 4 Contextual Effects in Modeling for SAE included in W 5 as contextual effects. Then, the resulting bias and variance for ˆβ (W 5) under P 2 are calculated as below: E ξ(p2 ) ( ) ( ) ˆβ (W 5 ) = (X (s)σ 1 X (s)) 1 X (s)σ 1 E ξ(p2 ) y = (X (s)σ 1 X (s)) 1 X (s)σ 1 X (s)β = β V ar ξ(p2 ) ( ) ˆβ (W 5 ) = (X (s)σ 1 X (s)) 1 X (s)σ 1 V ar ξ(p2 ) (y)(x (s)σ 1 X (s)) 1 X (s)σ 1 ] ] = (X (s)σ 1 X (s)) 1 X (s)σ 1 Σ (X (s)σ 1 X (s)) 1 X (s)σ 1 = (X (s)σ 1 X (s)) 1 (4.111) Note that, W 5 is the true unit-level sample model under P 2. The parameter estimate for β is unbiased using W 5 under P 2 as presented in (4.111). Considering synthetic estimator presented in (4.80), the bias, variance and MSE of this estimator are calculated under P 2 as below: Bias ( Ȳ Syn(W 5) ξ(p2 ) ) = E ( Ȳ Syn(W 5) ξ(p2 ) Ȳ) = E ξ(p2 ) X ˆβ (W5) ( X β + u + ē )] = E ξ(p2 ) X ( ˆβ (W5) β ) ] E ξ(p2 ) (u + ē ) = X E ξ(p2 ) ( ˆβ (W 5 ) β ) ] = 0 V ar ( Ȳ Syn(W 5) ξ(p2 ) ) = X V ar ξ(p2 )] ) (β (W 5) X (4.112) MSE ( Ȳ Syn(W 5) ξ(p2 ) ) = E ( Ȳ Syn(W 5) ξ(p2 ) Ȳ) 2 = E ξ(p2 ) X ˆβ (W5) ( X β + u + ē )] 2 = E ξ(p2 ) X ( ˆβ (W5) β ) ] 2 + Eξ(P2 ) (u + ē )2 = X Eξ(P2) ( ˆβ (W5) β ) 2] X + σu 2 + σ2 e N The BLUP for th area mean based on W 5 is given in Bias and MSE of the 120

134 Chapter 4 Contextual Effects in Modeling for SAE BLUP presented in 4.82 are calculated under P 2 as follows: ( ) Bias ξ(p2 Ȳ BLUP (W 5) ) = E ( Ȳ Syn(W 5) ξ(p2 ) Ȳ) = E ξ(p2 ) X β (W5) + ũ (W5) ( X β + u + ē )] = E ξ(p2 ) X ( β (W5) β ) ] E ξ(p2 ) (ũ (W 5) u ) E ξ (P2 ) (ē ) = X E ξ(p2 ) ( β (W 5 ) β ) ] = 0 MSE ξ(p2 ) ( Ȳ BLUP (W 5) ) (W 5) ) (W 5) = MSEξ(P2)( Ȳ = Eξ(P2)( Ȳ ) 2 Ȳ = ]{ X γ x MSEξ(P2) ( β (W5) ) } X γ ] x + (1 γ )σu 2 The MSE is obtained from (4.34) (4.113) As presented in Table (4.4), W 5 is a true unit-level sample model when P 2 is considered as the actual population model. Therefore, the parameter estimates based on W 5 under P 2 are unbiased as shown in (4.111). Unbiased parameter estimates in this case help us to calculate unbiased synthetic estimators and EBLUPs for area means (see (4.112) and (4.113)) Prediction based on W 6 under P 2 Considering the definition given for W 6 in sub-section (4.3.6), the vector of regression coefficient is estimated based on W 6 as shown in (4.88). Both sample and population area means are included in W 6. Then, the resulting bias and variance for ˆβ (W 6) under P 2 are calculated as below: ( E ) β (W 6 ) ξ(p2 = ( X Σ ) (s) 1 X (s)) 1 X Σ 1 (ȳ)] (s) E ξ(p2 ) = ( X Σ (s) 1 X (s)) 1 X Σ (s) 1 X (s)β = β V ar ξ(p2 )( β (W 6 ) ) = = ( X Σ (s) 1 X (s)) 1 X Σ (s) 1 V ar ξ(p2 (y)( X Σ ) (s) 1 X (s)) 1 X = ( X Σ (s) 1 X (s)) 1 X Σ (s) 1 Σ ( Σ (s) 1 ] X Σ (s) 1 X (s)) 1 X Σ ] (s) 1 = ( X Σ (s) 1 X (s)) 1 (4.114) Note that, W 6 is the true area-level sample model under P 2. The parameter estimate for β is unbiased using W 6 under P 2 as presented in (4.114). 121

135 Chapter 4 Contextual Effects in Modeling for SAE Considering synthetic estimator presented in (4.91), the bias, variance and MSE of this estimator are calculated under P 2 as below: Bias ( Ȳ Syn(W 6) ξ(p2 ) ) = E ( Ȳ Syn(W 6) ξ(p2 ) Ȳ) = E ξ(p2 ) X ˆβ (W6) ( X β + u + ē )] = E ξ(p2 ) X ( ˆβ (W6) β ) ] E ξ(p2 ) (u + ē ) = X E ξ(p2 ) ( ˆβ (W 6 ) β ) ] = 0 V ar ( Ȳ Syn(W 6) ξ(p2 ) ) = X V ar ξ(p2 )] ) (β (W 6) X (4.115) MSE ( Ȳ Syn(W 6) ξ(p2 ) ) = E ( Ȳ Syn(W 6) ξ(p2 ) Ȳ) 2 = E ξ(p2 ) X ˆβ (W6) ( X β + u + ē )] 2 = E ξ(p2 ) X ( ˆβ (W6) β ) ] 2 + Eξ(P2 ) (u + ē )2 = X Eξ(P2) ( ˆβ (W6) β ) 2] X + σu 2 + σ2 e N The BLUP for th area mean based on W 6 is given in Bias and MSE of the BLUP presented in 4.93 are calculated under P 2 as follows: ( ) Bias ξ(p2 Ȳ BLUP (W 6) ) = E ( Ȳ Syn(W 6) ξ(p2 ) Ȳ) = E ξ(p2 ) X β (W6) + ũ (W6) ( X β + u + ē )] = E ξ(p2 ) X ( β (W6) β ) ] E ξ(p2 ) (ũ (W 6) u ) E ξ (P2 ) (ē ) = X E ξ(p2 ) ( β (W 6 ) β ) ] = 0 MSE ξ(p2 ) ( Ȳ BLUP (W 6) ) (W 6) ) (W 6) = MSEξ(P2)( Ȳ = Eξ(P2)( Ȳ ) 2 Ȳ = ]{ X γ x MSEξ(P2) ( β (W6) ) } X γ ] x + (1 γ )σu 2 (4.116) The MSE is calculated based on (4.42). As presented in Table (4.4), W 6 is a true area-level sample model when P 2 is considered as the actual population model. Therefore, the parameter estimates based on W 6 under P 2 are unbiased as shown in (4.114). Unbiased parameter estimates in this case help us to calculate unbiased synthetic estimators and EBLUPs for area means (see (4.115) and (4.116)). 122

136 Chapter 4 Contextual Effects in Modeling for SAE 4.6 Conclusion As discussed in this chapter, selecting an appropriate model to be fitted on the sample data is a big challenge. The miss-specification problem was clearly addressed in this chapter and an area-level model (W 2 ) was proposed as a good choice when area means are the targets of inference to avoid miss-specification of existing contextual effects. Six woring models presented in Table 4.1 are discussed in this chapter. They differ according to two different assumption about the actual population model. A summary of the Synthetic estimators and EBLUPs based on six woring models are presented in Table 4.5. Table 4.5: Summery of Possible Woring Models and Predictors Woring Models Synthetic Estimator EBLUP y (W 1) i = x i β + u + e i X ˆβ ȳ (W 2) = x β + u + ē X ˆβ ȳ (W 3) = X β + u + ē X ˆβ y (W 4) i = x i β + u + e i y (W 5) i = X (s)iβ + u + e i ȳ (W 6) = X (s)iβ + u + e i X ˆβ X X ˆβ X X ˆβ X X ˆβ + û X ˆβ + û X ˆβ + û ˆβ + û ˆβ + û ˆβ + û The expected values of model coefficients using different woring models are presented under P 1 and P 2 in Table

137 Chapter 4 Contextual Effects in Modeling for SAE Table 4.6: Estimating model coefficients based on different woring models ˆβ (W 1) = β (W 1) = (x Σ 1 x) 1 x Σ 1 y E ξ(p1 )( β(w 1 ) ) = β E ξ(p2 ) = β + (x Σ 1 x) 1 x Σ 1 ( X (s) x)β C ˆβ (W2) = β (W2) = ( x Σ 1 x) 1 x Σ 1ȳ E ξ(p1 )( β(w 2 ) ) = β E ξ(p2 )( β(w 2 ) ) = β + ( x Σ 1 x) 1 x Σ 1 ( X x)β C ˆβ (W 3) = β (W 3) = ( X Σ 1 X) 1 X Σ 1ȳ E ξ(p1 )( β(w 3 ) ) = ( X Σ 1 X) 1 X Σ 1 ( x X) + X ] β E ξ(p2 )( β(w 3 ) ) = ( X Σ 1 X) 1 X Σ 1 ( x X)β I + β ˆβ (W4) = β (W4) = ( x Σ 1 x) 1 x Σ 1ȳ E ξ(p1 )( ) ˆβ (W 4 ) = β 0 ( ) ˆβ (W 4 ) = β + (x Σ 1 x ) 1 x Σ 0 1 ( X ] x)β C E ξ(p2 ) ˆβ (W5) = β (W5) = (X (s)σ 1 X (s)) 1 X E ξ(p1 )( ) ˆβ (W 5 ) = β 0 ( ) ˆβ (W 5) = β E ξ(p2 ) ˆβ (W6) = β (W6) = ( X E ξ(p1 )( ) β (W 6 ) = β 0 (s)σ 1 y Σ (s) 1 X (s)) 1 X Σ (s) 1 ȳ E ξ(p2 )( ˆβ (W 6 ) ) = β 124

138 Chapter 4 Contextual Effects in Modeling for SAE Woring model W 1 can be fitted on the individual-level sample data when the required information about the anonymized group labels are provided. The resulting model parameter estimates using W 1 are unbiased with the least variance under P 1. When P 2 is the actual population model, fitting W 1 leads to biased estimates due to miss-specification of contextual effects. Area-level woring model W 2 uses aggregated-level data provided with anonymized group labels. Resulting estimates using W 2 are unbiased under P 1. Existing contextual effects within the population model can be also involved in W 2. When sample information about the auxiliary variable is not available, W 3 can be used for area-level estimation purposes. Using W 3 leads to biased estimates under P 1 and P 2. The bias is due to the differences between sample and population area means. Considering the difference between sample and population area means to be negligible, resulting parameter estimates using W 2 and W 3 are approximately unbiased under both population models. When P 2 is assumed to be the actual population model, W 5 and W 6 are respectively the true unit-level and area-level sample models leading to unbiased estimates. However, using W 4 leads to biased estimates due to possible differences between sample and population area means. When P 1 is the true population model,using contextual woring models (W 4, W 5 & W 6 ) lead to larger variances in parameter estimation due to over-fitting issue. Features of different woring models in this study under P 1 and P 2 are summarized in Table 4.7. Assuming the contextual model presented in (4.4) to be applied for population data, the first fitting woring model (W 1 ) leads to biased parameter estimates. As the true area-level sample model under P 2 is the one presented in (4.14), parameter estimates based on W 2 may also be biased as sample area means ( x) and population area means ( X) may differ. However, W 2 includes P+1 regression coefficients to be estimated while 2P+1 regression coefficients are included in sampling models presented in (4.12) and (4.14). Therefore, dimension reduction in calculating model parameter estimates is a considerable advantage of applying W 2 in SAE. When there are not enough sample information available about the auxiliary variables, W 3 can be used for area-level estimation purposes, but the actual group labels are required in this case. 125

139 Chapter 4 Contextual Effects in Modeling for SAE When P 2 is considered as the actual population model, the resulting estimates based on contextual woring models (W 4, W 5 & W 6 ) are supposed to be more reliable as the existing contextual effects within the actual population model are involved in such woring models. Another choice is to use W 2 for area-level estimation under P 2. Although possible contextual effects in the population model is not specifically specified in W 2, they are automatically involved in the estimation process. The main advantage of using W 2 is the dimension reduction in calculating required estimators. Model properties are compared for certain cases in Table 4.8. Using a simulation study presented in Chapter 5, the woring models introduced in this thesis are evaluated, numerically. 126

140 Chapter 4 Contextual Effects in Modeling for SAE Table 4.7: Summary of Model Characteristics based on Different True Population Models Generated Population Based on P 1 Generated Population Based on P 2 W 1 W 2 W 3 W 4 W 5 W 6 Using Individual-level Sample Data Need for Anonymized Group Labels Unbiased Parameter Estimates Parameter Estimation with Minimum Variance Unbiased Parameter Estimates Parameter Estimation not with Minimum Variance Using Aggregate-level Sample Data Need for Anonymized Group Labels Bised Parameter Estimates No Auxiliary Sample Data Parameter Estimation not with Minimum Variance Using Aggregate-level Sample & Population Data Need for Actual Group Labels at Area-level Unbiased Parameter Estimates Over-fitting Covariates Using Individual-level & Aggregate-level Sample Data Need for Anonymized Group Labels Unbiased Parameter Estimates Over-fitting Covariates Using Individual-level Sample Data Using Aggregate-level Population Data Need for Actual Group Labels at Unit-level Unbiased Parameter Estimates Over-fitting Covariates Using Aggregate-level Population & Sample Data Need for Actual Group Labels at Area-level Biased Parameter Estimates Ignores the Contextual Effects Using Individual-level Sample Data Need for Anonymized Group Labels Approximately Unbiased Parameter Estimates Somehow Consider the Contextual Effects Using Aggregate-level Sample Data Need for Anonymized Group Labels Approximately Unbiased Parameter Estimates No Auxiliary Sample Data Somehow Consider the Contextual Effects Using Aggregate-level Sample & Population Data Need for Actual Group Labels at Area-level Approximately Unbiased Parameter Estimates Parameter Estimation with Minimum Variance Using Individual-level & Aggregate-level Sample Data Need for Anonymized Group Labels Unbiased Parameter Estimates Parameter Estimation with Minimum Variance Using Individual-level Sample Data Using Aggregate-level Population Data Need for Actual Group Labels at Unit-level Unbiased Parameter Estimates Parameter Estimation with Minimum Variance Using Aggregate-level Population & Sample Data Need for Actual Group Labels at Area-level 127

141 Chapter 4 Contextual Effects in Modeling for SAE W 1 W 2 W 2 W 3 W 2 W 6 W 2 W 4 Table 4.8: Model Comparisons W 2 somehow consider contextual effects when P 2 is true. W 2 leads to less biased parameter estimates when P 2 is true. W 1 leads to more stable parameter estimates. W 3 is useful when there are not enough sample data on auxiliary variable. Actual Group labels are required for W 3. W 6 is true sample model when P 2 is true. Using W 6, the number of required parameter estimates would be doubled. Problem of Co-linearity should be considered in W 6. Actual Group labels are required for W 6. W 4 leads to more stable parameter estimates when P 2 is true. Using W 4, the number of required parameter estimates would be doubled. W 2 just require the aggregated-level data. W 2 somehow consider the possible area-effect in true population Model. W 4 W 5 Problem of Co-linearity should be more considered in W 4. Actual Group labels are required for W

142 Chapter 5 Model-Assisted Design-Based Simulation 5.1 Introduction This chapter presents the results of a model-assisted design-based simulation study to empirically assess the bias and Mean Square Error (MSE) of synthetic estimators and EBLUPs based on the unit-level and area-level woring models discussed in Chapter 4. The population data in this study is generated according to population models P 1 and P 2 based on available area information within Australia. There are six states and two mainland territories in Australia and each is divided into several statistical sub-divisions by the Australian Bureau of Statistics (ABS) forming a total of 57 statistical sub-divisions. These sub-divisions are considered in different survey designs by the ABS. As a hypothetical example, we suppose that there is an interest in the mean value of income for the 57 statistical sub-divisions within Australia. 5.2 Methodology Suppose that the Australian government is interested in reliable estimates for area means of income for different sub-national domains. The population data in this computational study is generated with the idea of a real situation in Australia. The total number of individuals aged 15 and over for 57 sub-divisions in Australia is 129

143 Chapter 5 Simulation shown in Table 5.9 on Page 164. Tables 5.10 and 5.11 on Page 165 & 166 respectively contain the area means and standard deviations of weely gross salary and hours wored for different sub-divisions obtained in the 2006 Australian Census. Note that, information about a 1% random sample from the whole census were available for this study. Area-level information were also available in the ABS web-site. Area-level indicators provided in Table 5.10 and 5.11 are calculated based on the information available to this study. We assume that there is a linear relationship between the weely gross salary (as the target variable) and the weely hours wored for individuals aged 15 and over. In this simulation, population data is generated based on the population models presented in Table 4.3 and 4.4. Therefore, there are not any underlying relationship between two variables. However, it is important to investigate such relationships in practice. Model parameter values in this study are obtained based on available information on the relation between weely gross salary and hours wored for individuals over 15 in the 2006 Australian Census. A total population size of individuals was generated corresponding to the population aged 15 and over in the 2006 Australian Census. Population data is generated based on assumed population models presented in (3.3) and (4.9). Then, based on a stratified sample design, a sample of size n = 2133 is determined in order to estimate the population mean (see Table 5.9 and 5.12). Considering the area means as the targets of inference, synthetic estimates and EBLUPs are then calculated based on the six woring models presented in Table 4.1 being fitted on the sample data. This allows a comparison to be made among different woring models when area means of the auxiliary variable are involved in the population model as possible contextual effects. Area-specific information given in Table 5.10 and 5.11 and some individual-level data available for this study are used in following section to calculate parameter values for population models in order to generate the target population. After the target population is generated based on the two models presented in Table 4.3 and 4.4, sample units will be selected by a stratified sampling method. Based on a simple random sample design, a sample of size n = 2133 is determined 130

144 Chapter 5 Simulation in order to estimate the actual population means. Note that, the margin of error is 1.96 times the standard error is $30. Variance of the target of inference within the whole population is considered to be equal to calculated based on information available to this study. ( z V ar(y ) n = d) ( = z V ar(y ) 1 + N d) ( 1.96 ) ( ) = 2133 (5.1) The sample size is allocated to the 57 areas proportionally to population size. This means that the allocated sample size for the th area is n = (N /N) n. This allocation provides samples from each area to calculate direct design-based estimates. Some of the 57 areas can be considered as small areas because their the sample sizes are relatively small. The sample sizes allocated to different sub-divisions can be seen in Table 5.12 and vary from 1 to 398 with average sample size of 37. As mentioned, the two populations are generated and then 1000 samples are selected using the presented stratified sampling design with the sample size specified in Table The models presented in Table 4.1 are then fitted on each set of resulting sample data, separately. Synthetic estimators and EBLUPs for area means are then calculated based on each woring model. Two different population are generated in this study and six woring models are fitted on the sample data obtained from each population. EBLUPs and synthetic estimators are then calculated based on the parameter estimates obtained for each woring model. Thus, 12 estimation techniques are to be evaluated in this study for each generated population. This evaluation helps us answering the following questions: What level of estimation should be used when there is not any contextual effect in the actual population model? How does the contextual effects in the actual population model affect the estimation results for different small areas? How to choose a suitable woring model when we do not now if there are possible contextual effects within the actual population model? 131

145 Chapter 5 Simulation The aim of this simulation is to empirically study unit-level and area-level models introduced in previous chapters under two population models. Additionally the efficiency of the EBLUP technique is examined comparing with the synthetic estimation. In order to perform the experimental assessment, 1000 samples are selected in each case of the population generation. Then, the empirical MSE is used in different estimation methods to quantify the amount by which the estimators differ from the true value of the quantity being estimated. The purpose is to estimate the means of the required variable Y for different areas. Therefore, the empirical MSE can be calculated through the equation below: MSE = 1000 ( Ȳ (m) Ȳ) 2 m= ; = 1,..., 57 (5.2) where Ȳ (m) is the th area mean being estimated based on mth sampling iteration. Finally, to compare the MSEs of different estimators, the relative efficiency will be calculated for each area, separately. Bias and variance of each estimator are calculated, as well. This will help us to empirically assess different aspects of each estimation method. 5.3 Population Model Parameters As mentioned before, data available from the 2006 Australian Census is used in this simulation. The ABS confidentiality policies allows the researchers to access to information for just approximately unidentified individuals from Australian Census Categorical information is available in this data set for weely income and hours wored. Table 5.1 shows the categories considered in Australian Census 2006 for individual weely income and hours wored. 132

146 Chapter 5 Simulation Table 5.1: Weely Income and Hours Wored in Australian Census 2006 Weely Income Hours Wored Negative income Less than 1 hour and Not applicable Nill income 1-15 hours $1-$ hours $150-$ hours $250-$ hours $400-$ hours $600-$ hours $800-$ hours and over $1,000-$1,299 Not stated $1,300-$1,599 Overseas visitor $1,600-$1,999 $2,000 or more Not stated Not applicable Overseas visitor Here, we want to generated weely income and hours wored as two related continuous variables based on available categorical information provided by ABS. Note that, only positive available values are considered in this simulation. One possible way to generate an original continuous value for each required quantity is to consider a uniform distribution within each category and find a random value falling in the interval recorded for each individual, separately. Here, $5000 is considered as the maximum weely income and maximum time spent on woring is 100 hours in a wee. Based on this method, weely income and hours wored for individuals over 15 are modeled in the graph shown in Figure 5.1 ignoring the missing data and negative values. Individuals with zero income and people who did not wor were also omitted. 133

147 Chapter 5 Simulation Figure 5.1: Fitting LMM on available Individual-level Data from Australian Census 2006 Information about the sub-divisions was not provided from ABS for individuallevel information. Therefore, OLS method is used in R to calculate initial model regression parameter values. The intercept and slope for the red line in the graph shown in Figure 5.1 are ˆβ = ], and standard deviation of model residuals is As can be seen in Figure 5.1, generated data is highly scattered around the fitted line, and the standard deviation of model residuals is clearly too large. Here, we want to find a way to select our model parameters based on the idea of real situation in Australia. As the provided data for this study is not sufficient for an absolute decision, we tried several methods to calculate the model parameter more appropriately. In order to find a smaller standard deviation of model residuals, another method was used in generating original data. In this approach, the closest value to the red line in Figure 5.1 is selected for each individual based on the recorded intervals. The standard deviation for new model residuals was This value is still large for our model residual variation, but this is the minimum value calculated based on information available and was used in this study. Using the area-level information available from the whole census about the variables based on the relevant categories introduced in Table 5.1, mean and standard deviation within different subdivisions are calculated (Table (5.10) and (5.11)). Fig- 134

Contextual Effects in Modeling for Small Domains

Contextual Effects in Modeling for Small Domains University of Wollongong Research Online Applied Statistics Education and Research Collaboration (ASEARC) - Conference Papers Faculty of Engineering and Information Sciences 2011 Contextual Effects in