Technical Report Series GO Spatial Disaggregation and Small-Area Estimation Methods for Agricultural Surveys: Solutions and Perspectives

Size: px
Start display at page:

Download "Technical Report Series GO Spatial Disaggregation and Small-Area Estimation Methods for Agricultural Surveys: Solutions and Perspectives"

Transcription

1 Technical Report Series GO Spatial Disaggregation and Small-Area Estimation Methods for Agricultural Surveys: Solutions and Perspectives September 2015

2

3 Spatial Disaggregation and Small-Area Estimation Methods for Agricultural Surveys: Solutions and Perspectives

4

5 Table of Contents Preface 9 Acknowledgements 11 Acronyms and Abbreviations 12 General Introduction 13 Review of Spatial Disaggregation and SAE Methods Spatial Disaggregation: Mapping Techniques Introduction Areal interpolation methods Simple area weighting method Pycnophylactic interpolation methods Dasymetric mapping Examples Spatial disaggregation with interpolation based on Regression models Regression models The EM algorithm Examples Small-Area Estimators Introduction A classification of SAE models Model-assisted estimators GREG estimator Example of calculation of GREG estimator Model-Based estimators: area-level FH-EBLUP Example of calculation of the FH-EBLUP estimator SEBLUP Example of calculation of FH-SEBLUP estimator Applications to agricultural data Final remarks on the FH-EBLUP and FH-SEBLUP Model-based estimators: unit-level EBLUP SEBLUP The MQ estimator The MQGWR estimator Example of calculation of EBLUP, MQ and MQGWR estimators Application to agricultural data Final remarks on the EBLUP, SEBLUP, MQ and MQGWR Extensions of the previous small-area models Semi-parametric Fay and Herriot model NPEBLUP specified at the unit level Non-parametric MQ specified at unit level GWEBLUP 65 5

6 2.6.5 MBDE and SMBDE A note on Bayesian SAE methods SAE for binary and count data Geostatistical methods Geoadditive models Kriging GWR 70 References (Part I) 71 Resilience of SAE Methods to Non-Standard Situations 81 Introduction Sensitivity of SAE Predictors to Spatial Model Specifications Introduction Model-based simulation experiment Design-based simulation experiment Remarks and findings The Modifiable Area Unit Problem Introduction An evaluation of the impact of the scale effect on SAE predictors and interpolation methods Remarks and findings The Robustness of SAE Predictors Introduction Small-area robust estimators MQ estimators Robust EBLUP Assessment of the robustness of the EBLUP, MQ and robust EBLUP The RSEBLUP: robust SAE using geo-referenced information in the mixed-model approach Evaluating the Spatial REBLUP estimator using simulation studies Remarks and findings The Complexity of Sample Design Introduction Design-consistent small-area estimators Expansion estimator Modified GREG estimators The Pseudo-EBLUP Weighted MQ estimators Simulation study of the impact of ignorable and non-ignorable designs Description of the simulation experiment Simulation results Investigating the impact of sampling designs on data interpolation A short introduction about the design effect on data interpolation A simulation experiment to assess the impact of the design effect on spatial interpolation Remarks and findings 117 6

7 7. Missing Data in Spatial Datasets Introduction Missing values in datasets: general concepts and solutions Multiple imputation Missing values in spatial data as measurement error Missing data in spatial analysis Missing spatial information Missing values in auxiliary and target variables Missing information in methods of data integration Remarks and findings Analysis of Zero-Inflated Data in SAE Introduction Bayesian small-area estimator for zero-inflated data Frequentist SAE for zero-inflated data Empirical evaluation for the frequentist approach Remarks and findings Final Remarks and Recommendations 147 References (Part II) 150 General Summary 158 7

8 8

9 Preface This Technical Report on Spatial Disaggregation and Small-Area Estimation Methods for Agricultural Surveys: Solutions and Perspectives was prepared within the framework of the Global Strategy to Improve Agricultural and Rural Statistics. The Global Strategy is an initiative endorsed in 2010 by the United Nations Statistical Commission, to provide a framework and a blueprint to meet current and emerging data requirements and the needs of policymakers and other data users. Its goal is to contribute to greater food security, reduced food price volatility, higher incomes and greater well-being for rural populations, through evidence-based policies. The Global Strategy is centred upon 3 pillars: (1) establishing a minimum set of core data (2) integrating agriculture into National Statistical Systems (NSSs) and (3) fostering the sustainability of the statistical system through governance and statistical capacity building. The Action Plan to Implement the Global Strategy includes an important research programme, to address methodological issues for improving the quality of agricultural and rural statistics. The outcome of the research programme is to produce scientifically sound and cost-effective methods that will be used as inputs to prepare practical guidelines for use by country statisticians, training institutions, consultants, etc. To enable countries and partners to benefit at an early stage from research activity results that are already available, it has been decided to establish a Technical Reports Series, to widely disseminate available technical reports and advanced draft guidelines and handbooks. This will also provide an opportunity for countries to give feedback on the papers. Technical reports and draft guidelines and handbooks published in this Technical Report Series have been prepared by senior consultants and experts and reviewed by the Scientific Advisory Committee (SAC) 1 of the Global Strategy, the Research Coordinator at the Global Office and other independent senior experts. For some of the research topics, field tests will be organized before final results are included in guidelines and handbooks. The aim of this report on Spatial Disaggregation and Small-Area Estimation Methods for Agricultural Surveys: Solutions and Perspectives is to enhance disaggregation methods for adaptation to various agricultural situations and datasets. Part 1 reviews the literature on this subject under two topics: i) mapping techniques and ii) small-area estimators. With regard to mapping techniques, the main areal interpolation methods based on regression techniques are presented. SAE methods are classified as: i) model-assisted methods for example the generalized regression estimator; and ii) model-based methods, which are considered as unit-level and area-level specifications the empirical best linear unbiased predictors estimator, M-quantile estimator and Fay and Herriot estimator with spatial specifications where available. Assumptions are explained and the information needed for each method is given, with illustrations from applications to rural and agricultural statistics or to socio-economic statistics. Part 2 examines the reliability of the methods in non-standard situations that commonly arise in agricultural surveys. The main topics are sensitivity to spatial model specification, the modifiable area unit problem, robustness of predictors, complexity of sample design, missing data in spatial datasets and excess of zeros in survey data. 1 The SAC is composed of ten well-known senior experts in various fields relevant to the Research Programme of the Global Strategy. They are selected for a two-year term. The membership at the time of preparation of this report was composed of: Fred Vogel, Sarah Nusser, Ben Kiregyera, Seghir Bouzaffour, Miguel Galmes, Cristiano Ferraz, Ray Chambers, Vijay Bhatia, Jacques Delincé and Anders Walgreen. 9

10 This part analyses the methods presented in Part 1, presents the main contributions to the topics and proposes methodological and operational solutions. Part 3 summarizes the issues in the review of mapping techniques and small-area estimators. It also offers remarks and recommendations based on the analysis of the reliability of the methods and draft guidelines for applying them in field tests. 10

11 Acknowledgements This paper was prepared by Monica Pratesi (Professor) of the University of Pisa, with the assistance of Alessandra Petrucci (Professor) of the University of Florence and Nicola Salvati (Professor) of the University of Pisa, and supported by Caterina Giusti (PhD, researcher) and Stefano Marchetti (PhD, researcher) of the University of Pisa, with the guidance and supervision of Elisabetta Carfagna and Naman Keita of the Global Office of the Global Strategy to improve agricultural and rural statistics (FAO). The report was reviewed by the Scientific Advisory Committee of the Global Strategy, who provided comments and inputs. Valuable inputs and comments were provided at various stages by Luigi Biggeri (Emeritus Professor) of the University of Florence and by Loredana di Consiglio of the Italian Istituto Nazionale di Statistica. This publication was prepared with support from the Trust Fund of the Global Strategy funded by the United Kingdom Department for International Development and the Bill & Melinda Gates Foundation. 11

12 Acronyms and Abbreviations ANC Acid-Neutralizing Capacity AvRBias Average Relative Bias BARE Broad Area Ratio Estimator CAR Conditional Auto-Regressive CV Cross-Validation DFID Department For International Development EBLUP Empirical Best Linear Unbiased Predictor EM Expectation-Maximization (algorithm) EMAP Environmental Monitoring and Assessment Program FAO Food and Agriculture Organization of the United Nations FH-EBLUP Fay and Herriot Empirical Best Linear Unbiased Predictor GIS Geographic Information System GLM Generalized Linear Mixed (model) GREG Generalized Regression (estimator) GREG-LV GREG under Lehtonen-Veijanen specification GREG-S GREG with Sample weights GWEBLUP Geographically Weighted Empirical Best Linear Unbiased Predictor GWR Geographically Weighted Regression GWR-W Geographically Weighted Regression with Sample Weights HT Horvitz and Thompson (expansion estimator) HUC Hydrologic Unit Code ISTAT Istituto Nazionale di Statistica (Italian National Statistics Institute) LMM Linear Mixed Model MAE Model-Assisted Estimator MAR Missing At Random MAUP Modifiable Areal Unit Problem MBDE Model-Based Direct Estimator MBE Model-Based Estimator MCAR Missing Completely At Random MCMC Markov Chain Monte Carlo MI Multiple Imputation ML Maximum Likelihood MNAR Missing Not At Random MPML Multi-level Pseudo Maximum Likelihood MQ M-Quantile MQ-GC M-Quantile Geographic Coordinates MQGWR Model Quantile Geographically Weighted Regression (estimator) MQGWR-CD Model Quantile Geographically Weighted Regression MQ-WR M-Quantile Welsh-Ronchetti MSE Mean Squared Error NPEBLUP Non-Parametric Empirical Best Linear Unbiased Predictor P-spline Polynomial spline RB Relative Bias REML REstricted Maximum Likelihood RRMSE Relative Root Mean Squared Error SAE Small-Area Estimation SAC Scientific Advisory Committee (of the Global Strategy) ssar Simultaneously Auto-Regressive SEBLUP Spatial Empirical Best Linear Unbiased Predictor SMBDE Spatial Model-Based Direct Estimator WMQ Weighted M-Quantile ZIEBLUP Zero-Inflated data Empirical Best Linear Unbiased Predictor 12

13 General Introduction This report presents methods for estimating agricultural and rural statistics at the local level small-area estimation (SAE). The local level is the geographical level at which data are requested with a view to planning sub-regional policies or evaluating the results of policy. For this purpose a region can be split into subsets or domains of study. The local and regional levels will vary from country to country: administrative areas include municipalities and census divisions and localities such as the tehsil in India and the woreda in Ethiopia; the area level may also depend on the method applied. The domains may refer to a demographic group or a geographical area, or both. In agro-environmental studies, such administrative areas are the traditional spatial area for which statistical data are available. Numbers 11, 12 and 14 of the Statistical Development Series of the Food and Agriculture Organization of the United Nations (FAO) review the data available in each country after the 2000 agricultural census. 2 Satellite imagery and geographic information systems (GIS) facilitate surveys of the dynamics of change in land use and cover at any required level of spatial resolution; the LUCAS project in Europe 3 is an example. Some of this work focuses on yield forecasting and estimation only. The reality is that the area harvested is needed to determine total production: even if a model provides information about crop yields, it may say nothing about the area being harvested. And, unfortunately, the data needed to produce detailed information and maps for various phenomena are often unavailable. FAO publications show that the main sources of detailed local-level agricultural and rural data are agricultural censuses, sample surveys and administrative registers. But the censuses are conducted only periodically, and the FAO publications on the 2000 agricultural census reveal that national surveys of agricultural and other statistics differ widely and may be constrained by cost and other considerations. Not every country has a current survey of agriculture, and some often developing countries conduct agricultural censuses on a sample basis 4 And where a country has statistical data from which to compute indicators of land use, crop production, livestock, farm structure, incomes and living conditions, their quality at the local level is not homogenous. Two situations stem from the analysis of the main sources of agricultural and rural data: i. Data on the target variable at the local level can be obtained from a survey such as an agricultural census or a sample or other survey. 5 This will not be as exhaustive as the Cross-Cutting Experiment in India, for example, where the sample size and number of observations in the domain of interest is large enough to provide statistically sound estimates, and where information at the local level is available. 2 No. 11 A system of integrated agricultural censuses and surveys; no World Census of Agriculture; no World Census of Agriculture , methodological review. 3 Since 2006 the EUROSTAT LUCAS survey has observed changes in land use and land cover in the European Union every three years. The latest survey in 2012 covered all 27 countries with observations at 270,000 locations. 4 Follow-up surveys can be register-based as in Denmark and Kuwait, or census-based as in India. See SDS 12, p. ix: The Denmark agriculture census is linked to the registers of Integrated Administration and Control System (IACS) which contain information relating to area under major crops for all the farms applying for crop subsidies. In India, the administrative functions of maintaining land ownership records and doing seasonal crop enumeration are vested in a single office at village level. The services of this office are utilized to carry out an agriculture census (limited to crops) once every five year by re-tabulating the land ownership registers to obtain a list of agricultural holders which provides the frame for the agriculture census and a follow-up survey. 5 In many administrative data sources such as registers of farmers, location information may be part of the original microdata set or its metadata. 13

14 ii. Local-level data on the target variable are not collected or are collected from a small number of sources in a sample survey. Such surveys can provide local-level data, but to be truly representative they must come from large samples, which increases the costs significantly. Data collection for countries in special situations developing countries are an example requires a good deal of work because the data sources mentioned above are frequently not available. When the survey data provides few observations and cost constraints prevent additional surveys or additional sampling of the study area, existing information must be integrated and harmonized to produce credible statistics on the dynamics of change at the local level. In this case, an estimate of the target variable for local domains can be obtained from reliable data relating to a larger domain that includes the domains in question. In short, the available aggregate data for broad areas must be disaggregated at the local level for small areas. A more reliable estimate of the target variable at the local level can be obtained when auxiliary variables are known for larger areas or domains and for the small areas of interest for sampled and non-sampled units in the domains. Spatial auxiliary information is crucial in many applications of the estimation method for local areas, because it can increase the efficiency and effectiveness of the estimations. Spatial auxiliary information can be derived from administrative archives and maps of the territory under study, and geographical information systems can provide spatial data relating to coverage, perimeters, extensions and distances. Today, the quality and coverage of spatial information on land use are generally satisfactory. Such information constitutes the bulk of relevant auxiliary information; this also applies to developing countries. GIS satellites provide maps of land use from which indications of crop quantities and yields can be obtained, with acceptable and useful indications for official statisticians and other stakeholders. As we showed in a previous FAO report, 6 there are several methods for estimating small areas for which no data or insufficient or low-quality data are available. These can be classified into four groups: data interpolation, data integration, data fusion and data disaggregation. The techniques of the first three groups are well known, and are used for crop-yield estimates; they usually use a detailed cell size or a grid laid over the study zone, and they integrate various sources of data. 7 The application of remotely sensed data frequently depends on the characteristics of the study zone and the quality of satellite imagery. 8 For these reasons it is difficult to provide a set of methods that will be useful for most countries. Where there are credible aggregate data for large areas that include the small areas in question, data-disaggregation methods are applicable; this applies in developing and developed countries. These methods have general conditions of applicability, and they can be useful in countries involved in the Global Strategy because they can be adapted to a range of situations. The spatial disaggregation techniques developed by GIS and geo-statistics researchers (Kim and Yao, 2010; Li et al., 2007) are used to break down maps and spatially aggregated data into a zoning system with finer spatial resolution. They are based on estimation and data-interpolation techniques, and take into account assumptions about the spatial distribution of the target variable or the relationship between the target variable and the auxiliary geographical 6 Review of projects and contributions on statistical methods for spatial disaggregation and for integration of various kinds of geographical information and geo-referenced survey data. Available at: 7 The history in applying GIS and remote sensing data analysis to crop production forecasting is long and rich. A comprehensive collection of the possible results using variety of observations data and processing models is accessible by Crop Explorer gov/cropexplorer 8 Data integration, aggregation and fusion techniques are applied to a convergence of evidence including weather patterns, actual ground observations and remotely sensed data. There are, however, numerous countries where some of these areas of evidence are not available. 14

15 variable. The local area the target zone is smaller in extent than the source zone, which is generally a broad area. In other words, local area is synonymous with area of small extension. The SAE methods developed by sample survey statisticians are used to produce statistically sound estimates on the basis of data from surveys and administrative records when the sample size is small or equal to zero in the target area and provides few observations on the variable in question. The term small area is used to describe domains with too few observations to give statistically significant results. The estimates are based on the specification of a model that borrows strength from the related areas and links the study variable to the existing auxiliary information (Rao, 2010). This report contains an overview of methods, particularly those related to the specification of statistical models for SAE, and considers the problems of some advanced SAE methods in particular situations or where deviation from standard assumptions occurs in agricultural surveys. Full understanding of the methods presented requires advanced knowledge of statistics, but our presentation in terms of simulation studies and actual applications should make them useful and should make them accessible to people with relatively limited knowledge of statistics. Some of the methods are complex, and relevant software is identified where needed. The report is in two parts, and contains nine chapters. Part I: Review of spatial data disaggregation and SAE methods Chapter 1: Mapping techniques. Chapter 2: Small-area estimators These chapters review the most common methods for data disaggregation and the most effective models for SAE, which usually use spatial information obtained by aggregation and integration of existing data sources. The assumptions on which the methods are based are clarified. The information needed for each method is described auxiliary information, coordinates and geographical information, for example and their role in the production of almost cost-free official statistics is suggested. 9 Most of the methods are based on the assumption that the data come from a simple random sample of the population; extensions to more complex sampling systems are also presented. A comparison of the methods on the basis of available information is given, and there are examples of applications to rural and agricultural statistics or to general social and economic statistics. Part II: Reliability of SAE methods in non-standard situations Chapter 3: Sensitivity of SAE predictors to spatial model specifications. Chapter 4: The modifiable area unit problem. Chapter 5: Robustness of SAE predictors. Chapter 6: The complexity of sample design. Chapter 7: Missing data in spatial datasets. Chapter 8: Excess of zeros in survey data. There are several open issues with regard to the quality of small-area estimates. These are important for practitioners because they show the benefits and limitations of some advanced methods in specific situations or where deviations from standard assumptions occur in agricultural surveys. We refer in particular to the sensitivity and robustness of SAE methods in the specification of the model linking the study variable to spatial auxiliary information (see Chapter 3). The level of aggregation of spatial data to define target areas affects the fit of the model: in other words the area units are modifiable, and this affects the significance of the SAE models (see Chapter 4). The type and quality of available data determine the accuracy of small-area estimates: in general, the robustness of SAE predictors is significant when applying the methods in the presence of outliers and data errors (see Chapter 5). The data used for estimations often come from sample surveys that do not follow the simple random-sampling assumption common to many SAE models. We review the effect of the complexity of the sample design on the model (see Chapter 6), 9 The Australian Bureau of Statistics provides an online guide to SAE for stakeholders interested in having local data (see nss/home.nsf/ And interesting work has been done on the deliverables of ESSnet SAE and MEMOBUST handbook projects; see:

16 the effect of missing data in the spatial dataset used as auxiliary data (see Chapter 7) and of the excess of zero in the study variable (see Chapter 8). Chapter 9: Conclusions and final remarks. This summarizes our main findings and recommendations. The report is intended to stimulate research and application studies of SAE applied to agricultural and rural statistics when geo-referenced auxiliary information is used. For this reason, and because many topics were intentionally omitted, it cannot be considered a compendium of methods of integrating spatial information into SAE applied to agricultural surveys. But it reflects, to the best of our knowledge, the state-of-the-art with regard to several crucial issues. 16

17 Review of Spatial Disaggregation and SAE Methods 1. Spatial Disaggregation: Mapping Techniques 1.1 Introduction The main idea underlying the techniques in this chapter is to disaggregate spatially aggregated data into a zoning system of higher resolution. The original areas, with known data, are called source zones; the targeted areas are called target zones (Lam, 1983). Spatial disaggregation methods, which are based on areal interpolation techniques, can be classified according to various criteria such as underlying assumptions or the use of ancillary data (Wu et al., 2005). In all such techniques error is inevitably generated by the assumptions about the distribution of the objects homogeneity of density, for example or by the spatial relationship imposed in disaggregation process the size of the target zones, for example (Li et al., 2007). 1 Areal interpolation the process whereby data from one set of source polygons are redistributed to another set of overlapping target polygonal areas 10 is used primarily when the target variables are estimated on the basis of data available from various sources covering the same area, but with different internal boundaries. This approach frequently uses census data as the input, and applies interpolation or disaggregation techniques to obtain a refined population surface. Two groups of techniques are considered below: i) interpolation based on the proportionality to the density distribution of the target variable or other, auxiliary, variables the simple area-weighting method, pycnophylactic, or mass-preserving, interpolation methods or dasymetric mapping; and ii) interpolation based on regression models the expectation-maximization (EM) algorithm. Other methods are then proposed to show some non-plausible hypotheses of the first set, and examples of practical applications are presented; further examples can be found at: Most of these methods require digital maps and GIS data to estimate target variables such as crop production, land use and pesticide use. 1.2 Areal interpolation methods Simple area weighting method The simplest interpolation approach for disaggregating data is the basic area weighting method, which apportions the attribute of interest by area, given the geometric intersection of the source zones with the target zones. This method assumes that the target variable y is uniformly distributed in each source zone. Given this hypothesis, the data in each target zone can be estimated as: (1.1) 10 Includes the simple case of the existence only of non-overlapping target areas in a source area. 17

18 where is the estimated value of the target variable in the target zone t, is the observed value of the target variable in source zone s, is the area of source zone s and is the area of the intersection of the source and target zones. This method satisfies the pycnophylactic or volume-preserving property, which requires the preservation of the initial data: the predicted value for source area s obtained by aggregating the predicted values at intersections with area s should coincide with the observed value for area s (Do et al., 2013; Li et al., 2007). Several studies show, however, that the overall accuracy of simple area weighting is low compared with other techniques (see, for example Langford, 2006; Gregory and Paul, 2005; Reibel and Aditya, 2006). To extend the assumption of homogeneity in the simple area-weighting method it is very rarely acceptable several approaches have been proposed. A number of studies, for example, aim to overcome the problem by smoothing with density functions such as kernel-based surface functions around area centroids, and there is Tobler s (1979) pycnophylactic-interpolation method (Kim and Yao, 2010) Pycnophylactic interpolation methods Tobler (1979) proposed the pycnophylactic interpolation method as an extension of simple area weighting to produce smooth population-density data from areally aggregated data. It calculates the target area values on the basis of the values and weighted distance-from-the-centre of neighbouring source areas, maintaining volume consistency in the source areas. It uses the following algorithm: 1. intersect a dense grid over the study region; 2. assign a value to each grid cell using simple area weighting; 3. smooth the values of all the cells by replacing each cell value with the average of its neighbours; 4. calculate the value in each source region by summing all the cell values; 5. weight the values of the target cells in each source area equally, so that source-area values are consistent; and 6. repeat steps 3 to 5 until there are no further changes to a specified tolerance. In this approach, the choices of the appropriate smooth-density function and of the search window size depend on the characteristics of individual applications. The underlying assumption is that the value of a spatial variable in neighbouring target areas tends to be similar; Tobler s first law of geography asserts that neighbouring things are more related than distant ones (Tobler, 1979). Comber et al. (2008), for example, refer to an application of pycnophylactic interpolation to agricultural data to identify land-use areas from aggregated agricultural census data Dasymetric mapping The dasymetric mapping method (Wright, 1936; Mennis and Hultgren, 2006, Langford, 2003) is different. To reflect density variation in source zones, this method uses other information x to distribute y: that is, it uses additional information to estimate the actual distribution of aggregated data with the target units of analysis with a view to allocating y to the small intersection zones in the sources provided that the relationship between x and y is proportional and strongly correlated. Hence this method replaces the homogeneity assumption of simple area weighting with the assumption that data are proportional to the auxiliary information on any sub-region. Considering a quantitative variable x, the dasymetric mapping method extends formula (1.1) by substituting x for the area: (1.2) The simplest scheme for implementing dasymetric mapping is to use a binary mask of land-cover types (Langford and Unwin, 1994; Langford and Fisher, 1996; Eicher and Brewer, 2001; Mennis and Hultgren, 2006). In this case the auxiliary information is categorical, and its level defines the control zones (see Figure 2 in Example 1: Mask area weighting). The classic case, called binary dasymetric mapping, is population estimation when there are two 18

19 control zones, one known to be populated and the other unpopulated, and it is assumed that the count density is uniform throughout the control zones. In this case formula (1.1) becomes: (1.3) where is the estimated population in the target zone t, is the total population in source zone s, is the source zone area identified as populated, and is the area of overlap between target zone t and source zone s, with land cover identified as populated. Several multi-class extensions to binary dasymetric mapping have been proposed (Kim and Yao, 2010; Mennis, 2003; Langford, 2006). Li et al. (2007) present three-class dasymetric mapping for population estimation that takes advantage of binary dasymetric mapping and a regression model with a limited number of ancillary class variables non-urban, low-density residential and high-density residential to present a range of residential densities in each source zone. The technique is based on a relaxed assumption about homogeneous density for each land class in each source zone: (1.4) Here, is the area of intersection between target zone t and source zone s identified as land class c, and is the area of source zone s identified as land class c. Hence represents the density estimate for class c in zone s. These densities can be estimated in a regression model, as described below. The dasymetric and pycnophylactic methods have complementary strengths and shortcomings for population estimation and target variable disaggregation. For this reason, several hybrid pycnophylactic/dasymetric methods have been proposed (Kim and Yao, 2010; Mohammed et al., 2012; Comber et al., 2008). These use dasymetric mapping for a preliminary population/variable of interest redistribution, and an iterative pycnophylactic-interpolation process to obtain a volume-preserved smoothed surface. Comber et al. (2008) use the hybrid method to disaggregate agricultural census data to obtain a fine-grained 1 km2 maps of agricultural land use in the United Kingdom Examples Example 1 - Simple tabular and graphical examples In this first example we show how the different methods work, with simple examples. Figure 1.1 illustrates a hypothetical example taken from Shu et al. (2010) with data for three source areas that are to be split into 25 target areas. The examples also compare the two approaches: volume-preserving interpolation and non-volume preserving interpolation. Block (b) shows the results of applying the simple area weighting method; blocks (c) and (d) show the results obtained by applying volume-preserving interpolation and non-volume preserving interpolation. The results of block (c) applying non-volume preserving interpolation returns the total values of the three polygons as 112, 76, and 304; these are different from the original values of 90, 80 and 360 because the volumes are not preserved. 19

20 Figure 1.1: Example of volume-preserving interpolation obtained by applying volume-preserving intnd 304; these are different from the original values of 90, 80 and 360 because the volumes are not preserved. Figure 1.2 shows a simplified simulation of areal interpolation of artificial population data (see: integrated-assessment.eu/guidebook/spatial_disaggregation) comparing the simple area-weighting method with dasymetric mapping and considering the different density distribution of the different areas. The simple area-weighting method attributes T=(100x0.25)+(60x0.25)=40 persons to the target area. The calculation is done in proportion to the relative extension of the intersection between source and target areas T A and T B. The mask area weighting limits the intersection to the populated zone, thereby modifying the relative weights of T A and T B to the values 0.5 and Finally, dasymetric disaggregation takes into account the distribution of the population in the intersections and their relative extensions. 20

21 Figure 1.2: Simulation of areal interpolation of population data An example of the results of mask area weighting for disaggregating county pesticide usage in East Anglia in the UK, where the target areas are defined on a 5x5 km grid is shown in the maps in Figure

22 Figure 1.3: Disaggregation of county pesticide usage ( Example 2 Comparison of simple area weighting method to estimate crop production, applied by using three different kind of proportions You, L. and Wood (2006) presented an application in in which Brazilian state-level production statistics were used to generate pixel-level crop production data for eight crops. The robustness of the results of this entropy-based approach were compared with short-cut approaches to allocating crop-production statistics. They examined three possible short-cut methods for assigning state-level crop areas to municipalities: i) in proportion to the total land area of the municipalities; ii) in proportion to the cropland area of each municipality; and iii) in proportion to the amount of biophysically suitable land for the production of each crop in each municipality. For all crops, the proposed approach was most successful in predicting municipality crop areas by a large margin for wheat and beans. The simplest procedure distributing crop production in proportion to the total areas of the municipalities was the second-best method for maize and beans, which are grown extensively to meet ubiquitous demand for primary foods and commodities such as maize-based feed. Example 3 Application of pycnophylactic interpolation and dasymetric mapping Comber et al. (2008) describe an approach combining dasymetric and volume-preserving techniques to create a national land-use dataset at 1 km2 resolution. The results for an English county are compared with contemporaneous aggregated habitat data, and the results show that accurate estimates of local arable and grass land-use patterns can be obtained when individual 1 km squares are combined into blocks of > 9 squares, thereby providing local estimates of agricultural land use. This in turn allows more detailed modelling of land uses related to livestock and cropping activities. 22

23 Example 4 Adaptation of Dasymetric mapping to agricultural data De Belém et al. (2012) examine the adaptation of dasymetric mapping methods to agricultural data, including testing and transposition, to recover the underlying statistical surface that is, an approximation of the real distribution of data. The method was applied in the Alentejo region of Portugal using data from the 1999 agricultural census; several counties were used as source zones. The aim was to generate a distribution of agro-forestry occupations as close as possible to reality. Two lines of analysis were followed: i) simultaneous application of the method to all counties to obtain a definition of regional densities; and ii) separate application of the method to different sub-areas with similar characteristics to obtain a definition of sub-regional densities. The results were validated through error indicators at the county level and in a sample of parishes. The second variant of the method, which gave more precise results and was superior for the types of data available, yielded maps in which the distribution of the most relevant agro-forestry occupations was closest to reality. 1.3 Spatial disaggregation with interpolation based on Regression models Regression models The dasymetric weighting schemes in the previous paragraph have several restrictions: i) the assumption of proportionality of y and x; ii) the fact that the auxiliary information should be known at the intersection level; and iii) the limitation to a unique auxiliary variable. Spatial disaggregation techniques based on regression models can overcome these constraints (Langford et al., 1991; Yuan et al., 1997; Shu and Lam, 2011). Another limitation of the dasymetric method is that when predicting at the level of the source/target intersection s-t, only the areal data y s, in which the intersection is nested is used for prediction. This will not be the case for regression: in general, the regression techniques involve a regression of the source-level data of y on the target or the control values of x. Generally speaking, regression models for estimating population counts assume that the given source zone population may be expressed as a sum of a set of densities related to the areas assigned to the different land classes. Other ancillary variables may be included for these area densities, but the basic model is: (1.5) where is the total population count for each source zone s, c is the land cover class, is the area size for each land class in each source zone, is the coefficient of the regression model and is the random error. The output of the regression model is an estimate of the population densities. A problem with this regression model is that the densities are derived from a global context and remain spatially stable in each land class in the study area; it has therefore been suggested that the locally fitted approach used by the dasymetric method will always outperform the global fitting approach used by regression models (Li et al., 2007). To overcome this limitation, locally fitted regression models have been proposed where the globally estimated density for each land class is locally adjusted in each source zone by the ratio of the predicted population and census counts to obtain a variation of the absolute value of population densities by reflecting the differences in terms of local population density between source zones. These methods were developed initially to ensure that the populations reported in target zones were constrained to match the sum of the source zones the pycnophylactic property. 23

24 1.3.2 The EM algorithm Another statistical approach is based on the EM algorithm (Flowerdew and Green, 1992). Rather than using a regression approach, the interpolation problem is set as a missing-data problem considering the intersection values of the target variable as unknown and the source values as known. The EM algorithm is used to predict the intersection values. This method is useful when the variable of interest is not a count but can be assumed to follow the normal distribution. Let be the mean of the values of the variable of interest over the values in the intersection zone s-t, and assume that: (1.6) The values are assumed as known or interpolated from. Hence: (1.7) and: (1.8) If the were known, we would obtain the mean in target zone t as: with. Setting would give the simple areal weighting solution. But with the EM algorithm the interpolated values can be obtained following E-step and M-step operations until convergence is reached: E-step: where M-step: Treat the as a sample of independent observations with distribution and fit the model with least-weighted squares. 24

25 These steps are repeated until convergence, and then the interpolated values from the E-step: are computed as the weighted mean of the (1.9) If convergence cannot be achieved, an alternative non-iterative scheme can be used (see Flowerdew and Green, 1992). Analogous regression models can be used also to disaggregate count, binary and categorical data (Langford and Harvey, 2001; Tassone et al., 2010) Examples Example 5 Application of regression models to describe the spatial patterns of corn yield A study by Kaspar et al. (2003) developed a linear-regression model to describe the spatial patterns of corn yield for a 16 ha field in central Iowa, USA. The study examined the relationship between six years of Zea mays L. yield data and relative elevations, slopes and curvatures, and corn yield in six crop years with relative elevations measured by GIS imagery; slopes and curvatures were then determined by digital terrain analysis. The data showed that in the four years with less rain than usual in the growing season, corn yield was negatively correlated with relative elevations, slopes and curvatures, whereas in the two years with more rain than usual, yield was positively correlated with relative elevations and slopes. A multiple linear regression model based on relative elevation, slope and curvature was developed that predicted 78 percent of the spatial variability of the average yield of the transect plots for the four dry years, and that identified the spatial patterns in the entire field for yield monitoring data from 1997, which was one of the dry years. The relationship between terrain attributes and corn yield spatial patterns may provide opportunities for site-specific crop management. Example 6 Application of the EM algorithm to produce population density grids Gallego (2010) described four methods for producing dasymetric population density grids combining population data by commune with the CORINE land-cover map, which is available across the European Union. The four methods apply different versions of the dasymetric method and the EM algorithm. An accuracy assessment in five countries for which a reliable 1 km population-density grid exists showed that the improvement compared with the choropleth map by commune ranged from 20 percent for the weakest result in Finland to 62 percent for the best result in the Netherlands. All methods overestimate populations in agricultural, heterogeneous and forest areas; it is often smaller for the EM method, but this approach significantly overestimates the population in the class "infrastructure" because it appears mainly in highly populated communes. 25

26 2. Small-Area Estimators 2.1 Introduction Many target parameters in agricultural and rural statistics can be expressed in the form of means and percentages. 11 A common practice is to estimate these quantities for sub-populations or domains with survey data, but as stated in the Introduction there are geographic domains, or areas, for which sufficiently precise direct estimates cannot be produced. Survey designs usually focus on achieving a particular degree of precision for estimates at a level of aggregation higher than that of small areas. Knowledge of the parameter for a given domain or small area can be obtained in three ways, depending on the level and type of information from administrative archives and survey data: 12 The broad area ratio estimator (BARE) is one of the simplest types of small-area model; it is applicable when the study variable is known for a larger domain in which the small area is included. In this case the estimator is calculated by applying the rate for a broad area obtained from a survey for example, crop yield rates at the district or community-development block level from the Cross-Cutting Experiment in India 13 to the small-area population obtained from a population census or demographic estimate. The success of BARE relies on the choice of the broad area, which must be large enough in terms of sample size to allow for a reliable direct survey estimate but small enough to enable the assumption that the small areas in the broad area are homogenous in terms of the characteristic of interest. This is a major assumption, so users must be aware of it. As with direct estimation, BARE can be used to validate more complex approaches. The BARE with auxiliary data approach uses information correlated with the variable of interest and is available at the small-area level to derive an estimate adjusted for compositional differences in small areas. It is a deterministic model that assumes that crop yield rates only vary by household size or farm size; it does not allow for other effects. It can, however, be applied in association with a broad area ratio estimate. As in BARE, there is a strong underlying assumption of homogeneity in broad areas. Survey evidence shows that in many developing countries crop-yield rates can be correlated with household size. This could mean that areas with a large proportion of large households will have high crop yields. The estimator applies household size and crop yields to a small-area population classified by household size. These two methods of estimation for small domains are similar to the simplest spatial disaggregation techniques in chapter 1, but they are based on different assumptions. The spatial disaggregation techniques disaggregate maps into more detailed maps on the basis of the assumed spatial relations between source and target zones. The BARE estimators start from survey data and estimates, and spatial relations are not essential. In a third approach based on SAE methods, spatial relations between areas can be inserted in the definition of model-assisted estimators (MAE) and model-based estimators (MBE), which are based on regression models and make it possible to base small-area predictions on a number of variables. These come from sources other than the sample survey and refer to the area of interest a municipality or a district for example or to the unit of interest, 11 In Europe, many agro-environmental indicators are expressed in percentages, combining different kinds of data with arable land or mostly the utilized agricultural area the total area taken up by arable land, permanent grassland, permanent crops and kitchen gardens (see Eurostat, LUCAS). 12 See also: Australian Bureau of Statistics A Guide to Small-Area Estimation. Canberra. 13 In this context, crop yield, or agricultural output, refers to the measure of the yield of a crop per unit area of land cultivation (see Sud et al., 2011). 26

27 which could be a person, a rural household or a farm. In the first case area-level small-area models are defined; in the second, unit-level small-area models. The term small area is used to describe domains whose sample sizes are not large enough to enable sufficiently precise direct estimates. In practice it is not possible to plan for all possible areas or domains and uses of survey data, because the client will always require more than is specified at the design stage Fuller (1999). When direct estimation is not possible, one must rely on alternative model-based methods for producing small-area estimates: these depend on the availability of population-level auxiliary information related to the variable in question and use linear mixed models; they are commonly referred to as indirect methods (see Rao, 2003; Ghosh and Rao, 1994; Pfeffermann, 2002; Jiang and Lahiri, 2006a; and Pfeffermann, 2013). Accurate spatial data can nowadays be derived from satellite imagery and GIS. The SAE models are generally regression models. Figure 2.1 shows how an SAE model works with the MAE or MBE approach. To understand the characteristics of SAE methods, we may suppose that a phenomenon a population, for example U of size N is divided into m non-overlapping subsets, which may be domains of study or small areas, of size. These domains refer to geographical areas such as municipalities or census divisions, or to an agricultural group such as type of production or a farm, or to a demographic group such as a population defined by age, gender and race in a large geographical area, or to a cross-classification of these. The index j identifies the units of the population, the index i the small areas. The population data consist of values of the variable of interest, and of values of a vector of p auxiliary variables. A sample s of units is drawn from the population according to some sampling system such that the inclusion probability of unit j in area i is given by. The values of are known for area-specific samples, and unknown for each unit of the set, which contains the non-sampled units in small area i. The p-vector of auxiliary variables it is known for each unit of the area i from external sources such as a census. At least the area and level totals or means are accurately known for all the small areas of interest. Spatial information can enrich the auxiliary variables for sampled and non-sampled units. Note that area-specific samples of size n i > 0 can be unavailable for each area. In the sample design there are areas that have, in which case is the empty set and the areas are out-of-sample areas. Figure 2.1 shows fixed effects covariates f (x ij ) and random effects at the area level g (u i ) and at the individual level e ij. 27

28 Figure 2.1: How an SAE unit-level model works Provided that the model is fitted to the data, combining direct estimates and the predicted ŷ ij, the SAE estimates will be obtained. Their properties are evaluated in the MAE or MBE approach (see section 2.2) to find efficient predictors in terms of mean squared error (MSE) for the target area parameters. The objective of SAE predictors is to produce accurate estimates for small areas, and they should improve the precision of direct estimates. Although direct estimators have several sound properties, direct estimates often lack precision when domain sample sizes are small. 2.2 A classification of SAE models As we have indicated, target variables at the area level can be estimated through: i) the design-based approach (see Hansen et al., 1953; Kish, 1965; Cochran, 1977); ii) the MAE approach (see Särndal et al., 1992); and iii) the MBE approach (see Gosh and Meeden, 1997; Valliant et al., 2000; Rao, 2003). They will be direct or indirect small-area estimates (see Figure 2.2). Direct estimates are obtained in the design-based approach from data obtained from a single survey with the application of Horvitz-Thompson-type estimators. For the production of significant direct estimates at the small-area level, design issues that affect small-area estimation must be considered, particularly in the context of large-scale surveys. Rao (2003) discusses some of these design issues, and refers to Singh et al. (1994) for detailed analysis. When designing a survey, the use of direct estimators in SAE can be facilitated by: i) minimizing clustering; ii) replacing broad strata with several narrow strata from which samples are drawn; iii) adopting compromise sample allocations to satisfy reliability requirements at the small-area level and the large-area level; and iv) integrating surveys such as dual-frame surveys and repeated surveys. 28

29 The indirect estimates use auxiliary information, or variables, to improve the accuracy of survey estimates and to break down known values for large areas by using regression models. Indirect estimates are obtained in the modelassisted and model-based approaches where a statistical model usually a regression model is specified with a view to reinforcing the validity of evidence, or borrowing strength, from the auxiliary variables. Figure 2.2 shows the classification of SAE methods used in this report. Figure 2.2: A classification of SAE estimation methods In the MAE approach estimators generally have design-based properties, and their accuracy as measured by MSE is derived with the sampling system used to collect the survey data. In the MBE approach the properties of the estimators and their accuracy are evaluated with the statistical model specified for obtaining confirmation from the auxiliary variables. In the last 30 years, indirect estimates have become popular. The generalized regression (GREG) estimator and its modifications together with the empirical best linear unbiased predictors (EBLUP) specified in the linear mixed models (LMMs) are currently applied by many statistics agencies. In LMMs the distribution of the study variable is a function of area-specific random effects and of unit random effects. Area-random effects make it possible to include differences between areas in the model. The characteristics of the data available for the study have motivated the specification of area-level and unit-level models. The best linear unbiased predictors can be obtained in an area-level model the Fay and Herriot model (FH-EBLUP) or an EBLUP (Henderson, 1975; Rao, 2003), in which case it takes into account the area-specific random effect and the individual random effect. 29

30 These predictors can incorporate geographic information referring to the areas of interest as the spatial empirical best linear unbiased predictor (SEBLUP). Models can include random effects; those that do not are known as synthetic models. All SAE models must reflect underlying data continuous, count or categorical, for example and must take account of specific characteristics of the distribution of the target variable such as non-parametric specifications or methods unaffected by outliers. A recent approach to SME is based on the use of M-quantile (MQ) models, which are specified at the unit level (Chambers and Tzavidis, 2006). Differences between areas can be captured throughout quantile coefficients. This approach can be extended to model quantiles with the model quantile geographically weighted regression (MQGWR) estimator (Salvati et al., 2012). Geostatistical models such as geo-additive models, kriging and MQGWR can also play a role in the spatial extension of SAE estimators (see section 2.7 of Part I and chapter 3 of Part II). A number of further developments have taken place in the SAE literature in recent years. The estimation of parameters other than averages and totals are the subject of several papers: examples include quantities of the small-area distribution function of the outcome of interest (Tzavidis et al., 2010) and complex indicators (Molina and Rao, 2010; Marchetti et al., 2012). Opsomer et al., 2008 focused on non-parametric versions of the random-effects model; others focused on the specification of models that borrow strength in spatial terms by applying models with spatially correlated or non-stationary random effects (Benedetti et al., 2012; Salvati et al., 2012; Chandra et al., 2012). The issue of outlier-robust SAE has attracted interest mainly because in many real data applications the Gaussian assumptions of the conventional random effects model are not satisfied (Sinha and Rao, 2009). Categorical survey variables are not suited to standard SAE methods based on LMM. One option in such cases is to adopt an empirical best predictor based on generalized LMMs. Some details are given in section 2.6.7; a Bayesian approach to the non-spatial and spatial mixed effects models for SAE is described in section The main estimators of the MAE and MBE approaches are described in sections 2.3, 2.4 and 2.5. We give an example of calculation for each estimator, and present previous work applied to agricultural data. For each estimator the data needed to apply it is specified, its advantages and disadvantages in comparison with others are discussed and the extensions applied to overcome them are identified. Additional extensions or alternatives to the main solutions that will be used in Part II are reviewed in section 2.6. Equation Chapter (Next) Section Model-assisted estimators GREG estimator GREG design-based estimators were introduced for SAE by Särndal (1984). The class of GREG estimators, which encompasses a range of estimators assisted by a model, are characterized by asymptotic design, absence of bias, and consistency. GREG estimators share the following structure: (2.1) 30

31 Different GREG estimators are obtained in association with different estimation models, that is, for calculating predicted values,. To define this estimator and subsequent estimators we assume that contains 1 as its first component. In the simplest case, a fixed-effects regression model is assumed:,, where the expectation is taken with respect to the assisting model. When sampling weights are used in the estimation process of the regression model, it leads to the estimator GREG-S: (2.2) where and (Rao, 2003, section 2.5). Note that in this case the regression coefficients are calculated on the basis of data from the whole sample and are not area-specific. Table 2.1 summarizes the characteristics of the linear GREG estimator, with a focus on its underlying assumptions, its behaviour as an out-of-sample predictor, its design consistency and its robustness against outliers. It also highlights its advantages and disadvantages, which determine its extensions. Table 2.1. Linear GREG: advantages, disadvantages, extensions Properties Advantages Disadvantages Extensions Model Design-based from linearregression One-level linear Two-level linear model extension model regression only, with fixed assumptions effects (Lehtonen and Veijanen, 1999) Design consistency Robustness to outliers Out-of-sample predictions Asymptotic design, absence of bias and design consistency Yes Sensitivity to extreme values of sampling inclusion probabilities Not robust against outliers Prediction not inclusive of spatial information Robust GREG (Duchesne, 1999) Spatial versions not yet developed; spatial auxiliary info admitted; coordinates of sampled and nonsampled units Model assumptions The model assisting the estimation is a fixed-effect linear model with common regression parameters as in Rao (2003), Section 2.5. In this case the resulting small-area estimators can overlook the area effects the inter-area variation beyond that accounted for by model covariates and may result in inefficient estimators. For this reason, Lehtonen and Veijanen (1999) introduce a supporting two-level model where, which is a model with area-specific regression coefficients. In practice not all coefficients need to be random, and models with areaspecific intercepts that mimic LMM may be used (see Lehtonen et al., 2003). In this case the estimator GREG-LV takes the form (2.1) with. Estimators and are obtained by using generalized least squares and restricted maximum likelihood methods (see Lehtonen and Pahkinen, 2004, section 6.3). Design consistency Design consistency is a general-purpose form of protection against model failures in that it guarantees that estimates make sense even if the assumed model fails completely, at least for large domains. The GREG estimator is asymptotically design-unbiased and consistent, but it can be sensible to extreme values of inclusion probabilities (Fabrizi et al., 2014). GREG estimators supported by LMM have turned to model-based estimation for the parameters of the model, so the efficiency of the resulting small-area estimators relies on the validity of the model assumption, and typically on the validity of the normality of residuals. 31

32 Robustness to outliers GREG and GREG-S expressions allow for survey weighting of outlying observations, but this does not guarantee protection against the outlying observations. A robust version of GREG was proposed in Duchesne (1999). Predictions for out-of-sample areas Predictions for the out-of-sample areas those with zero sample size are based on the estimated parameters of the linear regression model and on the X auxiliary information: (2.3) Spatial versions of the GREG estimator have not yet been developed. Nonetheless, the coordinates of the positions of the sampled and non-sampled units and other auxiliary geographical variables referring to the same area can be included in the regression model (see Part II, chapter 3). This is a method that takes spatial interaction into account when it results from the covariates themselves and not to the spatial relation between the areas in the study zone Example of calculation of GREG estimator This example shows how to use and apply the generalized-regression estimator (2.1) introduced by Särndal (1984) to obtain small-area estimates of area mean values. The target parameter is the mean forest biomass in ha in municipalities, taken as small areas. The data are from the Norwegian National Forest Inventory, which provides estimates of forest parameters at the national and regional scales from a network of permanent sample plots. The dataset is in the public domain, and detailed information is available in Breidenbach and Astrup (2012). Application involves using R software: R is a language and environment for statistical computing and graphics, downloadable free at It offers many packages of routines and functions to implement SAE techniques. The forest in Vestfold county in Norway is a finite population subdivided into 14 municipalities, which are the small areas of interest. Above-ground forest biomass per hectare is the variable of interest, and mean forest biomass per hectare in the municipalities is the population characteristic of interest. Data on forest biomass per hectare biomass/ ha are available for 145 sample plots; auxiliary data on mean canopy height are also available from GIS images. Table 2.2 shows the first 6 of 145 lines on the sample plots of the Norwegian National Forest Inventory. The R package JoSAE contains the function eblup.mse.f.wrap, which can be used to obtain GREG and EBLUP small-area estimates. The data can be loaded into R by using the commands: library(josae) data(josae.sample.data) 32

33 Table 2.2: Norwegian National Forest Inventory sample data sample.id domain.id biomass.ha mean.canopy.ht The relationship between the target and the auxiliary data available from the sample is shown in Figure 2.3. The command to obtain the scatterplot is: plot(biomass.ha~mean.canopy.ht,josae.sample.data) Figure 2.3: Scatterplot of the biomass/ha vs mean canopy height sample data 33

34 Table 2.3: Population data of mean canopy height from digital aerial images domain.id N.i mean.canopy.ht.bar Data on the mean canopy height are also available for all the elements. The population here is the forest covered by GIS images, from which the mean canopy values are available. Hence the population elements are the tiles in the forest for which auxiliary variables from the canopy height mean and image data were calculated. To load the data in R, use the command: data(josae.domain.data) Using the data in Tables 2.2 and 2.3 we can obtain small-area GREG estimates of the mean of forest biomass/ha in the 14 municipalities. First, an LMM must be estimated to obtain predicted y ij values. 34

35 The R commands are: fit.lme <- lme(biomass.ha ~ mean.canopy.ht, data=josae.sample.data, random=~1 domain.id)) where biomass.ha is the response variable, mean.canopy.ht is the auxiliary variable, JoSAE. sample.data is the data source and random=~1 domain.id indicates that an LMM is being fitted where the second-level units are identified by domain.id Check that the name of the auxiliary variable is the same in the population and sample datasets. This is not the case in the package example data, so the name of the mean canopy data must be changed, for example from mean.canopy.ht.bar to mean.canopy.ht. This can be done in R with the commands: d.data <- JoSAE.domain.data names(d.data)[3] <- "mean.canopy.ht" This provides all the information needed to obtain the GREG estimates by using the eblup.mse.f.wrap function: results <- eblup.mse.f.wrap(domain.data = d.data, lme.obj = fit.lme) The eblup.mse.f.wrap function has two elements: domain.data, which contains the population data, in this case the dataset d.data, and lme.obj, which contains the fitted LMM, in this case fit.lme The eblup.mse.f.wrap function automatically produces several results, including GREG points and MSE small-area estimates. These results can be obtained with the commands: results.greg=cbind(results$greg,results$greg.se) results.greg [,1] [,2] [1,] NA [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] NA [13,] NA [14,] The results obtained results are also shown in Table

36 Table 2.4: Point and MSE GREG estimates of the mean forest biomass/ha for the 14 municipalities domain.id GREG GREG.se NA NA NA Note that for the three municipalities with only one sample observation (domains 1, 12 and 13) no MSE values are estimated. Table 2.5 summarizes the data and software needed to implement the GREG estimator. It is data-hungry in that unit-level information is required to adapt the model. Given the estimated regression parameters, however, the out-of-sample predictions can be obtained even when only the population-level average values of the auxiliary variables are known. The method is popular, and routines to implement it can downloaded free from several websites. In this paper, the main references are the websites of two SAE projects funded by the European Commission the EURAREA project 14 and the SAMPLE project 15 The Italian Istituto Nazionale di Statistica (ISTAT; National Statistics Institute) also provides SMART2 software (Fasulo et al., 2013). In any case, GREG can be obtained by applying the R functions described in this section

37 Table 2.5: Model assisted methods: data needed and available software SAE methods Data needed Software GREG GREG_S Study variable Y Individual Y microdata classified by areas (sampled units) + individual survey weights (sampled units) Individual Y microdata classified by areas (sampled units) + individual survey weights (sampled units) Aux info X Individual X microdata classified by areas (sampled and non-sampled units) Individual X microdata classified by areas (sampled and non-sampled units) EURAREA project website AMELI project website ISTAT R functions EURAREA project website AMELI project website ISTAT R functions, SMART2 2.4 Model-Based estimators: area-level The most popular methods used for model-based SAE employ LLMs. Publications dealing with LMMs include Searle et al., (1992), Longford (1995), McCullogh and Searle (2001) and Demidenko (2004). Model-dependent estimators that rely on linear-mixed or random-effects models have gained popularity (Rao, 2003; Jiang and Lahiri, 2006a) because they enable the inclusion of a random-area effect to explain inter-area variation in addition to that explained by fixed-effect covariates. The reliability of these methods depends on the validity of model assumptions, however, a criticism often raised in design-based research (Estevao and Särndal, 2004) FH-EBLUP The FH-EBLUP is the most popular method for producing small-area estimates from area-level data. The model can be extended to include correlated random area effects, the FH Spatial EBLUP. Let be the vector of the parameters of inferential interest, typically small-area totals ; small-area means with i = 1 m) and assume that the direct estimator is available and is design-unbiased: (2.4) where e is a vector of independent sampling errors with mean vector 0 and a known diagonal variance matrix, representing the sampling variances of the direct estimators of the area parameters of interest. Usually, is unknown and is estimated by various methods such as generalized variance functions (Wolter, 1985; Wang and Fuller, 2003). The basic area-level model assumes that an term is linearly related to as: matrix of area-specific auxiliary variables including an intercept (2.5) where is the vector of regression parameters, u is the vector of independent random area-specific effects with zero mean, and covariance matrix, with is the identity matrix. The combined model (Fay and Herriot, 1979) can be written as:, (2.6) 37

38 It is a special type of LMM where normality and symmetry of the distribution of the u and e components holds. In this model, the EBLUP is extensively used to obtain model-based indirect estimators of small-area parameters and associated measures of variability. This approach and its modifications 16 allow the survey data to be combined with other data in a synthetic regression fitted using population area-level covariates. The EBLUP estimate of is a composite estimate of the form:, (2.7) where and is the weighted least squares estimate of with weights obtained by regressing on, and is an estimate of the variance component. The EBLUP estimate gives more weight to the synthetic estimate when the sampling variance, is small, and moves towards the direct estimate as decreases or increases., is large or where Table 2.6 summarizes the properties of the FH-EBLUP. Table 2.6: EBLUP under area level specification: advantages, disadvantages and extensions Properties Advantages Disadvantages Extensions Model Efficiency under the Linearity of the relation with Non-parametric extension assumptions assumption of Normality of fixed effects aux variables EBLUP LMM Incorrelation between the (Giusti et al., 2012) random area effects SEBLUP (Petrucci and Salvati, 2006; Pratesi and Salvati, 2008) Design Design consistent consistency Robustness to Not robust against outliers outliers Out-of-sample predictions Prediction not inclusive of spatial information SEBLUP (Petrucci and Salvati, 2006; Pratesi and Salvati, 2008, 2009) Model assumptions The EBLUP is popular and is efficient under the assumption of normality of LLMs. It is specified under the assumption of linearity of the relation between the study variable and the auxiliary variables. Giusti et al. (2012) extended it, however, with a semi-parametric specification obtained by P splines, which allows non-linearities in the relationship between the response variable and the auxiliary variables (see section 2.6.1). The correlation between random-area effects is introduced in the SEBLUP (Petrucci and Salvati, 2006; Pratesi and Salvati, 2008, 2009). Design consistency The FH-EBLUP is design-consistent. Although the estimator makes use of survey weights only to compute the direct estimates involved in the expression (2.7) and, in general, in the expression of the representing the sampling variances of the direct estimators of the area parameters of interest. These predictors are model-based, and their statistical properties such as bias and MSE are evaluated with respect to the distribution induced by the data-generating process and not with respect to randomization induced by the sampling system. 16 Jiang et al. (2011) derive the best predictive estimator of the fixed parameters under the Fay Herriot model and the nested-error regression model. This leads to a new prediction procedure called observed best prediction, which is different from EBLUP. The authors show that the best predictive estimator is more reasonable than the traditional estimators derived from estimation considerations such as maximum likelihood and restricted maximum likelihood if the main interest is estimation of small-area means, which is a mixed-model prediction problem. 38

39 Robustness to outliers The FH-EBLUP is not outlier-robust, but it is anticipated that the protection inserted by Sinha and Rao (2009) in the fitting procedure of the unit level EBLUP can also be used in FH-EBLUP to reduce the effect of influential residuals. Predictions for out-of-sample areas For non-sampled areas, the EBLUP estimate is given by the regression-synthetic estimate, using the known covariates associated with the non-sampled areas. This allows for the inclusion of geographical auxiliary variables coordinates of the centroids of the areas as suggested for GREG. Geographical covariates can take into account spatial interaction when it results from the covariates themselves. In this case it is reasonable to assume that the random small-area effects are independent and that the EBLUP is still a valid predictor. There are circumstances, however, where the spatial interactions between the areas are not selfcontained in the covariates themselves and the random effects are consequently spatially correlated. This motivated the spatial extensions of the method (Petrucci and Salvati, 2006; Pratesi and Salvati, 2008) Example of calculation of the FH-EBLUP estimator This section describes the use of the FH-EBLUP to estimate the mean agrarian surface area used for grape production (θ i ) in the 274 municipalities of Tuscany. The population is based on the 2000 Italian Agricultural Census for the region, which collected information about farmland by type of cultivation, amount of breeding, kind of production, and structure and amount of farm employment. The municipalities are taken to be small areas with population sizes N i, i = 1,..., m from the census. The aim is to estimate the mean agrarian surface area used for grape production (θ i ) in each municipality by using the agrarian surface area for production in hectares (x 1i ) and the average number of working days in the reference year (x 2i ) as covariates in the model. The sample data are collected from a simple random sample with size ni from each area, with sampling fractions n i /N i approximately constant and equal to These are used to compute i for each municipality, the direct estimator of the mean surface area for grape production in hectares (y i ) and its sampling variance (ψ i ). The census data provided the agrarian production area in hectares (x 1i ) and the average number of working days in the reference year (x 2i ). The FH-EBLUP is computed from the implemented R functions available in the package created by I. Molina, using the R functions developed in the SAMPLE project. Table 2.7 lists the data for the first ten areas: sample size n i, direct estimate y i, standard error of direct estimator, the production area in hectares (x 1i ) and the average number of working days in the reference year (x 2i ). 39

40 Table 2.7: Data on grape production Small area n i y i (grapehect) x 1i (area) x 2i (workdays) ψ i (var) An example of R code used to read the dataset in Table 2.7 and run function eblupfh using that data. Load the package by the command: > library(sae) Load the dataset: > data(grapes) The formula of the mixed-effect model (formula), the variance of the direct estimator (vardir), the estimation method, the maximum number of iterations (MAXITER) and the tolerance (PRECISION) must be specified in the function: > resultreml <- eblupfh(formula=grapehect ~ area + workdays - 1, vardir=var, data=grapes, MAXITER=500,PRECISION=1e-04) The function returns a list with the following items: eblup vector with the values of the estimators for the domains. fit a list containing the following items: method: type of fitting method applied ("REML", "ML"or "FH"). convergence: a logical value equal to TRUE if the Fisher-scoring algorithm converges in less than MAXITER iterations. iterations: number of iterations performed by the Fisher-scoring algorithm. estcoef: a dataframe with the estimated model coefficients in the first column (beta), their asymptotic standard errors in the second column (std.error), the t statistics in the third column (tvalue) and the p-values of the significance of each coefficient in the fourth column (pvalue). refvar: estimated random effects variance. goodness: vector containing three goodness-of-fit measures: loglikehood, AIC and BIC. The results for the first ten municipalities are shown in Table 2.8. The results for all 247 municipalities can be obtained at 40

41 Table 2.8: Data on grape production Small area n i y i (grapehect) FH-EBLUP MSE e e e e e e e e e e-03 The mean squared estimates can be obtained by the function msefh: > resultmse <- msefh(grapehect ~ area + workdays - 1, var, data=grapes) resultmse$mse [1] e e e e e e e e e e e+01 Table 2.8 shows that the MSE of the FH-EBLUP is lower than the variance of the direct estimates in Table SEBLUP Salvati (2004), Singh, B. et al. (2005), Petrucci and Salvati (2006) and Pratesi and Salvati (2008) proposed the introduction of spatial autocorrelation in SAE in the Fay-Herriot model. The spatial dependence among small areas is introduced by specifying an LMM with spatially correlated random area effects for : (2.8) where D is a matrix of known positive constants, v is an vector of spatially correlated random area effects given by the following simultaneous auto-regressive (SAR) process with SAR coefficient and spatial contiguity matrix W (Cressie, 1993; Anselin, 1992): (2.9) The W matrix describes the spatial interaction structure of the small areas, usually defined through the neighbourhood relationship between areas; generally speaking, W has a value of 1 in row i and column j if areas i and j are neighbours. The auto-regressive coefficient defines the strength of the spatial relationship among the random effects associated with neighbouring areas. For ease of interpretation the spatial interaction matrix is generally defined in row-standardized form in which the row elements sum to 1; in this case is called a spatial autocorrelation parameter (Banerjee et al., 2004). 41

42 Combining (2.4) and (2.8), the estimator with spatially correlated errors can be written as: (2.10) The error terms v have the SAR covariance matrix: and the covariance matrix of is given by where. Under model (2.10), the SEBLUP estimator is: (2.11) where and is a vector with value 1 in the ith position. The predictor is obtained from Henderson s (1975) results for general LMMs involving fixed and random effects. In the SEBLUP estimator the value of is obtained either by maximum likelihood (ML) or restricted maximum likelihood (REML) methods based on the normality assumption of the random effects (see Singh, B. et al., 2005; Pratesi and Salvati, 2008). The main features of the SEBLUP are summarized in Table 2.9. Model assumptions The SEBLUP is efficient under the assumption of normality and spatial correlation in LLM. Its main advantage is the introduction of the spatial relation among the targeted areas through the spatial correlation of the random area effects. When the strength of the spatial relationship among the random effects associated with neighbouring areas is relevant autoregressive coefficient > 0.5 the efficiency gains are appreciable in comparison with FH-EBLUP. But it relies on the stationarity of the spatial relation in the studied zone. It can be extended to allow for local nonstationarity (Benedetti et al., 2012). Design consistency The FH-SEBLUP is not design-consistent; it is model-based like the FH-EBLUP. It makes use of survey weights only to compute the direct estimates involved in the expression (2.11) and, in general, in the expression of the representing the sampling variances of the direct estimators of the area parameters of interest. Table 2.9: SEBLUP under area-level specification: advantages, disadvantages and extensions Properties Advantages Disadvantages Extensions Model assumptions Efficiency under the assumption of stationarity Stationarity of spatial correlation given the Local stationarity extension (Benedetti et al., 2012) of spatial correlation contiguity matrix Design Not design-consistent consistency Robustness to outliers Not robust against outliers Robust-to-outliers extension (Schmid and Münnich, 2013) Out-of-sample predictions Prediction based on individual X information, and on spatial contiguity and spatial correlation of out-of-sample area 42

43 Robustness to outliers The FH-SEBLUP is not outlier-robust but there is a robustified version of it to protect the estimates against the outlying observations (Schmid and Münnich, 2013). The protection is based on the extension of the correction by Chambers et al. (2014) to the SEBLUP considering area and individual outliers in u and e. Predictions for out-of-sample areas The main advantage of FH-SEBLUP is the introduction of the spatial relation among the target areas into the predictions for the out-of-sample areas (see Saei and Chambers, 2005b). When the strength of the spatial relationship among the random effects associated with neighbouring areas is relevant autoregressive coefficient > 0.5 introducing it can mitigate the smoothing effect of the variability of the predicted values in comparison with those obtained by FH-EBLUP (see Saei and Chambers, 2005b) Example of calculation of FH-SEBLUP estimator The objective is still to estimate the mean agrarian surface area used for grape production (θ i ) in each municipality of Tuscany. The data in the FH-EBLUP example can also be regarded as lattice data. In this case the information on the spatial structure of the areas is to be included in the estimation process. The spatial relation between contiguous areas is described by the SAR process. In order to apply the FH-SEBLUP the centroids of the municipalities are taken as spatial reference points. The m m proximity matrix W = (w ij ) was obtained from the neighbourhood structure of the municipalities. We first set w ij equal to 1 if municipality i shares an edge with municipality j, and 0 otherwise. Next the rows of W are standardized so that their elements sum to 1. The W matrix for the first ten areas is: Area

44 The function eblupsfh can be used for fitting the spatial Fay-Herriot model. This function gives small-area estimators based on a spatial Fay-Herriot model, where area effects follow a SAR(1) process. With respect to the eblupfh function, we have only to add the proximity matrix W (grapesprox) as a parameter of the function: > resultreml.sp <- eblupsfh(formula=grapehect ~ area + workdays - 1, vardir=var, proxmat=grapesprox, data=grapes, MAXITER=500,PRECISION=1e-04) The results obtainable with this function are the same as with the eblupfh, but we can also obtain the value of the estimated spatial correlation (spatialcorr)equal to 0.61: > resultreml.sp $eblup $fit $fit$method [1] "REML" $fit$convergence [1] TRUE $fit$iterations [1] 6 $fit$estcoef beta std.error tvalue pvalue area e-09 workdays e+00 $fit$refvar [1] $fit$spatialcorr [1] $fit$goodness loglike AIC BIC The mean squared estimates can be obtained by the function msesfh (Molina et al., 2009). Table 2.10 shows the SAE estimates and the MSE. 44

45 Table 2.10: Data on grape production Small area n i y i (grapehect) FH-SEBLUP MSE e e e e e e e e e e-03 A comparison of Table 2.8 with Table 2.10 shows that in this case FH-SEBLUP has lower MSE values than FH-EBLUP. This happens because the data are spatially correlated Applications to agricultural data In the last 40 years the Fay-Herriot model has been applied in many empirical studies. The results are generally satisfactory given the characteristics of each case-study. Fuller (1981) applied the area-level model FH-EBLUP to estimate mean soybean hectares per segment in 1978 at the county level in the USA. He used the mean number of pixels of soybeans per area segment obtained by satellite imagery and the mean soybean hectares from the 1974 United States Agricultural Census as area-level covariates. Survey estimates for a sample of m = 10 counties were obtained by sampling area segments in sampled counties (see also Rao, 2010). Petrucci et al. (2005) applied the FH-SEBLUP to estimate the average production of olives per farm in 42 local economic systems in Tuscany. The authors note that the introduction of spatial interaction improves the estimates obtained by SEBLUP by reducing the MSE. This happens because the covariates cannot take into account the spatial interaction in the target variable. Sud et al. (2011) applied the FH-EBLUP to estimate crop yield at the district level in Uttar Pradesh in India. He used data pertaining to supervised crop-cutting experiments on paddy rice under the Improvement of Crop Statistics scheme for the kharif (autumn crop) season collected during 2009/10. The state is divided into 70 districts: under the Improvement of Crop Statistics scheme there are sample data in 58 districts, and 12 districts are out-of-sample areas. The population census provides auxiliary variables of average household size and the female population in marginal households, which are also used for the SAE in the out-of-sample districts. The estimated coefficient of variations of the estimators has a high degree of reliability when compared with the direct survey estimates Final remarks on the FH-EBLUP and FH-SEBLUP The FH-EBLUP and FH-SEBLUP require access to survey estimates and they need to know the direct estimates at the area level; they do not require access to microdata. For this reason the methods are frequently applied, and routines to implement them can be downloaded free from several websites. Table 2.11 shows the data and software required. 45

46 The FH-SEBLUP also requires spatial information on the area of interest. Spatial-contiguity matrices, centroids of the areas and their coordinates and the distances between them are easily obtained when GIS maps of the studied areas are available. Using GIS, the available dataset for a study can be combined with digital maps of the study area to enrich the description of small areas with geographical coordinates, their geometric properties and neighbourhood structures. Other spatial reference data such as land-parcel codes, street addresses and postal codes can be added from other digital maps to facilitate the link with additional auxiliary variables from other sources. Table 2.11: Model-based methods under area-level specification: data and software SAE methods Data needed Software Study variable Y Aux info X EBLUP Area-level direct estimates of means, percentages and totals Area-level auxiliary information (sampled and non-sampled EURAREA project website SAMPLE project website CRAN repository (sampled areas) areas) SEBLUP Area-level direct estimates of means, percentages and totals (sampled areas) Area-level auxiliary information and contiguity matrix W (sampled and non-sampled areas) EURAREA project website SAMPLE project website ISTAT CRAN repository In practical applications it is important to complement the estimates with the estimated MSE as a measure of their accuracy. FH-EBLUP and FH-SEBLUP have estimates of their MSE. The routines for application provide estimates of MSE based on this approximately unbiased analytical estimator: where is equal to when EBLUP is used, and to when considering the SEBLUP estimator. The MSE estimator is the same as that derived by Prasad and Rao (1990). For more details on the specification of the g components in both models see Pratesi and Salvati (2009), and Giusti et al. (2012) for the semi-parametric version of FH-EBLUP. For a detailed discussion of the MSE and its estimation for the EBLUP based on the traditional F-H model, see Rao (2003). An alternative procedure for estimating the MSE of estimators and can be based on a bootstrapping procedure proposed by Gonzalez-Manteiga et al. (2007), Molina et al. (2009) and Opsomer et al. (2008). The explicit modelling of spatial effects in the FH-EBLUP is advisable when: i) there are no geographic covariates that can take into account the spatial interaction in the target variable; and ii) there are some geographic covariates but the spatial interaction is so important autoregressive spatial coefficient> 0.5 that the small-area random effects are presumed to be still correlated. In this case, taking advantage of the information about the related areas appears to be the best solution; the FH-SEBLUP is more efficient than the FH with uncorrelated area random effects. Both predictors are useful for estimating small-area parameters efficiently when the model assumptions hold, but they can be sensitive to representative outliers, or departures from the assumed normal distributions for the random effects in the model. Chambers (1986) defines a representative outlier as a sample element with a value that has been correctly recorded and that cannot be regarded as unique. In particular, there is no reason to assume that there are no more similar outliers in the non-sampled part of the population. Welsh and Ronchetti (1998) regard representative outliers as extremely related to the bulk of the data. That is, the deviations from the underlying distributions or assumptions refer to the fact that a small proportion of the data may come from an arbitrary distribution rather than the underlying true distribution, which may result in outliers or influential observations in the data. When the outlying observations are representative, the protections suggested by Sinha and Rao (2009) and Schmid and Münnich (2013) are recommended. 46

47 2.5 Model-based estimators: unit-level The EBLUP based on unit-level data is the standard tool for producing small-area estimates. As with the arealevel specification, it can be extended to correlate random-area effects to obtain the SEBLUP. The MQ small-area approach is a robust alternative to the standard approach, and is based on mixed-effects models EBLUP and MQ regression models. The MQGWR model extends the MQ model to include spatial relations between areas, hence enabling local rather than global robust parameters for MQ models EBLUP Let denote a vector of p auxiliary variables for each population unit j in small area i and assume that information for the variable of interest y is available only from the sample. The aim is to use the data to estimate various areaspecific quantities. A popular approach is to use mixed-effects models with random-area effects. A linear mixedeffects model is:, (2.12) where is the vector of regression coefficients, denotes a random-area effect that characterizes differences in the conditional distribution of y given x between the m small areas, is a constant whose value is known for all units in the population and is the error term associated with the j-th unit in the i-th area. Conventionally, and are assumed to be independent and normally distributed, with mean zero and variances and respectively. The EBLUP of the mean for small area i (Battese et al., 1988; Rao, 2003) is then: (2.13) where, denotes the sampled units in area i, denotes the remaining units in the area i and and are obtained by substituting an optimal estimate of the covariance matrix of the random effects in (2.12) into the best linear unbiased estimator of and the best linear unbiased predictor of. For the estimation of the MSE of (2.13) see Prasad and Rao (1990). Table 2.12 shows the properties of the EBLUP under the unit-level specification. Table 2.12: EBLUP under unit level specification: advantages, disadvantages and extensions Properties Advantages Disadvantages Extensions/ Model assumptions Efficiency assuming normality of the LMM Linearity of the relation with the aux info Non-parametric extension (Opsomer et al., 2008) Design consistency Robustness to outliers Out-of-sample predictions Random effects at area and unit levels Incorrelation with the random-area effect Not design-consistent Not robust against outliers Prediction not inclusive of spatial information Geographically weighted EBLUP (GWEBLUP) (Chandra et al., 2012) SEBLUP (Petrucci and Salvati, 2006; Pratesi and Salvati, 2008) Design-consistent weighted extensions (Kott, 1989; Prasad and Rao, 1990; Rao, 2003) Robust-to-outliers extension (Sinha and Rao, 2009) 47

48 Model assumptions The predictor is widely used in real-life applications. It has been extended to overcome the disadvantages deriving from the linearity of the relation with the auxiliary variables and the independence of random area effects. There is a non-parametric extension by Opsomer et al., (2008) and two extensions to include geography and correlation of the random area effects into the specification of the model. The first is the GWEBLUP by Chandra et al., 2012; see section The second is the EBLUP at the unit level specified under the SAR assumption to derive the SEBLUP, which is described in the following section. Design consistency Model-based estimators using unit-level models typically do not make use of survey weights and the derived estimators are generally not design-consistent unless the sampling design is self-weighting within areas. Modifications to achieve design consistency were proposed by Kott (1989), Prasad and Rao (1999) and You and Rao (2002). Although they are design-consistent, these predictors are model-based and their statistical properties such as bias and MSEs are evaluated with respect to the distribution induced by the data-generating process and not randomization. Jiang and Lahiri (2006b) obtained design-consistent predictors for generalized linear models, and evaluated their corresponding MSEs with respect to the joint randomization-model distribution Robustness to outliers Sinha and Rao (2009) proposed a robust version of (2.13) that works well in presence of outlying values. It is based on a modification of the iterative method for the SAE model fitting based on M-estimation. Out-of-sample predictions In the mixed model (2.12) the synthetic mean predictor for out-of-sample area i is: Note that all variation in the area-specific predictions comes from the area-specific auxiliary information. The conventional synthetic estimation for out-of-sample areas can potentially be improved by using a model that borrows strength over space. Besides its SEBLUP spatial extension, which is described below, there is a non-parametric extension and geographically weighted extensions (see section and 2.6.4) SEBLUP As in the area-level models, model (2.12) can be extended to allow for correlated random area effects, specifying an SAR mixed model. Let the deviations v from the fixed part of the model be the result of an autoregressive process with parameter and proximity matrix W (Cressie, 1993): then. The matrix needs to be strictly positive-definite to ensure the existence of. This happens if, where s are the eigenvalues of matrix W. The model with spatially correlated errors can be expressed as: (2.14) 48

49 with independent of v. Under (2.14), the spatial best linear unbiased predictor of the small-area mean and its empirical version SEBLUP are obtained following Henderson (1975). In particular, the SEBLUP of the smallarea mean is: where,,,, (2.15), is the vector of the sample observations, are asymptotically consistent estimators of the parameters obtained by ML or REML estimation, and is vector with value 1 in the i-th position. For the MSE of the predictor (2.15) (see Singh, B. et al., 2005). Table 2.13 summarizes the main characteristics of the SEBLUP. Table 2.13: SEBLUP under unit-level specification: advantages, disadvantages and extensions Properties Advantages Disadvantages Extensions/ Alternatives Model assumptions Efficiency assuming stationarity of spatial correlation Stationarity of spatial correlation given the contiguity matrix GWEBLUP (Chandra et al., 2012) Spatial model-based direct estimator (Chandra et al., 2007) Possible extensions to outlier resistance (Sinha and Rao, 2009; Schmid and Münnich, 2013) Design consistency Not design-consistent Robustness to outliers Out-of-sample predictions Prediction based on individual X information and on spatial contiguity and spatial correlation of out-of-sample areas Not robust against outliers Model assumptions An alternative approach for incorporating the same spatial information in the model is the spatial model-based direct estimator by Chandra et al., (2007). This also assumes the stationarity of spatial correlation given the contiguity matrix. An extensive approach assumes that the regression coefficients vary spatially across the geography of interest. Models of this type can be fitted using geographically weighted regression (GWR), and are suitable for modelling spatial non-stationarity (Brunsdon et al., 1998; Fotheringham et al., 2002). Chandra et al. (2012) proposed a GWEBLUP for a small-area average and an estimator of its conditional MSE (see sections 2.6 and 2.7). Design consistency As with the unit-level EBLUP, the SEBLUP typically does not make use of survey weights and the derived estimators are generally not design-consistent unless the sampling design is self-weighting within areas. Modifications following the solutions proposed for EBLUP can be design-consistent, but their statistical properties such as bias and MSE are evaluated with respect to the distribution induced by the data-generating process and not randomization. Robustness to outliers Sinha and Rao (2009) proposed a robust version of (2.13) that works well in the presence of outlying values. This solution may possibly be extended to the SEBLUP following Schmid and Münnich,

50 Out-of-sample predictions To enable spatial correlation of area random effects, the SEBLUP predictions for area parameters can be computed taking into account the contribution of the random part of the model for sampled areas and out-of-sample areas (see Saei and Chambers, 2005a). This counters the tendency to smooth the variability of the predicted values in comparison with those obtained by the EBLUP The MQ estimator None of the predictors described above are robust to deviations from the underlying distributions and assumptions unless they are extended with another estimator designed to address the problem. A recently proposed approach to SME based on the use of MQ models (Chambers and Tzavidis, 2006) is naturally robust against the effect of the outlying observations on the validity of the small-area model, a feature that should be useful when the method is applied to agricultural surveys. A linear MQ regression model is one where the MQ of the conditional distribution of y given x satisfies: (2.16) Here be used. denotes the influence function associated with the MQ, usually a Huber-type function where c is the tuning constant and t is the error. Other influence functions can For specified q and continuous, an estimate of is obtained from iterative weighted least squares. The MQ coefficients of the population units obtained by the fitting of the model are the basis for constructing an alternative to random effects for characterizing the variability across the population. For unit j with values and, this coefficient is the value such that. We observe that if a hierarchical structure explains part of the variability in the population data, units within areas or clusters defined by this hierarchy are expected to have similar MQ coefficients. When the conditional MQs are assumed to follow a linear model, with a sufficiently smooth function of q, a predictor of small-area parameters is suggested in the form: where is an estimate of the average value of the MQ coefficients of the units in area i. This is typically the average of the estimates of these coefficients for sample units in the area. These unit-level coefficients are estimated by solving denoting the estimated value of (2.16) at q. When there are no sample units in area i, then (2.17) Tzavidis et al. (2010) refer to (2.17) as the naïve MQ predictor and note that it can be biased. To address the problem, the authors propose a bias-adjusted MQ predictor of the small-area parameter derived as the mean functional of the Chambers and Dunstan (1986) estimator of the distribution function given by: 50 (2.18)

51 Note that in simple random sampling in the small areas, (2.18) is also derived from the expected value functional of the area i version of the Rao et al. (1990) distribution function estimator, which is a design-consistent and modelconsistent estimator of the finite population distribution function. Table 2.14 shows the main characteristics of the method. Table 2.14: MQ method under unit-level specification: advantages, disadvantages and extensions Properties Advantages Disadvantages Extensions/ Model assumptions Design consistency Robustness to outliers Out-of-sample predictions It does not require normality of errors Robust against outliers (Giusti et al., 2014) It requires a linear regression model for quantiles Not design-consistent Possible failures in protection against outliers Prediction not inclusive of spatial information and based on Non-parametric non-linear extension (Pratesi et al., 2008) Design-consistent weighted extension (Fabrizi et al., 2014) Extension for bias correction (Chambers et al., 2014) Spatial extension MQGWR (Salvati et al., 2012) Model assumptions The method does not require distributional assumptions about errors. This is an advantage because it is useful to describe non-normal study variables, but the MQ requires the relation between the quantiles of the study variable and the auxiliary variables to be linear. Its non-parametric extension to allow for non-linearities is described in section 2.5. The linear model for quantiles can be extended to local geographical regression in the MQGWR described below. Design consistency As with the other unit-level SAE models, MQ models typically do not make use of survey weights and the derived estimators are not design-consistent unless the sampling design is self-weighting within areas. Modifications proposed by Fabrizi et al. (2014) can be used in the model-assisted approach, a version of MQ including the sample weights that is design-consistent. Its statistical properties such as the bias and MSE are evaluated with respect to the distribution induced by the data-generating process and by the sampling design. Robustness to outliers The proposed outlier-robust small-area estimators can be substantially biased when outliers are drawn from a distribution that has a different mean from the rest of the survey data. Chambers et al. (2014) proposed an outlierrobust bias correction for these estimators and two analytical MSE estimators for the ensuing bias-corrected outlierrobust estimators. Resistance to outliers is the result of the M-estimation algorithm and of the Chambers and Dunston adjustment (Giusti et al., 2014). Out-of-sample predictions The prediction for the out-of-sample areas are given by: 51

52 and are based only on X individual information and the conventional MQ average coefficient value. The expression can be modified to include the distances between the areas: The MQGWR estimator SAR mixed models are global models in the sense that it is assumed that the relations being modelled hold everywhere in the study area and that spatial correlation at the area level is allowed for. One way of incorporating the spatial structure of the data in the MQ small-area model is through an MQGWR model (Salvati et al., 2012); for a description of the GWR regression see section 2.7. Unlike SAR mixed models, MQGWR models are local models that allow for a spatially non-stationary process in the mean structure of the model. Given n observations at a set of L locations with data values observed at location, an MQGWR model is defined as: (2.19) where now varies with h as well as with q. The MQGWR is a local model for the entire conditional distribution not just the mean of y given x. Estimates of in (2.19) can be obtained by solving: (2.20) where w(h l, h) is a spatial weighting function generally a function of the Euclidean distances between the locations h l and h and, where s is a suitable robust estimate of scale such as the median absolute deviation estimate; is normally assumed to be a Huber-type influence function, but other influence functions are also possible. A Huber-type function gives Provided c, the tuning constant, is bounded away from zero an iteratively re-weighted least squares algorithm can be used to solve (2.20), leading to estimates of the form: In (2.21) y is the vector of n sample y values and X is the corresponding design matrix of order of sample x values. The matrix is a diagonal matrix of order n with entries corresponding to a particular sample observation and equal to the product of the spatial weight of this observation. This in turn depends on its distance from location h, with the weight that this observation has when the sample data are used to calculate the spatially stationary MQ estimate. (2.21) Provided there are sample observations in area i, an area-specific MQGWR coefficient, average value of the sample MQGWR coefficients in area i. can be defined as the Following Salvati et al. (2012), the bias-adjusted MQGWR predictor of the mean in small area i is: (2.22) 52

53 where is defined through the model (2.19). For details of the MSE estimator of predictor (2.22) see Salvati et al. (2012). Table 2.15 summarizes the main features of the MQGWR method. Table 2.15: MQGWR under unit- level specification: advantages, disadvantages and extensions Properties Advantages Disadvantages Extensions Model assumptions Design consistency Robustness to outliers Out-of-sample predictions It does not require normality of errors Robust against outliers Prediction based on individual X information and on a weighting function based on distances between in-sample and out-ofsample areas It requires linear models for quantiles, but includes spatial regression Not design-consistent Possible extensions to other spatial weighting functions in the spatial regression Model specifications In addition to the characteristics of the MQ method, which are common to this extension, the spatial-regression coefficient allows for the representation of local non-stationarity in the data. Note that the spatial weight is derived from a spatial-weighting function whose value depends on the distance from sample location to h such that sample observations with locations close to the prediction location u receive more weight than those further away. A popular approach to defining such a weighting function is to use: where denotes the Euclidean distance between and h and b is the bandwidth, which is best defined using a least-squares criterion (Fotheringham et al., 2002). But alternative weighting functions such as the bi-square function (Ibid.) can also be used. Out-of-sample predictions The main advantage of the MQGWR in comparison with the MQ estimator is in the out-of-sample predictions. Focusing on the spatial structure of the estimator, the out-of-sample predictions are obtained by: With non-spatial modelling, all variation in the area-specific predictions comes from the area-specific auxiliary information. As described above, one way of improving the conventional synthetic estimation for out-of-sample areas is by using a model that borrows strength over space. In the MQGWR model the improvement is searched following the assumption of local rather than global stationarity of the spatial relation. 53

54 Design consistency The MQGWR typically does not make use of sampling weights, and it is not design-consistent. Possible extensions such as that of Fabrizi et al. (2014) have not yet been tested. Robustness to outliers As with the MQ predictor, robustness to outliers is the result of the M-estimation algorithm and of the Chambers and Dunston adjustment (Giusti et al., 2014). In any case there are no studies that test the resilience of the MQGWR Example of calculation of EBLUP, MQ and MQGWR estimators This section gives an example of applying the EBLUP, the MQ and the MQGWR methods to obtain small-area estimates. The target parameter is mean acid neutralizing capacity (ANC), which is an indicator of the acidification risk in bodies of water. This indicator has to be estimated at the level of the hydrologic unit, a domain for which there are few small-area observations in the United States Environmental Protection Agency's northeast lakes survey (Larsen et al. 2001). A sample of 334 lakes is selected from the population of 21,026 using a random-systematic design. The lakes in this population are grouped according to 113 hydrologic unit codes (HUCs), of which 64 contain fewer than 5 observations and 27 have no observations. The variable of interest, y, is the ANC indicator: the higher the ANC the more acid a body of water can neutralize, and the less susceptible it is to acidification. The number of observed sites is 349, with 551 measurements: this gives 86 sampled areas out of 113 areas, with a sample of 551 units. For each sampled location, the Environmental Monitoring and Assessment Program (EMAP) dataset includes the elevation, the auxiliary variables used in the small-area models, x, and geographical coordinates of the centroid of each lake in the target area. Table 2.16 shows the first 6 of the 551 lines in the EMAP survey dataset. Table 2.16: EMAP Northeast lakes survey data Lake id HUC lon lat Elev ANC ME750L ME250L ME751L ME753L ME519L ME251L For each lake in the target population, the HUC, the elevation, the longitude and latitude are known. Table 2.17 shows the first 6 of the 21,026 lines of the population data. 54

55 Table 2.17: EMAP Northeast lakes population data Lake id HUC lon lat Elev ME750L ME250L ME751L ME753L ME519L ME251L For each unit in the population, sampled and non-sampled, the HUC the small area to which the lake belongs the longitude, the latitude and elevation are known, and the ANC target variable for the 551 sampled units is also known. From these data we can obtain small-area estimates of the mean of ANC using the EBLUP, the MQ and the MQGWR. 55

56 Estimates of EBLUP can be obtained using the function eblupbhf present in the R library sae. The MQ and the MQGWR estimates can be obtained using the R functions, mq_function and mqgwr.sae available under the SAMPLE project. EBLUP Function eblupbhf package sae The eblupbhf requires the following arguments: formula: A symbolic description of the model to be fitted, for example y~1+x (ANC~1+Elev) dom: Vector of small-area codes for sampled units (dom = HUC in table 1) meanxpop: Data frame with domain codes in the first column. Each remaining column contains the population means of each of the p auxiliary variables for all the domains. The domains considered in meanxpop must contain those specified in selectdom (meanxpop in the example this is the small-area means of the Elev variable) popnsize: Data frame with small-area codes in the first column and the corresponding small-area population sizes in the second column (popnsize in the example is the population size for each of the 113 hydrologic units) method: A character string. If "REML", the model is fitted by maximizing the restricted log-likelihood. If "ML" the log-likelihood is maximized. Defaults to "REML" data: Optional data frame containing the variables named in formula and dom. By default the variables are taken from the environment from which eblupbhf is called. In the following an example of usage of the R package SAE is described, listing all the commands used: >library(sae) This command loads the library sae which contains the function eblupbhf. Now the EMAP survey dataset shown in Table 1 and the EMAP population dataset shown in Table 2 are loaded: >survey.lake=read.table("emaplakesurvey.txt",header=true,dec=",") >population.lake=read.table("emaplakepopulation.txt",header=true, dec=",") The data frame survey.lake contains the sampled-unit data; the population data are in the data frame population. lake. The eblupbhf function can now be run: >area.means = tapply(population.lake$elev, population.lake$huc, mean) >SaeEst = eblupbhf(formula=survey.lake$anc~1+survey.lake$huc dom=survey.lake$huc, meanxpop=area.means, popnsize=as.numeric(table(population.lake$huc)), method="reml") The function eblupbhf returns a list with the following objects: eblup: Data frame with number of rows equal to number of sampled small areas (113), containing in its columns the domain codes (domain) and the EBLUPs of the means of selected domains based on the nested error linear regression model (eblup). For domains with zero sample size, the EBLUPs are the synthetic regression estimators. fit: A list containing the following objects: summary: Summary of the unit level model fitting fixed: Vector with the estimated values of the fixed regression coefficient random: Vector with the predicted random effects errorvar: Estimated model error variance refvar: Estimated random effects variance loglike: Log-likelihood residuals: Vector with raw residuals 56

57 Results are shown in Table The estimates for the 27 out-of-sample areas are obtained with a synthetic estimator. Table 2.18: Estimated average ANC for all the 113 HUCs obtained using the eblupbhf R function HUC EBLUP RMSE CV

58 MQ function mq available on the SAMPLE project website. The mq_function requires the following arguments: x: Matrix of auxiliary variables for sampled units y: The numeric response vector for sampled units regioncode.s: Area code for sampled units m: The number of small areas p: The size of x+1 x.outs: Matrix of auxiliary variables for out-of-sample units regioncode.r: Area code for out-of-sample units tol.value: Convergence tolerance limit for the MQ model; default to maxit.value: Maximum number of iterations for the iterative weighted least squares procedure; default to 100 k.value: Tuning constant used with the Huber proposal 2 scale estimation; default to In the following an example of how to use the mq_function function with the dataset EMAP is given. First, the functions in the file mq.sae.r are loaded (at the time of writing this R-package is still not available): >source("mq.sae.r") The EMAP survey dataset shown in Table 1 and the EMAP population dataset shown in Table 2 are loaded: >survey.lake=read.table("emaplakesurvey.txt",header=true, dec=",") >population.lake=read.table("emaplakepopulation.txt",header=true, dec=",") The data frame survey.lake contains the sampled unit data; the population data are in the data frame population.lake. To run the function mq_function the sampled units must be removed from the population data frame to obtain the out-of-sample data needed by the function: >s=survey.lake$id >outsample.lake=population.lake[-s,] The mq_function function can now be run: >SaeEst = mq_function(x=survey.lake$elev, y=survey.lake$anc, regioncode.s=survey.lake$huc m=86, p=2, x.outs=outsample.lake$elev, regioncode.r=outsample.lake$huc, tol.value=0.0001, maxit.value=100, k.value=1.345) The function returns small-area estimates of the mean under the MQ model and the corresponding MSE estimates: mq.cd: Estimates of small-area means using the MQ Chambers and Dunstan estimator (Tzavidis et al., 2010) mq.naive: Estimates of small-area means using the MQ naive estimator (Chambers and Tzavidis, 2006) mse.cd: MSE estimates for the MQ Chamber and Dunstan small-area means mse.naive: MSE estimates for the MQ naive small-area means code.area: The codes of the small areas Results for the first three areas are shown in Table 2.19 as an example. Estimates for out-of-sample areas are synthetic estimates obtained at quantile

59 Table 2.19: Estimated average of ANC for all the 113 HUCs obtained using the MQ_function and R function HUC MQ RMSE CV MQGWR function mqgwr.sae available on the SAMPLE project website. The mqgwr.sae R function requires the following arguments: x.s: matrix of auxiliary variables for sampled units, x.s = Elev y: numeric response vector for sampled units (y = ANC in Table 1) area.s: vector of small-area codes for sampled units (area.s = HUC in Table 1) lon.s: vector of longitude of points representing the spatial positions of the sampled observations (lon.s = lon in table 1) lat.s: vector of latitude of points representing the spatial positions of the sampled observations (lat.s = lat in table 1) x.r: matrix of auxiliary variables for out-of-sample units (x.r = Elev in table 2) area.r: vector of small-area codes for out-of-sample units (area.r = HUC in table 2) lon.r: vector of longitude of points representing the spatial positions of the out-of-sample observations (lon.r = lon in table 2) lat.r: vector of latitude of points representing the spatial positions of the out-of-sample observations (lat.r = lon in table 2) k.value: tuning constant used for Huber proposal 2 scale estimation; default to method: a character string. If mqgwr the MQGWR model is used to fit the MQ surface. If mqgwr-li the MQGWR-local intercepts (LI) is used; defaults to mqgwr mqgwrweight: geographical weighting function: gwr.gauss() if TRUE or gwr.bisquare() if FALSE; defaults to TRUE The following is an example of using the R function mqgwr.sae. First, the functions in the file mqgwr.sae.r are loaded (at the time of writing an R package is still not available): >source("mqgwr.sae.r") then the EMAP survey dataset shown in Table 1 and the EMAP population dataset shown in Table 2 are loaded: >survey.lake=read.table("emaplakesurvey.txt",header=true, dec=",") >population.lake=read.table("emaplakepopulation.txt",header=true, dec=",") The data frame survey.lake contains the sampled-unit data; the population data are in the data frame population. lake. To run the function mqgwr.sae the sampled units from the population data frame must be discarded to obtain the out-of-sample data needed by the function: >s=survey.lake$id >outsample.lake=population.lake[-s,] The mqgwr.sae function can now be run: >SaeEst = mqgwr.sae(x.s=survey.lake$elev, y=survey.lake$anc, m=86, area.s=survey.lake$huc, lon.s=survey.lake$lon, lat.s=survey.lake$lat, 59

60 k.value=1.345, method="mqgwr", mqgwrweight=true) The function mqgwr.sae returns: area.code.in: unique list of the area code of the sampled areas (86 codes) area.code.out: unique list of the area code of the non-sampled areas (27 codes) est.mean.in: small-area estimates of the mean for the sampled areas (86 areas) est.mean.out: small-area estimates of the mean for the non-sampled areas (27 areas) est.mse.in: estimates of the MSE of the mean estimator (only available for the 86 sampled areas) Note: The sampled areas are the areas where there is at least one observation in the sample; the non-sampled areas are those where there is no observation. As an example, the first three estimates were put into in Table Using this estimator the spatial information is used to obtain out-of-sample area estimates. Table 2.20: Estimated average of ANC for all the 113 HUCs obtained using the mqgwr.sae R function HUC MQGWR RMSE CV Application to agricultural data Since Battese et al. (1988) SAE unit-level models have been applied to various case studies in agriculture. In this section the model-based direct estimator (MBDE and the spatial model-based direct estimator [SMBDE]) are referred to; they are presented in Section Battese et al. (1988) applied EBLUP to estimate the area under corn and soybeans for each of 12 counties in north central Iowa using farm-interview data in conjunction with LANDSAT data. Each county was divided into area segments, and the areas under corn and soybeans were ascertained for a sample of segments by interviewing farmers. The number of sample segments n i in a county ranged from 1 to 5. In this application the auxiliary variables x ij are the number of pixels classified as corn and the number of pixels classified as soybeans in the j th area segment of the ith county, and the response variable is the number of hectares of corn or soybeans in the j th sample area segment of the i th county. Unit-level auxiliary data in the form of number of pixels classified as corn and soybeans were also obtained for all the area segments, including the sample segments, in each county using the LANDSAT readings. Chandra et al. (2007) use real data and design-based simulation to evaluate the performance of EBLUP, MBDE, SEBLUP and SMBDE in the context of a real population and realistic sampling methods, using the ISTAT farm structure survey in Tuscany. Chandra et al. (2007) used these sample farms to generate a population of N = 22,977 farms by sampling with replacement from the original sample of 529 farms, with probabilities proportional to their sample weights. The small areas of interest are defined by the 23 local economic systems of northern Tuscany. 60

61 Sample sizes in these areas were fixed to be the same as in the original sample. The aim was to estimate average olive production in quintali (100 kg units) in each local economic system using the surface utilized for olives in hectares as the auxiliary variable. The results show that EBLUP and SEBLUP are unstable in a few small areas, mainly because there is little or no variability in the variable of interest in these areas. In contrast, the MBDE and SMBDE methods appear unaffected by such behaviour. The median relative bias of MBDE is smaller than that of EBLUP. In contrast, the median relative root mean squared error (RRMSE) of EBLUP is smaller than that of MBDE. The median relative bias and median RRMSE of SEBLUP is marginally smaller than that of EBLUP. Salvati et al. (2009) applied EBLUP, SEBLUP and a spatial version of the MQ predictor to estimate the average production of olives per farm in quintals for each of the small areas making up the local economic systems in Tuscany. In this application the authors employed data from the 2003 ISTAT farm structure survey, which is carried out every two years to collect information on farmland by type of cultivation, amount of animal production and structure and amount of farm employment on 55,030 farms. The GIS Atlas of Coverage of the Tuscany Region provided information on coordinates, surface area and positions of the small areas of interest. The centroid of each area is the spatial reference for all the units residing in the same small area. The auxiliary variable employed in the models is the surface area used for olive production. Coelho and Pereira (2011) describe the design of the Monte Carlo simulation study, and present empirical results on the performance of the direct and indirect estimators using a real dataset from an agricultural survey conducted by the Portuguese Statistical Office. In particular, the authors analyse the performance of the EBLUP with random small-area effects to present a spatial covariance structure following an isotropic exponential model. To explore the behaviour of the small-area predictors the authors built a pseudo-population obtained from a real dataset containing the responses to the 1993 Agricultural Structure Survey, which is carried out by the Portuguese Statistical Office between agricultural censuses. The responses for the variable total production of cereals were extracted and circumscribed to the Nomenclatura das Unidadis Territoriais para Fins Estatísticos 2 of the Alentejo region. The total sample size was 7,060 and the population size 47,049. Production in 1989 is used as an auxiliary variable in the models applied in the simulation. Geographical coordinates associated with the centroids of freguesias (administrative divisions) are recorded. This is the lowest level of aggregation for which geographical referencing is available. From the results of the simulation experiment is evident that when the data display spatial variability, the estimators that reflect the spatial correlation between observations tend to present reductions in bias and bias ratio when compared with estimators that ignore this variability. These reductions are usually accompanied by a modest loss of precision, resulting in bias ratios that are generally substantially lower than those obtained for the other estimators Final remarks on the EBLUP, SEBLUP, MQ and MQGWR The following remarks stem from the review of the features of the unit level models. 1. The EBLUP, SEBLUP, MQ and MQGWR require access to microdata files from the survey, with the sampled units classified by area of interest. Microdata are also needed for the auxiliary variables, which must be classified by area. The methods are frequently applied nonetheless, and routines to implement them can be downloaded free from several websites. The references here are mainly to routines developed under the EURAREA and SAMPLE projects. Table 2.21 summarizes the data and software needed to implement the methods. 2. The SEBLUP and MQGWR require spatial information on the individual units and areas, and localization of the individual units. Many spatial references useful to describe the spatial relation can be obtained from GIS digital maps localization of the units and their coordinates, for example, spatial contiguity matrices for areas, centroids of the areas and their coordinates and distances between them. 61

62 Table 2.21: Model-based methods under unit-level specification: data needed and software SAE methods Data needed Software EBLUP SEBLUP MQ MQGWR Study variable Y Individual Y microdata classified by area (sampled units) Individual Y microdata classified by area (sampled units) Individual Y microdata classified by area (sampled units) Individual Y microdata classified by area (sampled units) Aux info X Individual X micro-data classified by area (sampled and non-sampled units) Individual X microdata classified by area (sampled and non-sampled units) + contiguity matrix W (sampled and non-sampled areas) Individual X micro-data classified by area (sampled and non-sampled units) Individual X microdata classified by area (sampled and non-sampled units) + Euclidean distances between the centroids of the areas (sampled and non-sampled areas) EURAREA project website SAMPLE project website EURAREA project website SAMPLE project website SAMPLE project website SAMPLE project website guide-method/methodquality/general-methodology/ small-area-estimation/ eurarea/index.html 3. In practical applications it is important to complement estimates with the estimated MSE as a measure of their accuracy. For EBLUP, SEBLUP, MQ and MQGWR the routines for their application provide estimates of MSE. Basically the MSE estimator is obtained following Chambers et al. (2011), who proposed a method of MSE estimation for estimators of finite population-domain means that can be expressed in pseudo-linear form, that is as weighted sums of sample values. In particular, it can be used for estimating the MSE of the EBLUP, the MB direct estimator and the MQ-based predictors. There are many applications where the performance of the predictors is compared with real-life case studies. Gains in accuracy can be obtained when the underlying models fit the distribution of the study variable more closely. The estimators that reflect the spatial correlation between observations tend to present reductions in bias and bias ratio when compared with estimators that ignore this variability. 4. Explicit modelling of spatial effects in the SEBLUP and MQGWR becomes necessary when there are insufficient geographic covariates to explain local interactions. Simulation studies show that this happens when the autoregressive spatial coefficient is more than 0.5. In this case, the best solution appears to be to take advantage of the information of the related areas via an SAR model or a local geographic regression, and the SEBLUP will outperform the basic EBLUP model and the MQGWR will outperform the MQ model. 5. With regard to unit-level specifications, the EBLUP and SEBLUP can be very sensitive to representative outliers or departures from the assumed normal distributions for the random effects in the model (see Shina and Rao, 2009 for a robust version of EBLUP; Schmid and Münnich, 2013 for a robust version of SEBLUP). Simulation studies show that MQ predictors have better resistance to outliers than the traditional EBLUP (Giusti et al., 2014). There are no specific studies to test the resilience to outliers of the MQGWR. 62

63 2.6 Extensions of the previous small-area models Semi-parametric Fay and Herriot model An alternative approach for introducing the spatial correlation in an area-level model was proposed by Giusti et al. (2012) with a semi-parametric specification of the Fay Herriot model obtained by truncated polynomial splines (P splines). This allows non-linearities in the relationship between the response variable and the auxiliary variables. A semi-parametric additive model hereinafter the semi-parametric model with one covariate can be written as where the function is unknown but assumed to be sufficiently well approximated by the function: (2.23) where is the vector of the coefficients of the polynomial function, is the coefficient vector of the polynomial spline (P-spline) basis and q is the degree of the spline if, 0 otherwise. The latter portion of the model allows for handling departures from a q polynomial t in the structure of the relationship. In this portion for is a set of fixed knots, and if K is sufficiently large the class of functions in (2.23) is very large and can approximate most smooth functions. Details of the choice of bases and knots can be found in Ruppert et al. (2003). Since a P spline model can be viewed as a random-effects model (Ruppert et al. 2003; Opsomer et al. 2008), it can be combined with the Fay Herriot model to obtain a semi-parametric SAE framework based on LMM regression. Correspondingly, the and vectors define: Following the notation introduced previously for the Fay-Herriot model, and adding the matrix,, the model becomes: matrix to the X effect (2.24) where is a vector of regression coefficients, the component can be treated as a vector of independent and identically distributed random variables with mean 0 and variance matrix. The covariance matrix of model (2.24) is, where. Model-based estimation of the small-area parameters can be obtained by using the EBLUP (Henderson, 1975): (2.25) with and hereinafter non-parametric EBLUP (NPEBLUP). 63

64 When geographically referenced responses play a central role in the analysis and need to be converted to maps, we can deal with bivariate smoothing and specify a semi-parametric bivariate additive model (see Giusti et al., 2012) NPEBLUP specified at the unit level Although useful in many estimation contexts, LMMs depend on distributional assumptions for the random part of the model and do not easily allow for outlier-robust inference. The fixed part of the model may not be flexible enough to handle estimation contexts in which the relationship between the variable of interest and some covariates is more complex than a linear model. Opsomer et al. (2008) usefully extend model (2.12) to the case in which the small-area random effects can be combined with a smooth non-parametrically specified trend. In the simplest case: (2.26) where is an unknown smooth function of the variable, the estimator of the small-area mean is: as in (2.13), where. By using penalized splines as the representation for the non-parametric trend, Opsomer et al. (2008) express the non-parametric small-area estimation problem as a mixed-effect model regression. The latter can be easily extended to handle bivariate smoothing and additive modelling. The P-spline model proposed by Ugarte et al. (2009) is considered to analyse trends in small areas and to forecast future values of the response. The prediction MSE for the fitted and the predicted values, together with estimators for those quantities, were also derived Non-parametric MQ specified at unit level Pratesi et al. (2008) extended this approach to the MQ method for estimating the small-area parameters using a nonparametric specification of the conditional MQ of the response variable, given the covariates. When the functional form of the relationship between the q th MQ and the covariates deviates from the assumed form, the traditional MQ regression can lead to biased estimators of the small-area parameters. Using P-splines for MQ regression exploits the properties of MQ models and also makes it possible to deal with an undefined functional relationship that can be estimated from the data. When the relationship between the q th MQ and the covariates is not linear, a P-spline MQ regression model may have significant advantages compared to the linear MQ model. (2.27) The small-area estimator of the mean may be taken as in (2.17), where the unobserved value for population unit is predicted using: where and are the coefficient vectors of the parametric and spline portion respectively of the fitted P-splines MQ regression function at. In the case of P-splines and MQ regression models, the bias-adjusted estimator for the mean is given by: (2.28) 64

65 where denotes the predicted values for the population units in and in. The use of bivariate P-spline approximations to fit non-parametric unit-level nested error and MQ regression models makes it possible to reflect spatial variation in the data and then to use these non-parametric models for SAE GWEBLUP An alternative approach to incorporating spatial information in the model is to assume that the regression coefficients vary spatially across the geography of interest. Models of this type can be fitted using GWR (see section 2.5.4) and are suitable for modelling spatial non-stationarity (Brunsdon et al., 1998; Fotheringham et al., 2002). Chandra et al. (2012) proposed a GWEBLUP for a small-area average and an estimator of its conditional MSE. GWEBLUP is based on a mixed model that allows for spatially non-stationary linear fixed effects as well as random area effects. It is obtained by local linear fitting of an LMM using weights that are a function of the distance between the sample data points. Parameter estimation for the GWEBLUP is performed by extending the maximum likelihood estimation of the conventional LMM to incorporate the geographical information contained in these distances MBDE and SMBDE Chandra and Chambers (2005) proposed an alternative approach to SAE based on the use of MBDE in the small areas. In this case an estimate for a small area of interest corresponds to a weighted linear combination of the sample data for that area, with weights based on a population-level version of the LMM. These weights borrow strength through this model, which includes random area effects. Provided the assumed small-area model is true, the EBLUP is asymptotically the most efficient estimator for a particular small area. In practice, however, the true model for the data is unknown, and the EBLUP can be inefficient if wrongly specified. Chandra and Chambers (2005) noted that in such circumstances MBDE offers an alternative to potentially unstable EBLUP. In particular, MBDE is easy to implement, produces sensible estimates when the sample data exhibit patterns of variability that are inconsistent with the assumed model for example containing too many zeros and generates robust MSE estimates. Under the population-level LMM, the sample weights that define the EBLUP for the population total of y are: (2.29) where, and. (see Royall, 1976). The MBDE (see Chandra and Chambers, 2005) of the i th small-area mean is then defined as: (2.30) Chandra et al. (2007) proposed an SMBDE in which i th small-area mean is given by (2.30), with the weights (2.29) there replaced by the spatial EBLUP weights w SEBLUP defined as in (2.29) but where now: and A note on Bayesian SAE methods Bayesian alternatives of the non-spatial and spatial mixed effects models for SAE include Datta and Ghosh (1991 and 2012), Ghosh et al. (1998) and Rao (2003). In particular, Bayesian small-area spatial modelling has been successful in similar contexts such as the estimation of rates of disease in different geographic regions (Best et al., 2005). Complex mixed effects and correlation between areas can be easily handled and modelled hierarchically in different layers of the model. 65

66 Although implementation of complex Bayesian models requires computationally intensive Markov Chain Monte Carlo (MCMC) simulation algorithms (Gilks et al., 1995), there are a number of potential benefits of the Bayesian approach for SAE. Gomez-Rubio et al. (2010) present these advantages: 1. It offers a coherent framework that can handle different types of target variable continuous, dichotomous and categorical, for example different random-effect structures such as independent and spatially correlated, areas with no direct survey information and models to smooth the survey sample variance estimates in a consistent way using the same computational methods and software whatever the model. 2. Uncertainty about all model parameters is automatically captured by the posterior distribution of the small-area estimates and any functions of these such as their rank, and by the predictive distribution of estimates for small areas not included in the survey sample. 3. Bayesian methods are particularly suited to sparse-data problems such as when the survey sample size per area is small, because Bayesian posterior inference is exact and does not rely on asymptotic arguments. 4. The posterior distribution obtained from a Bayesian model provides a richer output than the traditional pointand-interval estimates from a corresponding likelihood-based model. In particular, the ability to make direct probability statements about unknown quantities for example the probability that the target variable exceeds some specified threshold in each area and to quantify all sources of uncertainty in the model make Bayesian SAE suitable for informing and evaluating policy decisions SAE for binary and count data Let be the value of the outcome of interest, a discrete or a categorical variable, for unit j in area i, and let denote a vector of unit-level covariates, including an intercept. Working within a frequentist paradigm one can follow Jiang and Lahiri (2001), who propose an empirical best predictor for a binary response, or Jiang (2003), who extends these results to generalized linear mixed models. Nevertheless, use of the empirical best predictor can be computationally challenging (Molina and Rao, 2010). Despite their attractive properties as far as modelling non-normal outcomes is concerned, fitting generalized LMMs requires numerical approximations. In particular, the likelihood function defined by a generalized LMM can involve high-dimensional integrals which cannot be evaluated analytically (see McCulloch, 1994 and 1997; Song et al., 2005). In such cases, numerical approximations can be used as for example in the R function glmer in the package lme4. Alternatively, estimation of the model parameters can be obtained by using an iterative procedure that combines maximum penalized quasi-likelihood and REML estimation (Saei and Chambers, 2003). Estimates of generalized LMM parameters can be sensitive to outliers or departures from underlying distributional assumptions. Large deviations from the expected response and outlying points in the space of the explanatory variables are known to have a significant influence on classical maximum-likelihood inference based on generalized linear models. Nonetheless, in the case of discrete outcomes model-based SAE conventionally employs a generalized LMM for of the form: (2.31) where g is a link function. When is binary-valued, a popular choice for g is the logistic link function, and the individual values in area i are taken to be independent Bernoulli outcomes with: and. When is a count outcome, the logarithmic link function is commonly used and the individual values in area i are assumed to be independent Poisson random variables with: 66

67 and. The q-dimensional vector is generally assumed to be independently distributed between areas according to a normal distribution with mean 0 and covariance matrix. This matrix depends on parameters which are referred to as the variance components, and in (2.31) is the vector of fixed effects. If the target of inference is the small-area i mean (proportion), and the Poisson or Bernoulli generalized LMM is assumed, the approximation to the minimum MSE predictor of is. Since depends on and, a further stage of approximation is required where unknown parameters are replaced by suitable estimates. This leads to the conditional expectation predictor for the area i mean (proportion) under logarithmic or logistic: (2.32) where or,, is the vector of the estimated fixed effects and denotes the vector of the predicted area-specific random effects. We refer to (2.32) in this case as a random intercepts conditional expectation predictor. For details, see Saei and Chambers (2003), Jiang and Lahiri (2006a) and Gonzalez-Manteiga et al. (2007). Note, however, that (2.32) is not taken to be the proper empirical best predictor by Jiang (2003). The proper empirical best predictor does not have closed form and needs to be computed by numerical approximations. For this reason the conditional expectation predictor version (2.32) is used in practice, as with the small-area estimates of labour force activity currently produced by the United Kingdom Office for National Statistics. 2.7 Geostatistical methods Geostatistics is concerned with the problem of producing a map of a quantity of interest over a particular geographical region based on usually noisy measurement taken at a set of locations in the region. The aim is to describe and analyse the geographical pattern of the phenomenon of interest. Geostatistical methods are developed and applied in areas such as environmental studies and epidemiology, where spatial information is recorded and available. In recent years the diffusion of spatially detailed statistical data has been considerably increased, and this kind of procedure with modifications as appropriate can be used in other fields of application such as studies of demographic and socio-economic characteristics of a population in a particular region. To obtain a surface estimate, one can exploit the exact knowledge of the latitude and longitude of the studied phenomenon by using bivariate smoothing techniques such as kernel estimates or kriging. Bivariate smoothing deals with the flexible smoothing of point clouds to obtain surface estimates that can be used to produce maps. The geographical application, however, is not the only use of bivariate smoothing because the method can be applied to handle the non-linear relation between any two continuous predictors and a response variable (Cressie, 1993; Ruppert et al., 2003). Kriging, a widely used method for interpolating or smoothing spatial data, has a close connection with P-spline smoothing. Its aims appear to be akin to non-parametric regression, and the understanding of spatial estimates can be enriched through their interpretation as smoothing estimates (Nychka, 2000). The spatial information alone, however, does not properly explain the pattern of the response variable: one therefore needs to introduce some covariates in a more complex model. 67

68 2.7.1 Geoadditive models Geoadditive models were introduced by Kammann and Wand (2003) to analyse the spatial distribution of the study variable while accounting for possible covariate effects through an LMM representation. The first half of the model formulation involves a low-rank mixed-model representation of additive models; the geographical component is then added by expressing kriging as an LMM. This is then merged with the additive model to obtain a single mixed model, the geoadditive model. The model is specified as: (2.33) where in the first part, and represent measurements on two predictors s and t and a response variable y for unit i, f and g are smooth but otherwise unspecified functions of s and t respectively. The second part of the model is the simple universal kriging model with representing the geographical location, and is a stationary zero-mean stochastic process. Because the first and the second part of model (2.33) can be specified as an LMM, the whole model (2.33) can also be formulated as a single LMM that can be fitted using standard mixedmodel software. It can therefore be said that in a geoadditive model the LMM structure enables the inclusion of the area-specific effect as an additional random component. In particular, a geoadditive SAE model has two random effect components: the area-specific effects, and the spatial effects (Bocci, 2009). Kammann and Wand (2003) provide more details on geoadditive model specifications. Having a mixed-model specification, geoadditive models can be used to obtain small-area estimators under a non-parametric approach (Opsomer et al., 2008; see also Part II). In this respect, Bocci et al. (2012) use a two-part geoadditive SAE model to estimate the per-farm average grape production, specified as a semi-continuous skewed variable, at the agrarian region level using data from the fifth Italian agricultural census. To provide more detail, the response variable, which is assumed to have a significant spatial pattern, has a semi-continuous structure: this means that the variable has a fraction of values equal to zero and a continuous skewed distribution among the remaining values. Hence the variable can be recorded as: (2.34) and (2.35) For this variable, Bocci et al. (2012) specify two uncorrelated geoadditive small-area models, one for the logit probability of and one for the conditional mean of the logarithm of the response. Another extension to the work of Kammand and Wand (2003) is the geoadditive model proposed by Cafarelli and Castrignanò (2011), which was used to analyse the spatial distribution of grain weight, a common indicator of wheat production, taking into account its non-linear relations with other crop features. 68

69 2.7.2 Kriging The principles of geostatistics and interpolation by kriging are described in a large body of literature that includes Burrough (1986), Cressie (1993), Deutsch and Journel (1992), Isaaks and Srivastava (1989), Journel and Huijbregts (1978), Matheron (1963) and Webster and Oliver (2001). Only the basic notions are outlined here. An early introduction to the origins of kriging is given by Cressie (1990). Kriging is based on a concept of random functions: the surface or volume is assumed to be one realisation of a random function with a certain spatial covariance (Journel and Huijbregts, 1978; Matheron, 1963). In this sense kriging is a form of weighted average where the weights depend upon the location and structure of covariance or semivariogram of observed points (Hemyari and Nofziger, 1987). The choice of weights must make the prediction error less than that of any other linear sum. A semivariogram is a function used to indicate spatial correlation in observations measured at sample locations. The literature on kriging provides a choice of functions that can be used as theoretical semivariograms spherical, exponential, Gaussian or Bessel, for example. The parameters of these functions are then optimized for the best fit of the experimental semivariogram. Kriging is used extensively to produce contour maps (Dowd, 1985; Sabin, 1985), for example to predict the values of soil attributes at non-sampled locations. All kriging estimators are variants of the basic equation: (2.36) where µ is a known stationary mean assumed to be constant over the whole domain and calculated as the average of the data (Wackernagel, 2003). The parameter λ i is kriging weight, N is the number of sampled points used to make the estimation it depends on the size of the search window and μ(x 0 ) is the mean of samples within the search window. The kriging weights are estimated by minimizing the variance, as follows: (2.37) where Z(x 0 ) is the true value expected at point x0, N represents the number of observations to be included in the estimation and C(x i,x j ) = Cov[Z(x i ), Z(x j )] (Isaaks and Srivastava, 1989). The main strengths of kriging are the statistical quality of its predictions its unbiasedness, for example and the ability to predict the spatial distribution of uncertainty. It has been less successful in applications where local geometry and smoothness are the key issues, and other methods prove to be competitive or even better (Deutsch and Journel, 1992; Hardy, 1990). In ordinary kriging estimates, the value of the attribute is obtained using equations (2.36) by replacing μ with a local mean μ(x 0 ) that is the mean of samples within the search window and forcing, that is, which is achieved by plugging it into equation (2.36) (Clark and Harper, 2001; Goovaerts, 2010). Kriging estimates the local constant mean, then performs spatial kriging on the corresponding residuals; it only requires the stationary mean of the local search window (Goovaerts, 2010). 69

70 2.7.3 GWR GWR has its roots in a linear-regression framework. Standard regression assumes that observations are independent, which is clearly not true for spatial data where the defining characteristic is that nearby observations are more similar than those far apart. Another assumption in regression is that the parameters of the model remain constant over the domain in other words there is no local change in the parameter values (Fotheringham et al., 2002). As an illustration, a simple example of GWR on a two-dimensional dataset is considered. To accommodate the spatial correlation between predictors, GWR assumes a linear model in which the response variable changes as a function of the coordinates or parameters. The parameters of the GWR model depend on a weight function w(h l,h), which is chosen so that points near the prediction locations have more influence than points far away. Some common weight functions are the bi-square and Gaussian functions. GWR is a popular spatial interpolation method. It is designed for spatial interpolation of a single dataset. There is no provision for incorporating multiple data sources, though such an extension might include additional equations for additional datasets in the model. The parameters must be identical across datasets. The method also assumes that data are at point-level support. Little work has been done to address change of support in GWR, though studies that apply GWR to modifiable areal unit problems, a class of change-of-support problem where continuous spatial processes are aggregated into districts, found extreme variation in GWR regression parameters (Fotheringham and Wong, 1991). To use GWR the parameters at a set of locations must be estimated, typically locations associated with the data themselves. Computational order for this process is usually O(N 3 ), where N is the number of data points. Hence GWR does not scale well with increases in data size (Grose et al., 2008). Modifications for large datasets include choosing a fixed number p of locations p << n, where the model parameters are evaluated. Another possible approach is to separate GWR into several non-interacting processes, which can be solved in parallel using grid computing methods (Grose et al., 2008). 70

71 References (Part I) Anselin, L Spatial Econometrics: Method and Models. Boston, USA, Kluwer Academic Publishers. Battese, G., Harter, R. & Fuller, W An Error-Components Model for Prediction of County Crop Areas Using Survey and Satellite Data. Journal of the American Statistical Association 83, Banerjee, S., Carlin, B.P. & Gelfand, A.E Hierarchical Modelling and Analysis for Spatial Data. New York, Chapman & Hall. Benedetti, R., Pratesi, M. & Salvati, N Local stationarity in small-area estimation models. Statistical Methods and Applications 22(1). Best, N., Richardson, S. & Thomson, A A comparison of Bayesian spatial models for disease mapping. Statistical Methods in Medical Research 14(1), Bocci, C Geoadditive models for data with spatial information. PhD Thesis, Department of Statistics, University of Florence. Bocci, C., Petrucci A. & Rocco E Small-Area Methods for Agricultural Data: a Two-Part Geoadditive Model to Estimate the Agrarian Region Level Means of the Grapevines Production in Tuscany. Journal of the Indian Society of Agricultural Statistics 66(1), Breidenbach, J. & Astrup, R Small-area estimation of forest attributes in the Norwegian National Forest Inventory. European Journal of Forest Research 131: Brunsdon, C., Fotheringham, A.S. & Charlton, M Geographically weighted regression-modelling spatial non-stationarity. Journal of the Royal Statistical Society, Series D 47 (3) Burrough, P.A Principles of Geographical Information Systems for Land Resources Assessment. Oxford, UK, Oxford University Press. Cafarelli, B. & Castrignanò, A The use of geoadditive models to estimate the spatial distribution of grain weight in an agronomic field: a comparison with kriging with external drift. Environmetrics 22, Chambers, R. L Outlier-robust finite population estimation. Journal of the American Statistical Association 81, Chambers, R. & Dunstan, P Estimating distribution function from survey data. Biometrika 73, Chambers, R. & Tzavidis, N M-quantile models for small area estimation. Biometrika 93, Chambers, R. Chandra, H. & Tzavidis, N On bias-robust mean squared error estimation for pseudo-linear small area estimators. Survey Methodology 37, Chambers, R. Chandra, H., Salvati, N. and Tzavidis, N Outlier-robust small-area estimation. Journal of the Royal Statistical Society, Series B 76 (1)

72 Chandra, H. & Chambers, R.L Comparing EBLUP and C-EBLUP for small-area estimation. Statistics in Transition 7, Chandra, H., Salvati, N. & Chambers, R Small-area estimation for spatially correlated populations: a comparison of direct and indirect model-based methods. Statistics in Transition 8, Chandra, H., Salvati, N., Chambers, R. & Tzavidis, N Small-area estimation under spatial non-stationarity. Computational Statistics and Data Analysis 56, Clark, I. & Harper, W.V Practical Geostatistics Alloa, Scotland, UK, Geostokos (Ecosse) Ltd. Cochran, W.G Sampling Techniques, 3rd edn. New York, Wiley. Coelho, P.S. & Pereira, L.N A spatial unit-level model for small-area estimation. RevStat Statistical Journal 9(2): Comber, A., Proctor C. & Anthony, S The creation of a national agricultural land-use dataset: combining pycnophylactic interpolation with dasymetric mapping techniques. Transactions in GIS 12(6): Cressie, N The origins of kriging. Mathematical Geology 22 (3), Cressie, N Statistics for Spatial Data. New York, Wiley. Datta, G.S. & Ghosh, M Bayesian prediction in linear models: Applications to small-area estimation. The Annals of Statistics 19, Datta, G. & Ghosh, M Small-area shrinkage estimation. Statist. Sci. 27, De Belém Costa Freitas Martins, M., de Sousa Xavier A.M. & de Sousa Fragoso, R.M Redistributing agricultural data by a dasymetric mapping methodology. Agricultural and Resource Economics Review 41(3): Demidenko, E Mixed Models: Theory and Applications. New York, Wiley. Deutsch, C.V. & Journel, A.G Geostatistical Software Library and User's Guide. New York, Oxford University Press. 340 pp. Do, V.H., Thomas-Agnan C. & Vanhemsz, A Spatial Reallocation of Areal Data: a Review. Toulouse, France, Toulouse School of Economics. Dowd, P.A A Review of Geostatistical Techniques for Contouring. Earnshaw, R.A. (ed.). Fundamental Algorithms for Computer Graphics. NATO ASI Series, vol. F17. Berlin, Springer-Verlag. Duchesne, P Robust calibration estimators. Survey Methodology 25, Eicher, C.L. & Brewer, C.A Dasymetric mapping and areal interpolation, implementation and evaluation. Cartography and Geographic Information Science 28(2),

73 Estevao, V.M. & Särndal, C.E Borrowing strength is not the best technique within a wide class of designconsistent domain estimators. Journal of Official Statistics 20, Fabrizi, E., Salvati, N., Tzavidis, N. & Pratesi, M Outlier-robust model-assisted small-area estimation. Biometrical Journal 56, ; doi: /bimj Fay, R.E. & Herriot, R.A Estimates of income for small places: an application of James-Stein procedures to census data. Journal of the American Statistical Association 74, FAO A System of Integrated Agricultural Censuses and Surveys. Statistical Development Series, no. 11. Rome FAO World Census of Agriculture. Statistical Development Series, no. 12. Rome. FAO World Census of Agriculture : Methodological Review. Statistical Development Series, no. 14. Rome. Fasulo, A., D Alò, M., Di Consiglio, L., Falorsi, S. & Solari, F SMART2: a new web system for small-area estimation. Paper in Book of Abstract of ITACOSM2013, pp Flowerdew, R. & Green, M Statistical methods for inference between incompatible zonal systems. In: Goodchild, M.F. & Gopal, S. (eds.), The Accuracy of Spatial Databases, pp London, Taylor and Francis. Fotheringham, A.S., Brunsdon, C. & Charlton, M Geographically Weighted Regression. Bognor Regis, UK, John Wiley and Sons. Fotheringham, A.S. & Wong, D.W.S The modifiable areal unit problem in multivariate statistical analysis. Environment and Planning 23(7). Fuller, W.A Regression estimation for small areas. In: Gilford, D.M., Nelson, G.L. & Ingram, L. (eds.), Rural America in Passage: Statistics for Policy, pp Washington DC, National Academy Press. Fuller, W.A Environmental surveys over time. Journal of Agricultural, Biological and Environmental Statistics 4, Gallego, F.J A population density grid of the European Union. Population and Environment 31: Ghosh, M. & Rao, J.N.K Small-area estimation: an appraisal (with discussion). Statistical Science 9(1): Ghosh, M., Natarajan, K., Stroud, T.W.F. & Carlin, B.P Generalized linear models for small-area estimation. Journal of the American Statistical Association 93, Gilks, W.R., Richardson, S. & Spiegelhalter, D.J Markov Chain Monte Carlo in Practice. Boca Raton, FL, USA. Chapman and Hall/CRC. Giusti, C., Marchetti, S., Pratesi, M. & Salvati, N Semi-parametric Fay-Herriot model using penalized splines. Journal of the Indian Society of Agricultural Statistics 66,

74 Giusti, C., Tzavidis, N., Pratesi, M. & Salvati, N Resistance to Outliers of M-Quantile and Robust Random Effects Small-Area Models. Communication in Statistics: Simulation and Computation 43(3). Gomez-Rubio, V., Best, N., Richardson, S., Li, G. & Clarke, P Bayesian Statistics Small-Area Estimation. Technical Report. London, Imperial College. Goovaerts, P Combining areal and point data in geostatistical interpolation: applications to soil science and medical geography. Mathematical Geosciences 42, Gosh, M. & Meeden, G Bayesian Methods for Finite Population Sampling. London, Chapman and Hall. Gregory, I.N. & Paul, S.E Breaking the boundaries: geographical approaches to integrating 200 years of the census. Journal of the Royal Statistical Society 168, Grose, D.J., Harris, R., Brunsdon, C. & Kilham, D Grid enabling geographically weighted regression. Available at: Hansen, M.H., Hurwitz, W.N. & Madow, W.G Sample Survey Methods and Theory. New York, Wiley. Hardy, R.L Theory and applications of the multiquadric-biharmonic method. Computers and Mathematics with Applications 19, Hemyari, P. & Nofziger, D.L Analytical solution for punctual kriging in one dimension. Soil Science Society of America Journal 51, Henderson, C Best linear unbiased estimation and prediction under a selection model. Biometrics 31, Jiang, J. & Lahiri, P Empirical best prediction for small-area inference with binary data. Annals of the Institute of Statistical Mathematics 53, Jiang, J Empirical best prediction for small-area inference based on generalized linear mixed models. Journal of Statistical Planning and Inference 111, Jiang, J. & Lahiri, P. 2006a. Mixed model prediction and small-area estimation. TEST 15, Jiang, J. & Lahiri, P. 2006b. Estimation of finite population domain means: a model-assisted empirical best prediction approach. Journal of the American Statistical Association 101, Jiang, J., Nguyen, T. & Rao J.S Best predictive small-area estimation. Journal of the American Statistical Association 106, Journel, A.G. & Huijbregts, C.J Mining Geostatistics, vol London, Academic Press. Kammann, E.E. & Wand, M.P Geoadditive models. Journal of Applied Statistics 52, Kaspar, T.C., Colvin, T.S., Jaynes, D.B., Karlen, D.L., James, D.E., Meek, D.W., Pulido, D. & Butler, H Relationships between six years of corn yields and terrain attributes. Precision Agriculture 4,

75 Kim, H. & Yao, X Pycnophylactic interpolation revisited: integration with the dasymetric mapping method. International Journal of Remote Sensing 31(21): Kish, L Survey Sampling. New York, Wiley. Kott, P. 1989, Robust small domain estimation using random effects modelling. Survey Methodology 15, Isaaks, E.H. & Srivastava, R.M Applied Geostatistics. New York, Oxford University Press. Lam, N Spatial interpolation methods: a review. American Cartographer 10, Langford, M Refining methods for dasymetric mapping. In: Mesev, V. (ed.), Remotely Sensed Cities, pp London, Taylor and Francis. Langford, M Obtaining population estimations in non-census reporting zones: an evaluation of the threeclass dasymetric method. Computers, Environment and Urban Systems 30, Langford, M. & Harvey, J.T The use of remotely sensed data for spatial disaggregation of published census population counts. IEEE/ISPRS Joint Workshop on Remote Sensing and Data Fusion over Urban Areas, La Sapienza university, Rome. Langford, M., Maguire, D. & Unwin, D The areal interpolation problem: estimating population using remote sensing in a GIS framework. In: Masser, E. & Blakemore, M. (eds.), Handling Geographic Information: Methodology and Potential Applications, pp London, Longman. Langford, M. & Fisher, P.F Modelling sensitivity to accuracy in classification imagery: a study of areal interpolation by dasymetric mapping. Professional Geographer 48(3), Langford, M. & Unwin, D.J Generating and mapping population density surfaces within a GIS. Cartographic Journal 31, Larsen, T., Nagoda, D. & Anderson, J.R The Barents Sea Ecoregion: a Biodiversity Assessment. Oslo, World Wildlife Fund. Lehtonen, R. & Veijanen, A Domain estimation with logistic generalized regression and related estimators. IASS Satellite Conference on Small-Area Estimation. Riga, Latvian Council of Science. Lehtonen, R., Särndal, C.E. & Veijanen, A The effect of model choice in estimation for domains, including small domains. Survey Methodology 29, Lehtonen, R. & Pahkinen, E Practical Methods for Design and Analysis of Complex Surveys. New York, Wiley. Li, T., Pullar, D., Corcoran, J. & Stimson, R A comparison of spatial disaggregation techniques as applied to population estimation for South East Queensland, Australia. Applied GIS 3(9): Longford, N.T Random Coefficient Models. London, Clarendon Press. 75

76 Marchetti, S., Tzavidis, N. & Pratesi, M Non-parametric bootstrap mean squared error estimation for M-quantile estimators of small-area averages, quantiles and poverty indicators. Computational Statistics and Data Analysis 56, Matheron, G.F Principles of geostatistics. Economic Geology 58, McCullogh, C.E Maximum likelihood variance components estimation for binary data. Journal of the American Statistical Association 89, McCullogh, C.E Maximum likelihood algorithms for generalized linear mixed models. Journal of the American Statistical Association 92, McCullogh, P. & Searle, S.R Generalized, Linear and Mixed Models. New York, Wiley. Mennis, J Generating surface models of population using dasymetric mapping. Professional Geographer 55(1): Mennis, J. & Hultgren, T Intelligent dasymetric mapping and its application to areal interpolation. Cartography and Geographic Information Science 33, Mohammed, J.I., Comber, A. & Brunsdon C Population estimation in small areas: combining dasymetric mapping with pycnophylactic interpolation. GIS Research UK Conference, Lancaster University. Molina, I., Salvati, N. & Pratesi, M Bootstrap for estimating the MSE of the Spatial EBLUP. Computational Statistics and Data Analysis 24, Molina, I. & Rao, J.N.K Small-area estimation of poverty indicators. Canadian Journal of Statistics 38, Nychka, D Spatial process estimates as smoothers. In: Schimek, M.G. (ed.), Smoothing and Regression: Approaches, Computation and Application, New York, Wiley, pp Opsomer, J.D., Claeskens, G., Ranalli, M.G., Kauermann, G. & Breidt, F. J Non-parametric small-area estimation using penalized spline regression. Journal of the Royal Statistical Society, Series B 70, Petrucci, A., Pratesi, M. & Salvati, N Geographic information in small-area estimation: small-area models and spatially correlated random area effects. Statistics in Transition 7, Petrucci, A. & Salvati, N Small-area estimation for spatial correlation in watershed erosion assessment. Journal of Agricultural, Biological and Environmental Statistics 11, Pfeffermann, D Small-area estimation: new developments and directions. International Statistical Review 70(1): Pfeffermann, D New important developments in small-area estimation. Statistical Science 28(1): Prasad, N. & Rao, J The estimation of mean squared error of small-area estimators. Journal of the American Statistical Association 85,

77 Prasad, N. & Rao, J On robust small-area estimation using a simple random-effects model. Survey Methodology 25, Pratesi, M. & Salvati, N Small-area estimation: the EBLUP estimator based on spatially correlated random area effects. Statistical Methods and Applications 17, Pratesi, M. & Salvati, N Small-area estimation in the presence of correlated random area effects. Journal of Official Statistics 25, Pratesi, M., Ranalli, M.G. & Salvati, N Semi-parametric M-quantile regression for estimating the proportion of acidic lakes in 8-digit HUCs of the north-eastern United States. Environmetrics 19, Rao, J.N.K., Kovar, J.G. & Mantel, H.J On estimating distribution functions and quantiles from survey data using auxiliary information. Biometrika 77, Rao, J.N.K Small-Area Estimation. New York, Wiley. Rao, J.N.K Small-area estimation with applications to agriculture. In: Benedetti, R., Bee, M., Espa, G. & Piersimoni, F. (eds.) Agricultural Survey Methods, London, John Wiley and Sons. Reibel, M. & Aditya, A Land use weighted areal interpolation. GIS Planet 2005 International Conference, Estoril, Portugal. Royall, R.M Current advances in sampling theory: implications for human observational studies. American Journal of Epidemiology 104, Ruppert, D., Wand, M.P. & Carroll, R Semiparametric Regression. Cambridge, UK and New York, Cambridge University Press. Sabin, M.A Contouring: the state of the art. In: Earnshaw, R.A. (ed.), Fundamental Algorithms for Computer Graphics, NATO ASI series, vol. F17. Heidelberg, Springer-Verlag. Saei, A. & Chambers, R Small-area estimation under linear and generalized linear mixed models with time and area effects. In: University of Southampton Statistical Sciences Research Institute, S3RI Methodology Working Papers, Southampton, UK, pp Saei, A. and Chambers, R. 2005a. Empirical best linear unbiased prediction for out-of-sample areas. Working paper M05/03. Southampton, University of Southampton Statistical Sciences Research Institute. Saei, A. & Chambers, R. 2005b. Out-of-sample estimation for small areas using area-level data. Working paper M05/011. Southampton, University of Southampton Statistical Sciences Research Institute. Salvati, N Small-area estimation by spatial models: the spatial empirical best linear unbiased predictor (Spatial EBLUP). Working paper no. 2004/04. Florence, Italy, University of Florence Department of Statistics. Salvati, N., Pratesi, M., Tzavidis, N & Chambers, R Spatial M-quantile models for small-area estimation. Statistics in Transition vol. 10(2):

78 Salvati, N., Tzavidis, N., Pratesi, M. & Chambers, R Small-area estimation via M-quantile geographically weighted regression. TEST 21, Särndal, C.E Design-consistent versus model-dependent estimation for small domains. Journal of the American Statistical Association 79, Särndal, C.E., Swensson, B. & Wretman, J Model-Assisted Survey Sampling. New York, Springer Verlag. Searle, S.R., Casella, G. & McCullogh, P Variance Components. New York, Wiley. Shu, Y. & Lam, N.S.N Spatial disaggregation of carbon dioxide emissions from road traffic based on multiple linear regression model. Atmospheric Environment 45, Shu, Y., Lam N.S.N. & Reams, M A new method for estimating carbon dioxide emissions from transportation at fine spatial scales. Environmental Research Letters 5. Singh, B.B., Shukla, G.K. & Kundu, D Spatio-temporal models in small-area estimation. Survey Methodology 31, Schmid, T. & Münnich, R Spatial-robust small-area estimation. Statistical Papers. DOI: /s y. Singh, M.P., Gambino, J. & Mantel, H.J Issues and strategies for small-area data. Survey Methodology 20, Sinha, S.K. & Rao, J.N.K Robust small-area estimation. Canadian Journal of Statistics 37, Song, P. X., Fan, Y. & Kalbfleisch, J Maximization by parts in likelihood inference (with discussion). Journal of the American Statistical Association 100, Sud, U.C., Bhatia, V.K., Chandra, H. & Srivastava, A.K Crop yield estimation at district level by combining improvement of crop statistics scheme data and census data. Wye City Group on Rural Statistics and Agricultural Household Income, 4th meeting, Rio de Janeiro. Tassone, E.C., Miranda, M.L. & Gelfand, A.E Disaggregated spatial modelling for areal unit categorical data. Journal of the Royal Statistical Society 59(1): Tobler, W Smooth pycnophylactic interpolation for geographical regions. Journal of the American Statistical Association 74, Tzavidis, N., Marchetti, S. & Chambers, R Robust estimation of small-area means and quantiles. Australian and New Zealand Journal of Statistics 52, Ugarte, M.D., Goicoa, T., Militino, A.F. & Durban, M Spline smoothing in small area trend estimation and forecasting. Computational Statistics and Data Analysis 53, Valliant, R., Dorfman, A.H. & Royall, R.M Finite Population Sampling and Inference: a Prediction Approach. New York, Wiley. 78

79 You, L. & Wood, S An entropy approach to spatial disaggregation of agricultural production. Agricultural Systems 90, You, Y. & Rao, J.N.K A pseudo-empirical best linear unbiased prediction approach to small-area estimation using survey weights. Canadian Journal of Statistics 30, Yuan, Y., Smith, R.M. & Limp, W.F Remodelling census population with spatial information from Landsat TM imagery. Computers, Environment and Urban Systems 21, pp Wackernagel, H Multivariate Geostatistics: an Introduction with Applications. Berlin, Springer Verlag. Wang, J. & Fuller, W.A Mean squared error of small-area predictors constructed with estimated area variances. Journal of the American Statistical Association 92, Webster, R. & Oliver, M Geostatistics for Environmental Scientists. Chichester, UK, John Wiley and Sons. Welsh, A.H. & Ronchetti, E Bias-calibrated estimation from sample surveys containing outliers. Journal of the Royal Statistical Society B60, Wolter, K Introduction to Variance Estimation. New York, Springer-Verlag. Wright, J.K A method of mapping densities of population. Geographical Review 26, Wu, S., Qiu, X. & Wang L Population estimation methods in GIS and remote sensing: a review. GI Science and Remote Sensing 42(1):

80 80

81 Resilience of SAE Methods to Non-Standard Situations Introduction Part II discusses several open issues in terms of the resilience of SAE methods in non-standard situations that may occur in agricultural surveys, particularly with regard to assessment of the quality of small-area estimates and to the application of methods used for official statistics. Their relevance in agro-environmental applications is also discussed. Chapter 3: Sensitivity of SAE predictors to spatial model specifications SAE estimators are model-based. This means they are based on the specification of an operational model to link the study variable to the auxiliary variables. The model can accurately represent the real spatial distribution of the study variable, or it can simply mimic it. Because the spatial distribution of crops and land use in small areas is likely to be non-stationary and likely to show specific levels of spatial correlation, it is important to assess the extent to which the model s goodness-of-fit affects the quality of small-area estimates. 2 Chapter 4: The impact of the modifiable areal unit Point-based census or survey data may, of course, be aggregated into areas or regular cells such as enumeration districts, administrative areas or any other spatial partition; the areal units are hence modifiable. The problem is that in analysis of the spatial or other relation between variables, the result can be different when the same relation is measured in areal units at different scales. This can give misleading results in the specification of SAE models, and affects the quality of SAE itself. Chapter 5: The robustness of the predictors to departures from normality and the robustness of small-area estimators to outlier observations Many traditional SAE models assume that the study variable has a normal distribution, an assumption that can rarely be accepted when the distribution of agro-environmental variables is studied. Even when a normal distribution can be achieved by transforming the original data, the presence of outliers can compromise the efficiency of the estimates. In particular, when the data-production process can cause errors as may happen in statistical agencies the use of robust estimators is suggested to minimize bias in making estimates. Chapter 6: The effect on target variables of the complexity of sample designs in a survey A survey s design for sampling target variables can affect disaggregation methods because there may be an effect of the sampling design on the small area estimators. Stratification, clustering and varying probabilities of inclusion can alter the properties of statistical models, which are in general developed on the assumption of simple random sampling from an entire population. When selection follows a more complex design, the effects on the estimates produced by the model must be assessed. The problem is discussed here as it affects SAE in terms of cases in which the design can have an impact on the estimators, with two alternative small-area estimators that account for the sample design. The impact of sample design is assessed using a design-based simulation based on real agricultural data. Chapter 7: Missing data in spatial datasets Complete knowledge of the spatial distribution of an auxiliary variable correlated with a study or target variable and knowledge of the exact location of all the units can be useful in SAE. The performance of an SAE predictor is likely to be impaired to the extent that such knowledge is affected by errors, missing geographical data and missing values 81

82 in the study and auxiliary variables. This often occurs in practice, and it is therefore important to assess proposed ways of protecting the validity of SAE. Chapter 8: The excess of zeros in survey data This problem is relevant when the target variable is skewed and strictly positive, and also characterized by a large number of zeros. This is likely to happen when the study variable is crop production. In survey data, zero values for crop production can be observed at many sampled farms because, for example, a crop is not cultivated over a wide area or because the land was not used to cultivate that crop in the survey period. 82

83 3. Sensitivity of SAE Predictors to Spatial Model Specifications 3.1 Introduction Many target parameters in agricultural and rural statistics can be expressed in the form of means and percentages. In Europe, many agro-environmental indicators are expressed in percentages and combine different kinds of data with arable land, usually expressed as the utilized agricultural area the total area occupied by arable land, permanent grassland, permanent crops and kitchen gardens. This is the case in the LUCAS surveys, for example. 17 Using survey data to estimate these quantities of interest for sub-populations domains is a common practice. There are, however, geographic domains for which direct estimates of adequate precision cannot be produced: these are known as small areas. Survey designs usually focus on achieving a particular degree of precision in estimates at a higher level of aggregation than the small area, so sample sizes for small areas are typically small. To explain the setting and objectives of the experiment described in this chapter and to interpret its main findings, attention is drawn to three major issues identified in the literature review in Part I: Small-area estimates are obtained by fitting statistical models to survey data, and then applying them to auxiliary information available for the small-area population of interest (see Chapter 2). These data can be administrative and geographic, but they must always refer to the same domains or population units, which leads to the more general problem of usage of misaligned data and their spatial integration as discussed in Chapter 4. Small-area estimates are new statistics that are not otherwise available from surveys or administrative data sources. Often, a number of potential models are considered that involve various combinations of the auxiliary variables (see Chapter 2). Because they are obtained by fitting a model to the data, it is obvious that the robustness of the results with regard to departures from the hypothesis of normal distribution of the population values becomes an important issue. Various quality diagnostics must be examined to determine which of the potential small-area predictors to use. Once the model is chosen, users must be given an assessment of its quality and the quality of the small-area estimates produced from it. Of the various diagnostics used to assess the accuracy, validity and consistency of small-area estimates (Brown et al., 2001) the most common are: i) a bias test that compares the small-area predictions with the direct estimates, usually by comparing the absolute relative bias of the small-area estimates with direct estimates in a simulation study; and ii) the RRMSE test, which is analogous to sampling errors calculated for survey estimates: it is a measure of the efficiency in terms of accuracy of the small-area estimates. With these issues in mind, this section gives the results of a simulation in which the performance of different smallarea predictors of the small-area mean are compared in alternative scenarios relevant to the production of agricultural and rural statistics at the small-area level. The objectives of the study are: i. to provide evidence of the sensitivity of SAE predictors to the specification of a model to describe the spatial structure of the available data; ii. to discuss the properties of SAE predictors at different levels of availability of survey and auxiliary data; and iii. o start discussion of the robustness of SAE predictors to departures from the hypothesis of normal distribution of population values and their resilience to outliers (see Chapter 5). The experiment involves two simulation studies. The first, a model-based experiment, analyses the sensitivity of the estimators to different specifications of the spatial structure of the data. The sample remains fixed, and many 17 Since 2006, EUROSTAT has carried out a survey every three years of the state and the dynamics of changes in land use and cover in the European Union the LUCAS surveys, which are based on observations made and registered on the ground. The most recent, in 2012, covered all 27 European Union countries and made observations at 270,000 points. 83

84 realizations of the same spatial population model are simulated; the properties of the predictors are studied separately for each spatial model, and the results obtained in three spatial models are compared (see section 3.2). The second, a design-based experiment, is pseudo-real in that it is based on real data collected by the United States Environmental Protection Agency; because the population is real, the properties of the predictors are evaluated on the basis of replicates of the sampling design applied to the population (see section 3.3). In both experiments the performance of the SAE predictors is evaluated in terms of bias by the values of average relative bias AvRBias: 1 ( ) { ( )} 1 T 1 T ˆ t= 1 t= 1 AvRBias = T m T m m i it it it The relative bias is calculated for each small area i and averaged on the T replicates of the simulation study. In the expression, m it is the actual average for area i at simulation t and ˆm it is the estimated small-area average. The efficiency is evaluated by calculating the average RRMSE AvRRMSE: 2 { } 1 1 T 1 T ( ) ( ˆ ) t= 1 t= 1 AvRRMSE = T m T m m i it it it The results obtained for the traditional small-area predictors are compared with those of the so-called spatial predictors. We consider the following estimators: EBLUP, GREG, which on p. 136 in Rao (2003) is called modified GREG, MBDE (Chandra and Chambers, 2005), MQ (see Chapter 2), SEBLUP (Petrucci and Salvati, 2006), SMBDE (Chandra et al., 2007), GWEBLUP (Chandra et al., 2012) and two common predictors used in spatial interpolation GWR (Fotheringham et al., 2002) and the predictor based on ordinary kriging interpolation (see Cressie, 1993). Here, for GWR-EBLUP and GWR, a Gaussian specification for the weighting function is used:, where denotes the Euclidean distance between ul and u and is the bandwidth. As the distance between ul and u increases, the spatial weight decreases exponentially. The bandwidth b is a measure of the rate at which the weighting function decays with increasing distance, and so determines the roughness of the fitted GWR function. Here also the bandwidth is defined by minimizing the cross-validation (CV) criterion proposed by Fotheringham et al. (2002), who also discuss other weighting functions and the computation of the bandwidth.. The SEBLUP, SMBDE and GWEBLUP are built to integrate geo-referenced information and to model it at the small-area level (see Chapter 2). This report also considers MQ standard predictors, which are naturally robust to outliers and to application to skewed distributions (see Chapter 5). 3.2 Model-based simulation experiment Three alternative spatial models are specified to generate the population values of the study variable,, for the unit j of the area i. A sample is drawn from each realization of the population. The number of small areas is fixed at A = 20 in each replication, following a simulation setting by Chandra et al. (2012). The level of availability of survey and auxiliary variables is high. In other words, survey and auxiliary data are not misaligned, population values of the study and auxiliary variables are known for each unit j and each area i, and the spatial coordinates of the sampled and non-sampled population units identify their location. In GIS, the spatial coordinates of the centroids of the small areas are provided for sampled and non-sampled areas. The latter, also called out-of-sample areas, are those where no sample observations are made and, in practice, where direct estimators cannot be computed. In field applications the study variable is observed in sampled units, but only auxiliary information 84

85 is available for out-of-sample units; the coordinates of the locations of out-of-sample units may be unknown, which is why the centroid of the small area they belong to is determined Spatially stationary model This model generates a spatially stationary set of population values. Crop yield, for example, is generated under a spatial random process whose properties do not vary by location. The population values of and are generated according to the two-level model:, where, and, with the random area effects generated as and with level-one errors distributed as, corresponding to an intra-area correlation of SAR stationary model This model is used to generate the population corresponding to a nested error regression model, with random area effects for neighbouring areas distributed according to an SAR spatial correlation structure. In this case the distribution of y i is not conditional: the marginal distributions for all y i are specified as a system of simultaneous equations. It is not unusual, for example, for the crop yield of a point to be linked with neighbouring y. An alternative is the conditional auto-regressive (CAR) spatial correlation structure that is, the distribution of y i conditional on all the other y-values is normal. In this simulation an SAR process is preferred as being likely for agricultural data (Pratesi and Salvati, 2005). The model assumes the form: and, in which W is a proximity matrix of order A, I is a diagonal matrix of order A, and is the SAR coefficient, which is set at 0.75 high spatial correlation. The element of a contiguity matrix W takes the value 1 if area k shares an edge with area l, and 0 otherwise. The distribution for and for random area effects is the same as in the spatially stationary model. 3. Spatially non-stationary model The third model uses the same distribution for and for random area effects as the first, but it also allows the intercept and the slope of the linear model for to vary according to longitude and latitude. This leads to a spatially non-stationary set of population values. Crop yield, for example, is generated in this case under a spatialrandom process whose properties vary by location: the two-level model in this case is: with and, and with the location coordinates for each unit of the population generated independently as (see Salvati et al., 2012). The small-area population sizes were randomly drawn from a uniform distribution of [450,500] and kept fixed over the simulations. A sample of size was selected from each simulated population, with small-area sample sizes proportional to the fixed small-area population sizes, resulting in an average area sample size of. These area-specific sample sizes were kept fixed in the simulations and the small areas were treated as strata, with the final sample selection carried out by random sampling in each small area. A total of T = 500 simulations were carried out. The small-area estimators compared in the simulations are the EBLUP (Rao, 2003), GREG (Rao, 2003), SEBLUP (Petrucci and Salvati, 2006; Pratesi and Salvati, 2009), MBDE (Chandra and Chambers, 2005), SMBDE (Chandra et al., 2007), GWEBLUP (Chandra et al., 2012) and MQ regression small-area estimator (Chambers and Tzavidis, 2006). Note that in these model-based simulations the NPEBLUP (Opsomer et al., 2008) and small-area estimator based on the MQGWR (Salvati et al., 2012) are not evaluated because they perform well if the spatial coordinates are 18 An example of availability of geographical auxiliary data for out-of-sample data is the Baseline project of the Italian Ministry of Agricultural, Food and Forestry Policies, MIPAAF, which integrates data available from the AGRIT project with the POPOLUS spatial frame after it has been refreshed on the basis of CORINE land cover (see the LUCAS project). 85

86 available for sampled and out-of-sample units, which is not the case in this scenario, where the availability of the auxiliary information is restricted to the sampled areas. A drawback of MQGWR is that it is computationally intensive (see Salvati et al., 2012). The NPEBLUP can capture the spatial relationship through P-splines, which are useful when the functional form of the relationship between the variable of interest and the covariates is unspecified and the data are characterized by complex patterns of spatial dependence. It does not guarantee good performance when only the spatial coordinates of centroids are available: this is because a non-parametric model cannot use a large number of knots. In the proposed simulation experiments in 20 small areas, the NPEBLUP could use 10 or 11 knots. In these simulations the estimator is also evaluated on the basis of interpolation methods GWR (Fotheringham et al., 2002) and kriging (Cressie, 1993) (see Chapter 2). Note that any method for taking spatial information into account must include some geographic covariates for each small area by considering data regarding the spatial location such as the centroid coordinates and/or GIS-generated auxiliary geographical variables referring to the same area. Table 3.1. Definitions of models and small-area predictors used in the simulation studies Acronym Predictor Model EBLUP EBLUP-GC Empirical best linear unbiased predictor EBLUP + geographical coordinates GREG Generalized regression estimator GREG-GC GREG + geographical coordinates SEBLUP Spatial EBLUP MBDE Model-based direct estimator SMBDE Spatial MBDE GWEBLUP Geographically weighted EBLUP MQ M-quantile model MQ-GC MQ + geographical coordinates GWR Geographically weighted regression KRIG kriging The covariates should be able to take spatial interaction into account when it results from the covariates themselves: for this reason we have evaluated the performance of small-area estimators based on EBLUP, GREG and MQ, adding longitude and latitude as covariates. The models and the estimators considered in our empirical evaluations are summarized in Table 3.1. The performance of different small-area estimators is evaluated by computing the average relative bias (AvRBias) and the AvRRMSE for each small area as follows:, 86

87 . and summarizing the results over the T=500 realizations of the population model. Here average for area i at simulation t, with denoting the estimated small-area average. denotes the actual Table 3.2. Summary of results from model based simulations Predictor Spatially stationary AvRBias% SAR stationary Spatially nonstationary Spatially stationary AvRRMSE% SAR stationary Spatially nonstationary EBLUP (1.00) (1.00) (1.00) EBLUP-GC (1.02) (1.00) (0.64) GREG (1.07) (1.05) (1.01) GREG-GC (1.07) (1.05) (0.79) SEBLUP (1.02) (0.98) (0.97) MBDE (1.47) (1.44) (1.81) SMBDE (1.47) (1.44) (1.74) GWEBLUP (1.10) (1.07) (0.56) MQ (1.15) (1.22) (1.03) MQ-GC (1.14) (1.11) (0.60) KRIG (1.47) (1.30) (0.71) GWR (1.54) (1.33) (0.61) Note: Values are expressed as percentages. The values of the ratios of AvRRMSE to EBLUP are given in parentheses. Table 3.2 shows the mean of the distribution of values of AvRBias and AvRRMSE over simulations for spatially stationary, SAR stationary and the spatially non-stationary population models. In the stationary case, all the estimators show small average relative bias (0.013 for MQ) and, as one would expect, the EBLUP has a lower RRMSE than the other estimators. Things change, however, when one looks at the results for the spatially nonstationary case, where there is evidence of a substantial gain in efficiency, as measured by a lower RRMSE, when the GWEBLUP and GWR are compared with the other small-area predictors. The small-area estimators that take into account the spatial coordinates in the model increase in efficiency. The mean value of the AvRRMSE of the EBLUP is 2.639, and that of the EBLUP-GC is 1.688; the mean value of the AvRRMSE of the MQ-GC is 1.586, which is lower than the corresponding values for MQ. This happens for the GREG-GC estimator as well, with AvRRMSE equal to against for GREG. Under the SAR stationary population model, the mean value of the AvRRMSE of the SEBLUP estimator is compared with for the EBLUP: the SEBLUP is hence the most efficient estimator in this scenario. In the SAR stationary scenario, the models that use spatial information perform better than those that do not: the AvRRMSE, for example, is for MQ-GC and for MQ. From the results of the simulation experiments it is evident that better estimates can be obtained by using the spatial information in both the fixed part and the random parts of the models, or even by specifying models with spatially correlated random area effects. The evidence from the case studies is that SEBLUP with correlated random area effects following a SAR process performs better when the spatial correlation in the study variable is high. But the 87

88 inclusion of covariates that capture the spatial effects may be useful when the process is spatially non-stationary. In view of the results obtained by MQ with spatial coordinates, it should improve the efficiency of the small-area estimator by fitting the MQGWR (see Salvati et al., 2012). 3.3 Design-based simulation experiment The aims were: i) to compare the performance of the different small-area predictors and interpolation methods of the mean in each small area; and ii) to evaluate the performance of the different predictors for estimating the mean for out-of-sample areas. The level of availability of spatial information for survey and auxiliary variables is higher than in the model-based simulation; the spatial coordinates of the population units are available for sampled and non-sampled areas. The data are drawn from the Environmental Monitoring and Assessment Program of the Space Time Aquatic Resources Modelling and Analysis Program at Colorado State University in the United States. The data set has been studied intensively in SAE experiments with spatial data (see Opsomer et al., 2008 and Salvati et al., 2012). The survey data used in this design-based simulation come from the United States Environmental Protection Agency s northeast lakes survey (Larsen et al. 2001), and are the same as those used in some examples in Chapter 2. To recap between 1991 and 1995 researchers from the Environmental Protection Agency conducted an environmental health study of the lakes in the north-eastern states using a sample of 334 lakes from the population of 21,026, which were grouped according to digit HUCs of which 64 contained fewer than 5 observations and 27 had no observations. The variable of interest was ANC, an indicator of the acidification risk of water bodies. Because some lakes were visited several times during the study and because some of these were measured at more than one site, the total number of observed sites was 349, with 551 measurements. The EMAP data set also contained the elevation and geographical coordinates of the centroid of each lake in the target area. For sampled locations, the exact spatial coordinates of the corresponding location are known and for non-sampled locations the centroid of the lake is known, so detailed information on the spatial coordinates for non-sampled locations exists as the geography defined by the lakes is below the geography of interest defined by the HUCs. The aims of the simulation were: i) to compare the performance of the different small-area predictors and interpolation methods for the mean of ANC in each HUC; and ii) to evaluate the performance of the different predictors for estimating the mean ANC for out-of-sample HUCs. To do this, a population of ANC values was created with spatial characteristics similar to those of the lakes sampled by EMAP, with the value of the estimated spatial correlation equal to 0.7. A total of 200 independent random samples were taken from each HUC sampled by EMAP, with sample sizes set equal to where is the sample size of each HUC in the original EMAP dataset. No observations were taken from HUCs that had not been sampled by EMAP. This process resulted in a sample of 652 ANC values from 86 HUCs. For details on the generation of the population see Salvati et al. (2012). In the simulations, the small-area predictors evaluated in Part I are compared; in this case the performance of the MQGWR and the NPEBLUP are also evaluated. The relative bias (RB) and the RRMSE of estimates of the mean value of ANC in each HUC were computed. The summary of the across-area distribution of RB and RRMSE are set out in Tables 3.3 and 3.4 for sampled areas, and Tables 3.5 and 3.6 for out-of-sample areas. GREG, GREG-GC, MBDE and SMBDE cannot be computed for out-of-sample areas because the y values have to be known to be computed. For this reason the results for these estimators are not shown in Tables 3.5 and

89 In Tables 3.3 and 3.4, all small-area predictors based on variants of the MQGWR model have significantly lower RB than the EBLUP, SEBLUP and NPEBLUP; the MQGWR predictor performs best. With regard to performance in terms of RRMSE, the small-area predictors that account for the spatial structure of the data have on average smaller root mean squared errors; GREG-GC is an exception. The NPEBLUP, SEBLUP and the MQGWR predictor perform best. These results show that there is a substantial number of in-sample HUCs where the MQGWR predictor has lower RRMSE than the NPEBLUP and SEBLUP. The results also confirm that the MQGWR predictor is a good competitor of NPEBLUP and SEBLUP in sampled areas: in other words the MQGWR predictor is not expected to be uniformly better than the SEBLUP, but it is expected to be more efficient in some HUCs. The results for the GWR and kriging interpolation methods show smaller RRMSE than the small-area predictors. This could be because of the small about 15 percent intra-class correlation coefficient in the data. This value is the ratio ; it measures the presence of area effects in the data. When there is little heterogeneity across the areas, a synthetic predictor such as those based on GWR and kriging can perform better than the predictors that take area effects into account. Note that the GWR is a particular case of MQGWR with a high tuning constant at quantile 0.5; this is the expectile version of MQGWR. For out-of-sample areas, MQGWR-based small-area predictors have lower relative bias and lower root mean squared errors than the EBLUP, NPEBLUP and SEBLUP. It seems that the MQGWR model offers a straightforward approach for improving synthetic estimation for out-of-sample areas. The performance of the SEBLUP in this case may be surprising, but it should be borne in mind that there is evidence in this case of spatial non-stationary behaviour of the study variable. A synthetic SEBLUP was also used for out-of-sample areas. A more elaborate method for outof-sample areas in the SAR model was proposed by Saei and Chambers (2005). Another result to consider is that the small-area methods that take into account the spatial information in the covariates EBLUP-GC and MQ-GC perform better in terms of RRMSE than the small-area predictors than do not use EBLUP and MQ. The interpolation methods also work well for out-of-sample areas: in particular, GWR shows low bias and RRMSE with values close to those of MQGWR. It can be concluded from the design simulation experiment that if the intra-class correlation is small and spatial information is available and shows spatial non-stationarity, the interpolation methods can be used for estimation, and the MQGWR is the preferred predictor in the class of small-area estimation. 89

90 Table 3.3. Design-based simulation results using the EMAP data for 86 sampled areas Predictor Summary of across-area distribution Min Q1 median Mean Q3 Max EBLUP EBLUP-GC GREG GREG-GC SEBLUP MBDE SMBDE GWEBLUP MQ MQ-GC MQGWR NPEBLUP GWR KRIG Note: Results show across-area distribution of RB% over simulations. Table 3.4. Design-based simulation results using the EMAP data for 86 sampled areas Predictor Summary of across-area distribution Min Q1 median Mean Q3 Max EBLUP EBLUP-GC GREG GREG-GC SEBLUP MBDE SMBDE GWEBLUP MQ MQ-GC MQGWR NPEBLUP GWR KRIG Note: Results show across-area distribution of RRMSE% over simulations. 90

91 Table 3.5. Design-based simulation results using the EMAP data for 27 out of sample areas Predictor Summary of across-area distribution Min Q1 median Mean Q3 Max EBLUP EBLUP-GC SEBLUP GWEBLUP MQ MQ-GC MQGWR NPEBLUP GWR KRIG Note: Results show across areas distribution of RB% over simulations. Table 3.6. Design-based simulation results using the EMAP data for 27 out of sample areas Predictor Summary of across-area distribution Min Q1 median Mean Q3 Max EBLUP EBLUP_GC SEBLUP GWEBLUP MQ MQ_GC MQGWR NPEBLUP GWR KRIG Results show across-areas distribution of RRMSE% over simulations. 3.4 Remarks and findings The main findings for each objective of the study are summarized below. i. With regard to the sensitivity of the SAE predictors to different spatial models, it can be concluded that: i) provided the spatial correlation is high greater than 0.5 good results are obtained using spatial information in the fixed and the random parts of the models; ii) when the process is spatially non-stationary, the inclusion of covariates that capture the spatial effects can be useful because they improve the efficiency of the predictors; and iii) when spatial heterogeneity is relevant across the areas, the SAE approach performs better than synthetic estimates from the kriging and GWR interpolation methods. 91

92 ii. The properties of the SAE predictors change with different levels of availability of survey and auxiliary data, and the recommended estimators change. The design-based experiment shows that MQGWR competes with the other predictors and the interpolation methods, especially when estimates for out-of-sample areas are required and the coordinates of the population units are available. iii. When data are generated under the assumption of normality, the EBLUP family of predictors EBLUP, EBLUP- GC and GWEBLUP show a substantial gain in efficiency as measured by a lower RRMSE for the spatially stationary, SAR stationary and spatially non-stationary models. Of the interpolation methods, only GWR shows a competitive performance in terms of efficiency. The MQ approach does not gain in efficiency in normality scenarios; its resilience to departures from normality and to outliers is considered in Chapter 5. It should be noted with regard to finding (ii) above that the availability of auxiliary spatial information is a crucial issue in the application of SAE predictors. Auxiliary information can consist of geo-coded GIS data about the spatial distribution of these domains and units. Such information can, for example, be obtained from digital maps that cover the domains of interest and so enable the calculation of their centroids, borders, perimeters and areas and the distances between them. Alternatively, spatial coordinates are available for all sampled and non-sampled population units and out-of-sample units as in the design-based case study. These attributes are commonly available in statistical agencies, and they are helpful in the analysis of social-economic data relating to these domains because these often show spatial structure that is, they are correlated with the geography of the landscape. In this context it is useful to recall Tobler s (1970) first law of geography: Everything is related to everything else, but near things are more related than distant things. The law is also valid for small geographical areas: nearby areas are more likely to have values similar to those of the target parameter than widely separated areas. This suggests that appropriate use of geographical information and geographical modelling can help to produce accurate estimates for small-area parameters. And in fact the spatially-based estimators presented above make it possible to use all components of survey data, including geographical data. This is an advantage for environmental and agro-environmental studies in which geographical information is fundamental in understanding the spatial pattern of the phenomena being analysed (Petrucci et al., 2005) The spatial approach presented here has its limitations. The models and estimators presented are variable specific solutions: it is a matter of fact that geographical information relevant to one study variable cannot be relevant to another. Nevertheless, even if geographical information is not informative by itself, it must be accepted that the spatial conformation of a study area land use, elevation and percentages of hill, mountain and plain are likely to have a strong influence on many environmental and socio-economic phenomena and their distribution by small area of interest (Petrucci et al. 2005). Another source of sensitivity is the definition of the geographical units under analysis. The modifiable areal unit problem (MAUP) (Unwin, 1996) is a potential source of error that can affect spatial studies, which utilize aggregate data sources and SAE results. The MAUP occurs in spatial analysis of aggregated data in which the results differ when the same analysis is applied to the same data but different aggregation schemes are used. It takes two forms: the scale effect, and the zone effect. The scale effect gives different results when the same analysis is applied to the same data, but changes the scale of the aggregation units. Analysis using data aggregated by county, for example, will differ from analysis using data aggregated by census tract. This difference in results is often valid in that each analysis asks a different question because each evaluates the data from a different perspective or a different scale. 92

93 The zone effect is observed when the scale of analysis is fixed, but the shape of the aggregation units is changed. Analysis using data aggregated into one-mile grid cells, for example, will differ from analysis using one-mile hexagonal cells. The zone effect is a problem because it is an analysis, at least in part, of the aggregation scheme rather than the data themselves. A simple strategy to deal with MAUP in SAE is to carry out analyses at several scales or in several zones. But this can conflict with budget and time issues, which often constrain the production of small-area statistics (see Chapter 4). Sensitivity of SAE predictors Spatial models Use of spatial information is recommended in the fixed and random part of the models when the spatial correlation is high at > 0.5. Use of covariates that can capture the spatial effects is recommended when the process is spatially nonstationary. Use of the SAE approach is recommended when spatial heterogeneity is relevant across the areas compared with the kriging and GWR approaches. Availability of survey and auxiliary data The MQGWR competes with the other predictors and the interpolation methods when estimates for outof-sample areas are required and coordinates of the population units are available. Normality assumption When the assumption is satisfied, the EBLUPs show a substantial gain in efficiency for spatially stationary, SAR stationary and spatially non-stationary models. Among the interpolation methods, only GWR has competitive performance in terms of efficiency. The M-quantile approach does not gain in efficiency in normality scenarios. 93

94 4. The Modifiable Area Unit Problem 4.1 Introduction The availability of desktop computing power and GIS software has created interest in and a need to learn more about the MAUP, which has been discussed in spatial analysis literature since the 1930s (Unwin, 1996). The term is due to Openshaw and Taylor (1979) and it has long been recognized as a potentially troublesome feature of aggregated data. The MAUP is a source of statistical bias that can radically affect the results of statistical analysis. It affects results when point-based measures of spatial phenomena such as population density are aggregated into larger areas, in that the resulting summary values totals, rates and proportions are influenced by the choice of the area boundaries. Point-based census or survey data, for example, may be aggregated into census enumeration districts, postcode areas or any other spatial partition and hence the areal units are modifiable. The problem is particularly relevant in the production and analysis of agro-environmental data and in the analysis of socio-economic data in general. But it seems to be far from being solved, as indicated in the next two sections. This section provides evidence of the MAUP effect in the application of SAE predictors and interpolation methods. Knowledge of the spatial distribution of the localities of the sampled units in the small areas, and the corresponding point-based measures of the study variable y, are assumed. The auxiliary variables are also available for out-ofsample units, as specified in Chapter 3. To explain the rationale of the simulation experiment, the two forms of the MAUP the scale effect and the zone effect must be recalled: i. The scale effect give different results when the same analysis is applied to the same data but there are changes in the scale of the aggregation units. Analysis of average crop production carried out with data aggregated by county, for example, will differ from analysis using data aggregated by agrarian zone. ii. The zone effect is observed when the scale of analysis is fixed but the shape of the aggregation units is changed. Analysis of data aggregated into one-mile grid cells, for example, will give different results from analysis based on one-mile hexagonal cells. The zone effect is a problem because it is an analysis, at least in part, of the aggregation scheme rather than the data themselves. In current spatial analysis, in fact, the level of aggregation of point-based measures will often have been decided, and data gathered for particular areal units such as census tracts, enumeration areas, municipalities or other aggregated geographical zone of interest are used. When the values are averaged over the process of aggregation, variability in the dataset is lost as a result of the scale effect, and the results of the same statistical technique will tend to vary according to the level of spatial resolution (Openshaw and Taylor, 1979). This difference in results may be valid nonetheless in that each analysis asks a different question because each evaluates the data from a different perspective or different scale. In our opinion, the zone effect is a secondary problem because the level of aggregation of point-based measures and the shape (square, hexagon, triangle cells) of the aggregated unit is often imposed by the goals of the analysis in real life applications. The focus of this section is hence limited to the study of the scale effect on SAE predictors. As far as we know, no previous studies have been devoted to the topic. In their search for a method to overcome the MAUP, Tobler (1989) and Fotheringham (1989) looked for statistical methods whose results would be relatively robust to the definition of the spatial units for which data are recorded. Following their approach, evidence is given here from which to assess the robustness of SAE methods to different scales of aggregation of point-based measures in particular small areas or domains of interest. The rationale of the simulation is to determine the extent to which one can aggregate 94

95 individual values in small areas and still achieve an acceptably accurate estimate of the small-area parameter. It is recognized that point-based measures, even as the best option, are costly and that geographical and statistical aggregation can be an easier alternative. 4.2 An evaluation of the impact of the scale effect on SAE predictors and interpolation methods Sensitivity to the level of aggregation of point-based measures is taken into account when the small-area parameter the area mean is predicted by EBLUP (Rao, 2003), the generalized regression estimator (Rao, 2003), SEBLUP (Petrucci and Salvati, 2006; Pratesi and Salvati, 2009), MBDE (Chandra and Chambers, 2005), SMBDE (Chandra et al., 2007), the MQ regression small-area estimator (Chambers and Tzavidis, 2006) and interpolation methods GWR (Fotheringham et al., 2002) and ordinary kriging (Cressie, 1993). The simulation experiment is based on the second model presented in Section 3.2, where the population is generated by a nested-error regression model with random area effects for neighbouring areas distributed according to an simultaneously auto-regressive (SAR) spatial correlation structure with spatial auto-regressive coefficient sets equal to 0.75 high spatial correlation. It is based on about 10,000 points, each representing an individual unit located randomly in 20 small areas. The small areas are in the form of quadrats; their population sizes are randomly drawn from a uniform distribution of [450,500] and kept fixed over the simulations. The location coordinates for each unit of the population are independently generated from a uniform random variable. It is assumed that the only spatial information available is the spatial coordinates of the sampled units and the spatial coordinates of the centroids of the small areas to which they belong. To examine the scale effect, the points are aggregated into a mean 101 areal units or clusters in each small area. Spatial aggregation is carried out by aggregating a number of contiguous point spatial units into a single cluster unit whose boundaries are irregular as defined by a stopping rule of 100 individual units. The extension and shape of the cluster of 100 individuals depends on the random distribution of the point locations of the individual units. The small area sizes of the aggregated population vary between 89 and 108 clusters. A sample of size clusters is selected from each simulated population, with small-area sample sizes proportional to the fixed small-area population sizes, giving an average area sample size of clusters. These area-specific sample sizes are kept fixed in the simulations and the small areas are treated as strata, with the final sample selection carried out by random sampling in each small area. A total of T = 500 simulations is carried out. The models used in the simulation study are presented Section 3, Table 3.1. The performance of different small-area estimators is evaluated by computing for each small area the average relative bias (AvRBias) and the AvRRMSE, as in Section 3.2. Table 4.1 gives the results for the original simulation experiment in Section 3.2 and the results for the aggregated population. The results for the latter show that the MQ-type, EBLUP-type and SEBLUP estimators perform best in RRMSE. Kriging and GWR show less bias than the small-area predictors. 95

96 Table 4.1. Results from model-based simulations in 20 areas; SAR-stationary process Predictor Original Population Aggregated Population AvRBias% AvRRMSE% AvRBias% AvRRMSE% EBLUP (+18.7%) EBLUP_GC (+18.2%) GREG (+20.8%) GREG_GC (+20.9%) SEBLUP (+21.5%) MBDE (+21.0%) SMBDE (+21.1%) MQ (+0.4%) MQ_GC (+7.0%) KRIG (-2.5%) GWR (+15.5%) Values are expressed as percentages. In parenthesis the percentage increase of RRMSE for each predictor from the original population to the aggregated population. To evaluate the scale effect, the last column of Table 4.1 shows in parenthesis the percentage of increase of RRMSE for each predictor from the original population to the aggregated population. The SEBLUP predictor shows the most increase in terms of RRMSE. The reason could be the decrease in the value of the spatial auto-correlation parameter such that variables, parameters and processes that are important at one scale or unit are frequently not important or predictive at another scale or unit. The MQ-type estimators have the lowest increase of RRMSE. Kriging has the best performance, with a reduction of 2.5 percent of RRMSE. The performance of MQ can be explained by the fact that the changes in geography do not affect the MQ coefficients at the area level. GWR also performs well, which suggests that locally varying models may less influenced by MAUP issues than traditional linear regression and linear mixed models. 4.3 Remarks and findings The MAUP has been studied for univariate statistics mean, variance and Moran coefficient and for bivariate and multivariate statistics by using dataset or simulation studies. Qi and Wu (1996) noted that the Moran coefficient, Geary ratio and Cliff-Ord statistic are scale-dependent: the spatial correlation values decline with scale, and are dependent on the zoning system used in the aggregation. In the case of bivariate statistics, Gehlke and Biehl (1934) noted that the coefficient of correlation increases as regions are aggregated into smaller numbers of larger regions. Openshaw and Taylor (1979) discovered that they could obtain almost any value of the correlation between voting behaviour and age in Iowa merely by aggregating counties in different ways. Fotheringham and Wong (1991) presented the results of an analysis of the effects of aggregation on linear regression and logit models, and demonstrated that some relationships can be relatively stable to data aggregation while others appear to be highly sensitive. Many authors have tried to overcome the MAUP even though it has traditionally been written off as intractable. Steel and Holt (1996), Holt et al. (1996) and Tranmer and Steel (1998) propose a model structure that includes an extra set of grouping variables z that can be measured at the individual level and that are in some way related to the processes being measured at the aggregate level. The grouping variables are used to adjust the aggregate-level 96

97 variance covariance matrix for the model so that it approximates the unknown individual-level variance covariance matrix more closely. Following Tobler (1989) and Fotheringham (1989) in search of SAE methods that are relatively robust to the definition of the spatial units for which data are recorded, we have obtained evidence of the effect of changing scale in most SAE predictors of small-area means. The results were also compared with those obtained by kriging and GWR. Two main results stem from Table 4.1: i. The more the operational model underlying SAE is linked to a defined spatial structure of the data, the worse the performance when changing scale. This is what happened to SEBLUP, EBLUP, GREG and MBDE, which registered the best performance in the original population. Methods based on spatial auto-correlation are directly affected by the MAUP and by scale-dependent spatial correlation coefficients (Qi and Wu, 1996). ii. Methods that are naturally robust to outliers and not linked to distributional assumptions about the study variable as in MQ and MQ-GC models seem to perform better and to be more resilient to changing scales of analysis. Their performance, which is worse than the EBLUP-based methods used for the original population, does not become any worse at the new scale of analysis, probably because the changes in geography do not affect the MQ coefficients at the area level. The performance of kriging and GWR comes somewhere between these. The GWR estimator is worse than kriging, probably because the measure of spatial correlation in the definition of local regression parameters is scale-dependent. Sensitivity of SAE predictors to MAUP SAE models linked to a defined spatial structure SEBLUP, EBLUP, GREG and MBDE perform worst when the scale changes. SAE models based on spatial auto-correlation suffer because the spatial correlation coefficient is scaledependent. SAE models that are naturally robust to outliers and not linked to distributional assumptions about the study variable MQ and MQ-GC are more resistant to changes in the scale of analysis. GWR performs worse than kriging. 97

98 5. The Robustness of SAE Predictors 5.1 Introduction This section proposes robequation Section 7ust estimators for small areas. The presence of outliers in agricultural data is common and should be taken into account in the estimation process. Two approaches are presented here: the MQ and the robust mixed model. The literature contains numerous proposals for small-area estimators, particularly for means or totals. The most commonly used is the EBLUP estimator (see Chapter 2). If the LMM assumptions on which the EBLUP is based are respected, then it is the best available estimator in its class in terms of efficiency. But in many real applications the presence of outliers and the skewness of the data cause the EBLUP to become biased and inefficient. A significant body of literature establishes the effect that outliers can have on the parameter estimates of random-effects models (Huggins, 1993; Richardson and Welsh, 1995), so many different small-area estimators that take into account departures from normality assumptions in the LMM and the possible presence of outliers have been developed in recent years. In this section, robust estimators are presented that are alternatives to the EBLUP. The initial tools for robust SAE come from Chambers and Tzavidis (2006) and Tzavidis et al. (2010), who suggested the use of MQ linear models to obtain small-area estimates. Sinha and Rao (2009) studied the effect of outliers on the widely used EBLUP of the small-area mean. In practical applications, the use of robust estimators is suggested with a view to protecting estimates from bias induced by outliers in the data. 5.2 Small-area robust estimators This section focuses on two different approaches to robust SAE with robustness against outliers. The first is based on the MQ linear model, the second on the robust random effect model MQ estimators The MQ small-area estimator of the mean was described in chapter 2. This estimator is an M-type estimator, so outlier-robust estimation is automatically achieved. But it is possible to improve resistance to outliers by introducing a robust function in the MQ-CD estimator presented in chapter 2, defining a Welsh-Ronchetti (1998) MQ estimator MQ-WR:, (5.1) where, as usual, the population units are identified by j and the small areas by i, the data consist of values y ij of the outcomes, values x ij of a vector of p auxiliary variables, which includes the constant term as first component, a sample s is drawn and the area-specific samples si of size ni 0 are available for each area/domain, the set ri contains the Ni ni indices of the non-sampled units in small area i and values of y ij are known only for sampled values, while for the p-vector of auxiliary variables it is assumed that unit-level data are accurately known from external sources. The difference with the MQ-CD estimator is in the third addend of the right side of the equation (7.1), where is a robust estimate of scale such as the median absolute deviation of the residuals, and is the influence function associated with the MQ (see Tzavidis et al. 2010). Analytic and bootstrap MSE estimation for the MQ small-area estimators is described in Chambers and Tzavidis (2006), Chambers et al. (2011) and Tzavidis et al. (2010). 98

99 5.2.2 Robust EBLUP There is ample documentation to show that the generalized least-squares estimator of β and the ML or REML estimators of the variance components are sensitive to outliers (Fellner, 1986; Huggins, 1993; Richardson and Welsh, 1995). This in turn can affect the small-area estimates. Sinha and Rao (2009) recognized this problem and proposed a small-area estimator of the small-area mean using an outlier-robust version of the LMM that is in practice an extension to the EBLUP estimator of the mean (see Chapter 2). In particular, Sinha and Rao (2009) suggested obtaining fixed-effects and variance components by using the robust ML proposal II defined by Richardson and Welsh (1995):, (5.2) where r = U 1/2 (y Xβ) is the vector of unit-level residuals, U is a diagonal matrix with its elements equal to the diagonal elements of V, K is a diagonal matrix such that K = bi = E[ψ 2 (r)], and ψ is Huber s function ψ(u) = min{u, max( u, c)} with c = V is the covariance matrix of the LMM, and θ l is the l-th component of θ = (θ1,...,θq) T, which are the variance components of the covariance matrix V. For estimating outlier-robust random effects v, Sinha and Rao (2009) suggested the use of the equation of Fellner (1986): (5.3) where G and R are the covariance matrixes of the random-area effect and of the unit-level effect in the LMM. The outlier-robust predictor REBLUP of the small-area mean is then obtained by substituting in the equation of the EBLUP mean estimator the robust estimates obtained through (5.2) and (5.3). Denoting by subscript M the robust estimates of the fixed and random effects, the robust version of the EBLUP is then: (5.4) See Sinha and Rao (2009) for further details and for estimation of the MSE of the robust EBLUP estimator in equation (5.4). 5.3 Assessment of the robustness of the EBLUP, MQ and robust EBLUP Various empirical SAE studies assess the influence of outliers on the EBLUP, REBLUP and MQ small-area estimators. By comparing the efficiency of the REBLUP with respect to the EBLUP, Sinha and Rao (2009) generated different populations according to the LMM, introducing outliers in the area random error or the unitlevel error or both. Their empirical studies of model-based and design-based simulations show the supremacy of the REBLUP compared with the EBLUP in terms of RRMSE, or efficiency. Giusti et al. (2014) compared robust estimators in the small-area framework the EBLUP, the REBLUP (equation 5.4), the MQ and the MQ-WR (equation 5.1) among others using different model-based simulations. Their results show that the presence of outliers can significantly affect small-area estimates, suggesting that outlier-robust smallarea methods should be used in real data applications. 99

100 In the Giusti et al. (2014) comparison of the MQ and REBLUP approaches to estimating the small-area mean, the two performed similarly. In their model-based simulation experiment the REBLUP performed well in the estimation of small-area means, followed by the MQ-WR and MQ estimators. Their comparison of the precision and efficiency of the MSE estimator showed that the bootstrap of the bias-corrected MQ estimators performed best in the worst scenarios that is, with outliers affecting area-level and unit-level residuals. The bootstrap of the REBLUP performed quite well, but it is a more computer-intensive technique. Giusti et al. (2014) also carried out an application to real income data from the 2008 Italian survey of income and living conditions and 2001 census, in which income data was not collected. The estimates obtained with the REBLUP, MQ-CD and MQ-WR estimators turned out to be similar, and the computation of a goodness-of-fit diagnostic suggested that none of these estimators can be considered statistically different from direct estimates. Note that in this case good results were also obtained for the EBLUP, though preliminary diagnostics suggested that the hypotheses of normality did not hold for these data. 5.4 The RSEBLUP: robust SAE using geo-referenced information in the mixedmodel approach As stated previously, in economic, environmental and epidemiological applications spatially close observations tend to be more alike than observations made further apart, and it can be important to include the available geographic information in models used for SAE. Examples of small-area predictors that explicitly incorporate spatial information in the mixed-model approach to SAE are the SEBLUP (Petrucci and Salvati, 2006; Singh et al., 2005; Pratesi and Salvati, 2008) and the GWEBLUP (Chandra et al., 2012). In the MQ approach to SAE, Salvati et al. (2012) proposed an MQ-GWR model to define a bias-robust predictor of the small-area characteristic of interest that also accounts for spatial association in the data. Schmid and Munnich (2013) proposed the SREBLUP extension to cover spatial area effects in a simultaneous auto-regressive model. Like the MQ-GWR model, the SREBLUP integrates the concepts of bias-robust SAE and a unified framework to enhance spatial accuracy. As explained in the previous section, the estimator (5.4) is robust to model mis-specifications or outliers, but it does not consider spatial dependencies in the data. This can be significant in many applications. Following Salvati (2004), Petrucci et al. (2005) and Singh et al. (2005), Schmid (2011) and Schmid and Munnich (2013) proposed the introduction of a simultaneously auto-regressive (SAR) model in the REBLUP to obtain the SREBLUP. Denoting by the SAR parameter defining the strength of the spatial dependencies, and by W the proximity matrix between the areas, the vector v of spatially correlated random area effects is given by: (5.5) Given this, the model with spatial correlated random effects is: (5.6) where D is a matrix of known positive constants. The covariance matrix of the y is given by: (5.7) 100

101 Following Sinha and Rao (2009), Schmid and Munnich (2013) suggested estimating and using outlier-robust ML-equations solved with a Newton-Raphson algorithm. Having thus obtained the estimates, and, estimates of the spatially robust random effects can be obtained using Fellern s equation (Fellern, 1986). The SREBLUP of small-area means is then: where. Like Sinha and Rao (2009), Schmid and Munnich (2013) proposed a parametric bootstrap estimator for the MSE of the REBLUP Evaluating the Spatial REBLUP estimator using simulation studies Schmid (2011) carried out simulations to investigate and compare the performance of the EBLUP, REBLUP and SREBLUP point and MSE estimators in the absence or presence of outliers, observations and spatial dependence in the data. Different models were used to generate the data, simulating several scenarios to allow for violations of the normality assumptions of the EBLUP and to include spatial dependencies between areas. The population data were generated taking m=100 areas of size, with covariate values normally distributed with. With regard to the spatial structure of the areas, the latitude and longitude of each were generated independently from a uniform distribution, and they were assigned to each unit inside the area. From this population, a simple random sample of size was selected in each area. (5.8) The values of the dependent variable y were generated from the model and generated from the following distributions: with the error terms where the parameters and determine the amount of contamination in the data, with being the situation without outliers; if, then there are no spatial dependencies in the data and the standard EBLUP is obtained. Spatial dependence is introduced into the model through the G matrix, setting, and with the W matrix obtained with the nearest-neighbour approach. A small area j is defined as a neighbour of small area i if the Euclidean distance between them is less than 0.15, a value chosen because it determines the realistic situation in which each small area has on average 5 neighbours. With regard to the presence of outlier observations, the simulation study considers the values, that is 5 percent of contaminated errors, under four different scenarios with no spatial dependence ( ): (0,0) scenario, no contamination; (0,e) scenario, contamination only in the individual errors e; (0,u) scenario, contamination only in the area errors u; (e,u) scenario, contamination in both errors. To evaluate the effect of the presence of outlying observations at the area level more accurately, in scenarios (0,u) and (e,u) the areas with the contamination in u are always the same areas 96, 97, 98, 99 and 100. These four scenarios are also considered together with the presence of spatial correlation between the areas, that is with : {(0,0) p, (e,0) p, (0,u) p, (e,u) p }. The study simulated each scenario R=1,000 times and estimated during each run the small-area mean with the following estimators: GREG, standard EBLUP, SEBLUP, robust EBLUP, SR EBLUP, naïve MQ, MQ-CD and 101

102 MQ-GWR. To evaluate and compare the performance of the three estimators, the following statistics were computed for each area i RRMSE: and the RB: where A represent the estimator, is the index over the replications, is the estimate of the mean of y in area i, replication r using estimator A and is the true mean of y in area i. The RRMSE is a measure of the accuracy of the estimators; the RB is a measure of the bias of the estimators. Table 5.1. Mean values of the RRMSE (%) in the non-spatial scenarios, with 5% symmetric outlier contamination Area No outliers Indiv. outliers Both outliers [0,0] [0,u] [e,0] [e,u] [0,u] [e,u] GREG EBLUP SEBLUP REBLUP SREBLUP MQ MQ-CD MQGWR

103 Table 5.2. Mean values of the RB (%) in the non-spatial scenarios with 5% symmetric outlier contamination Area No outliers Indiv. outliers Both outliers [0,0] [0,u] [e,0] [e,u] [0,u] [e,u] GREG EBLUP SEBLUP REBLUP SREBLUP MQ MQ-CD MQGWR Tables 5.1 and 5.2 show that when outlier contamination is present in non-spatial scenarios, the robust estimators outperform the traditional ones, as expected. The REBLUP and the SREBLUP perform better with respect to the MQ estimators, especially when considering the MQ-CD that suffers from the individual outliers. This is true for the RRMSE and the RB: because the scenarios are non-spatial in this case, no particular gain is expected for the spatial estimators. The positive effect of spatial modelling is evident in Tables 5.3 and 5.4, which show the results from the spatial scenarios. The SEBLUP outperforms the EBLUP in the settings with no contamination, and the SREBLUP outperforms the REBLUP. The MQ-GWR performs better than the MQ and MQ-CD. A comparison of the SREBLUP and the MQGWR shows that the latter has a slightly lower RB, whereas the former has a slightly lower RRMSE. In general there are no large differences in the results of the small-area models in the scenarios with symmetric contamination, because the symmetric outliers tend to follow the assumptions of the underlying model rather than alternative non-symmetric outliers. Table 5.3. Mean values of the RRMSE (%) in the spatial scenarios with 5% symmetric outlier contamination No outliers Indiv. outliers Area outliers Both [0,0] [0,u] [e,0] [e,u] [0,u] [e,u] GREG EBLUP SEBLUP REBLUP SREBLUP MQ MQ-CD MQGWR

104 Table 5.4. Mean values of the RB (%) in the spatial scenarios with 5 % symmetric outlier contamination Area No outliers Indiv. outliers Both outliers [0,0] [0,u] [e,0] [e,u] [0,u] [e,u] GREG EBLUP SEBLUP REBLUP SREBLUP MQ MQ-CD MQGWR Schmid (2011) also considers four alternative non-symmetric simulation scenarios with the area-specific random effects u i and random errors e ij generated using the model: Results for these simulations are presented in Tables 5.5 and 5.6 for the non-spatial settings, and 5.7 and 5.8 for the spatial settings. Table 5.5 shows that all robust small-area estimators suffer from high RRMSE except for the MQ-CD. This estimator captures the effect of the individual outliers e ij in the sample, which helps with the estimation of the area means in the population. Further research is needed to enhance understanding of the behaviour of the other robust estimators. The EBLUP performs moderately well in the scenarios with individual non-symmetric representative outliers in the data, but weakly when contamination occurs at the area level. With regard to the RB in the non-spatial scenarios with non-symmetric outlier contamination, it is evident from Table 5.6 that the robust estimators REBLUP, SREBLUP, MQ and MQGWR suffer from a negative bias of approximately 1 percent caused by the fact that these estimators treat individual outliers in the sample as unique in the population. The MQ-CD on the other hand corrects the bias in corresponding settings. When considering the spatial scenarios in Tables 5.7 and 5.8, the results are much as in the non-spatial scenarios that is, the spatial effect in this setting does not lead to an enhancement of the spatial estimators, as in the case of symmetric outliers. 104

105 Table 5.5. Mean values of the RRMSE (%) in the non-spatial scenarios with 5 % nonsymmetric outlier contamination Area No outliers Indiv. outliers outliers Both [0,0] [0,u] [e,0] [e,u] [0,u] [e,u] GREG EBLUP SEBLUP REBLUP SREBLUP MQ MQ-CD MQGWR Table 5.6. Mean values of the RB (%) in the non-spatial scenarios with 5 % non-symmetric outlier contamination Area No outliers Indiv. outliers Both outliers [0,0] [0,u] [e,0] [e,u] [0,u] [e,u] GREG EBLUP SEBLUP REBLUP SREBLUP MQ MQ-CD MQGWR Table 5.7. Mean values of the RRMSE (%) in the spatial scenarios with 5 % non-symmetric outlier contamination Area No outliers Indiv. outliers Both outliers [0,0] [0,u] [e,0] [e,u] [0,u] [e,u] GREG EBLUP SEBLUP REBLUP SREBLUP MQ MQ-CD MQGWR

106 Table 5.8. Mean values of the RB (%) in the spatial scenarios with 5 % non-symmetric outlier contamination Area No outliers Indiv. outliers Both outliers [0,0] [0,u] [e,0] [e,u] [0,u] [e,u] GREG EBLUP SEBLUP REBLUP SREBLUP MQ MQ-CD MQGWR Remarks and findings Analysis of the literature on SAE does not reveal a dominant robust estimator in the REBLUP, the MQ or the MQ-WR. What emerges is the inefficiency of the EBLUP when there are outliers in the data. In the light of the studies described in the literature, the practitioner is advised to use one of the suggested robust estimators REBLUP or MQ-WR if there is evidence or a suspicion of the presence of outliers in the data. Sensitivity of small-area estimators to outliers EBLUP is inefficient in the presence of outliers. REBLUP and MQ-WR are to be preferred when there is evidence of outliers. Neither REBLUP nor MQ-WR dominates the other. 106

107 6. The Complexity of Sample Design 6.1 Introduction This section considers the problem of sample design in SAE. First, the effect of a design on the estimators is discussed, using a design-based simulation using real agricultural data; second, two alternative small-area estimators are suggested that take the sample design into account. SAE techniques generally focus on model-based and model-assisted estimators. The most commonly used modelbased small-area estimators do not make use of sample weights, and they are not design-consistent unless the sampling design is self-weighting within areas. But design consistency is a desired property for a model-based estimator in that it guarantees that estimates make sense, at least for large domains, even if the model fails. With regard to the effect of a particular sampling system on small-area estimators, there are two categories of design: ignorable and non-ignorable (Sugden and Smith, 1984; see Rubin, 1987 for a discussion of ignorability). In the field of small-area research, a design is considered non-ignorable if all the variables contributing to the calculation of sampling weights are excluded from the model. Hence as far as SAE methods are concerned the design itself does not matter the only issue is whether it is ignorable or non-ignorable. The effect of non-ignorable sample designs on SAE is assessed in the following paragraphs, and alternative estimators are presented that take sample design into account. In particular, the expansion (see 6.2.1), GREG, pseudo-eblup and weighted-mq small-area estimators are considered. The effect of ignorable and non-ignorable designs on these estimators is evaluated in a design-based simulation based on real data see Fabrizi et al. (2013), You and Rao (2002) and Särdnal (1982). The first part of this section introduces some design-consistent small-area estimators. This is followed by a designbased simulation study to assess the effect of the designs on small-area estimators. 6.2 Design-consistent small-area estimators Suppose that a population U of size N is partitioned into m subsets U i domains of study or areas of size N i, i = 1,,m. The population units are identified by j and the small areas by i. The population data consist of values y ij of the variable of interest, and values x ij of a vector of p auxiliary variables that include the constant term as the first component. Suppose that a sample s is drawn according to some possibly complex sampling design such that the inclusion probability of unit j in area i is given by π ij, and that area-specific samples s i U i of size n i 0 are available for each area or domain. Note that it is possible to have non-sample areas, so n i = 0, in which case si is the empty set. The set r i U i contains the N i n i indices of the non-sampled units in small area i. Values of y ij and x are known only for sampled values; for the p-vector of auxiliary variables it is assumed that area level totals X i or that their means are accurately known from external sources Expansion estimator The expansion estimator, also known as the Horvitz and Thompson (HT) estimator (Horvitz and Thompson, 1952), is defined as: (6.1) 107

108 A popular variance estimator for (6.1) is where is the probability to include in the sample s i the unit j and k in area i. Many alternative estimators of the variance are available (see Särndal et al., 2003) Modified GREG estimators The GREG estimators have the following structure: (6.2) (6.3) where is the estimate of the mean in area i. The class of estimators in (6.3) changes accordingly to the model used to fit the target variable. The most popular choice for fitting the target variable is the linear regression model: (6.4) where and (see Rao, 2003 section 2.5). This is the linear version of the GREG. By generalizing the model on which the linear GREG is based, different alternative estimators can be obtained such as a GREG based on a random-intercept model. This provides the advantage of taking between-area variability into account (Lehtonen and Veijanen, 1999): where the parameters and u i are estimated by generalized least squares and restricted maximum likelihood (see Lehtonen and Pahkinen, 2004 section 6.3). MSE estimation is also suggested in Lehtonen and Pahkinen (2004) The Pseudo-EBLUP The pseudo-eblup is a design-consistent small-area estimator for the area mean proposed by You and Rao (2002). It is based on the random intercept regression model, with the assumption that the sample design is ignorable given the auxiliary variable included in the model. The design-consistent pseudo-eblup estimator of the i-th area mean is then given by: (6.5) (6.6) where,,, and are the regression coefficients vector and the area effect estimates from the fitting of a random-intercept model;,,,, and are the estimates of the variance component of the random-intercept model obtained, for example, by the restricted maximum likelihood method. Prasad and Rao (1999) and You and Rao (2002) provided formulae for the modelbased MSE associated with the pseudo-eblup estimators of the area mean. An alternative similar design-consistent estimator was proposed by Jiang and Lahiri (2006). 108

109 6.2.4 Weighted MQ estimators Using the M-quantile approach to small-area estimation, Fabrizi et al. (2013) proposed a design-consistent smallarea estimator of the mean: in which the regression coefficient vector is estimated according to the MQ linear model accounting for the sample weights; in particular, for a quantile q where X is the nxp design matrix of auxiliary variables, y is the n-vector of the sample y values, W is the diagonal sampling weight matrix of order n and C(q) is a diagonal matrix of order n defined by the weights obtained from the iterative re-weighted least squares algorithm used to fit the design-weighted M-quantile regression coefficient at q (see Fabrizi et al., 2013 for details). The estimator in (6.7) was proved to be design-consistent under some assumptions by Fabrizi et al. (2013). It offers several advantages with respect to the GREG estimator, given that the use of an area-specific coefficient in MQ regression accounts for area characteristics that are not explained by the auxiliary variables. The use of M-estimation offers outlier-robust estimation. An analytic MSE estimator was proposed by Fabrizi et al. (2013), but it underestimated the actual mean squared error of (6.7), particularly when the overall sample size is not moderate and the sampling variance of the y ij and x ij does not dominate the variance associated with the uncertainty in estimating. Alternative estimators of the MSE of (6.7) based on bootstrap was also proposed by Fabrizi et al. (2013). (6.7) 6.3 Simulation study of the impact of ignorable and non-ignorable designs A simulation study carried out to assess the effect of a design on small-area estimates is based on a real dataset from the Australian Agricultural and Grazing Industries Survey. A sample of 1,652 broad-acre farms in 29 regions is studied. A population of N = 81,982 farms is generated by bootstrapping the original survey sample: that is, the 1,652 farms in the original sample are themselves sampled, with replacement using selection probabilities proportional to a farm s survey sample weight, where the sum of survey sample weights is equal to 81,982 (Fabrizi et al., 2013). Because the interest is in the design-based properties of estimators, this population is kept fixed and repeatedly sampled according to a sampling design. To assess the impact of ignorable and non-ignorable designs on the small-area estimates, a comparison between the bias and MSE of the proposed estimators and the bias and MSE of selected alternatives is carried out Description of the simulation experiment The synthetic survey population consists of 15 variables for 81,982 farms. In this simulation farms define the lower level (level 1) and the 29 Australian regions define the small areas of interest (level 2). The size of regions in terms of farms ranges from 79 to 10,930. The target variable is total cash costs (TCC; Y) that is, payments made by the business for materials and services and for permanent and casual labour, excluding owner-managers, partners and family labour; its distribution shows strong positive skewness. For each farm, auxiliary variables (X) are available: the total revenues received by the business during the financial year (TTR) and the total area of the farm in hectares (FarmArea). A group of six binary variables is available for each farm, cross-classifying them by climatic zone and size (SizeZone). The six levels of SizeZone are defined as: i) pastoral zone, and area of 50,000 ha or less; ii) pastoral zone, and area of more than 50,000 ha; iii) wheat/sheep zone, and area of 1,500 ha or less; iv) wheat/sheep zone, and area of more than 1,500 ha; v) high-rainfall zone, and area of 750 ha or less; and vi) high-rainfall zone, and area of more than 750 ha. Three sets of auxiliary variables are taken into account to create three models with different values of R 2 calculated on fitting an ordinary linear-regression model: i) weak linear relationship between Y and X 1 109

110 = [SizeZone] characterized by R 2 = 0.16 low scenario; ii) medium linear relationship between X 2 = [SizeZone, FarmArea], characterized by R 2 = 0.40 medium scenario; and iii) strong linear relationship between Y and X 3 = [SizeZone, TTR], with R 2 = 0.90 high scenario. To check model diagnostics and the characteristics of the synthetically generated population, a two-level mixed model with area-specific random effects is fitted in the different scenarios, using the population data. In all cases, analysis of the residuals shows that the normality assumption fails; the lack of normality for the model residuals is probably caused by several outliers in the regions. This situation can penalize the GREG and pseudo-eblup estimators, but it represents a realistic agricultural scenario. Samples are selected according to a fixed size unequal probability without replacement sampling design, using the maximum entropy method (Tillé, 2006 chapter 5). The sample size is set at 578, corresponding approximately to a 0.7 percent sampling rate. Two alternative sets of inclusion probabilities are defined to be proportional to two size variables Z: i) livestock beef, sheep and wool; and ii) a uniform variable on the interval (1, 20). In case i), π j = 0.2 z j was defined for all j in U to minimize the number of inclusion probabilities equal to 1 (Fabrizi et al., 2013). The design is non-ignorable when, conditionally on the covariates X, livestock is used as a size variable. The correlation computed on the population between TCC and livestock given X 1 is equal to When conditioning on X 2 the correlation falls to 0.16, and on X 3 to The design would become ignorable if livestock were included as a covariate in the model (Fabrizi et al., 2013). This option is not considered here because the aim of the simulation is to mimic situations where not all design variables are available to the analyst. When inclusion probabilities are generated proportional to the uniform variable, the design is ignorable given X. The first scenario is called a nonignorable design; the second is called an ignorable design. The compared estimators are the weighted MQ (WMQ), the HT, the pseudo-eblup, the GREG-S and the [in full?greg with Sample weights, see pag.27] GREG-LV (see pag.28). The Monte Carlo experiment consists of drawing R = 5,000 samples from this population and calculating small-area estimates of the mean of TCC. The performance of the small-area estimators is evaluated using the RB and the RRMSE of estimates of the small-area means. The RB for small area i is computed as:, (6.8) and the relative RMSE for area i is computed as:. (6.9) In (6.8) and in (6.9) the subscript r = 1,,R indexes the Monte Carlo simulation, i indexes the area and the true value of the parameter the mean in area i. represent Simulation results Table 6.1 shows results for the mean RB and the RRMSE for the three model scenarios and for the two possible designs ignorable and non-ignorable. Results are as usual averaged over areas and simulations. 110

111 The results in table 6.1 show that the proposed design-consistent small-area estimators are unbiased. In particular, the WMQ and the HT estimators show very little bias even when the model has a poor fit the low scenario whereas the other small-area estimator shows some bias in the ignorable and non-ignorable cases. As expected, the HT estimator has a very large RRMSE with respect to the WMQ and the pseudo-eblup, particularly when the model holds. Table 6.1. Design-based simulation results; population generated using the AAGIS data Average RB% Predictors Non-ignorable design Ignorable design Low Medium High Low Medium High WMQ Pseudo- EBLUP GREG-S GREG-LV Expansion Average RRMSE % WMQ Pseudo- EBLUP GREG-S GREG-LV Expansion Results show the RB% and the RRMSE% averaged over areas in the three model scenarios. Results for estimators that are not design-consistent such as the EBLUP (Rao, 2003) and the MQ (Chambers and Tzavidis, 2006) are not given here, but they show greater bias, particularly when the model does not hold the low scenario in the case of the non-ignorable design; they do show good results for the ignorable design, however, as expected. The WMQ estimator performs best in terms of RRMSE for the non-ignorable design; this is particularly evident for the medium and high scenarios. The GREG-S is less efficient than the pseudo-eblup and WMQ because it does not allow for area-specific regression coefficients. Ignorable designs can be handled using EBLUP-based and MQ-based estimators that do not use survey weights. 6.4 Investigating the impact of sampling designs on data interpolation This section investigates the effect of the sample design on interpolation methods. A brief introduction on the effect of the design on data interpolation is followed by a simulation experiment to assess the impact of three sample designs simple random, two-stage and stratified two-stage on two interpolation models, the GWR and ordinary kriging A short introduction about the design effect on data interpolation In the mathematical field of nequation Section 6umerical analysis, interpolation is a method of constructing new data points within the range of a discrete set of known data points. In real applications, the known data points are 111

112 usually a sample of a finite or infinite population. The way in which the sample is drawn is known as the sample design. Sampling weights weigh sample data to correct for the disproportionality of the sample with respect to the target population of interest; they reflect unequal sample inclusion probabilities and compensate for differential non-response and frame under-coverage. They are routinely included in survey data files released to analysts. Sampling weights can be vital in two aspects of the modelling process: i) they can be used to test and protect against non-ignorable sampling designs that could cause selection bias; and ii) they can be used to protect against mis-specification of the model holding in the population. When the design is non-ignorable, the estimation process for the model parameter should take sample weights into account. When a model is chosen to interpolate a set of sampled points, a desirable property of the model parameters is their design consistency. In classical statistics theory, consistency refers to the limiting behaviour of a sample statistic as the sample size is increased to infinity: hence defining the concept of consistency in finite population sampling requires that the population size is also allowed to increase. This raises the question, however, of a suitable formulation of the way in which the population and the sample increase such that their structure is preserved. A sample statistic t s (n) is said to be design-consistent for a descriptive population quantity T(N) if where plim stands for limit in probability under the randomization distribution, n is the sample size and N is the population size (Pfeffermann, 1993). Why is design consistency a desirable property for an estimator? The answer is robustness. If the model holds in the population and the estimation technique used yields a corresponding descriptive population quantity that is consistent in the model, then as the population size increases the corresponding descriptive population quantity will converge to the model parameter. The following paragraphs give a short definition of corresponding descriptive population quantity. Let be generated from a distribution indexed by a vector of unknown parameters. Let U(Y, ) = 0 define a set of estimating equations for obtained by an estimation rule. The solution T(Y) such that U(Y,T(Y)) = 0 is the corresponding descriptive quantity for under the rule. When the sample is selected by simple random sampling, the model holding for the sample data is the same as the model holding in the population before sampling. With the complex sampling designs often used in practice the two models can be very different, however, and failure to account for the sample selection process might bias the inference on the target parameters. Incorporating the sampling weights in the analysis is the preferred way of dealing with the effects of the design (Pfeffermann, 1993; Kish, 1990). In some cases, the sample design becomes ignorable. The definition of ignorability and the conditions in which the design is ignorable are discussed in the literature, for example by Little (1982), Rubin (1976), Scott (1977) and Sugden and Smith (1984). The ignorability of the design refers to the information provided by the selection scheme beyond what is already provided by the design variables. The ignorability conditions (Rubin, 1976) are clearly satisfied in sampling schemes that depend only on the design variables. Because analysts often do not know all the design variables, Sudgen and Smith (1984) explore the conditions under which a sampling scheme that depends only on the design variables is ignorable, given partial information on the design. The ignorability of the sampling design depends on the design and the available design information, and also on the model and the parameters of interest. Hence if the regressor variables in a regression model include all the design variables, the sampling design is ignorable for estimating the regression model. If, however, the design variable values are only known for units in the sample, the sampling design is non-ignorable for estimating the unconditional mean and variance of the regression dependent variables. This last case is not relevant in this section. 112

113 In a study of the effects of ignoring the sample selection process when fitting models to survey data, Skinner et al. (1989) conclude that failure to account for all the important design variables or incorrectly specifying the conditional distribution of the survey variables given the design information can have severe effects on the inference process. Frequently, however, the analyst has only limited knowledge about the actual sampling process: in such cases, the sampling weights come into play. Estimators of model parameters are modified so that they are design-consistent for the corresponding descriptive population quantity in the finite population from which the sample was drawn. Consider, for example, as a descriptive Equation Section 4population quantity the B parameter of the simple linear model yi=x i B+e i, i = 1,,N, where y i is the variable of interest for the population unit i, x i is the vector of p auxiliary variables for unit i and e i is the error term that fulfils the standard assumptions of the linear model; N is the size of the population. The descriptive population quantity B is then: (6.10) Given that the design is non-ignorable, the ordinary least square estimator of B is not design-consistent, so a different estimator of B is defined using the sample weights:, (6.11) that is, the solution of the equation. Here, s is the set of the sample units drawn from the population of interest following a complex design, w i is the sample weight for the unit i. For more sophisticated models the principle is the same: when the design is non-ignorable the sample weights should be included in the estimation process A simulation experiment to assess the impact of the design effect on spatial interpolation This section describes a simulation experiment to assess the impact of simple random sampling, two-stage cluster sampling and stratified two-stage cluster sampling designs on some spatial interpolation models such as the GWR and ordinary kriging. The experiment consists of generating populations with different spatial structure and drawing samples using different designs. For each sample, model parameters are estimated and predictions are made on all population units using GWR and ordinary kriging models. The effect of the design on the performance of the model is evaluated with the bias and the RMSE of the predicted values. First, the focus is on ignorable sample designs. The experiment settings are inspired by the work of Crainiceanu et al. (2005), Marley and Wand (2010) and Bocci and Rocco (2011). Three different populations are generated and kept fixed in the simulations. Each population is generated using the following model: where,,, s is an vector that represents the spatial location generated by a different spatial point process in each population, the function is obtained as a bivariate normal mixture density and with. The bivariate normal mixture density is obtained as the weighted mean of the following bivariate normal density: 113

114 , with weights 3/7 for W a and W b and 1/7 for W c. The resulting density is shown in Figure 6.1. Figure 6.1 Bivariate normal mixture density Each population has N = 3,000 units located in the unit square O = [0,1] x [0,1]. This square is divided into four equal squares O 1 = [0,0.5] x [0,0.5], O 2 = [0,0.5] x [0.5,1], O 3 = [0.5,1] x [0,0.5], O 4 = [0.5,1] x [0.5,1]. The auxiliary variable (x) is common to the three populations, as is the individual error term ( ). The difference between the three populations is in the spatial point process used to generate the location of the units and consequently the values of the target variable. The three generating process are: A: non-homogeneous Poisson process on O; B: non-homogeneous Poisson process on O i, i = 1,2,3,4; and C: Matern cluster process on O i, i = 1,2,3,4. The surface O is divided into 64 clusters and four strata so that it is possible to carry out simple random, two-stage cluster and stratified two-stage cluster sample designs. Figure 6.2 shows the three populations, the strata and the clusters. 114

115 Figure 6.2 Spatial distribution of the population units A: non-homogeneous Poisson process on O region. B: non-homogeneous Poisson process in O i regions (i=1,,4). C: Matern cluster process on O i regions (i=1,,4). Red lines identify the strata and black lines identify the clusters. The O i regions are not drawn. In each population, the strata contain 9, 15, 15 and 25 clusters. Table 6.1 shows the distribution of the units across the clusters in each generated population. Populations A and B are quite similar in terms of units in each cluster, whereas population C has a high concentration of units in the clusters, 25 percent of which have more than 93 units. Table 6.1 Distribution over clusters of the population A, B and C and number of void clusters 1 st quartile Median Mean 3 rd quartile Void clusters Population A Population B Population C The Poisson process and the Matern cluster process used to generate unit locations are similar to those described in Bocci and Rocco (2011). For each population of N=3000 units three sample designs have been carried out: 1. Simple random sampling without replacement: 150 units are drawn from the target population. 2. Two-stage sampling: a simple random sample of 30 clusters is drawn from the 64 clusters, and a simple random sample of 5 units is drawn from each cluster; void clusters, which are present only in population C, are dropped from the sample and replaced randomly with another cluster until a non-void cluster is sampled. 3. Stratified two-stage sampling: each population is divided into 4 spatial strata; the strata are fixed for populations A, B and C. Then in each stratum k j (j=1,,4) clusters are sampled, and from each cluster h j (j=1,,4) units are selected. The selection of clusters and units is obtained with simple random sampling. Void clusters are treated as described in (2). The clusters and units drawn from each population are: A: k = (4, 7, 7, 20) and h = (5, 5, 5, 3) B: k = (4, 7, 7, 12) and h = (5, 5, 5, 5) C: k = (4, 7, 12, 15) and h = (10, 5, 1, 4) The allocation of the sample in the clusters follows the proportion in Bocci and Rocco For each sample, GWR and ordinary kriging models are estimated, and the target variable y is predicted for all population (N = 3,000) units. The GWR model parameters are estimated with and without the sample weights, which are as usual the multiplicative inverse of the first order inclusion probabilities. The prediction is made considering the x values and the location (s) of all population units as known. The structures of the GWR and ordinary kriging models are: 115

116 The Monte Carlo experiment has been carried out with L = 1,000 replications. The performances of the interpolation models Geographically Weighted Regression (GWR), Geographically Weighted Regression with sample Weights (GWR-W) and ordinary kriging are evaluated in terms of bias and RMSE as follows: where is the true value of the unit i and is the predicted value for the unit i in the replication l under a GWR, GWR-W or ordinary kriging. Tables 6.2, 6.3 and 6.4 show the results for each population and each sample design. Table 6.2 Results of the Monte Carlo experiment for Population A BIAS RMSE GWR GWR-W KRIG GWR GWR-W KRIG SRS* TS** STS*** * Simple random sampling ** Two-stage sampling *** Stratified two-stage sampling Table 6.3 Results of the Monte Carlo experiment for Population B BIAS RMSE GWR GWR-W KRIG GWR GWR-W KRIG SRS TS STS Table 6.4 Results of the Monte Carlo experiment for the Population C BIAS RMSE GWR GWR-W KRIG GWR GWR-W KRIG SRS TS STS

117 Results for populations A and B are similar, showing the superior performance of ordinary kriging in terms of bias and RMSE. In term of bias, kriging is not influenced by the design, whereas in terms of RMSE it is. Tables 6.2 and 6.3 show a higher RMSE in the two-stage sample design for ordinary kriging. The GWR and GWR-W show the same results under simple random sampling, as expected. In the two-stage sampling and stratified two-stage designs, results for the GWR and GWR-W are similar; GWR is slightly dominant. This result should not be a surprise, given that the designs are ignorable. In terms of bias, the GWR gains in the two-stage and stratified two-stage sample designs. The GWR-W behaves in the same way as the GWR for population A, whereas for population B performance in term of bias is similar in the three designs. For populations A and B, GWR and GWR-W show increasing RMSE in simple random, stratified two-stage and two-stage sampling. The conclusion is that complex designs negatively affect this interpolation method in the cases analysed. Population C where units are more clustered shows slightly different results. Ordinary kriging is superior with respect to GWR and GWR-W in terms of RMSE, but in terms of bias there are similar results in simple random and two-stage sampling, and a small predominance of the GWR-W in stratified two-stage sampling. In this population the GWR-W shows less bias than the GWR and a higher RMSE in the two-stage and stratified two-stage sampling designs. Therefore, as expected, the weights have a positive effect on bias and a small negative effect on RMSE. In conclusion, it is notable that only ignorable sample designs are considered. It can be said that ordinary kriging is negatively influenced by complex sample designs, whereas the performance of the GWR and GWR-R depends on the design and on the spatial structure of the population. Further investigation is needed into other spatial interpolation models and other population spatial structures; research with a view to including sample weights in spatial interpolation models would be a valuable contribution. 6.5 Remarks and findings It is clear that design-consistent small-area estimators improve upon the efficiency of traditional estimators the EBLUP and MQ when the sample design is non-ignorable. It is strongly recommended that one of the estimators presented above be adopted to obtain small-area estimates in cases where a design is non-ignorable. A common problem in agricultural statistics is the presence of outliers. In such cases the use of the WMQ estimator is recommended, or other robust estimators that are design-consistent. Specific sampling designs do not significantly influence the behaviour of small-area design-consistent estimators what does influence estimates is large variation in survey weights (Gelman 2007). Münnich and Burgard (2012) assess the effects of large variation in survey weights on some small-area estimators as a result of different sampling designs. Their suggestions agree with the analysis here: i) design-consistent estimators such as the pseudo-eblup must be used to reduce the negative effect on the stability of the estimates caused by variability in sample weights; and ii) robust small-area estimators such as the WMQ must be used when outliers are present. Practitioners should note that a design is considered non-ignorable when no variables contributing to the calculation of sampling weights are included in the model used. In such cases the pseudo-eblup estimator should be used if there is evidence of outliers; if outliers are present in the data, the WMQ estimator should be used. As stated at the beginning of the section, cases when the design affects small-area estimates are identified that is, when the design is non-ignorable; two design-consistent estimators are proposed. The problem of outliers is addressed with the WMQ. With regard to interpolation, the literature shows that ignoring the sample selection process can lead to failure in the inference process; this occurs when design variables are not included in the model. In such cases sampling weights should be used. Estimators of model parameters are modified to be design-consistent for the corresponding 117

118 descriptive population quantity in the finite population from which the sample has been drawn. The simulations show that ordinary kriging is negatively influenced by complex sample designs, whereas the performance of the GWR and GWR-R depends on the design and spatial structure of the population. Further investigation is needed, and indeed encouraged. Effect of sample design on small-area estimators When the design is non-ignorable that is, all the variables used to obtain sample weights are excluded from the model small-area design-consistent estimators should be used. The influence of the design on design-consistent small-area estimators is related more closely to the variability of sample weights than to the design scheme. Pseudo-EBLUP and WMQ estimators are design-consistent small-area estimators. WMQ should be preferred when outliers are present. Effect of the sample design on spatial interpolation When the design is non-ignorable, sample weights should be used in the process of estimating parameters. Inferences about parameters can fail if sample weights are ignored. 118

119 7. Missing Data in Spatial Datasets 7.1 Introduction This section addresses the issue of missing data in spatial datasets. An introduction to the problem highlights the main definitions and suggestions for handling cases of missing data in general datasets, and gives an overview of the concept of missing information from a measurement error perspective, with a focus on spatial datasets. Given the general relevance of the problem of missing values in spatial datasets, two particular problems are considered: missing data in geographical information, and missing data in study and auxiliary variables. With regard to the treatment of missing spatial information in statistical models with spatial effects, the effect of missing point locations for out-of-sample units is considered when MQGWR or geo-additive models are used to estimate the parameter of interest in some geographical domains. Point locations are usually available for sampled units, whereas for the population units not included in the sample only the area they belong to is usually known. But if a geostatistical model is to be applied to these data, the missing locations must be filled in. The classic approach is to locate all the units belonging to the same area by the coordinates of the geographical centroid of their area; this solution is an approximation that can affect the final estimates. This chapter evaluates this effect in two simulation studies. In the first, which is based on the design-based simulation in Chapter 3, the effect on MQGWR small-area estimates is evaluated in terms of bias and variability when the exact location of each unit, known in this case for sampled and out-of-sample units, is replaced by the location of the centroid of its area. In the second imputation method, which was recently proposed in the literature on geoadditive models as an improvement with respect to the classic centroid imputation, the performance of the technique is compared with that of the classical approach. Then, with regard to the general issue of missing data in study variables and auxiliary variables in spatial datasets, some recommendations and case studies from the literature are presented, with a focus on crop yield data and SAR models. This is followed by consideration of a particularly relevant issue: the effect of an informative unit nonresponse on small-area estimates. As noted by Giusti and Rocco (2010), a possible solution when values for the study variable are missing is to use a weighting approach with a weight function for the response probabilities. Because these probabilities are usually unknown, they need to be estimated. Some of the simulation results proposed by Giusti and Rocco (2010) are presented with a view to evaluating the effect on the small-area mean estimator of the study variable resulting from different missing-data mechanisms homogeneous or non-homogeneous between areas and different estimation techniques for the response probabilities such as weighting within cells or the logit model. 7.2 Missing values in datasets: general concepts and solutions Missing data are a pervasive problem in applied research. Typically, a researcher is interested in analysing the data in a rectangular dataset, a matrix where each row represents a unit case, observation or subject and each column represents a variable, which may for example be continuous or categorical, measured for each unit. Conventional statistical methods and software presume that all the values of this matrix are observed. But unfortunately it is often the case that some values are missing and if data are missing on all the variables for some cases we have what is commonly known in sample surveys as unit non-response as opposed to item non-response that is, some but not all the values of the variables are missing for a given unit. It has been established that when values are missing in a given dataset, then any method chosen to treat them or even simply ignoring them can have a significant effect on the results of the analysis of interest (Little and Rubin, 2002). This is true in any applied field, regardless of the analysis to be carried out on the dataset. Missing data can 119

120 introduce bias into estimates derived from statistical models, for example (Schafer, 1997; Allison, 2002), and they can cause a loss of information and of statistical power (Little and Rubin, 2002). The effect of missing data on the methods applied and the subsequent results depends on the pattern of missing data and on the mechanism that led to missing data. Figure 7.1. Representation of three multivariate missing data patterns: (i) and (ii) monotone, (iii) general and (iv) matching datasets In a multivariate setting the pattern of the data indicate which values are missing that is, which of the situations in Figure 7.1 applies. Schemes (i) and (ii) in Figure 7.1 are monotone missing-data patterns: (i) has only one variable subject to missingness, (ii) has more variables. Scheme (iii) is a general pattern of missingness where the variables cannot be ordered to obtain a monotone pattern. Scheme (iv) is the typical pattern obtained after datasets have been matched, with some of the variables never jointly observed. In some cases the pattern of missingness can help in deciding how to treat the missing values. With regard to the missing-data mechanism, it must be noted that no method for handling missing data can be expected to perform well unless there are some restrictions in relation to how the data came to be missing. If we indicate with the data in the dataset of interest, they can be partitioned into two parts, observed and missing:. Then let be the response indicator, with if is missing, if is observed. The missing-data mechanism concerns the distribution of R given Y: that is, it specifies a model for the response probabilities. A variable is said to be missing completely at random (MCAR) when: (7.1) that is, when the probability that is missing depends neither on the observed variables nor on the missing values. The concept can be generalized to more than one variable with missing data, in which case data are said to be missing completely at random if the probability that any variable is missing cannot depend on any other variable in the model of interest, or on the potentially missing values themselves (Little and Rubin, 2002). For most datasets, the MCAR assumption is unlikely to be precisely satisfied because it requires a strong assumption that the missingness of Y does not depend on estimation of the observed variables included in the model. A weaker assumption is the missing-at-random (MAR) hypothesis: 120

121 (7.2) In this case the missingness of may depend on the observed data but not on the values of itself. As with MCAR, the extension to more than one variable with missing data requires care in stating the assumption (Rubin, 1976), but the basic idea is the same: the probability that a variable is missing may depend on anything that is observed but it cannot depend on any of the unobserved values of the variables with missing data, even after adjusting for observed values. This means that the MAR hypothesis can be made more likely by including as many observed variables as possible in the model to be estimated, because in this way the residual dependence of the missingness on Y itself can be reduced or eliminated. There is an additional technical definition linked to the MAR assumption. The missing-data mechanism the process whereby missingness was generated is said to be ignorable if the data are MAR and the parameters governing the missing-data mechanism are distinct from the parameters in the model to be estimated. This last condition is usually satisfied in real-world situations, so it is commonplace to use the terms MAR and ignorability interchangeably. As the name suggests, if the missing-data mechanism is ignorable then it is possible to obtain valid optimal estimates of parameters without directly modelling the missing-data mechanism. A final possibility for the missing-data mechanism is the following: if the MAR assumption is violated, the data are said to be missing not at random (MNAR). In this case, the missingness of Y depends on the missing values: a classic example is non-response to personal income questions in sample surveys. When the data are MNAR, the missing-data mechanism is not ignorable, and valid estimation requires that the missing-data mechanism be modelled as part of the estimation process. Because every MNAR situation is different, the model for the missingdata mechanism must be adapted to each situation, which is why standard statistical and analysis software usually suppose the missing-data mechanism to be ignorable. The methods proposed here for the treatment of data subject to missingness can be grouped into four partially overlapping categories (Little and Rubin, 2002): i. Procedures based on completely recorded units. These include the available case analysis and the complete case analysis, which involve discarding the units with missing values from the analysis of interest. These methods can cause a substantial reduction in sample size, and they requires the hypothesis of MCAR data. If this is not the case, estimates from the analysis model can be severely biased. ii. Imputation-based procedures. In these methods the missing values are filled in with plausible values, and standard methods of analysis are applied to the completed dataset. Examples include hot-deck, mean and regression imputation, each of which has its advantages and disadvantages. Imputation is the solution commonly used for item non-response in sample surveys. A possible drawback of applying a model to a dataset completed using single imputation that is, imputing one value for each missing value is underestimation of the variability of the true estimates; for this reason, Rubin (1987) proposed multiple imputation, which involves imputing more than one value for each missing value. iii. Weighting procedures. Randomization inference from sample surveys without non-response is usually based on design weights, which are inversely proportional to the probability of selection. Weighting procedures, which modify the design weights to adjust for non-response, represent the class of methods commonly used to treat unit non-response. iv. Model-based procedures. These are generated by defining a model for partially missing data and then basing inferences on likelihood in that model, with parameters estimated by methods such as maximum likelihood. These procedures are usually flexible and satisfy the three desirable properties of a good missing-data method minimizing bias, maximizing the available information and yielding good estimates of uncertainty (Allison, 2002). Some methods in this category are intended for monotone patterns of missing data (see Figure7.1). The EM algorithm, on the other hand, which is a general technique for finding maximum likelihood estimates for 121

122 incomplete data, is applicable to more general missing-data patterns, but it can involve a noticeable increase in computing Multiple imputation Multiple imputation (MI) has become more popular in recent years as a method for addressing the problem of missing data. Introduced by Rubin (1987) in the context of complex sample surveys, its main objective is to overcome the limits of single imputation that is, to consider imputed values as observed ones with consequent underestimation of the variability resulting from the imputation step. MI has its drawbacks, of course: in official statistics, for example, consideration must be given to the additional burden for users deriving from the release of many completed datasets and from the need to use special formulas to obtain estimates of interest. From the methodological point of view, MI has another drawback in that, in practice, multiply imputed datasets from complex sample designs are typically imputed under simple random sampling assumptions and then analysed using methods that account for the design features. Methods for accounting for complex sample designs directly in the multiple imputation procedure were proposed by Zhou (2014). With MI, m imputations are created for each missing value; the variability of the different imputations reflects the uncertainty arising from the original missing values. Let be the estimate of interest. After the m imputations have been computed, each of the m completed datasets is analysed with traditional statistical techniques to obtain estimates and the corresponding estimated variances. The MI estimate can then be computed as: whereas the estimate of its total variance is: Thus, the total MI variance is the mean of the m variances, plus times the variance between the MI estimates,. This last quantity measures the increase in variance arising from the missing values. Thus the quantity: is called the fraction of missing information, which measures the contribution of the missing values to the inferential uncertainty about Q (Schafer, 1997). This quantity depends on the corresponding percentage of missing values, but it is usually lower because it is positively influenced by the information blended into the imputation model (Schenker et al., 2006). Rubin (1987) demonstrates the properties of MI from a Bayesian perspective and gives the conditions to obtain valid inferences through the randomization theory. In both theories if the MI procedure has some basic desirable properties, the fraction of missing information can be used to evaluate the quantity, the relative efficiency of an estimate based on m multiple imputations with respect to one based on an infinity of imputations. For example, if λ=0.3 or 30 percent of missing information, the relative efficiency with m=5 MIs is already 94 percent (Schafer and Olsen, 1998). Therefore even a small number of MIs can lead to efficient estimates. 122

123 To achieve this efficiency in actual application the MI procedure should have some desirable characteristics, which can be summarized as follows (Giusti, 2009): i. introduce into the imputation model all the covariates potentially influencing the missing mechanism to enhance the MAR hypothesis; ii. include as covariates the variables related to the sampling scheme of the survey, which is particularly important for improper MIs (Rubin, 1996); iii. consider as covariates the variables that are likely to be used in analysis of the imputed datasets, because the incoherence between the imputation and the analysis model can lead to the uncongeniality problem (Meng, 1994, 2002; Fay, 1992; Rubin, 1996); and iv. make the imputations in different models to study the sensitivity of the final results to the imputation model. MI is a highly flexible tool because it can be used in different settings and models and hence can be used in the special case of missing values in spatial datasets and surveys Missing values in spatial data as measurement error A research area of geographical information science has recently been developed: i) to investigate the ways in which uncertainty in spatial data arises and is distributed through GIS operations; and ii) to assess the probable effects on subsequent decision-making (Heuvelink, 1998; Zhang and Goodchild, 2002). Leung et al. (2004) observe: With the ever increasing volume of geo-referenced data being generated, transferred and utilized, the amount of uncertainty embedded in spatial databases has become a major issue of crucial theoretical importance and practical consideration. Uncertainty as to attributed values and positions generally in spatial databases reflects the accuracy, statistical precision and bias in initial values or in estimated coefficients. Spatial uncertainty also includes the estimation of errors in the final output that results from the propagation of external and internal uncertainty. It is therefore important to be able to track the occurrence and propagation of uncertainties (Goodchild, 1991). Research on accuracy is closely associated with the study of errors in GIS, and the literature on this subject is extensive: see Goodchild and Gopal (1989), Heuvelink (1998), Leung and Yan (1998), Mowrer and Congalton (2000), Stanislawski et al. (1996), Wolf and Ghilani (1997) and Zhang and Goodchild (2002). The error taxonomy of Veregin (1989) recognizes that different classes of spatial data exhibit different types of errors, and that errors may be introduced and propagated in various stages of data manipulation and spatial processing. Errors in spatial databases are generally divided into inherent errors and operational errors: inherent errors are those present in source documents and include errors in maps used as input to a GIS; operational errors occur throughout data manipulation and spatial modelling and are introduced during the processes of data entry or capture and manipulation functions of a GIS (Leung et al., 2004). From the modelling point of view, the errors can also be classified as systematic or random. The systematic component can usually be removed by modifying the model, but it is impossible to avoid random errors in measurements entirely (Wolf and Ghilani, 1997). Dealing with such measurement error is one of the most important problems in the use of geo-referenced data. To support the determination of error structures in GIS location coordinates, the concept of a measurement-based GIS was proposed by Goodchild (1999): a system that provides access to measurements used to determine the locations of objects, to the geographical procedures (transformation functions) that link measurements to quantities to be measured, and to the rules used to determine interpolated positions. The basic idea is to retain details of measurements so that error can be analysed. Leung et al. (2004) also propose a framework in which error propagation and the statistical approach to the analysis of measurement error can be formulated. 123

124 The measurement-error analysis approach is a geographical science approach involving some statistical tools and concepts, but it is mainly a technical approach. In statistics, the measurement-error problem is concerned with the influence on regression models where some of the independent variables are contaminated with errors or otherwise not measured accurately on all subjects. The literature establishes that disregarding measurement error in a predictor distorts its estimated relationship with the response variable and produces biased estimates of the regression coefficients in linear (Buonaccorsi, 1995; Fuller, 1987, Chapter 1) and non-linear models (Carroll et al., 2006, Chapter 3). Hence most of measurement error analysis is concerned with correcting for such effects. Measurement-error models usually have two components: i) an underlying model for the response variable y in terms of some predictors to distinguish between predictors measured without error z and predictors that cannot be observed exactly x; and ii) a variable w that is related to the unobservable x. The parameters in the model relating y and (z,x) cannot be estimated directly because x is not observed. The aim of measurement-error modelling is to obtain nearly unbiased estimates of these parameters indirectly by fitting a model for y in terms of (z,w). In assessing measurement error, attention must be given to the type and nature of the error and to the sources of data that enable modelling of the error (Carroll et al., 2006). A fundamental prerequisite for analysis of a measurement-error problem is the specification of a model for the measurement-error process. The two general types are: i) the classical error model, where the conditional distribution of w given (x,z) is modelled; and ii) the Berkson error model, where the conditional distribution of x given (w,z) is modelled (Berkson, 1950). In their simplest form, the two models correspond to: classical error model: w i = x i + u i, with E(u i x i ) = 0 Berkson error model: xi= w i + u i, with E(u i w i ) = 0 where u can be distributed in various ways. For details of these specifications, see Fuller (1987) and Carroll et al. (2006). But the basic difference between the two types of error model is that the classical model is to be used if the error-prone variable has to be measured uniquely for every individual, whereas the Berkson model is preferable if all individuals in a small group or stratum are given the same value of the error-prone covariate. The literature on statistical measurement error analysis is enormous. Examples include Carroll et al. (1993), Bollinger (1998), Richardson et al. (2002), Chesher and Schluter (2002), Wang (2004), Carroll et al. (2004), Ganguli et al. (2005), Ybarra and Lohr (2008) and Torabi et al. (2009). In recent years various published applications for models with spatial measurement error include Zhuly et al. (2003), Gryparis et al. (2007), Madsen et al. (2008), Goovaerts (2009) and Gryparis et al. (2009). 7.3 Missing data in spatial analysis All the concepts introduced in the previous sections are relevant in the particular case of data missing from a dataset containing spatial data. Spatial data raise additional issues that must be taken into account (Haining, 2003 chapters 2 and 4). Missing values in geographical information such as coordinates of units under study and missing values in SAE study variables and auxiliary variables are particular problems in geographical data analysis Missing spatial information Implementation of geostatistical methods requires that the statistical units are referenced at point locations. If the aim is to analyse the spatial pattern or to produce a spatial interpolation of a studied phenomenon, then such 124

125 spatial information is required only for the sampled statistical units. But if GWR or a geoadditive model is used to produce estimates of a parameter of interest for some geographical domains, the spatial information is required for all population units. This information is not always easily available, especially when socio-economic data are involved. The coordinates for sampled units, which could be specially collected for the analysis, are usually known, but the exact location of all the non-sampled population units is not known only the areas to which they belong such as census districts or municipalities. In such situations, the classic approach that allows the use of geostatistical techniques is to locate all the units belonging to the same area by the latitude and longitude coordinates of the centroid of each area. This is obviously an approximation based on a geometrical property, and the strength of its effect on the estimates will depend on the level of non-linearity in the spatial pattern and on the area dimension. To evaluate the effect of imputing the locations of missing units using centroids in small-area estimates, the M-quantile Generalized Weighted Regression model based on the Chambers-Dunstan Correction (MQGWR- CD) model and the design-based simulation study on EMAP data in chapter 3 are considered. To check the effect of missing unit locations on the MQGWR-CD estimator, the same model was fitted with each unit located at the centroid of its area for sampled units and for out-of-sample households. Table 7.1 shows the median values over the areas of the percentage RB and of the RRMSE for the MQGWR-CD estimator, as presented in chapter 3, and for the sample estimator using the centroids of the areas and MQGWR-CD centroid predictors. Table 7.1 Median values of the percentage RB and percentage RRMSE for the MQGWR-CD and MQGWR-CD centroid predictors in designed-based simulations (Chapter 3) Predictor RB (%) RRMSE (%) 86 sampled HUCs MQGWR-CD MQGWR-CD centroid out-of-sample HUCs MQGWR-CD MQGWR-CD centroid When all locations at the unit level are missing, it is clear that replacing them by using the centroids can increase the bias and the variability of the final small-area estimates. In the design-based simulation study this happens for the sampled small areas and for the out-of-sample areas. These results suggest that when a large number of unit locations is missing, alternative small-area models such as area-level models or an alternative imputation method should be considered. Lack of geographical information can be dealt with in a measurement-error approach instead of the centroids whereby a distribution for the locations inside each area is imposed such that x ij is the vector of the exact spatial coordinates for the unit i belonging to the area j, and wj is the coordinates of the centroid of the area j. This enables formulation of the hypothesis as a Berkson-type error model: x ij = w i + u ij, where E(u ij w i ) = 0 and u can assume distributions with different parameters in each area. A proposal by Little and Rubin (1987) for filling gaps in geographical information follows a stochastic imputation approach instead of the classic deterministic approach using the centroids. In another interesting approach, Bocci and Rocco (2011) proposed to deal with the absence of point referenced geographical data in a geoadditive model which requires the location of all units to be known by imposing a distribution to locate the units inside each area. 125

126 The intention is to make an improvement with respect to imputing the locations using the area centroids. Because this idea could be extended to other small-area and interpolation models, it is presented here with the main results obtained by Bocci and Rocco (2011). The proposal by Bocci and Rocco (2011) is realized through a hierarchical Bayesian formulation of a geoadditive model in which a prior distribution of the spatial coordinates is defined, and the performance of the imputation approach is evaluated through various MCMC experiments in different scenarios: true distribution of the spatial coordinates homogeneous Poisson process, non-homogeneous Poisson process and beta distribution and a priori coordinate distribution used in the hierarchical Bayesian formulation centroid, uniform and beta. The model is not a complete measurement-error model in that it is assumed that the measurement error does not influence the estimation of the parameters of the geoadditive model the spatial information is available for the sample but it does occur when the parameter of interest for the areas with the whole population covariates is predicted. As stated in chapter 2, exact knowledge of the spatial coordinates of the studied phenomenon can be exploited to obtain a surface estimate by using bivariate smoothing techniques such as kernel estimates or kriging (Cressie, 1993; Ruppert et al., 2003). But spatial information alone does not properly explain the pattern of the response variable, and some covariates must be introduced in a more complex model. Geoadditive models, introduced by Kammann and Wand (2003), answer this problem because they analyse the spatial distribution of the study variable while accounting for possible linear or non-linear covariate effects. Under the additivity assumption they can handle such covariate effects by merging an additive model that accounts for the relationship between the variables and a kriging model that accounts for the spatial correlation, and by expressing both as an LMM. The LMM representation is a useful instrument because it enables estimation with mixed-model methods and software (see Kammann and Wand, 2003). The addition of other explanatory variables is straightforward: smoothing components are added in the random effects term, and linear components can be incorporated as fixed effects. The mixed model structure provides a unified modular framework that enables straightforward extension of the model to include various kind of generalization and evolution (Ruppert et al., 2009). The mixed model could be fitted in a frequentist framework using a best linear unbiased predictor or penalized quasi-likelihood estimation. A Bayesian inferential perspective can also be adopted by placing priors on the model parameters and simulating their joint posterior distribution. The posterior density is often analytically unavailable, but it can be simulated using MCMC. The posterior distribution of any explicit function of the model parameters can be obtained as a by-product of the simulation algorithm. Let a population of N units be divided in Q regions, with interest in estimating the regional mean of a study variable y. A sample of n units is taken, from which the response variable y, the location s and possibly some other covariates that are known without error for all the population units are taken. To obtain the regional mean, a model-based mean estimator is required: where N q is the total number of units in region q and S q and R q indicate the sets of the sampled and non-sampled units belonging to region q. The estimated parameters are obtained from the sampled units but if s is not known for the non-sampled units the simulator above cannot be used directly. To show the problem more clearly, consider a linear predictor x i of y i at spatial location s i and use of the following spline-regression mode: 126

127 where a low-rank thin-plate spline with K s knots is used to represent the unspecified bivariate smooth function of s. The model-based mean estimator becomes: (7.3) The relevant issue now is: how can this estimator still be applied if s i for the non-sampled units R q is not known? In the classic approach the s i values are replaced with the region centroid c q, which is a constant for all the units in region q. But this can, as stated above, have drawbacks with regard to the final estimates of interest. As suggested by Bocci and Rocco (2011), lack of geographical information can be treated as a particular problem of missing data: instead of using the same coordinates c q for all the units in region q, which may be defined as a particular case of deterministic imputation, they suggest the use of a stochastic Bayesian imputation approach, including in the hierarchical Bayesian formulation of the geoadditive model (Ruppert et al., 2003 chapter 16), a prior distribution for s i inside each region q, and then the use of the joint posterior distribution of all parameters given the data as the basis of inference (see Bocci and Rocco, 2011). Some of the simulation results are given here to evaluate the performance of this approach in comparison with the classic centroid approach. In the experiments, Bocci and Rocco (2011) follow the settings and examples presented in Crainiceanu et al. (2005) and Marley and Wand (2010). All scenarios are characterized by the following setting, with the study variable simulated by the model: where,,,,, is a dummy variable known for the whole population, s represents the spatial location that is generated by a different spatial point process in each scenario, and function f(s) is obtained as a bivariate normal mixture density. The population consists of N = 3,000 units located in the unit square O = [0;1] [0;1], which is divided into Q = 9 rectangular regions that can be represented by their vertices [(l 1q ;m 1q ); (l2 q ;m 1q ); (l 2 q;m 2q ); (l 1q ;m 2q )]. The regions are obtained by using a random binary splitting procedure. Each scenario differs from the others in the spatial point process used to generate s. Four data-generating processes are considered by Bocci and Rocco (2011), as shown in Figure

128 Figure 7.2 Spatial distributions of population units Legend: (a) homogeneous Poisson process; (b) non-homogeneous Poisson process; (c) non-homogeneous Poisson process on each region; (d) independent bivariate beta distribution on each region. Bocci and Rocco (2011). For each population setting three MCMC experiments are performed to estimate the mean of y in the 9 regions applying the estimator (7.3) and using the complete hierarchical Bayesian formulation of the geoadditive model. They are characterized by three different choices of the prior distribution for s i inside each region q, that is by three different imputation models: centroid imputation, uniform imputation and beta imputation. The results of the simulation studies are presented in Tables 7.2, 7.3, 7.4 and 7.5, where the performance of the small-area mean estimator is evaluated in terms of RB and RRMSE in the three imputation approaches. From Tables 7.2 and 7.5 it is evident that the stochastic imputation approach produces better estimates than the classic centroid approach when the imputation distribution corresponds to the population spatial distribution. This is the case with the uniform approach in scenario (a) and of the beta approach in scenario (d). The beta imputation approach also works well in scenario (a) because the true spatial distribution in each region is a special case of the bivariate beta distribution, but it produces less precise estimates than the uniform imputation because the beta parameters need to be estimated in the fitting process. 128

129 In scenarios (b) and (c) in Figures 7.3 and 7.4 none of the imputation models corresponds to the population spatial distribution, but the beta approach still performs well. This is because the beta distribution has the advantage of modelling different shapes depending on the values of the parameters. In the approach presented here these parameters are estimated directly in MCMC, exploiting the spatial distribution of the sampled units and producing a posterior bivariate beta distribution that is as similar as possible to the sample spatial distribution. The good performance of this approach obviously relies on the representativeness of the sample. As a final remark on the classic centroid approach, the results suggest that in almost all cases it performs worse than the beta imputation, even if there are particular situations in which it seems a good choice. This depends strictly on the spatial distribution of specific units and the values of y in that region. This consideration also applies to the behaviour of the uniform distribution in scenarios (b), (c) and (d): generally it does not work well, but it may be good in particular situations. The good performance of the beta imputation in all the scenarios is reflected in the mean estimation for the overall area O (see Bocci and Rocco, 2011). Table 7.2. Empirical RB % and RRMSE % of the model-based mean estimator for the three imputation approaches. Scenario (a): homogeneous Poisson process Region Centroid imputation Uniform imputation Beta imputation RB % RRMSE % RB % RRMSE % RB % RRMSE % Overall Table 7.3 Empirical RB % and RRMSE % of the model-based mean estimator for the three imputation approaches. Scenario (b): non-homogeneous Poisson process Region Centroid imputation Uniform imputation Beta imputation RB % RRMSE % RB % RRMSE % RB % RRMSE % Overall

130 Table 7.4 Empirical RB % and RRMSE % of the model-based mean estimator for the three imputation approaches. Scenario (c): non-homogeneous Poisson process on each region Centroid imputation Uniform imputation Beta imputation Region RB % RRMSE % RB % RRMSE % RB % RRMSE % Overall Table 7.5 Empirical RB % and RRMSE % of the model-based mean estimator for the three imputation approaches. Scenario (d): bivariate beta distribution on each region Centroid imputation Uniform imputation Beta imputation Region RB % RRMSE % RB % RRMSE % RB % RRMSE % Overall To show the relevance of the spatial representativeness property of the sample, Bocci and Rocco (2011) present other MCMC experiments. In this case, s is assumed to be univariate so that the regions are actually intervals. In the new simulations the study variable y is simulated by the model where,, α = 10, β x = 0.4, is a dummy variable known for the whole population, s represents the spatial location and is generated by a uniform distribution in every region and function f(s) = sin(3πs 3 ). The population consisting of N = 3,000 units is located in the interval O = [0, 1], which is divided into Q = 4 intervals [0,0.2], [0.2,0.5], [0.5,0.82], [0.82,1]. 130

131 The population obtained is shown in Figure 7.3(a), where the green dots correspond to the units with x i =0, and the black dots to the units with x i =1, the vertical dashed lines indicate the regions, and the red lines indicate the deterministic component of the model. Figure 7.3 Scenario settings Legend: (a) simulated population; green dots correspond to units with xi=0, black dots to units with xi=1; vertical dashed lines indicate regions, red lines indicate the deterministic component of the model; (b) distribution of a representative sample; (c) distribution of a type-1 non-representative sample; (d) distribution of a type-2 non-representative sample. Bocci and Rocco (2011). Three scenarios are considered, each with a different type of sample selected from the population. For each scenario, three MCMC experiments are performed to estimate the mean of y in the four regions. The three types of sample are stratified samples of n=500 units, with strata corresponding to the four regions and proportional allocation of sampled units in each stratum. They differ in the sampling design used to select the units in each stratum: representative sample a simple random sample is selected in each stratum; type-1 non-representative sample: in each stratum 70 percent of the sample is randomly selected among the units with s values lower than the centroid; the remaining 30 percent is randomly selected among the units with s values greater than the centroid; and type-2 non-representative sample: in each stratum the units are selected with probability proportional to the inverse of the y values. 131

132 Examples of spatial distribution in the three samples are shown in Figure 7.3. The MCMC experiments follow the settings previously described in this section and are replicated m=100 times to take into account variability in the model and the sampling design. Function f(s) is modelled with a low-rank truncated linear spline with K s = 30 knots located on the quantiles of the sample distribution of s. The posterior densities of the regional model-based mean estimator in the three scenarios are presented in Figure 7.4. It is evident that when a simple random sample is selected in each stratum, the uniform imputation, which corresponds to the true spatial distribution, and the beta imputation work well as in the bivariate scenario (a). In the other two scenarios the performance of the two imputation approaches deteriorates, with the beta imputation more affected. This is because the beta imputation exploits the spatial distribution of the sampled units to estimate its parameters, and as long as the spatial sample distribution does not reflect the one of population, the estimated parameters produce a posterior spatial distribution different from the true distribution. On the other hand, the uniform imputation does not exploit any sample information and so it correctly imputes the coordinates of the non-sampled units. But because the selection of sampled units depends on their location or to their y value, which is connected to s by f(s), the joint spatial distribution of sampled and imputed units will not be uniform. Similar considerations apply with the classic centroid imputation approach. Hence whichever imputation approach is used, the mean estimator will be affected by the non-representativity of the sample. It is important to note that the non-representativity of the sample is closely related to the imputation step of the presented analysis. Its semi-parametric spline structure makes the geoadditive model robust to sample nonrepresentativity, and the model fitting step is hardly influenced by it. 132

133 Figure 7.4. Posterior density of the regional model-based mean estimator in the three scenarios and for the three imputation approaches Legend: centroid = green line; uniform = red line; beta = blue line. Vertical lines indicate true mean values. Bocci and Rocco (2011) Missing values in auxiliary and target variables With regard to the issue of missing auxiliary variables in geographic datasets, an initial consideration is that the MAR hypothesis does not necessarily imply that the missing values are geographically distributed at random. Observations can be missing at regular intervals in the region under study, or they can be clustered in some subregions; in the latter case the remaining data will have a strong influence on the fit of models used to describe surface variation. As with other data, it is important to consider why the values are missing before deciding on the approach to be adopted. Missing remotely sensed data, for example, does not necessarily violate the MAR assumption because sensor failure along a scan line is not usually related to the underlying surface. And if unemployment data are not recorded in some areas as a result of strike action, it does not imply a non-ignorable missing-data mechanism: the number of non-responses to questions on crime, for example, is often higher in inner-city areas as a result of a Missing Not At Random (MNAR) mechanism linked to crime levels in those areas. The spatial continuity of the observations should be used to impute missing values. In the case of clustered missing data in a sub-region, therefore, imputations could be based largely on the values observed in the closer areas. And if the aim is to map the spatial variability of given characteristics, an estimate of prediction error should be included to check the effect of the missing values; this could be represented in a map as well. 133

134 Apart from situations where the number of missing data items is low, analysis using complete cases only should be avoided because it can severely bias the results of the analysis of interest, and removing a relevant covariate from the analysis because of the missing values could cause a mis-specification of the model. The usual solution for treating missing data in datasets containing spatial data is, therefore, to impute the missing information. All the imputation techniques defined for general datasets can be used for spatial data as well, with all extra information provided by the spatial distribution of the values included in the imputation process. Of course, with spatial data there may be particular circumstances to take into account. Some values of covariates may be missing for some observations, for example, but the total at the area level will be known. If data are missing at the small-area level, the total for a wider area including all the small areas may be known. In these cases, the missing values should be filled in so that the final estimates benchmark the wider area total. Haining (2003) suggests that missing variable values should be imputed in spatial data matrixes using one of the following approaches: i. Spatial mean imputation with equal or unequal weights assigned to each data value. The idea is to impute the missing value with the arithmetical mean strictly speaking the median of the data values in a spatial window defined round the area with the missing value. The mean could be weighted to avoid cluster effects in the distribution of irregularly shaped areas. ii. Spatial hot-deck imputation. In this approach a missing value is imputed by drawing it from the empirical distribution of the variable, considering the values obtained from a given spatial window. iii. Spatial regression imputation. This approach extends regression imputation by including among the predictors the neighbouring values of a fully observed covariate, weighted using a contiguity matrix of the areas. iv. Maximum likelihood approach. This involves the iterative estimation of model parameters and prediction of missing values; it is similar to the EM algorithm and to simple and universal kriging. Lokupitiya et al. (2006) compared the effect of four techniques to impute missing crop yield data for barley in the 1997 database of the National Agricultural Statistical Survey. The data considered were crop yields aggregated at the county level entered into the National Agricultural Statistical Survey and the Census of Agriculture. The National Agricultural Statistical Survey crop-yield data are produced annually in a statistical sampling approach and surveys of selected farms in a county; Census of Agriculture crop-yield estimates, produced every five years, are based on a survey covering almost all farms in a county. Both datasets present missing data, but the aim of Lokupitiya et al. (2006) was to fill the gaps in the National Agricultural Statistical Survey database because it reports yields every year. A major source of missingness is that the survey only covers states that produce 90 percent to 95 percent of the national total for each crop. Lokupitiya et al. (2006) compared the following imputation methods: regression, kernel smoothing, universal kriging and MI. As covariate information in these models, they used data on crop yields from the Census of Agriculture. In the multiple imputation procedure, an MCMC method was used to impute the missing values. Mean vector and covariance matrixes for the data that did not have missing values were computed as starting values and considered as the prior distribution. Filling missing values with the random numbers drawn from the available distribution created a complete dataset. The mean vector and covariance matrixes were re-computed for the complete dataset to obtain the posterior distribution. The missing values were then imputed again by generating random numbers from the posterior distribution. This procedure was repeated until the mean vector and covariance matrixes were stable. Imputations from the final iteration were taken to form a dataset with no missing values. In the simulation studies Lokupitiya et al. (2006) used the omit-one cross-validation method and the deleting-k multifold cross-validation method with k=5 to compare the performance of the different techniques. The first method of validation worked by fitting the model to a sub-sample of the original dataset, where the sub-sample included all but one observation in each sub-sample, for a sample of size n the model is fitted n times on n sub-samples where 134

135 the sub-sample n has all the observations but one (e.g. the 1st sub-sample has obs. 2,3,...,n; the 2 nd sub-sample has obs. 1,3,4,...,n, and so on, the nth sub-sample has obs. 1,2,3...,n-1. The omitted observation changed with each subsample so that every observation was held out exactly once; in each case the sub-sample was used to estimate the omitted observation and to compare the estimated value with the omitted observation. In multifold cross-validation, on the other hand, several (k>1) observations were deleted in each sub-sample. Table 7.6 shows the mean absolute prediction error obtained with the two cross-validation methods. Table 7.6 Mean absolute prediction errors for each imputation method under the two crossvalidation approaches Method Omit-one MAPE* Deleting-5 multifold Regression Multiple imputation Universal kriging Kernel smoothing * Mean absolute prediction error. Lokupitiya et al. (2006). The results of the simulations show that regression and multiple imputation performed best, followed by universal kriging and kernel smoothing. Lokupitiya et al. (2006) suggested that the main problem of kernel smoothing was over-estimation because it is a distance-based method; it could occur when estimating a low crop-yield datum in a zone surrounded by high crop values. They suggested that universal kriging performed poorly because it depended on the hypothesis of isotropy; estimation could be improved in this case by correcting for anisotropy. Nevertheless, the final suggestion was to use regression imputation when data from the Census of Agriculture were available, and to use MI otherwise. Several studies consider the issues of missing values in spatial datasets, where the SAR or CAR hypothesis can be used to specify the model of interest. Wang and Lee (2013) consider SAR panel models with randomly missing data in the dependent variable: they suggest that missing data can occur even more frequently in spatial-panel data because temporal and spatial missingness may occur across sectional dimensions. To deal with this type of missing data, they consider three approaches: a generalized method of moment estimation based on linear moments; the non-linear least-squares estimation, in which the reduced form of the panel SAR model is used; and a two-stage least-squares estimation with imputation. Wang and Lee (2013) also propose the use of the spatial Mundlak approach (Mundlak, 1978) if individual effects are correlated with the included regressors. Polasek et al. (2010) proposed a spatial extension of the Chow and Lin (1971) method, the first to develop a unified framework for three problems interpolation, extrapolation and distribution of predicting time series by related series. This model predicts unobserved dependent data using indicators observed at the same disaggregated regional level and a spatial SAR model specification. In a similar approach, Horabik and Nahorski (2011) proposed a method derived from areal to areal data realignment for imputing missing data, accounting for spatial clustering in a CAR specification. In both studies, the primary interest is to allocate or estimate the variable of interest at a finer geographical scale with respect to that currently available. Because this is the primary objective of SAE, the issue of missing data is addressed with a focus on unit non-response. In addition to the difficulty associated with small sample sizes, an SAE problem can be further complicated by the fact that not all the units in the sample respond to the survey and the probability that a sample unit response may 135

136 be related to the study variable. Giusti and Rocco (2010) proposed a probability-weighted estimation procedure to adjust for the effect of a non-ignorable non-response mechanism on the small-area mean predictor when a small-area model at the unit level is adopted. Consider a one-fold nested error linear regression model: where is a fixed covariate vector, is a fixed vector of parameters, is a known constant, and are normally distributed mutually independent terms of error at area and unit level with mean zero and variances and. There is an informative non-response at the first level when some of the values of the target variable are missing, and the associated non-response probabilities are related to the target variable even after conditioning on the covariates. In this context the method suggested by Giusti and Rocco (2010) is the pseudo maximum likelihood approach introduced by Skinner (1989) to adjust for informative sample designs, extended to the case of informative unit non-response in SAE problems. This extension requires consideration of the population as two-level, with the individual the first-level units nested in the small areas the second-level units. To compensate for the effects of an informative non-response at the first level, it is possible to use a multi-level pseudo maximum likelihood approach, which requires knowledge of the survey weights at every level of the population structure. When the sample design is self-weighting and non-response concerns only the first-level units, the first-level survey weights for unit j belonging to small area i can be specified as, where denotes the response probability of the unit. Because response probabilities are usually unknown, they must be estimated using the available information. The simplest and perhaps most common way to estimate individual response probabilities is to partition the sampled units in weighting classes, assumed to be homogeneous with respect to the mechanism of response, and then to estimate response probabilities as rates of respondent units in each class. Another common way to estimate individual response probabilities is to express them as a logit function of a set of known variables. Hence the expression of the small-area mean estimator used by Giusti and Rocco (2010) is: (7.4) where and are the set of respondents and the set of non-sampled plus non-respondent units in area i, and with and obtained using the multi-level pseudo maximum likelihood (MPML) estimation with weights and where the response probabilities may be true or estimated. To illustrate the bias of the small-area mean estimator that can occur when ignoring an informative response mechanism, and to assess the performance of the MPML estimation procedure on the basis of the true or estimated response probabilities, Giusti and Rocco (2010) designed three simulation studies A, B and C each consisting of the following steps: i. Generate area indexes, and population sizes, with generated from truncated below by and above by ; for the lie in the range [70, 126]. ii. Generate the population random area effects, and the covariates, assuming. This rather complicated formula for generating the auxiliary variables follows Pfefferman and Sverchkov (2007) and guarantees that the covariates are the same in each of the three groups of areas, except for the random disturbances. The three groups consist of areas, areas and areas iii. Generate the values according to the model defined in section 2, with and 136

137 iv. Associate with each level-1 unit a response probability as follows: in study A for each unit in each area the response probability is obtained through an exponential function of ; in study B the areas are split into four groups using the quartiles of the random area effects distribution, and in each group the response probabilities are generated through an exponential function of the values; but the parameters of this function change from one group to another. In study C the procedure is the same as in study B, but the exponential function used to generate the non-response is assumed to depend only on the individual random effects. In all the studies, the parameters of the non-response generating function are chosen to produce an expected overall population response rate of about 0.7. v. Select a stratified sample of the first-level units with strata equal to the second-level units and a sampling fraction equal to 0.1 in each stratum. vi. Classify each level-1 unit in the sample as respondent or non-respondent, carrying out a Bernoulli experiment for each of them. vii. Repeat steps 2 6 1,000 times. In study A for each set of respondents the following six predictors of the area means were computed: i. the standard unweighted EBLUP estimator calculated on the set of respondents; ii. the MPML predictor (7.4) with weights computed using the true response probabilities; iii. the MPML predictor (7.4) with weights computed using response probabilities estimated with the weightingwithin-cells method and using the values to define the cells; iv. the MPML predictor (7.4) with weights computed using response probabilities estimated with a logit model function of the, supposed known for all the population units; v. the MPML predictor (7.4) with weights computed using response probabilities estimated as in point (d), but assuming as explicative in the logit model, with ; vi. the standard unweighted EBLUP estimator calculated on the entire sample; vii. for study B and study C, the same predictors are computed except that the estimator described at point (e) is replaced by viii. the MPML predictor (7.4) with weights computed using response probabilities estimated with a logit model assuming as covariate not only but also a categorical variable that identifies the groups of areas with different response mechanisms. Figures 7.5, Figure 7.6 and Figure 7.7 show the area percentage relative biases of the six predictors (a), (b), (c), (d) and (e) and (f) or (g), by study. Figure 7.5. Study A 137

138 Figure 7.6. Study B Figure 7.7. Study C In all the figures the predictor (f), which corresponds to the hypothesis of complete responses, is considered and shown as benchmark. It is evident from all the figures that an informative response mechanism may induce a significant bias in the estimation of the small-area means if the hierarchical regression model is fitted using the standard ML estimation method (case a). The bias can be reduced effectively by the MPML, assuming unrealistically that the response probabilities are known (case b). Figure 7.5 also provides evidence of the reduction of bias that occurs if auxiliary variables predictive of the response behaviour are available, and the unknown response probabilities are estimated through a logit model (cases d and e). Obviously, the more predictive the available auxiliary variables, the greater is the bias reduction (case d versus case e). When the response mechanism conditional on the auxiliary variables becomes fully ignorable, the estimated response probabilities produce a bias reduction equivalent to that obtained with the true response probabilities. The performance of the weighting-within-cells method (case c) is equivalent to the performance of true response probabilities when based on auxiliary variables predictive of the response behaviour. 138

139 The two response probability estimation methods using as covariates only the z ij, the parametric and the nonparametric weighting-class methods appear equivalent in study A, where they perform well, and in studies B and C, where they do not significantly reduce the bias of the traditional EBLUP estimator. In these two studies, good performance of the suggested MPML predictor requires the inclusion of a categorical variable that identifies the groups of areas with different response mechanisms in the estimation process for the response probabilities. The advantage of introducing this categorical variable to estimate the response probabilities in studies B and C is obvious. The point in question is the following: the response estimation procedures that use only the z ij values almost remove the bias of the whole population mean direct estimator (see Table 7.7). Table 7.7. Percentage relative bias of the whole-population mean direct estimator Response estimation method Study Logit model Weighting within cell A B C It follows that if the researcher who calculates the survey weights is not interested in the SAE problem, s/he may not realize this advantage. In other words, compensating for non-response using a method that works well for the estimation of the overall population mean without considering the estimation at the small-area level may reduce or not reduce the bias of the small-area mean predictions (predictors c and d). This depends on the compensation method, but also on the response mechanism. From Table 7.7 it is also evident that in study (c) the bias of the whole population direct estimator with (not adjusted for non-response) is less than half of the bias of the corresponding small area mean unweighted predictor (see Figure 7.5). This result indicates that an informative response mechanism may have a modest effect on population estimators, but a significant effect on small-area estimators. This study shows that the unit non-response and the SAE problems should probably be addressed simultaneously, because the non-response probabilities may depend on individual unit characteristics and also on area-level issues such as administrative problems in conducting the survey in certain areas Missing information in methods of data integration Merging datasets from multiple sources can cause data to be missing from the resulting integrated dataset. A statistical matching problem is a missing-data problem with a non-monotone missingness pattern (see Figure 7.1); it is often described as a missing-by-design pattern. The treatment of missing data in this approach is usually more problematical than in a standard non-monotone pattern when the interest is in analysing the variables that have not been jointly observed. The inherent identification problem in statistical matching requires a conditional independence assumption between variables that have not been jointly observed given the variables jointly observed (Rassler, 2002). The analysis can be further complicated because the matched subset of data can be affected by additional missing-data mechanisms such as unit non-response. In such situations, even if there is a missing-bydesign pattern, the missing-data mechanism is not MCAR because there is another underlying missing-data problem. Assuming conditional independence, the hypothesis can be maintained that the missing-data mechanism is still ignorable because conditional independence should include ignorability (Koller-Meinfelder, 2009). With regard to spatial analysis, the data in an integrated dataset can suffer from spatial-temporal misalignment because the location and time characteristics of the original data do not align well. To investigate health effects of air pollution, for example, data on air pollution and health are needed: but the location and time stamps of air pollution data may be imperfectly aligned with the location and time stamps of aggregated, disaggregated or individual-level 139

140 health data. To address such a misalignment problem, environmental data have to be imputed to the spatial-temporal stamps of health data (Liang and Kumar, 2013). It should be noted that data gaps deriving from the matching of spatial datasets fall into the problem category of incompatible spatial data. The problem has been described in various ways the ecological inference problem, the MAUP, spatial data transformation, the scaling problem, inference between incompatible zonal systems, block kriging, pycnophylactic geographic interpolation, the polygonal overlay problem, areal interpolation, inference with spatially misaligned data, contour re-aggregation, multi-scale and multi-resolution modelling, and the change-ofsupport problem (Gotway and Young, 2002). Hence the treatment of data gaps deriving from misaligned spatial datasets can sometimes be treated using the methods suggested for corresponding incompatible spatial data problems such as the areal interpolation method using the expectation-maximization algorithm. Analysis of time-space datasets that come from different sources requires that: i) they are aligned with respect to location and time; ii) they are arranged on the same spatial-temporal scales; and iii) missing values are filled. If adequate data points across geographic space and time are available, different methods of interpolation can be employed to impute values at a given location and time. A recently suggested method is time-space kriging, which can be attractive because it minimizes mean squared prediction errors among linear unbiased predictors. Time-space kriging can therefore address multiple problems arising from the convergence of time and space domains, misalignment, missing values and mismatches in spatialtemporal resolutions (Kumar, 2012). Liang and Kumar (2013) proposed a Bayesian hierarchical spatial-temporal method of interpolation Markov cube kriging to deal with spatial-temporal misalignment, mismatches in spatialtemporal scales and missing values across space and time in large spatial-temporal datasets (see also the MAUP in chapter 4). 7.4 Remarks and findings With regard to the problem of missing geographical information, this chapter has highlighted what happens when geostatistical model is to be applied that requires knowledge of the exact locations of all population units, information that is seldom available. When estimates for geographical domains are needed, locating all the units in the centroid of the corresponding area can be a strong approximation. This was highlighted in the first simulation study, where an MQGWR model was fitted to estimate the mean of the variable of interest in areas in two alternative settings: using the exact coordinates for all units, or locating them on the centroid of their area. With this approximation, the performance of the MQGWR estimator was affected by an increase in bias and variability. The other option presented was the imputation technique suggested by Bocci and Rocco (2011) in the context of geoadditive models in a Bayesian framework. The simulation results showed that in the absence of prior knowledge of the spatial distribution, an approach that imputes spatial coordinates using a Beta prior distribution is certainly preferable to the classic approach that locates each unit with its corresponding area centroid. The proposal made by Bocci and Rocco (2011) is promising, and it could be extended to other settings and geostatistical models. The chapter also described the effects on small-area estimates of data gaps resulting from an informative nonresponse mechanism. Giusti and Rocco (2010) suggested that when data are missing for the units in a sample and the interest is in producing small-area estimates, a possible solution is to use an MPML approach in which the units are assigned weights that are functions of the estimated response probabilities. The simulation studies presented in this chapter lead to the important conclusion that the issues of missing data and SAE should be addressed together if possible. This is because the unit non-response mechanism could be different between areas or between groups of areas, and so a weighting approach that reduces the non-response bias for the whole population may not reduce 140

141 it for small-area estimators. These observations should be taken into account in further contributions to the study of missing data in SAE problems. Missing geographical information Applying a geostatistical model requires knowledge of the exact locations of all population units. Locating all the units in the centroid of the corresponding area can be a strong approximation. In the context of geoadditive models in a Bayesian framework, a spatial coordinates imputation approach using a Beta prior distribution is certainly preferable. Missing target/auxiliary variables Apart from situations where the amount of missing data is small, analysis using complete cases only should be avoided. Removing a relevant covariate from the analysis because of the missing values could cause a misspecification of the model. The usual solution treating missing data in datasets containing spatial data is to impute the missing information. All the imputation techniques defined for general datasets can also be used for spatial data; this includes the imputed information on spatial distribution. Missing non-random target data The issues of missing data and SAE should be addressed together. A possible solution is to use a multi-level pseudo-maximum likelihood approach where the units are assigned weights that are functions of the estimated response probabilities. 141

142 8. Analysis of Zero-Inflated Data in SAE 8.1 Introduction This section addresses the problem of the zero-inflated dataset. In many agricultural data there can be a large number of zeros in quantitative variables of interest, which leads to problems in the inference process. The expression zero-inflated data is used here to mean data that have a larger proportion of zeros than expected from pure-count Poisson data (see for example Barry and Welsh, 2002). Estimates for this particular type of dataset can be obtained by following a Bayesian or a frequentist approach; both are presented in this section. As stated earlier, SAE techniques are usually based on the LMM. If the LMM is true, neither will be as efficient as the EBLUP if spatial information is not available. But the small-area estimator based on the LMM can be inefficient in zero-inflated data situations. In effect, zero inflation in the data invalidates the assumptions of the LMM (McCullagh and Nelder, 1989) and so problems with inference can occur if this feature of the data is not known. When the focus of the inference is on small areas, the presence of excess zeros in small areas will be more influential than in the overall sample. This kind of mixed distribution has been considered recently in the SAE literature. In particular, the problem of zero-inflated data has been considered following the Bayesian paradigm (Pfeffermann et al., 2008; Dreassi et al., 2013) and the frequentist paradigm (Chandra and Chambers, 2014; Chandra and Sud, 2012). The zero-inflation problem can be addressed in both paradigms using a two-part mixed model, of which the first part is the logic function used to model the probability of a positive outcome, and the second is a linear model with normal error terms fitted to the non-zero responses. Both models include individual-level and area-level covariates and area-random effects that account for variations not explained by the covariates. 8.2 Bayesian small-area estimator for zero-inflated data Suppose that a population U of size N is partitioned into m subsets U i domains of study or areas of size N i, i = 1,,m. The population units are identified by j and the small areas by i. The population data consist of values y ij of the variable of interest, values x ij of a vector of p auxiliary variables that includes the constant term as first component. Suppose that a sample s is drawn and that area-specific samples s i U i of size n i 0 are available for each area or domain. Note that it is possible to have non-sample areas, so n i = 0, in which case si is the empty set. The set r i U i contains the N i ni indices of the non-sampled units in small area i. Values of y ij are known only for sampled values, while for the p-vector of auxiliary variables it is assumed that area-level totals X i or their means are accurately known from external sources. Given y the response variable and Z the covariate variables and random effects (Pfeffermann et al., 2008): (8.1) For the classical SAE problem, a random-intercept model can be applied with areas or domains defining the first level and units defining the second level. For a unit j in area i with covariates Z ij = z, the follow relationship exists: (8.2) The two parts in the right-hand side of (8.2) can be modelled separately. For units with positive target values, a random intercept model is assumed: 142

143 (8.3) where x + ij and y+ ij are the positive outcome and the vector of covariates for units with positive outcomes, v i is the area-level error and e ij is the unit-level error; standard mixed-model assumptions are considered true. To model the probability of positive outcomes the second part of equation (8.2) the generalized LMM is used: (8.4) where x ij is the vector of covariates for unit j in area i and u i represent random area effects not accounted for by the covariates; standard assumptions are considered true. The Bayesian model framework also allows for the non-zero correlation between the area random effects of the two parts v i and u i (see Pfeffermann et al., 2008). Given that the parameter of interest is the small-area mean or total, unobserved outcomes should be predicted: where define appropriate sample estimates. Adding estimates of the unit-level errors to the estimated mean values reflects the variability of the positive responses more closely. In the Bayesian approach, the missing scores are predicted by drawing at random from their predictive distribution. The proportion of non-zero outcomes is predicted in the frequentist approach as: (8.5) (8.6) where: (8.7) A Bayesian solution consists of predicting the indicators I(y>0) by drawing at random from their predictive distribution. Methods for estimating fixed and random effects when fitting LMMs or generalized LMMs alone have been developed in the last two decades in the frequentist and the Bayesian paradigms. These methods make it possible to compute estimators of the MSE or the Bayes risk of the small-area predictors that account for hyper-parameter estimation to the correct order (see Rao, 2003; Jiang and Lahiri, 2005). The use of Bayesian methods requires specification of prior distributions for the fixed parameters underlying the two-part model. With the aid of MCMC simulations the application of this approach permits sampling from the posterior distribution of the fixed parameters and the random effects, and hence sampling from the predictive distribution of the unobserved responses. Hence the use of this approach yields the whole posterior distribution of 143

144 the small-area parameters of interest, thereby enabling computation of correct MSE posterior variance measures or confidence intervals that account for all the sources of variation (Pfeffermann et al., 2008). The MSE or Bayes risk is estimated by computing the empirical variance of the sampled values. Credibility intervals with coverage rates of (1 α) are defined by the α/2 and (1 α/2) level quantiles of the empirical posterior distribution. Dreassi et al. (2013) suggest a hierarchical Bayesian approach to SAE for dealing with semi-continuous, skewed and spatially structured data, which occur frequently in agricultural applications. None of the methods mentioned earlier appear to be directly applicable to this problem, however, because of the nature of the response variable: its distribution is zero-inflated, highly skewed for the non-zero values and presents a spatial trend. To describe these features, a suitable extension to current methods is proposed that considers the highly skewed distribution of the positive responses. This justifies the choice of the gamma model in the second part of the model, whose effectiveness is confirmed by the results. In the SAE framework, the skewness of data is usually treated using the log-normal distribution. Because it is highly flexible, the gamma distribution could be a valid alternative (see for example Firth, 1988). When the target variable shows a spatial trend, appropriate use of geographical information makes it possible to achieve more accurate SAE. Further investigation is needed, even though the suggested approach provides encouraging results: in particular, the conditions that make the full two-part model preferable to the separate ones need to be evaluated. 8.3 Frequentist SAE for zero-inflated data The frequentist approach is similar to the Bayesian approach: the difference lies in the parameter estimation of the two models the linear model for positive outcomes and the model for zero and non-zero outcomes and the estimation of the MSE. The models involved in the estimation of small-area means for zero-inflated data are: where is a binary variable assumed to follow a generalized LMM with logit link function. In the model that links the probability of positive values with the covariates, is the unknown fixed-effect parameters and ui is the random area effect associated with area i, which is assumed to be normal with zero mean and constant variance. In the model for positive outcomes, is the unknown fixed-effect parameters, v i is the random area effect associated with area i, which is also assumed to be normal with zero mean and constant variance, x + ij is the vector of auxiliary variables for positive outcome unit j in area i, p ij is the probability of observing a positive outcome. In the frequentist approach, it is difficult to take into account the non-zero correlation between u i and v i so they are considered uncorrelated. An estimate of the unknown parameters and the variance components of the linear random effect model can be obtained by maximum likelihood or restricted maximum likelihood estimation, while the generalized LMM parameters and random effects can be estimated with the penalized quasi-likelihood method combined with restricted maximum likelihood estimation (Saei and Chambers, 2003). An approximately model-unbiased estimate of the small-area mean is: (8.8) 144

Deriving Spatially Refined Consistent Small Area Estimates over Time Using Cadastral Data

Deriving Spatially Refined Consistent Small Area Estimates over Time Using Cadastral Data Deriving Spatially Refined Consistent Small Area Estimates over Time Using Cadastral Data H. Zoraghein 1,*, S. Leyk 1, M. Ruther 2, B. P. Buttenfield 1 1 Department of Geography, University of Colorado,

More information

ENVIRONMENTAL VULNERABILITY INDICATORS OF THE COASTAL SLOPES OF SÃO PAULO, BRAZIL *

ENVIRONMENTAL VULNERABILITY INDICATORS OF THE COASTAL SLOPES OF SÃO PAULO, BRAZIL * ENVIRONMENTAL VULNERABILITY INDICATORS OF THE COASTAL SLOPES OF SÃO PAULO, BRAZIL * D ANTONA, Álvaro de O. (FCA, NEPO-Unicamp, Brazil) BUENO, Maria do Carmo D. (IBGE; IFCH-Unicamp, Brazil) IWAMA, Allan

More information

Improving rural statistics. Defining rural territories and key indicators of rural development

Improving rural statistics. Defining rural territories and key indicators of rural development Improving rural statistics Defining rural territories and key indicators of rural development Improving rural statistics Improving Rural Statistics In 2016, the Global Strategy to improve Agricultural

More information

Spatio-temporal Small Area Analysis for Improved Population Estimation Based on Advanced Dasymetric Refinement

Spatio-temporal Small Area Analysis for Improved Population Estimation Based on Advanced Dasymetric Refinement Spatio-temporal Small Area Analysis for Improved Population Estimation Based on Advanced Dasymetric Refinement Hamidreza Zoraghein, Stefan Leyk, Barbara Buttenfield and Matt Ruther ABSTRACT: Demographic

More information

Areal Interpolation Methods using Land Cover and Street Data. Jeff Bourdier GIS Master s s Project Summer 2006

Areal Interpolation Methods using Land Cover and Street Data. Jeff Bourdier GIS Master s s Project Summer 2006 Areal Interpolation Methods using Land Cover and Street Data Jeff Bourdier GIS Master s s Project Summer 2006 Objective The areal interpolation problem Some variable (here, population) is known in a given

More information

Séminaire de l'umr Economie Publique. Spatial Disaggregation of Agricultural. Raja Chakir. February 21th Spatial Disaggregation.

Séminaire de l'umr Economie Publique. Spatial Disaggregation of Agricultural. Raja Chakir. February 21th Spatial Disaggregation. Séminaire de l'umr Economie Publique : An : An February 21th 2006 Outline : An 1 2 3 4 : An The latest reform the Common Policy (CAP) aims to encourage environmentally friendly farming practices in order

More information

Brazil Paper for the. Second Preparatory Meeting of the Proposed United Nations Committee of Experts on Global Geographic Information Management

Brazil Paper for the. Second Preparatory Meeting of the Proposed United Nations Committee of Experts on Global Geographic Information Management Brazil Paper for the Second Preparatory Meeting of the Proposed United Nations Committee of Experts on Global Geographic Information Management on Data Integration Introduction The quick development of

More information

Contextual Effects in Modeling for Small Domains

Contextual Effects in Modeling for Small Domains University of Wollongong Research Online Applied Statistics Education and Research Collaboration (ASEARC) - Conference Papers Faculty of Engineering and Information Sciences 2011 Contextual Effects in

More information

ESTP course on Small Area Estimation

ESTP course on Small Area Estimation ESTP course on Small Area Estimation Statistics Finland, Helsinki, 29 September 2 October 2014 Topic 1: Introduction to small area estimation Risto Lehtonen, University of Helsinki Lecture topics: Monday

More information

Land Cover and Land Use Diversity Indicators in LUCAS 2009 data

Land Cover and Land Use Diversity Indicators in LUCAS 2009 data Land Cover and Land Use Diversity Indicators in LUCAS 2009 data A. Palmieri, L. Martino, P. Dominici and M. Kasanko Abstract Landscape diversity and changes are connected to land cover and land use. The

More information

Understanding China Census Data with GIS By Shuming Bao and Susan Haynie China Data Center, University of Michigan

Understanding China Census Data with GIS By Shuming Bao and Susan Haynie China Data Center, University of Michigan Understanding China Census Data with GIS By Shuming Bao and Susan Haynie China Data Center, University of Michigan The Census data for China provides comprehensive demographic and business information

More information

Small Area Estimates of Poverty Incidence in the State of Uttar Pradesh in India

Small Area Estimates of Poverty Incidence in the State of Uttar Pradesh in India Small Area Estimates of Poverty Incidence in the State of Uttar Pradesh in India Hukum Chandra Indian Agricultural Statistics Research Institute, New Delhi Email: hchandra@iasri.res.in Acknowledgments

More information

Gridded population data for the UK redistribution models and applications

Gridded population data for the UK redistribution models and applications Gridded population data for the UK redistribution models and applications David Martin 1 1 School of Geography, University of Southampton, Southampton, SO17 1BJ, UK e- mail: D.J.Martin@soton.ac.uk Abstract

More information

Achieving the Vision Geo-statistical integration addressing South Africa s Developmental Agenda. geospatial + statistics. The Data Revolution

Achieving the Vision Geo-statistical integration addressing South Africa s Developmental Agenda. geospatial + statistics. The Data Revolution Achieving the Vision Geo-statistical integration addressing South Africa s Developmental Agenda geospatial + statistics The Data Revolution humble beginnings, present & future - South Africa UN World Data

More information

COMBINING ENUMERATION AREA MAPS AND SATELITE IMAGES (LAND COVER) FOR THE DEVELOPMENT OF AREA FRAME (MULTIPLE FRAMES) IN AN AFRICAN COUNTRY:

COMBINING ENUMERATION AREA MAPS AND SATELITE IMAGES (LAND COVER) FOR THE DEVELOPMENT OF AREA FRAME (MULTIPLE FRAMES) IN AN AFRICAN COUNTRY: COMBINING ENUMERATION AREA MAPS AND SATELITE IMAGES (LAND COVER) FOR THE DEVELOPMENT OF AREA FRAME (MULTIPLE FRAMES) IN AN AFRICAN COUNTRY: PRELIMINARY LESSONS FROM THE EXPERIENCE OF ETHIOPIA BY ABERASH

More information

High resolution population grid for the entire United States

High resolution population grid for the entire United States High resolution population grid for the entire United States A. Dmowska, T. F. Stepinski Space Informatics Lab, Department of Geography, University of Cincinnati, Cincinnati, OH 45221-0131, USA Telephone:

More information

Spatial Disaggregation of Land Cover and Cropping Information: Current Results and Further steps

Spatial Disaggregation of Land Cover and Cropping Information: Current Results and Further steps CAPRI CAPRI Spatial Disaggregation of Land Cover and Cropping Information: Current Results and Further steps Renate Koeble, Adrian Leip (Joint Research Centre) Markus Kempen (Universitaet Bonn) JRC-AL

More information

DATA DISAGGREGATION BY GEOGRAPHIC

DATA DISAGGREGATION BY GEOGRAPHIC PROGRAM CYCLE ADS 201 Additional Help DATA DISAGGREGATION BY GEOGRAPHIC LOCATION Introduction This document provides supplemental guidance to ADS 201.3.5.7.G Indicator Disaggregation, and discusses concepts

More information

Compact guides GISCO. Geographic information system of the Commission

Compact guides GISCO. Geographic information system of the Commission Compact guides GISCO Geographic information system of the Commission What is GISCO? GISCO, the Geographic Information System of the COmmission, is a permanent service of Eurostat that fulfils the requirements

More information

Common geographies for dissemination of SDG Indicators

Common geographies for dissemination of SDG Indicators 5 th High Level Forum on United Nations GGIM, Mexico 2017 Common geographies for dissemination of SDG Indicators Understanding statistical and geodetic division of territory Janusz Dygaszewicz Central

More information

Indicator: Proportion of the rural population who live within 2 km of an all-season road

Indicator: Proportion of the rural population who live within 2 km of an all-season road Goal: 9 Build resilient infrastructure, promote inclusive and sustainable industrialization and foster innovation Target: 9.1 Develop quality, reliable, sustainable and resilient infrastructure, including

More information

Measuring and Monitoring SDGs in Portugal: Ratio of land consumption rate to population growth rate Mountain Green Cover Index

Measuring and Monitoring SDGs in Portugal: Ratio of land consumption rate to population growth rate Mountain Green Cover Index Measuring and Monitoring SDGs in Portugal: 11.3.1Ratio of land consumption rate to population growth rate 15.4.2 Mountain Green Cover Index United Nations World Geospatial Information Congress João David

More information

CountrySTAT REGIONAL BASIC ADMINISTRATOR TRAINING for GCC MEMBER STATES. CountrySTAT STATISTICS COMPONENT

CountrySTAT REGIONAL BASIC ADMINISTRATOR TRAINING for GCC MEMBER STATES. CountrySTAT STATISTICS COMPONENT CountrySTAT REGIONAL BASIC ADMINISTRATOR TRAINING for GCC MEMBER STATES Abu Dhabi, United Arab Emirates (UAE), 26 30 January 2014 CountrySTAT STATISTICS COMPONENT (Concepts, Definitions, Classification,

More information

Possible links between a sample of VHR images and LUCAS

Possible links between a sample of VHR images and LUCAS EUROPEAN COMMISSION EUROSTAT Directorate E: Sectoral and regional statistics Unit E-1: Farms, agro-environment and rural development CPSA/LCU/08 Original: EN (available in EN) WORKING PARTY "LAND COVER/USE

More information

Multifunctional theory in agricultural land use planning case study

Multifunctional theory in agricultural land use planning case study Multifunctional theory in agricultural land use planning case study Introduction István Ferencsik (PhD) VÁTI Research Department, iferencsik@vati.hu By the end of 20 th century demands and expectations

More information

POPULAR CARTOGRAPHIC AREAL INTERPOLATION METHODS VIEWED FROM A GEOSTATISTICAL PERSPECTIVE

POPULAR CARTOGRAPHIC AREAL INTERPOLATION METHODS VIEWED FROM A GEOSTATISTICAL PERSPECTIVE CO-282 POPULAR CARTOGRAPHIC AREAL INTERPOLATION METHODS VIEWED FROM A GEOSTATISTICAL PERSPECTIVE KYRIAKIDIS P. University of California Santa Barbara, MYTILENE, GREECE ABSTRACT Cartographic areal interpolation

More information

The Added Value of Geospatial Data in a Statistical Office. Pedro Diaz Munoz Director Sectoral and Regional Statistics EUROSTAT European Commission

The Added Value of Geospatial Data in a Statistical Office. Pedro Diaz Munoz Director Sectoral and Regional Statistics EUROSTAT European Commission The Added Value of Geospatial Data in a Statistical Office Pedro Diaz Munoz Director Sectoral and Regional Statistics EUROSTAT European Commission Why integrate Responsibility of all the information we

More information

A cellular automata model for the study of small-size urban areas

A cellular automata model for the study of small-size urban areas Context and Motivation A cellular automata model for the study of small-size urban areas Centre de Política de Sòl i Valoracions Barcelona 23 January 2007 Nuno Norte Pinto School of Technology and Management

More information

What are we like? Population characteristics from UK censuses. Justin Hayes & Richard Wiseman UK Data Service Census Support

What are we like? Population characteristics from UK censuses. Justin Hayes & Richard Wiseman UK Data Service Census Support What are we like? Population characteristics from UK censuses Justin Hayes & Richard Wiseman UK Data Service Census Support Who are we? Richard Wiseman UK Data Service / Jisc Justin Hayes UK Data Service

More information

Integrating Official Statistics and Geospatial Information NBS Experience

Integrating Official Statistics and Geospatial Information NBS Experience Integrating Official Statistics and Geospatial Information NBS Experience Presented by Eng. Reda AL Sabbagh Director of IT Dept. Prepared by Eng. Adham Makky GIS Specialist Topics to be covered: Background

More information

USING DOWNSCALED POPULATION IN LOCAL DATA GENERATION

USING DOWNSCALED POPULATION IN LOCAL DATA GENERATION USING DOWNSCALED POPULATION IN LOCAL DATA GENERATION A COUNTRY-LEVEL EXAMINATION CONTENT Research Context and Approach. This part outlines the background to and methodology of the examination of downscaled

More information

Gridded Population of the World Version 4 (GPWv4)

Gridded Population of the World Version 4 (GPWv4) Gridded Population of the World Version 4 (GPWv4) Greg Yetman Chris Small Erin Doxsey-Whitfield Kytt MacManus May 28, 2015 Global Urbanization Workshop http://sedac.ciesin.columbia.edu 1 Focus on applying

More information

Globally Estimating the Population Characteristics of Small Geographic Areas. Tom Fitzwater

Globally Estimating the Population Characteristics of Small Geographic Areas. Tom Fitzwater Globally Estimating the Population Characteristics of Small Geographic Areas Tom Fitzwater U.S. Census Bureau Population Division What we know 2 Where do people live? Difficult to measure and quantify.

More information

Integration of Geo spatial and Statistical Information: The Nepelese Experience

Integration of Geo spatial and Statistical Information: The Nepelese Experience Integration of Geo spatial and Statistical Information: The Nepelese Experience Krishna Raj B.C. Joint Secretary Ministry of Land Reform and Management, Nepal 11 June, 2014 Presentation Outline The Country

More information

Use of administrative registers for strengthening the geostatistical framework of the Census of Agriculture in Mexico

Use of administrative registers for strengthening the geostatistical framework of the Census of Agriculture in Mexico Use of administrative registers for strengthening the geostatistical framework of the Census of Agriculture in Mexico Susana Pérez INEGI, Dirección de Censos y Encuestas Agropecuarias. Avenida José María

More information

Department of Geography, University of Connecticut, Storrs, CT, USA. Online publication date: 28 March 2011 PLEASE SCROLL DOWN FOR ARTICLE

Department of Geography, University of Connecticut, Storrs, CT, USA. Online publication date: 28 March 2011 PLEASE SCROLL DOWN FOR ARTICLE This article was downloaded by: [University of Connecticut] On: 28 March 2011 Access details: Access Details: [subscription number 784375807] Publisher Taylor & Francis Informa Ltd Registered in England

More information

The Combination of Geospatial Data with Statistical Data for SDG Indicators

The Combination of Geospatial Data with Statistical Data for SDG Indicators Session x: Sustainable Development Goals, SDG indicators The Combination of Geospatial Data with Statistical Data for SDG Indicators Pier-Giorgio Zaccheddu Fabio Volpe 5-8 December2018, Nairobi IAEG SDG

More information

Typical information required from the data collection can be grouped into four categories, enumerated as below.

Typical information required from the data collection can be grouped into four categories, enumerated as below. Chapter 6 Data Collection 6.1 Overview The four-stage modeling, an important tool for forecasting future demand and performance of a transportation system, was developed for evaluating large-scale infrastructure

More information

Denis White NSI Technical Services Corporation 200 SW 35th St. Corvallis, Oregon 97333

Denis White NSI Technical Services Corporation 200 SW 35th St. Corvallis, Oregon 97333 POLYGON OVERLAY TO SUPPORT POINT SAMPLE MAPPING: THE NATIONAL RESOURCES INVENTORY Denis White NSI Technical Services Corporation 200 SW 35th St. Corvallis, Oregon 97333 Margaret Maizel ' American Farmland

More information

Geog183: Cartographic Design and Geovisualization Spring Quarter 2018 Lecture 11: Dasymetric and isarithmic mapping

Geog183: Cartographic Design and Geovisualization Spring Quarter 2018 Lecture 11: Dasymetric and isarithmic mapping Geog183: Cartographic Design and Geovisualization Spring Quarter 2018 Lecture 11: Dasymetric and isarithmic mapping Discrete vs. continuous revisited Choropleth suited to discrete areal, but suffers from

More information

Estimation of Urban Population by Remote Sensing Data

Estimation of Urban Population by Remote Sensing Data Journal of Emerging Trends in Economics and Management Sciences (JETEMS) 4(6):565-569 Scholarlink Research Institute Journals, 2013 (ISSN: 2141-7024) jetems.scholarlinkresearch.org Journal of Emerging

More information

Introduction to Survey Data Integration

Introduction to Survey Data Integration Introduction to Survey Data Integration Jae-Kwang Kim Iowa State University May 20, 2014 Outline 1 Introduction 2 Survey Integration Examples 3 Basic Theory for Survey Integration 4 NASS application 5

More information

How proximity to a city influences the performance of rural regions by Lewis Dijkstra and Hugo Poelman

How proximity to a city influences the performance of rural regions by Lewis Dijkstra and Hugo Poelman n 01/2008 Regional Focus A series of short papers on regional research and indicators produced by the Directorate-General for Regional Policy Remote Rural Regions How proximity to a city influences the

More information

GEOGRAPHIC INFORMATION SYSTEMS Session 8

GEOGRAPHIC INFORMATION SYSTEMS Session 8 GEOGRAPHIC INFORMATION SYSTEMS Session 8 Introduction Geography underpins all activities associated with a census Census geography is essential to plan and manage fieldwork as well as to report results

More information

Directorate E: Sectoral and regional statistics Unit E-4: Regional statistics and geographical information LUCAS 2018.

Directorate E: Sectoral and regional statistics Unit E-4: Regional statistics and geographical information LUCAS 2018. EUROPEAN COMMISSION EUROSTAT Directorate E: Sectoral and regional statistics Unit E-4: Regional statistics and geographical information Doc. WG/LCU 52 LUCAS 2018 Eurostat Unit E4 Working Group for Land

More information

Selection of small area estimation method for Poverty Mapping: A Conceptual Framework

Selection of small area estimation method for Poverty Mapping: A Conceptual Framework Selection of small area estimation method for Poverty Mapping: A Conceptual Framework Sumonkanti Das National Institute for Applied Statistics Research Australia University of Wollongong The First Asian

More information

EVALUATING THE COST- EFFICIENCY OF REMOTE SENSING IN DEVELOPING COUNTRIES

EVALUATING THE COST- EFFICIENCY OF REMOTE SENSING IN DEVELOPING COUNTRIES EVALUATING THE COST- EFFICIENCY OF REMOTE SENSING IN DEVELOPING COUNTRIES Elisabetta Carfagna Research Coordinator of the Global Strategy to Improve Agricultural and Rural Statistics - FAO Statistics Division

More information

Discussion paper on spatial units

Discussion paper on spatial units Discussion paper on spatial units for the Forum of Experts in SEEA Experimental Ecosystem Accounting 2018 Version: 8 June 2018 Prepared by: SEEA EEA Revision Working Group 1 on spatial units (led by Sjoerd

More information

The Global Statistical Geospatial Framework and the Global Fundamental Geospatial Themes

The Global Statistical Geospatial Framework and the Global Fundamental Geospatial Themes The Global Statistical Geospatial Framework and the Global Fundamental Geospatial Themes Sub-regional workshop on integration of administrative data, big data and geospatial information for the compilation

More information

Calculating the Natura 2000 network area in Europe: The GIS approach

Calculating the Natura 2000 network area in Europe: The GIS approach Calculating the Natura 2000 network area in Europe: The GIS approach 1. INTRODUCTION A precise area calculation is needed to check to what extent member states have designated Natura 2000 sites of their

More information

The Governance of Land Use

The Governance of Land Use The planning system The Governance of Land Use United Kingdom Levels of government and their responsibilities The United Kingdom is a unitary state with three devolved governments in Northern Ireland,

More information

Data Collection. Lecture Notes in Transportation Systems Engineering. Prof. Tom V. Mathew. 1 Overview 1

Data Collection. Lecture Notes in Transportation Systems Engineering. Prof. Tom V. Mathew. 1 Overview 1 Data Collection Lecture Notes in Transportation Systems Engineering Prof. Tom V. Mathew Contents 1 Overview 1 2 Survey design 2 2.1 Information needed................................. 2 2.2 Study area.....................................

More information

Use of auxiliary information in the sampling strategy of a European area frame agro-environmental survey

Use of auxiliary information in the sampling strategy of a European area frame agro-environmental survey Use of auxiliary information in the sampling strategy of a European area frame agro-environmental survey Laura Martino 1, Alessandra Palmieri 1 & Javier Gallego 2 (1) European Commission: DG-ESTAT (2)

More information

Economic and Social Council

Economic and Social Council United Nations Economic and Social Council 28 December 2000 Original: English E/CN.3/2001/2 Statistical Commission Thirty-second session 6-9 March 2001 Item 3 (a) of the provisional agenda * Demographic

More information

Identifying Residential Land in Rural Areas to Improve Dasymetric Mapping

Identifying Residential Land in Rural Areas to Improve Dasymetric Mapping Identifying Residential Land in Rural Areas to Improve Dasymetric Mapping Stefan Leyk*, Matthew Ruther*, Barbara P. Buttenfield*, Nicholas N. Nagle** and Alexander K. Stum* * Geography Department, University

More information

The polygon overlay problem in electoral geography

The polygon overlay problem in electoral geography The polygon overlay problem in electoral geography Romain Louvet *1,2, Jagannath Aryal 2, Didier Josselin 1,3, Christèle Marchand-Lagier 4, Cyrille Genre-Grandpierre 1 1 UMR ESPACE 7300 CNRS, Université

More information

Summary Description Municipality of Anchorage. Anchorage Coastal Resource Atlas Project

Summary Description Municipality of Anchorage. Anchorage Coastal Resource Atlas Project Summary Description Municipality of Anchorage Anchorage Coastal Resource Atlas Project By: Thede Tobish, MOA Planner; and Charlie Barnwell, MOA GIS Manager Introduction Local governments often struggle

More information

KENYA NATIONAL BUREAU OF STATISTICS Workshop on

KENYA NATIONAL BUREAU OF STATISTICS Workshop on KENYA NATIONAL BUREAU OF STATISTICS Workshop on Capacity Building in Environment Statistics: the Framework for the Development of Environment Statistics (FDES 2013) Coordination with Sector Ministries

More information

Dasymetric Mapping for Disaggregating Coarse Resolution Population Data

Dasymetric Mapping for Disaggregating Coarse Resolution Population Data Dasymetric Mapping for Disaggregating Coarse Resolution Population Data Jeremy Mennis and Torrin Hultgren Department of Geography and Urban Studies, Temple University Department of Geography, University

More information

Combining Geospatial and Statistical Data for Analysis & Dissemination

Combining Geospatial and Statistical Data for Analysis & Dissemination Combining Geospatial and Statistical Data for Analysis & Dissemination (with Special Reference to Qatar Census 2010) Presentation by Mansoor Al Malki, Director of IT Department Qatar Statistics Authority

More information

Lessons Learned from the production of Gridded Population of the World Version 4 (GPW4) Columbia University, CIESIN, USA EFGS October 2014

Lessons Learned from the production of Gridded Population of the World Version 4 (GPW4) Columbia University, CIESIN, USA EFGS October 2014 Lessons Learned from the production of Gridded Population of the World Version 4 (GPW4) Columbia University, CIESIN, USA EFGS October 2014 Gridded Population of the World Gridded (raster) data product

More information

Integration for Informed Decision Making

Integration for Informed Decision Making Geospatial and Statistics Policy Intervention: Integration for Informed Decision Making Greg Scott Global Geospatial Information Management United Nations Statistics Division Department of Economic and

More information

Accounting Units for Ecosystem Accounts Paper prepared by Alessandra Alfieri, Daniel Clarke, and Ivo Havinga United Nations Statistics Division

Accounting Units for Ecosystem Accounts Paper prepared by Alessandra Alfieri, Daniel Clarke, and Ivo Havinga United Nations Statistics Division DEPARTMENT OF ECONOMIC AND SOCIAL AFFAIRS STATISTICS DIVISION UNITED NATIONS Expert Meeting on Ecosystem Accounts London, UK 5-7 December 2011 Accounting Units for Ecosystem Accounts Paper prepared by

More information

New Land Cover & Land Use Data for the Chesapeake Bay Watershed

New Land Cover & Land Use Data for the Chesapeake Bay Watershed New Land Cover & Land Use Data for the Chesapeake Bay Watershed Why? The Chesapeake Bay Program (CBP) partnership is in the process of improving and refining the Phase 6 suite of models used to inform

More information

Pilot studies on the provision of harmonized land use/land cover statistics: Synergies between LUCAS and the national systems

Pilot studies on the provision of harmonized land use/land cover statistics: Synergies between LUCAS and the national systems 1 Pilot studies on the provision of harmonized land use/land cover statistics: Synergies between LUCAS and the national systems Norway Erik Engelien Division for Natural resources and Environmental Statistics,

More information

ReCAP Status Review of the Updated Rural Access Index (RAI) Stephen Vincent, Principal Investigator

ReCAP Status Review of the Updated Rural Access Index (RAI) Stephen Vincent, Principal Investigator ReCAP Status Review of the Updated Rural Access Index (RAI) Stephen Vincent, Principal Investigator Establishment of RAI in 2005/2006 2006 Definition of the RAI Note by Peter Roberts Dated September 2005

More information

BROOKINGS May

BROOKINGS May Appendix 1. Technical Methodology This study combines detailed data on transit systems, demographics, and employment to determine the accessibility of jobs via transit within and across the country s 100

More information

Rules of the territorial division

Rules of the territorial division Rules of the territorial division Janusz Dygaszewicz Central Statistical Office of Poland Jerusalem, 4-7 December 2016 Rules of territory division (the Polish case) The area of each unit of territorial

More information

Principle 3: Common geographies for dissemination of statistics Poland & Canada. Janusz Dygaszewicz Statistics Poland

Principle 3: Common geographies for dissemination of statistics Poland & Canada. Janusz Dygaszewicz Statistics Poland Principle 3: Common geographies for dissemination of statistics Poland & Canada Janusz Dygaszewicz Statistics Poland Reference materials Primary: Ortophotomap, Cadastral Data, Administrative division borders,

More information

A GIS based Land Capability Classification of Guang Watershed, Highlands of Ethiopia

A GIS based Land Capability Classification of Guang Watershed, Highlands of Ethiopia A GIS based Land Capability Classification of Guang Watershed, Highlands of Ethiopia Gizachew Ayalew 1 & Tiringo Yilak 2 1 Amhara Design and Supervision Works Enterprise (ADSWE), Bahir Dar, Ethiopia 2

More information

arxiv: v2 [math.st] 20 Jun 2014

arxiv: v2 [math.st] 20 Jun 2014 A solution in small area estimation problems Andrius Čiginas and Tomas Rudys Vilnius University Institute of Mathematics and Informatics, LT-08663 Vilnius, Lithuania arxiv:1306.2814v2 [math.st] 20 Jun

More information

The National Spatial Strategy

The National Spatial Strategy Purpose of this Consultation Paper This paper seeks the views of a wide range of bodies, interests and members of the public on the issues which the National Spatial Strategy should address. These views

More information

ENGRG Introduction to GIS

ENGRG Introduction to GIS ENGRG 59910 Introduction to GIS Michael Piasecki October 13, 2017 Lecture 06: Spatial Analysis Outline Today Concepts What is spatial interpolation Why is necessary Sample of interpolation (size and pattern)

More information

Census Transportation Planning Products (CTPP)

Census Transportation Planning Products (CTPP) Census Transportation Planning Products (CTPP) Penelope Weinberger CTPP Program Manager - AASHTO September 15, 2010 1 What is the CTPP Program Today? The CTPP is an umbrella program of data products, custom

More information

MSc Thesis. Small Area Estimation of Maize Yield of Wereda-level Using Mixed Effect Linear Model with Spatial Auxiliary Information

MSc Thesis. Small Area Estimation of Maize Yield of Wereda-level Using Mixed Effect Linear Model with Spatial Auxiliary Information ADDIS ABABA UNIVERSITY COLLEGE OF NATURAL AND COMPUTATIONAL SCIENCES DEPARTMENT OF STATISTICS GRADUATE REGULAR PROGRAM MSc Thesis Small Area Estimation of Maize Yield of Wereda-level Using Mixed Effect

More information

Uganda - National Panel Survey

Uganda - National Panel Survey Microdata Library Uganda - National Panel Survey 2013-2014 Uganda Bureau of Statistics - Government of Uganda Report generated on: June 7, 2017 Visit our data catalog at: http://microdata.worldbank.org

More information

THE ROLE OF GEOSPATIAL AT THE WORLD BANK

THE ROLE OF GEOSPATIAL AT THE WORLD BANK THE ROLE OF GEOSPATIAL AT THE WORLD BANK INSPIRE Conference Barcelona, Spain September 26, 2016 Kathrine Kelm Senior Land Administration Specialist Global Land and Geospatial Unit The World Bank Group

More information

Exposure Disaggregation: Introduction. By Alissa Le Mon

Exposure Disaggregation: Introduction. By Alissa Le Mon Exposure Disaggregation: Building Better Loss Estimates 10.2010 Editor s note: In this article, Alissa Le Mon, an analyst in AIR s exposures group, discusses how AIR s innovative disaggregation techniques

More information

Training on national land cover classification systems. Toward the integration of forest and other land use mapping activities.

Training on national land cover classification systems. Toward the integration of forest and other land use mapping activities. Training on national land cover classification systems Toward the integration of forest and other land use mapping activities. Guiana Shield 9 to 13 March 2015, Paramaribo, Suriname Background Sustainable

More information

How rural the EU RDP is? An analysis through spatial funds allocation

How rural the EU RDP is? An analysis through spatial funds allocation How rural the EU RDP is? An analysis through spatial funds allocation Beatrice Camaioni, Roberto Esposti, Antonello Lobianco, Francesco Pagliacci, Franco Sotte Department of Economics and Social Sciences

More information

Exploring Digital Welfare data using GeoTools and Grids

Exploring Digital Welfare data using GeoTools and Grids Exploring Digital Welfare data using GeoTools and Grids Hodkinson, S.N., Turner, A.G.D. School of Geography, University of Leeds June 20, 2014 Summary As part of the Digital Welfare project [1] a Java

More information

Operational Definitions of Urban, Rural and Urban Agglomeration for Monitoring Human Settlements

Operational Definitions of Urban, Rural and Urban Agglomeration for Monitoring Human Settlements Operational Definitions of Urban, Rural and Urban Agglomeration for Monitoring Human Settlements By Markandey Rai United Nations Human Settlements Programme PO Box-30030 Nairobi, Kenya Abstract The United

More information

Mapping the Urban Farming in Chinese Cities:

Mapping the Urban Farming in Chinese Cities: Submission to EDC Student of the Year Award 2015 A old female is cultivating the a public green space in the residential community(xiaoqu in Chinese characters) Source:baidu.com 2015. Mapping the Urban

More information

Small Area Modeling of County Estimates for Corn and Soybean Yields in the US

Small Area Modeling of County Estimates for Corn and Soybean Yields in the US Small Area Modeling of County Estimates for Corn and Soybean Yields in the US Matt Williams National Agricultural Statistics Service United States Department of Agriculture Matt.Williams@nass.usda.gov

More information

Spatial Statistical Information Services in KOSTAT

Spatial Statistical Information Services in KOSTAT Distr. GENERAL WP.30 12 April 2010 ENGLISH ONLY UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE (UNECE) CONFERENCE OF EUROPEAN STATISTICIANS EUROPEAN COMMISSION STATISTICAL OFFICE OF THE EUROPEAN UNION (EUROSTAT)

More information

Frontier and Remote (FAR) Area Codes: A Preliminary View of Upcoming Changes John Cromartie Economic Research Service, USDA

Frontier and Remote (FAR) Area Codes: A Preliminary View of Upcoming Changes John Cromartie Economic Research Service, USDA National Center for Frontier Communities webinar, January 27, 2015 Frontier and Remote (FAR) Area Codes: A Preliminary View of Upcoming Changes John Cromartie Economic Research Service, USDA The views

More information

Merging statistics and geospatial information

Merging statistics and geospatial information Merging statistics and geospatial information Demography / Commuting / Spatial planning / Registers Mirosław Migacz Chief GIS Specialist Janusz Dygaszewicz Director Central Statistical Office of Poland

More information

DROUGHT ASSESSMENT USING SATELLITE DERIVED METEOROLOGICAL PARAMETERS AND NDVI IN POTOHAR REGION

DROUGHT ASSESSMENT USING SATELLITE DERIVED METEOROLOGICAL PARAMETERS AND NDVI IN POTOHAR REGION DROUGHT ASSESSMENT USING SATELLITE DERIVED METEOROLOGICAL PARAMETERS AND NDVI IN POTOHAR REGION Researcher: Saad-ul-Haque Supervisor: Dr. Badar Ghauri Department of RS & GISc Institute of Space Technology

More information

GIS (GEOGRAPHICAL INFORMATION SYSTEMS) AS A FACILITATION TOOL FOR SUSTAINABLE DEVELOPMENT IN AFRICA

GIS (GEOGRAPHICAL INFORMATION SYSTEMS) AS A FACILITATION TOOL FOR SUSTAINABLE DEVELOPMENT IN AFRICA GIS (GEOGRAPHICAL INFORMATION SYSTEMS) AS A FACILITATION TOOL FOR SUSTAINABLE DEVELOPMENT IN AFRICA a presentation by Elizabeth Hicken GDEST Conference on Geospatial Sciences for Sustainable Development

More information

Foundation Geospatial Information to serve National and Global Priorities

Foundation Geospatial Information to serve National and Global Priorities Foundation Geospatial Information to serve National and Global Priorities Greg Scott Inter-Regional Advisor Global Geospatial Information Management United Nations Statistics Division UN-GGIM: A global

More information

A GIS-based Model for Evaluating Agricultural Land based on Crop Equivalent Rate

A GIS-based Model for Evaluating Agricultural Land based on Crop Equivalent Rate A GIS-based Model for Evaluating Agricultural Land based on Crop Equivalent Rate Jiyeong Lee, Ph.D. Minnesota State University, Mankato, MN 56001 E-mail: leej@mnsu.edu Abstract Cadastral GIS applications

More information

Adding value to Copernicus services with member states reference data

Adding value to Copernicus services with member states reference data www.eurogeographics.org Adding value to Copernicus services with member states reference data Neil Sutherland neil.sutherland@os.uk GIS in the EU 10 November 2016 Copyright 2016 EuroGeographics 61 member

More information

Mapping and Assessment of Ecosystems and their Services

Mapping and Assessment of Ecosystems and their Services Mapping and Assessment of Ecosystems and their Services ALTER-Net Conference 2013: Science underpinning the EU 2020 Biodiversity Strategy. April 2013 Gent Joachim MAES 1 This presentation is based on the

More information

Calculating Land Values by Using Advanced Statistical Approaches in Pendik

Calculating Land Values by Using Advanced Statistical Approaches in Pendik Presented at the FIG Congress 2018, May 6-11, 2018 in Istanbul, Turkey Calculating Land Values by Using Advanced Statistical Approaches in Pendik Prof. Dr. Arif Cagdas AYDINOGLU Ress. Asst. Rabia BOVKIR

More information

LUCAS: current product and its evolutions

LUCAS: current product and its evolutions LUCAS: current product and its evolutions Workshop Land Use and Land Cover products: challenges and opportunities Brussels 15 Nov 2017 Eurostat E4: estat-dl-lucas@ec.europa.eu Contents 1) The context 2)

More information

Spatial Accuracy Evaluation of Population Density Grid disaggregations with Corine Landcover

Spatial Accuracy Evaluation of Population Density Grid disaggregations with Corine Landcover Spatial Accuracy Evaluation of Population Density Grid disaggregations with Corine Landcover Johannes Scholz, Michael Andorfer and Manfred Mittlboeck Abstract The article elaborates on the spatial disaggregation

More information

Supplementary material: Methodological annex

Supplementary material: Methodological annex 1 Supplementary material: Methodological annex Correcting the spatial representation bias: the grid sample approach Our land-use time series used non-ideal data sources, which differed in spatial and thematic

More information

Experiences with the Development and Use of Poverty Maps

Experiences with the Development and Use of Poverty Maps Experiences with the Development and Use of Poverty Maps Case Study Note for VIETNAM * 1. Background information on the poverty mapping initiative The development of the first poverty map in Vietnam was

More information

MANUAL ON THE BSES: LAND USE/LAND COVER

MANUAL ON THE BSES: LAND USE/LAND COVER 6. Environment Protection, Management and Engagement 2. Environmental Resources and their Use 5. Human Habitat and Environmental Health 1. Environmental Conditions and Quality 4. Disasters and Extreme

More information

Land Resources Planning (LRP) Toolbox User s Guide

Land Resources Planning (LRP) Toolbox User s Guide Land Resources Planning (LRP) Toolbox User s Guide The LRP Toolbox is a freely accessible online source for a range of stakeholders, directly or indirectly involved in land use planning (planners, policy

More information