Contributors: France, Germany, Italy, Netherlands, Norway, Poland, Spain, Switzerland

Size: px
Start display at page:

Download "Contributors: France, Germany, Italy, Netherlands, Norway, Poland, Spain, Switzerland"

Transcription

1 ESSNET ON SMALL AREA ESTIMATION REPORT ON WORK PACKAGE 5 CASE STUDIES FINAL VERSION REVISION FEBRUARY 1 Contributors: France, Germany, Italy, Netherlans, Norway, Polan, Spain, Switzerlan

2 Contents 1. Overview of case stuies Backgroun Outline 3 Methos 6 Software 8 Results 8 Discussion 11. Case stuy reports 13 France 14 Germany 4 Italy 45 Netherlans 66 Norway 86 Polan 11 Spain 133 Switzerlan

3 1. Overview of case stuies Case stuies have been conucte by seven NSIs partnering in the ESSnet on Small Area Estimation (SAE). The present ocument reports on these case stuies. In this chapter we give an overview of the case stuies that are escribe in this report. Backgroun Italy Istat has experience in small area estimation. Since 1 SAE estimates are isseminate for the annual average estimates of employe an unemploye counts at local labour market area level base on the Labour Force Survey. Furthermore, Istat has contribute to the evelopment of a web base system (SMART) for the prouction of SAE estimates for research purposes. France The French experience in SAE is limite. It is the first time that small area estimation is applie to the Labour Force Survey in France. Norway Domain mortality rates are neee in orer to calculate life expectancy at a isaggregate population level. In the omain of small areas confience intervals of life expectancy are calculate at three regional levels in Norway with the municipality as the most etaile level of aggregation. Polan The statistical office of Polan (GUS) has experience with small area estimation. Research has been one in the application of small area estimation to measure the extent of unemployment, poverty, househol structure an as well as the application in agriculture relate surveys. This research has resulte in a number of publications. The use of small area estimation is still not for official statistical output, but only for experimental stuies. Spain INE starte to use SAE in 1: they participate in the FP5 EU project EURAREA. This project was followe by the MODEAP project create by INE an UMH to aapt an apply some moel-assiste an moel-base estimators to the LFS to estimate totals of unemploye an employe people an the unemployment rates by gener an NUTS level 4. They introuce an compute resampling methos to evaluate esign-base MSE of moelbase estimators (two-stage bootstrap an a elete-one-cluster jackknife). After that they were involve in the application of estimators base on an area-level multinomial logit mixe moels. They also applie some calibration techniques. --

4 Switzerlan The practical experience of the Feeral Statistical Office (FSO) in Switzerlan is limite. As a first step, small area estimators are teste by means of a simulation stuy. To this en small area estimation is applie to a (former) census. Netherlans Statistics Netherlans have experience with small area estimation. Small area estimation is implemente in the prouction process of the Dutch Labour Force Survey (LFS). In this case stuy the application of small area estimation to the National Safety Monitor (NSM) is investigate. Outline of the case stuies Italy For the case stuy Istat has consiere the Health Survey (HS). HS ata are collecte every five years an the target population consists of all the resient iniviuals excluing the population in collective househols. The omains are given by 188 Health Districts (HD), which are aministrative partitions of the Italian regions efine for the aministration an allocation of health funs an expenses. The parameters to be estimate at HD level are the population counts of (1) people having at least one specialist visit one, () obese people, (3) women age from 5 to 69 who ha at least one mammography an (4) women age from 5 to 69 who ha at least one pap test. Since irect estimates seem to be reliable for parameter (4), SAE methos are only use for the estimation of the parameters (1), () an (3). For the small area estimation auxiliary information is use from emographic statistics at municipal level, given by population counts cross-classifie by age classes an sex. Other available covariates are estimate by the LFS. They are eucational level an househol type counts. For the case stuy HS ata are use from 5. The population contains 58,14,88 persons an the sample size is 18,4. The sample sizes of the small areas vary form 43 to 496. France The French Labour Force Survey (LFS) is generally not sufficient to allow local estimation of unemployment. Aim of this case stuy is to test other methos an the possibility of a quality assessment of the estimation of unemployment in France. The survey use in this stuy is the Labour Force Survey. The quarterly LFS 7 is efine by all the people who are 15 years or oler at the en of December 7 an who are living in an orinary househol in France. The areas are the ZE (zones emploi), there are 348 areas. The sample size of the areas is not equal; it is epening on the size of the cities inclue in the ZE. Therefore for the extreme sizes, they observe 3 ZE with a sample size between 1 an 1 responents an 3 ZE with a sample size larger than 15 responents. The auxiliary ata are extracte from six ifferent surveys, the Census 7, the DEFM (Demanes Emploi en Fin e Mois) from Pôle Emploi, a specific atabase with specific economic inicators, the typology Tabar (typology of geographical units), the RMI (Revenu -3-

5 Minimum Insertion) base an finally the ZUS (areas consiere as sensible). 18 auxiliary variables are consiere. The preselection of variables is base on a stanar regression moel at the omain level. First a Pearson correlation analysis is use an finally a stepwise selection is applie. The list of auxiliary variables is then reuce to a list of the seven most explicative ones, in aition to an intercept. Those seven variables are: the proportion of people who spontaneously eclare for a job, the proportion of people who eclare to live alone, the proportion of male registere in PE age between 15 an 19 with a high stuy level, the proportion of male age between 3 an 49 with a low level of stuy, the proportion of male age between 5 an 64 with a low level of stuy, the proportion of people in the ZE living in the area efine by the Tabar coe number 7, an finally the ifferences of the flow of factories beginning an ening an activity between an 6 ivie by the number of active factories in the ZE in 6. Norway Two objectives are consiere here, a statistical aim an a theoretical one. The statistical aim of the stuy is the investigation of omain mortality rates for the calculation of life expectancy at a low aggregation level an the theoretical aim is to evelop extensions to the basics smoothing approach to account for potential clustering (cohorts) effects that may exist in the population. This analysis is a special case in the omain of small areas research, because they o not use a survey but registers. The mortality (number of eaths) are compile from the eath recors over the five-year perio from 4-8. Three imensions are necessary for the classification of the omains: the areas of interest, the sex of the iniviuals an the age. Areas are the municipalities, there are 43 areas. The target quantity is the theoretical life expectancy on birth an is in this stuy epening on the estimates of the omain mortality rates. Sex an age are consiere as covariates. Polan GUS has concentrate the case stuy to the application of small area estimation to the Polish Labour Force Survey (LFS). The goal is to estimate the percentage of unemploye persons in the population of 15 years or oler at the NUTS3 level. The NUTS3 level consists of 66 regional omains, that form the small areas in the case stuy. The target variable is a binomial variable that measures the unemployment efine accoring to the International Labour Organization (ILO unemployment). For the case stuy the following ata sources were available: Register of Unemployment, Vital Statistics Register an the Tax Register. These registers provie auxiliary information about age, sex, place of resience (rural/urban), volume an irections of commuter traffic, income an tax payments. All these variables are available at iniviual level. The ata use in the case stuy refer to the first quarter of the 8 LFS survey wave. The sample sizes of the 66 omains vary from 15 to 1375 responents. Spain The objective of the Census is to etermine for each municipality the basic structure of the population. The aim of this stuy is the evaluation of the census estimator along with some SAE that may be an alternative when the municipality sample size is too small, by means of carrying out simulations experiments that allow assessing the municipality estimates. The 11 Census of Population an Housing is a combination of a pre-census file (PCF), a com- -4-

6 prehensive 11 Builing Census to allow georeferencing of all builings an a sample survey. The pre-census file is compose by the Municipal Population Register an information provie by the Home Office, the Social Security, the Tax Agency an other sources (i.e register of eucational levels of the Ministry of Eucation ). The size of the sampling survey is relatively high in orer to know the characteristics of iniviuals an househols, but also to comply with the coverage regulations establishe by Eurostat. Small areas are not given by geographical areas but by subomains (i.e., categories whose relative size in the population are significantly small an coul not be inclue as planne omains). They simulate the census survey as much as possible an for one of the two regions, they have implemente a systematic sampling esign for each municipality. First calibration groups are constructe, an then the census estimator is constructe as a Generalize Regression Estimator with a stanar regression moel fitte to the municipality sample. After calibration there are still 453 samples. They performe Monte Carlo simulation to evaluate small areas estimators for municipality population totals by eucational level, marital status or of foreigners by nationality. All these variables are consiere as auxiliary variables. Switzerlan In orer to test small area estimation the FSO has one a simulation stuy base on a census from. The census was hel uner persons of 15 years or oler an it measures emographic, spatial, social an economic evelopments in Switzerlan. The small areas are efine by the 896 communes in Switzerlan. In the simulation stuy the communes are classifie into seven size classes accoring to the number of inhabitants. The target parameters are the proportions of active persons (in the labour market) for all the communes. The auxiliary information is also available from the census an is known for the whole population. The available covariates are age (in four classes), sex, nationality, civil status an NUTS (regional ivision of the country). All variables are categorial an are available at iniviual level. The simulation stuy is base on the census from. The population contains 6,43,35 persons. There are rawn 1 samples of size, from this population. The sample sizes of the communes vary from to 1,598 persons. Netherlans Statistics Netherlans have stuie the application of small area estimation to the National Safety Monitor (NSM). The NSM is an annual survey that provies information on crime victimization, public safety an satisfaction uner persons age 15 years or oler resiing in the Netherlans. The small areas are given by the police istricts, a subivision of the Netherlans in 5 regions. For the case stuy there are five target parameters consiere. One of them is measure on a ten point scale (perceive nuisance) an the others are given as percentages (victim of property an violent crime, satisfaction with the police an feelings of unsafety). The auxiliary information is available from four ifferent sources: 1) Demographic information given by municipal aministrations, like gener, age, level of urbanization an mean house prices. ) Information about reporte crimes an offences provie by the Police Register of Reporte Offences. 3) Sample information for the NSM from preceing perios. 4) Es- -5-

7 timates on crime victimization from other surveys, in this case the Integrate Safety Monitor which is conucte in parallel with the ISM. Most of the auxiliary information is given at the area level. For the case stuy NSM ata are use from the years 8 an 9. In both years there are about 6 responents, who are equally ivie over the omains. So there are aubout 4 responents per area. Methos Italy The investigate small area estimators are the EBLUP base on a linear mixe moel (LMM) an a generalize linear mixe moel (GLMM), namely the logistic mixe moel. Both area an unit level estimators were consiere. The MSE is estimate on the basis of analytic mean cross prouct error incluing the components g1, g, g3 an g4. Moel properties have been assesse by means of bias iagnostics, where the moel-base estimates are compare with irect estimates. Moel selection is one by consiering iagnostics like the maximum likelihoo, AIC an BIC. For each of the target parameters the cross-classification of the covariates sex an age is use, but with ifferent age classes. In orer to fin the optimal classification of the variable age, a non-parametric metho base on spline functions an CART (Classification an Regression Tree) have been use. France The Fay-Herriot moel an Poisson moels are consiere this stuy. For the Fay-Herriot moel, the epenent variable consiere is a ummy variable equal to 1 if the iniviual is jobless an otherwise. Two situations are consiere here: the case of a sample size > an a secon situation where the sample size > 49. In the Poisson moel, they consier all the ZE (there is no restriction on the size of the sample). The methos applie to evaluate the quality assurance are AIC an BIC. They also consier graphs of the resiuals an the preicte ranom effects. They compute tests for the analysis of the sensitiveness of the estimation. Norway They consier a basic smoothing approach. The composite estimator is efine by a Poisson Gamma istribution. Uner the basic smoothing approach, they eal first with separate subsets of omains. So it may be possible to refer to a smoothing by age or by area. The composite estimator may coincie with the Empirical Best preictor (EB), in this case they assume a Gamma istribution of the composite estimator. An alternative to the irect moelling propose is to consier the composite estimator as the mean of the within-omain iniviual RR. The MSE of the composite estimator oes not take into account the estimation uncertainty of the estimate H². An given the fact that the MSE is a non-linear function of the H², the variance of the estimate H² will turn into a bias of the irect plug-in estimator. To eal with those two problems, they propose the unifie jackknife MSE estimator. Over-shrinkage may also be a metho which will pay attention on all the properties of the composite estimator. Uner the basic smoothing approach, the estimation is carrie out in separate subsets of omains, also referre as clusters of omains. They evelop a variance component moel for the omain RR, which allows for the potential cluster-specific ranom effects. -6-

8 Polan In the case stuy synthetic small area estimators are consiere base on a linear mixe moel (LMM) at unit level an on a linear mixe moel at area level with a poole sample estimate of the within-area variance. Furthermore, the EBLUP is use also base on both a linear mixe moel efine at unit level an a linear mixe moel efine at area level. Also an EBLUP estimator using spatial correlation an a synthetic population level estimator are consiere. The small area estimators are compare to the irect estimator (Horvitz-Thompson estimator) an the Generalize Regression Estimator (GREG). The selection of the covariates is one by stepwise selection starting from the complete set of available covariates. The use gooness-of-fit measures are the root of the MSE estimation, R, ajuste R, epenant mean an the coefficient of variation. Spain They evaluate the following estimators for each municipality: (1) the Census estimator, () the Broa Area Ratio Estimator, which is also calle the ratio-synthetic estimator an (3) an estimator base on post-stratification, calle the count-synthetic estimator. The quality assurance methos use to evaluate how goo the propose small area estimators are to estimate the municipal totals are relate to the objective variables, K samples are simulate in the region selecte. An the following performance criteria are consiere: The percentage of the relative bias for the small area estimator in the municipality M (average relative bias). The mean of the relative bias. The percentage of the relative root mean square error for the small area estimator in M. The mean of the relative root mean square error. The comparison between true values of the target variables an the mean of the moel-base estimates over the simulate samples. Switzerlan FSO has consiere ifferent small area estimators base on mixe moels where the explanatory variables are consiere as fixe effects an the communes as ranom effects. Five estimators are base on a linear mixe moel: the EBLUP efine at unit level an at area level, the synthetic estimator efine at unit level an at area level an the You-Rao estimator efine at unit level. Furthermore, there are two estimators base on a generalize mixe moel: the empirical Best Preictor efine at area level an the Binomial Preictor at unit level, both are base on a logistic mixe moel. For all moels the same covariates are use. The choice of the covariates is mae by means of a linear area level moel stuie over the whole population. Moel evaluation for these moels is one by investigating the significance of the regression coefficients, the R -coefficient an by making plots of the Stuent-resiuals versus the preicte values. There was a constraint on the number of covariates use in the moel. Due to the amount of computation maximal 13 parameters coul be use. -7-

9 Netherlans Since most of the auxiliary information is only available at the area level, Statistics Netherlans has base the small area estimation for the NSM on the basic area level moel. The basic area level moel is a linear mixe moel. In the case stuy two estimators are consiere, the Empirical Best Linear Unbiase Preictor (EBLUP) an the Hierarchical Bayesian (HB) approach. For the EBLUP the moel variance is estimate by, leaing to synthetic estimates for all the omains. This is unesirable an it probably also gives an unerestimation of the mean square error (MSE). For this reason the HB approach is use for the small area estimates an the MSE estimates. The covariates for the moels are selecte through a stepforwar variable selection proceure. The conitional Akaike Information Criterium (caic) an the leave-one-out cross valiation measure (LOO) are use as comparison measures to select the most suitable moels. In orer to reuce the imension of the covariate space Principal Component Analysis (PCA) is applie. Software The software packages use for the case stuies are R an SAS: R is use by Italy, Norway, Switzerlan an the Netherlans an SAS is use by France, Polan, Spain an Switzerlan. Results Italy By investigating the coefficient of variation (cv) of the moel-base estimates, it can be conclue that the small area estimation methos seem to perform well. For most of the areas the values of cv are below 16.5%. The moel-base estimates are compare with the irect estimates by plotting the point estimates of the irect an the moel-base methos versus each other. This shows that there is a bias for all moels. Furthermore, confience intervals of the moel-base estimates have been compute. They are isplaye in plots sorte by area size. Only for the logistic moel there is an effective reuction of the confience interval size. As expecte, the confience interval erive from the logistic moel are wier than the corresponing intervals from the linear moels. The spatial istribution of the moel-base estimates is investigate by making two-imensional an three-imensional geographic plots. The synthetic an the EBLUP estimates give opposite results, but the EBLUP estimates are preferre. This is because the synthetic estimators, being affecte by a strong shrinkage effect, are extremely sensitive to minor changes. Finally, the istributions of the resiuals an the ranom effects are investigate. The istributions seem to be close to normal, but the istributions of the ranom effects isplay a spatial autocorrelation. France The Fay-Herriot Moel: The fitting of the moel was restricte to the ZE where the sample size was bigger or equal than 5 responents. The first step of the analysis was the comparison of the quality of the moels efine by ifferent sets of variables. The selecte moel is efine by five auxiliary variables: proportion of people who are looking for a job, proportion of people who eclare to live alone, an the three other variables give an inication on the -8-

10 characteristics of the responent (sex, eucation, age). Given this moel, they fin that the 9% of estimate proportions of BIT jobless people lie between.6% an 6.5%. This has to be compare with the estimate proportions given by the irect estimation; in this case, the 9% of the estimate proportions lie between % an 1%. They also fin that in about one half of the ZE, the contribution of the irect estimator from the LFS is more than % in the final EBLUP. There is a positive correlation between the imension of the ZE an the contribution of the irect estimator. The Poisson Moel: All the ZE with no responents are exclue from the analysis. Given the high influence of the sample size in the FH moel, they trie here to compare ifferent configurations (the sample size bigger than or bigger than 49 an so on) using a full mixe moel with overispersion. They fin that the more they select the ZE, the more they create a gap between the irect an the small area estimation, an this may imply more bias. An the overispersion parameter ecreases, which may imply a smaller variance for the small area estimators. Therefore they fit the moel on the whole set of ZE. The moel selection also suggests the use of a ifferent moel with more variables. In this case, explanatory variable on the economic situation of the country plays also a role in the estimation of the unemployment. They also consier the classical Poisson moel an full mixe moel. The full mixe moel seems not to bring better results than the classical one. Norway First of all they checke the effects of sex on cohort mortality rates. They fin that it seems to be a higher mortality rate for a male new born. Between the age 1 an 15 there is harly any ifference in the mortality rates ue to sex. Between 15 an, the male mortality rate increases to almost ouble as high as the female mortality rate. The ifference remains constant between 3 an 4 years, both are increasing with age. An finally from age 6 towars the en of life, there is a steay acceleration in the growth of male mortality. The results are presente in ifferent Q-Q gamma plots. The gamma istribution chosen to match the empirical variance of the set of component estimator oes not fit well in the case of the basic smoothing approach an variance homogeneity assumption. Moreover, the parametric Gamma moel oes not fit the ata well enough in most clusters, so the semi-parametric approach to the composite estimator is preferre. With the basic smoothing approach the variance of the SMR H² varies consierably among cohorts between age an an there is no variation among other cohorts. To stabilize the estimate H², a moving average instea of the irect within-cluster estimate is applie. The moving average estimator is more robust against the outlier omain. For the MSE estimation, they fin in the worst case that there may be 3% of unerestimation, the meian value of the MSE ratio is very close to 1. They also conclue that it is acceptable to use the plug-in MSE estimator. Polan In the case stuy the point estimates of the ifferent estimators (moel-base an irect) are compute an are compare with each other an with the registere unemployment from the Register of Unemployment. Furthermore, a geographical plot is mae of the irect estimates of all NUTS3 omains. -9-

11 Spain Base on the coefficient of variation efine by the RMSE, the census estimator accuracy is acceptable for the majority of the municipalities. The small areas consiere here are not geographical areas but subomains (corresponing to categories whose relative size in the population is significantly small). Base on the ratio RB (between the absolute value of the bias an the RMSE), they fin that the census estimator is approximately esign-unbiase. The synthetic estimates (BARE an POSS) outperform the irect estimates (Census estimates) in those municipalities where the relative sizes are similar to the one in the broa area (province). To improve the synthetic estimators (reuce the bias of the synthetic estimators for small areas whose behaviour is ifferent from the province) they groupe areas with similar mean values. The new variables consiere are emographic variables (percentage of people belonge to 4 age groups, an percentage of foreigners) an economic variables (total income tax per househol, percentage of investment income tax an percentage of agrarian income tax). Now the BARE estimates are base on the new clusters of small areas. The alternative BARE (CBAR) estimates base on clusters improve the estimates more often than the BARE base on province. The results from the quality iagnostics confirm that the Census estimator is unbiase (except for the Nationality:UE). Differences may also be observe between BARE an CBAR bias for Eucational level an Marital Status. This may be explaine by the presence of outliers. Switzerlan All available covariates turne out to be significant an they improve the estimates for the proportion of active persons. Due to the amount of computation the number of use covariates ha to be limite to 13. To this en a hybri form of the covariates NUTS3 (6 levels) an NUTS (7 levels) is use, containing of nine classes. The other selecte variables are age in four classes an sex. No interaction terms are involve. In the simulation framework the quality of the small area estimators is investigate. The consiere quality measures are the bias, the stanar eviation an the root mean square error (rmse). These measures are investigate along the communes that are ivie into seven size classes. For each size class the meian, the upper quartile an maximum are compute an a box plot is rawn. In general the moel-base estimators lea to much smaller rmse values than the GREG for all size classes. The ifference is strongly ecreasing with the commune size. The maximum values of the rmse can take higher values for the moel-base estimators. For all estimators the rmse ecreases with e commune size. For the moel-base estimators this ecrease is mainly riven by the bias an for the GREG by the stanar eviation. The moel-base estimators show a similar performance. Hence, while the estimators base on the binomial ranom intercept moel seem to be more appropiate for the binomial target variable, they o not outperform the other small area estimators. In the simulation framework also the behaviour of some quality measures is investigate for the situation where only one sample is available. This is one by comparing the estimation of rmse by analytic formules with the simulate rmse. The analytic estimates are only compute for the linear moels, not for the binomial moels. For the GREG the use estimator for the stanar eviation turns out to be almost (esign) unbiase, so it can be use to construct vali confience intervals. For the moel-base estimators the analytic rmse estimates are much larger than the simulate rmse estimates for smaller communes. The ifference becomes -1-

12 comes smaller for larger communes an tens to zero for the largest communes. Finally the bias base on single samples is investigate by means of the regression technique. This technique gives a goo inication about the size of the shrinkage effect. Netherlans For each of the five target variables an for both years an optimal moel of covariates is chosen. The PCA metho to reuce the imension of the covariate space turne out not to work sufficiently an is therefore not use in the selection process of the covariates. Most of the moels inclue covariates from the parallel Integrate Safety Monitor. For the property crimes only auxiliary information from the police registers is use. The HB estimates are compare with the generalize regression estimator in terms of the coefficient of variation (cv). For all target variables the cv is reuce by HB, varying from 33% to 55%. Furthermore, some moel iagnostics were compute. The HB (point) estimates are plotte against irect estimates to emonstrate that the SAE estimates are smoothe compare to the irect estimates. Normal Q-Q plots are rawn to show that the resiuals are normally istribute. Finally the spatial istribution of the estimates is investigate by geographical plots of the small area estimates. Discussion France The main results obtaine in this case stuy are the important effect of the imension of the ZE in the ifferent estimations. In the case of the Poisson Moels, the full mixe moel oes not bring better results than the simplifie Poisson Moel. Finally, as presente in the FH analysis, they conclue that they expect a serious improvement in the local accuracy of unemployment statistics when they refer to the irect estimation strategy. Norway In comparing the ifferent methos to calculate the estimates of municipality life expectancy on birth, the ifferent methos o not show ifferent patterns for both sexes, the methoology oes not nee to iffer accoring to sex. The basic smoothing approach retains too much over-shrinkage to be useful in reality. The neighbouring variance homogeneity assumption is more plausible than the global homogeneity assumption. Smoothing leas to a consierable reuction of the range of the estimate municipality life expectancy compare to irect estimators. Smoothing also yiels gains of efficiency for the smaller municipalities. Switzerlan Base on a simulation stuy several small area estimators are investigate. From the simulation stuy it is observe that the performance of the moel-base estimators strongly epens on the preictive power of the unerlying moel. So moel choice is a crucial point. For the small area estimators the estimate bias is much higher than the estimate stanar eviation. In the combine small area estimators the synthetic part turns out to be ominant. The GREG shows the opposite behaviour, i.e., large stanar eviation an small bias. In terms of rmse the moel-base estimators show in general a consierable gain in precision compare to the -11-

13 GREG. The moel-base estimators show a similar performance with regar to the rmse. Hence, while the estimators base on the binomial ranom intercept moel seem to be more appropiate for the binomial target variable, they o not outperform the other small area estimators. Finally, the rmse calculate from the simulation stuy is compare with the analytic rmse estimates. From this it can be conclue that on a commune level the mse obtaine by an mse estimate for a moel-base estimator can not be interprete in the same way as the mse for a irect estimator or the mse obtaine in a simulation stuy. The question is how to fin similar precision requirements for moel-base estimators. Netherlans From the perspective of increasing the precision of the irect estimates, the case stuy performe by Statistics Netherlans can be calle successful, for the margins are reuce by approximately 4%. Also the moel selection an estimation methos have performe well with no inication of moel misspecification. The report is finishe with some further guielines. It is avise to o some sensitivity analysis of the moel selection an to investigate the use of covariates with sampling errors. Furthermore, it is recommene to evelop a policy for the publication of the moel-base estimates, for publishing moel-base estimates is still uncommon at NSIs. -1-

14 . Case stuy reports Country page France 14 Germany 4 Italy 45 Netherlans 66 Norway 86 Polan 11 Spain 133 Switzerlan

15 Case stuy report about Labor force survey in France (Insee) I) Backgroun Information about jobless people is a strong local asset. The French national statistical institute (Insee) isseminates monthly ata at the "épartement 1 " level an quarterly ata at a finest geographical level name "Zone 'emploi" (ZE). Those statistics are base on the quarterly Labor force survey (LFS) results an on external ata coming from "Pole Emploi" (PE), a national aministration which registers each month a large part of the jobless people looking for some employment. Up to now, there are not explicit norms concerning the issemination of figures about unemployment but it is clear that the French LFS alone is (generally) not sufficient to allow reliable local estimations. The present metho to prouce those estimations uses a synthetic estimator base on an implicit moel which represents a proportional relation between the number of jobless people in a omain an the number of registere people at PE. It consists in iviing the national estimation (by sex X age) from LFS into local estimations proportionally to the local number of registere people (by sex X age) at PE. This metho has goo sense an users seem to be satisfie with it, but we have the opportunity to test other methos which shoul bring a more formalize backgroun an the possibility of some quality assessment - the present methoology being a rather empirical one. The experience in France about small area estimation is poor, an this is in fact the first time the LFS is concerne by such an approach. Our main goal is to experiment ifferent small area estimators an to compare them with the present estimator calculate by Insee. The case stuy is concerne by the first quarter of the year 7. This is motivate by the availability of external ata concerning this perio. As we better know now that is useful, the global process coul be rather easily extene to more recent quarters an years in orer to see if there is some time stability of the results. If there is some agreement between the ifferent alternative estimators mentione above an if there is moreover enough time stability, then it coul be the occasion to woner about the real assets to maintain the present methoology. This paper presents an application using two moels efinie at the omain level - Fay & Herriot moel an Poisson moel - but for information we proceee to a wier stuy implying other moels (calibrations on local margins, Eblup_B moel, Logistic moel). II) Description of the survey an of the external ata a) About the Labor force survey The scope of the 7 quarterly LFS is efine by all people who are 15 years ol or more at the en of December 7 are who are currently living in an orinary househol in France except overseas territories. The sampling esign is a rotational one, with a stratification an several egrees. The finest sample units are clusters of (approximately ) wellings, in which each iniviual in the scope is rawn. One face to a serious technical ifficulty for variance estimation because as far as ifferent egrees are concerne, the sample size is equal to one. The efinitive sampling weights come from a "one step" calibration, which in the same step corrects the non-response an ecreases the sampling variance. The questionnaire takes about 1 hour an has ifferent moules. Taking into account the relations between ifferent questions, a qualitative variable about activity is built an gives the status of each responent, istinguishing: in employment people, BIT jobless people, inactive people. This variable is the central information use to estimate the total number of BIT jobless people in France uring the concerne perio. 1 The "épartement" is an aministrative area, an also a stanar reference area for local statistics in France - there are 95 épartements in metropolitan France. BIT : Bureau International u Travail = International Labour Office -14-

16 The national responing 3 sample has a size of iniviuals. The sample weights vary from the extreme minimum 51 to the extreme maximum 1141, but for approximately 9% of the iniviuals the weights lie between 5 an 13 (the meian is 61). With this set of weights, the irect national (metropolitan) estimation of the population in the scope is an the estimate total number of BIT jobless people is 416. The small area concerne here is the "Zone 'emploi" (ZE), which is the smallest fraction of the French territory about which Insee prouces official quarterly figures about employment an unemployment. Those areas were efine in 199 on the basis of the census ata an have been moifie very recently in 11. One can istinguish 348 ZE in metropolitan France. Each French "commune 4 " belongs to a unique ZE, the city of Paris being the only town which meets exactly a ZE. The regions are perfect sets of ZE, but a ZE can cross (or more) "épartements". The table 1 gives the istribution of the population size variable by ZE in the LFS scope, an the table gives the istribution of the responent sample size by ZE. One can see that there is a very large spectrum of situations. Table 1 : population size by ZE Table : responent sample size by ZE Quantile Population 1% Max % % % % Q % Meian % Q % % 46 1% 1536 % Min 788 Responent Quantile sample size 1% Max 31 99% % 693 9% 5 75% Q3 7 5% Meian 18 5% Q1 7 1% 38 5% 3 1% 1 % Min 1 We a that the largest ZE concerning population size is the city of Paris. Concerning the table, on one han there are 3 ZE with a sample size between 1 an 1 responents an 7 ZE with a sample size between 11 an responents - an on the other han, we fin 3 ZE with a sample size larger than 15 responent people. Moreover, it appears that 1 ZE are not covere by the LFS sample (all those ZE have a small size population). b) About the auxiliary ata We gathere ata from 6 ifferent sources : The census : in reality, this is not a classical exhaustive survey but a very large size sample survey. Each "small" city (less than 1 inhabitants) is exhaustively surveye with a perioicity of 5 years. Each "large city" (more than 1 inhabitants) is surveye each year through a sample of wellings with a sampling rate equal to 8%. The 7 census figures come from the aggregation of the 5 consecutive yearly surveys from 5 to 9. In the small cities stratum, the census is thus an exhaustive one but ata come from ifferent years. In the large cities stratum, the ata come from ifferent years too, but we face a sample survey with a rate of 4%. 3 The proportion of partial interviews is equal to.4% of the total number of complete plus partial interviews. 4 A "commune" is a city (municipality) having an aministration which is istinctive for it. -15-

17 Then, the auxiliary census ata are not properly exact ata, but we will consier them as if it was so because the variance concerne by the census esign at the ZE level are obviously very small if we compare them to the variance of the small area estimators (we calculate that the effective sample size concerning the census by ZE for 7 lies between 55 people an 4 people). The DEFM 5 from "Pôle Emploi" : we mentione that PE isseminates each month its statistics about people registere in this aministration an looking for a job. Insee gets the iniviual level ata, which inclues informations about sexe, age, stuy level. There is also a classification of the iniviual cases which istinguishes 8 categories of jobless people. The BIT official efinition correspons to the first 3 categories if the iniviual ha not any activity uring the month. A specific atabase with (among others) economic inicators: for each ZE, it was possible to gather some informations from the business register an from other aministrative files about employment in the firms an about iniviual incomes to escribe some emographic aspects concerning the firms an buil inicators about income. The 39 initial variables can not be liste here 6, we just give one example: the ratio / S, with efine as the flow of factories beginning an activity between an 6 minus the flow of factories ening an activity between an 6 in the ZE, an S the number of active factories in the ZE in 6. The typology "Tabar ": it is a typology of geographical units from a socioemographic point of view, built from the 1999 census ata. There are 7 categories to classify the geographical units, each iniviual belonging to a category, epening on the area in which he lives. For each ZE, we calculate the structure of the population in the scope ivie into the 7 categories. The RMI 7 base: it gives, for each ZE, the total number of people who are allowe to perceive a statutory minimum income because they have not enough financial resources. The ZUS 8 base: in France, about 85 areas - name ZUS - are classifie as "sensible" areas because one can fin there an economic or/an a social eteriorate situation. In each ZE, we have the proportion of people living in a ZUS. At the en, besies the iniviual ata coming from the LFS table available for the moelisation at the iniviual level, we obtain a large atabase at the omain level, all the variables being quantitative variables efine in form of a proportion among the population in the LFS scope. For a normalization reason, when we manage a moel at the omain level an in orer to limit the ispersion of the epenent variable, we chose to moelize a ratio, name P : its numerator is the estimate number of BIT jobless people, its enominator is the (estimate by LFS) population size in the scope (so, it is not the unemployment rate!). Then, to get the total number of BIT jobless people in the ZE, it remains to multiply this ratio by the (pseuo exact) population size coming from the 7 census. In other wors, what we call the irect ZE estimator of the total number of jobless people is a simple poststratifie estimator with a calibration on the local population size. We give below, table 3, the istribution of the irect ratio of BIT jobless people by ZE : the weights are the efinitive weights coming from the national table of LFS ata. 5 Demanes 'Emploi en Fin e Mois. 6 This atabase was built before this stuy an for other reasons - so an appreciable number of those 39 variables is obviously useless for our purpose! That is why, if fact, we immeiately estoye some of them at the very beginning. 7 Revenu Minimum 'Insertion. 8 Zone Urbaine Sensible. -16-

18 Table 3 : istribution of the irect estimator of the proportion of BIT jobless people by ZE Ratio Quantile (%) 1% Max.7 99% % 1. 9% 8. 75% Q3 6. 5% Meian 4. 5% Q1.4 1%. 5%. 1%. % Min. III) Approach to the problem The aim is to test ifferent methos, some of them use a moel efine at a ZE level, others use irectly the iniviual ata. To fit a moel at the ZE level, we can irectly use any variable from the large atabase presente in part II, without wonering if this variable is or is not in the questionnaire. To buil an estimator which uses the iniviual level, we have to istinguish the calibration technique an any technique using a stochastic moel. Concerning the calibration approach, the natural way is to consier the potential explanatory variables among the variables available jointly in the questionnaire (to get the iniviual values) an in the large atabase (to get the true values at ZE level). However, it is always possible to implement a new iniviual variable in the sample LFS table built as a constant for any iniviual in a given ZE an equal to the true value of the ZE, extracte itself from the large ZE level atabase. Concerning the fit of a moel at the iniviual level, it is much more bining when we eal with a qualitative variable 9, because we nee to explicitly calculate the iniviual preictions of the probabilities involve for each iniviual in the population. Now, it is a non linear formula, so we nee a unique atabase with all the iniviual auxiliary variables known for each iniviual in the population. As we have a (too) large number of variables caniate to explain the BIT jobless phenomenon from the large ZE level atabase (more than 3 hunres ), we select a priori a set of explicative variables, then we fit ifferent subsets of variables an choose the one which seems the most aapte. After, we try to assess the quality of the moel, through some inicators or / an graphs. Then we look at the istribution of the small area estimators an compare them to alternative estimators, mainly to the present statistics calculate by Insee (which remain an internal reference because a change in the methoology has a cost). We focus in this ocument on two moels implying the omain level only. The preselection of variables was base on a stanar regression moel at the omain level where the proportion P is the epenent variable (with a I variance matrix) an was mae in two steps. First, we kept only the set of variables whose Pearson correlation with the estimate proportions P is highest than a given threshol (minus than -1% or highest than 1% or %, epening on the kin of variable). We got 18 variables among the 345 initial variables. This first raw iscrimination is certainly not an optimum process from the statistical point of view, but the iea is to reuce the imension of the next selection step (which is more conventional) an to keep necessary a minimum stanar of interpretation for the final explanatory variables. 9 With a quantitative variable, it is ifferent : the moel shoul be linear an so we o not nee a unique atabase since in fine we just nee to know the ifferent true totals of the auxiliary variables. -17-

19 In the secon step, we applie a Stepwise selection metho an use the SAS Proc Glmselect for that. This proceure can select a set of moels through a criterion 1 - which concretly rives the stepwise selection process - an stop the stepwise process when we get a local minimum of a criterion. We trie ifferent process of selection, successively a significant level technique for criterion 1 (selecting the variables accoring to the value of the F-value attache to the test of significativity) an ajuste R for criterion, a technique using the ajuste R for criterion 1 an the Cp of Mallow for criterion, an as a thir attempt, a cross valiation approach base on the PRESS statistic (the sample ata are split into 5 blocks). Here is the syntax for the 3 successive selections: SELECTION = stepwise (select=sl SLE=. SLS=. stop=adjrsq) SELECTION = stepwise (select=adjrsq stop=cp) CVMETHOD=BLOCK(5) CVDETAILS=ALL SELECTION = STEPWISE (select=cv) The first selection is interesting because it is a kin of compromise between a stategy purely base on the significant level of the iniviual variables an a strategy base on a global criterion of quality, wheras the two approaches are not a priori in agreement (for instance a variable may be significant but not retaine in the moel because the global criterion epreciates if we a it). The crossvaliation is, accoring to us, the most logical strategy, better aapte to assess the absolute quality than for a stategy of best fitting (a fitting strategy, mainly in the spirit of the first two selections, gives the best moel, but the best is not necessary "goo" - the cross valiation, through the PRESS criterion, quantifies an "absolute" relevance of the iniviual preictions). Those ifferent selection methos o not carry out exactly the same list of significant variables, however there are quite a few variables in common if one compare the respective outcomes : so, gathering the results an recognizing that there is again some empiricism in this step, 15 variables among the 18 initial ones appear as being potentially the most explicative ones. After an ultimate fit an taking into account the t-value associate with each variable (in orer to exclue the variables with a too high p-value), we en the preselection step with a list of 7 variables, in aition to an intercept, see table 4. Table 4 : preselection of explanatory variables Domain level moel Variable Parameter Estimate Stanar Error t Value Pr > t Intercept t_rech_oui t_couple_ t_age15_19hiplbit t_age5_64hnonipl t_age3_49hnoniplbit c_txsoletab_ part_epcom_ t_rech_oui : proportion of people who spontaneously eclare to look for a job; t_couple_ : proportion of people who eclare to leave alone; t_age15_19hiplbit : proportion of people with the following characteristics: registere in PE (categories 1,, 3), age between 15 an 19, sexe male, high stuy level ; t_age3_49hnoniplbit : proportion of people with the following characteristics: registere in PE (categories 1,, 3), age between 3 an 49, sexe male, low stuy level ; t_age5_64hnoniplbit: proportion of people with the following characteristics: registere in -18-

20 PE (categories 1,, 3) 1, age between 5 an 64, sexe male, low stuy level ; c_txsoletab_6 : ratio / S, with efine as the flow of factories beginning an activity between an 6 minus the flow of factories ening an activity between an 6 in the ZE, an S efine as the number of active factories in the ZE in 6. part_epcom_4 : proportion of people in the ZE leaving in an area characterize by a coe Tabar N 7 (it has something to o with the localisation near the sea or in a tourist area). IV) Teste methos A) The Fay an Herriot moel We very briefly summarize the moel (see [Rao]). If ientifies the ZE an s the LFS sample, we Yˆ note Yˆ = with Yˆ = Nˆ wi Yi an Nˆ = w i. For a given, the true (known) means are X. The moel is is i Y ˆ = B t is i X + v + B is an unknown vector parameter. Moreover, E( v ) = E( ) =, ( v ) effect, Var( ) = Var = v the ranom area the sampling variance (suppose to be known). The ranom terms v an are suppose to be inepenent. The empirical best linear unbiase estimator (EBLUP) to estimate the true mean Y is : Y ˆ H = ˆ Y ˆ + ˆ t ( 1 ˆ ) B X where Bˆ = X ˆ + ˆ + X v t 1 X v Y ˆ an ˆ = ˆ v v ˆ. + More often, ˆ v is the maximum likelihoo estimator if the errors are suppose to follow a gaussian istribution. Thus, this estimator is mixing a irect estimator, (almost) unbiase but with a high sampling variance with a synthetic estimator, biase but stable. One can show ( ˆ H ) E Y Y we keep the whole set of ZE for which 1 = + O, where m is the total number of ZE (so m = m n > ). 338 if In our context, Y i is equal to 1 if the iniviual i is jobless, an otherwise. So, if n, estimates the total number of jobless people in ZE. The sampling variance is suppose to be known, but practicaly it must be estimate, an this is the main technical ifficulty. For that, we use the regional esign effect calculate as an output of the variance estimation process about the LFS estimation of the total number of BIT jobless people at the national an regional level. To stabilize this estimate variance, we use an arithmetic mean eff of the esign effects for among the > Yˆ 1 In fact, the exact variable in output is not restricte to the DEFM categories 1, an 3 - but for homogeneity an interpretation, we retain this variable restricte to those 3 categories, after checking that it has absolutely no significant consequence. -19-

21 ifferent quarterly values from the fourth quarter of 5 to the secon quarter of 8. Concerning the simple ranom sampling part of the variance, we use the proportion p~ coming from the present ~ Insee methoology, only if 3% p 16% (it is a synthetic estimator, so it avois extrem values). Otherwise, we use the national ratio. This has the etermining avantage to avoi too much variability among the ˆ. Finally ˆ = eff ~ p ( 1 p ) n ~ If we select only the ZE with n 5 given by table 5. (justification below), we get the istribution of sampling variance Table 5 : istribution of ˆ Subset of ZE with 5 n Quantile Variance X 1 4 1% Max % % % % Q % Meian 4.1 5% Q1.31 1% %.83 1%.44 % Min.39 Yˆ lies between 11,% an 616,6 %, with Q1 = 33,8%, a The estimate coefficient of variation for the meian of 51,1 % an Q 3 = 7,5% : this poor efficiency justifies the implementation of small area estimation techniques (remember those CV are multiplie by to assess the accuracy of the estimate Y ˆ. 1±.CV ). since the confience limits for the true mean are ( ) In orer to assess the quality of this moel in our context, we use ifferent tools (cf part V): - Stanar criteria (AIC, BIC, ) to choose the explanatory variables in conjunction with t- values to appreciate the significance of the explanatory variables; - Graphs of resiuals an preicte ranom effects to appreciate the fitting quality of the moel an the opportunity of the Gaussian hypothesis; - Exam of the stability / sensitiveness of the estimation ˆ v ; - Comparison of moel-base estimation to irect estimation at an aggregate level, to assess the impact of bias; - Graphs comparing the moel-base estimation to irect estimation for a large set of omains, in the same spirit; The software use to get the Fay an Herriot estimator is SAS, through the proceure Proc GLIMMIX. As the sampling variances must be hol as fixe values, we use the PARMS / HOLDS option in the --

22 MODEL statement of the Proc, with a final covariance matrix whose shape is a VC-type. So, the optimization is concerning the parameters B an v. In the SAS framework, there is one parameter G (which is v ) an 87 parameters R (which are the hol ˆ ). The matrix X has 6 columns (the intercept plus 5 regressors) an the matrix Z has 87 columns ( Z is the ientity matrix, since we have 87 ranom effects ). The algorithm use is a Newton-Raphson metho an the objective function to minimize is efine as minus times the restricte log likelihoo. In our context, the convergence was quick, achieve after 4 iterations. We must note that the software program we use is very easy to write (a few lines of SAS), it "works" well but - very strangly - it necessitates a very large amount of RAM : for 87 omains, the Proc Glimmix requests 15 Giga of RAM an the CPU time is 6 secons on a very powerful evice 11. B) The Poisson moels c In any ZE, as soon as n >, the irect estimator Nˆ of the total number of jobless people is suppose to follow a Poisson istribution with an unknown real parameter µ : ˆ c N P( µ ) t with g ( µ ) = X + v where g( ) is a given function an Var( v ) = v. The (possible) local effects are suppose to be gaussian ranom mutually inepenent variables. As it is a generalize linear mixe moel (GLMM), one use a restricte maximum likelihoo technique through an approximate linear mixe moel (LMM) which involves a (calculable) pseuo variable A : v A = X t + v + where is an a hoc ranom variable with E( ) = c Var Nˆ v = µ v, then ( ). When the c Nˆ istribution is a guenuine Poisson istribution conitional to. The reality may be ifferent: often, there is an "overispersion" phenomenon because for some (unknown) parameter c ( Nˆ v ) = Var µ In this wier context, we can verify that Var( ) = M ( ) ( is the vector with the components an M ( ) is a complex matrix epening on. The parameter is estimate by restricte maximum likelihoo technique, as for an v. ) Finally, after fitting - using ˆ v small area Poisson estimator: an ˆ if the corresponing parameters are introuce - one get the 1 t Nˆ = ˆ µ = g ( X ˆ + vˆ ) c For more information, see [McCulloch]. As unfortunately we o not have a software tool to optimize the selection of the explanatory variables in this context, an since it is not reasonnable to fit p 11 3 "application servers" IBM x385 M, each mae of 4 processors Xeon Dual-Core 711N at,5 GHz, with 4 Giga bytes of RAM. -1-

23 moels if we get p explanatory variables, we consiere as a correct starting point to launch an initial fit with the 15 potential explanatory variables selecte at the en of the preselection step - cf. part III. A moel using logarithms seems relevant. Inee, one can expect a relation like Nˆ c poptot7 = p k ( X ) k= 1 k k where poptot 7 is the total 7 population in ZE, obtaine from the census an X are ifferent proportions characterizing some specific subpopulations - as if the probabilty to be a jobless iniviual was "almost" (see the powers k ) a prouct of probabilities to belong to such an such subpopulation. So c Log( Nˆ ) = Log( poptot7) + Log The natural link function is therefore g ( ) Log( µ ) t Nˆ exp ( X ˆ + vˆ ) t Nˆ = exp( X ˆ ). c we have = c p k = 1 k k ( X ) µ =. So finally, for any ZE participating to the fit,. For any ZE not involve in the fit, we agree upon Note that the coefficient of the variable Log ( poptot7) must remain equal to 1 in the fit: the software SAS eals with this special case if one eclare this variable as an "offset" variable. After aapting two variables an getting ri of one variable (because they are sometimes null or negative), we fit a moel using 14 explanatory variables (plus the intercept). We (inevitably) exclue any ZE with n =, but we can keep any ZE with n 1, an also any ZE with ˆ c N = (even if it may look strange). We use the Proc GLIMMIX of SAS, which enables to introuce an overispersion parameter. The CPU time is here very short an there is not any problem of RAM. V) The results A) Fay an Herriot moel The fitting of the moel was restricte to the ZE with n 5. So, 87 ZE are concerne. Starting with the set of 7 variables liste in table 4 (plus the intercept), we use the t-value criterion to see what is the impact of the elimination of the less significant variable, in conformity with a stepwise strategy. It is interesting to remark that, in each step, there is always one (or two) variable(s) with a p-value greater than 5% which appear(s) significant in the previous step. The table 6 gives the values of stanar criteria to assess the quality of the moel, epening on the ifferent sets of variables (starting with the full moel, minus the least significant variable at each step). Table 6 : some criteria to assess the quality of the Fay & Herriot moel, epening on the explanatory variables CRITERION 7 variables 5 variables 4 variables 3 variables variables 1 variable -. restricte Log likelihoo AIC BIC Chi- generalize

24 This result allows the elimination of the following variables (in the same step) : c_txsoletab_6 an part_epcom_4. The other variables belong to the efinitive moel. The restricte maximum likelihoo technique gives ˆ v. The table 7 gives the ifferent estimate values, epening on the explanatory variables use in the moel. Fortunately, the parameter appears significantly ifferent of zero in any situation. The aim here is to assess the stability of the estimator in relation to the set of explanatory variables, which looks fine. However, the CV of ˆ v is about 3%, which is not really goo. Table 7 : ifferent estimations of ˆ v, epening on the explanatory variables Statistique variables variables variables variables variables variable Estimation ˆ v Stanar error of ˆ v v The auxiliary variables use in the final moel are liste in table 8. Table 8 : final Fay & Herriot moel Effect Solutions for Fixe Effects Estimate Stanar Error DF t Value Pr > t Intercept t_rech_oui t_couple_ t_age15_19hiplbit t_age5_64hnonipl t_age3_49hnoniplbit We can see that it is possible to face a situation with non significant variables from the point of view of t-test (p-value threshol at 5%) but which is consiere as the best one from the point of view of the stanar quality criteria, with the conclusion to keep them in the final moel. The (strange) ecimal values for the egree of freeom come from the Kenwar an Roger option, which is a technical option available in Proc Glimmix an recommene to improve the statistics of test. The (surprising) very large value of the coefficient of variable t_age15_19hiplbit is ue to the rareness of the corresponing population (therefore this rareness oes not prevent the variable from being significant! ). From this moel, table 9 gives the istribution of the estimate proportion of BIT jobless people in the 87 ZE, which is inee the main output. We can see that 9% of the estimate proportions lie between,6% an 6,5%. Those results may be compare to the irect estimations given in table

25 Table 9 : final Fay an Herriot moel : istribution of the 87 estimations Quantile FH-Estimate (%) 1% Max % 8. 95% % % Q % Meian % Q %.85 5%.58 1% 1.86 % Min 1.8 The histogram of the preicte ranom local effects vˆ, from the visual point of view (cf graph 1), seems to confirm the hypothesis of normality of the v. We can see that the local effect preictors lie between -1,5 an points of percentage. Graph 1 : preicte ranom local effects in the final Fay & Herriot moel FREQUENCY 1 3 Effet_omaine MIDPOINT The graph concerns ifferent plots about the conitional stuentize resiuals, efine as t v X B Y U ˆ ˆ ˆ ˆ ˆ =. -4-

26 Graph : quality assessment outputs about the final Fay & Herriot moel The clou of points is nice since there is not any obvious pattern in those resiuals. The histogram of the resiuals, from the visual point of view, seems to confirm the hypothesis of normality of the, what is confirme by the Q-Q plot. The final box-plot shows an equilibrate istribution, mainly sprea between - an +. However, there is (essentially) one clearly atypical ZE (ZE of Mantes-la-Jolie, near Paris) : in this ZE there is a consierable value for Yˆ coming from the LFS (,7% - with a CV of 45,6%) an the moel proposes a ˆ equal to,13, giving a final EBLUP equal to 7,4%. The coefficient enables on one han to appreciate the relative weight of the irect estimator in the EBLUP estimator, an on the other han to assess the gain in variance upon the irect estimator (cf. IV-A). The istribution of the estimate coefficient ˆ is given in table 1. Table 1 : final Fay an Herriot moel : istribution of for 87 ZE ˆ Quantile Gamma 1% Max.74 99%.71 95%.57 9%.49 75% Q3.33 5% Meian.1 5% Q1.14 1%.1 5%.9 1%.7 % Min.5-5-

27 We can see that for 9% of the ZE, the priority is given to the synthetic estimator. In about one half of the ZE, the contribution of the irect estimator from LFS is more than % in the final EBLUP, what is not negligible. The maximum contribution of the irect estimator - equal to 74 % - is attache to the ZE of Paris, an this is not surprising since it is the largest ZE ( n = 31), with a CV of 11,3% for the irect estimation ( ˆ 4 =,4 1 ). The minimum contribution of the irect estimator is attache to a 4 very small ZE : n = 5, the CV for the irect estimation equals 71% an ˆ = 19, If we consier a "meian" ZE, it will have a sampling variance of 4 1 (see table 5), so the true proportion of BIT jobless people in this ZE will be estimate with an uncertainty of about. 4 = 4 points of percentage (which is ba). But this meian ZE will also have a gamma of approximately % 4 (see table 1), so the Fay & Herriot estimator will have a total error (EQM) of.8 1, accoring to 4 the theoretical result given in IV-A, let's say 1 : so, the true proportion of BIT jobless people in this 4 ZE will be estimate with an uncertainty of about 1, in other wors points of percentage, which is two times smaller than the irect uncertainty!!! So, even if the final result may be consiere as perfectible (in itself, the obtaine CV with this moel is not really impressive ), it becomes acceptable an anyway we can expect a serious improvement in the local accuracy of unemployement statistics when we refer to the irect estimation strategy. ˆ v is very sensitive to the extreme ˆ. A first proof testifies of it : if one oes not select the ZE with n 5 ˆ = an the estimation process fails!!! The moel is It is interesting (but also worrying ) to notice that the estimation values of the variances before fitting the moel, the software gives v funamentaly base on a balance between two kins of variance: we guess those (realistic) conitions give too much variance coming from the sample an thus the variance coming from the moel appears really negligible. As a secon proof, imagine that we fix all the ˆ equal to a (fictitious) constant value an see how the estimation process reacts: with 15 ZE concerne (to simplify), the table 11 gives the sensitivity of ˆ v an ˆ (which is then a constant) to the constant value of ˆ. Table 11 : sensitivity of fictive constant value ˆ v ˆ to a Constante ˆ ˆ v ˆ That is why we selecte at the very beginning only the 87 ZE with a sample size superior or equal to 5 responing people. This threshol appears to be a satisfactory one from the numerical results point of view. To summarize the scale of the moifications introuce by the moel, the table 1 gives the istribution of the gaps (in % points) between the LFS irect estimator of the proportion of BIT jobless people an the Fay an Herriot estimations in the 87 ZE. We can see that there are sometimes very strong corrections ue to the Fay an Herriot metho. -6-

28 Table 1 : absolute ifference between Yˆ an the EBLUP (Fay & Herriot) Absolue Gap Quantile (in % points) 1% Max % 6. 95% 3.5 9%. 75% Q3.9 5% Meian -.1 5% Q1-1. 1% -.4 5% -.8 1% -3.6 % Min -4. We calculate the summation of the 87 ZE irect estimations concerning the total number of BIT jobless people, covering so the largest part of France - remember that the irect estimator of the total Yˆ in ZE is N. We calculate as well the Fay & Herriot national estimation as Nˆ N H Y ˆ, an the total of the local estimations from Insee methoology. Then, we can compare 3 estimations (see table 13) : Table 13 : summations for three rival methos at a pseuo national level (87 ZE) (pseuo) national estimation Total estimation Fay an Herriot 339 Labor force survey - irect estimation 34 Insee methoology - implicit synthetic estimator 35 The nearness of the respective estimations of the Fay an Herriot estimation an the irect estimation is here exceptional; probably it is necessary to see a happy stroke of fate there to reach this level of resemblance but in any case we consier that it is a form of valiation of the approach of Fay an Herriot. The graph 3 enables to appreciate the bias of the estimator as far as the sampling ranomness is concerne. The iea consists in comparing the istribution of the irect estimations, without bias, with the corresponing Fay an Herriot estimations, whose theoretical property of absence of bias are justifie only if we take into account at the same time the ranomness of the moel an the sampling ranomness - an besies uner the funamental hypothesis that the moel is "exact". If the clou of points spreas along the line, there is a strong assumption of lack of bias. Y = X In our particular context, there is a gap, which is not certainly consierable but we cannot ignore it. A conventional test ens moreover in a significant ifference between the line Y = X an the regression line. We have to make for a phenomenon of "shrinkage", who is rather classic when one use an estimator implying a synthetic component. -7-

29 Graph 3 The estimations relative to the 61 ZE with n < 5 are classical synthetic estimations, that is to say Y ˆ SYN t = B ˆ X In particular, this strategy inclues the 1 ZE with n = (an without having to istinguish them specifically). By construction, the synthetic estimations are appreciably less scattere than the irect estimations, an thus the phenomenon of shrinkage is very marke, as we can see in the graph 4 below, reserve for the 61 ZE in question: Graph 4-8-

30 When we consier the whole set of the 348 ZE of metropolitan France, an if we are intereste in the (complete) national estimation of the number of BIT jobless people obtaine from the various rival methos, we fin (see table 14) Table 14 : summations for three rival methos at a national level (348 ZE) (pseuo) national estimation Total estimation Fay an Herriot (or synthetic) 43 Labor force survey - irect estimation Insee methoology- implicit synthetic estimator 48 We thus fin a very satisfactory nearness with the irect estimation. However, as we want to have a national summation of the moel-base estimations Yˆmo equal to the "official" total number of BIT jobless people given by the LFS, equal to 416, we practice a final benchmarking for any ZE : Yˆ = 416 ˆ mo Y Y ˆ mo ZE The two following graphs take into account the 348 ZE an compare the irect estimations to the moel-base estimator (Fay an Herriot or synthetic estimator, epening on the ZE). The graph 5 gives the entire clou of points an the graph 6 is just a zoom on the subpopulation of the ZE with less than 5 BIT jobless people (from the irect estimation point of view), so that we get ri of the influence of the largest ZE. Graph 5 1 Calculate here as the summation of the 348 irect post-stratifie estimations - so it is ifferent of the "official" total ( 416 ) which uses the LFS weights. -9-

31 Graph 6 Obviously, the benchmarking proceure has ecrease the sampling bias, even if a light shranking phenomenon remains 13. B) Poisson moel The first issue we face is about the preselection of ZE before the fit, since we must exclue at least all the ZE with = n (no estimation c Nˆ ). For information, graph 7 gives the istribution of the c Nˆ. Graph 7 Number of BIT jobless people by ZE : approximative istribution ZE with n> FREQUENCY NbponchomeurBIT MIDPOINT Caution: the orer of the axis is inverte with regar to the graphs 3 an

32 However, we wonere if it woul not be preferable to strengthen the preselection to limitate the isturbances which coul occur because of the ZE with a small size n (as we notice in the Fay an c Herriot experience) an/or a small Nˆ. Thus, we trie to compare 4 configurations using a full mixe moel with overispersion (15 variables plus the intercept). In table15, we give the results concerning the national Poisson estimation (sum of ZE estimations) an the estimate overispersion parameter ˆ, with three columns about the traitional test concerning a possibly sampling bias (fit coefficient R, slope of the regression line bˆ, an result of the traitional test). Remember that the national irect estimation of the number of jobless people is 416. Table 15 : some quality inicators for 4 rival preselection patterns Selection criterion National estimation ˆ R bˆ Conclusion test n > No bias n > No bias n > 49 an ˆ c N > No bias n > 49 an ˆ c < N < Bias Finally, the more we select the ZE, the more we create a gap between the irect estimation an the small area estimation, which signifies an increasing suspicion of bias. On the other han, the overispersion parametre ecreases, which shoul give rise to a smaller variance for the small area estimators. In fine, we chose to fit the moel on the whole set of ZE - as soon as n >. a. The classical moel This moel is a synthetic approach, since we have not any local effect v. Starting with the full moel (14 variables - see IV.B), we get the following results (table 16) Table 16 : some quality inicators about the full moel Fit Statistics - full moel (classic version) - Log Likelihoo AIC (smaller is better) BIC (smaller is better) Pearson Chi-Square However, examining the p-values of each variable, we can see that variables are largely not significant. If we suppress them, we get the efinitive moel (table 17 an table18). Table 17 : some quality inicators about the aapte moel (efinitive moel) Fit Statistics - efinitive moel (classic version) - Log Likelihoo AIC (smaller is better) BIC (smaller is better) Pearson Chi-Square

33 In this occasion, we notice that the traitional quality criteria are not in agreement. The ifference between the two moels involve here is rather subtle, it is true, an we thus relie on the traitional metho of the Stuent tests. Table 18 : fitting results for the efinitive moel Parameter Estimates - efinitive moel (classic version) Effect Estimate Stanar Error t Value Pr > t Intercept <.1 log_t_rech_oui <.1 log_t_couple_ <.1 log_t_age3_49hnoni <.1 log_t_age5_64hnoni <.1 log_b5_partbasrev log_a6_partagri <.1 log_c8_partslbtp <.1 log_c8_partslsante log_c8_partslfabri <.1 log_c8_partslgest <.1 c_txsoletab_ <.1 t_age15_19hiplbit <.1 At the opposite of the Fay & Herriot case, we have now a large set of explanatory variables - which means that the two competitive moels react pretty ifferently. The coefficient of log-t-rech-oui is nearly equal to 1 an the other coefficients are close to zero : this is because the proportion of jobless people is not far from the proportion of people looking for a job. The intercept is quite negative but it comes from the fact that the explanatory percentages are multiplie by 1 (so we shoul a something like log1 to appreciate the intercept). As for the coefficient of t_age15_19hiplbit (which is not a logarithm), we remin that this variable concerns a very rare populaion. If we pratice the fit with the 3 explanatory variables use in part V.B.b (see table 4), the quality of the fit appears to be clearly smaller (AIC = , BIC = , Chi of Pearson = ). With only the explanatory variable log_t_rech_oui (an the intercept), the egraation increases (AIC = , BIC = , Chi of Pearson =596 31). Note that if we eclare an overispersion parameter, the Poisson estimators are rigorously equal ZE by ZE because the matrix of variance is a matrix with a iagonal matrix, so ˆ oes not interfer in the expression of ˆ an there is no consequence for the final estimator (which is not true for a mixe moel). Consiering the graphs 8 an 9, the resiuals look fine, even if an aitionnal stuy coul probably stress on a few outliers (but no special an clear pattern, istribution looking like a gaussian istribution) an there is no shrinkage. Therefore, the estimators may be consiere as being without sampling bias. -3-

34 Graph 8: Resiuals for the efinitive classical Poisson moel Graph 9 Poisson GLM The benchmarking phase oes not appear here as an asset : it is probably not efficient to use a benchmarking correction in this context. -33-

35 b. The mixe moel The full mixe moel (using 14 variables - see IV.B) without overispersion gives ˆ v = 4, 96 (stanar error =,44). This value is huge because ˆ, means that we are facing a non negligible number of local effects between 3 an 4 (for instance), so the corrective local effects c affecting the final mixe estimator Nˆ han, we get (cf. tables 19 an ): are about 3 e or Table 19 : some quality inicators 14 about the full moel without overispersion (mixe moel) Fit Statistics - full moel - mixe without overispersion - Res Log Pseuo-Likelihoo Generalize Chi-Square 357. v 4 e, which is ifficult to believe... On the other Table : fitting results for the full moel without overispersion (mixe moel) Solutions for Fixe Effects - full moel - mixe - without overispersion Effect Estimate Stanar Error DF t Value Pr > t Intercept log_t_rech_oui log_t_couple_ log_t_age3_49hnoni log_t_age5_64hnoni log_b5_partbasrev log_t_allocrmi log_t_natc_afri log_a6_partagri log_c8_partslbtp log_c8_partslsante log_c8_partslfabri log_c8_partslgest c_txsoletab_ t_age15_19hiplbit The variable c8_partslsante represents the proportion (in 6) of the employe people whose office at work concerns the health or the social activity, an c8_partslgest6 the proportion (in 6) of the employe people whose office at work concerns the management. It can be surprising that those two offices are especially highlighte (more that the others ). On the other han an on the principle, the presence of economic explanatory variables of this type in a moel about employment is rather natural: the ynamism of the employment can be very variable from a sector to the other one an the structure of the economic activity in a ZE cannot be without connection with the scale of the unemployment phenomenon. 14 SAS oes not prouce other inicators in the context of generalize linear mixe moels - in particular we have not the AIC an BIC criteria. We guess that it is a consequence of the insufficiency of the theoretical knowlege relative to the behavior of these two criteria when we use these complex moels. -34-

36 Using those parameter estimations, the national total estimation is equal to 4 14 BIT jobless people: we obviously get an extrem result which is unacceptable. We can also see that aberrations occur about the resiuals (graph 1): Graph 1 If we introuce an overispersion parameter, then we get ˆ v =, 39 to,4 an ˆ = 1434 with a stanar error equal to169, 9. So, ˆ v with a stanar error equal is not clearly significant (let's amit that we are on the verge of the critical region - in any case we hesitate to conclue). On the other han, is resolutely positive an the numerical estimation ˆ, compare to, may be surprising (anyway we fin the same orer of magnitue for ˆ with the classical moel with overispersion). Obviously, the moel affects to the overispersion ("R-effect") - an not to the local effect ("G-effect") - the largest part of the variance of the Moreover, when we introuce an overispersion, the national total estimation becomes nice, since we get 449 BIT jobless people, an the resiuals recover a sympathetic look (see graph 11). The traitionnal test ens in a lack of bias. Finally, the characteristics of the estimation are given in tables 1 an. Nˆ. Table 1 : some quality inicators about the full moel with an overispersion (mixe moel) Fit Statistics - full mixe moel - with overispersion - Res Log Pseuo-Likelihoo Generalize Chi-Square c ˆ v -35-

37 Table : fitting results for the full moel with overispersion (mixe moel) Effect Solutions for Fixe Effects - full mixe moel - with overispersion Estimate Stanar Error DF t Value Pr > t Intercept log_t_rech_oui log_t_couple_ log_t_age3_49hnoni log_t_age5_64hnoni log_b5_partbasrev log_t_allocrmi log_t_natc_afri log_a6_partagri log_c8_partslbtp log_c8_partslsante log_c8_partslfabri log_c8_partslgest c_txsoletab_ t_age15_19hiplbit The variable c8_ partslfabri represents the proportion (in 6) of the employe people whose office at work concerns manufacturing. Out of curiosity, still using 14 regressors, if one restrict the population of ZE with the conition : ˆ c ˆ =, with a stanar error equal to,48 (so n > 49 an < N < 5, then v 13 becomes explicitely significant) an ˆ collapses since ˆ = 65 ˆ v with a stanar error equal to 8. In parallel, the amount of significance for the "central" variable log_t_rech_oui quickly ecreases (p-value =.), an the traitional test ens in a bias, while the national jobless estimation seem satisfactory: in a general situation which shoul be improve by the eviction of the smallest ZE, thus the most isturbing elements for the fit, we finally face a rather isconcerting mechanism, mainly concerning the explanatory variables! Keeping only the three significant variables (we aopt a tolerant position ) which appear in the full moel, the fitting results about this first simplifie moel are presente in the tables 3 an 4 : Table 3 : some quality inicators about the first simplifie moel with an overispersion (mixe moel) Fit Statistics - 1 st simplifie mixe moel - Res Log Pseuo-Likelihoo Generalize Chi-Square

38 Table 4 : fitting results for the first simplifie moel with overispersion (mixe moel) Effect Solutions for Fixe Effects - 1 st simplifie mixe moel Estimate Stanar Error DF t Value Pr > t Intercept <.1 log_t_rech_oui <.1 log_c8_partslfabri log_c8_partslgest ˆ =, With this first simplifie moel, 1 with a stanar error equal to, ( clearly non significative) while ˆ = 1548 v ˆ v is now very with a stanar error equal to 166. The national estimation is about 447 jobless people (which remains goo) an the test conclues in a lack of bias. If we suppress again the variables which appears explicitely as non significant in table 4, then log_t_rech_oui remains alone (with the intercept) an we get the secon simplifie moel - see the results in tables 5 an 6. Table 5 : some quality inicators about the secon simplifie moel with overispersion (mixe moel) Fit Statistics - n simplifie mixe moel - Res Log Pseuo-Likelihoo Generalize Chi-Square Table 6 : fitting results for the secon simplifie moel with overispersion (mixe moel) Solutions for Fixe Effects - n simplifie mixe moel Effect Estimate Stanar Error DF t Value Pr > t Intercept <.1 log_t_rech_oui <.1 The national estimation becomes equal to 43 jobless people an accoring to the test, there is no bias. On the other han, ˆ =, 9 with a stanar error,17 ( is highly non significant) an ˆ = 169 v with a stanar error equal to 17. Then, when we reuce the list of the explanatory variables, we strengthen the not significant character of ˆ v an we ten to increase ˆ in return - but in a reasonable proportion which allows to say that ˆ is relatively stable. These successive results are a little puzzling. Inee, with the traitionnal calculate criteria, the full moel is the best one (compare tables 1, 3 an 5), but it inclues a lot of non significant variables (see table ). If one wishes to keep only the significant variables, then it is enough to retain the variable log_t_rech_oui alone (with the intercept) an we get a nice national estimation but in this configuration the traitional criteria egrae (compare tables 1 an 5). Finally, if we ecie to retain the full moel - with the argument of nearness of the fit with the classic moelling to make tilt the balance (compare tables 18 an ) - we obtain: ˆ v -37-

39 Graph 11 Fitting results (resiuals) about the full mixe moel with overispersion an Graph 1 Bias assessement from the full mixe moel with overispersion Poisson GLMM The fit if globally satisfactory. On the other han, if we prefer the secon simplifie moel with the argument of simplicity to make tilt the balance, we obtain -38-

40 Graph 13 Fitting results (resiuals) about the secon simplifie mixe moel with overispersion an Graph 14 Bias assessement from the secon simplifie mixe moel with overispersion Poisson GLMM As far as the sampling bias is concerne, we have the confirmation that the full mixe moel is not "better" than the very simplifie one. Moreover, a benchmarking step seems useless. -39-

41 To finish off, the tables 7 (for the full moel) an 8 (for the secon simplifie moel) enable to assess the gap (in relative value an without any benchmarking) between the (classical or mixe) Poisson estimation an the irect estimation for an intermeiary - but highly agregate - geographical level name ZEAT : a ZEAT is a group of regions an metropolitan France is ivie into 9 ZEAT (an regions), so the Labour force survey may be consiere as reliable at the ZEAT level. Table 7 Bias assessement for the full moel using ZEAT level estimation ZEAT Classical Poisson Poisson mixe Direct Gap classical Poisson Gap Poisson mixe ,1 %,8 % , % 3,6 % ,1 % 1,7 % , % 9,5 % , % 1,8 % , % 1,6 % , %, % ,9 % 1,6 % France Sigma = 9,5 Sigma =4,6 Table 8 Bias assessement for the secon simplifie moel using ZEAT level estimation ZEAT Classical Poisson Poisson mixe Direct Gap classical Poisson Gap Poisson mixe ,5 %, % ,4 %,4 % ,9 % 4,4 % , % 11,6 % , %,8 % , %, % ,5 % 3,4 % , % 1,3 % France Sigma = 31,7 Sigma = 3,1 The "sigma" inicator is the summation of the percentages representing the gap, without the sign. It has no particular interpretation at the global level but it allows to assess the numerical importance of the relative errors inepenently of the populations concerne. It coul also be a quality inicator to compare ifferent small area estimators. We can see that the classical Poisson an the mixe Poisson give rather coherent gaps in orer of magnitue (except for ZEAT 1 an 3 in the full moel - see table 7). Moreover, generaly speaking, the results are not so ifferent between the full moel an the very simplifie moel. The best "sigma" is prouce by the full mixe moel. At last, the final quality seems very sensitive to the ZEAT : it coul be an incitation to fit a specific moel in the ZEAT for which the results are not goo enough. -4-

42 VI) Conclusion We can see the nice global performance of the composite Fay & Herriot an Poisson estimators to estimate the total number of jobless people at the "Zone 'emploi" level. Despite the lack of quality assessement about the currently use Insee methoology, those small area moel-epenant estimators seem - accoring to several inications - to be at least as efficient as the synthetic estimators use to prouce the official French statistics. Because of the very complexe structure of the French LFS survey, which implies several egrees with a-hoc man-mae sampling units, we regret not to feel able to carry out a simulation in orer to quantify the sampling bias affecting the ifferent estimators involve in this experiment - espite the proximity between the available Census ata about activity an the ata collecte through the Labour Force survey. However, the existence of quarterly labour force atasets woul allow to test some time series moels to "borrow strenght" from the past. On the other han, the obvious spatial correlation between the inicators of neighbour "Zone 'emploi" suggests to fit a moel with an a-hoc variance structure of the local effects. To be more exhaustive, we shoul try in the futur to assess the stability over time of those outcomes: for that, it woul be esirable to apply the whole above proceures using other quarterly LFS samples - an not merely the 7 ata. Moreover, we can imagine that some improvement coul be obtaine using higher level area inicators in the ifferent moels (such as "epartement" or "region" inicators). References [McCulloch] "Generalize, Linear, an Mixe Moels", C.E. McCulloch, S..R Searle, J.M. Neuhaus, Wiley, 8 [Rao] "Small area Estimation", J.N.K. Rao, Wiley, 3-41-

43 Case stuy report DESTATIS, Germany Preliminary Note Due to staff reuctions an unplanne extraorinary supplementary work loas for the German CENSUS 11, DESTATIS was unable to carry out an report results accoring to the planne case stuy with a minimum of quality stanars in time. To the eepest regret of DESTATIS this ecision ha to be mae accoringly to the above mentione circumstances. Backgroun Historically, German official statistics on labour force issues takes avantage of two crucial sources: The German Microcensus, an annual survey comprising 1% of German private househols with the purpose to get census-like information in a regular time frame an ata from the Bunesagentur fuer Arbeit (BA, the German Feeral Employment Agency) which is proviing statistical information irectly istille from German labour offices an hence ata from the BA has more or less the character of register-base ata. It has to be note that German labour offices o not recor all jobless persons (there are also aitional community base institutions recoring jobless persons). Nevertheless base on German Social Law the BA has been authorise to secure the quality of German labour statistics after the en of 4 unemployment ai an benefit payments have been combine on a local authority level. Unfortunately results between the two sources iffer significantly an for a long time great efforts have been unertaken to eliminate this ifference or at least minimise it. Since alreay enormous ifferences on a very high aggregate level (up to the feeral level, although results are publishe up to NUTS ) exist, one refrains from officially analysing an publishing results on finely isaggregate levels by taken in account both sources of ata so far. In etail, stanar analyses are eeme reliable only for regions with at least 3, inhabitants since mean square errors of estimate results have been consiere acceptable only up to this size of region. This woul be sufficient for all NUTS 1 an NUTS areas whilst a lot of regional istricts (Kreise) referring to NUTS 3 levels are below that limit. Of course there is consierable interest in less aggregate results, but besies ifficulties in reconciling the above mentione atabases precision constraints o not permit so far a breakown into NUTS or NUTS 3-level results. This project offers at least the methoological units of DESTATIS an Information un Technik Norrhein-Westfalen (IT.NRW, the statistical office of the German Buneslan North Rhine-Westphalia) an opportunity to get unerway a first step into (small area) moelling this problem on a further isaggregate level uner the scrutinising eyes of - in case small area moelling - more establishe an experience national statistical offices in Europe. Main objective of the project woul be the improvement of accuracy in terms of mean square errors. Due to the restriction to the use of aggregate ata, only area level moels seem to be appropriate to carry out analysis on NUTS 3-levels. -4-

44 Internally, at DESTATIS there is as well consierable interest of non-employment relate sections at DESTATIS (foreign trae statistics, agricultural statistics) focussing on improving their results on finer regional levels. Transferable results from this case stuy might be very supportive for future methoological work. Description of survey or register, an of all ata sources use In August 1 a one ay meeting concerning ata supply, quality an sources of ata available, an special analysing interests from both, the feeral statistical office an the statistical office of the German Buneslan North Rhine-Westphalia with microcensus experts from IT.NRW an DESTATIS took place in Düsselorf. It was agree on the way an level ata coul be provie from IT.NRW an how the expecte workloa to be followe in the case stuy woul be share. It was iscusse how the matching of ata of ifferent source coul be manage As alreay introuce in the previous chapter the variables to be estimate originate from the German microcensus. Apart from emography an househol composition issues one of the main ata contents refers to labour force ata. The sample units are constructe as clusters of aroun 8 to 1 househols base on the results of the German Census back in Aitional strata have been incorporate into the Microcensus esign for builings occupie for the first time after May They were chosen from a fairly recent survey carrie out in 9 an the selecte variables of interest have been ientifie as the number of jobless persons as well as the number of employe persons with efinitions accoring to the International Labour Force survey. For whole Germany nearly 4, househols resulting in more than 8, inhabitants have been interviewe. To begin with, not all the states of Germany have been inclue into this analysis. We concentrate only on regions/municipalities insie the state of North Rhine-Westphalia, which serves as a goo starting point since this is the largest German state an areas of interest are compare to most other states quite large. North Rhine-Westphalia with a total of aroun 18 million inhabitants is ivie into 5 NUTS -areas (Regierungsbezirke) an 53 NUTS 3-areas (Kreise), the latter ranging from 1, to 1 million inhabitants. The ata use from the BA stem from ata continuously collecte on several labour force variables. A selection of ata can be foun on the website The exogenous variables for our propose moel are again the number of employe persons an the number of registere jobless persons. This ata, serving as auxiliary ata is taken from 395 labour office areas in North Rhine- Westphalia which cannot be matche exactly to regional aministrative structures as use in the German microcensus. Since there are several time points per year when ata form the BA has been provie we have taken an average over the whole year

45 Jobless persons: the ataset comprises in total 8, jobless persons varying throughout the areas between 83 (Heimbach) an 54,16 (Köln) jobless persons with a mean of,6 people. Table 1: Descriptive statistics of number of jobless persons in 395 labour office areas in North Rhine-Westphalia Mean Variance Q1 Q (meian) Q3,6 4,647, ,513 Employe persons: the ataset comprises in total 5.8 million people in the state of North Rhine-Westphalia varying from 1,7 (Dahlem) to 331,56 (Köln) employe persons in labour office areas with an average of 14,434 people. Table : Descriptive statistics of number of employe persons in 395 labour office areas in North Rhine-Westphalia Mean Variance Q1 Q (meian) Q3 14, ,7,477 4,37 7,156 13,111 A first analysis base on these totals will show if further potential for less aggregate ata exists. Follow-up stuies are then planne with extensions in terms of examination levels in several irections: (i) (ii) (iii) Regional refinements: given a successful analysis of ata on labour office levels a further isaggregation on municipality level woul be consiere Categorical refinements come immeiately into consieration since BA ata is also available in emographic subgroups namely by gener an 8 age groups (uner 15, 15-4, 5-34, 35-44, 45-54, 55-64, an 74 an over) Assuming positive outcomes, the stuy shoul be extene to all German states (iv) Borrowing aitional strength by using Microcensus ata on NUTS 1 or NUTS - levels, respectively, as auxiliary variables -44-

46 Case stuy report [Istat, Italy] Backgroun Italian National Statistical Institute (Istat) carries out several househol sample surveys with the aim to estimate parameters relate to social, emographic an economic characteristics. The sample size of these surveys is aequate to prove irect estimates of reasonable accuracy only at the level of main territorial omains (e.g. geographical regions). Therefore, irect estimates cannot respon appropriately to local targets. In the past, in orer to meet those nees, Istat often resorte to oversampling. However, to overcome financial, organizational an methoological problems connecte to collecting, processing an analyzing statistical ata in the oversize sample, Istat, as well as other National Statistics Institutes, began to focus on Small Area Estimation (SAE) methos, with the aim of improving the reliability of the estimates relate to small omains for which the irect estimators are consiere not reliable. These methos are base on explicit/implicit moels which borrow strength from the areas surrouning the area of interest (Rao, 3). A key aspect for the prouction of SAE estimates of goo quality is relate, therefore, to the choice an valiation of the unerlying statistical moel. The first practical application of SAE methos for the prouction of official statistics in Istat is concerne with the annual average estimates of employe an unemploye counts at local labour market areas level on the basis of Labour Force (LF) ata. These estimates are isseminate by Istat, since 1, on the basis of an agreement fune by the Ministry of Economy an Finance. Furthermore, some Regional Statistical Offices require the prouction of similar estimates with respect to ifferent municipal aggregation regare to be more useful for regional economic planning. Consiering this nee an in orer to increase the number of potential users, Istat in collaboration with CISIS 1 has evelope a web. SMART oes not prouce Istat official statistics but instea SAE estimates for research purposes. The users are require to valiate the estimates, e.g. comparing them over the time with other available sources. It is worthwhile to note that by the en of 11 a new version of SMART will be release. This will provie SAE estimates for a larger number of Istat surveys. Among them there is the Health Survey (HS), which is the target survey of this case stuy. HS ata are collecte every five years. Last available sampling ata come from 5 an they will be use to perform the moel selection.. Description of survey or register, an of all ata sources use All the resient iniviuals excluing population living in collective househols is the target population of HS. The sampling strategy of the survey, whose ata collection is carrie out through personal interviews, is a two-stage sampling esign with stratification of primary sampling units. More specifically, municipalities are the primary sampling units an househols are the seconary sampling units. Stratification of the municipalities accoring to their emographic size is 1 CISIS, Centro Interregionale per i sistemi informatico, geografico e statistico, is a technical organ of the Conference of Italian Regions an Italian Autonomous Provinces forme to ensure effective coorination of information tools an geographic an statistical information. -45-

47 carrie out in each Area Vasta (AV). AVs are an aggregation of Health Districts which are aministrative partitions of the Italian regions efine for the aministration an allocation of health funs an expenses. Within each AV, the larger municipalities are classifie as Self- Representing Areas (SRAs) while an Non Self-Representing Areas (NSRAs) are given by the smaller ones. The househols are selecte by means of systematic sampling in each SRA an using a two stage sample esign for NSRAs. In this case, the municipalities are the primary sampling units (PSUs), while the househols are the Seconary Sampling Units (SSUs). PSUs are arrange into strata of the same imension in terms of population size an from each strata one PSU is rawn with probability proportional to the PSU population size. SSUs are selecte by means of systematic sampling in each PSU. For all the sample municipality all members of each sample househol are interviewe. The sample sizes in the AVs are efine as a compromise between proportional an equal allocation. The small areas of interest are the HDs. Roma an Torino are ivie in more HDs, but each of them was consiere in this stuy as a whole HD. Therefore the overall number of HDs taken into account is 188. Population an sample characteristics of the HDs are shown in table 1. Mean Min Q1 Meian Q3 Max Overall Population 398,35 56,61 19,843 31,85 47,746,59,19 58,14,88 Sample ,35 4,96 18,4 Table 1. Population an sample characteristics of HDs. The parameters to be estimate at HD level are population counts of: (1) people having at least one specialist visit one, () obese people, (3) women age from 5 to 69 who having ha at least one mammography, an (4) women age from 5 to 64 who ha at least one pap test. Auxiliary ata were use in the estimation SAE process. Yearly upate population counts cross-classifie by age classes an sex are known from emographic statistics at municipality level. Population counts relate to other auxiliary variables expecte to increase the preictive power of SAE moels are estimate from LF survey. They are eucational level an househol type counts. For the eucational level, the sampling units younger than 18 were attache the maximum eucational level of the househol. Because eucational level counts are estimate from LF survey their reliability at small area level was checke. Approach to the problem In orer to eal with the small area estimation problem a ifferent steps are neee. The first one is the analysis of the irect estimates an in particular of the relate CVs. Table isplays the istribution of CVs of irect estimates for each of the parameters liste above. The threshols use to efine the CVs classes are those ones suggeste by ABS to check LF survey ata reliability (ABS, 6). The first class refers to estimates with no release restrictions, the estimates belonging to the secon class shoul be accompanie by warning, while the estimates in the last class are not recommene for issemination. Variable CV% < CV% 33.3 CV% > 33.3 Parameter (1) Parameter () Parameter (3) Parameter (4) Table. CV% of irect estimates. -46-

48 It is clear from table that SAE methos woul be particularly useful for the estimation of parameters (1) an (), which have a large number of small area estimates are inclue in the secon class. For the estimation of the parameter (3) an (4) the irect estimates seem to be reliable. Nonetheless, in orer to evaluate a possible gain in efficiency the SAE methos has been applie to the parameter (3) as well. On the contrary the strong precision attaine by the irect estimates for the parameter (4) suggests not to apply the SAE methos for this case. When the irect estimates are consiere to be too instable, the best way to o is to borrow strength from the others areas, by means of explicit/implicit use of moels. The choice of the SAE metho epens on ata collecte an how the phenomena uner stuy sprea out on the whole territory. The formal specification of SAE moels can be useful in orer to select the best set of auxiliary information an the more suitable broa area. Methos SAE methos The investigate SAE moels are linear mixe moels (LMMs) an generalize linear mixe moels (GLMMs), namely logistic mixe moel, in both cases with uncorrelate area ranom effects. Moel selection for each target variable was carrie out consiering iagnostic criteria such as maximum likelihoo, AIC an BIC, such as in Boonstra et al. (8), Boonstra, Buelens an Smeets (9) an D Alo et al. (9). Except for variable (3), for which the most preictive variable is the eucational level, the cross-classification by sex an age resulte to be very preictive. Different age classes were efine for males an females, an for each of the three variables. In orer to choose the optimal classification in classes of the variable age, a non-parametric metho base on spline functions an CART classification have been use. Broa areas efine accoring to geographical characteristics was taken into account in orer to select the broa area. The results unerperforme the outcomes obtaine using the overall territory as broa area. In this case stuy, both area an unit level estimators base on linear mixe moels were consiere. Moreover, given the binary nature of the target variable observe on the units of interest, a more proper unit base logit estimator was taken into account. Auxiliary information The variables taken into account in the experimental stuy for the moel selection are some the variables usually utilize in the househol surveys carrie out by Istat. In this case these variables are: sex. Age, eucational level, an househol type. As far as the variable age it is concerne, besies to verify a possible cross-classification effect with the variabnle sex, we nee to efine for each target variables the optimal efinition of classes of age. Two alternative methos have been use. The first metho is a nonparametric regression metho. In etails, we use the metho base on penalize splines (Eilers e Marx, 1996; Ruppert et al., 3). The secon classification results from the use of regression tree methos (CART, Classification an Regression Tree, Breiman et al., 1984). Furthermore, CART has been use to efine a complete partition of the population taking into account all the whole set of auxiliary variables. Unfortunately, the resulting classification cannot be use in the estimation process since the estimates of resultant the population counts are not reliable. About the eucational level, we consiere two ifferent levels: from ISCED level to, an from ISCED level 3 to 6. The final choice to split the variable in only two levels erive from the lack of reliable estimates for the population counts relate to finer classifications. Estimates of the corresponent population counts are obtaine from Labour Force survey -47-

49 5 ata. In orer to improve the explanatory power of the auxiliary variable eucational level, we assigne the maximum eucational level of the parents to all the sampling units < 18 years. Two istinct classifications have been stuie for the variable househol type. The former contains seven groups: single person, couple with or without chilren an without isolate persons, single parent, couple with or without chilren with isolate, single parent with isolate persons, multi-family househol, one family househol. The latter is less finer then the previous one an only two levels are consiere: single, an not single. The last classifications can be obtaine from the first one, joining the first, the thir, the fifth an seventh group to obtain the level single, an the secon, the fourth, an the sixth, to achieve the level not single. We have to unerline that only the population counts relate to the secon classification can be estimate with a precision level to be use in the SAE process. A further classification was obtaine by means of CART algorithm. However, this classification was not use since it was not possible to translate the classification to the LF survey ata in orer to compute the relate population count estimates. Spline Spline functions are obtaine by iviing the interval of efinition in more subintervals an choosing for each a polynomial of egree. It is then require that two following polynomials to join through a smoothing function. The obtaine function is calle a spline function. The iea is to use to the subintervals in which the spline functions can be consiere approximately linear as a possible classification of the auxiliary variable. The plots in figure 1 show the spline functions for the target parameter (1) compute separetely by sex. Figure 1. Spline functions for linear an logistic mixe moel for M (up) an F (own). People having at least one specialist visit one. -48-

50 From figure 1, the following classification for the variable people having at least one specialist visit one is erive: Group : 11 5 years M Group 3: 6 6 years M Group 4: 61 7 years M Group 5: years M Group 6: >85 years M Group 1: 1 years M an F Group 7: 11 5 years F Group 8: 6 35 years F Group 9: 36 5 years F Group 1: 51 6 years F Group 11: 61 7 years F Group 1: years F Group 13: >85 years F The plots in figures 3 an 4, escribing the spline functions for the linear mixe moel an the logistic mixe moel respectively, show a ifferent pattern. Therefore, in this case it is avisable to assume istinct classifications for the linear an the logistic moels. Figure 3. Spline functions for linear mixe moel for M (up) an F (own). Obese people. The analysis of the plots in figure 3 suggests the following classification: Group : 15 4 years M Group 3: 41 6 years M Group 4: 61 8 years M Group 5: >81 years M Group 1: 14 years M an F Group 6: 15 4 years F Group 7: years F Group 8: years F Group 9: >85 years F -49-

51 Figure 4. Spline functions for logistic mixe moel for M (up) an F (own). Obese people. From figure 4, we can raw the following classification for the cross-classification of sex an age: Group : 11 5 years M Group 3: 6 6 years M Group 4: 61 7 years M Group 5: years M Group 6: >85 years M Group 1: 1 years M an F Group 6: 11 5 years F Group 7: 6 35 years F Group 8: 36 5 years F Group 9: 51 6 years F Group 1: 61 7 years F Group 11: years F Group 1: >85 years F The range for the target variable (3) are women age from 5 to 69, an, therefore, the x axis range is a subset of the range in the previous plots. -5-

52 Figure 5. Spline functions for linear mixe moel for women age from 5 to 69 who having ha at least one mammography. Below it is reporte the resulting classification. -51-

53 Figure 6. Spline functions for logistic mixe moel for women age from 5 to 69 who having ha at least one mammography. Examining the plots in the previous figure, the following classification for the variable age is suggeste: years years years Classification an Regression Tree Classification trees are non-parametric statistical algorithms to help the ientification an efinition of preictive moels. In particular, they provie a partition of the population base on the statistical relationship between a epenent variable an multiple inepenent variables (both qualitative an quantitative). The final output is represente by a hierarchical tree in which each statistical unit performs a path from the root, the starting noe that inclues all units, to a leaf, namely a terminal noe. In the applications evelope for the case stuy, the algorithm CART was use. It is particularly suitable when, as in this case, the epenent variable is binary. In fact, the algorithm CART allows only binary branching noes. The branch is obtaine, starting from the root noe, efining the best possible subivision of the units so that the chil noes are more homogeneous than the parent noe with respect to the response variable. This goal is achieve by using the Gini inex of impurity: for each step you choose the split maximizing the reuction of chil noes impurities compare to the parent noe. Characteristic elements of the algorithm CART are as follows: -5-

54 - the inepenent variables can be either categorical or numerical; - an inepenent variable can be use several times in subsequent steps. The iterative proceure stops when further subivision of the noes o not cause a ecrease of impurity greater than a threshol, or when the number of leaves is equal to a threshol fixe in avance. For each of the three target variables three types of analysis through the application of CART have been performe. target variable analysis auxiliary variables sex age eucational level househol type 1 x x x x (1) x x 3 x 1 x x x x () x x 3 x 1 x x x (3) x 3 x Table 3. Analysis performe for each target variable. Table 3 summarizes the list of classifications consiere for each of the target variables. The three types of analysis are relate to three separate applications of the algorithm CART: the whole set of the auxiliary variables (analysis 1), the cross-classification of sex an age (analysis ), an the classification of househol type (analysis 3). As state in the previous sections, the application of CART was not mainly aime to the choice of the most relevant auxiliary variables, but to ientify optimal classifications for each of them. -53-

55 Figure 7. Classification tree for sex an age. People having at least one specialist visit one. Figure 8. Classification tree for sex an age. Obese people. -54-

56 Figure 9. Women age from 5 to 69 who having ha at least one mammography. From the classification trees in figures 7, 8, 9, the following cross-classification of sex an age has been efine for each target variables: People having at least one specialist visit one Group 3: 6 16 years M Group 4: years M Group 5: 35 5 years M Group 6: >5 years M Group 1: 3 years M an F Group : 3 5 years M e F Group 7: 6 5 years F Group 8: 6 38 years F Group 9: years F Group 1: 57 8 years F Group 11: >8 years F Obes people. Group : 18 7 years M Group 3: 8 31 years M Group 4: 3 4 years M Group 5: 43 5 years M Group 6: years M Group 7: years M Group 8: > 81 years M Group 1: 17 years M an F Group 9: 18 7 years F Group 1: 8 4 years F Group 11: 43 5 years F Group 1: years F Group 13: years F Group 14: >81 years F Women age from 5 to 69 who having ha at least one mammography -55-

57 years Table 4 isplays all the moels taken into account in the moel selection process. Both the methos base on the spline functions an CART were use to create a cross-classification of sex an classes of age. Furthermore, all the moels were estimate consiering three ifferent broa areas: the whole national territory, a partition in three broa areas (North, Centre an South), an a partition in five broa areas (North-West, North-East, Centre, South, Islans). moel moel1 moel moel3 moel4 Description sex age classes sex age classes + househol type sex age classes + eucational level sex age classes + eucational level + househol type Table 4. List of moels. Moels for the parameter (3), women age from 5 to 69 who having ha at least one mammography, o not take into account the variable sex. Quality assurance methos After computing the EBLUP for the LMMs an the EBLUP type for the GLMMs an evaluating their MSE, moel properties were assesse by means of bias iagnostic. All these iagnostics are base on the comparison between moel base an irect estimates (Brown et al., 1, Heay et al., 3). MSE was estimate on the basis of analytic mean cross prouct error incluing g1, g, g3 an g4 components. The hypothesis for the bias iagnostic is that, even if the irect estimates have high variability, they can consiere to be unbiase or only slightly affecte by unbiaseness. One the iagnostic measures propose by Brown et al. (1) is compute fitting the LM between irect an moel base estimates. Uner the hypothesis that no bias is introuce by the moel, the scatter plot of irect an moel base estimates is expecte to be ranomly sprea aroun the line y=x. Software All the computations were performe using the software R, with the exception of CART which was implemente by means of SPSS. The other non-parametric metho use to etect the classification of the variable age in classes, base on the spline functions, was performe utilizing the gam function inclue in package mgcv available in R cran. All the LMMs an GLMMs were compare for moel selection computing AIC an BIC criteria using the function lmer in the R package lme4. Finally, small area estimates an MSE estimation were calculate applying the R function evelope by Istat within WP4. These functions allow to perform small area preiction an MSE evaluation for unit level linear mixe moels with uncorrelate an spatially correlate area ranom effects (mixe.unit.sae an spatial.eblup), for area level linear mixe moels (mixe.area.sae), an logistic mixe moels (logistic.mixe.sae). -56-

58 Results This section will escribe the main results obtaine in the case stuy. From now on, for sake of simplicity only the results relate to the variable women age from 5 to 69 who having ha at least one mammography will be reporte. moel Italy 3 macroareas 5 macroareas AIC BIC LLH AIC BIC LLH AIC BIC LLH moel moel moel moel Table 5a. AIC, BIC an log-likelihoo for the unit level linear mixe moel (with uncorrelate ranom effects) for the parameter obese people. Cross-classification of sex an age by means of splines. The figures isplaye in re enote the best values for AIC, BIC an loglikelihoo criteria. moel Italy 3 macroareas 5 macroareas AIC BIC LLH AIC BIC LLH AIC BIC LLH moel moel moel moel Table 5b. AIC, BIC an log-likelihoo for the unit level linear mixe moel (with uncorrelate ranom effects) for the parameter obese people. Cross-classification of sex an age by means of CART. The figures isplaye in re enote the best values for AIC, BIC an loglikelihoo criteria. moel Italy 3 macroareas 5 macroareas AIC BIC LLH AIC BIC LLH AIC BIC LLH moel moel moel moel Table 5c. AIC, BIC an log-likelihoo for the logistic mixe moel for the parameter obese people. Crossclassification of sex an age by means of splines. The figures isplaye in re enote the best values for AIC, BIC an loglikelihoo criteria. moel Italy 3 macroareas 5 macroareas AIC BIC LLH AIC BIC LLH AIC BIC LLH moel moel moel

59 moel Table 5. AIC, BIC an log-likelihoo for the logistic mixe moel (with uncorrelate ranom effects) for the parameter obese people. Cross-classification of sex an age by means of CART. The figures isplaye in re enote the best values for AIC, BIC an loglikelihoo criteria. moel Splines CART AIC BIC LLH AIC BIC LLH moel moel moel moel Table 5e. AIC, BIC an log-likelihoo for the area level for the parameter obese people. Cross-classification of sex an age by means of splines (columns from to 4) an CART (columns from 5 to 7). The figures isplaye in re enote the best values for AIC, BIC an loglikelihoo criteria. Table 6 shows the istribution of the CVs for the moel base estimates for the target variable (). All the small area estimation methos seem to perform well. EBLUP unit level Variable CV% < CV% 33.3 CV% > 33.3 Parameter () 186 EBLUP area level Variable CV% < CV% 33.3 CV% > 33.3 Parameter () 188 EBLUP spatial unit level Variable CV% < CV% 33.3 CV% > 33.3 Parameter () 186 Table 6 CV% of moel base estimates. The gooness of fit iagnostics are use to test if the moel-base estimates are affecte by bias. This is obtaine comparing the moel bias estimates to the irect estimates (ID, WP3 report). Figures 1 to 1 isplay the moel bias iagnostic results in case of moel base estimates compute using the unit level LMM (figure 1), the area level LMM (figure 11), an the logistic mixe moel (figure 1). Hypothesis test can be carrie out to verify if there is a eparture of the regression line from the line y=x. -58-

60 Figure. 1. EBLUP unit level vs irect estimates. Obese people. Figure 11. EBLUP area level vs irect estimates. Obese people. -59-

61 Figure 1. EBLUP logit moel vs irect estimates. Obese people. The plots show that the slope estimates for the three moels seem to be very ifferent from 1, implying a bias for all moels. On the contrary, for all the moels the intercept, as expecte, is very close to zero. Even though the failing of the moel bias iagnostic may suggest ba performances of the estimators, analysis carrie out within ESSnet SAE group suggeste to consier this iagnostic not as a strict constrain to establish the quality of an estimation methos. Furthermore, separate analysis for ifferent groups of areas coul lea to more satysfying results. Figure 13. Confience intervals for EBLUP unit level sorte for area size. Obese people. -6-

62 Figure 14. Confience intervals for EBLUP area level sorte for area size. Obese people. Figure 15. Confience intervals for logistic EBLUP type sorte for area size. Obese people. Figures from 13 to 15 show that only for the logistic moel is evient an effective reuction of confience interval size for EBLUP or logistic EBLUP-type. As expecte, the confience interval erive from the logistic moel are wier than the corresponing intervals resulting from the linear moel. This is because, the estimates of MSE from a LMM are known to be negatively biase when eparture from normality is observe. -61-

63 Figure 16. Spatial istribution of synthetic component of EBLUP base on the unit level linear mixe moel. Obese people. Figure 17. Spatial istribution of EBLUP base on the unit level linear mixe moel. Obese people. The analysis of the graphs 16 an 17 isplays that the synthetic estimates present lower values in central an southern Italy, while the highest values correspon to the area surrouning the Po river. Completely opposite inications erive from the EBLUP estimates, which ientify the central an northern part of the country as the regions with lowest estimates, an south an islans as the area with largest ones. EBLUP estimates are to be preferre, since the synthetic estimates, being affecte by a strong shrinkage effect aroun the mean line, are extremely sensitive to minor changes. Figure 18. Spatial istribution of synthetic component of EBLUP base on the area level linear mixe moel. Obese people. -6-

64 Figure 19. Spatial istribution of EBLUP base on the area level linear mixe moel. Obese people. This is confirme by the plots in the 18 an 19. The plots are making use of a nonparametric moel base on spline functions for synthetic an EBLUP estimates, epening on the latitue an longitue (see Eilers e Marx, 1996; Ruppert et al., 3). In both graphs we see that the estimates present a spatial tren from north-west to south-east, showing with a less strong presence of the phenomenon in the northern Italy. Figure. Spatial istribution of EBLUP-type base on the logistic mixe moel. Obese people. Finally the analogous plot for the logistic moel (figure ) present similar results to EBLUP base on the linear mixe moel. -63-

65 Figure 1. Resiual istribution (left) an ranom effect istribution (right) for the unit level linear mixe moel. Obese people. Figure. Resiual istribution (left) an ranom effect istribution (right) for the area level linear mixe moel. Obese people. -64-

66 Figure 3. Resiual istribution (left) an ranom effect istribution (right) for the logistic mixe moel. Obese people. The area ranom effects, isplaye from figure 1 to figure 3, seem to show a istribution close the normal law, but they isplay a spatial autocorrelation, epening on the ASL coe, which are sorte by geographical position. References Australian Bureau of Statistics (6). A guie to small area estimation. Version 1.1. ID 55, WP3 report. Breiman, L., Freiman, J. H., Olshen, R. A., Stone, C. J. (1984). Classification an Regression Trees. Wasworth. Brown, J. Chambers, R., Heay, P., Heasman, D. (3). Evaluation of small area estimation methos: an application to the unemployment estimates from the UK LFS. In Proceeings of Statistics Canaa Symposium 1. ID, WP3 report. Heay, P., Clarke, P., Brown, G., Ellis, K., Heasman, D., Hennell, S., Longhurst, J., Mitchell, B. (3). Small Area Estimation Project Report. Moel-Base Small Area Estimation Series, No., ONS Publication. ID 1, WP3 report. Boonstra, H. J., Buelens, B., Smeets, M. (9). Estimation of unemployment for Dutch municipalities. Small Area Estimation 9 Conference, Elche (Spain), June 9-July 1. ID 51, WP3 report. Boonstra, H. J., Brakel, J. van en, Buelens, B., Krieg, S., Smeets, M. (8). Towars small area estimation at Statistics Netherlans, Metron, Vol. LIV, No.1. pp ID 5, WP3 report. D Alò M., Di Consiglio L., Falorsi S., Solari F. (8). Towars small area estimation at Statistics Netherlans, ISI 9 Conference, Durban (South Africa), 16- August. ID 7, WP3 report. Eilers, P. H. C. an Marx, B. D. (1996) Flexible smoothing with B-splines an penalties. Statist. Sci, 11, Rao, J. N K. (3). Small Area Estiation. Wiley, Hoboken (NY). Ruppert, D., Wan, M. P. an Carroll, R. J. (3). Semiparametric Regression. Cambrige University Press, Cambrige. -65-

67 Case stuy report Statistics Netherlans Bart Buelens, Jan van en Brakel, Harm Jan Boonstra, Marc Smeets an Virginie Blaess Backgroun Information on crime victimization, public safety an satisfaction with police performance is obtaine by the Dutch National Safety Monitor (NSM).This is an annual survey which was conucte in the first quarter each year, from 6 through 8 at full scale with about 19, responents. It is esigne to provie aequate irect estimates at the national level an at the level of police istricts, a subivision of the Netherlans in 5 regions. In 8 the NSM was reesigne an change to the Integrate Safety Monitor (ISM), which is conucte annually in the fourth quarter, starting in 8. In orer to maintain consistent series, the NSM is conucte in parallel with a size of about 6, responents in 8, 9 an 1. The sample allocate to the parallel NSM is esigne to provie irect estimates with sufficient precision at the national level only. Nevertheless there is interest in estimates base on the NSM for the 5 police istricts. The sample allocate to the NSM in the parallel run is only a thir of the sample size allocate to the regular survey. As a result, this sample size is too small to obtain aequate irect estimates even for the planne omains of the ISM. In this case stuy moel-base estimators are consiere to construct estimates of sufficient accuracy for these omains. The reason to continue the NSM for three years in parallel with the ISM is twofol. First, government policy on crime an public safety was to be measure through inicators erive from the NSM. It was necessary to have a continuous series until the year 1, the en of that particular program. Secon, conucting the parallel surveys allows for the quantification of iscontinuities in the time series of survey variables measure in both surveys. Details of the latter aspect of the NSM are outsie the scope of the present case stuy. Description of ata sources NSM survey ata The National Safety Monitor (NSM) is base on a stratifie two stage sample esign of persons age 15 years or oler resiing in the Netherlans. The Netherlans is ivie into 5 police istricts, which are use as the stratification variable in the sample esign. In the first stage municipalities are rawn with selection probabilities proportional to the population size. In the secon stage, persons are rawn from the municipalities selecte in the first stage, with a minimum sample size of one. In the first quarters of 6, 7 an 8, the survey is conucte at full scale with a net sample size of about 19, responents. In the last quarters of 8, 9 an 1, the NSM is conucte at a limite scale, in parallel with the ISM, with a net sample size of about 6, responents. The ata collection moe of the NSM is base on a mixe moe esign using computer assiste personal interviewing (CAPI) an computer assiste telephone interviewing (CATI). Persons with a non-secret permanent telephone connection are interviewe by -66-

68 telephone, an other persons are interviewe face-to-face. The generalize regression (GREG) estimator (Särnal et al. 199) is use to estimate population parameters at the national level for the years that the NSM is conucte at a limite scale an also for police istricts in the years that the survey is conucte with full sample size. The focus of this case stuy is on the NSM in the years 8 an 9, when it was conucte at limite scale. Data of the year 1 was not available at the time when this research was starte. The sample size of the limite scale NSM is too small to prouce aequate irect estimates for the police istricts, which are therefore consiere as the small areas. The purpose of this stuy is to prouce moel base estimates for these areas. Auxiliary ata Four important sources of auxiliary information from registers an surveys are istinguishe: 1) Demographic information is available or erive from the municipal aministrations: gener, age, ethnicity, etc. The level of urbanization (aress ensity) an mean house prices are also available from aministrative registers. ) The Police Register of Reporte Offences (PRRO) contains information about the number an type of with the police reporte crimes an offences. Data are available at an aggregate level, per police istrict. 3) Long staning surveys, like the Dutch crime victimization survey, are conucte repeately in time. In these cases there is also sample information available from preceing perios. In particular, full scale NSM surveys were conucte prior to the limite scale ones. 4) The NSM is conucte in parallel with the ISM. From the ISM there are aequate irect estimates available for police istricts, as the ISM sample is larger. Estimates from both surveys will iffer ue to the reesign, but can be expecte to correlate well. Statistics Netherlans conucts the ISM since 8 with a sample size of about 19, responents an employs a similar sampling esign as the NSM. In aition, municipalities an police istricts can participate with the survey on a voluntary basis. In these regions, aitional samples are rawn with the purpose of proviing precise estimates for these regions. Statistics Netherlans collect ata for the national sample using a sequential mixe moe esign that is base on web interviewing (WI), paper an pencil interviewing (PAPI), CAPI an CATI. All persons inclue in the sample receive an avance letter where they are aske to complete a questionnaire via internet (WI). Persons can receive a paper version of the questionnaire on their request (PAPI). After two reminers, non responents are contacte by telephone if a telephone number is available to complete the questionnaire (CATI). The remaining persons are visite at home by an interviewer to complete the questionnaire face to face (CAPI). The ata collection in the aitional regional samples is conucte by various market research bureaus. For these samples WI, PAPI an CATI moes are manatory. The use of the CAPI moe is recommene but not manatory since this moe is very costly. For the purpose of the present case stuy, the ata collecte by -67-

69 market research bureaus in selecte regions is ignore. In what follows, the ISM is unerstoo to refer to the sample excluing the regional over sampling. The NSM is a multipurpose survey that collects information about many outcome variables. In the present stuy five key survey variables are consiere, see Table 1. Variables source from the municipal aministration are liste in Table. Table 3 shows the variables obtaine from the Police Register of Reporte Offences. Table 4 lists the variables from the ISM an past NSM surveys consiere for use as covariates. These inclue precisely the five target variables uner consieration, extene with seven more variables of potential benefit. All auxiliary variables are name with prefixes inicating their source. Table 1: Five key NSM survey variables consiere in the present stuy. Variable Description nuisance perceive nuisance in the neighbourhoo on a ten point scale; this inclues nuisance by runk people, neighbours, or groups of youngsters, harassment, an rug relate problems unsafe percentage of people feeling unsafe at times propvict percentage of people saying to have been victim to property crime in the last 1 months violvict percentage of people saying to have been victim to violent crime in the last 1 months satispol percentage of people satisfie with police at their last contact (if contact in last 1 months) Table : Auxiliary ata from aministrative registers. Data are at police istrict level. Variable Description am_immigr percentage of immigrants in population am_immigrnw percentage of non-western immigrants in population am_young percentage of young people (age 15 to 5) am_ol percentage of elerly people (age over 65) am_urban level of urbanization (in 5 categories) am_house mean house price am_benefit percentage of social benefit claimants Table 3: Auxiliary ata from the Police Register of Reporte Offences. Figures are reporte offences per 1 inhabitants. Variable Description prro_propcrim property crimes prro_bicycle bicycle thefts (subset of property crimes) prro_violcrim violent crimes prro_assault physical assaults (subset of violent crimes) prro_threat threats (subset of violent crimes) prro_traffic traffic offences prro_rugs illicit rug offences prro_weapon weapon offences prro_amage amage to public an private property prro_puborer isturbance of public orer -68-

70 Table 4: Auxiliary ata from the parallel ISM surveys, an from the past NSM survey. The first five variables correspon to the NSM survey target variables. The prefix surv is either ism or nsm, reflecting the source of the variable. Variable Description surv_nuisance perceive nuisance in the neighborhoo on a ten point scale; this inclues nuisance by runk people, neigbours, or groups of youngsters, harassment, an rug relate problems surv_unsafe percentage of people feeling unsafe at times surv_propvict percentage of people saying to have been victim to property crime in the last 1 months surv_violvict percentage of people saying to have been victim to violent crime in the last 1 months surv_satispol percentage of people satisfie with police at their last contact (if contact in last 1 months) surv_egra perceive egraation of the neighbourhoo, on a ten point scale surv_funcpol opinion on functioning of the police, on a ten point scale surv_victim percentage of people saying to have been victim to crime surv_crimes number of offences per 1 inhabitants (erive from victim reports) surv_propcrim number of property crimes per 1 inhabitants surv_violcrim number of violent crimes per 1 inhabitants surv_bicycle number of bicycle thefts per 1 inhabitants -69-

71 Approach to the problem Direct estimates for police istricts are not precise enough. This is the motivating reason to investigate other alternatives. Small area estimation methos are esigne specifically for such situations. The irect estimates are obtaine through generalize regression (GREG) (Särnal et al. 199) using a moel containing the covariates gener, age, ethnicity, egree of urbanization, househol size, marital status, income an police istrict. For successful moel-base inference, auxiliary ata that correlate well with the target variables are neee. In aition to the variables use in the GREG-moel, registrations of crimes an offences reporte with the police are goo caniate covariates, at least for variables relate to victimization. In this particular case stuy, the survey of interest is conucte in parallel with a larger survey on the same topic, crime. Although the ata collection moes an questionnaires iffer between the NSM an ISM, it can be expecte that the outcomes of both surveys correlate rather well. Finally, the survey has a history of several years, so past statistics may again correlate well with present ones. In combination, variables from these ifferent sources coul have a lot of preictive power, enabling the successful use of moel-base methos. A limitation here is that many covariates are available only at the area level. The police recors are provie to us at an aggregate level, per istrict. Furthermore, using outcomes from other sample surveys either from the past or conucte in parallel forces the use of aggregate ata, as units sample in one survey are generally not sample in the other. Consiering moel-base approaches, there are two extremes to hanle small areas. Either the areas are inclue as fixe effects, resulting in one regression coefficient per area, an effectively just giving the irect estimates as a result. Or they are not inclue at all, which woul imply a synthetic estimator. The intermeiate form is to use a mixe effects moel an to inclue area as a ranom effect. In that case there is one parameter relate to the areas: the between-area variance. This basic yet powerful moel is chosen to pursue. It is wiely use an fairly well establishe in the small area literature, an is commonly known as the Fay-Herriot moel (Fay an Herriot, 1979; Rao, 3). Since in the present stuy there are many potential covariates, one of the aims is selecting which covariates to use in the moels. This inclues an attempt to reuce the imension of the covariate space. The ultimate aim of course is to prouce estimates for the small areas that have lower mean square errors (MSEs) than the irect estimates. It is generally a goo iea to conuct a simulation stuy when possible. Since we o not have micro-level ata available for all covariates, a simulation stuy base on repeate sampling from a population is not trivial. In particular, the construction of such a population is problematic as it woul require moelling an moel assumptions in the first place. Earlier simulation stuies conucte at Statistics Netherlans in the context of the Labour Force Survey confirme the valiity of the proceure that we follow here for the crime survey (Boonstra et al., 8). In particular, this earlier work showe the usefulness of the moel selection criteria we will use, an of the Hierarchical Bayesian estimation approach, iscusse in more etail below. -7-

72 Methos SAE metho As motivate above, the basic area level moel (Fay an Herriot, 1979) is consiere in this application. The Fay Herriot moel is a linear mixe moel an estimation commonly procees using Empirical Best Linear Unbiase Preiction (EBLUP), see Rao (3) for etails. Frequently applie methos to estimate the variance of ranom area effects are the Fay- Herriot moment estimator, maximum likelihoo an restricte maximum likelihoo estimation. A weakness of these methos is that in some situations the estimate moel variance tens to zero, see e.g. Bell (1999) an Rao (3). Zero estimates can occur when the between area variation controlle for the covariates is small, for example in the case of strong auxiliary information. Zero estimates can also occur when the number of areas is small, resulting in imprecise variance estimates. This is the case in the present application of the NSM. A zero or a significantly uner estimate moel variance leas to unesirable situations, as in these cases the EBLUP estimator gives too much weight to the synthetic regression part, an too little weight to irect estimates, even in omains with larger sample sizes. This results in less plausible omain preictions. Furthermore there is a large risk of MSE uner estimation because the area effects are estimate at zero an the variation between areas is associate with the variation in the auxiliary variables, which is in most cases not realistic. These problems can be avoie with the Hierarchical Bayesian (HB) approach. Therefore the full HB approach for the Fay Herriot moel, summarize by Rao (3), section 1.3, is use as an alternative in the present case stuy. Small area estimates an MSE estimates are obtaine as the posterior mean an the posterior variance of the posterior ensity of the omain parameters. The Bayesian estimate for the moel variance is obtaine as the mean of the posterior ensity of the moel variance, assuming a non-informative prior. This estimator is always positive, an therefore avois the problems with zero or even negative estimates for the moel variance, resulting in more plausible parameter an MSE estimates for the omain variables. A similar approach was previously followe at Statistics Netherlans for applications of small area estimation in the Labour Force Survey (Boonstra et al., 8, 11). An important part of the available auxiliary information is base on a sample survey, as explaine before, resulting in auxiliary variables observe with sampling errors. Ybarra an Lohr (8) evelope an EBLUP estimator uner the Fay-Herriot moel to account for uncertainty in the auxiliary variables. In the present application the sampling error in the auxiliary variables is approximately constant over the omains, since the allocation of the sample in the regular survey of the NSM an the ISM is esigne such that about 75 responents are obtaine within each police istrict, as explaine before. If the sampling error in the auxiliary information is constant over the omains, Ybarra an Lohr (8) argue that the variance of the sampling error in the auxiliary variables can be ignore. The unerlying iea is that the uncertainty in the covariates will be reflecte in a worse fitting moel, an hence in the absorption of this uncertainty in the estimate moel variance. -71-

73 Therefore the estimate auxiliary variables are substitute for their unknown values in the EBLUP an the HB estimator, an treate as if observe without error in this application. The main avantage of this approach is that the stanar HB estimator can be use to avoi the problems with zero estimates for the moel variance. Ybarra an Lohr (8) also reporte problems with zero an negative estimates for the moel variance in their simulation stuy. Further research on this particular aspect may prove useful, but is outsie the scope of the present case stuy. Quality assurance methos The covariates for the moels are selecte from a set of suitable auxiliary variables through a step forwar variable selection proceure. The conitional Akaike Information Criteria (caic) an a cross valiation measure (CV) are use as comparison measures to select the most suitable moels. The caic is propose by Vaia an Blanchar (5) for mixe moels. In the case of a fixe effects moel, the penalty for the moel complexity is the number of moel parameters. The ranom part of a mixe moel also contributes to the number of moel egrees of freeom with a value between zero in the case of no omain effects (i.e. the moel variance equals zero) an the total number of omains in the case of fixe omain effects (i.e. the moel variance tens to infinity). In the caic, the penalty for the moel complexity is the effective egrees of freeom of the mixe moel an is efine as the trace of the hat matrix, which maps the observe ata to the fitte values, see Hoges an Sargent (1). The CV measure to be use is given by Hastie et al. (3), ch Leave-one-out (LOO) cross valiation is use, which measures preictive accuracy by fitting the moel using all but one areas, an preicting the outcome for that one area. This is repeate for all areas an the preiction errors are average. Since the number of omains in this application is limite to 5 police istricts, moels with more than a few covariates can be expecte to easily overfit the ata. Therefore, Principal Component Analysis (PCA) is applie to reuce the imension of the covariate space, see e.g. Hastie et al. (3). PCA is a linear transformation of the original covariate space into a space where the imensions are orthogonal irections of maximal variance. The coorinates of this space are referre to as the principal components. The principal components that correlate sufficiently well with the target variables are selecte in the moel, which are not necessarily the first principal components that explain the largest amount of the variance. As the principal components are linear combinations of the original covariates, all original variables may contribute to the principal components which are retaine in the moel, an might result in moels with a substantially reuce number of covariates. Instea of the original covariates, the principal components are use in the same way in the step forwar variable selection proceure outline above to select optimal moels. Quality measures for the EBLUP an HB estimates are provie by appropriate MSE estimates. A prerequisite is that the MSE estimates themselves are of goo quality. Generally, applying moel base SAE methos can be seen as successful an worthwhile if the precision is significantly better than that of the irect estimates. A goo way to assess -7-

74 this is comparing the coefficients of variation (ratio of stanar error over mean) of the irect an moel base estimates. As a single measure, the average percentage reuction of the coefficients of variation can be use. When small areas vary a lot in (sample) size, such average may conceal ifferences between the areas. Consequently, investigating iniviual areas remains useful. Software This case stuy has been implemente in R, incluing the moels, fitting an estimation proceures. Most of the generic SAE coe that is use ha been implemente before at Statistics Netherlans (Boonstra et al., 11). Parts of this coe have been transferre to the R-scripts that are evelope within the context of WP4 of this ESSnet, in particular the moel selection measures caic an CV. Results Moel selection As mentione above, step-forwar moel selection is use. In this proceure, the starting moel is the constant moel, consisting of an intercept only. Covariates are ae one by one. Each time, that covariate is ae that results in the largest improvement, base on either the caic or CV measures (see above). This process stops when there is no further improvement possible, or when all covariates are inclue. It was foun in the present case stuy that moels selecte in this way using caic are generally smaller, but neste within moels foun base on CV. Since the number of areas is relatively small, there is a risk of over fitting. Therefore smaller moels are preferre. Base on this consieration, it was chosen to use caic as the moel selection criterion in this case stuy. A limitation of the covariate ata at han is that incluing interaction terms is not possible. All covariates are either inclue as main effects only, or not inclue at all. Since many ifferent covariates are available, various subsets of covariates are consiere in turn. This enables the investigation of the importance of the various types of covariates. For ifferent subsets the optimal moel is selecte. The resulting caic measures are liste in Table 5. In brackets the average reuction in coefficients of variation is given, for the estimates base on the selecte moel. This reuction reflects the gain achieve by the moel base estimates compare to the initial irect estimates of the NSM. Each column in Table 5 correspons to a subset of covariates, as follows: amin PRRO ISM NSM ISM&NSM covariates from aministrative registers only amin ata together with PRRO variables amin, PRRO, an ata from the parallel ISM amin, PRRO, an ata from the past NSM amin, PRRO, parallel ISM an past NSM -73-

75 Table 5. Optimal caic values of moels base on covariates from 5 ifferent subsets. The values in brackets are the mean reuction in cv of SAE estimates using the corresponing moel, with respect to the irect GREG estimates. variable year amin PRRO ISM NSM ISM&NSM nuisance (-9%) -.5 (-35%) (-51%) -3. (-45%) (-51%) (-3%) -5. (-35%) -9.9 (-4%) -8.1 (-4%) (-4%) unsafe (-4%) (-9%) 17.3 (-37%) 17.7 (-39%) 17.7 (-39%) (-%) (-7%) 18.4 (-38%) (-8%) 18.4 (-38%) propvict (-49%) 14.6 (-49%) 15. (-51%) 14.6 (-49%) 15. (-51%) (-37%) 14.5 (-49%) 14.5 (-49%) 14.5 (-49%) 14.5 (-49%) violvict (-49%) 93. (-49%) 89.7 (-5%) 91.5 (-51%) 86.8 (-5%) (-36%) 98.5 (-37%) 97.3 (-33%) 98.5 (-35%) 97.3 (-33%) satispol (-5%) (-5%) (-55%) 164. (-49%) (-55%) (-4%) (-4%) 169. (-39%) 17.3 (-39%) 17.3 (-39%) The past NSM variables are those from the last full scale NSM hel prior to the 8 small scale NSM. The full scale NSM covariates are the same for 8 an 9. An alternative woul be to use the moel base SAE estimates of 8 as covariates in 9. Moels using aministrative ata only are worse than moels incluing PRRO variables as well. Expaning these moels with ISM variables elivers the largest gains. Past NSM ata is useful too but only in some occasions oes it outperform the ISM ata. Using both ISM an past NSM ata offers little improvement over using only ISM ata. Police register ata from the PRRO is helpful for victimization of crime (propvict an violvict) an satisfaction with the police, but not as powerful for nuisance an unsafe. These latter two benefit a lot from using other survey ata. An explanation coul be that they are not as irectly correlate with police register ata, but rather reflect personal opinions an feelings. From this table it is seen that the SAE methoology applie here reuces average coefficients of variation by 3% to 5%. Generally moel fits for the same variables iffer between years, in that ifferent covariates are selecte in one year compare to the other. It may be esirable to establish moels that are vali both years. This coul be achieve through simultaneous selection of moels for both years, an to seek the optimal of the average of the moel selection criteria. This proceure woul select a moel that is best on average, in both years. In the present case stuy, estimation accuracy is of primary importance; hence moels are allowe to vary between years. The total number of covariates to select from in this case stuy is very large. It is larger than the number of areas. Since moels consisting of more than a few variables will over fit, it may be that not all relevant covariates get use. Therefore it is investigate how this can be remeie. In particular Principal Component Analysis (PCA) is use to reuce the imension of the covariate space. The original covariates are linearly transforme into covariates referre to as principal component covariates, along irections of maximal variance. Since this transformation occurs inepenently of the target variable, it is not guarantee to eliver improvements, but this can be assesse. Table 6 is the equivalent of Table 5, but now using principal component covariates. Principal components are etermine within the same subsets of original covariates efine above. Comparing the results presente in Tables 5 an 6, it is seen that using PCA oes not lea to large improvements when using aministrative ata only, or in combination with PRRO variables. -74-

76 Somewhat surprising are the results when ISM an/or NSM variables are inclue in PCA: the principal component variables give rise to moels that are in many cases worse than those using the covariates irectly. A possible reason is that some variables possess little preictive power while others are goo preictors. In PCA these get combine together an may form variables correlating less well with the target variables consiere in this stuy. This may require further investigation, in particular into methos where the correlation with the target variables is taken into account. For the present case stuy it is conclue that PCA oes not offer sufficient benefits to warrant its use. Therefore the moels unerlying the results in Table 5 are use. Details of these moels are given in Table 7. It is now seen in etail which variables are use as covariates in the optimal moels. An NSM variable is use in only one case (violvict in 8). Moels for all but one variable always inclue ISM variables, which are clearly strong preictors. Only for property crimes o they not offer benefits compare to variables from police an aministrative registers. This coul be cause by the fact that property crimes are generally well reporte with the police, maybe better than violence although this cannot be confirme base on the available ata an that perceive crime victimization is naturally closer relate to reporte crime than is the case for other variables. Table 6. Optimal caic values of moels base on principal components calculate from 5 ifferent subsets. The values in brackets are the mean reuction in cv of SAE estimates using the corresponing moel, with respect to the irect GREG estimates. variable year amin PRRO ISM NSM ISM&NSM nuisance (-31%) -3.1 (-34%) -6. (-38%) -3.7 (-34%) -6.1 (-36%) (-6%) -.5 (-9%) -9.5 (-41%) -3.8 (-34%) -9.1 (-4%) unsafe (-4%) (-3%) (-3%) 13.8 (-3%) (-3%) (-19%) (-1%) (-7%) 136. (-5%) (-5%) propvict (-48%) 18.1 (-44%) 17.3 (-47%) 18. (-46%) 16. (-47%) (-36%) 11.8 (-38%) 19.6 (-41%) 18.9 (-4%) 18.5 (-4%) violvict (-49%) 93.5 (-47%) 91.5 (-49%) 93.5 (-47%) 91.3 (-47%) (-33%) 1.7 (-34%) 98.4 (-3%) 1.7 (-34%) 98.6 (-34%) satispol (-51%) 164. (-5%) (-53%) 16.8 (-53%) 164. (-5%) (-41%) (-39%) (-38%) (-39%) 17. (-4%) Table 7. Selecte moels for the five key survey variables. All moels inclue an intercept (not shown). variable year caic-selecte moel nuisance 8 ism_nuisance, am_ol 9 ism_nuisance, ism_victim, ism_bicycle, prro_threat unsafe 8 ism_nuisance, am_benefit, prro_propcrim, prro_rugs 9 ism_unsafe, ism_satispol, ism_violvict propvict 8 prro_propcrim, am_ol 9 prro_propcrim, prro_traffic violvict 8 ism_propvict, nsm_propvict, prro_bicycle 9 ism_bicycle, am_house, prro_amage, ism_unsafe, prro_assault, am_age satispol 8 ism_funcpol 9 am_ol, am_young, ism_satispol, prro_puborer -75-

77 Small area estimates General results The moels liste in Table 7 are use to prouce small area estimates using Hierarchical Bayesian (HB) estimation methos (see above). The HB estimates are compare with the irect estimates in terms of coefficient of variation (cv), see Table 8. Property crime victimization is improve a lot in both years. Aroun 5% reuctions in cv are also achieve for nuisance, violvict an satispol, although not in both years. Unsafe seems the most ifficult variable, which may be cause by this being a personal opinion (ifferent people will feel ifferent levels of safety in the same objective circumstances). Somewhat suspicious may be the fining that the improvements in 9 seem to be consistently worse than those in 8, at least for three variables. None of these use the past NSM survey as source of covariates, so that is not the explanation. There are no clear other reasons; it coul be ue to chance as well. Table 8. Mean coefficients of variations for the irect GREG estimates an the HB small area estimates, an the percentage reuction. variable year irect HB SAE Reuction nuisance 8 1.8% 5.3% -5.7% 9 1.6% 6.% -4.% unsafe % 9.4% -37.1% % 8.6% -37.6% propvict 8 4.5% 11.6% -49.3% 9 4.8% 11.6% -49.4% violvict % 16.8% -49.8% %.9% -33.% satispol 8 1.5% 5.6% -55.% 9 1.9% 7.8% -39.6% In the preceing sections it was argue that Bayesian estimation of the moel variance is superior compare to other methos, if the moel variance is small. To illustrate this point REML estimates of the moel variance are compare to the mean of the posterior istribution in the Bayesian approach. In this application, using the optimal moels, there is only one instance where the REML estimate of the moel variance is non-zero: unsafe in 9. The REML estimate is.161 while the posterior mean is 3.4, see Figure 1, bottom panel. The top panel illustrates what happens if the likelihoo has its maximum at zero: the REML estimate is zero, while the posterior mean is not. In this case the latter is equal to

78 Figure 1. Posterior istributions of the moel variance for the variable unsafe for 8 (top) an 9 (bottom). The ashe line inicates the mean. In 9 the maximum is away from zero, while in 8 the maximum is at zero. Given the Bayesian estimate of the moel variance, it is possible to procee in two ways. Either the BLUP can be use with the Bayesian estimate of the moel variance plugge in, or the full Hierarchical Bayesian approach can be further implemente to obtain posterior istributions of the variables to be estimate. These two methos are compare in Table 9. This table shows the relative percentage ifferences between the BLUP point estimates an the HB estimates, an between the MSE estimates using the Prasa-Rao MSE estimation for the BLUP, an the posterior variances in the HB approach. All values are averages over the areas. As expecte from the theory (Rao, 3), the point estimates are equal with very small ifferences most likely ue to numerical rouning. The MSE estimates, however, iffer more. The MSE estimates in the BLUP are approx. 5% lower than the HB estimates. As the BLUP MSE estimates o not account for uncertainty in the estimation of the moel variance, the HB estimates are preferre an recommene. -77-

79 Table 9. Comparing ifferences between BLUP an HB, point estimates an MSE estimates. Percentage ifferences of the BLUP relative to the HB are shown. variable year BLUP vs. HB point estimates BLUP vs. HB MSE estimates nuisance 8.1% -5.38% 9.1% -4.67% unsafe 8.1% -4.63% 9.% -4.35% propvict 8.% -5.33% 9.5% -5.9% violvict 8.1% -4.38% 9.15% -3.9% satispol 8 -.% -5.76% 9.4% -4.78% The BLUP can be written as a weighte sum of the irect estimate an the moel estimate, with the weighting applie to the irect commonly enote by J. The weight of the moel preiction is then (1- J). The value of J epens on the variance of the irect estimates an on the moel variance (Rao, 3). Table 1 lists the J-values foun in the present stuy. Values vary roughly between.15 an.3, showing that the largest contribution to the moel base estimates is elivere by the moel preictions. Variables with low J-values can be expecte to benefit more from the moelling approach, for example propvict. This is in accorance with the results shown in Table 8. Similarly, unsafe was foun to benefit least (Table 8), an is now seen to have larger values of J. Table 1.?-values for the target variables. variable year? nuisance unsafe propvict violvict satispol

80 Some etaile results Until now only global results an characteristics of the moels an estimates have been consiere. Nevertheless, area estimates shoul be investigate in etail too, as averages taken over areas may conceal important aspects of the applie methos. For the present report two variables are selecte as an example: unsafe an propvict. Unsafe was foun to benefit less from the SAE moelling approach than propvict, which makes them interesting to compare. In Figure, HB estimates are plotte against irect estimates for the years 8 an 9. The ashe line is a linear regression line fitte to the ata. As is commonly seen, the SAE estimates are smoothe compare to the irect estimates, as inicate by the slope of the regression line being smaller than 1. This can be expecte since the irect estimates have larger variances than the HB estimates, while there shoul not be a systematic bias. The latter is easily checke by comparing irect an HB estimates at the national level. Fur unsafe, the HB estimate is.4% higher than the irect estimate, for propvict it is.3% higher. These ifferences are very small an hence the HB estimate is not biase. If esire, the HB estimates can be calibrate to the irect estimate, as is sometimes one for consistency purposes (e.g. Boonstra et al., 11). Another iagnostic is analysis of the resiues, an their istribution. Figure 3 is a Q-Q plot of the resiues of the two variables uner consierations. The istributions seem not to eviate very much from normal. These plots contain ata of the two years. -79-

81 Figure. HB estimates versus irect estimates for the two variables uner consieration. The soli line is the iagonal, y = x, an the ashe line is a linear fit. -8-

82 Figure 3. Normal Q-Q plots for the resiues of unsafe an propvict. Average reuctions in the coefficients of variation (cv) are shown in Table 8 above. Figure 4 now shows etails for the two variables uner consieration for the year 9. The cv s of both NSM (irect) an HB-moel-base small area estimates are plotte for the 5 police istricts, which are orere by increasing population size. Since the NSM employs a proportional allocation over the strata in this case the police istricts this orering is accoring to NSM sample size. The smaller areas have larger cv s when using irect estimates. They benefit more than the larger areas from applying small area moels, as their cv s are reuce more than the larger areas. Nevertheless there are some gains too for the largest police istricts. Finally, since the police istricts are geographic areas, it may be informative to represent the outcomes of this case stuy on a map. This is one in Figure 5, showing the HB estimates for unsafe an propvict for 9. It can be seen that there appear to be clusters in the north east an south west with lower values of both unsafe an propvict, while the more centrally -81-

83 locate areas as well as the south east are characterize by rather larger values for both variables. The colour scale for propvict oes not work very well, as a single istrict with a very high value pulls the scale upwars, resulting in all other values being plotte in mirange to low colours (white-blue). The observe spatial clustering correspons to what is generally known about the socio-economics an emography of regions within the Netherlans. For example the central areas contain the larger cities The Hague, Amsteram an Rotteram, while the north east has a lower population ensity an is much more rural. These observations coul warrant a further investigation into the use of spatial components in the small area moels, which is not something that has been one in the present case stuy. Figure 4. Coefficients of variation (cv) of irect (NSM) an moel base (HB) estimates for two variables for the year

84 Figure 5. Spatial istribution of the HB estimates of unsafe (upper panel) an propvict (lower panel) in

85 Discussion From the perspective of increasing the precision of the irect estimates, this case stuy can be calle a success, as margins are reuce by approximately 4%. The moel selection an estimation methos appear to have performe well with no inication that moels are misspecifie. It is believe that the followe approach of conucting this case stuy can serve as a moel to follow when investigating the applicability an usability of SAE methoology in a particular setting. However, there are a number of issues that nee careful consieration, maybe more than they receive in this case stuy. (1) There is the moel choice that varies between years. If SAE was to be taken into prouction, it may be esirable that the exact moels use remain unchange from year to year, as is commonly the case for regression moels unerlying GREG estimates. On the other han, maybe methos shoul be able to accommoate for changes over time in the correlation between covariates an survey variables. This can occur for example with operational or policy changes affecting the register ata from which covariates are obtaine. () A publication policy with respect to moel base estimates may be require. Publishing moel base estimates is uncommon for NSIs. Statistics Netherlans starte recently with the publication of labour force statistics using moel base methos. These publications are accompanie with a methoological justification. In the opinion of the authors of this case stuy report, the methoology that is use to prouce particular statistics is of seconary importance an oes not warrant any form of special treatment of moel base statistics compare to irect estimates. This view is not always share by all parties involve. (3) While the methoology applie in this case stuy is efenable, it is certainly the case that the moels an estimation methos are not the one an only applicable. Other moels coul lea to equally acceptable but ifferent results, in which case there is a iscussion to be ha about which is right an which is wrong. Again, this may stress the importance of consistently using the same moels an methoology when applying small area estimation methos repeately over time. (4) Some possible improvements coul be further investigate. For example, so far no spatial components have been incorporate into the moels, while that coul prove beneficial, as suggeste by the results in this case stuy, an as seen in some other case stuies conucte within this ESSnet. (5) The issue that some covariates are in fact sample survey outcomes an hence have errors associate with them has been brushe over rather quickly. This eserves further attention, in particular in more general cases where the errors on the area covariates vary between areas. Nevertheless has this case stuy elivere useful outcomes, both in terms of the resulting small area estimates as well as of the experience of applying SAE methoology to the crime survey. At the same time has it elivere some concrete research topics that can be taken further in future research projects. -84-

86 References W. R. Bell, W.R Accounting for uncertainty about variances in small area estimation, Bulletin of the International Statistical Institute Boonstra, H.J., J.A. Van en Brakel, B. Buelens, S. Krieg an M. Smeets (8). Towars small area estimation at Statistics Netherlans, Metron Vol. LXVI, Nr. 1, pp Boonstra, H.J., B. Buelens, K. Leufkens an M. Smeets (11). Small area estimates of labour status in Dutch municipalities, Discussion paper 11, Statistics Netherlans, available online: A EB//11x1.pf Fay, R.E. an R. A. Herriot, (1979). Estimation of income for small places: an application of James-Stein proceures to census ata, Journal of the American Statistical Association, 74, pp Hastie, T., R. Tibshirani an J. H. Frieman, (3). The elements of statistical learning, Springer-Verlag, New York. Hoges, J.S. an D.J. Sargent, (1). Counting egrees of freeom in hierarchical an other richly parameterize moels, Biometrika, 88, pp Rao, J.N.K. (3). Small Area Estimation, John Wiley an Sons, New York. Särnal, C.E., B. Swensson an J. Wretman, (199). Moel Assiste Survey Sampling, Springer-Verlag, New York. Vaia, F. an S. Blanchar, (5). Conitional Akaike Information for Mixe-Effects moels, Biometrika, 9, pp Ybarra, L.M.R. an S. L. Lohr, (8). Small area estimation when auxiliary infromation is measure with error, Biometrika, 95, pp

87 1 Backgroun Case Stuy report: Moeling of omain mortality rates Li-Chun Zhang Statistics Norway Domain (or sub-population) mortality rates are neee in orer to calculate life expectancy at a isaggregate population level. Three imensions are at least necessary for the classification of the omains: (i) the areas of interest, enote by i = 1,..., m for fixe m, (ii) sex, enote by j = 1 for male an j = for female, an (iii) age, enote by a =,..., 99. (There are very few people over the age of 99. These can either be ignore, or put in the same group as those of the age 99. There is harly any effect on the calculate life expectancy in either case.) The number of eaths in the population over any given perio is available in the relevant register at Statistics Norway. Denote by y ija the recore within-omain number of eaths, i.e. mortality. Since the omain population sizes necessarily vary over the time perio for which the eath recors are collecte, some kin of a hypothetical equivalent population sizes must be constructe. Denote by n ija this equivalent omain population size. A common moel, as in the case of isease mapping, is to assume that y ija follows a Poisson istribution with the parameter λ ija = n ija τ ija, where τ ija enotes the theoretical (or superpopulation) mortality rate for omain (i, j, a). A irect estimator of τ ija is then given by ˆτ ija = y ija /n ija (1) The problem is that ˆτ ija can be highly unstable at low aggregation levels, causing the extreme mortality rates to be ominate by low-population omains that are least reliable. Some eviences are avaialble in Peersen (1), who calculate the confience intervals of life expectancy at three regional levels in Norway, with the Municipality as the most etaile level of aggregation. A simple alternative is a synthetic estimator which, in the current setting, may be given by m m ˆτ ija S ef = ( y ija )/( n ija ) = y.ja /n.ja = ξ ja () i=1 i=1 The potential rawback is that ˆτ ija S may lea to over-smoothing, because the between-area variation is then, in expectation, entirely ue to the variation in the omain population size, now that ˆλ S ija = n ijaξ ja. Inee, the synthetic estimates woul yiel a single life expectancy for each sex an, thus, removing all the potential ifferences that are of interest across the areas. Thus, moeling of omain mortality rates aims to achieve a sensible balance between these two alternatives. On the one han, the resulting estimator shoul have reuce variance compare to the irect estimator in orer to bring stability over time. On the other han, it shoul avoi the extreme over-smoothing compare to the synthetic estimator in orer to provie useful information on the between-area ifference in life expectancy. There is a boy of literature on isease mapping that are clearly relevant. See e.g. Section 9.5 in Rao (3) for an overview. The key ifference is the existence of cohort in aition to the areas of interest, where each istinct sex-age group forms naturally a population cohort. Initially, within -86-

88 each cohort (j, a), the observe area-specific mortality rates can be smoothe using a stanar isease mapping technique to yiel estimates of {τ ija ; i = 1,..., m}. Uner the assumption of inepenence across the cohorts, one may repeat the same proceure in each cohort to obtain the estimates of all {τ ija ; i = 1,..., m an j = 1, an a =,..., 99}. However, this basic smoothing approach has a rawback in practice because it is easy to conceive that, the estimate cohort-specific smoothing parameters, whatever these may be with regar to the chosen smoothing technique, may not vary smoothly over the neighboring cohorts. For instance, the neighboring cohorts may be the neighboring age-groups, either for a given sex or when both sexes are consiere pairwise. Yet there appears to be no reason a priori why one shoul accept such jerky parameter estimates shoul it be the case, an the estimation may suffer from this kin of ranom variation over time. More generally, the cohort effects can be regare as a particular form of clustering that tens to exist in all natural populations. For instance, there may be spatial clustering effects among the neighboring areas. It is therefore esirable in general if one has means to account for the potential clustering effects, in orer to get one step beyon the basic smoothing approach. In summary, there are two overall goals of this stuy. The first, statistical aim is to investigate possible alternatives that achieve a better balance between the irect an synthetic estimates of omain mortality rates for the calculation of life expectancy at a low aggregation level. The secon, theoretical aim is to evelop extensions to the basic smoothing approach in orer to account for potential clustering (or cohort) effects that may exist in the population. Data an target quantity For this stuy, we set municipality to be the areas of interest, an there are 43 of them. The mortality, or number of eaths, are compile from the eath recors over the five-year perio from 4 to 8. The atabase was initially prepare for Peersen (1), an re-organize for this stuy by M. Lillegår at Statistics Norway. We o not use any aitional ata. Our target quantity is the theoretical life expectancy on birth. It can be constructe from the mortality rates in various ways. However, variations in the theory of life expectancy calculation are unlikely to be critical for one s evaluation of the statistical estimation methos for the mortality rates. The scheme which we have aopte for this stuy can be briefly escribe as follows. Given the mortality rate τ ija, one first calculate the corresponing probability of eath by q ija = τ ija /(1 + ω ija τ ija ) where ω ija is a tuning constant which may vary by ifferent theoretical consierations. We use simply ω ija.5. Next, the expecte en-of-year cohort size, enote by l ij(a+1) is given by l ij(a+1) = l ija (1 q ija ) conitional on the start-of-year cohort size l ija. Finally, the expecte average live cohort size, enote by L ija is given by L ija = l ij(a+1) + ω ija (l ij(a+1) l ija ) = l ija (1 q ija + ω ija q ija ) ω ija.5 = l ija (1 q ija /) -87-

89 In this way, the theoretical life expectancy on birth in omain (i, j) is given by A T ij = L ija for l ij 1 an A = 99 a= In other wors, all one nees is the estimates of the omain mortality rates τ ija. 3 Approach We follow the general flow chart escribe in the WP6 report. Here is a summary of the key steps. The ata an metaata have been escribe above. In particular, the areas of interest are the municipalities. Accoring to the a hoc convention aopte in the WP6 report, sex an age are auxiliaries that can be incorporate into the hierarchical structure of the ata. The irect estimator is given by (1) which, as note before, has been shown to lea to the estimate life expectancy that is too unstable in most of the municipalities. A natural synthetic estimator is given by (), which is unacceptable as it leas to a single life expectancy for male across all the areas, an another one for female across all the areas. Composite estimates can be erive uner the basic smoothing approach outline above, i.e. within each sex-age cohort. As expecte, we encountere the problem of jerky parameter estimates over the neighboring age groups, in aition to the well-known over-shrinkage problem (ref. WP6 report). We conclue that the composite estimator uner the basic smoothing approach is not satisfactory, neither statistically nor theoretically. Ajustment of over-shrinkage can be achieve for the composite estimator by tuning the respective coefficients assigne to the irect an synthetic estimates (Spjøtvoll an Thomsen, 1987). In aition, we evelop a variance component moel as an extension to the two-level moel unerlying the basic smoothing approach, where the eviation between a omain mean an the corresponing overall mean is given as the prouct of a cluster-level ranom effect an another within-cluster omain-level ranom effect. It turns out, however, that the variance of the cluster-level ranom effect is very small compare to that of the withincluster omain-level ranom effect, such that jerky parameter estimates can only be avoie by imposing a variance homogeneity assumption across the neighboring age-groups. In conclusion, statistically speaking, shrinkage-ajuste composite estimates uner a suitable variance homogeneity assumption can yiel estimates of the life expectancy that are acceptable at the municipality level. From a theoretical point of view, these estimates can be motivate for the purpose of smoothing, but not irectly from a preiction perspective. 4 Methos 4.1 Basic smoothing approach We can aapt the stanar isease mapping proceure to yiel the basic smoothing approach as follows. The overall cohort-specific mortality rate ξ ja will be treate as fixe, proviing a reference -88-

90 mortality rate for all the relevant omains {(i, j, a); i = 1,..., m}. Let θ ija = τ ija /ξ ja be the relative risk (RR), or stanarize mortality rate (SMR), of omain (i, j, a). Without losing generality we may assume E(θ ija ) = 1. Conitional on θ ija, y ija is assume to follow the Poisson istribution with parameter λ ija = n ija ξ ja θ ija. To complete the moel, we nee to specify at least the variance of θ ija. Below we escribe first some omain- an iniviual-level alternatives. First, notice that, uner the basic smoothing approach, we eal with separate subsets of omains in the same way. For example, we may apply a composite estimator to the set of omains {(i, j, a); i = 1,..., m} for given (j, a), an repeat the proceure for all (j, a), which we refer to as smoothing by age (an sex). Or, we may apply a composite estimator to the set of omains {(i, j, a); a =,..., 99} for given (i, j), an repeat the proceure for all (i, j), which we may refer to as smoothing by area (an sex). It is convenient to use the same escription in either case, if we aopt a simplifie notation where the only subscript i is the generic inex of whatever the omains that are of concern. That is, in the case of smoothing by age, i refers to municipality an the inices for sex an age are suppresse, whereas in the case of smoothing by area, i refers to age an the inices of sex an area are suppresse. Now, a composite estimator for a set of {θ i ; i = 1,..., m} can be obtaine uner the following assumptions: y i θ i Poisson(λ i ) for λ i = µ i θ i an µ i = n i ξ i E(θ i ) = 1 an V (θ i ) = σ where ξ i is a suitable constant reference mortality rate. A composite estimator has the form ˆθ C i = ϕ i ˆθi + (1 ϕ i ) 1 (3) where ˆθ i = y i /µ i. The coefficient ϕ i may be chosen to minimize the MSE of the composite estimator. We have MSE(ˆθ C i ) = ϕ i MSE(ˆθ i ) + (1 ϕ i ) MSE(1) = ϕ i /µ i + (1 ϕ i ) V (θ i ) because E{(ˆθ i θ i )(1 θ i )} = an, by the Poisson istribution of y i given θ i, MSE(ˆθ i ) = E(E(ˆθ i θ i ) E(ˆθ i θ i )θ i + θi )) = E(V (ˆθ i θ i )) = E(θ i /µ i ) = 1/µ i MSE(1) = E((θ i 1) ) = V (θ i ) such that the minimum MSE is achieve at ϕ i = MSE(1)/(MSE(ˆθ i ) + MSE(1)) = µ i /(µ i + 1/V (θ i )) We have, then, MSE(ˆθ C i ) = ϕ i /µ i + (1 ϕ i ) V (θ i ) = 1/(µ i + 1/V (θ i )) -89-

91 such that the relative efficiency (RE) compare to the irect estimator ˆθ i is given by RE = MSE(ˆθ i C )/MSE(ˆθ i ) = µ i /(µ i + 1/V (θ i )) < 1 provie V (θ i ) > Notice that ˆθ i C is a weighte average of the irect estimator ˆθ i an the overall mean, which is 1 = E(θ i ) uner the moel. Provie σ >, we have {ˆθC i > even if ˆθi = ˆθ i C ˆθ i as n i The estimator ˆθ i C has therefore two intuitively appealing properties: (i) it is boune away from zero even if the irect estimator is zero, which can often occur in small omains when the theoretical mortality rate is close to zero, an (ii) it gets closer to the irect estimator as the omain population size n i grows larger an the irect estimator becomes more reliable. In practice, of course, we first nee to estimate V (θ i ) = σ in orer to obtain ˆϕ i an, then, ˆθ i C by (3). Marshall (1991) propose a simple metho of moment estimator (MME). Let s = m w i s i where s i = (ˆθ i 1) an w i = n i /n. an n. = i=1 m i=1 n i Notice that it is possible to use any set of weights such that i w i = 1. The ifference concerns the efficiency of the MME. Provie inepenence between the ˆθ i s, the optimal choice is given by w i = V (s i ) 1 / m g=1 V (s g) 1 which, however, is unknown an must be estimate. The choice of w i = n i /n. may be sensible provie V (s i ) = O(1/n i) asymptotically. In any case, E(s ) = i w i E((y i /µ i 1) ) = i w i ((V (y i ) + µ i )/µ i 1) = V (θ i ) + i w i /µ i now that E(y i θ i ) = V (y i θ i ) = µ i θ i uner the Poisson assumption. An MME of σ is given as ˆσ = max(, s i w i /µ i ) 4. Discussions 4..1 EB estimator The composite estimator (3) coincies with the so-calle EB estimator given as follows. Put y i θ i Poisson(λ i ) for λ i = µ i θ i an µ i = n i ξ i θ i Gamma(α, α) That is, one assumes explicitly a Gamma istribution of θ i, where the shape parameter equals to the scale parameter such that E(θ i ) = α/α 1 an V (θ i ) = 1/α ef = σ. We have θ i y i Gamma(y i + α, µ i + α), so that the empirical best (EB) preictor is given as ˆθ EB i = Ê[θ i y i, ξ] = y i + ˆα µ i + ˆα = ˆγ i ˆθ i + (1 ˆγ i ) 1-9-

92 where ˆγ i = µ i /(µ i + ˆα) = µ i /(µ i + 1/ˆσ ) = n i ξ i /(n i ξ i + 1/ V (θ i )) CMSEP(ˆθ B i ) = E((ˆθ B i θ i ) y i ) = V (θ i y i ) = (y i + α)/(µ i + α) In comparison, the CMSEP of the irect estimator ˆθ i is given by CMSEP(ˆθ i ) = E((ˆθ i θ i ) y i ) = V (θ i y i ) + ( ˆθ i E(θ i y i )) It follows that the relative efficiency (RE) of the best preictor is given by RE = CMSEP(ˆθ B i )/CMSEP(ˆθ i ) = V (θ i y i )/(V (θ i y i ) + ( ˆθ i E(θ i y i )) ) which is less than 1 unless ˆθ i happens to be equal to E(θ i y i ). 4.. Moeling variance at iniviual level An alternative to irect moeling of θ i at the omain level is to consier it as the mean of the within-omain iniviual RR, say, θ ik for k = 1,..., n i. That is, θ i = n i k=1 θ ik/n i. As before, we assume that, given θ i, y i follows the Poisson istribution with the parameter λ i = µ i θ i. For the iniviual RR, consier first the gamma istribution θ ik ii Gamma(α, α) where E(θ ik ) = 1 an V (θ ik ) = 1/α ef = σ It follows that E(θ i ) = 1 an V (θ i ) = σ /n i. Inee, we have n i θ i Gamma(n i α, α), an n i θ i y i Gamma(y i + n i α, ξ i + α), such that the EB estimator is given by ˆθ EB i = n 1 y i + n i ˆα i ξ i + ˆα = ˆγ i ˆθ i + (1 ˆγ i ) 1 where ˆγ i = ξ i /(ξ i + 1/ˆσ ) = (n i ξ i )/(n i ξ i + 1/ V (θ i )) = µ i /(µ i + 1/ V (θ i )). As in the case of irect moeling of the omain-level variance, it is possible to erive the same estimator as a composite estimator, base on only the moment assumptions E(θ ik ) = 1 an E(θ ik ) = σ. Let s be the same as before. An MME of σ is in this case given by ˆσ = max(, (s i w i /µ i )/( i w i /n i )) which is also known as the empirical Bayes estimator. Clearly, the EB estimator is the same as the composite estimator given by (3) with a plug-in estimate of V (θ i ) = σ. Given the fully parametric moel assumptions, it is possible to use other methos of estimation, such as the maximum likelihoo estimator (MLE). This woul improve the efficiency of estimation in situations where the moel fits the ata reasonably well. Whereas the composite estimator is vali uner a more general setting. Moreover, uner the Poisson- Gamma moel, it seems natural to assess the estimation uncertainty by the conitional MSE of preiction (CMSEP) given the omain observation y i. The CMSEP of the best preictor, enote by ˆθ i B = E(θ i y i ) provie the parameter α is known, is given by -91-

93 Next, MSE(ˆθ i ) = 1/µ i as before, whereas MSE(1) = V (θ i ) = σ /n i, such that ˆθ C i = ˆϕ i ˆθi + (1 ˆϕ i ) 1 = ˆθ EB i where ˆϕi = (ˆσ /n i )/(ˆσ /n i + 1/µ i ) = ˆγ i Now, just like before, this composite estimator is also boune away from zero even if ˆθ i =, as long as ˆσ >. However, it oes not get any closer to the irect estimator ˆθ i as n i grows larger, because ˆγ i = ξ i /(ξ i + 1/ˆσ ) no longer epens on n i. Rather, uner the iniviual-level moel, the irect estimator ˆθ i itself will inevitably ten to unity as the omain population size grows to infinity, now that θ i P 1 as ni an V (θ i ) = σ /n i. In other wors, while the ifference in θ i is a theoretical necessity uner the omain-level moel, it appears merely as a spurious effect uner the iniviual-level moel. Inee, the iniviual-level moel seems to suggest that ieally one shoul be able to explain the ifference in mortality risk by some funamental factors associate with the iniviuals, an it only makes sense to speak of the between-area ifference as the result of ifferent area-specific congregations of iniviuals Jackknife MSE estimation The MSE of ˆθ i C given by (3) with known σ is given as MSE(ˆθ i C) = 1/(µ i + σ ). A irect plug-in MSE estimator is given by MSE = 1/(µ i, ˆσ ) which, however, has two rawbacks. Firstly, it oes not take into account the estimation uncertainty of sigma ˆ. Seconly, because the MSE is a non-linear function of σ, the variance of ˆσ will turn into a bias of the irect plug-in estimator. Jiang, Lahiri, an Wan () propose a unifie jackknife MSE estimator that eals with both rawbacks. An application of their approach to the current setting can be given as follows. Write the hypothetical composite estimator with known variance components as θ C hi = γ hi ˆθ hi + (1 γ hi ) 1 in orer to istinguish from the empirical estimator which is esignate as ˆθ i C. The shrinkage coefficient can be written as ϕ i = γ(µ i, σ ), i.e. as a function of µ i an σ. Moreover, write MSE( θ C i ) = ϕ i /µ i + (1 γ i ) σ ef = κ(µ i, σ ) i.e. also as a function of (µ i, σ ), an write ˆθ C i explicitly as ˆθ C i = ˆϕ i ˆθi + (1 ˆϕ i ) 1 where ˆϕ i = ϕ(µ hi, ˆσ ) is a plug-in estimator of ϕ i. Observe the general ecomposition MSE(ˆθ C i ) = E(( θ C i θ i ) ) + E((ˆθ C i θ C i ) ) = κ(µ i, σ ) + E((ˆθ C i θ C i ) ) where the first term is purely the preiction uncertainty with the parameters all given, an the secon term is entirely ue to the uncertainty associate with the estimation of the parameters. Jiang, Lahiri, an Wan () apply jackknife to each of the two terms separately. Uner the current setting, at the k-th jackknife iteration, omain k is elete, an the remaining ata are use to obtain the elete-k estimator of the parameters, enote by ˆσ ( k) here. These yiel ˆϕ i( k) = ϕ(µ i, ˆσ ( k) ), an ˆθ i( k) C given as θ i C calculate using ˆϕ i( k) instea of ϕ i, an κ(µ i, ˆσ ( k) ). -9-

94 Base on all the say, K jackknife iterations, we obtain ˆκ(µ i, σ ) = κ(µ i, ˆσ ) K 1 K K (κ(µ i, ˆσ ( k) ) κ(µ i, ˆσ )) i.e. the plug-in estimator of κ with a jackknife bias correction, an k=1 Ê((ˆθ C i θ C i ) ) = K 1 K K (ˆθ i( k) C ˆθ i C ) i.e. the usual jackknife estimator of E((ˆθ i C θ i C) ) irectly. In particular, notice that the omains are elete only for the estimation of the parameters, because to calculate ˆθ i( k) C one must use ˆθ i. Finally, we notice that Lohr an Rao (9) propose a moifie jackknife estimator for the conitional MSE of preiction, which can be useful if the fully parametric Poisson-Gamma moel is aopte. We shall not enter into the etails here Over-shrinkage Small area estimation methos are often evelope from the point of view of area-specific preiction. So is the composite estimator given by (3). However, this may be unsatisfactory if we are intereste in some aspects of the istribution of the small area parameters as well. For instance, the between-area variation of the area-specific empirical best preictor is often foun to be much smaller than the true variation in the population, which is known as over-shrinkage. The problem has mostly been stuie uner the Bayesian framework (e.g. Louis, 1984; Spjøtvoll an Thomsen, 1987; Ghosh, 199). Zhang (3) showe that the empirical best linear unbiase preictor (EBLUP) suffers from over-shrinkage as expecte, an propose a simultaneous estimator which has better ensemble properties from the frequentist perspective. Ensemble properties, such as the range, stanar eviation, minimum an maximum of the life expectancy at the municipality level, are of course of much interest in the present context. Over-shrinkage is therefore an aspect that we will pay attention to. The simultaneous estimator propose by Zhang (3) can be applie uner the fully parametric Poisson-Gamma moel. For the composite estimator given by (3), a convenient shrinkage-ajuste alternative is given by k=1 ˆθ i A = ϕ η ˆθ i i + (1 ϕ η i ) 1 for < η < 1 (4) Provie ϕ i (, 1), ϕ η will assign more weight to the irect estimator for any η (, 1) an, in this way, works against over-shrinkage towars 1. In particular, the choice η = 1/ can be motivate using a constraine empirical Bayes argument (e.g. Spjøtvoll an Thomsen, 1987), which minimizes the MSE of the composite estimator, subjecte to the constraint that the resulting estimator has a between-area variance that matches the postulate variance of θ i. 4.3 Variance component moel for omain relative risk Uner the basic smoothing approach, estimation is carrie out in separate subsets of omains, to be referre to as clusters of omains. But there may naturally exist neighboring clusters, such as the neighboring age groups in smoothing by age (an sex). It is then unappealing theoretically if -93-

95 the parameter estimates o not vary smoothly over the neighboring clusters. Also, the parameter estimates for the same cluster may suffer from unue variations over time. For a potential remey we now evelop a variance component moel for the omain RR, which enables us to borrow strength across the clusters while allowing for the potential cluster-specific ranom effects. Again we use a simplifie omain inex to explain the basic approach. Let h = 1,..., H be the cluster inex, an let (hi) enote the i-th omain within the h-th cluster, for i = 1,..., m h. For example, h = (j, a) if each sex-age cohort forms a cluster of omains, an h = (i, j) in the case of smoothing by area (an sex). Consier the following hierarchical moel y hi θ hi Poisson(λ hi ) for λ hi = µ hi θ hi an µ hi = n hi ξ hi θ hi = ψ h ψ hi where E(ψ h ) = E(ψ hi ) = 1 an V (ψ h ) = σ ψ an V (ψ hi ) = σ h an ψ h an ψ hi are inepenent of each other, an ψ hi an ψ hg are inepenent of each other for i g. That is, the omain RR θ hi is the prouct of a cluster-level RR ψ h an a omain-to-cluster RR ψ hi. Notice that in general we allow V (ψ hi ) = σh to vary across the clusters. A special case is given by the homogeneity assumption of a single omain-to-cluster variance component σ h = σ (5) We refer to this as the variance homogeneity assumption which, of course, woul remove all the risks of obtaining jerky estimates of V (θ hi ) across the clusters. We have E(y hi ) = µ hi an V (y hi ) = µ hi + µ hi V (θ hi) uner the Poisson assumption, where V (θ hi ) = σ ψ σ h + σ ψ + σ h ef = v h an Cov(y hi, y hg ) = µ hi µ hg σ ψ Let s h = m h i=1 w hi(ˆθ hi 1), where m h i=1 w hi = 1 an ˆθ hi = y hi /µ hi. We have E(s h ) = i w hi V (ˆθ hi ) = i w hi (v h + 1 µ hi ) = v h + β 1h for β 1h = ( i w hi /µ hi ) Next, let e h = m h i=1 w hi(ˆθ hi ˆ θh. ), where ˆ θh. = m h i=1 ˆθ hi /m h. We have E(e h ) = i = i w hi V (ˆθ hi ˆ θh. ) w hi {( m h 1 ) V ( y hi ) + 1 m h µ hi m h m h 1 m h g i g i V ( y hg µ hg ) Cov( y hi µ hi, y hg µ hg ) 1 m h l g i Cov( y hl µ hl, y hg µ hg )} = β h + v h (m h 1)/m h σ ψ (m h 1)(m h 3)/m h where β h = i w hi { (m h 1) m h µ hi + 1 m g i h µ } = 1 hg m h i 1 + m h w hi µ hi m h µ i hi -94-

96 It follows that, base on {ˆθ hi ; i = 1,..., m h }, an MME estimator of σψ can be given as ˆσ ψ,h = max(, (s h m h m h 1 e h + m h m h 1 β h β 1h )/( 3/m h )) To combine the estimates across h, we use the weights w h where h w h = 1, to obtain ˆσ ψ = h w hˆσ ψ,h For instance, a choice may be w h = n h /n. where n. = h n h an n h = i n hi. Of course, an alternative is to combine s h, e h, β 1h an β h first, an then erive an estimate of σψ base on the combine sample statistics. The ifference is that the propose ˆσ ψ tens to be more robust, in the sense that it is much less likely to obtain as an estimate of σψ, since the effects of the cluster-specific sample statistics are limite by the truncation at zero within each cluster. The numerical stability can be useful in situations where σψ is small. Next, having obtaine ˆσ ψ, we estimate σ h by ˆσ h = (ˆσ h,s + ˆσ h,e )/ where an ˆσ h,s = max(, (s h β 1h ˆσ ψ )/(1 + ˆσ ψ )) ˆσ h,e = max(, (e h β h (1 1/m h )(1 3/m h ))ˆσ ψ /((1 1/m h)(1 + ˆσ ψ ))) Notice that, again, we apply potential truncation at zero to ˆσ h,s an ˆσ h,e separately, before combining them to obtain ˆσ h. Moreover, an estimator of σ uner the variance homogeneity assumption can be obtaine using the same weights as above, i.e. ˆσ = h w hˆσ h Finally, given all the estimates of the variance components, we obtain a composite estimator ˆθ C hi = ˆϕ hi ˆθhi + (1 ˆϕ hi ) 1 for ˆϕhi = µ hi /(µ hi + 1/ˆv h ) (6) where ˆv h = ˆσ ψ ˆσ h + ˆσ ψ + ˆσ h. Notice that the composite estimator (6) reuces to that given by (3), i.e. the variance-component approach reuces to the basic smoothing approach, if we set ˆσ ψ an estimate σ h separately base on each set of {ˆθ hi ; i = 1,..., m h }. Notice also that, since the reference mortality rate ξ ja is the overall mean mortality rate within each sex-age cohort, we expect the cluster-level variance component σψ to be small when h = (j, a). Whereas, in the case of h = (i, j) an i = a, if a municipality tens to have a higher than usual mortality rate in all the age groups, then it may be the case that a non-negligible portion of all the θ hi s may be attribute to a common area-level effect ψ h, in which case σψ shoul be appreciable compare to the σ h s. Denote by θ hi C the hypothetical composite estimator with known parameters. We have MSE( θ C hi ; v h) = 1/(µ i + v h ) = 1/(µ hi + V (θ hi )) -95-

97 A plug-in MSE estimator for ˆθ C hi is thus given by MSE = 1/(µ hi, ˆv h ) irectly. As pointe earlier, this has its rawbacks. The jackknife MSE estimator may be applie as follows. Write θ C hi = γ hi ˆθ hi + (1 γ hi ) 1 where the shrinkage coefficient γ hi = γ(µ hi, v h ) = γ(µ hi, σψ, σ h ), an MSE( θ C hi ) = γ hi /µ hi + (1 γ hi ) v h ef = κ(µ hi, σ ψ, σ h ) where MSE(ˆθ hi ) = 1/µ hi uner the Poisson assumption an MSE(1) = V (θ hi ) = v h. Next, write ˆθ C hi = ˆγ hi ˆθ hi + (1 ˆγ hi ) 1 where ˆγ hi = γ(µ hi, ˆv h ) is a plug-in estimator of γ hi. Observe the general ecomposition MSE(ˆθ C hi ) = E(( θ C hi θ hi) ) + E((ˆθ C hi θ C hi ) ) = κ(µ hi, σ ψ, σ h ) + E((ˆθ C hi θ C hi ) ) In the k-th jackknife iteration, cluster k is elete, an the remaining ata are use to obtain the elete-k estimator of the variance components, enote by ˆσ ψ( k) an ˆσ h( k). These yiel ˆγ hi( k) = γ(µ hi, ˆσ ψ( k), ˆσ h( k) ), an ˆθ hi( k) C given as θ hi C calculate using ˆγ hi( k) instea of γ hi, an κ(µ hi, ˆσ ψ( k), ˆσ h( k) ). Base on all the, say, K jackknife iterations, we obtain ˆκ(µ hi, σ ψ, σ h ) = κ(µ hi, ˆσ ψ, ˆσ h ) K 1 K 5 Results 5.1 Moel iagnostics K (κ(µ hi, ˆσ ψ( k), ˆσ h( k) ) κ(µ hi, ˆσ ψ, ˆσ h )) k=1 Ê((ˆθ C hi θ C hi ) ) = K 1 K Exploring sex-age cohort effects K (ˆθ hi( k) C ˆθ hi C ) We start by taking a look at the reference sex-age cohort mortality rate ξ ja for j = 1, an a =,..., 99. In Figure 1 we plot the rates by three age groups ( -, 15-55, 5-9). In the left panel of each row, the mortality rates of both sexes are plotte against age. In the right panel, the mortality rates of male are plotte against those of female. In aition, a fitte simple regression line between the two sets of mortality rates is plotte. The effects of sex on cohort mortality rates can be summarize as follows. To start with, there seems to a be higher mortality rate for a male new born (i.e. age ). Next, roughly speaking, between age 1 an 15, there is harly any ifference in the mortality rates ue to sex. Then comes a short perio in life, roughly between 15 an, where the male mortality rate increases from equal to almost ouble as high as the female mortality rate. k=1-96-

98 The ifference remains rather constant for the next 3-4 years, at the same time as both are increasing with age of course. Finally, in the last long perio, roughly from age 6 towars the en of life, there is a steay acceleration in the growth of male mortality compare to that of the female. It oes seem that it is possible to combine the mortality ata of both sexes using a piece-wise regression moel, i.e. separately fitte to suitably efine age groups. However, there is really little to be gaine from this in the current setting, where the ξ ja s are treate as constant reference values, since any potential effects of sex on the mortality rate are alreay taken into consieration now that the RR (or SMR) θ ija is efine with relation to ξ ja Basic smoothing approach an variance homogeneity assumption Next, we explore the basic smoothing approach. Figure shows some iagnostics relate to the composite estimator (3) for, respectively, the sex-age cohort with (j, a) = (1, 6) in the top row, an the cluster of age-groups with (j, i) = (1, 1) in the bottom row. The left column contains the scatter plots of the observe omain mortality vs. the composite estimate given by (3). The estimates appear to track the observe estimates unbiasely in both cases. The mile column contains Q-Q normal plots of the Anscombe resiuals for the Poisson istribution, given by r i = 3y/3 i 3ˆλ /3 i ˆλ 1/6 i see e.g. McCullagh an Neler (1989). The iea is to transform a non-normal ranom variable to a scale where the normal approximation is the best. While the Poisson istribution oes not seem too ba for the omains across all the areas within the given sex-age cohort (i.e. the top mile plot), the assumption is quite unacceptable for the male age-groups in the first municipality (i.e. the bottom mile plot). Finally, the right column contains Q-Q gamma plots of the preicte SMR ˆθ i, where the gamma istribution is chosen to match the empirical variance of the set of ˆθ i s. In neither case oes the gamma istribution seem to fit well. We notice that Figure illustrates just two examples of the many separate clusters of omains uner the basic smoothing approach, i.e. 1 sex-age cohorts for each sex in the case of smoothing by age (an sex) an 43 area-clusters of omains for each sex in the case of smoothing by area (an sex). Having examine many cases, we reache the following two conclusions. (i) The parametric gamma moel oes not fit the ata well enough in most clusters, such that the semi-parametric approach to the composite estimator is to be preferre. (ii) Still, the composite estimator oes not behave smoothly enough across the clusters of omains. Inee, even more so in the case of smoothing by area than that by age. Figure 3 illustrates the problem with the basic smoothing approach. The ashe line shows how the variance of the SMR σ varies across the sex-age cohorts uner basic smoothing. It can be seen that there are consierable fluctuations among the cohorts between age to, for both male an female, i.e. ˆσ can be large in one age (an sex) cohort but zero in the neighbor age cohort. The same fluctuations exists also among the elerly age-cohort, which however are not as visible in Figure 3 ue to that fact that the scale is ominate by the large ˆσ -values among the younger age cohorts. One possibility to stabilize ˆσ is to use a moving average instea of -97-

99 the irect within-cluster estimate. Denote by ˆσ j,a the irect within sex-age (i.e. (j, a)) cohort estimate of σ. Denote by = 3 a chosen banwith (BW), for which we erive the corresponing moving average estimate of σj,a as min(99,a+) σ (j,a) = b=max(,a ) w j,bˆσ j,a where w j,b = n j,b /( min(99,a+) b =max(,a ) The moving average estimates are shown in the otte lines in Figure 3, for 3 ifferent values of the banwith. Notice that setting 99 leas to a single global estimate of σ. Now, the moving average estimator is unbiase uner a variance homogeneity assumption among the neighboring age cohorts, the number of which is etermine by the banwith. However, provie the variance homogeneity assumption, a irect estimate of σ can be obtaine from applying the basic smoothing approach to all the omains in the neighboring cohorts, regarless of which age-group a omain belongs to. This irect estimates of σ uner the neighborhoo variance homogeneity assumption are shown in soli lines in Figure 3. For male age cohorts, the moving average estimate an neighborhoo variance homogeneity estimate agree quite well with each other no matter which banwith is chose. The agreement is not as goo for the female age cohorts, especially as the banwith increases. The reason seems to be the existence of certain outlier female omains which cause more erratic behavior of the irect estimator uner the neighborhoo variance homogeneity assumption. As explaine earlier, when escribing the composite estimator (6), the moving average estimator is more robust against the outlier omains, because the effect of such an outlier is limite to the cohort (or cluster) it belongs to. We have an inication of this here if we compare the estimates of σ for male an female both with = 1. The moving average estimate for female appears much more plausible than the corresponing irect estimate Fitting variance component moel n j,b ) Table 1: Estimate variance components for each sex. Clusters by age or area. With σψ Cluster With σψ Male Female Age Yes (ˆσ ψ, ˆσ ) = (.65, ) (ˆσ ψ, ˆσ ) = (.13, 1.584) No (ˆσ ψ, ˆσ ) = (, 1.75) (ˆσ ψ, ˆσ ) = (, 1.711) Area Yes (ˆσ ψ, ˆσ ) = (.9, 4.633) (ˆσ ψ, ˆσ ) = (.6, ) No (ˆσ ψ, ˆσ ) = (, 4.793) (ˆσ ψ, ˆσ ) = (, ) or not. Finally, we examine the variance component moel, which allows for explicit cluster-level ranom effects. Table 1 shows the various estimates of the variance components σψ an σ uner the variance homogeneity assumption (5). For instance, (ˆσ ψ, ˆσ ) = (.65, ) are obtaine uner the variance component moel with clusters being the sex-age cohorts, i.e. h = (j, a), whereas (ˆσ ψ, ˆσ ) = (, 1.75) are obtaine for the same clusters of omains, without allowing for the cluster-level ranom effects an only uner the variance homogeneity assumption. As explaine before, it is not surprising that ˆσ ψ are small uner smoothing by age (an sex). But neither is σψ appreciable uner smoothing by area, especially in comparison to the omain-tocluster variance component σ. Somewhat liberally, one may raw the conclusion that the SMRs, -98-

100 given by θ ija = τ ija /τ ja, are essentially inepenent of each other, both across age an area, conitional on the reference cohort mortality rates ξ ja. More cautiously, one can at least conclue that whatever the area effect there may be among {θ ija ; i = 1,..., m}, it can not be capture by the variance component moel Main conclusions In summary, we reache the following two main conclusions. Practically, composite estimates of the omain mortality rates nee to be erive uner a neighborhoo variance homogeneity assumption, of which the global variance homogeneity assumption is a special case. Theoretically, the approach shoul not be motivate from an unbiase moel-base preiction perspective, but rather a bias-variance trae-off MSE-stabilizing consieration, which we shall refer to as the smoothing perspective. 5. MSE estimation The MSE of estimate life expectancy on birth can be obtain by the elta metho given the MSEs of the corresponing estimate mortality rates. We have, for l ij 1 an ω ija 1/, T ij = It follows that an, for a > A L ijx = x= A l ijx (1 q ijx ) = (1 q ij /) + x= T ij / τ ij = {1/ + (1 q ij1 /) + a 1 T ij / τ ija = {( (1 q ijx ))/ + x= A A x 1 (1 q ijx /) l y (1 q ijy ) x=1 y= A y 1 (1 q ijy /) (1 q ijx )}/(1 + τ ij ) y= y=a+1 (1 q ijy /) x=1 y 1 x=;x a (1 q ijx )}/(1 + τ ija ) To reuce the variance of the MSE estimator, we evaluate all the partial erivatives at τ ija = ξ ja, an obtain MSE( ˆT A ij ) = ( T ij / τ ija ) τ ija =ξ ja MSE(ˆτ ijx ) (7) x= for any estimator of τ ija = ξ ja θ ija for i = 1,..., m, an j = 1,, an a = 1,..., A. One nees to be aware of the unerlying assumptions for the simple expression (7). In case of irect within-omain estimator of mortality rate, it requires inepenence between the irect estimators across the omains of sex, age an Municipality. While it may be questionable whether it is appropriate to treat the mortalities within neighboring age-sex groups as inepenent, the practice is stanar in the relevant emographic statistical literature (e.g. Chiang, 1984). The potential inter-omain epenence, however, is less critical to the smoothing estimates consiere -99-

101 in this stuy. The moeling approach is aime at the SMR θ ija, given the reference mortality rate ξ ja efine in (). In other wors, what is require is conitional inepenence between the ˆθ ija s given ξ ja. Results of fitting the variance component moel contain no evience against this assumption of conitional inepenence. Another requirement is that one can ignore the covariance between the irect within-omain estimator of the mortality rate an the estimator of the variance component. Given the large number of Municipalities, this seems acceptable. The plug-in MSE estimator is convenient, but we woul like to check its performance against the jackknife MSE estimator. Assume global variance homogeneity an without cluster-level variance component, i.e. σψ =. On eletion of an age-sex cohort at each jackknife iteration, the corresponing jackknife replicate estimate of the global variance component σ is simply the weighte average of the remaining within-cluster variance estimates ˆσ h. No recalculation is necessary, provie the basic smoothing approach has been applie once to each age-sex cohort. Using this jackknife replicate estimate of σ, we obtain all the corresponing jackknife replicate estimates of the within-cohort omain mortality rates, as well as all the corresponing plug-in MSE estimates. In Figure 4 the two MSE estimators are compare to each other. Given age an sex, the ratios between the two estimates across the Municipalities are summarize in terms the minimum, meian an maximum value. It can be seen that, while there may be aroun 3% unerestimation in the worst case, the meian value of the MSE-ratio is very close to 1. Since the MSE of the life expectancy estimator is a weighte sum of the relevant omain MSEs by (7), we conclue that it is acceptable to use the plug-in MSE estimator in this case. 5.3 Estimates of Municipality life expectancy on birth We now calculate the Municipality life expectancy on birth for male an female, respectively, base on the following estimators of the omain mortality rate by sex, age an Municipality: Direct estimator Basic Smoothing within each sex-age cohort Basic Smoothing within each sex-age cohort, with over-shrinkage ajustment (4) for η =.5. Neighboring variance homogeneity assumption an moving average variance estimator with banwith 5 Neighboring variance homogeneity assumption an moving average variance estimator with banwith 5, with over-shrinkage ajustment (4) for η =.5. Global variance homogeneity assumption Global variance homogeneity assumption, with over-shrinkage ajustment (4) for η =.5. The estimate life expectancies are plotte in Figure 5 for male an in Figure 6 for female. Moreover, the quintiles of the life expectancies across the Municipalities are given in Table. We notice the following. The comparisons between ifferent methos show a similar pattern of performance for male an female. The choice of methoology oes not nee to iffer accoring to sex. -1-

102 Table : Quintiles of estimate life expectancies across the Municipalities. Basic: Basic smoothing. Neighbor: Neighboring variance homogeneity assumption an moving average variance estimator of banwith 5. Global: Global variance homogeneity assumption. Ajuste: Over-shrinkage ajustment for η =.5. Sex Metho Minimum 5%-Quantile Meian 75%-Quantile Maximum Male Direct Estimator Basic Basic, Ajuste Neighbor Neighbor, Ajuste Global Global, Ajuste Female Direct Estimator Basic Basic, Ajuste Neighbor Neighbor, Ajuste Global Global, Ajuste Optimal area-specific composite estimation leas to over-shrinkage of the omain mortality rates towars the synthetic mean an, subsequently, over-shrinkage of the life expectancies. Plausible ajustments can be obtaine using a simple constraine empirical Bayes approach (Spjøtvoll an Thomsen, 1987). This applies to all the three variance assumptions. The basic smoothing approach, nevertheless, retains too much over-shrinkage to be useful in reality. The reason is because the cohort-specific variance component estimates are often zero, or too small for the eler age groups. In theory, the neighboring variance homogeneity assumption is more plausible than the global homogeneity assumption. Empirically, however, the ifferences are small. The global homogeneity assumption is simple in practice, smoothing all the omain mortality rate estimates by means of a single global variance component estimate, which is likely to be a more stable option over time, avoiing the a hoc choice of banwith for the moving average variance component estimator uner the neighboring variance homogeneity assumption. Smoothing leas to a consierable reuction of the range of the estimate Municipality life expectancy, compare to irect estimation. For the global smoothing option with overshrinkage ajustment by η =.5, the range is about halve for female, an more so for male. Still, it can be seen from Figure 5 an 6 that, the extreme life expectancies values that have been suppresse by smoothing belong initially to the smaller Municipalities. The smoothing estimates agree well with the irect estimates for large Municipalities. Notice that, in final application, the shrinkage tuning parameter η can be set closer to 1 in orer to increase the range of the smoothing estimates. But this is a choice that shoul to be mae -11-

103 together with the subject-matter specialist. The estimate root MSE (RMSE) of the irect estimator of the Municipality life expectancy is compare to that of the global smoothing in Figure 7 for all the Municipalities. Smoothing yiels consierable gains of efficiency for the smaller Municipalities. The meian value of the relative efficiency (RE) is 48% for male without over-shrinkage ajustment an 54% with ajustment. The meian RE is 41% for female without over-shrinkage ajustment an 48% with ajustment. Therefore, over-shrinkage ajustment oes not cause much loss of efficiency. The mean (or meian) RMSE of the Municipality life expectancy can be reuce to below 1 year for both male an female. References Chiang, C.L. (1984). The Life Table an its Applications. Floria: Robert E. Krieger Publ. Co. Ghosh, M. (199). Constraine Bayes estimation with applications. Journal of the American Statistical Association, 87, Jiang, J., Lahiri, P., an Wan, S. (). A unifie jackknife theory for empirical best preiction with M-estimation. The Annals of Statistics, 3, Lohr, S. an Rao, J.N.K. (9). Jackknife estimation of mean square error of small area preictors in nonlinear mixe moels. Biometrika, 96, Louis, T. (1984). Estimating a population of parameter values using bayes an empirical Bayes methos. Journal of the American Statistical Association, 79, Marshall, R.J. (1991). Mapping isease an mortality rates using empirical Bayes estimators. J. Roy. Statist. Soc. C, 4, McCullagh, P. an Neler, J.A. (1989). Generalize Linear Moels. Lonon: Chapman an Hall. Peersen, H.E. (1). Konfiensintervaller for regionale levealerestimater (in Norwegian). Master s thesis, University of Oslo. Rao, J.N.K. (3). Small Area Estimation. New York: Wiley. Spjøtvoll, E. an Thomsen, I. (1987). Application of some empirical Bayes methos to small area statistics. Bulletin of the International Statistical Institute,, Zhang, L.-C. (3). Simultaneous estimation of the mean of a binary variable from a large number of small areas. Journal of Official Statistics, 19,

104 Cohort Mortality Rates Scatter plot: Motality Rate Mortality Rate Male Age Female Cohort Mortality Rates Scatter plot: Motality Rate Mortality Rate Male Age Female Cohort Mortality Rates Scatter plot: Motality Rate Mortality Rate..1. Male Age Female Figure 1: Left column: Reference cohort mortality rate by age (-, 15-55, 5-9) an sex (male, soli; female; ashe). Right column: Scatter plot of reference cohort mortality rate: male vs. female by age group; simple regression line (soli line). -13-

105 Scatter plot: Mortality Resiual Q Q plot Q Q plot(with empirical variance) Oserve St Resi Preicte RR Preicte Theoretical Normal Theoretical Gamma Scatter plot: Mortality Resiual Q Q plot Q Q plot(with empirical variance) Oserve St Resi Preicte RR Preicte 1 1 Theoretical Normal Theoretical Gamma Figure : Diagnostic plots of basic smoothing. Top row: sex-age cohort with (j, a) = (1, 6). Bottom row: clusters of age groups with (j, i) = (1, 1). Left column: Observe vs. preicte mortality. Mile column: Q-Q normal plot of Anscombe resiual. Right column: Q-Q gamma plot of preicte SMR. -14-

106 Cluster=Age; Sex=1; BW=5 Cluster=Age; Sex=1; BW=5 Cluster=Age; Sex=1; BW=1 Within Cluster Variance Within Cluster Variance Within Cluster Variance Age Age Age Cluster=Age; Sex=; BW=5 Cluster=Age; Sex=; BW=5 Cluster=Age; Sex=; BW=1 Within Cluster Variance Within Cluster Variance Within Cluster Variance Age Age Age Figure 3: Variance of SMR uner smoothing by age an sex. Top row: male, bottom row: female. Variance homogeneity assumption across neighboring cohorts (Soli), moving average of neighboring age-groups with number of age-groups given by banwith (otte), basic smoothing within sex-age cohort (ashe). -15-

107 Ratio Plugin vs. jackknife MSE; Male Ratio Plugin vs. jackknife MSE; Female MSE Ratio MSE Ratio Age Age Figure 4: Ratio between estimate MSEs of the omain mortality rate estimator: Plug-in MSE estimate vs. jackknife MSE estimate. Domain by sex, age an Municipalities. Minimum (soli), meian (ashe) an maximum (otte) MSE-ratio across the Municipalities, given age an Sex. -16-

108 Male: Dir Est vs. Age Smoothing; BW=, Comp Est Male: Dir Est vs. Age Smoothing; BW=, Shrink Aj Life Expectancy Life Expectancy log(municipality Size) log(municipality Size) Male: Dir Est vs. Age Smoothing; BW=5, Comp Est Male: Dir Est vs. Age Smoothing; BW=5, Shrink Aj Life Expectancy Life Expectancy log(municipality Size) log(municipality Size) Male: Dir Est vs. Age Smoothing; BW=1, Comp Est Male: Dir Est vs. Age Smoothing; BW=1, Shrink Aj Life Expectancy Life Expectancy log(municipality Size) log(municipality Size) Figure 5: Estimates of Municipality life expectancy for male against log Municipality population size. Direct estimates (circle) vs. alternative estimates (square). Basic: Basic smoothing. Neighbor: Neighboring variance homogeneity assumption an moving average variance estimator of banwith 5. Global: Global variance homogeneity assumption. Ajuste: Over-shrinkage ajustment for η =

109 Female: Dir Est vs. Age Smoothing; BW=, Comp Est Female: Dir Est vs. Age Smoothing; BW=, Shrink Aj Life Expectancy Life Expectancy log(municipality Size) log(municipality Size) Female: Dir Est vs. Age Smoothing; BW=5, Comp Est Female: Dir Est vs. Age Smoothing; BW=5, Shrink Aj Life Expectancy Life Expectancy log(municipality Size) log(municipality Size) Female: Dir Est vs. Age Smoothing; BW=1, Comp Est Female: Dir Est vs. Age Smoothing; BW=1, Shrink Aj Life Expectancy Life Expectancy log(municipality Size) log(municipality Size) Figure 6: Estimates of Municipality life expectancy for female against log Municipality population size. Direct estimates (circle) vs. alternative estimates (square). Basic: Basic smoothing. Neighbor: Neighboring variance homogeneity assumption an moving average variance estimator of banwith 5. Global: Global variance homogeneity assumption. Ajuste: Over-shrinkage ajustment for η =

110 Male; Age smoothing; Comp Est Male; Age smoothing; Shrink Aj RMSE(Life Exp) RMSE(Life Exp) log(municipality Size) log(municipality Size) Female; Age smoothing; Comp Est Female; Age smoothing; Shrink Aj RMSE(Life Exp) RMSE(Life Exp) log(municipality Size) log(municipality Size) Figure 7: Root MSE estimates: Direct estimator (circle) of Municipality life expectancy an global smoothing (square). -19-

111 Case stuy report [GUS, Polan] Backgroun First attempts at applying various approaches to parameter estimation for small areas in Polan were unertaken after the International Conference on Small Area Statistics hel in Warsaw in 199 (es. Kalton, Koros, Platek, 1993). There were only a few attempts to apply small area estimation (SAE) methos to measure the scope of unemployment, poverty, househol structure an in agriculture relate surveys (Koros, Paraysz, ). Further applications an examination of stanar inirect estimators properties were unertaken within the EURAREA project 1 after the IASS Satellite Conference hel in Riga in 1999 (Pfeffermann D., 1999). The stanar estimators were efine in the project as: the techniques of omain estimation (synthetic estimators, GREGs an composite estimators) which entere into use in the Unite States an Canaa in the 198s, an have been the subject of steay theoretical refinement since (EURAREA Documents, IST -69, Annex 1 - Description of Work p.4). The first important stuy to apply SAE methoology in LFS was conucte in 3. The authors (Bracha, Lenicki an Wieczorkowski) estimate totals an rates of several labour market characteristics by region, subregion an poviat (NUTS, NUTS3 an NUTS4). They use irect, synthetic an composite estimators. The irect estimates were obtaine from the LFS in 1995, an appropriate ata from the Census of Population an Housing were use as auxiliary information. The composite estimator is a combination of the irect an synthetic estimators with the assumption of weight equaling.5. The quality of the estimators was evaluate base on the bootstrap approach with bootstrap variances an bootstrap coefficients of variation. The use of SAE estimation was accompanie by a publication (Bracha, Lenicki an Wieczorkowski (3)), however it was intene to be treate as an experimental stuy. No information about the software use was given. The secon important stuy was conucte in 4 by E. Gołata. The stuy was intene to rely on EURAREA project experiences. The Polish atabase - the so-calle superpopulation labelle POLDATA - was create on the basis of 3 ata sources: the 1995 Microcensus, the 1995 Househol Buget Survey an the Local Data Bank. POLDATA represente as closely as possible the characteristic of Polan in 1995 with respect to the new aministration ivision of the country which was introuce in January For the purposes of applying the stanar estimators, the proportion of ILO unemploye (in the whole population over 15) in an area was estimate. Unemployment was estimate as a binary not multivariate or Poisson variable. In choosing covariates, a set of variable categories was harmonize for the common moel for all the countries taking part in the EURAREA project. The intention was to inclue all variable categories which experience has shown to be effective. In the case of ILO unemployment the stanar variables were: age, sex, eucation, employment status an housing. The simulation stuy was conucte on samples rawn from the POLDATA accoring to a two-stage sampling with unequal probabilities. The stanar estimators were applie. The omain was efine as poviat (NUTS4). High values of relative 1 The EURAREA project no. IST--69 entitle Enhancing Small Area Estimation Techniques to meet European nees is part of 5 th framework programme for research, technological evelopment an emonstration of EU. Its main co-orinator is ONS Office for National Statistics, UK -11-

112 estimation errors, which for the NUTS4 level range from 5-4% seeme unsatisfactory. Results obtaine for NUTS3 were much better, as the average value i not excee 5%. It seeme that there was a possibility of improving the estimates by creating a moel that woul incorporate more correlate variables i.e. registere unemployment instea of the proportion of unemploye claiming the unemployment benefit. The use of SAE estimation was accompanie by a publication (Gołata (4); which was intene to be treate as an experimental stuy an became the main part of E. Gołata s habilitation issertation. The software use within the stuy were EURAREA SAS programs. One shoul also mention the stuy conucte in 4 by J. Kubacki. The parameter of interest in the stuy was unemployment size for NUTS an NUTS 4 level. Registere unemployment constitute the aitional ata source use in the stuy (with covariates: the number of unemploye persons, the number of employe persons, the number of economically inactive persons, the number of wellings an the number of persons age 15 an above. Both esign an moel base types of estimators were applie: esign base estimators incluing post stratification methos (both ratio an regression estimator), synthetic estimator (both ratio an regression estimator), moel base estimators inclue empirical Bayes (EB) estimator an hierarchical Bayes (HB) estimator. The use of the esign-base estimation was mainly motivate by the availability of ata about registere unemployment coming irectly from PLFS (special question) an also the availability of such information for NUTS an NUTS 4 level. Moel base estimators use Census ata an also ata from statistical information sources (e.g. emographic ata an employment ata). The reason for choosing such methos was partially the nee to compare such estimates with methos applie in the official survey an the possibility to exten these methos to the moel approach (with ifferent a priori estimates). The esign base estimator was mainly assesse using the ranom group technique. Also the bootstrap technique was use (for ata from ). Moel base estimates were assesse using the naive empirical Bayes technique an by means of the so calle ergoic variance (for results prouce by MCMC simulation carrie by WinBUGS software). The use of SAE estimation was accompanie by three main publications (Kubacki (4, 5, 8)) an was not an official statistical output but an experimental stuy. Recent years have seen a growing interest in new possibilities an tools evelope to meet the growing eman for estimates at local level. Special projects were carrie in Central Statistical Office (CSO) in cooperation with university researchers. The Projects referre to ifferent subjects, for example: labour market, househol structure, business statistics incluing small business. From the perspective of the Population Census 11, which was in progress at the time, special research was unertaken within the Central Statistical Office an newly establishe Centre for Small Area Statistics in Poznan. It was aime at examining aministration registers, their quality an usefulness as sources of auxiliary variables in small omain estimation. But practical application of SAE methos of official statistics in Polan is still not part of normal practice

113 General setup In this case stuy we inten to continue explorations of estimating labour market characteristics for small omains. First, ata infrastructure referring to economic activity an unemployment is presente an iscusse. The intention was to inclue all variable categories, which experience has shown to be effective. In the case of ILO unemployment the stanar variables are: age, sex, eucation, employment status an housing. In the stuy will use ata from the following sources: Register of Unemploye Vital Statistics register Tax Register ata will be use in an inirect form via Commuting Survey which was base on ata from Tax Register the first eition of the survey is available for 6. Taking into account special features of the labour market in Polan, especially its high territorial ifferentiation, various estimation techniques will be analyse. Theoretical approaches to estimation with spatial effects propose by R. Chambers & A. Saei (3) together with the SAS software provie by EURAREA project are consiere an ajuste to fit specific arrangements. The stanar EURAREA estimators use in the stuy are: Direct (for comparative reasons), Generalise REGression estimator - GREG, Synthetic an EBLUP. Also EBLUPGREG, which takes into account spatial correlation structure, was applie within the case stuy. Labour market: the structure of economically active population an, especially, unemployment an its structure are of exceptional social interest in Polan. Unemployment, since the very beginning of 9 s has assume alarming imensions an is characterise by great territorial ifferentiations at the national as well as regional level. This characteristic is ue to structural ifferences in economy an regional inequalities shape by istinct historical experiences as well as the transformation process. Regularities observe at the national level, in most cases, cannot be generalise an iffer from region to region. For example in June 11 the highest registere unemployment rate in Polan was observe in the Warmia- Masuria Province 19,5%, an the lowest in the Wielkopolska Province 9,1% [province (voivoship) refers to the NUTS level accoring to Eurostat territorial ivision].this situation requires avance stuies reflecting regional specificities. Data available from the Labour Force Survey enable estimates of employe an unemploye for the whole country by age, sex an place of resience: urban an rural areas. But at the regional level (NUTS) only aggregate ata can be obtaine from LFS ifferentiate by sex an place of resience (into urban an rural areas), but not by age. ILO unemployment unemployment efine accoring to International Labour Organization, which is comparable among European an non-european countries -11-

114 The goal is to estimate the percentage of unemploye people in the population of 15 an oler 3 at the NUTS3 level. Although there ha been previous attempts to estimate unemployment at the NUTS4 but, they were not very satisfactory. Description of the survey an of all ata sources use A) Labour Force Survey The LFS methoology is base on the efinitions of the economically active population, the employe an the unemploye aopte by the Thirteenth International Conference of Labour Statisticians (October 198) an recommene by the International Labour Office. The survey concentrates on the situation from the point of view of economic activity of population, i.e. the fact of being employe, unemploye or economically inactive in the reference week. The labour force survey is a probability sample survey. Since the 4th quarter of 1999 the LFS has been carrie out as a continuous survey. This means that in each of 13 weeks in a quarter interviewers visit a etermine number of ranomly sample wellings an collect ata concerning economic activity uring a preceing week. The survey covers all people at the age 15 years an more, living in the sample wellings. A sample of wellings to be visite is change every week. Weekly samples result from a ranom istribution of a quarterly sample into 13 parts. It was constructe in such a way that every one of 13 weekly samples is not only the same size but has also the same structure. The survey results are compile an publishe quarterly. Simplifying somehow, it can be sai that the quarterly results are calculate as the mean values of the results from 13 weeks of a given quarter. Sampling for the LFS follows the two-stage househol sampling. The primary sampling units subject to the first stage selection, are census units calle census clasters - CCs in towns, while in rural areas they are enumeration istricts - EDs 4. Secon stage sampling units are wellings 5. The primary sampling units (PSU) are sample with the application of the so-calle stratification. Strata correspon to provinces (voivoships). Within provinces there were aitionally efine 3 to 5 strata; only in Silesian Voivoship 7 strata coul be create. Strata within provinces were create epening on the size of a place; rural areas were inclue into the smallest ones. Strata within provinces were create accoringly to a situation in a particular voivoship without any fixe iviing criterion. 3 It is ifferent from the unemployment rate but such an assumption will significantly simplify the computations, especially as far as the MSE of the estimators is concerne. 4 In rural areas application of smaller first stage sampling units is useful for organizational reasons, but negatively influences result precision for these areas. In orer to improve this, the principle of the so-calle overrepresentation of rural areas was applie, i.e. the number of wellings sample from rural areas is about 1% higher than the number resulting from the so-calle proportional allocation (relate to the number of wellings in the whole population). 5 Sampling of primary sampling units an wellings is conucte on the basis of the Domestic Territorial Division Register, incluing among others a list of territorial statistical units an a list of welling aresses within CC s an EDs

115 The estimation process consists in efining the appropriate generalizing factors, referre to as weights. This is achieve in three steps. The first step provies primary weights, which basically are the reciprocals of selection probabilities for ultimate sampling units (i.e. wellings), which compensates for the isproportionate construction of the sample. 1 w j (1) j Then, the so-calle interview rates R are calculate by the formula (), where Nˆ k is the number of interviewe wellings estimate using primary weights an B k is the estimate of the number of wellings that were qualifie for the survey but were not interviewe regarless of the reasons. The interview rates are calculate in six categories of place of resience (k=1,,,6). R k Nˆ k N ˆ B k k () where ˆ N k j s k 1 j, (3) j - selection probabilities of j-th welling in the part of the sample sk belonging to k- th category of place of resience. Nˆ k - estimator of the number of wellings in the part of the sample belonging to k-th category of place of resience. Bk - estimate of the number of wellings to be examine in the survey, but were not The seconary weights (4) are calculate in the next step by iviing primary weights by R, where R rate epens on the category of a place of resience of a given welling (the rural area or one of the five town classes mentione above). The seconary weights are also final for the results concerning househols an families. w jk 1 w j (4) R k where: w jk - seconary weights w j - primary weights -114-

116 Rk - interview rates are calculate in six categories of place of resience (k=1,,,6) Final weights for the results concerning the population are calculate in the thir step. The purpose of this step is to ajust the LFS results to the current emographic estimates. It is given by calculating the so-calle moifiers for each of 48 categories efine by the place-ofresience (urban/rural), sex an 1 age groups. The weights are calculate by iviing the number of people in each group accoring to the emographic estimates by the number of people in these categories calculate from the LFS results by applying seconary weights from the secon step. Final weights (7) result from multiplying seconary weights by aequate moifiers. F ĝ l x jl, (5) R j s k k where: ĝl - the estimate number of people in l-th category base on the LFS sample, F the reciprocals of the frequency of selecte wellings (F= for cities an F=818 for rural areas) Rk - interview rates are calculate in six categories of place of resience (k=1,,,6) x jl numer of people belonging to l-th category in j-th welling in the LFS sample Categories (l=1,,48): rural/urban area; sex; twelve age groups: years, 18-19, - 4, 5-9, 3-34, 35-39, 4-44, 45-49, 5-54, 55-59, 6-64, 65 an more M l Gl gˆ l (6) where G l - number of people in each l-th category accoring to the emographic estimates w w M (7) jkl jk l The variances of complex estimators obtaine in LFS cannot be estimate with orinary methos an special, approximate proceures must be employe. Since 3 one of the most popular approximate methos has been chosen for this purpose, base on the resampling an bootstrap rule. Resampling base methos make it possible to treat ifferent parameter estimators uniformly an eliminate the necessity of eriving complicate analytical formulas. The application of the bootstrap rule to complex sample esigns most frequently use in sample surveys requires the introuction of aequate moifications. In the case of complex two-stage sampling esign use in the LFS, variance estimation is performe with the use of ata from quarterly samples, rawn with the application of stratification at the level of primary sampling units. The chosen variant of the bootstrap metho is applie separately in each stratum to obtain the corresponing variance component estimate

117 The etaile escription of the bootstrapp proceuree applie inn the variance estimation for LFS estimates in Polan is presente in etails in Bracha et al. (3).( The ata use in the case stuy refers only to the 1 st quarter of the 8 LFS Survey wave an the spatial istribution of population an samples for this quarter is presente below. Fig. 1. Fig

118 B) Tax Register - general characteristics of an employment-relate population flow stuy in 6 The use of aministrative registers in Polish public statistics is at the initial stage. The only larger stuy in which they playe a supporting role was the 11 National Census, which relies on information from various sources in orer to collect certain ata (reucing the buren on responents), upate the sampling frame for sample surveys an to upate the atabase of builings an wellings. Wier access to aministrative atabases provies an opportunity for Polish public statistics to evelop methos of how they can be use in statistical reporting, an consequently, to upgrae the statistical infrastructure. A stuy of employment-relate population flow was conucte on the basis of ata from the tax system collecte in the POLTAX atabase in the Statistical Office in Poznan in 9. The stuy was intene to provie estimates about the volume an irections of commuter traffic involving people in pai employment, using ata from 31 December 6. The results an the methoological etails of the stuy have been mae partially available in the Regional Data Bank an in the book entitle Commuting in Polan, eite by K. Kruszka, Poznan, in October 1. Registers, after verification an cleaning, turne out to be a goo source of information about the structure of economic activity from the territorial perspective. Aitionally, the set inclue characteristics of sex an age (variables erive from the variable "birth ate"), which enable another breakown. One isavantage is their incompleteness. In aition, they o not cover all the characteristics reporte in other stuies (such as eucation, class of places of resience, etc., which are inclue in LFS); this means that these atasets cannot fully replace the previously use measurement tools. Nevertheless, registers can play a supporting role an provie a goo source for auxiliary variables in inirect estimation of labour market characteristics. Approach to the problem The problem in case of Polish team in the project was to estimate the percentage of unemploye people at the lower level of aggregation than presente in the CSO s publications. Having in min that ata available from the Labour Force Survey enable irect estimates of employe an unemploye for the whole country by age, sex an place of resience an for aggregate ata at the regional level (NUTS), it was ecie to try to get small area estimates at the NUTS3 level. Another problem was to evaluate the results. The applie software of course enables us to compute MSEs for the stuie estimators, but there was a serious ifficulty to valiate the estimates against population values. It was ecie that espite of small ifferences in the efinition of LFS an registere unemployment the latter will be use as a kin of benchmark. Our natural choice was to use the EURAREA coe

119 To use the EURAREA coe, the t user has to be equippe with: - sample ataset - population ataset or ataset with means/totals from population at the omain level l The sample atasett shoul contain: - area/omain i (numerical variable) - target variable (in our case it is a ummy variable, whose values aree 1 if a person is unemploye, otherwise) - covariates - weight from the sampling esign The population ataset or ataset with means/totals shoul contain: - area/omain i (numerical variable) - means/totals from population at the omain level The structure of the sample an population is presente below (just example): Fig. 3. Dataset with population totals. Fig. 4. Sample ataset. As potential covariates we chose: -118-

120 - commuting to work, - place of resience, - sex, - 6 age groups All covariates were recoe into binary variables proucing nine variables: X1 commuting to work (1 if a person commutes to work, otherwise) X place of resience (1 if a person lives in rural area, otherwise) X3 sex (1 if a person is a male, otherwise) X4 group of age (1 if a person is up to years of age, otherwise) X5 group of age (1 if a person is -4 years of age, otherwise) X6 group of age (1 if a person is 5-34 years of age, otherwise) X7 group of age (1 if a person is years of age, otherwise) X8 group of age (1 if a person is years of age, otherwise) X9 - group of age (1 if a person is over 55 years of age, otherwise) Then we use stepwise selection to get the following moel: Parameter Estimates Variable Label DF Parameter Estimate Stanar Error t Value Pr > t Intercept Intercept X1 X X X X3 X X4 X X5 X X7 X X8 X An its gooness of fit is as follows Root MSE.136 R-Square.7191 Depenent Mean.55 Aj R-Sq

121 Coeff Var.3754 Although two variables are not significant we ecie to inclue them in the moel as in our opinion they are of special importance (place of resience an sex). Methos The seven stanar estimators were applie: - synthetic population level estimator NSMEAN - irect estimator - GREG with a stanar linear regression moel - synthetic estimator consiere uner two ifferent moels: a) a linear two-level moel with iniviual ata, b) a linear moel with area-level covariates an a poole sample estimate of within-area variance, - EBLUP estimator using moels: a) a linear two-level moel with iniviual ata, b) a linear moel with area-level covariates an a poole sample estimate of within-area variance c) EBLUPGREG a linear two-level moel with iniviual ata taking into account the spatial correlation structure. Direct estimator Y ˆ DIRECT 1 Nˆ iu w i y i (8) where: w 1/ Nˆ w i i i iu MSE Estimator: ˆ 1 ˆ DIRECT ˆ DIRECT M SE( Y ) ( 1)( ) ˆ wi wi yi Y (9) N iu (assuming, that i, j ' GREG Estimator y i T i i, for all ' or i j. x β (1) E( i ), Var ( i ) -1-

122 Yˆ GREG 1 Nˆ is y i i X T 1 Nˆ is T x i βˆ i (11) Where ˆ 1 an βˆ are estimate by using LSM weighte by weights resulting from N i s the sampling process: i 1 T βˆ wi xi xi wi xi yi (1) iu iu E( ), Var ( ) i i Assuming: With weights g: g i r x T GREG iβˆ an Yˆ w i y i 1 ( X T T 1. x. ) ( wi xi xi ) i u x i iu i g i y i ˆ GREG ij i j ( Y ) g i ri g j rj (13) iu ju ij i j MSE ˆ SYNTHETIC ESTIMATORS MODEL A Moel stanar two level linear moel: y i x β u e (14) T i i u ~ ii u N(, ), e i ~ ii e N(, ) ˆ SYNTH T Y X βˆ (15). z X ). ( X.,1,..., X., p T Estimator oes not respect sampling weights ˆ ˆ ˆ Vˆ (16) SYNTH T M SE( Y ) u X. X. where Vˆ is the covariance matrix of covariates. -11-

123 MODEL B Moel for omain is as follows: T y β. x. (17) e Where ~ ii N(, u ) an n enotes sample size for area. n Variance ˆ 1 e is estimate accoring to the formula: e ( yi y. ) (18) n na i where: n - sample size; na - number of omains in the sample; One level regression moel with β i u estimate itreratively from: T 1 1 T 1 βˆ ( x D x) x D y where y - vector of the sample elements y, (19) x - matrix with rows consisting of T x, e D - iagonal matrix with iteratively upate values ( ˆ u ) on the iagonal. n ˆ ˆ SYNTH T Y X βˆ () ˆ ( ˆ. ) ˆ X Vˆ X (1) SYNTH T M SE Y u.. where Vˆ is the estimate of the covariance matrix ( x T D 1 x ) 1 EBLUP ESTIMATORS MODEL A Y ˆ EBLUP w Y ˆ (1 w ) Y ˆ () EBLUP GREG EBLUP SYNTH w EBLUP u ˆ e ˆ (3) ˆ u n -1-

124 In more etails the moels may be presente as follows: EBLUP using MODEL A βˆ X ) βˆ x ( ˆ. T. T. y Y (4) where: e u u n ˆ ˆ ˆ (5) T x., y., are corresponing sample means of y an the covariates for omain. ˆ, ˆ βˆ, e u are parameters estimate using stanar two-level linear moel. MSE Estimators. ESTIMATOR 1: T e n Y SE M.. X Vˆ X ) (1 ˆ ) ˆ ( ˆ (6) ESTIMATOR : ), ˆ ( ˆ ˆ ˆ ) ˆ ( ˆ ˆ ˆ ) ˆ ( ˆ ˆ ˆ ˆ X Vˆ X ) (1 ˆ ) ˆ ( ˆ e u e u e e u u e u e T e Cov Var Var n n n Y MSE (7) where: ) ˆ ( ˆ ˆ ˆ) ( ˆ ˆ ) ˆ ( ˆ ˆ) ( ˆ ) ˆ ( ˆ ˆ ˆ ) ˆ, ˆ ( ˆ ˆ ) ˆ ( ˆ e e e u u e u e e e Var Var Var Var Var p m m Cov p m m Var 1 m - number of selecte units m - number of omains / e u is the variance ratio -13-

125 Var ˆ ( ˆ). n (1 n ˆ) where Vˆ is the cocariance matrix of covariates. Var ˆ ( ˆ ) is estimate variance an Var ˆ ( ˆ ) is estiamte variance e ˆ e u ˆu. Estimator 1 is a corresponing approximation if the number of omains is large. Estimator may be applie in any case. MODEL B Y ˆ EBLUP w Y ˆ (1 w ) Y ˆ (9) EBLUP DIRECT EBLUP SYNTH EBLUP w u u ˆ (3) ˆ ˆ EBLUP using MODEL B. ˆ EBLUP ˆ irect T Y Y (1 )X βˆ (31). ˆ u (3) ˆ ˆ u e where T 1 1 T 1 βˆ ( x D x) x D y (33) where: y - vector of the sample elements y, x - matrix with rows consisting of T x., D iagonal matrix with iterative upate values ( ˆ ˆ ) on the iagonal. MSE Estimators ESTIMATOR 1: ˆ ˆ T M SE( Y ) (1 ) X. X. ˆ Vˆ (6) e ˆ is an estimator of resiual variance insie omains ˆ. n ESTIMATOR : u e -14-

126 T 3 X Vˆ X ˆ ˆ ˆ Vaˆ r( ˆ ) ˆ ˆ MSE( Y ) ˆ (1 ) (34) ESTIMATOR 3:.. u T 3 X Vˆ X ˆ ˆ ˆ Vaˆ r( ˆ ) ˆ ˆ MSE( Y ) ˆ (1 ) (35).. where Vˆ is the covariance matrix of covariates an Var ˆ ( ˆ u ) is an estimate of ˆu. u u u EBLUP using spatial correlation Software prepare by ISTAT was base on re-formulation of the expressions containe in Saei an Chambers (3) in orer to obtain a more efficient SAS coe. The consiere moel is the general linear mixe moel y Xβ Zu e (36) where: X an Z are known matrices of orer N P an N D respectively; X is the matrix of the population values of the covariates an Z is the incience matrix for the spatial ranom area effect ; e an u are vectors of ranom variables with mean an variance an covariance matrices expresse respectively by the couples: N [, I ] N ~ N ~ [, U A] I N being the ientity matrix of orer N an A square matrix of orer D allowing a spatial correlation structure to be inclue in the moel. The generic element of A a is given by a ( ') 1 ( ') ist( ') exp 1 (37) The SAS program that has been evelope by ISTAT for the prouction of small area estimator base on the spatial linear mixe moel. ( ') 1 for for ' ' Software -15-

127 The software use in the case stuy was SAS. We starte from using PROC SURVEYMEANS to compute the irect estimator. Thenn the EURAREA coe was implemente. Not only were Seven Stanar Estimators compute but the EBLUPGREG software with its options was teste to fin out whether to alloww for spatial autocorrelation or not. The SAS software was also use forr computing coorinates of the centrois which are of special importance in estimation conucte via EBLUPGREG taking into o account the spatial epenence. Partial coe /*Macro variables*/ %let CENTROID=1; %let CENTROID_OPT=GEOMETRIC; ; /*Definition of output parameters*/ %let maplib=maps; %let mapcat=nuts3; %let mapname= NUTS3; %let cathow=create; %let spalib=maps; %let spaname= NUTS3; %let spahow= NUTS3; /* SCL(SAS COMPONENT LANGUAGE) instruction* */ m 'af c=sashelp.gisimp.batch.scl'; Fig. 3. These coorinates will be inpute in EBLUPGREG coe, allowing for spatial correlation structure. -16-

128 Resultss Fig.4. Fig

129 Figures 4 an 5 show that the spatial pattern of the irect estimates for NUTS3 level fits quite well to the pattern of the values coming from the aministrative registers. The following figures (Fig. 6 Fig.1) give some overview on how the applie estimators reprouce the registere unemployment. The figure 6 showe that irect estimates of ILO (International Labor Bureau) unemployment are in most of the cases lower for the NUTS3 areas than registere unemployment. Quite ifferent situation is presente on the figure 7. In case of moel assiste estimator (GREG) the estimate unemployment from LFS is higher than registere. However if the ranges of unemployment are compare one coul see that irect estimates have slightly more narrow range when compare with GREG. The registere unemployment ranges from 1,41% to 1,77%, irect estimates range from 1,1% to 7,89% an finally GREG from,9% to 1,89%. Fig. 6 DIRECTestimates 1,% DIRECT vs REGISTERED 1,% 8,% 6,% 4,%,%,%,%,% 4,% 6,% 8,% 1,% 1,% Registere unemployment Fig.7 GREG estimates 1,% GREG vs REGISTERED 1,% 8,% 6,% 4,%,%,%,%,% 4,% 6,% 8,% 1,% 1,% Registere unemployment -18-

130 Synth_A estimator has the smallest variance but the bias in this case is unacceptable. Synth_B estimates are quite similar to irect ones. However the range for the estimates obtaine while applying Synth_B is very short from,35% to 5,94%. So one coul conclue that smoothing of the estimates is too strong. Fig.8 Synth_A estimates 1,% Synth_A vs REGISTERED 1,% 8,% 6,% 4,%,%,%,%,% 4,% 6,% 8,% 1,% 1,% Registere unemployment Fig.9 Synth_B estimates 1,% Synth_B vs REGISTERED 1,% 8,% 6,% 4,%,%,%,%,% 4,% 6,% 8,% 1,% 1,% Registere unemployment The comparison of the EBLUP estimators base on ifferent assumptions reveale that they shoul be of a special interest in applications of small area methoology in Labor Force Survey. The results obtaine suggest their bias is relatively small with relatively small variation. For instance the range of estimates prouce by EBLUPGREG (which takes into consieration the spatial structure of ata) is from 4,5% to 1,34%. One shoul notice that even some basic assumptions relating to the nature of variable are violate (number of unemploye people aging 15 an more is not a continuous variable an the istribution coul not be normal), the behavior of the EBLUPs is quite well. The estimates obtaine by application of EBLUP_A an EBLUP_B moels show the same patters as their irect components. EBLUP_A is a linear combination of GREG an SYNTH_A while EBLUP_B is a linear combination of DIRECT an SYNTH_B. -19-

131 We also compare the MSE of the stuie estimators. The software prouce in EURAREA project computes the MSE of seven stanar estimators an also for spatial version of EBLUPGREG estimator. However the approach presente by us has two simplifications. In fact for DIRECT an GREG estimators we have mainly variance as these two estimators are esign unbiase. The secon problem while applying EURAREA coe the reaer shoul realize is the fact, that the quality assessment measures are compute assuming simple ranom sampling. In fact the sampling esign applie in LFS surveys is rather stratifie an two-stage in most cases. Fig.1 EBLUP_A estimates 1,% EBLUP_A vs REGISTERED 1,% 8,% 6,% 4,%,%,%,%,% 4,% 6,% 8,% 1,% 1,% Registere unemployment Fig.11 EBLUP_B estimates 1,% EBLUP_B vs REGISTERED 1,% 8,% 6,% 4,%,%,%,%,% 4,% 6,% 8,% 1,% 1,% Registere unemployment -13-

132 Fig.1 EBLUPGREG estimates 1,% EBLUPGREG vs REGISTERED 1,% 8,% 6,% 4,%,%,%,%,% 4,% 6,% 8,% 1,% 1,% Registere unemployment Fig.13 1,8% Quality assesment square root of MSE 1,6% 1,4% 1,% 1,%,8%,6% DIRECT GREG SYNTH_A SYNTH_B EBLUP_A EBLUP_B EBLUPGREG_SPATIAL,4%,%,% NUTS3 areas sorte accoring to sample size (increasing orer) The last figure (13) presents the istribution of MSEs where the NUTS3 were orere accoring to the increasing sample size. The highest values of MSEs are connecte with SYTH_A estimator. In this case the important input in its value is the input of bias. The esign unbiase estimators DIRECT an GREG show a quite high amount of variance which is slightly smaller in the case of GREG. However when sample size increases the variance of the DIRECT estimator is ecreasing. In the case of GREG the variance is rather constant. The best performance as far as the behavior of the estimation is concerne is connecte with EBLUPs. They are quite similar an have the smallest MSE

133 References 1. Bracha, C., Lenicki, B. an Wieczorkowski, R., 3, Estimation of Data from the Polish Labour Force Surveys by poviats (counties) in 1995 (in Polish), Central Statistical Office of Polan, Warsaw, 97p. Chanra H., Salvati N., Chambers R., 9, Small Area estimation for Spatially Correlate Populations A Comparison of Direct an Inirect Moel-Base Methos, Southampton Statistical Sciences Research Institute, Methoology Working Paper M7/9, University of Southampton 3. D Alò M., Falorsi S., Solari F., 4, EURAREA Documentation on SAS/IML program on Linear Mixe Moel with Spatial Correlate Area Effects in Small Area Estimation, EURAREA Deliverable EURAREA_Project_Reference_Volume ( 5. EURAREA EBLUPGREG Software Documentation, Statistics Finlan EURAREA Consortium, Deliverables D.3., D3.3., 4 6. Gołata E., 4, Estymacja pośrenia bezrobocia na lokalnym rynku pracy, Wyawnictwo AE w Poznaniu (In Polish). 7. Kruszka K., (e.), 1, Commuting In Polan, Statistical Office In Poznań 8. Kubacki, J., 4, Application of the Hierarchical Bayes Estimation to the Polish Labour Force Survey, Statistics in Transition, Vol. 6, No 5, , 9. Kubacki, J., 6, Remarks on the Polish LFS an Population Census Data for Unemployment Estimation by County, Statistics in Transition, Vol. 7, No 4, , 1. Kubacki, J. (8), Application of Bayesian estimation methos for small omains in the Polish Labor Force Survey, Acta Universitatis Loziensis, Folia Oeconomica 16, , Łóź Saei A., Chambers R, 3, Small Area Estimation: A Review of Methos Base on the Application of Mixe Moels, EURAREA. 1. Saei A., Chambers R., 4, Small Area Estimation Uner Linear an Generalize Linear Mixe Moels With Time an Area Effects, University of Southampton. -13-

134 Final report on the case stuy (INE, Spain). 1.- Introuction Demographic censuses to be carrie out this year are, in fact, three ifferent censuses: the census of population, housing census an the census of builings. Of the three, the Census of Population is clearly the one with the greatest effect an the most longstaning. At present it is noteworthy that for the first time it has been evelope a Community Regulation on the Census in orer to ensure both the availability of information an the consistency of results at European level. The EU Regulation on the Census of Population an Housing provies a variety of collection methos groupe aroun the concepts of conventional censuses an census only base on recors an, between these extremes, there are contemplate combinations of samples an aitional use of recors. In Spain the Municipal Population Register is a consoliate recor an, certainly, places us among the countries with better conitions to conuct a census base on a combination of recors an sample surveys. The 11 Census of Population an Housing is presente as an operation base on the combination of the following elements: A pre-census file (PCF) mae taking the Register as a basic element of its structure an aing information provie by the Home Office to eal with ientifiers of people, both Spanish an foreigner, the Social Security, the Tax Agency an other sources (register of eucational levels of the Ministry of Eucation, etc.). A ata collection work that inclues two major operations: o A comprehensive 11 Builing Census to allow georeferencing of all builings. o A sampling survey with a sample size relatively high to know the characteristics of iniviuals an househols as well as to comply with the coverage regulations establishe by Eurostat. The combination of the PCF with the information obtaine from the survey will provie all the census information. One of the census objectives is to etermine, for each of the Spanish municipalities, the basic structure of the population (stock an its istribution by sex, age an country of birth/nationality). However, in an operation like this, where one of its pillars is a survey, may appear not to get enough information for small omains but the current project aresses the nee to provie ata for several subpopulations in all municipalities. This target will be achieve by having a large sample size in the operation an applying estimation techniques in small areas where neee as well as possible. So the goal of our case stuy is to evaluate the census estimator along with some small area estimators that may be the alternative when the municipality sample size is too small, by means of carrying out simulation experiments that allow us to assess the municipality estimates

135 1.1. Backgroun in small area estimation In a formal way INE starte to step up SAE methoology in 1, with the collaboration of the University Miguel Hernánez of Elche (UMH), by participating in the FP5 EU project EURAREA (Enhancing Small Area Estimation Techniques to meet European nees), following the research line on Complex Designs. Once this project was finishe, the MODEAP project (Research into Moels for Small Area Estimates) was create by INE an UMH with a clear goal: to aapt an to apply some moel-assiste an moel-base estimators to the LFS in orer to estimate totals of unemploye an employe people an the unemployment rates by gener an NUTS level 4 (comarcas). The small area estimators applie to the LFS were those inclue in the EURAREA software an new ones implemente in SAS (poststratifie, count-synthetic an Sample Size Depenent estimators). To calculate all of them, weights w j for each sampling unit are use but they are not sampling esign constants. In fact, they are sample epenent, w j = w j (s), because of both processes, the non-response ajustment an the calibration, an consequently they are ranom quantities. This is an important argument in favour of introucing resampling methos to evaluate esignbase MSEs of moel-base estimators. So, two resampling techniques, a two-stage bootstrap an a elete-one-cluster jackknife, were consiere an implemente in SAS. After this first stage, in orer to solve some estimating problem arisen by using linear moels, it was consiere the application of estimators base on an area-level multinomial logit mixe moels. In aition, ue to the fact that population totals use in the LFS are the INE s Demographic Population Projections, an they are not available at NUTS level 4 (comarcas), some works were unertaken to provie estimates of those missing figures, within the scope of the project, using ata from the Municipal Population Register as input an calibration techniques..- Description of the 11 Census Survey The main objectives of this census are: a) To estimate the total population of some groups, in such a way that it corrects the register figure corresponing to them. To this en, we will use some recounting factors obtaine from the probabilities of belonging to the sample. b) To estimate the characteristics of the population an of the wellings, at ifferent geographical breakown levels. For this purpose, we will obtain calibration weights erive from the sampling esign, an which are calibrate to the municipal populations. In accorance with the above, an consiering that the population census is the only mean available for obtaining information broken own at census section level, the primary sampling unit use in househol surveys, we will raw a sample in all census sections. Sampling units are wellings an are selecte without replacement an equal probabilities. All the iniviuals in a selecte welling are investigate. To select the sample, the total wellings of the municipality are groupe into two sampling frames: - Frame A comprise of the set of locatable wellings of the PCF -134-

136 - Frame B mae up of the set of properties that are registere uring the comprehensive route carrie out in the fielwork of the 11 Builing Census. Taking into account the objectives of proviing information at municipal level an the buget available, sampling fractions vary with the size of the municipality. Thus the smaller municipalities will be thoroughly investigate while the larger municipalities will account for the lower sampling fractions as it is showe in the following tables: DWELLINGS Sampling fraction (%) Population brackets Munici palities Dwellings: Average per municipality Framework A: Locatable main wellings Aver age Average sample per municipality Total sample uner ,334 5 to ,694 1 to 199 1, ,717 to 499 1, ,7 5 to 999 1, ,17 1, to 1, ,571, to 4,999 1,11 1, ,67 5, to 9, , ,631 1, to 19, , ,94, to 49, , ,76 31,481 5, to 99, , ,968 46,33 1, to 199, , ,9 198,33, to 499, , ,966 5,11 5, to 999, , , ,355 1,, an over 1,, ,84 11,69 TOTAL 8, ,4,76 Table 1. Distribution of welling by municipalities POPULATION Population brackets Municipaliti es Average population by municipality Sampling fraction Table. Distribution of population by municipalities. Average sample per municipalit y Total sample uner ,7 5 to ,758 1 to ,148 to ,655 5 to ,653 1, to 1, , ,74, to 4, , ,78 5, to 9, , ,867 1, to 19, , ,87 455,757, to 49, ,599 9.,77 687,84 5, to 99, , , ,6 1, to 199, , ,54 446,86, to 499, , , ,399 5, to 999, , ,38 61,313 1,, an over,45, ,186 46,373 TOTAL 8, ,797,

137 The estimators of the characteristics of wellings an persons, in a given municipality, are expansion estimators with a correction of non-response, to which calibration techniques are applie, accoring to the case. At the municipal level, the census survey inclues estimating of the following population parameters: total by nationality: Spain, UE (except Spain), Africa, America, Asia, Oceania, Stateless. total by eucational level: 1 categories, from illiterate to octorate (see Appenix A). total by marital status: Single, Marrie, Wiowe, Separate, Divorce. In our case stuy, we will perform Monte Carlo simulations to evaluate small area estimators of municipality population totals by eucational level, marital status or of foreigners by nationality. In next section it is showe how we have foun a way to cope with the goal of getting estimates at municipality level for subpopulation totals with weights calibrate to municipal populations. 3.- Approach to the problem In orer to perform the simulation stuy, one of the first activities to be unertaken has been the creation of a ata file from the 1 Population Census, that contains the objective variables (see Appenix A). So far this atabase has been compile for two regions which comprise 3 provinces, containing 44 variables an 1,19,46 an 1,58,53 entries for each one respectively. For each recor, 4 variables come from the 1 Population an Dwellings Census an two are erive variables, taking up to 11 characters. The recor unit is the person resient in a main family welling on the census reference ata (1 November 1). The welling to which the person belongs can also be ientifie using a common ientification number for all members of the same welling. Then the iea is to simulate the census survey as much as possible an, for one of the two regions, we have implemente a systematic sampling esign for each of the 45 municipalities inclue in the selecte region. Finally 5 samples have been rawn. The main ifferences with the real wor are that there are not recounting factors, the ata file is the only frame available an the absence of non-response. In this context, the calibrate estimator of subpopulation totals in the municipality M is given by the expression Ŷ M = G g= 1 is Mg w c i y i where the inex g is for the calibrate group the inex i is for the iniviual -136-

138 y i takes the value 1 if the iniviual belongs to the class total estimate an zero otherwise. w i c is the calibrate weight efine by N c Mg w i = w i with Nˆ Mg = w i Nˆ Mg i S Mg The weight w i is the sampling weight efine as the ratio between the population an the sample sizes in the municipality, V M /v M, accoring to the number of wellings. N Mg is the total of iniviuals in the municipality M an the calibrate group g. In this case, if the above formula is evelope, the calibrate weights, w c i, are calculate as the ratio N Mg /n Mg where the enominator is the total of iniviuals in the sample for the municipality M belong to the calibrate group g The construction of the calibration groups. In orer to efine the calibration groups for each municipality in this region, at the beginning, we have consiere a partition of the municipal population by sex an age crosse with Spanish an foreign people. So we have contemplate x19x=76 calibrate groups in all municipalities. After selecting 1 samples in each municipality an analyzing the groups in the sample without observations, the efinition of the calibration groups was moifie by aggregating some age groups an eliminating the crossing between sex-age an nationality. So, finally an at the municipality level, we have consiere 34 sex-age groups an 3 nationality groups efine by : - group 1: Spanish people - group : foreigners from the European Union (15), U.S., Japan or Oceania - group 3: other foreigners without crossing which are not a partition of the municipal population. Consequently the methoology to calibrate the census estimator to these 37 groups is becoming more complex as it is showe in next section Moification of the census estimator In orer to estimate without error the population totals of the new calibration groups, given that these groups are not a partition of the municipality, the census estimator is constructe as a Generalize Regression Estimator (GREG) with a stanar regression moel fitte to the municipality sample. Therefore the census estimator is the irect estimator Ŷ GREG M = Ŷ M + T ( X M Xˆ M ) ˆ M -137-

139 where Ŷ M = i S M w i y i is the irect estimator for the total of the variable Y. X M =(X M1,, X MG ) T is the vector of G calibration group population totals. Xˆ ( ) T M = Xˆ M1, K, Xˆ MG is the vector of G estimate calibration group population totals where Xˆ Mg = w i x i for g=1,...,g an x i i S takes the value 1 if the iniviual belongs to the calibration group g an zero otherwise 1 ˆ T M = w ix ix i w ix iyi with x i =(x i1,, x ig ) T the vector of G is M is M values that take the value 1 if the iniviual i belong to the calibration group or zero otherwise. The GREG estimator can be written as a weighte sum as Ŷ GREG M where the g weights are i ( M M ) i i i i = T T g = 1 + X Xˆ w x x x. is M These g weights epen on the municipality sample an verify that is M g i w i x i i = S M g im i x w i That is, the new weights efine by g i w i are calibrate to the known population totals of the G calibration groups an, consequently, they are the new calibrate weights w i c. i y i Mg Process to calibrate the 5 samples First we got the calibration weights consiering the 37 new groups for iniviuals in the 1 samples alreay selecte an, unfortunately, we foun some problems as, for example, - in a very few cases of foreigners, the calibrate weights were foun to be negative - with a relatively high frequency the calibrate weights were positive but less than one - there were still situations where it was impossible to calibrate because there was no sample observations for any group (most of these cases are in groups of foreigners, specially in group ) -138-

140 Before selecting the remaining 49 samples an work with them, some ecisions were taken to avoi these situations. Then, the solution to the first problem was to truncate the calibrate weights to force them to take values fulfilling the following rule,1 w w i c i 1 For the secon problem we have not one anything because it is not relevant for our stuy. An, finally, for the thir situation we ecie to remove the sample if the problem comes to sex-age group an to join all foreigners in only one group otherwise. In other wors we gave priority to the calibration of sex-age groups an sacrifice the two groups of foreigners if it was necessary. Also, for each municipality M an group of foreigners, the expecte omain sample size was calculate as E N Mg ( ) n Mg = n M N M an if it was less than 6, foreigners in that municipality forme a single group (N M is the population total of the municipality M ). Finally, after calibrating the 5 samples, we still fin few samples that faile the calibration to a municipality because there were no sample observations in a group. Given that the synthetic estimators to be evaluate are base in provincial estimates, those samples with calibration problems shoul be eliminate. However, to avoi reucing too the number of samples in the simulation stuy, we analyse case by case an applie the mentione rule of joining foreigners or eliminating the sample to respect sex-age groups. As a consequence of this last process, 47 samples have been eliminate an the final calibrate groups are: CALIBRATION GROUPS OF Sex an Age Nationality (Spanish people / foreigners)s Nationality (groups 1, an 3) 45 out of 45 5 out of out of 45 Table 3. Total of municipalities in the region investigate accoring to the type of calibration groups applie. Once calibrate weights are calculate in all samples, we evaluate the census estimator constructe as a GREG an two synthetic estimators which are efine in next section

141 4.- Methoology 4.1. SAE methos. For each municipality M, the following estimators will be evaluate: 1) Census estimator. ) Broa Area Ratio Estimator (BARE): This estimator is the ratio-synthetic estimator (4..) appearing in Rao s book (see reference [157] in the WP Final Report) BARE M Ŷ = N M Ŷ Nˆ GREG P GREG P where the ratio estimator is calculate for the province P which the municipality belongs to: GREG GREG GREG Ŷ P = Ŷ M ; Nˆ P = M P 3) Estimator base on post-stratification (POSS): M P Nˆ GREG M This estimator is the count-synthetic estimator (4..3) appearing in Rao s book (see reference [157] in the WP Final Report) Ŷ POSS M = g N Mg Ŷ Nˆ GREG Pg GREG Pg where g means only the sex-age groups (the two or three calibration groups base on spanish an foreign people are exclue). The ratio estimator for each summan is calculate for the province P which the municipality belongs to: GREG GREG Ŷ Pg = Ŷ GREG Mg ; Nˆ Pg = Nˆ MP M P GREG Mg This synthetic estimator has been previously evaluate in the context of the Spanish Labour Force Survey (see chapter, section 7 in the WP Final Report). -14-

142 4.. Quality assurance methos. In orer to evaluate how goo the propose small area estimators are to estimate the municipal totals relate to the objective variables, K=5 samples are simulate in the region selecte. Let Ŷ M ( k) be the estimate of the total Y M for the municipality M in the k th replicate sample. The following stanar performance criteria are consiere (see Evaluation Measures in section III-A an the reference [335] of WP3 Final Report): 1. The percentage of the relative bias for the small area estimator in the municipality M (average relative bias): ARB K 1 ŶM ( ) ( k ) ŶM = K k YM 1 Y = 1 M. The mean of the average relative bias: ARB = 1 M P M P ARB ( Ŷ ) where M P is the total of municipalities belong to the province P. M 3. The percentage of the relative root mean square error for the small area estimator in the municipality M (relative mean square error): RMSE K 1 ˆ ( ˆ YM ) ( k ) Y = M K Y Y M = 1 k M 1 4. The mean of the relative root mean square error: RMSE = 1 M P MP RMSE ( Ŷ ) M In aition, the moel iagnostics propose by Brown et al. (1), iscusse in section IV-A Guies (reference []) an IV-D Moel iagnostics of WP3 Final Report are consiere. The purpose of the analysis is to stuy the consistency between the results obtaine from the moel iagnostics an the stanar performance criteria mentione above (1-4). Since those iagnostics compare irect with moel-base estimates, an in this case stuy the true values of the target variables are available, irect estimates are replace by true values, an they are compare with the mean of the moel-base estimates over the simulate samples ( Yˆ M ) for each estimator: ˆ 1 M = Yˆ M ( k ) ( ) MSE Yˆ M Y K k ( Yˆ ) The evaluation measures are calculate also for the irect estimator, as benchmark. RMSE 1 K M Y M -141-

143 5. Software. The platform use in the simulation stuy is SAS: Sample selection is carrie out with stanar proceures (PROC SURVEYSELECT). All the estimates are obtaine through specific programming with SAS Macro an SAS IML, implementing the GREG estimator an imposing restrictions on the variation of the weights (before an after applying GREG). The programs propose in WP4 for moel iagnostics an calibration are use, introucing moifications that a more flexibility in the input an output, using SAS Macro. Specific programs evelope for the quality assurance methos escribe in section 4. are also use. Cluster analysis is acomplishe also by stanar proceures (PROC CLUSTER an PROC TREE). 6. First results. In orer to evaluate the census estimates we have stuie their RMSEs for each of the parameters investigate an the istribution of them accoring to the classes suggeste by Statistics Canaa to check LFS ata reliability (Statistics Canaa, 1) base on the coefficient of variation (CV). In our case, the RMSE is use instea of the CV an, in the case of the census estimator, RMSE correspons to CV as the census estimator is approximately unbiase. Obviously, this inicator is not calculate if the population total of the category (Y M ) is zero or the proportion (P M =Y M /N M ) is zero. Table 4 isplays the istribution of the small areas (45) accoring to this classification: Target parameter P M= CV(%) [,16 5] (16 5,33 3 ] > 33 3 By Nationality Spain UE (except Spain) Africa America Asia Oceania Stateless By Marital Status Single Marrie Wiowe Separate Divorce By Eucational Level

144 Table 4. CV% of CENSUS estimates. Given that the sampling fractions ( f ) in the municipalities of the region investigate range from 1% to 5%, epening on the population size, the census estimator accuracy is acceptable for the majority of the municipalities an categories of the target variables. Note that in this stuy the small areas consiere are not geographical areas, but subomains, which correspons to categories whose relative size in the population (P M =Y M /N M ) is significantly small an coul not be inclue as planne omains (e.g.: marital status: Divorce). To analyze the bias of the census estimator we have calculate the ratio BR between the absolute value of the bias an the root of the mean square error, expresse as percentage, an for each estimate parameter, its istribution has been stuie accoring to a classification erive from the suggestion mentione in Särnal s book (reference [167] in the WP Final Report). Table 5 shows this istribution an the conclusion is that the census estimator is approximately esign-unbiase as it was expecte Target parameter P M= BR% 7 7 BR%>7 7 By Nationality Spain UE (except Spain) 43 - Africa 43 - America Asia 3 - Oceania Stateless 43 - By Marital Status Single Marrie Wiowe Separate Divorce By Eucational Level Table 5. BR% of CENSUS estimates. It is clear from Tables 4 an 5 that SAE methos an, in particular, the synthetic estimators consiere, shoul be useful for the estimation of the small area population totals of the categories which estimates are inclue in the thir class of the table 4 (except for Asia, Oceania an Stateless which shoul be forme a single category ue to its resiual character). Then we focus on UE (except Spain), Africa an America for Nationality, Separate an Divorce for Civil Status an from 6 (Seconary eucation) to 1 (Doctorate) for Eucational Level. For the categories mentione the analysis of the results shows that the behaviour of both synthetic estimators is similar, consistent with the fact that they are base on the -143-

145 same broa area, the province, which in this stuy inclues all the small areas, an that the overall sampling fraction is high (aroun 1%). Tables 6 an 7 show the istribution of the RRMSEs an BRs of the BARE an POSS estimates respectively. Target parameter P M = BARE POSS RRMSE (%) RRMSE (%) [,16 5] (16 5,33 3] >33 3 [,16 5] (16 5,33 3] >33 3 By Nationality UE(except Spain) Africa America By Marital Status Separate Divorce By Eucational Level Table 6. RRMSE% of BARE an POSS estimates (base on the province). Target parameter P M= BARE POSS 7 BR%>7 7 BR% 7 7 BR% 7 BR%>7 7 By Nationality UE (except Spain) Africa America By Marital Status Separate Divorce By Eucational Level Table 7. BR% of BARE an POSS estimates (base on the province). The error in both cases is mainly bias, as the BR% ratio is inclue in the secon class for most of the small area estimates., This fact is a result of their very low variability aroun the average over all the simulate samples, because the accuracy of the estimation of the relative size in the province is substantially high. Figure 1 shows the ARB of both estimators for the same variable, but orere by the relative size in the population ( P M ). A ecreasing pattern as P M grows can be observe, overestimating in those municipalities where the relative size is lower than the overall one (in the province), an the opposite, so the minimum bias is foun in those municipalities where the relative size is similar to the province one, as expecte for the BARE, but usually also for POSS

146 ARB(%) Municipalities BARE POSS Figure 1. ARB of BARE an POSS estimates of small area total for Marital status: Divorce (base on the province) orere by relative size in the population ( P M ). Finally, comparing the accuracy of BARE an POSS to the Census estimator, the synthetic estimates outperform the irect estimate in those municipalities where the relative sizes are similar to the ones in the broa area, that is the province, proviing acceptable estimates only when they are very close, as can be seen in Figure, suggesting that estimators base on a proper efinition of the broa area, epenent on the municipality an relate to the target variable, coul be consiere as an alternative to the Census estimator. RMSE(%) Municipalities CENSUS BARE POSS Figure. RMSE of Census, BARE an POSS estimates of small area totals for Marital status: Divorce orere by relative size in the population ( P M )

147 7. New approach an results to improve the synthetic estimators. In orer to reuce the bias of the synthetic estimators in those small areas whose behaviour is very ifferent from the province, cluster analysis has been use, grouping areas with similar mean values for certain variables available in the real worl. The War's minimum-variance metho has been use for cluster generation (War (1963)). The following variables were consiere: 1) Demographic variables: - Distribution of the population accoring to 4 age groups (-15,16-4,5-69,7-) - Percentage of foreigners ) Economic variables: - Total income per welling - Percentage of investment income over total income - Percentage of agrarian income over total income 4 groups have been create, as can bee seen in Figures 3 an 4 ( R =. 539 ). Figure 3. Tree generate in the cluster analysis. Figure 4. Clusters of small areas -146-

148 Tables 8 an 9 show respectively the istribution of the RRMSEs an BRs of the BARE an POSS estimates base on those clusters. Population total P M= BARE POSS [,16 5] (16`5,33 3] >33 3 [o,16 5] (16 5,33 3] >33 3 By Nationality UE (except Spain) Africa America By Marital Status Separate Divorce ByEucational Level Table 8. RRMSE% of BARE an POSS estimate (base on clusters of small areas). Target parameter P M= BARE BR% 7 7 BR%>7 7 BR% 7 7 POSS BR%>7 7 By Nationality UE (except Spain) Africa America By Marital Status Separate Divorce By Eucational Level Table 9. BR% of BARE an POSS estimates (base on clusters of small areas). Both tables show that the alternative synthetic estimators behave similarly for most of the estimate parameters. Given that the simplicity of the metho is always an avantage, in the rest of the analysis we will consier only the BARE estimator, naming it CBAR. 7.1 Census estimator vs. BARE estimators. To compare the ifferent choices (census, BARE-province or BARE-cluster (CBAR) estimator) the omains where the census estimates have a RRMSE% greater than 33,3% are consiere (the last column of Table 4). Table 1 escribes the istribution of those cases: Best choice Target parameter CENSUS estimator BARE estimator (province) CBAR estimator (cluster) By Nationality UE (except Spain) Africa America By Marital Status Separate Divorce ByEucational Level Total -147-

149 Table 1. Distribution of small areas when census estimates have RRMSE%>33,3 accoring to the best choice (the one with smallest RMSE). As expecte the alternative BARE estimator base on clusters improves the estimates more often than the BARE base on a larger area as the province. Moreover, in this case, the census estimator (irect) works well in many cases supporting the hypothesis alreay mentione that the small areas are not so much the municipalities as subomains with relative population size significantly small. Table 11 shows the ARB an the RMSE inicators average over a subset of the small areas accoring to the following situations: - CV% of the census estimate is not greater than 33,3 - CV% of the census estimate is greater than 33,3 but outperforms the CBAR estimate. - CV% of the census estimate is greater than 33,3 an the better alternative is the CBAR estimate. Census estimator outperforms CBAR CV% 33 3 Population total P M= TOTAL ARB RMSE TOTAL CV%> 33 3 ARB RMSE CBAR outperforms Census estimator TOTAL ARB RMSE By Nationality UE (except Spain) Africa America By Marital Status Separate Divorce ByEucational Level Table 11. Mean values of ARB an RMSE over the small areas. Since the variables use to construct the clusters are not exactly the same as the target variables, the synthetic estimator improves the estimates in many occasions (see the highlighte table results) but not always. Two significant cases are liste below. One is the case of estimating the small area total of foreigners from Africa, where the RMSE of the synthetic estimator is smaller than the census estimator in 9 small areas vs. 15, but the mean of the ARB represents more than half of the error. Another example is the estimation of octorates where the RMSE of the synthetic estimates is smaller in 3 small areas vs. 7 an the ARB represents a percentage of the error much smaller than in the previous case (18%approximately). Although in both cases there is a reuction in RMSE over 65%, in the first, the error mae by the alternative estimator is mainly bias whereas in the secon is not an the -148-

150 number of small area estimates improve is also ifferent. Figures 5 an 6 escribe each situation: 11% 1% 9% 8% 7% 6% 5% 4% 3% % 1% % Municipalities RMSE_CBAR ABS(ARB_CBAR) P_M Figure 5. RMSE an absolute value of the ARB for the 9 small areas where the CBAR estimate improves the census estimates of small area totals for Nationality: Africa orere by clusters an relative size in the population ( P M ).,1,8,6,4, % 19% 18% 17% 16% 15% 14% 13% 1% 11% 1% 9% 8% 7% 6% 5% 4% 3% % 1% % Municipalities RMSE_CBAR ABS(ARB_CBAR) P_M,8,7,6,5,4,3,,1 Figure 6. RMSE an absolute value of the ARB for the 3 small areas where the CBAR estimate improves the census estimates of small area totals for Eucational level: Doctorate orere by clusters an relative size in the population ( P M ). Comparing the previous figures it can be observe that the range of the small area relative size ( P M ) is much smaller in the secon one than in the first (-,8 an -,1 respectively). It is also more homogenous insie each clusters for the secon that for the first. Both factors are relevant to improve or not the census estimates. Therefore the behaviour of the synthetic estimator iffers from one situation to another. In orer to improve it in any situation by reucing its bias, it seems necessary to buil the broa region a-hoc to the target parameter although this strategy is not feasible when there are many variables, i.e. in the case of multipurpose survey as it is in our case. For this reason the synthetic estimator coul be a goo strategy when there is only one target variable or a small set of them with high correlation, otherwise it is better to apply a composite estimator ([WP Final Report, 157], page 57) which is a convex linear -149-

151 combination of the irect estimator an the synthetic one an a goo way to balance the bias of the synthetic estimator against the instability of the irect estimator. -15-

152 7. Moel iagnostics. The results from the bias, gooness of fit, coverage, an calibration iagnostics are shown in Tables 1 to 15 for the subset of variables selecte in section 6 (aing Nationality: Asia an Nationality: Oceania. The iagnostics for the target variables not inclue here gave similar results. It is relevant to remark that, as mentione in section 4., true values are use instea of irect estimates, so their MSE is, an moel-base estimates are in reality means over the simulate samples, so their MSE is significantly lower than the corresponing MSE (see approximation propose in section 4.). Such peculiarity will have an impact on those iagnostics that make use of the MSE, that is, gooness of fit iagnostic, an coverage iagnostic, compare to the case where only one sample is available an the true values are unknown, specially in biase estimators, making the bias more apparent. The unbiasness of the Census estimator is confirme by the bias iagnostics shown in Table 1, focusing on the lines marke in yellow, which correspons to transforme ata (iscusse subsequently), since H is not rejecte for any of the target variables, except for Nationality: UE (not Spain), something reasonable taking into account that it actually is a GREG estimator, which is just approximately esign-unbiase. More striking results are foun for BARE an CBAR since the tests fail to etect bias, with significantly high p-values. Only when it is substantial, as in the BARE estimator for Eucational levels an Marital status (see Table 1), the test is able to etect the bias. It can be explaine by the presence of influential observations in the BARE estimator, as can be observe in Figure 6 (marke out in re). Figure 6. BARE an CBAR bias scatterplots for Nationality: America Those observations are ientifie as influential by stanar iagnostics like Cook's D, COVRATIO, DFFITS (Belsley et al.(198)). Removing them, the test is able to etect the bias in the BARE estimates (left-han picture in Figure 7), whereas CBAR remains globally unbiase

153 Figure 7. BARE an CBAR bias scatterplots for Nationality:America (square root) excluing influential observations Figure 8. BARE an CBAR bias scatterplots for Eucational level:doctorate (square root) Still significant ifferences can be foun between BARE an CBAR for Eucational level an Marital status, as can be observe in Figure 8 an Table 1, playing in this case a role as a comparative assessment tool, although it is not able to ifferentiate the behaviour of Census an CBAR estimators, again because the presence of influential observations. Because the variability in the population sizes, nearly for all the target variables an initial transformation to ensure the homosceasticity assumption was require. In some cases (see Nationality:Asia an Nationality:Oceania ) the homosceasticity test faile ue to the presence of many omains where PM = although in those cases the test on the transforme variables was performe too. Paying attention to the results obtaine using the untransforme ata (lines not marke in yellow) the intercepts an slopes for BARE an CBAR are sometimes quite far from their target values, because the instability of the estimates, an the tests give confusing results compare to the tests on the transforme ata. In conclusion, as Brown et al. pointe out, the use of this iagnostic requires that the heteroskeasticity problem is aresse, but it is also necessary to alleviate the effect of influential observations. -15-

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: Balance Sampling for Multi-Way Stratification Contents General section... 3 1. Summary... 3 2.

More information

The Role of Models in Model-Assisted and Model- Dependent Estimation for Domains and Small Areas

The Role of Models in Model-Assisted and Model- Dependent Estimation for Domains and Small Areas The Role of Moels in Moel-Assiste an Moel- Depenent Estimation for Domains an Small Areas Risto Lehtonen University of Helsini Mio Myrsylä University of Pennsylvania Carl-Eri Särnal University of Montreal

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: EBLUP Unit Level for Small Area Estimation Contents General section... 3 1. Summary... 3 2. General

More information

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing

More information

Estimation of District Level Poor Households in the State of. Uttar Pradesh in India by Combining NSSO Survey and

Estimation of District Level Poor Households in the State of. Uttar Pradesh in India by Combining NSSO Survey and Int. Statistical Inst.: Proc. 58th Worl Statistical Congress, 2011, Dublin (Session CPS039) p.6567 Estimation of District Level Poor Househols in the State of Uttar Praesh in Inia by Combining NSSO Survey

More information

A COMPARISON OF SMALL AREA AND CALIBRATION ESTIMATORS VIA SIMULATION

A COMPARISON OF SMALL AREA AND CALIBRATION ESTIMATORS VIA SIMULATION SAISICS IN RANSIION new series an SURVEY MEHODOLOGY 133 SAISICS IN RANSIION new series an SURVEY MEHODOLOGY Joint Issue: Small Area Estimation 014 Vol. 17, No. 1, pp. 133 154 A COMPARISON OF SMALL AREA

More information

Estimating International Migration on the Base of Small Area Techniques

Estimating International Migration on the Base of Small Area Techniques MPRA Munich Personal RePEc Archive Estimating International Migration on the Base of Small Area echniques Vergil Voineagu an Nicoleta Caragea an Silvia Pisica 2013 Online at http://mpra.ub.uni-muenchen.e/48775/

More information

Estimating Unemployment for Small Areas in Navarra, Spain

Estimating Unemployment for Small Areas in Navarra, Spain Estimating Unemployment for Small Areas in Navarra, Spain Ugarte, M.D., Militino, A.F., an Goicoa, T. Departamento e Estaística e Investigación Operativa, Universia Pública e Navarra Campus e Arrosaía,

More information

Survey-weighted Unit-Level Small Area Estimation

Survey-weighted Unit-Level Small Area Estimation Survey-weighte Unit-Level Small Area Estimation Jan Pablo Burgar an Patricia Dörr Abstract For evience-base regional policy making, geographically ifferentiate estimates of socio-economic inicators are

More information

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions Working Paper 2013:5 Department of Statistics Computing Exact Confience Coefficients of Simultaneous Confience Intervals for Multinomial Proportions an their Functions Shaobo Jin Working Paper 2013:5

More information

Deliverable 2.2. Small Area Estimation of Indicators on Poverty and Social Exclusion

Deliverable 2.2. Small Area Estimation of Indicators on Poverty and Social Exclusion Deliverable 2.2 Small Area Estimation of Inicators on Poverty an Social Exclusion Version: 2011 Risto Lehtonen, Ari Veijanen, Mio Myrsylä an Maria Valaste The project FP7-SSH-2007-217322 AMELI is supporte

More information

A Fay Herriot Model for Estimating the Proportion of Households in Poverty in Brazilian Municipalities

A Fay Herriot Model for Estimating the Proportion of Households in Poverty in Brazilian Municipalities Int. Statistical Inst.: Proc. 58th Worl Statistical Congress, 2011, Dublin (Session CPS016) p.4218 A Fay Herriot Moel for Estimating the Proportion of Househols in Poverty in Brazilian Municipalities Quintaes,

More information

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION The Annals of Statistics 1997, Vol. 25, No. 6, 2313 2327 LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION By Eva Riccomagno, 1 Rainer Schwabe 2 an Henry P. Wynn 1 University of Warwick, Technische

More information

A Modification of the Jarque-Bera Test. for Normality

A Modification of the Jarque-Bera Test. for Normality Int. J. Contemp. Math. Sciences, Vol. 8, 01, no. 17, 84-85 HIKARI Lt, www.m-hikari.com http://x.oi.org/10.1988/ijcms.01.9106 A Moification of the Jarque-Bera Test for Normality Moawa El-Fallah Ab El-Salam

More information

A comparison of small area estimators of counts aligned with direct higher level estimates

A comparison of small area estimators of counts aligned with direct higher level estimates A comparison of small area estimators of counts aligne with irect higher level estimates Giorgio E. Montanari, M. Giovanna Ranalli an Cecilia Vicarelli Abstract Inirect estimators for small areas use auxiliary

More information

Combining Time Series and Cross-sectional Data for Current Employment Statistics Estimates 1

Combining Time Series and Cross-sectional Data for Current Employment Statistics Estimates 1 JSM015 - Surey Research Methos Section Combining Time Series an Cross-sectional Data for Current Employment Statistics Estimates 1 Julie Gershunskaya U.S. Bureau of Labor Statistics, Massachusetts Ae NE,

More information

Quality competition versus price competition goods: An empirical classification

Quality competition versus price competition goods: An empirical classification HEID Working Paper No7/2008 Quality competition versus price competition goos: An empirical classification Richar E. Balwin an Taashi Ito Grauate Institute of International an Development Stuies Abstract

More information

Improving Estimation Accuracy in Nonrandomized Response Questioning Methods by Multiple Answers

Improving Estimation Accuracy in Nonrandomized Response Questioning Methods by Multiple Answers International Journal of Statistics an Probability; Vol 6, No 5; September 207 ISSN 927-7032 E-ISSN 927-7040 Publishe by Canaian Center of Science an Eucation Improving Estimation Accuracy in Nonranomize

More information

M-Quantile Regression for Binary Data with Application to Small Area Estimation

M-Quantile Regression for Binary Data with Application to Small Area Estimation University of Wollongong Research Online Centre for Statistical & Survey Methoology Working Paper Series Faculty of Engineering an Information Sciences 2012 M-Quantile Regression for Binary Data with Application

More information

Poisson M-quantile regression for small area estimation

Poisson M-quantile regression for small area estimation University of Wollongong Research Online Centre for Statistical & Survey Methoology Working Paper Series Faculty of Engineering an Information Sciences 2013 Poisson M-quantile regression for small area

More information

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction

More information

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21 Large amping in a structural material may be either esirable or unesirable, epening on the engineering application at han. For example, amping is a esirable property to the esigner concerne with limiting

More information

The new concepts of measurement error s regularities and effect characteristics

The new concepts of measurement error s regularities and effect characteristics The new concepts of measurement error s regularities an effect characteristics Ye Xiaoming[1,] Liu Haibo [3,,] Ling Mo[3] Xiao Xuebin [5] [1] School of Geoesy an Geomatics, Wuhan University, Wuhan, Hubei,

More information

6 General properties of an autonomous system of two first order ODE

6 General properties of an autonomous system of two first order ODE 6 General properties of an autonomous system of two first orer ODE Here we embark on stuying the autonomous system of two first orer ifferential equations of the form ẋ 1 = f 1 (, x 2 ), ẋ 2 = f 2 (, x

More information

inflow outflow Part I. Regular tasks for MAE598/494 Task 1

inflow outflow Part I. Regular tasks for MAE598/494 Task 1 MAE 494/598, Fall 2016 Project #1 (Regular tasks = 20 points) Har copy of report is ue at the start of class on the ue ate. The rules on collaboration will be release separately. Please always follow the

More information

CONTROL CHARTS FOR VARIABLES

CONTROL CHARTS FOR VARIABLES UNIT CONTOL CHATS FO VAIABLES Structure.1 Introuction Objectives. Control Chart Technique.3 Control Charts for Variables.4 Control Chart for Mean(-Chart).5 ange Chart (-Chart).6 Stanar Deviation Chart

More information

β ˆ j, and the SD path uses the local gradient

β ˆ j, and the SD path uses the local gradient Proceeings of the 00 Winter Simulation Conference E. Yücesan, C.-H. Chen, J. L. Snowon, an J. M. Charnes, es. RESPONSE SURFACE METHODOLOGY REVISITED Ebru Angün Jack P.C. Kleijnen Department of Information

More information

VI. Linking and Equating: Getting from A to B Unleashing the full power of Rasch models means identifying, perhaps conceiving an important aspect,

VI. Linking and Equating: Getting from A to B Unleashing the full power of Rasch models means identifying, perhaps conceiving an important aspect, VI. Linking an Equating: Getting from A to B Unleashing the full power of Rasch moels means ientifying, perhaps conceiving an important aspect, efining a useful construct, an calibrating a pool of relevant

More information

THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE

THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE Journal of Soun an Vibration (1996) 191(3), 397 414 THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE E. M. WEINSTEIN Galaxy Scientific Corporation, 2500 English Creek

More information

A Review of Multiple Try MCMC algorithms for Signal Processing

A Review of Multiple Try MCMC algorithms for Signal Processing A Review of Multiple Try MCMC algorithms for Signal Processing Luca Martino Image Processing Lab., Universitat e València (Spain) Universia Carlos III e Mari, Leganes (Spain) Abstract Many applications

More information

ADIT DEBRIS PROJECTION DUE TO AN EXPLOSION IN AN UNDERGROUND AMMUNITION STORAGE MAGAZINE

ADIT DEBRIS PROJECTION DUE TO AN EXPLOSION IN AN UNDERGROUND AMMUNITION STORAGE MAGAZINE ADIT DEBRIS PROJECTION DUE TO AN EXPLOSION IN AN UNDERGROUND AMMUNITION STORAGE MAGAZINE Froe Opsvik, Knut Bråtveit Holm an Svein Rollvik Forsvarets forskningsinstitutt, FFI Norwegian Defence Research

More information

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation Tutorial on Maximum Likelyhoo Estimation: Parametric Density Estimation Suhir B Kylasa 03/13/2014 1 Motivation Suppose one wishes to etermine just how biase an unfair coin is. Call the probability of tossing

More information

Simple Tests for Exogeneity of a Binary Explanatory Variable in Count Data Regression Models

Simple Tests for Exogeneity of a Binary Explanatory Variable in Count Data Regression Models Communications in Statistics Simulation an Computation, 38: 1834 1855, 2009 Copyright Taylor & Francis Group, LLC ISSN: 0361-0918 print/1532-4141 online DOI: 10.1080/03610910903147789 Simple Tests for

More information

SYNCHRONOUS SEQUENTIAL CIRCUITS

SYNCHRONOUS SEQUENTIAL CIRCUITS CHAPTER SYNCHRONOUS SEUENTIAL CIRCUITS Registers an counters, two very common synchronous sequential circuits, are introuce in this chapter. Register is a igital circuit for storing information. Contents

More information

Semiclassical analysis of long-wavelength multiphoton processes: The Rydberg atom

Semiclassical analysis of long-wavelength multiphoton processes: The Rydberg atom PHYSICAL REVIEW A 69, 063409 (2004) Semiclassical analysis of long-wavelength multiphoton processes: The Ryberg atom Luz V. Vela-Arevalo* an Ronal F. Fox Center for Nonlinear Sciences an School of Physics,

More information

3.2 Shot peening - modeling 3 PROCEEDINGS

3.2 Shot peening - modeling 3 PROCEEDINGS 3.2 Shot peening - moeling 3 PROCEEDINGS Computer assiste coverage simulation François-Xavier Abaie a, b a FROHN, Germany, fx.abaie@frohn.com. b PEENING ACCESSORIES, Switzerlan, info@peening.ch Keywors:

More information

MODELLING DEPENDENCE IN INSURANCE CLAIMS PROCESSES WITH LÉVY COPULAS ABSTRACT KEYWORDS

MODELLING DEPENDENCE IN INSURANCE CLAIMS PROCESSES WITH LÉVY COPULAS ABSTRACT KEYWORDS MODELLING DEPENDENCE IN INSURANCE CLAIMS PROCESSES WITH LÉVY COPULAS BY BENJAMIN AVANZI, LUKE C. CASSAR AND BERNARD WONG ABSTRACT In this paper we investigate the potential of Lévy copulas as a tool for

More information

Resilient Modulus Prediction Model for Fine-Grained Soils in Ohio: Preliminary Study

Resilient Modulus Prediction Model for Fine-Grained Soils in Ohio: Preliminary Study Resilient Moulus Preiction Moel for Fine-Graine Soils in Ohio: Preliminary Stuy by Teruhisa Masaa: Associate Professor, Civil Engineering Department Ohio University, Athens, OH 4570 Tel: (740) 59-474 Fax:

More information

Spurious Significance of Treatment Effects in Overfitted Fixed Effect Models Albrecht Ritschl 1 LSE and CEPR. March 2009

Spurious Significance of Treatment Effects in Overfitted Fixed Effect Models Albrecht Ritschl 1 LSE and CEPR. March 2009 Spurious Significance of reatment Effects in Overfitte Fixe Effect Moels Albrecht Ritschl LSE an CEPR March 2009 Introuction Evaluating subsample means across groups an time perios is common in panel stuies

More information

7.1 Support Vector Machine

7.1 Support Vector Machine 67577 Intro. to Machine Learning Fall semester, 006/7 Lecture 7: Support Vector Machines an Kernel Functions II Lecturer: Amnon Shashua Scribe: Amnon Shashua 7. Support Vector Machine We return now to

More information

Designing of Acceptance Double Sampling Plan for Life Test Based on Percentiles of Exponentiated Rayleigh Distribution

Designing of Acceptance Double Sampling Plan for Life Test Based on Percentiles of Exponentiated Rayleigh Distribution International Journal of Statistics an Systems ISSN 973-675 Volume, Number 3 (7), pp. 475-484 Research Inia Publications http://www.ripublication.com Designing of Acceptance Double Sampling Plan for Life

More information

Small Area Estimation: A Review of Methods Based on the Application of Mixed Models. Ayoub Saei, Ray Chambers. Abstract

Small Area Estimation: A Review of Methods Based on the Application of Mixed Models. Ayoub Saei, Ray Chambers. Abstract Small Area Estimation: A Review of Methos Base on the Application of Mixe Moels Ayoub Saei, Ray Chambers Abstract This is the review component of the report on small area estimation theory that was prepare

More information

Code_Aster. Detection of the singularities and computation of a card of size of elements

Code_Aster. Detection of the singularities and computation of a card of size of elements Titre : Détection es singularités et calcul une carte [...] Date : 0/0/0 Page : /6 Responsable : Josselin DLMAS Clé : R4.0.04 Révision : 9755 Detection of the singularities an computation of a car of size

More information

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences. S 63 Lecture 8 2/2/26 Lecturer Lillian Lee Scribes Peter Babinski, Davi Lin Basic Language Moeling Approach I. Special ase of LM-base Approach a. Recap of Formulas an Terms b. Fixing θ? c. About that Multinomial

More information

Non-Linear Bayesian CBRN Source Term Estimation

Non-Linear Bayesian CBRN Source Term Estimation Non-Linear Bayesian CBRN Source Term Estimation Peter Robins Hazar Assessment, Simulation an Preiction Group Dstl Porton Down, UK. probins@stl.gov.uk Paul Thomas Hazar Assessment, Simulation an Preiction

More information

New Statistical Test for Quality Control in High Dimension Data Set

New Statistical Test for Quality Control in High Dimension Data Set International Journal of Applie Engineering Research ISSN 973-456 Volume, Number 6 (7) pp. 64-649 New Statistical Test for Quality Control in High Dimension Data Set Shamshuritawati Sharif, Suzilah Ismail

More information

Web-Based Technical Appendix: Multi-Product Firms and Trade Liberalization

Web-Based Technical Appendix: Multi-Product Firms and Trade Liberalization Web-Base Technical Appeni: Multi-Prouct Firms an Trae Liberalization Anrew B. Bernar Tuck School of Business at Dartmouth & NBER Stephen J. Reing LSE, Yale School of Management & CEPR Peter K. Schott Yale

More information

TOMASZ KLIMANEK USING INDIRECT ESTIMATION WITH SPATIAL AUTOCORRELATION IN SOCIAL SURVEYS IN POLAND 1 1. BACKGROUND

TOMASZ KLIMANEK USING INDIRECT ESTIMATION WITH SPATIAL AUTOCORRELATION IN SOCIAL SURVEYS IN POLAND 1 1. BACKGROUND PRZEGLĄD STATYSTYCZNY NUMER SPECJALNY 1 01 TOMASZ KLIMANEK USING INDIRECT ESTIMATION WITH SPATIAL AUTOCORRELATION IN SOCIAL SURVEYS IN POLAND 1 1. BACKGROUND First attempts at applying various approaches

More information

Linear First-Order Equations

Linear First-Order Equations 5 Linear First-Orer Equations Linear first-orer ifferential equations make up another important class of ifferential equations that commonly arise in applications an are relatively easy to solve (in theory)

More information

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Lower Bounds for the Smoothed Number of Pareto optimal Solutions Lower Bouns for the Smoothe Number of Pareto optimal Solutions Tobias Brunsch an Heiko Röglin Department of Computer Science, University of Bonn, Germany brunsch@cs.uni-bonn.e, heiko@roeglin.org Abstract.

More information

Chapter 6: Energy-Momentum Tensors

Chapter 6: Energy-Momentum Tensors 49 Chapter 6: Energy-Momentum Tensors This chapter outlines the general theory of energy an momentum conservation in terms of energy-momentum tensors, then applies these ieas to the case of Bohm's moel.

More information

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs Lectures - Week 10 Introuction to Orinary Differential Equations (ODES) First Orer Linear ODEs When stuying ODEs we are consiering functions of one inepenent variable, e.g., f(x), where x is the inepenent

More information

arxiv: v1 [hep-lat] 19 Nov 2013

arxiv: v1 [hep-lat] 19 Nov 2013 HU-EP-13/69 SFB/CPP-13-98 DESY 13-225 Applicability of Quasi-Monte Carlo for lattice systems arxiv:1311.4726v1 [hep-lat] 19 ov 2013, a,b Tobias Hartung, c Karl Jansen, b Hernan Leovey, Anreas Griewank

More information

Gaussian processes with monotonicity information

Gaussian processes with monotonicity information Gaussian processes with monotonicity information Anonymous Author Anonymous Author Unknown Institution Unknown Institution Abstract A metho for using monotonicity information in multivariate Gaussian process

More information

The Principle of Least Action

The Principle of Least Action Chapter 7. The Principle of Least Action 7.1 Force Methos vs. Energy Methos We have so far stuie two istinct ways of analyzing physics problems: force methos, basically consisting of the application of

More information

APPROXIMATE SOLUTION FOR TRANSIENT HEAT TRANSFER IN STATIC TURBULENT HE II. B. Baudouy. CEA/Saclay, DSM/DAPNIA/STCM Gif-sur-Yvette Cedex, France

APPROXIMATE SOLUTION FOR TRANSIENT HEAT TRANSFER IN STATIC TURBULENT HE II. B. Baudouy. CEA/Saclay, DSM/DAPNIA/STCM Gif-sur-Yvette Cedex, France APPROXIMAE SOLUION FOR RANSIEN HEA RANSFER IN SAIC URBULEN HE II B. Bauouy CEA/Saclay, DSM/DAPNIA/SCM 91191 Gif-sur-Yvette Ceex, France ABSRAC Analytical solution in one imension of the heat iffusion equation

More information

On colour-blind distinguishing colour pallets in regular graphs

On colour-blind distinguishing colour pallets in regular graphs J Comb Optim (2014 28:348 357 DOI 10.1007/s10878-012-9556-x On colour-blin istinguishing colour pallets in regular graphs Jakub Przybyło Publishe online: 25 October 2012 The Author(s 2012. This article

More information

CONFIRMATORY FACTOR ANALYSIS

CONFIRMATORY FACTOR ANALYSIS 1 CONFIRMATORY FACTOR ANALYSIS The purpose of confirmatory factor analysis (CFA) is to explain the pattern of associations among a set of observe variables in terms of a smaller number of unerlying latent

More information

Optimization of Geometries by Energy Minimization

Optimization of Geometries by Energy Minimization Optimization of Geometries by Energy Minimization by Tracy P. Hamilton Department of Chemistry University of Alabama at Birmingham Birmingham, AL 3594-140 hamilton@uab.eu Copyright Tracy P. Hamilton, 1997.

More information

18 EVEN MORE CALCULUS

18 EVEN MORE CALCULUS 8 EVEN MORE CALCULUS Chapter 8 Even More Calculus Objectives After stuing this chapter you shoul be able to ifferentiate an integrate basic trigonometric functions; unerstan how to calculate rates of change;

More information

Construction of the Electronic Radial Wave Functions and Probability Distributions of Hydrogen-like Systems

Construction of the Electronic Radial Wave Functions and Probability Distributions of Hydrogen-like Systems Construction of the Electronic Raial Wave Functions an Probability Distributions of Hyrogen-like Systems Thomas S. Kuntzleman, Department of Chemistry Spring Arbor University, Spring Arbor MI 498 tkuntzle@arbor.eu

More information

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors Math 18.02 Notes on ifferentials, the Chain Rule, graients, irectional erivative, an normal vectors Tangent plane an linear approximation We efine the partial erivatives of f( xy, ) as follows: f f( x+

More information

Research Article When Inflation Causes No Increase in Claim Amounts

Research Article When Inflation Causes No Increase in Claim Amounts Probability an Statistics Volume 2009, Article ID 943926, 10 pages oi:10.1155/2009/943926 Research Article When Inflation Causes No Increase in Claim Amounts Vytaras Brazauskas, 1 Bruce L. Jones, 2 an

More information

Code_Aster. Detection of the singularities and calculation of a map of size of elements

Code_Aster. Detection of the singularities and calculation of a map of size of elements Titre : Détection es singularités et calcul une carte [...] Date : 0/0/0 Page : /6 Responsable : DLMAS Josselin Clé : R4.0.04 Révision : Detection of the singularities an calculation of a map of size of

More information

Influence of weight initialization on multilayer perceptron performance

Influence of weight initialization on multilayer perceptron performance Influence of weight initialization on multilayer perceptron performance M. Karouia (1,2) T. Denœux (1) R. Lengellé (1) (1) Université e Compiègne U.R.A. CNRS 817 Heuiasyc BP 649 - F-66 Compiègne ceex -

More information

One-dimensional I test and direction vector I test with array references by induction variable

One-dimensional I test and direction vector I test with array references by induction variable Int. J. High Performance Computing an Networking, Vol. 3, No. 4, 2005 219 One-imensional I test an irection vector I test with array references by inuction variable Minyi Guo School of Computer Science

More information

The Exact Form and General Integrating Factors

The Exact Form and General Integrating Factors 7 The Exact Form an General Integrating Factors In the previous chapters, we ve seen how separable an linear ifferential equations can be solve using methos for converting them to forms that can be easily

More information

Web Appendix to Firm Heterogeneity and Aggregate Welfare (Not for Publication)

Web Appendix to Firm Heterogeneity and Aggregate Welfare (Not for Publication) Web ppeni to Firm Heterogeneity an ggregate Welfare Not for Publication Marc J. Melitz Harvar University, NBER, an CEPR Stephen J. Reing Princeton University, NBER, an CEPR March 6, 203 Introuction his

More information

How to Minimize Maximum Regret in Repeated Decision-Making

How to Minimize Maximum Regret in Repeated Decision-Making How to Minimize Maximum Regret in Repeate Decision-Making Karl H. Schlag July 3 2003 Economics Department, European University Institute, Via ella Piazzuola 43, 033 Florence, Italy, Tel: 0039-0-4689, email:

More information

A variance decomposition and a Central Limit Theorem for empirical losses associated with resampling designs

A variance decomposition and a Central Limit Theorem for empirical losses associated with resampling designs Mathias Fuchs, Norbert Krautenbacher A variance ecomposition an a Central Limit Theorem for empirical losses associate with resampling esigns Technical Report Number 173, 2014 Department of Statistics

More information

A SIMPLE ENGINEERING MODEL FOR SPRINKLER SPRAY INTERACTION WITH FIRE PRODUCTS

A SIMPLE ENGINEERING MODEL FOR SPRINKLER SPRAY INTERACTION WITH FIRE PRODUCTS International Journal on Engineering Performance-Base Fire Coes, Volume 4, Number 3, p.95-3, A SIMPLE ENGINEERING MOEL FOR SPRINKLER SPRAY INTERACTION WITH FIRE PROCTS V. Novozhilov School of Mechanical

More information

Hybrid Fusion for Biometrics: Combining Score-level and Decision-level Fusion

Hybrid Fusion for Biometrics: Combining Score-level and Decision-level Fusion Hybri Fusion for Biometrics: Combining Score-level an Decision-level Fusion Qian Tao Raymon Velhuis Signals an Systems Group, University of Twente Postbus 217, 7500AE Enschee, the Netherlans {q.tao,r.n.j.velhuis}@ewi.utwente.nl

More information

Inverse Theory Course: LTU Kiruna. Day 1

Inverse Theory Course: LTU Kiruna. Day 1 Inverse Theory Course: LTU Kiruna. Day Hugh Pumphrey March 6, 0 Preamble These are the notes for the course Inverse Theory to be taught at LuleåTekniska Universitet, Kiruna in February 00. They are not

More information

Modeling the effects of polydispersity on the viscosity of noncolloidal hard sphere suspensions. Paul M. Mwasame, Norman J. Wagner, Antony N.

Modeling the effects of polydispersity on the viscosity of noncolloidal hard sphere suspensions. Paul M. Mwasame, Norman J. Wagner, Antony N. Submitte to the Journal of Rheology Moeling the effects of polyispersity on the viscosity of noncolloial har sphere suspensions Paul M. Mwasame, Norman J. Wagner, Antony N. Beris a) epartment of Chemical

More information

The total derivative. Chapter Lagrangian and Eulerian approaches

The total derivative. Chapter Lagrangian and Eulerian approaches Chapter 5 The total erivative 51 Lagrangian an Eulerian approaches The representation of a flui through scalar or vector fiels means that each physical quantity uner consieration is escribe as a function

More information

Balancing Expected and Worst-Case Utility in Contracting Models with Asymmetric Information and Pooling

Balancing Expected and Worst-Case Utility in Contracting Models with Asymmetric Information and Pooling Balancing Expecte an Worst-Case Utility in Contracting Moels with Asymmetric Information an Pooling R.B.O. erkkamp & W. van en Heuvel & A.P.M. Wagelmans Econometric Institute Report EI2018-01 9th January

More information

Thermal runaway during blocking

Thermal runaway during blocking Thermal runaway uring blocking CES_stable CES ICES_stable ICES k 6.5 ma 13 6. 12 5.5 11 5. 1 4.5 9 4. 8 3.5 7 3. 6 2.5 5 2. 4 1.5 3 1. 2.5 1. 6 12 18 24 3 36 s Thermal runaway uring blocking Application

More information

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments 2 Conference on Information Sciences an Systems, The Johns Hopkins University, March 2, 2 Time-of-Arrival Estimation in Non-Line-Of-Sight Environments Sinan Gezici, Hisashi Kobayashi an H. Vincent Poor

More information

Unit #6 - Families of Functions, Taylor Polynomials, l Hopital s Rule

Unit #6 - Families of Functions, Taylor Polynomials, l Hopital s Rule Unit # - Families of Functions, Taylor Polynomials, l Hopital s Rule Some problems an solutions selecte or aapte from Hughes-Hallett Calculus. Critical Points. Consier the function f) = 54 +. b) a) Fin

More information

Real-time arrival prediction models for light rail train systems EDOUARD NAYE

Real-time arrival prediction models for light rail train systems EDOUARD NAYE DEGREE PROJECT IN TRANSPORT AND LOCATION ANALYSIS STOCKHOLM, SWEDEN 14 Real-time arrival preiction moels for light rail train systems EDOUARD NAYE KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ARCHITECTURE

More information

Monte Carlo Methods with Reduced Error

Monte Carlo Methods with Reduced Error Monte Carlo Methos with Reuce Error As has been shown, the probable error in Monte Carlo algorithms when no information about the smoothness of the function is use is Dξ r N = c N. It is important for

More information

arxiv:physics/ v2 [physics.ed-ph] 23 Sep 2003

arxiv:physics/ v2 [physics.ed-ph] 23 Sep 2003 Mass reistribution in variable mass systems Célia A. e Sousa an Vítor H. Rorigues Departamento e Física a Universiae e Coimbra, P-3004-516 Coimbra, Portugal arxiv:physics/0211075v2 [physics.e-ph] 23 Sep

More information

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation JOURNAL OF MATERIALS SCIENCE 34 (999)5497 5503 Thermal conuctivity of grae composites: Numerical simulations an an effective meium approximation P. M. HUI Department of Physics, The Chinese University

More information

State observers and recursive filters in classical feedback control theory

State observers and recursive filters in classical feedback control theory State observers an recursive filters in classical feeback control theory State-feeback control example: secon-orer system Consier the riven secon-orer system q q q u x q x q x x x x Here u coul represent

More information

Math 1271 Solutions for Fall 2005 Final Exam

Math 1271 Solutions for Fall 2005 Final Exam Math 7 Solutions for Fall 5 Final Eam ) Since the equation + y = e y cannot be rearrange algebraically in orer to write y as an eplicit function of, we must instea ifferentiate this relation implicitly

More information

Situation awareness of power system based on static voltage security region

Situation awareness of power system based on static voltage security region The 6th International Conference on Renewable Power Generation (RPG) 19 20 October 2017 Situation awareness of power system base on static voltage security region Fei Xiao, Zi-Qing Jiang, Qian Ai, Ran

More information

PD Controller for Car-Following Models Based on Real Data

PD Controller for Car-Following Models Based on Real Data PD Controller for Car-Following Moels Base on Real Data Xiaopeng Fang, Hung A. Pham an Minoru Kobayashi Department of Mechanical Engineering Iowa State University, Ames, IA 5 Hona R&D The car following

More information

Image Denoising Using Spatial Adaptive Thresholding

Image Denoising Using Spatial Adaptive Thresholding International Journal of Engineering Technology, Management an Applie Sciences Image Denoising Using Spatial Aaptive Thresholing Raneesh Mishra M. Tech Stuent, Department of Electronics & Communication,

More information

Flexible High-Dimensional Classification Machines and Their Asymptotic Properties

Flexible High-Dimensional Classification Machines and Their Asymptotic Properties Journal of Machine Learning Research 16 (2015) 1547-1572 Submitte 1/14; Revise 9/14; Publishe 8/15 Flexible High-Dimensional Classification Machines an Their Asymptotic Properties Xingye Qiao Department

More information

Copyright 2015 Quintiles

Copyright 2015 Quintiles fficiency of Ranomize Concentration-Controlle Trials Relative to Ranomize Dose-Controlle Trials, an Application to Personalize Dosing Trials Russell Reeve, PhD Quintiles, Inc. Copyright 2015 Quintiles

More information

Optimal Signal Detection for False Track Discrimination

Optimal Signal Detection for False Track Discrimination Optimal Signal Detection for False Track Discrimination Thomas Hanselmann Darko Mušicki Dept. of Electrical an Electronic Eng. Dept. of Electrical an Electronic Eng. The University of Melbourne The University

More information

Topic Modeling: Beyond Bag-of-Words

Topic Modeling: Beyond Bag-of-Words Hanna M. Wallach Cavenish Laboratory, University of Cambrige, Cambrige CB3 0HE, UK hmw26@cam.ac.u Abstract Some moels of textual corpora employ text generation methos involving n-gram statistics, while

More information

Some New Thoughts on the Multipoint Method for Reactor Physics Applications. Sandra Dulla, Piero Ravetto, Paolo Saracco,

Some New Thoughts on the Multipoint Method for Reactor Physics Applications. Sandra Dulla, Piero Ravetto, Paolo Saracco, Jeju, Korea, April 16-20, 2017, on USB 2017 Some New Thoughts on the Multipoint Metho for Reactor Physics Applications Sanra Dulla, Piero Ravetto, Paolo Saracco, Politecnico i Torino, Dipartimento Energia,

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri ITA, UC San Diego, 9500 Gilman Drive, La Jolla, CA Sham M. Kakae Karen Livescu Karthik Sriharan Toyota Technological Institute at Chicago, 6045 S. Kenwoo Ave., Chicago, IL kamalika@soe.ucs.eu

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Keywors: multi-view learning, clustering, canonical correlation analysis Abstract Clustering ata in high-imensions is believe to be a har problem in general. A number of efficient clustering algorithms

More information

Sparse Reconstruction of Systems of Ordinary Differential Equations

Sparse Reconstruction of Systems of Ordinary Differential Equations Sparse Reconstruction of Systems of Orinary Differential Equations Manuel Mai a, Mark D. Shattuck b,c, Corey S. O Hern c,a,,e, a Department of Physics, Yale University, New Haven, Connecticut 06520, USA

More information

2Algebraic ONLINE PAGE PROOFS. foundations

2Algebraic ONLINE PAGE PROOFS. foundations Algebraic founations. Kick off with CAS. Algebraic skills.3 Pascal s triangle an binomial expansions.4 The binomial theorem.5 Sets of real numbers.6 Surs.7 Review . Kick off with CAS Playing lotto Using

More information

Evaluation of Column Breakpoint and Trajectory for a Plain Liquid Jet Injected into a Crossflow

Evaluation of Column Breakpoint and Trajectory for a Plain Liquid Jet Injected into a Crossflow ILASS Americas, 1 st Annual Conference on Liqui Atomization an Spray Systems, Orlano, Floria, May 008 Evaluation of Column Breakpoint an Trajectory for a Plain Liqui Jet Injecte into a Crossflow S.M. Thawley,

More information

Level Construction of Decision Trees in a Partition-based Framework for Classification

Level Construction of Decision Trees in a Partition-based Framework for Classification Level Construction of Decision Trees in a Partition-base Framework for Classification Y.Y. Yao, Y. Zhao an J.T. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canaa S4S

More information

d dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1

d dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1 Lecture 5 Some ifferentiation rules Trigonometric functions (Relevant section from Stewart, Seventh Eition: Section 3.3) You all know that sin = cos cos = sin. () But have you ever seen a erivation of

More information