Variance estimation on SILC based indicators Emilio Di Meglio Eurostat emilio.di-meglio@ec.europa.eu Guillaume Osier STATEC guillaume.osier@statec.etat.lu 3rd EU-LFS/EU-SILC European User Conference 1
Our main message today EU SILC is a sample survey This means that when looking at indicators we should also take a look at accuracy measures EU SILC is a complex survey Naive methods are not directly applicable We present the first results of variance estimation using linearization techniques 3rd EU-LFS/EU-SILC European User Conference 2
Why variance estimation? Requested by regulation Quality report Compliance Requested by users Policy relevance of indicators Requested by researchers 3rd EU-LFS/EU-SILC European User Conference 3
Current quality precision requirements According to Reg.1982/2003, the X and L (initial sample) data are to be based on a nationally representative probability sample of the population residing in private households Representative probability samples shall be achieved both for households and for individual persons in the target population. The sampling frame and methods of sample selection should ensure that every individual and household in the target population is assigned a known and non-zero probability of selection. Reg. 1177/2003 defines the minimum effective sample sizes to be achieved. 3rd EU-LFS/EU-SILC European User Conference 4
Minimum effective sample size Households Persons aged 16+ Country Cross-sectional Longitudinal Cross-sectional Longitudinal BE 4 750 3 500 8 750 6 500 BG 4 500 3 500 10 000 7 500 DK 4 250 3 250 7 250 5 550 DE 8 250 6 000 14 500 10 500 3rd EU-LFS/EU-SILC European User Conference 5
DEFF (Kish, 1965) Definition: "The ratio of the variance under the given sample design, to the variance under a simple random sample of the same size" Importance: a tool to measure the efficiency of your complex sample design Calculation based on at-risk-of poverty rate 3rd EU-LFS/EU-SILC European User Conference 6
What influences variance? Variability of the phenomenon Sample size Indicator value Sampling design Imputation Calibration Weighting 3rd EU-LFS/EU-SILC European User Conference 7
Main challenges for EU SILC Difficulty to find the «best» possible method for variance estimation Different designs (flexibility) Missing information Debate on methods ongoing Differentiate the needs: accuracy estimates for policy usage and accuracy estimates for researchers. 3rd EU-LFS/EU-SILC European User Conference 8
Sampling design by country (2010) Sampling of dwellings/ addresses Sampling of households Sampling of individuals Simple random sampling Stratified simple random sampling Stratified simple random sampling from former participants of micro census Stratified multi-stage sampling Stratified simple random sampling Stratified multi-stage sampling Simple random or systematic sampling Stratified simple random or systematic sampling Stratified two-phase sampling Stratified two-stage sampling Malta Luxembourg, Austria* Germany Czech Republic, Spain, France, Hungary, Latvia, The Netherlands, Poland, Portugal, Romania, United Kingdom Cyprus, Slovakia Belgium, Bulgaria, Greece, Ireland, Italy Denmark, Iceland, Sweden, Norway Estonia, Lithuania Finland Slovenia 3rd EU-LFS/EU-SILC European User Conference 9
Sample design variables DB050: primary strata DB060: primary sampling units DB062: secondary sampling units DB070: order of selection of primary sampling units DB030: household ID 3rd EU-LFS/EU-SILC European User Conference 10
Our objective Resampling taking into account all the possible elements coming from 32 countries would be extremely computationally and resource intensive Variance estimation methods balancing between scientific accuracy and administrative considerations (time, cost, simplicity) are the only viable solution Aim: to quickly provide to users and policy makers standard errors for the SILC-based indicators, particularly the AROPE, its components and its main breakdowns. 3rd EU-LFS/EU-SILC European User Conference 11
The proposed approach We have considered different methods: bootstrap, Jacknife, linearisation We carried out comparative experiments on a limited number of countries and results are similar We chose to work with linearisation (ultimate cluster approach proposed by Net-SILC2) that can provide acceptable results given the constraints we face The approach was discussed at the Workshop accuracy (Net-SILC2) and validated by the SILC WG 3rd EU-LFS/EU-SILC European User Conference 12
The method (synthesis) Linearization is a technique based on the use of linear approximation to reduce non-linear statistics to a linear form, justified by asymptotic properties of the estimator (Särndal et al, 1992 ; Deville, 1999 ; Wolter, 2006 ; Osier, 2009) The "ultimate cluster" approach (Särndal et al, 1992) is a simplification consisting in calculating the variance taking into account only variation among Primary Sampling Unit (PSU) totals This method requires first stage sampling fractions to be small which is nearly always the case. This method allows a great flexibility and simplifies the calculations of variances. It can also be generalized to calculate variance of the differences of one year to another (Berger, 2004, 2010 ). Applicable with the main statistical packages (SAS, R, STATA) 3rd EU-LFS/EU-SILC European User Conference 13
The method and the results The described methodology has been applied in Eurostat for running an estimation of variance and confidence intervals For AROPE and its subcomponents for 3 age groups (0-17, 18-64, 65+) and gender breakdowns For estimating variance of net changes. We have used the SAS procedures SURVEYMEANS and SURVEYFREQ that allow to specify the survey design According to the characteristics and availability of data for different countries we have used different variables to specify strata and cluster information. 3rd EU-LFS/EU-SILC European User Conference 14
The method and the results We have used SAS PROC Surveyfreq (linearization) adapting strata and cluster parameters according to the following groups: GROUP 1: BE, BG, CZ, IE, EL, ES, FR, IT, LV, HU, NL, PL, PT, RO, SI, UK, HR Strata=DB050 Cluster=DB060 GROUP 2 DE, EE, CY, LT, LU, AT, SK, FI, CH Strata=DB050 cluster=db030 GROUP 3 DK IS MT NO SE Cluster=DB030 3rd EU-LFS/EU-SILC European User Conference 15
Results on AROPE For 6 countries 95% Confidence Interval for AROPE equal or smaller that ±1.0% (CZ, IT, SI, DE, FI, SE) For 11 countries 95% Confidence Interval for AROPE between ± 1% and ±1.5% (ES, HU, PL, UK, EE, AT, SK, CH, DK, IS, NO) For 8 countries 95% Confidence Interval for AROPE between ±1.5% and ±2% (BE, BG, EL, LV, NL, PT, CY, MT) For 4 countries 95% Confidence Interval for AROPE larger than ±2% (IE, RO, LT, HR) Complete results in EU-SILC quality report 3rd EU-LFS/EU-SILC European User Conference 16
Results, example Member State Indicator Value Standards Error (%) CI 95% Lower bound CI 95% Upper bound EU27 16.4 0.14 16.08 16.64 BE 14.6 0.74 13.13 16.06 BG 20.7 0.85 19.03 22.35 CZ 9.0 0.44 8.14 9.86 IE 16.1 0.98 14.13 17.98 3rd EU-LFS/EU-SILC European User Conference 17
Measurement of net changes To measure the significance of the evolution of social indicators Example: When the At-risk-of-poverty or social exclusion rate for Cyprus goes from 22.9% in 2010 to 23.5% in 2011, are we able to say that this change is significant? Exercise already done for: AROPE, AROPE(0-17),ARP,ARP(65+), SMD, VLWI, IWP, UMNC 3rd EU-LFS/EU-SILC European User Conference 18
Problem statement t= x/y Absolute change: = t2-t1 AIM: Estimation of variance of change Major problem: temporal correlations between indicators Var ( )= var(t1)+var(t2)-2corr(t1,t2)sqr(var(t1)var(t2)) NET-SILC2: multivariate linear regression approach (Berger and Priam, Statistics Canada Symposium, 2010 code SAS developed by G. Osier) 3rd EU-LFS/EU-SILC European User Conference 19
Algorithm used 1. Preparation of the data 2. Aggregation at PSU level (ultimate cluster PSU approach) 3. SE estimation for the X estimator at T0 4. 1-3 for the X estimator at T1 5. SE estimation for changes in X estimators using the multivariate regression approach: Responses variables: 4 totals Regressors: 1. Stratification dummy variables 2. Rotation variable at T0 (dummy variable which specifies which PSUs are observed at T0) 3. Rotation variable at T1 3rd EU-LFS/EU-SILC European User Conference 20
Output Country AROPE (2010) % AROPE (2011) % Difference 2011-2010 (% points) Standard error (% points) Margin of error (% points) = 1.96*SE Significance of change BE 20.8 21 0.1 0.076 0.1 N BG 41.6 49.1 7.5 0.726 1.4 Y CY 22.9 23.5 0.5 0.605 1.2 N DK 18.3 18.9 0.5 0.448 0.9 N 3rd EU-LFS/EU-SILC European User Conference 21
Conclusion and future plans The methodology is of relatively simple application It can be considered as a good compromise between scientific soundness and feasibility under current constraints. SILC based indicators in the current implementation can be considered as having an overall acceptable accuracy; The next steps consist in still improving these calculations by asking Member States to provide the necessary information where missing. Dissemination of further information to users under investigation. 3rd EU-LFS/EU-SILC European User Conference 22