The Effect of Microaggregation Procedures on the Estimation of Linear Models: A Simulation Study

Similar documents
Schmid, Schneeweiß, Küchenhoff: Statistical Inference in a Simple Linear Model Under Microaggregation

Mean squared error matrix comparison of least aquares and Stein-rule estimators for regression coefficients under non-normal disturbances

Stein-Rule Estimation under an Extended Balanced Loss Function

BACHELOR THESIS. by Paul Fink October 2, K Ward microaggregated data in linear models

Eric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION

Panel Threshold Regression Models with Endogenous Threshold Variables

Should Non-Sensitive Attributes be Masked? Data Quality Implications of Data Perturbation in Regression Analysis

Toutenburg, Nittner: The Classical Linear Regression Model with one Incomplete Binary Variable

Economics 536 Lecture 7. Introduction to Specification Testing in Dynamic Econometric Models

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Econometric Methods. Prediction / Violation of A-Assumptions. Burcu Erdogan. Universität Trier WS 2011/2012

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Disclosure Risk Measurement with Entropy in Two-Dimensional Sample Based Frequency Tables

Least Squares Estimation of a Panel Data Model with Multifactor Error Structure and Endogenous Covariates

Anova-type consistent estimators of variance components in unbalanced multi-way error components models

EMERGING MARKETS - Lecture 2: Methodology refresher

3. Linear Regression With a Single Regressor

D-optimally Lack-of-Fit-Test-efficient Designs and Related Simple Designs

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

The Power of the KPSS Test for Cointegration when Residuals are Fractionally Integrated 1

Specification errors in linear regression models

1. The Multivariate Classical Linear Regression Model

Empirical Market Microstructure Analysis (EMMA)

Bootstrapping Heteroskedasticity Consistent Covariance Matrix Estimator

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

Ekkehart Schlicht: Trend Extraction From Time Series With Structural Breaks and Missing Observations

A better way to bootstrap pairs

Wooldridge, Introductory Econometrics, 3d ed. Chapter 9: More on specification and data problems

The Simple Linear Regression Model

Econ 510 B. Brown Spring 2014 Final Exam Answers

PRE-TEST ESTIMATION OF THE REGRESSION SCALE PARAMETER WITH MULTIVARIATE STUDENT-t ERRORS AND INDEPENDENT SUB-SAMPLES

Econometrics of Panel Data

Linear regression methods

No Information Sharing in Oligopoly: The Case of Price Competition with Cost Uncertainty

Lecture: Simultaneous Equation Model (Wooldridge s Book Chapter 16)

Homoskedasticity. Var (u X) = σ 2. (23)

Econometrics of Panel Data

1 Motivation for Instrumental Variable (IV) Regression

The regression model with one fixed regressor cont d

On IV estimation of the dynamic binary panel data model with fixed effects

Dealing With Endogeneity

Econometrics. Week 8. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Economics Division University of Southampton Southampton SO17 1BJ, UK. Title Overlapping Sub-sampling and invariance to initial conditions

Regression Analysis for Data Containing Outliers and High Leverage Points

Linear Regression. Junhui Qian. October 27, 2014

Applied Econometrics (QEM)

Empirical Bayes Moderation of Asymptotically Linear Parameters

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

Multivariate Time Series: Part 4

Business Economics BUSINESS ECONOMICS. PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS MODULE No. : 3, GAUSS MARKOV THEOREM

Problem Set 3: Bootstrap, Quantile Regression and MCMC Methods. MIT , Fall Due: Wednesday, 07 November 2007, 5:00 PM

Statistical Inference of Covariate-Adjusted Randomized Experiments

Applied Econometrics (MSc.) Lecture 3 Instrumental Variables

Econometrics Master in Business and Quantitative Methods

Sociedad de Estadística e Investigación Operativa

Gibbs Sampling in Endogenous Variables Models

Statistical inference

A Robust Approach of Regression-Based Statistical Matching for Continuous Data

Double Kernel Method Using Line Transect Sampling

Independent and conditionally independent counterfactual distributions

Statistics 910, #5 1. Regression Methods

Nonsense Regressions due to Neglected Time-varying Means

The aggregation problem in its hystorical perspective: a summary overview

Working Paper Classification of Processes by the Lyapunov exponent

Motivation for multiple regression

Bias Correction Methods for Dynamic Panel Data Models with Fixed Effects

Greene, Econometric Analysis (7th ed, 2012)

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

1. Einleitung. 1.1 Organisatorisches. Ziel der Vorlesung: Einführung in die Methoden der Ökonometrie. Voraussetzungen: Deskriptive Statistik

Variable Selection under Measurement Error: Comparing the Performance of Subset Selection and Shrinkage Methods

Forecasting Levels of log Variables in Vector Autoregressions

Ekkehart Schlicht: Trend Extraction From Time Series With Structural Breaks

Making sense of Econometrics: Basics

A General Overview of Parametric Estimation and Inference Techniques.

Outline. Possible Reasons. Nature of Heteroscedasticity. Basic Econometrics in Transportation. Heteroscedasticity

INTRODUCTION TO BASIC LINEAR REGRESSION MODEL

Bayes Estimators & Ridge Regression

A note on the asymptotic efficiency of the restricted estimation

econstor Make Your Publications Visible.

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares

The computationally optimal test set size in simulation studies on supervised learning

OSU Economics 444: Elementary Econometrics. Ch.10 Heteroskedasticity

When Should We Use Linear Fixed Effects Regression Models for Causal Inference with Panel Data?

Research Statement. Zhongwen Liang

Recent Advances in the Field of Trade Theory and Policy Analysis Using Micro-Level Data

Financial Econometrics

Volume 03, Issue 6. Comparison of Panel Cointegration Tests

Econometrics - 30C00200

Nearly Unbiased Estimation in Dynamic Panel Data Models

IV Estimation and its Limitations: Weak Instruments and Weakly Endogeneous Regressors

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Lecture 6: Dynamic Models

The Prediction of Monthly Inflation Rate in Romania 1

Darmstadt Discussion Papers in Economics

cemmap working paper, Centre for Microdata Methods and Practice, No. CWP11/04

Multiple Linear Regression CIVL 7012/8012

IV Estimation and its Limitations: Weak Instruments and Weakly Endogeneous Regressors

Birkbeck Working Papers in Economics & Finance

Lectures 5 & 6: Hypothesis Testing

A Practitioner s Guide to Cluster-Robust Inference

Transcription:

The Effect of Microaggregation Procedures on the Estimation of Linear Models: A Simulation Study Der Einfluss von Mikroaggregationsverfahren auf die Schätzung linearer Modelle - eine Simulationsstudie By Matthias Schmid and Hans Schneeweiss, Munich Abstract Microaggregation is a set of procedures that distort empirical data in order to guarantee the factual anonymity of the data. At the same time the information content of data sets should not be reduced too much and should still be useful for scientific research. This paper investigates the effect of microaggregation on the estimation of a linear regression by ordinary least squares. It studies, by way of an extensive simulation experiment, the bias of the slope parameter estimator induced by various microaggregation techniques. Some microaggregation procedures lead to consistent estimates while others imply an asymptotic bias for the estimator. Zusammenfassung Mikroaggregationstechniken werden zur faktischen Anonymisierung von stetigen Daten verwendet. Da die Schutzwirkung von Mikroaggregation auf der Reduzierung des Informationsgehalts von Datensätzen beruht, kann eine Verzerrung statistischer Analysen mit mikroaggregierten Daten nicht ausgeschlossen werden. Der Aufsatz untersucht die Auswirkungen von Mikroaggregation auf die Kleinste- Quadrate-Schätzung eines linearen Modells. Mit Hilfe einer umfassenden Simulationsstudie wird der Bias der Schätzung des Steigungsparameters analysiert, der aus verschiedenen Varianten der Mikroaggregation resultiert. Einige der betrachteten Mikroaggregationsverfahren garantieren die konsistente Schätzung des Steigungspa- 1

2 rameters, während andere Verfahren zu einer asymptotischen Verzerrung des naiven Schätzers führen. JEL C13, C15, C21, C81 Keywords: microaggregation, disclosure control, simple linear model, bias, consistency 1 Introduction A problem statistical offices are increasingly faced with is providing sufficient information to scientists while at the same time having to maintain confidentiality required by data protection laws. To handle this trade-off (which is commonly referred to as the statistical disclosure control problem), the information content of released microdata sets is often reduced by means of masking procedures. However, while reducing the disclosure risk of a data file, masking procedures also affect the results of statistical analyses. One of the most promising masking techniques is microaggregation (Defays/Anwar 1998, Domingo-Ferrer/Mateo-Sanz 2002), a procedure for continuous data which has been widely discussed over the last years. The main idea behind microaggregation is to group the observations in a data set and replace the original data values with their corresponding group means. To reduce the information loss imposed by microaggregation, it is considered advisable to group only those data values which are similar in terms of a similarity criterion. The various types of microaggregation techniques mainly differ in the similarity criterion that is used to form the groups. While the disclosure risk of anonymized data sets has been subject to intensive research over the last years (Elliot 2001, Willenborg/de Waal 2001, Yancey et al. 2002), the impact of microaggregation on statistical analyses is still widely unexplored. Empirical studies based on the analysis of selected data sets include Mateo-Sanz/Domingo-Ferrer (1998), Domingo- Ferrer/Mateo-Sanz (2001), Domingo-Ferrer/Torra (2001), and Rosemann (2004). However, while providing an important insight into the effects of microaggregation on statistical analyses, the results of these studies may depend on various (unknown) characteristics of the data sets and on the uncertainty whether the statistical models are correctly specified. In this paper,

3 we instead focus on simulated data sets. Thus, we are able to control the data structure and the model specification and can concentrate on a study of the microaggregation effect solely. The effect of microaggregation on statistical analyses depend on the type of analysis carried out by the researcher. It is therefore necessary to study all forms of models to be considered for statistical analysis. In the following, we restrict our investigation to the estimation of a simple linear regression model. This is done by means of a systematic simulation study. Our main interest is in the potential bias of the naive estimator of the slope parameter. It turns out that all aggregation methods considered in this paper lead to a bias, at least for small sample sizes. In some cases the bias persists asymptotically while in other cases it decreases to zero with growing sample size, thus giving rise to consistent estimates. In Section 2 we start with a description of the various microaggregation procedures used for masking data. Section 3 contains a systematic simulation study of the effects of microaggregation on the estimation of a simple linear model. Moreover, we consider in detail the microaggregation techniques which induce an asymptotic bias in the estimated slope parameter. In Section 4, we discuss a very simple situation involving a discrete error structure. This is done to illustrate how the bias is generated. In Section 5, we outline the effects of microaggregation on the estimation of a multiple linear regression model. Section 6 contains a concluding summary. 2 Microaggregation Techniques As stated in the introduction, there are various types of microaggregation methods which mainly differ in the similarity criterion used for grouping the data. For our simulation study, we decided to chose five of the most commonly applied microaggregation techniques, namely 1. Microaggregation using a leading variable (Paass/Wauschkuhn 1985): The data values are sorted with respect to one variable in the data set (the so-called leading variable). Groups are then formed by data records having similar values for the leading variable. The group size (or aggregation level ) is kept fixed for every group. In a simple linear

4 model, the leading variable can either be the regressor or the dependent variable. Feige/Watts (1972) have shown that if the regressor is used as the leading variable, linear model estimates based on the aggregated data are unbiased. Concerning microaggregation using the dependent variable as the leading variable, Feige/Watts (1972) hint at the possibility that estimates might show an aggregation bias. However, little is known about this bias. Therefore, in the following sections, we restrict to the case where the dependent variable is the leading variable. 2. Individual ranking (Defays/Anwar 1998): Each variable is microaggregated separately. First, the data set is sorted by the first variable, and the values of this variable are microaggregated. Then, the same procedure is repeated for the second variable and so on. Again, the group size is kept fixed. 3. Microaggregation using principal component analysis: Multivariate data are first projected onto the first principle axis. The projected values then serve as a leading variable as described in 1.). The group size is kept fixed. 4. Multivariate microaggregation based on Euclidean distances: This type of microaggregation uses the Euclidean distance to determine the similarity of data records. Before microaggregation, all variables are standardized. Again, a fixed group size is used. For details on the algorithm we refer to Domingo-Ferrer/Mateo-Sanz (2002). 5. Multivariate microaggregation based on Ward hierarchical clustering (Domingo-Ferrer/Mateo-Sanz 2002): A cluster analysis with minimum group size is performed to aggregate the data. Under the constraint that each group must consist of at least k data values, groups are formed by minimizing the within-groups variance. The group size is allowed to vary over the groups. Again, we refer to Domingo-Ferrer/Mateo-Sanz (2002) for an exact description of the algorithm. 3 Simulations We consider the simple linear model Y = α + βx + ɛ. (1)

5 Y denotes the continuous response (or endogenous variable) while X denotes the continuous covariate (or exogenous variable). γ := (α, β) is the corresponding parameter vector. The random error ɛ is independent of X. Moreover, ɛ is assumed to have zero mean and constant variance σ 2 ɛ. We applied the five microaggregation techniques described in Section 2 to data sets simulated from model (1). We then estimated a simple linear model from the aggregated data (so-called naive estimation). Table 1 shows the averages of the naive estimates ˆβ over 1000 replications for selected values of β (β = 0, β = 0.5, β = 1, β = 5) and various sample sizes (n = 50, n = 100, n = 200, n = 400). For symmetry reasons we do not consider negative values of β. The residual variance σ 2 ɛ was set equal to 0.5. The values of the covariate X were drawn from a standard normal distribution. For microaggregation using Y as a leading variable (LV), microaggregation n=50 n=100 n=200 n=400 Eucl -0.001 0.002-0.001-0.001 IR 0.001 0.002 0.000 0.000 β = 0 LV -0.006 0.002 0.000 0.000 PCA -0.002 0.004 0.003-0.007 kw 0.001 0.001 0.002 0.001 Eucl 0.529 0.513 0.510 0.504 IR 0.492 0.496 0.499 0.499 β = 0.5 LV 0.769 0.762 0.753 0.751 PCA 0.614 0.611 0.611 0.609 kw 0.521 0.511 0.507 0.503 Eucl 1.029 1.018 1.008 1.005 IR 0.988 0.997 0.998 0.998 β = 1 LV 1.163 1.158 1.155 1.155 PCA 1.084 1.085 1.082 1.082 kw 1.036 1.021 1.012 1.006 Eucl 5.016 5.011 5.009 5.004 IR 4.972 4.990 4.995 4.998 β = 5 LV 5.031 5.035 5.034 5.033 PCA 5.030 5.027 5.028 5.028 kw 5.036 5.034 5.028 5.017 Table 1: Naive estimates of β for various sample sizes (σ ɛ = 0.5)

6 using individual ranking (IR), microaggregation using principal component analysis (PCA), and microaggregation using Euclidean distances (Eucl), the group size was set equal to three. For microaggregation based on Ward hierarchical clustering (kw), the minimum group size was set equal to three as well. From Table 1 we see that if n is small and β > 0, all estimates show a bias. Interestingly, for LV, PCA, Eucl, and kw, the bias is positive, meaning that the effect of X on the dependent variable Y is overestimated. If IR is used to aggregate the data, the slope parameter is underestimated. The only exception is the case where β = 0: In this case, for all values of n, the naive estimates are unbiased. For β > 0 we see that if IR, Eucl, or kw are used for microaggregation, the bias decreases with n increasing and apparently goes to zero with n. Thus, IR, Eucl, and kw seem to lead to consistent naive estimates of the slope parameter. In contrast, estimates show an asymptotic bias if LV and PCA are used for aggregation. This bias depends on the value of β. Comparing LV to PCA, we see that PCA induces a smaller bias than LV, at least for the values of β chosen, but see below. Tables 2 and 3 show what happens if the residual variance σɛ 2 is increased. Apparently, the bias of the estimates becomes larger as σɛ 2 gets larger. In addition, convergence to the true value of β slows down if IR, Eucl, or kw are applied. Interestingly, if σ ɛ = 1.5 and β = 0.5, the biases of the naive estimates based on LV and PCA are almost equal. In this case, PCA does not perform better than LV. At the end of this section, we will consider in detail the effect of LV and PCA on the naive estimate of β.

n=50 n=100 n=200 n=400 Eucl 0.004-0.002 0.002 0.002 IR -0.002-0.001 0.002-0.002 β = 0 LV -0.026-0.015-0.002 0.000 PCA -0.026-0.005-0.015 0.026 kw 0.003-0.001 0.000-0.003 Eucl 0.535 0.519 0.512 0.507 IR 0.494 0.493 0.499 0.500 β = 0.5 LV 1.120 1.083 1.081 1.076 PCA 0.911 0.909 0.896 0.890 kw 0.555 0.525 0.513 0.507 Eucl 1.044 1.034 1.019 1.009 IR 0.985 0.997 0.999 1.002 β = 1 LV 1.545 1.515 1.515 1.508 PCA 1.330 1.315 1.309 1.310 kw 1.084 1.047 1.027 1.014 Eucl 5.042 5.027 5.021 5.010 IR 4.952 4.986 4.987 4.995 β = 5 LV 5.136 5.132 5.136 5.130 PCA 5.115 5.112 5.110 5.112 kw 5.135 5.107 5.075 5.004 7 Table 2: Naive estimates of β for various sample sizes (σ ɛ = 1) n=50 n=100 n=200 n=400 Eucl 0.013-0.005 0.000 0.000 IR 0.000 0.005-0.004 0.001 β = 0 LV -0.008-0.007 0.012 0.013 PCA 0.050-0.074 0.001 0.071 kw -0.003 0.000 0.005-0.001 Eucl 0.537 0.521 0.511 0.510 IR 0.491 0.490 0.498 0.497 β = 0.5 LV 1.282 1.254 1.262 1.260 PCA 1.273 1.273 1.263 1.252 kw 0.576 0.539 0.519 0.511 Eucl 1.050 1.037 1.017 1.012 IR 0.981 0.991 1.002 0.993 β = 1 LV 1.917 1.881 1.874 1.867 PCA 1.683 1.650 1.638 1.634 kw 1.146 1.079 1.041 1.025 Eucl 5.087 5.052 5.033 5.009 IR 4.947 4.970 4.992 4.997 β = 5 LV 5.302 5.305 5.293 5.294 PCA 5.269 5.253 5.247 5.245 kw 5.276 5.198 5.129 5.074 Table 3: Naive estimates of β for various sample sizes (σ ɛ = 1.5)

8 A=3 A=6 A=9 A=12 Eucl 0.002 0.003-0.002 0.002 IR 0.002 0.001 0.001 0.000 β = 0 LV -0.002 0.007 0.010-0.040 PCA -0.015-0.014 0.000-0.043 kw 0.000-0.001-0.002 0.003 Eucl 0.512 0.509 0.511 0.510 IR 0.499 0.496 0.492 0.494 β = 0.5 LV 1.081 1.536 1.771 1.936 PCA 0.896 1.024 1.075 1.098 kw 0.513 0.538 0.562 0.588 Eucl 1.019 1.018 1.018 1.015 IR 0.999 0.997 0.993 0.981 β = 1 LV 1.515 1.731 1.822 1.866 PCA 1.309 1.410 1.436 1.461 kw 1.027 1.065 1.109 1.137 Eucl 5.021 5.020 5.022 5.022 IR 4.987 4.981 4.970 4.960 β = 5 LV 5.136 5.168 5.179 5.183 PCA 5.110 5.141 5.151 5.153 kw 5.075 5.145 5.173 5.181 Table 4: Naive estimates of β for various group sizes (σ ɛ = 1, n = 200) Table 4 shows the naive estimates of β for various group sizes (here, σ ɛ was set equal to one and n was set equal to 200). The group size is denoted by A. We see that as A increases and β > 0, the bias of the naive estimate increases in most cases, particularly if LV or PCA are used for microaggregation. The only exception is the case where Eucl is used to aggregate the data: Here, the group size does not seem to have any effect on the bias of ˆβ. It should be mentioned that the bias of ˆβ induced by kw is not directly comparable to the bias induced by the other microaggregation methods. This is due to the fact that kw allows the group sizes to vary, implying that groups containing more than three records are possible. The average group size of kw estimated from the simulated data sets was about 3.70.

9 Let us now have a closer look at the nature of the bias of ˆβ if LV or PCA are used to aggregate the data. We estimated the bias of ˆβ for β = 0, 0.005,..., 0.495, 0.5, 0.75,..., 2.75, 3 and the corresponding negative values. The sample size n was set equal to 400, the group size was set equal to three. As before, the values of X were drawn from a standard normal distribution. The results are shown in Fig. 1 (here, σ ɛ was set equal to 0.5): Again, we see that if β > 0, β is overestimated by ˆβ. Obviously, if β is close to zero, PCA induces a larger bias than LV does. For large values of β, PCA performs better. We also see that if β = 0, estimates are unbiased. As β, bias( ˆβ) goes to zero. This is a plausible result because using Y as the leading variable is approximately the same as using X as the leading variable when β is large and σ 2 ɛ is kept constant (in fact, the correlation between X and Y is close to ±1 in this case). In the same way, if X and Y are highly correlated, using PCA is almost the same as using X as the leading variable. Now, if X is used as the leading variable, estimates are unbiased (see Feige/Watts 1972), and thus bias( ˆβ) should be close to zero if Y is the leading variable and β is large. bias(beta) 1.0 0.5 0.0 0.5 1.0 3 2 1 0 1 2 3 Figure 1: Bias of ˆβ if LV (solid line) or PCA (dashed line) is used for microaggregation (σ ɛ = 0.5) beta

10 Fig. 2 shows what happens when σ ɛ is increased: As σ ɛ increases, the bias increases, and also the difference between the two bias curves. Interestingly, Fig. 2 shows that bias( ˆβ) is a non-monotonic function of β. In the next section, we give an explanation of this phenomenon. Sigma_epsilon=0.5 Sigma_epsilon=1 bias(beta) 2 1 0 1 2 bias(beta) 2 1 0 1 2 3 2 1 0 1 2 3 beta 3 2 1 0 1 2 3 beta Sigma_epsilon=1.5 Sigma_epsilon=2 bias(beta) 2 1 0 1 2 bias(beta) 2 1 0 1 2 3 2 1 0 1 2 3 beta 3 2 1 0 1 2 3 beta Figure 2: Bias of ˆβ if LV (solid line) or PCA (dashed line) is used for microaggregation

4 Microaggregation using Y as the leading variable: Analysis of a Linear Model with Discrete Errors We have seen in the previous section that the bias of the naive estimate ˆβ is a non-monotonic function of β if LV or PCA are applied to aggregate the data. For β > 0, if β increases, the bias first increases from zero up to a maximum value and then decreases again to zero. To get an idea why bias( ˆβ) behaves like this, we study a very simple linear model involving a discrete error structure and a discrete regressor X. As the bias curves based on LV and PCA have the same shape, we only consider microaggregation using Y as the leading variable. Let the vector x be given by (1,..., 8, 1,..., 8) and consider the deterministic vector of residuals e = (0.5,..., 0.5, 0.5,..., 0.5). Assuming α to be zero, the response vector y becomes y = (β 1 + 0.5,..., β 8 + 0.5,..., β 1 0.5,..., β 8 0.5). (2) 11 Fig. 3 shows the resulting plot of y vs. x for β = 1. y 2 4 6 8 1 2 3 4 5 6 7 8 Figure 3: Plot of y vs. x (β = 1) x

12 Let us now see what happens if the data are aggregated with respect to Y (in the following, we use an aggregation level of A = 2). Figs. 4-6 show the effects of the aggregation step by step for five values of β (β= 0.05, β=0.18, β=0.25, β=0.5, and β=1.5). The filled-in dots represent the aggregated data values. The aggregated data vectors are denoted by ỹ and x. As long as β is close to zero, aggregating the data set with respect to Y is the same as aggregating the points lying below the true regression line and the points lying above the true regression line separately. As the order of (x 1,..., x 8 ) and (x 9,..., x 16 ) is the same as the order of (y 1,..., y 8 ) and (y 9,..., y 16 ), respectively, the least squares estimate ˆβ is unbiased, and the estimated regression line based on the aggregated data values is the same as the true regression line (Fig. 4). beta=0.05 y 1 0 1 2 3 4 5 1 2 3 4 5 6 7 8 x Figure 4: Plot of y vs. x and ỹ vs. x for β = 0.05 (dotted line = true regression) Fig. 5 (β = 0.18) shows a different picture: As β increases, the middle points (1, β 1+0.5), (7, β 7 0.5) and (2, β 2+0.5), (8, β 8 0.5) are grouped, forcing the corresponding aggregated data values to move in the direction of x := 1 n n i=1 x i. It is well known from linear model theory that extreme data values situated far from x (in this case, (1.5, 1 (β 1 0.5 + β 2 0.5)) and 2 (7.5, 1 (β 7 + 0.5 + β 8 + 0.5))) have a big influence on the slope of the 2 estimated regression line. This is why ˆβ in Fig. 5 has a positive bias.

13 beta=0.18 beta=0.25 y 1 0 1 2 3 4 5 y 1 0 1 2 3 4 5 1 2 3 4 5 6 7 8 x 1 2 3 4 5 6 7 8 x Figure 5: Plot of y vs. x and ỹ vs. x for β = 0.18 and β = 0.25 (dotted line = true regression) This effect becomes even stronger if β continues to increase (Fig. 5, β = 0.25): Two more aggregated data values move in the direction of x, causing the bias of ˆβ to increase even more. However, as the true regression line becomes steeper, the number of points lying exactly on the true regression line increases, too (two points if β = 0.18, four points if β = 0.25). This has an adverse effect on the estimate of the slope: The bias of ˆβ begins to decline as more and more aggregated data values lie on the true regression line (Fig. 6, β = 0.5). Finally, as β goes to infinity, all aggregated data values lie on the true regression line, and ˆβ equals the true β again. This can be seen from Fig. 6 (β = 1.5) where the two regression lines have become identical, just like in Fig. 4. Thus we can conclude that the bias of ˆβ is zero as long as β is close to zero. As the values of β increase, bias( ˆβ) becomes positive at first. As β, bias( ˆβ) declines and becomes zero again. Fig. 7 illustrates this result. It is also clear from Figs. 4-7 that for negative values of β, bias( ˆβ) becomes negative at first. As β, bias( ˆβ) becomes zero again.

14 beta=0.5 beta=1.5 y 1 0 1 2 3 4 5 y 0 2 4 6 8 10 12 1 2 3 4 5 6 7 8 x 1 2 3 4 5 6 7 8 x Figure 6: Plot of y vs. x and ỹ vs. x for β = 0.5 and β = 1.5 (dotted line = true regression) bias 0.00 0.05 0.10 0.15 0.0 0.5 1.0 1.5 2.0 beta Figure 7: Plot of bias( ˆβ) vs. β, x = (1,..., 8, 1,..., 8)

15 Fig. 8 shows what happens if the sample size n is increased (here, the first half of x is (1, 1.02, 1.04, 1.06,..., 24) ): The bias of the least squares estimate ˆβ is almost a smooth function of β. bias 0.00 0.01 0.02 0.03 0.04 0.05 0.0 0.2 0.4 0.6 0.8 1.0 beta Figure 8: Plot of bias( ˆβ) vs. β, first half of x equals (1, 1.02, 1.04, 1.06,..., 24) Finally, we modify the above model by replacing the deterministic residuals with a simple stochastic error structure: Let the vector x be (1, 2, 3, 5). The residuals ɛ 1,..., ɛ 4 are now assumed to take on the values +0.5 or 0.5, each with probability 1/2. As there are 2 4 = 16 possible values for the vector e = (ɛ 1,..., ɛ 4 ), the mean of ˆβ can be computed by averaging the 16 least squares estimates for each value of β. The resulting bias curve shown in Fig. 9 is very similar to Fig. 7, and the conclusions concerning the deterministicerror model can be applied to the above stochastic-error model as well.

16 bias 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.0 0.5 1.0 1.5 2.0 beta Figure 9: Plot of bias( ˆβ) vs. β, x = (1, 2, 3, 5) 5 Estimating a Linear Model with Two Covariates In this section we outline the effects of microaggregation on the estimation of a multiple linear regression model. We therefore expand model (1) by adding an additional covariate X 2 : Y = α + β 1 X 1 + β 2 X 2 + ɛ. (3) Again, we study the effects of the five microaggregation techniques described in Section 2 by means of a systematic simulation study. The values of the covariates X 1 and X 2 were independently drawn from a standard normal distribution, the number of replications was 1000. The residual variance σ 2 ɛ was set equal to one. We estimated model (3) for various samples sizes, setting β 1 = 1, β 2 = 0.5, and A = 3. Tables 5 and 6 show the results obtained. Apparently the results of Section 3 (concerning the estimation of

17 n=50 n=100 n=200 n=400 Eucl 1.094 1.068 1.046 1.030 IR 0.989 0.986 1.001 1.001 β 1 = 1 LV 1.439 1.443 1.425 1.422 PCA 1.264 1.258 1.265 1.261 kw 1.149 1.107 1.063 1.044 Table 5: Naive estimates of β 1 for various sample sizes (σ ɛ = 1) n=50 n=100 n=200 n=400 Eucl -0.569-0.544-0.523-0.520 IR -0.495-0.504-0.500-0.501 β 2 = 0.5 LV -0.725-0.710-0.720-0.713 PCA -0.617-0.636-0.624-0.629 kw -0.592-0.547-0.530-0.520 Table 6: Naive estimates of β 2 for various sample sizes (σ ɛ = 1) the simple linear model (1)) can be applied to the estimation of the multiple linear regression model (3) as well. The naive estimates ˆβ 1 and ˆβ 2 based on Eucl, IR, and kw are biased if the sample size n is small. However, Eucl, IR, and kw lead to consistent estimates of the slope parameters β 1 and β 2. In contrast, ˆβ 1 and ˆβ 2 are asymptotically biased if LV and PCA are used to aggregate the data. To explore how the asymptotic bias induced by LV and PCA depends on the regression parameter values, we estimated β 1 and β 2 for β 1 = 0, 0.005,..., 0.495, 0.5, 0.75,..., 2.75, 3, β 2 = 0, 0.005,..., 0.495, 0.5, 0.75,..., 2.75, 3, and the corresponding negative values. As in Section 3, the sample size n was set equal to 400 and the group size A was set equal to three. The values of X 1 and X 2 were drawn independently from a standard normal distribution. The results are shown in Figs. 10 (LV) and 11 (PCA): (Note that for symmetry reasons, bias( ˆβ 2 ) is exactly the same as bias( ˆβ 1 ). This is why we do not include the graphs showing bias( ˆβ 2 )). We see that, just as in Figs. 1 and 2, bias( ˆβ 1 ) is positive if β 1 is positive and bias( ˆβ 1 ) is negative if β 1 is negative. Moreover, if one of the parameters β 1 and β 2 is large in absolute value, bias( ˆβ 1 ) and bias( ˆβ 2 )

18 both go to zero. This result can be explained by the fact that if either β 1 or β 2 is large in absolute value, LV and PCA are almost the same as microaggregation using X 1 or X 2 as the leading variable. In this case, the naive estimates ˆβ 1 and ˆβ 2 are unbiased (see Feige/Watts 1972). 0.4 0.2 0.0 0.2 0.4 3 2 1 beta2 0 1 2 2 1 0 beta1 1 2 3 3 3 Figure 10: Plot of bias( ˆβ 1 ) vs. β 1 and β 2 if LV is used for microaggregation 0.4 0.2 0.0 0.2 0.4 3 2 1 0 1 beta2 2 3 3 2 1 0 beta1 1 2 3 Figure 11: Plot of bias( ˆβ 1 ) vs. β 1 and β 2 if PCA is used for microaggregation

19 6 Conclusion We have analyzed the effects of microaggregation techniques on the estimation of a linear model, one of the most frequently encountered statistical estimation problems. By means of a simulation study, we have investigated the behavior of the naive least-squares estimator of the slope parameter in a simple linear model based on microaggregated data. The main results are: 1. All five microaggregation techniques considered in this paper induce a bias in the naive estimator of the slope parameter β in model (1). If IR is used to aggregate the data and β > 0, the bias is negative, whereas if LV, PCA, Eucl, or kw are used for aggregation and β > 0, the bias is positive. For symmetry reasons, if β < 0, the bias is positive for IR and negative for LV, PCA, Eucl, and kw. If β = 0, ˆβ is unbiased despite microaggregation. 2. Concerning the asymptotic behavior of the naive least-squares estimator, the simulation results show that for IR, Eucl, and kw the bias disappears for large sample sizes. Thus, IR, Eucl, and kw lead to consistent estimates of β. 3. If LV and PCA are used to mask the data, estimates are asymptotically biased. The asymptotic bias is a non-monotonic function of β and converges to 0 as β. For small values of β, the estimates based on PCA show a larger bias than those based on LV. If β is large, PCA induces a smaller bias than LV does. 4. The bias of the estimates becomes larger if the residual variance σ 2 ɛ is increased. In addition, if IR, Eucl, or kw are used for microaggregation, convergence to the true value of β slows down. The same effect can be seen when the group size A is increased (except for the estimates based on Eucl, which do not seem to depend on the group size). 5. The above results are also found in a multiple linear regression model with two covariates. In addition, if any one of the two slope parameters β 1 or β 2 is large in absolute value, the bias of the naive estimates is close to zero.

20 We thus see that in terms of consistency, Eucl and kw are reasonable alternatives to IR which is considered to be less protective than the other microaggregation techniques analyzed in this paper (see Winkler 2002). However, Eucl and kw clearly are the most computationally demanding microaggregation procedures. Concerning LV and PCI, we have seen that estimates are asymptotically biased. An obvious solution to this problem would be to develop estimators that correct for the biases induced by LV and PCI. This can be done if the bias has been evaluated analytically. For LV this has been achieved, see Schmid et al. (2005). Acknowledgements We gratefully acknowledge financial support from the Deutsche Forschungsgemeinschaft (German Science Foundation). References Defays, D., M.N. Anwar (1998), Masking Microdata Using Micro- Aggregation. Journal of Official Statistics, Vol. 14, No. 4, pp. 449-461. Domingo-Ferrer, J., J.M. Mateo-Sanz (2001), An Empirical Comparison of SDC Methods for Continuous Microdata in Terms of Information Loss and Re-Identification Risk. Second Joint ECE/Eurostat Work Session on Statistical Data Confidentiality, Skopje, Macedonia, March 14-16, 2001. Domingo-Ferrer, J., J.M. Mateo-Sanz (2002), Practical Data-Oriented Microaggregation for Statistical Disclosure Control. IEEE Transactions on Knowledge and Data Engineering, Vol. 14, No. 1, pp. 189-201. Domingo-Ferrer, J., V. Torra (2001), A Quantitative Comparison of Disclosure Control Methods for Microdata. In: P. Doyle, J. Lane, J. Theeuwes, L. Zayatz (eds.), Confidentiality, Disclosure, and Data Access, North-Holland, Amsterdam, pp. 111-133. Elliot, M. (2001), Disclosure Risk Assessment. In: P. Doyle, J. Lane, J. Theeuwes, L. Zayatz (eds.), Confidentiality, Disclosure, and Data Access, North-Holland, Amsterdam, pp. 75-90.

21 Feige, E.L., H.W. Watts (1972), An Investigation of the Consequences of Partial Aggregation of Micro-Economic Data. Econometrica, Vol. 40, No. 2, pp. 343-360. Mateo-Sanz, J.M., J. Domingo-Ferrer (1998), A Comparative Study of Microaggregation Methods. Questiio, Vol. 22, No. 3, pp. 511-526. Paass, G., U. Wauschkuhn (1985), Datenzugang, Datenschutz und Anonymisierung - Analysepotential und Identifizierbarkeit von anonymisierten Individualdaten. Berichte der Gesellschaft für Mathematik und Datenverarbeitung, Nr. 148, Oldenbourg, München. Rosemann, M. (2004), Auswirkungen unterschiedlicher Varianten der Mikroaggregation auf die Ergebnisse linearer und nicht linearer Schätzungen. Contribution to the Workshop Econometric Analysis of Anonymized Firm Data, Tübingen, Institut für Angewandte Wirtschaftsforschung, March 18-19, 2004. Schmid, M., H. Schneeweiss, H. Küchenhoff (2005), Consistent Estimation of a Simple Linear Model Under Microaggregation. Discussion Paper 415, SFB 386, Institut für Statistik, Ludwig-Maximilians-Universität München. Willenborg, L., T. de Waal (2001), Elements of Statistical Disclosure Control. Springer, New York. Winkler, W.E. (2002), Single-Ranking Micro-aggregation and Reidentification. Statistical Research Division report RR 2002/08, U.S. Bureau of the Census, Washington. Yancey, W.E., W.E. Winkler, R.H. Creecy (2002), Disclosure Risk Assessment in Perturbative Microdata Protection. In J. Domingo-Ferrer (ed.), Inference Control in Statistical Databases, Springer, New York, pp. 135-152.

22 Dipl.-Stat. Matthias Schmid University of Munich, Department of Statistics Ludwigstr. 33, D-80539 Munich Tel.: +49 89 2180 3165, Fax: +49 89 2180 5308 Email: matthias.schmid@stat.uni-muenchen.de Prof. Dr. Hans Schneeweiss University of Munich, Department of Statistics Akademiestr. 1, D-80799 Munich Tel.: +49 89 2180 5386, Fax: +49 89 2180 5044 Email: hans.schneeweiss@stat.uni-muenchen.de