Eric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION

INTRODUCTION Statistical disclosure control part of preparations for disseminating microdata. Data perturbation techniques: Methods assuring anonymity during interview (e.g. Randomized response) Methods part of editing process (e.g. Resampling, suppression (blanking), imputation, data-swapping, noise addition) Methods differ in terms of level of protection and usefulness.

INTRODUCTION

OUTLINE Blanking Description of Method Problems Noise Addition Three methods Problems Simex Explanation of method Combination of Blanking and Noise Addition Description of method Monte Carlo Experiment

BLANKING Previous uses: Cells were suppressed because they would lead to identity disclosure if released based on external information (e.g. Bill Gates income) Low counts in contingency tables, tabular data. K-anonymity: If a quasi-identifier does not occur k times, it is suppressed.

BLANKING Protection method: 1. Create blank data set by removing observations lying outside critical quantile range. 2. Compute corresponding conditional probabilities. 3. Provide researcher with blanked data set and conditional probabilities.

BLANKING Conditional Probability: Y i is the value of variable Y at observation i. D i = i P(D yi = 1 Y i ) = i P(q θl 1 n i θu 1 n P(D yi = 1 Y i ) = Given a value Yi, the probability that it will be included in the data set

Conditional probability: P(D yi =1 Y i )

EXAMPLE Percent Body Fat Weight Height 12.3 6.1 25.3 10.4 28.7 20.9 19.2 12.4 4.1 11.7 154.25 173.25 154.00 184.75 184.25 210.25 181.00 176.00 191.00 198.25 67.75 72.25 66.25 72.25 71.25 74.75 69.75 72.50 74.00 73.50

EXAMPLE Percent Body Fat Weight Height 12.3 6.1 25.3 10.4 20.9 19.2 12.4 11.7 154.25 173.25 184.75 184.25 181.00 176.00 191.00 198.25 67.75 72.25 72.25 71.25 69.75 72.50 74.00 73.50 Observations between the 10 th and 90 th percentiles are kept in the dataset.

PROBLEMS Blanking only protects specific observations. What if an attacker with external information wants to learn specific information about someone in the data set whose information is not blanked? Blanked data set not useful if researcher is concerned with tails of the data. E.g. Researcher wants to look at income of families below the poverty level, but most of the incomes are blanked. Difficulty estimating true parameter values Illustrated using M-estimation. M-estimation method where statistics are obtained as the solution to the problem of minimizing the sum of a certain function of the data (Wikipedia).

M-ESTIMATION SETUP Consider the condition expectation function: E[Y i X i ] = μ(x i, θ 0 ), θ 0 is true k x 1 parameter vector Example: In the linear regression model, μ(x i, θ 0 ) = X i β 0 Let Z i = (Y i, X i ) = Y i X 1 : X n q(z i, θ) be an objective function to be minimized. Example: In linear regression, Y i is the response variable, and X i is the set of predictors. Want to find β such that the squared distance between the Y and Y-hat is minimized, so q(z i, θ) = (Y i X i β) 2, with θ β. Let dummy variable D i =

M-ESTIMATION Unblanked Blanked θ 0 (i.e. β) E[q(Z i, θ)] E[D i q(z i, θ)] M-estimator of θ 0 (i.e. ) n -1 q(z i, θ) n -1 D i q(z i, θ) Parameter and M-estimator not the same for unblanked and blanked dataset unless assumption is made. Missing at Random (MAR) Assumption: Assume missing data mechanism is ignorable: Z i D i W i means independence W i is the vector of covariates at observation i. Explanation: Missing values are not randomly distributed across all observations but are randomly distributed within one or more subsamples Reasonable Assumption?

M-ESTIMATION Based on MAR assumption, weight observed moment function by inverse of the individual probability of not being blanked given the vector of covariates Inverse Probability Weighting (IPW) (Horvitz, Wooldridge) E D i q(zi, θ 0 ) P(D i = 1 W i ) = E [ q(zi, θ 0 ) ] Thus, weighted M-estimator is the solution for: n-1 D i q(zi, θ 0 ) / P(D i = 1 W i )

NOISE ADDITION Method of data perturbation Three algorithms Adding Noise Adding Noise and Linear Transformations (Kim) Adding Noise and Nonlinear Transformations (Sullivan)

ADDING NOISE Vector of a variable, x j ~ (μ j, σ j2 ) Create perturbed vector, z j = x j + ε j εj is the noise εj ~ N(0, σ εj2 ) Cov(ε t, ε l ) = 0 for all t l Cov(x t, ε j ) = 0 for all t, l

EXAMPLE Pct Body Fat Noise (var=9) Pct Body Fat2 Weight Noise (var=25) Weight2 Height Noise (var=2) Height2 12.3 6.1 25.3 10.4 28.7 20.9 19.2 12.4 4.1 11.7-2.32 2.69-1.00 1.05-1.77-1.82 4.43-1.51 2.00 1.02 9.98 8.79 24.30 11.45 26.93 19.08 23.63 10.89 6.10 12.72 154.25 173.25 154.00 184.75 184.25 210.25 181.00 176.00 191.00 198.25 6.40 0.77-1.36 0.14-4.77 2.06 3.93-4.11-2.75-8.68 160.65 174.02 152.64 184.89 179.48 212.31 184.93 171.89 188.25 189.57 67.75 72.25 66.25 72.25 71.25 74.75 69.75 72.50 74.00 73.50 2.08-1.17-1.34-2.85 1.30 0.03 1.02-0.90-1.19 1.00 69.83 71.08 64.91 69.40 72.55 74.78 70.77 71.60 72.81 74.50

PROBLEMS Poor protection for extreme values. Perturbed values might not make sense (e.g. values that are negative). Distribution of masked variables not known if original variable is not normally distributed. Sample variances of masked data are asymptotically biased estimators of variances of original. Sample correlations are also biased. An estimator is biased if the expected value of the estimator is different from the value of the true parameter it is estimating.

BIAS DUE TO ADDING NOISE General assumption is that variance, ε j is proportional to variance of original (Spruill, Sullivan, Tendick). Variance ε2 = x 2 α is a positive constant varying amount of noise z 2 = x 2 + ε 2 = x 2 + x 2 = (1 + x 2 Correlation between 2 variables zi, zj ρ zi,zj = Cov(z i, z j ) / (V(z i )V(z j ) = (1 / 1+ ) Cov(x i, x j ) / (V(x i )V(x j )) = (1 / 1+ ρ xi,xj

EXAMPLE

ADDING NOISE & LINEAR TRANSFORMATION z j = x j + ε j, j = 1,, p g j = cz j + id j g j is masked and transformed variable i is a vector of ones c is a constant d j differs between variables Given restrictions E(g j ) = E(z j ) and V(g j ) = V(x j ), d j = (1-c)E(x j ) (Kim)

ADDING NOISE & LINEAR TRANSFORMATION Two possible transformations for g j = cz j + id j 1. g j,1 = cz j + (1 - c) c = [(n-1) / (n(1+α) - 1)] 2. g j,2 = cz j + (1 - c) c = [(n-1-α) / ((n-1)(1+α))] =

ADDING NOISE & LINEAR TRANSFORMATION Suitable for continuous variables only. Preserves expected values and covariances due to restriction for determining c. Univariate distribution not preserved, unless original variables are normally distributed to begin with.

ADDING NOISE & NONLINEAR TRANSFORMATION Can be used for continuous and discrete data. Univariate distributions are approximately sustained.

ADDING NOISE & NONLINEAR TRANSFORMATION 1. Calculate empirical distribution function for every variable. 2. Smooth empirical distribution function. Use moving average. 3. Convert smoothed function into a uniform random variable and then convert uniform random variable into a standard normal random variable. Use quantile function (inverse of cumulative density function (cdf)). 4. Add noise to standard normal variable. Mask similar to method of adding noise and linear transformation. 5. Back-transform to values of distribution function. 6. Back-transform to original scale.

PROBLEMS Procedures following the transformation and noise addition are needed to correct for differences in correlation (usually when observed variables are not normally distributed). Not same level of protection due to corrections. Variances of continuous variables larger than those of original variables due to transformations.

NOISE ADDITION & BLANKING More disclosure limitation Observations with high original values, which are not protected well by noise addition, are protected by data blanking.

NOISE ADDITION & BLANKING Problem with blanking: Not all observations are protected. Problem corrected with noise addition because this method perturbs all data. Problem with noise addition: Extreme outliers not protected well. Problem corrected with blanking because extreme outliers will be suppressed.

NOISE ADDITION & BLANKING 1. Add independent noise to sensitive variables. 2. Create blanked data set from masked data by removing observations outside critical quantile range. 3. Compute corresponding conditional probabilities. 4. Provide researcher with blanked data set, the conditional probabilities, and variance of measurement term μ i.

SIMEX SIMEX (Simulation Extrapolation) is a procedure that uses simulation to estimate parameters (e.g. in linear regression, use SIMEX to estimate β).

SIMEX Consider linear regression model with response y and predictor x. i= 1,, n = 10 b = 1,, B = 2 u i,b ~ N(0, u 2 t = 1,, T=4 λ 0 = 0, λ 1 =.5, λ 2 = 1, λ 3 = 1.5, λ 4 = 2

SIMEX At each level of λ, create B=2 new datasets with X i,b (λ t ) = X i + (λ t )u i,b, response (weight) stays the same in each dataset. Calculate β b for each data set For each level of λ, calculate β(λ t ) by taking the average of all β b. Now with a value of β for each level of λ, extrapolate to find value of β when λ t = -1. This is the unbiased estimate of β (Carroll).

MONTE CARLO EXPERIMENT Monte-Carlo methods that using stochastic techniques to simulate behavior of a system. Generate random data based on original distribution of variables. Used to simulate effect of blanking and noise addition on microdata. Simulate SIMEX approach to estimate the IPW M-estimators.

MONTE CARLO EXPERIMENT Used multivariate linear regression model: Y i = α + βx 1i + γx 2i + e i, i = 1, n ~ N(, ) Samples sizes of n=100 and n=1000 with R=1000 replicates. SIMEX approach: 0 = λ 0 < λ 1 =.5 < λ 2 =1 < λ 3 =1.5 < λ 4 =2 B=50 samples

FOUR DIFFERENT MONTE CARLO DESIGNS Variance of noise in blanking method Variance of measurement error (noise) in noise addition

DESIGN 1 σ u2.01, q θu.95 Root Mean Square Error: (MSE(θ hat)) = (E((θ hat θ 2 Relative Standard Error: Estimated SE/ True SE Estimates from original dataset Ordinary Least Squares Estimate (Bad Estimate) Estimates from SIMEX

DESIGN 2 σ u 2.01, q θu.90

DESIGN 3 σ u 2.5, q θu.95

DESIGN 4 σ u 2.5, q θu.90

MONTE CARLO EXPERIMENT RESULTS Bias and RMSE of estimates are reduced when compared to the naïve OLS-estimate. Estimated variances smaller than naïve estimates but larger than that of the original dataset. Bias and RMSE is larger when n=100 compared to when n=1000. More noise (> u 2 yields more biased estimates. Due to low RELSE for small sample sizes standard errors cannot be estimated precisely. RELSE gets worse when n=1000. Their explanation: not enough bootstrap replications.

COMMENTS Too much information given with conditional probabilities and variance of noise? Dataset still not useful if researcher is concerned with tails of the data. Protection from identity disclosure using quasiidentifiers and/or external information? Possible use of imputation with blanked data? Previously used for non-response. Any applications to categorical data? Where s proof that SIMEX method to the IPW-estimator can be applied to nonlinear models?

CONCLUSIONS Blanking protects against sensitive, but not all data. Noise protects all data to some extent, but small impact on outliers. Combination of both compensates for each others weaknesses. Apply SIMEX approach to IPW-estimator. Monte-Carlo experiments show bias of estimators small, but RELSE not that good. More research needs to be conducted.

REFERENCES Anton Flossmann and Sandra Lechner (2006). Combining Blanking and Noise Addition as a Data Disclosure Limitation Method. Privacy in Statistical Databases, Lecture Notes in Computer Science. Springer Berlin/Heidelberg, Vol. 4302. pp. 152-163. R. Brand, Microdata protection through noise, in Inference Control in Statistical Databases. Ed. J. Domingo-Ferrer. Lecture Notes in Computer Science, 2316. Berlin: Springer, 2002. 97 116. M-estimator. <http://en.wikipedia.org/wiki/m-estimator>. 8 Nov. 2007. Carroll, R.J., Ruppert, D. Stefanski, L.A.: Measurement Error in Nonlinear Models. Journal of the American Statistical Assosiciation, 89 (1994) 1314-1328.