Eric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION

Similar documents
Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

A note on multiple imputation for general purpose estimation

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Should Non-Sensitive Attributes be Masked? Data Quality Implications of Data Perturbation in Regression Analysis

Lecture 14 Simple Linear Regression

Fractional Imputation in Survey Sampling: A Comparative Review

Least Squares Estimation of a Panel Data Model with Multifactor Error Structure and Endogenous Covariates

Data Integration for Big Data Analysis for finite population inference

ECO375 Tutorial 8 Instrumental Variables

Chapter 5: Models used in conjunction with sampling. J. Kim, W. Fuller (ISU) Chapter 5: Models used in conjunction with sampling 1 / 70

Causal Inference with a Continuous Treatment and Outcome: Alternative Estimators for Parametric Dose-Response Functions

Measurement error as missing data: the case of epidemiologic assays. Roderick J. Little

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

COMPARISON OF GMM WITH SECOND-ORDER LEAST SQUARES ESTIMATION IN NONLINEAR MODELS. Abstract

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

11. Bootstrap Methods

A Practitioner s Guide to Cluster-Robust Inference

Simulation-Extrapolation for Estimating Means and Causal Effects with Mismeasured Covariates

Plausible Values for Latent Variables Using Mplus

Bootstrap & Confidence/Prediction intervals

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Stat 5101 Lecture Notes

Model Assisted Survey Sampling

Measurement Error and Linear Regression of Astronomical Data. Brandon Kelly Penn State Summer School in Astrostatistics, June 2007

MA 575 Linear Models: Cedric E. Ginestet, Boston University Bootstrap for Regression Week 9, Lecture 1

Bootstrap Approach to Comparison of Alternative Methods of Parameter Estimation of a Simultaneous Equation Model

Semiparametric Generalized Linear Models

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Lecture 4: Multivariate Regression, Part 2

GENERALIZED LINEAR MIXED MODELS AND MEASUREMENT ERROR. Raymond J. Carroll: Texas A&M University

Spatial Regression. 3. Review - OLS and 2SLS. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

The Slow Convergence of OLS Estimators of α, β and Portfolio. β and Portfolio Weights under Long Memory Stochastic Volatility

Applied Statistics and Econometrics

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

Specification Errors, Measurement Errors, Confounding

Chapter 2: Resampling Maarten Jansen

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

Homework 2: Simple Linear Regression

The Nonparametric Bootstrap

Lecture 4: Multivariate Regression, Part 2

Characterizing Forecast Uncertainty Prediction Intervals. The estimated AR (and VAR) models generate point forecasts of y t+s, y ˆ

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Double Robustness. Bang and Robins (2005) Kang and Schafer (2007)

COMPARISON OF THE ESTIMATORS OF THE LOCATION AND SCALE PARAMETERS UNDER THE MIXTURE AND OUTLIER MODELS VIA SIMULATION

Robustness to Parametric Assumptions in Missing Data Models

Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

A weighted simulation-based estimator for incomplete longitudinal data models

SIMEX and TLS: An equivalence result

A Significance Test for the Lasso

PDEEC Machine Learning 2016/17

Correlation and Regression

Finite Population Sampling and Inference

Applied Health Economics (for B.Sc.)

Monte Carlo Study on the Successive Difference Replication Method for Non-Linear Statistics

Statistics: A review. Why statistics?

ECON 3150/4150, Spring term Lecture 6

STA 2201/442 Assignment 2

Bias-Correction in Vector Autoregressive Models: A Simulation Study

AGEC 661 Note Fourteen

System Identification, Lecture 4

Machine Learning 2nd Edition

System Identification, Lecture 4

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

Marginal Specifications and a Gaussian Copula Estimation

Statistics and Data Analysis

Nonresponse weighting adjustment using estimated response probability

Review of Statistics 101

STAT Section 2.1: Basic Inference. Basic Definitions

Two-phase sampling approach to fractional hot deck imputation

What s New in Econometrics. Lecture 13

6.435, System Identification

9. Linear Regression and Correlation

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

ECON 3150/4150, Spring term Lecture 7

Instrumental Variables

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Maximum Likelihood Estimation; Robust Maximum Likelihood; Missing Data with Maximum Likelihood

ECO 310: Empirical Industrial Organization Lecture 2 - Estimation of Demand and Supply

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

Problem Set 3: Bootstrap, Quantile Regression and MCMC Methods. MIT , Fall Due: Wednesday, 07 November 2007, 5:00 PM

Weighted Least Squares

Uncertainty Quantification for Inverse Problems. November 7, 2011

Nonparametric estimation of tail risk measures from heavy-tailed distributions

Discussing Effects of Different MAR-Settings

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW

Gov 2000: 9. Regression with Two Independent Variables

Bias-Variance in Machine Learning

Econ 510 B. Brown Spring 2014 Final Exam Answers

First Year Examination Department of Statistics, University of Florida

Streamlining Missing Data Analysis by Aggregating Multiple Imputations at the Data Level

Economics 583: Econometric Theory I A Primer on Asymptotics

Online Appendix to Yes, But What s the Mechanism? (Don t Expect an Easy Answer) John G. Bullock, Donald P. Green, and Shang E. Ha

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Combining data from two independent surveys: model-assisted approach

Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego

Transcription:

Eric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION

INTRODUCTION Statistical disclosure control part of preparations for disseminating microdata. Data perturbation techniques: Methods assuring anonymity during interview (e.g. Randomized response) Methods part of editing process (e.g. Resampling, suppression (blanking), imputation, data-swapping, noise addition) Methods differ in terms of level of protection and usefulness.

INTRODUCTION

OUTLINE Blanking Description of Method Problems Noise Addition Three methods Problems Simex Explanation of method Combination of Blanking and Noise Addition Description of method Monte Carlo Experiment

BLANKING Previous uses: Cells were suppressed because they would lead to identity disclosure if released based on external information (e.g. Bill Gates income) Low counts in contingency tables, tabular data. K-anonymity: If a quasi-identifier does not occur k times, it is suppressed.

BLANKING Protection method: 1. Create blank data set by removing observations lying outside critical quantile range. 2. Compute corresponding conditional probabilities. 3. Provide researcher with blanked data set and conditional probabilities.

BLANKING Conditional Probability: Y i is the value of variable Y at observation i. D i = i P(D yi = 1 Y i ) = i P(q θl 1 n i θu 1 n P(D yi = 1 Y i ) = Given a value Yi, the probability that it will be included in the data set

Conditional probability: P(D yi =1 Y i )

EXAMPLE Percent Body Fat Weight Height 12.3 6.1 25.3 10.4 28.7 20.9 19.2 12.4 4.1 11.7 154.25 173.25 154.00 184.75 184.25 210.25 181.00 176.00 191.00 198.25 67.75 72.25 66.25 72.25 71.25 74.75 69.75 72.50 74.00 73.50

EXAMPLE Percent Body Fat Weight Height 12.3 6.1 25.3 10.4 20.9 19.2 12.4 11.7 154.25 173.25 184.75 184.25 181.00 176.00 191.00 198.25 67.75 72.25 72.25 71.25 69.75 72.50 74.00 73.50 Observations between the 10 th and 90 th percentiles are kept in the dataset.

PROBLEMS Blanking only protects specific observations. What if an attacker with external information wants to learn specific information about someone in the data set whose information is not blanked? Blanked data set not useful if researcher is concerned with tails of the data. E.g. Researcher wants to look at income of families below the poverty level, but most of the incomes are blanked. Difficulty estimating true parameter values Illustrated using M-estimation. M-estimation method where statistics are obtained as the solution to the problem of minimizing the sum of a certain function of the data (Wikipedia).

M-ESTIMATION SETUP Consider the condition expectation function: E[Y i X i ] = μ(x i, θ 0 ), θ 0 is true k x 1 parameter vector Example: In the linear regression model, μ(x i, θ 0 ) = X i β 0 Let Z i = (Y i, X i ) = Y i X 1 : X n q(z i, θ) be an objective function to be minimized. Example: In linear regression, Y i is the response variable, and X i is the set of predictors. Want to find β such that the squared distance between the Y and Y-hat is minimized, so q(z i, θ) = (Y i X i β) 2, with θ β. Let dummy variable D i =

M-ESTIMATION Unblanked Blanked θ 0 (i.e. β) E[q(Z i, θ)] E[D i q(z i, θ)] M-estimator of θ 0 (i.e. ) n -1 q(z i, θ) n -1 D i q(z i, θ) Parameter and M-estimator not the same for unblanked and blanked dataset unless assumption is made. Missing at Random (MAR) Assumption: Assume missing data mechanism is ignorable: Z i D i W i means independence W i is the vector of covariates at observation i. Explanation: Missing values are not randomly distributed across all observations but are randomly distributed within one or more subsamples Reasonable Assumption?

M-ESTIMATION Based on MAR assumption, weight observed moment function by inverse of the individual probability of not being blanked given the vector of covariates Inverse Probability Weighting (IPW) (Horvitz, Wooldridge) E D i q(zi, θ 0 ) P(D i = 1 W i ) = E [ q(zi, θ 0 ) ] Thus, weighted M-estimator is the solution for: n-1 D i q(zi, θ 0 ) / P(D i = 1 W i )

NOISE ADDITION Method of data perturbation Three algorithms Adding Noise Adding Noise and Linear Transformations (Kim) Adding Noise and Nonlinear Transformations (Sullivan)

ADDING NOISE Vector of a variable, x j ~ (μ j, σ j2 ) Create perturbed vector, z j = x j + ε j εj is the noise εj ~ N(0, σ εj2 ) Cov(ε t, ε l ) = 0 for all t l Cov(x t, ε j ) = 0 for all t, l

EXAMPLE Percent Body Fat Weight Height 12.3 6.1 25.3 10.4 28.7 20.9 19.2 12.4 4.1 11.7 154.25 173.25 154.00 184.75 184.25 210.25 181.00 176.00 191.00 198.25 67.75 72.25 66.25 72.25 71.25 74.75 69.75 72.50 74.00 73.50

EXAMPLE Pct Body Fat Noise (var=9) Pct Body Fat2 Weight Noise (var=25) Weight2 Height Noise (var=2) Height2 12.3 6.1 25.3 10.4 28.7 20.9 19.2 12.4 4.1 11.7-2.32 2.69-1.00 1.05-1.77-1.82 4.43-1.51 2.00 1.02 9.98 8.79 24.30 11.45 26.93 19.08 23.63 10.89 6.10 12.72 154.25 173.25 154.00 184.75 184.25 210.25 181.00 176.00 191.00 198.25 6.40 0.77-1.36 0.14-4.77 2.06 3.93-4.11-2.75-8.68 160.65 174.02 152.64 184.89 179.48 212.31 184.93 171.89 188.25 189.57 67.75 72.25 66.25 72.25 71.25 74.75 69.75 72.50 74.00 73.50 2.08-1.17-1.34-2.85 1.30 0.03 1.02-0.90-1.19 1.00 69.83 71.08 64.91 69.40 72.55 74.78 70.77 71.60 72.81 74.50

PROBLEMS Poor protection for extreme values. Perturbed values might not make sense (e.g. values that are negative). Distribution of masked variables not known if original variable is not normally distributed. Sample variances of masked data are asymptotically biased estimators of variances of original. Sample correlations are also biased. An estimator is biased if the expected value of the estimator is different from the value of the true parameter it is estimating.

BIAS DUE TO ADDING NOISE General assumption is that variance, ε j is proportional to variance of original (Spruill, Sullivan, Tendick). Variance ε2 = x 2 α is a positive constant varying amount of noise z 2 = x 2 + ε 2 = x 2 + x 2 = (1 + x 2 Correlation between 2 variables zi, zj ρ zi,zj = Cov(z i, z j ) / (V(z i )V(z j ) = (1 / 1+ ) Cov(x i, x j ) / (V(x i )V(x j )) = (1 / 1+ ρ xi,xj

BIAS DUE TO ADDING NOISE General assumption is that variance, ε j is proportional to variance of original (Spruill, Sullivan, Tendick). Variance ε2 = x 2 α is a positive constant varying amount of noise z 2 = x 2 + ε 2 = x 2 + x 2 = (1 + x 2 Correlation between 2 variables zi, zj ρ zi,zj = Cov(z i, z j ) / (V(z i )V(z j ) = (1 / 1+ ) Cov(x i, x j ) / (V(x i )V(x j )) = (1 / 1+ ρ xi,xj

EXAMPLE

ADDING NOISE & LINEAR TRANSFORMATION z j = x j + ε j, j = 1,, p g j = cz j + id j g j is masked and transformed variable i is a vector of ones c is a constant d j differs between variables Given restrictions E(g j ) = E(z j ) and V(g j ) = V(x j ), d j = (1-c)E(x j ) (Kim)

ADDING NOISE & LINEAR TRANSFORMATION Two possible transformations for g j = cz j + id j 1. g j,1 = cz j + (1 - c) c = [(n-1) / (n(1+α) - 1)] 2. g j,2 = cz j + (1 - c) c = [(n-1-α) / ((n-1)(1+α))] =

ADDING NOISE & LINEAR TRANSFORMATION Suitable for continuous variables only. Preserves expected values and covariances due to restriction for determining c. Univariate distribution not preserved, unless original variables are normally distributed to begin with.

ADDING NOISE & NONLINEAR TRANSFORMATION Can be used for continuous and discrete data. Univariate distributions are approximately sustained.

ADDING NOISE & NONLINEAR TRANSFORMATION 1. Calculate empirical distribution function for every variable. 2. Smooth empirical distribution function. Use moving average. 3. Convert smoothed function into a uniform random variable and then convert uniform random variable into a standard normal random variable. Use quantile function (inverse of cumulative density function (cdf)). 4. Add noise to standard normal variable. Mask similar to method of adding noise and linear transformation. 5. Back-transform to values of distribution function. 6. Back-transform to original scale.

PROBLEMS Procedures following the transformation and noise addition are needed to correct for differences in correlation (usually when observed variables are not normally distributed). Not same level of protection due to corrections. Variances of continuous variables larger than those of original variables due to transformations.

NOISE ADDITION & BLANKING More disclosure limitation Observations with high original values, which are not protected well by noise addition, are protected by data blanking.

NOISE ADDITION & BLANKING Problem with blanking: Not all observations are protected. Problem corrected with noise addition because this method perturbs all data. Problem with noise addition: Extreme outliers not protected well. Problem corrected with blanking because extreme outliers will be suppressed.

NOISE ADDITION & BLANKING 1. Add independent noise to sensitive variables. 2. Create blanked data set from masked data by removing observations outside critical quantile range. 3. Compute corresponding conditional probabilities. 4. Provide researcher with blanked data set, the conditional probabilities, and variance of measurement term μ i.

SIMEX SIMEX (Simulation Extrapolation) is a procedure that uses simulation to estimate parameters (e.g. in linear regression, use SIMEX to estimate β).

SIMEX Consider linear regression model with response y and predictor x. i= 1,, n = 10 b = 1,, B = 2 u i,b ~ N(0, u 2 t = 1,, T=4 λ 0 = 0, λ 1 =.5, λ 2 = 1, λ 3 = 1.5, λ 4 = 2

SIMEX At each level of λ, create B=2 new datasets with X i,b (λ t ) = X i + (λ t )u i,b, response (weight) stays the same in each dataset. Calculate β b for each data set For each level of λ, calculate β(λ t ) by taking the average of all β b. Now with a value of β for each level of λ, extrapolate to find value of β when λ t = -1. This is the unbiased estimate of β (Carroll).

MONTE CARLO EXPERIMENT Monte-Carlo methods that using stochastic techniques to simulate behavior of a system. Generate random data based on original distribution of variables. Used to simulate effect of blanking and noise addition on microdata. Simulate SIMEX approach to estimate the IPW M-estimators.

MONTE CARLO EXPERIMENT Used multivariate linear regression model: Y i = α + βx 1i + γx 2i + e i, i = 1, n ~ N(, ) Samples sizes of n=100 and n=1000 with R=1000 replicates. SIMEX approach: 0 = λ 0 < λ 1 =.5 < λ 2 =1 < λ 3 =1.5 < λ 4 =2 B=50 samples

FOUR DIFFERENT MONTE CARLO DESIGNS Variance of noise in blanking method Variance of measurement error (noise) in noise addition

DESIGN 1 σ u2.01, q θu.95 Root Mean Square Error: (MSE(θ hat)) = (E((θ hat θ 2 Relative Standard Error: Estimated SE/ True SE Estimates from original dataset Ordinary Least Squares Estimate (Bad Estimate) Estimates from SIMEX

DESIGN 2 σ u 2.01, q θu.90

DESIGN 3 σ u 2.5, q θu.95

DESIGN 4 σ u 2.5, q θu.90

MONTE CARLO EXPERIMENT RESULTS Bias and RMSE of estimates are reduced when compared to the naïve OLS-estimate. Estimated variances smaller than naïve estimates but larger than that of the original dataset. Bias and RMSE is larger when n=100 compared to when n=1000. More noise (> u 2 yields more biased estimates. Due to low RELSE for small sample sizes standard errors cannot be estimated precisely. RELSE gets worse when n=1000. Their explanation: not enough bootstrap replications.

COMMENTS Too much information given with conditional probabilities and variance of noise? Dataset still not useful if researcher is concerned with tails of the data. Protection from identity disclosure using quasiidentifiers and/or external information? Possible use of imputation with blanked data? Previously used for non-response. Any applications to categorical data? Where s proof that SIMEX method to the IPW-estimator can be applied to nonlinear models?

CONCLUSIONS Blanking protects against sensitive, but not all data. Noise protects all data to some extent, but small impact on outliers. Combination of both compensates for each others weaknesses. Apply SIMEX approach to IPW-estimator. Monte-Carlo experiments show bias of estimators small, but RELSE not that good. More research needs to be conducted.

REFERENCES Anton Flossmann and Sandra Lechner (2006). Combining Blanking and Noise Addition as a Data Disclosure Limitation Method. Privacy in Statistical Databases, Lecture Notes in Computer Science. Springer Berlin/Heidelberg, Vol. 4302. pp. 152-163. R. Brand, Microdata protection through noise, in Inference Control in Statistical Databases. Ed. J. Domingo-Ferrer. Lecture Notes in Computer Science, 2316. Berlin: Springer, 2002. 97 116. M-estimator. <http://en.wikipedia.org/wiki/m-estimator>. 8 Nov. 2007. Carroll, R.J., Ruppert, D. Stefanski, L.A.: Measurement Error in Nonlinear Models. Journal of the American Statistical Assosiciation, 89 (1994) 1314-1328.