Should Non-Sensitive Attributes be Masked? Data Quality Implications of Data Perturbation in Regression Analysis

Size: px

Start display at page:

Download "Should Non-Sensitive Attributes be Masked? Data Quality Implications of Data Perturbation in Regression Analysis"

Lauren Philippa Cameron
6 years ago
Views:

1 Should Non-Sensitive Attributes be Masked? Data Quality Implications of Data Perturbation in Regression Analysis Sumitra Mukherjee Nova Southeastern University Abstract Ensuring the security of sensitive data is an increasingly important challenge for information systems managers. A widely used technique to protect sensitive data is to mask the data by adding zero mean noise. Noise addition affects the quality of data available for legitimate statistical use. This article develops a framework that may be used to analyze the implications of additive noise data masking on data quality when the data is used for regression analysis. The framework is used to investigate whether noise should be added to non-sensitive attributes when only a subset of attributes in the database are considered sensitive, an issue that has not been addressed in the literature. Our analysis indicates that adding noise to all the attributes is preferable to the existing practice of masking only the subset of sensitive attributes. 1. Introduction Ensuring the security of sensitive data is an increasingly important challenge for information systems managers (see [2] and [11]). Common security measures such as password protection and data encryption often fail to provide adequate protection. This is because a user may be able to infer sensitive information on the basis of information to which he has authorized access [7]. Recognizing the significance of this problem in statistical databases, researchers have developed several disclosure limitation methods (see [1], and [6] for a survey of security methods). A widely used method for disclosure limitation is data perturbation under which sensitive data are masked by adding zero mean noise. Noise addition, while reducing the risk of disclosure, has a negative effect on the quality of data available for legitimate statistical use. Information systems managers have the conflicting mandates of maintaining data quality while ensuring security. Researchers have mainly focused on the security aspect of noise addition methods (see [3] and [10] for a good review of the literature). While there is a growing body of research on data quality that addresses issues such as accuracy, consistency, timeliness and completeness (see, e.g., [12]), the subject has not received much attention in the data security literature. Accuracy and consistency issues in masked data have been addressed by Matloff [9], Kim [8], and Tendick [13]. Matloff investigated univariate distributions while the latter studies considered regression analysis. We focus on regression models since they are widely used in statistical analysis. Kim s work used the bias in the regression coefficient estimates as a measure of data quality while the Tendick s study was based on squared correlation coefficients. In each case the noise addition method was found to be optimal when the covariance matrix of the noise vector is proportional to the covariance matrix of the attributes to which the noise is added. Both these studies assumed that all the attributes in the database were sensitive and hence had to be masked. Typically, however, only a few attributes in the database are considered sensitive. The literature has not addressed the issue of data quality implications of masking non-sensitive attributes. In this article a general framework is developed to analyze the effect of noise addition on data quality when a subset of attributes in the database is considered sensitive. The framework is used to investigate whether non-sensitive data should be masked. The utility of this approach is further demonstrated by comparing the method proposed by Kim [8] and Tendick [13] to the more common implementation of adding independent noise. In the case of independent noise our results are related to the extensive literature developed by statisticians and econometricians on measurement error models (see [5] and Fuller [4] for a review of the literature). In section 2 the mathematical framework used for the analysis is developed. Measures of data quality are

2 derived in section 3. The framework is used to compare alternative additive noise data masking schemes in section 4. Section 5 provides an example to demonstrate the results and section 6 concludes with discussions. 2. The Model Since regression analysis is a widely used statistical tool, we use the classical normal linear regression (CNLR) model for the investigation of data quality issues. The model may be represented as follows: Y = β X + e where Y is the attribute representing the dependent variable, X is a (Kx1) vector of attributes representing the regressors, β is a (Kx1) vector of regression coefficients to be estimated, and e is the disturbance term. The assumptions of normality and independence in the CNLR model imply that: X µ Σ ~ N, Y β' µ β ' Σ β' + σ 2 where µ is the vector of means for X, Σ is the (KxK) covariance matrix for X, and σ 2 is the variance of the disturbance term e. The masked data M may be represented as M = X + U where M is a (Kx1) vector of masked attributes and U is a (Kx1) vector of normally distributed zero mean additive noise with covariance matrix Ω. Assume further, as is the case under all implementations of noise addition methods, that X and U are statistically independent. It follows that: M µ Σ + Ω ~ N, Y β ' µ β ' Σ β' + σ 2 Noise addition schemes may be characterized by the covariance matrix Ω of the additive noise vector. For instance, the methods proposed by Kim [8] and Tendick [13] require that Ω = ασ (3) where α is a proportionality constant. Under an implementation where the noise components are independent of each other, Ω is a diagonal matrix. For obvious reasons, it is a common practice to select the variances of the noise components to be proportional to the variances of the corresponding attributes. In this case: Ω = Diagonal(αΣ) (4) (1) (2) Under an implementation where only the subset of sensitive attributes is masked, the noise addition scheme may be characterized by a partitioned matrix. Without loss of generality, when only the first p of the K attributes are masked, the covariance matrix has the structure: Ω11 0 Ω = where Ω 11 is (pxp) sub-matrix. 0 0 Under this approach Tendick s method would require that Ω 11 = α Σ 11 (5) and an implementation using independent noise components requires that: Ω 11 = α diagonal(σ 11 ) (6) where Σ 11 is the corresponding (pxp) sub matrix obtained by partitioning Σ. While in this article we investigate noise addition methods characterized by covariance matrices specified by (3), (4), (5), and (6), the framework is general enough to accommodate any alternate implementation. Based on this structure, we specify measures for data quality in the context of regression analysis. 3. Measures of Data Quality The expected value of the regression coefficients when Y is regressed on the true values of the attributes X is given by β. When masked attributes are used, it follows from (2) that: E[Y M] = β [I - Σ(Σ+Ω) -1 ]µ + β Σ(Σ+Ω) -1 M (7) where I is an identity matrix. Hence, the expected value of the coefficients obtained by regression of Y on M is given by: b = (Σ+Ω) -1 Σ β (8) and the bias in the regression coefficients is given by B = β-b = [I - (Σ+Ω) -1 Σ] β (9) Following the work of Kim [8] we use the bias B as a measure of data quality. This choice is further supported by the extensive literature on errors in variables (see e.g. [5]). An alternate measure of data quality may be based on the squared correlation coefficient R 2. In a regression R 2 represents the proportion of the variance in the outcome variable that is explained by the explanatory variables. Tendick s work [13] on optimal noise addition methods motivates the choice of this measure of data quality. Within our framework we take the ratio of the R 2 obtained when Y is regressed on the masked attributes to that obtained by regressing Y on the true values X. The lower the ratio, the worse is the quality of the masked data. Using (1) and (2) it can be shown that the ratio:

3 γ = R 2 M /R 2 X = (β Σ(Σ+Ω) -1 )/(β ) (10) Under alternate noise addition schemes, the bias B and the ratio of the squared correlation coefficient γ may be computed to evaluate the impact of the method on data quality, when the data is used for regression analysis. The relative merits of these candidate measures will also be discussed. 4. Comparison of Alternate Additive Noise Data Masking Schemes In this section we illustrate how the framework presented may be used to compare implementations of data perturbation where all attributes are masked to those where only the sensitive attributes are masked. In the latter case we consider both, addition of noise with covariance matrix proportional to that of the covariance matrix of the attributes as well as implementations when the noise components are statistically independent. Case I Under this implementation noise is added not only to the sensitive attributes but also to the non sensitive attributes. The covariance matrix of the noise vector is proportional to the covariance matrix of the attribute vector as specified in (3). The issue of bias Substituting (3) in (8) we find the expected value of regression coefficient b i = 1/(1+α) β i for i = 1,..., K (11) and hence the bias B i = α/(1+α) β i for i = 1,..., K (12) The above implies that under such a scheme all the regression coefficients are attenuated by the same factor 1/(1+α). For instance, when the variances of the additive noise components are 25% of the variances in the corresponding attributes (as may be the case in a practical data masking scheme), the attenuation in the regression coefficients is 20%. More significantly, if the proportionality factor α is published, the necessary adjustments can be made to obtain unbiased estimates for all coefficients. The issue of reduction in R 2 The ratio γ of the squared correlation coefficients may be computed by substituting (3) in (10). This measure of data quality is given by γ = 1/(1+α) (13) The results imply that for additive noise with 25% of the variance of the attributes, the R 2 for the regression would decrease by 20% when the masked values rather than the true values are used as regressors. This reduction in R 2 has implications in terms of the confidence that the user has on estimates obtained using masked data. Case II Under this implementation noise is added only to the subset of sensitive attributes. The covariance matrix of the noise vector is proportional to the covariance matrix of the corresponding sensitive attributes. The issue of bias Substituting (5) in (8), the expected value of the coefficients have the form P11 0 b = β (14) P21 I where I is an identity matrix and P 11 and P 21 are (p x p) and (K-p x p) matrices with elements that are functions of α and Σ. The exact expressions are complex (interested readers may obtain the expressions by inverting partitioned matrices using the Binomial Inverse Theorem (Woodbury 1950)) but provide some insights into the nature of the biases introduced. The coefficients of the sensitive attributes are biased but not affected by the coefficients of the unperturbed attributes. However, the coefficients of the unperturbed attributes are influenced by those of the sensitive attributes. More significantly, the magnitude and sign of the bias in the coefficient estimates do follow any easily discernable pattern. Some of the coefficients may be attenuated while others are amplified. In general, the signs of the coefficients may even be reversed, based on the correlation between the attributes. Hence, unlike in the where noise is added to all attributes, adding noise to only the sensitive attributes prevents the legitimate user from adopting convenient procedures to compensate for the biases introduced due to noise addition. The estimates provided may be quite inaccurate and misleading, especially as the levels of noise increase. The issue of reduction in R 2 The ratio γ of the R 2 for perturbed and unperturbed data may be obtained by substituting (5) in (10). This measure of data quality has the form Q 11 Σ 12 γ = β β / (β ) (15) Σ21 Σ22 The matrices in the numerator and denominator differ only in the elements of the (p x p) sub-matrix corresponding to the sensitive attributes. The reduction in the R 2 is hence typically small, especially when a small subset of the attributes are masked. Further, it can be

4 shown that the ratio is greater than 1/(1+α) (the proof is beyond the scope of this article but a special case is established in equations (18) and (19)). Thus based on this measure alone, adding noise to only the subset of sensitive attributes seems more attractive than adding noise to all attributes. This can be intuitively explained by the fact that the percentage of the variation in the outcome variables explained by the regressors is higher when fewer regressors are disturbed. However, concerns about biases in the coefficient estimates may far out weigh the lower reduction in the R 2 and this measure not be an appropriate candidate for evaluating data quality. Case III When independent noise is added, but only the sensitive attributes are masked, the covariance matrix of the noise vector is specified by (6). The expressions for the bias and reduction in R 2 are complex and are treated exhaustively in the literature on errors in variables (see [4]). Rather than reproducing the results, we draw on this body of knowledge to investigate the data quality implications of data perturbation. In order to keep the analysis simple we consider the case when only one (i=1) of the K attributes is sensitive. Notice that this implementation is equivalent to the previous case with p=1. The issue of bias Based on results from [5] it can be shown that: b 1 = 1/(1+α/(1-R 2 1)) β 1 (16) b i = β i + αδ i β 1 /(1+α- R 2 1)for i = 2,...,K. (17) where R 2 1 is the squared correlation coefficient obtained by regression of the sensitive attribute X 1 on the K-1 non sensitive attributes, and δ i is the regression coefficient for X i when X 1 is regressed on the other K-1 non sensitive attributes. Comparing these results to those obtained in Case I, the following observations can be made: a) For the sensitive attribute the regression coefficient is attenuated. Moreover, the attenuation is greater than that in the previous case when noise is added to all K attributes. This is obvious from (11) and (16) since the squared correlation coefficient is less than 1. b) The bias in the coefficient of a non sensitive (and hence non-masked) attribute is proportional to the bias in the coefficient of the sensitive attribute. Specifically, B i =δ i B 1 for i=2,...,k. Note that the coefficients of the non sensitive attributes are not necessarily attenuated. Hence the bias may be positive or negative. c) The absolute value of the bias for the non-sensitive attributes is less in the second case iff R 2 1 < (1+α)(1- δ i β 1 ). Hence, it is not necessarily true that adding noise to non-sensitive attributes causes an increase in the bias of the regression coefficient estimates. Results based on models with more than one disturbed regressor provide similar insights. The biases in the coefficients are not necessarily increased when noise is added to all the attributes. Hence if the bias in the regression coefficients is taken as a measure of data quality then adding independent noise provides no advantage over the methods proposed by Kim [8] and Tendick [13]. The issue of reduction in R 2 In investigating the effect on R 2, it can be shown that the matrix Σ(Σ+Ω) -1 Σ differs from the covariance matrix Σ only in the first row first column element. Hence the ratio of the squared correlation coefficient may be written in terms of this difference c as: γ = 1 - β 2 1 αc/(β ) (18) with c < (β )/(β 2 1 (1+α)) (19) Comparing γ when all attributes are masked (13) to the case when only the first attribute is masked ((18) and (19)), we find that the squared correlation coefficient is always greater in the latter case. This is because the percentage of variation in the outcome variables explained by the regressors is lower when only a single regressor is disturbed. Notice that the results are consistent with those obtained for Case II. In view of the fact that the coefficients can be estimated with much greater accuracy under Kim s method, the higher R 2 obtained under this method may not be sufficient justification for adding independent noise. We now use an example to illustrate the results obtained in this section. In addition to the three cases analyzed we also consider an implementation in which independent noise is added to all the attributes in the database. 5. An Example Consider a database with eight attributes of which only the first three attributes are considered sensitive. Assume that the attributes are multi-normally distributed with a covariance matrix Σ given by:

5 Covarianx1 x2 x3 x4 x5 x6 x7 x8 x x x x x x x x Case I: Consider the case when noise is added to all attributes with the covariance of the noise vector Ω = ασ. For a choice of α = 0.25, the expected values of the coefficients estimated using masked data are b i = 0.8 β i for i = 1,.., 8, and, the ratio of the squared correlation coefficient γ = 0.8. These are obviously consistent with the theoretical results obtained in the previous section. Users may compensate for the systematic bias if the value of the parameter α is known. Case II When noise is added to only the three sensitive attributes, the covariance matrix has the structure specified by (5). The expected values of the coefficients are given by (Σ+Ω) -1 Σ β, and for α = 0.25 the matrix (Σ+Ω) -1 Σ can be computed as: In order to get a sense of the magnitude and sign of the biases, consider the case where the true regression coefficients are all one (that is, when β = [ ]). In this case the expected values of the coefficients estimated using the masked data can be computed as: b = [ ] Notice that bias is introduced in the coefficient estimates for both the masked and the unperturbed attributes. The estimated coefficient can be more than 17 times the true value of the coefficient (as is the case with b 10 ). The signs of the coefficients can be reversed (as is the case with b 1 and b 5 ). In general, the biases do not follow a pattern and are determined by the covariance matrix Σ. The ratio of the squared correlation coefficient is given by (10). For this example, the matrix Σ(Σ+Ω) -1 Σ can be computed as: And hence the ratio of the squared correlation coefficient (with β = [ ]) is γ = 519.9/521 = This surprisingly high ratio is due to the fact that the elements by which the matrix Σ(Σ+Ω) -1 Σ differs from Σ is small compared to the elements that they have in common. However, as the biases indicate, this ratio may be a poor measure for data quality. Case III Now we assume that noise is added to the sensitive attributes only and the noise components are independent of each other, as characterized in (6). It follows that for α = 0.25 the matrix (Σ+Ω) -1 Σ can be computed as: Again, when the true regression coefficients are given by β = [ ], the expected values of the coefficients estimated using the masked data are: b = [ ]. Notice that there is considerable variation in terms of the magnitude and sign of the bias. The signs of the coefficients are reversed in some cases. Comparing the coefficient estimates to those obtained in Case II we find that the magnitude and sign of the biases are quite similar for each coefficient regardless of whether the noise components for the masked attributes are

6 independent or follow the same covariance pattern as the vector of attributes. The ratio of the squared correlation coefficient (10) can be computed for this example using the matrix Σ(Σ+Ω) The ratio of the squared correlation coefficient (with β = [ ]) is hence γ = 514.4/521 = Again, the high ratio is a result of the fact that only a few regressors are disturbed and should not be taken to indicate high data quality. Case IV Finally we consider an implementation in which all attributes are masked using independent noise (4). For α = 0.25 the matrix (Σ+Ω) -1 Σ can be computed as: With the true regression coefficients given by β = [ ], the expected values of the coefficients estimated using the masked data are: b = [ ]. Notice that the signs of the coefficients are all consistent. Moreover, the biases are not very significant (except in the case of the last coefficient) and are of the same order as those obtained using Kim s method. However, as compared to implementations that mask only the subset of sensitive attributes, adding independent noise to all attributes is much more desirable. The ratio of the squared correlation coefficient (10) can be computed using the matrix Σ(Σ+Ω) -1 that is given by The ratio of the squared correlation coefficient (with β = [ ]) can be computed as γ = 468.2/521 = This is obviously lower than that obtained when only the sensitive attributes are masked. However, it is higher than that obtained with Kim s method. The only significant disadvantage of this method is that the procedure to compensate for the biases is not as simple as in the fist case. This example reinforces the analytical results obtained in the last section. The significant results are summarized in the discussions that follow. 6. Discussion This article develops a framework that may be used to analyze the implications of additive noise data masking on data quality when the data is used for regression analysis. Two measures of data quality that have been proposed in the literature are considered - the bias in the regression coefficient estimates and the percentage of the variation (R 2 ) in the outcome variable that is explained by the regressors. The framework is used to investigate whether noise should be added to non-sensitive attributes when only a subset of attributes in the database are considered sensitive. Two common implementations of noise addition schemes are considered. The first adds noise with the covariance matrix of the noise vector proportional to the covariance matrix of the attribute vector. Under the second scheme the noise components are statistically independent of each other. Based on our analysis the following observations can be made: 1. Of the two measures of data quality considered, the bias in the coefficient estimates is the more appropriate one. This is because the reduction in R 2 is strongly influenced by the proportion of attributes masked and may be misleading about the accuracy of the statistical estimates. 2. Adding noise to all the attributes is preferable to masking only the subset of sensitive attributes. The estimated coefficients are much more accurate in the

7 former case. This is true regardless of the covariance structure of the noise vector. 3. Using noise with a covariance matrix proportional to the covariance matrix of the attributes (as proposed by Kim [8] and Tendick [13]) is preferable to adding statistically independent noise. Preserving the covariance structure of the data allows simple procedures to compensate for the bias introduced due to noise addition. Our results provide practical guidance to database administrators and data users. Database administrators have the conflicting mandates of protecting sensitive data while ensuring access to good quality data for legitimate statistical analysis. The literature has not considered the typical case when some of the attributes in the database are non-sensitive. In our investigation of this issue we focus on data quality since security is not a concern for non-sensitive data. Our analysis indicates that both sensitive and non-sensitive data should be masked using the method proposed by Kim [8] and Tendick [13]. Under this implementation we specify a simple procedure for the users to correct the estimates for bias. The only major drawback of this approach is that it adversely affects users interested in univariate statistics of nonsensitive attributes. Our analysis is restricted to the classical normal linear regression model. Generalizations of our results pose significant and interesting research challenges. We relate our results to the literature on errors in variable models since this extensive body of knowledge may prove useful in identifying and answering questions in data security and data quality issues. [6] Jabine T.B. Statistical Disclosure Limitation Practices of the United State Statistical Agencies, Journal of Official Statistics, 1993, 9, 2, [7] Landwehr C.E. [editor], Data Security: Status and Prospects New York: North Holland, [8] Kim, J., A Method for Limiting Disclosure in Microdata based on Random Noise and Transformation, ASA Proceedings of the Survey Research Method Section, 1986, [9] Matloff, N.S., "Another Look at Noise Addition for Database Security," Proceedings of the IEEE Computer Society Symposium on Research in Security and Privacy, 1986, [10] Muralidhar K. and Batra D., Accessibility, Security, and Accuracy in Statistical Databases: The Case for Multiplicative Fixed Data Perturbation Approach, Management Science, 1995, 41, 9, [11] Straub D.W. and Collins R.W., Key Information Liability Issues Facing Managers: Software, Piracy, Proprietary Databases, and Individual Rights to Privacy, MIS Q., 1990, 14, [12] Strong, D.M., Lee, Y.W, and Wang, R.Y., "Data Quality in Context," Communications of the ACM, 1997, 40, 5, [13] Tendick, P., "Optimal Noise Addition for the Preservation of Confidentiality in Multivariate Data," Journal of Statistical Planning and Inference, 1991, 27, References [1] Adam, N.R. and Wortmann, J.C., "Security-Control Methods for Statistical Databases: A Comparative Study," ACM Computing Surveys, 1989, 21, 4, [2] Duncan, G.T. and Pearson, R.W., "Enhancing Access to Data while Protecting Confi-dentiality: Prospects for the Future," Statistical Science, 1991, 6, 3, [3] Fuller, W. A., "Masking Procedures for Microdata Disclosure Limitation," Journal of Official Statistics, 1993, 9, 2, [4] Fuller, W. A., Measurement Error Models, New York: John Wiley, [5] Garber S. and Klepper S., Extending the Classical Normal Errors in Variable Model, Econometrica, 1980, 48, 6,

Eric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION

Eric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION INTRODUCTION Statistical disclosure control part of preparations for disseminating microdata. Data perturbation techniques: Methods assuring