Preserving Privacy in Data Mining using Data Distortion Approach

Similar documents
Jun Zhang Department of Computer Science University of Kentucky

Orthogonal Nonnegative Matrix Factorization: Multiplicative Updates on Stiefel Manifolds

MULTIPLICATIVE ALGORITHM FOR CORRENTROPY-BASED NONNEGATIVE MATRIX FACTORIZATION

c Springer, Reprinted with permission.

Non-Negative Matrix Factorization with Quasi-Newton Optimization

Iterative Laplacian Score for Feature Selection

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Non-negative Matrix Factorization on Kernels

Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY

NONNEGATIVE matrix factorization (NMF) is a

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT

Nonnegative Matrix Factorization Clustering on Multiple Manifolds

Object Characterization from Spectral Data Using Nonnegative Factorization and Information Theory

Automatic Rank Determination in Projective Nonnegative Matrix Factorization

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE Artificial Intelligence Grad Project Dr. Debasis Mitra

Nonnegative Matrix Factorization with Orthogonality Constraints

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

Discriminative Direction for Kernel Classifiers

NEAREST NEIGHBOR CLASSIFICATION WITH IMPROVED WEIGHTED DISSIMILARITY MEASURE

Group Sparse Non-negative Matrix Factorization for Multi-Manifold Learning

ONP-MF: An Orthogonal Nonnegative Matrix Factorization Algorithm with Application to Clustering

EUSIPCO

On Spectral Basis Selection for Single Channel Polyphonic Music Separation

Nonnegative Matrix Factorization with Toeplitz Penalty

Non-Negative Matrix Factorization

Nonnegative Matrix Factorization

Bregman Divergences. Barnabás Póczos. RLAI Tea Talk UofA, Edmonton. Aug 5, 2008

Quick Introduction to Nonnegative Matrix Factorization

Unsupervised Classification via Convex Absolute Value Inequalities

Multi-Task Clustering using Constrained Symmetric Non-Negative Matrix Factorization

Sparseness Constraints on Nonnegative Tensor Decomposition

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

NON-NEGATIVE SPARSE CODING

Protein Expression Molecular Pattern Discovery by Nonnegative Principal Component Analysis

A Purely Geometric Approach to Non-Negative Matrix Factorization

AN ENHANCED INITIALIZATION METHOD FOR NON-NEGATIVE MATRIX FACTORIZATION. Liyun Gong 1, Asoke K. Nandi 2,3 L69 3BX, UK; 3PH, UK;

Using Matrix Decompositions in Formal Concept Analysis

On the Projection Matrices Influence in the Classification of Compressed Sensed ECG Signals

Interaction on spectral data with: Kira Abercromby (NASA-Houston) Related Papers at:

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Nonnegative Tensor Factorization with Smoothness Constraints

arxiv: v1 [stat.ml] 23 Dec 2015

Channel Pruning and Other Methods for Compressing CNN

Nonnegative Matrix Factorization. Data Mining

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Multiple Similarities Based Kernel Subspace Learning for Image Classification

arxiv: v2 [math.oc] 6 Oct 2011

Robust Inverse Covariance Estimation under Noisy Measurements

Detect and Track Latent Factors with Online Nonnegative Matrix Factorization

Dimension Reduction Using Nonnegative Matrix Tri-Factorization in Multi-label Classification

Study on Classification Methods Based on Three Different Learning Criteria. Jae Kyu Suhr

arxiv: v3 [cs.lg] 18 Mar 2013

An Approach to Classification Based on Fuzzy Association Rules

ECE521 Lectures 9 Fully Connected Neural Networks

Reconstruction of Reflectance Spectra Using Robust Nonnegative Matrix Factorization

Non-Negative Matrix Factorization

Examination of Initialization Techniques for Nonnegative Matrix Factorization

Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels

Research Article Comparative Features Extraction Techniques for Electrocardiogram Images Regression

PROJECTIVE NON-NEGATIVE MATRIX FACTORIZATION WITH APPLICATIONS TO FACIAL IMAGE PROCESSING

Non-negative matrix factorization with fixed row and column sums

Lecture Notes 10: Matrix Factorization

ParaGraphE: A Library for Parallel Knowledge Graph Embedding

Privacy Preserving Frequent Itemset Mining. Workshop on Privacy, Security, and Data Mining ICDM - Maebashi City, Japan December 9, 2002

Distributed Mining of Frequent Closed Itemsets: Some Preliminary Results

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Fractal Dimension and Vector Quantization

Privacy-Preserving Linear Programming

On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering

STA 414/2104: Lecture 8

Jun Zhang Department of Computer Science University of Kentucky

A comparison of three class separability measures

arxiv: v1 [cs.lg] 13 Aug 2014

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Privacy-Preserving Decision Tree Mining Based on Random Substitutions

* Matrix Factorization and Recommendation Systems

L 2,1 Norm and its Applications

PoS(CENet2017)018. Privacy Preserving SVM with Different Kernel Functions for Multi-Classification Datasets. Speaker 2

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

NetBox: A Probabilistic Method for Analyzing Market Basket Data

Implementing Competitive Learning in a Quantum System

On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing

Semidefinite Programming Based Preconditioning for More Robust Near-Separable Nonnegative Matrix Factorization

GROUP-SPARSE SUBSPACE CLUSTERING WITH MISSING DATA

Rare Event Discovery And Event Change Point In Biological Data Stream

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

Fast Coordinate Descent methods for Non-Negative Matrix Factorization

Random Multiplication based Data Perturbation for Privacy Preserving Distributed Data Mining - 1

Linear Algebra Methods for Data Mining

Data Mining and Matrices

A Local Non-Negative Pursuit Method for Intrinsic Manifold Structure Preservation

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Distributed Data Mining for Pervasive and Privacy-Sensitive Applications. Hillol Kargupta

ENGG5781 Matrix Analysis and Computations Lecture 10: Non-Negative Matrix Factorization and Tensor Decomposition

Probabilistic Latent Variable Models as Non-Negative Factorizations

1 Non-negative Matrix Factorization (NMF)

Transcription:

Preserving Privacy in Data Mining using Data Distortion Approach Mrs. Prachi Karandikar #, Prof. Sachin Deshpande * # M.E. Comp,VIT, Wadala, University of Mumbai * VIT Wadala,University of Mumbai 1. prachiv21@yahoo.co.in 2.sachin.deshpande@vit.edu.in Abstract. Data mining, the extraction of hidden predictive information from large databases, is nothing but discovering hidden value in the data warehouse. Because of the increasing ability to trace and collect large amount of personal information, privacy preserving in data mining applications has become an important concern. Data distortion is one of the well known techniques for privacy preserving data mining. The objective of these data perturbation techniques is to distort the individual data values while preserving the underlying statistical distribution properties. These techniques are usually assessed in terms of both their privacy parameters as well as its associated utility measure. In this paper, we are studying the use of non-negative matrix factorization (NMF) with sparseness constraints for data distortion. Keywords: Data Mining, Privacy, Data distortion, NMF, Sparseness INTRODUCTION Data Mining [1],is the extraction of hidden predictive information from large databases, is a powerful technology with great potential to help companies focus on the most important information in their data warehouses. Several data mining applications deal with privacy-sensitive data such as financial transactions, and health care records. Because of the increasing ability to trace and collect large amount of personal data, privacy preserving in data mining applications has become an important concern. There is a growing concern among citizens in protecting their privacy. Data is stored either in a centralized database or in a distributed database. According to its storage there are various privacy preserving techniques used. Among the techniques that are used for privacy preserving data mining are: Generalization, Data Sanitation, Data distortion, Blocking & Cryptography techniques. We are focusing our study on the latter approach i.e. Data distortion via data perturbation. The objective of data perturbation is to distort the individual data values while preserving the underlying statistical distribution properties. Theses data perturbation techniques are usually assessed in terms of both their privacy parameters as well as its associated utility measure. While the privacy parameters present the ability of these techniques to hide the original data values, the data utility measures assess whether the dataset keeps the performance of data mining techniques after the data distortion. Our objective is to study the use of truncated non-negative matrix factorization (NMF) with sparseness constraints for data perturbation. The rest of the paper is organized as follows. In section 2, we review the non-negative matrix factorization technique. The data distortion and the utility measures which can be used are reviewed in section 3 and section 4 respectively. Experimental results are given in section 5. Conclusion and future scope are given in section 6. Nonnegative matrix factorization Non negative matrix factorization (NMF) [8] refers to a class of algorithms that can be formulated as follows: Given a nonnegative n r data matrix, V. NMF finds an approximate factorization V WH where W and H are both non negative matrices of size n m and m r respectively. The reduced rank m of the factorization is generally chosen so that n r<m<nr and hence the product WH can be regarded as a compressed form of the data matrix V. The optimal choices of matrices W and H are defined to be those non-negative matrices that minimize the reconstruction error between V and WH. Various error functions 13

have been proposed. The most widely used is the squared error (Euclidean distance) function (i) and K-L Divergence function (ii) etc [2]. i,j W, H V ij WH ij 2. (1) Non negative matrix factorization requires all entries of both matrices to be non negative, i.e., the data is described by using additive components only. (2) NMF with Sparseness Constraint Several measures for sparseness have been proposed. The sparseness of a vector X of dimension n is given by [8]: S x = n ( x 1 ) / x 1 2 (1) n -1 Usually, most of NMF algorithms produce a sparse representation of the data. Such a representation encodes much of the data using few active components. However, the sparseness given by these techniques can be considered as a side-effect rather than a controlled parameter, i.e., one cannot in any way control the degree to which the representation is sparse. Our aim is to constrain NMF to find solutions with desired degrees of sparseness. The sparseness constraint can be imposed on either W or H or on both of them. For example, a doctor analyzing a dataset that describes disease patterns, might assume that most diseases are rare (hence sparse) but that each disease can cause a large number of symptoms. Assuming that symptoms make up the rows of her matrix and the columns denote different individuals, in this case it is the coefficients which should be sparse and the basis vectors unconstrained. We have studied the projected gradient descent algorithm for NMF with sparseness constraints proposed in [7]. Truncation on NMF with Sparseness Constraint In order to control the degree of achievable data distortion, the elements in the sparsified H matrix with values less than a specified truncation threshold are truncated to zero. Thus the overall data distortion can be summarized as follows: (i) Perform sparsified NMF with sparse constraint h s on H to obtain H Sh (ii) Truncate the elements in H Sh that are less than to obtain H Sh,. The perturbed dataset is given by W H Sh, Є. Thus the new dataset is basically distorted twice by our proposed algorithm that has three parameters: the reduced rank m, the sparseness parameter h s and the truncation threshold Є. 14

DATA DISTORTION MEASURES Throughout this work, we adopt the same set of privacy parameters proposed in [5]. The value difference (VD) parameter is used as a measure for value difference after the data distortion algorithm is applied to the original data matrix. Let V and V denote the original and distorted data matrices respectively. Then, VD is given by : VD V V / V, (1) where denotes the Frobenius norm of the enclosed argument. After a data distortion, the order of the value of the data elements also changes. Several metrics are used to measure the position difference of the data elements. For a dataset Vwith n data object and m attributes, let i, Rank j denote the rank (in ascending order) of the j th element in attribute i. Similarly, let Rank i j denote the rank of the corresponding distorted element. The RP parameter is used to measure the position difference. It indicates the average change of rank for all attributes after distortion and is given by m n RP = 1 Rank i j - Rank i j i=1 j=1 nm (2) (2) RK represents the percentage of elements that keeps their rank in each column after distortion and is given by 1 m n RK= Rk i j nm i=1 j=1 (3) where Rk i j 1. If an element keeps its position in the order of values, otherwise Rk i j 0. Similarly, the CP parameter is used to measure how the rank of the average value of each attributes varies after the data distortion. In particular, CP defines the change of rank of the average value of the attributes and is given by m CP = 1 RankVV i - RankVV i m i=1 15 (4)

where RankVV i and RankVV i denote the rank of the average value of the i th attribute before and after the data distortion, respectively. Similar to RK, CK is used to measure the percentage of the attributes that keep their ranks of average value after distortion. From the data privacy perspective, a good data distortion algorithm should result in a high values for the RP and CP parameters and low values for the RK and CK parameters. UTILITY MEASURE The data utility measures assess whether the dataset keeps the performance of data mining techniques after the data distortion. The accuracy of a simple K-nearest neighborhood (KNN) [9] can be used as data utility measure. EXPERIMENTAL RESULTS In order to test the performance of our proposed method, we conducted a series of experiments on some real world datasets. In this section, we present a sample of the results obtained when applying our technique to the original Wisconsin breast cancer downloaded from UCI machine Learning Repository [10]. For the breast cancer database, we used 569 observations and 30 attributes (with positive values) to perform our experiment. For the accuracy calculation, TANAGRA data mining tool has been used.throughout the experiment KNN classifier has been used with K=19. Using Squared error function From the Figure 1, it is clear that m = 2 provides the best choice with respect to the privacy parameters. -.- RK - o RP X CP + CK VD 16

Fig. 4. Effect of the reduced rank m on the privacy parameters. Table 1 shows the how the privacy parameters and accuracy vary with the sparseness constraint Sh. S h RP RK CP CK VD Acc 0 180.28 0.0371 0.266 0.866 0.0431 93.14 % 0.45 185.90 0.0153 0.266 0.866 0.3751 92.23 % 0.65 187.41 0.0115 0.6 0.6 0.98 91.21 % 0.75 186.43 0.015 0.26 0.86 0.99 92.61 % TABLE I EFFECT OF THE SPARSENESS CONSTRAINT ON THE PRIVACY PARAMETERS AND ACCURACY Table 2 shows the effect of threshold on the privacy parameters. From the table, it is clear that there is a trade-off between the privacy parameters. RP RK CP CK VD 0.035 184.90 0.0179 0.2667 0.8667 0.1398 0.040 195.54 0.0173 0.2667 0.8667 0.1044 0.045 196.11 0.0166 0.2667 0.8667 0.0952 0.050 197.71 0.0144 0.2667 0.8667 0.0830 17

Privacy parameters & Accuracy IJCES International Journal of Computer Engineering Science, Volume1 Issue2, November 2011 TABLE II EFFECT OF THRESHOLD PRIVACY PARAMETERS AND ACCURACY Using K-L Divergence error function From the Figure 2, it is clear that m 3 provides the best choice with respect to the privacy parameters. So, we fixed m 2 throughout the rest of our experiments with this dataset. 10 9 8 7 RP/100 RK*100 CP CK*10 VD ACC/10 6 5 4 3 2 1 0 0 5 10 15 20 25 Reduce rank m Fig. 5. Effect of the reduced rank m on the privacy parameters Table 3 shows the how the privacy parameters and accuracy vary with the sparseness constraint S h. S h RP RK CP CK VD Acc 0 187.51 0.0113 0 1 0.0536 92.14% 0.3 187.28 0.0103 3.2 0.53 0.999 90.56 % 0.65 185.93 0.0108 6.06 0.366 0.999 90.86 % 18

TABLE III EFFECT OF THE SPARSENESS CONSTRAINT ON THE PRIVACY PARAMETERS AND ACCURACY From the results in Table 3, it is clear that S h = 0.65 not only improves the values of the privacy parameters, but also improves the classification accuracy. Table 4 shows the effect of threshold on the privacy parameters and accuracy. From the table, it is clear that there is a trade-off between the privacy parameters and the accuracy. RP RK CP CK VD 0.005 186.09 0.0108 6.066 0.3667 0.999 0.01 187.064 0.0108 6.066 0.3667 0.999 0.05 194.84 0.0083 6.066 0.3667 0.999 0.1 198.51 0.0064 6.066 0.3667 0.999 TABLE IV EFFECT OF THRESHOLD PRIVACY PARAMETERS CONCLUSION AND FUTURE SCOPE Non-negative matrix factorization with sparseness constraints can provide an effective data perturbation tool for privacy preserving data mining. In order to test the performance of the proposed method we would like to conduct a set of experiments on some standard real world data sets. While using the above mentioned privacy parameters, we would like to test the ability of these techniques to hide the original data values. REFERENCES [1] M. Chen, J. Han, and P. Yu, "Data Mining: AnOverview from a Database Prospective", IEEE Trans. Knowledge and Data Engineering, 8, 1996.Z. [2] Yang, S. Zhong, R. N. Wright, Privacy preserving classification of customer data without loss of accuracy, In proceedings of the 5th SIAM International Conference on Data Mining, Newport Beach, CA, April 21-23, 2005. [3] Saif M. A. Kabir1, Amr M. Youssef2 and AhmedK.Elhakeem1 Concordia University, Montreal, Quebec, Canada, On data distortion for privacy preserving data mining [4] Rakesh Agrawal and Ramakrishnan Srikant, Privacy-preserving data mining, In Proceeding of the ACM SIGMOD Conference on Management of Data, pages 439 450, Dallas, Texas, May 2000. ACM Press. [5] Shuting Xu, Jun Zhang, Dianwei Han, and Jie Wang, Data distortion for privacy protection in a terrorist Analysis system. P. Kantor et al (Eds.):ISI 2005, LNCS 3495, pp.459-464, 2005 [6] V. P. Pauca, F. Shahnaz, M. Berry and R. Plemmons. Text Mining using non-negative Matrix Factorizations, Proc. SIAM Inter. Conf. on Data Mining, Orlando, April, 2004. [7] Patrik O. Hoyer. Non-negative Matrix Factorization with Sparseness Constraints. Journal of Machine Learning Research 5 (2004) 1457 1469 19

[8] D. D. Lee and H. S. Seung. Algorithms for nonnegative matrix factorization. In Advances in Neural Information Processing 13 (Proc. NIPS 2000). MIT Press, 2001. [9] R.Duda,P.Hart,and D. Stork, Pattern Classification, John Wiley and Sons, 2001. 20