On Variable Constraints in Privacy Preserving Data Mining

Size: px

Start display at page:

Download "On Variable Constraints in Privacy Preserving Data Mining"

Ariel Potter
6 years ago
Views:

1 On Variable Cnstraints in Privacy Preserving Data Mining Charu C. Aggarwal, Philip S. Yu IBM T. J. Watsn Research Center { charu, psyu }@us.ibm.cm Abstract In recent years, privacy preserving data mining has becme an imprtant prblem because f the large amunt f persnal data which is tracked by many business applicatins. In many cases, users are unwilling t prvide persnal infrmatin unless the privacy f sensitive infrmatin is guaranteed. A recent framewrk perfrms privacy preserving data mining by using a cndensatin based apprach. In this framewrk, the privacy f all recrds is treated hmgeneusly. It is therefre inefficient t design a system with a unifrm privacy requirement ver all recrds. We discuss a new framewrk fr privacy preserving data mining, in which the privacy f all recrds is nt the same, but can vary cnsiderably. This is ften the case in many real applicatins, in which different grups f individuals may have different privacy requirements. We discuss a cndensatin based apprach fr privacy preserving data mining in which an efficient methd is discussed fr cnstructing the cndensatin in a hetergeneus way. The hetergeneus cndensatin is capable f handling bth static and dynamic data sets. We present empirical results illustrating the effectiveness f the methd. 1 Intrductin Privacy preserving data mining has becme an imprtant prblem in recent years, because f the large amunt f cnsumer data tracked by autmated systems n the internet. The prliferatin f electrnic cmmerce n the wrld wide web has resulted in the strage f large amunts f transactinal and persnal infrmatin abut users. In additin, advances in hardware technlgy have als made it feasible t track infrmatin abut individuals frm transactins in everyday life. In many cases, users are nt willing t supply such persnal data unless its privacy is guaranteed. Therefre, in rder t ensure effective data cllectin, it is imprtant t design methds which can mine the data with a guarantee f privacy. Sme interesting discurses n the nature f privacy in the cntet f recent trs in infrmatin technlgy may be fund in [6, 9, 10]. The recent fcus n privacy in data cllectin has resulted t a cnsiderable amunt f research n the subject [1, 2, 3, 4, 5, 7, 11, 12, 15, 16]. A recent apprach t privacy preserving data mining has been a cndensatin-based technique [2]. This technique essentially creates cndensed grups f recrds which are then utilized in ne f tw ways: The statistical infrmatin in the pseud-grups can be utilized t generate a new set f pseuddata which can be utilized with a variety f data mining algrithms. The cndensed pseud-grups can be utilized directly with minr mdificatins f eisting data mining algrithms. The cndensatin apprach f [2] is als referred t as the k-indistinguishability mdel. A recrd is said t be k-indistinguishable, when there are at least k ther recrds in the data (including itself) frm which it cannt be distinguished. Clearly, when a recrd is 1-indistinguishable, it has n privacy. The k- indistinguishability f a recrd is achieved by placing it in a grup with at least (k 1) ther recrds. This mdel mdel shares a number f cnceptual characteristics with the k-annymity mdel [18], thugh the algrithms fr ding s are quite different. Anther imprtant difference between the tw schemes is that the frmer des nt rely n dmain specific hierarchies (as in the case f the k-annymity mdel). The k-indistinguishability mdel can als wrk effectively in a dynamic envirnment such as that created by data streams. In the mdel discussed in [2], it was assumed that all recrds have the same privacy requirement. This is als the case fr the k-annymity mdel in which the level f privacy is fied a-priri. In mst practical applicatins, this is nt be a reasnable assumptin. Fr eample, when a data repsitry cntains recrds frm hetergeneus data surces, it is rarely the case that each repsitry has the same privacy requirement. Similarly, in an applicatin tracking the data fr brkerage custmers, the privacy requirements f retail investrs are likely t be different frm thse f institutinal investrs. Even amng a particular class f custmers, sme custmers

2 (such as high net-wrth individuals) may desire a higher level f privacy than thers. In general, we wuld like t assciate a different privacy level with each recrd in the data set. Let us assume that we have a database D cntaining N recrds. The recrds are dented by X 1... X N. We dente this desired privacy level fr recrd X i by p(i). The prcess f finding cndensed grups with varying level f pint specific privacy makes the prblem significantly mre difficult frm a practical standpint. This is because it is nt advisable t pre-segment the data int different privacy levels befre perfrming the cndensatin separately fr each segment. When sme f the segments cntain very few recrds, such a cndensatin may result in an inefficient representatin f the data. In sme cases, the number f recrds fr a given level f privacy k may be lwer than k. Clearly, it is nt even pssible t create a grup cntaining nly recrds with privacy level k, since the privacy level f the entire grup wuld then be less than k. Therefre, it is nt pssible t create an efficient (and feasible) system f grup cndensatin withut miing recrds f different privacy levels. This leads t a number f interesting trade-ffs between infrmatin lss and privacy preservatin. We will discuss these trade-ffs and the algrithms t ptimize them. In many cases, the data may be available at ne time r it may be available in a mre dynamic and incremental fashin. algrithm: We discuss tw cases fr ur We discuss an algrithm t perfrm the cndensatin when the entire data is available at ne time. We discuss an algrithm fr the case when the data is available incrementally. This is a mre difficult case because it is ften nt pssible t design the mst effective cndensatin at the mment the data becmes available. We will shw that in mst cases, the algrithm fr perfrming the dynamic grup cnstructin is able t achieve results which are cmparable t the algrithm fr static grup cnstructin. This paper is rganized as fllws. In the net sectin, we will discuss sme ntatins and definitins and als intrduce the lcality sensitive cndensatin apprach. We will first discuss the simple case in which an entire data set is available fr applicatin f the privacy preserving apprach. This apprach will be eted t incrementally updated data sets in sectin 3. The empirical results are discussed in sectin 4. Finally, sectin 5 cntains the cnclusins and summary. 2 The Cndensatin Apprach In this sectin, we will discuss the cndensatin apprach fr privacy preserving data mining. Befre describing details f the algrithm, we will discuss sme ntatins and definitins. We assume that we have a set f N recrds, each f which cntain d dimensins. We als assume that assciated with each data pint i, we have a crrespnding privacy level p(i). The verall database is dented by D whereas the database crrespnding t the privacy level p is dented by D p. The privacy level fr a recrd is defined as fllws: Definitin 2.1. The privacy level fr a given recrd is defined as the minimum number f ther recrds in the data frm which it cannt be distinguished. In the cndensatin based apprach, the data is partitined int grups f recrds. Recrds within a given grup cannt be distinguished frm ne anther. Fr each grup, we maintain certain summary statistics abut the recrds. This summary statistics prvides the ability t apply data mining algrithms directly t the cndensed grups f recrds. This infrmatin als suffices t preserve infrmatin abut the mean and crrelatins acrss the different dimensins. The size f the grups may vary, but its size is at least equal t the desired privacy level f each recrd in that grup. Thus, a recrd with privacy level equal t p(i) may be cndensed with recrds f privacy levels different frm p(i). Hwever, the size f that grup must at least be equal t the maimum privacy level f any recrd in that grup. Each grup f recrds is referred t as a cndensed unit. Let G be a cndensed grup cntaining the recrds {X 1... X k }. Let us als assume that each recrd X i cntains the d dimensins which are dented by ( 1 i... d i ). The fllwing infrmatin is maintained abut each grup f recrds G: Fr each attribute j, we maintain the sum f crrespnding values. The crrespnding value is given by k i=1 j i. We dente the crrespnding first-rder sums by F s j (G). The vectr f first rder sums is dented by F s(g). Fr each pair f attributes i and j, we maintain the sum f the prduct f crrespnding attribute values. The crrespnding sum is given by k t=1 i t j t. We dente the crrespnding secnd rder sums by Sc ij (G). The vectr f secnd rder sums is dented by Sc(G). We maintain the sum f the privacy levels f the recrds in the grup. This number f dented by P s(g).

3 All data pints with Privacy Level 4 All data pints with Privacy Level 3 Befre Attritin Figure 1: Levels After Attritin The efficiency f Miing Different Privacy We maintain the ttal number f recrds k in that grup. This number is dented by n(g). The fllwing facts are true abut the recrds in a given grup. Observatin 2.1. The mean value f attribute j in grup G is given by F s j (G)/n(G). Observatin 2.2. The cvariance between attributes i and j in grup G is given by Sc ij (G)/n(G) F s i (G) F s j (G)/n(G) 2. We nte that the algrithm fr grup cnstructin must try t put each recrd in a grup which is at least equal t the maimum privacy level f any recrd in the grup. A natural slutin is t first classify the recrds based n their privacy levels and then indepently create the grups fr varying privacy levels. Unfrtunately, this des nt lead t the mst efficient methd fr packing the sets f recrds int different grups. This is because the mst effective methd fr cnstructing the grups may require us t cmbine recrds frm different privacy levels. Fr eample, a recrd with a very lw privacy requirement may smetimes naturally be cmbined with a grup f high privacy recrds in its lcality. An attempt t cnstruct a separate grup f recrds with a lw privacy requirement may lead t an even higher lss f infrmatin. In rder t illustrate this pint better, we will prvide an eample. Cnsider the set f recrds illustrated in Figure 1. In this case, there are 3 recrds with privacy level 3 and 5 recrds with privacy level 4. One way f gruping the recrds is t place all the recrds f privacy level 3 in ne grup and all recrds with privacy level 4 in the ther. Unfrtunately, the grup crrespnding t privacy level 4 turns ut t be ineffective in representing the data. The cndensed grup utilized frm this set f recrds has pr statistical characteristics, since ne f the data pints is far remved frm the grup. Since the cndensed statistics f the grup des nt represent the variatins within it, this can lead t an Algrithm CnstructGrups(Level: M ap rivacylevel, Database: D); p = 2; H 1 = Grups frm singletn pints in D 1 ; while (p MaP rivacylevel) d H p = Segment(D p, p); (H p 1, H p ) = Cannibalize(H p 1, H p )); (H p 1, H p ) = Attritin(H p 1, H p ); H p = H p H p 1 ; p = p + 1; ; Figure 2: The Prcess f Grup Cnstructin fr Privacy Preserving Data Mining Algrithm Segment(Database: D p, Privacy level: p) while D p cntains at least p data pints; Sample a data pint X frm D p ; Find the (p 1) data pints clsest t X in D p ; Create a grup G f p data pints cmprising X and the p 1 ther clsest data pints; Add G t the set f grups H; Assign remaining data pints in D p t clsest grups; Figure 3: Grup Segmentatin inefficient representatin in many cases. In the situatin illustrated in Figure 1, it is better t place the utlying recrd f privacy level 4 int the grup with privacy level 3. We als nte that it may nt be pssible t place this utlying recrd in a grup with nly tw pre-eisting members, because f the higher privacy requirement f the recrd. First, we need a measure t quantify the effectiveness f a given cndensatin based apprach. In general, this effectiveness is related t the level f cmpactness with which we can partitin the data int different grups. As a gal, this cmpactness is nt very different frm the aim f mst clustering algrithms. Hwever, the difference here is that there are several cnstraints n the cardinality f the data pints in each grup as

4 Algrithm Cannibalize(Grups: H p 1, H p ); fr each grup G H p 1 d fr each pint in G perfrm temprary assignment t clsest grup in H p ; if (SSQ f temprary assignment is lwer) r (H p 1 cntains fewer than (p 1) members), then make assignment permanent else keep ld assignment; Figure 4: Cannibalizatin Algrithm Algrithm Attritin(Grups: H p 1, H p, Privacy Level: p); fr each data pint X in H p d Distc(X, p) = Distance f X t centrid f its current grup in H p ; Dist(X, p 1) = Distance f X t centrid f its clsest viable grup in H p 1 ; Imprve(X) = Distc(X, p) Dist(X, p 1); ; fr each grup in H p with at least p > p pints d find (if any) the at mst (p p) data pints with largest value f Imprve( ) functin which is larger than 0; Assign these at mst (p p) pints t their crrespnding clsest grups in H p 1 ; Grup 1 Privacy Level 2 Figure 5: Attritin Algrithm Grup 2 Privacy Level 3 Grup 3 Privacy Level 3 Grup 1 Cannibalized Grup 2 Grup 3 Figure 6: An eample f Cannibalizatin well as the identity f the data pints which can be added t a grup with given cardinality. Thus, fr the prcess f quantificatin f the cndensatin quality, we simply use the square sum errr f the data pints in each grup. While the privacy level f a grup is determined by the number f recrds in it, the infrmatin lss is defined by the average variance f the recrds abut their centrid. We will refer t this quantity as the Sum Squared Errr (SSQ). The methd f grup cnstructin is different deping upn whether an entire database f recrds is available r whether the data recrds arrive in an incremental fashin. We will discuss tw appraches fr cnstructin f class statistics. The first apprach is utilized fr the case when the entire database f recrds is available. The secnd apprach is utilized in an incremental scheme in which the data pints arrive ne at a time. First, we will discuss the static case in which the entire database f recrds is available. The essence f the static apprach is t cnstruct the grups using an iterative methd in which the grups are prcessed with increasing privacy level. The verall prcess f grup cnstructin is illustrated in Figure 2. The input t the algrithm is the database D and the maimum privacy level which is dented by MaP rivacylevel. We assume that the segment f the database with privacy level requirement f p is dented by D p. We als assume that the set f grups with privacy level f p is dented by H p. We nte that the database D 1 cnsists f the set f pints which have n privacy cnstraint at all. Therefre, the grup H 1 cmprises f the singletn items frm the database D 1. Net, we cnstruct the statistics f the grups in H p using an iterative algrithm. In each iteratin, we increase the privacy level p by 1, and cnstruct the cndensed grups H p which have privacy level p. The first step is t cnstruct the grup H p by using a purely segmentatin based prcess. This prcess is dented by Segment in Figure 2. This segmentatin prcess is a straightfrward iterative apprach. In each iteratin, a recrd X is sampled frm the database H p. The clsest (p 1) recrds t this individual recrd X are added t this grup. Let us dente this grup by G. The statistics f the p recrds in G are cmputed. Net, the p recrds in G are remved frm D p. The prcess is repeated iteratively, until the database D p is empty. We nte that at the f the prcess, it is pssible that between 1 and (p 1) recrds may remain. These recrds can be added t their nearest sub-grup in the data. Thus, a small number f grups in the data may cntain larger than p data pints. The segmentatin prcedure is illustrated in Figure 3.

5 Once the segmentatin prcedure has been perfrmed, we apply the prcess f Attritin and Cannibalize in rder t further reduce the level f infrmatin lss withut cmprmising n the privacy requirements. The purpse f the Cannibalize prcedure is slightly different. In this prcedure, we int t cannibalize sme f the grups in H p 1 and reassign their data pints t better fitting grups in H p. Cnsider the eample illustrated in Figure 6. In this case, we have illustrated three grups. One f the grups (cntaining tw pints) has privacy level f tw, and anther grup (cntaining three pints) has privacy level f three. Hwever, the grup with privacy level tw des nt frm an natural cluster f data pints. In such a case, it may be desirable t break up the grup with privacy level 2 and assign ne pint each t the grups with privacy level 3. Thus, cannibalizatin is perfrmed when the grup G H p 1 des nt frm a natural cluster. In such cases, it is mre effective t cannibalize the grup G and reassign its grup members t ne r mre clusters in H p. Anther eample f a situatin when cannibalizatin is desirable is when H p 1 has fewer than (p 1) members. Such a situatin arises in situatins in which there are very few recrds fr a given privacy level. Cnsequently, it is nt pssible t create a grup cntaining nly the pints at a particular privacy level. We refer t this test fr cannibalizatin as the numerical test. If the grup passes the numerical test, we perfrm an additinal qualitative test t see if cannibalizatin shuld be perfrmed. In rder t test whether the cannibalizatin prcedure shuld be perfrmed, we calculate the SSQ f the regruping when a temprary assignment f the data pints in G is perfrmed t ne r mre grups in H p. If the SSQ f the resulting assignment is lwer, then we make this assignment permanent. The pseud-cde fr the cannibalizatin prcess is illustrated in Figure 4. By perfrming this peratin, the apprpriate privacy level f all data pints is maintained. This is because the cannibalizatin prcess nly assigns data pints t grups with higher privacy level. Therefre, the assigned data pints find themselves in a grup with at least their crrespnding required privacy. We nte that sme grups in H p may smetimes cntain mre than p data pints. This is due t the effects f the Segment and Cannibalize prcedures discussed earlier. The idea in the Attritin prcedure is t mve these ecess pints t a better fitting grup in H p 1. The mvement f these ecess pints is likely t imprve the quality f data representatin in terms f reducing the level f infrmatin lss. An eample f such a case is illustrated in Figure 1. In this case, the grup with five data pints cntains ne recrd which des nt fit very well with the rest f the grup. In such a case, the reassignment f the data pint t a grup with privacy level 3 results in a mre cmpact representatin. We nte that the reassigned data pint has privacy level 4. Hwever, the reassignment prcess results in the grup with privacy level 3 cntaining 4 data pints. Therefre, even thugh the data pint with privacy level 4 was assigned t a grup with lwer privacy level, the resulting grup cntinues t maintain the desired level f privacy fr the reassigned data pint. Fr this purpse, during the attritin prcess we cnsider nly thse grups which are viable fr reassignment. Fr a grup t be cnsidered viable, it must cntain at least as many data pints as the privacy level (after the assignment). Furthermre, fr a grup G cntaining p data pints and with privacy level p, we can remve at mst (p p) data pints frm it withut disturbing the privacy level f the remaining grup. In rder t perfrm the actual reassignment, we calculate a functin called Imprve(X) fr each data pint X G. The value f Imprve(X) is defined t be difference between the distance f X frm its clsest viable centrid and the distance frm its current centrid. Clearly, the reassignment f the data pint X t anther grup is useful nly when the value f Imprve(X) is larger than 0. We re-assign the at mst (p p) data pints with largest value f Imprve( ), prvided that the value f Imprve( ) fr each f these data pints is larger than 0. The verall attritin prcedure is illustrated in Figure 5. The prcesses f segmentatin, cannibalizatin and attritin are applied iteratively t the segment D p f the database fr each value f the privacy level p. The value f p is incremented by 1 in each iteratin up t the maimum privacy level. The set f grups cnstructed at this pint are returned as the final cndensatin. Once the cndensed statistics have been cnstructed, annymized data can be generated as discussed in [2]. The annymized data is generated using the statistical prperties which can be derived frm the grup. While this new set f pints resembles the riginal data distributin, it maintains the privacy f the data. The prcess f annymized grup cnstructin is achieved by first cnstructing a d d cvariance matri fr each grup G. This matri is dented by C(G). The ijth entry f the c-variance matri is the c-variance between the attributes i and j f the set f recrds in G. The eigenvectrs f this c-variance matri are determined by decmpsing the matri C(G) in the fllwing frm: (2.1) C(G) = P (G) (G) P (G) T The clumns f P (G) are the eigenvectrs f C(G). The diagnal entries λ 1 (G)... λ d (G) f (G) represent

6 a a/2 split a/2 plane Centers f split b grups Figure 7: Splitting Grup Statistics (Illustratin) the crrespnding eigenvalues. It can be shwn that the eigenvectrs f a cvariance matri frm an rthnrmal ais system. This rth-nrmal ais-system represents the directins alng which the secnd rder crrelatins are zer. If the data were represented using this rth-nrmal ais system, then the cvariance matri wuld be the diagnal matri crrespnding t (G). The diagnal entries f (G) represent the variances alng the individual dimensins in this new ais system. We can assume withut lss f generality that the eigenvalues λ 1 (G)... λ d (G) are rdered in decreasing magnitude. The crrespnding eigenvectrs are dented by e 1 (G)... e d (G). The annymized data fr each grup is recnstructed assuming that the data within each grup is indepently and unifrmly distributed alng the different eigenvectrs. Furthermre, the variance f the distributin alng each eigenvectr is equal t the crrespnding eigenvalue. These apprimatins are reasnable when nly a small spatial lcality is used. 3 Dynamic Maintenance f Grups The prcess f dynamic maintenance f grups is useful in a variety f settings such as that f data streams. In the prcess f dynamic maintenance, the pints in the data stream are prcessed incrementally. It is assumed that a set S f the data pints (dented by InitNumber) are available at the ning f the prcess. The static prcess CnstructGrups is applied t this set S. Once the initial grups have been cnstructed, a dynamic prcess f grup maintenance is applied in rder t maintain the cndensed grups f varying privacy levels. The incremental algrithm wrks by using a nearest neighbr apprach. When an incming data pint X i is received, we find the clsest cluster t it using the distance f the data pint X i t the different centrids. While it is desirable t add X i t its clsest centrid, we cannt add X i t a given cluster which has fewer than p(i) 1 data pints in it. Therefre, the data pint X i is added t the clsest cluster which als happens t have at least p(i) 1 data pints inside it. In general, it is nt desirable t have grups with high sizes cmpared t their cnstituent privacy levels. When such a situatin arises, it effectively means that a higher level f representatinal inaccuracy is created than is really necessary with the privacy requirements f the pints within the grup. The average privacy level f the grup G can be cmputed frm the cndensed statistics. This number is equal t P s(g)/n(g). This is because P s(g) is equal t the sum f the privacy levels f the data pints in the grup. The split criterin used by ur algrithm is that a grup is divided when the number f items in the grup is mre than twice the average privacy level f the items in the grup. Therefre, the grup is split when the fllwing hlds true: (3.2) n(g) 2 P s(g)/n(g) As in the case f annymized data cnstructin, we utilize the unifrmity assumptin in rder t split the grup statistics. In each case, the grup is split alng the eigenvectr with the largest eigenvalue. This als crrespnds t the directin with the greatest level f variance. This is dne in rder t reduce the verall variance f the resulting clusters and ensure the greatest cmpactness f representatin. An eample f this case is illustrated in Figure 7. We assume withut lss f generality that the eigenvectr e 1 with the lwest inde is the chsen directin the split. The crrespnding eigenvalue is dented by λ 1. Since the variance f the data alng e 1 is λ 1, then the range (a) f the crrespnding unifrm distributin alng e 1 is given 1 by a = 12 λ 1. In such a case, the riginal grup f size 2 k is split int tw grups f equal size. We need t determine the first rder and secnd rder statistical data abut each f the split grups M 1 and M 2. We assume that the privacy cmpnent P s(g) is als equally divided between the tw grups. We first derive the centrid and eigenvectr directins fr each grup. These values are sufficient t recnstruct the values f F s i (G) and Sc ij (G) abut each grup. Assume that the centrid f the unsplit grup M is dented by Y (M). This centrid can be cmputed 1 This calculatin was dne by using the frmula fr the standard deviatin f a unifrm distributin with range a. The crrespnding standard deviatin is given by a/12.

7 frm the first rder values F s(m) as fllws: (3.3) Y (M) = (F s 1 (M),... F s d (M))/n(G) Once the centrid has been cmputed, thse f each f the split grups can be cmputed as well. Frm Figure 7, it is easy t see that the centrids f each f the split grups M 1 and M 2 are given by Y (M) (a/4) e 1 and Y (M) + (a/4) e 1 respectively. By substituting a = 12 λ 1, it is easy t see that the new centrids f the grups M 1 and M 2 are given by Y (M) ( 12 λ 1 /4) e 1 and Y (M) + ( 12 λ 1 /4) e 1 respectively. We will nw discuss hw t cmpute the secnd rder statistical values. The first step is the determinatin f the cvariance matri f the split grups. Let us assume that the ijth entry f the c-variance matri fr the grup M 1 is given by C ij (M 1 ). We als nte that the eigenvectrs f M 1 and M 2 are identical t the eigenvectrs f M, since the directins f zer crrelatin remain unchanged by the splitting prcess. Therefre, we have: e 1 (M 1 ) = e 1 (M 2 ) = e 1 (M) e 2 (M 1 ) = e 2 (M 2 ) = e 2 (M) e 3 (M 1 ) = e 3 (M 2 ) = e 3 (M)... e d (M 1 ) = e d (M 2 ) = e d (M) The eigenvalue (in the split grups M 1 and M 2 ) crrespnding t e 1 (M) is equal t λ 1 /4. This is because the splitting prcess alng e 1 reduces the crrespnding variance by a factr f 4. Other eigenvalues remain unchanged. Let P (M) represent the eigenvectr matri f M, and (M) represent the crrespnding diagnal matri. Then, the new diagnal matri (M 1 ) = (M 2 ) f M 1 can be derived by dividing the entry λ 1 (M) by 4. Therefre, we have: λ 1 (M 1 ) = λ 1 (M 2 ) = λ 1 (M)/4 The ther eigenvalues f M 1 and M 2 remain the same: λ 2 (M 1 ) = λ 2 (M 2 ) = λ 2 (M) λ 3 (M 1 ) = λ 3 (M 2 ) = λ 3 (M)... λ d (M 1 ) = λ d (M 2 ) = λ d (M) Thus, the (identical) c-variance matries f M 1 and M 2 may be determined as fllws: C(M 1 ) = P (M 1 ) (M 1 ) P (M 1 ) T Frm Observatin 2.2, it is clear that the secnd rder statistics f M 1 may be determined as fllws: Sc ij (M 1 ) = k C ij (M 1 ) + F s i (M 1 ) F s j (M 1 )/k An imprtant bservatin is that even thugh the cvariance matrices f M 1 and M 2 are identical, the values f Sc ij (M 1 ) and Sc ij (M 2 ) are different because f different first rder aggregates substituted in the abve frmula fr Sc ij (M 1 ). The verall prcess fr splitting the grup statistics is illustrated in Figure 7. Anther interesting pint t be nted is that the entire purpse f splitting is t keep grups sizes sufficiently cmpact fr data mining algrithms. The prcess f splitting itself can never result in the vilatin f the privacy cnditin, since the split grup is based n a split f the statistics, but nt f the data pints themselves. In rder t understand this pint, let us cnsider the fllwing eample f a case where the split cnditin seems t vilate privacy. Cnsider a grup having 5 tuples, the privacy cnstraints f the tuples being 2, 2, 2, 3, 5 respectively. The grup des nt split because 5 < 2 14/5. Nw, if a new tuple having privacy cnstraint 3 wants t jin the grup, the splitting cnditin is satisfied since 6 > 2 17/6. Hence each f the split grup crrespnds t statistics f 3 data pints. Therefre, it wuld apparently seem that the privacy f the tuple with requirement 5 has been vilated. This is nt the case since we split the statistics int tw pseud-grups f 3 pints each, rather than actually split the pints themselves. The prcess f perfrming the split partitins the statistics based n a prbability distributin assumptin (unifrm distributin) rather than using the actual pints themselves (which have already been lst in the merged statistics). The tuple with privacy cnditin 5 may cntribute t the statistics f bth grups, when the splitting cnditin is used. Each pseud-grup thus has a privacy level as high as the unsplit grup, frm the perspective f the ld data pints in it, but at the same time we wuld need t use the size f the grup while cnsidering the additin f further data pints int the smaller pseud-grups. In rder t test the quality f ur results we applied ur apprach t a nearest neighbr classifier. In the classificatin prcess, the cndensatin prcess was perfrmed separately fr each class. In the net sectin, we will discuss the behavir f this nearest neighbr classifier. 4 Empirical Results We tested the privacy preserving apprach ver a wide range f data sets and metrics. An imprtant questin which arises in the cntet f a privacy preserving

8 1 Insphere Data Set (Classificatin Accuracy) Classificatin Accuracy (Static Cndensatin) Classificatin Accuracy (Dynamic Cndensatin) Classificatin Accuracy (Original Data) 0.9 Classificatin Accuracy Alpha (Maimum Grup Size) Cvariance Cmpatibility Cefficient Ecli Data Set (Statistical Cmpatibility) Figure 8: Accuracy f Classifier with Increasing Privacy Level (Insphere Data Set) Cvariance Cmpatibility Cefficient Insphere Data Set (Statistical Cmpatibility) Cvariance Cmpatibility Cefficient (Static Cndensatin) Cvariance Cmpatibility Cefficient (Dynamic Cndensatin) Cvariance Cmpatibility Cefficient (Static Cndensatin) Cvariance Cmpatibility Cefficient (Dynamic Cndensatin) Alpha (Maimum Grup Size) Figure 11: Cvariance Cmpatibility f Cndensed Data Set with Increasing Privacy Level (Ecli Data Set) Alpha (Maimum Grup Size) Figure 9: Cvariance Cmpatibility f Cndensed Data Set with Increasing Privacy Level (Insphere Data Set) Pima Data Set (Classificatin Accuracy) Classificatin Accuracy (Static Cndensatin) Classificatin Accuracy (Dynamic Cndensatin) Classificatin Accuracy (Original Data) Classificatin Accuracy Ecli Data Set (Classificatin Accuracy) Classificatin Accuracy (Static Cndensatin) Classificatin Accuracy (Dynamic Cndensatin) Classificatin Accuracy (Original Data) Classificatin Accuracy Alpha (Maimum Grup Size) Alpha (Maimum Grup Size) Figure 12: Accuracy f Classifier with Increasing Privacy Level (Pima Indian Data Set) Figure 10: Accuracy f Classifier with Increasing Privacy Level (Ecli Data Set)

9 1 Abalne Data Set (Statistical Cmpatibility) Cvariance Cmpatibility Cefficient Pima Data Set (Statistical Cmpatibility) Cvariance Cmpatibility Cefficient Cvariance Cmpatibility Cefficient (Static Cndensatin) Cvariance Cmpatibility Cefficient (Dynamic Cndensatin) Alpha (Maimum Grup Size) Cvariance Cmpatibility Cefficient (Static Cndensatin) Cvariance Cmpatibility Cefficient (Dynamic Cndensatin) Alpha (Maimum Grup Size) Figure 15: Cvariance Cmpatibility f Cndensed Data Set with Increasing Privacy Level (Abalne Data Set) Figure 13: Cvariance Cmpatibility f Cndensed Data Set with Increasing Privacy Level (Pima Indian Data Set) Classificatin Accuracy Abalne Data Set (Classificatin Accuracy) Classificatin Accuracy (Static Cndensatin) Classificatin Accuracy (Dynamic Cndensatin) Classificatin Accuracy (Original Data) Alpha (Maimum Grup Size) Figure 14: Accuracy f Classifier with Increasing Privacy Level (Abalne Data Set) apprach is the nature f the metric t be used in rder t test the quality f the apprach. The first step is t test the nature f the tradeff between increased levels f privacy, and the resulting infrmatin lss. While the level f privacy is cntrlled by the average cndensed grup size, the infrmatin lss is measured indirectly in terms f the effect f the perturbatin n the quality f data mining algrithms. We tested the accuracy f a simple k-nearest neighbr classifier with the use f different levels f privacy. The minimum privacy level f each data pint was generated frm a (discrete) unifrm distributin in the range [α β, α]. By changing the value f α it is pssible t vary the level f privacy during the cndensatin prcess. The aim f ur apprach is t shw that a high level f privacy can be achieved withut significantly cmprmising accuracy. Anther useful metric fr testing the quality f the privacy preserving prcess arises frm the level f matching between the riginal and perturbed data. This prvides insight int the nature f the relatinship between the riginal data set and perturbed data set. The first step is therefre t identify the statistics used fr testing the effectiveness f the perturbatin prcess. One simple methd is t test hw the cvariance structure f the perturbed data set matched with the riginal data set. This is because the cvariance structure f the data identifies the essential data prperties up t a secnd rder apprimatin. If the newly created data set has very similar data characteristics t the riginal data set, then the cndensed data set is a gd substitute fr mst data mining algrithms. Fr each dimensin pair (i, j), let the crrespnding entries in the cvariance matri fr the riginal and the perturbed data be dented

10 by ij and p ij respectively. We cmputed the statistical cefficient f crrelatin between the data entry pairs ( ij, p ij ). Let us dente this value by µ. When the tw matrices are identical, the value f µ is 1. On the ther hand, when there is perfect negative crrelatin between the entries, the value f µ is 1. A number f real data sets frm the UCI machine learning repsitry 2 were used fr the testing. We used the Insphere, Ecli, Pima Indian and Abalne data sets. The last data set was a regressin mdeling prblem, and therefre the classificatin measure needed t be redefined. Fr this prblem, the classificatin accuracy measure used was the percentage f the time that the age was predicted within an accuracy f less than ne year by the nearest neighbr classifier. In many cases, the number f data pints fr a given privacy level fr lwer than the numerical value f the privacy level itself. In such cases, the miing f data pints fr different privacy levels is inevitable. Thus, the cndensatin prcess culd nt have been perfrmed fr such cases using the hmgeneus k -annymity mdel r k- indistinguishability mdel [2, 18]. The results n classificatin accuracy fr the Insphere, Ecli, Pima Indian, and Abalne data sets are illustrated in Figures 8, 10, 12 and 14 respectively. The value f β was fied t 4, whereas the value f α was varied ver the different data sets. The range f values f α is determined by the number f data pints in the particular data set at hand. This value f α is illustrated n the X-ais. On the Y-ais, we have pltted the classificatin accuracy f the nearest neighbr classifier, when the cndensatin technique was used. Fr each graph, we have illustrated the results using bth static and dynamic cndensatin. In additin, a baseline is marked n each graph. This baseline is a hrizntal line n the graph which shws the classificatin accuracy using the riginal data. It is clear that in mst cases, the accuracy f classificatin reduced with increasing grup size. This is a natural tradeff because a greater amunt f privacy is achieved with larger grups sizes. At the same time, it leads t a higher amunt f infrmatin lss. In many cases, the quality f the classificatin imprved because f the cndensatin prcess. in mst cases. While the aim f ur apprach was t prvide a high level f privacy withut lsing infrmatin, it appears that the prcess f cndensatin itself actually helped in remving the anmalies in the data fr the purpse f classificatin. This phenmenn is likely t be helpful ver a number f different data mining prblems in which the aggregate behavir f the data is epsed by the cndensatin prcess. 2 http : // mlearn Furthermre, the static cndensatin apprach prvided higher quality results than the dynamic technique. This is because the splitting algrithm f the dynamic cndensatin prcess intrduced an additinal level f apprimatin int the data representatin. The splitting prcedure assumed a unifrm distributin f the data within a cndensed grup f data pints. The accuracy f this apprimatin reduces when grup sizes are small. In such cases, there are simply t few data pints t make an accurate estimatin f the values f split grup statistics. Thus, the use f the unifrm distributin apprimatin reduces the quality f the cvariance statistics in the split grups fr small grup sizes. Fr this reasn, the dynamic cndensatin prcess was smetimes less effective than the static cndensatin apprach. Hwever, in all cases, the dynamic cndensatin apprach wrked almst as effectively as the classifier n the riginal data. One ntable eceptin t the general advantage f the static cndensatin prcess was the behavir n the Pima Indian data set. In this case, the dynamic cndensatin prcess prvided results f higher quality fr larger grup sizes. The reasn fr this was that the splitting prcess seemed t imprve the quality f the classificatin. The data set seemed t cntain a number f anmalies. These anmalies were remved by the splitting prcess. This resulted in a higher classificatin accuracy f the dynamic apprach. We als cmpared the cvariance characteristics f the data sets. The results are illustrated in Figures 9, 11, 13 and 15 respectively. Fr mst data sets, the value f the statistical crrelatin is almst perfect. This crrespnds t the fact that the crrelatin values was larger than in mst cases. Fr sme eamples such as the Abalne data set (illustrated in Figure 15), the cvariance cmpatibility value was larger than These results emphasize the fact that the perturbed data is similar t the riginal data in terms f its statistical structure. As in the previus case, the results fr the case f static cndensatin were better than thse fr dynamic cndensatin. This is again because f the additinal inaccuracy intrduced by the splitting prcess. In all cases, the abslute crrelatin prvided by the scheme was very high. In the dynamic case, the crrelatin cefficient ted t drp fr small grup sizes. The nly eceptin t this general rule was the insphere data set in which the cvariance cmpatibility values were slightly lwer fr the static case. The cvariance cmpatibility als reduced fr etremely large grup sizes. This is because in such a case, the pseud-data n lnger represents a particular data lcality well. Thus, the cvariance cmpatibility was highest in thse cases in which the data cntained tight clusters cmprising a relatively mdest number f

11 data pints. This is because f the fllwing reasns: When the number f pints in each cluster were large, the accuracy f the unifrm distributin assumptin during the splitting prcess is maintained. When the clusters are tight, these data pints represent a small spatial lcality with respect t the rest f the data set. An apprimatin in a small spatial lcality des nt significantly affect the verall crrelatin structure. We nte that the prcess f representing a small spatial lcality in a grup and that f representing a larger number f data pints in a grup are tw cmpeting and cntradictry gals. It is imprtant t pick a balance between the tw, since this tradeff defines the quality f perfrmance n the underlying data mining algrithm. This balance is eternally defined, since the average grup size is determined by the privacy requirements f the users. In general, since ur apprach cntinued t be as effective as the base classificatin accuracy ver a wide range f grup sizes, this illustrates the effectiveness f ur methdlgy in mst practical scenaris. 5 Cnclusins and Summary In this paper, we discussed a scheme fr privacy preserving data mining in which the data pints are allwed t have variable privacy levels. This is useful in a number f applicatins in which different recrds have inherently different privacy requirements. We prpse a methd fr privacy prtectin in a data stream envirnment using cndensed statistics f the data set. These cndensed statistics can either be generated statically r they can be generated dynamically in a data stream envirnment. We tested ur results n a number f real data sets frm the UCI machine learning repsitry. The results shw that ur methd prduces data sets which are quite similar t the riginal data in structure, and als ehibit similar accuracy results. References [1] C. C. Aggarwal, and S. Parthasarathy, Mining Massively Incmplete Data Sets by Cnceptual Recnstructin, Prceedings f the ACM KDD Cnference, (2001), pp , [2] C. C. Aggarwal, and P. S. Yu, A Cndensatin Based Apprach t Privacy Preserving Data Mining, Prceedings f the EDBT Cnference, (2004), pp [3] D. Agrawal, and C. C. Aggarwal, On the Design and Quantificatin f Privacy Preserving Data Mining Algrithms, Prceedings f the ACM PODS Cnference, (2002). [4] R. Agrawal, and R. Srikant, Privacy Preserving Data Mining, Prceedings f the ACM SIGMOD Cnference, (2000). [5] P. Benassi, Truste: An nline privacy seal prgram, Cmmunicatins f the ACM, 42(2), (1999), pp [6] C. Cliftn, and D. Marks, Security and Privacy Implicatins f Data Mining, ACM SIGMOD Wrkshp n Research Issues in Data Mining and Knwledge Discvery, (1996), pp [7] J. Vaidya, and C. Cliftn, Privacy Preserving Assciatin Rule Mining in Vertically Partitined Data, ACM KDD Cnference, (2002). [8] T. M. Cver, J. A. Thmas, Elements f Infrmatin Thery, Jhn Wiley & Sns, Inc., New Yrk, (1991). [9] Cranr L. F. (Ed.) Special Issue n Internet Privacy, Cmmunicatins f the ACM, 42(2), (1999). [10] The Ecnmist, The End f Privacy, (1999). [11] V. Estivill-Castr, and L. Brankvic, Data Swapping: Balancing privacy against precisin in mining fr lgic rules, Data Warehusing and Knwledge Discvery, Springer-Verlag, Lecture Ntes in Cmputer Science 1676, (1999), pp [12] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, Privacy Preserving Mining Of Assciatin Rules, ACM KDD Cnference, (2002). [13] A. Hinneburg, and D. A. Keim, An Efficient Apprach t Clustering in Large Multimedia Databases with Nise, ACM KDD Cnference, (1998). [14] V. S. Iyengar, Transfrming Data T Satisfy Privacy Cnstraints, ACM KDD Cnference, (2002). [15] C. K. Liew, U. J. Chi, and C. J. Liew, A data distrtin by prbability distributin, ACM TODS Jurnal, (1985), 10(3) pp [16] T. Lau, O. Etzini, and D. S. Weld, Privacy Interfaces fr Infrmatin Management, Cmmunicatins f the ACM, 42(10), (1999), pp [17] S. Murthy, Autmatic Cnstructin f Decisin Trees frm Data: A Multi-Disciplinary Survey, Data Mining and Knwledge Discvery, 2, (1998), pp [18] P. Samarati, and L. Sweeney, Prtecting Privacy when Disclsing Infrmatin: k-annymity and its Enfrcement Thrugh Generalizatin and Suppressin. Prceedings f the IEEE Sympsium n Research in Security and Privacy, (1998). [19] S. L. Warner, Randmized Respnse: A survey technique fr eliminating evasive answer bias, Jurnal f the American Statistical Assciatin, 60(309), (1965), pp

Chapter 3: Cluster Analysis

Chapter 3: Cluster Analysis } 3.1 Basic Cncepts f Clustering 3.1.1 Cluster Analysis 3.1. Clustering Categries } 3. Partitining Methds 3..1 The principle 3.. K-Means Methd 3..3 K-Medids Methd 3..4 CLARA