The Effect of Homogeneity on the Computational Complexity of Combinatorial Data Anonymization

Size: px

Start display at page:

Download "The Effect of Homogeneity on the Computational Complexity of Combinatorial Data Anonymization"

Amberly Lyons
6 years ago
Views:

1 Preprint To appear in Data Mining and Knowledge Discovery DOI /s Online available The Effect of Homogeneity on the Computational Complexity of Combinatorial Data Anonymization Robert Bredereck 1,, André Nichterlein 1, Rolf Niedermeier 1, Geevarghese Philip 2, Received: 18 April 2012 / Accepted: 24 September 2012 Abstract A matrix M is said to be k-anonymous if for each row r in M there are at least k 1 other rows in M which are identical to r The NP-hard k-anonymity problem asks, given an n m-matrix M over a fixed alphabet and an integer s > 0, whether M can be made k-anonymous by suppressing (blanking out) at most s entries Complementing previous work, we introduce two new data-driven parameterizations for k-anonymity the number t in of different input rows and the number t out of different output rows both modeling aspects of data homogeneity We show that k-anonymity is fixed-parameter tractable for the parameter t in, and that it is NP-hard even for t out = 2 and alphabet size four Notably, our fixed-parameter tractability result implies that k-anonymity can be solved in linear time when t in is a constant Our computational hardness results also extend to the related privacy problems p-sensitivity and l-diversity, while our fixed-parameter tractability results extend to p-sensitivity and the usage of domain generalization hierarchies, where the entries are replaced by more general data instead of being completely suppressed Keywords k-anonymity p-sensitivity l-diversity Domain generalization hierarchies Matrix modification problems Parameterized algorithmics Fixed-parameter tractability NP-hardness An extended abstract entitled The Effect of Homogeneity on the Complexity of k- Anonymity appeared in Proceedings of the 18th International Symposium on Fundamentals of Computation Theory (FCT 11), volume 6914 of LNCS, pages 53-64, Springer 2011 Apart from the full proofs omitted in that version, the current article also contains new results on l-diversity, p-sensitivity, and on the usage of domain generalization hierarchies Supported by the DFG, research project PAWS, NI 369/10 Supported by the Indo-German Max Planck Centre for Computer Science (IMPECS) 1 Institut für Softwaretechnik und Theoretische Informatik, TU Berlin, Germany 2 Max-Planck-Institut für Informatik, Saarbrücken, Germany {robertbredereck,andrenichterlein,rolfniedermeier}@tu-berlinde gphilip@mpi-infmpgde

2 1 Introduction Assume that data about n individuals are represented by length-m vectors consisting of attribute values In other words, assume that our input is an n m data matrix with entries from some (potentially large) alphabet Σ If all vectors (that is, rows) are identical, then we have full homogeneity and thus full anonymity of all individuals Relaxing full anonymity to k-anonymity (requiring that every vector occurs at least k times), in this work we investigate how the degree of (in)homogeneity (measured by the number of different vectors) both with respect to the input data matrix as well as the corresponding anonymized output data matrix influences the computational complexity of the NP-hard problem of making sets of individuals k-anonymous Moreover, we extend our investigations to further combinatorial data anonymization problems including the concepts of p-sensitivity and l-diversity Samarati and Sweeney (1998), Samarati (2001), and Sweeney (2002b) devised the notion of k-anonymity to better quantify the degree of anonymity in sanitized data This notion formalizes the intuition that entities who have identical sets of attributes cannot be distinguished from one another For a positive integer k we say that a matrix M is k-anonymous if, for each row r in M, there are at least k 1 other rows in M which are identical to r Thus k-anonymity provides a clear and simple combinatorial model for sanitizing data: choose a value of k which would satisfy the relevant privacy requirements, and then try to modify at minimum cost the matrix in such a way that it becomes k-anonymous The corresponding decision problem k- Anonymity asks, additionally given an upper bound s for the number of suppressions allowed, whether a matrix can be made k-anonymous by suppressing (blanking out) at most s entries While k-anonymity is our central data anonymization problem, our results also extend to several more restrictive anonymization problems, 1 including p-sensitivity (Truta and Vinay, 2006) and l-diversity (Machanavajjhala et al, 2007) Motivation The increased use of technology brings with it various non-obvious threats to personal privacy For example, location data from tracking devices such as Global Positioning System (GPS) devices used in public transport systems or communication devices such as mobile phones can be used by an adversary in space or time-correlated inference attacks to deduce private information about individuals Another seemingly unlikely source of breach of privacy lies in the search logs of web search engines, which are sometimes released as a valuable aid for research on information retrieval Yet other, perhaps more obvious, sources of private information are medical data or the relationship graphs of social networks The protection of privacy is widely recognized as a human right, and at the same time there are large economic and scientific incentives for analyzing data sets which could potentially reveal private information To be able to do this without harming privacy, k- 1 See Section 2 for formal definitions 2

3 Anonymity and its variants have been proposed as a possible approach to addressing privacy concerns arising in all these scenarios (Campan and Truta, 2009; Gedik and Liu, 2008; Gkoulalas-Divanis et al, 2010; Gruteser and Grunwald, 2003; Monreale et al, 2010; Navarro-Arribas et al, 2012; Zhou and Pei, 2011) 2 We focus on a better understanding of the computational complexity and on tractable special cases of combinatorial data privacy problems; see Machanavajjhala et al (2007) and Sweeney (2002b) for discussions on the pros and cons of these models in terms of privacy vs preservation of meaningful data In particular, note that with differential privacy (cleverly adding some random noise) a statistical model has become very popular as well (Dwork, 2011; Fung et al, 2010); we do not study this here However, k-anonymity and its variants are very natural combinatorial problems (with potential applications beyond data privacy) that are easy to interpret with respect to performed data modifications, and for instance in the case of one-time anonymization, are obviously valuable for the sake of providing simple models that do not introduce noise Important Variants of k-anonymity While the k-anonymity problem is described in terms of the operation of suppressing blanking out data entries, this level of obfuscation is sometimes not required or desirable Sweeney (2002a) introduced the notion of domain generalization and domain generalization hierarchies (DGH) as a generalization of and an alternative to suppression for achieving k-anonymous datasets The operation of generalization involves replacing a data entry with a less specific element For instance, the precise age of a person might be replaced with an age range which includes the given age The new value, while still being correct for the person, is not as precise as the previous one was This idea is extended to the notion of domain generalization hierarchies, which can be thought of as increasingly more general ranges of data The operation of suppressing a data element is thus a special case of generalization The DGH-k-Anonymity problem asks for a minimal under a certain natural ordering generalization of the input matrix which is k-anonymous Observe that the k-anonymity problem is a special case of DGH-k-Anonymity where the only generalization maps each data element to the -symbol; thus DGH-k-Anonymity is at least as hard as k-anonymity; we extend some of our positive results for k-anonymity to DGH-k-Anonymity The attributes used in the well-known linking attack method by Sweeney (2002b) for re-identification were gender, ZIP code and birth date She called such attributes containing publicly available data quasi-identifiers Published datasets, such as medical data, consist of these quasi-identifiers and private information (eg disease) To protect against a linking attack, it is sufficient 2 While it is beyond the scope of this work to fully address all the potential weaknesses of k-anonymity, we mention for the sake of completeness that k-anonymity is known to be vulnerable against the attack models attribute linkage, table linkage, and probabilistic attack (Fung et al, 2010) 3

4 to make the quasi-identifiers k-anonymous Hence, the input matrix M for k- Anonymity contains only quasi-identifiers This leads to a major drawback of k-anonymity, namely the possibility that the private information may end up being highly uniform in the output For example, an attacker cannot determine the row corresponding to one individual but using the linking attack method of Sweeney (2000) he may identify k rows from which exactly one corresponds to the individual If the private information stored in these k rows is equal, then the attacker does not know the row corresponding to the individual but the private information belonging to the individual is revealed To prevent this, different concepts were introduced, eg p-sensitivity (Truta and Vinay, 2006), l-diversity (Machanavajjhala et al, 2007), and t-closeness (Li et al, 2007) We will partially extend our findings for k-anonymity to p-sensitivity and l-diversity p-sensitivity is an enhancement to k-anonymity in the sense that there is the additional requirement that in the output matrix each maximal set of rows that are identical in the quasi-identifiers contains at least p different private values l-diversity is an enhancement requiring that each such set of rows contains at least l well represented private values The combinatorial realization of well represented that we use is to require that in each such set of rows the relative frequency of each private value is at most 1/l (Wong et al, 2006; Xiao and Tao, 2006) Computational Complexity k-anonymity and many related problems are NP-hard (Meyerson and Williams, 2004), even when the input matrix is highly restricted Concerning polynomial-time approximability, k-anonymity is APX-hard when k = 3, even when the alphabet size is just two (Bonizzoni et al, 2011a); it is MAX SNP-hard when k = 7, even when the number of columns in the input dataset is just three (Chakaravarthy et al, 2010); and it is MAX SNP-hard when k = 3, even when the number of columns in the input dataset is 27 (Blocki and Williams, 2010) Moreover, it is APX-hard when k = 3, even when the input matrix has only three columns (Bonizzoni et al, 2011b) On the positive side, many polynomial-time approximation algorithms have been developed for the problem and its variants Meyerson and Williams (2004) devised an approximation algorithm with the approximation ratio O(k ln k) which runs in O(n 2k ) time, and another one which runs in time polynomial in both n and k and has the ratio O(k ln n) Gionis and Tassa (2009) came up with an algorithm which runs in O(n 2k ) time and has the approximation ratio O(ln k) This was later improved upon by Park and Shim (2007) and by Kenig and Tassa (2012), who devised approximation algorithms with the same ratio and the same worst-case running time, but which run much faster on large classes of inputs Parameterized Complexity Confronted with this computational hardness, we study aspects of the parameterized (or multivariate) computational complexity of k-anonymity and its variants as initiated by Evans et al (2009) The 4

5 central question here is how naturally occurring parameters influence computational complexity For example, consider the following parameterization Is k-anonymity polynomial-time solvable for constant values of k? While 2-Anonymity is polynomial-time solvable (Blocki and Williams, 2010), the general answer is no since already 3-Anonymity is NP-hard (Meyerson and Williams, 2004), even on binary data sets (Bonizzoni et al, 2011a) Thus, k alone does not yield a promising parameterization k-anonymity has a number of meaningful parameterizations beyond k, including the number of rows n, the alphabet size Σ, the number of columns m, and, in the spirit of multivariate algorithmics (Fellows, 2009; Niedermeier, 2010), various combinations of single parameters Here the arity of the alphabet Σ may range from binary (such as gender) to unbounded (such as net worth) For instance, answering an open question of Evans et al (2009), Bonizzoni et al (2011b) showed that k-anonymity is fixed-parameter tractable with respect to the combined parameter (m, Σ ), whereas there is no hope for fixedparameter tractability with respect to the single parameters m and Σ (Evans et al, 2009) We emphasize that Bonizzoni et al (2011b) made use of the fact that the value Σ m is an upper bound on the number of different input rows, thus implicitly exploiting a very rough upper bound on input homogeneity In this work, we refine this view by asking how the degree of homogeneity of the input matrix influences the complexity of k-anonymity In other words, is k-anonymity fixed-parameter tractable for the parameter number of different input rows? In a similar vein, we also study the effect of the degree of homogeneity of the output matrix on the complexity of k-anonymity Table 1, which extends similar tables due to Evans et al (2009) and Bonizzoni et al (2011b), summarizes known and new results for k-anonymity Our Contributions We introduce the homogeneity parameters the number t in of different input rows and the number t out of different output rows, for studying the computational complexity of k-anonymity and related problems Typically, we expect t in n and t in Σ m The latter relation is obvious since Σ m just refers to having all possible input rows over the alphabet Σ The relation t in n, however, may also be justified in natural settings First, assume that the data are already k -anonymous and shall be made k-anonymous with k > k This may be due stronger safety requirements Then we have t in n/k Second, the parameter t in can be viewed as a distance from triviality-parameterization in the sense that if t in is small then the input is almost k-anonymous and hopefully needs not many modifications to be made k-anonymous This may be particularly true when the matrix entries represent ranges (such as age range 40-50) and are not too fine-grained (such as specifying the exact birth day) Third, we study the adult data set (Frank and Asuncion, 2010): Preparing it as described in Machanavajjhala et al (2007), it consists of n = rows and t in = input row types, that is, t in 06 n Furthermore, the data set has at least one column with 72 different values, its alphabet size is Σ = 72, and it has eight columns Hence, for this dataset we have Σ m = , and so t in Σ m Indeed, 5

6 Table 1 The parameterized complexity of k-anonymity FPT stands for fixed-parameter tractability and W[1]-hardness means presumable fixed-parameter intractability, see Section 2 for definitions Results proved in this paper are in bold The column and row entries represent parameters For instance, the entry in row and column s refers to the (parameterized) complexity for the single parameter s whereas the entry in row m and column s refers to the (parameterized) complexity for the combined parameter (s, m) An entry NPhard means that k-anonymity remains NP-hard even if the parameter adopts constant values In addition,? marks a currently unknown parameterized complexity status k s k, s NP-hard a NP-hard a W[1]-hard b W[1]-hard b Σ NP-hard c NP-hard c?? m NP-hard d NP-hard d FPT e FPT e n FPT e FPT e FPT e FPT e Σ, m FPT b FPT e FPT e FPT e Σ, n FPT e FPT e FPT e FPT e t in FPT FPT FPT FPT t out NP-hard? FPT FPT Σ, t out NP-hard? FPT FPT k s Σ m n t in t out parameters degree of anonymity number of suppressions alphabet size number of columns number of rows number of different input rows number of different output rows a Meyerson and Williams (2004) b Bonizzoni et al (2011b) c Aggarwal et al (2005) d Bonizzoni et al (2011a) e Evans et al (2009) t in is a data-driven parameterization in the sense that one can efficiently measure in advance the instance-specific value of t in, whereas Σ m denotes the maximum possible degree of inhomogeneity As to parameterization by output homogeneity t out, we basically show that this leads to computational intractability even when t out = 2 and the alphabet size is t out +2, irrespective of the variant of t out being part of the input or not etc which we consider This holds for k-anonymity as well as for the more restrictive privacy concepts that lead to the problems p-sensitivity (Truta and Vinay, 2006) and l-diversity (Machanavajjhala et al, 2007) On the positive side we show that parameterization by input homogeneity t in yields several tractability results For instance, we derive an algorithm that solves k-anonymity in O(nm + 2 t2 in t 2 in (m + t in log(t in ))) time, which compares favorably with the Bonizzoni et al (2011b) algorithm which runs in O(2 ( Σ +1)m kmn 2 ) time In particular, when t in is a constant, our algorithm 6

7 solves k-anonymity in time linear in the size of the input Moreover, we prove that k-anonymity becomes fixed-parameter tractable for the parameter m when t out is a constant, and that k-anonymity becomes fixed-parameter tractable for the combined parameter (t out, s) where s denotes the number of allowed suppressions in the matrix We also show that our positive results for k- Anonymity parameterized by t in generalize to the problems p-sensitivity, DGH-k-Anonymity, and DGH-p-Sensitivity, with similar time bounds In Table 1 we summarize our results on k-anonymity and present them in the context of previous results on the parameterized complexity of k-anonymity Organization of the Paper In the next section we introduce the notation and terminology which we use in the paper In Section 3 we consider the homogeneity of the output matrix and demonstrate that achieving a very homogeneous output matrix (as might seem desirable for privacy purposes) is computationally very hard This applies to all problems studied in our work In Section 4, we show that for homogeneous input matrices one can gain several fixedparameter tractability results for most problems considered in this article, leaving the case of l-diversity as a major open question We conclude in Section 5 with a discussion of directions for future research 2 Preliminaries We briefly overview some relevant concepts from parameterized complexity which we use later We then describe the notation and terminology used in the discussion on k-anonymity, p-sensitivity, and l-diversity The definitions specific to DGH-k-Anonymity can be found in Subsection 44 Parameterized Complexity Our algorithmic results mostly rely on concepts of parameterized algorithmics (Downey and Fellows, 1999; Flum and Grohe, 2006; Niedermeier, 2006) The fundamental idea herein is, given a computationally hard problem X, to identify a parameter p (typically a positive integer or a tuple of positive integers) for X and to determine whether a size-n input instance of X can be solved in f(p) n O(1) time, where f is an arbitrary computable function If this is the case, then one says that X is fixed-parameter tractable for the parameter p The corresponding complexity class is called FPT If X could only be solved in polynomial running time where the degree of the polynomial depends on p (such as n O(p) ), then, for parameter p, problem X is said to lie in the strictly larger parameterized complexity class XP Finally, we also consider the parameterized complexity class W[1] with FPT W[1] XP It is widely believed that a problem which is W[1]- hard based on the so-called parameterized reductions (Downey and Fellows, 1999) does not have FPT algorithms 7

8 k-anonymity Our inputs are datasets in the form of n m-matrices, where the n rows refer to the individuals and the m columns correspond to attributes with entries drawn from an alphabet Σ Suppressing an entry M[i, j] of an n m-matrix M over alphabet Σ with 1 i n and 1 j m means to simply replace M[i, j] Σ by the new symbol ending up with a matrix over the alphabet Σ { } Definition 1 A row type is a string from (Σ { }) m We say that a row in a matrix has a certain row type if it coincides in all its entries with the row type In what follows, synonymously, we sometimes also speak of a row lying in a row type, and that the row type contains these rows Definition 2 (k-anonymity (Samarati, 2001; Samarati and Sweeney, 1998; Sweeney, 2002b)) A matrix is k-anonymous if for every row in the matrix one can find at least k 1 other identical rows A natural objective when trying to achieve k-anonymity is to minimize the number of suppressed matrix entries For a row y (with some entries suppressed) in the output matrix, we call a row x in the input matrix the preimage of y if y is obtained from x by suppressing in x the -entry positions of y The basic problem of this work reads as follows k-anonymity Input: An n m-matrix M over a fixed alphabet Σ and nonnegative integers k, s Question: Can one suppress at most s elements of M to obtain a k- anonymous matrix M? We study two new parameterizations t in and t out, referring to the input and the output homogeneity, respectively, where t in denotes the number of row types in the input matrix and t out denotes the number of row types in the output k-anonymous matrix Note that as we show later t in can be determined efficiently In Section 3, we discuss several possibilities for a more fine-grained and mathematically precise definition of t out p-sensitivity and l-diversity For the problems p-sensitivity and l- Diversity, the input matrix M consists of a matrix where the last column is declared as private and the other columns as quasi-identifiers 3 The row types are defined over the submatrix restricted to the quasi-identifier columns (denoted by M qi ) That is, when two rows are identical in their quasi-identifier entries quasi-identifiers for short then they belong to the same row type, irrespective of whether they differ in their private columns This definition fits to the anonymization concepts: The more homogeneous the quasi-identifiers are, the easier p-sensitivity and l-diversity will be to solve A similar 3 Following an approach due to Xiao et al (2010) we consider the case with one private information column 8

9 argument does not hold for the private information: Homogeneous private information does not make these two problems more efficiently solvable Hence, row types are defined over the quasi-identifiers only Definition 3 (p-sensitive (Truta and Vinay, 2006)) A matrix is p-sensitive if for each row type there are p different values in the private column p-sensitivity Input: An n m-matrix M over a fixed alphabet Σ and nonnegative integers k, s, and p Question: Can one suppress at most s elements in M qi to obtain a p- sensitive matrix M where M qi is k-anonymous? Definition 4 (l-diverse (Wong et al, 2006; Xiao and Tao, 2006)) A matrix is l-diverse if for each row type the relative frequency of each value in the private column is at most 1/l Note that Definition 4 implies that in an l-diverse matrix each row type is of size at least l l-diversity Input: An n m-matrix M over a fixed alphabet Σ and nonnegative integers l and s Question: Can one suppress at most s elements in M qi to obtain an l- diverse matrix M? 3 Hardness of Achieving Homogeneous Outputs In Section 31, we discuss some basic concepts of output homogeneity In Section 32, we demonstrate the computational intractability of achieving homogeneous outputs for k-anonymity, l-diversity and p-sensitivity 31 Output Homogeneity There are two cases to consider when investigating the homogeneity of the output matrix: The user specifies the required output homogeneity by providing the parameter t out as part of the input The output homogeneity is not specified by the user In the following, we briefly discuss both cases 9

10 User-Specified Output Homogeneity In this case, the user wants to specify the number of output row types Here, we consider three variants: The minimum number of output row types is specified in the input The maximum number of output row types is specified in the input The exact number of output row types is specified in the input There is a close relationship between k-anonymity and clustering problems where one is also interested in grouping together similar objects Such a relationship has already been observed in related work (Aggarwal et al, 2010; Bredereck et al, 2011; Fard and Wang, 2010) This clustering view on k- Anonymity makes the three described problem variants very interesting because they allow the user to specified a minimum, maximum, or exact number of clusters which is a useful feature for many applications Output Homogeneity not Specified by the User If the output homogeneity is not specified, then it is a property of the set of feasible solutions Here, the two simplest properties yield the following two parameters: The minimum number t min out of output row types among all feasible solutions The maximum number t max out of output row types among all feasible solutions The parameter t max out is the more interesting one from the point of view of anonymization Given a fixed number s of allowed suppressions and a required degree k of anonymity, in this case the user of the data is interested in output matrices with as many output row types as possible Due to the higher diversity more output row types potentially provide more information to the end-user of the data The parameter t min out is more interesting than t max out from an algorithmic per- t max out, fixed-parameter algorithms with t min out as the pa- are potentially more effective It is not difficult to is much smaller than t max spective Since t min out rameter instead of t max out think of real-world examples where t min out out 32 Hardness Results It is a natural question whether k-anonymity is fixed-parameter tractable with respect to the number of output types, for each of the parameters discussed in Section 31 It is easy to see that the problem becomes polynomialtime solvable when there is only one output row type Answering this question for more than one output row types in the negative, we show that k- Anonymity is NP-hard even when there are only two output types and the alphabet has size four, destroying any hope for fixed-parameter tractability already for the combined parameter number of output types and alphabet size This intractability also transfers to all the three variants of user-specified output homogeneity discussed above The hardness proof uses a polynomial-time many-one reduction from Balanced Complete Bipartite Subgraph: 10

11 Balanced Complete Bipartite Subgraph (BCBS) Input: A bipartite graph G = (A B, E) and an integer k 1 Question: Is there a complete bipartite subgraph G = (A B, E ) of G with A A and B B such that A = B k? BCBS is NP-complete by a reduction from Clique (Johnson, 1987) We provide a polynomial-time many-one reduction from a special case of BCBS with a balanced input graph, that is, a bipartite graph that can be partitioned into two independent sets of the same size, and k = V /4 This special case is also NP-complete by a simple reduction from the general BCBS problem: Let V := A B with A and B being the two vertex partition classes of the input graph If the input graph is not balanced, that is, if A B 0, then add A B isolated vertices to the partition class of smaller size If k < V /4, then repeat the following until k = V /4: Add a new vertex a to A and a new vertex b to B Make a adjacent to all vertices in B, and b adjacent to all vertices in A Increment k by 1 If k > V /4, then add 2k V /2 isolated vertices to each of A and B We devise a reduction from BCBS with a balanced input graph and k = V /4 to show the NP-hardness of k-anonymity This reduction also introduces the basic structure of the constructions used in all the hardness proofs in this work Proposition 1 k-anonymity is NP-complete for two output row types and alphabet size four Proof We only have to prove NP-hardness, since containment in NP is clear Let (G, n/4) be a BCBS-instance where G = (V, E) is a balanced bipartite graph, and where n := V Let A = {a 1, a 2,, a n/2 } and B = {b 1, b 2,, b n/2 } be the two vertex partition classes of G We construct an (n/4+1)-anonymityinstance that is a yes-instance if and only if (G, n/4) BCBS To this end, the main idea is to use a matrix expressing the adjacencies between A and B as the input matrix Partition class A corresponds to the rows and partition class B corresponds to the columns The salient points of our reduction are as follows 1 By making a matrix with 2x rows x-anonymous we ensure that there are at most two output types One of the types, the solution type, corresponds to the solution set of the original instance: Solution set vertices from A are represented by rows that are preimages of rows in the solution type and solution set vertices from B are represented by columns in the solution type that are not suppressed 2 We add one row that contains the -symbol in each entry Since the - symbol is not used in any other row, this enforces the other output type to be fully suppressed, that is, each column is suppressed 11

12 b 1 b 2 b n/2 1 b n/2 a a a n/ a n/ a n/ Fig 1 Typical structure of the matrix D of the k-anonymity instance obtained by the reduction from Balanced Complete Bipartite Subgraph for t out = 2 3 Since the rows in the solution type have to agree on the columns that are not suppressed, we have to ensure that they agree on adjacencies to model BCBS This is done by using two different types of 0-symbols representing non-adjacency The 1-symbol represents adjacency The matrix D is described in the following and illustrated in Figure 1 There is one row for each vertex in A and one column for each vertex in B The value in the i th column of the j th row is 1 if a j is adjacent to b i and, otherwise, 0 1 if 4 j n/4 and 0 2 if j > n/4 Additionally, there are two further rows, one containing only 1s and one containing only -symbols The number of allowed suppressions is s := (n/4 + 1) n/2 + (n/4 + 1) n/4 This completes the construction It remains to show that (G, n/4) is a yes-instance of BCBS if and only if the constructed matrix can be transformed into an (n/4 + 1)-anonymous matrix by suppressing at most s elements We start with the easier direction: : Let A A and B B be the vertex subsets forming a balanced complete bipartite subgraph of G such that A = B = n/4 The (n/4 + 1)- anonymous matrix looks as follows: The first output type is fully suppressed and contains the all- row, and each row which corresponds to a vertex from A \ A The second output type contains the remaining rows; in it each column corresponding to a vertex in B \ B is suppressed These rows clearly agree in each unsuppressed column, because each vertex from A is adjacent to each vertex from B Hence, each unsuppressed column contains a 1 The fully suppressed output type needs (n/4 + 1) n/2 suppressions and the second output type needs at most (n/4+1) n/4 suppressions Thus, the total number of suppressions is at most s : The (n/4 + 1)-anonymous matrix consists of exactly two output types because having only one type would cause (n/2+2) n/2 > s suppressions (due to the all- row everything would have to be suppressed) and an (n/4 + 1)- anonymous matrix with n/2+2 rows cannot have more than two output types One output type contains n/4 + 1 fully suppressed rows because of the all- 4 We assume without loss of generality that n is divisible by four 12

13 b 1 b 2 b n/2 1 b n/2 d 1 d 2 d n/2 a a a n/ a n/ a n/ n/ n/ n/4 + 1 t t t t t t t t Fig 2 Structure of the matrix D of the k-anonymity instance obtained by the extended reduction from Balanced Complete Bipartite Subgraph for general t out New parts of the reduction are in boldface row in the input Since the total number s of suppressions is (n/4 + 1) n/2 + (n/4 + 1) n/4 and for the first type one already needs at least (n/4 + 1) n/2 suppressions, the other type contains at most n/4 suppressed columns Since no symbol other than 1 appears in more than n/4 rows (by construction), each unsuppressed column consists entirely of 1s Now, consider the vertex subset A A corresponding to the rows of the second output type and the vertex subset B B corresponding to the unsuppressed columns Since at most one row in the second output type does not correspond to a vertex from A (the all-1 row), A n/4 Furthermore, B n/4 because at most n/4 of the n/2 columns can be suppressed in the second output type Since the second output type only consists of - and 1-entries, the subgraph induced by A and B is a balanced complete bipartite subgraph of G Observe that, for the matrix D constructed in the proof of Proposition 1, t min out = t max out = 2: If the matrix D is a yes-instance, then by construction it must have exactly two output types It follows that k-anonymity is NPcomplete for the maximization of the number of output row types, for its minimization, as well as when asking for an exact number of output row types In the following, we show that this hardness result holds for all fixed constants greater than or equal to two Theorem 1 For any fixed constant t out 2, k-anonymity is NP-complete, where t out = t min out = t max out and the alphabet size is t out + 2, even if the exact, minimum, or maximum number of output row types is additionally specified in the input 13

14 Proof We will show that the reduction from Proposition 1 can be extended such that for any given constant t out 2, a solution must have exactly t out output row types An illustration of the reduction can be found in Figure 2 As before, the reduction is from BCBS First, we construct a k-anonymity instance as described in Proposition 1 Then, we add n/2 dummy columns (d 1,, d n/2 ) filled with 1-entries Finally, for each integer 2 < i t out, we add n/4 + 1 dummy rows filled with the new symbol i in each entry As in the previous proof, the number of allowed suppressions is s := (n/4 + 1) n/2 + (n/4 + 1) n/4, and the degree of anonymity is k := (n/4 + 1) : Given a solution for BCBS, constructing the solution for the extended k-anonymity instance works exactly like before Each dummy row occurs n/4 + 1 times in the reduced instance, and so none of their entries need be suppressed The resulting solution has exactly t out output types : two output types as in the previous proof, and t out 2 other output types each of which consists of all the dummy rows which have the same symbol : Consider a solution for the k-anonymity instance If a dummy row and a row from another type have to be included in the same output type, then this requires suppressing all the entries of both these rows, at a cost of (n/4 + 1) n suppressions Since this is more than the allowed cost, it follows that in the output the dummy rows are left as they are So the dummy rows form t out 2 output row types By the same argument as in the proof of Proposition 1, it follows that there is a solution for the corresponding BCBS instance which is encoded in two further output row types Observe that the (n/4+1)-anonymous output matrix consists of exactly t out output row types Hence the theorem holds for t min out and t max out This clearly transfers to the cases where t out is specified in the input By allowing more symbols in the alphabet, the intractability result of Theorem 1 also extends to p-sensitivity and l-diversity Corollary 1 For any fixed constant t out 2, 1 p-sensitivity is NP-complete, where t out = t min out = t max out and the alphabet size is max(t out + 2, p), even if the exact, minimum, or maximum number of output row types is additionally specified in the input 2 l-diversity is NP-complete, where t out = t min out = t max out, even if the exact, minimum, or maximum number of output row types is additionally specified in the input Proof To extend the reduction given for the proof of Theorem 1 to also work for l-diversity and p-sensitivity, we define all original columns as quasiidentifiers, and add one extra private column To keep the argument simple, we only sketch the proof for 2-Sensitivity Cases with larger values of p can be handled in a similar fashion, with an alphabet of size max(t out + 2, p) 1 For 2-Sensitivity, set the private entries for the all-1 row and the all- row to 1 For each dummy row type, set half of the private entries to 0 and the rest to 1 Finally, set the private entries of the remaining rows to 0 and set p to 2 14

15 2 For l-diversity, set the private entry for each row to a new unique symbol and set l := k = (n/4 + 1) Now, for every row type containing at least (n/4 + 1) rows, the relative frequency of each private value is at most 1/(n/4 + 1) Furthermore, for every row type containing less than (n/4 + 1) rows, the relative frequency of each private value is at least 1/(n/4) Thus, every l-diverse matrix is also k-anonymous (restricted on the quasi-identifiers) The rest of the proof works analogously to the proof of Theorem 1 4 Tractability Results As we saw in the previous section, k-anonymity and its variants l- Diversity and p-sensitivity are all NP-hard even when there are just two different row types in the output As such, unless P=NP, these problems cannot be solved in polynomial time where the degree of the polynomial is a constant independent of the input even when it is specified as part of the input that the output has to contain a bounded number of row types In this section we see that the situation changes for the better when the number of row types in the input matrix is small We show, for instance, that when the number of row types in the input matrix is a constant, then we can solve k-anonymity in time linear in the input size (see Subsection 41) We then show how to adjust the algorithm to achieve fixed-parameter tractability for the combined parameter (t out, s) (where s denotes the allowed number of suppressions) and for the parameter number m of columns when t out is a constant (see Subsection 42) We also derive results for p-sensitivity (see Subsection 43) and for anonymization using the so-called domain generalization hierarchies (see Subsection 44) The algorithms described in this section find solutions that have at most t out many output row types If t out is not part of the input, then by iteratively trying the values 1,, t out the minimum value such that a solution exists can be determined Thus, if t out is not given, we have t out = t min out 41 Parameter t in In this subsection we show that k-anonymity is fixed-parameter tractable with respect to the parameter number t in of input row types Since t in Σ m and t in n, this result implies that k-anonymity is also fixed-parameter tractable with respect to the combined parameter (m, Σ ) and the single parameter n Both these latter results were known previously; Bonizzoni et al (2011b) demonstrated fixed-parameter tractability for the parameter (m, Σ ), and Evans et al (2009) showed the same for the parameter n Besides achieving a fixed-parameter tractability result for a typically smaller parameter, we improve their results by giving a simpler algorithm with a (usually) better running time 15

16 Algorithm 1 Pseudo-code for solving k-anonymity The function solverowassignment solves Row Assignment in polynomial time, see Lemma 1 1: procedure solvekanonymity(m, k, s, t out) 2: Determine the row types R 1,, R tin Phase 1, Step 1 3: for each possible A : [0, 1] t in t out do Phase 1, Step 2 4: for j 1 to t out do Phase 1, Step 3 5: if A[1, j] = A[2, j] = = A[t in, j] = 0 then 6: delete empty output row type R j 7: decrease t out by one 8: else 9: Determine all entries of R j 10: if solverowassignment then Phase 2 11: return yes 12: return no Let (M, k, s) be an instance of k-anonymity, and let M be the (unknown) k-anonymous solution matrix which we seek to obtain from M by suppressing at most s elements Our fixed-parameter algorithm works in two phases In the first phase, the algorithm guesses the entries of each row type R j in M In the second phase, the algorithm computes an assignment of the rows of M to the row types R j in M see Algorithm 1 for an outline We now explain the two phases in detail, beginning with Phase 1 To first determine the row types R i of M (line 2 in Algorithm 1), the algorithm constructs a trie (Fredkin, 1960) on the rows of M The leaves of the trie correspond to the row types of M For later use, the algorithm also keeps track of the numbers n 1, n 2,, n tin of each type of row that is present in M; this can clearly be done by keeping a counter at each leaf of the trie and incrementing it by one whenever a new row matches the path to a leaf All of this can be done in a single pass over M For implementing the guess in Step 2 of Phase 1, the algorithm goes over all binary matrices of dimension t in t out ; such a matrix A is interpreted as follows: A row of type R i is mapped 5 to a row of type R j if and only if A[i, j] = 1 (see line 3) Note that we allow in our guessing step that an output type may contain no row of any input row type These empty output row types are deleted Hence, with our guessing in Step 2, we guess not only output matrices M with exactly t out types, but also matrices M with at most t out types Now the algorithm computes the entries of each row type R j, 1 j t out, of M (Step 3 of Phase 1) Assume for ease of notation that R 1,, R l are the row types of M which contribute (according to the guessing) at least one row to the (as yet unknown) output row type R j Now, for each 1 i m, if R 1 [i] = R 2 [i] = = R l [i], then set R j [i] := R 1[i]; otherwise, set R j [i] := This yields the entries of the output type R j, and the number ω j of suppres- 5 Note that not all rows of an input type need to be mapped to the same output type 16

17 sions required to convert any input row (if possible) to the type R j is the number of -entries in R j The guessing in Step 2 of Phase 1 takes time exponential in the parameter (t in, t out ), but Phase 2 can be done in polynomial time To show this, we prove that Row Assignment is polynomial-time solvable We do this in the next lemma, after formally defining the Row Assignment problem To this end, we use the two sets T in = {1,, t in } and T out = {1,, t out } Row Assignment Input: Nonnegative integers k, s, ω 1,, ω tout and n 1,, n tin with tin i=1 n i = n, and a function a : T in T out {0, 1} Question: Is there a function g : T in T out {0,, n} such that t in t out i=1 j=1 a(i, j) n g(i, j) i T in j T out (1) t in i=1 g(i, j) k j T out (2) t out g(i, j) = n i i T in (3) j=1 g(i, j) ω j s (4) Row Assignment formally defines the remaining problem in Phase 2: At this stage of the algorithm the input row types R 1,, R tin and the number of rows n 1,, n tin in these input row types are known The algorithm has also computed the output row types R 1,, R t out and the number of suppressions ω 1,, ω tout in these output row types Now, the algorithm computes an assignment of the rows of the input row types to output row types such that: The assignment of the rows respects the guessing in Step 2 of Phase 1 This is secured by Inequality (1) M is k-anonymous, that is, each output row type contains at least k rows This is secured by Inequality (2) All rows of each input row type are assigned This is secured by Equation (3) The total cost of the assignment is at most s This is secured by Inequality (4) Note that in the definition of Row Assignment no row type occurs and, hence, the problem is independent of the specific entries of the input or output row types Lemma 1 Row Assignment can be solved in O((t in +t out ) log(t in +t out )(t in t out + (t in + t out ) log(t in + t out ))) time Proof We reduce Row Assignment to the Uncapacitated Minimum Cost Flow problem, which is defined as follows (Orlin, 1988): 17

18 n 1 v 1 n 2 v 2 n 3 v 3 n 4 v 4 n 5 v 5 ω 1 ω 1 ω 3 ω 2 ω 2 ω 3 ω 4 ω 4 k u 1 k u 2 k u 3 k u ( tin ) n i k t out i=1 t Fig 3 Example of the constructed network with t in = 5 and t out = 4 The number on each arc denotes its cost The number next to each node denotes its demand Uncapacitated Minimum Cost Flow Input: Task: A network (directed graph) D = (V, A) with demands d : V Z on the nodes and costs c : V V N Find a function f which minimizes (u,v) A c(u, v) f(u, v) and satisfies: f(u, v) f(v, u) = d(u) u V {v (u,v) A} {v (v,u) A} f(u, v) 0 (u, v) A We first describe the construction of the network with demands and costs For each n i, 1 i t in, add a node v i with demand n i (that is, a supply of n i ) and for each ω j add a node u j with demand k If a(i, j) = 1, then add an arc (v i, u j ) with cost ω j Finally, add a sink t with demand ( n i ) k t out and the arcs (u j, t) with cost zero See Figure 3 for an example of the construction Note that, although the arc capacities are unbounded, the maximum flow over one arc is implicitly bounded by n because the sum of all supplies is tin i=1 n i = n The Uncapacitated Minimum Cost Flow problem is solvable in O( V log( V )( A + V log( V ))) time in a network (directed graph) D = (V, A) (Orlin, 1988) Since our constructed network has O(t in + t out ) nodes and O(t in t out ) arcs, we can solve our Uncapacitated Minimum Cost Flow-instance in O((t in + t out ) log(t in + t out )(t in t out + (t in + t out ) log(t in + t out ))) time It remains to prove that the Row Assignment-instance is a yes-instance if and only if the constructed network has a minimum cost flow of cost at most s : Assume that g is a function fulfilling constraints (1) to (4) Then define a flow f as follows: For each 1 i t in, 1 j t out, set f(v i, u j ) = g(i, j) and f(u j, t) = t in i=1 g(i, j) k The flow f fulfills the demands on the nodes due to Equation (3) and Inequality (2) Since g fulfills Inequality (4) and the cost of each arc (u j, t), 1 j t out, is zero, flow f has cost of at most s 18

19 : Assume that f is a flow with cost of at most s All costs and demands are integer valued, and hence, due to the Integrality Property of network flow problems, an optimal flow has also integer values Then set g(i, j) = f(v i, u j ) for each 1 i t in, 1 j t out Note that g fulfills Equation (3) and Inequality (2) due to the demands on the nodes of the network Since n i n for all 1 i t in, also Inequality (1) is fulfilled Note that f has cost of at most s and, hence, g fulfills Inequality (4) Putting all these together, we arrive at the following theorem: Theorem 2 k-anonymity can be solved in O(nm + 2 tintout (t in t out m + (t in + t out ) log(t in + t out )(t in t out + (t in + t out ) log(t in + t out )))) time Proof The correctness of the described two-phase-algorithm (see Algorithm 1) follows from the fact that Phase 1 performs an exhaustive search, Lemma 1, and (as discussed above) that Row Assignment is indeed the remaining problem As for the running time: As described above, Step 1 of Phase 1 (line 2 in Algorithm 1) can be done in one pass of the input matrix, that is, in O(nm) time The loop starting at line 3 iterates O(2 tintout ) times The loop starting at line 4 iterates at most t out times; the condition inside this loop can be checked in O(t in ) time, and the column in A corresponding to an empty output row type can be marked as deleted in constant time The entries of each R j can be computed in O(mt in) time, as described above As we saw in Lemma 1, the Row Assignment problem can be solved in O((t in + t out ) log(t in + t out )(t in t out + (t in + t out ) log(t in + t out ))) time Putting all these together, the algorithm runs in O(nm + 2 tintout (t in t out m + (t in + t out ) log(t in + t out )(t in t out + (t in + t out ) log(t in + t out )))) time In case that t out is not given, the algorithm simple tries all possible values from 0 to t out The running time is O(nm+ t out i=1 2tini (t in im+(t in +i) log(t in + i)(t in i+(t in +i) log(t in +i)))) = O(nm+2 tintout (t in t out m+(t in +t out ) log(t in + t out )(t in t out + (t in + t out ) log(t in + t out )))) We now show that the exponential dependence of the running time of the algorithm of Theorem 2 on the number t out of output types can be done away with To this end, we exploit that the problem definition (see Section 2) does not impose any restriction on t out In particular, we are free to look for a solution which minimizes the number of output types We take advantage of this to show that, without loss of generality, one may assume t out t in (remember that in this section we set t out := t min out ) Lemma 2 Let (M, k, s) be a yes-instance of k-anonymity If M has t in row types, then by suppressing at most s elements one can construct a k-anonymous matrix M Proof Let M be a k-anonymous matrix obtained from M by suppressing at most s elements If M has at most t in row types, then there is nothing to prove So let M have more than t in row types We now describe an operation 19

20 that reduces the number of row types in M without increasing the cost (in terms of the number of suppressed elements) of obtaining M from M Call a row type R in M redistributable if, for each row r in R, there is a row type R r in M such that (i) the preimage r of r in M can be converted to the type R r by suppressing in r the -symbol positions of R r, and (ii) R r contains at most as many -symbols as R Observe that if a row type in M is redistributable, then we can eliminate this row type from M by moving each of its rows to a different, at most as expensive row type, while preserving the condition that there are at least k rows per (output) row type This operation reduces the number of row types in M without increasing the cost of obtaining M from M As long as there are redistributable row types left in M, we repeatedly eliminate row types in this manner If the number of remaining row types is at most t in, then we are done So let there be more than t in row types left in M, none of which is redistributable Consider a row type R in M that is not redistributable Let r be a row in R such that the cost of suppressing the preimage r of r to match any row type R in M with R R is more than the cost of suppressing r to match R Since R is not redistributable such a row r must exist Let R be the row type of r in M Then the cost of suppressing any row in R to match any row type R in M with R R is more than the cost of suppressing this row to match R It follows that R is the only row type in M having this property with respect to R Let f be a mapping that takes each row type R in M to some row type R for which R is the unique cheapest row type From the above argument, such an f exists and is one-to-one It follows that M has more than t in row types, a contradiction Combining Lemma 2 with Theorem 2 we get: Theorem 3 k-anonymity can be solved in O(nm + 2 t2 in t 2 in (m + t in log(t in ))) time In the beginning of this section, we remarked that our algorithm implies the fixed-parameter tractability of k-anonymity with respect to the combined parameter ( Σ, m) This was previously shown by Bonizzoni et al (2011b) They presented an algorithm for k-anonymity which runs in O(2 ( Σ +1)m kmn 2 ) time and works similarly to our algorithm in two phases: First, their algorithm guesses all possible output row types together with their entries in O(2 ( Σ +1)m ) time In Phase 1 our algorithm guesses the output row types producible from M within O(2 tintout t in t out m + mn) time using a different approach Note that, in general, t in is much smaller than the number Σ m of all possible different input types Hence, in general the guessing step of our algorithm is faster For instances where Σ m t in t out, one can exchange the guessing step with the one from Bonizzoni et al which runs in O(2 ( Σ +1)m ) time Next, we compare Phase 2 of our algorithm to the second step of Bonizzoni et al s algorithm In both algorithms, the same problem Row Assignment is solved Bonizzoni et al did this by finding a maximum matching on a bipartite 20

Using Patterns to Form Homogeneous Teams

Algorithmica manuscript No. (will be inserted by the editor) Using Patterns to Form Homogeneous Teams Robert Bredereck, Thomas Köhler, André Nichterlein, Rolf Niedermeier, Geevarghese Philip Abstract Homogeneous