DATA SAFEKEEPING AGAINST KNOWLEDGE DISCOVERY. Seunghyun Im

Size: px
Start display at page:

Download "DATA SAFEKEEPING AGAINST KNOWLEDGE DISCOVERY. Seunghyun Im"

Transcription

1 DATA SAFEKEEPING AGAINST KNOWLEDGE DISCOVERY by Seunghyun Im A dissertation submitted to the faculty of The University of North Carolina at Charlotte in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Information Technology Charlotte 2006 Approved by: Dr. Zbigniew Ras Dr. Mirsad Hadzikadic Dr. Agnieszka Dardzinska Dr. Cem Saydam Dr. Richard Hartshorne

2 c 2006 Seunghyun Im ALL RIGHTS RESERVED ii

3 iii ABSTRACT SEUNGHYUN IM. Data Safekeeping against Knowledge Discovery (Under the direction of DR. ZBIGNIEW RAS) This dissertation examines an important issue of data mining: how to provide meaningful knowledge without compromising data confidentiality. In other words, information systems should provide knowledge extracted from their data which can be used to identify underlying trends and patterns, but the knowledge should not be used to reveal confidential data. In conventional database systems, data confidentiality is maintained by hiding sensitive data from unauthorized users. However, hiding secret data is not sufficient in knowledge discovery systems (KDS) due to Chase [Raś and Dardzińska, 2005b] [Im and Raś, 2005]. Chase is a widely used data mining technique that helps predict missing values in information systems by using knowledge provided through pattern or rule extraction. For example, if an attribute is incomplete in an information system we can use Chase to approximate the missing values to make it more complete. It is also used to answer user queries containing non-local attributes [Raś and Joshi, 1997]. If attributes in queries are locally unknown, we search for their definitions from KDS and use the results to replace the unknown attributes in the query. The problem of Chase with respect to data confidentiality comes from its ability to reveal hidden data. Sensitive data may be hidden from an information system, for example, by replacing them with null values or encrypting them with cryptographic technologies. However, any user in a KDS who has access to knowledge is able to reconstruct hidden or missing data with Chase. For example, in a standalone

4 iv information system with a partially confidential attribute, Chase algorithm using knowledge extracted from non-confidential part can reconstruct hidden (confidential) values [Im et al., 2005b]. When a system is distributed with autonomous sites, Chase can also reveal sensitive data in local information systems with knowledge extracted from both local or remote information systems [Im et al., 2005b]. Clearly, mechanisms that protect sensitive data from these vulnerabilities have to be implemented in order to build a security-aware KDS, and this paper presents algorithms that minimizes disclosure of confidential data.

5 v ACKNOWLEDGMENTS I would like to express my sincere gratitude to my advisor Dr. Zbigniew Ras for the opportunity to explore the field of data mining. His support and encouragement throughout my Ph.D were invaluable. Without his insightful comments and guidance, the study of security in data mining and completion of my Ph.D dissertation would have been impossible. I also would like to acknowledge Dr. Mirsad Hadzikadic, Dr. Agnieszka Dardzinska, Dr. Cem Saydam and Dr. Richard Hartshorne for their support as my professors and committee members.

6 vi TABLE OF CONTENTS CHAPTER 1: INTRODUCTION Motivation and Problem Statement Approach CHAPTER 2: BACKGROUND Security and Data Mining Incomplete Information System Local Chase Distributed Chase CHAPTER 3: DATA SECURITY AND CHASE Confidential Data Disclosure by Chase Concept of Confidentiality CHAPTER 4: PROTECTION WITH MINIMUM DATA LOSS Algorithm One : Bottom Up Approach Algorithm Two : Top Down Approach CHAPTER 5: HIERARCHICAL DATA MASKING Rule Extraction and Hierarchical Attribute Chase Applicability Method Description CHAPTER 6: PROTECTION WITH MINIMUM KNOWLEDGE LOSS Knowledge Loss Measurement based on Certainty Factor Knowledge Loss Measurement based on Strength Factor Knowledge Loss Measurement based on Coverage Factor

7 vii CHAPTER 7: PROACTIVE DATA PROTECTION AGAINST CHASE Proactive Data Protection based on Reducts CHAPTER 8: CONCLUSION Summary Future Work

8 CHAPTER 1: INTRODUCTION 1.1 Motivation and Problem Statement Knowledge discovery in databases (a.k.a. data mining) is recognized as an essential tool for analyzing and creating value from ever increasing amounts of data in a wide variety of domains. Its distinctive capability of extracting hidden knowledge from large volumes of data has been able to resolve many complicated issues. One of the widely used applications of data mining is called Chase [Raś and Dardzińska, 2004b] [Raś and Dardzińska, 2005a]. Chase is a generalized null value imputation algorithm that is designed to predict unknown values. For example, we can take advantage of its prediction ability to build a medical decision support system. Medical data often contains large amount of null values due to insufficient information. There are various reasons for missing data. For instance, some medical tests cannot be taken because they are dangerous, making a patient s condition worse, or too expensive. In these cases, prediction of the test result can provide significant benefits to doctors, patients, or insurance companies [Korver and Janssens, 1993]. The prediction made by Chase is particularly useful and reliable because it has the property that approximated values reflect the actual characteristics of the data set of an information system. The prediction capability, however, may create security breaches if an information system contains confidential data that have to be kept secret [Im and Raś, 2005] [Im, 2006]. Typical scenarios that show confidential data disclosure by Chase are the

9 2 following (see Table 1.1). Suppose that an attribute in an information system S 1 contains patient diagnosis information: part of these information is not confidential (e.g. consent is given by patents) while others should be kept secret. In this case, Chase may reveal a set of confidential data in S 1 by using the knowledge extracted at S 1. In other words, self-generated rules from non-confidential data in S 1 can be used to predict confidential data. Another example is the hidden data reconstruction in a distributed knowledge discovery system (DKDS). The key concept of DKDS is to generate global knowledge through knowledge sharing. Each site in DKDS develops knowledge independently, and they are used jointly to produce global knowledge without complex data integrations. Knowledge sharing is particularly important in real world environments where data are often collected and stored in information systems residing at many different locations, built independently, instead of being placed at only a single location. Assuming that two sites S 1 and S 2 in a DKDS share their knowledge in order to obtain global knowledge, and that an attribute of a site S 1 in a DKDS is confidential, the exact value of the confidential data in S 1 can be hidden by replacing them with null values. However, users at S 1 may treat them as null or missing data and try to reconstruct them with the knowledge extracted from S 2. A distributed medical information system is a good example that an attribute is confidential for one information system, but the same attribute is not considered as secret data in another site (e.g. privacy regulation is less restrictive in many countries outside the U.S.). The vulnerabilities illustrated in these examples show that a security-aware data management is an essential component for any KDS to ensure data confidentiality. Hiding confidential data from an information system does not guarantee the

10 3 KDS Types Single KDS Description Hidden data reconstruction is based on non-confidential part of local information system Distributed KDS Hidden data reconstruction is based on rules from local and remote information systems Table 1.1: Vulnerabilities by types of KDS secrecy against Chase. Regardless of data hiding methods, such as data encryption [Coppersmith, 1994] [Daemen and Rijmen, ], access control [Lunt, 1989], or simply null value replacement, they may be reconstructed by Chase. 1.2 Approach There are two input parameters required for Chase when predicting missing (confidential) values. One is knowledge in terms of inference rules. The other is nonconfidential part of data. Considering the main objective of any KDS is to discover underlying knowledge in information systems and provide them to users, protection algorithms should preserve accuracy of the knowledge as much as possible. Therefore, we should prevent the prediction of confidential data by controlling, the second input parameter, that is non-conditional part of data. There is a trade-off between the strength of confidentiality and data (or knowledge) availability in this approach. As we modify more data from an information system, less knowledge is applicable by Chase and disclosure risk decreases in general. However, preserving original source of data is also important for the information system to return precise answers to user queries or to generate more accurate knowledge (We

11 4 will discuss these problems further in Chapter 4 and 6). In addition, some degree of data or knowledge loss is almost inevitable to block the reconstruction of confidential data. Therefore, the key issue is to limit possible disclosure with the least amount of loss in terms of data or knowledge (see Table 1.2). As we have seen in the previous example, different vulnerabilities exist depending on the types of knowledge discovery system. Clearly, if a KDS consists of a single information system and a confidential attribute is completely hidden, secret data cannot be reconstructed by Chase. If part of data is not confidential we may have to hide additional data because these data can be used by Chase. For a distributed KDS that a confidential attribute is partially or completely hidden, some of sensitive data can still be reconstructed by global knowledge. Either a KDS is local or distributed, predictions often form a chain, meaning rules in KB reconstruct not only confidential data but also non-confidential data that is again used to predict some confidential data. This means that additionally hidden data are reconstructed by another set of rules, and these predictions have to be evaluated again to ensure the confidentiality of sensitive data. There are two types of protection schemes depending on knowledge availability (see Table 1.3): (1) Complete set of knowledge that is applicable by Chase is known before we hide additional data. In this case, the system stores all available rules in a knowledge base (KB), and users acquire knowledge only from the KB. This is a common mechanism for many knowledge discovery systems, and we assure data confidentiality by hiding or modifying additional attribute values based on Chase applicable knowledge only in KB. (2) The second case is when knowledge is unknown or partially known at the time of applying a protection algorithm. This is closely

12 5 Method Minimum Data Loss Description Minimize additional data hiding while preventing reconstruction of confidential data Minimum Knowledge Loss Preserve interesting knowledge as much as possible when hiding additional data to achieve data security Table 1.2: Minimum data and knowledge loss Protection Scheme Active Protection Description Applicable knowledge for Chase is known prior to execution of protection algorithm Proactive Protection Applicable knowledge for Chase is partially known or unknown Table 1.3: Protection Scheme by Knowledge Availability related to support and confidence of rules. If the assumption is that users are able to extract rules with any support and confidence values on the fly, a different strategy (called proactive protection) has to be taken because hiding additional attribute values against one set of rules may not prevent data reconstruction from another set of rules.

13 CHAPTER 2: BACKGROUND This chapter presents security in data mining in other discipline area. Then, discuss a null value imputation algorithm Chase, and its use for local and distributed information systems (DIS). 2.1 Security and Data Mining Security in KDS has been studied in various disciplines such as cryptography, statistics, and data mining. A well known security problem in cryptography area is how to acquire global knowledge in a distributed system while exchanging data securely. In other words, the objective is to extract global knowledge without disclosing any data stored in each local site. Proposed solutions are based primarily on the idea of secure multiparty protocol [Yao, 1996] [Du and Atallah, 2001] [Du, 2001] that ensures each participant cannot learn more than its own input and outcome of a public function. Various authors expanded the idea to build a secure data mining systems. Clifton and Kantarcioglou employed the concept to association rule mining [Agrawal and Srikant, 1994] for vertically [Clifton et al., 2002] and horizontally [Kantarcioglou and Clifton, 2002] partitioned data. Du et al, [Du and Zhan, 2002] [Du and Zhan, 2003] pursued a similar idea to build a decision tree. They observed performance improvement by sending data to a 3 rd party server which what they termed commodity server. Lindell et al, [Lindell and Pinkas, 2000] presented a privacy preserving data mining scheme for ID3 algorithm [Quinlan, 1993] using se-

14 7 cure multiparty protocol. They focused on improving the generic secure multiparty protocol for decision trees. All these works have a common drawback that they require expensive encryption and decryption mechanisms. Considering extremely large amount of data to be processed, performance has to be improved before we apply these algorithms to real-world data. Another research area of data security and data mining is called perturbation. Dataset is perturbed (e.g. noise addition or data swapping) before its release to the public in effort to minimize disclosure risk of confidential data, while maintaining statistical characteristics (e.g. mean and variable) of original data. Muralidhar and Sarathy [K. and Sarathy, 2003][Muralidhar and Sarathy, 1999] provided a theoretical basis for data perturbation in terms of data utilization and disclosure risks, and conducted a survey of existing perturbation methods. In KDD area, protection of sensitive rules with minimum side effect has been discussed by several researchers. In [Oliveira and Zaiane, 2002], authors suggested a solution to protecting sensitive association rules in the form of sanitization process where protection is achieved by hiding selective patterns from the frequent itemsets. There has been another interesting proposal [Saygin et al., 2002] for hiding sensitive association rules. They introduced an interval of minimum support and confidence value to measure the degree of sensitive rules. The interval is specified by the user and only the rules within the interval are to be removed. The key contribution of this study, among many others, is to provide data security algorithms for distributed knowledge sharing systems. Previous and related works concentrated only on a single information system, or sharing of knowledge was not considered.

15 8 2.2 Incomplete Information System Now, we present backgrounds on incomplete information systems and Chase algorithm. One of the assumptions for many rule extraction algorithms is that rules are extracted from an information system where the information about objects is either precisely known or not known at all. This implies that either a single value of an attribute is assigned to an object as its property or no value is assigned. However, it happens quite often that users do not have exact knowledge about objects, which makes it difficult to determine a unique set of values for an object. To overcome this problem, the notion of incomplete information system [Dardzińska and Raś, 2003] was introduced which is a generalization of an information system introduced by Pawlak [Pawlak, 1991][Pawlak et al., 1995]. More formally, by an Information System, we mean that S = (X, A, V ), where X is a finite set of objects, A is a finite set of attributes, and V is a set of attribute values. In particular, we say that S is an incomplete information system of type λ [Dardzińska and Raś, 2003] if the following three conditions hold: 1. a S (x) is defined for any x X, a A, 2. ( x X)( a A)[(a S (x) = {(a i, p i ) : 1 i m}) m i=1 p i = 1], 3. ( x X)( a A)[(a S (x) = {(a i, p i ) : 1 i m}) ( i)(p i λ)]. Incompleteness is understood by having a set of weighted attribute values as a value of an attribute. The concept of multiple possible values is used for replacing null values in Chase. (Hereafter, we will use the two terms, data and attribute value (AV ), interchangeably) Example 2.1 In Table 2.1, the values of attribute a in object x 1 are a 1 and a 2 with the weights of 1 and 2 respectively. Clearly, we are able to assign multiple AV s 3 3

16 9 as possible values to an incomplete information system. Before we continue this discussion to Chase algorithm, we have to decide first on the interpretation of functors or and and, denoted in this paper by + and, correspondingly. We will adopt the semantics of terms proposed by Ras & Joshi in [Raś and Joshi, 1997] as their semantics has all the properties required for the query transformation process to be sound and complete [see [Raś and Joshi, 1997]]. It was shown that their semantics satisfies the following distributive property: t 1 (t 2 +t 3 ) = (t 1 t 2 ) + (t 1 t 3 ). Let us assume that S = (X, A, V ) is an information system of type λ and t is a term in predicate calculus constructed, in a standard way, from values of attributes in V seen as constants and from two functors + and. By N S (t), we mean the standard interpretation of a term t in S defined as (see [Raś and Joshi, 1997]):, 1. N S (v) = {(x, p) : (v, p) a(x)}, for any v V a, 2. N S (t 1 + t 2 ) = N S (t 1 ) N S (t 2 ), 3. N S (t 1 t 2 ) = N S (t 1 ) N S (t 2 ), where, for any N S (t 1 ) = {(x i, p i )} i I, N S (t 2 ) = {(x j, q j )} j J, we have: 1. N S (t 1 ) N S (t 2 ) = {(x i, p i )} i (I J) {(x j, p j )} j (J I) {(x i, max(p i, q i ))} i I J, 2. N S (t 1 ) N S (t 2 ) = {(x i, p i q i )} i (I J). 2.3 Local Chase The incomplete value imputation algorithm Chase, based on the above semantics, converts information system S of type λ to a new more complete information system Chase(S) of the same type. This algorithm assumes partial incompleteness of data (sets of weighted AV s can be assigned to an object as its value) in system S. The main phase of the algorithm is the following, 1. identify all incomplete AV s in S.

17 10 2. extract rules from S describing these incomplete AV s. 3. incomplete AV s in S are replaced by values suggested by the rules. 4. steps 1-3 are repeated until a fixed point is reached. More specifically, suppose that KB = {(t v c ) D : c In(A)} (called a knowledge base) is a set of all rules extracted from S = (X, A, V ) by ERID(S, λ 1, λ 2 ), where In(A) is the set of incomplete attributes in S and λ 1, λ 2 are thresholds for minimum support and minimum confidence, correspondingly. ERID [Raś and Dardzińska, 2005c] is the algorithm for discovering rules from incomplete information systems, and used as a part of null value imputation algorithm Chase. Now, let R s (x i ) KB L be the set of rules that the all of the conditional part of the rules match with the AV in x i S, and d be a null value. Then, there are three cases: 1. R s (x i ) = φ In this case, d cannot be replaced. 2. R s (x) = {r 1 = [t 1 d 1 ], r 2 = [t 2 d 1 ],..., r k = [t k d 1 ]} In this case, every rule implies a single decision AV, and d = d R s (x i ) = {r 1 = [t 1 d 1 ], r 2 = [t 2 d 2 ],..., r k = [t k d k ]} In this case, rules imply multiple decision values, and the replacement is determined by the confidence of the predicted values. We define the following formula [Im and Raś, 2005] to compute the confidence of each predicted value d in case 3. Assuming that support and confidence of a rule r i is [s i, c i ], and the product of the weight of each AV that matches with a(x) t i is pa(ti ), for i k. conf(d ) = {[ pa(ti )] s i c i : [d = d i ]}, 1 i k (2.1) {[ pa(ti )] s i c i } Clearly, in case 2, the confidence of d 1 is 1. We replace the null value d with d when conf(d ) is greater than a threshold value λ.

18 11 Number of null value imputation Data Set : 1984 US Congressional Voting Number of objects : 435 Number of attributes used : 7 Support : 50 Confidence : 80% Minimum Weight Threshold : Iteration Figure 2.1: Null value imputation by Chase 0 Example 2.2 Suppose that Table 3.2 is a S of λ = 0.3, and that the rules in KB are summarized in Table 3.3. For instance r 1 = [a 1 b 1 d 1 ] is an example of a rule belonging to KB. For a(x 6 ), we have two rules {r 5, r 6 } which decision values are a. By using equation (2.1), we have Conf S (a 1, x 6, KB) = = Conf S (a 2, x 6, KB) = = Because the weights of a 1 and a 2 are greater than the threshold value, the AV s of a(x 6 ) assigned by Chase is {(a 1, 0.413)(a 2, 0.587)}. Chase is an iterative process. The execution of the algorithm that generates new information system is repeated until it reaches a state where no additional null value imputation is available. Figure 2.1 shows the number of null value imputations at each iteration for a sample data set describing the US congressional voting [S. Hettich and Merz, 1998]. After first iteration, 163 null values are replaced. In the second iteration, 10 more slots are filled. The execution stops after third iteration.

19 12 X a b c d e x 1 {(a 1, 1 3 ), (a 2, 2 3 )} {(b 1, 2 3 ), (b 2, 1 3 )} c 1 d 1 {(e 1, 1 2 ), (e 2, 1 2 )} x 2 {(a 2, 1 4 ), (a 3, 3 4 )} {(b 1, 1 3 ), (b 2, 2 3 )} d 2 e 1 x 3 b 2 {(c 1, 1 2 ), (c 3, 1 2 )} d 2 e 3 x 4 a 3 c 2 d 1 {(e 1, 2 3 ), (e 2, 1 3 )} x 5 {(a 1, 2 3 ), (a 2, 1 3 )} b 1 c 2 e 1 x 6 a 2 b 2 c 3 d 2 {(e 2, 1 3 ), (e 3, 2 3 )} x 7 a 2 {(b 1, 1 4 ), (b 2, 3 4 )} {(c 1, 1 3 ), (c 2, 2 3 )} d 2 e 2 x 8 b 2 c 1 d 1 e 3 Table 2.1: Information System S 1 X a b c d e x 1 {(a 1, 1 3 ), (a 2, 2 3 )} {(b 1, 2 3 ), (b 2, 1 3 )} c 1 d 1 {(e 1, 1 3 ), (e 2, 2 3 )} x 2 {(a 2, 1 4 ), (a 3, 3 4 )} b 1 {(c 1, 1 3 ), (c 2, 2 3 )} d 2 e 1 x 3 a 1 b 2 {(c 1, 1 2 ), (c 3, 1 2 )} d 2 e 3 x 4 a 3 c 2 d 1 e 2 x 5 {(a 1, 3 4 ), (a 2, 1 4 )} b 1 c 2 e 1 x 6 a 2 b 2 c 3 d 2 {(e 2, 1 3 ), (e 3, 2 3 )} x 7 a 2 {(b 1, 1 4 ), (b 2, 3 4 )} c 1 d 2 e 2 x 8 {(a 1, 2 3 ), (a 2, 1 3 )} b 2 c 1 d 1 e 3 Table 2.2: Information System S 2

20 Distributed Chase In distributed information systems (or distributed knowledge discovery system) it is very possible that an attribute is missing in one of sites while it occurs in many others. Also, in one information system, an attribute might be partially hidden, while in other systems the same attribute is either complete or close to being complete. Assume that a user submits a query to one of the information systems (called a client) which involves some non-local attributes. In such a case, network communication technology is used to get definitions of these unknown attributes from other information systems (called servers). All these new definitions form a knowledge base which can be used to chase missing attributes at the client site. In Figure 2.2, we present two consecutive states of a distributed information system consisting of S 1, S 2, S 3. In the first state, all values of all missing attributes in all three information systems have to be identified. System S 1 sends request q S1 to the other two information systems asking them for definitions of its missing attributes. Similarly, system S 2 sends request q S2 to the other two information systems asking them for definitions of its missing attributes. Now, system S 3 sends request q S3 to the other two information systems also asking them for definitions of its missing attributes. Next, rules describing the requested definitions are extracted from each of these three information systems and sent to the systems which requested them. It means, the set KB 1 is sent to S 2 and S 3, the set KB 2 is sent to S 1 and S 3, and the set KB 3 is sent to S 1 and S 2. The second state of a distributed information system, presented in Figure 2.2, shows all three information systems with the corresponding KB i sets, i {1, 2, 3},

21 14 all abbreviated as KB. Now, the Chase algorithm is run independently at each of our three sites. Resulting information systems are: Chase(S 1 ), Chase(S 2 ), and Chase(S 3 ). Now, the whole process is recursively repeated. It means, incomplete attributes in all three new information systems are identified again. Next, each of these three systems is sending requests to the other two systems asking for definitions of its incomplete attributes and when these definitions are received, they are stored in the corresponding KB sets. Now, Chase algorithm is run again at each of these three sites. The whole process is repeated until some fixed point is reached (no changes in AV assigned to objects are observed in all 3 systems). When this step is accomplished, a query containing some missing AV can be submitted to any S i, i {1, 2, 3} and processed in a standard way. The following gives the formal definition of distributed Chase. Assume that S 1, S 2 (see Table 2.1 and 2.2)are partially incomplete information systems, both of type λ. The same set X of objects is stored in both systems and the same set A of attributes is used to describe them. The meaning and granularity of values of attributes from A in both systems S 1, S 2 are also the same. Additionally, we assume that a S1 (x) = {(a 1i, p 1i ) : 1 m 1 } and a S2 (x) = {(a 2i, p 2i ) : 1 m 2 }. We say that δ-containment relation Ψ holds between S 1 and S 2, if the following three conditions hold: 1. ( x X)( a A)[card(a S1 (x)) card(a S2 (x))], 2. ( x X)( a A)[[card(a S1 (x)) = card(a S2 (x))] [ i j p 2i p 2j > i j p 1i p 1j ]]. 3. [ i j p 2i p 2j i j p 1i p 1j ]] δ. Instead of saying that δ-containment relation holds between S 1 and S 2, we can equivalently say that S 1 was transformed into S 2 by δ-containment mapping Ψ. This fact

22 15 can be presented as a statement Ψ(S 1 ) = S 2 or ( x X)( a A)[Ψ(a S1 (x)) = Ψ(a S2 (x))]. Similarly, we can either say that a S1 (x) was transformed into a S2 (x) by Ψ or that δ-containment relation Ψ holds between a S1 (x) and a S2 (x). So, if δ-containment mapping Ψ converts an information system S to S, then S is more complete than S. Saying another words, for a minimum one pair (a, x) A X, either Ψ has to decrease the number of AV s in a S (x) or the average difference between confidences assigned to AV s in a S (x) has to be increased minimum by δ. To give an example of a δ-containment mapping Ψ, let us take two information systems S 1, S 2 both of the type λ, represented as Table 2.1 and Table 2.2. Also, we assume that δ = 1. 6 It can be easily checked that the values assigned to e(x 1 ), b(x 2 ), c(x 2 ), a(x 3 ), e(x 4 ), a(x 5 ), c(x 7 ), and a(x 8 ) in S 1 are different than the corresponding values in S 2. In each of these eight cases, an AV assigned to an object in S 2 is less general than the value assigned to the same object in S 1. Also, it can be easily checked that Ψ satisfies δ restriction. It means that Ψ(S 1 ) = S 2. The knowledge base KB contains rules extracted locally at the local site as well as rules extracted from information systems at remote sites. Since rules are extracted from different information systems, inconsistencies in semantics, if any, have to be resolved before any null value imputation can be applied. There are two options: 1. a knowledge base KB at the local site is kept consistent (in this scenario all inconsistencies have to be resolved before rules are stored in the knowledge base) 2. a knowledge base at the local site is inconsistent (values of the same attribute used in two rules extracted at different sites may be of different granularity

23 16 levels and may have different semantics associated with them). In general, we assume that the information stored in ontology [Benjamins, 1998] [Guarino and Giaretta, 1995] and, if needed, in inter-ontologies (if they are provided) is sufficient to resolve inconsistencies in semantics of all sites involved in Chase [Raś and Dardzińska, 2004a]. In other words, any meta-information in DIS is described by one or more ontologies, and the inter-ontology relationships are used as a semantic bridge between autonomous information systems in other to collaborate and understand each other [Raś, 1994]. If we have a case where the assumption does not hold, that is, the rules stored in KB have different semantics that require different interpretations in order to be applicable by Chase, rough semantics can be used for interpreting rules in KB [Raś and Dardzińska, 2004a]. In this paper, for simplicity, we assume that the semantics of attributes are consistent among all sites. For example, if a A i A j, then conceptually its meaning both in S i and S j is the same.

24 17 S 3 g a b c S 2 q S2 b a d e q S3 KB KB q S3 q S2 q S1 q S1 S 1 a b c d q S1=[a, c, d : b] STATE 1 KB q S2=[b, a, e : d] q S3=[a, b, c : g] S 3 g a b c r 1, r 2 S 2 b a d e r 1, r 2 r 5, r 6 KB r 3, r 4 r 5, r 6 r 3, r 4 KB r 3, r 4 extracted from S 3 r 3, r r 1, r 2 r 1, r 2 extracted from S 2 4 r 5, r 6 r5, r6 S 1 a b c d STATE 2 r 1, r 2 r 3, r 4 KB r 5, r 6 extracted from S 1 Figure 2.2: Global extraction and exchange of knowledge

25 CHAPTER 3: DATA SECURITY AND CHASE This chapter presents how confidential data is reconstructed, and the notion of data confidentiality against Chase. 3.1 Confidential Data Disclosure by Chase To illustrate the data confidentiality problem, let s consider the following example. Suppose a local information system S S i for i I operates in a DIS. An attribute d in S contains a set of confidential data. To protect d from disclosure, we hide d from S (see Table 3.1) and construct S d = (X, A, V ) (see Table 3.2), where: 1. a S (x) = a Sd (x), for any a A {d}, x X, 2. d Sd (x) is undefined, for any x X, and user queries now are responded by S d in replace of S. The problem is that hiding attribute d may not be enough due to Chase. In order to reconstruct d, a request for a definition of d can be sent from site S d to some of its remote sites involved in DIS. These definitions fetched and stored in KB of S d are used by Distributed Chase algorithm to impute missing values for a number of hidden AV at S d. Figure 3.1 shows the overall process of confidential data disclosure. Example 3.1 We see that object x 3 is supported by three rules {r 1, r 2, r 3 } (see Table 3.3), which predict d 1. The confidence of the predicted value is, Conf Sd (d 1, x 3, KB) = In this case, we need to hide additional AV from x 3 to make the rules inapplicable and therefore protect confidential data d 1.

26 19 Local Information System Remote Servers (2) A B D a1 b2 a2 b3 a2 b2 KD Engine Predict null values (1) Extract definition of D from remote site a1->d1 b3->d1 a1*b1->d2 Distributed Information System under same ontology Figure 3.1: An example of confidential data reconstruction

27 20 X a b c d e f g x 1 (a 1, 2 3 )(a 2, 1 3 ) b 1 c 1 d 1 e 1 f 1 g 1 x 2 (a 2, 2 5 )(a 3, 3 5 ) (b 1, 1 3 )(b 2, 2 3 ) d 2 e 1 f 2 x 3 a 1 b 1 (c 1, 1 2 )(c 3, 1 2 ) d 1 e 3 f 1 x 4 a 3 c 2 d 1 (e 1, 2 3 )(e2, 1 3 ) f 2 x 5 (a 1, 2 3 )(a 3, 1 3 ) (b 1, 1 2 )(b 2, 1 2 ) c 1 d 1 e 1 f 2 g 1 x 6 b 1 (c 1, 1 3 )(c 3, 2 3 ) d 1 e 1 f 1 g 1 x 7 a 1 b 1 c 1 e 1 f 3 g 1.. x i (a 3, 1 2 )(a 4, 1 2 ) b 2 c 2 e 3 f 2 Table 3.1: Information System S. λ = 0.3

28 21 x a b c d e f g x 1 (a 1, 2 3 )(a 2, 1 3 ) b 1 c 1 e 1 f 1 g 1 x 2 (a 2, 2 5 )(a 3, 3 5 ) (b 1, 1 3 )(b 2, 2 3 ) e 1 f 2 x 3 a 1 b 1 (c 1, 1 2 )(c 3, 1 2 ) e 1 f 1 x 4 a 3 c 2 (e 1, 2 3 )(e2, 1 3 ) f 2 x 5 (a 1, 2 3 )(a 3, 1 3 ) (b 1, 1 2 )(b 2, 1 2 ) c 1 e 1 f 2 g 1 x 6 b 1 (c 1, 1 3 )(c 3, 2 3 ) e 1 f 1 g 1 x 7 a 1 b 1 c 1 e 1 f 3 g 1.. x i (a 3, 1 2 )(a 4, 1 2 ) b 2 c 2 e 3 f 2 Table 3.2: Information System S d. λ = 0.3

29 22 Rule a b c d e f g sup conf source r 1 a 1 c 1 (d 1 ) 20 95% S 2 r 2 b 1 c 1 (d 1 ) 20 92% S 2 r 3 (d 1 ) f % S 1 r 4 (d 2 ) f % S 1 r 5 (a 1 ) b 1 c % S d r 6 (a 2 ) c 1 f % S d r 7 (b 1 ) c % S d r 8 (b 1 ) e % S d r 9 a 1 (c 1 ) g % S d r 10 a 1 c 1 (e 1 ) 20 92% S d r 11 (c 1 ) e 1 g % S d r 12 (b 1 ) c % S d Table 3.3: Rules contained in KB. Values in parenthesis () indicate decision values

30 23 To examine additional disclosure risk caused by Local Chase, let s consider locally extracted rules in KB (that is, r 5..r 12 in Table 3.3). In other words, we run Chase algorithm not only for the confidential AV, but also for the AV s that we have hidden. Example 3.2 {r 1, r 2, r 3 } are supported by x 3 in Example 3.1. Suppose the we hide two AV s {c 1, f 1 } from x 3. Clearly, all these three rules are now inapplicable and the minimum number of hidden AV required to protect d 1 becomes 2. Now, we can see that c 1, that was just hidden, is reconstructed by {r 9, r 11 } by local Chase. This means that {r 9, r 11 } are in the prediction path and therefore we need to hide additional AV s again. 3.2 Concept of Confidentiality Before we consider the reconstruction of an information system, it is important to understand the types of services that system provides to users, and the tasks that can be performed by users. This will specify the type and amount of information that users can acquire from the system. We assume that the following First of all, the following meta-information of an information system S is known to users. 1. List of attributes are known to users. 2. List of objects are known to users. 3. Attribute Values are hidden from users. In addition, The following tasks can be performed by users. 1. Request the list of rules in knowledge base 2. Request the list of objects that satisfy the rules 3. Request the estimated value of undefined values (e.g. Null Value) We assume that no attribute value is known to users at the time of building the knowledge base. Some attribute values contribute rules generation and other are not.

31 24 Those attribute values that are involved are generally assumed to be identified. (Ahha, this applies only to the rules generated locally) Rules came from remote site does may not have this property. However, attribute values are not identified precisely, if (1) the original information system is not shown to the users and (2) the information system is incomplete. Before we discuss our algorithms, we define confidentiality of hidden AV s against Chase. Assume that KB contains the rules extracted from local and remote sites. Let the confidential attribute value d S (x) = d j. Then, there are three cases: 1. if Conf Sd (d j, x, KB) λ and ( d d j )[Conf Sd (d, x, KB) λ], then d j is secure and we do not hide any additional slots for x. 2. if Conf Sd (d j, x, KB) λ and ( d d j )[Conf Sd (d, x, KB) < λ], then d j is not secure and we hide additional slots for x. 3. If Conf Sd (d j, x, KB) < λ and ( d d j )[Conf Sd (d, x, KB) λ], then d j is secure and we do not hide additional slots for x. Case 2 can be further divided depending on the system. When a confidential value has been predicted by Chase, the weight of the predicted value may be substantially different from that of the actual value. If this is the case, protection may not be required because an adversary cannot have enough confidence in the prediction. In order to determine whether a prediction is valid we use a measurement function and compare the result with the threshold value τ. Suppose that the weight of an actual confidential value d i is denoted as p d[i] and weight of the predicted value is denoted as p d [i]. The degree of validity associated with the prediction is defined as, p(v) = 1 p d [i] p d [i] p d[i] (3.1)

32 25 and, we say d i is secure against Chase if v < τ. Example 3.3 Assume that τ = 0.8, and a confidential AV is {(d 1, 3)), (d 4 2, 1 )}. If 4 we have weights of its predicted value as {(d 1, 1 4 )), (d 3, 4 4 )}. Then, d 1 is not considered as a valid prediction because, using function (2), p(v) = 1 (0.5/0.75) < 0.8.

33 CHAPTER 4: PROTECTION WITH MINIMUM DATA LOSS In this chapter, we present two algorithms for finding minimum data loss. The first algorithm is a bottom up approach that examines Chase closure [Im et al., 2005a] of attribute values (or data) involved in hidden value reconstruction, and uses the result to detects the largest set of attribute values that can remain unchanged. The second algorithm is a top down approach that identifies AV s that eliminates the largest number of Chase applicable rules. To describe the algorithm, first we define the following sets, 1. α(x), the set of attribute values used to describe x in S d 2. α(t), the set of attribute values used in t, where t is their conjunction 3. R(x) = {(t d) : α(t) α(x)} KB, the set of rules in KB where the attribute values used in t are contained in α(x) 4. β(x) = {d : [t d] R(x)} d c, where d c is the confidential value. 4.1 Algorithm One : Bottom Up Approach The first algorithm is a bottom up approach that examines Chase closure of attribute values involved in hidden value reconstruction, and uses the result to detect the largest set of attribute values that can remain unchanged [Im et al., 2005b]. In our example (see Table 3.1,3.2, and 3.3) R(x 1 ) = {r 1,r 2,r 3,r 5,r 6,r 9,r 10,r 11,r 12 }, and α(x 1 ) = {a 1, b 1, c 1, e 1, f 1, g 1 }. Clearly, by using the procedures described in section 2.2, d 1 replaces the hidden slots by rules from {r 8, r 9, r 10 }. In addition, other

34 27 rules from R(x 1 ) also predict attribute values listed in {t 8, t 9, t 10 }. These interconnections often build up a complex chain of predictions. The task of blocking such prediction chains and identifying the minimal set of concealing values is not straightforward. Let us consider the following example. Suppose we have {r 1 = [a 1 b 1 d 1 ], r 2 = [b 1 c 1 d 1 ], r 3 = [b 1 e 1 d 1 ]}, all inferencing d 1. In this case, b 1 is covered by 3 rules, and elimination of it will ensure the protection. However, if there were 3 other rules {h 1 b 1, i 1 h 1, k 1 j 1 }, additional values {h 1, i 1, k 1 } have to be hidden, and b 1 was not the best choice. In general, if a large number of attributes and rules exist, overlap based approach often produces a large and complex graph as we try to trace all connections. Another issue is that the order we eliminate values of attributes may have high impact on the final result. For example, hiding values in the order of c 1 f 1 a 1 and c 1 a 1 f 1 may produce different results because attribute value set {c 1, f 1 } removes the inference to d 1, while {c 1, a 1 } cannot remove it and we have to hide b 1 again. To find the minimum amount of values that are used for prediction for the confidential values, a bottom up approach has been adapted. We check the values that will remain unchanged starting from a singleton set containing attribute value a by using chase closure [Im et al., 2005a] and increase the initial set size as much as possible. Chase closure is similar to transitive closure [Abiteboul et al., 1995] except that if the weight of a predicted value is less than λ, the value is not added to the closure. For example, an object x 7 supports two rules, {r 5, r 6 } KB that predict {a 1, a 2 }. In this case, a 1 is not included in the closure if λ Sd = 0.45 because Conf(a 1, x 7, KB) = < Use of chase closure identifies must-be-hidden values without checking all possible superset of α(t). The justification of this is quite simple.

35 28 Chase closure has the property that the superset of a set s also contains s. Clearly, if a set of attribute values predicts d 1, then the set must be hidden regardless of the presence/abscence of other attribute values. To outline the procedure, we start with a set α(x) for the object x 1 which construction is supported by 9 rules from KB, and check the chase closure of each singleton subset δ(x) of that set. If the chase closure of δ(x) contains classified attribute value d 1, then δ(x) does not sustain, it is marked, and it is not considered in later steps. Otherwise, the set remains unmarked. In the second iteration of the algorithm, all two-element subsets of α(x) built only from unmarked sets are considered. If the chase closure of any of these sets does not contain d 1, then such a set remains unmarked and it is used in the later steps of the algorithm. Otherwise, the set is getting marked. If either all sets in a currently executed iteration step are marked or we have reached the set α(x), then the algorithm stops. We only check the subsets of α(x) which is typically smaller than α(x). The following singleton sets are considered for x 1. + denotes a closure of the given set : {a 1 } + = {a 1 } unmarked {b 1 } + = {b 1 } unmarked {c 1 } + = {a 1, b 1, c 1, e 1, d 1 } {d 1 } marked * {e 1 } + = {e 1 } unmarked {f 1 } + = {d 1, f 1 } {d 1 } marked * {g 1 } + = {g 1 } unmarked Clearly, c 1 and f 1 have to be hidden. The next step is to build sets of length 2 and determine which of them can remain. We take the union of two sets only if they are

36 29 both unmarked and one of them is a singleton set. {a 1, b 1 } + = {a 1, b 1 } unmarked {a 1, e 1 } + = {a 1, e 1 } unmarked {a 1, g 1 } + = {a 1, b 1, c 1, d 1, e 1, g 1 } {d 1 } marked* {b 1, e 1 } + = {b 1, e 1 } unmarked {b 1, g 1 } + = {b 1, g 1 } unmarked {e 1, g 1 } + = {a 1, b 1, c 1, d 1, e 1, g 1 } {d 1 } marked* Now we build 3-element sets from previous sets that have not been marked. {a 1, b 1, e 1 } + = {a 1, b 1, e 1 } unmarked {b 1, e 1, g 1 } + is not considered as a superset of {e 1, g 1 } which was marked. We have {a 1, b 1, e 1 } as unmarked set that contain the maximum number of elements and do not have the chase closure containing d. If multiple unmarked sets are identified at the last iteration, we can randomly pick one of them. In a similar way, we compute the maximal sets for any object x i. Precise description of the algorithm is given in Figure Algorithm Two : Top Down Approach Another approach is to identify the most promising attribute values that can eliminate the largest number of supported rules, and uses the result to determine the order of AV hiding from an object. The algorithm is the following. For each attribute value v i α(t) β(d), we use the notation vi c to denote the frequency that v i is in of conditional part of r R(x). Then, vi c represents the number of overlaps for v i between rules. We denote vi d as the frequency that v i is in

37 30 AlgorithmOne (S D, KB) begin i := 1; while i l do begin for all v α(x i ) do Mark(v) := F ; for all v α(x i ) do begin α1(x i, v) := v; β(x i, v) := Chase(S d, KB, α1(x i, v)); if d c β(x i, v) then Mark(v) := T ; end j:= 2; while j k i 1 do begin for each w α(x i ) such that card(w) = j and all subsets of w are unmarked do begin α1(x i, w) := w; β(x i, w) := Chase(S d, KB, α1(x i, w)); if d c β(x i, w) then Mark(w) := T ; end end x i := max card (α1); i := i + 1 end end Chase (S, KB, x i ) begin α(x i, w) := {v x i }; while card(α(x i, w)) card(α(x i, w)) do begin R(x i ) = {r KB : ( t α(x i, w))[r = t d]}; α(x i, w) = α(x i, w) {d : ( t)([t d] R(x i )), d α(x i, w) = φ, conf(d) > λ S }; end return α(x i, w); end Figure 4.1: Minimum data hiding algorithm: Bottom up approach

38 31 the decision part of r R(x). Then, the weight of each attribute value ω(v i ) is, ω(v i ) = v c i v d i The logic behind the weight function is to select AV that predicts many other AV s, but that is predicted few times. Now, we start hiding v i that has the maximum w and is contained in the rules that directly predict confidential AV. Then, check Chase closure of remaining AV set to see if the confidential value can be reconstructed. If confidential value is reconstructed, we compute weights again without the AV s that have been hidden. We continue hiding max(v i ) until the confidential AV is not reconstructed with R(x). When two or more v i have equal weight, we randomly select one. Precise description of the algorithm is given in Figure 4.2. Example 4.2 Figure 4.3 shows an example of additional AV for x 7 using the algorithm. x 7 is supported by 7 rules, {a 1 c 1 d 1, b 1 c 1 d 1, b 1 c 1 a 1, a 1 g 1 c 1, a 1 c 1 e 1, e 1 g 1 c 1, c 1 b 1 }. {v i } = α(x) β(x) is {a 1, b 1, c 1, e 1, g 1 }, and w(v i ) is initially {2, 1, 4, 0, 2}. Clearly, we first hide c 1 from {v i } because it has the maximum weight and it is contained in the conditional part of the rules that directly predict confidential AV. In the next iteration, we hide g 1 and d 1 cannot not be reconstructed. Because g 1 and a 1 have equal weight, it is possible that a 1 is hidden first. If this is the case, total number of hidden AV for x 7 becomes 3 instead of 2. We implemented proposed methods and conducted experiments to answer the following questions: 1. What percentage of confidential AV are reconstructed by Chase? 2. What percentage of additional AV are required to be hidden to protect the confidential AV?

39 32 AlgorithmTwo (S D, KB) begin i := 1; while i l do begin for each x i X do begin h(v) = φ; α(x i, v) := α(t) β(d); loop for all v j α(x i, v) do ω(v j ) = v c j v d j ; if β(d) φ then h(v) := h(v) max ω (v j ), v j β(d); else h(v) := h(v) max ω (v j ); end if α(x i, v) := α(x i, v) h(v); α(x i, v) := Chase(S D, KB, α(x i, v)); exit when d c / α(x i, v); end loop end x i := α(x i, v); i := i + 1; end end Figure 4.2: Minimum data hiding algorithm: Top down approach (1) R(x) a1 c1 d1 (x) b1 c1 d1 (x) b1 c1 a1 (x) a1 g1 c1 a1 c1 e1 vi a1 b1 c1 e1 g1 w(vi) (hiding) 0 2 (2) e1 g1 c1 c1 b1 (x) R(x) a1 c1 d1 (x) b1 c1 d1 (x) b1 c1 a1 (x) a1 g1 c1 (x) a1 c1 e1 e1 g1 c1 (x) c1 b1 (x) vi a1 2 b1 0 c1 e1 1 g1 2 w(vi) (hidden) (hiding) Figure 4.3: Additional attribute value hiding using top down approach

40 33 3. What are the differences between two algorithms with regard to the number of additional AV hiding and performance? Our experiments were performed on 4 data sets taken from UCI data repository [S. Hettich and Merz, 1998], and their characteristics are shown in table 4.1. We assume that each data set is an independent DIS. To build DIS environment as simple as possible (without problems related to handling different granularity and different semantics of attributes at different sites, and without either using a global ontology or building inter-ontology bridges between local ontologies), data sets were randomly partitioned into 3 tables with same attributes and equal number of objects. One of these tables is called a local site and two others are called remote sites. All of them represent sites in DIS. At local site, we chose one of the attributes as confidential attribute, and replaced all of its values with null. For example, the number of objects for car evaluation data set was 576 at local site, and two remote sites had equal number of objects. So the total number of objects in DIS was The confidential attribute was the acceptance levels which had four different classifications. The data set was complete and had no missing (null) values. Now, we applied ERID to learn the descriptions of the remaining (unhidden) attributes from the local site, which would be later used by local Chase. At each remote site, we also generated rules using ERID to learn descriptions of the confidential attribute and stored them in its KB. All these descriptions, in the form of rules, had been fetched to the KB of the local site. A threshold value λ = 0.3 was used for all data sets. We used several different support and confidence values (see table 4.2) for each data set in order to generate enough number of rules, and assumed that these rules represented all the knowledge in DIS. We discuss the issue related to the

41 34 Data Set # object # attribute missing confidential attribute (local + remote) values (# classifier) Car Eval. 576 (1728) 7 No Acceptance Level (4) Breast Cancer 233 (699) 10 Yes Diagnosis (2) Nursery 4320 (16200) 9 No Application Rank (5) Cong. Voting 145 (435) 17 Yes Party (2) Table 4.1: Characteristics of sample data set Data Set rule extraction parameters # rules in KB # confidential data sup conf (local + remote) reconstruction Car Eval. 10% 85% 11 (4 + 7) 71% (410) Breast Cancer 15% 85% 77 ( ) 67% (156) Nursery 7.5% 85% 32 (24 + 8) 48% (2055) Cong. Voting 27% 100% 159 ( ) 95% (138) Table 4.2: Number of confidential AV reconstruction by Chase amount of knowledge in more detail in chapter 7. Table 4.2 shows that a substantial number of confidential AV are reconstructed by Chase. For car evaluation data set, 71% (410 out of 576) of confidential AV were reconstructed. The KB consists of 11 rules. 7 of them came from the servers, and describe the confidential AV. There were 4 local rules that described the values of the remaining attributes. Experiments for three other data sets also exhibit a significant number of confidential AV reconstruction. To examine the number of additional AV hiding that is required to protect

42 35 Figure 4.4: Sample Interface for SCIKD confidential AV from disclosure, we ran our proposed algorithms, and the result is shown in table % AV (723 AV in client table) were additionally hidden with bottom up approach, and 7.39% (739) of AV were additionally hidden with top down approach for congressional voting. We ran Chase algorithm with new tables (after hiding 723 and 739 AV ) and no confidential AV were reconstructed correctly in both cases. Clearly, bottom-up approach identifies the minimum number of data hiding, and there was 2.3% difference in number of hidden AV. The difference was primarily due to having equal weight (ω) in algorithm one. Random picks from the equal weights do not always lead to the optimal decision. The percentage of additional AV hiding does not have linear relationship with the number of rules. This is due to the overlaps between rules, which means that hiding one AV can make a

43 36 Additional data hiding Computation time (in sec.) Data Set Bottom up Top down Bottom up Top down Car Eval. 17.9% (727) 18.0% (723) Breast Cancer 34.4% (802) 34.5% (804) Nursery 4.8% (1901) 4.8% (1901) Cong. Voting 13.5% (333) 15.8% (391) Table 4.3: Additional data hiding number of rules inapplicable. With regard to performance, two proposed algorithms were written in PL/SQL language, and executed in the Oracle database 10g (184MB SGA) running on Windows XP with a 1.4 GHz Pentium M and 512 MB memory. The performance of bottom-up approach was better in certain cases because weight computation (ω) is not performed. However, when we had a number of rules that the conditional part is long, top down approach showed better performance because, in bottom up approach, singleton attribute values were not eliminated at the first iteration, and therefore larger number of super sets were tested.

44 CHAPTER 5: HIERARCHICAL DATA MASKING In the previous chapter, we try to protect confidential data by replacing a set of data with null values [Im and Raś, 2005][Im et al., 2005b]. Another approach, which we will discuss in this chapter, is to mask the exact value by substituting it with more generalized values at higher level of a hierarchical attribute structure (see Figure 5.1). Information system represented in hierarchical attribute structures is a way to design knowledge discovery system more flexible [Raś et al., 2005]. Unlike singlelevel attribute system, data collected with different granularity levels can be assigned into an information system with their semantic relations. For example, when the age of a person is recorded, the value can be 20s or young as shown in Table 5.1. In this environment, we may show that a person is young instead of showing she is 20s if disclosure of the value young does not compromise the privacy of the person. The advantage of the second approach that users will be able to acquire more explicit answers to their queries. Clearly, we need to assume that a hierarchical attribute structure is given to each attribute and they are part of the common ontology which is large and approximately the same among sites. They should come from the same world (e.g. medical information) and, consequently, rules generated from different sites are close in terms of their meanings. In addition, each site is forced to accept a new version of ontology if any change has been made. Also the hierarchical structure must be seen by users. Users have freedom of querying any level of values in the

45 38 X Age Marrage Education Salary x 1 (20s, 1 3 )(30s, 2 3 ) sp present middle school (30K, 1 2 )(40K, 1 2 ) x 2 30s (sp absent, 1)(divorced, 1 ) bachelor 40K 2 2 x 3 40s never married bachelor 50K x 4 widowed tertiary 80K x 5 young never married high school 20K x 6 50s divorced master 80K x 7 middle aged never married high school (60K, 2 3 )(70K, 1 3 ) x 8 60s married high school 50K Table 5.1: Information system represented in hierarchical attributes, λ = 1 3 hierarchy.

46 39 age education young middle aged old primary secondary tertiary 20s 30s 40s 50s 60s 70s+ elementary middle school high school bachelor master PhD Figure 5.1: Attribute hierarchy for age and education

47 Rule Extraction and Hierarchical Attribute Before we discuss Chase and data security, we need to examine how rules are generated from S that is represented in hierarchical attribute structures. It is very much essential to have such rule extraction algorithm. For many real world knowledge discovery systems, data are often drawn from a various sources with different methods, and the assumption of uniform granularity over all attribute values may not be expected to hold. Another possibility is that users may be interested in rules at higher levels of abstraction. For example, instead of searching for a flight departing at 8:30AM, users may need an answer for flights departing in the morning. One of the solutions to these problems is to organize a multilevel attribute structure that represents the attribute values of a domain so that values may be stored and accessed at several levels of granularity. Clearly, the attribute hierarchy can be a part of domain ontology [Raś and Dardzińska, 2004a] and subject to be defined by a domain expert before data across various sources are collected and a knowledge discovery method is applied. The knowledge discovery should allow users to extract rules from any level in the hierarchy. To achieve that, a value transformation scheme must be established when values are transformed from one level to another. In this paper, we expend the notion of ERID to discover rules from an information system represented in hierarchical attributes. Assume that an information system S = (X, A, V ) is an partially incomplete information system of type λ, and a set of tree-like attribute hierarchy H S is assigned to S where h a H S represents all possible values of an attribute a A. If we denote a node in t a as a i, the set {a ik : 1 k m} contains all the children of a i as shown

48 41 in Figure 5.2. a Attribute values in the same level a1 a2 a3 a21 a22 a2m2 a31 a32 a3m3 a311 a312 a313 Figure 5.2: Possible values of an attribute by level Many different combinations of attribute levels can be chosen for rule extraction. To extract rules at particular levels of interest in S, we need to transform attribute values before the rule extraction algorithm ERID is executed. In the following, we will use the term generalization of a(x) to refer to the transformation of a(x) to a node value on the path from a(x) to the root node in the hierarchy, and specification to mean a transformation of a(x) to a node value on the path to the leaf node. As defined, each attribute value in an incomplete information system is a value/weight pair (a(x), p). When attribute values are transformed, the new value and weight are interpreted as the following, 1. if a(x) is specialized, it is replaced by a null value. This means that a parent node is considered as a null value for any child node. 2. if a(x) is generalized, it is replaced by a i t a at the given level on the path. The weight of the new value is the sum of the children nodes. Intermediate nodes placed along the path, if exists, is computed in the same way. That is

49 42 Figure 5.3: User interface for ERID-H p a(x) = p a(x)ik, (1 k m, p a(x)ik λ). Clearly, the root node in each tree is an attribute name, and it is equivalent to a null value. Null value assigned to an object is interpreted as all possible values of an attribute with equal confidence assigned to all of them. Now, let L H be the set of level of attributes to be used, λ 1 be the support, and λ 2 be the confidence value. ERID for hierarchical attributes is represented as ERID-H (S, H S, L H, λ 1, λ 2 ). User interface of implementation is shown in figure Chase Applicability Chase applicability is also different from that of single level attribute systems. Suppose that a knowledge base KB for S contains a set of rules. In order for the Chase algorithm to be applicable to S, it has to satisfy the following conditions

Knowledge Discovery Based Query Answering in Hierarchical Information Systems

Knowledge Discovery Based Query Answering in Hierarchical Information Systems Knowledge Discovery Based Query Answering in Hierarchical Information Systems Zbigniew W. Raś 1,2, Agnieszka Dardzińska 3, and Osman Gürdal 4 1 Univ. of North Carolina, Dept. of Comp. Sci., Charlotte,

More information

Action rules mining. 1 Introduction. Angelina A. Tzacheva 1 and Zbigniew W. Raś 1,2,

Action rules mining. 1 Introduction. Angelina A. Tzacheva 1 and Zbigniew W. Raś 1,2, Action rules mining Angelina A. Tzacheva 1 and Zbigniew W. Raś 1,2, 1 UNC-Charlotte, Computer Science Dept., Charlotte, NC 28223, USA 2 Polish Academy of Sciences, Institute of Computer Science, Ordona

More information

Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics

Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Basit Shafiq, Wei Fan, Danish Mehmood, and David Lorenzi Distributed

More information

Knowledge Discovery. Zbigniew W. Ras. Polish Academy of Sciences, Dept. of Comp. Science, Warsaw, Poland

Knowledge Discovery. Zbigniew W. Ras. Polish Academy of Sciences, Dept. of Comp. Science, Warsaw, Poland Handling Queries in Incomplete CKBS through Knowledge Discovery Zbigniew W. Ras University of orth Carolina, Dept. of Comp. Science, Charlotte,.C. 28223, USA Polish Academy of Sciences, Dept. of Comp.

More information

Action Rule Extraction From A Decision Table : ARED

Action Rule Extraction From A Decision Table : ARED Action Rule Extraction From A Decision Table : ARED Seunghyun Im 1 and Zbigniew Ras 2,3 1 University of Pittsburgh at Johnstown, Department of Computer Science Johnstown, PA. 15904, USA 2 University of

More information

Event Operators: Formalization, Algorithms, and Implementation Using Interval- Based Semantics

Event Operators: Formalization, Algorithms, and Implementation Using Interval- Based Semantics Department of Computer Science and Engineering University of Texas at Arlington Arlington, TX 76019 Event Operators: Formalization, Algorithms, and Implementation Using Interval- Based Semantics Raman

More information

Classification Based on Logical Concept Analysis

Classification Based on Logical Concept Analysis Classification Based on Logical Concept Analysis Yan Zhao and Yiyu Yao Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada S4S 0A2 E-mail: {yanzhao, yyao}@cs.uregina.ca Abstract.

More information

Lectures 1&2: Introduction to Secure Computation, Yao s and GMW Protocols

Lectures 1&2: Introduction to Secure Computation, Yao s and GMW Protocols CS 294 Secure Computation January 19, 2016 Lectures 1&2: Introduction to Secure Computation, Yao s and GMW Protocols Instructor: Sanjam Garg Scribe: Pratyush Mishra 1 Introduction Secure multiparty computation

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Privacy-preserving Data Mining

Privacy-preserving Data Mining Privacy-preserving Data Mining What is [data] privacy? Privacy and Data Mining Privacy-preserving Data mining: main approaches Anonymization Obfuscation Cryptographic hiding Challenges Definition of privacy

More information

A new Approach to Drawing Conclusions from Data A Rough Set Perspective

A new Approach to Drawing Conclusions from Data A Rough Set Perspective Motto: Let the data speak for themselves R.A. Fisher A new Approach to Drawing Conclusions from Data A Rough et Perspective Zdzisław Pawlak Institute for Theoretical and Applied Informatics Polish Academy

More information

Privacy-Preserving Data Imputation

Privacy-Preserving Data Imputation Privacy-Preserving Data Imputation Geetha Jagannathan Stevens Institute of Technology Hoboken, NJ, 07030, USA gjaganna@cs.stevens.edu Rebecca N. Wright Stevens Institute of Technology Hoboken, NJ, 07030,

More information

Towards a General Theory of Non-Cooperative Computation

Towards a General Theory of Non-Cooperative Computation Towards a General Theory of Non-Cooperative Computation (Extended Abstract) Robert McGrew, Ryan Porter, and Yoav Shoham Stanford University {bmcgrew,rwporter,shoham}@cs.stanford.edu Abstract We generalize

More information

Statistical Privacy For Privacy Preserving Information Sharing

Statistical Privacy For Privacy Preserving Information Sharing Statistical Privacy For Privacy Preserving Information Sharing Johannes Gehrke Cornell University http://www.cs.cornell.edu/johannes Joint work with: Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh

More information

Quantum Wireless Sensor Networks

Quantum Wireless Sensor Networks Quantum Wireless Sensor Networks School of Computing Queen s University Canada ntional Computation Vienna, August 2008 Main Result Quantum cryptography can solve the problem of security in sensor networks.

More information

Lecture th January 2009 Fall 2008 Scribes: D. Widder, E. Widder Today s lecture topics

Lecture th January 2009 Fall 2008 Scribes: D. Widder, E. Widder Today s lecture topics 0368.4162: Introduction to Cryptography Ran Canetti Lecture 11 12th January 2009 Fall 2008 Scribes: D. Widder, E. Widder Today s lecture topics Introduction to cryptographic protocols Commitments 1 Cryptographic

More information

Privacy Preserving Frequent Itemset Mining. Workshop on Privacy, Security, and Data Mining ICDM - Maebashi City, Japan December 9, 2002

Privacy Preserving Frequent Itemset Mining. Workshop on Privacy, Security, and Data Mining ICDM - Maebashi City, Japan December 9, 2002 Privacy Preserving Frequent Itemset Mining Stanley R. M. Oliveira 1,2 Osmar R. Zaïane 2 1 oliveira@cs.ualberta.ca zaiane@cs.ualberta.ca Embrapa Information Technology Database Systems Laboratory Andre

More information

ESTIMATING STATISTICAL CHARACTERISTICS UNDER INTERVAL UNCERTAINTY AND CONSTRAINTS: MEAN, VARIANCE, COVARIANCE, AND CORRELATION ALI JALAL-KAMALI

ESTIMATING STATISTICAL CHARACTERISTICS UNDER INTERVAL UNCERTAINTY AND CONSTRAINTS: MEAN, VARIANCE, COVARIANCE, AND CORRELATION ALI JALAL-KAMALI ESTIMATING STATISTICAL CHARACTERISTICS UNDER INTERVAL UNCERTAINTY AND CONSTRAINTS: MEAN, VARIANCE, COVARIANCE, AND CORRELATION ALI JALAL-KAMALI Department of Computer Science APPROVED: Vladik Kreinovich,

More information

Secret Sharing CPT, Version 3

Secret Sharing CPT, Version 3 Secret Sharing CPT, 2006 Version 3 1 Introduction In all secure systems that use cryptography in practice, keys have to be protected by encryption under other keys when they are stored in a physically

More information

Randomized Decision Trees

Randomized Decision Trees Randomized Decision Trees compiled by Alvin Wan from Professor Jitendra Malik s lecture Discrete Variables First, let us consider some terminology. We have primarily been dealing with real-valued data,

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution

More information

Inverting Proof Systems for Secrecy under OWA

Inverting Proof Systems for Secrecy under OWA Inverting Proof Systems for Secrecy under OWA Giora Slutzki Department of Computer Science Iowa State University Ames, Iowa 50010 slutzki@cs.iastate.edu May 9th, 2010 Jointly with Jia Tao and Vasant Honavar

More information

CS-E4320 Cryptography and Data Security Lecture 11: Key Management, Secret Sharing

CS-E4320 Cryptography and Data Security Lecture 11: Key Management, Secret Sharing Lecture 11: Key Management, Secret Sharing Céline Blondeau Email: celine.blondeau@aalto.fi Department of Computer Science Aalto University, School of Science Key Management Secret Sharing Shamir s Threshold

More information

Privacy in Statistical Databases

Privacy in Statistical Databases Privacy in Statistical Databases Individuals x 1 x 2 x n Server/agency ( ) answers. A queries Users Government, researchers, businesses (or) Malicious adversary What information can be released? Two conflicting

More information

Preserving Privacy in Data Mining using Data Distortion Approach

Preserving Privacy in Data Mining using Data Distortion Approach Preserving Privacy in Data Mining using Data Distortion Approach Mrs. Prachi Karandikar #, Prof. Sachin Deshpande * # M.E. Comp,VIT, Wadala, University of Mumbai * VIT Wadala,University of Mumbai 1. prachiv21@yahoo.co.in

More information

Random Multiplication based Data Perturbation for Privacy Preserving Distributed Data Mining - 1

Random Multiplication based Data Perturbation for Privacy Preserving Distributed Data Mining - 1 Random Multiplication based Data Perturbation for Privacy Preserving Distributed Data Mining - 1 Prof. Ja-Ling Wu Dept. CSIE & GINM National Taiwan University Data and User privacy calls for well designed

More information

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM SPATIAL DATA MINING Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM INTRODUCTION The main difference between data mining in relational DBS and in spatial DBS is that attributes of the neighbors

More information

Differential Privacy and its Application in Aggregation

Differential Privacy and its Application in Aggregation Differential Privacy and its Application in Aggregation Part 1 Differential Privacy presenter: Le Chen Nanyang Technological University lechen0213@gmail.com October 5, 2013 Introduction Outline Introduction

More information

Inductive Learning. Inductive hypothesis h Hypothesis space H size H. Example set X. h: hypothesis that. size m Training set D

Inductive Learning. Inductive hypothesis h Hypothesis space H size H. Example set X. h: hypothesis that. size m Training set D Inductive Learning size m Training set D Inductive hypothesis h - - + - + + - - - + + - + - + + - - + + - p(x): probability that example x is picked + from X Example set X + - L Hypothesis space H size

More information

CHAPTER-17. Decision Tree Induction

CHAPTER-17. Decision Tree Induction CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 3: Query Processing Query Processing Decomposition Localization Optimization CS 347 Notes 3 2 Decomposition Same as in centralized system

More information

An Unconditionally Secure Protocol for Multi-Party Set Intersection

An Unconditionally Secure Protocol for Multi-Party Set Intersection An Unconditionally Secure Protocol for Multi-Party Set Intersection Ronghua Li 1,2 and Chuankun Wu 1 1 State Key Laboratory of Information Security, Institute of Software, Chinese Academy of Sciences,

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Jun Zhang Department of Computer Science University of Kentucky

Jun Zhang Department of Computer Science University of Kentucky Application i of Wavelets in Privacy-preserving Data Mining Jun Zhang Department of Computer Science University of Kentucky Outline Privacy-preserving in Collaborative Data Analysis Advantages of Wavelets

More information

Quantifying Privacy for Privacy Preserving Data Mining

Quantifying Privacy for Privacy Preserving Data Mining Quantifying Privacy for Privacy Preserving Data Mining Justin Zhan Carnegie Mellon University justinzh@rew.cmu.edu Abstract Data privacy is an important issue in data mining. How to protect respondents

More information

REDUCTS AND ROUGH SET ANALYSIS

REDUCTS AND ROUGH SET ANALYSIS REDUCTS AND ROUGH SET ANALYSIS A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES AND RESEARCH IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE UNIVERSITY

More information

Lecture 9 and 10: Malicious Security - GMW Compiler and Cut and Choose, OT Extension

Lecture 9 and 10: Malicious Security - GMW Compiler and Cut and Choose, OT Extension CS 294 Secure Computation February 16 and 18, 2016 Lecture 9 and 10: Malicious Security - GMW Compiler and Cut and Choose, OT Extension Instructor: Sanjam Garg Scribe: Alex Irpan 1 Overview Garbled circuits

More information

A PRIMER ON ROUGH SETS:

A PRIMER ON ROUGH SETS: A PRIMER ON ROUGH SETS: A NEW APPROACH TO DRAWING CONCLUSIONS FROM DATA Zdzisław Pawlak ABSTRACT Rough set theory is a new mathematical approach to vague and uncertain data analysis. This Article explains

More information

Lecture 9 - Symmetric Encryption

Lecture 9 - Symmetric Encryption 0368.4162: Introduction to Cryptography Ran Canetti Lecture 9 - Symmetric Encryption 29 December 2008 Fall 2008 Scribes: R. Levi, M. Rosen 1 Introduction Encryption, or guaranteeing secrecy of information,

More information

Differential Privacy

Differential Privacy CS 380S Differential Privacy Vitaly Shmatikov most slides from Adam Smith (Penn State) slide 1 Reading Assignment Dwork. Differential Privacy (invited talk at ICALP 2006). slide 2 Basic Setting DB= x 1

More information

Lecture Notes 17. Randomness: The verifier can toss coins and is allowed to err with some (small) probability if it is unlucky in its coin tosses.

Lecture Notes 17. Randomness: The verifier can toss coins and is allowed to err with some (small) probability if it is unlucky in its coin tosses. CS 221: Computational Complexity Prof. Salil Vadhan Lecture Notes 17 March 31, 2010 Scribe: Jonathan Ullman 1 Interactive Proofs ecall the definition of NP: L NP there exists a polynomial-time V and polynomial

More information

Privacy-Preserving Data Mining

Privacy-Preserving Data Mining CS 380S Privacy-Preserving Data Mining Vitaly Shmatikov slide 1 Reading Assignment Evfimievski, Gehrke, Srikant. Limiting Privacy Breaches in Privacy-Preserving Data Mining (PODS 2003). Blum, Dwork, McSherry,

More information

: Cryptography and Game Theory Ran Canetti and Alon Rosen. Lecture 8

: Cryptography and Game Theory Ran Canetti and Alon Rosen. Lecture 8 0368.4170: Cryptography and Game Theory Ran Canetti and Alon Rosen Lecture 8 December 9, 2009 Scribe: Naama Ben-Aroya Last Week 2 player zero-sum games (min-max) Mixed NE (existence, complexity) ɛ-ne Correlated

More information

R E A D : E S S E N T I A L S C R U M : A P R A C T I C A L G U I D E T O T H E M O S T P O P U L A R A G I L E P R O C E S S. C H.

R E A D : E S S E N T I A L S C R U M : A P R A C T I C A L G U I D E T O T H E M O S T P O P U L A R A G I L E P R O C E S S. C H. R E A D : E S S E N T I A L S C R U M : A P R A C T I C A L G U I D E T O T H E M O S T P O P U L A R A G I L E P R O C E S S. C H. 5 S O F T W A R E E N G I N E E R I N G B Y S O M M E R V I L L E S E

More information

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only

More information

Anonymous Credential Schemes with Encrypted Attributes

Anonymous Credential Schemes with Encrypted Attributes Anonymous Credential Schemes with Encrypted Attributes Bart Mennink (K.U.Leuven) joint work with Jorge Guajardo (Philips Research) Berry Schoenmakers (TU Eindhoven) Conference on Cryptology And Network

More information

Quantization of Rough Set Based Attribute Reduction

Quantization of Rough Set Based Attribute Reduction A Journal of Software Engineering and Applications, 0, 5, 7 doi:46/sea05b0 Published Online Decemer 0 (http://wwwscirporg/ournal/sea) Quantization of Rough Set Based Reduction Bing Li *, Peng Tang, Tommy

More information

Combining Memory and Landmarks with Predictive State Representations

Combining Memory and Landmarks with Predictive State Representations Combining Memory and Landmarks with Predictive State Representations Michael R. James and Britton Wolfe and Satinder Singh Computer Science and Engineering University of Michigan {mrjames, bdwolfe, baveja}@umich.edu

More information

Introduction to Cryptography Lecture 13

Introduction to Cryptography Lecture 13 Introduction to Cryptography Lecture 13 Benny Pinkas June 5, 2011 Introduction to Cryptography, Benny Pinkas page 1 Electronic cash June 5, 2011 Introduction to Cryptography, Benny Pinkas page 2 Simple

More information

Should Non-Sensitive Attributes be Masked? Data Quality Implications of Data Perturbation in Regression Analysis

Should Non-Sensitive Attributes be Masked? Data Quality Implications of Data Perturbation in Regression Analysis Should Non-Sensitive Attributes be Masked? Data Quality Implications of Data Perturbation in Regression Analysis Sumitra Mukherjee Nova Southeastern University sumitra@scis.nova.edu Abstract Ensuring the

More information

Patrol: Revealing Zero-day Attack Paths through Network-wide System Object Dependencies

Patrol: Revealing Zero-day Attack Paths through Network-wide System Object Dependencies Patrol: Revealing Zero-day Attack Paths through Network-wide System Object Dependencies Jun Dai, Xiaoyan Sun, and Peng Liu College of Information Sciences and Technology Pennsylvania State University,

More information

Lecture 18 - Secret Sharing, Visual Cryptography, Distributed Signatures

Lecture 18 - Secret Sharing, Visual Cryptography, Distributed Signatures Lecture 18 - Secret Sharing, Visual Cryptography, Distributed Signatures Boaz Barak November 27, 2007 Quick review of homework 7 Existence of a CPA-secure public key encryption scheme such that oracle

More information

[Title removed for anonymity]

[Title removed for anonymity] [Title removed for anonymity] Graham Cormode graham@research.att.com Magda Procopiuc(AT&T) Divesh Srivastava(AT&T) Thanh Tran (UMass Amherst) 1 Introduction Privacy is a common theme in public discourse

More information

Notes on BAN Logic CSG 399. March 7, 2006

Notes on BAN Logic CSG 399. March 7, 2006 Notes on BAN Logic CSG 399 March 7, 2006 The wide-mouthed frog protocol, in a slightly different form, with only the first two messages, and time stamps: A S : A, {T a, B, K ab } Kas S B : {T s, A, K ab

More information

Privacy and Fault-Tolerance in Distributed Optimization. Nitin Vaidya University of Illinois at Urbana-Champaign

Privacy and Fault-Tolerance in Distributed Optimization. Nitin Vaidya University of Illinois at Urbana-Champaign Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign Acknowledgements Shripad Gade Lili Su argmin x2x SX i=1 i f i (x) Applications g f i (x)

More information

From Secure MPC to Efficient Zero-Knowledge

From Secure MPC to Efficient Zero-Knowledge From Secure MPC to Efficient Zero-Knowledge David Wu March, 2017 The Complexity Class NP NP the class of problems that are efficiently verifiable a language L is in NP if there exists a polynomial-time

More information

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05 Meelis Kull meelis.kull@ut.ee Autumn 2017 1 Sample vs population Example task with red and black cards Statistical terminology Permutation test and hypergeometric test Histogram on a sample vs population

More information

Knowledge representation DATA INFORMATION KNOWLEDGE WISDOM. Figure Relation ship between data, information knowledge and wisdom.

Knowledge representation DATA INFORMATION KNOWLEDGE WISDOM. Figure Relation ship between data, information knowledge and wisdom. Knowledge representation Introduction Knowledge is the progression that starts with data which s limited utility. Data when processed become information, information when interpreted or evaluated becomes

More information

Lecture 14: Secure Multiparty Computation

Lecture 14: Secure Multiparty Computation 600.641 Special Topics in Theoretical Cryptography 3/20/2007 Lecture 14: Secure Multiparty Computation Instructor: Susan Hohenberger Scribe: Adam McKibben 1 Overview Suppose a group of people want to determine

More information

Privacy-Preserving Multivariate Statistical Analysis: Linear Regression and Classification

Privacy-Preserving Multivariate Statistical Analysis: Linear Regression and Classification Syracuse University SURFACE Electrical Engineering and Computer Science LC Smith College of Engineering and Computer Science 1-1-2004 Privacy-Preserving Multivariate Statistical Analysis: Linear Regression

More information

CPSC 467: Cryptography and Computer Security

CPSC 467: Cryptography and Computer Security CPSC 467: Cryptography and Computer Security Michael J. Fischer Lecture 22 November 27, 2017 CPSC 467, Lecture 22 1/43 BBS Pseudorandom Sequence Generator Secret Splitting Shamir s Secret Splitting Scheme

More information

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Mining Molecular Fragments: Finding Relevant Substructures of Molecules Mining Molecular Fragments: Finding Relevant Substructures of Molecules Christian Borgelt, Michael R. Berthold Proc. IEEE International Conference on Data Mining, 2002. ICDM 2002. Lecturers: Carlo Cagli

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

PRIVACY PRESERVING INFORMATION SHARING

PRIVACY PRESERVING INFORMATION SHARING PRIVACY PRESERVING INFORMATION SHARING A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

More information

Data-Driven Logical Reasoning

Data-Driven Logical Reasoning Data-Driven Logical Reasoning Claudia d Amato Volha Bryl, Luciano Serafini November 11, 2012 8 th International Workshop on Uncertainty Reasoning for the Semantic Web 11 th ISWC, Boston (MA), USA. Heterogeneous

More information

Solutions to the Mathematics Masters Examination

Solutions to the Mathematics Masters Examination Solutions to the Mathematics Masters Examination OPTION 4 Spring 2007 COMPUTER SCIENCE 2 5 PM NOTE: Any student whose answers require clarification may be required to submit to an oral examination. Each

More information

Pseudonym and Anonymous Credential Systems. Kyle Soska 4/13/2016

Pseudonym and Anonymous Credential Systems. Kyle Soska 4/13/2016 Pseudonym and Anonymous Credential Systems Kyle Soska 4/13/2016 Moving Past Encryption Encryption Does: Hide the contents of messages that are being communicated Provide tools for authenticating messages

More information

Information Flow on Directed Acyclic Graphs

Information Flow on Directed Acyclic Graphs Information Flow on Directed Acyclic Graphs Michael Donders, Sara Miner More, and Pavel Naumov Department of Mathematics and Computer Science McDaniel College, Westminster, Maryland 21157, USA {msd002,smore,pnaumov}@mcdaniel.edu

More information

Static Program Analysis

Static Program Analysis Static Program Analysis Xiangyu Zhang The slides are compiled from Alex Aiken s Michael D. Ernst s Sorin Lerner s A Scary Outline Type-based analysis Data-flow analysis Abstract interpretation Theorem

More information

Revisiting Cryptographic Accumulators, Additional Properties and Relations to other Primitives

Revisiting Cryptographic Accumulators, Additional Properties and Relations to other Primitives S C I E N C E P A S S I O N T E C H N O L O G Y Revisiting Cryptographic Accumulators, Additional Properties and Relations to other Primitives David Derler, Christian Hanser, and Daniel Slamanig, IAIK,

More information

Interpreting Low and High Order Rules: A Granular Computing Approach

Interpreting Low and High Order Rules: A Granular Computing Approach Interpreting Low and High Order Rules: A Granular Computing Approach Yiyu Yao, Bing Zhou and Yaohua Chen Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 E-mail:

More information

A Privacy Preserving Markov Model for Sequence Classification

A Privacy Preserving Markov Model for Sequence Classification A Privacy Preserving Markov Model for Sequence Classification Suxin Guo Department of Computer Science and Engineering SUNY at Buffalo Buffalo 14260 U.S.A. suxinguo@buffalo.edu Sheng Zhong State Key Laboratory

More information

SOBER Cryptanalysis. Daniel Bleichenbacher and Sarvar Patel Bell Laboratories Lucent Technologies

SOBER Cryptanalysis. Daniel Bleichenbacher and Sarvar Patel Bell Laboratories Lucent Technologies SOBER Cryptanalysis Daniel Bleichenbacher and Sarvar Patel {bleichen,sarvar}@lucent.com Bell Laboratories Lucent Technologies Abstract. SOBER is a new stream cipher that has recently been developed by

More information

Short Note: Naive Bayes Classifiers and Permanence of Ratios

Short Note: Naive Bayes Classifiers and Permanence of Ratios Short Note: Naive Bayes Classifiers and Permanence of Ratios Julián M. Ortiz (jmo1@ualberta.ca) Department of Civil & Environmental Engineering University of Alberta Abstract The assumption of permanence

More information

Lecture Notes 20: Zero-Knowledge Proofs

Lecture Notes 20: Zero-Knowledge Proofs CS 127/CSCI E-127: Introduction to Cryptography Prof. Salil Vadhan Fall 2013 Lecture Notes 20: Zero-Knowledge Proofs Reading. Katz-Lindell Ÿ14.6.0-14.6.4,14.7 1 Interactive Proofs Motivation: how can parties

More information

ANALYSIS OF PRIVACY-PRESERVING ELEMENT REDUCTION OF A MULTISET

ANALYSIS OF PRIVACY-PRESERVING ELEMENT REDUCTION OF A MULTISET J. Korean Math. Soc. 46 (2009), No. 1, pp. 59 69 ANALYSIS OF PRIVACY-PRESERVING ELEMENT REDUCTION OF A MULTISET Jae Hong Seo, HyoJin Yoon, Seongan Lim, Jung Hee Cheon, and Dowon Hong Abstract. The element

More information

Parts 3-6 are EXAMPLES for cse634

Parts 3-6 are EXAMPLES for cse634 1 Parts 3-6 are EXAMPLES for cse634 FINAL TEST CSE 352 ARTIFICIAL INTELLIGENCE Fall 2008 There are 6 pages in this exam. Please make sure you have all of them INTRODUCTION Philosophical AI Questions Q1.

More information

Multi-Party Privacy-Preserving Decision Trees for Arbitrarily Partitioned Data

Multi-Party Privacy-Preserving Decision Trees for Arbitrarily Partitioned Data INTERNATIONAL JOURNAL OF INTELLIGENT CONTROL AND SYSTEMS VOL. 12, NO. 4, DECEMBER 2007, 351-358 Multi-Party Privacy-Preserving Decision Trees for Arbitrarily Partitioned Data Shuguo HAN, and Wee Keong

More information

Foundations of Classification

Foundations of Classification Foundations of Classification J. T. Yao Y. Y. Yao and Y. Zhao Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 {jtyao, yyao, yanzhao}@cs.uregina.ca Summary. Classification

More information

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Summary! Input Knowledge representation! Preparing data for learning! Input: Concept, Instances, Attributes"

More information

A Logical Formulation of the Granular Data Model

A Logical Formulation of the Granular Data Model 2008 IEEE International Conference on Data Mining Workshops A Logical Formulation of the Granular Data Model Tuan-Fang Fan Department of Computer Science and Information Engineering National Penghu University

More information

Benny Pinkas Bar Ilan University

Benny Pinkas Bar Ilan University Winter School on Bar-Ilan University, Israel 30/1/2011-1/2/2011 Bar-Ilan University Benny Pinkas Bar Ilan University 1 Extending OT [IKNP] Is fully simulatable Depends on a non-standard security assumption

More information

Locally Differentially Private Protocols for Frequency Estimation. Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha

Locally Differentially Private Protocols for Frequency Estimation. Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha Locally Differentially Private Protocols for Frequency Estimation Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha Differential Privacy Differential Privacy Classical setting Differential Privacy

More information

Improving the Reliability of Causal Discovery from Small Data Sets using the Argumentation Framework

Improving the Reliability of Causal Discovery from Small Data Sets using the Argumentation Framework Computer Science Technical Reports Computer Science -27 Improving the Reliability of Causal Discovery from Small Data Sets using the Argumentation Framework Facundo Bromberg Iowa State University Dimitris

More information

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING Santiago Ontañón so367@drexel.edu Summary so far: Rational Agents Problem Solving Systematic Search: Uninformed Informed Local Search Adversarial Search

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

More information

Answering Many Queries with Differential Privacy

Answering Many Queries with Differential Privacy 6.889 New Developments in Cryptography May 6, 2011 Answering Many Queries with Differential Privacy Instructors: Shafi Goldwasser, Yael Kalai, Leo Reyzin, Boaz Barak, and Salil Vadhan Lecturer: Jonathan

More information

Banacha Warszawa Poland s:

Banacha Warszawa Poland  s: Chapter 12 Rough Sets and Rough Logic: A KDD Perspective Zdzis law Pawlak 1, Lech Polkowski 2, and Andrzej Skowron 3 1 Institute of Theoretical and Applied Informatics Polish Academy of Sciences Ba ltycka

More information

Verification of the TLS Handshake protocol

Verification of the TLS Handshake protocol Verification of the TLS Handshake protocol Carst Tankink (0569954), Pim Vullers (0575766) 20th May 2008 1 Introduction In this text, we will analyse the Transport Layer Security (TLS) handshake protocol.

More information

Guest Speaker. CS 416 Artificial Intelligence. First-order logic. Diagnostic Rules. Causal Rules. Causal Rules. Page 1

Guest Speaker. CS 416 Artificial Intelligence. First-order logic. Diagnostic Rules. Causal Rules. Causal Rules. Page 1 Page 1 Guest Speaker CS 416 Artificial Intelligence Lecture 13 First-Order Logic Chapter 8 Topics in Optimal Control, Minimax Control, and Game Theory March 28 th, 2 p.m. OLS 005 Onesimo Hernandez-Lerma

More information

1 Secure two-party computation

1 Secure two-party computation CSCI 5440: Cryptography Lecture 7 The Chinese University of Hong Kong, Spring 2018 26 and 27 February 2018 In the first half of the course we covered the basic cryptographic primitives that enable secure

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, ) Logic FOL Syntax FOL Rules (Copi) 1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, ) Dealing with Time Translate into first-order

More information

CTR mode of operation

CTR mode of operation CSA E0 235: Cryptography 13 March, 2015 Dr Arpita Patra CTR mode of operation Divya and Sabareesh 1 Overview In this lecture, we formally prove that the counter mode of operation is secure against chosen-plaintext

More information

Removing trivial associations in association rule discovery

Removing trivial associations in association rule discovery Removing trivial associations in association rule discovery Geoffrey I. Webb and Songmao Zhang School of Computing and Mathematics, Deakin University Geelong, Victoria 3217, Australia Abstract Association

More information

Zero-Knowledge Against Quantum Attacks

Zero-Knowledge Against Quantum Attacks Zero-Knowledge Against Quantum Attacks John Watrous Department of Computer Science University of Calgary January 16, 2006 John Watrous (University of Calgary) Zero-Knowledge Against Quantum Attacks QIP

More information

Exploring Spatial Relationships for Knowledge Discovery in Spatial Data

Exploring Spatial Relationships for Knowledge Discovery in Spatial Data 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Exploring Spatial Relationships for Knowledge Discovery in Spatial Norazwin Buang

More information

CS 347 Distributed Databases and Transaction Processing Notes03: Query Processing

CS 347 Distributed Databases and Transaction Processing Notes03: Query Processing CS 347 Distributed Databases and Transaction Processing Notes03: Query Processing Hector Garcia-Molina Zoltan Gyongyi CS 347 Notes 03 1 Query Processing! Decomposition! Localization! Optimization CS 347

More information

The Road to Improving your GIS Data. An ebook by Geo-Comm, Inc.

The Road to Improving your GIS Data. An ebook by Geo-Comm, Inc. The Road to Improving your GIS Data An ebook by Geo-Comm, Inc. An individual observes another person that appears to be in need of emergency assistance and makes the decision to place a call to 9-1-1.

More information

Branch Prediction based attacks using Hardware performance Counters IIT Kharagpur

Branch Prediction based attacks using Hardware performance Counters IIT Kharagpur Branch Prediction based attacks using Hardware performance Counters IIT Kharagpur March 19, 2018 Modular Exponentiation Public key Cryptography March 19, 2018 Branch Prediction Attacks 2 / 54 Modular Exponentiation

More information