DATA SAFEKEEPING AGAINST KNOWLEDGE DISCOVERY. Seunghyun Im

Size: px

Start display at page:

Download "DATA SAFEKEEPING AGAINST KNOWLEDGE DISCOVERY. Seunghyun Im"

Jacob Wilkins
5 years ago
Views:

1 DATA SAFEKEEPING AGAINST KNOWLEDGE DISCOVERY by Seunghyun Im A dissertation submitted to the faculty of The University of North Carolina at Charlotte in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Information Technology Charlotte 2006 Approved by: Dr. Zbigniew Ras Dr. Mirsad Hadzikadic Dr. Agnieszka Dardzinska Dr. Cem Saydam Dr. Richard Hartshorne

3 iii ABSTRACT SEUNGHYUN IM. Data Safekeeping against Knowledge Discovery (Under the direction of DR. ZBIGNIEW RAS) This dissertation examines an important issue of data mining: how to provide meaningful knowledge without compromising data confidentiality. In other words, information systems should provide knowledge extracted from their data which can be used to identify underlying trends and patterns, but the knowledge should not be used to reveal confidential data. In conventional database systems, data confidentiality is maintained by hiding sensitive data from unauthorized users. However, hiding secret data is not sufficient in knowledge discovery systems (KDS) due to Chase [Raś and Dardzińska, 2005b] [Im and Raś, 2005]. Chase is a widely used data mining technique that helps predict missing values in information systems by using knowledge provided through pattern or rule extraction. For example, if an attribute is incomplete in an information system we can use Chase to approximate the missing values to make it more complete. It is also used to answer user queries containing non-local attributes [Raś and Joshi, 1997]. If attributes in queries are locally unknown, we search for their definitions from KDS and use the results to replace the unknown attributes in the query. The problem of Chase with respect to data confidentiality comes from its ability to reveal hidden data. Sensitive data may be hidden from an information system, for example, by replacing them with null values or encrypting them with cryptographic technologies. However, any user in a KDS who has access to knowledge is able to reconstruct hidden or missing data with Chase. For example, in a standalone

4 iv information system with a partially confidential attribute, Chase algorithm using knowledge extracted from non-confidential part can reconstruct hidden (confidential) values [Im et al., 2005b]. When a system is distributed with autonomous sites, Chase can also reveal sensitive data in local information systems with knowledge extracted from both local or remote information systems [Im et al., 2005b]. Clearly, mechanisms that protect sensitive data from these vulnerabilities have to be implemented in order to build a security-aware KDS, and this paper presents algorithms that minimizes disclosure of confidential data.

5 v ACKNOWLEDGMENTS I would like to express my sincere gratitude to my advisor Dr. Zbigniew Ras for the opportunity to explore the field of data mining. His support and encouragement throughout my Ph.D were invaluable. Without his insightful comments and guidance, the study of security in data mining and completion of my Ph.D dissertation would have been impossible. I also would like to acknowledge Dr. Mirsad Hadzikadic, Dr. Agnieszka Dardzinska, Dr. Cem Saydam and Dr. Richard Hartshorne for their support as my professors and committee members.

6 vi TABLE OF CONTENTS CHAPTER 1: INTRODUCTION Motivation and Problem Statement Approach CHAPTER 2: BACKGROUND Security and Data Mining Incomplete Information System Local Chase Distributed Chase CHAPTER 3: DATA SECURITY AND CHASE Confidential Data Disclosure by Chase Concept of Confidentiality CHAPTER 4: PROTECTION WITH MINIMUM DATA LOSS Algorithm One : Bottom Up Approach Algorithm Two : Top Down Approach CHAPTER 5: HIERARCHICAL DATA MASKING Rule Extraction and Hierarchical Attribute Chase Applicability Method Description CHAPTER 6: PROTECTION WITH MINIMUM KNOWLEDGE LOSS Knowledge Loss Measurement based on Certainty Factor Knowledge Loss Measurement based on Strength Factor Knowledge Loss Measurement based on Coverage Factor

7 vii CHAPTER 7: PROACTIVE DATA PROTECTION AGAINST CHASE Proactive Data Protection based on Reducts CHAPTER 8: CONCLUSION Summary Future Work

8 CHAPTER 1: INTRODUCTION 1.1 Motivation and Problem Statement Knowledge discovery in databases (a.k.a. data mining) is recognized as an essential tool for analyzing and creating value from ever increasing amounts of data in a wide variety of domains. Its distinctive capability of extracting hidden knowledge from large volumes of data has been able to resolve many complicated issues. One of the widely used applications of data mining is called Chase [Raś and Dardzińska, 2004b] [Raś and Dardzińska, 2005a]. Chase is a generalized null value imputation algorithm that is designed to predict unknown values. For example, we can take advantage of its prediction ability to build a medical decision support system. Medical data often contains large amount of null values due to insufficient information. There are various reasons for missing data. For instance, some medical tests cannot be taken because they are dangerous, making a patient s condition worse, or too expensive. In these cases, prediction of the test result can provide significant benefits to doctors, patients, or insurance companies [Korver and Janssens, 1993]. The prediction made by Chase is particularly useful and reliable because it has the property that approximated values reflect the actual characteristics of the data set of an information system. The prediction capability, however, may create security breaches if an information system contains confidential data that have to be kept secret [Im and Raś, 2005] [Im, 2006]. Typical scenarios that show confidential data disclosure by Chase are the

9 2 following (see Table 1.1). Suppose that an attribute in an information system S 1 contains patient diagnosis information: part of these information is not confidential (e.g. consent is given by patents) while others should be kept secret. In this case, Chase may reveal a set of confidential data in S 1 by using the knowledge extracted at S 1. In other words, self-generated rules from non-confidential data in S 1 can be used to predict confidential data. Another example is the hidden data reconstruction in a distributed knowledge discovery system (DKDS). The key concept of DKDS is to generate global knowledge through knowledge sharing. Each site in DKDS develops knowledge independently, and they are used jointly to produce global knowledge without complex data integrations. Knowledge sharing is particularly important in real world environments where data are often collected and stored in information systems residing at many different locations, built independently, instead of being placed at only a single location. Assuming that two sites S 1 and S 2 in a DKDS share their knowledge in order to obtain global knowledge, and that an attribute of a site S 1 in a DKDS is confidential, the exact value of the confidential data in S 1 can be hidden by replacing them with null values. However, users at S 1 may treat them as null or missing data and try to reconstruct them with the knowledge extracted from S 2. A distributed medical information system is a good example that an attribute is confidential for one information system, but the same attribute is not considered as secret data in another site (e.g. privacy regulation is less restrictive in many countries outside the U.S.). The vulnerabilities illustrated in these examples show that a security-aware data management is an essential component for any KDS to ensure data confidentiality. Hiding confidential data from an information system does not guarantee the

10 3 KDS Types Single KDS Description Hidden data reconstruction is based on non-confidential part of local information system Distributed KDS Hidden data reconstruction is based on rules from local and remote information systems Table 1.1: Vulnerabilities by types of KDS secrecy against Chase. Regardless of data hiding methods, such as data encryption [Coppersmith, 1994] [Daemen and Rijmen, ], access control [Lunt, 1989], or simply null value replacement, they may be reconstructed by Chase. 1.2 Approach There are two input parameters required for Chase when predicting missing (confidential) values. One is knowledge in terms of inference rules. The other is nonconfidential part of data. Considering the main objective of any KDS is to discover underlying knowledge in information systems and provide them to users, protection algorithms should preserve accuracy of the knowledge as much as possible. Therefore, we should prevent the prediction of confidential data by controlling, the second input parameter, that is non-conditional part of data. There is a trade-off between the strength of confidentiality and data (or knowledge) availability in this approach. As we modify more data from an information system, less knowledge is applicable by Chase and disclosure risk decreases in general. However, preserving original source of data is also important for the information system to return precise answers to user queries or to generate more accurate knowledge (We

11 4 will discuss these problems further in Chapter 4 and 6). In addition, some degree of data or knowledge loss is almost inevitable to block the reconstruction of confidential data. Therefore, the key issue is to limit possible disclosure with the least amount of loss in terms of data or knowledge (see Table 1.2). As we have seen in the previous example, different vulnerabilities exist depending on the types of knowledge discovery system. Clearly, if a KDS consists of a single information system and a confidential attribute is completely hidden, secret data cannot be reconstructed by Chase. If part of data is not confidential we may have to hide additional data because these data can be used by Chase. For a distributed KDS that a confidential attribute is partially or completely hidden, some of sensitive data can still be reconstructed by global knowledge. Either a KDS is local or distributed, predictions often form a chain, meaning rules in KB reconstruct not only confidential data but also non-confidential data that is again used to predict some confidential data. This means that additionally hidden data are reconstructed by another set of rules, and these predictions have to be evaluated again to ensure the confidentiality of sensitive data. There are two types of protection schemes depending on knowledge availability (see Table 1.3): (1) Complete set of knowledge that is applicable by Chase is known before we hide additional data. In this case, the system stores all available rules in a knowledge base (KB), and users acquire knowledge only from the KB. This is a common mechanism for many knowledge discovery systems, and we assure data confidentiality by hiding or modifying additional attribute values based on Chase applicable knowledge only in KB. (2) The second case is when knowledge is unknown or partially known at the time of applying a protection algorithm. This is closely

12 5 Method Minimum Data Loss Description Minimize additional data hiding while preventing reconstruction of confidential data Minimum Knowledge Loss Preserve interesting knowledge as much as possible when hiding additional data to achieve data security Table 1.2: Minimum data and knowledge loss Protection Scheme Active Protection Description Applicable knowledge for Chase is known prior to execution of protection algorithm Proactive Protection Applicable knowledge for Chase is partially known or unknown Table 1.3: Protection Scheme by Knowledge Availability related to support and confidence of rules. If the assumption is that users are able to extract rules with any support and confidence values on the fly, a different strategy (called proactive protection) has to be taken because hiding additional attribute values against one set of rules may not prevent data reconstruction from another set of rules.

13 CHAPTER 2: BACKGROUND This chapter presents security in data mining in other discipline area. Then, discuss a null value imputation algorithm Chase, and its use for local and distributed information systems (DIS). 2.1 Security and Data Mining Security in KDS has been studied in various disciplines such as cryptography, statistics, and data mining. A well known security problem in cryptography area is how to acquire global knowledge in a distributed system while exchanging data securely. In other words, the objective is to extract global knowledge without disclosing any data stored in each local site. Proposed solutions are based primarily on the idea of secure multiparty protocol [Yao, 1996] [Du and Atallah, 2001] [Du, 2001] that ensures each participant cannot learn more than its own input and outcome of a public function. Various authors expanded the idea to build a secure data mining systems. Clifton and Kantarcioglou employed the concept to association rule mining [Agrawal and Srikant, 1994] for vertically [Clifton et al., 2002] and horizontally [Kantarcioglou and Clifton, 2002] partitioned data. Du et al, [Du and Zhan, 2002] [Du and Zhan, 2003] pursued a similar idea to build a decision tree. They observed performance improvement by sending data to a 3 rd party server which what they termed commodity server. Lindell et al, [Lindell and Pinkas, 2000] presented a privacy preserving data mining scheme for ID3 algorithm [Quinlan, 1993] using se-

14 7 cure multiparty protocol. They focused on improving the generic secure multiparty protocol for decision trees. All these works have a common drawback that they require expensive encryption and decryption mechanisms. Considering extremely large amount of data to be processed, performance has to be improved before we apply these algorithms to real-world data. Another research area of data security and data mining is called perturbation. Dataset is perturbed (e.g. noise addition or data swapping) before its release to the public in effort to minimize disclosure risk of confidential data, while maintaining statistical characteristics (e.g. mean and variable) of original data. Muralidhar and Sarathy [K. and Sarathy, 2003][Muralidhar and Sarathy, 1999] provided a theoretical basis for data perturbation in terms of data utilization and disclosure risks, and conducted a survey of existing perturbation methods. In KDD area, protection of sensitive rules with minimum side effect has been discussed by several researchers. In [Oliveira and Zaiane, 2002], authors suggested a solution to protecting sensitive association rules in the form of sanitization process where protection is achieved by hiding selective patterns from the frequent itemsets. There has been another interesting proposal [Saygin et al., 2002] for hiding sensitive association rules. They introduced an interval of minimum support and confidence value to measure the degree of sensitive rules. The interval is specified by the user and only the rules within the interval are to be removed. The key contribution of this study, among many others, is to provide data security algorithms for distributed knowledge sharing systems. Previous and related works concentrated only on a single information system, or sharing of knowledge was not considered.

15 8 2.2 Incomplete Information System Now, we present backgrounds on incomplete information systems and Chase algorithm. One of the assumptions for many rule extraction algorithms is that rules are extracted from an information system where the information about objects is either precisely known or not known at all. This implies that either a single value of an attribute is assigned to an object as its property or no value is assigned. However, it happens quite often that users do not have exact knowledge about objects, which makes it difficult to determine a unique set of values for an object. To overcome this problem, the notion of incomplete information system [Dardzińska and Raś, 2003] was introduced which is a generalization of an information system introduced by Pawlak [Pawlak, 1991][Pawlak et al., 1995]. More formally, by an Information System, we mean that S = (X, A, V ), where X is a finite set of objects, A is a finite set of attributes, and V is a set of attribute values. In particular, we say that S is an incomplete information system of type λ [Dardzińska and Raś, 2003] if the following three conditions hold: 1. a S (x) is defined for any x X, a A, 2. ( x X)( a A)[(a S (x) = {(a i, p i ) : 1 i m}) m i=1 p i = 1], 3. ( x X)( a A)[(a S (x) = {(a i, p i ) : 1 i m}) ( i)(p i λ)]. Incompleteness is understood by having a set of weighted attribute values as a value of an attribute. The concept of multiple possible values is used for replacing null values in Chase. (Hereafter, we will use the two terms, data and attribute value (AV ), interchangeably) Example 2.1 In Table 2.1, the values of attribute a in object x 1 are a 1 and a 2 with the weights of 1 and 2 respectively. Clearly, we are able to assign multiple AV s 3 3

16 9 as possible values to an incomplete information system. Before we continue this discussion to Chase algorithm, we have to decide first on the interpretation of functors or and and, denoted in this paper by + and, correspondingly. We will adopt the semantics of terms proposed by Ras & Joshi in [Raś and Joshi, 1997] as their semantics has all the properties required for the query transformation process to be sound and complete [see [Raś and Joshi, 1997]]. It was shown that their semantics satisfies the following distributive property: t 1 (t 2 +t 3 ) = (t 1 t 2 ) + (t 1 t 3 ). Let us assume that S = (X, A, V ) is an information system of type λ and t is a term in predicate calculus constructed, in a standard way, from values of attributes in V seen as constants and from two functors + and. By N S (t), we mean the standard interpretation of a term t in S defined as (see [Raś and Joshi, 1997]):, 1. N S (v) = {(x, p) : (v, p) a(x)}, for any v V a, 2. N S (t 1 + t 2 ) = N S (t 1 ) N S (t 2 ), 3. N S (t 1 t 2 ) = N S (t 1 ) N S (t 2 ), where, for any N S (t 1 ) = {(x i, p i )} i I, N S (t 2 ) = {(x j, q j )} j J, we have: 1. N S (t 1 ) N S (t 2 ) = {(x i, p i )} i (I J) {(x j, p j )} j (J I) {(x i, max(p i, q i ))} i I J, 2. N S (t 1 ) N S (t 2 ) = {(x i, p i q i )} i (I J). 2.3 Local Chase The incomplete value imputation algorithm Chase, based on the above semantics, converts information system S of type λ to a new more complete information system Chase(S) of the same type. This algorithm assumes partial incompleteness of data (sets of weighted AV s can be assigned to an object as its value) in system S. The main phase of the algorithm is the following, 1. identify all incomplete AV s in S.

17 10 2. extract rules from S describing these incomplete AV s. 3. incomplete AV s in S are replaced by values suggested by the rules. 4. steps 1-3 are repeated until a fixed point is reached. More specifically, suppose that KB = {(t v c ) D : c In(A)} (called a knowledge base) is a set of all rules extracted from S = (X, A, V ) by ERID(S, λ 1, λ 2 ), where In(A) is the set of incomplete attributes in S and λ 1, λ 2 are thresholds for minimum support and minimum confidence, correspondingly. ERID [Raś and Dardzińska, 2005c] is the algorithm for discovering rules from incomplete information systems, and used as a part of null value imputation algorithm Chase. Now, let R s (x i ) KB L be the set of rules that the all of the conditional part of the rules match with the AV in x i S, and d be a null value. Then, there are three cases: 1. R s (x i ) = φ In this case, d cannot be replaced. 2. R s (x) = {r 1 = [t 1 d 1 ], r 2 = [t 2 d 1 ],..., r k = [t k d 1 ]} In this case, every rule implies a single decision AV, and d = d R s (x i ) = {r 1 = [t 1 d 1 ], r 2 = [t 2 d 2 ],..., r k = [t k d k ]} In this case, rules imply multiple decision values, and the replacement is determined by the confidence of the predicted values. We define the following formula [Im and Raś, 2005] to compute the confidence of each predicted value d in case 3. Assuming that support and confidence of a rule r i is [s i, c i ], and the product of the weight of each AV that matches with a(x) t i is pa(ti ), for i k. conf(d ) = {[ pa(ti )] s i c i : [d = d i ]}, 1 i k (2.1) {[ pa(ti )] s i c i } Clearly, in case 2, the confidence of d 1 is 1. We replace the null value d with d when conf(d ) is greater than a threshold value λ.

18 11 Number of null value imputation Data Set : 1984 US Congressional Voting Number of objects : 435 Number of attributes used : 7 Support : 50 Confidence : 80% Minimum Weight Threshold : Iteration Figure 2.1: Null value imputation by Chase 0 Example 2.2 Suppose that Table 3.2 is a S of λ = 0.3, and that the rules in KB are summarized in Table 3.3. For instance r 1 = [a 1 b 1 d 1 ] is an example of a rule belonging to KB. For a(x 6 ), we have two rules {r 5, r 6 } which decision values are a. By using equation (2.1), we have Conf S (a 1, x 6, KB) = = Conf S (a 2, x 6, KB) = = Because the weights of a 1 and a 2 are greater than the threshold value, the AV s of a(x 6 ) assigned by Chase is {(a 1, 0.413)(a 2, 0.587)}. Chase is an iterative process. The execution of the algorithm that generates new information system is repeated until it reaches a state where no additional null value imputation is available. Figure 2.1 shows the number of null value imputations at each iteration for a sample data set describing the US congressional voting [S. Hettich and Merz, 1998]. After first iteration, 163 null values are replaced. In the second iteration, 10 more slots are filled. The execution stops after third iteration.

19 12 X a b c d e x 1 {(a 1, 1 3 ), (a 2, 2 3 )} {(b 1, 2 3 ), (b 2, 1 3 )} c 1 d 1 {(e 1, 1 2 ), (e 2, 1 2 )} x 2 {(a 2, 1 4 ), (a 3, 3 4 )} {(b 1, 1 3 ), (b 2, 2 3 )} d 2 e 1 x 3 b 2 {(c 1, 1 2 ), (c 3, 1 2 )} d 2 e 3 x 4 a 3 c 2 d 1 {(e 1, 2 3 ), (e 2, 1 3 )} x 5 {(a 1, 2 3 ), (a 2, 1 3 )} b 1 c 2 e 1 x 6 a 2 b 2 c 3 d 2 {(e 2, 1 3 ), (e 3, 2 3 )} x 7 a 2 {(b 1, 1 4 ), (b 2, 3 4 )} {(c 1, 1 3 ), (c 2, 2 3 )} d 2 e 2 x 8 b 2 c 1 d 1 e 3 Table 2.1: Information System S 1 X a b c d e x 1 {(a 1, 1 3 ), (a 2, 2 3 )} {(b 1, 2 3 ), (b 2, 1 3 )} c 1 d 1 {(e 1, 1 3 ), (e 2, 2 3 )} x 2 {(a 2, 1 4 ), (a 3, 3 4 )} b 1 {(c 1, 1 3 ), (c 2, 2 3 )} d 2 e 1 x 3 a 1 b 2 {(c 1, 1 2 ), (c 3, 1 2 )} d 2 e 3 x 4 a 3 c 2 d 1 e 2 x 5 {(a 1, 3 4 ), (a 2, 1 4 )} b 1 c 2 e 1 x 6 a 2 b 2 c 3 d 2 {(e 2, 1 3 ), (e 3, 2 3 )} x 7 a 2 {(b 1, 1 4 ), (b 2, 3 4 )} c 1 d 2 e 2 x 8 {(a 1, 2 3 ), (a 2, 1 3 )} b 2 c 1 d 1 e 3 Table 2.2: Information System S 2

20 Distributed Chase In distributed information systems (or distributed knowledge discovery system) it is very possible that an attribute is missing in one of sites while it occurs in many others. Also, in one information system, an attribute might be partially hidden, while in other systems the same attribute is either complete or close to being complete. Assume that a user submits a query to one of the information systems (called a client) which involves some non-local attributes. In such a case, network communication technology is used to get definitions of these unknown attributes from other information systems (called servers). All these new definitions form a knowledge base which can be used to chase missing attributes at the client site. In Figure 2.2, we present two consecutive states of a distributed information system consisting of S 1, S 2, S 3. In the first state, all values of all missing attributes in all three information systems have to be identified. System S 1 sends request q S1 to the other two information systems asking them for definitions of its missing attributes. Similarly, system S 2 sends request q S2 to the other two information systems asking them for definitions of its missing attributes. Now, system S 3 sends request q S3 to the other two information systems also asking them for definitions of its missing attributes. Next, rules describing the requested definitions are extracted from each of these three information systems and sent to the systems which requested them. It means, the set KB 1 is sent to S 2 and S 3, the set KB 2 is sent to S 1 and S 3, and the set KB 3 is sent to S 1 and S 2. The second state of a distributed information system, presented in Figure 2.2, shows all three information systems with the corresponding KB i sets, i {1, 2, 3},

21 14 all abbreviated as KB. Now, the Chase algorithm is run independently at each of our three sites. Resulting information systems are: Chase(S 1 ), Chase(S 2 ), and Chase(S 3 ). Now, the whole process is recursively repeated. It means, incomplete attributes in all three new information systems are identified again. Next, each of these three systems is sending requests to the other two systems asking for definitions of its incomplete attributes and when these definitions are received, they are stored in the corresponding KB sets. Now, Chase algorithm is run again at each of these three sites. The whole process is repeated until some fixed point is reached (no changes in AV assigned to objects are observed in all 3 systems). When this step is accomplished, a query containing some missing AV can be submitted to any S i, i {1, 2, 3} and processed in a standard way. The following gives the formal definition of distributed Chase. Assume that S 1, S 2 (see Table 2.1 and 2.2)are partially incomplete information systems, both of type λ. The same set X of objects is stored in both systems and the same set A of attributes is used to describe them. The meaning and granularity of values of attributes from A in both systems S 1, S 2 are also the same. Additionally, we assume that a S1 (x) = {(a 1i, p 1i ) : 1 m 1 } and a S2 (x) = {(a 2i, p 2i ) : 1 m 2 }. We say that δ-containment relation Ψ holds between S 1 and S 2, if the following three conditions hold: 1. ( x X)( a A)[card(a S1 (x)) card(a S2 (x))], 2. ( x X)( a A)[[card(a S1 (x)) = card(a S2 (x))] [ i j p 2i p 2j > i j p 1i p 1j ]]. 3. [ i j p 2i p 2j i j p 1i p 1j ]] δ. Instead of saying that δ-containment relation holds between S 1 and S 2, we can equivalently say that S 1 was transformed into S 2 by δ-containment mapping Ψ. This fact

22 15 can be presented as a statement Ψ(S 1 ) = S 2 or ( x X)( a A)[Ψ(a S1 (x)) = Ψ(a S2 (x))]. Similarly, we can either say that a S1 (x) was transformed into a S2 (x) by Ψ or that δ-containment relation Ψ holds between a S1 (x) and a S2 (x). So, if δ-containment mapping Ψ converts an information system S to S, then S is more complete than S. Saying another words, for a minimum one pair (a, x) A X, either Ψ has to decrease the number of AV s in a S (x) or the average difference between confidences assigned to AV s in a S (x) has to be increased minimum by δ. To give an example of a δ-containment mapping Ψ, let us take two information systems S 1, S 2 both of the type λ, represented as Table 2.1 and Table 2.2. Also, we assume that δ = 1. 6 It can be easily checked that the values assigned to e(x 1 ), b(x 2 ), c(x 2 ), a(x 3 ), e(x 4 ), a(x 5 ), c(x 7 ), and a(x 8 ) in S 1 are different than the corresponding values in S 2. In each of these eight cases, an AV assigned to an object in S 2 is less general than the value assigned to the same object in S 1. Also, it can be easily checked that Ψ satisfies δ restriction. It means that Ψ(S 1 ) = S 2. The knowledge base KB contains rules extracted locally at the local site as well as rules extracted from information systems at remote sites. Since rules are extracted from different information systems, inconsistencies in semantics, if any, have to be resolved before any null value imputation can be applied. There are two options: 1. a knowledge base KB at the local site is kept consistent (in this scenario all inconsistencies have to be resolved before rules are stored in the knowledge base) 2. a knowledge base at the local site is inconsistent (values of the same attribute used in two rules extracted at different sites may be of different granularity

23 16 levels and may have different semantics associated with them). In general, we assume that the information stored in ontology [Benjamins, 1998] [Guarino and Giaretta, 1995] and, if needed, in inter-ontologies (if they are provided) is sufficient to resolve inconsistencies in semantics of all sites involved in Chase [Raś and Dardzińska, 2004a]. In other words, any meta-information in DIS is described by one or more ontologies, and the inter-ontology relationships are used as a semantic bridge between autonomous information systems in other to collaborate and understand each other [Raś, 1994]. If we have a case where the assumption does not hold, that is, the rules stored in KB have different semantics that require different interpretations in order to be applicable by Chase, rough semantics can be used for interpreting rules in KB [Raś and Dardzińska, 2004a]. In this paper, for simplicity, we assume that the semantics of attributes are consistent among all sites. For example, if a A i A j, then conceptually its meaning both in S i and S j is the same.

24 17 S 3 g a b c S 2 q S2 b a d e q S3 KB KB q S3 q S2 q S1 q S1 S 1 a b c d q S1=[a, c, d : b] STATE 1 KB q S2=[b, a, e : d] q S3=[a, b, c : g] S 3 g a b c r 1, r 2 S 2 b a d e r 1, r 2 r 5, r 6 KB r 3, r 4 r 5, r 6 r 3, r 4 KB r 3, r 4 extracted from S 3 r 3, r r 1, r 2 r 1, r 2 extracted from S 2 4 r 5, r 6 r5, r6 S 1 a b c d STATE 2 r 1, r 2 r 3, r 4 KB r 5, r 6 extracted from S 1 Figure 2.2: Global extraction and exchange of knowledge

25 CHAPTER 3: DATA SECURITY AND CHASE This chapter presents how confidential data is reconstructed, and the notion of data confidentiality against Chase. 3.1 Confidential Data Disclosure by Chase To illustrate the data confidentiality problem, let s consider the following example. Suppose a local information system S S i for i I operates in a DIS. An attribute d in S contains a set of confidential data. To protect d from disclosure, we hide d from S (see Table 3.1) and construct S d = (X, A, V ) (see Table 3.2), where: 1. a S (x) = a Sd (x), for any a A {d}, x X, 2. d Sd (x) is undefined, for any x X, and user queries now are responded by S d in replace of S. The problem is that hiding attribute d may not be enough due to Chase. In order to reconstruct d, a request for a definition of d can be sent from site S d to some of its remote sites involved in DIS. These definitions fetched and stored in KB of S d are used by Distributed Chase algorithm to impute missing values for a number of hidden AV at S d. Figure 3.1 shows the overall process of confidential data disclosure. Example 3.1 We see that object x 3 is supported by three rules {r 1, r 2, r 3 } (see Table 3.3), which predict d 1. The confidence of the predicted value is, Conf Sd (d 1, x 3, KB) = In this case, we need to hide additional AV from x 3 to make the rules inapplicable and therefore protect confidential data d 1.

remote site a1->d1 b3->d1 a1*b1->d2 Distributed Information System

26 19 Local Information System Remote Servers (2) A B D a1 b2 a2 b3 a2 b2 KD Engine Predict null values (1) Extract definition of D from remote site a1->d1 b3->d1 a1*b1->d2 Distributed Information System under same ontology Figure 3.1: An example of confidential data reconstruction

27 20 X a b c d e f g x 1 (a 1, 2 3 )(a 2, 1 3 ) b 1 c 1 d 1 e 1 f 1 g 1 x 2 (a 2, 2 5 )(a 3, 3 5 ) (b 1, 1 3 )(b 2, 2 3 ) d 2 e 1 f 2 x 3 a 1 b 1 (c 1, 1 2 )(c 3, 1 2 ) d 1 e 3 f 1 x 4 a 3 c 2 d 1 (e 1, 2 3 )(e2, 1 3 ) f 2 x 5 (a 1, 2 3 )(a 3, 1 3 ) (b 1, 1 2 )(b 2, 1 2 ) c 1 d 1 e 1 f 2 g 1 x 6 b 1 (c 1, 1 3 )(c 3, 2 3 ) d 1 e 1 f 1 g 1 x 7 a 1 b 1 c 1 e 1 f 3 g 1.. x i (a 3, 1 2 )(a 4, 1 2 ) b 2 c 2 e 3 f 2 Table 3.1: Information System S. λ = 0.3

28 21 x a b c d e f g x 1 (a 1, 2 3 )(a 2, 1 3 ) b 1 c 1 e 1 f 1 g 1 x 2 (a 2, 2 5 )(a 3, 3 5 ) (b 1, 1 3 )(b 2, 2 3 ) e 1 f 2 x 3 a 1 b 1 (c 1, 1 2 )(c 3, 1 2 ) e 1 f 1 x 4 a 3 c 2 (e 1, 2 3 )(e2, 1 3 ) f 2 x 5 (a 1, 2 3 )(a 3, 1 3 ) (b 1, 1 2 )(b 2, 1 2 ) c 1 e 1 f 2 g 1 x 6 b 1 (c 1, 1 3 )(c 3, 2 3 ) e 1 f 1 g 1 x 7 a 1 b 1 c 1 e 1 f 3 g 1.. x i (a 3, 1 2 )(a 4, 1 2 ) b 2 c 2 e 3 f 2 Table 3.2: Information System S d. λ = 0.3

29 22 Rule a b c d e f g sup conf source r 1 a 1 c 1 (d 1 ) 20 95% S 2 r 2 b 1 c 1 (d 1 ) 20 92% S 2 r 3 (d 1 ) f % S 1 r 4 (d 2 ) f % S 1 r 5 (a 1 ) b 1 c % S d r 6 (a 2 ) c 1 f % S d r 7 (b 1 ) c % S d r 8 (b 1 ) e % S d r 9 a 1 (c 1 ) g % S d r 10 a 1 c 1 (e 1 ) 20 92% S d r 11 (c 1 ) e 1 g % S d r 12 (b 1 ) c % S d Table 3.3: Rules contained in KB. Values in parenthesis () indicate decision values

30 23 To examine additional disclosure risk caused by Local Chase, let s consider locally extracted rules in KB (that is, r 5..r 12 in Table 3.3). In other words, we run Chase algorithm not only for the confidential AV, but also for the AV s that we have hidden. Example 3.2 {r 1, r 2, r 3 } are supported by x 3 in Example 3.1. Suppose the we hide two AV s {c 1, f 1 } from x 3. Clearly, all these three rules are now inapplicable and the minimum number of hidden AV required to protect d 1 becomes 2. Now, we can see that c 1, that was just hidden, is reconstructed by {r 9, r 11 } by local Chase. This means that {r 9, r 11 } are in the prediction path and therefore we need to hide additional AV s again. 3.2 Concept of Confidentiality Before we consider the reconstruction of an information system, it is important to understand the types of services that system provides to users, and the tasks that can be performed by users. This will specify the type and amount of information that users can acquire from the system. We assume that the following First of all, the following meta-information of an information system S is known to users. 1. List of attributes are known to users. 2. List of objects are known to users. 3. Attribute Values are hidden from users. In addition, The following tasks can be performed by users. 1. Request the list of rules in knowledge base 2. Request the list of objects that satisfy the rules 3. Request the estimated value of undefined values (e.g. Null Value) We assume that no attribute value is known to users at the time of building the knowledge base. Some attribute values contribute rules generation and other are not.

31 24 Those attribute values that are involved are generally assumed to be identified. (Ahha, this applies only to the rules generated locally) Rules came from remote site does may not have this property. However, attribute values are not identified precisely, if (1) the original information system is not shown to the users and (2) the information system is incomplete. Before we discuss our algorithms, we define confidentiality of hidden AV s against Chase. Assume that KB contains the rules extracted from local and remote sites. Let the confidential attribute value d S (x) = d j. Then, there are three cases: 1. if Conf Sd (d j, x, KB) λ and ( d d j )[Conf Sd (d, x, KB) λ], then d j is secure and we do not hide any additional slots for x. 2. if Conf Sd (d j, x, KB) λ and ( d d j )[Conf Sd (d, x, KB) < λ], then d j is not secure and we hide additional slots for x. 3. If Conf Sd (d j, x, KB) < λ and ( d d j )[Conf Sd (d, x, KB) λ], then d j is secure and we do not hide additional slots for x. Case 2 can be further divided depending on the system. When a confidential value has been predicted by Chase, the weight of the predicted value may be substantially different from that of the actual value. If this is the case, protection may not be required because an adversary cannot have enough confidence in the prediction. In order to determine whether a prediction is valid we use a measurement function and compare the result with the threshold value τ. Suppose that the weight of an actual confidential value d i is denoted as p d[i] and weight of the predicted value is denoted as p d [i]. The degree of validity associated with the prediction is defined as, p(v) = 1 p d [i] p d [i] p d[i] (3.1)

32 25 and, we say d i is secure against Chase if v < τ. Example 3.3 Assume that τ = 0.8, and a confidential AV is {(d 1, 3)), (d 4 2, 1 )}. If 4 we have weights of its predicted value as {(d 1, 1 4 )), (d 3, 4 4 )}. Then, d 1 is not considered as a valid prediction because, using function (2), p(v) = 1 (0.5/0.75) < 0.8.

33 CHAPTER 4: PROTECTION WITH MINIMUM DATA LOSS In this chapter, we present two algorithms for finding minimum data loss. The first algorithm is a bottom up approach that examines Chase closure [Im et al., 2005a] of attribute values (or data) involved in hidden value reconstruction, and uses the result to detects the largest set of attribute values that can remain unchanged. The second algorithm is a top down approach that identifies AV s that eliminates the largest number of Chase applicable rules. To describe the algorithm, first we define the following sets, 1. α(x), the set of attribute values used to describe x in S d 2. α(t), the set of attribute values used in t, where t is their conjunction 3. R(x) = {(t d) : α(t) α(x)} KB, the set of rules in KB where the attribute values used in t are contained in α(x) 4. β(x) = {d : [t d] R(x)} d c, where d c is the confidential value. 4.1 Algorithm One : Bottom Up Approach The first algorithm is a bottom up approach that examines Chase closure of attribute values involved in hidden value reconstruction, and uses the result to detect the largest set of attribute values that can remain unchanged [Im et al., 2005b]. In our example (see Table 3.1,3.2, and 3.3) R(x 1 ) = {r 1,r 2,r 3,r 5,r 6,r 9,r 10,r 11,r 12 }, and α(x 1 ) = {a 1, b 1, c 1, e 1, f 1, g 1 }. Clearly, by using the procedures described in section 2.2, d 1 replaces the hidden slots by rules from {r 8, r 9, r 10 }. In addition, other

34 27 rules from R(x 1 ) also predict attribute values listed in {t 8, t 9, t 10 }. These interconnections often build up a complex chain of predictions. The task of blocking such prediction chains and identifying the minimal set of concealing values is not straightforward. Let us consider the following example. Suppose we have {r 1 = [a 1 b 1 d 1 ], r 2 = [b 1 c 1 d 1 ], r 3 = [b 1 e 1 d 1 ]}, all inferencing d 1. In this case, b 1 is covered by 3 rules, and elimination of it will ensure the protection. However, if there were 3 other rules {h 1 b 1, i 1 h 1, k 1 j 1 }, additional values {h 1, i 1, k 1 } have to be hidden, and b 1 was not the best choice. In general, if a large number of attributes and rules exist, overlap based approach often produces a large and complex graph as we try to trace all connections. Another issue is that the order we eliminate values of attributes may have high impact on the final result. For example, hiding values in the order of c 1 f 1 a 1 and c 1 a 1 f 1 may produce different results because attribute value set {c 1, f 1 } removes the inference to d 1, while {c 1, a 1 } cannot remove it and we have to hide b 1 again. To find the minimum amount of values that are used for prediction for the confidential values, a bottom up approach has been adapted. We check the values that will remain unchanged starting from a singleton set containing attribute value a by using chase closure [Im et al., 2005a] and increase the initial set size as much as possible. Chase closure is similar to transitive closure [Abiteboul et al., 1995] except that if the weight of a predicted value is less than λ, the value is not added to the closure. For example, an object x 7 supports two rules, {r 5, r 6 } KB that predict {a 1, a 2 }. In this case, a 1 is not included in the closure if λ Sd = 0.45 because Conf(a 1, x 7, KB) = < Use of chase closure identifies must-be-hidden values without checking all possible superset of α(t). The justification of this is quite simple.

35 28 Chase closure has the property that the superset of a set s also contains s. Clearly, if a set of attribute values predicts d 1, then the set must be hidden regardless of the presence/abscence of other attribute values. To outline the procedure, we start with a set α(x) for the object x 1 which construction is supported by 9 rules from KB, and check the chase closure of each singleton subset δ(x) of that set. If the chase closure of δ(x) contains classified attribute value d 1, then δ(x) does not sustain, it is marked, and it is not considered in later steps. Otherwise, the set remains unmarked. In the second iteration of the algorithm, all two-element subsets of α(x) built only from unmarked sets are considered. If the chase closure of any of these sets does not contain d 1, then such a set remains unmarked and it is used in the later steps of the algorithm. Otherwise, the set is getting marked. If either all sets in a currently executed iteration step are marked or we have reached the set α(x), then the algorithm stops. We only check the subsets of α(x) which is typically smaller than α(x). The following singleton sets are considered for x 1. + denotes a closure of the given set : {a 1 } + = {a 1 } unmarked {b 1 } + = {b 1 } unmarked {c 1 } + = {a 1, b 1, c 1, e 1, d 1 } {d 1 } marked * {e 1 } + = {e 1 } unmarked {f 1 } + = {d 1, f 1 } {d 1 } marked * {g 1 } + = {g 1 } unmarked Clearly, c 1 and f 1 have to be hidden. The next step is to build sets of length 2 and determine which of them can remain. We take the union of two sets only if they are

36 29 both unmarked and one of them is a singleton set. {a 1, b 1 } + = {a 1, b 1 } unmarked {a 1, e 1 } + = {a 1, e 1 } unmarked {a 1, g 1 } + = {a 1, b 1, c 1, d 1, e 1, g 1 } {d 1 } marked* {b 1, e 1 } + = {b 1, e 1 } unmarked {b 1, g 1 } + = {b 1, g 1 } unmarked {e 1, g 1 } + = {a 1, b 1, c 1, d 1, e 1, g 1 } {d 1 } marked* Now we build 3-element sets from previous sets that have not been marked. {a 1, b 1, e 1 } + = {a 1, b 1, e 1 } unmarked {b 1, e 1, g 1 } + is not considered as a superset of {e 1, g 1 } which was marked. We have {a 1, b 1, e 1 } as unmarked set that contain the maximum number of elements and do not have the chase closure containing d. If multiple unmarked sets are identified at the last iteration, we can randomly pick one of them. In a similar way, we compute the maximal sets for any object x i. Precise description of the algorithm is given in Figure Algorithm Two : Top Down Approach Another approach is to identify the most promising attribute values that can eliminate the largest number of supported rules, and uses the result to determine the order of AV hiding from an object. The algorithm is the following. For each attribute value v i α(t) β(d), we use the notation vi c to denote the frequency that v i is in of conditional part of r R(x). Then, vi c represents the number of overlaps for v i between rules. We denote vi d as the frequency that v i is in

37 30 AlgorithmOne (S D, KB) begin i := 1; while i l do begin for all v α(x i ) do Mark(v) := F ; for all v α(x i ) do begin α1(x i, v) := v; β(x i, v) := Chase(S d, KB, α1(x i, v)); if d c β(x i, v) then Mark(v) := T ; end j:= 2; while j k i 1 do begin for each w α(x i ) such that card(w) = j and all subsets of w are unmarked do begin α1(x i, w) := w; β(x i, w) := Chase(S d, KB, α1(x i, w)); if d c β(x i, w) then Mark(w) := T ; end end x i := max card (α1); i := i + 1 end end Chase (S, KB, x i ) begin α(x i, w) := {v x i }; while card(α(x i, w)) card(α(x i, w)) do begin R(x i ) = {r KB : ( t α(x i, w))[r = t d]}; α(x i, w) = α(x i, w) {d : ( t)([t d] R(x i )), d α(x i, w) = φ, conf(d) > λ S }; end return α(x i, w); end Figure 4.1: Minimum data hiding algorithm: Bottom up approach

38 31 the decision part of r R(x). Then, the weight of each attribute value ω(v i ) is, ω(v i ) = v c i v d i The logic behind the weight function is to select AV that predicts many other AV s, but that is predicted few times. Now, we start hiding v i that has the maximum w and is contained in the rules that directly predict confidential AV. Then, check Chase closure of remaining AV set to see if the confidential value can be reconstructed. If confidential value is reconstructed, we compute weights again without the AV s that have been hidden. We continue hiding max(v i ) until the confidential AV is not reconstructed with R(x). When two or more v i have equal weight, we randomly select one. Precise description of the algorithm is given in Figure 4.2. Example 4.2 Figure 4.3 shows an example of additional AV for x 7 using the algorithm. x 7 is supported by 7 rules, {a 1 c 1 d 1, b 1 c 1 d 1, b 1 c 1 a 1, a 1 g 1 c 1, a 1 c 1 e 1, e 1 g 1 c 1, c 1 b 1 }. {v i } = α(x) β(x) is {a 1, b 1, c 1, e 1, g 1 }, and w(v i ) is initially {2, 1, 4, 0, 2}. Clearly, we first hide c 1 from {v i } because it has the maximum weight and it is contained in the conditional part of the rules that directly predict confidential AV. In the next iteration, we hide g 1 and d 1 cannot not be reconstructed. Because g 1 and a 1 have equal weight, it is possible that a 1 is hidden first. If this is the case, total number of hidden AV for x 7 becomes 3 instead of 2. We implemented proposed methods and conducted experiments to answer the following questions: 1. What percentage of confidential AV are reconstructed by Chase? 2. What percentage of additional AV are required to be hidden to protect the confidential AV?

39 32 AlgorithmTwo (S D, KB) begin i := 1; while i l do begin for each x i X do begin h(v) = φ; α(x i, v) := α(t) β(d); loop for all v j α(x i, v) do ω(v j ) = v c j v d j ; if β(d) φ then h(v) := h(v) max ω (v j ), v j β(d); else h(v) := h(v) max ω (v j ); end if α(x i, v) := α(x i, v) h(v); α(x i, v) := Chase(S D, KB, α(x i, v)); exit when d c / α(x i, v); end loop end x i := α(x i, v); i := i + 1; end end Figure 4.2: Minimum data hiding algorithm: Top down approach (1) R(x) a1 c1 d1 (x) b1 c1 d1 (x) b1 c1 a1 (x) a1 g1 c1 a1 c1 e1 vi a1 b1 c1 e1 g1 w(vi) (hiding) 0 2 (2) e1 g1 c1 c1 b1 (x) R(x) a1 c1 d1 (x) b1 c1 d1 (x) b1 c1 a1 (x) a1 g1 c1 (x) a1 c1 e1 e1 g1 c1 (x) c1 b1 (x) vi a1 2 b1 0 c1 e1 1 g1 2 w(vi) (hidden) (hiding) Figure 4.3: Additional attribute value hiding using top down approach

40 33 3. What are the differences between two algorithms with regard to the number of additional AV hiding and performance? Our experiments were performed on 4 data sets taken from UCI data repository [S. Hettich and Merz, 1998], and their characteristics are shown in table 4.1. We assume that each data set is an independent DIS. To build DIS environment as simple as possible (without problems related to handling different granularity and different semantics of attributes at different sites, and without either using a global ontology or building inter-ontology bridges between local ontologies), data sets were randomly partitioned into 3 tables with same attributes and equal number of objects. One of these tables is called a local site and two others are called remote sites. All of them represent sites in DIS. At local site, we chose one of the attributes as confidential attribute, and replaced all of its values with null. For example, the number of objects for car evaluation data set was 576 at local site, and two remote sites had equal number of objects. So the total number of objects in DIS was The confidential attribute was the acceptance levels which had four different classifications. The data set was complete and had no missing (null) values. Now, we applied ERID to learn the descriptions of the remaining (unhidden) attributes from the local site, which would be later used by local Chase. At each remote site, we also generated rules using ERID to learn descriptions of the confidential attribute and stored them in its KB. All these descriptions, in the form of rules, had been fetched to the KB of the local site. A threshold value λ = 0.3 was used for all data sets. We used several different support and confidence values (see table 4.2) for each data set in order to generate enough number of rules, and assumed that these rules represented all the knowledge in DIS. We discuss the issue related to the

41 34 Data Set # object # attribute missing confidential attribute (local + remote) values (# classifier) Car Eval. 576 (1728) 7 No Acceptance Level (4) Breast Cancer 233 (699) 10 Yes Diagnosis (2) Nursery 4320 (16200) 9 No Application Rank (5) Cong. Voting 145 (435) 17 Yes Party (2) Table 4.1: Characteristics of sample data set Data Set rule extraction parameters # rules in KB # confidential data sup conf (local + remote) reconstruction Car Eval. 10% 85% 11 (4 + 7) 71% (410) Breast Cancer 15% 85% 77 ( ) 67% (156) Nursery 7.5% 85% 32 (24 + 8) 48% (2055) Cong. Voting 27% 100% 159 ( ) 95% (138) Table 4.2: Number of confidential AV reconstruction by Chase amount of knowledge in more detail in chapter 7. Table 4.2 shows that a substantial number of confidential AV are reconstructed by Chase. For car evaluation data set, 71% (410 out of 576) of confidential AV were reconstructed. The KB consists of 11 rules. 7 of them came from the servers, and describe the confidential AV. There were 4 local rules that described the values of the remaining attributes. Experiments for three other data sets also exhibit a significant number of confidential AV reconstruction. To examine the number of additional AV hiding that is required to protect

35 Figure 4.4: Sample Interface for SCIKD confidential AV from disclosure, we ran our proposed algorithms, and the result is shown in table 4.3. 10.

42 35 Figure 4.4: Sample Interface for SCIKD confidential AV from disclosure, we ran our proposed algorithms, and the result is shown in table % AV (723 AV in client table) were additionally hidden with bottom up approach, and 7.39% (739) of AV were additionally hidden with top down approach for congressional voting. We ran Chase algorithm with new tables (after hiding 723 and 739 AV ) and no confidential AV were reconstructed correctly in both cases. Clearly, bottom-up approach identifies the minimum number of data hiding, and there was 2.3% difference in number of hidden AV. The difference was primarily due to having equal weight (ω) in algorithm one. Random picks from the equal weights do not always lead to the optimal decision. The percentage of additional AV hiding does not have linear relationship with the number of rules. This is due to the overlaps between rules, which means that hiding one AV can make a

43 36 Additional data hiding Computation time (in sec.) Data Set Bottom up Top down Bottom up Top down Car Eval. 17.9% (727) 18.0% (723) Breast Cancer 34.4% (802) 34.5% (804) Nursery 4.8% (1901) 4.8% (1901) Cong. Voting 13.5% (333) 15.8% (391) Table 4.3: Additional data hiding number of rules inapplicable. With regard to performance, two proposed algorithms were written in PL/SQL language, and executed in the Oracle database 10g (184MB SGA) running on Windows XP with a 1.4 GHz Pentium M and 512 MB memory. The performance of bottom-up approach was better in certain cases because weight computation (ω) is not performed. However, when we had a number of rules that the conditional part is long, top down approach showed better performance because, in bottom up approach, singleton attribute values were not eliminated at the first iteration, and therefore larger number of super sets were tested.

44 CHAPTER 5: HIERARCHICAL DATA MASKING In the previous chapter, we try to protect confidential data by replacing a set of data with null values [Im and Raś, 2005][Im et al., 2005b]. Another approach, which we will discuss in this chapter, is to mask the exact value by substituting it with more generalized values at higher level of a hierarchical attribute structure (see Figure 5.1). Information system represented in hierarchical attribute structures is a way to design knowledge discovery system more flexible [Raś et al., 2005]. Unlike singlelevel attribute system, data collected with different granularity levels can be assigned into an information system with their semantic relations. For example, when the age of a person is recorded, the value can be 20s or young as shown in Table 5.1. In this environment, we may show that a person is young instead of showing she is 20s if disclosure of the value young does not compromise the privacy of the person. The advantage of the second approach that users will be able to acquire more explicit answers to their queries. Clearly, we need to assume that a hierarchical attribute structure is given to each attribute and they are part of the common ontology which is large and approximately the same among sites. They should come from the same world (e.g. medical information) and, consequently, rules generated from different sites are close in terms of their meanings. In addition, each site is forced to accept a new version of ontology if any change has been made. Also the hierarchical structure must be seen by users. Users have freedom of querying any level of values in the

45 38 X Age Marrage Education Salary x 1 (20s, 1 3 )(30s, 2 3 ) sp present middle school (30K, 1 2 )(40K, 1 2 ) x 2 30s (sp absent, 1)(divorced, 1 ) bachelor 40K 2 2 x 3 40s never married bachelor 50K x 4 widowed tertiary 80K x 5 young never married high school 20K x 6 50s divorced master 80K x 7 middle aged never married high school (60K, 2 3 )(70K, 1 3 ) x 8 60s married high school 50K Table 5.1: Information system represented in hierarchical attributes, λ = 1 3 hierarchy.

46 39 age education young middle aged old primary secondary tertiary 20s 30s 40s 50s 60s 70s+ elementary middle school high school bachelor master PhD Figure 5.1: Attribute hierarchy for age and education

47 Rule Extraction and Hierarchical Attribute Before we discuss Chase and data security, we need to examine how rules are generated from S that is represented in hierarchical attribute structures. It is very much essential to have such rule extraction algorithm. For many real world knowledge discovery systems, data are often drawn from a various sources with different methods, and the assumption of uniform granularity over all attribute values may not be expected to hold. Another possibility is that users may be interested in rules at higher levels of abstraction. For example, instead of searching for a flight departing at 8:30AM, users may need an answer for flights departing in the morning. One of the solutions to these problems is to organize a multilevel attribute structure that represents the attribute values of a domain so that values may be stored and accessed at several levels of granularity. Clearly, the attribute hierarchy can be a part of domain ontology [Raś and Dardzińska, 2004a] and subject to be defined by a domain expert before data across various sources are collected and a knowledge discovery method is applied. The knowledge discovery should allow users to extract rules from any level in the hierarchy. To achieve that, a value transformation scheme must be established when values are transformed from one level to another. In this paper, we expend the notion of ERID to discover rules from an information system represented in hierarchical attributes. Assume that an information system S = (X, A, V ) is an partially incomplete information system of type λ, and a set of tree-like attribute hierarchy H S is assigned to S where h a H S represents all possible values of an attribute a A. If we denote a node in t a as a i, the set {a ik : 1 k m} contains all the children of a i as shown

48 41 in Figure 5.2. a Attribute values in the same level a1 a2 a3 a21 a22 a2m2 a31 a32 a3m3 a311 a312 a313 Figure 5.2: Possible values of an attribute by level Many different combinations of attribute levels can be chosen for rule extraction. To extract rules at particular levels of interest in S, we need to transform attribute values before the rule extraction algorithm ERID is executed. In the following, we will use the term generalization of a(x) to refer to the transformation of a(x) to a node value on the path from a(x) to the root node in the hierarchy, and specification to mean a transformation of a(x) to a node value on the path to the leaf node. As defined, each attribute value in an incomplete information system is a value/weight pair (a(x), p). When attribute values are transformed, the new value and weight are interpreted as the following, 1. if a(x) is specialized, it is replaced by a null value. This means that a parent node is considered as a null value for any child node. 2. if a(x) is generalized, it is replaced by a i t a at the given level on the path. The weight of the new value is the sum of the children nodes. Intermediate nodes placed along the path, if exists, is computed in the same way. That is

42 Figure 5.3: User interface for ERID-H p a(x) = p a(x)ik, (1 k m, p a(x)ik λ). Clearly, the root node in each tree is an attribute name, and it is equivalent to a null value.

49 42 Figure 5.3: User interface for ERID-H p a(x) = p a(x)ik, (1 k m, p a(x)ik λ). Clearly, the root node in each tree is an attribute name, and it is equivalent to a null value. Null value assigned to an object is interpreted as all possible values of an attribute with equal confidence assigned to all of them. Now, let L H be the set of level of attributes to be used, λ 1 be the support, and λ 2 be the confidence value. ERID for hierarchical attributes is represented as ERID-H (S, H S, L H, λ 1, λ 2 ). User interface of implementation is shown in figure Chase Applicability Chase applicability is also different from that of single level attribute systems. Suppose that a knowledge base KB for S contains a set of rules. In order for the Chase algorithm to be applicable to S, it has to satisfy the following conditions

Knowledge Discovery Based Query Answering in Hierarchical Information Systems

Knowledge Discovery Based Query Answering in Hierarchical Information Systems Zbigniew W. Raś 1,2, Agnieszka Dardzińska 3, and Osman Gürdal 4 1 Univ. of North Carolina, Dept. of Comp. Sci., Charlotte,