Learning Rules from Very Large Databases Using Rough Multisets

Learning Rules from Very Large Databases Using Rough Multisets Chien-Chung Chan Department of Computer Science University of Akron Akron, OH 44325-4003 chan@cs.uakron.edu Abstract. This paper presents a mechanism called LERS-M for learning production rules from very large databases. It can be implemented using objectrelational database systems, it can be used for distributed data mining, and it has a structure that matches well with parallel processing. LERS-M is based on rough multisets and it is formulated using relational operations with the objective to be tightly coupled with database systems. The underlying representation used by LERS-M is multiset decision tables, which are derived from information multisystems. In addition, it is shown that multiset decision tables provide a simple way to compute Dempster-Shafer s basic probability assignment functions from. data sets. 1 Introduction The development of computer technologies has provided many useful and efficient tools to produce, disseminate, store, and retrieve data in electronics forms. As a consequence, ever-increasing streams of data are recorded in all types of databases. For example, in automated business activities, even simple transactions such as telephone calls, credit card charges, items in shopping carts, etc. are typically recorded in databases. These data are potentially beneficial to enterprises, because they may be used for designing effective marketing and sales plans based on consumer's shopping patterns and preferences collectively recorded in the databases. From databases of credit card charges, some patterns of fraud charges may be detected, hence, preventive actions may be taken. The raw data stored in databases are potentially lodes of useful information. In order to extract the ore, effective mining tools must be developed. The task of extracting useful information from data is not a new one. It has been a common interest in research areas such as statistical data analysis, machine learning, and pattern recognition. Traditional techniques developed in these areas are fundamental to the task, but there are limitations of these methods. For example, these tools usually assume that the collection of data in the databases is small enough to be fit into the memory of a computer system so that they can be processed. This condition is no longer true in very large databases. Another limitation is that these tools are usually applicable to only static data sets. However, most databases are updated frequently by large streams

of data. It is typical that databases of an enterprise are distributed in different locations. Issues and techniques related to finding useful information from distributed data need to be studied and developed. There are three classical data mining problems: market basket analysis, clustering, and classification. Traditional machine learning systems are usually developed independent of database technology. One of the recent trends is to develop learning systems that are tightly coupled with relational or object-relational database systems for mining association rules and for mining tree classifiers [1], [2], [3], [4]. Due to the maturity of database technology, these systems are more portable and scalable than traditional systems, and they are easier to integrate with OLAP (On Line Analytical Processing) and data warehousing systems. Another trend is that more and more data are stored into distributed databases. Some distributed data mining systems have been developed [5]. However, not many have been tightly coupled with database system technology. In this paper, we introduce a mechanism called LERS-M for learning production rules from very large databases. It can be implemented using object-relational database systems, it can be used for distributed data mining, and it has a structure that matches well with parallel processing. LERS-M is similar to the LERS family of learning programs [6], which is based on rough set theory [7], [8], [9]. The main differences are LERS-M is based on rough multisets [10] and it is formulated using relational operations with the objective to be tightly coupled with database systems. The underlying representation used by LERS-M is multiset decision tables [11], which are derived from information multisystems [10]. In addition to facilitate the learning of rules, multiset decision tables can also be used to compute Dempster-Shafer s belief functions from data [12], [14]. The methodology developed here can be used to design learning systems for knowledge discovery from distributed databases and to develop distributed rule-based expert systems and decision support systems. The paper is organized as follows. The problem addressed by this paper is formulated in Section 2. In Section 3, we review some related concepts. The concept of multiset decision tables and its properties are presented in Section 4. In Section 5, we present the LERS-M learning algorithm with example and discussion. Conclusions are given in Section 6. 2 Problem Statements In this paper we consider the problem of learning production rules from very large databases. For simplicity, a very large database is considered as a very large data table U defined by a finite nonempty set A of attributes. We assume that a very large data table can be store in one single database or distributed over databases. By distributed databases, we means that the data table U is divided into N smaller tables with sizes manageable by a database management system. In the abstraction, we do not consider communication mechanisms used by a distributed database system. Nor do we consider the costs of transferring data from A to B. Briefly speaking, the problem of inductive learning of production rules from examples is to generate descriptions or rules to characterize the logical implication 2

C D from a collection U of examples, where C and D are sets of attributes used to describe the examples. The set C is called condition attributes, and the set D is called decision attributes. Usually, set D is a singleton set, and the sets C and D are not overlapped. The objective of learning is to find rules that can be used to predict the logical implication as accurate as possible when applied to new examples. The objective of this paper is to develop a mechanism for generating production rules by taking into account the following issues: (1) The implication of C D may be uncertain, (2) If the set U of examples is divided into N smaller sets, how to determine the implication of C D, and (3) The result can be implemented using objectrelational database technology. 3 Related Concepts In the following, we will review the concepts of rough sets, information systems, decision tables, rough multisets, information multisystems, and partition of boundary sets. 3.1 Rough Sets, Information Systems, and Decision Tables The fundamental assumption of the rough set theory is that objects from the domain are perceived only through the accessible information about them, that is, the values of attributes that can be evaluated on these objects. Objects with the same information are indiscernible. Consequently, the classification of objects is based on the accessible information about them, not on objects themselves. The notion of information systems was introduced by Pawlak [8] to represent knowledge about objects in a domain. In this paper, we use a special case of information systems called decision tables or data tables to represent data sets. In a decision table there is a designated attribute called decision attribute and another set of attributes are called condition attributes. A decision attribute can be interpreted as a classification of objects in the domain given by an expert. Given a decision table, values of the decision attribute determine a partition on U. The problem of learning rules from examples is to find a set of classification rules using condition attributes that will produce the partition generated by the decision attribute. An example of a decision table adapted from [13] is shown in Table 1, where the universe U consists of 28 objects or examples. The set of condition attributes is {A, B, C, E, F}, and D is the decision attribute with values 1, 2, and 3. The partition on U determined by the decision attribute D is X 1 = [1, 2, 4, 8, 10, 15, 22, 25], X 2 = [3, 5, 11, 12, 16, 18, 19, 21, 23, 24, 27], X 3 = [6, 7, 9, 13, 14, 17, 20, 26, 28] where X i is the set of objects whose value of attribute d is i, for i = 1, 2, and 3. 3

Table 1. Example of a decision table. U A B C E F D 1 0 0 1 0 0 1 2 1 1 1 0 0 1 3 0 1 0 0 0 2 4 1 0 0 0 1 1 5 0 1 0 0 0 2 6 1 0 0 0 1 3 7 0 0 0 1 1 3 8 1 1 1 1 1 1 9 0 0 0 1 1 3 10 0 0 1 0 0 1 11 1 1 1 0 0 2 12 1 1 1 1 1 2 13 1 1 0 1 1 3 14 1 1 0 0 1 3 15 0 0 1 1 1 1 16 1 1 0 1 1 2 17 0 0 0 1 1 3 18 0 0 0 0 0 2 19 0 0 0 0 0 2 20 1 1 1 0 0 3 21 1 1 0 0 1 2 22 0 0 1 0 1 1 23 1 1 1 0 0 2 24 0 0 1 1 1 2 25 1 0 1 0 1 1 26 1 0 1 0 1 3 27 1 0 1 0 1 2 28 1 1 1 1 0 3 Note that Table 1 is an inconsistent decision table. Both objects 8 and 12 have the same condition values (1, 1, 1, 1, 1), but their decision values are different. Object 8 has decision value 1, but object 12 has decision value 2. Inconsistent data sets are also called noisy data sets. This kind of data sets is quite common in real world situations. It is an issue must be addressed by machine learning algorithms. In rough set approach, inconsistency is represented by the concepts of lower and upper approximations. Let A = (U, R) be an approximation space, where U is a nonempty set of objects and R is an equivalence relation defined on U. Let X be a nonempty subset of U. Then, the lower approximation of X by R in A is defined as RX = { e U [e] X} and the upper approximation of X by R in A is defined as R X = { e U [e] X }, where [e] denotes the equivalence class containing e. The difference R X RX is called the boundary set of X in A. A subset X of U is said to be R-definable in A if and only if RX = R X. The pair (RX, R X) defines a rough set in A, which is a family of subsets of U with the same lower and upper approximations as RX and R X. In terms of decision tables, the pair (U, A) defines an approximation space. When a decision class X i U is inconsistent, it means that X i is not A-definable. In this case, we can find classification rules from AX i and A X i. These rules are called certain rules and 4

possible rules, respectively [16]. Thus, rough set approach can be used to learn rules from both consistent and inconsistent examples [17], [18]. 3.2 Rough Multisets and Information Multisystems The concepts of rough multisets and information multisystems were introduced by Grzymala-Busse [10]. The basic idea is to represent an information system using multisets [15]. Object identifiers represented explicitly in an information system is not represented in an information multisystem. Thus, the resulting data tables are more compact. More precisely, an information multisystem is a triple S = (Q, V, Q ~ ), where Q is a set of attributes, V is the union of domains of attributes in Q, and Q ~ is a multirelation on V q. In addition, the concepts of lower and upper approximations in q Q rough sets are extended to multisets. Let M be a multiset, and let e be an element of M whose number of occurrences in M is w. The sub-multiset {w e} will be denoted by [e] M. Thus M may be represented as union of all [e] M s where e is in M. A multiset [e] M is called an elementary multiset in M. The empty multiset is elementary. A finite union of elementary multisets is called a definable multiset in M. Let X be a submultiset of M. Then, the lower approximation of X in M is the multiset defined as X = { e M [e] M X} and the upper approximation of X in M is the multiset defined as X = { e M [e] M X }, where the operations on sets are defined by multisets. Therefore, a rough multiset in M is the family of all sub-multisets of M having the same lower and upper approximations in M. Let P be a subset of Q, a projection of Q ~ onto P is defined as the multirelation P ~, obtained by deleting columns corresponding to attributes in Q P. Note that Q ~ and P ~ have same cardinality. Let X be a sub-multiset of P ~. A P-lower approximation of X in S is the lower approximation X of X in P ~. A P-upper approximation of X in S is the upper approximation X of X in P ~. A multiset X in P ~ is P-definable in S iff PX = P X. A multipartition χ on a multiset X is a multiset {X 1, X 2,, X n } of sub-multisets of X such that n i = 1 X where the sum of two multisets X and Y, denoted X + Y, is a multiset of all elements that are members of X or Y with the number of occurrences of each element e in X + Y is the sum of the number of occurrences of e in X and the number of occurrences of e in Y. Follow from [9], classifications are multipartitions on information multisystems generated with respect to subsets of attributes. Specifically, let S = (Q, V, Q ~ ) be an information multisystem. Let A and B be subsets of Q with A = i and B = j. Let A ~ i = X 5

be a projection of Q ~ onto A. The subset B generates a multipartition B A on A ~ defined as follows: each two i-tuples determined by A are in the same multiset X in B A if and only if their associated j-tuples, determined by B, are equal. The mulitpartition B A is called a classification on A ~ generated by B. Table 2. A n information multisystem S. A B C E F D W 0 0 0 0 0 2 2 0 0 0 1 1 3 3 0 0 1 0 0 1 2 0 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 1 1 1 2 1 0 1 0 0 0 2 2 1 0 0 0 1 1 1 1 0 0 0 1 3 1 1 0 1 0 1 1 1 1 0 1 0 1 2 1 1 0 1 0 1 3 1 1 1 0 0 1 2 1 1 1 0 0 1 3 1 1 1 0 1 1 2 1 1 1 0 1 1 3 1 1 1 1 0 0 1 1 1 1 1 0 0 2 2 1 1 1 0 0 3 1 1 1 1 1 0 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 Table 2 shows a multirelation representation of the data table given in Table 1 where the number of occurrences of each row is denoted by integers in the W column. The projection of the multirealtion onto the set P of attributes {A, B, C, E, F} is shown in Table 3. Table 3. An information multisystem P ~. A B C E F W 0 0 0 0 0 2 0 0 0 1 1 3 0 0 1 0 0 2 0 0 1 0 1 1 0 0 1 1 1 2 0 1 0 0 0 2 1 0 0 0 1 2 1 0 1 0 1 3 1 1 0 0 1 2 1 1 0 1 1 2 1 1 1 0 0 4 1 1 1 1 0 1 1 1 1 1 1 2 Let X be a sub-multiset of P ~ with elements shown in Table 4. 6

Table 4. A sub-multiset X of P ~. A B C E F W 0 0 1 0 0 2 1 1 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 1 0 1 1 1 0 1 0 1 1 The P-lower and P-upper approximations of X in P ~ are shown in Table 5 and 6. Table 5. P-lower approximation of X. A B C E F W 0 0 1 0 0 2 0 0 1 0 1 1 Table 6. P-upper approximation of X. A B C E F W 0 0 1 0 0 2 1 1 1 0 0 4 1 0 0 0 1 2 1 1 1 1 1 2 0 0 1 1 1 2 0 0 1 0 1 1 1 0 1 0 1 3 The classification of P ~ generated by attribute D in S consists of three submultisets which are given in the following Tables 7, 8, and 9 which correspond to the cases where D = 1, D = 2, and D = 3, respectively. Table 7. Sub-multiset of the multipartition D P with D = 1. A B C E F W 0 0 1 0 0 2 0 0 1 0 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 Table 8. Sub-multiset of the m ultipartition D P with D = 2. A B C E F W 0 0 0 0 0 2 0 0 1 1 1 1 0 1 0 0 0 2 1 0 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 0 0 2 1 1 1 1 1 1 7

Table 9. Sub-multiset of the multipartition D P with D = 3. A B C E F W 0 0 0 1 1 3 1 0 0 0 1 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 3.3 Partition of Boundary Sets The relationship between rough set theory and Dempster-Shafer s theory of evidence was first shown in [14] and further developed in [13]. The concept of partition of boundary sets was introduced in [13]. The basic idea is to represent an expert s classification on a set of objects in terms of lower approximations and a partition on the boundary set. In information multisystems, the concept of boundary sets is represented by boundary multisets, which is defined as the difference of upper and lower approximations of a multiset. Thus, the partition of a boundary set can be extended as a multipartition on a boundary multiset. The computation of this multipartition will be discussed in next section. 4 Multiset Decision Tables 4.1 Basic Concepts The idea of multiset decision tables (MDT) was first informally introduced in [11]. We will formalize the concept in the following. Let S = (Q = C D, V, Q ~ ) be an information multisystem, where C are condition attributes and D is a decision attribute. A multiset decision table is an ordered pair A = ( C ~, C D ), where C ~ is a projection of Q ~ onto C and C D is a multipartition on D ~ generated by C in A. We will call C ~ the LHS (Left Hand Side) and C D the RHS (Right Hand Side). Each sub-multiset in C D is represented by two vectors: a Boolean bit-vector and an integer vector. Similar representational scheme has been used in [19], [20], [21]. The size of each vector is the number of values in the domain V D of decision attribute D. The Boolean bit-vector labeled by D i s denotes that a decision value D i is in a sub-multiset of C D iff D i = 1 and its number of occurrences is denoted in the integer vector entry labeled by w i. The information multisystem of Table 2 is represented as a multiset decision table in Table 10 with C = {A, B, C, E, F} and decision attribute D. The Boolean vector is denoted by [D 1, D 2, D 3 ], and the integer vector is denoted by [w 1, w 2, w 3 ]. Note that W = w 1 + w 2 + w 3 on each row. 8

Table 10. Example of MDT. A B C E F W D 1 D 2 D 3 w 1 w 2 w 3 0 0 0 0 0 2 0 1 0 0 2 0 0 0 0 1 1 3 0 0 1 0 0 3 0 0 1 0 0 2 1 0 0 2 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 0 1 1 1 2 1 1 0 1 1 0 0 1 0 0 0 2 0 1 0 0 2 0 1 0 0 0 1 2 1 0 1 1 0 1 1 0 1 0 1 3 1 1 1 1 1 1 1 1 0 0 1 2 0 1 1 0 1 1 1 1 0 1 1 2 0 1 1 0 1 1 1 1 1 0 0 4 1 1 1 1 2 1 1 1 1 1 0 1 0 0 1 0 0 1 1 1 1 1 1 2 1 1 0 1 1 0 4.2 Properties of Multiset Decision Tables Based on multiset decision table representation, we can use relational operations on the table to compute the concepts of rough sets reviewed in Section 3. Let A be a multiset decision table. We will show how to determine the lower and upper approximations of decision classes and partitions of boundary multisets from A. The lower approximation of D i in terms of the LHS columns is defined as the multiset where D i = 1 and W = w i, and the upper approximation of D i is defined as the multiset where D i = 1 and W >= w i, or simply D i = 1. The boundary multiset of D i is defined as the multiset where D i = 1 and W > w i. The multipartition of boundary multisets can be identified by a equivalence multirelation defined over the Boolean vector denoted by the decision-value columns D 1, D 2, and D 3. It is clear that one row of a multiset decision table is in some boundary multiset if and only if the sum over D 1, D 2, and D 3 of the row is greater than 1. Therefore, to compute the multipartition of boundary multisets, we will first identify those rows with D 1 + D 2 + D 3 > 1, then the rows in the multirelation over D 1, D 2, and D 3 define blocks of the multipartition of the boundary multisets. The above computations are shown in the following example. Example: Consider the decision class D 1 in Table 10. The C-lower approximation of D 1 is the multiset that satisfies D 1 = 1 and W = w 1, in table form we have: Table 11. C-lower approximation of D 1. A B C E F W 0 0 1 0 0 2 0 0 1 0 1 1 The C-upper approximation of D 1 is the multiset that satisfies D 1 = 1, in table form we have: 9

Table 12. C-upper approximation of D 1. A B C E F W 0 0 1 0 0 2 0 0 1 0 1 1 0 0 1 1 1 2 1 0 0 0 1 2 1 0 1 0 1 3 1 1 1 0 0 4 1 1 1 1 1 2 To determine the partition of boundary multisets, we use the following two steps. Step 1. Identify rows with D 1 + D 2 + D 3 > 1, we have the following multiset in table form: Table 13. Elements in the boundary sets. A B C E F W D 1 D 2 D 3 0 0 1 1 1 2 1 1 0 1 0 0 0 1 2 1 0 1 1 0 1 0 1 3 1 1 1 1 1 0 0 1 2 0 1 1 1 1 0 1 1 2 0 1 1 1 1 1 0 0 4 1 1 1 1 1 1 1 1 2 1 1 0 Step 2. Grouping the above table in terms of D 1, D 2, and D 3, we have the following blocks in the partition. Table 14 shows the block where D 1 = 1 and D 2 = 1 and D 3 = 0, i.e., (1 1 0): Table 14. The block denotes D = {1, 2}. A B C E F W D 1 D 2 D 3 0 0 1 1 1 2 1 1 0 1 1 1 1 1 2 1 1 0 Table 15 shows the block where D 1 = 1 and D 2 = 0 and D 3 = 1, i.e., (1 0 1): Table 15. The block denotes D = {1, 3}. A B C E F W D 1 D 2 D 3 1 0 0 0 1 2 1 0 1 Table 16 shows the block where D 1 = 0 and D 2 = 1 and D 3 = 1, i.e., (0 1 1): Table 16. The block denotes D = {2, 3}. A B C E F W D 1 D 2 D 3 1 1 0 0 1 2 0 1 1 1 1 0 1 1 2 0 1 1 10

Table 17 shows the block where D 1 = 1 and D 2 = 1 and D 3 = 1, i.e., (1 1 1): Table 17. The block denotes D = {1, 2, 3}. A B C E F W D 1 D 2 D 3 1 0 1 0 1 3 1 1 1 1 1 1 0 0 4 1 1 1 From the above example, it is clear that an expert s classification on the decision attribute D can be obtained by grouping similar values over columns D 1, D 2, and D 3 and by taking the sum over the W column in a multiset decision table. Based on this grouping and summing operation, we can derive a basic probability assignment (bpa) function as required in Dempster-Shafer theory for computing belief functions. This is shown in Table 18. Table 18. Grouping over D 1, D 2, D 3 and sum over W. D 1 D 2 D 3 W 1 0 0 3 0 1 0 4 0 0 1 4 0 1 1 4 1 0 1 2 1 1 0 4 1 1 1 7 Let Θ = {1, 2, 3}. Table 19 shows the basic probability assignment function derived from the information multisystem shown in Table 2. The computation is based on the partition of boundary multisets shown in Table 18. Table 19. The bpa derived from Table 2. X {1} {2} {3} {1, 2} {1, 3} {2, 3} {1, 2, 3} m(x) 3/28 4/28 4/28 4/28 2/28 4/28 7/28 5 Learning Rules From MDT 5.1 LERS-M (Learning rules from Examples using Rough MultiSets) In this section, we will present an algorithm LERS-M for learning production rules from a database table based on multiset decision table. A multiset decision table can be computed directly using typical SQL commands from a database table once the condition and decision attributes are specified. For efficiency reason, we will associate entries in an MDT with a sequence of integer numbers. This can be accomplished by using extensions to relational database management system such as the UDF (User Defined Functions) and UDT (User defined Data Type) available on IBM s DB2 [22]. 11

The emphasis of this paper is more on algorithms, implementation details will be covered somewhere else. The basic idea of LERS-M is to generate a multiset decision table with a sequence of integer numbers. Then, for each value d i of the decision attribute D, the upper approximation of d i, UPPER(d i ), is computed, and a set of rules is generated for each UPPER(d i ). The algorithm LERS-M is given in the following. The detail for generation of rules is presented in Section 5.2. procedure LERS-M Inputs: a table S with condition attributes C 1, C 2,, C n and decision attribute D. Outputs: a set of production rules represented as a multiset data table. begin Create a Multiset Decision Table (MDT) from S with sequence numbers; for each decision value d i of D do begin find the upper approximation UPPER(d i ) of d i ; Generate rules for UPPER(d i ); end; end; 5.2 Rule Generation Strategy The basic idea of rule generation is to create an AVT (Attribute-Value pairs Table) table containing all a-v pairs appeared in the set UPPER(d i ). Then, we will partition the a-v pairs into different groups based on a grouping criterion such as degree of relevancy, which is also used to rank the groups. The left hand sides of rules are identified by taking conjunctions of a-v pairs within the same group (intra-group conjuncts) and by taking natural join over different groups (inter-group conjuncts). Strategies for generating and validating candidate conjuncts are encapsulated in a module called GenerateAndTestConjuncts. Once a set of valid conjuncts is identified, minimal conjuncts can be generated using the method of dropping conditions. The process of rule generation is an iterative one. It starts with the set UPPER(d i ) as an initial TargetSet. In each iteration, a set of rules is generated, and the instances covered by the rule-set are removed from the TargetSet. It stops when all instances in UPPER(d i ) are covered by the generated rules. In LERS-M, the stopping condition is guaranteed by the fact that upper approximations are always definable based on the theory of rough sets. The above strategy is presented in the following procedures RULE_GEN, GroupAVT, and GenerateAndTestConjuncts. A working example will be given in next section. Specifically, we have adopted the following notions. The extension of an a-v pair (a, v) denoted by [(a, v)], i.e., the set of instances covered by the a-v pair, is a subset of the sequence numbers in the original MDT. The extension of an a-v pair is encoded by a Boolean bit-vector. A conjunct is a nonempty finite set of a-v pairs. The extension of a conjunct is the intersection of extensions of all the a-v pairs in the conjunct. Note that the extension of a group of conjunct is the union of extensions of all the conjuncts in the group, and the extension of an empty group of conjuncts is an empty set. 12

procedure RULE_GEN Inputs: an upper approximation of a decision value d i, UPPER(d i ) and an MDT. Outputs: a set of rules for UPPER(d i ) represented as a multiset decision table. begin TargetSet := UPPER(d i ); Ruleset := empty set; Select a grouping criteria G := degree of relevance; Create an a-v pair table AVT contains all a-v pairs appeared in UPPER(d i ); while TargetSet is not empty do begin AVT := GroupAVT(G, TargetSet); NewRules := GenerateAndTestConjuncts(AVT, UPPER(d i )); RuleSet := RuleSet + NewRules; TargetSet := TargetSet [NewRules]; end; minimalcover(ruleset); /* applying dropping condition technique to remove redundant rules from RuleSet linearly starting from the first rule to the last rule in the set */ end; // RULE_GEN procedure GroupAVT Inputs: a grouping criterion such as degree of relevance and a subset of the upper approximation of a decision value d i. Outputs: a list of groups of equivalent a-v pairs relevant to the target set. begin Initialize the AVT table to be empty; Select a subtable T from the target set where decision value = d i ; Create a query to get a vector of condition attributes from the subtable T; for each condition attribute do /* Generate distinct values for each condition attribute */ begin Create query string to select distinct values; for each distinct value do begin Create a query string to select count of occurrences; relevance := count of occurrences; if (relevance > 0) Add the condition-value pair to AVT table; end;// for each distinct value end; // end of for each condition Select the list of distinct values of the relevance column; Sort the list of distinct values in descending order; Use the list of distinct values to generate a list of groups of a-v pairs; end; // GroupAVT 13

procedure GenerateAndTestConjuncts Inputs: a list AVT of groups of equivalent a-v pairs and the upper approximation of decision value d i. Outputs: a set of rules. begin RuleList := ; CarryOverList := ; // a list of groups of a-v pairs CandidateList := ; // a list of TargetSet := UPPER(d i ); // Generate Candidate List repeat L := getnext(avt); // L is a list of equivalent a-v pairs if (L is empty) then break; if ([conjunct(l)] TargetSet) then Add conjunct(l) to CandidateList; /* conjunct(l) returns a conjunction of all a-v pairs in L */ if (CarryOverList is empty) then Add all a-v pairs in L to CarryOverList else begin FilterList := ; Add join(carryoverlist, L) to FilterList; /*join is a function that creates new lists of a-v pairs by taking and joining one element each from the CarryOverList and L */ CarryOverList := ; for each list in FilterList do if ([list] TargetSet) then Add list to CandidateList else Add list to CarryOverList; end; until (CandidateList is not empty); // Test CandidateList for each list in CandidateList do begin list := minimalconjunct(list); /* applying dropping condition to get minimal list of a-v pairs */ Add list to RuleList; end; return RuleList; end; // GenerateAndTestConjuncts Example Consider the information multisystem in Table 2 as input to LERS-M. The result of generating an MDT with sequence numbers is shown in Table 20. 14

Table 20. MDT with sequence numbers. Seq A B C E F W D 1 0 0 0 0 0 2 0 1 0 0 2 0 2 0 0 0 1 1 3 0 0 1 0 0 3 3 0 0 1 0 0 2 1 0 0 2 0 0 4 0 0 1 0 1 1 1 0 0 1 0 0 5 0 0 1 1 1 2 1 1 0 1 1 0 6 0 1 0 0 0 2 0 1 0 0 2 0 7 1 0 0 0 1 2 1 0 1 1 0 1 8 1 0 1 0 1 3 1 1 1 1 1 1 9 1 1 0 0 1 2 0 1 1 0 1 1 10 1 1 0 1 1 2 0 1 1 0 1 1 11 1 1 1 0 0 4 1 1 1 1 2 1 12 1 1 1 1 0 1 0 0 1 0 0 1 13 1 1 1 1 1 2 1 1 0 1 1 0 The C-upper approximation of the class D = 1 is the sub-mdt shown in Table 21. Table 21. Table of UPPER(D 1 ). Seq A B C E F W D 3 0 0 1 0 0 2 1 0 0 2 0 0 4 0 0 1 0 1 1 1 0 0 1 0 0 5 0 0 1 1 1 2 1 1 0 1 1 0 7 1 0 0 0 1 2 1 0 1 1 0 1 8 1 0 1 0 1 3 1 1 1 1 1 1 11 1 1 1 0 0 4 1 1 1 1 2 1 13 1 1 1 1 1 2 1 1 0 1 1 0 The following is how RULE_GEN will generate rules for UPPER(D 1 ). Table 22 shows the AVT table created by procedure GroupAVT before sorting is applied to the table to generate the final list of groups of equivalent a-v pairs. The grouping criterion used is based on the size of intersection between the extension of an a-v pair and the set UPPER(D 1 ). Each entry in the Relevance column denotes the number of rows in the UPPER(D 1 ) table matched with the a-v pair. For example, the relevance of (A, 0) is 3 means that there are three rows in UPPER(D 1 ) that satisfy A = 0. The ranking of a-v pairs is based on maximum degree of relevance, i.e., larger relevance number has higher priority. The ranks are ordered in ascending order, i.e., smaller rank number has higher priority. The encoding for extensions of a-v pairs in the AVT is shown in Table 23, and the Target set UPPER(D 1 ) = {3, 4, 5, 7, 8, 11, 13} is considered with the encoding (0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1). 1 1 D 2 D 2 D 3 D 3 w 1 w 1 w 2 w 2 w 3 w 3 15

Table 22. AVT table created from UPPER(D 1 ). Name Value Relevance Rank A 0 3 4 A 1 4 3 B 0 5 2 B 1 2 5 C 0 1 6 C 1 6 1 E 0 5 2 E 1 2 5 F 0 2 5 F 1 5 2 Table 23. Extensions of a-v pairs encoded as Boolean bit-vector. N V 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 A 0 1 1 1 1 1 1 0 0 0 0 0 0 0 A 1 0 0 0 0 0 0 1 1 1 1 1 1 1 B 0 1 1 1 1 1 0 1 1 0 0 0 0 0 B 1 0 0 0 0 0 1 0 0 1 1 1 1 1 C 0 1 1 0 0 0 1 1 0 1 1 0 0 0 C 1 0 0 1 1 1 0 0 1 0 0 1 1 1 E 0 1 0 1 1 0 1 1 1 1 0 1 0 0 E 1 0 1 0 0 1 0 0 0 0 1 0 1 1 F 0 1 0 1 0 0 1 0 0 0 0 1 1 0 F 1 0 1 0 1 1 0 1 1 1 1 0 0 1 Based on the Rank of the AVT table shown in Table 22, the a-v pairs are grouped into the following six groups listed from higher to lower rank: {(C, 1)} {(B, 0), (E, 0), (F, 1)} {(A, 1)} {(A, 0)} {(B, 1), (E, 1), (F, 0)} {(C, 0)} Candidate conjuncts are generated and tested by the GenerateAndTestConjuncts procedure based on the above list. The basic strategy used here is to generate the intra-group conjuncts first, then followed by generating inter-group conjuncts. The procedure proceeds sequentially starting from the highest ranked group downward. It stops when at least one rule is found. The heuristics employed here is trying to find rules with maximum coverage of instances in UPPER(d i ). In our example, the first group contains only one a-v pair (C, 1); therefore, no need to generate intra-group conjuncts. From Table 21, we can see that [{(C, 1)}] is not a subset of the UPPER(D 1 ). Thus, inter-group join is needed. In addition, the second group {(B, 0), (E, 0), (F, 1)} is also included in the candidate list. This results in the following list of candidate conjuncts, which are listed with their corresponding externsions. [{(C, 1), (B, 0)}] = {3, 4, 5, 8} [{(C, 1), (E, 0)}] = {3, 4, 8, 11} [{(C, 1), (F, 1)}] = {4, 5, 8, 13} 16

[{(B, 0), (E, 0), (F, 1)}] = {4, 7, 8} Following the generating stage, a testing stage is performed to identify valid conjuncts. Because all the conjuncts are valid, i.e., their extensions are subset of UPPER(d i ). Four new rules are found in this iteration. The next step is to find minimal conjuncts by using dropping condition method. Consider the conjunction of {(B, 0), (E, 0), (F, 1)}. Dropping the a-v pair (B, 0) from the group, we have [{(E, 0), (F, 1)}] = {1, 4, 7, 8, 9}, which is not a subset of TargetSet, {3, 4, 5, 7, 8, 11, 13}. Next, try to drop the a-v pair (E, 0) from the group, we have [{(B, 0), (F, 1)}] = {1, 4, 5, 7, 8}, which is not a subset of TargetSet, {3, 4, 5, 7, 8, 11, 13}. Finally, try to drop the a-v pair (F, 1) from the group, we have [{(B, 0), (E, 0)}] = {1, 3, 4, 7, 8}, which is not a subset of TargetSet, {3, 4, 5, 7, 8, 11, 13}. We can conclude that the conjunction of {(B, 0), (E, 0), (F, 1)} contains no redundant a-v pairs, and it is a minimal conjunct. Similarly, it can be verified that the conjuncts {(C, 1), (B, 0)}, {(C, 1), (E, 0)}, and {(C, 1), (F, 1)} are minimal. All minimal conjuncts found are added to the new rule set R. Thus, we have the extension [R] of the new rules as [R] = [{(C, 1), (B, 0)}] + [{(C, 1), (E, 0)}] + [{(C, 1), (F, 1)}] + [{(B, 0), (E, 0), (F, 1)}] = {3, 4, 5, 7, 8, 11, 13}. The target set is updated by the following TargetSet = {3, 4, 5, 7, 8, 11, 13} [R] = empty set. Therefore, we have found the rule set. The last step in procedure RULE_GEN is to remove redundant rules from the rule set. The basic idea is similar to finding minimal conjuncts. Here, we try to remove one rule at a time and to test if the remaining rules cover all examples of the target set. More specifically, we try to remove the conjunct {(C, 1), (B, 0)} from the collection. Then, we have [R] = [{(C, 1), (E, 0)}] + [{(C, 1), (F, 1)}] + [{(B, 0), (E, 0), (F, 1)}] = {3, 4, 5, 7, 8, 11, 13} = TargetSet = {3, 4, 5, 7, 8, 11, 13}. Therefore, the conjunct {(C, 1), (B, 0)} is redundant and is removed from the rule set. Next, we try to remove the conjunct {(C, 1), (E, 0)} from the rule set, we have [R] = [{(C, 1), (F, 1)}] + [{(B, 0), (E, 0), (F, 1)}] = {4, 5, 7, 8, 13} TargetSet. 17

Therefore, the conjunct {(C, 1), (B, 0)} is not redundant, and it is kept in the rule set. Similarly, it can be verified that both conjuncts {(C, 1), (F, 1)} and {(B, 0), (E, 0), (F, 1)} are not redundant. The resulting rule set generated is shown in Table 22, where the w 1, w 2, and w 3 are column sums extracted from the table UPPER(D 1 ) of Table 21. Table 24. Rules generated for UPPER(D 1 ). A B C E F D w 1 w 2 w 3 null null 1 0 null 1 5 3 2 null null 1 null 1 1 4 3 1 null 0 null 0 1 1 3 1 2 The LERS-M algorithm tries to find only one minimal set of rules, it does not try to find all minimal sets of rules. 5.4 Discussion There are several advantages of developing LERS-M using relational database technology. Relational database systems have been highly optimized and scalable in dealing with large amount of data. They are very portable. They provide smooth integration with OLAP or data warehousing systems. However, one typical disadvantage of SQL implementation is extra computational overhead. Experiments are needed to identify impacts of computational overhead to the performance of LERS-M. When a database is very large, we can divide the database into smaller n databases and run LERS-M on each small database. Similarly, this scheme can be applied to homogeneous distributed databases. To integrate the distributed answers provided by multiple LERS-M programs, we can take the sum over the number of occurrences (i.e., w 1, w 2, and w 3 in previous example) provided by local LERS-M programs. When single answer is desirable, then the decision value D i with maximum sum of w i can be returned, or the entire vector of number of occurrences can be returned as an answer. It is possible to develop other inference mechanisms that will make use of the number of occurrences when performing the task of classification. Based on our discussion, there are two major parameters of LERS-M, namely, grouping criteria and generation of conjuncts. New criteria and heuristics based on numerical measures such as gini index and entropy function may be used. In the paper, we have used the minimal length criterion for the generation of candidate conjuncts. The search strategy is not exhaustive, and it stops when at least one candidate conjunct is identified. There are rooms for developing more extensive and efficient strategies for generating candidate conjuncts. The proposed algorithm is under implementation on IBM s DB2 database system running on Redhat Linux with web-based interface implemented using Java servlets and JSP. Performance evaluation and comparison to systems based on classical rough set methods will need further work. 18

6 Conclusions In this paper we have formulated the concept of multiset decision tables based on the concept of information multisystems. The concept is then used to develop an algorithm LERS-M for learning rules from databases. Based on the concept of partition of boundary sets, we have shown that it is straightforward to compute basic probability assignment functions of the Dempster-Shafer theory from multiset decision tables. A nice feature of multiset decision tables is that we can use the sum over number of occurrences of decision values as a simple mechanism to integrate distributed answers. Developing LERS-M on top of relational database technology will make the system scalable and portable. Our next step is to evaluate the time and space complexities of LERS-M over very large data sets. It would be interesting to compare the SQL-based implementation to classical rough set methods for learning rules from very large data sets. In addition, we have considered only homogenous data tables, which may be very large or distributed. Generalization to multiple heterogeneous tables needs further work. References 1. Sarawagi, S., S. Thomas, and R. Agrawal, Integrating association rule mining with relational database systems: alternatives and implications, Data Mining and Knowledge Discovery, 4, 89 125, (2000). 2. Agrawal, R. and K. Shim, Developing tightly-coupled data mining applications on a relational database systems, Proc. of the 2 nd Int. Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, (1996). 3. Wang, M., B. Iyer, and J.S. Vitter, Scalable mining for classification rules in relational databases, IDEAS, 58-67, (1998). 4. Fernández-Baizán, M.C., Menasalvas Ruiz E., Peña Sánchez J.M., Pardo Pastrana B., Integrating KDD Algorithms and RDBMS Code, Rough Sets and Current Trends in Computing (1998), 210-213. 5. Stolfo, S., A. Prodromidis, S. Tselepis, W. Lee, W. Fan, and P. Chan, JAM: Java agents for meta-learning over distributed databases, Proc. Third Intl. Conf. Knowledge Discovery and Data Mining, 74-81, (1997). 6. Grzymala-Busse, J.W., The LERS family of learning systems based on rough sets, Proc. of the 3rd Midwest Artificial Intelligence and Cognitive Science Society Conference, Carbondale, IL, April 12-14, 103-107, (1991). 7. Pawlak, Z., Rough sets: basic notion, Int. J. of Computer and Information Science 11, 344-56, (1982). 8. Pawlak, Z., Rough sets and decision tables, Lecture Notes in Computer Science 208, 186-196, Berlin, Heidelberg, Springer-Verlag, (1985). 9. Pawlak, Z., J. Grzymala-Busse, R. Slowinski, and W. Ziarko, "Rough sets," Communication of ACM, Vol. 38, No. 11, November, (1995), 89-95. 10. Grzymala-Busse, J.W., Learning from examples based on rough multisets, Proc. of the 2nd Int. Symposium on Methodologies for Intelligent Systems, Charlotte, North Carolina, October 14-17, 325-332, (1987). 11. Chan, C.-C., Distributed incremental data mining from very large databases: a rough multiset approach, Proc. the 5th World Multi-Conference on Systemics, Cybernetics and Informatics, SCI 2001, Orlando, Florida, July 22-25, (2001), 517-522. 19

12. Shafer, G., A Mathematical Theory of Evidence. Princeton, NJ, Princeton University Press, (1976). 13. Skowron, A. and J. Grzymala-Busse, "From rough set theory to evidence theory." in Advances in the Dempster-Shafer Theory of Evidence, edited by R. R. Yager, J. Kacprzyk, and M. Fedrizzi, 193-236, John Wiley & Sons, Inc, New York, (1994). 14. Grzymala-Busse, J.W., Rough set and Dempster-Shafer approaches to knowledge acquisition under uncertainty - a comparison, manuscript, (1987). 15. Knuth, D.E., The Art of Computer Programming. Vol. III, Sorting and Searching. Addison-Wesley, (1973). 16. Grzymala-Busse, J.W., Knowledge acquisition under uncertainty: a rough set approach, J. of Intelligent and Robotic Systems, Vol. 1, 3-16, (1988). 17. Chan, C.-C., "Incremental learning of production rules from examples under uncertainty: a rough set approach," Int. J. of Software Engineering and Knowledge Engineering, Vol. 1, No. 4, 439-461, (1991). 18. Grzymala-Busse, J.W., Managing Uncertainty in Expert Systems. Morgan Kaufmann Pub., San Mateo, CA, (1991). 19. Hu, X., T.Y. Lin, E. Louie, Bitmap techniques for optimizing decision support queries and association rule algorithms, IDEAS, (2003), pp. 34-43. 20. Kryszkiewicz, M., Rough Set Approach to Rules Generation from Incomplete Information Systems, In The Encyclopedia of Computer Science and Technology, Marcel Dekker, Inc., New York, Vol. 44, 319 346, (2001). 21. Ślęzak, D., Various approaches to reasoning with frequency based decision reducts: a survey, in Rough Set Methods and Applications, L. Polkowski, S. Tsumoto, T.Y. Lin (eds.), Physica-Verlag, Heidelberg, New York, (2000). 22. Chamberlin, D.. A Complete Guide to DB2 Universal Database. Morgan Kaufmann Publishers. (1998). 20