2 WANG Jue, CUI Jia et al. Vol.16 no", the discernibility matrix is only a new kind of learning method. Otherwise, we have to provide the specificatio

Size: px

Start display at page:

Download "2 WANG Jue, CUI Jia et al. Vol.16 no", the discernibility matrix is only a new kind of learning method. Otherwise, we have to provide the specificatio"

Adelia Eaton
6 years ago
Views:

1 Vol.16 No.1 J. Comput. Sci. & Technol. Jan Investigation on AQ11, ID3 and the Principle of Discernibility Matrix WANG Jue (Ξ ±), CUI Jia ( ) and ZHAO Kai (Π Λ) Institute of Automation, The Chinese Academy of Sciences, Beijing , P.R. China wangj@sunserver.ia.ac.cn Received march 2, 2000; revised August 3, Abstract The principle of discernibility matrix serves as a tool to discuss and analyze two algorithms of traditional inductive machine learning, AQ11 and ID3. The results are: (1) AQ11 and its family can be completely specified by the principle of discernibility matrix; (2) ID3 can be partly, but not naturally, specified by the principle of discernibility matrix; and (3) The principle of discernibility matrix is employed to analyze Cendrowska sample set, and it shows the weaknesses of knowledge representation style of decision tree in theory. Keywords rough set theory, principle of discernibility matrix, inductive machine learning 1 Introduction Inductive machine learning has been a classical problem of artificial intelligence. Many algorithms were proposed in the past 20 years. Among them, AQ11 [1] and ID3 [2], designed by Michelski and Quinlan respectively, are the two most popular ones. Up to now, many algorithms have been derived from them and formed two families of inductive machine learning. By investigating these algorithms, we aim at interpreting very large database, i.e., data mining. According to the objective who understands database, there are two categories: understanding database by computer" and understanding database by human" [3]. In understanding database by human", compared with other learning algorithms (such as Neural Network), inductive machine learning is more appropriate for human to understand its outputs. It seems to be a different motivation from its original intention. In fact, we would rather consider inductive machine learning a tool for analyzing symbolized data, so we can develop new methods for the above task. Although the principle of discernibility matrix [4] inherits the reduct theory from rough set theory, which is based on equivalence class, they are completely different in both algorithm and theory. Therefore, it is reasonable to view the principle of discernibility matrix as an independent reduct theory. That's why we add the word principle" in front of discernibility matrix", although it is now still regarded as a part of rough set theory. From the viewpoint of traditional machine learning, that is, extracting rules from an information system and then using them to solve new problems by computers automatically, the principle of discernibility matrix and rough sets can be regarded as a new family of inductive machine learning. Yet, we prefer to consider them a tool for analyzing symbolized data. In this paper, the principle of discernibility matrix serves as a tool to analyze AQ11 and ID3, so as to help to develop new data mining algorithms. For this purpose, we first have to answer such a question: is the principle able to specify AQ11 and ID3? If the answer is This research is partly supported by the National `863' High-Tech Programme, NKPSF, and the National Natural Science Foundation of China.

2 2 WANG Jue, CUI Jia et al. Vol.16 no", the discernibility matrix is only a new kind of learning method. Otherwise, we have to provide the specifications of AQ11 and ID3 based on discernibility matrix. In this paper, we first check the possibility of specifying AQ11 by the principle of discernibility matrix, and the answer is positive. Then, we check the possibility of specifying ID3 by the principle of discernibility matrix, and the answer is partly positive. After that, the principle of discernibility matrix is employed to investigate the weakness of decision tree representation in ID3. 2 Logical and Equivalence Class Specification in Inductive Machine Learning In the history of inductive machine learning, most algorithms were described by logical or equivalence class specification, such as AQ11 and ID3. In this section, we will employ the principle of discernibility matrix proposed by Skowron to demonstrate that equivalence class specification can completely describe logical specification in inductive machine learning. 2.1 Discernibility Function and Discernibility Matrix In order to find a more efficient reduct algorithm, Skowron first used difference" rather than equivalence" to specify equivalence class in The principle he proposed is called discernibility matrix. Further, Skowron defined discernibility function, which implies the possibility to replace logical specification by equivalence class. The major principle of discernibility matrix is listed as follows. Definition 2.1 Discernibility matrix (Skowron). Let hu; Ai be an information system, M be a set of ft q g, called the discernibility matrix of hu; Ai, such that t q = fa : a(x i ) 6= a(x j );x i ;x j 2 U; a 2 Ag where t q is called a discernibility element. It includes all attributes that discriminate between two samples in U. Skowron further explored this idea by a Boolean function in the conjunctive normal form (CNF), in which each clause represents a discernibility element. Definition 2.2 Discernibility function (Skowron). For an information system hu; Ai, a discernibility function f is a Boolean function of m Boolean variables ff 1 ;:::;ff m corresponding to the attributes a 1 ;:::;a m respectively, and defined as follows: f(ff 1 ;:::;ff m ) = t 1 ^ ^ t k where t q = _(ff i ) such that a i 2 t q. In this paper, we will not distinguish ff i and a i. Since Skowron has proved that discernibility matrix has the same ability as equivalence class for reduct, it can be considered as another dialect of equivalence class specification. To ensure the independence of reduct, the absorptive law is employed in Skowron's proof. We here present it as a proposition. Proposition 2.1 (Absorptive law). Let F and G be two logical formulas, F ^(F _G) = F _ (F ^ G) = F. Use distributive law (multiplication) and absorptive law to transform the discernibility function into a disjunctive normal form (DNF), which is just the reduct space of the given information system 1. In fact, every clause in this DNF is a reduct. 1 According to the definition in rough set theory, if the discernibility function is directed against discernibility matrix, the reduct is called `attribute reduct'. If it is directed against a column in discernibility matrix, the reduct is called `value reduct'.

3 No.1 AQ11, ID3 and Discernibility Matrix Logical Specification and Equivalence Class Specification Generally, the specification of inductive machine learning starts from a logical formula consisting of all samples. The set of samples is divided into two parts, positive and negative. They are combined to construct a conjunctive formula, with disjunction of all positive samples and non" of disjunction of all negative samples. In the formula, each sample is represented as a conjunction of attribute-value" pairs. This formula, called the logical formula on positive samples, is just the basis of AQ11. Consider information system hu; A [ Di, U=D = fy 1 ;:::;Y n g is the partition of U by decision attributes D. Assume Y j is the set of positive samples, denoted by E p. All other samples in Y q (q = 1;:::;n;q 6= j) are negative samples, denoted by E n. E p^ ο E n is the logical formula of positive samples E p on the background of E n. In order to discuss the relationship between logical and equivalent class specifications, we first consider the case consisting of one positive sample and one negative sample. Let e + i 2 E p and e j 2 E n be a positive and a negative samples respectively. As described above, they form a logical formula: e + i ^ ο e j = (^[a k = v k ]) ^ ([_a k 6= u k ]) where a k 2 A, v k and u k are values of a k. In the following process, the formula e + i ^ ο e j will be transformed into DNF with distributive law. This is called the logical specification of inductive machine learning. Can equivalence class specification describe logical specification? To answer this question, a proposition in mathematical logic subject is necessary. It is presented as follows. Proposition 2.2. Let F be a logical formula, and Λ be inconsistent, F ^ ο F = Λ, F ^ Λ = Λ, F _ Λ = F. For inductive machine learning, this proposition indicates that a clause in DNF transformed from e + i ^e j is inconsistent if there are complementary pairs in the clause. Therefore, they can be deleted from the DNF without affecting the truth of the formula. The idea is clearly expressed by the following lemma. Lemma 2.1. Let 8q(a q 2 A) such that e + i = ^f[a q = v q ]g, [a p 6= u] be a literal, a p 2 A. Consider conjunctive formula C = e + i ^ [a p 6= u], if u = v p, then C = Λ, else, C = e + i ^ [a p 6= u], especially, if e + i is valid, C = [a p 6= u]. Proof. Consider attribute a p, [a p = v p ] and [a p 6= u] are both in conjunctive formula C. If u = v p, [a p = v p ] and [a p 6= u] are complementary pairs. According to Propositions 2.1 and 2.2, C = Λ; otherwise, C = e + i ^ [a p 6= u]. In this case, if e + i is valid, C = [a p 6= u]. For inductive machine learning this lemma implies that, (1) if an attribute has the same value in positive sample as in a negative sample, i.e., a(e + i ) = a(e j ), there must be a complementary pair in the clause. Then it is inconsistent; (2) if e + i is valid, the formula e + i ^ [a r 6= u] can be expressed by [a r 6= u] when a(e + i ) 6= a(e j ). This lemma ensures that the equivalence class specification can describe logical specification in inductive machine learning. In order to prove that the equivalence class specification can describe logical specification in inductive machine learning, we only need to demonstrate that the disjunction formula transformed from e + i ^ e j by distributive law is an element of discernibility matrix. It is written as the following proposition. Proposition 2.3. For inductive machine learning, the disjunctive formula transformed from e + i ^ e j is the discernibility element between sample i and sample j. Proof. Let e + i = ^f[a q = v q ]g and ο e i = _f[a q 6= u q ]g for all a q 2 A. The distributive law is employed to transform e + i ^e j into a DNF, in which each clause is like C p = e + i ^[a p 6= u p ]:

4 4 WANG Jue, CUI Jia et al. Vol.16 Assume e + i is valid, according to lemma 2.1, each clause is either C p = Λ or C p = [a p 6= u p ]. If C p = Λ, it can be deleted from the DNF (Proposition 2.2). Thus, we finally get a set of literal R = f[a p 6= u p ] : C p 6= Λ;p= 1;:::;mg In fact, values u p in R have no contribution to the learning and can be ignored. The above formula can be rewritten as fa p g, including all attributes that satisfy a p (e + i ) 6= a p(e j ). Considering the definition of discernibility matrix (Definition 2.1), clearly, fa p g is a discernibility element. 2 The above proposition demonstrates that it is possible to use discernibility matrix to describe those kinds of machine learning algorithms that originally take logical specifications, such as AQ11. This implies that it is able to substitute the simple comparing" operation for the complex logical operations in the implementation of machine learning algorithms. In addition, comparing a positive sample with all negative samples, we get the set of discernibility elements for the positive sample. These elements form a conjunction that is further transformed into a disjunction by the distributive law and absorptive law. The DNF is just the reduct space of the given positive sample. 3 AQ11 Inductive Machine Learning [1] AQ11 is a typical inductive machine learning algorithm with logical specification. It was proposed by Michalski and his colleagues in Following the discussions in Section 2, the equivalence class specification of AQ11 is described in this section in details. 3.1 Logical Specification of AQ11 The principle of AQ11 can be briefly described as follows: Consider logical formula E p^ ο E n =(e + 1 _ _ e+ k )^ ο (e 1 _ _ e m) =(e + 1 ^ ο e 1 ^ ^ο e m) _ _(e + k ^ ο e 1 ^ ^ο e m) (3.1) where e + i 2 E p is a positive sample, and e j 2 E n is a negative sample. Consider one of the positive samples e + j ^ ο e 1 ^ ^ο e m (3.2) The distributive law and absorptive law are sequentially applied to (3.2), which generates the solution of the learning for the positive sample e + j. (3.2) is rewritten as follows. (e + j ^ ο e 1 ) ^ ^(e+ j ^ ο e m) (3.3) Obviously, (3.3) and (3.2) are equivalent in mathematics. The distributive law is applied to each (e + j ^ ο e j ) in (3.3) respectively. According to Proposition 2.3, all discernibility elements (totally m) of the positive sample e + j are got. Although (3.2) and (3.3) are equivalent in mathematics, they imply different algorithms of inductive machine learning. 3.2 Equivalence Class Specification of AQ11 Since the operators in traditional AQ11 are based on logical distributive and absorptive laws, AQ11 can be called logical specification algorithm. We will discuss the equivalence class specification of AQ11 in this section.

5 No.1 AQ11, ID3 and Discernibility Matrix 5 By comparing one positive sample and all negative samples respectively, a set of discernibility elements is generated. Each of them is expressed by a set of attributes. The Cartesian product of all these elements builds the reduct space of the given information system. That is, every member of the product is a learning solution of the given positive sample on the background of all the negative samples. Such a solution is called value reduct in rough set theory. The absorptive law is necessary for independent reduct. It can be represented in the following rule. Rule 3.1. Let ff and fi be the two sets of atoms of two clauses in a DNF (or CNF). If ff fi, the clause related with fi is deleted from the DNF (or CNF). The algorithm based on (3.3) first generates all discernibility elements and then finds a reduct from the reduct space. The whole procedure is divided into two independent parts: computing all discernibility elements and finding a reduct. So, if the goal is to find only one acceptable reduct, this algorithm performs well. On the other hand, the algorithm based on (3.2) does not explicitly separate finding reduct" from learning. Therefore, its outputs are the whole reduct space rather than a single reduct. So it is not suitable for getting a single reduct. 3.3 AQ11 Based on Discernibility Matrix Applying Proposition 2.3 to (3.3), we get m discernibility elements for a given positive sample. Then Rule 3.1 is used to delete the absorbable elements. After that, we are able to construct a conjunctive Boolean function, denoted by F (M), where M is the set of remained discernibility elements 2. Obviously, this is the discernibility function defined in (3.2). Then, F (M) is transformed to a DNF by the distributive and absorptive law, denoted by G(M). Clearly, F (M) = G(M). According to the analysis in Subsection 3.2, G(M) is the reduct space of the given positive sample, in which every clause is a value reduct in rough set theory. Generally, it is adequate to find an acceptable reduct, rather than reduct space for inductive machine learning. Therefore, AQ11 algorithm specified by discernibility matrix may adopt (3.3) instead of (3.2). Finally, it is noted that the AQ11 algorithm based on discernibility matrix has the same computational complexity with the traditional one. However, if the goal is to find one reduct instead of the whole reduct space, traditional AQ11 algorithm is more complex than the new one. 3.4 Extension Matrix Method AE11 As early as 1986, a Chinese computer scientist, Professor HONG Jiarong, used (3.3) and comparing" operation as an alternative for the distributive law to improve AQ11 algorithm. He named the method extension matrix algorithm" (AE11) [5]. The extension matrix for a positive sample can be defined as follows. Definition 3.1 Extension matrix ( HONG Jiarong ). Let E n = fe 1 ;:::;e mg be a set of negative samples, e + 2 E p be a positive sample. H denotes a matrix (h ij ), called the extension matrix of e + on the background of E n, such that h ij = fa j (e i ) : if a j(e + ) 6= a j (e i ); Λ : if a j(e + ) = a j (e i )g where Λ is called dead element". 2 Here M is not called the discernibility matrix because these discernibility elements are only involved in a column of the discernibility matrix, that is, they include and only include those elements generated by one positive sample.

6 6 WANG Jue, CUI Jia et al. Vol.16 Just like discernibility matrix, the extension matrix also takes comparing" as its operation. The main difference of the two methods lies in the elements in matrix. In the discernibility matrix, the elements are composed of attributes and in extension matrix, they are values of negative samples. However, implied by some examples in [5], those values are insignificant for the learning process. In fact, it is only necessary to decide which attribute should be retained. Furthermore, the distinction between discernibility matrix and extension matrix, i.e., retaining attributes or values, is essential for reduct algorithms. In the extension matrix, it is the presentation of dead element" that is important. It only shows that the values of dead elements are meaningless for positive samples. It is consistent with the absence of attributes in a discernibility element. It is not difficult to describe AE11 in terms of the principle of discernibility matrix. First, all live elements" in the extension matrix are replaced by corresponding attributes. All those elements in one row make up a discernibility element. Second, according to Definition 2.1, the extension matrix of a positive sample is just one column in the discernibility matrix. Therefore, the solution space of AE11 is the Cartesian product of all discernibility elements in the discernibility matrix. It is consistent with the original one. The above discussion can be concluded as the following proposition. Proposition 3.1. An extension matrix on a positive sample is just a column of the discernibility matrix. The extension matrix is quite similar with the discernibility matrix, especially in the case that they both adopt equivalence class specification. Considering that the extension matrix was proposed before the discernibility matrix, we should say that the extension matrix is the contribution of Professor HONG to inductive machine learning. Unfortunately, since he had not proposed the important concept discernibility element, and therefore could not use the deleting" operation in the reduct procedure, the extension matrix, compared with discernibility matrix, has much more restrictions in both theory and applications. Finally, neither AQ11 nor AE11 pay any attention to attribute reduct. To say in rough sets, they both provide value reduct. This is one of the reasons for the above discussion. 4 ID3 Inductive Machine Learning [2] ID3 is another important inductive machine learning algorithm proposed by Quinlan in 1986 [2]. It inherited the tree representation of CLS (Concept Learning Systems), which was first developed for natural language processing by Hunt in 1966 [6]. From then on, extensive study was carried in two aspects, attribute selecting strategy and incremental learning strategy. It resulted in the formation of ID family of inductive machine learning. Now, ID family plays an important role not only in artificial intelligence, but also in many related fields. In this section, we will discuss how to specify ID3 algorithm by the principle of discernibility matrix. ID3 algorithm has three characteristics: tree representation, entropy reduction and incremental learning techniques. Among them, incremental learning of discernibility matrix is not discussed in this paper. It will be argued in the discussions on independent reduct of very large database, which involves more complex problems in theory. 4.1 Description of Decision Tree by Equivalence Class Specification Let M be the discernibility matrix of information system hu; A[ Di, a 2 A. The modified discernibility element is defined as follows: c ij = fa : if a(x i ) 6= a(x j ); (a; v) : if a(x i ) = a(x j );a2 A; x i ;x j 2 U; i 6= j; v = a(x i )g (4.1)

7 No.1 AQ11, ID3 and Discernibility Matrix 7 The definition differs from the original one (see Definition 2.1) in the case that discernibility elements include not only condition attribute `a' if values of `a' for two samples are different, but also attribute-value pairs (a; v) if the values are the same. 4.2 Description of Information Entropy in Discernibility Matrix Partition the set of samples U by condition attribute `a', U=a = fx 1 ;:::;X m g, thus, U is divided into two parts, X j 2 U=a and U X j. Card(X) denotes the cardinality of set X. In ID3, the information entropy of attribute `a' is defined as E a = X [p j log(p j )+(1 p j ) log(1 p j )] (4.2) where p j =Card(X j )/Card(U), 1 p j =Card(U X j )/Card(U). According to the definition of (4.1), if a discernibility element includes attribute `a', there exist two samples x and y satisfying x 2 X j, and y 62 X j. The number of discernibility elements including attribute `a' must be where Card(N(a; X j )) = Card(X j ) Λ Card(U X j ) (4.3) N(a; X j ) = ft : t fag 6= ;;t2 M;a(x) 6= a(y);x2 X j ;y 2 U X j g (4.4) It is the set of discernibility elements between all samples in X j and all samples in U X j. Let n = Card(U), p j = Card(X j )=Card(U), then Card(U X j )=Card(U) = 1 p j. Assume q = Card(N(a; X j ))=n 2. Then q is a constant and can be counted directly from the discernibility matrix. Therefore, (4.3) can be rewritten as p j (1 p j ) = q (4.5) Compute the values of all p j (j = 1; 2;:::;m) by solving this equation and use them to compute information entropy of attribute `a' from (4.2). Obviously, it is equal to the entropy computed directly from a sample set. Thus, the strategy of attribute selection based on discernibility matrix is just the same as that of ID3, which enables discernibility matrix to construct the same decision tree as ID3 does. 4.3 Decision Tree Construction Based on Discernibility Matrix If a(x i ) = a(x j ), the modified discernibility matrix defined by (4.1) has to keep the information (a; v), because the attribute value has to be assigned to the branches for decision tree representation. Let x and y be two samples, x; y 2 X j 2 U=a. Clearly, attribute `a' is not included in the discernibility element between x and y, but (a; v j ) is. The set of discernibility elements for X j is P (a; v; X j ) = ft : t f(a; v)g 6= ;;t2 M;v = a(x);x2 X j g (4.6) where t is a discernibility element defined by (4.1). Let X j =D = fy 1 ;:::;Y e g 3. If Card(X j =D) = 1, i.e., all samples in X j have the same values for the decision attribute D. There is no need for further division. We can recursively define a decision tree based on discernibility matrix. Definition 4.1 (Decision Tree Based on Discernibility Matrix). Let T be a decision tree generated from a discernibility matrix M. The attribute `a' is selected as the root node of subtrees according to (4:2), denoted as T 0. The branches stemming from node 3 In fact, U=fa; Dg = U=a X j =D:

8 8 WANG Jue, CUI Jia et al. Vol.16 `a' are assigned with values v j, for j = 1;:::;m. If Card(X j =D) = 1, the subtree connected with branch j is null. Otherwise, let M = P (a; v j ;X j ), select attribute b j as root of the new subtree, which is connected with node `a' by branch v j. Thus, T 0 consists of node `a' and those subtrees connected with `a' by branches v 1 ;:::;v m respectively. This definition presents the procedure of constructing decision tree from the discernibility matrix. Since the strategy of node selection is the same as ID3, both procedures generate the same decision tree. 4.4 Discussion Although discernibility matrix can specify ID3, it has to be modified to satisfy the decision tree representation, that is, to keep attribute values of positive samples. In Subsection 3.4, we have pointed out that it is unnecessary to save attribute values of negative samples and it is one of Skowron's important contributions to retain attributes instead of their values. Therefore, it may be unsuitable to describe ID3 algorithm by the principle directly. But there is no difficulty in using the principle to specify the tree representation, if the representation is necessary for a domain. However, such a tree may not be identical with the one generated by ID3. 5 Discussions about Decision Tree Representation Although decision tree has been widely used, some of its weaknesses cannot be ignored. One is the irrelevant redundant literal in tree outputs. As a direct result, more conditions have to be satisfied in solving new problems. 5.1 An Example Consider the following sets of samples, where a, b, c are condition attributes, and d is the decision attribute. Attribute a Attribute b Attribute c Attribute d The decision tree formed by ID3 algorithm is expressed as the following rules: (1) if [b = 0][c = 0], then +; (2) if [b = 0][c = 1], then ; (3) if [b = 1], then +; (4) if [b = 2], then. Consider rule (2), where [b = 0] is a redundant literal for the above sample set. If [b = 0] is deleted from rule (2), it is changed to (2 0 ) if [c = 1], then. The new rule set is still consistent with the sample set. Assume [b = 3][c = 1] is a new sample, it cannot be solved by any of (1) (4). But with rule (2 0 ), the solution is ` '. The key problem is whether the solution is correct. Suppose rule (2 0 ) classifies the new sample correctly, it at least demonstrates that the solution of ID3 cannot cover this new sample. The reason is the restriction of tree representation, which keeps the connection

9 No.1 AQ11, ID3 and Discernibility Matrix 9 [b = 0] in rule (2). On the other hand, if rule (2 0 ) cannot classify the new sample correctly, that is, the conclusion of [b = 3][c = 1] is in fact `+' instead of ` ', neither decision tree nor rule sets can classify the new sample correctly. In this case, new rules `if [b = 3], then +' is needed. 5.2 Cendrowska Sample Set [7] In 1988, Cendrowska pointed out: The decision tree output of Quinlan's ID3 algorithm is one of its major weaknesses, :::, its use in expert systems frequently demands irrelevant information to be supplied". [7] As a result, the author suggested that rule set is a more reasonable representation. It is interesting to compare reduct based on discernibility matrix with that based on ID3, in terms of the sample set in [7]. Let's consider Example 5.2 that includes 24 samples, where `a', `b', `c', `d' are condition attributes, while ffi is decision attribute. Example 5.2 (Ophthalmic optics) [7]. No. a b c d ffi No. a b c d ffi In [7], Cendrowska gave out the decision tree solution by ID3, which can be rewritten as a set of rules, denoted as R in our paper. R 1 : d 2 ^ c 2 ^ b 2 ^ a 1 ) ffi 1 R 2 : d 2 ^ c 1 ^ b 1 ^ a 1 ) ffi 2 R 3 : d 2 ^ c 1 ^ b 1 ^ a 2 ) ffi 2 R 4 : d 2 ^ c 1 ^ b 2 ) ffi 2 R 5 : d 2 ^ c 2 ^ b 1 ) ffi 1 R 6 : d 1 ) ffi 3 R 7 : d 2 ^ c 2 ^ b 2 ^ a 2 ) ffi 3 R 8 : d 2 ^ c 1 ^ b 1 ^ a 3 ) ffi 3 R 9 : d 2 ^ c 2 ^ b 2 ^ a 3 ) ffi 3 We do not want to discuss the modified diagram in Cendrowska's paper. Readers interested in it may refer to [7, 8]. In this paper, we give out the solution by discernibility matrix directly. First, the principle of discernibility matrix is employed to compute attribute reduct. Since all of the four attributes are core attributes, there is one and only one attribute reduct fa; b; c; dg for this information system. In order to investigate this example in details, all absorbed elements are deleted from discernibility matrix. Then there are totally 6 kinds of discernibility elements remained, which are shown in the following table with their corresponding samples: fab; c; dg fa; c; dg fb; c; dg fdg fad; bd; cdg fa; b; cg 1, 5, 8 2, 6, 7 3, 4, , 16, 21, 22 17, 19, 23 18, 20, 24 These sets of discernibility elements are just value reducts of samples in rough set theory except fab c dg and fad bd cdg.

10 10 WANG Jue, CUI Jia et al. Vol.16 Samples 2, 6 and 7 have value reduct fa; c; dg, their rules are: Q 1 : a 1 ^ c 2 ^ d 2 ) ffi 1 (sample 2) Q 2 : a 1 ^ c 1 ^ d 2 ) ffi 2 (sample 6) Q 3 : a 2 ^ c 1 ^ d 2 ) ffi 2 (sample 7) Samples 3, 4 and 9 have value reduct fb; c; dg, their rules are: Q 4 : b 1 ^ c 2 ^ d 2 ) ffi 1 (samples 3, 4) Q 5 : b 2 ^ c 1 ^ d 2 ) ffi 2 (sample 9) Samples 10 15, 16, 21 and 22 have value reduct fdg, their rules are: Q 6 : d 1 ) ffi 3 Samples 18, 20 and 24 have value reduct fa; b; cg, their rules are: Q 7 : a 2 ^ b 2 ^ c 2 ) ffi 3 (sample 18) Q 8 : a 3 ^ b 1 ^ c 1 ) ffi 3 (sample 20) Q 9 : a 3 ^ b 2 ^ c 2 ) ffi 3 (sample 24) In this paper, we denote the rule set as Q. Among those samples not mentioned above, 17, 19 and 23 have value reducts fdg or fa; b; cg. If fdg is adopted, the rule is the same as Q 6, otherwise, it is the same as Q 7, Q 8, and Q 9 respectively. Similarly, samples 1, 5, 8 have value reducts fa; c; dg or fb; c; dg. For sample 1, if fa; c; dg is adopted, the rule is the same as Q 1, otherwise it is the same as Q 4. For example, in 5 and 8, the rules are Q 2 and Q 3 in case of fa; c; dg, and Q 5 in case of fb; c; dg. Therefore, the samples whose discernibility element set is fab; c; dg or fad; bd; cdg do not generate any new rules. Because the discernibility element sets of other samples only consist of core attributes, rule set Q is unique for Cendrowska's samples. 5.3 Analysis on Cendrowska's Samples It is interesting to make comparison between the principle of discernibility matrix and ID3 with Cendrowska's sample set. Note that discernibility matrix produces unique solution, but the results of ID3 depend on attribute selection strategies (different strategies lead to different solutions). First, compare the rules based on ID3 with those on discernibility matrix. R j (ID3) R 1 R 2 R 3 R 4 R 5 Q j (Descernibility Matrix) Q 1 = R 1 Q 2 = R 2 b 2 Q 3 = R 3 b 1 Q 4 = R 4 b 1 Q 5 = R 5 R j (ID3) R 6 R 7 R 8 R 9 Q j (Descernibility Matrix) Q 6 = R 6 Q 7 = R 7 d 2 Q 8 = R 8 d 2 Q 9 = R 9 d 2 where R j is the rule of ID3 decision tree. R 1, R 5 and R 6 of ID3 have no difference with Q 1, Q 5 and Q 6 of discernibility matrix. But R 2, R 3, R 4, R 7, R 8 and R 9 all have one more literals than the corresponding rules of discernibility matrix. Since ID3 selects attribute `d' as the root node, attribute `d' appears not only in rules R 7, R 8 and R 9, but also in all the other rules. Similarly, since `b' is embedded in branches, rules of ID3 might have redundant attribute `b' compared with the corresponding rules of discernibility matrix. Is it possible to generate more concise rules by improving attribute selection strategies in ID3? The answer is negative. That is, whichever strategy ID3 chooses, it cannot produce more concise rules than discernibility matrix, because it is impossible to represent the set of rules Q by tree structure for the above sample set. It's not difficult to prove the uniqueness of Q rule set for these samples. Theoretically, the reduct of rough sets has to satisfy the independent condition, which is not required in tree structure. This is why ID3 produces more than one solution for the above sample set.

11 No.1 AQ11, ID3 and Discernibility Matrix 11 6 Discussions and Conclusions In fact, there are three parts in the rough set theory proposed by Pawlak [9] : (1) the roughness of knowledge described by lower and upper approximations, (2) reduct theory, (3) reasoning based on roughness. The original motivation that Pawlak presented roughness is for uncertainty reasoning. But recent work shows that the uncertainty reasoning might be the least significant part of rough set theory. Instead, reduct theory may be the most important part, followed by roughness. The importance of roughness does not lie in reasoning, but in describing knowledge granulation different concise degrees [10]. In other words, roughness may be taken as a measurement for knowledge granulation. It is not necessary to build rough set theory on lower and upper approximations. We may abandon the whole concept of lower and upper approximations, and set up reduct theory based on Skowron's discernibility matrix. Moreover, the roughness and other principal concepts in rough sets can also be redefined in this frame. In other words, rough set theory can be independently set up on the principle of discernibility matrix. In addition, because the original intention of rough set theory has drifted away, rough sets" has become a misleading term for novices and may lead to wrong oriented research. But we still use it at present, in honor of history. The conclusions of this paper are: (1) The principle of discernibility matrix can completely describe AQ11 and its family; (2) The principle can partly describe ID3, but not naturally; and (3) The principle is employed to analyze Cendrowska sample set, and the result shows the weaknesses of knowledge representation style of decision tree in theory. In AQ family, there are also several other techniques that help solving the problems of very large database. They are not discussed in the paper, because it needs to present new techniques and theory based on discernibility matrix. There are two focuses in ID family. One is solving very large database problems, in which the idea of window" is quite important for the reduct theory of discernibility matrix. The other is the strategies on the bias, which are not so important, because the reduct theory defined by Pawlak must satisfy the independent condition and we have proved that it sometimes conflicts with information entropy strategy. Finally, we want to point out that, as early as 1989, two Chinese computer scientists Profs. ZHANG Bo and ZHANG Ling proposed similar concepts as rough sets in their book [11]. They suggested that the theory of classification could be set up on quotient space. In fact, the indiscernibility relation presented by Pawlak is just the quotient set a special case on quotient space, and even the specification is the same. In addition, the granulation theory proposed by Polkwski in 1998 [10] was discussed in their book too. Their contributions imply that topology might be an important mathematical tool for Artificial Intelligence. Acknowledgments The authors thank Professors LU Ruqian, ZHANG Bo and ZHANG Ling for their useful suggestions on the earlier drafts of this paper. The authors are particularly grateful to Professor LU Ruqian for his kindly help in providing the Cendrowska's original paper. In fact, the analysis on Cendrowska's samples is primarily illuminated by his book Artificial Intelligence". The authors are not familiar with Professor HONG Jiarong. The first time we carefully studied his book Inductive Machine Learning Algorithms, Theory and Applications" was after his death. The extension matrix should be regarded as one of his contributions to machine learning. We here also take this paper as a memorial to Professor HONG Jiarong. References [1] Michalski R S, Chilausky R L. Learning by being told and learning from examples: An experimental comparison of two methods of knowledge acquisition in context of developing on expert system for

12 12 WANG Jue, CUI Jia et al. Vol.16 soybean disease diagnosis. Policy Analysis and Information Systems, 1980, 4: [2] Quilan J. Induction of decision trees. Machine Learning 1, 1986, pp [3] Guo M, Wang J. Data mining and database knowledge discovery: Summary. Pattern Recognition and Artificial Intelligence, 1998, 11(3): [4] Skowron A, Rauszer C. The Discernibility Matrices and Functions in Information Systems. In Intelligent Decision Support Handbook of Applications and Advances of the Rough Sets Theory, Slowinski R (ed.), 1991, pp [5] Hong J. Inductive Machine Learning Algorithms. In Theory and Application, Scientific Press, [6] Hunt E B, Marin J, Stone P J. Experiments in Induction. New York: Academic Press, [7] Cendrowska J. PRISM: An algorithm for inducing modular rules. In Knowledge Acquisition Tools for Expert Systems, Boose J, Gaines B (eds.), Academic Press, 1988, 1: [8] Lu R. Artificial Intelligence. Beijing: Scientific Press, [9] Pawlak Z. Rough Set Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dorderecht, Boston, London, [10] Polkowski L, Skowron A. Rough Sets: Perspectives. In Rough Sets in Knowledge Discovery 1, Polkowski L, Skowron A (eds.), Physica-Verlag, 1998, 1: [11] Zhang B, Zhang L. Problem Solving Theory and Application. Beijing: Tsinghua Univ. Press, WANG Jue Professor at Institute of Automation, the Chinese Academy of Sciences, IEEE Senior Member. His interests are knowledge representation, ANN, GA, multi-agent system, machine learning and data mining. CUI Jia received her B.S. degree from University of Science and Technolgy of China in She is currently an M.S. candidate in Institute of Automation, the Chinese Academy of Sciences. Her research interests are Rough Sets, association rules. ZHAO Kai received his B.S. degree from Beijing Institute of Technology in 1993, and Ph.D. degree from Institute of Automation, the Chinese Academy of Sciences in His research interests are adaptation systems, genetic programming and data mining.

Selected Algorithms of Machine Learning from Examples

Fundamenta Informaticae 18 (1993), 193 207 Selected Algorithms of Machine Learning from Examples Jerzy W. GRZYMALA-BUSSE Department of Computer Science, University of Kansas Lawrence, KS 66045, U. S. A.