Minimal Attribute Space Bias for Attribute Reduction

Size: px

Start display at page:

Download "Minimal Attribute Space Bias for Attribute Reduction"

Shannon Candice Daniel
6 years ago
Views:

1 Minimal Attribute Space Bias for Attribute Reduction Fan Min, Xianghui Du, Hang Qiu, and Qihe Liu School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu , China {minfan, xianghd, qiuhang, Abstract. Attribute reduction is an important inductive learning issue addressed by the Rough Sets society. Most existing works on this issue use the minimal attribute bias, i.e., searching for reducts with the minimal number of attributes. But this bias does not work well for datasets where different attributes have different sizes of domains. In this paper, we propose a more reasonable strategy called the minimal attribute space bias, i.e., searching for reducts with the minimal attribute domain sizes product. In most cases, this bias can help to obtain reduced decision tables with the best space coverage, thus helpful for obtaining small rule sets with good predicting performance. Empirical study on some datasets validates our analysis. Keywords. Attribute reduction, bias, space coverage, rule set. 1 Introduction Practical machine learning algorithms are known to degrade in performance (prediction accuracy) when faced with many attributes that are not necessary for rule discovery [1]. It is therefore not surprising that much research has been carried out on attribute reduction [2], particularly by people in the Rough Sets society. A reduct is a subset of attributes that is jointly sufficient and individually necessary for preserving the same information (in terms of positive region [3], class distribution [4] among others) under consideration as provided by the entire set of attributes. [5] A commonly used reduct selection/construction strategy, called the minimal attribute bias [1][3][6], is to searching for a reduct with the minimal number of attributes. In some cases, especially when different attribute have approximately the same size of domain, this bias may be helpful for obtaining small rule sets with good performance. However, for data in reality where attribute domain sizes vary, this strategy is unfair since attributes with larger domains tend to have better discernibility or other significance measures [7], and it has severe implications when applied blindly without regarding for the resulting induced concept [1]. To cope with these problems, in this paper we propose a new bias called the minimal attribute space bias which is intended to minimize the attribute space. We argue that this bias is more reasonable, thus more helpful for obtaining small

2 rule set, than the minimal attribute bias. Empirical study on some datasets in the UCI library [8] validates our analysis. 2 Preliminaries In this section we enumerate basic concepts introduced by Pawlak [3] through an example. Formally, a decision table is a triple S = (U, C, {d}) where d C is the decision attribute and elements of C are called conditional attributes or simply conditions. Table 1 lists a decision table where U = {t1,..., t8}, C = {Shape, Material, Weight, Color} and d = Size. Table 1. An exemplary decision table Toy Shape Material Weight Color Size t1 round wood light red small t2 round plastic heavy black small t3 round wood heavy white large t4 round wood light white small t5 triangle wood light green small t6 triangle plastic heavy blue large t7 triangle plastic light pink large t8 triangle plastic heavy yellow large Any B C {d} determines a binary relation I(B) on U, which will be called an indiscernibility relation, and is defined as follows: I(B) = {(x i, x j ) U U a B, a(x i ) = a(x j )}, (1) where a(x) denotes the value of attribute a for element x. A partition determined by B is denoted by U/I(B), or simply by U/B. Let BX denotes B lower approximation of X, the positive region of {d} with respect to B C is defined as P OS B ({d}) = X U/{d} B(X). A reduct is the minimal subset of attributes that enables the same classification of elements of the universe as the whole set of attributes. This can be formally defined as follows: Definition 1. Any B C is called a decision relative reduct of S = (U, C, {d}) iff: 1. P OS B ({d}) = P OS C ({d}), and 2. a B, P OS B {a} ({d}) P OS C ({d}). A decision relative reduct can be simply called a relative reduct, or a reduct for briefness if the decision attribute is obvious from the context. According to Definition 1, the exemplary decision table has two reducts: R 1 ={Shape, Material, Weight} and R 2 ={Weight, Color}.

3 3 The Minimal Attribute Bias This bias is described as follows: A reduct R is optimal iff R is minimal, where denotes the cardinality of a set. According to this bias, R 2 ={Weight, Color} is an optimal reduct of Table 1 because R 1 = 3 and R 2 = 2. For the sake of clarity, we use the term minimal reduct instead of optimal reduct in the following context. The main advantage of this bias is simple and tend to give short rules. For datasets where different attributes have approximately the same size of domains, it may be also helpful for obtaining small rule sets with good predicting performance. However, it also has the following drawbacks: 1. Unfair for different attributes. For example, in Table 1, attribute Color is the most important attribute from the viewpoint of discernibility. But this is due to its relatively large domain (7 values versus 2 of others) rather than its intrinsic importance. 2. Too many optimal reducts. For example, the Mushroom dataset [8] (further discussed in Subsection 5.1) has 14 minimal reducts. Some of them perform well in terms of further rule set generation and/or decision tree construction, but others do not. The bias does not indicate a more detailed strategy for choosing among those reducts. 4 The Minimal Attribute Space Bias We propose the minimal attribute space bias as follows: A reduct R is optimal iff Π a R V a is minimal, where V a is the domain of attribute a. According to this bias, R 1 = {Shape, Material, Weight} is an optimal reduct of Table 1 because Π a R1 V a = 8 and Π a R2 V a = 14. We also use the term minimal space reduct instead of optimal reduct. Remark 1. If V ai = V aj for any a i, a j C, then the minimal attribute space bias coincides with the minimal attribute bias. Now we explain why this bias is more reasonable than the minimal attribute bias using the exemplary decision table. Each object in the table corresponds with a decision rule. For example, t1 corresponds with Shape = round Material = wood Weight = light Color = red Size = small. This type of rules will be called original rules since no inductive learning algorithm has been introduced. Because no object pairs are indiscernible, there are a total of 8 original rules. On the other hand, the attribute space of the decision table is Shape Material Weight Color = = 56. Therefore objects in the decision table only cover 8/56 = 1/7 = of the attribute space, and the original rule set may have poor performance in terms of coverage.

4 As an inductive approach, attribute reduction can reduce the number of attribute; or more importantly, it can reduce the attribute space. The attribute space of S(R 1 ) = {{t1,..., t8}, R 1, {Size}} is Shape Material Weight = = 8, while it has two indiscernible object pairs: t1 and t4 as well as t6 and t8, hence only 8 2 = 6 original rules could be obtained, incurring 6/8 = 0.75 of space coverage. The attribute space of S(R 2 ) = {{t1,..., t8}, R 2, {Size}} is Weight Color = 2 7 = 14, and no indiscernible object pairs exist, hence 8 original rules could be obtained, incurring 8/14 = of space coverage. From this viewpoint, both reducts have notable generalization ability, and R 1 performs better (with space coverage 0.75 versus of R 2 ). Table 2. Rule sets generated from S(R 1) and S(R 2) rule No. rule support r1 Material = wood Weight = light Size = small 3 r2 Shape = triangle Material = plastic Size = large 3 r3 Shape = round Weight = light Size = small 2 r4 Shape = triangle Weight = heavy Size = large 2 r5 Shape = round Material = plastic Size = small 1 r6 Material = wood Weight = heavy Size = large 1 r7 Shape = triangle Material = wood Size = small 1 r8 material = plastic weight = light Size = large 1 r9 Color = red Size = small 1 r10 Color = black Size = small 1 r11 Weight = heavy Color = white Size = large 1 r12 Weight = light Color = white Size = small 1 r13 Color = green Size = small 1 r14 Color = blue Size = large 1 r15 Color = pink Size = large 1 r16 Color = yellow Size = large 1 It should be noted that better generalization ability does not ensure smaller rule sets. In fact, using the exhaustive algorithm [9] we obtained 8 rules for either reduced decision tables, as listed in Table 2, where the former 8 rules corresponds with S(R 1 ). But it should be noted further that each rule generated from S(R 2 ) is supported by only 1 object, while in all rules generated from S(R 1 ), 2 rules (r1 and r2) are supported by 3 objects, and another 2 rules (r3 and r4) are supported by 2 objects. In other words, R 1 is more helpful for generating strong rules. One can observe that in this case both rule sets cover the whole attribute space. While for larger datasets, reducts with smaller attribute spaces often result in smaller rule sets with better space coverage. Formally, the space coverage of S = (U, C, {d}) is defined as SC(S) = U/C Π a C V a, (2)

5 which serves as an important factor for further rule generation / decision tree construction. From the viewpoint of space coverage, the goal of attribute reduction should be searching for a reduct R such that SC((U, R, {d})) is minimal. One approach is to maximize U/R, but U/R has an upper bound U, and U/R does not vary too much for different reducts. Thus this approach does not make sense. The other approach is to minimize Π a R V a, which coincides with the minimal attribute space bias. In most cases, minimal attribute space results in maximal space coverage. One might construct a counterexample as follows: A decision table S = (U, C, {d}) and two reducts R 1, R 2 satisfying U/R 1 > U/R 2, Π a R1 V a > Π a R2 V a and SC((U, R 1, {d})) > SC((U, R 2, {d})); but this situation is quite unlikely to happen in real data. 5 Experiments and Comparisons There are very few datasets (e.g., Nursery) with complete coverage of the attribute space. For many datasets, attribute reduction is a very important approach to improving the space coverage. We tested these two reduct selection biases on some datasets of the UCI library [8] using RSES [9] and a software developed by us called RDK (Rough Developer s Kit). For some datasets the set of all minimal space reducts coincides with the set of all minimal reducts. These datasets can be classified as follows: 1. One reduct datasets, e.g., Zoo, solar-flare and Monks; 2. Datasets whose all conditional attributes have the same size of domain, e.g., Letter Recognition and Tic-Tac-Toe, and 3. Other Datasets, e.g., bridges. In what follows experiments on two datasets will be discussed in more detail. 5.1 Experiments on Mushroom The Mushroom dataset [8] contains 8416 objects and 22 conditional attributes. The domain sizes of attributes vary from 1 (veil-type, the attribute value UNIVERSAL announced in agaricus-lepiota.names never appeared in the dataset) to 12 (gill-color). It has 292 reducts, and all minimal space reducts are also minimal reducts. For each minimal reduct, LEM2 was employed (with the cover parameter set to 1.0) to generate rule sets on reduced decision tables. Furthermore, we used CV-5 (the rule generation algorithm was also LEM2) to test the performance of those reducts. For all reducts tested, both the coverage and the accuracy of respective rule sets were 1.0. Other results are listed in Table 3. For this dataset, the number of minimal attribute reducts is much less than the number of minimal reducts. Also, minimal attribute reducts are more helpful for obtaining smaller rule sets.

6 Table 3. Experimental results of Mushroom minimal attribute bias minimal attribute space bias optimal reducts 14 2 minimal rule set size maximal rule set size average rule set size Experiments on Soybean The Large Soybean Database [8] contains two parts: the training set with 307 instances, and the testing set with 376 instances. There are 35 nominal conditional attributes, with domain sizes varying from 2 to 8, and some of them have missing (unknown) values. The domain size (i.e., 18) of the decision attribute is rather large. Due to the relatively large number of attributes, when we tried to use the exhaustive algorithm [9] to obtain the set of all reducts, an error out of memory was reported. So we used the genetic algorithm on the training set to obtain reducts, with the speed set to low and the number of reducts set to 100. In this way, 100 reducts were obtained, 35 of which were minimal reducts, and 9 of which were minimal space reducts. 8 out of 9 minimal space reducts contain 9 attributes, hence they were also minimal reducts. Rule sets were generated through employing the exhaustive algorithm on reduced decision tables, then they were tested on the testing set. Some results are listed in Table 4. Table 4. Experimental results of Soybean (Bolded Values Indicate the Best Results) minimal attribute reducts minimal attribute space reducts minimal maximal average minimal maximal average rule set size coverage accuracy F -measure Since the minimal attribute reduct set is much larger than the minimal attribute space reduct set, it contains both the best and the worst reducts. In general, for this dataset the minimal attribute space bias outperforms the minimal attribute bias in terms of average rule set size (325 less), average accuracy (0.024 more) and averge F -measure (0.024 more). It is quite interesting that the latter bias outperforms (0.004 more than) the former in terms of average coverage. Two reducts drew our special attention:

7 1. The best reduct. It helped obtaining a rule set with a predication accuracy of and F -measure of 0.740, and it was included in both reduct sets; and 2. The minimal attribute space reduct with 10 attributes. Although not a minimal reduct, it helped obtain a rule set containing 1789 rules, with the predication coverage 0.949, accuracy and F -measure The results are fairly good compared with that of minimal reducts. 6 Discussion In this section we discuss these two biases from a broader viewpoint. Discretization scheme Reduct Rule set Predicting performance Decision tree Fig. 1. A typical inductive learning scenario As depicted in Fig. 1, the ultimate goal of inductive learning is to obtain good predicting performance, defined by the coverage, the accuracy, or the combination of both (e.g., F -measure) on the data. But the predicting performance can be obtained only after rule set was generated or decision tree [10] was constructed (for the sake of simplicity, other approaches such as neural network or knn are not included in Fig. 1). According to Occam s Razor, smaller rule sets, or simpler decision trees (with least nodes) are preferred. In order to obtain a small rule set or a simple decision tree, also according to Occam s Razor, the simplest reduct is desired. But the key issue is: What is the metric of evaluating the simplicity of a reduct? Aforementioned biases are two metrics, among which the new bias seems to be closer to the essence. Then why the minimal attribute bias worked well for so many applications? In fact, many reduct construction algorithms use the following strategy: [i]f two attributes have the same performance with respect to the criterion described above then the one having less values is selected [1], and it is quite possible that a minimal space reduct be constructed while a minimal reduct is required. Moreover, even if the minimal reduct obtained is not a minimal space reduct, its attribute spaces is not too large compared with that of a minimal space reduct. In other words, the minimal attribute bias is often a good approximation of the minimal attribute space bias.

8 7 Conclusions and Further Works Compared with the minimal attribute bias, the minimal attribute space bias is closer to goal of constructing simple reducts from viewpoints of attribute space and attribute space coverage. Also, it does not incur the fairness problem. Experiments on two datasets showed that the new bias can help to narrow the scope of optimal reducts, and more importantly, it can help to obtain better rule sets in terms of accuracy and F -measure. Since the definition of a bias is a quite fundamental issue, many research works, e.g., reduct constructing algorithms, on the new bias are expected in the near future. Acknowledgement Fan Min was supported by an information distribution project under grant No. 9140A DZ223 and the Youth Foundation of UESTC. The authors would like to thank Zichun Zhong and Yue Liu for their help in experiments and paper proofing. References 1. Zhong, N., Dong, J.: Using rough sets with heuristics for feature selection. Journal of Intelligent Information Systems, 16 (2001) Pawlak, Z.: Rough sets. International Journal of Computer and Information Sciences 11 (1982) Pawlak, Z.: Some issues on rough sets. In Peters, J.F., Skowron, A., Grzyma la- Busse, J.W., Kostek, B., Świniarski, R.W., Szczuka, M.S., eds.: Transactions on Rough Sets I. LNCS Springer-Verlag, Berlin Heidelberg (2004) Zhang, W., Mi, J., Wu, W.: Knowledge reductions in inconsistent information systems. Chinese Journal of Computers 26(1) (2003) Yao, Y., Yan, Z., Wang, J.: On reduct construction algorithms. In Wang, G., Peters, J.F., Skowron, A., Yao, Y., eds.: RSKT LNCS 4062, Berlin Heidelberg, Springer-Verlag (2006) Skowron, A., Rauszer, C.: The discernibility matrices and functions in information systems. In S lowiński, R., ed.: Intelligent Decision Support Handbook of Applications and Advances of the Rough Sets Theory. Kluwer, Academic Publishers, Dordrecht (1992) Xu, C., Min, F.: Weighted reduction for decision tables. In Wang, L., Jiao, L., Shi, G., Li, X., Liu, J., eds.: FSKD LNCS 4223, Berlin Heidelberg, Springer-Verlag (2006) Blake, C.L., Merz, C.J.: UCI repository of machine learning databases, mlearn/mlrepository.html (1998) 9. Bazan, J., Szczuka, M.: The RSES homepage, rses ( ) 10. Quinlan, J.R.: Induction of decision trees. Machine Learning 1 (1986)

Easy Categorization of Attributes in Decision Tables Based on Basic Binary Discernibility Matrix

Easy Categorization of Attributes in Decision Tables Based on Basic Binary Discernibility Matrix Manuel S. Lazo-Cortés 1, José Francisco Martínez-Trinidad 1, Jesús Ariel Carrasco-Ochoa 1, and Guillermo