On Granular Rough Computing: Factoring Classifiers through Granulated Decision Systems

On Granular Rough Computing: Factoring Classifiers through Granulated Decision Systems Lech Polkowski 1,2, Piotr Artiemjew 2 Department of Mathematics and Computer Science University of Warmia and Mazury 2, Olsztyn,Poland Polish Japanese Institute of Information Technology 1 Koszykowa 86, 02008 Warszawa, Poland email:polkow@pjwstk.edu.pl;artem@matman.uwm.edu.pl Abstract. The paradigm of Granular Computing has quite recently emerged as an area of research on its own; in particular, t is pursued within rough set theory initiated by Zdzis law Pawlak. Granules of knowledge consist of entities with a similar in a sense information content. An idea of a granular counterpart to a decision/information system has been put forth, along with its consequence in the form of the hypothesis that various operators, aimed at dealing with information, should factorize sufficiently faithfully through granular structures [7], [8]. Most important such operators are algorithms for inducing classifiers. We show results of testing few well-known algorithms for classifier induction on well used data sets from Irvine Repository in order to verify the hypothesis. The results confirm the hypothesis in case of selected representative algorithms and open a new prospective area of research. Keywords rough inclusion, similarity, granulation of knowledge, granular systems and classifiers 1 Rough Computing Knowledge is represented as a pair (U, A), called an information system [4], where U is a set of objects, and A is a collection of attributes, each a A construed as a mapping a : U V a from U into the value set V a. The collection IND = {ind(a) : a A} of a indiscernibility relations, where ind(a) = {(u, v) : u, v U, a(u) = a(v)} for a A, can be restricted to any set B A, yielding the B indiscernibility relation ind(b) = a B ind(a). A concept is any subset of the set U. By a proper rough entity, we mean any entity e constructed from objects in U and relations in R such that its action e u on each object u U satisfies the condition: if (u, v) r then e u = e v for each r R; in particular, proper rough concepts are called exact, improper rough concepts are called rough. A particular case of an information system is a decision system, i.e., a pair (U, A {d}) in which d is a singled out attribute called the decision. Basic primitives in any reasoning based on rough set theory, are descriptors, see, e.g., [4], of the form (a = v), with semantics of the form [(a = v)] = {u U : a(u) = v}, extended to the set of formulae by means of sentential connectives, with appropriately extended semantics. In order to relate the conditional

knowledge (U, IN D) to the world knowledge (U, {ind(d)}), decision rules are in use; a decision rule is an implication of the form, (a = v a ) (d = w). (1) a A A classifier is a set of decision rules. 2 Rough Mereology. Rough Inclusions We outline it here as a basis for discussion of granules in the wake of [7], [8]. Rough Mereology is concerned with the theory of the predicate of Rough Inclusion. 2.1 Rough Inclusions A rough inclusion µ π (x, y, r), where x, y are individual objects, r [0, 1], does satisfy the following requirements, relative to a given part relation π on a set U of individual objects,see [6], [7], [8], [9], 1. µ π (x, y, 1) x ing π y; 2. µ π (x, y, 1) [µ π (z, x, r) µ π (z, y, r)]; 3. µ π (x, y, r) s < r µ π (x, y, s). (2) Those requirements seem to be intuitively clear: 1. demands that the predicate µ π is an extension to the relation ing π of the underlying system of Mereology; 2. does express monotonicity of µ π, and 3. assures the reading: to degree at least r. We use here only one rough inclusion, albeit a fundamental one, viz., see [6],[7] for its derivation, µ L (u, v, r) where IND(u, v) = {a A : a(u) = a(v)}. IND(u, v) A r, (3) 3 Granules A granule g µ (u, r) about u U of the radius r, relative to µ, is defined by letting, g µ (u, r) is ClsF (u, r), (4) where the property F (u, r) is satisfied with an object v if and only if µ(v, u, r) holds, and Cls is the class operator, see, e.g., [6]. Practically, in case of µ L, the granule g(u, r) collects all v U such that IND(v, u) r A. For a given granulation radius r, we form the collection U G r,µ = {g µ (u, r)}.

3.1 Granular decision systems The idea of a granular decision system was posed in [7]; for a given information system (U, A), a rough inclusion µ, and r [0, 1], the new universe U G r,µ is given. We apply a strategy G to choose a covering Cov G r,µ of the universe U by granules from U G r,µ. We apply a strategy S in order to assign the value a (g) of each attribute a A to each granule g Cov G r,µ: a (g) = S({a(u) : u g}). The granular counterpart to the information system (U, A) is a tuple (U G r,µ, G, S, {a : a A}); analogously, we define granular counterparts to decision systems by adding the factored decision d. The heuristic principle that objects, similar with respect to conditional attributes in the set A, should also reveal similar (i.e., close) decision values, and therefore, granular counterparts to decision systems should lead to classifiers satisfactorily close in quality to those induced from original decision systems, was stated in [7], and borne out by simple hand examples. In this work we verify this hypothesis with real data sets. 4 Classifiers: Rough set methods Classifiers are evaluated by error which is the ratio of the number of correctly classified objects to the number of recognized test objects (called also total accuracy) and total coverage, rec test, where rec is the number of recognized test cases and test is the number of test cases. We test LEM2 algorithm due to Grzymala Busse, see, e.g., [2] and covering as well as exhaustive algorithm in RSES package [12], see [1], [13], [16],[17]. 4.1 On the approach in this work For g(u, r) with r fixed and attribute a A {d}, the factored value a (g) is defined as S({a(u) : u g}) for a strategy S, each granule g does produce a new object g, with attribute values a(g ) = a (g) for a A, possibly not in the data set universe U. From the set U G r,µ, see sect.3.1, of all granules of the form g µ (u, r), by means of a strategy G, we choose a covering Cov G r,µ of the universe U. Thus, a decision system D ={g : g Cov G r,µ}, A {d }) is formed, called the granular counterpart relative to strategies G, S to the decision system D = (U, A {d}); this new system is substantially smaller in size for intermediate values of r, hence, classifiers induced from it have correspondingly smaller number of rules. As stated above, the hypothesis is that the granular counterpart D at sufficiently large granulation radii r preserves knowledge encoded in the decision system D to a satisfactory degree so given an algorithm A for rule induction, classifiers obtained from the training set D(trn) and its granular counterpart D (trn) should agree with a small error on the test set D(tst).

5 Experiments In experiments with real data sets, we accept total accuracy and total coverage coefficients as quality measures in comparison of classifiers given in this work. We make use of some well known real life data sets often used in testing of classifiers. Due to shortage of space, we include only a very few results. The following data sets have been used: Credit card application approval data set (Australian credit), see [14]; Pima Indians diabetes data set [14]. As representative and well established algorithms for rule induction in public domain,we have selected the RSES exhaustive algorithm, see [12]; the covering algorithm of RSES with p=.1[12]; LEM2 algorithm, with p=.5, see [2], [12]. Table 1 shows a comparison of these algorithms on the data set Australian credit split into the training and test sets with the ratio 1:1. Table 1. Comparison of algorithms on Australian credit data. 345 training objects, 345 test objects algorithm accuracy coverage rule number covering(p =.1) 0.670 0.783 589 covering(p =.5) 0.670 0.783 589 covering(p = 1.0) 0.670 0.783 589 exhaustive 0.872 1.0 5597 LEM2(p =.1) 0.810 0.061 6 LEM2(p =.5) 0.906 0.368 39 LEM2(p = 1.0) 0.869 0.643 126 In rough set literature there are results of tests with other algorithms on Australian credit data set; we recall some best of them in Table 2 and we include also best granular cases from this work. Table 2. Best results for Australian credit by some rough set based algorithms; in case, reduction in object size is 40.6 percent, reduction in rule number is 43.6 percent; in case, resp. 10.5, 5.9; in case, resp., 3.6, 1.9 source method accuracy coverage Bazan[1] SNAPM(0.9) error = 0.130 S.H.Nguyen[13] simple.templates 0.929 0.623 S.H.Nguyen[13] general.templates 0.886 0.905 S.H.Nguyen[13] closest.simple.templates 0.821.1.0 S.H.Nguyen[13] closest.gen.templates 0.855 1.0 S.H.Nguyen[13] tolerance.simple.templ. 0.842 1.0 S.H.Nguyen[13] tolerance.gen.templ. 0.875 1.0 J.W roblewski[17] adaptive.classifier 0.863 this.work granular.r = 0.642857 0.867 1.0 this.work granular.r = 0.714826 0.875 1.0 this.work granular.concept.dependent.r = 0.785714 0.9970 0.9985 For any granule g and any attribute b in the set A d of attributes, the reduced attribute s b value at the granule g has been estimated by means of the majority voting strategy and ties have been resolved at random; majority voting is one of most popular strategies and was frequently applied within rough set theory, see, e.g., [13], [16]. We also use the simplest strategy for covering finding, i.e., we select coverings by ordering objects in the set U and choosing sequentially granules about them in order to obtain an irreducible covering; a random choice of granules is applied in sections in which this is specifically mentioned. The only enhancement of the simple granulation is discussed in sect. 6 where the concept dependent granules are considered; this approach yields even better classification results.

5.1 Train and test at 1:1 ratio for Australian credit We include here results for Australian credit. Table 3 shows size of training and test sets in non granular and granular cases as well as results of classification versus radii of granulation. Table 4 shows absolute differences between non granular case (r=nil) and granular cases as well as fraction of training and rule sets in granular cases against those in non granular case. Table 3. Australian credit dataset:r=granule radius,tst=test sample size,trn=training sample size,rulcov=number of rules with covering algorithm,rulex=number of rules with exhaustive algorithm, rullem=number of rules with LEM2,acov=total accuracy with covering algorithm,ccov=total coverage with covering algorithm,aex=total accuracy with exhaustive algorithm,cex=total coverage with exhaustive algorithm,alem=total accuracy with LEM2, clem=total coverage with LEM2 r tst trn rulcov rulex rullem acov ccov aex clex alem clem nil 345 345 571 5597 49 0.634 0.791 0.872 0.994 0.943 0.354 0.0 345 1 14 0 0 1.0 0.557 0.0 0.0 0.0 0.0 0.0714286 345 1 14 0 0 1.0 0.557 0.0 0.0 0.0 0.0 0.142857 345 2 16 0 1 1.0 0.557 0.0 0.0 1.0 0.383 0.214286 345 3 7 7 1 0.641 1.0 0.641 1.0 0.600 0.014 0.285714 345 4 10 10 1 0.812 1.0 0.812 1.0 0.0 0.0 0.357143 345 8 18 23 2 0.820 1.0 0.786 1.0 0.805 0.252 0.428571 345 20 29 96 2 0.779 0.826 0.791 1.0 0.913 0.301 0.5 345 51 88 293 2 0.825 0.843 0.838 1.0 0.719 0.093 0.571429 345 105 230 933 2 0.835 0.930 0.855 1.0 0.918 0.777 0.642857 345 205 427 3157 20 0.686 0.757 0.867 1.0 0.929 0.449 0.714286 345 309 536 5271 45 0.629 0.774 0.875 1.0 0.938 0.328 0.785714 345 340 569 5563 48 0.629 0.797 0.870 1.0 0.951 0.357 0.857143 345 340 570 5574 48 0.626 0.791 0.864 1.0 0.951 0.357 0.928571 345 342 570 5595 48 0.628 0.794 0.867 1.0 0.951 0.357 1.0 345 345 571 5597 49 0.634 0.791 0.872 0.994 0.943 0.354 Table 4. Australian credit dataset:comparison; r=granule radius,acerr= abs.total accuracy error with covering algorithm,ccerr= abs.total coverage error with covering algorithm,aexerr=abs.total accuracy error with exhaustive algorithm,cexerr=abs.total coverage error with exhaustive algorithm,alemerr=abs.total accuracy error with LEM2, clemerr=abs.total coverage error with LEM2, sper=training sample size as fraction of the original size,rper= max rule set size as fraction of the original size r acerr ccerr aexerr cexerr alemerr clemerr sper rper nil 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.366+ 0.234 0.872 0.994 0.943 0.354 0.003 0.024 0.0714286 0.366+ 0.234 0.872 0.994 0.943 0.354 0.003 0.024 0.142857 0.366+ 0.234 0.872 0.994 0.057+ 0.029+ 0.0058 0.028 0.214286 0.007+ 0.209+ 0.231 0.006+ 0.343 0.340 0.009 0.02 0.285714 0.178+ 0.209+ 0.06 0.006+ 0.943 0.354 0.012 0.02 0.357143 0.186+ 0.209+ 0.086 0.006+ 0.138 0.102 0.023 0.04 0.428571 0.145+ 0.035+ 0.081 0.006+ 0.03 0.053 0.058 0.05 0.5 0.191+ 0.052+ 0.034 0.006+ 0.224 0.261 0.148 0.154 0.571429 0.201+ 0.139+ 0.017 0.006+ 0.025 0.423+ 0.304 0.403 0.642857 0.052+ 0.034 0.005 0.006+ 0.014 0.095+ 0.594 0.748 0.714286 0.005 0.017 0.003+ 0.006+ 0.005 0.026 0.896 0.942 0.785714 0.005 0.006+ 0.002 0.006+ 0.008+ 0.003+ 0.985 0.994 0.857143 0.008 0.0 0.008 0.006+ 0.008+ 0.003+ 0.985 0.998 0.928571 0.006 0.003+ 0.005 0.006+ 0.008+ 0.003+ 0.991 0.999 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 With covering algorithm, accuracy is better or within error of 1 percent for all radii, coverage is better or within error of 4.5 percent from the radius of 0.214860 on where training set size reduction is 99 percent and reduction in rule set size is 98 percent. With exhaustive algorithm, accuracy is within error of 10 percent from the radius of 0.285714 on, and it is better or within error of 4 percent from the radius of 0.5 where reduction in training set size is 85 percent and reduction in rule set size is 95 percent. The result of.875 at r =.714 is among the best at all (see Table 2). Coverage is better from r =.214 in the granular case, reduction in objects is 99 percent, reduction in rule size is almost 100 percent. LEM2 gives accuracy better or within 2.6 percent error from the radius of 0.5 where training set size reduction is 85 percent and rule set size reduction is 96 percent. Coverage is better or within error of 7.3 percent from the radius of.571429 on where reduction in training set size is 69.6 percent and rule set size is reduced by 96 percent.

5.2 CV-10 with Pima We have experimented with Pima Indians diabetes data set using 10 fold cross validation and random choice of a covering for exhaustive and LEM2 algorithms. Results are in Tables 5, 6. Table 5. 10-fold CV; Pima; exhaustive algorithm. r=radius,macc=mean accuracy,mcov=mean coverage,mrules=mean rule number, mtrn=mean size of training set r macc mcov mrules mtrn nil 0.6864 0.9987 7629 692 0.125 0.0618 0.0895 5.9 22.5 0.250 0.6627 0.9948 450.1 120.6 0.375 0.6536 0.9987 3593.6 358.7 0.500 0.6645 1.0 6517.6 579.4 0.625 0.6877 0.9987 7583.6 683.1 0.750 0.6864 0.9987 7629.2 692 0.875 0.6864 0.9987 7629.2 692 Table 6. 10-fold CV; Pima; LEM2 algorithm. r=radius,macc=mean accuracy,mcov=mean coverage,mrules=mean rule number, mtrn=mean size of training set r macc mcov mrules mtrn nil 0.7054 0.1644 227.0 692 0.125 0.900 0.2172 1.0 22.5 0.250 0.7001 0.1250 12.0 120.6 0.375 0.6884 0.2935 74.7 358.7 0.500 0.7334 0.1856 176.1 579.4 0.625 0.7093 0.1711 223.1 683.1 0.750 0.7071 0.1671 225.9 692 0.875 0.7213 0.1712 227.8 692 For exhaustive algorithm, accuracy in granular case is 95.4 percent of accuracy in non granular case, from the radius of.25 with reduction in size of the training set of 82.5 percent, and from the radius of.5 on, the difference is less than 3 percent with reduction in size of the training set of about 16.3 percent. The difference in coverage is less than.4 percent from r =.25 on, where reduction in training set size is 82.5 percent. For LEM2, accuracy in both cases differs by less than 1 percent from r =.25 on, and it is better in granular case from r =.5 on with reduction in size of the training set of 16.3 percent; coverage is better in granular case from r =.375 on with the training set size reduced by 48.2 percent. 5.3 A validation by a statistical test We have also carried out the test with Pima Indian Diabetes dataset [14], and random choice of coverings, taking a sample of 30 granular classifiers at the radius of.5 with train-and-test at the ratio 1:1 against the matched sample of classification results without granulation, with the covering algorithm for p=.1. The Wilcoxon [15] signed rank test for matched pairs in this case has given the p value of.14 in case of coverage,so the null hypothesis of identical means should not be rejected, whereas for accuracy, the hypothesis that the mean in granular case is equal to.99 of the mean in non granular case may be rejected (the p value is.009), and the hypothesis that the mean in granular case is greater than.98 of the mean in non granular case is accepted (the p value is.035) at confidence level of.03. 6 Concept dependent granulation A modification of the approach presented in results shown above is the concept dependent granulation; a concept in the narrow sense is a decision/classification class, cf., e.g.,

[2]. Granulation in this sense consists in computing granules for objects in the universe U and for all distinct granulation radii as previously, with the only restriction that given any object u U and r [0, 1], the new concept dependent granule g cd (u, r) is computed with taking into account only objects v U with d(v) = d(u), i.e., g cd (u, r) =g(u, r) {v U : d(v) = d(u)}. This method increases the number of granules in coverings but it is also expected to increase quality of classification, as expressed by accuracy and coverage. We show that this is the case indeed, by including results of the test in which exhaustive algorithm and random choice of coverings were applied tenfold to Australian credit data set, once with the standard by now granular approach and then with the concept dependent approach. The averaged results are shown in Table 7. Table 7. Standard and concept dependent granular systems for Australian credit data set; exhaustive RSES algorithm:r=granule radius, macc=mean accuracy, mcov=mean coverage, mrules=mean number of rules, mtrn=mean training sample size; in each column first value is for standard, second for concept dependent r macc mcov mrules mtrn nil 1.0; 1.0 1.0; 1.0 12025; 12025 690; 690 0.0 0.0; 0.8068 0.0; 1.0 0; 8 1; 2 0.0714286 0.0; 0.7959 0.0; 1.0 0; 8.2 1.2; 2.4 0.142857 0.0; 0.8067 0.0; 1.0 0; 8.9 2.4; 3.6 0.214286 0.1409; 0.8151 0.2; 1.0 1.3; 11.4 2.6; 5.8 0.285714 0.7049; 0.8353 0.9; 1.0 8.1; 14.8 5.2; 9.6 0.357143 0.7872; 0.8297 1.0; 0.9848 22.6; 32.9 10.1; 17 0.428571 0.8099; 0.8512 1.0; 0.9986 79.6; 134 22.9; 35.4 0.5 0.8319; 0.8466 1.0; 0.9984 407.6; 598.7 59.7; 77.1 0.571429 0.8607; 0.8865 0.9999; 0.9997 1541.6; 2024.4 149.8; 175.5 0.642857 0.8988; 0.9466 1.0; 0.9998 5462.5; 6255.2 345.7; 374.9 0.714286 0.9641; 0.9880 1.0; 0.9988 9956.4; 10344.0 554.1; 572.5 0.785714 0.9900; 0.9970 1.0; 0.9995 11755.5; 11802.7 662.7; 665.7 0.857143 0.9940; 0.9970 1.0; 0.9985 11992.7; 11990.2 682; 683 0.928571 0.9970; 1.0 1.0; 0.9993 12023.5; 12002.4 684; 685 1.0 1.0; 1.0 1.0; 1.0 12025.0; 12025.0 690; 690 Conclusions for concept dependent granulation Concept dependent granulation, as expected, involves a greater number of granules in a covering, hence, a greater number of rules, which is perceptible clearly up to the radius of.714286 and for greater radii the difference is negligible. Accuracy in case of concept dependent granulation is always better than in the standard case, the difference becomes negligible at the radius of.857143 when granules become almost single indiscernibility classes. Coverage in concept dependent case is almost the same as in the standard case, the difference between the two not greater than.15 percent from the radius of.428571, where the average number of granules in coverings is 5 percent of the number of objects. Accuracy at that radius is better by.04 i.e. by about 5 percent in the concept dependent case. It follows that concept dependent granulation yields better accuracy whereas coverage is the same as in the standard case. 7 Conclusions The results shown in this work confirm the hypothesis put forth in [7], [8] that granular counterparts to data sets preserve the encoded information to a very high degree. The search for theoretical explanation for this as well as work aimed at developing original algorithms for rule induction based on the discovered phenomenon are in progress to be reported. References 1. J. G. Bazan, A comparison of dynamic and non dynamic rough set methods for extracting laws from decision tables, In: Rough Sets in Knowledge Discovery 1, L. Polkowski, A.Skowron, Eds., Physica Verlag, Heidelberg, 1998, 321 365.

2. J.W. Grzymala Busse, Data with missing attribute values: Generalization of rule indiscernibility relation and rule induction, Transactions on Rough Sets I, Springer Verlag, Berlin, 2004, 78 95. 3. S. Leśniewski, On the foundations of set theory, Topoi 2, 1982, 7 52. 4. Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer, Dordrecht, 1991. 5. L. Polkowski, Rough Sets. Mathematical Foundations, Physica Verlag, Heidelberg, 2002. 6. L. Polkowski, Toward rough set foundations. Mereological approach (a plenary lecture), in: Proceedings RSCTC04, Uppsala, Sweden, 2004, LNAI vol. 3066,Springer Verlag, Berlin, 2004, 8 25. 7. L. Polkowski, Formal granular calculi based on rough inclusions (a feature talk), in: [10], 57 62. 8. L.Polkowski, Formal granular calculi based on rough inclusions (a feature talk), in: [11], 9 16. 9. L.Polkowski, A. Skowron, Rough mereology: a new paradigm for approximate reasoning,international Journal of Approximate Reasoning 15(4), 1997, 333 365. 10. Proceedings of IEEE 2005 Conference on Granular Computing,GrC05, Beijing, China, July 2005, IEEE Press, 2005. 11. Proceedings of IEEE 2006 Conference on Granular Computing, GrC06, Atlanta, USA, May 2006, IEEE Press, 2006. 12. A. Skowron et al., RSES: A system for data analysis; available at http: logic.mimuw.edu.pl rses 13. Sinh Hoa Nguyen, Regularity analysis and its applications in Data Mining, in: Rough Set Methods and Applications, L.Polkowski, S.Tsumoto, T.Y.Lin, Eds., Physica Verlag, Heidelberg, 2000,289 378. 14. http://www.ics.uci.edu./ mlearn/databases/iris 15. F. Wilcoxon, Individual comparisons by ranking method, Biometrics 1, 1945, 80 83. 16. A. Wojna, Analogy based reasoning in classifier construction, Transactions on Rough Sets IV, LNCS 3700, Springer Verlag, Berlin, 2005, 277 374. 17. J. Wróblewski, Adaptive aspects of combining approximation spaces, In:Rough Neural Computing, S.K.Pal, L.Polkowski,A.Skowron, Eds., Springer Verlag, 2004, 139 156.