Minimum Error Classification Clustering

Size: px

Start display at page:

Download "Minimum Error Classification Clustering"

Wesley Hardy
5 years ago
Views:

1 pp Minimum Error Classification Clustering Iwan Tri Riyadi Yanto Department of Mathematics University of Ahmad Dahlan Abstract Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. In this paper we study on the problem of clustering categorical data where data objects are made up of non-numerical attributes. We propose MECC (Minimum Error Classification Clustering) an alternative technique for categorical data clustering using VPRS taing into account minimum error classification. The technique is implemented in MATLA. Experimental results on two benchmar UCI datasets show that MECC technique is better than the baseline categorical data clustering techniques with respect to selecting the clustering attribute. Keywords: Clustering; Categorical data Rough set; VPRS Error Classification 1. Introduction The problem of clustering data arises in many disciplines and has a wide range of applications. Intuitively clustering is the problem of partitioning a finite set of points in a multi-dimensional space into classes (called clusters) so that (i) the points belonging to the same class are similar and (ii) the points belonging to different classes are dissimilar. The clustering problem has been studied extensively in machine learning databases and statistics from various perspectives and with various approaches and focuses. [1]. The clustering operation is required in a number of data analysis tass such as unsupervised classification and data summation as well as segmentation of large homogeneous data sets into smaller homogeneous subsets that can be easily managed separately modeled and analyzed [2]. In this paper we focus our attention on categorical datasets where data objects are made up of non-numerical attributes. For categorical data clustering a new trend has become in algorithms which can handle un-certainty in the clustering process. One of the well-nown techniques is based on rough set theory [3-5]. Mazlac proposed a technique called TR (Total Roughness). It is based on accuracy of approximation of a set [3] where the highest value is the best selection of attribute [6]. One of the successful pioneering rough clustering for categorical data techniques is Minimum-Minimum Roughness (MMR) proposed by Parmar et al. [7]. The algorithm for selecting a clustering attribute is based on the opposite of accuracy of approximation of a set [3]. To this TR and MMR possibly provide the same result on selecting a clustering attribute. However when the problem of the real data clustering is faced by noises data the data are always corrupted. So it is not feasible to deal with the noisy data with the classical definition of rough set as MMR fails to do for handling noisy data. There are drawbacs particularly losing more useful information for demanding the inclusion of the absolutely precision in the classical definition of rough set. In order to overcome the drawbac an error parameter where 0 < 0. 5 is introduced. Variable Precision Rough Set (VPRS) model proposed by Ziaro [8] is defined on the probabilistic space and will give us a new way to deal with the noisy data. It is an effective ISSN: IJSEIA Copyright c 2013 SERSC

2 mathematical tool with an error-tolerance capability to handle uncertainty problem. asically the VPRS is an extension of Pawlas rough set theory [3-5] allowing for partial classification. y setting a confidence threshold value the VPRS cannot only solve classification problems with uncertain data and no functional relationship between attribute but also relax the rigid boundary definition of Pawlas rough set model to improve the model suitability. Due to existence of the VPRS can resist data noise or remove data errors[9]. In order to determine a rational change interval for [10] it will give us a new way to deal with the noisy data[11]. Inspired VPRS for handling noisy data in this paper we propose an alternative technique for categorical data clustering that there are addresses above issue. For selecting the clustering attribute it is based on minimum error classification to get better accuracy of approximation. 2. Rough Set Theory 2.1. Information System and Set Approximations U An information system is a 4-tuple (quadruple) S ( U A V f ) = where A = a a a is = { u u u } is a non-empty finite set of objects { } u U a A a non-empty finite set of attributes a V = V V is the domain (value set) of attribute a f U A V u a U A called information (nowledge) function. : is an information function such that f ( u a) Va ( ) Definition 1. Two elements the set of attribute A a a a A for every x y U are said to be -indiscernible (indiscernible by f x a = f y a for every a. in S) if and only if ( ) ( ) Obviously every subset of A induces unique indiscernibility relation. Notice that an indiscernibility relation induced by the set of attribute denoted by IND () is an equivalence relation. The partition of U induced by IND () is denoted by U / and the equivalence class in the partition U / containing x U is denoted by[ x ]. The notions of lower and upper approximations of a set are defined as follows. Definition 2. The -lower approximation of denoted by ( ) approximations of denoted by ( ) respectively are defined by ( ) = { x U : [ x ] } ( ) = { x U : [ x ] U φ} The accuracy of approximation (accuracy of roughness) of any subset respect to A is measured by ( ) ( ) α ( ) = and -upper. U with where denotes cardinality of. For empty set φ obviously 0 α ( ) 1 ( ) = 1 if α ( ) = 0 is rough with respect to ( is vague with respect to ). α is crisp with respect to ( is precise with respect to ) and otherwise. If 222 Copyright c 2013 SERSC

3 2.2. Variable Precision Rough Set Variable precision rough set (VPRS) extends rough set theory by the relaxation of the subset operator [8]. It was proposed to analyze and identify data patterns which represent statistical trends rather than functional. The main idea of VPRS is to allow objects to be classified with an error smaller than a certain pre-defined level. This introduced threshold relaxes the rough set notion of requiring no information outside the dataset itself. Definition 4. Let a set U as a universe and Y U where Y φ. The error e Y is defined by classification rate of relative to Y is denoted by ( ) e ( Y ) 1 = 0 Definiton 5. Let U be a finite set and a set Y U > 0. = 0. Given be a real number within and - the range 0 < The -lower approximation of denoted by ( ) upper approximation of denoted by ( ) respectively and are defined by ( ) = { x U : e( [ x] ) } ( ) = { x U : e( [ x] ) < 1 }. The set ( ) is called the positive region of. It s the set of object of U that can be classified into with error classification rate not greater than. Then we have ( ) ( ) if only if 0 < 0. 5 which means that be restricted in an interval [ 0 0.5) in order to eep the meaning of the upper and lower approximations. 3. Minimum Error Classification Clustering (MECC) Technique 3.1. The MECC Technique for Selecting Clustering Attribute In this section we will present the proposed technique which we refer to as the Minimum Error Classification (MECC). The technique based on the accuracy approximation of attributes of variable precision rough set theory by introduces the threshold that respect to the error classification. Proposition 8 proves that prove that the accuracy of approximation by introduces the threshold is more accurate for selecting clustering attribute. Definition 7. The accuracy of approximation variable precision (accuracy of variable precision roughness) of any subset U with respect to A is denoted by α ( ). It is presented as ( ) α = where denotes cardinality of. If = 0 Pawla. ( ) ( ) it is the traditional rough set model of Copyright c 2013 SERSC 223

4 Proposition 8. Let S = ( U A V f ) be an information system ( ) roughness and ( ) factor of variable precision. ( 0 < 0.5) α ( ) α ( ). Proof. ased on Definition 5 if 0. 5 then ( ) ( ). Thus for 0 < 0. 5 have 0 ( ) ( ) and 0 ( ) ( ). Consequently 0 ( ) ( ) and 0 ( ) ( ). For = 0 based on Definition 5 α ( ) = α ( ). For 0 < < 0. 5 we have ( ) ( ) and ( ) ( ). Hence ( ) ( ). ( ) ( ) α be an accuracy of α is an accuracy of variable precision roughness given the error Therefore ( ) α ( ) α. Definition 9. Let S = (U AV f ) be an information system. Suppose A V a i has - different values say y = 12 n. Let ( a i = y ) = 12 n be a subset of the objects having -different values of attribute a i. The error classification rate of ( a i = y ) relative to ( a j = y )where i j can be defined as follows e ( ( a = y ) ( a = y ) i j = 1 ( a = y ) ( a = y ) i ( a = y ) i j a i ( ) The problem is the choice of the threshold so that accuracy approximation is higher where error classification is as least possible. ased on proposition 8 there are three cases of. Case 1. If 0.5 out. So ( ) ( ) then the meaning of the upper and lower approximations will be Case 2. If = 0 So ( ) α ( ) α = then the accuracy is not increase. Case 3. If 0 < < 0. 5 α α then the accuracy will be better than traditional rough set. Hence ( ) ( ) From the three cases above threshold can be taen as positive number that less than 0.5. Then from definition 9 the threshold > 0 can be chosen as the minimum of error classification as follows = arg min[ mean { e( ( a = y ) ( a = y )} ]. The attribute with minimum > 0 is selected as the clustering decision. i j. we 224 Copyright c 2013 SERSC

5 Algorithm: MECC Input: Data set without clustering attribute Output: Clustering attribute egin Step 1. Compute the equivalence classes using the indiscernibility relation on each attribute. Step 2. Determine the error clasification of attribute respect to all a where i j. j a i with Step 3. Select the mean error classification from step 2 be a. Step 6. Select a clustering attribute based on the minimum of. End Figure 1. The Pseudo-code of MECC for Selecting a Clustering Attribute 3.2. Example The following table is a student information system containing 15 students with 5 categorical-valued attributes; Programming Mathematics Statistics English and French. There is no a pre-defined a clustering (decision) attribute. Then we will select a clustering attribute among all candidates. Table 1. A Student Information System Prog Math Stat Eng French 1 bad Low no fluent Poor 2 bad intermediate yes poor Fluent 3 bad intermediate yes fluent Fluent 4 bad Low yes fluent Fluent 5 bad advance no fluent Poor 6 medium Low yes poor Poor 7 medium intermediate yes fluent Poor 8 medium advance no poor Poor 9 medium intermediate no fluent Poor 10 good Low yes poor Fluent 11 good advance no poor Fluent 12 good Low yes fluent Poor 13 good advance yes fluent Poor 14 good Low yes fluent Poor 15 medium advance yes fluent Fluent The procedure to find MECC value is described here. To obtain the values of MECC firstly we must obtain the equivalence classes induced by indisceribility relation of singleton attribute. ( Prog = bad) = { } ( Prog = medium) = { } ( Prog = good) = { } U / Prog = { } { } { } { }. Copyright c 2013 SERSC 225

6 ( Math = low) = { } ( Math = intermediate) = { 2379} ( Math = advance) = { } U / Math = { } { 2379} { }}. ( Stat = no) = { } ( Stat = yes) = { } U / Stat = { } { } }. ( Eng = fluent) = { } ( Eng = poor) = { } U / Eng = { } { } }. ( French = poor) = { } ( French = fluent) = { } U / French = { } { } ased on Definition 4 the error classification attribute Statistics with respect to Math is calculated as follow. { 1} { } 1 c( low no) = 6 { 9} { 2379} c( intermedia teno) = 4 { 5811} { } 3 c ( advance no) = 1 = 1 = 5 { } { } 5 c( low yes) = 6 { 237} { 2379} c( intermedia te yes) = 4 { 1315} { } 2 c( advance yes) = Following the same procedure the error classification on all attributes with respect each to the other are computed. These calculations are summarized in Table Copyright c 2013 SERSC

7 Table 2. The Minimum Error Classification Attribute (with respect to) Prog Math Stat Eng French The error classification Math Stat Eng French Prog Stat Eng French Prog Math Eng French Prog Math Stat French Prog Math Stat Eng mean With MECC technique From Table 2 the minimum of error classification of attributes is attribute Statistics. Thus attribute Statistics is selected as a clustering attribute Objects Splitting For objects splitting we use a divide-conquer method. For example in Table 2 we can cluster (partition) the objects based on the decision attribute selected i.e. Statistics. Notices that the partition of the set of animals induced by attribute Statistics is { } { } U / Stat =. To this we can split the objects using the hierarchical tree as follows The objects {158911} { } 1 st possible clusters Figure 2. The Objects Splitting The technique is applied recursively to obtain further clusters. At subsequent iterations the leaf node having more objects is selected for further splitting. The algorithm terminates when it reaches a pre-defined number of clusters. This is subjective and is pre-decided based either on user requirement or domain nowledge. 4. Experimental Results 4.1. Selecting the Clustering Attribute We elaborate the proposed technique through the three UCI benchmar datasets taen from: uci.edu [12-14]. alloon dataset contains 16 instances and 4 categorical attributes; Color Size Act and Age. Tic-Tac-Toe Endgame dataset The data contains 958 of instances and 9 categorical-attributes; top left square (TLS) top middle square (TMS) top Copyright c 2013 SERSC 227

8 right square (TRS) middle left square (MLS) middle middle square (MMS) middle right square(mrs) bottom left square (LS) bottom middle square (MS) bottom right square (RS)and a class attribute. Hayes-Roth dataset contains 132 training instances 28 test instances and 4 attributes; hobby age educational level and marital status. The algorithms of TR MMR and MECC are implemented in MATLA version (R2008a). They are executed sequentially on a processor Intel Core 2 Duo CPUs. The total main memory is 2G and the operating system is Windows 7. The experiment results are summarized in Table 3. Table 3. The Experiment Results Technique Data Set allon Tic tac toe Hayes-Roth TR Attribute Selected All All All MMR Attribute Selected All All All MECC Attribute Selected 3 dan The TR MMR and MEEC use different techniques in selecting clustering attribute. TR uses the total average of mean roughness MMR uses the minimum of mean roughness and MECC uses the error of classification quality of Variable Precision Rough Set to select a clustering attribute. ased on Table 3 the decision cannot be obtained using TR and MMR because the value of TR and MMR of attributes in all datasets are same (for TR is 0 and for MMR is 1 respectively). ut the clustering attribute can be selected based on the minimum values using MECC. The results of accuracy of the three datasets are given in Figure MECC TR MMR allon Hayes-Roth Tic tac toe Figure 3. The Accuracy of TR MMR and MECC Techniques 4.2. Clustering Objects and Validity In this sub-section we present the result of object partitioning. The purity of clusters was used as a measure to test the quality of the clusters. The purity of a cluster and overall purity are defined as 228 Copyright c 2013 SERSC

9 The number of data inboththeith cluster and itscorresponding class Purity ( i) = The number of datainthedata set Overal Purity = # of cluster i =1 Putiy( i) # of cluster alloon Dataset ased on Table 3 the selected attribute is Act and Age with the MECC value of both attributes is the same i.e For attribute Act we have the following clusters purity. Cluster Number Class 1 Class 2 Purity Overall Purity 0.83 For attribute Age we have the following clusters purity. Tic-Tac-Toe Endgame Dataset Cluster Number Class 1 Class 2 Purity Overall Purity 0.83 ased on Table 3 the selected attribute is MMS with the value of MECC For attribute MMS we have the following clusters purity Cluster Number Class 1 Class 2 Purity Overall Purity 0.69 Hayes-Roth dataset ased on Table 3 the selected attribute is F3 with the value of MECC For attribute F3 we have the following clusters purity. Cluster Number Class 1 Class 2 Class 3 Purity Overall Purity 0.63 Copyright c 2013 SERSC 229

10 4.3. Accuracy and Responses Time The benchmar data from UCI Machine Learning Repository ( i.e. Acute Inflammations alance Scale Car Evaluation Chess Flag Lenses Lung Cancer MONK's Problems Mushroom Soybean Statlog (Landsat Satellite) and Zoo [15] are used in order to test MECC and compare with MMR and TR in term of response time and accuracy. The datasets are described as follow. Table 4. The enchmar Datasets No Data Sets Number of Instances 1 Lenses alance Scale Car Evaluation MONK's Problems Acute Inflammations Zoo Soybean Chess Lung Cancer Mushroom Flag Number of Attribute The accuracy of selecting a clustering attribute is refers to Definition 7 and the results are given in Figure 4. Meanwhile the results of executing time through all dataset are given in Figures Lenses alance Scale Car Evaluation Mons Acute Inflammations Zoo Soybean Chess Lung Cancer Mushroom Flag TR MECC MMR Figure 4. The Accuracy of TR MMR and MECC Techniques 230 Copyright c 2013 SERSC

11 MECC MMR TR Zoo alance Scale Car Chess Lenses Flag Figure 5. The Responses Time of TR MMR and MECC Techniques MECC MMR TR Mushroom Mons Soybean Lung CancerAcute Inflammations Figure 6. The Responses Time of TR MMR and MECC Techniques With reference to Figure 4 it illustrates the accuracy of selecting clustering attribute. The accuracy of selecting clustering attribute of TR MMR and MECC techniques in almost case is the same. However the MECC technique has lower executing time due to less computation required as shown in Figures 5 and Figure 6. For example for Lung Cancer dataset the executing time for MECC is seconds while the executing times for TR and MMR are and seconds respectively. Thus in this case MECC improve the executing time of TR and MMR up to % in average. 5. Conclusion In this paper we have proposed an alternative technique for categorical data clustering using error classification in VPRS model. We have shown that the proposed technique able for handling noisy data. We present an example how our technique able to handle noisy data. Further we compare our technique on benchmar datasets taen from UCI ML repository. Copyright c 2013 SERSC 231

12 The results show that our technique provides better performance in selecting the clustering attribute. Since TR and MMR are based on the traditional definition of rough set theory thus our technique is different from TR and MMR. Acnowledgements This wor was supported by the Grant No. Vote PM-67/LPP-UAD/III/2012 Ahmad Dahlan University Indonesia. References [1] T. Li A General Model for Clustering inary Data Proceedings of the ACM Conference on nowledge Discovery and Data Mining (KDD 05) Chicago Illinois USA (2005) August [2] Z. Huang Extensions to the -means algorithm for clustering large data sets with categorical values Data Mining and Knowledge Discovery vol. 2 pp. 3 (1998) pp [3] Z. Pawla Rough sets International Journal of Computer and Information Science vol. 11 (1982) pp [4] Z. Pawla Rough sets: A theoretical aspect of reasoning about data Kluwer Academic Publisher (1991). [5] Z. Pawla and A. Sowron Rudiments of rough sets Information Sciences vol. 177 no. 1 pp. 3-27(2007). [6] L. J. Mazlac A. He Y. Zhu and S. Coppoc A rough set approach in choosing partitioning attributes Proceedings of the ISCA 13th International Conference CAINE-2000 Honolulu Hawaii USA (2000) November 1-3. [7] D. Parmar T. Wu and J. lachurst MMR: An algorithm for clustering categorical data using rough set theory Data and Knowledge Engineering vol. 63 (2007) pp [8] W. Ziaro Variable precision rough set model Journal of computer and system science vol. 46 (1991) pp [9] D. Sleza and W. Ziaro The investigation of the ayesian rough set model Int. J. Approx. Reason (2005). [10] G. ie ab J. Zhang b K. K. Lai c and Lean Yu d Variable precision rough set for group decision-maing: An application International Journal of Approximate Reasoning (2008). [11] I. T. R. Yanto P. Vitasari T. Herawan and M. Mat Deris Applying Variable Precision Rough Set Model for Clustering Student Suffering Study's Anxiety Expert System with Applications Elsevier (2012). [12] [13] Endgame. [14] [15] Author Iwan Tri Riyadi Yanto received his Sc degree in Mathematics from Universitas Ahmad DahlanYogyaarta Indonesia. He obtained his MIT from Universiti Tun Hussein Onn Malaysia. Currently he is a lecturer at Department of Mathematics Faculty of Mathematics and Natural Sciences Universitas Ahmad Dahlan (UAD). He published more than 15 research papers in journals and conferences. His research area includes numerical optimization data mining and KDD. 232 Copyright c 2013 SERSC

A novel k-nn approach for data with uncertain attribute values

A novel k-nn approach for data with uncertain attribute values A novel -NN approach for data with uncertain attribute values Asma Trabelsi 1,2, Zied Elouedi 1, and Eric Lefevre 2 1 Université de Tunis, Institut Supérieur de Gestion de Tunis, LARODEC, Tunisia trabelsyasma@gmail.com,zied.elouedi@gmx.fr