Rough Sets Used in the Measurement of Similarity of Mixed Mode Data

Size: px

Start display at page:

Download "Rough Sets Used in the Measurement of Similarity of Mixed Mode Data"

Rosanna Whitehead
6 years ago
Views:

1 Rough Sets Used the Measurement of Similarity of Mixed Mode Data Sarah Coppock Lawrence Mazlack Applied Artificial Intelligence Laboratory ECECS Department University of Ccnati Ccnati Ohio Abstract Similarity is important knowledge discovery Cluster analysis classification and granulation each volve some notion or defition of similarity The measurement of similarity is selected based on the doma and distribution of the data Even with a specific doma some similarity metrics may be considered more useful than others There is an amount of uncertaty quantitatively measurg the similarity between records of mixed data The uncertaty develops from the lack of scale that both nomal and ordal data have Rough set theory is one tool developed for handlg uncertaty Rough sets can be used dissimilarity analysis of qualitative data It would seem that rough sets could be applied measurg similarity between records contag both quantitative and qualitative data for the purpose of clusterg the records 1 Introduction Similarity metrics are used many fields When determg the similarity between records a data set that contas different kds of data a certa amount of uncertaty is troduced While metrics such as Euclidean and its generalized Mkowski metrics can be used when all of the data is quantitative (both discrete and contuous) it is not as easy to usefully combe scalar metrics representg qualitative data (both nomal and ordal) Data can be categorized to qualitative and quantitative data Qualitative data can be further described as either ordal or nomal Ordal data has order without scale eg small medium large Nomal data has no order and no scale eg Ccnati Tampa Atlanta Data such as cities and colors can be argued as havg some order" eg latitudes or longitudes and frequencies However the case of unsupervised learng such knowledge is not explicitly known by the learng algorithms For a more detailed discussion of data varieties see [1] [2] Similarity is important knowledge discovery Cluster analysis classification and granulation each volve some notion or defition of similarity Measurg similarity between multidimensional multi-modal data is difficult but offers formation The formation provided by clusterg records based on similarity measurements cludes an overall distribution of the data and discovery of possible outliers The measurement of similarity may be appropriately selected based on the doma and distribution of the data Even with a doma there may be some similarity metrics considered more useful than others There is an amount of uncertaty quantitatively measurg similarity (or dissimilarity) between records of mixed kds of data The uncertaty develops from the fact that both nomal and ordal data lack a natural fixed scale For example should we say that the similarity between red and orange is more or less than (or equal to) the similarity between blue and green? Some metrics assign a Boolean value for whether the values match The similarity would be considered equal Rough set theory is one of the tools developed for handlg uncertaty Pawlak [4] demonstrates how rough sets can be used dissimilarity analysis of qualitative data It would seem that rough sets could be applied measurg similarity between records contag both quantitative and qualitative data 2 Rough sets dissimilarity analysis Rough sets are built on the notion of discernibility There is an equivalence relation imposed on the items based on attribute values For example Table 1 and are considered discernible with respect to 0 denoted IND(0 ) A similar idea is used similarity metrics for nomal and ordal data A simple matchg technique where a Boolean 0 o is assigned based on whether two attribute values are the same for two records Pawlak [4] describes applyg rough sets to measure dissimilarity between records of Boolean values A brief description of Pawlak s method to measure dissimilarity usg rough sets follows usg his Middle East Situation example This example is given Table 1 Any attributes that have the same values as another attribute for all records ie equivalent attributes to one another are disregarded For example Table 2 have the same value for each record Only one of

2 would then be considered further the process but not all three This is because once one of the attributes is taken to account; the other two do not offer any more formation computg the dissimilarity In computg similarity it would be seem desirable to take to account that multiple attributes are equal That is the more values two records have common the greater the similarity is between the records Attributes that have the same value for all records are disregarded For example a 5 Table 2 would be disregarded sce there is only one value 1 for all records The attribute does not offer any formation discerng between any of the records Attributes that are the negation of another such as a 4 with Table 2 are also disregarded Only one of a 4 or a 8 Table 2 would be considered Table 1 Middle East Situation example Table 4 Core values for Table a 4 a 5 a 6 a 7 a Table 2 Small example a 4 a Table 1 is modified for measurg dissimilarity One of the possible resultg modified sets is given Table 3 Table 3 Possible modified set from Table A graph is then constructed from the modified table There is a node for each record and a labeled edge between the nodes if removg an attribute would put the records the same equivalence class For example an edge is between and with the label sce these records would be the same equivalence class IND( ) Figure 5 shows the graph for Table 3 The dissimilarity between two records is computed by determg the length of the shortest path between the nodes the graph correspondg to the records For example the dissimilarity between and would be 2 Figure1 Graph for Table 3 Dissimilarity is the complement to similarity Because of the relationship between dissimilarity and similarity we could modify the above approach to quantify similarity between records A consideration to make modifyg this approach is the generalization to multi-valued attributes For example if one attribute has more than two or three values such as the make of a car {Ford GM Toyota Nissan BMW} 3 Convertg quantitative attributes to qualitative A common approach to measure similarity between records contag mixed data is to add the measurements of qualitative similarity and of quantitative similarity Without knowledge of the doma and specifically the data set description fdg an appropriate weightg to give reasonable results would be computationally expensive Methods to cluster quantitative data have been developed One possibility for the discovery of similar records multi-modal data would be to convert the quantitative attributes to one qualitative attribute accordg to the natural clusters the quantitative attributes The modified rough set dissimilarity analysis approach can then be applied

3 Table 5 gives an example mixed data set with one nomal ( ) one ordal ( : {A B C D F} order) and one discrete quantitative ( ) attribute Table 6 is the modified data set with the quantitative attribute Table 5 clustered to c 1 and c 2 The attribute a 3 is the label which the value for the record belongs Table 5 Example mixed data set Coke B 4 Coke C 2 Pepsi B 1 Pepsi A 1 Bud F 2 Heeken B 3 Table 6 Modified data set from Table 5 a 3 Coke B c2 Coke C c1 Pepsi B c1 Pepsi A c1 Bud F c1 Heeken B c2 A modified approach as Section 2 may now be applied to determe pair-wise record similarity Figure 2 shows the graph associated with Table 6 From the graph it can be seen that some modification to handle multivalued attributes needs to be made The graph is not connected Table 7 which provides the similarities also demonstrates this need The similarities are computed as: (D max -D ij )/D max where D max is the maximum dissimilarity over all pairs and D ij is the dissimilarity between r i and r j After followg the method a straight-forward manner at this pot it is unclear whether the difficulty normalizg between attributes is handled Regardless of the method used to cluster or granulize the data it is difficult to evaluate whether the results are reasonable That is if we give a small data set such as the one our example to a number of students how would they group the records? It seems that this situation is more suited to havg a fuzzy measure associated with a sgle particular groupg or a clusterg of records The kd of measure and how to defe the measure function is unclear at this pot In this case heuristics such as those used [3] [6] can be used to restrict the search space to those that would be more likely to have a higher measure of certaty Figure 2 Graph associated with Table 5 Table 7 Pairwise similarities for Table 5 1 2/3 2/3 1/3 1 2/3 2/3 1 1/ /3 2/3 1/3 1 2/3 1 2/3 1/3 0 2/ / /3 1/3 2/3 1/ Fusg quantitative and qualitative formation Metrics and methods have been developed to cluster data records that have only quantitative or only qualitative data It is possible that formation can be extracted by the fusion of the methods results or the different measures Metrics are defed or can be normalized to the terval [01] Quantitative measures lie on the whole contuous terval while qualitative measures lie on a discrete lear subset of the terval 41 Fusg quantitative and qualitative partitions Metrics and methods have been developed to cluster records contag only one type of data The results of these methods and metrics have different meangs The characteristics that contribute to the similarity measures are different It is possible that rough sets can be used the fusion of the results of existg methods for the two sets of dimensions Let C q (X) denote a clusterg method for the quantitative dimensions of the data set X (X) denote the clusterg method for qualitative dimensions of the data set X C q clusters X based only on the quantitative attributes clusters X based only on the qualitative attributes Let C q (X)={q 1 q 2 q k } and C n (X)={n 1 n 2 n m } where the sets of q i are the clusters that result accordg to the quantitative and qualitative attributes respectively Table 9 shows one possibility for the results of C q applied to the simple example data set Table 5 Note that the q i

4 are arbitrary and are not a result of any specific metric or method There is one q i and one n i for every record Let s i = q i n i for a given r i The set s i contas all of the records considered similar to the record r i accordg to some quantitative and/or some qualitative metric or method There may be some order of the elements q i accordg to the similarity to r i That is given q i ={q i(1) q i(2) q i(k) } where q i(j) is the j th record the set q i it may be the case that s(r i q i(j) ) s(r i q i(k) ) s(r i q i(m) ) for any j k and m The same may be true for the set n i Table 8 Possible C q for Table 6 C q q q q C n n n 2 Table 9 S i for Table 8 s s s 3 s 4 s s Table 9 gives the s i for the example from Table 8 We can fer from the s i that the similarity between and is greater than the similarity between and Records and have the same membership for a greater number of s i (all of the s i ) than the pair and which differ s 1 We can also fer that and belong together the overall clusterg of the data set For each s i and have the same membership Thus far we have not addressed a weightg of attributes For example if there are 2 qualitative dimensions and 10 quantitative dimensions it seems reasonable that the q i would have more weight determg the overall clusters The overall clusterg would be more like the resultg quantitative clusters The fact that there may exist an order to each of the sets leads to the idea that rough sets may be used the development of a fuzzy measure The measure may be either a specific group identified as beg similar or an overall clusterg of mixed data are given the followg: C n ={{x 1 }{x 2 x 3 }{x 5 }} and C q ={{x 1 x 2 x 3 x 5 }{x 4 x 6 }} Both {x 1 } and {x 5 } are different clusters C q Suppose that the qualitative similarity between {x 1 } is maximal while the qualitative similarities between {x 2 x 3 } and {x 5 } are less than maximal Suppose also that the quantitative similarity between {x 1 } is mimal while the quantitative similarity between {x 5 } is greater than mimal We are not able to compare these similarities to determe if either pair should be kept together It may be more useful to consider the pair-wise qualitative and quantitative similarities One can consider "rough sets" from the perspective of each record In other words there are those records which defitely belong the same cluster as the record (lower approximation) those that defitely do not belong the same cluster and those that it is uncerta whether they belong the same cluster (boundary) Each of these can be determed by given similarity values For example we can say that for any two records if the similarity measurement is less than some threshold then they are not each others cluster approximation One can defe a similar threshold for those records that defitely belong the same cluster What these thresholds should be are subjective both to a particular doma and the metric that it used Suppose we have the followg similarity matrices for qualitative and quantitative dimensions respectively Table 10 and Table 11 The qualitative measure is computed as: number of matchg attribute values / number of qualitative attributes The quantitative measure is computed as: xik x jk 1 k quantitative R k where x mk is the k th attribute value for record m and R k is the range of attribute k Table 12 and Table 13 give the approximations with the lower threshold 1/2 and the upper approximation threshold of 9/10 0 denotes that the record is not the approximation 1 denotes that the record is the lower approximation Lastly '--' denotes that the record is the boundary For example both Table 12 and Table 13 is the boundary for From Table 12 and Table 13 we can see that for the cluster cludg the most likely record the same cluster would be sce it is both approximations One could use a similar idea to section 41 and use the union of the upper approximations to determe likely clusters For example { } based on the sets for both tables 42 Fusg qualitative and quantitative measures The sets C q provide less formation than havg pair-wise similarity measurements Suppose we

5 Table 10 Qualitative similarities for Table 5 1 1/2 1/ /2 1/ / /2 0 1/ / /2 0 1/ Table 11 Quantitative similarities for Table 5 1 1/ /3 2/3 1/3 1 2/3 2/3 1 2/3 0 2/ /3 1/3 0 2/ /3 1/3 1/3 1 2/3 2/3 1 2/3 2/3 2/3 1/3 1/3 2/3 1 The difficulty comparg different measures is still present because there still exists the problem of which approximation a resultg cluster should be more like For example sce the thresholds and therefore the equivalence relations are based on two different measures we cannot fer whether a likely result would be { }{ }{ }{ } or { } For this reason it would seem that a fuzzy measure is needed for the unsupervised discovery of similar records mixed data Table 12 Approximations for qualitative attributes Table 13 Approximations for quantitative attributes Summary This paper discussed two approaches for determg similarity between records of mixed data From both ideas it can be seen that due to the uncertaty and vagueness of qualitative data and of tryg to combe metrics leave rough set theory as an optional tool to be used As concluded the discussion an additional or other approach is needed the discovery of similar groups of records with data sets of mixed data References [1] Everitt B Cluster Analysis 3rd ed Hodder & Stoughton London 1993 [2] Han J and Kamber M Data Mg: Concepts and Techniques Morgan Kaufmann San Francisco 2001 [3] He A Unsupervised Data Mg by Recursive Partitiong Masters Thesis University of Ccnati June pg [4] Pawlak Z Rough Sets: Theoretical Aspects of Reasong About Data Kluwer Academic Publishers Dordrecht 1991 [5] Sneath P and Sokal R Numerical Taxonomy W H Freeman San Francisco 1973 [6] Zhu Y Unsupervised Database Discovery Based on Artificial Intelligence Techniques Masters Thesis University of Ccnati June pg

Granulating Data On Non-Scalar Attribute Values

Granulating Data On Non-Scalar Attribute Values Lawrence Mazlack Sarah Coppock Computer Science University of Cincinnati Cincinnati, Ohio 45220 {mazlack, coppocs}@uc.edu Abstract Data mining discouvers