Similarity Measures for Categorical Data: A Comparative Evaluation

Size: px

Start display at page:

Download "Similarity Measures for Categorical Data: A Comparative Evaluation"

Christiana Harvey
6 years ago
Views:

1 Similarity Measures for Categorical Data: A Comparative Evaluation Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see Abstract Shyam Boriah Varun Chanola Vipin Kumar Department of Computer Science an Engineering University of Minnesota <sboriah,chanola,kumar>@cs.umn.eu Measuring similarity or istance between two entities is a key step for several ata mining an knowlege iscovery tasks. The notion of similarity for continuous ata is relatively well-unerstoo, but for categorical ata, the similarity computation is not straightforwar. Several ata-riven similarity measures have been propose in the literature to compute the similarity between two categorical ata instances but their relative performance has not been evaluate. In this paper we stuy the performance of a variety of similarity measures in the context of a specific ata mining task: outlier etection. Results on a variety of ata sets show that while no one measure ominates others for all types of problems, some measures are able to have consistently high performance. Introuction Measuring similarity or istance between two ata points is a core requirement for several ata mining an knowlege iscovery tasks that involve istance computation. Examples inclue clustering (kmeans), istance-base outlier etection, classification (knn, SVM), an several other ata mining tasks. These algorithms typically treat the similarity computation as an orthogonal step an can make use of any measure. For continuous ata sets, the Minkowski Distance is a general metho use to compute istance between two multivariate points. In particular, the Minkowski Distance of orer (Manhattan) an orer (Eucliean) are the two most wiely use istance measures for continuous ata. The key observation about the above measures is that they are inepenent of the unerlying ata set to which the two points belong. Several atariven measures, such as Mahalanobis Distance, have also been explore for continuous ata. The notion of similarity or istance for categorical ata is not as straightforwar as for continuous ata. The key characteristic of categorical ata is that the ifferent values that a categorical attribute takes are not inherently orere. Thus, it is not possible to irectly compare two ifferent categorical values. The simplest way to fin similarity between two categorical attributes is to assign a similarity of if the values are ientical an a similarity of 0 if the values are not ientical. For two multivariate categorical ata points, the similarity between them will be irectly proportional to the number of attributes in which they match. This simple measure is also known as the overlap measure in the literature [33]. One obvious rawback of the overlap measure is that it oes not istinguish between the ifferent values taken by an attribute. All matches, as well as mismatches, are treate as equal. For example, consier a categorical ata set D, efine over two attributes: color an shape. Let color take 3 possible values in D: {re, blue, green} an shape take 3 possible values in D: {square, circle, triangle}. Table summarizes the frequency of occurrence for each possible combination in D. color shape square circle triangle Total re blue green 5 Total Table : Frequency Distribution of a Simple -D Categorical Data Set The overlap similarity between two instances (green,square) an (green,circle) is 3. The overlap similarity between (blue,square) an (blue,circle) is also 3. But the frequency istribution in Table shows that while (blue,square) an (blue,circle) are frequent combinations, (green,square) an (green,circle) are very rare 43

2 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see combinations in the ata set. Thus, it woul appear that the overlap measure is too simplistic in giving equal importance to all matches an mismatches. Although there is no inherent orering in categorical ata, the previous example shows that there is other information in categorical ata sets that can be use to efine what shoul be consiere more similar an what shoul be consiere less similar. This observation has motivate researchers to come up with ata-riven similarity measures for categorical attributes. Such measures take into account the frequency istribution of ifferent attribute values in a given ata set to efine similarity between two categorical attribute values. In this paper, we stuy a variety of similarity measures propose in iverse research fiels ranging from statistics to ecology as well as many of their variations. Each measure uses the information present in the ata uniquely to efine similarity. Since we are evaluating ata-riven similarity measures it is obvious that their performance is highly relate to the ata set that is being analyze. To unerstan this relationship, we first ientify the key characteristics of a categorical ata set. For each of the ifferent similarity measure that we stuy, we analyze how it relates to the ifferent characteristics of the ata set.. Key Contributions. The key contributions of this paper are as follows: We bring together fourteen ifferent categorical measures from ifferent fiels an stuy them together in a single context. Many of these measures have not been investigate outsie the omain they were introuce in, an not compare with other measures. We classify the categorical measures in three ifferent ways base on how they utilize information in the ata. We evaluate the various similarity measures for categorical ata on a wie variety of benchmark ata sets. In particular, we show the utility of atariven measures for the problem of etermining similarity with categorical ata. We also propose a number of new measures that are either variants of other previously propose measures, or erive from previously propose similarity frameworks. The performance of some of the measures we propose is among the best performance of all the measures we stuy.. Organization of the Paper. The rest of the paper is organize as follows. We first mention all relate efforts in the stuy of similarity measures in Section. In section 3, we ientify various characteristics of categorical ata that are relevant to this stuy. We then introuce the 4 ifferent similarity measures that are stuie in this paper in Section 4. We escribe our experimental setup, evaluation methoology an the results on public ata sets in Section 5. Relate Work Sneath an Sokal iscuss categorical similarity measures in some etail in their book [3] on numerical taxonomy. They were among the first to put together an iscuss many of the measures iscusse in their book. At the time, two major concerns were () biological relevance, since numerical taxonomy was mainly concerne with taxonomies from biology, ecology, etc., an () computation efficiency since computational resources were limite an scarce. Nevertheless, many of the observations mae by Sneath an Sokal are quite relevant toay an offer key insights into many of the measures. There are several books [, 8, 6, ] on cluster analysis that iscuss the problem of etermining similarity between categorical attributes. However, most of these books o not offer solutions to the problem or iscuss the measures in this paper, an the usual recommenation is to binarize the ata an then use binary similarity measures. Wilson an Martinez [36] performe a etaile stuy of heterogeneous istance functions (for ata with categorical an continuous attributes) for instancebase learning. The measures in their stuy are base upon a supervise approach where each ata instance has class information in aition to a set of categorical/continuous attributes. Measures iscusse in this paper are orthogonal to the ones propose in [36] since supervise measures etermine similarity base on class information, while ata-riven measures etermine similarity base on the ata istribution. In principle, both ieas can be combine. There have been a number of new ata mining techniques for categorical ata that have been propose recently. Some of them use notions of similarity which are neighborhoo-base [5, 4, 8, 6,, ], or incorporate the similarity computation into the learning algorithm [3, 7, ]. Neighborhoo-base approaches use some notion of similarity (usually the overlap measure) to efine the neighborhoo of a ata instance, while the measures we stuy in this paper are irectly use to etermine similarity between a pair of ata instances; hence, we see the measures iscusse in this paper as being useful to compute the neighborhoo of a point an 44

3 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see neighborhoo-base measures as meta-similarity measures. Since techniques which embe similarity measures into the learning algorithm o not explicitly efine general categorical similarity measures, we o not iscuss them in this paper. Jones an Furnas [0] stuie several similarity measures in the fiel of information retrieval. In particular, they performe a geometric analysis on continuous measures in orer to reveal important ifferences which woul affect retrieval performance. Noreault et al. [5] also stuie measures in information retrieval with the goal of generalizing effectiveness base on empirically evaluating the performance of the measures. Another comparative empirical evaluation for etermining similarity between fuzzy sets was performe by Zwick et al. [37], followe by several others [7, 35]. 3 Categorical Data Categorical ata (also known as nominal or qualitative multi-state ata) has been stuie for a long time in various contexts. As mentione earlier, computing similarity between categorical ata instances is not straightforwar owing to the fact that there is no explicit notion of orering between categorical values. To overcome this problem, several ata-riven similarity measures have been propose for categorical ata. The behavior of such measures irectly epens on the ata. In this section, we ientify the key characteristics of a categorical ata set that can potentially affect the behavior of a ata-riven similarity measure. For the sake of notation, consier a categorical ata set D containing N objects, efine over a set of categorical attributes where A k enotes the k th attribute. Let the attribute A k take n k values in the given ata set that are enote by the set A k. We also use the following notation: f k (x): The number of times attribute A k takes the value x in the ata set D. Note that if x / A k, f k (x) = 0 ˆp k (x): The sample probability of attribute A k to take the value x in the ata set D. The sample probability is given by ˆp k (x) = f k(x) N p k (x): Another probability estimate of attribute A k to take the value x in a given ata set, given by p k(x) = f k(x)(f k (x) ) N(N ) 3. Characteristics of Categorical Data. Since this paper iscusses ata-riven similarity measures for categorical ata, a key task is to ientify the characteristics of a categorical ata set that affect the behavior of such a similarity measure. We enumerate the characteristics of a categorical ata set below: Size of Data, N. As we will see later, most measures are typically invariant of the size of the ata, though there are some measures (e.g. Smirnov) that o make use of this information. Number of attributes,. Most measures are invariant of this characteristic, since they typically normalize the similarity over the number of attributes. But in our experimental results we observe that the number of attributes oes affect the performance of the outlier etection algorithms. Number of values taken by each attribute, n k. A ata set might contain attributes that take several values an attributes that take very few values. For example, one attribute might take several hunre possible values, while another attribute might take very few values. A similarity measure might give more importance to the secon attribute, while ignoring the first one. In fact, one of the measures iscusse in this paper (Eskin) behaves exactly like this. Distribution of f k (x). This refers to the istribution of frequency of values taken by an attribute in the given ata set. In certain ata sets an attribute might be istribute uniformly over the set A k, while in others the istribution might be skewe. A similarity measure might give more importance to attribute values that occur rarely, while another similarity measure might give more importance to frequently occurring attribute values. 4 Similarity Measures for Categorical Data The stuy of similarity between ata objects with categorical variables has ha a long history. Pearson propose a chi-square statistic in the late 800s which is often use to test inepenence between categorical variables in a contingency table. Pearson s chi-square statistic was later moifie an extene, leaing to several other measures [8, 4, 7]. More recently, however, the overlap measure has become the most commonly use similarity measure for categorical ata. Its popularity is perhaps relate to its simplicity an easy of use. In this section, we will iscuss the overlap measure an several ata-riven similarity measures for categorical ata. 45

4 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see Note that we have converte measures that were originally propose as istance to similarity measures in orer to make the measures comparable in this stuy. The measures iscusse henceforth will all be in the context of similarity, with istance measures being converte using the formula: sim = + ist Almost all similarity measures assign a similarity value between two ata instances X an Y belonging to the ata set D (introuce in Section 3) as follows: (4.) S(X, Y ) = w k S k (X k, Y k ) k= where S k (X k, Y k ) is the per-attribute similarity between two values for the categorical attribute A k. Note that X k, Y k A k. The quantity w k enotes the weight assigne to the attribute A k. To unerstan how ifferent measures calculate the per-attribute similarity, S k (X k, Y k ), consier a categorical attribute A, which takes one of the values {a, b, c, }. We have roppe the subscript k for simplicity. The per-attribute similarity computation is equivalent to constructing the (symmetric) matrix shown in Figure. Figure : Attribute a b c a S(a, a) S(a, b) S(a, c) S(a, ) b S(b, b) S(b, c) S(b, ) c S(c, c) S(c, ) S(, ) Similarity Matrix for a Single Categorical Essentially, in etermining the similarity between two values, any categorical measure is filling the entries of this matrix. For example, the overlap measure sets the iagonal entries to an the off-iagonal entries to 0, i.e., the similarity is if the values match an 0 if the values mismatch. Aitionally, measures may use the following information in computing a similarity value (all the measures in this paper use only this information): f(a), f(b), f(c), f(), the frequencies of the values in the ata set N, the size of the ata set n, the number of values taken by the attribute (4 in the case above) We can classify measures in several ways, base on: (i) the manner in which they fill the entries of the similarity matrix, (ii) whether more weight is a function of the frequency of the attribute values, (iii) the arguments use to propose the measure (probabilistic, information-theoretic, etc.). In this paper, we will escribe the measures by classifying them as follows: those that fill the iagonal entries only. These are measures that set the off-iagonal entries to 0 (mismatches are uniformly given the minimum value) an give possibly ifferent weights to matches. those that fill the off-iagonal entries only. These measures set the iagonal entries to (matches are uniformly given the maximum value) an give possibly ifferent weights to mismatches. those that fill both iagonal an off-iagonal entries. These measures give ifferent weights to both matches an mismatches. Table gives the mathematical formulas for the measures we will be escribing in this paper. The various measures escribe in Table compute the per-attribute similarity S k (X k, Y k ) as shown in column an compute the attribute weight w k as shown in column Measures that fill Diagonal Entries only.. Overlap. The overlap measure simply counts the number of attributes that match in the two ata instances. The range of per-attribute similarity for the overlap measure is [0, ], with a value of 0 occurring when there is no match, an a value of occurring when the attribute values match.. Gooall. Gooall [4] propose a measure that attempts to normalize the similarity between two objects by the probability that the similarity value observe coul be observe in a ranom sample of two points. This measure assigns higher similarity to a match if the value is infrequent than if the value is frequent. Gooall s original measure etails a proceure to combine similarities in the multivariate setting which takes into account epenencies between attributes. Since this proceure is computationally expensive, we use a simpler version of the measure (escribe next as Gooall). Gooall s original measure is not empirically evaluate in this paper. We also propose three other variants of Gooall s measure in this paper: Gooall, Gooall3 an Gooall4. 46

5 Measure S k (X k, Y k ) w k, k =... {. Overlap = if X k = Y k 0 otherwise Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see Eskin = { if Xk = Y k n k n k + otherwise { if Xk = Y k 3. IOF = otherwise +log f k (X k ) log f k (Y k ) { if Xk = Y k 4. OF = +log otherwise 5. Lin = { N f k (X k ) log N f k (Y k ) log ˆp k (X k ) if X k = Y k log(ˆp k (X k ) + ˆp k (Y k )) otherwise log ˆp k (q) if X k = Y k q Q 6. Lin = log ˆp k (q) otherwise q Q p k (q) if X k = Y k 7. Gooall = q Q 0 otherwise p k (q) if X k = Y k 8. Gooall = q Q 0 otherwise 9. Gooall3 = 0. Gooall4 =. Smirnov =. Gambaryan = 3. Burnaby = log 4. Anerberg S(X, Y ) = + N f k(x k ) f k (X k ) + q {A k \{X k,y k }} { q {A k \X k } f k (q) N f k (q) p k (X k) if X k = Y k 0 otherwise { p k (X k) if X k = Y k 0 otherwise f k (q) N f k (q) if X k = Y k otherwise { [ˆpk (X k ) log ˆp k (X k )+ if X k = Y k ( ˆp k (X k )) log ( ˆp k (X k ))] 0 otherwise if X k = Y k log( ˆp k (q)) q A k ˆp k (X k ) ˆp k (Y k ) ( ˆp k (X k ))( ˆp k (Y k )) + q A k log( ˆp k (q)) k { k :X k =Y k } ( k { k :X k =Y k } ) ˆp k (X k ) n k (n k + ) + otherwise i= log ˆp i(x i )+log ˆp i (Y i ) ( ) ˆp k (X k ) n k (n k + ) k { k :X k Y k } ( i= q Q log ˆp i(q) k= n k k= n k ˆp k (X k )ˆp k (Y k ) ) n k (n k + ) Table : Similarity Measures for Categorical Attributes. Note that S(X, Y ) = k= w ks k (X k, Y k ). For measure Lin, {Q A k : q Q, ˆp k (X k ) ˆp k (q) ˆp k (Y k )}, assuming ˆp k (X k ) ˆp k (Y k ). For measure Gooall, {Q A k : q Q, p k (q) p k (X k )}. For measure Gooall, {Q A k : q Q, p k (q) p k (X k )}. 47

6 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see 3. Gooall. The Gooall measure is the same as Gooall s measure on a per-attribute basis. However, instea of combining the similarities by taking into account epenencies between attributes, the Gooall measure takes the average of the perattribute similarities. The range of S k (X k, Y k ) for matches in Gooall measure is [0, N(N ) ], with the minimum being attaine when attribute A k takes only one value, an the maximum is attaine when the value X k occurs twice, while all other possible values of A k occur more than times. 4. Gooall. The Gooall measure is a variant of Gooall s measure propose by us. This measure assigns higher similarity if the matching values are infrequent, an at the same time there are other values that are even less frequent, i.e., the similarity is higher if there are many values with approximately equal frequencies, an lower if the frequency istribution is skewe. The range of S k (X k, Y k ) for matches in the Gooall measure is [0, N(N ) ], with the minimum value being attaine if attribute A k takes only one value, an the maximum is attaine when the value X k occurs twice, while all other possible values of A k occur only once each. 5. Gooall3. We also propose another variant of Gooall s measure calle Gooall3. The Gooall3 measure assigns a high similarity if the matching values are infrequent regarless of the frequencies of the other values. The range of S k (X k, Y k ) for matches in the Gooall3 measure is [0, N(N ) ], with the minimum value being attaine if X k is the only value for attribute A k an maximum value is attaine if X k occurs only twice. 6. Gooall4. The Gooall4 measure assigns similarity Gooall3 for matches. The range of S k (X k, Y k ) for matches in the Gooall4 measure is [ N(N ), ], with the minimum value being attaine if X k occurs only once, an the maximum value is attaine if X k is the only value for attribute A k. 7. Gambaryan. Gambaryan propose a measure [] that gives more weight to matches where the matching value occurs in about half the ata set, i.e., in between being frequent an rare. The Gambaryan measure for a single attribute match is closely relate to the Shannon entropy from information theory, as can be seen from its formula in Table. The range of S k (X k, Y k ) for matches in the Gambaryan measure is [0, ], with the minimum value being attaine if X k is the only value for attribute A k an the maximum value is attaine when X k has frequency N. 4. Measures that fill Off-iagonal Entries only.. Eskin. Eskin et al. [9] propose a normalization kernel for recor-base network intrusion etection ata. The original measure is istance-base an assigns a weight of for mismatches; when n k aapte to similarity, this becomes a weight of n k n +. This measure gives more weight to mismatches that occur on attributes that take many k values. The range of S k (X k, Y k ) for mismatches in the Eskin measure is [ 3, N N + ], with the minimum value being attaine when the attribute A k takes only two values, an the maximum value is attaine when the attribute has all unique values.. Inverse Occurrence F requency (IOF ). The inverse occurrence frequency measure assigns lower similarity to mismatches on more frequent values. The IOF measure is relate to the concept of inverse ocument frequency which comes from information retrieval [9], where it is use to signify the relative number of ocuments that contain a specific wor. A key ifference is that inverse ocument frequency is compute on a term-ocument matrix which is usually binary, while the IOF measure is efine for categorical ata. The range of S k (X k, Y k ) for mismatches in the IOF measure is [, ], with the minimum value being attaine when X k an Y k each occur N +(log N ) times (i.e., these are the only two values), an the maximum value is attaine when X k an Y k occur only once in the ata set. 3. Occurrence F requency (OF ). The occurrence frequency measure gives the opposite weighting of the IOF measure for mismatches, i.e., mismatches on less frequent values are assigne lower similarity an mismatches on more frequent values are assigne higher similarity. The range of S k (X k, Y k ) for mismatches in OF measure is [ (+(log N) ), +(log ) ], with the minimum value being attaine when X k an Y k occur only once in the ata set, an the maximum value attaine when X k an Y k occur N times. 4. Burnaby. Burnaby [5] propose a similarity measure using arguments from information theory. He argues that the set of observe values are like a group of signals conveying information an, as in information theory, attribute values that are rarely observe shoul be consiere more informative. In 48

7 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see [5], Burnaby propose information weighte measures for binary, orinal, categorical an continuous ata. The measure we present in Table is aapte from Burnaby s categorical measure. This measure assigns low similarity to mismatches on rare values an high similarity to mismatches on frequent values. The range of S k (X k, Y k ) for mismatches for the N log( N ) Burnaby measure is [ N log( N ) log(n ), ], with the minimum value being attaine when all values for attribute A k occur only once, an the maximum value is attaine when X k an Y k each occur N times. 4.3 Measures that fill both Diagonal an Offiagonal Entries.. Lin. In [3], Lin escribes an information-theoretic framework for similarity, where he argues that when similarity is thought of in terms of assumptions about the space, the similarity measure naturally follows from the assumptions. Lin [3] iscusses the orinal, string, wor an semantic similarity settings; we applie his framework to the categorical setting to erive the Lin measure in Table. The Lin measure gives higher weight to matches on frequent values, an lower weight to mismatches on infrequent values. The range of S k (X k, Y k ) for a match in the Lin measure is [ log N, 0], with the minimum value being attaine when X k occurs only once an the maximum value attaine when X k occurs N times. The range of S k (X k, Y k ) for a mismatch in Lin measure is [ log N, 0], with the minimum value being attaine when X k an Y k each occur only once, an the maximum value is attaine when X k an Y k each occur N times.. Lin. The Lin measure is another measure we have erive using Lin s similarity framework. This measure gives lower weight to mismatches if either of the mismatching values are very frequent, or if there are several values that have frequency in between those of the mismatching values; higher weight is given when there are mismatches on infrequent values an there are few other infrequent values. For matches, lower weight is given for matches on frequent values or matches on values that have many other values of the same frequency; higher weight is given to matches on rare values. The range of S k (X k, Y k ) for matches in the Lin measure is [ log N, 0], with the minimum value being attaine when X k occurs twice in the ataset, an no other value for attribute k occurs twice, an maximum value is attaine when X k occurs N times. The range of S k (X k, Y k ) for mismatches in the Lin measure is [ log N, 0], with the minimum value being attaine when X k an Y k both occur only once an all other values for attribute A k occur more than once, an the maximum value is attaine when X k is the most frequent value an Y k is the least frequent value or vice versa. 3. Smirnov. Smirnov [3] propose a measure roote in probability theory that not only consiers a given value s frequency, but also takes into account the istribution of the other values taken by the same attribute. The Smirnov measure is probabilistic for both matches an mismatches. For a match, the similarity is high when the frequency of the matching value is low, an the other values occur frequently. The range of S k (X k, Y k ) for a match in the Smirnov measure is [, N], with the minimum value being attaine when X k occurs N times; the maximum value is attaine when X k occurs only once an there only one other possible value for attribute A k, which occurs N times. The range of S k (X k, Y k ) for a mismatch in the Smirnov measure is [0, N ], with the minimum value being attaine when the attribute A k takes only two values, X k an Y k ; an the maximum is attaine when A k takes only one more value apart from X k an Y k an it occurs N times (X k an Y k occur once each). 4. Anerberg. In his book on cluster analysis [], Anerberg presents an approach to hanle similarity between categorical attributes. He argues that rare matches inicate a strong association an shoul be given a very high weight, an that mismatches on rare values shoul be treate as being istinctive an shoul also be given special importance. In accorance with these arguments, the Anerberg measure assigns higher similarity to rare matches, an lower similarity to rare mismatches. The Anerberg measure is unique in the sense that it cannot be written in the form of Equation 4.. The range of the Anerberg measure is [0, ]; the minimum value is attaine when there are no matches, an the maximum value is attaine when all attributes match. 4.4 Further classification of similarity measures. We can further classify categorical similarity measures base on the arguments use to propose the measures:. Probabilistic approaches take into account the probability of a given match taking place. The following measures are probabilistic: Gooall, Smirnov, Anerberg. 49

8 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see Information-theoretic approaches incorporate the information content of a particular value/variable with respect to the ata set. The following measures are information-theoretic: Lin, Lin, Burnaby. Table 3 provies a characterization of each of the 4 similarity measures in terms of how they hanle the various characteristics of categorical ata. This table shows that measures Eskin an Anerberg assign weight to every attribute using the quantity n k, though in opposite ways. Another interesting observation from column 3 is that several measures Lin, Lin, Gooall, Gooall3, Smirnov, Anerberg assign higher similarity to a match when the attribute value is rare (f k is low), while Gooall an Gooall4 assign higher similarity to a match when the attribute value is frequent (f k is high). Only Gambaryan assigns the maximum similarity when the attribute value has a frequency close to. Column 4 shows that IOF, Lin, Lin, Smirnov an Burnaby assign greater similarity when the mismatch occurs between rare values, while OF an Anerberg assign greater similarity for a mismatch between frequent values. 5 Experimental Evaluation In this section we present an experimental evaluation of the 4 measures (liste in Table ) on 8 ifferent ata sets in the context of outlier etection. Of these ata sets, 6 are base on the ata sets available at the UCI Machine Learning Repository [3], two are base on network ata generate by SKAION Corporation for the ARDA information assurance program [30]. The etails about the 8 ata sets are summarize in Table 4. Eleven of these ata sets were purely categorical, five (KD,KD,Sk,Sk,Cen) ha a mix of continuous an categorical attributes, an two ata sets, Irs an Sgm, were purely continuous. Continuous variables were iscretize using the MDL metho [0]. The KD,KD ata sets were obtaine from the KDDCup ata set by iscretizing the continuous attributes into 0 an 00 bins respectively. Another possible way to hanle a mixture of attributes is to compute the similarity for continuous an categorical attributes separately, an then o a weighte aggregation. In this stuy we converte the continuous attributes to categorical to simplify comparative evaluation. Each ata set contains labele instances belonging to multiple classes. We ientifie one class as the outlier class, an rest of the classes were groupe together an calle normal. For most of the ata sets, the smallest class was selecte as the outlier class. The only exceptions were (Cr,Cr) an (KD,KD), where the original ata sets ha two similar-size classes. For each ata set in the pair, we sample 00 points from one of the two classes as the outlier points. In Section 3. we iscusse a number of characteristics for categorical ata sets; in Table 4 we escribe the various public ata sets in terms of these characteristics. The first row gives the size of each ata set. The secon row shows the percentage of outlier points in the original ata set. The thir row inicates the number of attributes in the ata sets. Rows 4 an 5 show the istribution of the number of values taken by each attribute; the ifference between the average an the meian is a measure of how skewe this istribution is. For example, ata set Sk has a few attributes that take many values while most other attributes take few values. The next three rows show the istribution of frequency of values taken by an attribute in the given ata set (i.e., f k (x)). This is one by showing the number of attributes that have a uniform, Gaussian an skewe istribution in rows 6, 7, an 8 respectively. The last two rows in Table 4 enote the crossvaliation classification recall an precision reporte by the C4.5 classifier on the outlier class. This quantity inicates the separability between the instances belonging to normal class(es) an instances belonging to outlier class, using the given set of attributes. A low accuracy implies that istinguishing between outliers an normal instances is ifficult in that particular ata set using a ecision tree-base classifier. 5. Evaluation Methoology. The performance of the ifferent similarity measures was evaluate in the context of outlier etection using nearest neighbors [9, 34]. All instances belonging to the normal class(es) form the training set. We construct the test set by aing the outlier points to the training set. For each test instance, we fin its k nearest neighbors using the given similarity measure, in the training set (we chose the parameter k = 0). The outlier score is the istance to the k th nearest neighbor. The test instances are then sorte in ecreasing orer of outlier scores. To evaluate a measure, we count the number of true outliers in the top p portion of the sorte test instances, where p = δn, 0 δ an n is the number of actual outliers. Let o be the number of actual outliers in the top p preicte outliers. The accuracy of the algorithm is measure as o p. In this paper we present results for δ =. We have also experimente with other lower values of δ an the trens in relative performance are similar. We have presente these aitional results in our extene work available as a technical report [6]. 50

9 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see {f k (X k ), f k (Y k )} Measure n k X k = Y k X k Y k Overlap 0 Eskin n k 0 IOF /(log f k (X k ) log f k (Y k )) OF log f k (X k ) log f k (Y k ) Lin / log f k (X k ) / log (f k (X k ) + f k (Y k )) Lin / log f k (X k ) / log f k (X k ) f k (Y k ) Gooall ( fk (X k )) 0 Gooall fk (X k ) 0 Gooall3 ( fk (X k )) 0 Gooall4 fk (X k ) 0 Smirnov /f k (X k ) /(f k (X k ) + f k (Y k )) Gambaryan Maximum at f k (X k ) = N 0 Burnaby / log f k (X k ), / log f k (Y k ) Anerberg /n k /fk (X k ) f k (X k )f k (Y k ) Table 3: Relation between per-attribute similarity, S(X k, Y k ) an {n k, f k (X k ), f k (Y k )}. 5. Experimental Results on Public Data Sets. Our experimental results verifie our initial hypotheses about categorical similarity measures. As can be seen from Table 5, there are many situations where the Overlap measure oes not give goo performance. This is consistent with our intuition that the use of aitional information woul lea to better performance. In particular, we expecte that since categorical ata oes not have inherent orering, ata-riven measures woul be able to take avantage of information present in the ata set to make more accurate eterminations of similarity between a pair of ata instances. We make some key observations about the results in Table 5:. No single measure is always superior or inferior. This is to be expecte since each ata set has ifferent characteristics.. The use of some measures gives consistently better performance on a large variety of ata. The Lin, OF, Gooall3 measures give among the best performance overall in terms of outlier etection performance. This is noteworthy since Lin an Gooall3 have been introuce for the first time in this paper. 3. There are some pairs of measures that exhibit complementary performance, i.e., one performs well where the other performs poorly an vice-versa. Example complimentary pairs are (OF, IOF ), (Lin, Lin) an (Gooall3, Gooall4). This observation means that it may be possible to construct measures that raw on the strengths of two measures in orer to obtain superior performance. This is an aspect of this work that nees to be pursue in future work. 4. The performance of an outlier etection algorithm is significantly affecte by the similarity measure use (we refer the reaer to our extene work [6] for a similar evaluation using a ifferent outlier etection algorithm, LOF, which provies similar conclusions). For example, for the Cn ata set, which has a very low classification accuracy for the outlier class, using OF still achieves close to 50 % accuracy. We also note that for many of the ata sets there is a relationship between ecision tree performance (separability) an the performance of the measures. Specifically, for some of the ata sets (e.g., Sk, Tmr, Au) with low separability there was high variance in the performance of the various measures. 5. The Eskin similarity measure weights attributes proportional to the number of values taken by the attribute (n k ). For ata sets in which the attributes take large number of values (e.g., KD, Sk, Sk), Eskin performs very poorly. 6. The Smirnov measure assigns similarity to both iagonal an off-iagonal entries in the per-attribute similarity matrix (Figure ). But it still performs very poorly on most of the ata sets. The other measures that operate similarly Lin, Lin an Anerberg perform better than Smirnov in almost every ata set. 5

10 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see Cr Cr Irs Cn KD KD Sk Sk Msh Sgm Cen Can Hys Lym Nur Tmr TTT Au Size % Outls avg(nk) me(nk) fk Uni fk Gauss fk Skw Recall Precision Table 4: Description of Public Data Sets. Measure Cr Cr Irs Cn KD KD Sk Sk Msh Sgm Cen Can Hys Lym Nur Tmr TTT Au Avg ovrlp eskn iof of lin lin goo goo goo goo smrnv gmbrn brnby anbrg Avg Table 5: Experimental Results For knn Algorithm for δ =.0. 5

11 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see 6 Concluing Remarks an Future Work Computing similarity between categorical attributes has been iscusse in a variety of contexts. In this paper we have brought together several such measures an evaluate them in the context of outlier etection. We have also propose several variants (Lin, Lin, Gooall, Gooall3, Gooall4) of existing similarity measures, some of which perform very well as shown in our evaluation. Given this set of similarity measures, the first question that comes to min is: Which similarity measure is best suite for my ata mining task?. Our experimental results suggest that there is no one best performing similarity measure. Hence, one nees to unerstan how a similarity measure hanles the ifferent characteristics of a categorical ata set, an this nees to be explore in future research. We use outlier etection as the unerlying ata mining task for the comparative evaluation in this work. However, similar stuies can be performe using classification or clustering as the unerlying task. It will be useful to know if the relative performance of these similarity measures remains the same for the other ata mining tasks. In our evaluation methoology we have use one similarity measure across all attributes. Since ifferent attributes in a ata set can be of ifferent nature, an alternative way is to use ifferent measures for ifferent attributes. This appears to be especially promising given the complimentary nature of several similarity measures. 7 Acknowlegements We are grateful to the anonymous reviewers for their comments an suggestions, which improve this paper. We woul also like to thank György Simon for his helpful comments on an early raft of this paper. This work was supporte by NSF Grant CNS-05555, NSF ITR Grant ACI , NSF Grant IIS , an NSF Grant IIS Access to computing facilities was provie by the University of Minnesota Digital Technology Center an Supercomputing Institute. References [] A. Ahma an L. Dey. A metho to compute istance between two categorical values of same attribute in unsupervise learning for categorical ata set. Pattern Recogn. Lett., 8():0 8, 007. [] M. R. Anerberg. Cluster Analysis for Applications. Acaemic Press, New York, 973. [3] A. Asuncion an D. J. Newman. UCI machine learning repository. [ Irvine, CA: University of California, School of Information an Computer Science, 007. [4] Y. Biberman. A context similarity measure. In ECML 94: Proceeings of the European Conference on Machine Learning, pages Springer, 994. [5] T. Burnaby. On a metho for character weighting a similarity coefficient, employing the concept of information. Mathematical Geology, ():5 38, 970. [6] V. Chanola, S. Boriah, an V. Kumar. Similarity measures for categorical ata a comparative stuy. Technical Report 07-0, Department of Computer Science & Engineering, University of Minnesota, October 007. [7] H. Cramér. The Elements of Probability Theory an Some of its Applications. John Wiley & Sons, New York, NY, 946. [8] G. Das an H. Mannila. Context-base similarity measures for categorical atabases. In PKDD 00: Proceeings of the 4th European Conference on Principles of Data Mining an Knowlege Discovery, pages 0 0, Lonon, UK, 000. Springer- Verlag. [9] E. Eskin, A. Arnol, M. Prerau, L. Portnoy, an S. Stolfo. A geometric framework for unsupervise anomaly etection. In D. Barbará an S. Jajoia, eitors, Applications of Data Mining in Computer Security, pages Kluwer Acaemic Publishers, Norwell, MA, 00. [0] U. M. Fayya an K. B. Irani. Multi-interval iscretization of continuous-value attributes for classification learning. In Proceeings of the 3th International Joint Conference on Artificial Intelligence, pages 0 09, San Francisco, CA, 993. Morgan Kaufmann. [] P. Gambaryan. A mathematical moel of taxonomy. Izvest. Aka. Nauk Armen. SSR, 7():47 53, 964. [] V. Ganti, J. Gehrke, an R. Ramakrishnan. CACTUS clustering categorical ata using summaries. In KDD 99: Proceeings of the 5th ACM SIGKDD international conference on Knowlege iscovery an ata mining, pages 73 83, New York, NY, USA, 999. ACM Press. 53

12 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see [3] D. Gibson, J. Kleinberg, an P. Raghavan. Clustering categorical ata: an approach base on ynamical systems. The VLDB Journal, 8(3): 36, 000. [4] D. W. Gooall. A new similarity inex base on probability. Biometrics, (4):88 907, 966. [5] S. Guha, R. Rastogi, an K. Shim. ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 5(5): , 000. [6] J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, New York, NY, 975. [7] Z. Huang. Extensions to the k-means algorithm for clustering large ata sets with categorical values. Data Mining an Knowlege Discovery, (3):83 304, 998. [8] A. K. Jain an R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall, Englewoo Cliffs, NJ, 988. [9] K. S. Jones. A statistical interpretation of term specificity an its application in retrieval. In Document Retrieval Systems, volume 3 of Taylor Graham Series In Founations Of Information Science, pages 3 4. Taylor Graham Publishing, Lonon, UK, 988. ISBN [0] W. P. Jones an G. W. Furnas. Pictures of relevance: a geometric analysis of similarity measures. J. Am. Soc. Inf. Sci., 38(6):40 44, 987. [] L. Kaufman an P. J. Rousseeuw. Fining Groups in Data: An Introuction to Cluster Analysis. John Wiley & Sons, New York, NY, 990. [] S. Q. Le an T. B. Ho. An association-base issimilarity measure for categorical ata. Pattern Recogn. Lett., 6(6): , 005. [3] D. Lin. An information-theoretic efinition of similarity. In ICML 98: Proceeings of the 5th International Conference on Machine Learning, pages , San Francisco, CA, USA, 998. Morgan Kaufmann Publishers Inc. [4] K. Maung. Measurement of association in a contingency table with special reference to the pigmentation of hair an eye colours of Scottish school chilren. Annals of Eugenics, :89 3, 94. [5] T. Noreault, M. McGill, an M. B. Koll. A performance evaluation of similarity measures, ocument term weighting schemes an representations in a boolean environment. In SIGIR 80: Proceeings of the 3r annual ACM conference on Research an evelopment in information retrieval, pages 57 76, Kent, UK, 98. Butterworth & Co. [6] C. R. Palmer an C. Faloutsos. Electricity base external similarity of categorical attributes. In PAKDD 03: Proceeings of the 7th Pacific-Asia Conference on Avances in Knowlege Discovery an Data Mining, pages Springer, 003. [7] C. P. Pappis an N. I. Karacapiliis. A comparative assessment of measures of similarity of fuzzy values. Fuzzy Sets an Systems, 56():7 74, 993. [8] K. Pearson. On the general theory of multiple contingency with special reference to partial contingency. Biometrika, (3):45 58, 96. [9] S. Ramaswamy, R. Rastogi, an K. Shim. Efficient algorithms for mining outliers from large ata sets. In SIGMOD 00: Proceeings of the ACM SIG- MOD International Conference on Management of Data, pages ACM Press, 000. [30] SKAION Corporation. SKAION intrusion etection system evaluation ata. [ [3] E. S. Smirnov. On exact methos in systematics. Systematic Zoology, 7(): 3, 968. [3] P. H. A. Sneath an R. R. Sokal. Numerical Taxonomy: The Principles an Practice of Numerical Classification. W. H. Freeman an Company, San Francisco, 973. [33] C. Stanfill an D. Waltz. Towar memory-base reasoning. Commun. ACM, 9():3 8, 986. [34] P.-N. Tan, M. Steinbach, an V. Kumar. Introuction to Data Mining. Aison-Wesley, Boston, MA, 005. [35] X. Wang, B. De Baets, an E. Kerre. A comparative stuy of similarity measures. Fuzzy Sets an Systems, 73():59 68, 995. [36] D. R. Wilson an T. R. Martinez. Improve heterogeneous istance functions. J. Artif. Intell. Res. (JAIR), 6: 34, 997. [37] R. Zwick, E. Carlstein, an D. V. Buescu. Measures of similarity among fuzzy concepts: A comparative analysis. International Journal of Approximate Reasoning, (): 4,

Similarity Measures for Categorical Data A Comparative Study. Technical Report

Similarity Measures for Categorical Data A Comparative Stuy Technical Report Department of Computer Science an Engineering University of Minnesota 4-92 EECS Builing 200 Union Street SE Minneapolis, MN