Similarity Measures for Categorical Data A Comparative Study. Technical Report

Size: px

Start display at page:

Download "Similarity Measures for Categorical Data A Comparative Study. Technical Report"

Hollie Webb
5 years ago
Views:

1 Similarity Measures for Categorical Data A Comparative Stuy Technical Report Department of Computer Science an Engineering University of Minnesota 4-92 EECS Builing 200 Union Street SE Minneapolis, MN USA TR Similarity Measures for Categorical Data A Comparative Stuy Varun Chanola, Shyam Boriah, an Vipin Kumar October 5, 2007

3 Similarity Measures for Categorical Data A Comparative Stuy Varun Chanola, Shyam Boriah, an Vipin Kumar Department of Computer Science & Engineering University of Minnesota Abstract Measuring similarity or istance between two entities is a key step for several ata mining an knowlege iscovery tasks. The notion of similarity for continuous ata is relatively well-unerstoo, but for categorical ata, the similarity computation is not straightforwar. Several ata-riven similarity measures have been propose in the literature to compute the similarity between two categorical ata instances but their relative performance has not been evaluate. In this paper we stuy the performance of a variety of similarity measures in the context of a specific ata mining task: outlier etection. Results on a variety of ata sets show that while no one measure ominates others for all types of problems, some measures are able to have consistently high performance. Introuction Measuring similarity or istance between two ata points is a core requirement for several ata mining an knowlege iscovery tasks that involve istance computation. Examples inclue clustering (kmeans), istance-base outlier etection, classification (knn, SVM), an several other ata mining tasks. These algorithms typically treat the similarity computation as an orthogonal step an can make use of any measure. For continuous ata sets, the Minkowski Distance is a general metho to compute istance between two multivariate points. In particular, the Minkowski Distance of orer (Manhattan) an orer 2 (Eucliean) are the two most wiely use istance measures for continuous ata. The key observation about the above measures is that they are inepenent of the unerlying ata set to which the two points belong. Several ata riven measures such as Mahalanobis Distance have also been explore for continuous ata. The notion of similarity or istance for categorical ata is not as straightforwar as for continuous ata. The key characteristic of categorical ata is that ifferent values that a categorical attribute takes are not inherently orere. Thus it is not possible to irectly compare two ifferent categorical values. The simplest way to fin similarity between two categorical attributes is to assign a similarity of if the values are ientical an a similarity of 0 if the values are not ientical. For two multivariate categorical ata points, the similarity between them will be irectly proportional to the number of attributes in which they match. This simple measure is also known as the overlap measure in the literature [29]. One obvious rawback of the overlap measure is that it oes not istinguish between the ifferent values taken by an attribute. All matches as well as mismatches are treate as equal. For example, consier a categorical ata set, D efine over two attributes: color an shape. Let color take 3 possible values in D: {re, blue, green} an shape take 3 possible values in D: {square, circle, triangle}. Table summarizes the frequency of occurrence for each possible combination in D. color shape square circle triangle Total re blue green Total Table : Frequency Distribution of a Simple 2-D Categorical Data Set The overlap similarity between two instances (green,square) an (green,circle) is 3. The overlap similarity between (blue,square) an (blue,circle) is also 3. But the frequency istribution in Table shows that while (blue,square) an (blue,circle) are frequent combinations, (green,square) an (green,circle) are very rare combinations in the ata set. Thus, it woul appear that the overlap measure is too simplistic in giving equal importance to all matches an mismatches. Although there is no inherent orering in categorical ata, the previous example shows that there is other information in categorical ata sets that can be use to efine what shoul be consiere more similar an what shoul be consiere less similar.

4 This observation has motivate researchers to come up with ata-riven similarity measures for categorical attributes. Such measures take into account the frequency istribution of ifferent attribute values in a given ata set to efine similarity between two categorical attribute values. In this paper, we stuy a variety of similarity measures propose in iverse research fiels ranging from statistics to ecology as well as many of their variations. Each measure uses the information present in the ata uniquely, to efine similarity. Since we are evaluating ata-riven similarity measures it is obvious that their performance is highly relate to the ata set that is being analyze. To unerstan this relationship, we first ientify the key characteristics of a categorical ata set. For each of the ifferent similarity measure that we stuy, we analyze how it relates to the ifferent characteristics of the ata set.. Key Contributions The key contributions of this paper are as follows: We have brought together several categorical measures from ifferent fiels an stuie them together in a single context. We evaluate 4 ifferent ata-riven similarity measures for categorical ata on a wie variety of benchmark ata sets. In particular, we show the utility of ata-riven measures for the problem of etermining similarity with categorical ata. We have also propose a number of new measures that are either variants of other previously propose measures or erive from previously propose similarity frameworks. The performance of some of the measures we propose is among the best performance of all the measures we stuie. We ientify the key characteristics of a categorical ata set an analyze each similarity measure in relation to the characteristics of categorical ata..2 Organization of the Paper The rest of the paper is organize as follows. We first mention all relate efforts in the stuy of similarity measures in Section 2. In section 3, we ientify various characteristics of categorical ata that are relevant to this stuy. We then introuce the 4 ifferent similarity measures that are stuie in this paper in Section 4. We escribe our experimental setup, evaluation methoology an the results on public ata sets in Section 6. 2 Relate Work Sneath an Sokal iscuss categorical similarity measures in some etail in their book [28] on numerical taxonomy. They were among the first to put together an iscuss many of the measures iscusse in their book. At the time, two major concerns were (), biological relevance, since numerical taxonomy was mainly concerne with taxonomies from biology, ecology, etc. an (2) computation efficiency since computational resources were limite an scarce. Nevertheless, many of the observations mae by Sneath an Sokal are quite relevant toay an offer key insights into many of the measures. There are several books [2, 9, 6, 20] on cluster analysis that iscuss the problem of etermining similarity between categorical attributes. However, most of these books o not offer solutions to the problem or iscuss the measures in this paper, an the usual recommenation is to binarize the ata an then use binary similarity measures. Wilson an Martinez [3] performe a etaile stuy of heterogeneous istance functions (for ata with categorical an continuous attributes) for instancebase learning. The measures in this stuy are base upon a supervise approach where each ata instance has class information in aition to a set of categorical/continuous attributes. Measures iscusse in this paper are orthogonal to [3] since supervise measures etermine similarity base on class information, while ata-riven measures etermine similarity base on the ata istribution. In principle, both ieas can be combine. There have been a number of new ata mining techniques for categorical ata that have been propose recently. Some of them use notions of similarity which are neighborhoo-base [5, 4, 8, 24,, 2], or incorporate the similarity computation into the learning algorithm [3, 8, 2]. Neighborhoo-base approaches use some notion of similarity (usually the overlap measure) to efine the neighborhoo of a ata instance, while the measures we stuy in this paper are irectly use to etermine similarity between a pair of ata instances; hence, we see the measures iscusse in this paper as being useful to compute the neighborhoo of a point an neighborhoo-base measures as meta-similarity measures. Since techniques which embe similarity measures into the learning algorithm o not explicitly efine general categorical similarity measures, we o not iscuss them in this paper. 3 Categorical Data Categorical ata (also known as nominal or qualitative multi-state ata) has been stuie for a long time in various contexts. As mentione earlier, computing

5 similarity between categorical ata instances is not straightforwar; owing to the fact that there is no explicit notion of orering between categorical values. To overcome this problem, several ata-riven similarity measures have been propose for categorical ata. The behavior of such measures irectly epens on the ata. In this section we ientify the key characteristics of a categorical ata set, that can potentially affect the behavior of a ata riven similarity measure. For the sake of notation, consier a categorical ata set D containing N objects, efine over a set of categorical attributes where A k enotes the k th attribute. Let the attribute A k take n k values in the given ata set that are enote by the set A k. We also use the following notation: f k (x): The number of times attribute A k takes the value x in the ata set D. Note that if x / A k, f k (x) = 0 ˆp k (x): The sample probability of attribute A k to take the value x in the ata set D. The sample probability is given by ˆp k (x) = f k(x) N p 2 k (x): Another probability estimate of attribute A k to take the value x in a given ata set an is given by p 2 k(x) = f k(x)(f k (x) ) N(N ) 3. Characteristics of Categorical Data Set Since this paper iscusses ata riven similarity measures for categorical ata, a key task is to ientify the characteristics of a categorical ata set that affect the behavior of such a similarity measure. We enumerate the characteristics of a categorical ata set below: Size of Data, N. As we will see later, most measures are typically invariant of the size of the ata, but some (e.g. Smirnov) o incorporate it. Number of attributes,. Most measures are invariant of this characteristic, since they typically normalize the similarity over the number of attributes. But in our experimental results we observe that the number of attributes oes affect the performance of the outlier etection algorithms. Number of values taken by each attribute, n k. A ata set might contain attributes that take several values an attributes that take a very few values. For example, one attribute might take several hunre possible values, while the other attribute might take very few values. A similarity measure might give more importance to the secon attribute, while ignoring the first one. In fact one of the measures iscusse in this paper (eskin) behaves exactly like this. Distribution of f k (x). This refers to the istribution of frequency of values taken by an attribute in the given ata set. In certain ata sets an attribute might be istribute uniformly over the A k, while in others the istribution might be skewe. A similarity measure might give more importance to attribute values that occur rarely, while another similarity measure might give more importance to frequently occurring attribute values. 4 Similarity Measures for Categorical Data The stuy of similarity between ata objects with categorical variables has ha a long history. Pearson propose a chi-square statistic in the late 800s which is often use to test inepenence between categorical variables in a contingency table. Pearson s chi-square statistic was later moifie an extene, leaing to several other measures [25, 23, 7]. More recently, however, the overlap measure has become the most commonly use similarity measure for categorical ata. Its popularity is perhaps relate to simplicity an easy of use. In this section, we will iscuss the overlap measure an several ata-riven similarity measures for categorical ata. Note that we have converte measures that were originally propose as istance to similarity measures in orer to make the measures comparable in this stuy. The measures iscusse henceforth will all be in the context of similarity, with istance measures being converte using the formula: sim = +ist. Any similarity measure assigns a similarity between two ata instances belonging to the ata set D (introuce in Section 3) as follows: (4.) S(X, Y ) = w k S k (X k, Y k ) k= where S k (X k, Y k ) is the per attribute similarity between two values for the categorical attribute A k. Note that X k, Y k A k. The quantity w k enotes the weight assigne to the attribute A k. To unerstan how ifferent measures calculate the per attribute similarity, S k (X k, Y k ), consier a categorical attribute A, which takes one of the values {a, b, c, }. We have roppe the subscript k for simplicity. The per attribute similarity computation is equivalent to constructing the (symmetric) matrix shown in Figure. Essentially, in etermining the similarity between two values, any categorical measure is filling the entries

6 Figure : Attribute a b c a S(a, a) S(a, b) S(a, c) S(a, ) b S(b, b) S(b, c) S(b, ) c S(c, c) S(c, ) S(, ) Similarity Matrix for a Single Categorical of this matrix. For example, the overlap measure sets the iagonal entries to an the off-iagonal entries to 0, i.e. the similarity is if the values match an 0 if the values mismatch. Aitionally, measures may use the following information in computing a similarity value (all the measures in this paper use only this information): f(a), f(b), f(c), f(), the frequencies of the values in the ata set N, the size of the ata set n, the number of values taken by the attribute (4 in the case above) We can classify measures in several ways, base on: (i) the manner in which they fill the entries of the similarity matrix, (ii) whether more weight is a function of the frequency of the attribute values, (iii) the arguments use to propose the measure (probabilistic, information-theoretic, etc.). In this paper, we will escribe the measures by classifying them as follows: those that fill the iagonal entries only. These are measures that set the off-iagonal entries to 0 (mismatches are uniformly given the minimum value) an give possibly ifferent weights to matches. those that fill the off-iagonal entries only. These measures set the iagonal entries to (matches are uniformly given the maximum value) an give possibly ifferent weights to mismatches. those that fill both iagonal an off-iagonal entries. These measures give ifferent weights to both matches an mismatches. Table 4 gives the mathematical formulas for the measures we will be escribing in this paper. The various techniques escribe in Table 4 compute the per-attribute similarity S k (X k, Y k ) as shown in column 2 an compute the attribute weight w k as shown in column Measures that fill Diagonal Entries only. Overlap. The overlap measure simply counts the number of attributes that match in the two ata instances. The range of per attribute similarity for the overlap measure is [0, ], with a value of 0 occurring when there is no match, an a value of occurring when the attribute values match. 2. Gooall. Gooall [4] propose a measure that attempts to normalize the similarity between two objects by the probability that the similarity value observe coul be observe in a ranom sample of two points. This measure assigns higher similarity to a match if the value is infrequent than if the value is frequent. Gooall s original measure etails a proceure to combine similarities in the multivariate setting which takes into account epenencies between attributes. Since this proceure is computationally expensive, we use a simpler version of the measure (escribe next as Gooall). Gooall s original measure is not empirically evaluate in this paper. We also propose three variants of Gooall s measure in this paper: Gooall2, Gooall3 an Gooall4. 3. Gooall. The Gooall measure is the same as Gooall s measure on a per-attribute basis. However, instea of combining the similarities by taking into account epenencies between attributes, the Gooall measure takes the average of the perattribute similarities. The range of S k (X k, Y k ) for matches in Gooall measure is [ 2 N, ], with 2 the minimum being attaine when X k is the most frequent value for attribute k, an the maximum is attaine when the attribute k takes N values (every value occurs only once). 4. Gooall2. The Gooall2 measure is a variant of Gooall s measure propose by us. This measure assigns higher similarity if the matching values are infrequent, an at the same time there are other values are even less frequent, i.e. the similarity is higher if there are many values with approximately equal frequencies, an lower if the frequency istribution is skewe. The range of S k (X k, Y k ) for matches in the Gooall2 measure is [0, 2 N ], 2 with the minimum value being attaine if attribute k takes only one value, an maximum value is attaine when X k is the least frequent value for attribute k. 5. Gooall3. We also propose another variant of Gooall s measure calle Gooall3. The Gooall3 measure assigns a high similarity if the matching

7 Measure S k (X k, Y k ) w k, k =... {. Overlap = if X k = Y k 0 otherwise 2. Eskin = n 2 k n 2 k +2 { if Xk = Y k otherwise { if Xk = Y k 3. IOF = otherwise +log f k (X k ) log f k (Y k ) { if Xk = Y k 4. OF = +log otherwise 5. Lin = { N f k (X k ) log N f k (Y k ) 2 log ˆp k (X k ) if X k = Y k 2 log(ˆp k (X k ) + ˆp k (Y k )) otherwise log ˆp k (q) if X k = Y k q Q 6. Lin = 2 log ˆp k (q) otherwise q Q p 2 k (q) if X k = Y k 7. Gooall = q Q 0 otherwise p 2 k (q) if X k = Y k 8. Gooall2 = q Q 0 otherwise 9. Gooall3 = 0. Gooall4 =. Smirnov = 2 + N f k (X k ) f k (X k ) + { q {A k \X k } q {A k \{X k,y k }} p 2 k (X k) if X k = Y k 0 otherwise { p 2 k (X k) if X k = Y k 0 otherwise f k (q) N f k (q) f k (q) N f k (q) if X k = Y k otherwise i= log ˆp i(x i )+log ˆp i (Y i ) i= q Q log ˆp i(q) k= n k 2. Gambaryan = 3. Burnaby = log 4. Anerberg S(X, Y ) = { [ˆpk (X k ) log 2 ˆp k (X k )+ if X k = Y k ( ˆp k (X k )) log 2 ( ˆp k (X k ))] 0 otherwise if X k = Y k 2 log( ˆp k (q)) q A k ˆp k (X k ) ˆp k (Y k ) ( ˆp k (X k ))( ˆp k (Y k )) + q A k 2 log( ˆp k (q)) k { k :X k =Y k } ( k { k :X k =Y k } ) 2 2 ˆp k (X k ) n k (n k + ) + otherwise ( ) 2 2 ˆp k (X k ) n k (n k + ) k { k :X k Y k } ( k= n k 2ˆp k (X k )ˆp k (Y k ) ) 2 n k (n k + ) Table 2: Similarity Measures for Categorical Attributes. Note that S(X, Y ) = k= w ks k (X k, Y k ). For measure Lin, {Q A k : q Q, ˆp k (X k ) ˆp k (q) ˆp k (Y k )}, assuming ˆp k (X k ) ˆp k (Y k ). For measures Gooall an Gooall, {Q A k : q Q, p k (q) p k (X k )}. For measure Gooall2, {Q A k : q Q, p k (q) p k (X k )}.

8 values are infrequent regarless of the frequencies of the other values. The range of S k (X k, Y k ) for matches in the Gooall3 measure is [0, N ], 2 with the minimum value being attaine if X k is the only value for attribute k an maximum value is attaine if X k occurs only once. 6. Gooall4. The Gooall4 measure assigns similarity Gooall3 for matches. The range of S k (X k, Y k ) for matches in the Gooall4 measure is [ N, ], with 2 the minimum value being attaine if X k occurs only once, an the maximum value is attaine if X k is the only value for attribute k. 7. Gambaryan. Gambaryan propose a measure [] that gives more weight to matches where the matching value occurs in about half the ata set, i.e. in between being frequent an rare. The Gambaryan measure for a single attribute match is closely relate to the Shannon entropy from information theory, as can be seen from its formula in Table 4. The range of S k (X k, Y k ) for matches in the Gambaryan measure is [0, ], with the minimum value being attaine if X k is the only value for attribute k an the maximum value is attaine when X k has frequency N Measures that fill Off-iagonal Entries only. Eskin. Eskin et al. [9] propose a normalization kernel for recor-base network intrusion etection ata. The original measure is istance-base 2 an assigns a weight of for mismatches; when n 2 k aapte to similarity, this becomes a weight of n 2 k n This measure gives more weight to mismatches that occur on attributes that take many k values. The range of S k (X k, Y k ) for mismatches in the Eskin measure is [ 2 3, N 2 N 2 +2 ], with the minimum value being attaine when the attribute k takes only two values, an the maximum value is attaine when the attribute has all unique values. 2. Inverse Occurrence F requency (IOF ). The inverse occurrence frequency measure assigns lower similarity to mismatches on more frequent values. The IOF measure is relate to the concept of inverse ocument frequency which comes from information retrieval, where it is use to signify the relative number of ocuments that contain a specific wor. A key ifference is that inverse ocument frequency is compute on a term-ocument matrix which is usually binary, while the IOF measure is efine for categorical ata. The range of S k (X k, Y k ) for mismatches in the IOF measure is [, ], with the minimum value being attaine when X k an Y k each occur N 2 +(log N 2 )2 times (i.e. these are the only two values), an the maximum value is attaine when X k an Y k occur only once in the ata set. 3. Occurrence F requency (OF ). The occurrence frequency measure gives the opposite weighting of the IOF measure for mismatches, i.e. mismatches on less frequent values are assigne lower similarity an mismatches on more frequent values are assigne higher similarity. The range of S k (X k, Y k ) for mismatches in OF measure is [ (+(log N) 2 ), +(log 2) ], with the minimum value 2 being attaine when X k an Y k occur only once in the ata set, an the maximum value is attaine when X k an Y k occur N 2 times. 4. Burnaby. Burnaby [6] propose a similarity measure using arguments from information theory. He argues that the set of observe values are like a group of signals conveying information an as in information theory, attribute values that are rarely observe shoul be consiere more informative. In [6], Burnaby propose information weighte measures for binary, orinal, categorical an continuous ata. The measure we present in Table 4 is aapte from Burnaby s categorical measure. This measure assigns low similarity to mismatches on rare values an high similarity to mismatches on frequent values. The range of S k (X k, Y k ) for mismatches in N log( N ) Burnaby measure is [ N log( N ) log(n ), ], with the minimum value being attaine all values for attribute k occur only once, an the maximum value is attaine when X k an Y k each occur N 2 times. 4.3 Measures that fill both Diagonal an Offiagonal Entries. Lin. In [22], Lin escribes an information-theoretic framework for similarity, where he argues that when similarity is thought of in terms of assumptions about the space, the similarity measure naturally follows from the assumptions. Lin [22] iscusses the orinal, string, wor an semantic similarity settings; we applie his framework to the categorical setting to erive the Lin measure in Table 4. The Lin measure gives higher weight to matches on frequent values, an lower weight to mismatches on infrequent values. The range of S k (X k, Y k ) for a match in Lin measure is [ 2 log N, 0], with the minimum value being attaine when X k occurs only once an the maximum value is attaine when X k occurs N times. The range of S k (X k, Y k ) for a

9 mismatch in Lin measure is [ 2 log N 2, 0], with the minimum value being attaine when X k an Y k each occur only once, an the maximum value is attaine when X k an Y k each occur N 2 times. 2. Lin. The Lin measure is another measure we have erive using Lin s similarity framework. This measure gives lower weight to mismatches if either of the mismatching values are very frequent, or if there are several values that have frequency in between those of the mismatching values; higher weight is given when there are mismatches on infrequent values an there are few other infrequent values. For matches, lower weight is given for matches on frequent values or matches on values that have many other values of the same frequency; higher weight is given to matches on rare values. The range of S k (X k, Y k ) for matches in the Lin measure is [ N log N, 0], with the minimum value being attaine when attribute k takes N possible values, an maximum value is attaine when X k occurs N times. The range of S k (X k, Y k ) for mismatches in the Lin measure is [ 2 log N 2, 2], with the minimum value being attaine when X k an Y k both occur only once, an maximum value is attaine when X k is the most frequent value an Y k is the least frequent value or vice versa. 3. Smirnov. Smirnov [27] propose a measure roote in probability theory that not only consiers a given value s frequency, but also takes into account the istribution of the other values taken by the same attribute. The Smirnov measure is probabilistic for both matches an mismatches. For a match, the similarity is high when the frequency of the matching value is low, an the other values occur frequently. The range of S k (X k, Y k ) for a match in the Smirnov measure is [0, 2(N )], with the minimum value being attaine when X k occurs only once an there only one other possible value for attribute k, which occurs N times; the maximum value is attaine when X k occurs N times. The range of S k (X k, Y k ) for a mismatch in the Smirnov measure is [ 2, N 2 3], with the minimum value being attaine when the attribute k takes only two values, X k an Y k ; an the maximum is attaine when k takes only one more value apart from X k an Y k an it occurs N 2 times (X k an Y k occur once each). 4. Anerberg. In his book [2], Anerberg presents an approach to hanle similarity between categorical attributes. He argues that rare matches inicate a strong association an shoul be given a very high weight, an that mismatches on rare values shoul be treate as being istinctive an shoul also be given special importance. In accorance with these arguments, the Anerberg measure assigns higher similarity to rare matches, an lower similarity to rare mismatches. Anerberg measure is unique in the sense that it cannot be written in the form of Equation 4.. The range of the Anerberg measure is [0, ]; the minimum value is attaine when there are no matches, an the maximum value is attaine when all attributes match. 4.4 Further classification of similarity measures We can further classify categorical similarity measures base on the arguments use to propose the measures:. Probabilistic approaches take into account the probability of a given match taking place. The following measures are probabilistic: Gooall, Smirnov, Anerberg. 2. Information-theoretic approaches incorporate the information content of a particular value/variable with respect to the ata set. The following measures are information-theoretic: Lin, Lin, Burnaby. Table 3 provies a characterization of each of the 4 similarity measures in terms of how they hanle the various characteristics of a categorical. This table shows that measures Eskin an Anerberg assign weight to every attribute using the quantity n k, though in opposite ways. Another interesting observation from column 3 is that several measures Lin,Lin, Gooall,Gooall3, Smirnov, Anerberg assign higher similarity to a match when the attribute value is rare (f k is low), while Gooall2 an Gooall4 assign higher similarity to a match when the attribute value is frequent (f k is high). Only Gambaryan assigns the maximum similarity when the attribute value has a frequency close to 2. Column 4 shows that IOF,Lin,Lin,Smirnov an Burnaby assign greater similarity when the mismatch occurs between rare values, while OF an Anerberg assign greater similarity for mismatch between frequent values. 5 Outlier Detection in Categorical Data Outlier etection refers to etecting instances that o not conform to a specific efinition of normal behavior. For nearest neighbor techniques, a normal instance is the one that has a very tight neighborhoo. In categorical omain, this correspons to the frequency of occurrence of a combination of attribute values. Normal points are frequent combinations of categorical values

10 {f k (X k ), f k (Y k )} Measure n k X k = Y k X k Y k Overlap 0 Eskin n 2 k 0 IOF /(log f k (X k ) log f k (Y k )) OF log f k (X k ) log f k (Y k ) Lin / log f k (X k ) / log (f k (X k ) + f k (Y k )) Lin / log f k (X k ) / log f k (X k ) f k (Y k ) Gooall ( fk 2 (X k )) 0 Gooall2 fk 2 (X k ) 0 Gooall3 ( fk 2 (X k )) 0 Gooall4 fk 2 (X k ) 0 Smirnov /f k (X k ) /(f k (X k ) + f k (Y k )) Gambaryan Maximum at f k (X k ) = N 2 0 Burnaby / log f k (X k ), / log f k (Y k ) Anerberg /n k /fk 2 (X k ) f k (X k )f k (Y k ) Table 3: Relation between per-attribute similarity, S(X k, Y k ) an {n k, f k (X k ), f k (Y k )}. while outliers are the rarely occurring combinations. We will first provie an unerstaning of normal an outlier instances in categorical ata from this perspective. Consier the example shown earlier in Table. Assuming that a count of 20 or more is consiere as frequent an below is consiere as rare. Now let us consier following 4 instances belonging to D:. (re, square): The combination occurs 30 times (frequent). 2. (green, circle): The combination occurs times (rare); the value green for color occurs 5 times (rare) an the value circle for shape occurs 28 times (frequent). 3. (re, circle): The combination occurs 2 times (rare); the value re for color occurs 35 times (frequent) an the value circle for shape occurs 28 times (frequent). 4. (green, triangle): The combination occurs 2 times (rare); the value green for color occurs 5 times (rare) an the value triangle for shape occurs 5 times (rare). Instance seems to be an obvious normal instance, while instance 4 seems to be an obvious outlier. Instances 2 an 3 occur rarely, but one or both iniviual attribute values occur frequently. These might be consiere as outliers or normal epening on the ata omain. Thus we observe that normal an outlier instances in a categorical ata set might be ifferent in their composition. 5. Outlier Detection Using Nearest Neighbors Nearest Neighbor base techniques for outlier etection assume that the outliers are far away from normal points using a certain similarity measure. The general methoology of such techniques is to estimate the ensity of each point in a coorinate space. The ensity is measure by either counting the number of points within certain raius of the point, or by estimating the sparsity of a neighborhoo of a point. kn N Outlier Detection One of the nearest neighbor technique use in this paper [26] uses a single parameter k. The outlier score of a point is equal to the istance of the point to its k th nearest neighbor. lof Outlier Detection This technique [5] has the notion of k-istance for a given point p. The k-istance is efine as the istance of p to its k th nearest neighbor. The k-istance neighborhoo of a point p is all points that are at istance less than equal to its k-istance. Note that the size of k-istance neighborhoo of p nee not necessarily be k. The k-istance neighborhoo of a p is enote by N k (p). The reachability istance of a point p with respect to another point o is efine as (5.2) r k (p, o) = max{k istance(o), (p, o)} where (p, o) is the actual istance between p an o. For points that are far away, the reachability istance an actual istance are the same. For points that are close to p, the reachability istance is replace by the k-istance of the other point. The local reachability ensity (lr) of a point p is efine as (5.3) ( o N lr k (p) = k (p) r k(p, o) N k (p) ) If there are uplicates in the ata, such that the k neighborhoo of a point consists of only the uplicates, lr computation will run into the problem of ivision

11 by 0. For continuous ata sets, such scenario is highly unlikely, but might occur in categorical ata sets. In such cases there are two possible solutions. Assign a small istance (ɛ) between two ientical points 2. The k-neighborhoo of any point consists of k istinct points The local outlier factor or the outlier score (lof) of a point p is efine as (5.4) lof k (p) = 6 Experimental Evaluation lr k (o) o N k (p) lr k (p) N k (p) In this section we present an experimental evaluation of the 4 measures given in Table 4 on 23 ifferent ata sets in context of outlier etection. Of these ata sets, 2 are base on the ata sets available at the UCI Machine Learning Repository [3], two are base on network ata generate by SKAION Corp for the ARDA information assurance program [7]. The etails about the 23 ata sets is summarize in Table 4. Eleven of these ata sets were purely categorical, five (KD,KD2,Sk,Sk2,Cen) ha a mix of continuous an categorical attributes, an two ata sets, Irs an Sgm, were purely continuous. Continuous variables were iscretize using the MDL metho [0]. The KD,KD2 ata sets were obtaine from the KDDCup ata set by iscretizing the continuous attributes into 0 an 00 bins respectively. Another possible way to hanle a mixture of attributes is to compute the similarity for continuous an categorical attributes separately, an then o a weighte aggregation. In this stuy we converte the continuous attributes to categorical to simplify comparative evaluation. Each ata set contains labele instances belonging to multiple classes. We ientifie one class as the outlier class, an rest of the classes were groupe together an calle normal. The last two rows in Table 4 enote the cross-valiation classification recall an precision reporte by C4.5 classifier on the outlier class. This quantity inicates the separability between the instances belonging to normal class(es) an instances belonging to outlier class, using the given set of attributes. A low accuracy implies that istinguishing between outliers an normal instances is ifficult in that particular ata set using a ecision tree-base classifier. 6. Evaluation Methoology The performance of the ifferent similarity measures was evaluate in the context of outlier etection using nearest neighbors [26, 30]. We construct a test ata by taking equal number of instances as ranom samples from the outlier class (n) an the normal class(es). In aition, a ranom sample (comparable in size to the outlier class) is taken from the normal class to serve as the training set. For each test instance we fin its nearest neighbors, using the given similarity measure, in the training set (we chose the parameter k = 0). The outlier score is chosen for the knn algorithm an the lof algorithm as iscusse earlier. The test instances are then sorte in ecreasing orer of outlier scores. To evaluate a measure, we count the number of true outliers in the top p portion of the sorte test instances, where p = δn, 0 δ. Let o be the number of actual outliers in the top p preicte outliers. The accuracy of the algorithm is measure as o p. In this paper we present results for δ =. We have also experimente with other lower values of δ an the trens in relative performance are similar. 6.2 Experimental Results on Public Data Sets Our experimental results verifie our initial hypotheses about categorical similarity measures. As can be seen from Table 5, there are many situations where the Overlap measure oes not give goo performance. This is consistent with our intuition that the use of aitional information woul lea to better performance. In particular, we expecte that since categorical ata oes not have inherent orering, ata-riven measures woul be able to take avantage of information present in the ata set to make more accurate eterminations of similarity between a pair of ata instances. We make some key observations about the results in Table 5:. No single measure is always superior or inferior. This is to be expecte since each ata set has ifferent characteristics. 2. The use of some measures gives consistently better performance on a large variety of ata. The Lin, OF, Gooall3 measures give among the best performance overall in terms of outlier etection performance. This is noteworthy since Lin an Gooall3 have been introuce for the first time in this paper. 3. There are some pairs of measures that exhibit complementary performance, i.e. one performs well where the other performs poorly an vice-versa. Example complimentary pairs are (OF, IOF ), (Lin, Lin) an (Gooall3, Gooall4). This observation means that it may be possible to construct measures that raw on the strengths of two measures in orer to obtain superior performance. This

12 Cr Cr2 Irs Cn Cn2 KD KD2 KD3 KD4 Sk Sk2 Ms Ms2 Sgmt Cen Bal Can Hys Lym Nur Tmr TTT Au Size % Outls avg(nk) me(nk) fk Uni fk Gauss fk Skw Recall Precision Table 4: Description of Public Data Sets Msr. Cr Cr2 Irs Cn Cn2 KD KD2 KD3 KD4 Sk Sk2 Ms Ms2 Sgm Cen Bal Can Hys Lym Nur Tmr TTT Au Avg ovrlp eskn iof of lin lin goo goo goo goo smrnv gmbrn brnby anbrg Avg Table 5: Experimental Results For knn Algorithm for 00 %

13 Msr. Cr Cr2 Irs Cn Cn2 KD KD2 KD3 KD4 Sk Sk2 Ms Ms2 Sgm Cen Bal Can Hys Lym Nur Tmr TTT Au Avg ovrlp eskn iof of lin lin goo goo goo goo smrnv gmbrn brnby anbrg Avg Table 6: Experimental Results For LOF Algorithm for 00 % Msr. Cr Cr2 Irs Cn Cn2 KD KD2 KD3 KD4 Sk Sk2 Ms Ms2 Sgm Cen Bal Can Hys Lym Nur Tmr TTT Au Avg ovrlp eskn iof of lin lin goo goo goo goo smrnv gmbrn brnby anbrg Avg Table 7: Experimental Results For knn Algorithm for 50 %

14 Msr. Cr Cr2 Irs Cn Cn2 KD KD2 KD3 KD4 Sk Sk2 Ms Ms2 Sgm Cen Bal Can Hys Lym Nur Tmr TTT Au Avg ovrlp eskn iof of lin lin goo goo goo goo smrnv gmbrn brnby anbrg Avg Table 9: Experimental Results For knn Algorithm for 25 % Msr. Cr Cr2 Irs Cn Cn2 KD KD2 KD3 KD4 Sk Sk2 Ms Ms2 Sgm Cen Bal Can Hys Lym Nur Tmr TTT Au Avg ovrlp eskn iof of lin lin goo goo goo goo smrnv gmbrn brnby anbrg Avg Table 8: Experimental Results For LOF Algorithm for 50 %

15 Msr. Cr Cr2 Irs Cn Cn2 KD KD2 KD3 KD4 Sk Sk2 Ms Ms2 Sgm Cen Bal Can Hys Lym Nur Tmr TTT Au Avg ovrlp eskn iof of lin lin goo goo goo goo smrnv gmbrn brnby anbrg Avg Table 0: Experimental Results For LOF Algorithm for 25 % is an aspect of this work that nees to be pursue in future work. 4. The performance of an outlier etection algorithm is significantly affecte by the similarity measure use. For example, for the Cn ata set, which has a very low classification accuracy for the outlier class, using OF still achieves close to 50 % accuracy. 5. The Eskin similarity measure weights attributes proportional to the number of values taken by the attribute (n k ). For ata sets in which the attributes take large number of values (e.g., KD2, Sk, Sk2), eskin performs very poorly. 6. The Smirnov measure assigns similarity to both iagonal an off-iagonal entries in the per-attribute similarity matrix (Figure ). But it still performs very poorly on most of the ata sets. The other measures that operate similarly Lin, Lin an Anerberg performs better than Smirnov in almost every ata set. 7. The performance of knn for varying values of δ is not very significant as seen from Tables 5, 7, an Using lof as the outlier etection algorithm (Refer to Tables 6, 8, an 0) improves the overall performance for almost every similarity measure. The rop in performance for 4 measures for δ =.00, is marginal. This inicates that lof is a better outlier etection algorithm than knn for categorical ata sets. The relation between the algorithm an similarity measure is also of significance an will be a part of our future research. 7 Concluing Remarks an Future Work Computing similarity in categorical attributes has been iscusse in a variety of contexts. In this paper we have brought together several such measures an evaluate them in context of outlier etection. We have also propose several variants (Lin, Gooall2, Gooall3, Gooall4) of existing similarity measures some of which perform very well as shown in our evaluation. Given this set of similarity measures, the first question that comes to min is: Which similarity measure is best suite for my ata mining task?. Our experimental results suggest that there is no one best performing similarity measure. Hence, one nees to unerstan how a similarity measure hanles the ifferent characteristics of a categorical ata set, an this nees to be explore in future research.

Similarity Measures for Categorical Data: A Comparative Evaluation

Similarity Measures for Categorical Data: A Comparative Evaluation Downloae 0/4/8 to 46.3.0.89. Reistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php Abstract Shyam