Similarity Measures for Categorical Data: A Comparative Evaluation

Size: px
Start display at page:

Download "Similarity Measures for Categorical Data: A Comparative Evaluation"

Transcription

1 Similarity Measures for Categorical Data: A Comparative Evaluation Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see Abstract Shyam Boriah Varun Chanola Vipin Kumar Department of Computer Science an Engineering University of Minnesota <sboriah,chanola,kumar>@cs.umn.eu Measuring similarity or istance between two entities is a key step for several ata mining an knowlege iscovery tasks. The notion of similarity for continuous ata is relatively well-unerstoo, but for categorical ata, the similarity computation is not straightforwar. Several ata-riven similarity measures have been propose in the literature to compute the similarity between two categorical ata instances but their relative performance has not been evaluate. In this paper we stuy the performance of a variety of similarity measures in the context of a specific ata mining task: outlier etection. Results on a variety of ata sets show that while no one measure ominates others for all types of problems, some measures are able to have consistently high performance. Introuction Measuring similarity or istance between two ata points is a core requirement for several ata mining an knowlege iscovery tasks that involve istance computation. Examples inclue clustering (kmeans), istance-base outlier etection, classification (knn, SVM), an several other ata mining tasks. These algorithms typically treat the similarity computation as an orthogonal step an can make use of any measure. For continuous ata sets, the Minkowski Distance is a general metho use to compute istance between two multivariate points. In particular, the Minkowski Distance of orer (Manhattan) an orer (Eucliean) are the two most wiely use istance measures for continuous ata. The key observation about the above measures is that they are inepenent of the unerlying ata set to which the two points belong. Several atariven measures, such as Mahalanobis Distance, have also been explore for continuous ata. The notion of similarity or istance for categorical ata is not as straightforwar as for continuous ata. The key characteristic of categorical ata is that the ifferent values that a categorical attribute takes are not inherently orere. Thus, it is not possible to irectly compare two ifferent categorical values. The simplest way to fin similarity between two categorical attributes is to assign a similarity of if the values are ientical an a similarity of 0 if the values are not ientical. For two multivariate categorical ata points, the similarity between them will be irectly proportional to the number of attributes in which they match. This simple measure is also known as the overlap measure in the literature [33]. One obvious rawback of the overlap measure is that it oes not istinguish between the ifferent values taken by an attribute. All matches, as well as mismatches, are treate as equal. For example, consier a categorical ata set D, efine over two attributes: color an shape. Let color take 3 possible values in D: {re, blue, green} an shape take 3 possible values in D: {square, circle, triangle}. Table summarizes the frequency of occurrence for each possible combination in D. color shape square circle triangle Total re blue green 5 Total Table : Frequency Distribution of a Simple -D Categorical Data Set The overlap similarity between two instances (green,square) an (green,circle) is 3. The overlap similarity between (blue,square) an (blue,circle) is also 3. But the frequency istribution in Table shows that while (blue,square) an (blue,circle) are frequent combinations, (green,square) an (green,circle) are very rare 43

2 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see combinations in the ata set. Thus, it woul appear that the overlap measure is too simplistic in giving equal importance to all matches an mismatches. Although there is no inherent orering in categorical ata, the previous example shows that there is other information in categorical ata sets that can be use to efine what shoul be consiere more similar an what shoul be consiere less similar. This observation has motivate researchers to come up with ata-riven similarity measures for categorical attributes. Such measures take into account the frequency istribution of ifferent attribute values in a given ata set to efine similarity between two categorical attribute values. In this paper, we stuy a variety of similarity measures propose in iverse research fiels ranging from statistics to ecology as well as many of their variations. Each measure uses the information present in the ata uniquely to efine similarity. Since we are evaluating ata-riven similarity measures it is obvious that their performance is highly relate to the ata set that is being analyze. To unerstan this relationship, we first ientify the key characteristics of a categorical ata set. For each of the ifferent similarity measure that we stuy, we analyze how it relates to the ifferent characteristics of the ata set.. Key Contributions. The key contributions of this paper are as follows: We bring together fourteen ifferent categorical measures from ifferent fiels an stuy them together in a single context. Many of these measures have not been investigate outsie the omain they were introuce in, an not compare with other measures. We classify the categorical measures in three ifferent ways base on how they utilize information in the ata. We evaluate the various similarity measures for categorical ata on a wie variety of benchmark ata sets. In particular, we show the utility of atariven measures for the problem of etermining similarity with categorical ata. We also propose a number of new measures that are either variants of other previously propose measures, or erive from previously propose similarity frameworks. The performance of some of the measures we propose is among the best performance of all the measures we stuy.. Organization of the Paper. The rest of the paper is organize as follows. We first mention all relate efforts in the stuy of similarity measures in Section. In section 3, we ientify various characteristics of categorical ata that are relevant to this stuy. We then introuce the 4 ifferent similarity measures that are stuie in this paper in Section 4. We escribe our experimental setup, evaluation methoology an the results on public ata sets in Section 5. Relate Work Sneath an Sokal iscuss categorical similarity measures in some etail in their book [3] on numerical taxonomy. They were among the first to put together an iscuss many of the measures iscusse in their book. At the time, two major concerns were () biological relevance, since numerical taxonomy was mainly concerne with taxonomies from biology, ecology, etc., an () computation efficiency since computational resources were limite an scarce. Nevertheless, many of the observations mae by Sneath an Sokal are quite relevant toay an offer key insights into many of the measures. There are several books [, 8, 6, ] on cluster analysis that iscuss the problem of etermining similarity between categorical attributes. However, most of these books o not offer solutions to the problem or iscuss the measures in this paper, an the usual recommenation is to binarize the ata an then use binary similarity measures. Wilson an Martinez [36] performe a etaile stuy of heterogeneous istance functions (for ata with categorical an continuous attributes) for instancebase learning. The measures in their stuy are base upon a supervise approach where each ata instance has class information in aition to a set of categorical/continuous attributes. Measures iscusse in this paper are orthogonal to the ones propose in [36] since supervise measures etermine similarity base on class information, while ata-riven measures etermine similarity base on the ata istribution. In principle, both ieas can be combine. There have been a number of new ata mining techniques for categorical ata that have been propose recently. Some of them use notions of similarity which are neighborhoo-base [5, 4, 8, 6,, ], or incorporate the similarity computation into the learning algorithm [3, 7, ]. Neighborhoo-base approaches use some notion of similarity (usually the overlap measure) to efine the neighborhoo of a ata instance, while the measures we stuy in this paper are irectly use to etermine similarity between a pair of ata instances; hence, we see the measures iscusse in this paper as being useful to compute the neighborhoo of a point an 44

3 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see neighborhoo-base measures as meta-similarity measures. Since techniques which embe similarity measures into the learning algorithm o not explicitly efine general categorical similarity measures, we o not iscuss them in this paper. Jones an Furnas [0] stuie several similarity measures in the fiel of information retrieval. In particular, they performe a geometric analysis on continuous measures in orer to reveal important ifferences which woul affect retrieval performance. Noreault et al. [5] also stuie measures in information retrieval with the goal of generalizing effectiveness base on empirically evaluating the performance of the measures. Another comparative empirical evaluation for etermining similarity between fuzzy sets was performe by Zwick et al. [37], followe by several others [7, 35]. 3 Categorical Data Categorical ata (also known as nominal or qualitative multi-state ata) has been stuie for a long time in various contexts. As mentione earlier, computing similarity between categorical ata instances is not straightforwar owing to the fact that there is no explicit notion of orering between categorical values. To overcome this problem, several ata-riven similarity measures have been propose for categorical ata. The behavior of such measures irectly epens on the ata. In this section, we ientify the key characteristics of a categorical ata set that can potentially affect the behavior of a ata-riven similarity measure. For the sake of notation, consier a categorical ata set D containing N objects, efine over a set of categorical attributes where A k enotes the k th attribute. Let the attribute A k take n k values in the given ata set that are enote by the set A k. We also use the following notation: f k (x): The number of times attribute A k takes the value x in the ata set D. Note that if x / A k, f k (x) = 0 ˆp k (x): The sample probability of attribute A k to take the value x in the ata set D. The sample probability is given by ˆp k (x) = f k(x) N p k (x): Another probability estimate of attribute A k to take the value x in a given ata set, given by p k(x) = f k(x)(f k (x) ) N(N ) 3. Characteristics of Categorical Data. Since this paper iscusses ata-riven similarity measures for categorical ata, a key task is to ientify the characteristics of a categorical ata set that affect the behavior of such a similarity measure. We enumerate the characteristics of a categorical ata set below: Size of Data, N. As we will see later, most measures are typically invariant of the size of the ata, though there are some measures (e.g. Smirnov) that o make use of this information. Number of attributes,. Most measures are invariant of this characteristic, since they typically normalize the similarity over the number of attributes. But in our experimental results we observe that the number of attributes oes affect the performance of the outlier etection algorithms. Number of values taken by each attribute, n k. A ata set might contain attributes that take several values an attributes that take very few values. For example, one attribute might take several hunre possible values, while another attribute might take very few values. A similarity measure might give more importance to the secon attribute, while ignoring the first one. In fact, one of the measures iscusse in this paper (Eskin) behaves exactly like this. Distribution of f k (x). This refers to the istribution of frequency of values taken by an attribute in the given ata set. In certain ata sets an attribute might be istribute uniformly over the set A k, while in others the istribution might be skewe. A similarity measure might give more importance to attribute values that occur rarely, while another similarity measure might give more importance to frequently occurring attribute values. 4 Similarity Measures for Categorical Data The stuy of similarity between ata objects with categorical variables has ha a long history. Pearson propose a chi-square statistic in the late 800s which is often use to test inepenence between categorical variables in a contingency table. Pearson s chi-square statistic was later moifie an extene, leaing to several other measures [8, 4, 7]. More recently, however, the overlap measure has become the most commonly use similarity measure for categorical ata. Its popularity is perhaps relate to its simplicity an easy of use. In this section, we will iscuss the overlap measure an several ata-riven similarity measures for categorical ata. 45

4 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see Note that we have converte measures that were originally propose as istance to similarity measures in orer to make the measures comparable in this stuy. The measures iscusse henceforth will all be in the context of similarity, with istance measures being converte using the formula: sim = + ist Almost all similarity measures assign a similarity value between two ata instances X an Y belonging to the ata set D (introuce in Section 3) as follows: (4.) S(X, Y ) = w k S k (X k, Y k ) k= where S k (X k, Y k ) is the per-attribute similarity between two values for the categorical attribute A k. Note that X k, Y k A k. The quantity w k enotes the weight assigne to the attribute A k. To unerstan how ifferent measures calculate the per-attribute similarity, S k (X k, Y k ), consier a categorical attribute A, which takes one of the values {a, b, c, }. We have roppe the subscript k for simplicity. The per-attribute similarity computation is equivalent to constructing the (symmetric) matrix shown in Figure. Figure : Attribute a b c a S(a, a) S(a, b) S(a, c) S(a, ) b S(b, b) S(b, c) S(b, ) c S(c, c) S(c, ) S(, ) Similarity Matrix for a Single Categorical Essentially, in etermining the similarity between two values, any categorical measure is filling the entries of this matrix. For example, the overlap measure sets the iagonal entries to an the off-iagonal entries to 0, i.e., the similarity is if the values match an 0 if the values mismatch. Aitionally, measures may use the following information in computing a similarity value (all the measures in this paper use only this information): f(a), f(b), f(c), f(), the frequencies of the values in the ata set N, the size of the ata set n, the number of values taken by the attribute (4 in the case above) We can classify measures in several ways, base on: (i) the manner in which they fill the entries of the similarity matrix, (ii) whether more weight is a function of the frequency of the attribute values, (iii) the arguments use to propose the measure (probabilistic, information-theoretic, etc.). In this paper, we will escribe the measures by classifying them as follows: those that fill the iagonal entries only. These are measures that set the off-iagonal entries to 0 (mismatches are uniformly given the minimum value) an give possibly ifferent weights to matches. those that fill the off-iagonal entries only. These measures set the iagonal entries to (matches are uniformly given the maximum value) an give possibly ifferent weights to mismatches. those that fill both iagonal an off-iagonal entries. These measures give ifferent weights to both matches an mismatches. Table gives the mathematical formulas for the measures we will be escribing in this paper. The various measures escribe in Table compute the per-attribute similarity S k (X k, Y k ) as shown in column an compute the attribute weight w k as shown in column Measures that fill Diagonal Entries only.. Overlap. The overlap measure simply counts the number of attributes that match in the two ata instances. The range of per-attribute similarity for the overlap measure is [0, ], with a value of 0 occurring when there is no match, an a value of occurring when the attribute values match.. Gooall. Gooall [4] propose a measure that attempts to normalize the similarity between two objects by the probability that the similarity value observe coul be observe in a ranom sample of two points. This measure assigns higher similarity to a match if the value is infrequent than if the value is frequent. Gooall s original measure etails a proceure to combine similarities in the multivariate setting which takes into account epenencies between attributes. Since this proceure is computationally expensive, we use a simpler version of the measure (escribe next as Gooall). Gooall s original measure is not empirically evaluate in this paper. We also propose three other variants of Gooall s measure in this paper: Gooall, Gooall3 an Gooall4. 46

5 Measure S k (X k, Y k ) w k, k =... {. Overlap = if X k = Y k 0 otherwise Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see Eskin = { if Xk = Y k n k n k + otherwise { if Xk = Y k 3. IOF = otherwise +log f k (X k ) log f k (Y k ) { if Xk = Y k 4. OF = +log otherwise 5. Lin = { N f k (X k ) log N f k (Y k ) log ˆp k (X k ) if X k = Y k log(ˆp k (X k ) + ˆp k (Y k )) otherwise log ˆp k (q) if X k = Y k q Q 6. Lin = log ˆp k (q) otherwise q Q p k (q) if X k = Y k 7. Gooall = q Q 0 otherwise p k (q) if X k = Y k 8. Gooall = q Q 0 otherwise 9. Gooall3 = 0. Gooall4 =. Smirnov =. Gambaryan = 3. Burnaby = log 4. Anerberg S(X, Y ) = + N f k(x k ) f k (X k ) + q {A k \{X k,y k }} { q {A k \X k } f k (q) N f k (q) p k (X k) if X k = Y k 0 otherwise { p k (X k) if X k = Y k 0 otherwise f k (q) N f k (q) if X k = Y k otherwise { [ˆpk (X k ) log ˆp k (X k )+ if X k = Y k ( ˆp k (X k )) log ( ˆp k (X k ))] 0 otherwise if X k = Y k log( ˆp k (q)) q A k ˆp k (X k ) ˆp k (Y k ) ( ˆp k (X k ))( ˆp k (Y k )) + q A k log( ˆp k (q)) k { k :X k =Y k } ( k { k :X k =Y k } ) ˆp k (X k ) n k (n k + ) + otherwise i= log ˆp i(x i )+log ˆp i (Y i ) ( ) ˆp k (X k ) n k (n k + ) k { k :X k Y k } ( i= q Q log ˆp i(q) k= n k k= n k ˆp k (X k )ˆp k (Y k ) ) n k (n k + ) Table : Similarity Measures for Categorical Attributes. Note that S(X, Y ) = k= w ks k (X k, Y k ). For measure Lin, {Q A k : q Q, ˆp k (X k ) ˆp k (q) ˆp k (Y k )}, assuming ˆp k (X k ) ˆp k (Y k ). For measure Gooall, {Q A k : q Q, p k (q) p k (X k )}. For measure Gooall, {Q A k : q Q, p k (q) p k (X k )}. 47

6 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see 3. Gooall. The Gooall measure is the same as Gooall s measure on a per-attribute basis. However, instea of combining the similarities by taking into account epenencies between attributes, the Gooall measure takes the average of the perattribute similarities. The range of S k (X k, Y k ) for matches in Gooall measure is [0, N(N ) ], with the minimum being attaine when attribute A k takes only one value, an the maximum is attaine when the value X k occurs twice, while all other possible values of A k occur more than times. 4. Gooall. The Gooall measure is a variant of Gooall s measure propose by us. This measure assigns higher similarity if the matching values are infrequent, an at the same time there are other values that are even less frequent, i.e., the similarity is higher if there are many values with approximately equal frequencies, an lower if the frequency istribution is skewe. The range of S k (X k, Y k ) for matches in the Gooall measure is [0, N(N ) ], with the minimum value being attaine if attribute A k takes only one value, an the maximum is attaine when the value X k occurs twice, while all other possible values of A k occur only once each. 5. Gooall3. We also propose another variant of Gooall s measure calle Gooall3. The Gooall3 measure assigns a high similarity if the matching values are infrequent regarless of the frequencies of the other values. The range of S k (X k, Y k ) for matches in the Gooall3 measure is [0, N(N ) ], with the minimum value being attaine if X k is the only value for attribute A k an maximum value is attaine if X k occurs only twice. 6. Gooall4. The Gooall4 measure assigns similarity Gooall3 for matches. The range of S k (X k, Y k ) for matches in the Gooall4 measure is [ N(N ), ], with the minimum value being attaine if X k occurs only once, an the maximum value is attaine if X k is the only value for attribute A k. 7. Gambaryan. Gambaryan propose a measure [] that gives more weight to matches where the matching value occurs in about half the ata set, i.e., in between being frequent an rare. The Gambaryan measure for a single attribute match is closely relate to the Shannon entropy from information theory, as can be seen from its formula in Table. The range of S k (X k, Y k ) for matches in the Gambaryan measure is [0, ], with the minimum value being attaine if X k is the only value for attribute A k an the maximum value is attaine when X k has frequency N. 4. Measures that fill Off-iagonal Entries only.. Eskin. Eskin et al. [9] propose a normalization kernel for recor-base network intrusion etection ata. The original measure is istance-base an assigns a weight of for mismatches; when n k aapte to similarity, this becomes a weight of n k n +. This measure gives more weight to mismatches that occur on attributes that take many k values. The range of S k (X k, Y k ) for mismatches in the Eskin measure is [ 3, N N + ], with the minimum value being attaine when the attribute A k takes only two values, an the maximum value is attaine when the attribute has all unique values.. Inverse Occurrence F requency (IOF ). The inverse occurrence frequency measure assigns lower similarity to mismatches on more frequent values. The IOF measure is relate to the concept of inverse ocument frequency which comes from information retrieval [9], where it is use to signify the relative number of ocuments that contain a specific wor. A key ifference is that inverse ocument frequency is compute on a term-ocument matrix which is usually binary, while the IOF measure is efine for categorical ata. The range of S k (X k, Y k ) for mismatches in the IOF measure is [, ], with the minimum value being attaine when X k an Y k each occur N +(log N ) times (i.e., these are the only two values), an the maximum value is attaine when X k an Y k occur only once in the ata set. 3. Occurrence F requency (OF ). The occurrence frequency measure gives the opposite weighting of the IOF measure for mismatches, i.e., mismatches on less frequent values are assigne lower similarity an mismatches on more frequent values are assigne higher similarity. The range of S k (X k, Y k ) for mismatches in OF measure is [ (+(log N) ), +(log ) ], with the minimum value being attaine when X k an Y k occur only once in the ata set, an the maximum value attaine when X k an Y k occur N times. 4. Burnaby. Burnaby [5] propose a similarity measure using arguments from information theory. He argues that the set of observe values are like a group of signals conveying information an, as in information theory, attribute values that are rarely observe shoul be consiere more informative. In 48

7 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see [5], Burnaby propose information weighte measures for binary, orinal, categorical an continuous ata. The measure we present in Table is aapte from Burnaby s categorical measure. This measure assigns low similarity to mismatches on rare values an high similarity to mismatches on frequent values. The range of S k (X k, Y k ) for mismatches for the N log( N ) Burnaby measure is [ N log( N ) log(n ), ], with the minimum value being attaine when all values for attribute A k occur only once, an the maximum value is attaine when X k an Y k each occur N times. 4.3 Measures that fill both Diagonal an Offiagonal Entries.. Lin. In [3], Lin escribes an information-theoretic framework for similarity, where he argues that when similarity is thought of in terms of assumptions about the space, the similarity measure naturally follows from the assumptions. Lin [3] iscusses the orinal, string, wor an semantic similarity settings; we applie his framework to the categorical setting to erive the Lin measure in Table. The Lin measure gives higher weight to matches on frequent values, an lower weight to mismatches on infrequent values. The range of S k (X k, Y k ) for a match in the Lin measure is [ log N, 0], with the minimum value being attaine when X k occurs only once an the maximum value attaine when X k occurs N times. The range of S k (X k, Y k ) for a mismatch in Lin measure is [ log N, 0], with the minimum value being attaine when X k an Y k each occur only once, an the maximum value is attaine when X k an Y k each occur N times.. Lin. The Lin measure is another measure we have erive using Lin s similarity framework. This measure gives lower weight to mismatches if either of the mismatching values are very frequent, or if there are several values that have frequency in between those of the mismatching values; higher weight is given when there are mismatches on infrequent values an there are few other infrequent values. For matches, lower weight is given for matches on frequent values or matches on values that have many other values of the same frequency; higher weight is given to matches on rare values. The range of S k (X k, Y k ) for matches in the Lin measure is [ log N, 0], with the minimum value being attaine when X k occurs twice in the ataset, an no other value for attribute k occurs twice, an maximum value is attaine when X k occurs N times. The range of S k (X k, Y k ) for mismatches in the Lin measure is [ log N, 0], with the minimum value being attaine when X k an Y k both occur only once an all other values for attribute A k occur more than once, an the maximum value is attaine when X k is the most frequent value an Y k is the least frequent value or vice versa. 3. Smirnov. Smirnov [3] propose a measure roote in probability theory that not only consiers a given value s frequency, but also takes into account the istribution of the other values taken by the same attribute. The Smirnov measure is probabilistic for both matches an mismatches. For a match, the similarity is high when the frequency of the matching value is low, an the other values occur frequently. The range of S k (X k, Y k ) for a match in the Smirnov measure is [, N], with the minimum value being attaine when X k occurs N times; the maximum value is attaine when X k occurs only once an there only one other possible value for attribute A k, which occurs N times. The range of S k (X k, Y k ) for a mismatch in the Smirnov measure is [0, N ], with the minimum value being attaine when the attribute A k takes only two values, X k an Y k ; an the maximum is attaine when A k takes only one more value apart from X k an Y k an it occurs N times (X k an Y k occur once each). 4. Anerberg. In his book on cluster analysis [], Anerberg presents an approach to hanle similarity between categorical attributes. He argues that rare matches inicate a strong association an shoul be given a very high weight, an that mismatches on rare values shoul be treate as being istinctive an shoul also be given special importance. In accorance with these arguments, the Anerberg measure assigns higher similarity to rare matches, an lower similarity to rare mismatches. The Anerberg measure is unique in the sense that it cannot be written in the form of Equation 4.. The range of the Anerberg measure is [0, ]; the minimum value is attaine when there are no matches, an the maximum value is attaine when all attributes match. 4.4 Further classification of similarity measures. We can further classify categorical similarity measures base on the arguments use to propose the measures:. Probabilistic approaches take into account the probability of a given match taking place. The following measures are probabilistic: Gooall, Smirnov, Anerberg. 49

8 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see Information-theoretic approaches incorporate the information content of a particular value/variable with respect to the ata set. The following measures are information-theoretic: Lin, Lin, Burnaby. Table 3 provies a characterization of each of the 4 similarity measures in terms of how they hanle the various characteristics of categorical ata. This table shows that measures Eskin an Anerberg assign weight to every attribute using the quantity n k, though in opposite ways. Another interesting observation from column 3 is that several measures Lin, Lin, Gooall, Gooall3, Smirnov, Anerberg assign higher similarity to a match when the attribute value is rare (f k is low), while Gooall an Gooall4 assign higher similarity to a match when the attribute value is frequent (f k is high). Only Gambaryan assigns the maximum similarity when the attribute value has a frequency close to. Column 4 shows that IOF, Lin, Lin, Smirnov an Burnaby assign greater similarity when the mismatch occurs between rare values, while OF an Anerberg assign greater similarity for a mismatch between frequent values. 5 Experimental Evaluation In this section we present an experimental evaluation of the 4 measures (liste in Table ) on 8 ifferent ata sets in the context of outlier etection. Of these ata sets, 6 are base on the ata sets available at the UCI Machine Learning Repository [3], two are base on network ata generate by SKAION Corporation for the ARDA information assurance program [30]. The etails about the 8 ata sets are summarize in Table 4. Eleven of these ata sets were purely categorical, five (KD,KD,Sk,Sk,Cen) ha a mix of continuous an categorical attributes, an two ata sets, Irs an Sgm, were purely continuous. Continuous variables were iscretize using the MDL metho [0]. The KD,KD ata sets were obtaine from the KDDCup ata set by iscretizing the continuous attributes into 0 an 00 bins respectively. Another possible way to hanle a mixture of attributes is to compute the similarity for continuous an categorical attributes separately, an then o a weighte aggregation. In this stuy we converte the continuous attributes to categorical to simplify comparative evaluation. Each ata set contains labele instances belonging to multiple classes. We ientifie one class as the outlier class, an rest of the classes were groupe together an calle normal. For most of the ata sets, the smallest class was selecte as the outlier class. The only exceptions were (Cr,Cr) an (KD,KD), where the original ata sets ha two similar-size classes. For each ata set in the pair, we sample 00 points from one of the two classes as the outlier points. In Section 3. we iscusse a number of characteristics for categorical ata sets; in Table 4 we escribe the various public ata sets in terms of these characteristics. The first row gives the size of each ata set. The secon row shows the percentage of outlier points in the original ata set. The thir row inicates the number of attributes in the ata sets. Rows 4 an 5 show the istribution of the number of values taken by each attribute; the ifference between the average an the meian is a measure of how skewe this istribution is. For example, ata set Sk has a few attributes that take many values while most other attributes take few values. The next three rows show the istribution of frequency of values taken by an attribute in the given ata set (i.e., f k (x)). This is one by showing the number of attributes that have a uniform, Gaussian an skewe istribution in rows 6, 7, an 8 respectively. The last two rows in Table 4 enote the crossvaliation classification recall an precision reporte by the C4.5 classifier on the outlier class. This quantity inicates the separability between the instances belonging to normal class(es) an instances belonging to outlier class, using the given set of attributes. A low accuracy implies that istinguishing between outliers an normal instances is ifficult in that particular ata set using a ecision tree-base classifier. 5. Evaluation Methoology. The performance of the ifferent similarity measures was evaluate in the context of outlier etection using nearest neighbors [9, 34]. All instances belonging to the normal class(es) form the training set. We construct the test set by aing the outlier points to the training set. For each test instance, we fin its k nearest neighbors using the given similarity measure, in the training set (we chose the parameter k = 0). The outlier score is the istance to the k th nearest neighbor. The test instances are then sorte in ecreasing orer of outlier scores. To evaluate a measure, we count the number of true outliers in the top p portion of the sorte test instances, where p = δn, 0 δ an n is the number of actual outliers. Let o be the number of actual outliers in the top p preicte outliers. The accuracy of the algorithm is measure as o p. In this paper we present results for δ =. We have also experimente with other lower values of δ an the trens in relative performance are similar. We have presente these aitional results in our extene work available as a technical report [6]. 50

9 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see {f k (X k ), f k (Y k )} Measure n k X k = Y k X k Y k Overlap 0 Eskin n k 0 IOF /(log f k (X k ) log f k (Y k )) OF log f k (X k ) log f k (Y k ) Lin / log f k (X k ) / log (f k (X k ) + f k (Y k )) Lin / log f k (X k ) / log f k (X k ) f k (Y k ) Gooall ( fk (X k )) 0 Gooall fk (X k ) 0 Gooall3 ( fk (X k )) 0 Gooall4 fk (X k ) 0 Smirnov /f k (X k ) /(f k (X k ) + f k (Y k )) Gambaryan Maximum at f k (X k ) = N 0 Burnaby / log f k (X k ), / log f k (Y k ) Anerberg /n k /fk (X k ) f k (X k )f k (Y k ) Table 3: Relation between per-attribute similarity, S(X k, Y k ) an {n k, f k (X k ), f k (Y k )}. 5. Experimental Results on Public Data Sets. Our experimental results verifie our initial hypotheses about categorical similarity measures. As can be seen from Table 5, there are many situations where the Overlap measure oes not give goo performance. This is consistent with our intuition that the use of aitional information woul lea to better performance. In particular, we expecte that since categorical ata oes not have inherent orering, ata-riven measures woul be able to take avantage of information present in the ata set to make more accurate eterminations of similarity between a pair of ata instances. We make some key observations about the results in Table 5:. No single measure is always superior or inferior. This is to be expecte since each ata set has ifferent characteristics.. The use of some measures gives consistently better performance on a large variety of ata. The Lin, OF, Gooall3 measures give among the best performance overall in terms of outlier etection performance. This is noteworthy since Lin an Gooall3 have been introuce for the first time in this paper. 3. There are some pairs of measures that exhibit complementary performance, i.e., one performs well where the other performs poorly an vice-versa. Example complimentary pairs are (OF, IOF ), (Lin, Lin) an (Gooall3, Gooall4). This observation means that it may be possible to construct measures that raw on the strengths of two measures in orer to obtain superior performance. This is an aspect of this work that nees to be pursue in future work. 4. The performance of an outlier etection algorithm is significantly affecte by the similarity measure use (we refer the reaer to our extene work [6] for a similar evaluation using a ifferent outlier etection algorithm, LOF, which provies similar conclusions). For example, for the Cn ata set, which has a very low classification accuracy for the outlier class, using OF still achieves close to 50 % accuracy. We also note that for many of the ata sets there is a relationship between ecision tree performance (separability) an the performance of the measures. Specifically, for some of the ata sets (e.g., Sk, Tmr, Au) with low separability there was high variance in the performance of the various measures. 5. The Eskin similarity measure weights attributes proportional to the number of values taken by the attribute (n k ). For ata sets in which the attributes take large number of values (e.g., KD, Sk, Sk), Eskin performs very poorly. 6. The Smirnov measure assigns similarity to both iagonal an off-iagonal entries in the per-attribute similarity matrix (Figure ). But it still performs very poorly on most of the ata sets. The other measures that operate similarly Lin, Lin an Anerberg perform better than Smirnov in almost every ata set. 5

10 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see Cr Cr Irs Cn KD KD Sk Sk Msh Sgm Cen Can Hys Lym Nur Tmr TTT Au Size % Outls avg(nk) me(nk) fk Uni fk Gauss fk Skw Recall Precision Table 4: Description of Public Data Sets. Measure Cr Cr Irs Cn KD KD Sk Sk Msh Sgm Cen Can Hys Lym Nur Tmr TTT Au Avg ovrlp eskn iof of lin lin goo goo goo goo smrnv gmbrn brnby anbrg Avg Table 5: Experimental Results For knn Algorithm for δ =.0. 5

11 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see 6 Concluing Remarks an Future Work Computing similarity between categorical attributes has been iscusse in a variety of contexts. In this paper we have brought together several such measures an evaluate them in the context of outlier etection. We have also propose several variants (Lin, Lin, Gooall, Gooall3, Gooall4) of existing similarity measures, some of which perform very well as shown in our evaluation. Given this set of similarity measures, the first question that comes to min is: Which similarity measure is best suite for my ata mining task?. Our experimental results suggest that there is no one best performing similarity measure. Hence, one nees to unerstan how a similarity measure hanles the ifferent characteristics of a categorical ata set, an this nees to be explore in future research. We use outlier etection as the unerlying ata mining task for the comparative evaluation in this work. However, similar stuies can be performe using classification or clustering as the unerlying task. It will be useful to know if the relative performance of these similarity measures remains the same for the other ata mining tasks. In our evaluation methoology we have use one similarity measure across all attributes. Since ifferent attributes in a ata set can be of ifferent nature, an alternative way is to use ifferent measures for ifferent attributes. This appears to be especially promising given the complimentary nature of several similarity measures. 7 Acknowlegements We are grateful to the anonymous reviewers for their comments an suggestions, which improve this paper. We woul also like to thank György Simon for his helpful comments on an early raft of this paper. This work was supporte by NSF Grant CNS-05555, NSF ITR Grant ACI , NSF Grant IIS , an NSF Grant IIS Access to computing facilities was provie by the University of Minnesota Digital Technology Center an Supercomputing Institute. References [] A. Ahma an L. Dey. A metho to compute istance between two categorical values of same attribute in unsupervise learning for categorical ata set. Pattern Recogn. Lett., 8():0 8, 007. [] M. R. Anerberg. Cluster Analysis for Applications. Acaemic Press, New York, 973. [3] A. Asuncion an D. J. Newman. UCI machine learning repository. [ Irvine, CA: University of California, School of Information an Computer Science, 007. [4] Y. Biberman. A context similarity measure. In ECML 94: Proceeings of the European Conference on Machine Learning, pages Springer, 994. [5] T. Burnaby. On a metho for character weighting a similarity coefficient, employing the concept of information. Mathematical Geology, ():5 38, 970. [6] V. Chanola, S. Boriah, an V. Kumar. Similarity measures for categorical ata a comparative stuy. Technical Report 07-0, Department of Computer Science & Engineering, University of Minnesota, October 007. [7] H. Cramér. The Elements of Probability Theory an Some of its Applications. John Wiley & Sons, New York, NY, 946. [8] G. Das an H. Mannila. Context-base similarity measures for categorical atabases. In PKDD 00: Proceeings of the 4th European Conference on Principles of Data Mining an Knowlege Discovery, pages 0 0, Lonon, UK, 000. Springer- Verlag. [9] E. Eskin, A. Arnol, M. Prerau, L. Portnoy, an S. Stolfo. A geometric framework for unsupervise anomaly etection. In D. Barbará an S. Jajoia, eitors, Applications of Data Mining in Computer Security, pages Kluwer Acaemic Publishers, Norwell, MA, 00. [0] U. M. Fayya an K. B. Irani. Multi-interval iscretization of continuous-value attributes for classification learning. In Proceeings of the 3th International Joint Conference on Artificial Intelligence, pages 0 09, San Francisco, CA, 993. Morgan Kaufmann. [] P. Gambaryan. A mathematical moel of taxonomy. Izvest. Aka. Nauk Armen. SSR, 7():47 53, 964. [] V. Ganti, J. Gehrke, an R. Ramakrishnan. CACTUS clustering categorical ata using summaries. In KDD 99: Proceeings of the 5th ACM SIGKDD international conference on Knowlege iscovery an ata mining, pages 73 83, New York, NY, USA, 999. ACM Press. 53

12 Downloae 0/4/8 to Reistribution subject to SIAM license or copyright; see [3] D. Gibson, J. Kleinberg, an P. Raghavan. Clustering categorical ata: an approach base on ynamical systems. The VLDB Journal, 8(3): 36, 000. [4] D. W. Gooall. A new similarity inex base on probability. Biometrics, (4):88 907, 966. [5] S. Guha, R. Rastogi, an K. Shim. ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 5(5): , 000. [6] J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, New York, NY, 975. [7] Z. Huang. Extensions to the k-means algorithm for clustering large ata sets with categorical values. Data Mining an Knowlege Discovery, (3):83 304, 998. [8] A. K. Jain an R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall, Englewoo Cliffs, NJ, 988. [9] K. S. Jones. A statistical interpretation of term specificity an its application in retrieval. In Document Retrieval Systems, volume 3 of Taylor Graham Series In Founations Of Information Science, pages 3 4. Taylor Graham Publishing, Lonon, UK, 988. ISBN [0] W. P. Jones an G. W. Furnas. Pictures of relevance: a geometric analysis of similarity measures. J. Am. Soc. Inf. Sci., 38(6):40 44, 987. [] L. Kaufman an P. J. Rousseeuw. Fining Groups in Data: An Introuction to Cluster Analysis. John Wiley & Sons, New York, NY, 990. [] S. Q. Le an T. B. Ho. An association-base issimilarity measure for categorical ata. Pattern Recogn. Lett., 6(6): , 005. [3] D. Lin. An information-theoretic efinition of similarity. In ICML 98: Proceeings of the 5th International Conference on Machine Learning, pages , San Francisco, CA, USA, 998. Morgan Kaufmann Publishers Inc. [4] K. Maung. Measurement of association in a contingency table with special reference to the pigmentation of hair an eye colours of Scottish school chilren. Annals of Eugenics, :89 3, 94. [5] T. Noreault, M. McGill, an M. B. Koll. A performance evaluation of similarity measures, ocument term weighting schemes an representations in a boolean environment. In SIGIR 80: Proceeings of the 3r annual ACM conference on Research an evelopment in information retrieval, pages 57 76, Kent, UK, 98. Butterworth & Co. [6] C. R. Palmer an C. Faloutsos. Electricity base external similarity of categorical attributes. In PAKDD 03: Proceeings of the 7th Pacific-Asia Conference on Avances in Knowlege Discovery an Data Mining, pages Springer, 003. [7] C. P. Pappis an N. I. Karacapiliis. A comparative assessment of measures of similarity of fuzzy values. Fuzzy Sets an Systems, 56():7 74, 993. [8] K. Pearson. On the general theory of multiple contingency with special reference to partial contingency. Biometrika, (3):45 58, 96. [9] S. Ramaswamy, R. Rastogi, an K. Shim. Efficient algorithms for mining outliers from large ata sets. In SIGMOD 00: Proceeings of the ACM SIG- MOD International Conference on Management of Data, pages ACM Press, 000. [30] SKAION Corporation. SKAION intrusion etection system evaluation ata. [ [3] E. S. Smirnov. On exact methos in systematics. Systematic Zoology, 7(): 3, 968. [3] P. H. A. Sneath an R. R. Sokal. Numerical Taxonomy: The Principles an Practice of Numerical Classification. W. H. Freeman an Company, San Francisco, 973. [33] C. Stanfill an D. Waltz. Towar memory-base reasoning. Commun. ACM, 9():3 8, 986. [34] P.-N. Tan, M. Steinbach, an V. Kumar. Introuction to Data Mining. Aison-Wesley, Boston, MA, 005. [35] X. Wang, B. De Baets, an E. Kerre. A comparative stuy of similarity measures. Fuzzy Sets an Systems, 73():59 68, 995. [36] D. R. Wilson an T. R. Martinez. Improve heterogeneous istance functions. J. Artif. Intell. Res. (JAIR), 6: 34, 997. [37] R. Zwick, E. Carlstein, an D. V. Buescu. Measures of similarity among fuzzy concepts: A comparative analysis. International Journal of Approximate Reasoning, (): 4,

Similarity Measures for Categorical Data A Comparative Study. Technical Report

Similarity Measures for Categorical Data A Comparative Study. Technical Report Similarity Measures for Categorical Data A Comparative Stuy Technical Report Department of Computer Science an Engineering University of Minnesota 4-92 EECS Builing 200 Union Street SE Minneapolis, MN

More information

Optimization of Geometries by Energy Minimization

Optimization of Geometries by Energy Minimization Optimization of Geometries by Energy Minimization by Tracy P. Hamilton Department of Chemistry University of Alabama at Birmingham Birmingham, AL 3594-140 hamilton@uab.eu Copyright Tracy P. Hamilton, 1997.

More information

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing

More information

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions Working Paper 2013:5 Department of Statistics Computing Exact Confience Coefficients of Simultaneous Confience Intervals for Multinomial Proportions an their Functions Shaobo Jin Working Paper 2013:5

More information

Level Construction of Decision Trees in a Partition-based Framework for Classification

Level Construction of Decision Trees in a Partition-based Framework for Classification Level Construction of Decision Trees in a Partition-base Framework for Classification Y.Y. Yao, Y. Zhao an J.T. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canaa S4S

More information

SYNCHRONOUS SEQUENTIAL CIRCUITS

SYNCHRONOUS SEQUENTIAL CIRCUITS CHAPTER SYNCHRONOUS SEUENTIAL CIRCUITS Registers an counters, two very common synchronous sequential circuits, are introuce in this chapter. Register is a igital circuit for storing information. Contents

More information

A Modification of the Jarque-Bera Test. for Normality

A Modification of the Jarque-Bera Test. for Normality Int. J. Contemp. Math. Sciences, Vol. 8, 01, no. 17, 84-85 HIKARI Lt, www.m-hikari.com http://x.oi.org/10.1988/ijcms.01.9106 A Moification of the Jarque-Bera Test for Normality Moawa El-Fallah Ab El-Salam

More information

Construction of the Electronic Radial Wave Functions and Probability Distributions of Hydrogen-like Systems

Construction of the Electronic Radial Wave Functions and Probability Distributions of Hydrogen-like Systems Construction of the Electronic Raial Wave Functions an Probability Distributions of Hyrogen-like Systems Thomas S. Kuntzleman, Department of Chemistry Spring Arbor University, Spring Arbor MI 498 tkuntzle@arbor.eu

More information

On the Surprising Behavior of Distance Metrics in High Dimensional Space

On the Surprising Behavior of Distance Metrics in High Dimensional Space On the Surprising Behavior of Distance Metrics in High Dimensional Space Charu C. Aggarwal, Alexaner Hinneburg 2, an Daniel A. Keim 2 IBM T. J. Watson Research Center Yortown Heights, NY 0598, USA. charu@watson.ibm.com

More information

Implicit Differentiation

Implicit Differentiation Implicit Differentiation Thus far, the functions we have been concerne with have been efine explicitly. A function is efine explicitly if the output is given irectly in terms of the input. For instance,

More information

Lower bounds on Locality Sensitive Hashing

Lower bounds on Locality Sensitive Hashing Lower bouns on Locality Sensitive Hashing Rajeev Motwani Assaf Naor Rina Panigrahy Abstract Given a metric space (X, X ), c 1, r > 0, an p, q [0, 1], a istribution over mappings H : X N is calle a (r,

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: Balance Sampling for Multi-Way Stratification Contents General section... 3 1. Summary... 3 2.

More information

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors Math 18.02 Notes on ifferentials, the Chain Rule, graients, irectional erivative, an normal vectors Tangent plane an linear approximation We efine the partial erivatives of f( xy, ) as follows: f f( x+

More information

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search The Web Challenges Tim

More information

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration

More information

Influence of weight initialization on multilayer perceptron performance

Influence of weight initialization on multilayer perceptron performance Influence of weight initialization on multilayer perceptron performance M. Karouia (1,2) T. Denœux (1) R. Lengellé (1) (1) Université e Compiègne U.R.A. CNRS 817 Heuiasyc BP 649 - F-66 Compiègne ceex -

More information

Hybrid Fusion for Biometrics: Combining Score-level and Decision-level Fusion

Hybrid Fusion for Biometrics: Combining Score-level and Decision-level Fusion Hybri Fusion for Biometrics: Combining Score-level an Decision-level Fusion Qian Tao Raymon Velhuis Signals an Systems Group, University of Twente Postbus 217, 7500AE Enschee, the Netherlans {q.tao,r.n.j.velhuis}@ewi.utwente.nl

More information

Why Bernstein Polynomials Are Better: Fuzzy-Inspired Justification

Why Bernstein Polynomials Are Better: Fuzzy-Inspired Justification Why Bernstein Polynomials Are Better: Fuzzy-Inspire Justification Jaime Nava 1, Olga Kosheleva 2, an Vlaik Kreinovich 3 1,3 Department of Computer Science 2 Department of Teacher Eucation University of

More information

Schrödinger s equation.

Schrödinger s equation. Physics 342 Lecture 5 Schröinger s Equation Lecture 5 Physics 342 Quantum Mechanics I Wenesay, February 3r, 2010 Toay we iscuss Schröinger s equation an show that it supports the basic interpretation of

More information

CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, and Tony Wu

CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, and Tony Wu CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, an Tony Wu Abstract Popular proucts often have thousans of reviews that contain far too much information for customers to igest. Our goal for the

More information

6 General properties of an autonomous system of two first order ODE

6 General properties of an autonomous system of two first order ODE 6 General properties of an autonomous system of two first orer ODE Here we embark on stuying the autonomous system of two first orer ifferential equations of the form ẋ 1 = f 1 (, x 2 ), ẋ 2 = f 2 (, x

More information

How the potentials in different gauges yield the same retarded electric and magnetic fields

How the potentials in different gauges yield the same retarded electric and magnetic fields How the potentials in ifferent gauges yiel the same retare electric an magnetic fiels José A. Heras a Departamento e Física, E. S. F. M., Instituto Politécnico Nacional, México D. F. México an Department

More information

ELEC3114 Control Systems 1

ELEC3114 Control Systems 1 ELEC34 Control Systems Linear Systems - Moelling - Some Issues Session 2, 2007 Introuction Linear systems may be represente in a number of ifferent ways. Figure shows the relationship between various representations.

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Technical Report TTI-TR-2008-5 Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri UC San Diego Sham M. Kakae Toyota Technological Institute at Chicago ABSTRACT Clustering ata in

More information

Attribute Reduction and Information Granularity *

Attribute Reduction and Information Granularity * ttribute Reuction an nformation Granularity * Li-hong Wang School of Computer Engineering an Science, Shanghai University, Shanghai, 200072, P.R.C School of Computer Science an Technology, Yantai University,

More information

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences. S 63 Lecture 8 2/2/26 Lecturer Lillian Lee Scribes Peter Babinski, Davi Lin Basic Language Moeling Approach I. Special ase of LM-base Approach a. Recap of Formulas an Terms b. Fixing θ? c. About that Multinomial

More information

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction

More information

Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets

Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets Proceeings of the 4th East-European Conference on Avances in Databases an Information Systems ADBIS) 200 Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets Eleftherios Tiakas, Apostolos.

More information

Linear First-Order Equations

Linear First-Order Equations 5 Linear First-Orer Equations Linear first-orer ifferential equations make up another important class of ifferential equations that commonly arise in applications an are relatively easy to solve (in theory)

More information

05 The Continuum Limit and the Wave Equation

05 The Continuum Limit and the Wave Equation Utah State University DigitalCommons@USU Founations of Wave Phenomena Physics, Department of 1-1-2004 05 The Continuum Limit an the Wave Equation Charles G. Torre Department of Physics, Utah State University,

More information

THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE

THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE Journal of Soun an Vibration (1996) 191(3), 397 414 THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE E. M. WEINSTEIN Galaxy Scientific Corporation, 2500 English Creek

More information

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments 2 Conference on Information Sciences an Systems, The Johns Hopkins University, March 2, 2 Time-of-Arrival Estimation in Non-Line-Of-Sight Environments Sinan Gezici, Hisashi Kobayashi an H. Vincent Poor

More information

Chapter 6: Energy-Momentum Tensors

Chapter 6: Energy-Momentum Tensors 49 Chapter 6: Energy-Momentum Tensors This chapter outlines the general theory of energy an momentum conservation in terms of energy-momentum tensors, then applies these ieas to the case of Bohm's moel.

More information

Estimating Causal Direction and Confounding Of Two Discrete Variables

Estimating Causal Direction and Confounding Of Two Discrete Variables Estimating Causal Direction an Confouning Of Two Discrete Variables This inspire further work on the so calle aitive noise moels. Hoyer et al. (2009) extene Shimizu s ientifiaarxiv:1611.01504v1 [stat.ml]

More information

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks A PAC-Bayesian Approach to Spectrally-Normalize Margin Bouns for Neural Networks Behnam Neyshabur, Srinah Bhojanapalli, Davi McAllester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur,

More information

Generalizing Kronecker Graphs in order to Model Searchable Networks

Generalizing Kronecker Graphs in order to Model Searchable Networks Generalizing Kronecker Graphs in orer to Moel Searchable Networks Elizabeth Boine, Babak Hassibi, Aam Wierman California Institute of Technology Pasaena, CA 925 Email: {eaboine, hassibi, aamw}@caltecheu

More information

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21 Large amping in a structural material may be either esirable or unesirable, epening on the engineering application at han. For example, amping is a esirable property to the esigner concerne with limiting

More information

Necessary and Sufficient Conditions for Sketched Subspace Clustering

Necessary and Sufficient Conditions for Sketched Subspace Clustering Necessary an Sufficient Conitions for Sketche Subspace Clustering Daniel Pimentel-Alarcón, Laura Balzano 2, Robert Nowak University of Wisconsin-Maison, 2 University of Michigan-Ann Arbor Abstract This

More information

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Lower Bounds for the Smoothed Number of Pareto optimal Solutions Lower Bouns for the Smoothe Number of Pareto optimal Solutions Tobias Brunsch an Heiko Röglin Department of Computer Science, University of Bonn, Germany brunsch@cs.uni-bonn.e, heiko@roeglin.org Abstract.

More information

Classifying Biomedical Text Abstracts based on Hierarchical Concept Structure

Classifying Biomedical Text Abstracts based on Hierarchical Concept Structure Classifying Biomeical Text Abstracts base on Hierarchical Concept Structure Rozilawati Binti Dollah an Masai Aono Abstract Classifying biomeical literature is a ifficult an challenging tas, especially

More information

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation JOURNAL OF MATERIALS SCIENCE 34 (999)5497 5503 Thermal conuctivity of grae composites: Numerical simulations an an effective meium approximation P. M. HUI Department of Physics, The Chinese University

More information

Axiometrics: Axioms of Information Retrieval Effectiveness Metrics

Axiometrics: Axioms of Information Retrieval Effectiveness Metrics Axiometrics: Axioms of Information Retrieval Effectiveness Metrics ABSTRACT Ey Maalena Department of Maths Computer Science University of Uine Uine, Italy ey.maalena@uniu.it There are literally ozens most

More information

Parameter estimation: A new approach to weighting a priori information

Parameter estimation: A new approach to weighting a priori information Parameter estimation: A new approach to weighting a priori information J.L. Mea Department of Mathematics, Boise State University, Boise, ID 83725-555 E-mail: jmea@boisestate.eu Abstract. We propose a

More information

Research Article When Inflation Causes No Increase in Claim Amounts

Research Article When Inflation Causes No Increase in Claim Amounts Probability an Statistics Volume 2009, Article ID 943926, 10 pages oi:10.1155/2009/943926 Research Article When Inflation Causes No Increase in Claim Amounts Vytaras Brazauskas, 1 Bruce L. Jones, 2 an

More information

A Weak First Digit Law for a Class of Sequences

A Weak First Digit Law for a Class of Sequences International Mathematical Forum, Vol. 11, 2016, no. 15, 67-702 HIKARI Lt, www.m-hikari.com http://x.oi.org/10.1288/imf.2016.6562 A Weak First Digit Law for a Class of Sequences M. A. Nyblom School of

More information

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an

More information

arxiv: v1 [physics.class-ph] 20 Dec 2017

arxiv: v1 [physics.class-ph] 20 Dec 2017 arxiv:1712.07328v1 [physics.class-ph] 20 Dec 2017 Demystifying the constancy of the Ermakov-Lewis invariant for a time epenent oscillator T. Pamanabhan IUCAA, Post Bag 4, Ganeshkhin, Pune - 411 007, Inia.

More information

u!i = a T u = 0. Then S satisfies

u!i = a T u = 0. Then S satisfies Deterministic Conitions for Subspace Ientifiability from Incomplete Sampling Daniel L Pimentel-Alarcón, Nigel Boston, Robert D Nowak University of Wisconsin-Maison Abstract Consier an r-imensional subspace

More information

A Lower Bound On Proximity Preservation by Space Filling Curves

A Lower Bound On Proximity Preservation by Space Filling Curves A Lower Boun On Proximity Preservation by Space Filling Curves Pan Xu Inustrial an Manufacturing Systems Engg. Iowa State University Ames, IA, USA Email: panxu@iastate.eu Srikanta Tirthapura Electrical

More information

CONTROL CHARTS FOR VARIABLES

CONTROL CHARTS FOR VARIABLES UNIT CONTOL CHATS FO VAIABLES Structure.1 Introuction Objectives. Control Chart Technique.3 Control Charts for Variables.4 Control Chart for Mean(-Chart).5 ange Chart (-Chart).6 Stanar Deviation Chart

More information

Situation awareness of power system based on static voltage security region

Situation awareness of power system based on static voltage security region The 6th International Conference on Renewable Power Generation (RPG) 19 20 October 2017 Situation awareness of power system base on static voltage security region Fei Xiao, Zi-Qing Jiang, Qian Ai, Ran

More information

The Principle of Least Action

The Principle of Least Action Chapter 7. The Principle of Least Action 7.1 Force Methos vs. Energy Methos We have so far stuie two istinct ways of analyzing physics problems: force methos, basically consisting of the application of

More information

MULTIFRACTAL NETWORK GENERATORS

MULTIFRACTAL NETWORK GENERATORS MULTIFRACTAL NETWORK GENERATORS AUSTIN R. BENSON, CARLOS RIQUELME, SVEN P. SCHMIT (0) Abstract. Generating ranom graphs to moel networks has a rich history. In this paper, we explore a recent generative

More information

Some properties of random staircase tableaux

Some properties of random staircase tableaux Some properties of ranom staircase tableaux Sanrine Dasse Hartaut Pawe l Hitczenko Downloae /4/7 to 744940 Reistribution subject to SIAM license or copyright; see http://wwwsiamorg/journals/ojsaphp Abstract

More information

Vectors in two dimensions

Vectors in two dimensions Vectors in two imensions Until now, we have been working in one imension only The main reason for this is to become familiar with the main physical ieas like Newton s secon law, without the aitional complication

More information

Lagrangian and Hamiltonian Mechanics

Lagrangian and Hamiltonian Mechanics Lagrangian an Hamiltonian Mechanics.G. Simpson, Ph.. epartment of Physical Sciences an Engineering Prince George s Community College ecember 5, 007 Introuction In this course we have been stuying classical

More information

u t v t v t c a u t b a v t u t v t b a

u t v t v t c a u t b a v t u t v t b a Nonlinear Dynamical Systems In orer to iscuss nonlinear ynamical systems, we must first consier linear ynamical systems. Linear ynamical systems are just systems of linear equations like we have been stuying

More information

Quantum mechanical approaches to the virial

Quantum mechanical approaches to the virial Quantum mechanical approaches to the virial S.LeBohec Department of Physics an Astronomy, University of Utah, Salt Lae City, UT 84112, USA Date: June 30 th 2015 In this note, we approach the virial from

More information

A Randomized Approximate Nearest Neighbors Algorithm - a short version

A Randomized Approximate Nearest Neighbors Algorithm - a short version We present a ranomize algorithm for the approximate nearest neighbor problem in - imensional Eucliean space. Given N points {x } in R, the algorithm attempts to fin k nearest neighbors for each of x, where

More information

Data-Intensive Similarity Measures for Categorical Data

Data-Intensive Similarity Measures for Categorical Data Data-Intensive Similarity Measures for Categorical Data Thesis submitted in partial fulfillment of the requirements for the degree of MS by Research in Computer Science Engineering by Desai Aditya Makarand

More information

Math 1B, lecture 8: Integration by parts

Math 1B, lecture 8: Integration by parts Math B, lecture 8: Integration by parts Nathan Pflueger 23 September 2 Introuction Integration by parts, similarly to integration by substitution, reverses a well-known technique of ifferentiation an explores

More information

The total derivative. Chapter Lagrangian and Eulerian approaches

The total derivative. Chapter Lagrangian and Eulerian approaches Chapter 5 The total erivative 51 Lagrangian an Eulerian approaches The representation of a flui through scalar or vector fiels means that each physical quantity uner consieration is escribe as a function

More information

A New Family of Near-metrics for Universal Similarity

A New Family of Near-metrics for Universal Similarity arxiv:1707.06903v3 [stat.ml] 17 Oct 2017 A New Family of Near-metrics for Universal Similarity Chu Wang Iraj Saniee William S. Kenney Chris A. White October 18, 2017 Abstract We propose a family of near-metrics

More information

TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS. Yannick DEVILLE

TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS. Yannick DEVILLE TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS Yannick DEVILLE Université Paul Sabatier Laboratoire Acoustique, Métrologie, Instrumentation Bât. 3RB2, 8 Route e Narbonne,

More information

Robustness and Perturbations of Minimal Bases

Robustness and Perturbations of Minimal Bases Robustness an Perturbations of Minimal Bases Paul Van Dooren an Froilán M Dopico December 9, 2016 Abstract Polynomial minimal bases of rational vector subspaces are a classical concept that plays an important

More information

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control 19 Eigenvalues, Eigenvectors, Orinary Differential Equations, an Control This section introuces eigenvalues an eigenvectors of a matrix, an iscusses the role of the eigenvalues in etermining the behavior

More information

A Random Graph Model for Massive Graphs

A Random Graph Model for Massive Graphs A Ranom Graph Moel for Massive Graphs William Aiello AT&T Labs Florham Park, New Jersey aiello@research.att.com Fan Chung University of California, San Diego fan@ucs.eu Linyuan Lu University of Pennsylvania,

More information

arxiv: v1 [hep-lat] 19 Nov 2013

arxiv: v1 [hep-lat] 19 Nov 2013 HU-EP-13/69 SFB/CPP-13-98 DESY 13-225 Applicability of Quasi-Monte Carlo for lattice systems arxiv:1311.4726v1 [hep-lat] 19 ov 2013, a,b Tobias Hartung, c Karl Jansen, b Hernan Leovey, Anreas Griewank

More information

WESD - Weighted Spectral Distance for Measuring Shape Dissimilarity

WESD - Weighted Spectral Distance for Measuring Shape Dissimilarity 1 WESD - Weighte Spectral Distance for Measuring Shape Dissimilarity Ener Konukoglu, Ben Glocker, Antonio Criminisi an Kilian M. Pohl Abstract This article presents a new istance for measuring shape issimilarity

More information

The Press-Schechter mass function

The Press-Schechter mass function The Press-Schechter mass function To state the obvious: It is important to relate our theories to what we can observe. We have looke at linear perturbation theory, an we have consiere a simple moel for

More information

d dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1

d dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1 Lecture 5 Some ifferentiation rules Trigonometric functions (Relevant section from Stewart, Seventh Eition: Section 3.3) You all know that sin = cos cos = sin. () But have you ever seen a erivation of

More information

Topic 7: Convergence of Random Variables

Topic 7: Convergence of Random Variables Topic 7: Convergence of Ranom Variables Course 003, 2016 Page 0 The Inference Problem So far, our starting point has been a given probability space (S, F, P). We now look at how to generate information

More information

Capacity Analysis of MIMO Systems with Unknown Channel State Information

Capacity Analysis of MIMO Systems with Unknown Channel State Information Capacity Analysis of MIMO Systems with Unknown Channel State Information Jun Zheng an Bhaskar D. Rao Dept. of Electrical an Computer Engineering University of California at San Diego e-mail: juzheng@ucs.eu,

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 12 EFFICIENT LEARNING So far, our focus has been on moels of learning an basic algorithms for those moels. We have not place much emphasis on how to learn quickly.

More information

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y Ph195a lecture notes, 1/3/01 Density operators for spin- 1 ensembles So far in our iscussion of spin- 1 systems, we have restricte our attention to the case of pure states an Hamiltonian evolution. Toay

More information

How to Minimize Maximum Regret in Repeated Decision-Making

How to Minimize Maximum Regret in Repeated Decision-Making How to Minimize Maximum Regret in Repeate Decision-Making Karl H. Schlag July 3 2003 Economics Department, European University Institute, Via ella Piazzuola 43, 033 Florence, Italy, Tel: 0039-0-4689, email:

More information

Computing the Longest Common Subsequence of Multiple RLE Strings

Computing the Longest Common Subsequence of Multiple RLE Strings The 29th Workshop on Combinatorial Mathematics an Computation Theory Computing the Longest Common Subsequence of Multiple RLE Strings Ling-Chih Yao an Kuan-Yu Chen Grauate Institute of Networking an Multimeia

More information

Linear and quadratic approximation

Linear and quadratic approximation Linear an quaratic approximation November 11, 2013 Definition: Suppose f is a function that is ifferentiable on an interval I containing the point a. The linear approximation to f at a is the linear function

More information

Non-deterministic Social Laws

Non-deterministic Social Laws Non-eterministic Social Laws Michael H. Coen MIT Artificial Intelligence Lab 55 Technology Square Cambrige, MA 09 mhcoen@ai.mit.eu Abstract The paper generalizes the notion of a social law, the founation

More information

A new proof of the sharpness of the phase transition for Bernoulli percolation on Z d

A new proof of the sharpness of the phase transition for Bernoulli percolation on Z d A new proof of the sharpness of the phase transition for Bernoulli percolation on Z Hugo Duminil-Copin an Vincent Tassion October 8, 205 Abstract We provie a new proof of the sharpness of the phase transition

More information

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs Lectures - Week 10 Introuction to Orinary Differential Equations (ODES) First Orer Linear ODEs When stuying ODEs we are consiering functions of one inepenent variable, e.g., f(x), where x is the inepenent

More information

Delocalization of boundary states in disordered topological insulators

Delocalization of boundary states in disordered topological insulators Journal of Physics A: Mathematical an Theoretical J. Phys. A: Math. Theor. 48 (05) FT0 (pp) oi:0.088/75-83/48//ft0 Fast Track Communication Delocalization of bounary states in isorere topological insulators

More information

A New Minimum Description Length

A New Minimum Description Length A New Minimum Description Length Soosan Beheshti, Munther A. Dahleh Laboratory for Information an Decision Systems Massachusetts Institute of Technology soosan@mit.eu,ahleh@lis.mit.eu Abstract The minimum

More information

Inverse Theory Course: LTU Kiruna. Day 1

Inverse Theory Course: LTU Kiruna. Day 1 Inverse Theory Course: LTU Kiruna. Day Hugh Pumphrey March 6, 0 Preamble These are the notes for the course Inverse Theory to be taught at LuleåTekniska Universitet, Kiruna in February 00. They are not

More information

DEGREE DISTRIBUTION OF SHORTEST PATH TREES AND BIAS OF NETWORK SAMPLING ALGORITHMS

DEGREE DISTRIBUTION OF SHORTEST PATH TREES AND BIAS OF NETWORK SAMPLING ALGORITHMS DEGREE DISTRIBUTION OF SHORTEST PATH TREES AND BIAS OF NETWORK SAMPLING ALGORITHMS SHANKAR BHAMIDI 1, JESSE GOODMAN 2, REMCO VAN DER HOFSTAD 3, AND JÚLIA KOMJÁTHY3 Abstract. In this article, we explicitly

More information

A NONLINEAR SOURCE SEPARATION APPROACH FOR THE NICOLSKY-EISENMAN MODEL

A NONLINEAR SOURCE SEPARATION APPROACH FOR THE NICOLSKY-EISENMAN MODEL 6th European Signal Processing Conference EUSIPCO 28, Lausanne, Switzerlan, August 25-29, 28, copyright by EURASIP A NONLINEAR SOURCE SEPARATION APPROACH FOR THE NICOLSKY-EISENMAN MODEL Leonaro Tomazeli

More information

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS ALINA BUCUR, CHANTAL DAVID, BROOKE FEIGON, MATILDE LALÍN 1 Introuction In this note, we stuy the fluctuations in the number

More information

Bohr Model of the Hydrogen Atom

Bohr Model of the Hydrogen Atom Class 2 page 1 Bohr Moel of the Hyrogen Atom The Bohr Moel of the hyrogen atom assumes that the atom consists of one electron orbiting a positively charge nucleus. Although it oes NOT o a goo job of escribing

More information

The Exact Form and General Integrating Factors

The Exact Form and General Integrating Factors 7 The Exact Form an General Integrating Factors In the previous chapters, we ve seen how separable an linear ifferential equations can be solve using methos for converting them to forms that can be easily

More information

An Analytical Expression of the Probability of Error for Relaying with Decode-and-forward

An Analytical Expression of the Probability of Error for Relaying with Decode-and-forward An Analytical Expression of the Probability of Error for Relaying with Decoe-an-forwar Alexanre Graell i Amat an Ingmar Lan Department of Electronics, Institut TELECOM-TELECOM Bretagne, Brest, France Email:

More information

New Statistical Test for Quality Control in High Dimension Data Set

New Statistical Test for Quality Control in High Dimension Data Set International Journal of Applie Engineering Research ISSN 973-456 Volume, Number 6 (7) pp. 64-649 New Statistical Test for Quality Control in High Dimension Data Set Shamshuritawati Sharif, Suzilah Ismail

More information

A simple model for the small-strain behaviour of soils

A simple model for the small-strain behaviour of soils A simple moel for the small-strain behaviour of soils José Jorge Naer Department of Structural an Geotechnical ngineering, Polytechnic School, University of São Paulo 05508-900, São Paulo, Brazil, e-mail:

More information

Lecture 2: Correlated Topic Model

Lecture 2: Correlated Topic Model Probabilistic Moels for Unsupervise Learning Spring 203 Lecture 2: Correlate Topic Moel Inference for Correlate Topic Moel Yuan Yuan First of all, let us make some claims about the parameters an variables

More information

Improving Estimation Accuracy in Nonrandomized Response Questioning Methods by Multiple Answers

Improving Estimation Accuracy in Nonrandomized Response Questioning Methods by Multiple Answers International Journal of Statistics an Probability; Vol 6, No 5; September 207 ISSN 927-7032 E-ISSN 927-7040 Publishe by Canaian Center of Science an Eucation Improving Estimation Accuracy in Nonranomize

More information

TCP throughput and timeout steady state and time-varying dynamics

TCP throughput and timeout steady state and time-varying dynamics TCP throughput an timeout steay state an time-varying ynamics Stephan Bohacek an Khushboo Shah Dept. of Electrical an Computer Engineering Dept. of Electrical Engineering University of Delaware University

More information

Math 342 Partial Differential Equations «Viktor Grigoryan

Math 342 Partial Differential Equations «Viktor Grigoryan Math 342 Partial Differential Equations «Viktor Grigoryan 6 Wave equation: solution In this lecture we will solve the wave equation on the entire real line x R. This correspons to a string of infinite

More information

Cascaded redundancy reduction

Cascaded redundancy reduction Network: Comput. Neural Syst. 9 (1998) 73 84. Printe in the UK PII: S0954-898X(98)88342-5 Cascae reunancy reuction Virginia R e Sa an Geoffrey E Hinton Department of Computer Science, University of Toronto,

More information

Regular tree languages definable in FO and in FO mod

Regular tree languages definable in FO and in FO mod Regular tree languages efinable in FO an in FO mo Michael Beneikt Luc Segoufin Abstract We consier regular languages of labele trees. We give an effective characterization of the regular languages over

More information

TAYLOR S POLYNOMIAL APPROXIMATION FOR FUNCTIONS

TAYLOR S POLYNOMIAL APPROXIMATION FOR FUNCTIONS MISN-0-4 TAYLOR S POLYNOMIAL APPROXIMATION FOR FUNCTIONS f(x ± ) = f(x) ± f ' (x) + f '' (x) 2 ±... 1! 2! = 1.000 ± 0.100 + 0.005 ±... TAYLOR S POLYNOMIAL APPROXIMATION FOR FUNCTIONS by Peter Signell 1.

More information

Separation of Variables

Separation of Variables Physics 342 Lecture 1 Separation of Variables Lecture 1 Physics 342 Quantum Mechanics I Monay, January 25th, 2010 There are three basic mathematical tools we nee, an then we can begin working on the physical

More information