Similarity Measures for Categorical Data A Comparative Study. Technical Report

Size: px
Start display at page:

Download "Similarity Measures for Categorical Data A Comparative Study. Technical Report"

Transcription

1 Similarity Measures for Categorical Data A Comparative Stuy Technical Report Department of Computer Science an Engineering University of Minnesota 4-92 EECS Builing 200 Union Street SE Minneapolis, MN USA TR Similarity Measures for Categorical Data A Comparative Stuy Varun Chanola, Shyam Boriah, an Vipin Kumar October 5, 2007

2

3 Similarity Measures for Categorical Data A Comparative Stuy Varun Chanola, Shyam Boriah, an Vipin Kumar Department of Computer Science & Engineering University of Minnesota Abstract Measuring similarity or istance between two entities is a key step for several ata mining an knowlege iscovery tasks. The notion of similarity for continuous ata is relatively well-unerstoo, but for categorical ata, the similarity computation is not straightforwar. Several ata-riven similarity measures have been propose in the literature to compute the similarity between two categorical ata instances but their relative performance has not been evaluate. In this paper we stuy the performance of a variety of similarity measures in the context of a specific ata mining task: outlier etection. Results on a variety of ata sets show that while no one measure ominates others for all types of problems, some measures are able to have consistently high performance. Introuction Measuring similarity or istance between two ata points is a core requirement for several ata mining an knowlege iscovery tasks that involve istance computation. Examples inclue clustering (kmeans), istance-base outlier etection, classification (knn, SVM), an several other ata mining tasks. These algorithms typically treat the similarity computation as an orthogonal step an can make use of any measure. For continuous ata sets, the Minkowski Distance is a general metho to compute istance between two multivariate points. In particular, the Minkowski Distance of orer (Manhattan) an orer 2 (Eucliean) are the two most wiely use istance measures for continuous ata. The key observation about the above measures is that they are inepenent of the unerlying ata set to which the two points belong. Several ata riven measures such as Mahalanobis Distance have also been explore for continuous ata. The notion of similarity or istance for categorical ata is not as straightforwar as for continuous ata. The key characteristic of categorical ata is that ifferent values that a categorical attribute takes are not inherently orere. Thus it is not possible to irectly compare two ifferent categorical values. The simplest way to fin similarity between two categorical attributes is to assign a similarity of if the values are ientical an a similarity of 0 if the values are not ientical. For two multivariate categorical ata points, the similarity between them will be irectly proportional to the number of attributes in which they match. This simple measure is also known as the overlap measure in the literature [29]. One obvious rawback of the overlap measure is that it oes not istinguish between the ifferent values taken by an attribute. All matches as well as mismatches are treate as equal. For example, consier a categorical ata set, D efine over two attributes: color an shape. Let color take 3 possible values in D: {re, blue, green} an shape take 3 possible values in D: {square, circle, triangle}. Table summarizes the frequency of occurrence for each possible combination in D. color shape square circle triangle Total re blue green Total Table : Frequency Distribution of a Simple 2-D Categorical Data Set The overlap similarity between two instances (green,square) an (green,circle) is 3. The overlap similarity between (blue,square) an (blue,circle) is also 3. But the frequency istribution in Table shows that while (blue,square) an (blue,circle) are frequent combinations, (green,square) an (green,circle) are very rare combinations in the ata set. Thus, it woul appear that the overlap measure is too simplistic in giving equal importance to all matches an mismatches. Although there is no inherent orering in categorical ata, the previous example shows that there is other information in categorical ata sets that can be use to efine what shoul be consiere more similar an what shoul be consiere less similar.

4 This observation has motivate researchers to come up with ata-riven similarity measures for categorical attributes. Such measures take into account the frequency istribution of ifferent attribute values in a given ata set to efine similarity between two categorical attribute values. In this paper, we stuy a variety of similarity measures propose in iverse research fiels ranging from statistics to ecology as well as many of their variations. Each measure uses the information present in the ata uniquely, to efine similarity. Since we are evaluating ata-riven similarity measures it is obvious that their performance is highly relate to the ata set that is being analyze. To unerstan this relationship, we first ientify the key characteristics of a categorical ata set. For each of the ifferent similarity measure that we stuy, we analyze how it relates to the ifferent characteristics of the ata set.. Key Contributions The key contributions of this paper are as follows: We have brought together several categorical measures from ifferent fiels an stuie them together in a single context. We evaluate 4 ifferent ata-riven similarity measures for categorical ata on a wie variety of benchmark ata sets. In particular, we show the utility of ata-riven measures for the problem of etermining similarity with categorical ata. We have also propose a number of new measures that are either variants of other previously propose measures or erive from previously propose similarity frameworks. The performance of some of the measures we propose is among the best performance of all the measures we stuie. We ientify the key characteristics of a categorical ata set an analyze each similarity measure in relation to the characteristics of categorical ata..2 Organization of the Paper The rest of the paper is organize as follows. We first mention all relate efforts in the stuy of similarity measures in Section 2. In section 3, we ientify various characteristics of categorical ata that are relevant to this stuy. We then introuce the 4 ifferent similarity measures that are stuie in this paper in Section 4. We escribe our experimental setup, evaluation methoology an the results on public ata sets in Section 6. 2 Relate Work Sneath an Sokal iscuss categorical similarity measures in some etail in their book [28] on numerical taxonomy. They were among the first to put together an iscuss many of the measures iscusse in their book. At the time, two major concerns were (), biological relevance, since numerical taxonomy was mainly concerne with taxonomies from biology, ecology, etc. an (2) computation efficiency since computational resources were limite an scarce. Nevertheless, many of the observations mae by Sneath an Sokal are quite relevant toay an offer key insights into many of the measures. There are several books [2, 9, 6, 20] on cluster analysis that iscuss the problem of etermining similarity between categorical attributes. However, most of these books o not offer solutions to the problem or iscuss the measures in this paper, an the usual recommenation is to binarize the ata an then use binary similarity measures. Wilson an Martinez [3] performe a etaile stuy of heterogeneous istance functions (for ata with categorical an continuous attributes) for instancebase learning. The measures in this stuy are base upon a supervise approach where each ata instance has class information in aition to a set of categorical/continuous attributes. Measures iscusse in this paper are orthogonal to [3] since supervise measures etermine similarity base on class information, while ata-riven measures etermine similarity base on the ata istribution. In principle, both ieas can be combine. There have been a number of new ata mining techniques for categorical ata that have been propose recently. Some of them use notions of similarity which are neighborhoo-base [5, 4, 8, 24,, 2], or incorporate the similarity computation into the learning algorithm [3, 8, 2]. Neighborhoo-base approaches use some notion of similarity (usually the overlap measure) to efine the neighborhoo of a ata instance, while the measures we stuy in this paper are irectly use to etermine similarity between a pair of ata instances; hence, we see the measures iscusse in this paper as being useful to compute the neighborhoo of a point an neighborhoo-base measures as meta-similarity measures. Since techniques which embe similarity measures into the learning algorithm o not explicitly efine general categorical similarity measures, we o not iscuss them in this paper. 3 Categorical Data Categorical ata (also known as nominal or qualitative multi-state ata) has been stuie for a long time in various contexts. As mentione earlier, computing

5 similarity between categorical ata instances is not straightforwar; owing to the fact that there is no explicit notion of orering between categorical values. To overcome this problem, several ata-riven similarity measures have been propose for categorical ata. The behavior of such measures irectly epens on the ata. In this section we ientify the key characteristics of a categorical ata set, that can potentially affect the behavior of a ata riven similarity measure. For the sake of notation, consier a categorical ata set D containing N objects, efine over a set of categorical attributes where A k enotes the k th attribute. Let the attribute A k take n k values in the given ata set that are enote by the set A k. We also use the following notation: f k (x): The number of times attribute A k takes the value x in the ata set D. Note that if x / A k, f k (x) = 0 ˆp k (x): The sample probability of attribute A k to take the value x in the ata set D. The sample probability is given by ˆp k (x) = f k(x) N p 2 k (x): Another probability estimate of attribute A k to take the value x in a given ata set an is given by p 2 k(x) = f k(x)(f k (x) ) N(N ) 3. Characteristics of Categorical Data Set Since this paper iscusses ata riven similarity measures for categorical ata, a key task is to ientify the characteristics of a categorical ata set that affect the behavior of such a similarity measure. We enumerate the characteristics of a categorical ata set below: Size of Data, N. As we will see later, most measures are typically invariant of the size of the ata, but some (e.g. Smirnov) o incorporate it. Number of attributes,. Most measures are invariant of this characteristic, since they typically normalize the similarity over the number of attributes. But in our experimental results we observe that the number of attributes oes affect the performance of the outlier etection algorithms. Number of values taken by each attribute, n k. A ata set might contain attributes that take several values an attributes that take a very few values. For example, one attribute might take several hunre possible values, while the other attribute might take very few values. A similarity measure might give more importance to the secon attribute, while ignoring the first one. In fact one of the measures iscusse in this paper (eskin) behaves exactly like this. Distribution of f k (x). This refers to the istribution of frequency of values taken by an attribute in the given ata set. In certain ata sets an attribute might be istribute uniformly over the A k, while in others the istribution might be skewe. A similarity measure might give more importance to attribute values that occur rarely, while another similarity measure might give more importance to frequently occurring attribute values. 4 Similarity Measures for Categorical Data The stuy of similarity between ata objects with categorical variables has ha a long history. Pearson propose a chi-square statistic in the late 800s which is often use to test inepenence between categorical variables in a contingency table. Pearson s chi-square statistic was later moifie an extene, leaing to several other measures [25, 23, 7]. More recently, however, the overlap measure has become the most commonly use similarity measure for categorical ata. Its popularity is perhaps relate to simplicity an easy of use. In this section, we will iscuss the overlap measure an several ata-riven similarity measures for categorical ata. Note that we have converte measures that were originally propose as istance to similarity measures in orer to make the measures comparable in this stuy. The measures iscusse henceforth will all be in the context of similarity, with istance measures being converte using the formula: sim = +ist. Any similarity measure assigns a similarity between two ata instances belonging to the ata set D (introuce in Section 3) as follows: (4.) S(X, Y ) = w k S k (X k, Y k ) k= where S k (X k, Y k ) is the per attribute similarity between two values for the categorical attribute A k. Note that X k, Y k A k. The quantity w k enotes the weight assigne to the attribute A k. To unerstan how ifferent measures calculate the per attribute similarity, S k (X k, Y k ), consier a categorical attribute A, which takes one of the values {a, b, c, }. We have roppe the subscript k for simplicity. The per attribute similarity computation is equivalent to constructing the (symmetric) matrix shown in Figure. Essentially, in etermining the similarity between two values, any categorical measure is filling the entries

6 Figure : Attribute a b c a S(a, a) S(a, b) S(a, c) S(a, ) b S(b, b) S(b, c) S(b, ) c S(c, c) S(c, ) S(, ) Similarity Matrix for a Single Categorical of this matrix. For example, the overlap measure sets the iagonal entries to an the off-iagonal entries to 0, i.e. the similarity is if the values match an 0 if the values mismatch. Aitionally, measures may use the following information in computing a similarity value (all the measures in this paper use only this information): f(a), f(b), f(c), f(), the frequencies of the values in the ata set N, the size of the ata set n, the number of values taken by the attribute (4 in the case above) We can classify measures in several ways, base on: (i) the manner in which they fill the entries of the similarity matrix, (ii) whether more weight is a function of the frequency of the attribute values, (iii) the arguments use to propose the measure (probabilistic, information-theoretic, etc.). In this paper, we will escribe the measures by classifying them as follows: those that fill the iagonal entries only. These are measures that set the off-iagonal entries to 0 (mismatches are uniformly given the minimum value) an give possibly ifferent weights to matches. those that fill the off-iagonal entries only. These measures set the iagonal entries to (matches are uniformly given the maximum value) an give possibly ifferent weights to mismatches. those that fill both iagonal an off-iagonal entries. These measures give ifferent weights to both matches an mismatches. Table 4 gives the mathematical formulas for the measures we will be escribing in this paper. The various techniques escribe in Table 4 compute the per-attribute similarity S k (X k, Y k ) as shown in column 2 an compute the attribute weight w k as shown in column Measures that fill Diagonal Entries only. Overlap. The overlap measure simply counts the number of attributes that match in the two ata instances. The range of per attribute similarity for the overlap measure is [0, ], with a value of 0 occurring when there is no match, an a value of occurring when the attribute values match. 2. Gooall. Gooall [4] propose a measure that attempts to normalize the similarity between two objects by the probability that the similarity value observe coul be observe in a ranom sample of two points. This measure assigns higher similarity to a match if the value is infrequent than if the value is frequent. Gooall s original measure etails a proceure to combine similarities in the multivariate setting which takes into account epenencies between attributes. Since this proceure is computationally expensive, we use a simpler version of the measure (escribe next as Gooall). Gooall s original measure is not empirically evaluate in this paper. We also propose three variants of Gooall s measure in this paper: Gooall2, Gooall3 an Gooall4. 3. Gooall. The Gooall measure is the same as Gooall s measure on a per-attribute basis. However, instea of combining the similarities by taking into account epenencies between attributes, the Gooall measure takes the average of the perattribute similarities. The range of S k (X k, Y k ) for matches in Gooall measure is [ 2 N, ], with 2 the minimum being attaine when X k is the most frequent value for attribute k, an the maximum is attaine when the attribute k takes N values (every value occurs only once). 4. Gooall2. The Gooall2 measure is a variant of Gooall s measure propose by us. This measure assigns higher similarity if the matching values are infrequent, an at the same time there are other values are even less frequent, i.e. the similarity is higher if there are many values with approximately equal frequencies, an lower if the frequency istribution is skewe. The range of S k (X k, Y k ) for matches in the Gooall2 measure is [0, 2 N ], 2 with the minimum value being attaine if attribute k takes only one value, an maximum value is attaine when X k is the least frequent value for attribute k. 5. Gooall3. We also propose another variant of Gooall s measure calle Gooall3. The Gooall3 measure assigns a high similarity if the matching

7 Measure S k (X k, Y k ) w k, k =... {. Overlap = if X k = Y k 0 otherwise 2. Eskin = n 2 k n 2 k +2 { if Xk = Y k otherwise { if Xk = Y k 3. IOF = otherwise +log f k (X k ) log f k (Y k ) { if Xk = Y k 4. OF = +log otherwise 5. Lin = { N f k (X k ) log N f k (Y k ) 2 log ˆp k (X k ) if X k = Y k 2 log(ˆp k (X k ) + ˆp k (Y k )) otherwise log ˆp k (q) if X k = Y k q Q 6. Lin = 2 log ˆp k (q) otherwise q Q p 2 k (q) if X k = Y k 7. Gooall = q Q 0 otherwise p 2 k (q) if X k = Y k 8. Gooall2 = q Q 0 otherwise 9. Gooall3 = 0. Gooall4 =. Smirnov = 2 + N f k (X k ) f k (X k ) + { q {A k \X k } q {A k \{X k,y k }} p 2 k (X k) if X k = Y k 0 otherwise { p 2 k (X k) if X k = Y k 0 otherwise f k (q) N f k (q) f k (q) N f k (q) if X k = Y k otherwise i= log ˆp i(x i )+log ˆp i (Y i ) i= q Q log ˆp i(q) k= n k 2. Gambaryan = 3. Burnaby = log 4. Anerberg S(X, Y ) = { [ˆpk (X k ) log 2 ˆp k (X k )+ if X k = Y k ( ˆp k (X k )) log 2 ( ˆp k (X k ))] 0 otherwise if X k = Y k 2 log( ˆp k (q)) q A k ˆp k (X k ) ˆp k (Y k ) ( ˆp k (X k ))( ˆp k (Y k )) + q A k 2 log( ˆp k (q)) k { k :X k =Y k } ( k { k :X k =Y k } ) 2 2 ˆp k (X k ) n k (n k + ) + otherwise ( ) 2 2 ˆp k (X k ) n k (n k + ) k { k :X k Y k } ( k= n k 2ˆp k (X k )ˆp k (Y k ) ) 2 n k (n k + ) Table 2: Similarity Measures for Categorical Attributes. Note that S(X, Y ) = k= w ks k (X k, Y k ). For measure Lin, {Q A k : q Q, ˆp k (X k ) ˆp k (q) ˆp k (Y k )}, assuming ˆp k (X k ) ˆp k (Y k ). For measures Gooall an Gooall, {Q A k : q Q, p k (q) p k (X k )}. For measure Gooall2, {Q A k : q Q, p k (q) p k (X k )}.

8 values are infrequent regarless of the frequencies of the other values. The range of S k (X k, Y k ) for matches in the Gooall3 measure is [0, N ], 2 with the minimum value being attaine if X k is the only value for attribute k an maximum value is attaine if X k occurs only once. 6. Gooall4. The Gooall4 measure assigns similarity Gooall3 for matches. The range of S k (X k, Y k ) for matches in the Gooall4 measure is [ N, ], with 2 the minimum value being attaine if X k occurs only once, an the maximum value is attaine if X k is the only value for attribute k. 7. Gambaryan. Gambaryan propose a measure [] that gives more weight to matches where the matching value occurs in about half the ata set, i.e. in between being frequent an rare. The Gambaryan measure for a single attribute match is closely relate to the Shannon entropy from information theory, as can be seen from its formula in Table 4. The range of S k (X k, Y k ) for matches in the Gambaryan measure is [0, ], with the minimum value being attaine if X k is the only value for attribute k an the maximum value is attaine when X k has frequency N Measures that fill Off-iagonal Entries only. Eskin. Eskin et al. [9] propose a normalization kernel for recor-base network intrusion etection ata. The original measure is istance-base 2 an assigns a weight of for mismatches; when n 2 k aapte to similarity, this becomes a weight of n 2 k n This measure gives more weight to mismatches that occur on attributes that take many k values. The range of S k (X k, Y k ) for mismatches in the Eskin measure is [ 2 3, N 2 N 2 +2 ], with the minimum value being attaine when the attribute k takes only two values, an the maximum value is attaine when the attribute has all unique values. 2. Inverse Occurrence F requency (IOF ). The inverse occurrence frequency measure assigns lower similarity to mismatches on more frequent values. The IOF measure is relate to the concept of inverse ocument frequency which comes from information retrieval, where it is use to signify the relative number of ocuments that contain a specific wor. A key ifference is that inverse ocument frequency is compute on a term-ocument matrix which is usually binary, while the IOF measure is efine for categorical ata. The range of S k (X k, Y k ) for mismatches in the IOF measure is [, ], with the minimum value being attaine when X k an Y k each occur N 2 +(log N 2 )2 times (i.e. these are the only two values), an the maximum value is attaine when X k an Y k occur only once in the ata set. 3. Occurrence F requency (OF ). The occurrence frequency measure gives the opposite weighting of the IOF measure for mismatches, i.e. mismatches on less frequent values are assigne lower similarity an mismatches on more frequent values are assigne higher similarity. The range of S k (X k, Y k ) for mismatches in OF measure is [ (+(log N) 2 ), +(log 2) ], with the minimum value 2 being attaine when X k an Y k occur only once in the ata set, an the maximum value is attaine when X k an Y k occur N 2 times. 4. Burnaby. Burnaby [6] propose a similarity measure using arguments from information theory. He argues that the set of observe values are like a group of signals conveying information an as in information theory, attribute values that are rarely observe shoul be consiere more informative. In [6], Burnaby propose information weighte measures for binary, orinal, categorical an continuous ata. The measure we present in Table 4 is aapte from Burnaby s categorical measure. This measure assigns low similarity to mismatches on rare values an high similarity to mismatches on frequent values. The range of S k (X k, Y k ) for mismatches in N log( N ) Burnaby measure is [ N log( N ) log(n ), ], with the minimum value being attaine all values for attribute k occur only once, an the maximum value is attaine when X k an Y k each occur N 2 times. 4.3 Measures that fill both Diagonal an Offiagonal Entries. Lin. In [22], Lin escribes an information-theoretic framework for similarity, where he argues that when similarity is thought of in terms of assumptions about the space, the similarity measure naturally follows from the assumptions. Lin [22] iscusses the orinal, string, wor an semantic similarity settings; we applie his framework to the categorical setting to erive the Lin measure in Table 4. The Lin measure gives higher weight to matches on frequent values, an lower weight to mismatches on infrequent values. The range of S k (X k, Y k ) for a match in Lin measure is [ 2 log N, 0], with the minimum value being attaine when X k occurs only once an the maximum value is attaine when X k occurs N times. The range of S k (X k, Y k ) for a

9 mismatch in Lin measure is [ 2 log N 2, 0], with the minimum value being attaine when X k an Y k each occur only once, an the maximum value is attaine when X k an Y k each occur N 2 times. 2. Lin. The Lin measure is another measure we have erive using Lin s similarity framework. This measure gives lower weight to mismatches if either of the mismatching values are very frequent, or if there are several values that have frequency in between those of the mismatching values; higher weight is given when there are mismatches on infrequent values an there are few other infrequent values. For matches, lower weight is given for matches on frequent values or matches on values that have many other values of the same frequency; higher weight is given to matches on rare values. The range of S k (X k, Y k ) for matches in the Lin measure is [ N log N, 0], with the minimum value being attaine when attribute k takes N possible values, an maximum value is attaine when X k occurs N times. The range of S k (X k, Y k ) for mismatches in the Lin measure is [ 2 log N 2, 2], with the minimum value being attaine when X k an Y k both occur only once, an maximum value is attaine when X k is the most frequent value an Y k is the least frequent value or vice versa. 3. Smirnov. Smirnov [27] propose a measure roote in probability theory that not only consiers a given value s frequency, but also takes into account the istribution of the other values taken by the same attribute. The Smirnov measure is probabilistic for both matches an mismatches. For a match, the similarity is high when the frequency of the matching value is low, an the other values occur frequently. The range of S k (X k, Y k ) for a match in the Smirnov measure is [0, 2(N )], with the minimum value being attaine when X k occurs only once an there only one other possible value for attribute k, which occurs N times; the maximum value is attaine when X k occurs N times. The range of S k (X k, Y k ) for a mismatch in the Smirnov measure is [ 2, N 2 3], with the minimum value being attaine when the attribute k takes only two values, X k an Y k ; an the maximum is attaine when k takes only one more value apart from X k an Y k an it occurs N 2 times (X k an Y k occur once each). 4. Anerberg. In his book [2], Anerberg presents an approach to hanle similarity between categorical attributes. He argues that rare matches inicate a strong association an shoul be given a very high weight, an that mismatches on rare values shoul be treate as being istinctive an shoul also be given special importance. In accorance with these arguments, the Anerberg measure assigns higher similarity to rare matches, an lower similarity to rare mismatches. Anerberg measure is unique in the sense that it cannot be written in the form of Equation 4.. The range of the Anerberg measure is [0, ]; the minimum value is attaine when there are no matches, an the maximum value is attaine when all attributes match. 4.4 Further classification of similarity measures We can further classify categorical similarity measures base on the arguments use to propose the measures:. Probabilistic approaches take into account the probability of a given match taking place. The following measures are probabilistic: Gooall, Smirnov, Anerberg. 2. Information-theoretic approaches incorporate the information content of a particular value/variable with respect to the ata set. The following measures are information-theoretic: Lin, Lin, Burnaby. Table 3 provies a characterization of each of the 4 similarity measures in terms of how they hanle the various characteristics of a categorical. This table shows that measures Eskin an Anerberg assign weight to every attribute using the quantity n k, though in opposite ways. Another interesting observation from column 3 is that several measures Lin,Lin, Gooall,Gooall3, Smirnov, Anerberg assign higher similarity to a match when the attribute value is rare (f k is low), while Gooall2 an Gooall4 assign higher similarity to a match when the attribute value is frequent (f k is high). Only Gambaryan assigns the maximum similarity when the attribute value has a frequency close to 2. Column 4 shows that IOF,Lin,Lin,Smirnov an Burnaby assign greater similarity when the mismatch occurs between rare values, while OF an Anerberg assign greater similarity for mismatch between frequent values. 5 Outlier Detection in Categorical Data Outlier etection refers to etecting instances that o not conform to a specific efinition of normal behavior. For nearest neighbor techniques, a normal instance is the one that has a very tight neighborhoo. In categorical omain, this correspons to the frequency of occurrence of a combination of attribute values. Normal points are frequent combinations of categorical values

10 {f k (X k ), f k (Y k )} Measure n k X k = Y k X k Y k Overlap 0 Eskin n 2 k 0 IOF /(log f k (X k ) log f k (Y k )) OF log f k (X k ) log f k (Y k ) Lin / log f k (X k ) / log (f k (X k ) + f k (Y k )) Lin / log f k (X k ) / log f k (X k ) f k (Y k ) Gooall ( fk 2 (X k )) 0 Gooall2 fk 2 (X k ) 0 Gooall3 ( fk 2 (X k )) 0 Gooall4 fk 2 (X k ) 0 Smirnov /f k (X k ) /(f k (X k ) + f k (Y k )) Gambaryan Maximum at f k (X k ) = N 2 0 Burnaby / log f k (X k ), / log f k (Y k ) Anerberg /n k /fk 2 (X k ) f k (X k )f k (Y k ) Table 3: Relation between per-attribute similarity, S(X k, Y k ) an {n k, f k (X k ), f k (Y k )}. while outliers are the rarely occurring combinations. We will first provie an unerstaning of normal an outlier instances in categorical ata from this perspective. Consier the example shown earlier in Table. Assuming that a count of 20 or more is consiere as frequent an below is consiere as rare. Now let us consier following 4 instances belonging to D:. (re, square): The combination occurs 30 times (frequent). 2. (green, circle): The combination occurs times (rare); the value green for color occurs 5 times (rare) an the value circle for shape occurs 28 times (frequent). 3. (re, circle): The combination occurs 2 times (rare); the value re for color occurs 35 times (frequent) an the value circle for shape occurs 28 times (frequent). 4. (green, triangle): The combination occurs 2 times (rare); the value green for color occurs 5 times (rare) an the value triangle for shape occurs 5 times (rare). Instance seems to be an obvious normal instance, while instance 4 seems to be an obvious outlier. Instances 2 an 3 occur rarely, but one or both iniviual attribute values occur frequently. These might be consiere as outliers or normal epening on the ata omain. Thus we observe that normal an outlier instances in a categorical ata set might be ifferent in their composition. 5. Outlier Detection Using Nearest Neighbors Nearest Neighbor base techniques for outlier etection assume that the outliers are far away from normal points using a certain similarity measure. The general methoology of such techniques is to estimate the ensity of each point in a coorinate space. The ensity is measure by either counting the number of points within certain raius of the point, or by estimating the sparsity of a neighborhoo of a point. kn N Outlier Detection One of the nearest neighbor technique use in this paper [26] uses a single parameter k. The outlier score of a point is equal to the istance of the point to its k th nearest neighbor. lof Outlier Detection This technique [5] has the notion of k-istance for a given point p. The k-istance is efine as the istance of p to its k th nearest neighbor. The k-istance neighborhoo of a point p is all points that are at istance less than equal to its k-istance. Note that the size of k-istance neighborhoo of p nee not necessarily be k. The k-istance neighborhoo of a p is enote by N k (p). The reachability istance of a point p with respect to another point o is efine as (5.2) r k (p, o) = max{k istance(o), (p, o)} where (p, o) is the actual istance between p an o. For points that are far away, the reachability istance an actual istance are the same. For points that are close to p, the reachability istance is replace by the k-istance of the other point. The local reachability ensity (lr) of a point p is efine as (5.3) ( o N lr k (p) = k (p) r k(p, o) N k (p) ) If there are uplicates in the ata, such that the k neighborhoo of a point consists of only the uplicates, lr computation will run into the problem of ivision

11 by 0. For continuous ata sets, such scenario is highly unlikely, but might occur in categorical ata sets. In such cases there are two possible solutions. Assign a small istance (ɛ) between two ientical points 2. The k-neighborhoo of any point consists of k istinct points The local outlier factor or the outlier score (lof) of a point p is efine as (5.4) lof k (p) = 6 Experimental Evaluation lr k (o) o N k (p) lr k (p) N k (p) In this section we present an experimental evaluation of the 4 measures given in Table 4 on 23 ifferent ata sets in context of outlier etection. Of these ata sets, 2 are base on the ata sets available at the UCI Machine Learning Repository [3], two are base on network ata generate by SKAION Corp for the ARDA information assurance program [7]. The etails about the 23 ata sets is summarize in Table 4. Eleven of these ata sets were purely categorical, five (KD,KD2,Sk,Sk2,Cen) ha a mix of continuous an categorical attributes, an two ata sets, Irs an Sgm, were purely continuous. Continuous variables were iscretize using the MDL metho [0]. The KD,KD2 ata sets were obtaine from the KDDCup ata set by iscretizing the continuous attributes into 0 an 00 bins respectively. Another possible way to hanle a mixture of attributes is to compute the similarity for continuous an categorical attributes separately, an then o a weighte aggregation. In this stuy we converte the continuous attributes to categorical to simplify comparative evaluation. Each ata set contains labele instances belonging to multiple classes. We ientifie one class as the outlier class, an rest of the classes were groupe together an calle normal. The last two rows in Table 4 enote the cross-valiation classification recall an precision reporte by C4.5 classifier on the outlier class. This quantity inicates the separability between the instances belonging to normal class(es) an instances belonging to outlier class, using the given set of attributes. A low accuracy implies that istinguishing between outliers an normal instances is ifficult in that particular ata set using a ecision tree-base classifier. 6. Evaluation Methoology The performance of the ifferent similarity measures was evaluate in the context of outlier etection using nearest neighbors [26, 30]. We construct a test ata by taking equal number of instances as ranom samples from the outlier class (n) an the normal class(es). In aition, a ranom sample (comparable in size to the outlier class) is taken from the normal class to serve as the training set. For each test instance we fin its nearest neighbors, using the given similarity measure, in the training set (we chose the parameter k = 0). The outlier score is chosen for the knn algorithm an the lof algorithm as iscusse earlier. The test instances are then sorte in ecreasing orer of outlier scores. To evaluate a measure, we count the number of true outliers in the top p portion of the sorte test instances, where p = δn, 0 δ. Let o be the number of actual outliers in the top p preicte outliers. The accuracy of the algorithm is measure as o p. In this paper we present results for δ =. We have also experimente with other lower values of δ an the trens in relative performance are similar. 6.2 Experimental Results on Public Data Sets Our experimental results verifie our initial hypotheses about categorical similarity measures. As can be seen from Table 5, there are many situations where the Overlap measure oes not give goo performance. This is consistent with our intuition that the use of aitional information woul lea to better performance. In particular, we expecte that since categorical ata oes not have inherent orering, ata-riven measures woul be able to take avantage of information present in the ata set to make more accurate eterminations of similarity between a pair of ata instances. We make some key observations about the results in Table 5:. No single measure is always superior or inferior. This is to be expecte since each ata set has ifferent characteristics. 2. The use of some measures gives consistently better performance on a large variety of ata. The Lin, OF, Gooall3 measures give among the best performance overall in terms of outlier etection performance. This is noteworthy since Lin an Gooall3 have been introuce for the first time in this paper. 3. There are some pairs of measures that exhibit complementary performance, i.e. one performs well where the other performs poorly an vice-versa. Example complimentary pairs are (OF, IOF ), (Lin, Lin) an (Gooall3, Gooall4). This observation means that it may be possible to construct measures that raw on the strengths of two measures in orer to obtain superior performance. This

12 Cr Cr2 Irs Cn Cn2 KD KD2 KD3 KD4 Sk Sk2 Ms Ms2 Sgmt Cen Bal Can Hys Lym Nur Tmr TTT Au Size % Outls avg(nk) me(nk) fk Uni fk Gauss fk Skw Recall Precision Table 4: Description of Public Data Sets Msr. Cr Cr2 Irs Cn Cn2 KD KD2 KD3 KD4 Sk Sk2 Ms Ms2 Sgm Cen Bal Can Hys Lym Nur Tmr TTT Au Avg ovrlp eskn iof of lin lin goo goo goo goo smrnv gmbrn brnby anbrg Avg Table 5: Experimental Results For knn Algorithm for 00 %

13 Msr. Cr Cr2 Irs Cn Cn2 KD KD2 KD3 KD4 Sk Sk2 Ms Ms2 Sgm Cen Bal Can Hys Lym Nur Tmr TTT Au Avg ovrlp eskn iof of lin lin goo goo goo goo smrnv gmbrn brnby anbrg Avg Table 6: Experimental Results For LOF Algorithm for 00 % Msr. Cr Cr2 Irs Cn Cn2 KD KD2 KD3 KD4 Sk Sk2 Ms Ms2 Sgm Cen Bal Can Hys Lym Nur Tmr TTT Au Avg ovrlp eskn iof of lin lin goo goo goo goo smrnv gmbrn brnby anbrg Avg Table 7: Experimental Results For knn Algorithm for 50 %

14 Msr. Cr Cr2 Irs Cn Cn2 KD KD2 KD3 KD4 Sk Sk2 Ms Ms2 Sgm Cen Bal Can Hys Lym Nur Tmr TTT Au Avg ovrlp eskn iof of lin lin goo goo goo goo smrnv gmbrn brnby anbrg Avg Table 9: Experimental Results For knn Algorithm for 25 % Msr. Cr Cr2 Irs Cn Cn2 KD KD2 KD3 KD4 Sk Sk2 Ms Ms2 Sgm Cen Bal Can Hys Lym Nur Tmr TTT Au Avg ovrlp eskn iof of lin lin goo goo goo goo smrnv gmbrn brnby anbrg Avg Table 8: Experimental Results For LOF Algorithm for 50 %

15 Msr. Cr Cr2 Irs Cn Cn2 KD KD2 KD3 KD4 Sk Sk2 Ms Ms2 Sgm Cen Bal Can Hys Lym Nur Tmr TTT Au Avg ovrlp eskn iof of lin lin goo goo goo goo smrnv gmbrn brnby anbrg Avg Table 0: Experimental Results For LOF Algorithm for 25 % is an aspect of this work that nees to be pursue in future work. 4. The performance of an outlier etection algorithm is significantly affecte by the similarity measure use. For example, for the Cn ata set, which has a very low classification accuracy for the outlier class, using OF still achieves close to 50 % accuracy. 5. The Eskin similarity measure weights attributes proportional to the number of values taken by the attribute (n k ). For ata sets in which the attributes take large number of values (e.g., KD2, Sk, Sk2), eskin performs very poorly. 6. The Smirnov measure assigns similarity to both iagonal an off-iagonal entries in the per-attribute similarity matrix (Figure ). But it still performs very poorly on most of the ata sets. The other measures that operate similarly Lin, Lin an Anerberg performs better than Smirnov in almost every ata set. 7. The performance of knn for varying values of δ is not very significant as seen from Tables 5, 7, an Using lof as the outlier etection algorithm (Refer to Tables 6, 8, an 0) improves the overall performance for almost every similarity measure. The rop in performance for 4 measures for δ =.00, is marginal. This inicates that lof is a better outlier etection algorithm than knn for categorical ata sets. The relation between the algorithm an similarity measure is also of significance an will be a part of our future research. 7 Concluing Remarks an Future Work Computing similarity in categorical attributes has been iscusse in a variety of contexts. In this paper we have brought together several such measures an evaluate them in context of outlier etection. We have also propose several variants (Lin, Gooall2, Gooall3, Gooall4) of existing similarity measures some of which perform very well as shown in our evaluation. Given this set of similarity measures, the first question that comes to min is: Which similarity measure is best suite for my ata mining task?. Our experimental results suggest that there is no one best performing similarity measure. Hence, one nees to unerstan how a similarity measure hanles the ifferent characteristics of a categorical ata set, an this nees to be explore in future research.

Similarity Measures for Categorical Data: A Comparative Evaluation

Similarity Measures for Categorical Data: A Comparative Evaluation Similarity Measures for Categorical Data: A Comparative Evaluation Downloae 0/4/8 to 46.3.0.89. Reistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php Abstract Shyam

More information

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing

More information

Optimization of Geometries by Energy Minimization

Optimization of Geometries by Energy Minimization Optimization of Geometries by Energy Minimization by Tracy P. Hamilton Department of Chemistry University of Alabama at Birmingham Birmingham, AL 3594-140 hamilton@uab.eu Copyright Tracy P. Hamilton, 1997.

More information

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions Working Paper 2013:5 Department of Statistics Computing Exact Confience Coefficients of Simultaneous Confience Intervals for Multinomial Proportions an their Functions Shaobo Jin Working Paper 2013:5

More information

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: Balance Sampling for Multi-Way Stratification Contents General section... 3 1. Summary... 3 2.

More information

Construction of the Electronic Radial Wave Functions and Probability Distributions of Hydrogen-like Systems

Construction of the Electronic Radial Wave Functions and Probability Distributions of Hydrogen-like Systems Construction of the Electronic Raial Wave Functions an Probability Distributions of Hyrogen-like Systems Thomas S. Kuntzleman, Department of Chemistry Spring Arbor University, Spring Arbor MI 498 tkuntzle@arbor.eu

More information

Implicit Differentiation

Implicit Differentiation Implicit Differentiation Thus far, the functions we have been concerne with have been efine explicitly. A function is efine explicitly if the output is given irectly in terms of the input. For instance,

More information

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors Math 18.02 Notes on ifferentials, the Chain Rule, graients, irectional erivative, an normal vectors Tangent plane an linear approximation We efine the partial erivatives of f( xy, ) as follows: f f( x+

More information

SYNCHRONOUS SEQUENTIAL CIRCUITS

SYNCHRONOUS SEQUENTIAL CIRCUITS CHAPTER SYNCHRONOUS SEUENTIAL CIRCUITS Registers an counters, two very common synchronous sequential circuits, are introuce in this chapter. Register is a igital circuit for storing information. Contents

More information

CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, and Tony Wu

CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, and Tony Wu CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, an Tony Wu Abstract Popular proucts often have thousans of reviews that contain far too much information for customers to igest. Our goal for the

More information

Level Construction of Decision Trees in a Partition-based Framework for Classification

Level Construction of Decision Trees in a Partition-based Framework for Classification Level Construction of Decision Trees in a Partition-base Framework for Classification Y.Y. Yao, Y. Zhao an J.T. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canaa S4S

More information

A Modification of the Jarque-Bera Test. for Normality

A Modification of the Jarque-Bera Test. for Normality Int. J. Contemp. Math. Sciences, Vol. 8, 01, no. 17, 84-85 HIKARI Lt, www.m-hikari.com http://x.oi.org/10.1988/ijcms.01.9106 A Moification of the Jarque-Bera Test for Normality Moawa El-Fallah Ab El-Salam

More information

05 The Continuum Limit and the Wave Equation

05 The Continuum Limit and the Wave Equation Utah State University DigitalCommons@USU Founations of Wave Phenomena Physics, Department of 1-1-2004 05 The Continuum Limit an the Wave Equation Charles G. Torre Department of Physics, Utah State University,

More information

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences. S 63 Lecture 8 2/2/26 Lecturer Lillian Lee Scribes Peter Babinski, Davi Lin Basic Language Moeling Approach I. Special ase of LM-base Approach a. Recap of Formulas an Terms b. Fixing θ? c. About that Multinomial

More information

The Principle of Least Action

The Principle of Least Action Chapter 7. The Principle of Least Action 7.1 Force Methos vs. Energy Methos We have so far stuie two istinct ways of analyzing physics problems: force methos, basically consisting of the application of

More information

Lagrangian and Hamiltonian Mechanics

Lagrangian and Hamiltonian Mechanics Lagrangian an Hamiltonian Mechanics.G. Simpson, Ph.. epartment of Physical Sciences an Engineering Prince George s Community College ecember 5, 007 Introuction In this course we have been stuying classical

More information

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y Ph195a lecture notes, 1/3/01 Density operators for spin- 1 ensembles So far in our iscussion of spin- 1 systems, we have restricte our attention to the case of pure states an Hamiltonian evolution. Toay

More information

Necessary and Sufficient Conditions for Sketched Subspace Clustering

Necessary and Sufficient Conditions for Sketched Subspace Clustering Necessary an Sufficient Conitions for Sketche Subspace Clustering Daniel Pimentel-Alarcón, Laura Balzano 2, Robert Nowak University of Wisconsin-Maison, 2 University of Michigan-Ann Arbor Abstract This

More information

Linear First-Order Equations

Linear First-Order Equations 5 Linear First-Orer Equations Linear first-orer ifferential equations make up another important class of ifferential equations that commonly arise in applications an are relatively easy to solve (in theory)

More information

u!i = a T u = 0. Then S satisfies

u!i = a T u = 0. Then S satisfies Deterministic Conitions for Subspace Ientifiability from Incomplete Sampling Daniel L Pimentel-Alarcón, Nigel Boston, Robert D Nowak University of Wisconsin-Maison Abstract Consier an r-imensional subspace

More information

Why Bernstein Polynomials Are Better: Fuzzy-Inspired Justification

Why Bernstein Polynomials Are Better: Fuzzy-Inspired Justification Why Bernstein Polynomials Are Better: Fuzzy-Inspire Justification Jaime Nava 1, Olga Kosheleva 2, an Vlaik Kreinovich 3 1,3 Department of Computer Science 2 Department of Teacher Eucation University of

More information

6 General properties of an autonomous system of two first order ODE

6 General properties of an autonomous system of two first order ODE 6 General properties of an autonomous system of two first orer ODE Here we embark on stuying the autonomous system of two first orer ifferential equations of the form ẋ 1 = f 1 (, x 2 ), ẋ 2 = f 2 (, x

More information

Function Spaces. 1 Hilbert Spaces

Function Spaces. 1 Hilbert Spaces Function Spaces A function space is a set of functions F that has some structure. Often a nonparametric regression function or classifier is chosen to lie in some function space, where the assume structure

More information

Estimating Causal Direction and Confounding Of Two Discrete Variables

Estimating Causal Direction and Confounding Of Two Discrete Variables Estimating Causal Direction an Confouning Of Two Discrete Variables This inspire further work on the so calle aitive noise moels. Hoyer et al. (2009) extene Shimizu s ientifiaarxiv:1611.01504v1 [stat.ml]

More information

Hybrid Fusion for Biometrics: Combining Score-level and Decision-level Fusion

Hybrid Fusion for Biometrics: Combining Score-level and Decision-level Fusion Hybri Fusion for Biometrics: Combining Score-level an Decision-level Fusion Qian Tao Raymon Velhuis Signals an Systems Group, University of Twente Postbus 217, 7500AE Enschee, the Netherlans {q.tao,r.n.j.velhuis}@ewi.utwente.nl

More information

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction

More information

The total derivative. Chapter Lagrangian and Eulerian approaches

The total derivative. Chapter Lagrangian and Eulerian approaches Chapter 5 The total erivative 51 Lagrangian an Eulerian approaches The representation of a flui through scalar or vector fiels means that each physical quantity uner consieration is escribe as a function

More information

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments 2 Conference on Information Sciences an Systems, The Johns Hopkins University, March 2, 2 Time-of-Arrival Estimation in Non-Line-Of-Sight Environments Sinan Gezici, Hisashi Kobayashi an H. Vincent Poor

More information

Schrödinger s equation.

Schrödinger s equation. Physics 342 Lecture 5 Schröinger s Equation Lecture 5 Physics 342 Quantum Mechanics I Wenesay, February 3r, 2010 Toay we iscuss Schröinger s equation an show that it supports the basic interpretation of

More information

Parameter estimation: A new approach to weighting a priori information

Parameter estimation: A new approach to weighting a priori information Parameter estimation: A new approach to weighting a priori information J.L. Mea Department of Mathematics, Boise State University, Boise, ID 83725-555 E-mail: jmea@boisestate.eu Abstract. We propose a

More information

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks A PAC-Bayesian Approach to Spectrally-Normalize Margin Bouns for Neural Networks Behnam Neyshabur, Srinah Bhojanapalli, Davi McAllester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur,

More information

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21 Large amping in a structural material may be either esirable or unesirable, epening on the engineering application at han. For example, amping is a esirable property to the esigner concerne with limiting

More information

Vectors in two dimensions

Vectors in two dimensions Vectors in two imensions Until now, we have been working in one imension only The main reason for this is to become familiar with the main physical ieas like Newton s secon law, without the aitional complication

More information

On the Surprising Behavior of Distance Metrics in High Dimensional Space

On the Surprising Behavior of Distance Metrics in High Dimensional Space On the Surprising Behavior of Distance Metrics in High Dimensional Space Charu C. Aggarwal, Alexaner Hinneburg 2, an Daniel A. Keim 2 IBM T. J. Watson Research Center Yortown Heights, NY 0598, USA. charu@watson.ibm.com

More information

CONTROL CHARTS FOR VARIABLES

CONTROL CHARTS FOR VARIABLES UNIT CONTOL CHATS FO VAIABLES Structure.1 Introuction Objectives. Control Chart Technique.3 Control Charts for Variables.4 Control Chart for Mean(-Chart).5 ange Chart (-Chart).6 Stanar Deviation Chart

More information

The Press-Schechter mass function

The Press-Schechter mass function The Press-Schechter mass function To state the obvious: It is important to relate our theories to what we can observe. We have looke at linear perturbation theory, an we have consiere a simple moel for

More information

Introduction to variational calculus: Lecture notes 1

Introduction to variational calculus: Lecture notes 1 October 10, 2006 Introuction to variational calculus: Lecture notes 1 Ewin Langmann Mathematical Physics, KTH Physics, AlbaNova, SE-106 91 Stockholm, Sween Abstract I give an informal summary of variational

More information

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search The Web Challenges Tim

More information

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Lower Bounds for the Smoothed Number of Pareto optimal Solutions Lower Bouns for the Smoothe Number of Pareto optimal Solutions Tobias Brunsch an Heiko Röglin Department of Computer Science, University of Bonn, Germany brunsch@cs.uni-bonn.e, heiko@roeglin.org Abstract.

More information

A New Minimum Description Length

A New Minimum Description Length A New Minimum Description Length Soosan Beheshti, Munther A. Dahleh Laboratory for Information an Decision Systems Massachusetts Institute of Technology soosan@mit.eu,ahleh@lis.mit.eu Abstract The minimum

More information

Situation awareness of power system based on static voltage security region

Situation awareness of power system based on static voltage security region The 6th International Conference on Renewable Power Generation (RPG) 19 20 October 2017 Situation awareness of power system base on static voltage security region Fei Xiao, Zi-Qing Jiang, Qian Ai, Ran

More information

Nuclear Physics and Astrophysics

Nuclear Physics and Astrophysics Nuclear Physics an Astrophysics PHY-302 Dr. E. Rizvi Lecture 2 - Introuction Notation Nuclies A Nuclie is a particular an is esignate by the following notation: A CN = Atomic Number (no. of Protons) A

More information

DEGREE DISTRIBUTION OF SHORTEST PATH TREES AND BIAS OF NETWORK SAMPLING ALGORITHMS

DEGREE DISTRIBUTION OF SHORTEST PATH TREES AND BIAS OF NETWORK SAMPLING ALGORITHMS DEGREE DISTRIBUTION OF SHORTEST PATH TREES AND BIAS OF NETWORK SAMPLING ALGORITHMS SHANKAR BHAMIDI 1, JESSE GOODMAN 2, REMCO VAN DER HOFSTAD 3, AND JÚLIA KOMJÁTHY3 Abstract. In this article, we explicitly

More information

d dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1

d dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1 Lecture 5 Some ifferentiation rules Trigonometric functions (Relevant section from Stewart, Seventh Eition: Section 3.3) You all know that sin = cos cos = sin. () But have you ever seen a erivation of

More information

ELEC3114 Control Systems 1

ELEC3114 Control Systems 1 ELEC34 Control Systems Linear Systems - Moelling - Some Issues Session 2, 2007 Introuction Linear systems may be represente in a number of ifferent ways. Figure shows the relationship between various representations.

More information

Introduction to the Vlasov-Poisson system

Introduction to the Vlasov-Poisson system Introuction to the Vlasov-Poisson system Simone Calogero 1 The Vlasov equation Consier a particle with mass m > 0. Let x(t) R 3 enote the position of the particle at time t R an v(t) = ẋ(t) = x(t)/t its

More information

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation JOURNAL OF MATERIALS SCIENCE 34 (999)5497 5503 Thermal conuctivity of grae composites: Numerical simulations an an effective meium approximation P. M. HUI Department of Physics, The Chinese University

More information

Acute sets in Euclidean spaces

Acute sets in Euclidean spaces Acute sets in Eucliean spaces Viktor Harangi April, 011 Abstract A finite set H in R is calle an acute set if any angle etermine by three points of H is acute. We examine the maximal carinality α() of

More information

Topic 7: Convergence of Random Variables

Topic 7: Convergence of Random Variables Topic 7: Convergence of Ranom Variables Course 003, 2016 Page 0 The Inference Problem So far, our starting point has been a given probability space (S, F, P). We now look at how to generate information

More information

7.1 Support Vector Machine

7.1 Support Vector Machine 67577 Intro. to Machine Learning Fall semester, 006/7 Lecture 7: Support Vector Machines an Kernel Functions II Lecturer: Amnon Shashua Scribe: Amnon Shashua 7. Support Vector Machine We return now to

More information

Code_Aster. Detection of the singularities and computation of a card of size of elements

Code_Aster. Detection of the singularities and computation of a card of size of elements Titre : Détection es singularités et calcul une carte [...] Date : 0/0/0 Page : /6 Responsable : Josselin DLMAS Clé : R4.0.04 Révision : 9755 Detection of the singularities an computation of a car of size

More information

Math 342 Partial Differential Equations «Viktor Grigoryan

Math 342 Partial Differential Equations «Viktor Grigoryan Math 342 Partial Differential Equations «Viktor Grigoryan 6 Wave equation: solution In this lecture we will solve the wave equation on the entire real line x R. This correspons to a string of infinite

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 12 EFFICIENT LEARNING So far, our focus has been on moels of learning an basic algorithms for those moels. We have not place much emphasis on how to learn quickly.

More information

TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS. Yannick DEVILLE

TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS. Yannick DEVILLE TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS Yannick DEVILLE Université Paul Sabatier Laboratoire Acoustique, Métrologie, Instrumentation Bât. 3RB2, 8 Route e Narbonne,

More information

How to Minimize Maximum Regret in Repeated Decision-Making

How to Minimize Maximum Regret in Repeated Decision-Making How to Minimize Maximum Regret in Repeate Decision-Making Karl H. Schlag July 3 2003 Economics Department, European University Institute, Via ella Piazzuola 43, 033 Florence, Italy, Tel: 0039-0-4689, email:

More information

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes Leaving Ranomness to Nature: -Dimensional Prouct Coes through the lens of Generalize-LDPC coes Tavor Baharav, Kannan Ramchanran Dept. of Electrical Engineering an Computer Sciences, U.C. Berkeley {tavorb,

More information

Influence of weight initialization on multilayer perceptron performance

Influence of weight initialization on multilayer perceptron performance Influence of weight initialization on multilayer perceptron performance M. Karouia (1,2) T. Denœux (1) R. Lengellé (1) (1) Université e Compiègne U.R.A. CNRS 817 Heuiasyc BP 649 - F-66 Compiègne ceex -

More information

arxiv: v1 [physics.class-ph] 20 Dec 2017

arxiv: v1 [physics.class-ph] 20 Dec 2017 arxiv:1712.07328v1 [physics.class-ph] 20 Dec 2017 Demystifying the constancy of the Ermakov-Lewis invariant for a time epenent oscillator T. Pamanabhan IUCAA, Post Bag 4, Ganeshkhin, Pune - 411 007, Inia.

More information

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs Lectures - Week 10 Introuction to Orinary Differential Equations (ODES) First Orer Linear ODEs When stuying ODEs we are consiering functions of one inepenent variable, e.g., f(x), where x is the inepenent

More information

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control 19 Eigenvalues, Eigenvectors, Orinary Differential Equations, an Control This section introuces eigenvalues an eigenvectors of a matrix, an iscusses the role of the eigenvalues in etermining the behavior

More information

JUST THE MATHS UNIT NUMBER DIFFERENTIATION 2 (Rates of change) A.J.Hobson

JUST THE MATHS UNIT NUMBER DIFFERENTIATION 2 (Rates of change) A.J.Hobson JUST THE MATHS UNIT NUMBER 10.2 DIFFERENTIATION 2 (Rates of change) by A.J.Hobson 10.2.1 Introuction 10.2.2 Average rates of change 10.2.3 Instantaneous rates of change 10.2.4 Derivatives 10.2.5 Exercises

More information

Separation of Variables

Separation of Variables Physics 342 Lecture 1 Separation of Variables Lecture 1 Physics 342 Quantum Mechanics I Monay, January 25th, 2010 There are three basic mathematical tools we nee, an then we can begin working on the physical

More information

Cascaded redundancy reduction

Cascaded redundancy reduction Network: Comput. Neural Syst. 9 (1998) 73 84. Printe in the UK PII: S0954-898X(98)88342-5 Cascae reunancy reuction Virginia R e Sa an Geoffrey E Hinton Department of Computer Science, University of Toronto,

More information

How the potentials in different gauges yield the same retarded electric and magnetic fields

How the potentials in different gauges yield the same retarded electric and magnetic fields How the potentials in ifferent gauges yiel the same retare electric an magnetic fiels José A. Heras a Departamento e Física, E. S. F. M., Instituto Politécnico Nacional, México D. F. México an Department

More information

Lecture 2 Lagrangian formulation of classical mechanics Mechanics

Lecture 2 Lagrangian formulation of classical mechanics Mechanics Lecture Lagrangian formulation of classical mechanics 70.00 Mechanics Principle of stationary action MATH-GA To specify a motion uniquely in classical mechanics, it suffices to give, at some time t 0,

More information

MULTIFRACTAL NETWORK GENERATORS

MULTIFRACTAL NETWORK GENERATORS MULTIFRACTAL NETWORK GENERATORS AUSTIN R. BENSON, CARLOS RIQUELME, SVEN P. SCHMIT (0) Abstract. Generating ranom graphs to moel networks has a rich history. In this paper, we explore a recent generative

More information

arxiv:hep-th/ v1 3 Feb 1993

arxiv:hep-th/ v1 3 Feb 1993 NBI-HE-9-89 PAR LPTHE 9-49 FTUAM 9-44 November 99 Matrix moel calculations beyon the spherical limit arxiv:hep-th/93004v 3 Feb 993 J. Ambjørn The Niels Bohr Institute Blegamsvej 7, DK-00 Copenhagen Ø,

More information

Linear and quadratic approximation

Linear and quadratic approximation Linear an quaratic approximation November 11, 2013 Definition: Suppose f is a function that is ifferentiable on an interval I containing the point a. The linear approximation to f at a is the linear function

More information

arxiv: v1 [physics.flu-dyn] 8 May 2014

arxiv: v1 [physics.flu-dyn] 8 May 2014 Energetics of a flui uner the Boussinesq approximation arxiv:1405.1921v1 [physics.flu-yn] 8 May 2014 Kiyoshi Maruyama Department of Earth an Ocean Sciences, National Defense Acaemy, Yokosuka, Kanagawa

More information

Delocalization of boundary states in disordered topological insulators

Delocalization of boundary states in disordered topological insulators Journal of Physics A: Mathematical an Theoretical J. Phys. A: Math. Theor. 48 (05) FT0 (pp) oi:0.088/75-83/48//ft0 Fast Track Communication Delocalization of bounary states in isorere topological insulators

More information

EIGEN-ANALYSIS OF KERNEL OPERATORS FOR NONLINEAR DIMENSION REDUCTION AND DISCRIMINATION

EIGEN-ANALYSIS OF KERNEL OPERATORS FOR NONLINEAR DIMENSION REDUCTION AND DISCRIMINATION EIGEN-ANALYSIS OF KERNEL OPERATORS FOR NONLINEAR DIMENSION REDUCTION AND DISCRIMINATION DISSERTATION Presente in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Grauate

More information

u t v t v t c a u t b a v t u t v t b a

u t v t v t c a u t b a v t u t v t b a Nonlinear Dynamical Systems In orer to iscuss nonlinear ynamical systems, we must first consier linear ynamical systems. Linear ynamical systems are just systems of linear equations like we have been stuying

More information

A Randomized Approximate Nearest Neighbors Algorithm - a short version

A Randomized Approximate Nearest Neighbors Algorithm - a short version We present a ranomize algorithm for the approximate nearest neighbor problem in - imensional Eucliean space. Given N points {x } in R, the algorithm attempts to fin k nearest neighbors for each of x, where

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Technical Report TTI-TR-2008-5 Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri UC San Diego Sham M. Kakae Toyota Technological Institute at Chicago ABSTRACT Clustering ata in

More information

Computing the Longest Common Subsequence of Multiple RLE Strings

Computing the Longest Common Subsequence of Multiple RLE Strings The 29th Workshop on Combinatorial Mathematics an Computation Theory Computing the Longest Common Subsequence of Multiple RLE Strings Ling-Chih Yao an Kuan-Yu Chen Grauate Institute of Networking an Multimeia

More information

THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE

THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE Journal of Soun an Vibration (1996) 191(3), 397 414 THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE E. M. WEINSTEIN Galaxy Scientific Corporation, 2500 English Creek

More information

An Analytical Expression of the Probability of Error for Relaying with Decode-and-forward

An Analytical Expression of the Probability of Error for Relaying with Decode-and-forward An Analytical Expression of the Probability of Error for Relaying with Decoe-an-forwar Alexanre Graell i Amat an Ingmar Lan Department of Electronics, Institut TELECOM-TELECOM Bretagne, Brest, France Email:

More information

Optimal Signal Detection for False Track Discrimination

Optimal Signal Detection for False Track Discrimination Optimal Signal Detection for False Track Discrimination Thomas Hanselmann Darko Mušicki Dept. of Electrical an Electronic Eng. Dept. of Electrical an Electronic Eng. The University of Melbourne The University

More information

Balancing Expected and Worst-Case Utility in Contracting Models with Asymmetric Information and Pooling

Balancing Expected and Worst-Case Utility in Contracting Models with Asymmetric Information and Pooling Balancing Expecte an Worst-Case Utility in Contracting Moels with Asymmetric Information an Pooling R.B.O. erkkamp & W. van en Heuvel & A.P.M. Wagelmans Econometric Institute Report EI2018-01 9th January

More information

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k A Proof of Lemma 2 B Proof of Lemma 3 Proof: Since the support of LL istributions is R, two such istributions are equivalent absolutely continuous with respect to each other an the ivergence is well-efine

More information

Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets

Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets Proceeings of the 4th East-European Conference on Avances in Databases an Information Systems ADBIS) 200 Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets Eleftherios Tiakas, Apostolos.

More information

Inverse Theory Course: LTU Kiruna. Day 1

Inverse Theory Course: LTU Kiruna. Day 1 Inverse Theory Course: LTU Kiruna. Day Hugh Pumphrey March 6, 0 Preamble These are the notes for the course Inverse Theory to be taught at LuleåTekniska Universitet, Kiruna in February 00. They are not

More information

Space-time Linear Dispersion Using Coordinate Interleaving

Space-time Linear Dispersion Using Coordinate Interleaving Space-time Linear Dispersion Using Coorinate Interleaving Jinsong Wu an Steven D Blostein Department of Electrical an Computer Engineering Queen s University, Kingston, Ontario, Canaa, K7L3N6 Email: wujs@ieeeorg

More information

Bayesian Estimation of the Entropy of the Multivariate Gaussian

Bayesian Estimation of the Entropy of the Multivariate Gaussian Bayesian Estimation of the Entropy of the Multivariate Gaussian Santosh Srivastava Fre Hutchinson Cancer Research Center Seattle, WA 989, USA Email: ssrivast@fhcrc.org Maya R. Gupta Department of Electrical

More information

Bohr Model of the Hydrogen Atom

Bohr Model of the Hydrogen Atom Class 2 page 1 Bohr Moel of the Hyrogen Atom The Bohr Moel of the hyrogen atom assumes that the atom consists of one electron orbiting a positively charge nucleus. Although it oes NOT o a goo job of escribing

More information

Calculus in the AP Physics C Course The Derivative

Calculus in the AP Physics C Course The Derivative Limits an Derivatives Calculus in the AP Physics C Course The Derivative In physics, the ieas of the rate change of a quantity (along with the slope of a tangent line) an the area uner a curve are essential.

More information

Flexible High-Dimensional Classification Machines and Their Asymptotic Properties

Flexible High-Dimensional Classification Machines and Their Asymptotic Properties Journal of Machine Learning Research 16 (2015) 1547-1572 Submitte 1/14; Revise 9/14; Publishe 8/15 Flexible High-Dimensional Classification Machines an Their Asymptotic Properties Xingye Qiao Department

More information

Calculus of Variations

Calculus of Variations Calculus of Variations Lagrangian formalism is the main tool of theoretical classical mechanics. Calculus of Variations is a part of Mathematics which Lagrangian formalism is base on. In this section,

More information

A New Family of Near-metrics for Universal Similarity

A New Family of Near-metrics for Universal Similarity arxiv:1707.06903v3 [stat.ml] 17 Oct 2017 A New Family of Near-metrics for Universal Similarity Chu Wang Iraj Saniee William S. Kenney Chris A. White October 18, 2017 Abstract We propose a family of near-metrics

More information

A simple model for the small-strain behaviour of soils

A simple model for the small-strain behaviour of soils A simple moel for the small-strain behaviour of soils José Jorge Naer Department of Structural an Geotechnical ngineering, Polytechnic School, University of São Paulo 05508-900, São Paulo, Brazil, e-mail:

More information

2Algebraic ONLINE PAGE PROOFS. foundations

2Algebraic ONLINE PAGE PROOFS. foundations Algebraic founations. Kick off with CAS. Algebraic skills.3 Pascal s triangle an binomial expansions.4 The binomial theorem.5 Sets of real numbers.6 Surs.7 Review . Kick off with CAS Playing lotto Using

More information

Non-deterministic Social Laws

Non-deterministic Social Laws Non-eterministic Social Laws Michael H. Coen MIT Artificial Intelligence Lab 55 Technology Square Cambrige, MA 09 mhcoen@ai.mit.eu Abstract The paper generalizes the notion of a social law, the founation

More information

Math 1B, lecture 8: Integration by parts

Math 1B, lecture 8: Integration by parts Math B, lecture 8: Integration by parts Nathan Pflueger 23 September 2 Introuction Integration by parts, similarly to integration by substitution, reverses a well-known technique of ifferentiation an explores

More information

KNN Particle Filters for Dynamic Hybrid Bayesian Networks

KNN Particle Filters for Dynamic Hybrid Bayesian Networks KNN Particle Filters for Dynamic Hybri Bayesian Networs H. D. Chen an K. C. Chang Dept. of Systems Engineering an Operations Research George Mason University MS 4A6, 4400 University Dr. Fairfax, VA 22030

More information

State observers and recursive filters in classical feedback control theory

State observers and recursive filters in classical feedback control theory State observers an recursive filters in classical feeback control theory State-feeback control example: secon-orer system Consier the riven secon-orer system q q q u x q x q x x x x Here u coul represent

More information

Axiometrics: Axioms of Information Retrieval Effectiveness Metrics

Axiometrics: Axioms of Information Retrieval Effectiveness Metrics Axiometrics: Axioms of Information Retrieval Effectiveness Metrics ABSTRACT Ey Maalena Department of Maths Computer Science University of Uine Uine, Italy ey.maalena@uniu.it There are literally ozens most

More information

Lower bounds on Locality Sensitive Hashing

Lower bounds on Locality Sensitive Hashing Lower bouns on Locality Sensitive Hashing Rajeev Motwani Assaf Naor Rina Panigrahy Abstract Given a metric space (X, X ), c 1, r > 0, an p, q [0, 1], a istribution over mappings H : X N is calle a (r,

More information

Chapter 6: Energy-Momentum Tensors

Chapter 6: Energy-Momentum Tensors 49 Chapter 6: Energy-Momentum Tensors This chapter outlines the general theory of energy an momentum conservation in terms of energy-momentum tensors, then applies these ieas to the case of Bohm's moel.

More information

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an

More information