Data-Intensive Similarity Measures for Categorical Data

Size: px

Start display at page:

Download "Data-Intensive Similarity Measures for Categorical Data"

Cori Campbell
5 years ago
Views:

1 Data-Intensive Similarity Measures for Categorical Data Thesis submitted in partial fulfillment of the requirements for the degree of MS by Research in Computer Science Engineering by Desai Aditya Makarand Center For Data Engineering International Institute of Information Technology Hyderabad , INDIA April, 2011

3 International Institute of Information Technology Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled Data-Intensive Similarity Measure for Categorical Data by Desai Aditya Makarand, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Advisor: Dr. Vikram Pudi

4 Dedicated to my father Makarand Desai and my mother Smita Desai

5 Acknowledgements Firstly, I would like to thank my advisor Dr. Vikram Pudi for his constant motivation and guidance. He always encouraged us to come up with new ideas and methodologies and guided us to the right path whenever the results were not as expected. In addition, a special thanks to Prof. Kamalakar Karlapalem in instilling in the me the value of hardwork. I would also like to thank Himanshu Singh with whom I worked towards my MS thesis. He has been a great friend and brainstorming sessions with him have always lead to new ideas. Also, a special thanks to my batch-mates and my colleagues at CDE who have made my stay at IIIT-H a memorable one. Finally and most importantly, I would like to thank my parents Makarand Desai and Smita Desai for their unconditional love and support. They have been there through ups and downs and have motivated and encouraged me during these times. Also, I would like to thank my entire family for always encouraging me and for instilling in me positivity at all times. v

6 Abstract The concept of similarity is fundamentally important in almost every scientific field. Clustering, distance-based outlier detection, classification, regression and search are major data mining techniques which compute the similarities between instances and hence the choice of a particular similarity measure can turn out to be a major cause of success or failure of the algorithm. The notion of similarity or distance for categorical data is not as straightforward as for continuous data and hence, is a major challenge. This is due to the fact that different values taken by a categorical attribute are not inherently ordered and hence a notion of direct comparison between two categorical values is not possible. In addition, the notion of similarity can differ depending on the particular domain, dataset, or task at hand. Ideally, the similarity notion is defined by a domain expert who understands the domain concepts well. However, in recent years due to advances in computers and data capture technology, huge datasets containing gigabytes or even terabytes of data have been collected. In many of these applications domain expertise is not available and the users don t understand the interconnections between objects well enough to formulate exact definitions of similarity or distance. Thus, it becomes necessary to analyze and extract knowledge (sometimes hidden) from data using machine oriented or automatic methods. In this thesis we present two new similarity measures for categorical data DISC: Data- Intensive Similarity Measure for Categorical Data and DISC-O: Data-Intensive Similarity Measure for Categorical Data using association rules. Both DISC and DISC-O capture the semantics of the data without any help from domain experts for defining similarity. In addition to this, they are generic and simple to implement. These desirable features make DISC and DISC-O vi

7 vii attractive alternatives to existing approaches. Our experimental study compares them with 15 other similarity measures on a number standard real datasets, used for classification, clustering and regression, and shows that the proposed algorithms are significantly more accurate than all its competitors.

8 Contents Chapter Page 1 Introduction Problem Formulation Key Contributions Background Data Supervised Learning Algorithms Classification Regression Unsupervised Learning Algorithms Clustering Frequent Pattern Mining Organization of Thesis Related Work Related Work Notations Characteristics of Categorical Data Popular Similarity Measures Measures that fill Diagonal Entries only Measures that fill Off-Diagonal Entries only Measures that fill both Diagonal and Off-Diagonal Entries Other classifications for similarity measures Similarity Measure Motivation and Design Data Structure Description Algorithm Overview DISC Computation Illustration Validity of Similarity Measure Computational Complexity viii

9 CONTENTS ix 4 An Association Rule based Similarity Measure Motivation Interestingess Measures for Association Rules Subjective v/s Objective Measures Symmetric v/s Assymetric Measures DISC-O Algorithm Computational Complexity of DISC-O Illustration Similarity between Red and Blue Similarity between Ferrari and Mercedes Similarity between Ferrari and Nano Inference Experimental Results of DISC Pre-Processing and Experimental Settings Experimental Results Discussion of Results Performance on Cricinfo Dataset Approach Observation Extension to a 2 nd level of clustering Observation Observation Experimental Results of DISC-O Performance Model Experimental Results Discussion of Results Selection of Parameters φ, ψ Conclusions and Future Work Contributions Future Work Bibliography

10 List of Figures Figure Page 2.1 Similarity Matrix for a Single Categorical Attribute Relation of per-attribute similarity to data characteristics x

11 List of Tables Table Page 2.1 Similarity Measures for Binary Vectors Similarity Measures for categorical attributes Illustration Cosine Similarity computation between v ij, v ik CI Table corresponding to Brand Representative points for attribute values corresponding to Brand Initialization for Brand Initialization for Color Similarity Matrix for Brand Similarity Matrix for Color A 2x2 Contingency Table for variables A and B A 2x2 Contingency Table for variables A and B Datasets for Classification Datasets for Regression Accuracy for k-nn with k = Accuracy for k-nn with k = Accuracy for k-nn with k = ABME/RMSE for k-nn with k = ABME/RMSE for k-nn with k = ABME/RMSE for k-nn with k = ABME/RMSE for k-nn with k = Interestingness Measures for Association Patterns Clustering accuracy of DISC-O (using Rand-Index) for k-means with number of clusters = number of classes Clustering accuracy of DISC-O (using Rand-Index) for k-means with number of clusters = number of classes Clustering accuracy of competitors (using Rand-Index) for k-means with number of clusters = number of classes xi

12 xii LIST OF TABLES 6.5 Difference between best and worst performance by varying psi={cosine, jaccard, dice} Classification accuracy of competitors for k-nn with number of neighbours = Classification accuracy of competitors for k-nn with number of neighbours = Classification accuracy of DISC-O for k-nn with number of neighbours = Classification accuracy of DISC-O for k-nn with number of neighbours = Difference between best and worst performance by varying psi={cosine, jaccard, dice}

13 Chapter 1 Introduction The concept of similarity is fundamentally important in almost every scientific field. This concept plays a major role in the field of data mining and knowledge discovery tasks involving distance computation. Clustering, distance-based outlier detection, classification and regression are major data mining techniques which compute the similarities between instances and hence the choice of a particular similarity measure can turn out to be a major cause of success or failure of the algorithm. For these tasks, the choice of a similarity measure can be as important as the choice of data representation or feature selection. The notion of similarity measure for continuous and ordinal data is comparatively straightforward due to inherent ordering information. There are a number of similarity measures for continuous data. M inkowski distance and its special case, the Euclidean distance are the two most widely used distance measures for continuous data. For ordinal variables, a general solution is to map ordinal values to numbers based on the ordering and then apply techniques associated with quantitative variables. The notion of similarity or distance for categorical data is not as straightforward as for continuous data and hence is a major challenge. This is due to the fact that the different values that a categorical attribute takes are not inherently ordered and hence a notion of direct comparison between two categorical values is not possible. In addition, the notion of similarity can differ depending on the particular domain, dataset, or task at hand. Although there is no inherent ordering in categorical data, there are other factors like cooccurrence statistics that can be used to define what should be considered more similar and 1

14 vice versa. This observation has motivated researchers to come up with data-driven similarity measures for categorical attributes. Such measures take into account the frequency distribution of individual attribute values in a given data set but most of these algorithms fail to capture any other feature in the dataset apart from frequency distribution of different attribute values in a given data set. One solution to the problem is to build a common repository of similarity measures for all commonly occurring concepts. As an example, let the similarity values for the concept colour be determined. Now, consider 3 colours red, pink and black. Consider two domains as follows: Domain 1: The domain is say determining the response of cones of the eye to the colour, then it is obvious that the cones behave largely similarly to red and pink as compared to black. Hence similarity between red and pink must be high compared to the similarity between red and black or pink and black. Domain2: Consider another domain for example the car sales data. In such a data, it may be known that the pink cars are extremely rare as compared to red and black cars and hence the similarity between red and black must be larger than that between red and pink or black and pink in this case. Ideally, the similarity notion is defined by a domain expert who understands the domain concepts well. However, in many applications domain expertise is not available and the users don t understand the interconnections between objects well enough to formulate exact definitions of similarity or distance. In the absence of domain expertise it is conceptually very hard to come up with a domain independent solution for similarity. Thus, the notion of similarity varies from one domain to another and hence the assignment of similarity must involve a thorough understanding of the domain. This makes it necessary to define a similarity measure based on latent knowledge available from data instead of a fit-to-all measure and is the major motivation for this thesis. In this thesis we present two new similarity measures for categorical data: 2

15 1. DISC (Data-Intensive Similarity Measure for Categorical Data) and 2. DISC-O (Data Intensive Similarity Measure for Categorical Data using AssOciation Rules). They capture the semantics of the data without any help from domain experts for defining similarity. They achieve this by capturing the relationships that are inherent in the data itself. In addition, both DISC and DISC-O are generic and simple to implement. The rest of chapter proceeds as follows. Initially, we present a mathematical formulation of what constitutes a valid similarity measure (Section 1.1) and the key contributions of this thesis (Section 1.2). We then describe the background behind the problem (Section 1.3). Later, we give a high level overview of the general nature of data in Section 1.4. This is followed by the description of supervised data mining algorithms like classification (Section 1.5.1), regression (Section 1.5.2) and unsupervised data mining algorithms like clustering (Section 1.6.1). We also briefly introduce frequent pattern mining (Section 1.7) which forms the basis of the algorithm presented later in this thesis. We finally describe the organization of the thesis in Section Problem Formulation In this section we discuss the conditions for a similarity measure to be valid. Later, in Section we describe how DISC satisfies these requirements and prove the validity of our algorithm. The conditions for a valid similarity measure are now defined as extensions of conditions on distance measure using the similarity-distance mapping in 1.1. Sim = dist (1.1) The following conditions need to hold for a distance metric d to be valid where d(x, y) is the distance between x and y. 1. d(x, y) 0 2. d(x, y) = 0 if and only if x=y 3

16 3. d(x, y) = d(y, x) 4. d(x, z) d(x, y) + d(y, z) We come up with the following conditions for a valid similarity measure, based on the definitions of the distance space, using the distance similarity mapping where, Sim(x, y) is the similarity between x and y: 1. 0 Sim(x, y) 1 2. Sim(x, y) = 1 if and only if x = y 3. Sim(x, y) = Sim(y, x) 4. 1 Sim(x,y) + 1 Sim(y,z) Sim(x,z) 1.2 Key Contributions The key contributions of this thesis can be summarized as follows: Introducing a notion of similarity between two values of a categorical attribute based on co-occurrence statistics. Defining two valid similarity measures for capturing such a notion which can be used out-of-the-box for any generic domain. Experimentally validating that such similarity measures provides a significant improvement in accuracy when applied to classification, clustering and regression on a wide array of dataset domains. The experimental validation is especially significant since it demonstrates a reasonably large improvement in accuracy by changing only the similarity measure while keeping the algorithm and its parameters constant. 4

17 1.3 Background Due to advances in computers and data capture technology, huge datasets containing gigabytes or even terabytes of data have been collected. The last estimates put the size of the data stored to 1.2 million peta bytes in 2010 up from 80,000 petabytes in The storage of data is predominantly due to the belief that more the amount of data, the better representation we have of the problem at hand and the analysis of this data should then give some insightful analysis. However, due to sheer size of the available data, using traditional database systems to support the decision support process becomes infeasible and hence it becomes necessary to analyze and extract knowledge (sometimes hidden) from data using machine oriented or automatic methods. The trick is to extract that valuable information from the surrounding mass of uninteresting numbers, so that data owners can effectively capitalize on the interesting aspects. The field of data mining has emerged as a result to solve this problem. Different texts propose different definitions of data mining prominent among them being the definition by Han and Kamber [1] which says: Data mining, also popularly referred to as knowledge discovery in databases (KDD), is the automated or convenient extraction of patterns representing knowledge implicitly stored in large databases, data warehouses, and other massive information repositories. The other definition by Witten and Frank [2] describes Data Mining as follows: It is the extraction of implicit, previously unknown, and potentially useful information from data. Another definition of data mining is: The application of computer technology and machine learning algorithms to discover patterns, anomalies, trends, and knowledge from data. Thus, in short data mining can be defined as the field of developing automated systems which are responsible for processing large volume of information and extracting previously unknown interesting information. However, as data is applied to new and previously unexplored field there exists need to understand the underlying nature and structure of data. In this thesis we propose two data-intensive similarity measures for categorical data. Once the structure and nature of the underlying data is known, suitable data mining algorithms like clustering, classification, regression or anomaly detection can be applied as the situation demands. 5

18 1.4 Data As stated previously, the aim of data mining is to discover interesting relationships that exist in the real world. Data pertaining to the relationships is collected by mapping entities in the domain of interest to a symbolic representation by means of some mapping procedure, which associates the value of a variable with a given property of an entity. A dataset is a set of measurements taken from some environment or domain. A typical dataset consists of a collection of objects and for each object there are d measurements. The entire collection of measurements on n objects is represented by a nxd matrix is the dataset to be mined. Each row may be referred to as an entity, case, object, record, instance depending on the context of use and each column may be referred to as variables, features, attributes or field depending on the context. Consider the census scenario which makes the following measurements: ID, Sex, Marital Status, Education and Income. However, there are certain distinctions between the kinds of values that a feature can take. The two classes of attributes are quantitative and qualitative attributes. A quantitative variable is measured on a numerical scale and in principle can take any value. From the above example ID and Income are examples of quantitative attributes. Qualitative attributes on the other hand can take certain discrete values defined by the domain and hence differ from quantitative attributes. For example, the variable sex can only take values Male/Female. Qualitative variables can be further divided into nominal and ordinal. Nominal variables take a fixed discrete set of values such that there is no ordering among the values. For example, for the variable sex, the values it can take are male and female, with no inherent ordering among them. On the other hand, if the variable education corresponds to the highest degree taken with the values taken being primary school, high school, bachelors, masters and doctorate, the values taken by the variable have an inherent ordering and the corresponding variable is called an ordinal variable. It may be noted that while mean, median and mode can be calculated for quantitative and ordinal data, there is no notion of mean/median for nominal data. 6

19 1.5 Supervised Learning Algorithms Supervised learning is the task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (independent variables) and a desired output value (dependant variable). A supervised learning algorithm analyzes the training data and produces an inferred function, which is called a classifier (if the output is discrete) or a regressor (if the output is continuous) Classification Classification is defined [1] as the process of finding a set of models (or functions) that describe and distinguish data classes and concepts, with the goal being to use the model to predict the classes of objects whose class labels are unknown. Thus, classification is a supervised learning problem where the task is to predict the value of a discrete output variable given a set of training examples and a test sample where each training example is a pair consisting of the input object and the desired class. Formally, the input consists of: A tuple X is represented by an n dimensional attribute vector, X = (x 1,..., x n ), depicting n measurements made on the object corresponding to n database attributes, A 1,, A 2,..., A n. Each tuple, X, is assumed to belong to a predefined class as determined by another database attribute called the class-label attribute. The class-label is discrete valued and unordered. It is categorical in that each value serves as a category or class. A training set of m training tuples D = {(X 1, y 1 ),..., (X m, y m )}, where X i are points in X-space and y i are the corresponding values of the class label (also called as the response variable). As the class label of each training tuple is provided, classification is thus a supervised learning methodology. Given this data, data classification is now a two-step process. 7

20 In the first step, a classifier is built describing a pre-determined set of data classes or concepts. This is the learning step (or training phase), where a classification algorithm builds the classifier by analyzing or learning from a training set made up of database tuples and their associated class labels as described above. This first step of the classification process can also be viewed as the learning of a mapping or a function y = f(x), that can predict the associated class label y of a given tuple X. Thus, in this step we wish to learn a mapping or a function that separates the data classes. This mapping depends on the family of classification algorithm that is used and may be represented in the form of rules, decision trees or separating hyperplanes. In the second step, the model is used for classification. Before, the classifier is used in actual production settings an attempt is made to determine its accuracy on unseen real world data. Using the training data for computing the prediction accuracy would give optimistic results as training may incorporate some of the anomalies of the training data that are not present in the general test set. Therefore, a test set is used, made up of test tuples and their associated class tuples. These tuples are randomly selected from the general dataset. They are independent of the training tuples (i.e. they are not used to construct the classifier). The reported prediction accuracy of a classifier on a given test set is now the percentage of test set of tuples that are correctly classified by the classifier. The classifiers output for the particular test tuple is correct if the predicted class of the test tuple matches the class label for that tuple. Classification has been widely used in: Credit scoring: Define the cap-limit for a credit card based on users past behaviour. Search engines: Categorize or classify the type of query or document. Handwriting recognition: Receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. Document categorization: Assign an electronic document to one or more categories, based on its contents. Speech recognition and Medical image analysis and diagnosis. 8

21 1.5.2 Regression Regression is a supervised learning algorithm similar to classification with the only difference that the variable to be predicted is continuous and hence is called as the response variable. The data regression task goes through the same procedure of training, testing and evaluation. However the reported accuracy of the classifier is measured in terms of absolute mean error (ABME) or the root mean square error (RMSE). Given a regressor R and an input tuple X i, let the value of the output variable be y i and the value output by the regressor be R(X i ). Thus, the error made by the regressor for the given sample is: absolute(r(x i ) y i ). If the test set consists of m test samples (X 1,..., X m ) then the absolute mean error (ABME) is given as: m i=0 ABME = absolute(r(x i) y i ) (1.2) m and the root mean square error (RMSE) is given as: m i=0 RMSE = (R(X i) y i ) 2 m Regression algorithms have been widely used for: (1.3) Trend Line: A trend line represents a trend or the long-term movement in time-series data. Finance: For analyzing and quantifying the systematic risk of an investment. Economics: Used to predict consumption spending, fixed investment spending, inventory investment, spending on imports, the demand to hold liquid assets, labor demand, and labor supply [3, 4]. 1.6 Unsupervised Learning Algorithms Unsupervised learning is a class of problems in which one seeks to determine how the data is organized. These systems learn to represent particular input patterns in a way that reflects the structure of the overall collection of input patterns. By contrast with supervised learning, there are no explicit target outputs (class labels) or environmental evaluations associated with each input. 9

22 1.6.1 Clustering Clustering in [1] is defined as the process of grouping a set of physical or abstract objects into classes of similar objects. A cluster is a collection of data objects that are similar to one another and are dissimilar to objects in other clusters. A cluster of data objects can be treated collectively as one group and so the cluster definition may be considered as a form of data compression. Although classification is an effective means for distinguishing groups or classes of objects, it requires the often costly collection and labeling of a large set of training tuples or patterns, which the classifier uses to model each group. Thus, in real-life scenarios it is more desirable to proceed in the reverse direction: First partition the set of data into groups based on similarity, and then assign labels to the relatively small number of groups. Additional advantages of such a clustering-based process are that it is adaptable to changes and helps single out useful features that distinguish different groups. Clustering algorithms may be applied to a wide variety of data like documents, images or the standard tabular data. For the purpose of this thesis, we consider the standard tabular data. Formally, the input consists of: A tuple X is represented by an n dimensional attribute vector, X = (x 1,..., x n ), depicting n measurements made on the tuple from n database attributes, respectively, A 1,, A 2,..., A n. A set of m tuples D = {(X 1 ),..., (X m )}, where X i are points in X-space. Depending on the algorithm, the number of clusters to be output is a optional variable. The output of the clustering algorithm is a set of k-clusters where each point belongs to exactly one of the k-clusters. Clustering algorithms can be applied in many fields, for example: Marketing: Finding groups of customers with similar behavior given a large database of past buying records of customers. Biology: classification of plants and animals given their features. 10

23 Insurance: Identifying groups of motor insurance policy holders with a high average claim cost, Fraud Detection. City-planning: Identifying groups of houses according to their house type, value and geographical location. Earthquake studies: Clustering observed earthquake epicenters to identify dangerous zones. Web: Document classification, clustering web-log data to discover groups of similar access patterns. 1.7 Frequent Pattern Mining Frequent patterns are patterns like item-sets, subsequences or sub-structures that appear in a dataset frequently. Thus, frequent pattern mining is the discovery of associations and correlations among items in large transactional and relational data sets. With massive amounts of data continuously being collected and stored, many industries are becoming interested in mining such patterns from their databases. This is because, the discovery of interesting relationships in huge amounts of business transaction records helps in many decision making processes, such as customer-shopping behavior analysis, cross marketing, catalog design to name a few. Formally, Frequent Itemsets and Association Rules are defined below: Let I = {I 1, I 2,..., I m } be a set of items. Let D be a set of database transactions where each transaction T is a set of items such that T I. Each transaction is associated with an identifier, called the TID. Let A be a set of items. A transaction T is said to contain A iff A T. An association rule is an implication of the form A B, where A I, B I, and A B = φ. The rule A B holds in the transaction set D with support s, where s is the fraction of transactions in D that contain A B (i.e., the union of sets A and B). This is equivalent to the probability, P(A B). The rule A B has confidence c in the transaction set D, where c is the fraction of transactions in D containing A that also contain B. This is equivalent to be the conditional probability, P (B A). Summarizing, support(a B) = P (A B) (1.4) 11

24 confidence(a B) = P (B A) = support(a B) support(a) (1.5) Rules that satisfy both minimum support threshold (min sup) and a minimum confidence threshold (min conf) are called strong. Association rule algorithms have been widely used for: Web Mining: Studying Web usage mining, web content mining and web structure mining [5] Intrusion detection. Bio-informatics: Sequence alignment, gene finding, genome assembly, drug design, drug discovery, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, genome-wide association studies and the modeling of evolution [6]. Market Basket Analysis: Targeted advertising, product promotion, cross-sell promotion. 1.8 Organization of Thesis Due to the wide range of related work relevant to the research problem we provide an overview of these works in the chapter on Related work (chapter 2). Later in chapter 3 we propose a similarity learning algorithm which learns and assigns a similarity value between two categorical values belonging to the same attribute. For clarity, we provide an illustration of the learning algorithm on a dummy dataset. Then, in chapter 4 we use the concept of co-occurrences statistics proposed in chapter 3 and define another similarity measure DISC-O using association rules. Then, in chapters 5 and 6 we compare the proposed approaches against the state of the art approaches surveyed in chapter 2. In addition to this, the chapters also provide an exhaustive evaluation of the behaviour of the proposed learning algorithm by varying various input parameters. Finally in chapter 7 we conclude our work and provide directions for future work. 12

25 Chapter 2 Related Work 2.1 Related Work Determining similarity measures for categorical data is a much studied field as there is no explicit notion of ordering among values of categorical attributes. The study of similarity between data objects with categorical variables has had a long history. Pearson proposed a chisquare statistic in the late 1800s which is often used to test independence between categorical variables in a contingency table. Pearson s chi-square statistic was later modified and extended, leading to several other measures [7, 8, 9]. Sneath and Sokal were among the first to put together and discuss many of the categorical similarity measures and discuss this in detail in their book [10] on numerical taxonomy. At the time, two major concerns were (1) biological relevance, since numerical taxonomy was mainly concerned with taxonomies from biology, ecology, etc., and (2) computation efficiency since computational resources were limited and scarce. Nevertheless, many of the observations made by Sneath and Sokal are quite relevant today and offer key insights into many of the measures. The specific problem of clustering categorical data has been actively studied. There are several books [12, 13, 14, 15] on cluster analysis that discuss the problem of determining similarity between categorical attributes. The problem has also been studied recently in [16, 17]. However, most of these approaches do not offer solutions to the problem discussed in this thesis, and the usual recommendation is to binarize the data and then use similarity measures designed for binary attributes. Most of these measures are based on the number of identical 13

26 and non-identical value pairs. The similarity of a non-identical value pair is simply 0 and the similarity of an identical value pair is 1. Let val set = {v 1,..., v l } be a value set of all attributes. A data object x is transformed into a binary vector X = (x 1,..., x l ), x i {0, 1}, where x i = 1 if the object x holds the value v i, and x i = 0 otherwise. Denote X and Y as two binary vectors, XY = l i=0 x iy i, X as the complementary vector of X: X = 1 X = [1 x i ], and the following counters: a = XY, b = XY, c = XY and d = XY. Obviously, a + b + c + d = l and a + b + c = D where D is the number of attributes. Using these notions, several measures on binary vectors have been defined. A brief summary is given in Table 2.1. Measure Definition Function Family Jaccard (1900) a Sokal and Michener index (1958) Sokal and Sneath Rogers and Tanimoto (1960) Czekanowski (1913), Dice (1945) Sokal and Sneath (1963) Hanman (1961) (a+b+c) (a+d) (a+b+c+d) a (a+2(b+c)) (a+d) (a+2(b+c)+d) a (a+ 1 2 (b+c)) b+c a+d a+d b c l T θ S θ T θ S θ T θ S θ T θ Table 2.1: Similarity Measures for Binary Vectors Most of the popular practical similarity measures in the literature (Jaccard, Dice, etc) belong to one of these two families introduced in [18]: T θ = S θ = a a + θ(b + c) a + d a + d + θ(b + c) (2.1) (2.2) Most work has been carried out on development of clustering algorithms and not similarity functions. Hence these works are only marginally or peripherally related to our work. Wilson and Martinez [19] performed a detailed study of heterogeneous distance functions (for categorical and continuous attributes) for instance based learning. The measures in their study are 14

27 based upon a supervised approach where each data instance has class information in addition to a set of categorical/continuous attributes. There have been a number of new data mining techniques for categorical data that have been proposed recently. Some of them use notions of similarity which are neighborhood-based [20, 21, 22, 23], or incorporate the similarity computation into the learning algorithm[24, 25]. These measures are useful to compute the neighborhood of a point and neighborhood-based measures but not for calculating similarity between a pair of data instances. Jones and Furnas [26] studied several similarity measures in the field of information retrieval. In particular, they performed a geometric analysis on continuous measures in order to reveal important differences which would affect retrieval performance. Noreault et al. [27] also studied measures in information retrieval with the goal of generalizing effectiveness based on empirically evaluating the performance of the measures. Another comparative empirical evaluation for determining similarity between fuzzy sets was performed by Zwick et al. [28], followed by several others [29, 30]. Vipin Kumar et al. recently surveyed and compared a large number of data-intensive similarity measures in [31]. In our experiments we have compared our approach with the methods discussed in [31]; which provides a recent exhaustive comparison of similarity measure for categorical data. In the remainder of this chapter, we identify the key characteristics of a categorical data (Section 2.3) set that can potentially affect the behavior of a data-driven similarity measure and a large number of data-intensive similarity measures in described in [31]. The notations for describing the similarity measures are defined in Section 2.2 and the measures themselves are described in Section Notations For the sake of notation, consider a categorical data set D containing N objects, defined over a set of d categorical attributes where A k denotes the k th attribute. Let the attribute A k take n k values in the given data set that are denoted by the set A k. We also use the following notation: 15

28 f k (x) : The number of times attribute A k takes the value x in the data set D. Note that if x / A k, f k (x) = 0 p k (x) : The sample probability of attribute A k to take the value x in the data set D. The sample probability is given by: p k (x) = f k(x) N (2.3) p 2 k (x): Another probability estimate of attribute A k to take the value x in a given data set, given by: p 2 k(x) = f k(x) (f k (x) 1) N (N 1) (2.4) 2.3 Characteristics of Categorical Data Since this thesis discusses data-driven similarity measures for categorical data, a key task is to identify the characteristics of a categorical data set that affect the behavior of such a similarity measure. We enumerate the characteristics of a categorical data set below: Size of Data, N. As we will see later, most measures are typically invariant of the size of the data, though there are some measures (e.g. Smirnov) that do make use of this information. Number of attributes, d. Most measures are invariant of this characteristic, since they typically normalize the similarity over the number of attributes. Number of values taken by each attribute, n k. A data set might contain attributes that take several values and attributes that take very few values. For example, one attribute might take several hundred possible values, while another attribute might take very few values. A similarity measure might give more importance to the second attribute, while ignoring the first one. In fact, one of the measures discussed in this section (Eskin) behaves exactly like this. 16

29 Distribution of f k (x). This refers to the distribution of frequency of values taken by an attribute in the given data set. In certain data sets an attribute might be distributed uniformly over the set A k, while in others the distribution might be skewed. A similarity measure might give more importance to attribute values that occur rarely, while another similarity measure might give more importance to frequently occurring attribute values. 2.4 Popular Similarity Measures In this section, we give an overview of popularly used similarity measures. It may be noted that some of the similarity measures were originally proposed as distance measures and are converted to similarity measures in order to make the measures comparable in this study. The measures discussed henceforth will all be in the context of similarity, with distance measures being converted using the formula: sim = dist (2.5) Almost all similarity measures assign a similarity value between two data instances X and Y belonging to the data set D (introduced in Section 3) as follows: S(X, Y ) = d w k S k (X k, Y k ) (2.6) k=1 where S k (X k, Y k ) is the per attribute similarity for categorical attribute A k. Note that X k, Y k A k. The quantity w k denotes the weight assigned to the attribute A k. To understand how different measures calculate the per-attribute similarity, S k (X k, Y k ), consider a categorical attribute A, which takes one of the values {a, b, c, d}. The per-attribute similarity computation is equivalent to constructing the (symmetric) matrix shown in Figure 2.1 Essentially, in determining the similarity between two values, any categorical measure is filling the entries of this matrix. For example, the overlap measure sets the diagonal entries to 1 and the off-diagonal entries to 0, i.e., the similarity is 1 if the values match and 0 if the 17

30 Figure 2.1: Similarity Matrix for a Single Categorical Attribute values mismatch. Additionally, measures may use the following information in computing a similarity value: f(a), f(b), f(c), f(d), the frequencies of the values in the data set N, the size of the data set n, the number of values taken by the attribute (4 in the case above) f(a k,a,a k,a ), the number of instances for which the attribute A k takes values a and attribute A k takes value a thus representing the co-occurrence statistics. We can classify measures in several ways, based on: (i) the manner in which they fill the entries of the similarity matrix, (ii) the arguments used to propose the measure (probabilistic, information-theoretic, etc.). In this section, we will describe the measures by classifying them as follows: Those that fill the diagonal entries only. These are measures that set the off-diagonal entries to 0 (mismatches are uniformly given the minimum value) and give possibly different weights to matches. Those that fill the off-diagonal entries only. These measures set the diagonal entries to 1 (matches are uniformly given the maximum value) and give possibly different weights to mismatches. Those that fill both diagonal and off-diagonal entries. These measures give different weights to both matches and mismatches. Table 2.2 gives the mathematical formulas 18

31 for the measures we will be describing in this section. The various measures described in Table 2.2 compute the per-attribute similarity Sk(X k, Y k ) as shown in column 2 and compute the attribute weight w k as shown in column Measures that fill Diagonal Entries only. Overlap: The overlap measure simply counts the number of attributes that match in the two data instances. The range of per-attribute similarity for the overlap measure is [0, 1], with a value of 0 occurring when there is no match, and a value of 1 occurring when the attribute values match. Goodall1: Goodall [32] proposed a measure that attempts to normalize the similarity between two objects by the probability that the similarity value observed could be observed in a random sample of two points. This measure assigns higher similarity to a match if the value is infrequent than if the value is frequent. The range of S k (X k, Y k ) for matches 2 in Goodall1 measure is [0, 1- ] with the minimum being attained when attribute N(N 1) A k takes only one value, and the maximum is attained when the value X k occurs twice, while all other possible values of A k occur more than 2 times. The autors of [31] also propose three other variants of Goodall s measure: Goodall2, Goodall3, Goodall4. Goodall2: The Goodall2 measure is a variant of Goodall s measure proposed by authors of [31]. This measure assigns higher similarity if the matching values are infrequent, and at the same time there are other values that are even less frequent, i.e., the similarity is higher if there are many values with approximately equal frequencies, and lower if the frequency distribution is skewed. The range of S k (X k, Y k ) for matches in the Goodall2 measure is [0, 1-2 N(N 1) ], with the minimum value being attained if attribute A k takes only one value, and the maximum is attained when the value X k occurs twice, while all other possible values of A k occur only once each. Goodall3: The Goodall3 measure assigns a high similarity if the matching values are infrequent regardless of the frequencies of the other values. The range of S k (X k, Y k ) 19

32 2 for matches in the Goodall3 measure is [0, 1- ], with the minimum value being N(N 1) attained if X k is the only value for attribute A k and maximum value is attained if X k occurs only twice. Goodall4: The Goodall4 measure assigns similarity 1 Goodall3 for matches. The 2 range of S k (X k, Y k ) for matches in the Goodall4 measure is [, 1], with the (N)(N 1) minimum value being attained if X k occurs only once, and the maximum value is attained if X k is the only value for attribute A k. Gambaryan: Gambaryan proposed a measure [33] that gives more weight to matches where the matching value occurs in about half the data set, i.e., in between being frequent and rare. The Gambaryan measure for a single attribute match is closely related to the Shannon entropy from information theory, as can be seen from its formula in Table 2.2. The range of S k (X k, Y k ) for matches in the Gambaryan measure is [0, 1], with the minimum value being attained if X k is the only value for attribute A k and the maximum value is attained when X k has frequency N Measures that fill Off-Diagonal Entries only. Eskin: Eskin et al. [34] proposed a normalization kernel for record-based network intrusion detection data. The original measure is distance-based and assigns a weight of 2 n 2 k for mismatches; when adapted to similarity, this becomes a weight of n 2 k n 2 k +2. This measure gives more weight to mismatches that occur on attributes that take many values. The range of S k (X k, Y k ) for mismatches in the Eskin measure is [ 2, N 2 ], with 3 N 2 +2 the minimum value being attained when the attribute A k takes only two values, and the maximum value is attained when the attribute has all unique values. Inverse Occurrence Frequency (IOF) : The inverse occurrence frequency measure assigns lower similarity to mismatches on more frequent values. The IOF measure is related to the concept of inverse document frequency which comes from information retrieval [35], where it is used to signify the relative number of documents that contain 20

33 a specific word. A key difference is that inverse document frequency is computed on a term-document matrix which is usually binary, while the IOF measure is defined for categorical data. The range of S k (X k, Y k ) for mismatches in the IOF measure is [ 1 1+(log N 2 )2, 1], with the minimum value being attained when X k and Y k each occur N 2 times (i.e., these are the only two values), and the maximum value is attained when X k and Y k occur only once in the data set. Occurrence Frequency (OF): The occurrence frequency measure gives the opposite weighting of the IOF measure for mismatches, i.e., mismatches on less frequent values are assigned lower similarity and mismatches on more frequent values are assigned higher similarity. The range of S k (X k, Y k ) for mismatches in OF measure is [ 1 1+(logN) 2, 1 1+log(2) 2 ], with the minimum value being attained when X k and Y k occur only once in the data set, and the maximum value attained when X k and Y k occur N 2 times. Burnaby: Burnaby [36] proposed a similarity measure using arguments from information theory. He argues that the set of observed values are like a group of signals conveying information and, as in information theory, attribute values that are rarely observed should be considered more informative. In [36], Burnaby proposed information weighted measures for binary, ordinal, categorical and continuous data. The measure we present in Table 2.2 is adapted from Burnaby s categorical measure. This measure assigns low similarity to mismatches on rare values and high similarity to mismatches on frequent values. The range of S k (X k, Y k ) for mismatches for the Burnaby measure is [ Nlog(1 1 N Nlog(1 1 N ) ) log(n 1), 1], with the minimum value being attained when all values for attribute A k occur only once, and the maximum value is attained when X k and Y k each occur N 2 times Measures that fill both Diagonal and Off-Diagonal Entries. Lin: In [37], Lin describes an information-theoretic framework for similarity, where he argues that when similarity is thought of in terms of assumptions about the space, the similarity measure naturally follows from the assumptions. Lin [37] discusses the 21

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data