Data-Intensive Similarity Measures for Categorical Data

Size: px
Start display at page:

Download "Data-Intensive Similarity Measures for Categorical Data"

Transcription

1 Data-Intensive Similarity Measures for Categorical Data Thesis submitted in partial fulfillment of the requirements for the degree of MS by Research in Computer Science Engineering by Desai Aditya Makarand Center For Data Engineering International Institute of Information Technology Hyderabad , INDIA April, 2011

2 Copyright c Desai Aditya Makarand, 2011 All Rights Reserved

3 International Institute of Information Technology Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled Data-Intensive Similarity Measure for Categorical Data by Desai Aditya Makarand, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Advisor: Dr. Vikram Pudi

4 Dedicated to my father Makarand Desai and my mother Smita Desai

5 Acknowledgements Firstly, I would like to thank my advisor Dr. Vikram Pudi for his constant motivation and guidance. He always encouraged us to come up with new ideas and methodologies and guided us to the right path whenever the results were not as expected. In addition, a special thanks to Prof. Kamalakar Karlapalem in instilling in the me the value of hardwork. I would also like to thank Himanshu Singh with whom I worked towards my MS thesis. He has been a great friend and brainstorming sessions with him have always lead to new ideas. Also, a special thanks to my batch-mates and my colleagues at CDE who have made my stay at IIIT-H a memorable one. Finally and most importantly, I would like to thank my parents Makarand Desai and Smita Desai for their unconditional love and support. They have been there through ups and downs and have motivated and encouraged me during these times. Also, I would like to thank my entire family for always encouraging me and for instilling in me positivity at all times. v

6 Abstract The concept of similarity is fundamentally important in almost every scientific field. Clustering, distance-based outlier detection, classification, regression and search are major data mining techniques which compute the similarities between instances and hence the choice of a particular similarity measure can turn out to be a major cause of success or failure of the algorithm. The notion of similarity or distance for categorical data is not as straightforward as for continuous data and hence, is a major challenge. This is due to the fact that different values taken by a categorical attribute are not inherently ordered and hence a notion of direct comparison between two categorical values is not possible. In addition, the notion of similarity can differ depending on the particular domain, dataset, or task at hand. Ideally, the similarity notion is defined by a domain expert who understands the domain concepts well. However, in recent years due to advances in computers and data capture technology, huge datasets containing gigabytes or even terabytes of data have been collected. In many of these applications domain expertise is not available and the users don t understand the interconnections between objects well enough to formulate exact definitions of similarity or distance. Thus, it becomes necessary to analyze and extract knowledge (sometimes hidden) from data using machine oriented or automatic methods. In this thesis we present two new similarity measures for categorical data DISC: Data- Intensive Similarity Measure for Categorical Data and DISC-O: Data-Intensive Similarity Measure for Categorical Data using association rules. Both DISC and DISC-O capture the semantics of the data without any help from domain experts for defining similarity. In addition to this, they are generic and simple to implement. These desirable features make DISC and DISC-O vi

7 vii attractive alternatives to existing approaches. Our experimental study compares them with 15 other similarity measures on a number standard real datasets, used for classification, clustering and regression, and shows that the proposed algorithms are significantly more accurate than all its competitors.

8 Contents Chapter Page 1 Introduction Problem Formulation Key Contributions Background Data Supervised Learning Algorithms Classification Regression Unsupervised Learning Algorithms Clustering Frequent Pattern Mining Organization of Thesis Related Work Related Work Notations Characteristics of Categorical Data Popular Similarity Measures Measures that fill Diagonal Entries only Measures that fill Off-Diagonal Entries only Measures that fill both Diagonal and Off-Diagonal Entries Other classifications for similarity measures Similarity Measure Motivation and Design Data Structure Description Algorithm Overview DISC Computation Illustration Validity of Similarity Measure Computational Complexity viii

9 CONTENTS ix 4 An Association Rule based Similarity Measure Motivation Interestingess Measures for Association Rules Subjective v/s Objective Measures Symmetric v/s Assymetric Measures DISC-O Algorithm Computational Complexity of DISC-O Illustration Similarity between Red and Blue Similarity between Ferrari and Mercedes Similarity between Ferrari and Nano Inference Experimental Results of DISC Pre-Processing and Experimental Settings Experimental Results Discussion of Results Performance on Cricinfo Dataset Approach Observation Extension to a 2 nd level of clustering Observation Observation Experimental Results of DISC-O Performance Model Experimental Results Discussion of Results Selection of Parameters φ, ψ Conclusions and Future Work Contributions Future Work Bibliography

10 List of Figures Figure Page 2.1 Similarity Matrix for a Single Categorical Attribute Relation of per-attribute similarity to data characteristics x

11 List of Tables Table Page 2.1 Similarity Measures for Binary Vectors Similarity Measures for categorical attributes Illustration Cosine Similarity computation between v ij, v ik CI Table corresponding to Brand Representative points for attribute values corresponding to Brand Initialization for Brand Initialization for Color Similarity Matrix for Brand Similarity Matrix for Color A 2x2 Contingency Table for variables A and B A 2x2 Contingency Table for variables A and B Datasets for Classification Datasets for Regression Accuracy for k-nn with k = Accuracy for k-nn with k = Accuracy for k-nn with k = ABME/RMSE for k-nn with k = ABME/RMSE for k-nn with k = ABME/RMSE for k-nn with k = ABME/RMSE for k-nn with k = Interestingness Measures for Association Patterns Clustering accuracy of DISC-O (using Rand-Index) for k-means with number of clusters = number of classes Clustering accuracy of DISC-O (using Rand-Index) for k-means with number of clusters = number of classes Clustering accuracy of competitors (using Rand-Index) for k-means with number of clusters = number of classes xi

12 xii LIST OF TABLES 6.5 Difference between best and worst performance by varying psi={cosine, jaccard, dice} Classification accuracy of competitors for k-nn with number of neighbours = Classification accuracy of competitors for k-nn with number of neighbours = Classification accuracy of DISC-O for k-nn with number of neighbours = Classification accuracy of DISC-O for k-nn with number of neighbours = Difference between best and worst performance by varying psi={cosine, jaccard, dice}

13 Chapter 1 Introduction The concept of similarity is fundamentally important in almost every scientific field. This concept plays a major role in the field of data mining and knowledge discovery tasks involving distance computation. Clustering, distance-based outlier detection, classification and regression are major data mining techniques which compute the similarities between instances and hence the choice of a particular similarity measure can turn out to be a major cause of success or failure of the algorithm. For these tasks, the choice of a similarity measure can be as important as the choice of data representation or feature selection. The notion of similarity measure for continuous and ordinal data is comparatively straightforward due to inherent ordering information. There are a number of similarity measures for continuous data. M inkowski distance and its special case, the Euclidean distance are the two most widely used distance measures for continuous data. For ordinal variables, a general solution is to map ordinal values to numbers based on the ordering and then apply techniques associated with quantitative variables. The notion of similarity or distance for categorical data is not as straightforward as for continuous data and hence is a major challenge. This is due to the fact that the different values that a categorical attribute takes are not inherently ordered and hence a notion of direct comparison between two categorical values is not possible. In addition, the notion of similarity can differ depending on the particular domain, dataset, or task at hand. Although there is no inherent ordering in categorical data, there are other factors like cooccurrence statistics that can be used to define what should be considered more similar and 1

14 vice versa. This observation has motivated researchers to come up with data-driven similarity measures for categorical attributes. Such measures take into account the frequency distribution of individual attribute values in a given data set but most of these algorithms fail to capture any other feature in the dataset apart from frequency distribution of different attribute values in a given data set. One solution to the problem is to build a common repository of similarity measures for all commonly occurring concepts. As an example, let the similarity values for the concept colour be determined. Now, consider 3 colours red, pink and black. Consider two domains as follows: Domain 1: The domain is say determining the response of cones of the eye to the colour, then it is obvious that the cones behave largely similarly to red and pink as compared to black. Hence similarity between red and pink must be high compared to the similarity between red and black or pink and black. Domain2: Consider another domain for example the car sales data. In such a data, it may be known that the pink cars are extremely rare as compared to red and black cars and hence the similarity between red and black must be larger than that between red and pink or black and pink in this case. Ideally, the similarity notion is defined by a domain expert who understands the domain concepts well. However, in many applications domain expertise is not available and the users don t understand the interconnections between objects well enough to formulate exact definitions of similarity or distance. In the absence of domain expertise it is conceptually very hard to come up with a domain independent solution for similarity. Thus, the notion of similarity varies from one domain to another and hence the assignment of similarity must involve a thorough understanding of the domain. This makes it necessary to define a similarity measure based on latent knowledge available from data instead of a fit-to-all measure and is the major motivation for this thesis. In this thesis we present two new similarity measures for categorical data: 2

15 1. DISC (Data-Intensive Similarity Measure for Categorical Data) and 2. DISC-O (Data Intensive Similarity Measure for Categorical Data using AssOciation Rules). They capture the semantics of the data without any help from domain experts for defining similarity. They achieve this by capturing the relationships that are inherent in the data itself. In addition, both DISC and DISC-O are generic and simple to implement. The rest of chapter proceeds as follows. Initially, we present a mathematical formulation of what constitutes a valid similarity measure (Section 1.1) and the key contributions of this thesis (Section 1.2). We then describe the background behind the problem (Section 1.3). Later, we give a high level overview of the general nature of data in Section 1.4. This is followed by the description of supervised data mining algorithms like classification (Section 1.5.1), regression (Section 1.5.2) and unsupervised data mining algorithms like clustering (Section 1.6.1). We also briefly introduce frequent pattern mining (Section 1.7) which forms the basis of the algorithm presented later in this thesis. We finally describe the organization of the thesis in Section Problem Formulation In this section we discuss the conditions for a similarity measure to be valid. Later, in Section we describe how DISC satisfies these requirements and prove the validity of our algorithm. The conditions for a valid similarity measure are now defined as extensions of conditions on distance measure using the similarity-distance mapping in 1.1. Sim = dist (1.1) The following conditions need to hold for a distance metric d to be valid where d(x, y) is the distance between x and y. 1. d(x, y) 0 2. d(x, y) = 0 if and only if x=y 3

16 3. d(x, y) = d(y, x) 4. d(x, z) d(x, y) + d(y, z) We come up with the following conditions for a valid similarity measure, based on the definitions of the distance space, using the distance similarity mapping where, Sim(x, y) is the similarity between x and y: 1. 0 Sim(x, y) 1 2. Sim(x, y) = 1 if and only if x = y 3. Sim(x, y) = Sim(y, x) 4. 1 Sim(x,y) + 1 Sim(y,z) Sim(x,z) 1.2 Key Contributions The key contributions of this thesis can be summarized as follows: Introducing a notion of similarity between two values of a categorical attribute based on co-occurrence statistics. Defining two valid similarity measures for capturing such a notion which can be used out-of-the-box for any generic domain. Experimentally validating that such similarity measures provides a significant improvement in accuracy when applied to classification, clustering and regression on a wide array of dataset domains. The experimental validation is especially significant since it demonstrates a reasonably large improvement in accuracy by changing only the similarity measure while keeping the algorithm and its parameters constant. 4

17 1.3 Background Due to advances in computers and data capture technology, huge datasets containing gigabytes or even terabytes of data have been collected. The last estimates put the size of the data stored to 1.2 million peta bytes in 2010 up from 80,000 petabytes in The storage of data is predominantly due to the belief that more the amount of data, the better representation we have of the problem at hand and the analysis of this data should then give some insightful analysis. However, due to sheer size of the available data, using traditional database systems to support the decision support process becomes infeasible and hence it becomes necessary to analyze and extract knowledge (sometimes hidden) from data using machine oriented or automatic methods. The trick is to extract that valuable information from the surrounding mass of uninteresting numbers, so that data owners can effectively capitalize on the interesting aspects. The field of data mining has emerged as a result to solve this problem. Different texts propose different definitions of data mining prominent among them being the definition by Han and Kamber [1] which says: Data mining, also popularly referred to as knowledge discovery in databases (KDD), is the automated or convenient extraction of patterns representing knowledge implicitly stored in large databases, data warehouses, and other massive information repositories. The other definition by Witten and Frank [2] describes Data Mining as follows: It is the extraction of implicit, previously unknown, and potentially useful information from data. Another definition of data mining is: The application of computer technology and machine learning algorithms to discover patterns, anomalies, trends, and knowledge from data. Thus, in short data mining can be defined as the field of developing automated systems which are responsible for processing large volume of information and extracting previously unknown interesting information. However, as data is applied to new and previously unexplored field there exists need to understand the underlying nature and structure of data. In this thesis we propose two data-intensive similarity measures for categorical data. Once the structure and nature of the underlying data is known, suitable data mining algorithms like clustering, classification, regression or anomaly detection can be applied as the situation demands. 5

18 1.4 Data As stated previously, the aim of data mining is to discover interesting relationships that exist in the real world. Data pertaining to the relationships is collected by mapping entities in the domain of interest to a symbolic representation by means of some mapping procedure, which associates the value of a variable with a given property of an entity. A dataset is a set of measurements taken from some environment or domain. A typical dataset consists of a collection of objects and for each object there are d measurements. The entire collection of measurements on n objects is represented by a nxd matrix is the dataset to be mined. Each row may be referred to as an entity, case, object, record, instance depending on the context of use and each column may be referred to as variables, features, attributes or field depending on the context. Consider the census scenario which makes the following measurements: ID, Sex, Marital Status, Education and Income. However, there are certain distinctions between the kinds of values that a feature can take. The two classes of attributes are quantitative and qualitative attributes. A quantitative variable is measured on a numerical scale and in principle can take any value. From the above example ID and Income are examples of quantitative attributes. Qualitative attributes on the other hand can take certain discrete values defined by the domain and hence differ from quantitative attributes. For example, the variable sex can only take values Male/Female. Qualitative variables can be further divided into nominal and ordinal. Nominal variables take a fixed discrete set of values such that there is no ordering among the values. For example, for the variable sex, the values it can take are male and female, with no inherent ordering among them. On the other hand, if the variable education corresponds to the highest degree taken with the values taken being primary school, high school, bachelors, masters and doctorate, the values taken by the variable have an inherent ordering and the corresponding variable is called an ordinal variable. It may be noted that while mean, median and mode can be calculated for quantitative and ordinal data, there is no notion of mean/median for nominal data. 6

19 1.5 Supervised Learning Algorithms Supervised learning is the task of inferring a function from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (independent variables) and a desired output value (dependant variable). A supervised learning algorithm analyzes the training data and produces an inferred function, which is called a classifier (if the output is discrete) or a regressor (if the output is continuous) Classification Classification is defined [1] as the process of finding a set of models (or functions) that describe and distinguish data classes and concepts, with the goal being to use the model to predict the classes of objects whose class labels are unknown. Thus, classification is a supervised learning problem where the task is to predict the value of a discrete output variable given a set of training examples and a test sample where each training example is a pair consisting of the input object and the desired class. Formally, the input consists of: A tuple X is represented by an n dimensional attribute vector, X = (x 1,..., x n ), depicting n measurements made on the object corresponding to n database attributes, A 1,, A 2,..., A n. Each tuple, X, is assumed to belong to a predefined class as determined by another database attribute called the class-label attribute. The class-label is discrete valued and unordered. It is categorical in that each value serves as a category or class. A training set of m training tuples D = {(X 1, y 1 ),..., (X m, y m )}, where X i are points in X-space and y i are the corresponding values of the class label (also called as the response variable). As the class label of each training tuple is provided, classification is thus a supervised learning methodology. Given this data, data classification is now a two-step process. 7

20 In the first step, a classifier is built describing a pre-determined set of data classes or concepts. This is the learning step (or training phase), where a classification algorithm builds the classifier by analyzing or learning from a training set made up of database tuples and their associated class labels as described above. This first step of the classification process can also be viewed as the learning of a mapping or a function y = f(x), that can predict the associated class label y of a given tuple X. Thus, in this step we wish to learn a mapping or a function that separates the data classes. This mapping depends on the family of classification algorithm that is used and may be represented in the form of rules, decision trees or separating hyperplanes. In the second step, the model is used for classification. Before, the classifier is used in actual production settings an attempt is made to determine its accuracy on unseen real world data. Using the training data for computing the prediction accuracy would give optimistic results as training may incorporate some of the anomalies of the training data that are not present in the general test set. Therefore, a test set is used, made up of test tuples and their associated class tuples. These tuples are randomly selected from the general dataset. They are independent of the training tuples (i.e. they are not used to construct the classifier). The reported prediction accuracy of a classifier on a given test set is now the percentage of test set of tuples that are correctly classified by the classifier. The classifiers output for the particular test tuple is correct if the predicted class of the test tuple matches the class label for that tuple. Classification has been widely used in: Credit scoring: Define the cap-limit for a credit card based on users past behaviour. Search engines: Categorize or classify the type of query or document. Handwriting recognition: Receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. Document categorization: Assign an electronic document to one or more categories, based on its contents. Speech recognition and Medical image analysis and diagnosis. 8

21 1.5.2 Regression Regression is a supervised learning algorithm similar to classification with the only difference that the variable to be predicted is continuous and hence is called as the response variable. The data regression task goes through the same procedure of training, testing and evaluation. However the reported accuracy of the classifier is measured in terms of absolute mean error (ABME) or the root mean square error (RMSE). Given a regressor R and an input tuple X i, let the value of the output variable be y i and the value output by the regressor be R(X i ). Thus, the error made by the regressor for the given sample is: absolute(r(x i ) y i ). If the test set consists of m test samples (X 1,..., X m ) then the absolute mean error (ABME) is given as: m i=0 ABME = absolute(r(x i) y i ) (1.2) m and the root mean square error (RMSE) is given as: m i=0 RMSE = (R(X i) y i ) 2 m Regression algorithms have been widely used for: (1.3) Trend Line: A trend line represents a trend or the long-term movement in time-series data. Finance: For analyzing and quantifying the systematic risk of an investment. Economics: Used to predict consumption spending, fixed investment spending, inventory investment, spending on imports, the demand to hold liquid assets, labor demand, and labor supply [3, 4]. 1.6 Unsupervised Learning Algorithms Unsupervised learning is a class of problems in which one seeks to determine how the data is organized. These systems learn to represent particular input patterns in a way that reflects the structure of the overall collection of input patterns. By contrast with supervised learning, there are no explicit target outputs (class labels) or environmental evaluations associated with each input. 9

22 1.6.1 Clustering Clustering in [1] is defined as the process of grouping a set of physical or abstract objects into classes of similar objects. A cluster is a collection of data objects that are similar to one another and are dissimilar to objects in other clusters. A cluster of data objects can be treated collectively as one group and so the cluster definition may be considered as a form of data compression. Although classification is an effective means for distinguishing groups or classes of objects, it requires the often costly collection and labeling of a large set of training tuples or patterns, which the classifier uses to model each group. Thus, in real-life scenarios it is more desirable to proceed in the reverse direction: First partition the set of data into groups based on similarity, and then assign labels to the relatively small number of groups. Additional advantages of such a clustering-based process are that it is adaptable to changes and helps single out useful features that distinguish different groups. Clustering algorithms may be applied to a wide variety of data like documents, images or the standard tabular data. For the purpose of this thesis, we consider the standard tabular data. Formally, the input consists of: A tuple X is represented by an n dimensional attribute vector, X = (x 1,..., x n ), depicting n measurements made on the tuple from n database attributes, respectively, A 1,, A 2,..., A n. A set of m tuples D = {(X 1 ),..., (X m )}, where X i are points in X-space. Depending on the algorithm, the number of clusters to be output is a optional variable. The output of the clustering algorithm is a set of k-clusters where each point belongs to exactly one of the k-clusters. Clustering algorithms can be applied in many fields, for example: Marketing: Finding groups of customers with similar behavior given a large database of past buying records of customers. Biology: classification of plants and animals given their features. 10

23 Insurance: Identifying groups of motor insurance policy holders with a high average claim cost, Fraud Detection. City-planning: Identifying groups of houses according to their house type, value and geographical location. Earthquake studies: Clustering observed earthquake epicenters to identify dangerous zones. Web: Document classification, clustering web-log data to discover groups of similar access patterns. 1.7 Frequent Pattern Mining Frequent patterns are patterns like item-sets, subsequences or sub-structures that appear in a dataset frequently. Thus, frequent pattern mining is the discovery of associations and correlations among items in large transactional and relational data sets. With massive amounts of data continuously being collected and stored, many industries are becoming interested in mining such patterns from their databases. This is because, the discovery of interesting relationships in huge amounts of business transaction records helps in many decision making processes, such as customer-shopping behavior analysis, cross marketing, catalog design to name a few. Formally, Frequent Itemsets and Association Rules are defined below: Let I = {I 1, I 2,..., I m } be a set of items. Let D be a set of database transactions where each transaction T is a set of items such that T I. Each transaction is associated with an identifier, called the TID. Let A be a set of items. A transaction T is said to contain A iff A T. An association rule is an implication of the form A B, where A I, B I, and A B = φ. The rule A B holds in the transaction set D with support s, where s is the fraction of transactions in D that contain A B (i.e., the union of sets A and B). This is equivalent to the probability, P(A B). The rule A B has confidence c in the transaction set D, where c is the fraction of transactions in D containing A that also contain B. This is equivalent to be the conditional probability, P (B A). Summarizing, support(a B) = P (A B) (1.4) 11

24 confidence(a B) = P (B A) = support(a B) support(a) (1.5) Rules that satisfy both minimum support threshold (min sup) and a minimum confidence threshold (min conf) are called strong. Association rule algorithms have been widely used for: Web Mining: Studying Web usage mining, web content mining and web structure mining [5] Intrusion detection. Bio-informatics: Sequence alignment, gene finding, genome assembly, drug design, drug discovery, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, genome-wide association studies and the modeling of evolution [6]. Market Basket Analysis: Targeted advertising, product promotion, cross-sell promotion. 1.8 Organization of Thesis Due to the wide range of related work relevant to the research problem we provide an overview of these works in the chapter on Related work (chapter 2). Later in chapter 3 we propose a similarity learning algorithm which learns and assigns a similarity value between two categorical values belonging to the same attribute. For clarity, we provide an illustration of the learning algorithm on a dummy dataset. Then, in chapter 4 we use the concept of co-occurrences statistics proposed in chapter 3 and define another similarity measure DISC-O using association rules. Then, in chapters 5 and 6 we compare the proposed approaches against the state of the art approaches surveyed in chapter 2. In addition to this, the chapters also provide an exhaustive evaluation of the behaviour of the proposed learning algorithm by varying various input parameters. Finally in chapter 7 we conclude our work and provide directions for future work. 12

25 Chapter 2 Related Work 2.1 Related Work Determining similarity measures for categorical data is a much studied field as there is no explicit notion of ordering among values of categorical attributes. The study of similarity between data objects with categorical variables has had a long history. Pearson proposed a chisquare statistic in the late 1800s which is often used to test independence between categorical variables in a contingency table. Pearson s chi-square statistic was later modified and extended, leading to several other measures [7, 8, 9]. Sneath and Sokal were among the first to put together and discuss many of the categorical similarity measures and discuss this in detail in their book [10] on numerical taxonomy. At the time, two major concerns were (1) biological relevance, since numerical taxonomy was mainly concerned with taxonomies from biology, ecology, etc., and (2) computation efficiency since computational resources were limited and scarce. Nevertheless, many of the observations made by Sneath and Sokal are quite relevant today and offer key insights into many of the measures. The specific problem of clustering categorical data has been actively studied. There are several books [12, 13, 14, 15] on cluster analysis that discuss the problem of determining similarity between categorical attributes. The problem has also been studied recently in [16, 17]. However, most of these approaches do not offer solutions to the problem discussed in this thesis, and the usual recommendation is to binarize the data and then use similarity measures designed for binary attributes. Most of these measures are based on the number of identical 13

26 and non-identical value pairs. The similarity of a non-identical value pair is simply 0 and the similarity of an identical value pair is 1. Let val set = {v 1,..., v l } be a value set of all attributes. A data object x is transformed into a binary vector X = (x 1,..., x l ), x i {0, 1}, where x i = 1 if the object x holds the value v i, and x i = 0 otherwise. Denote X and Y as two binary vectors, XY = l i=0 x iy i, X as the complementary vector of X: X = 1 X = [1 x i ], and the following counters: a = XY, b = XY, c = XY and d = XY. Obviously, a + b + c + d = l and a + b + c = D where D is the number of attributes. Using these notions, several measures on binary vectors have been defined. A brief summary is given in Table 2.1. Measure Definition Function Family Jaccard (1900) a Sokal and Michener index (1958) Sokal and Sneath Rogers and Tanimoto (1960) Czekanowski (1913), Dice (1945) Sokal and Sneath (1963) Hanman (1961) (a+b+c) (a+d) (a+b+c+d) a (a+2(b+c)) (a+d) (a+2(b+c)+d) a (a+ 1 2 (b+c)) b+c a+d a+d b c l T θ S θ T θ S θ T θ S θ T θ Table 2.1: Similarity Measures for Binary Vectors Most of the popular practical similarity measures in the literature (Jaccard, Dice, etc) belong to one of these two families introduced in [18]: T θ = S θ = a a + θ(b + c) a + d a + d + θ(b + c) (2.1) (2.2) Most work has been carried out on development of clustering algorithms and not similarity functions. Hence these works are only marginally or peripherally related to our work. Wilson and Martinez [19] performed a detailed study of heterogeneous distance functions (for categorical and continuous attributes) for instance based learning. The measures in their study are 14

27 based upon a supervised approach where each data instance has class information in addition to a set of categorical/continuous attributes. There have been a number of new data mining techniques for categorical data that have been proposed recently. Some of them use notions of similarity which are neighborhood-based [20, 21, 22, 23], or incorporate the similarity computation into the learning algorithm[24, 25]. These measures are useful to compute the neighborhood of a point and neighborhood-based measures but not for calculating similarity between a pair of data instances. Jones and Furnas [26] studied several similarity measures in the field of information retrieval. In particular, they performed a geometric analysis on continuous measures in order to reveal important differences which would affect retrieval performance. Noreault et al. [27] also studied measures in information retrieval with the goal of generalizing effectiveness based on empirically evaluating the performance of the measures. Another comparative empirical evaluation for determining similarity between fuzzy sets was performed by Zwick et al. [28], followed by several others [29, 30]. Vipin Kumar et al. recently surveyed and compared a large number of data-intensive similarity measures in [31]. In our experiments we have compared our approach with the methods discussed in [31]; which provides a recent exhaustive comparison of similarity measure for categorical data. In the remainder of this chapter, we identify the key characteristics of a categorical data (Section 2.3) set that can potentially affect the behavior of a data-driven similarity measure and a large number of data-intensive similarity measures in described in [31]. The notations for describing the similarity measures are defined in Section 2.2 and the measures themselves are described in Section Notations For the sake of notation, consider a categorical data set D containing N objects, defined over a set of d categorical attributes where A k denotes the k th attribute. Let the attribute A k take n k values in the given data set that are denoted by the set A k. We also use the following notation: 15

28 f k (x) : The number of times attribute A k takes the value x in the data set D. Note that if x / A k, f k (x) = 0 p k (x) : The sample probability of attribute A k to take the value x in the data set D. The sample probability is given by: p k (x) = f k(x) N (2.3) p 2 k (x): Another probability estimate of attribute A k to take the value x in a given data set, given by: p 2 k(x) = f k(x) (f k (x) 1) N (N 1) (2.4) 2.3 Characteristics of Categorical Data Since this thesis discusses data-driven similarity measures for categorical data, a key task is to identify the characteristics of a categorical data set that affect the behavior of such a similarity measure. We enumerate the characteristics of a categorical data set below: Size of Data, N. As we will see later, most measures are typically invariant of the size of the data, though there are some measures (e.g. Smirnov) that do make use of this information. Number of attributes, d. Most measures are invariant of this characteristic, since they typically normalize the similarity over the number of attributes. Number of values taken by each attribute, n k. A data set might contain attributes that take several values and attributes that take very few values. For example, one attribute might take several hundred possible values, while another attribute might take very few values. A similarity measure might give more importance to the second attribute, while ignoring the first one. In fact, one of the measures discussed in this section (Eskin) behaves exactly like this. 16

29 Distribution of f k (x). This refers to the distribution of frequency of values taken by an attribute in the given data set. In certain data sets an attribute might be distributed uniformly over the set A k, while in others the distribution might be skewed. A similarity measure might give more importance to attribute values that occur rarely, while another similarity measure might give more importance to frequently occurring attribute values. 2.4 Popular Similarity Measures In this section, we give an overview of popularly used similarity measures. It may be noted that some of the similarity measures were originally proposed as distance measures and are converted to similarity measures in order to make the measures comparable in this study. The measures discussed henceforth will all be in the context of similarity, with distance measures being converted using the formula: sim = dist (2.5) Almost all similarity measures assign a similarity value between two data instances X and Y belonging to the data set D (introduced in Section 3) as follows: S(X, Y ) = d w k S k (X k, Y k ) (2.6) k=1 where S k (X k, Y k ) is the per attribute similarity for categorical attribute A k. Note that X k, Y k A k. The quantity w k denotes the weight assigned to the attribute A k. To understand how different measures calculate the per-attribute similarity, S k (X k, Y k ), consider a categorical attribute A, which takes one of the values {a, b, c, d}. The per-attribute similarity computation is equivalent to constructing the (symmetric) matrix shown in Figure 2.1 Essentially, in determining the similarity between two values, any categorical measure is filling the entries of this matrix. For example, the overlap measure sets the diagonal entries to 1 and the off-diagonal entries to 0, i.e., the similarity is 1 if the values match and 0 if the 17

30 Figure 2.1: Similarity Matrix for a Single Categorical Attribute values mismatch. Additionally, measures may use the following information in computing a similarity value: f(a), f(b), f(c), f(d), the frequencies of the values in the data set N, the size of the data set n, the number of values taken by the attribute (4 in the case above) f(a k,a,a k,a ), the number of instances for which the attribute A k takes values a and attribute A k takes value a thus representing the co-occurrence statistics. We can classify measures in several ways, based on: (i) the manner in which they fill the entries of the similarity matrix, (ii) the arguments used to propose the measure (probabilistic, information-theoretic, etc.). In this section, we will describe the measures by classifying them as follows: Those that fill the diagonal entries only. These are measures that set the off-diagonal entries to 0 (mismatches are uniformly given the minimum value) and give possibly different weights to matches. Those that fill the off-diagonal entries only. These measures set the diagonal entries to 1 (matches are uniformly given the maximum value) and give possibly different weights to mismatches. Those that fill both diagonal and off-diagonal entries. These measures give different weights to both matches and mismatches. Table 2.2 gives the mathematical formulas 18

31 for the measures we will be describing in this section. The various measures described in Table 2.2 compute the per-attribute similarity Sk(X k, Y k ) as shown in column 2 and compute the attribute weight w k as shown in column Measures that fill Diagonal Entries only. Overlap: The overlap measure simply counts the number of attributes that match in the two data instances. The range of per-attribute similarity for the overlap measure is [0, 1], with a value of 0 occurring when there is no match, and a value of 1 occurring when the attribute values match. Goodall1: Goodall [32] proposed a measure that attempts to normalize the similarity between two objects by the probability that the similarity value observed could be observed in a random sample of two points. This measure assigns higher similarity to a match if the value is infrequent than if the value is frequent. The range of S k (X k, Y k ) for matches 2 in Goodall1 measure is [0, 1- ] with the minimum being attained when attribute N(N 1) A k takes only one value, and the maximum is attained when the value X k occurs twice, while all other possible values of A k occur more than 2 times. The autors of [31] also propose three other variants of Goodall s measure: Goodall2, Goodall3, Goodall4. Goodall2: The Goodall2 measure is a variant of Goodall s measure proposed by authors of [31]. This measure assigns higher similarity if the matching values are infrequent, and at the same time there are other values that are even less frequent, i.e., the similarity is higher if there are many values with approximately equal frequencies, and lower if the frequency distribution is skewed. The range of S k (X k, Y k ) for matches in the Goodall2 measure is [0, 1-2 N(N 1) ], with the minimum value being attained if attribute A k takes only one value, and the maximum is attained when the value X k occurs twice, while all other possible values of A k occur only once each. Goodall3: The Goodall3 measure assigns a high similarity if the matching values are infrequent regardless of the frequencies of the other values. The range of S k (X k, Y k ) 19

32 2 for matches in the Goodall3 measure is [0, 1- ], with the minimum value being N(N 1) attained if X k is the only value for attribute A k and maximum value is attained if X k occurs only twice. Goodall4: The Goodall4 measure assigns similarity 1 Goodall3 for matches. The 2 range of S k (X k, Y k ) for matches in the Goodall4 measure is [, 1], with the (N)(N 1) minimum value being attained if X k occurs only once, and the maximum value is attained if X k is the only value for attribute A k. Gambaryan: Gambaryan proposed a measure [33] that gives more weight to matches where the matching value occurs in about half the data set, i.e., in between being frequent and rare. The Gambaryan measure for a single attribute match is closely related to the Shannon entropy from information theory, as can be seen from its formula in Table 2.2. The range of S k (X k, Y k ) for matches in the Gambaryan measure is [0, 1], with the minimum value being attained if X k is the only value for attribute A k and the maximum value is attained when X k has frequency N Measures that fill Off-Diagonal Entries only. Eskin: Eskin et al. [34] proposed a normalization kernel for record-based network intrusion detection data. The original measure is distance-based and assigns a weight of 2 n 2 k for mismatches; when adapted to similarity, this becomes a weight of n 2 k n 2 k +2. This measure gives more weight to mismatches that occur on attributes that take many values. The range of S k (X k, Y k ) for mismatches in the Eskin measure is [ 2, N 2 ], with 3 N 2 +2 the minimum value being attained when the attribute A k takes only two values, and the maximum value is attained when the attribute has all unique values. Inverse Occurrence Frequency (IOF) : The inverse occurrence frequency measure assigns lower similarity to mismatches on more frequent values. The IOF measure is related to the concept of inverse document frequency which comes from information retrieval [35], where it is used to signify the relative number of documents that contain 20

33 a specific word. A key difference is that inverse document frequency is computed on a term-document matrix which is usually binary, while the IOF measure is defined for categorical data. The range of S k (X k, Y k ) for mismatches in the IOF measure is [ 1 1+(log N 2 )2, 1], with the minimum value being attained when X k and Y k each occur N 2 times (i.e., these are the only two values), and the maximum value is attained when X k and Y k occur only once in the data set. Occurrence Frequency (OF): The occurrence frequency measure gives the opposite weighting of the IOF measure for mismatches, i.e., mismatches on less frequent values are assigned lower similarity and mismatches on more frequent values are assigned higher similarity. The range of S k (X k, Y k ) for mismatches in OF measure is [ 1 1+(logN) 2, 1 1+log(2) 2 ], with the minimum value being attained when X k and Y k occur only once in the data set, and the maximum value attained when X k and Y k occur N 2 times. Burnaby: Burnaby [36] proposed a similarity measure using arguments from information theory. He argues that the set of observed values are like a group of signals conveying information and, as in information theory, attribute values that are rarely observed should be considered more informative. In [36], Burnaby proposed information weighted measures for binary, ordinal, categorical and continuous data. The measure we present in Table 2.2 is adapted from Burnaby s categorical measure. This measure assigns low similarity to mismatches on rare values and high similarity to mismatches on frequent values. The range of S k (X k, Y k ) for mismatches for the Burnaby measure is [ Nlog(1 1 N Nlog(1 1 N ) ) log(n 1), 1], with the minimum value being attained when all values for attribute A k occur only once, and the maximum value is attained when X k and Y k each occur N 2 times Measures that fill both Diagonal and Off-Diagonal Entries. Lin: In [37], Lin describes an information-theoretic framework for similarity, where he argues that when similarity is thought of in terms of assumptions about the space, the similarity measure naturally follows from the assumptions. Lin [37] discusses the 21

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Basics of Multivariate Modelling and Data Analysis

Basics of Multivariate Modelling and Data Analysis Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 2. Overview of multivariate techniques 2.1 Different approaches to multivariate data analysis 2.2 Classification of multivariate techniques

More information

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning CS4731 Dr. Mihail Fall 2017 Slide content based on books by Bishop and Barber. https://www.microsoft.com/en-us/research/people/cmbishop/ http://web4.cs.ucl.ac.uk/staff/d.barber/pmwiki/pmwiki.php?n=brml.homepage

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only

More information

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering

More information

Anomaly Detection. Jing Gao. SUNY Buffalo

Anomaly Detection. Jing Gao. SUNY Buffalo Anomaly Detection Jing Gao SUNY Buffalo 1 Anomaly Detection Anomalies the set of objects are considerably dissimilar from the remainder of the data occur relatively infrequently when they do occur, their

More information

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Pengtao Xie, Machine Learning Department, Carnegie Mellon University 1. Background Latent Variable Models (LVMs) are

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH

FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH FUZZY ASSOCIATION RULES: A TWO-SIDED APPROACH M. De Cock C. Cornelis E. E. Kerre Dept. of Applied Mathematics and Computer Science Ghent University, Krijgslaan 281 (S9), B-9000 Gent, Belgium phone: +32

More information

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D.

Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules. M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D. Detecting Anomalous and Exceptional Behaviour on Credit Data by means of Association Rules M. Delgado, M.D. Ruiz, M.J. Martin-Bautista, D. Sánchez 18th September 2013 Detecting Anom and Exc Behaviour on

More information

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM SPATIAL DATA MINING Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM INTRODUCTION The main difference between data mining in relational DBS and in spatial DBS is that attributes of the neighbors

More information

Models, Data, Learning Problems

Models, Data, Learning Problems Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Models, Data, Learning Problems Tobias Scheffer Overview Types of learning problems: Supervised Learning (Classification, Regression,

More information

Association Rule Mining on Web

Association Rule Mining on Web Association Rule Mining on Web What Is Association Rule Mining? Association rule mining: Finding interesting relationships among items (or objects, events) in a given data set. Example: Basket data analysis

More information

Unsupervised Anomaly Detection for High Dimensional Data

Unsupervised Anomaly Detection for High Dimensional Data Unsupervised Anomaly Detection for High Dimensional Data Department of Mathematics, Rowan University. July 19th, 2013 International Workshop in Sequential Methodologies (IWSM-2013) Outline of Talk Motivation

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

NetBox: A Probabilistic Method for Analyzing Market Basket Data

NetBox: A Probabilistic Method for Analyzing Market Basket Data NetBox: A Probabilistic Method for Analyzing Market Basket Data José Miguel Hernández-Lobato joint work with Zoubin Gharhamani Department of Engineering, Cambridge University October 22, 2012 J. M. Hernández-Lobato

More information

Data Mining and Analysis

Data Mining and Analysis 978--5-766- - Data Mining and Analysis: Fundamental Concepts and Algorithms CHAPTER Data Mining and Analysis Data mining is the process of discovering insightful, interesting, and novel patterns, as well

More information

Similarity and Dissimilarity

Similarity and Dissimilarity 1//015 Similarity and Dissimilarity COMP 465 Data Mining Similarity of Data Data Preprocessing Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed.

More information

Optimization Methods for Machine Learning (OMML)

Optimization Methods for Machine Learning (OMML) Optimization Methods for Machine Learning (OMML) 2nd lecture (2 slots) Prof. L. Palagi 16/10/2014 1 What is (not) Data Mining? By Namwar Rizvi - Ad Hoc Query: ad Hoc queries just examines the current data

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu What is Machine Learning? Overview slides by ETHEM ALPAYDIN Why Learn? Learn: programming computers to optimize a performance criterion using example

More information

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Predictive analysis on Multivariate, Time Series datasets using Shapelets 1 Predictive analysis on Multivariate, Time Series datasets using Shapelets Hemal Thakkar Department of Computer Science, Stanford University hemal@stanford.edu hemal.tt@gmail.com Abstract Multivariate,

More information

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance Notion of Distance Metric Distance Binary Vector Distances Tangent Distance Distance Measures Many pattern recognition/data mining techniques are based on similarity measures between objects e.g., nearest-neighbor

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data

More information

15 Introduction to Data Mining

15 Introduction to Data Mining 15 Introduction to Data Mining 15.1 Introduction to principle methods 15.2 Mining association rule see also: A. Kemper, Chap. 17.4, Kifer et al.: chap 17.7 ff 15.1 Introduction "Discovery of useful, possibly

More information

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr Foundation of Data Mining i Topic: Data CMSC 49D/69D CSEE Department, e t, UMBC Some of the slides used in this presentation are prepared by Jiawei Han and Micheline Kamber Data Data types Quality of data

More information

ESTIMATING STATISTICAL CHARACTERISTICS UNDER INTERVAL UNCERTAINTY AND CONSTRAINTS: MEAN, VARIANCE, COVARIANCE, AND CORRELATION ALI JALAL-KAMALI

ESTIMATING STATISTICAL CHARACTERISTICS UNDER INTERVAL UNCERTAINTY AND CONSTRAINTS: MEAN, VARIANCE, COVARIANCE, AND CORRELATION ALI JALAL-KAMALI ESTIMATING STATISTICAL CHARACTERISTICS UNDER INTERVAL UNCERTAINTY AND CONSTRAINTS: MEAN, VARIANCE, COVARIANCE, AND CORRELATION ALI JALAL-KAMALI Department of Computer Science APPROVED: Vladik Kreinovich,

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 10 What is Data? Collection of data objects

More information

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality Measurement and Data Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality Importance of Measurement Aim of mining structured data is to discover relationships that

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber. CS570 Data Mining Anomaly Detection Li Xiong Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber April 3, 2011 1 Anomaly Detection Anomaly is a pattern in the data that does not conform

More information

CHAPTER 2: DATA MINING - A MODERN TOOL FOR ANALYSIS. Due to elements of uncertainty many problems in this world appear to be

CHAPTER 2: DATA MINING - A MODERN TOOL FOR ANALYSIS. Due to elements of uncertainty many problems in this world appear to be 11 CHAPTER 2: DATA MINING - A MODERN TOOL FOR ANALYSIS Due to elements of uncertainty many problems in this world appear to be complex. The uncertainty may be either in parameters defining the problem

More information

More on Unsupervised Learning

More on Unsupervised Learning More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Data Analytics Beyond OLAP. Prof. Yanlei Diao Data Analytics Beyond OLAP Prof. Yanlei Diao OPERATIONAL DBs DB 1 DB 2 DB 3 EXTRACT TRANSFORM LOAD (ETL) METADATA STORE DATA WAREHOUSE SUPPORTS OLAP DATA MINING INTERACTIVE DATA EXPLORATION Overview of

More information

Spatial Co-location Patterns Mining

Spatial Co-location Patterns Mining Spatial Co-location Patterns Mining Ruhi Nehri Dept. of Computer Science and Engineering. Government College of Engineering, Aurangabad, Maharashtra, India. Meghana Nagori Dept. of Computer Science and

More information

CHAPTER 4: DATASETS AND CRITERIA FOR ALGORITHM EVALUATION

CHAPTER 4: DATASETS AND CRITERIA FOR ALGORITHM EVALUATION CHAPTER 4: DATASETS AND CRITERIA FOR ALGORITHM EVALUATION 4.1 Overview This chapter contains the description about the data that is used in this research. In this research time series data is used. A time

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January, 1 Stat 406: Algorithms for classification and prediction Lecture 1: Introduction Kevin Murphy Mon 7 January, 2008 1 1 Slides last updated on January 7, 2008 Outline 2 Administrivia Some basic definitions.

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan

More information

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014 CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof.

More information

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05 Meelis Kull meelis.kull@ut.ee Autumn 2017 1 Sample vs population Example task with red and black cards Statistical terminology Permutation test and hypergeometric test Histogram on a sample vs population

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Spazi vettoriali e misure di similaritá

Spazi vettoriali e misure di similaritá Spazi vettoriali e misure di similaritá R. Basili Corso di Web Mining e Retrieval a.a. 2009-10 March 25, 2010 Outline Outline Spazi vettoriali a valori reali Operazioni tra vettori Indipendenza Lineare

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

Lecture 7 Artificial neural networks: Supervised learning

Lecture 7 Artificial neural networks: Supervised learning Lecture 7 Artificial neural networks: Supervised learning Introduction, or how the brain works The neuron as a simple computing element The perceptron Multilayer neural networks Accelerated learning in

More information

A Large Deviation Bound for the Area Under the ROC Curve

A Large Deviation Bound for the Area Under the ROC Curve A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

ANÁLISE DOS DADOS. Daniela Barreiro Claro

ANÁLISE DOS DADOS. Daniela Barreiro Claro ANÁLISE DOS DADOS Daniela Barreiro Claro Outline Data types Graphical Analysis Proimity measures Prof. Daniela Barreiro Claro Types of Data Sets Record Ordered Relational records Video data: sequence of

More information

Mining Infrequent Patter ns

Mining Infrequent Patter ns Mining Infrequent Patter ns JOHAN BJARNLE (JOHBJ551) PETER ZHU (PETZH912) LINKÖPING UNIVERSITY, 2009 TNM033 DATA MINING Contents 1 Introduction... 2 2 Techniques... 3 2.1 Negative Patterns... 3 2.2 Negative

More information

Mining Positive and Negative Fuzzy Association Rules

Mining Positive and Negative Fuzzy Association Rules Mining Positive and Negative Fuzzy Association Rules Peng Yan 1, Guoqing Chen 1, Chris Cornelis 2, Martine De Cock 2, and Etienne Kerre 2 1 School of Economics and Management, Tsinghua University, Beijing

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

Classification and Pattern Recognition

Classification and Pattern Recognition Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations

More information

MASTER. Anomaly detection on event logs an unsupervised algorithm on ixr-messages. Severins, J.D. Award date: Link to publication

MASTER. Anomaly detection on event logs an unsupervised algorithm on ixr-messages. Severins, J.D. Award date: Link to publication MASTER Anomaly detection on event logs an unsupervised algorithm on ixr-messages Severins, J.D. Award date: 2016 Link to publication Disclaimer This document contains a student thesis (bachelor's or master's),

More information

Chap 2: Classical models for information retrieval

Chap 2: Classical models for information retrieval Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Prediction of Citations for Academic Papers

Prediction of Citations for Academic Papers 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Machine Learning Linear Models

Machine Learning Linear Models Machine Learning Linear Models Outline II - Linear Models 1. Linear Regression (a) Linear regression: History (b) Linear regression with Least Squares (c) Matrix representation and Normal Equation Method

More information

Pattern Structures 1

Pattern Structures 1 Pattern Structures 1 Pattern Structures Models describe whole or a large part of the data Pattern characterizes some local aspect of the data Pattern is a predicate that returns true for those objects

More information

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu Why? How? What? How much? How many? Individual facts (quantities, characters, or symbols) The Data-Information-Knowledge-Wisdom

More information

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data Elena Baralis and Tania Cerquitelli Politecnico di Torino Data set types Record Tables Document Data Transaction Data Graph World Wide Web Molecular Structures Ordered Spatial Data Temporal Data Sequential

More information

Data mining, 4 cu Lecture 5:

Data mining, 4 cu Lecture 5: 582364 Data mining, 4 cu Lecture 5: Evaluation of Association Patterns Spring 2010 Lecturer: Juho Rousu Teaching assistant: Taru Itäpelto Evaluation of Association Patterns Association rule algorithms

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

8. Classification and Pattern Recognition

8. Classification and Pattern Recognition 8. Classification and Pattern Recognition 1 Introduction: Classification is arranging things by class or category. Pattern recognition involves identification of objects. Pattern recognition can also be

More information

Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context.

Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context. Modified Entropy Measure for Detection of Association Rules Under Simpson's Paradox Context. Murphy Choy Cally Claire Ong Michelle Cheong Abstract The rapid explosion in retail data calls for more effective

More information

Collaborative topic models: motivations cont

Collaborative topic models: motivations cont Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.

More information

Data Mining 4. Cluster Analysis

Data Mining 4. Cluster Analysis Data Mining 4. Cluster Analysis 4.2 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Data Structures Interval-Valued (Numeric) Variables Binary Variables Categorical Variables Ordinal Variables Variables

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 4: Probabilistic Retrieval Models April 29, 2010 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig

More information

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Chapter 6. Frequent Pattern Mining: Concepts and Apriori Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining Pattern Discovery: Definition What are patterns? Patterns: A set of

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen Book Title Encyclopedia of Machine Learning Chapter Number 00403 Book CopyRight - Year 2010 Title Frequent Pattern Author Particle Given Name Hannu Family Name Toivonen Suffix Email hannu.toivonen@cs.helsinki.fi

More information

Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data

Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional -Series Data Xiaolei Li, Jiawei Han University of Illinois at Urbana-Champaign VLDB 2007 1 Series Data Many applications produce time series

More information

Mining Emerging Substrings

Mining Emerging Substrings Mining Emerging Substrings Sarah Chan Ben Kao C.L. Yip Michael Tang Department of Computer Science and Information Systems The University of Hong Kong {wyschan, kao, clyip, fmtang}@csis.hku.hk Abstract.

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing Chapter Fifteen Frequency Distribution, Cross-Tabulation, and Hypothesis Testing Copyright 2010 Pearson Education, Inc. publishing as Prentice Hall 15-1 Internet Usage Data Table 15.1 Respondent Sex Familiarity

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L2: Instance Based Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January

More information

Classification of Ordinal Data Using Neural Networks

Classification of Ordinal Data Using Neural Networks Classification of Ordinal Data Using Neural Networks Joaquim Pinto da Costa and Jaime S. Cardoso 2 Faculdade Ciências Universidade Porto, Porto, Portugal jpcosta@fc.up.pt 2 Faculdade Engenharia Universidade

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

Exploring Spatial Relationships for Knowledge Discovery in Spatial Data

Exploring Spatial Relationships for Knowledge Discovery in Spatial Data 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Exploring Spatial Relationships for Knowledge Discovery in Spatial Norazwin Buang

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

Correlation Preserving Unsupervised Discretization. Outline

Correlation Preserving Unsupervised Discretization. Outline Correlation Preserving Unsupervised Discretization Jee Vang Outline Paper References What is discretization? Motivation Principal Component Analysis (PCA) Association Mining Correlation Preserving Discretization

More information

Journal of Chemical and Pharmaceutical Research, 2014, 6(5): Research Article

Journal of Chemical and Pharmaceutical Research, 2014, 6(5): Research Article Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(5):266-270 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Anomaly detection of cigarette sales using ARIMA

More information

Leverage Sparse Information in Predictive Modeling

Leverage Sparse Information in Predictive Modeling Leverage Sparse Information in Predictive Modeling Liang Xie Countrywide Home Loans, Countrywide Bank, FSB August 29, 2008 Abstract This paper examines an innovative method to leverage information from

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

CS5112: Algorithms and Data Structures for Applications

CS5112: Algorithms and Data Structures for Applications CS5112: Algorithms and Data Structures for Applications Lecture 19: Association rules Ramin Zabih Some content from: Wikipedia/Google image search; Harrington; J. Leskovec, A. Rajaraman, J. Ullman: Mining

More information

COMP 5331: Knowledge Discovery and Data Mining

COMP 5331: Knowledge Discovery and Data Mining COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Tan, Steinbach, Kumar And Jiawei Han, Micheline Kamber, and Jian Pei 1 10

More information