Descriptive Data Summarization

Similar documents
Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

ANÁLISE DOS DADOS. Daniela Barreiro Claro

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Getting To Know Your Data

Chapter I: Introduction & Foundations

University of Florida CISE department Gator Engineering. Clustering Part 1

MSCBD 5002/IT5210: Knowledge Discovery and Data Minig

P8130: Biostatistical Methods I

proximity similarity dissimilarity distance Proximity Measures:

Chapter 3. Data Description

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Similarity and Dissimilarity

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

Statistical Concepts. Constructing a Trend Plot

Unit 2. Describing Data: Numerical

Chapter 2: Tools for Exploring Univariate Data

Lecture 2 and Lecture 3

2011 Pearson Education, Inc

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Concepts and Techniques

Chapter 1. Looking at Data

1. Exploratory Data Analysis

Data Exploration Slides by: Shree Jaswal

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1]

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

STAT 200 Chapter 1 Looking at Data - Distributions

Distances & Similarities

Chapter 4. Displaying and Summarizing. Quantitative Data

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

1 Measures of the Center of a Distribution

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

CISC 4631 Data Mining

Instrumentation (cont.) Statistics vs. Parameters. Descriptive Statistics. Types of Numerical Data

Histograms allow a visual interpretation

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

Describing distributions with numbers

Statistics for Managers using Microsoft Excel 6 th Edition

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Assignment 3: Chapter 2 & 3 (2.6, 3.8)

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

Chapter 1: Exploring Data

2.0 Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

Chapter 2: Descriptive Analysis and Presentation of Single- Variable Data

Elementary Statistics

SUMMARIZING MEASURED DATA. Gaia Maselli

Describing distributions with numbers

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

Class 11 Maths Chapter 15. Statistics

CS145: INTRODUCTION TO DATA MINING

Chapter 1 - Lecture 3 Measures of Location

Lecture 1: Descriptive Statistics

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

Section 3. Measures of Variation

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

Stat 101 Exam 1 Important Formulas and Concepts 1

Summarizing Measured Data

CIVL 7012/8012. Collection and Analysis of Information

Describing Distributions

Chapter 1:Descriptive statistics

CHAPTER 2: Describing Distributions with Numbers

Chapter Four. Numerical Descriptive Techniques. Range, Standard Deviation, Variance, Coefficient of Variation

CS 147: Computer Systems Performance Analysis

Descriptive Statistics-I. Dr Mahmoud Alhussami

Chapter 3. Measuring data

Descriptive Univariate Statistics and Bivariate Correlation

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 03

200 participants [EUR] ( =60) 200 = 30% i.e. nearly a third of the phone bills are greater than 75 EUR

are the objects described by a set of data. They may be people, animals or things.

Introduction to Basic Statistics Version 2

TOPIC: Descriptive Statistics Single Variable

MATH 117 Statistical Methods for Management I Chapter Three

CS249: ADVANCED DATA MINING

Statistics I Chapter 2: Univariate data analysis

Chapter2 Description of samples and populations. 2.1 Introduction.

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

Numerical Measures of Central Tendency

3.1 Measure of Center

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS

CHAPTER 1. Introduction

Introduction to Statistics

3rd Quartile. 1st Quartile) Minimum

Performance of fourth-grade students on an agility test

Unit 2: Numerical Descriptive Measures

MgtOp 215 Chapter 3 Dr. Ahn

Types of Information. Topic 2 - Descriptive Statistics. Examples. Sample and Sample Size. Background Reading. Variables classified as STAT 511

Statistics I Chapter 2: Univariate data analysis

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

Lecture 2. Descriptive Statistics: Measures of Center

Transcription:

Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning and data integration. For data preprocessing tasks, we want to learn about data characteristics regarding both central tendency and dispersion of the data. Measures of central tendency include mean, median, mode, and midrange, Measures of data dispersion include quartiles, interquartile range (IQR), and variance. These descriptive statistics are of great help in understanding the distribution of the data. Data Mining 1

Measuring Central Tendency: Mean The most common and most effective numerical measure of the center of a set of data is the arithmetic mean Arithmetic Mean: x 1 n Sometimes, each value x i in a set may be associated with a weight w i, The weights reflect the significance, importance, or occurrence frequency attached to their respective values. Weighted Arithmetic Mean: n i 1 x x i n i 1 n i 1 w x i w i i Data Mining 2

Measuring Central Tendency: Mean Although the mean is the single most useful quantity for describing a data set, it is not always the best way of measuring the center of the data. A major problem with the mean is its sensitivity to extreme (outlier) values. Even a small number of extreme values can corrupt the mean. To offset the effect caused by a small number of extreme values, we can instead use the trimmed mean, Trimmed mean can be obtained after chopping off values at the high and low extremes. Data Mining 3

Measuring Central Tendency: Median Another measure of the center of data is the median. Suppose that a given data set of N distinct values is sorted in numerical order. If N is odd, the median is the middle value of the ordered set; If N is even, the median is the average of the middle two values. For grouped data, the median can be estimated median L 1 n / 2 ( ( freq freq) l ) width median L 1 is the lower boundary of the median interval, N is the number of values in the entire data set, ( freq) l is the sum of the frequencies of all of the intervals that are lower than the median interval, freqi median is the frequency of the median interval, and width is the width of the median interval. Median interval Data Mining 4

Measuring Central Tendency: Mode Another measure of central tendency is the mode. The mode for a set of data is the value that occurs most frequently in the set. It is possible for the greatest frequency to correspond to several different values, which results in more than one mode. Data sets with one, two, or three modes: called unimodal, bimodal, and trimodal. At the other extreme, if each data value occurs only once, then there is no mode. For unimodal frequency curves that are moderately skewed (asymmetrical), we have the following empirical relation: mean mode 3 ( mean median ) Data Mining 5

Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data Data Mining 6

Measuring the Dispersion of Data The degree to which numerical data tend to spread is called the dispersion, or variance of the data. The most common measures of data dispersion are range, five-number summary (based on quartiles), interquartile range, standard deviation. Boxplots can be plotted based on the five-number summary and are a useful tool for identifying outliers. Data Mining 7

Range, Quartiles, Outliers Range: the difference between the largest and smallest values. Quartiles: Q1 (25th percentile), Q3 (75th percentile) Inter-quartile Range: IQR = Q 3 Q 1 Five number summary: Minumum, Q 1, Median, Q 3, Maximum Outliers: A common rule of thumb for identifying suspected outliers is to single out values falling at least 1.5xIQR above the third quartile or below the first quartile Data Mining 8

Boxplot Analysis Boxplots are a popular way of visualizing a distribution and aboxplot incorporates the fivenumber summary: The ends of the box are at the quartiles, so that the box length is the interquartile range, IQR. The median is marked by a line within the box. Two lines outside the box extend to the smallest and largest observations. Outliers: points beyond a specified outlier threshold, plotted individually Data Mining 9

Variance and Standard Deviation Variance of N observations: 2 1 N N i 1 ( x i ) 2 where is the mean value of the observations Standard Deviation σ is the square root of variance or σ 2 The basic properties of the standard deviation are σ measures spread about the mean and should be used only when the mean is chosen as the measure of center. σ =0 only when there is no spread, when all observations have the same value. Otherwise σ > 0. Data Mining 10

Properties of Normal Distribution Curve The normal (distribution) curve (μ: mean, σ: standard deviation) From μ σ to μ+σ: contains about 68% of the measurements From μ 2σ to μ+2σ: contains about 95% of it From μ 3σ to μ+3σ: contains about 99.7% of it Data Mining 11

Graphic Displays of Basic Statistical Descriptions Boxplot: graphic display of five-number summary Histogram: x-axis are values, y-axis represents frequencies Quantile plot: each value x i is paired with f i indicating that approximately 100 f i % of data are x i Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane Data Mining 12

Histogram Analysis Histogram: Graph display of tabulated frequencies, shown as bars It shows what proportion of cases fall into each of several categories Differs from a bar chart in that it is the area of the bar that denotes the value, not the height as in bar charts, a crucial distinction when the categories are not of uniform width The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent Data Mining 13

In an equal-width histogram, each bucket represents an equalwidth range of numerical attribute Histogram Analysis Data Mining 14

Histograms Often Tell More than Boxplots The two histograms shown in the left may have the same boxplot representation The same values for: min, Q1, median, Q3, max But they have rather different data distributions Data Mining 15

Quantile Plot Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information For a data x i data sorted in increasing order, f i indicates that approximately 100 f i % of the data are below or equal to the value x i quantiles Data Mining 16

Quantile-Quantile (Q-Q) Plot Graphs the quantiles of one univariate distribution against the corresponding quantiles of another It is a powerful visualization tool in that it allows the user to view whether there is a shift in going from one distribution to another. Suppose that we have two sets of observations for the variable unit price, taken from two different branch locations. Let x 1... x N be the data from the first branch, and y 1... y M be the data from the second, where each data set is sorted in increasing order. If M = N (i.e., the number of points in each set is the same), then we simply plot y i against x i, where y i and x i are both quantiles of their respective data sets. If M < N (i.e., the second branch has fewer observations than the first), there can be only M points on the q-q plot Data Mining 17

Quantile-Quantile (Q-Q) Plot A quantile-quantile plot for unit price data of items sold at two different branches Each point corresponds to the same quantile for each data set and shows the unit price of items sold at branch 1 versus branch 2 for that quantile. For example, here the lowest point in the left corner corresponds to the 0.03 quantile. A straight line that represents the case of when, for each given quantile, the unit price at each branch is the same. The darker points correspond to the data for Q1, the median, and Q3, respectively.) The unit price of items sold at branch 1 was slightly less than that at branch 2. Data Mining 18

Scatter plot A scatter plot is one of effective graphical methods for determining if there appears to be a relationship, pattern, or trend between two numerical attributes. To construct a scatter plot, each pair of values is treated as a pair of coordinates in an algebraic sense and plotted as points in the plane. Data Mining 19

Scatter Plot: Positively and Negatively Correlated Data Negatively Correlated Positively Correlated The left half fragment is positively correlated The righthalf fragment is negatively correlated Data Mining 20

Scatter Plot: Uncorrelated Data Uncorrelated scatter plot examples Data Mining 21

Loess Curve A loess curve is another graphic aid that adds a smooth curve to a scatter plot to provide better perception of the pattern of dependence. The word loess is short for local regression. Data Mining 22

Similarity and Dissimilarity Similarity The similarity between two objects is a numerical measure of the degree to which the two objects are alike. Similarities are higher for pairs of objects that are more alike. Similarities are usually non-negative and are often between 0 (no similarity) and 1 (complete similarity). Dissimilarity The dissimilarity between two objects is a numerical measure of the degree to which the two objects are different. Dissimilarities are lower for more similar pairs of objects. The term distance is used as a synonym for dissimilarity, although the distance is often used to refer to a special class of dissimilarities. Dissimilarities sometimes fall in the interval [0,1], but it is also common for them to range from 0 to. Proximity refers to a similarity or dissimilarity Data Mining 23

Similarity/Dissimilarity for Simple Attributes The proximity of objects with a number of attributes is typically defined by combining the proximities of individual attributes. Consider objects described by one nominal attribute. What would it mean for two such objects to be similar? p and q are the attribute values for two data objects Data Mining 24

Euclidean Distance Dissimilarities between Data Objects Euclidean Distance dist n k 1 ( p k q k 2 ) where n is the number of dimensions (attributes) and p k and q k are, respectively, the k th attributes (components) or data objects p and q. Normally attributes are numeric. Standardization is necessary, if scales differ. Data Mining 25

Euclidean Distance 3 2 1 0 p1 p3 p4 p2 0 1 2 3 4 5 6 point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1 p1 p2 p3 p4 p1 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0 Distance Matrix Data Mining 26

Minkowski Distance Minkowski Distance is a generalization of Euclidean Distance where n dist ( k 1 p k q k r is a parameter, n is the number of dimensions (attributes) and p k and q k are k th attributes of data objects p and q. r ) 1 r Note that Minkowski Distance is Euclidean Distance when r=2 Data Mining 27

Minkowski Distance: Examples r = 1. City block (Manhattan, taxicab, L 1 norm) distance. A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors r = 2. Euclidean distance r. supremum (L max norm, L norm) distance. This is the maximum difference between any component of the vectors Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions. Data Mining 28

point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1 Minkowski Distance Manhattan (L 1 ) L1 p1 p2 p3 p4 p1 0 4 4 6 p2 4 0 2 4 p3 4 2 0 2 p4 6 4 2 0 Euclidean (L 2 ) L2 p1 p2 p3 p4 p1 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0 Supremum (L ) L p1 p2 p3 p4 p1 0 2 3 5 p2 2 0 1 3 p3 3 1 0 2 p4 5 3 2 0 Distance Matrix Data Mining 29

(Metric) Properties of Distances Distances, such as the Euclidean distance, have some properties. If distance d(x, y) between x and y, hold following properties. Measures that satisfy all three properties are known as metrics. Some dissimilarities do not satisfy one or more of the metric properties. Examples: set difference, time difference Data Mining 30

Common Properties of a Similarity Similarities, also have some well known properties. 1. s(p, q) = 1 (or maximum similarity) only if p = q. (0 s 1) 2. s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q. Data Mining 31

Similarity Between Binary Vectors: Simple Matching and Jaccard Coefficients Similarity measures between objects that contain only binary attributes are called similarity coefficients Common situation is that objects, p and q, have only binary attributes Compute similarities using the following quantities M 01 = the number of attributes where p was 0 and q was 1 M 10 = the number of attributes where p was 1 and q was 0 M 00 = the number of attributes where p was 0 and q was 0 M 11 = the number of attributes where p was 1 and q was 1 Simple Matching and Jaccard Coefficients Simple Matching Coefficient counts both presences and absences equally SMC = number of matches / number of attributes = (M 11 + M 00 ) / (M 01 + M 10 + M 11 + M 00 ) Jaccard Coefficient is frequently for asymmetric binary attributes J = number of 11 matches / number of not-both-zero attributes values = (M 11 ) / (M 01 + M 10 + M 11 ) Data Mining 32

SMC versus Jaccard Coefficient: Example p = 1 0 0 0 0 0 0 0 0 0 q = 0 0 0 0 0 0 1 0 0 1 M 01 = 2 (the number of attributes where p was 0 and q was 1) M 10 = 1 (the number of attributes where p was 1 and q was 0) M 00 = 7 (the number of attributes where p was 0 and q was 0) M 11 = 0 (the number of attributes where p was 1 and q was 1) SMC = (M 11 + M 00 )/(M 01 + M 10 + M 11 + M 00 ) = (0+7) / (2+1+0+7) = 0.7 J = (M 11 ) / (M 01 + M 10 + M 11 ) = 0 / (2 + 1 + 0) = 0 Data Mining 33

Cosine Similarity Cosine similarity is a common measure for document similarity. If d 1 and d 2 are two document vectors, then cos(d 1,d 2 ) = (d 1 d 2 ) / d 1 d 2 where indicates vector dot product d is the length of vector d. A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document. Data Mining 34

cos(d 1, d 2 ) = (d 1 d 2 ) / d 1 d 2, where indicates vector dot product, d : the length of vector d Cosine Similarity : Example Find the similarity between documents 1 and 2. d 1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d 2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d 1 d 2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 d 1 = (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0) 0.5 =(42) 0.5 = 6.481 d 2 = (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1) 0.5 =(17) 0.5 = 4.12 cos(d 1, d 2 ) = 0.94 Data Mining 35

Cosine Similarity Cosine similarity really is a measure of the (cosine of the) angle between x and y. If the cosine similarity is 1, the angle between x and y is 0 o, and x and y are same If the cosine similarity is 0, then the angle between x and y is 90 o, and they do not share any terms Cosine similarity can be written as Dividing x and y by their lengths normalizes them to have a length of 1. This means that cosine similarity does not take the magnitude of the two data objects into account when computing similarity. Euclidean distance might be a better choice when magnitude is important. Data Mining 36

Extended Jaccard Coefficient (Tanimoto Coefficient) Extended Jaccard Coefficient can be used for document data and that reduces to the Jaccard coefficient in the case of binary attributes. Extended Jaccard Coefficient is also known as Tanimoto coefficient. Data Mining 37

Correlation Correlation measures the linear relationship between objects Pearson's correlation coefficient between two data objects, x and y: where Data Mining 38

Correlation: Perfect Correlation Correlation is always in the range -1 to 1. A correlation of 1 (-1) means that x and y have a perfect positive (negative) linear relationship A perfect negative linear relationship (correlation: -1) x = (-3, 6, 0, 3, -6) s xy = -7.5 s x = 4.74341649 s y = 1.58113883 y = ( 1, -2, 0,-1, 2 ) corr(x,y) = -1 A perfect positive linear relationship (correlation: +1) x = ( 3, 6, 0, 3, 6 ) s xy = 2.1 s x = 2.50998008 s y = 0.836660027 y = ( 1, 2, 0, 1, 2) corr(x,y) = +1 Data Mining 39

Visually Evaluating Correlation scatter plots showing the similarity from 1 to 1 Data Mining 40

Issues in Proximity Calculation Important issues related to proximity measures: (1) How to handle the case in which attributes have different scales and/or are correlated, (2) How to calculate proximity between objects that are composed of different types of attributes, e.g., quantitative and qualitative, (3) How to handle proximity calculation when attributes have different weights; i.e., when not all attributes contribute equally to the proximity of objects. Data Mining 41

Standardization and Correlation for Distance Measures An important issue with distance measures is how to handle the situation when attributes do not have the same range of values. This situation is often described by saying that "the variables have different scales." Example: Euclidean distance is used to measure the distance between people based on two attributes: age and income. Unless these two attributes are standardized, the distance between two people will be dominated by income. We have to both attributes have same range (Ex: 0 1) Related issue is how to compute distance when there is correlation between some of the attributes, A generalization of Euclidean distance, the Mahalanobis distance, is useful when attributes are correlated, have different ranges of values and the distribution of the data is approximately Gaussian Data Mining 42

Z-score: Standardizing Numeric Data X: raw score to be standardized, μ: mean of the population, σ: standard deviation the distance between the raw score and the population mean in units of the standard deviation negative when the raw score is below the mean, + when above An alternative way: Calculate the mean absolute deviation where s m z x f f 1( n x 1 n (x m x standardized measure (z-score): m... x m 1f f 2 f f nf f 1 f x 2 f... x nf Using mean absolute deviation is more robust than using standard deviation ). z if x if m s f f ) Data Mining 43

Mahalanobis Distance 1 mahalanobi s( p, q) ( p q) ( p q) T is the covariance matrix of the input data X 1 n ( X j, k ij j ik k ) n 1 i 1 X )( X where -1 is the inverse of the covariance matrix of the data. Note that the covariance matrix is the matrix whose ij th entry is the covariance of the i th and j th attributes X For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6. Data Mining 44

Mahalanobis Distance C Covariance Matrix: 0.3 0.2 0.2 0.3 B A A: (0.5, 0.5) B: (0, 1) C: (1.5, 1.5) Mahalanobis(A,B) = 5 Mahalanobis(A,C) = 4 Data Mining 45

General Approach for Combining Similarities Sometimes attributes are of many different types, but an overall similarity is needed. Following Algorithm is effective for computing an overall similarity between two objects, x and y, with different types of attributes. Data Mining 46

Using Weights to Combine Similarities We may not want to treat all attributes the same. Use weights w k which are between 0 and 1 and sum to 1. Modified Minkowski distance Data Mining 47

Selecting the Right Proximity Measure For many types of dense, continuous data, metric distance measures such as Euclidean distance are often used. The cosine, Jaccard, and extended Jaccard measures are appropriate for sparse, asymmetric data, most objects have only a few of the characteristics described by the attributes and thus, are highly similar in terms of the characteristics they do not have. In some cases, transformation or normalization of the data is important for obtaining a proper similarity measure since such transformations are not always present in proximity measures. The proper choice of a proximity measure can be a time-consuming task that requires careful consideration of both domain knowledge and the purpose for which the measure is being used. A number of different similarity measures may need to be evaluated to see which ones produce results that make the most sense. Data Mining 48