Descriptive Data Summarization

Size: px

Start display at page:

Download "Descriptive Data Summarization"

Beverly Conley
6 years ago
Views:

1 Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning and data integration. For data preprocessing tasks, we want to learn about data characteristics regarding both central tendency and dispersion of the data. Measures of central tendency include mean, median, mode, and midrange, Measures of data dispersion include quartiles, interquartile range (IQR), and variance. These descriptive statistics are of great help in understanding the distribution of the data. Data Mining 1

2 Measuring Central Tendency: Mean The most common and most effective numerical measure of the center of a set of data is the arithmetic mean Arithmetic Mean: x 1 n Sometimes, each value x i in a set may be associated with a weight w i, The weights reflect the significance, importance, or occurrence frequency attached to their respective values. Weighted Arithmetic Mean: n i 1 x x i n i 1 n i 1 w x i w i i Data Mining 2

3 Measuring Central Tendency: Mean Although the mean is the single most useful quantity for describing a data set, it is not always the best way of measuring the center of the data. A major problem with the mean is its sensitivity to extreme (outlier) values. Even a small number of extreme values can corrupt the mean. To offset the effect caused by a small number of extreme values, we can instead use the trimmed mean, Trimmed mean can be obtained after chopping off values at the high and low extremes. Data Mining 3

4 Measuring Central Tendency: Median Another measure of the center of data is the median. Suppose that a given data set of N distinct values is sorted in numerical order. If N is odd, the median is the middle value of the ordered set; If N is even, the median is the average of the middle two values. For grouped data, the median can be estimated median L 1 n / 2 ( ( freq freq) l ) width median L 1 is the lower boundary of the median interval, N is the number of values in the entire data set, ( freq) l is the sum of the frequencies of all of the intervals that are lower than the median interval, freqi median is the frequency of the median interval, and width is the width of the median interval. Median interval Data Mining 4

5 Measuring Central Tendency: Mode Another measure of central tendency is the mode. The mode for a set of data is the value that occurs most frequently in the set. It is possible for the greatest frequency to correspond to several different values, which results in more than one mode. Data sets with one, two, or three modes: called unimodal, bimodal, and trimodal. At the other extreme, if each data value occurs only once, then there is no mode. For unimodal frequency curves that are moderately skewed (asymmetrical), we have the following empirical relation: mean mode 3 ( mean median ) Data Mining 5

6 Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data Data Mining 6

7 Measuring the Dispersion of Data The degree to which numerical data tend to spread is called the dispersion, or variance of the data. The most common measures of data dispersion are range, five-number summary (based on quartiles), interquartile range, standard deviation. Boxplots can be plotted based on the five-number summary and are a useful tool for identifying outliers. Data Mining 7

8 Range, Quartiles, Outliers Range: the difference between the largest and smallest values. Quartiles: Q1 (25th percentile), Q3 (75th percentile) Inter-quartile Range: IQR = Q 3 Q 1 Five number summary: Minumum, Q 1, Median, Q 3, Maximum Outliers: A common rule of thumb for identifying suspected outliers is to single out values falling at least 1.5xIQR above the third quartile or below the first quartile Data Mining 8

9 Boxplot Analysis Boxplots are a popular way of visualizing a distribution and aboxplot incorporates the fivenumber summary: The ends of the box are at the quartiles, so that the box length is the interquartile range, IQR. The median is marked by a line within the box. Two lines outside the box extend to the smallest and largest observations. Outliers: points beyond a specified outlier threshold, plotted individually Data Mining 9

10 Variance and Standard Deviation Variance of N observations: 2 1 N N i 1 ( x i ) 2 where is the mean value of the observations Standard Deviation σ is the square root of variance or σ 2 The basic properties of the standard deviation are σ measures spread about the mean and should be used only when the mean is chosen as the measure of center. σ =0 only when there is no spread, when all observations have the same value. Otherwise σ > 0. Data Mining 10

11 Properties of Normal Distribution Curve The normal (distribution) curve (μ: mean, σ: standard deviation) From μ σ to μ+σ: contains about 68% of the measurements From μ 2σ to μ+2σ: contains about 95% of it From μ 3σ to μ+3σ: contains about 99.7% of it Data Mining 11

12 Graphic Displays of Basic Statistical Descriptions Boxplot: graphic display of five-number summary Histogram: x-axis are values, y-axis represents frequencies Quantile plot: each value x i is paired with f i indicating that approximately 100 f i % of data are x i Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane Data Mining 12

13 Histogram Analysis Histogram: Graph display of tabulated frequencies, shown as bars It shows what proportion of cases fall into each of several categories Differs from a bar chart in that it is the area of the bar that denotes the value, not the height as in bar charts, a crucial distinction when the categories are not of uniform width The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent Data Mining 13

14 In an equal-width histogram, each bucket represents an equalwidth range of numerical attribute Histogram Analysis Data Mining 14

15 Histograms Often Tell More than Boxplots The two histograms shown in the left may have the same boxplot representation The same values for: min, Q1, median, Q3, max But they have rather different data distributions Data Mining 15

16 Quantile Plot Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information For a data x i data sorted in increasing order, f i indicates that approximately 100 f i % of the data are below or equal to the value x i quantiles Data Mining 16

17 Quantile-Quantile (Q-Q) Plot Graphs the quantiles of one univariate distribution against the corresponding quantiles of another It is a powerful visualization tool in that it allows the user to view whether there is a shift in going from one distribution to another. Suppose that we have two sets of observations for the variable unit price, taken from two different branch locations. Let x 1... x N be the data from the first branch, and y 1... y M be the data from the second, where each data set is sorted in increasing order. If M = N (i.e., the number of points in each set is the same), then we simply plot y i against x i, where y i and x i are both quantiles of their respective data sets. If M < N (i.e., the second branch has fewer observations than the first), there can be only M points on the q-q plot Data Mining 17

18 Quantile-Quantile (Q-Q) Plot A quantile-quantile plot for unit price data of items sold at two different branches Each point corresponds to the same quantile for each data set and shows the unit price of items sold at branch 1 versus branch 2 for that quantile. For example, here the lowest point in the left corner corresponds to the 0.03 quantile. A straight line that represents the case of when, for each given quantile, the unit price at each branch is the same. The darker points correspond to the data for Q1, the median, and Q3, respectively.) The unit price of items sold at branch 1 was slightly less than that at branch 2. Data Mining 18

19 Scatter plot A scatter plot is one of effective graphical methods for determining if there appears to be a relationship, pattern, or trend between two numerical attributes. To construct a scatter plot, each pair of values is treated as a pair of coordinates in an algebraic sense and plotted as points in the plane. Data Mining 19

20 Scatter Plot: Positively and Negatively Correlated Data Negatively Correlated Positively Correlated The left half fragment is positively correlated The righthalf fragment is negatively correlated Data Mining 20

21 Scatter Plot: Uncorrelated Data Uncorrelated scatter plot examples Data Mining 21

22 Loess Curve A loess curve is another graphic aid that adds a smooth curve to a scatter plot to provide better perception of the pattern of dependence. The word loess is short for local regression. Data Mining 22

23 Similarity and Dissimilarity Similarity The similarity between two objects is a numerical measure of the degree to which the two objects are alike. Similarities are higher for pairs of objects that are more alike. Similarities are usually non-negative and are often between 0 (no similarity) and 1 (complete similarity). Dissimilarity The dissimilarity between two objects is a numerical measure of the degree to which the two objects are different. Dissimilarities are lower for more similar pairs of objects. The term distance is used as a synonym for dissimilarity, although the distance is often used to refer to a special class of dissimilarities. Dissimilarities sometimes fall in the interval [0,1], but it is also common for them to range from 0 to. Proximity refers to a similarity or dissimilarity Data Mining 23

24 Similarity/Dissimilarity for Simple Attributes The proximity of objects with a number of attributes is typically defined by combining the proximities of individual attributes. Consider objects described by one nominal attribute. What would it mean for two such objects to be similar? p and q are the attribute values for two data objects Data Mining 24

25 Euclidean Distance Dissimilarities between Data Objects Euclidean Distance dist n k 1 ( p k q k 2 ) where n is the number of dimensions (attributes) and p k and q k are, respectively, the k th attributes (components) or data objects p and q. Normally attributes are numeric. Standardization is necessary, if scales differ. Data Mining 25

26 Euclidean Distance p1 p3 p4 p point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1 p1 p2 p3 p4 p p p p Distance Matrix Data Mining 26

27 Minkowski Distance Minkowski Distance is a generalization of Euclidean Distance where n dist ( k 1 p k q k r is a parameter, n is the number of dimensions (attributes) and p k and q k are k th attributes of data objects p and q. r ) 1 r Note that Minkowski Distance is Euclidean Distance when r=2 Data Mining 27

28 Minkowski Distance: Examples r = 1. City block (Manhattan, taxicab, L 1 norm) distance. A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors r = 2. Euclidean distance r. supremum (L max norm, L norm) distance. This is the maximum difference between any component of the vectors Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions. Data Mining 28

29 point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1 Minkowski Distance Manhattan (L 1 ) L1 p1 p2 p3 p4 p p p p Euclidean (L 2 ) L2 p1 p2 p3 p4 p p p p Supremum (L ) L p1 p2 p3 p4 p p p p Distance Matrix Data Mining 29

30 (Metric) Properties of Distances Distances, such as the Euclidean distance, have some properties. If distance d(x, y) between x and y, hold following properties. Measures that satisfy all three properties are known as metrics. Some dissimilarities do not satisfy one or more of the metric properties. Examples: set difference, time difference Data Mining 30

31 Common Properties of a Similarity Similarities, also have some well known properties. 1. s(p, q) = 1 (or maximum similarity) only if p = q. (0 s 1) 2. s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q. Data Mining 31

32 Similarity Between Binary Vectors: Simple Matching and Jaccard Coefficients Similarity measures between objects that contain only binary attributes are called similarity coefficients Common situation is that objects, p and q, have only binary attributes Compute similarities using the following quantities M 01 = the number of attributes where p was 0 and q was 1 M 10 = the number of attributes where p was 1 and q was 0 M 00 = the number of attributes where p was 0 and q was 0 M 11 = the number of attributes where p was 1 and q was 1 Simple Matching and Jaccard Coefficients Simple Matching Coefficient counts both presences and absences equally SMC = number of matches / number of attributes = (M 11 + M 00 ) / (M 01 + M 10 + M 11 + M 00 ) Jaccard Coefficient is frequently for asymmetric binary attributes J = number of 11 matches / number of not-both-zero attributes values = (M 11 ) / (M 01 + M 10 + M 11 ) Data Mining 32

33 SMC versus Jaccard Coefficient: Example p = q = M 01 = 2 (the number of attributes where p was 0 and q was 1) M 10 = 1 (the number of attributes where p was 1 and q was 0) M 00 = 7 (the number of attributes where p was 0 and q was 0) M 11 = 0 (the number of attributes where p was 1 and q was 1) SMC = (M 11 + M 00 )/(M 01 + M 10 + M 11 + M 00 ) = (0+7) / ( ) = 0.7 J = (M 11 ) / (M 01 + M 10 + M 11 ) = 0 / ( ) = 0 Data Mining 33

34 Cosine Similarity Cosine similarity is a common measure for document similarity. If d 1 and d 2 are two document vectors, then cos(d 1,d 2 ) = (d 1 d 2 ) / d 1 d 2 where indicates vector dot product d is the length of vector d. A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document. Data Mining 34

35 cos(d 1, d 2 ) = (d 1 d 2 ) / d 1 d 2, where indicates vector dot product, d : the length of vector d Cosine Similarity : Example Find the similarity between documents 1 and 2. d 1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d 2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d 1 d 2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 d 1 = (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0) 0.5 =(42) 0.5 = d 2 = (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1) 0.5 =(17) 0.5 = 4.12 cos(d 1, d 2 ) = 0.94 Data Mining 35

36 Cosine Similarity Cosine similarity really is a measure of the (cosine of the) angle between x and y. If the cosine similarity is 1, the angle between x and y is 0 o, and x and y are same If the cosine similarity is 0, then the angle between x and y is 90 o, and they do not share any terms Cosine similarity can be written as Dividing x and y by their lengths normalizes them to have a length of 1. This means that cosine similarity does not take the magnitude of the two data objects into account when computing similarity. Euclidean distance might be a better choice when magnitude is important. Data Mining 36

37 Extended Jaccard Coefficient (Tanimoto Coefficient) Extended Jaccard Coefficient can be used for document data and that reduces to the Jaccard coefficient in the case of binary attributes. Extended Jaccard Coefficient is also known as Tanimoto coefficient. Data Mining 37

38 Correlation Correlation measures the linear relationship between objects Pearson's correlation coefficient between two data objects, x and y: where Data Mining 38

39 Correlation: Perfect Correlation Correlation is always in the range -1 to 1. A correlation of 1 (-1) means that x and y have a perfect positive (negative) linear relationship A perfect negative linear relationship (correlation: -1) x = (-3, 6, 0, 3, -6) s xy = -7.5 s x = s y = y = ( 1, -2, 0,-1, 2 ) corr(x,y) = -1 A perfect positive linear relationship (correlation: +1) x = ( 3, 6, 0, 3, 6 ) s xy = 2.1 s x = s y = y = ( 1, 2, 0, 1, 2) corr(x,y) = +1 Data Mining 39

40 Visually Evaluating Correlation scatter plots showing the similarity from 1 to 1 Data Mining 40

41 Issues in Proximity Calculation Important issues related to proximity measures: (1) How to handle the case in which attributes have different scales and/or are correlated, (2) How to calculate proximity between objects that are composed of different types of attributes, e.g., quantitative and qualitative, (3) How to handle proximity calculation when attributes have different weights; i.e., when not all attributes contribute equally to the proximity of objects. Data Mining 41

42 Standardization and Correlation for Distance Measures An important issue with distance measures is how to handle the situation when attributes do not have the same range of values. This situation is often described by saying that "the variables have different scales." Example: Euclidean distance is used to measure the distance between people based on two attributes: age and income. Unless these two attributes are standardized, the distance between two people will be dominated by income. We have to both attributes have same range (Ex: 0 1) Related issue is how to compute distance when there is correlation between some of the attributes, A generalization of Euclidean distance, the Mahalanobis distance, is useful when attributes are correlated, have different ranges of values and the distribution of the data is approximately Gaussian Data Mining 42

43 Z-score: Standardizing Numeric Data X: raw score to be standardized, μ: mean of the population, σ: standard deviation the distance between the raw score and the population mean in units of the standard deviation negative when the raw score is below the mean, + when above An alternative way: Calculate the mean absolute deviation where s m z x f f 1( n x 1 n (x m x standardized measure (z-score): m... x m 1f f 2 f f nf f 1 f x 2 f... x nf Using mean absolute deviation is more robust than using standard deviation ). z if x if m s f f ) Data Mining 43

44 Mahalanobis Distance 1 mahalanobi s( p, q) ( p q) ( p q) T is the covariance matrix of the input data X 1 n ( X j, k ij j ik k ) n 1 i 1 X )( X where -1 is the inverse of the covariance matrix of the data. Note that the covariance matrix is the matrix whose ij th entry is the covariance of the i th and j th attributes X For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6. Data Mining 44

45 Mahalanobis Distance C Covariance Matrix: B A A: (0.5, 0.5) B: (0, 1) C: (1.5, 1.5) Mahalanobis(A,B) = 5 Mahalanobis(A,C) = 4 Data Mining 45

46 General Approach for Combining Similarities Sometimes attributes are of many different types, but an overall similarity is needed. Following Algorithm is effective for computing an overall similarity between two objects, x and y, with different types of attributes. Data Mining 46

47 Using Weights to Combine Similarities We may not want to treat all attributes the same. Use weights w k which are between 0 and 1 and sum to 1. Modified Minkowski distance Data Mining 47

48 Selecting the Right Proximity Measure For many types of dense, continuous data, metric distance measures such as Euclidean distance are often used. The cosine, Jaccard, and extended Jaccard measures are appropriate for sparse, asymmetric data, most objects have only a few of the characteristics described by the attributes and thus, are highly similar in terms of the characteristics they do not have. In some cases, transformation or normalization of the data is important for obtaining a proper similarity measure since such transformations are not always present in proximity measures. The proper choice of a proximity measure can be a time-consuming task that requires careful consideration of both domain knowledge and the purpose for which the measure is being used. A number of different similarity measures may need to be evaluated to see which ones produce results that make the most sense. Data Mining 48

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher