Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr

Size: px

Start display at page:

Download "Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr"

Norma Hoover
5 years ago
Views:

1 Foundation of Data Mining i Topic: Data CMSC 49D/69D CSEE Department, e t, UMBC Some of the slides used in this presentation are prepared by Jiawei Han and Micheline Kamber Data Data types Quality of data Similarity of data objects Pre-processing

2 Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Provides enough information to distinguish one object from another (e.g. name) Ordinal: Enough information to order objects (e.g. Letter grades in a course) Interval: Differences between values are meaningful (e.g. calendar dates, temperature) Ratio: Both differences and ratios are meaningful 2

3 Discrete and Continuous Attributes Discrete Attributes: Takes only countably finite number of values. Example: Binary variables Continuous Attributes: Usually implies real- valued attributes in data mining literature. Asymmetric Data The value 0 is not important Market Basket data Boolean representation stands for the purchase of an item in a transactions 0 otherwise 3

4 Some Characteristics of Data Dimensionality Sparsity Resolution Structure of the Data Graph data Web data, molecular data, telephone call data Sequence data DNA sequence Time Series data Stock market Spatial data Earth Science data Traffic monitoring data 4

5 Data Quality Measurement and data collection errors Noise Precision: Closeness of repeated experiments Bias: Systematic variations Accuracy: Closeness to the true value Missing Values Inconsistent Values Duplicate data Sampling Allow a mining algorithm to run in complexity that is potentially sub- linear to the size of the data Choose a representative subset of the data Simple random sampling may have very poor performance in the presence of skew Develop adaptive sampling methods Stratified sampling: Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed data Sampling may not reduce database I/Os (page at a time). 5

6 Sampling Simple random sampling With replacement Without replacement Sample size Progressive sampling Stratified sampling Sampling Raw Data 6

7 Sampling Raw Data Cluster/Stratified Sample Discretization Three common types of attributes: Nominal values from an unordered set Ordinal values from an ordered set Continuous real numbers Discretization: divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis 7

8 Histograms 40 A popular data reduction technique Divide data into buckets and 25 store average (sum) for each bucket Entropy-Based Discretization Given a set of samples S, if S is partitioned into two intervals S and S2 using boundary T, the entropy after partitioning is where 2 I S, T ) = Ent( ) ( S + Ent S S m Ent( S) = p i log 2 p i ( 2 P i is the probability of class I in S. S i= The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization. The process is recursively applied to partitions obtained until some stopping criterion is met, e.g., S S ) Ent( S) E( T, S) > δ 8

9 Norm Vector space where every x is associated with a real number x where x = 0 if x=0 λx = λ x x+y <= x + y Metric associated with the norm is d(x,y)= x-y y l p norms: x p / p,... x n p = ( xi ) i Continued l norm, p= l 2 norm, p=2, related to Euclidean distance l norm, p, also called the norm l max 9

10 Basics Similarity Matrix Dissimilarity Matrix Normalizing the distance values Linear mapping s =(s min_s)/(max_s min_s) s and s are the true and normalized similarity values. Continued What is max_s =? Use non-linear transformation Example: d =d/(+d) d and d are the original and modified distance values respectively. Transforming similarities to dissimilarities d=-s if s falls in [0,] By negation Other examples: s=/(+d), s=e -d 0

11 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance: p p p d ( i, j) = p ( x x + x x x x ) 2 2 p p i j where i = (x i, x i2,, x ip ) and j = (x j, x j2,, x jp ) are two p- dimensional data objects, and q is a positive integer If p =, d is Manhattan distance i d( i, j) = x x + x x x x i j i2 j2 ip jp j i j Properties of Distance Metrics If p = 2, d is Euclidean distance: d ( i, j ) = 2 ( x x + x x i j i2 j x x ip j 2 p 2 ) Metric Properties d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j) Also one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures.

12 Non-Metric Dissimilarities Given two sets A and B A-B = set of elements of A that are not in B Distance(A, B) = A B Not a metric Binary Variables A contingency table for binary data Object i 0 sum a c a+ c Object j 0 sum b+ d c+ d Simple matching coefficient (invariant, if the binary b d a+ b variable is symmetric): d ( i, j) = b + c a + b + c + d Jaccard coefficient (noninvariant if the binary variable is asymmetric): d ( i, j) = p b c a + + b + c 2

13 Continued Symmetric Binary Variables: Both of its states are equally valuable and carry equal weights. Asymmetric Binary Variables: The states are not equally important. Example Dissimilarity between Binary Variables Name Gender Fever Cough Test- Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N gender is a symmetric attribute the remaining attributes are asymmetric binary let the values Y and P be set to, and the value N be set to d ( jack, mary ) = = d ( jack, jim ) = = d ( jim, mary ) = =

14 Cosine Similarity x. y cos( x, y ) = x y Tanimoto Coefficient T ( x, y ) = x x. y + y ( 2 2 x. y Nominal Variables A generalization of the binary variable in that it can take more than 2 states, t e.g., red, yellow, blue, green Method : Simple matching m: # of matches, p: total # of variables d ( i, j) = p p m Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal states 4

15 Ordinal Variables An ordinal variable can be discrete or continuous order is important, e.g., rank Can be treated like interval-scaled replacing x if by their rank r {,..., M } if f map the range of each variable onto [0, ] by replacing i-th object in the f-th variable by r if z = if M compute the dissimilarity using methods for interval-scaled variables f Interval-Values Variables Standardize data Calculate the mean absolute deviation: s f = ( x n m + x m... x m f f 2 f f nf f ) where m f = ( x f + x2 f x n nf ) Calculate late the standardized di measurement ement (z-score) z if = x if m s f f 5

16 Ratio-Scaled Variables Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as Ae Bt or Ae -Bt Methods: treat them like interval-scaled variables not a good choice! (why?) apply logarithmic transformation y if if = log(x if treat them as continuous ordinal data treat their rank as interval- scaled. if ) Correlation corr ( x, y ) = covariance(x, y) standard_deviation(x)*standard_deviation(y) Correlation coefficient takes a value in between - and. Magnitude shows the degree of dependency. 6

17 Mahalnobis Distance Mhl Mahalnobis( x, y) = ( x y) Σ ( x y ) T Works well when features are correlated data may be normally distributed Variables of Mixed Types A data set may contain all the six types of variables One may use a weighted formula to combine their effects. p ( f ) ( f ) δ ij dij f = d( i, j) = p ( f ) δ ij f = f is binary or nominal: d (f) ij = 0 if x if = x jf, or d (f) ij = otherwise f is interval-based: use the normalized distance f is ordinal or ratio-scaled compute ranks r if and if and treat z if as interval-scaled z r = if M f 7

18 References Tan, Steinback, Kumar. Han and Kamber. 8

Data Mining 4. Cluster Analysis

Data Mining 4. Cluster Analysis 4.2 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Data Structures Interval-Valued (Numeric) Variables Binary Variables Categorical Variables Ordinal Variables Variables