CS570 Introduction to Data Mining

CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong

Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data cleaning Data integration Data transformation Data reduction Data Mining: Concepts and Techniques 2

Data Transformation Aggregation: summarization (data reduction) E.g. Daily sales -> monthly sales Discretization and generalization E.g. age -> youth, middle-aged, senior (Statistical) Normalization: scaled to fall within a small, specified range E.g. income vs. age Attribute construction: construct new attributes from given ones E.g. birthday -> age January 25, 2011 3

Data Aggregation Data cubes store multidimensional aggregated information Multiple levels of aggregation for analysis at multiple granularities January 25, 2011 4

Normalization scaled to fall within a small, specified range Min-max normalization: [min A, max A ] to [new_min A, new_max A ] v v min = maxa min A ' ( _ A _ A) + Ex. Let income [$12,000, $98,000] normalized to [0.0, 1.0]. Then $73,000 is mapped to A new max new min 73,600 12,000 (1.0 0) + 0= 0.716 98,000 12,000 Z-score normalization (µ: mean, σ: standard deviation): v µ A v ' = σ A Ex. Let µ = 54,000, σ = 16,000. Then Normalization by decimal scaling 73,600 54,000 16,000 new_ min = 1.225 v v' = Where j is the smallest integer such that Max( ν ) < 1 j 10 A January 25, 2011 5

Discretization and Generalization Discretization: transform continuous attribute into discrete counterparts (intervals) Supervised vs. unsupervised Split (top-down) vs. merge (bottom-up) Generalization: generalize/replace low level concepts (such as age ranges) by higher level concepts (such as young, middle-aged, or senior) January 25, 2011 6

Discretization Methods Binning or histogram analysis Unsupervised, top-down split Clustering analysis Unsupervised, either top-down split or bottom-up Entropy-based discretization Supervised, top-down split January 25, 2011 7

Entropy-Based Discretization Entropy based on class distribution of the samples in a set S 1 : m classes, p i is the probability of class i in S 1 Entropy ( S m 1 ) = p i log 2 ( p i ) i= 1 Given a set of samples S, if S is partitioned into two intervals S 1 and S 2 using boundary T, the class entropy after partitioning is 1 2 I S, T ) = S Entropy ( 1) S S + Entropy ( S S S ( 2 The boundary that minimizes the entropy function is selected for binary discretization The process is recursively applied to partitions ) January 25, 2011 8

Information Entropy Information entropy: measure of the uncertainty associated with a random variable. Quantifies the information contained in a message with minimum message length (# bits) to communicate Illustrative example: P(X=A) = ¼, P(X=B) = ¼, P(X=C) = ¼, P(X=D) = ¼ BAACBADCDADDDA Minimum 2 bits (e.g. A = 00, B = 01, C = 10, D = 11) 0100001001001110110011111100 What if P(X=A) = ½, P(X=B) = ¼, P(X=C) = 1/8, P(X=D) = 1/8 Minimum # of bits? E.g. A = 0, B = 10, C= 110, D = 111 High entropy vs. low entropy 9

Generalization for Categorical Attributes Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts street < city < state < country Specification of a hierarchy for a set of values by explicit data grouping {Atlanta, Savannah, Columbus} < Georgia Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values E.g., for a set of attributes: {street, city, state, country} January 25, 2011 10

Automatic Concept Hierarchy Generation Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is placed at the lowest level of the hierarchy Exceptions, e.g., weekday, month, quarter, year country province_or_ state city street 15 distinct values 365 distinct values 3567 distinct values 674,339 distinct values January 25, 2011 11

Data Reduction Why data reduction? A database/data warehouse may store terabytes of data Number of data points Number of dimensions Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results January 25, 2011 13

Data Reduction Instance reduction Sampling (instance selection) Numerosity reduction Dimension reduction Feature selection Feature extraction 14

Instance Reduction: Sampling Sampling: obtaining a small representative sample s to represent the whole data set N A sample is representative if it has approximately the same property (of interest) as the original set of data Statisticians sample because obtaining the entire set of data is too expensive or time consuming. Data miners sample because processing the entire set of data is too expensive or time consuming Sampling method Sampling size January 25, 2011 15

Why sampling A statistics professor was describing sampling theory Student: I don t believe it, why not study the whole population in the first place? The professor continued explaining sampling methods, the central limit theorem, etc. Student: Too much theory, too risky, I couldn t trust just a few numbers in place of ALL of them. The professor explained the Nielsen television ratings Student: You mean that just a sample of a few thousand can tell us exactly what over 250 MILLION people are doing? Professor: Well, the next time you go to the campus clinic and they want to do a blood test tell them that s not good enough tell them to TAKE IT ALL!! 16

Sampling Methods Simple Random Sampling There is an equal probability of selecting any particular item Stratified sampling Split the data into several partitions (stratum); then draw random samples from each partition Cluster sampling When "natural" groupings are evident in a statistical population Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the sample - the same object can be picked up more than once

Simple random sampling without or with replacement Raw Data January 25, 2011 18

Stratified Sampling Illustration Raw Data Stratified Sample January 25, 2011 19

Sampling size 20

Sampling Size 8000 points 2000 Points 500 Points

Sample Size Whatsamplesizeisnecessarytogetatleastoneobjectfrom eachof10groups.

Data Reduction Instance reduction Sampling (instance selection) Numerosity reduction Dimension reduction Feature selection Feature extraction 23

Numerosity Reduction Reduce data volume by choosing alternative, smaller forms of data representation Parametric methods Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) Regression Non-parametric methods Do not assume models Major families: histograms, clustering January 25, 2011 24

Regress Analysis Assume the data fits some model and estimate model parameters Linear regression: Y = b 0 + b 1 X 1 + b 2 X 2 + + b P X P Line fitting: Y = b 1 X + b 0 Polynomial fitting: Y = b 2 x 2 + b 1 x + b 0 Regression techniques Least square fitting Vertical vs. perpendicular offsets Outliers Robust regression

Instance Reduction: Histograms Divide data into buckets and store average (sum) for each bucket Partitioning rules: Equi-width: equal bucket range Equi-depth: equal frequency V-optimal: with the least frequency variance January 25, 2011 26

Instance Reduction: Clustering Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only Can be very effective if data is clustered but not if data is smeared Can have hierarchical clustering and be stored in multi-dimensional index tree structures Cluster analysis will be studied in depth later January 25, 2011 27

Data Reduction Instance reduction Sampling (instance selection) Numerosity reduction Dimension reduction Feature selection Feature extraction 28

Feature Subset Selection Select a subset of features such that the resulting data does not affect mining result Redundant features duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid Irrelevant features contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' GPA

Correlation between attributes Correlation measures the linear relationship between objects 30

Correlation Analysis (Numerical Data) Correlation coefficient (also called Pearson s product moment coefficient) r A, B ( A A)( B B ) = = ( n 1) ( AB ) σ AσB ( n 1) nab σaσb where n is the number of tuples, and are the respective means of A and B, σ A and σ B are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product. r A,B > 0, A and B are positively correlated (A s values increase as B s) r A,B = 0: independent r A,B < 0: negatively correlated A B January 25, 2011 31

Visually Evaluating Correlation Scatter plots showing the Pearson correlation from 1 to 1.

Correlation Analysis (Categorical Data) Χ 2 (chi-square) test χ = ( Observed Expected Expected 2 2 ) The larger the Χ 2 value, the more likely the variables are related The cells that contribute the most to the Χ 2 value are those whose actual count is very different from the expected count January 25, 2011 33

Chi-Square Calculation: An Example Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500 Χ 2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) 2 χ = (250 90) 90 2 (50 210) + 210 2 (200 360) + 360 It shows that like_science_fiction and play_chess are correlated in the group (10.828 needed to reject the independence hypothesis) 507.93 January 25, 2011 34 2 (1000 840) + 840 2 =

Metrics of (in)dependence Mutual Information: mutual dependence between two attributes What s the mutual information between 2 completely independent attributes? Kullback Leibler divergence: asymmetric 35

Feature Selection Brute-force approach: Try all possible feature subsets Heuristic methods Step-wise forward selection Step-wise backward elimination Combining forward selection and backward elimination

Filter approaches: Feature Selection Features are selected independent of data mining algorithm E.g. Minimal pair-wise correlation/dependence, top k information entropy Wrapper approaches: Use the data mining algorithm as a black box to find best subset E.g. best classification accuracy Embedded approaches: Feature selection occurs naturally as part of the data mining algorithm E.g. Decision tree classification 37

Data Reduction Instance reduction Sampling Aggregation Dimension reduction Feature selection Feature extraction/creation 38