CS570 Introduction to Data Mining

Size: px

Start display at page:

Download "CS570 Introduction to Data Mining"

Meryl Sandra Mills
5 years ago
Views:

1 CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong

2 Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data cleaning Data integration Data transformation Data reduction Data Mining: Concepts and Techniques 2

3 Data Transformation Aggregation: summarization (data reduction) E.g. Daily sales -> monthly sales Discretization and generalization E.g. age -> youth, middle-aged, senior (Statistical) Normalization: scaled to fall within a small, specified range E.g. income vs. age Attribute construction: construct new attributes from given ones E.g. birthday -> age January 25,

4 Data Aggregation Data cubes store multidimensional aggregated information Multiple levels of aggregation for analysis at multiple granularities January 25,

5 Normalization scaled to fall within a small, specified range Min-max normalization: [min A, max A ] to [new_min A, new_max A ] v v min = maxa min A ' ( _ A _ A) + Ex. Let income [$12,000, $98,000] normalized to [0.0, 1.0]. Then $73,000 is mapped to A new max new min 73,600 12,000 (1.0 0) + 0= ,000 12,000 Z-score normalization (µ: mean, σ: standard deviation): v µ A v ' = σ A Ex. Let µ = 54,000, σ = 16,000. Then Normalization by decimal scaling 73,600 54,000 16,000 new_ min = v v' = Where j is the smallest integer such that Max( ν ) < 1 j 10 A January 25,

6 Discretization and Generalization Discretization: transform continuous attribute into discrete counterparts (intervals) Supervised vs. unsupervised Split (top-down) vs. merge (bottom-up) Generalization: generalize/replace low level concepts (such as age ranges) by higher level concepts (such as young, middle-aged, or senior) January 25,

7 Discretization Methods Binning or histogram analysis Unsupervised, top-down split Clustering analysis Unsupervised, either top-down split or bottom-up Entropy-based discretization Supervised, top-down split January 25,

Entropy-Based Discretization Entropy based on class distribution of the samples in a set S 1 : m classes, p i is the probability of class i in S 1 Entropy ( S m 1 ) = p i log 2 ( p i ) i= 1 Given a

8 Entropy-Based Discretization Entropy based on class distribution of the samples in a set S 1 : m classes, p i is the probability of class i in S 1 Entropy ( S m 1 ) = p i log 2 ( p i ) i= 1 Given a set of samples S, if S is partitioned into two intervals S 1 and S 2 using boundary T, the class entropy after partitioning is 1 2 I S, T ) = S Entropy ( 1) S S + Entropy ( S S S ( 2 The boundary that minimizes the entropy function is selected for binary discretization The process is recursively applied to partitions ) January 25,

9 Information Entropy Information entropy: measure of the uncertainty associated with a random variable. Quantifies the information contained in a message with minimum message length (# bits) to communicate Illustrative example: P(X=A) = ¼, P(X=B) = ¼, P(X=C) = ¼, P(X=D) = ¼ BAACBADCDADDDA Minimum 2 bits (e.g. A = 00, B = 01, C = 10, D = 11) What if P(X=A) = ½, P(X=B) = ¼, P(X=C) = 1/8, P(X=D) = 1/8 Minimum # of bits? E.g. A = 0, B = 10, C= 110, D = 111 High entropy vs. low entropy 9

10 Generalization for Categorical Attributes Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts street < city < state < country Specification of a hierarchy for a set of values by explicit data grouping {Atlanta, Savannah, Columbus} < Georgia Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values E.g., for a set of attributes: {street, city, state, country} January 25,

11 Automatic Concept Hierarchy Generation Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is placed at the lowest level of the hierarchy Exceptions, e.g., weekday, month, quarter, year country province_or_ state city street 15 distinct values 365 distinct values 3567 distinct values 674,339 distinct values January 25,

12 Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data cleaning Data integration Data transformation Data reduction Data Mining: Concepts and Techniques 12

13 Data Reduction Why data reduction? A database/data warehouse may store terabytes of data Number of data points Number of dimensions Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results January 25,

14 Data Reduction Instance reduction Sampling (instance selection) Numerosity reduction Dimension reduction Feature selection Feature extraction 14

15 Instance Reduction: Sampling Sampling: obtaining a small representative sample s to represent the whole data set N A sample is representative if it has approximately the same property (of interest) as the original set of data Statisticians sample because obtaining the entire set of data is too expensive or time consuming. Data miners sample because processing the entire set of data is too expensive or time consuming Sampling method Sampling size January 25,

Why sampling A statistics professor was describing sampling theory Student: I don t believe it, why not study the whole population in the first place?

Student: Too much theory, too risky, I couldn t trust just a few numbers in place of ALL of them.

16 Why sampling A statistics professor was describing sampling theory Student: I don t believe it, why not study the whole population in the first place? The professor continued explaining sampling methods, the central limit theorem, etc. Student: Too much theory, too risky, I couldn t trust just a few numbers in place of ALL of them. The professor explained the Nielsen television ratings Student: You mean that just a sample of a few thousand can tell us exactly what over 250 MILLION people are doing? Professor: Well, the next time you go to the campus clinic and they want to do a blood test tell them that s not good enough tell them to TAKE IT ALL!! 16

17 Sampling Methods Simple Random Sampling There is an equal probability of selecting any particular item Stratified sampling Split the data into several partitions (stratum); then draw random samples from each partition Cluster sampling When "natural" groupings are evident in a statistical population Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the sample - the same object can be picked up more than once

18 Simple random sampling without or with replacement Raw Data January 25,

19 Stratified Sampling Illustration Raw Data Stratified Sample January 25,

20 Sampling size 20

21 Sampling Size 8000 points 2000 Points 500 Points

22 Sample Size Whatsamplesizeisnecessarytogetatleastoneobjectfrom eachof10groups.

23 Data Reduction Instance reduction Sampling (instance selection) Numerosity reduction Dimension reduction Feature selection Feature extraction 23

24 Numerosity Reduction Reduce data volume by choosing alternative, smaller forms of data representation Parametric methods Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) Regression Non-parametric methods Do not assume models Major families: histograms, clustering January 25,

25 Regress Analysis Assume the data fits some model and estimate model parameters Linear regression: Y = b 0 + b 1 X 1 + b 2 X b P X P Line fitting: Y = b 1 X + b 0 Polynomial fitting: Y = b 2 x 2 + b 1 x + b 0 Regression techniques Least square fitting Vertical vs. perpendicular offsets Outliers Robust regression

Equi-width: equal bucket range Equi-depth: equal frequency

26 Instance Reduction: Histograms Divide data into buckets and store average (sum) for each bucket Partitioning rules: Equi-width: equal bucket range Equi-depth: equal frequency V-optimal: with the least frequency variance January 25,

27 Instance Reduction: Clustering Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only Can be very effective if data is clustered but not if data is smeared Can have hierarchical clustering and be stored in multi-dimensional index tree structures Cluster analysis will be studied in depth later January 25,

28 Data Reduction Instance reduction Sampling (instance selection) Numerosity reduction Dimension reduction Feature selection Feature extraction 28

29 Feature Subset Selection Select a subset of features such that the resulting data does not affect mining result Redundant features duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid Irrelevant features contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' GPA

30 Correlation between attributes Correlation measures the linear relationship between objects 30

31 Correlation Analysis (Numerical Data) Correlation coefficient (also called Pearson s product moment coefficient) r A, B ( A A)( B B ) = = ( n 1) ( AB ) σ AσB ( n 1) nab σaσb where n is the number of tuples, and are the respective means of A and B, σ A and σ B are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product. r A,B > 0, A and B are positively correlated (A s values increase as B s) r A,B = 0: independent r A,B < 0: negatively correlated A B January 25,

32 Visually Evaluating Correlation Scatter plots showing the Pearson correlation from 1 to 1.

33 Correlation Analysis (Categorical Data) Χ 2 (chi-square) test χ = ( Observed Expected Expected 2 2 ) The larger the Χ 2 value, the more likely the variables are related The cells that contribute the most to the Χ 2 value are those whose actual count is very different from the expected count January 25,

34 Chi-Square Calculation: An Example Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) Χ 2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) 2 χ = (250 90) 90 2 (50 210) ( ) It shows that like_science_fiction and play_chess are correlated in the group ( needed to reject the independence hypothesis) January 25, ( ) =

35 Metrics of (in)dependence Mutual Information: mutual dependence between two attributes What s the mutual information between 2 completely independent attributes? Kullback Leibler divergence: asymmetric 35

36 Feature Selection Brute-force approach: Try all possible feature subsets Heuristic methods Step-wise forward selection Step-wise backward elimination Combining forward selection and backward elimination

37 Filter approaches: Feature Selection Features are selected independent of data mining algorithm E.g. Minimal pair-wise correlation/dependence, top k information entropy Wrapper approaches: Use the data mining algorithm as a black box to find best subset E.g. best classification accuracy Embedded approaches: Feature selection occurs naturally as part of the data mining algorithm E.g. Decision tree classification 37

38 Data Reduction Instance reduction Sampling Aggregation Dimension reduction Feature selection Feature extraction/creation 38

Noise & Data Reduction

Noise & Data Reduction Andreas Wichert - Teóricas andreas.wichert@inesc-id.pt 1 Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis