CS570 Introduction to Data Mining

Similar documents
Noise & Data Reduction

CSE 5243 INTRO. TO DATA MINING

Noise & Data Reduction

CSE 5243 INTRO. TO DATA MINING

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Similarity and Dissimilarity

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Information Management course

CISC 4631 Data Mining

Descriptive Data Summarization

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Decision Tree And Random Forest

Data Mining Lab Course WS 2017/18

CS145: INTRODUCTION TO DATA MINING

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Lecture 7 Decision Tree Classifier

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

UVA CS 4501: Machine Learning

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

University of Florida CISE department Gator Engineering. Clustering Part 1

Preprocessing & dimensionality reduction

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Analytics Beyond OLAP. Prof. Yanlei Diao

EECS 349:Machine Learning Bryan Pardo

Introduction to Machine Learning CMU-10701

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22

Lecture 5: Clustering, Linear Regression

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

Data Mining: Data. Lecture Notes for Chapter 2 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Learning Decision Trees

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Predictive Analytics on Accident Data Using Rule Based and Discriminative Classifiers

Learning Decision Trees

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Lecture 5: Clustering, Linear Regression

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Lecture 5: Clustering, Linear Regression

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Collaborative Filtering. Radek Pelánek

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Information Gain. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Classification and Prediction

Lecture 7: DecisionTrees

Bivariate Relationships Between Variables

CPSC 340: Machine Learning and Data Mining. Data Exploration Fall 2016

Machine Learning 3. week

Decision T ree Tree Algorithm Week 4 1

Unsupervised machine learning

Chapter 3 Multiple Regression Complete Example

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

DEPARTMENT OF QUANTITATIVE METHODS & INFORMATION SYSTEMS QM 120. Spring 2008

Correlated subqueries. Query Optimization. Magic decorrelation. COUNT bug. Magic example (slide 2) Magic example (slide 1)

COMP 5331: Knowledge Discovery and Data Mining

Midwest Big Data Summer School: Introduction to Statistics. Kris De Brabanter

Predicting New Search-Query Cluster Volume

Learning Decision Trees

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Statistics Toolbox 6. Apply statistical algorithms and probability models

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

DATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Sources of randomness

Bio 183 Statistics in Research. B. Cleaning up your data: getting rid of problems

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Stat 101 Exam 1 Important Formulas and Concepts 1

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

CS246 Final Exam, Winter 2011

Final Exam, Machine Learning, Spring 2009

ECE521 Lectures 9 Fully Connected Neural Networks

Lecture 4: Data preprocessing: Data Reduction-Discretization. Dr. Edgar Acuna. University of Puerto Rico- Mayaguez math.uprm.

Machine Learning for Biomedical Engineering. Enrico Grisan

Sampling, Frequency Distributions, and Graphs (12.1)

Analysing data: regression and correlation S6 and S7

Prediction of Citations for Academic Papers

Math 140 Introductory Statistics

Math 140 Introductory Statistics

Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory

AP Statistics Cumulative AP Exam Study Guide

Correlation and simple linear regression S5

HYPOTHESIS TESTING: THE CHI-SQUARE STATISTIC

INTERVAL ESTIMATION AND HYPOTHESES TESTING

Transcription:

CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong

Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data cleaning Data integration Data transformation Data reduction Data Mining: Concepts and Techniques 2

Data Transformation Aggregation: summarization (data reduction) E.g. Daily sales -> monthly sales Discretization and generalization E.g. age -> youth, middle-aged, senior (Statistical) Normalization: scaled to fall within a small, specified range E.g. income vs. age Attribute construction: construct new attributes from given ones E.g. birthday -> age January 25, 2011 3

Data Aggregation Data cubes store multidimensional aggregated information Multiple levels of aggregation for analysis at multiple granularities January 25, 2011 4

Normalization scaled to fall within a small, specified range Min-max normalization: [min A, max A ] to [new_min A, new_max A ] v v min = maxa min A ' ( _ A _ A) + Ex. Let income [$12,000, $98,000] normalized to [0.0, 1.0]. Then $73,000 is mapped to A new max new min 73,600 12,000 (1.0 0) + 0= 0.716 98,000 12,000 Z-score normalization (µ: mean, σ: standard deviation): v µ A v ' = σ A Ex. Let µ = 54,000, σ = 16,000. Then Normalization by decimal scaling 73,600 54,000 16,000 new_ min = 1.225 v v' = Where j is the smallest integer such that Max( ν ) < 1 j 10 A January 25, 2011 5

Discretization and Generalization Discretization: transform continuous attribute into discrete counterparts (intervals) Supervised vs. unsupervised Split (top-down) vs. merge (bottom-up) Generalization: generalize/replace low level concepts (such as age ranges) by higher level concepts (such as young, middle-aged, or senior) January 25, 2011 6

Discretization Methods Binning or histogram analysis Unsupervised, top-down split Clustering analysis Unsupervised, either top-down split or bottom-up Entropy-based discretization Supervised, top-down split January 25, 2011 7

Entropy-Based Discretization Entropy based on class distribution of the samples in a set S 1 : m classes, p i is the probability of class i in S 1 Entropy ( S m 1 ) = p i log 2 ( p i ) i= 1 Given a set of samples S, if S is partitioned into two intervals S 1 and S 2 using boundary T, the class entropy after partitioning is 1 2 I S, T ) = S Entropy ( 1) S S + Entropy ( S S S ( 2 The boundary that minimizes the entropy function is selected for binary discretization The process is recursively applied to partitions ) January 25, 2011 8

Information Entropy Information entropy: measure of the uncertainty associated with a random variable. Quantifies the information contained in a message with minimum message length (# bits) to communicate Illustrative example: P(X=A) = ¼, P(X=B) = ¼, P(X=C) = ¼, P(X=D) = ¼ BAACBADCDADDDA Minimum 2 bits (e.g. A = 00, B = 01, C = 10, D = 11) 0100001001001110110011111100 What if P(X=A) = ½, P(X=B) = ¼, P(X=C) = 1/8, P(X=D) = 1/8 Minimum # of bits? E.g. A = 0, B = 10, C= 110, D = 111 High entropy vs. low entropy 9

Generalization for Categorical Attributes Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts street < city < state < country Specification of a hierarchy for a set of values by explicit data grouping {Atlanta, Savannah, Columbus} < Georgia Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values E.g., for a set of attributes: {street, city, state, country} January 25, 2011 10

Automatic Concept Hierarchy Generation Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is placed at the lowest level of the hierarchy Exceptions, e.g., weekday, month, quarter, year country province_or_ state city street 15 distinct values 365 distinct values 3567 distinct values 674,339 distinct values January 25, 2011 11

Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data cleaning Data integration Data transformation Data reduction Data Mining: Concepts and Techniques 12

Data Reduction Why data reduction? A database/data warehouse may store terabytes of data Number of data points Number of dimensions Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results January 25, 2011 13

Data Reduction Instance reduction Sampling (instance selection) Numerosity reduction Dimension reduction Feature selection Feature extraction 14

Instance Reduction: Sampling Sampling: obtaining a small representative sample s to represent the whole data set N A sample is representative if it has approximately the same property (of interest) as the original set of data Statisticians sample because obtaining the entire set of data is too expensive or time consuming. Data miners sample because processing the entire set of data is too expensive or time consuming Sampling method Sampling size January 25, 2011 15

Why sampling A statistics professor was describing sampling theory Student: I don t believe it, why not study the whole population in the first place? The professor continued explaining sampling methods, the central limit theorem, etc. Student: Too much theory, too risky, I couldn t trust just a few numbers in place of ALL of them. The professor explained the Nielsen television ratings Student: You mean that just a sample of a few thousand can tell us exactly what over 250 MILLION people are doing? Professor: Well, the next time you go to the campus clinic and they want to do a blood test tell them that s not good enough tell them to TAKE IT ALL!! 16

Sampling Methods Simple Random Sampling There is an equal probability of selecting any particular item Stratified sampling Split the data into several partitions (stratum); then draw random samples from each partition Cluster sampling When "natural" groupings are evident in a statistical population Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the sample - the same object can be picked up more than once

Simple random sampling without or with replacement Raw Data January 25, 2011 18

Stratified Sampling Illustration Raw Data Stratified Sample January 25, 2011 19

Sampling size 20

Sampling Size 8000 points 2000 Points 500 Points

Sample Size Whatsamplesizeisnecessarytogetatleastoneobjectfrom eachof10groups.

Data Reduction Instance reduction Sampling (instance selection) Numerosity reduction Dimension reduction Feature selection Feature extraction 23

Numerosity Reduction Reduce data volume by choosing alternative, smaller forms of data representation Parametric methods Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) Regression Non-parametric methods Do not assume models Major families: histograms, clustering January 25, 2011 24

Regress Analysis Assume the data fits some model and estimate model parameters Linear regression: Y = b 0 + b 1 X 1 + b 2 X 2 + + b P X P Line fitting: Y = b 1 X + b 0 Polynomial fitting: Y = b 2 x 2 + b 1 x + b 0 Regression techniques Least square fitting Vertical vs. perpendicular offsets Outliers Robust regression

Instance Reduction: Histograms Divide data into buckets and store average (sum) for each bucket Partitioning rules: Equi-width: equal bucket range Equi-depth: equal frequency V-optimal: with the least frequency variance January 25, 2011 26

Instance Reduction: Clustering Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only Can be very effective if data is clustered but not if data is smeared Can have hierarchical clustering and be stored in multi-dimensional index tree structures Cluster analysis will be studied in depth later January 25, 2011 27

Data Reduction Instance reduction Sampling (instance selection) Numerosity reduction Dimension reduction Feature selection Feature extraction 28

Feature Subset Selection Select a subset of features such that the resulting data does not affect mining result Redundant features duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid Irrelevant features contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' GPA

Correlation between attributes Correlation measures the linear relationship between objects 30

Correlation Analysis (Numerical Data) Correlation coefficient (also called Pearson s product moment coefficient) r A, B ( A A)( B B ) = = ( n 1) ( AB ) σ AσB ( n 1) nab σaσb where n is the number of tuples, and are the respective means of A and B, σ A and σ B are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product. r A,B > 0, A and B are positively correlated (A s values increase as B s) r A,B = 0: independent r A,B < 0: negatively correlated A B January 25, 2011 31

Visually Evaluating Correlation Scatter plots showing the Pearson correlation from 1 to 1.

Correlation Analysis (Categorical Data) Χ 2 (chi-square) test χ = ( Observed Expected Expected 2 2 ) The larger the Χ 2 value, the more likely the variables are related The cells that contribute the most to the Χ 2 value are those whose actual count is very different from the expected count January 25, 2011 33

Chi-Square Calculation: An Example Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500 Χ 2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) 2 χ = (250 90) 90 2 (50 210) + 210 2 (200 360) + 360 It shows that like_science_fiction and play_chess are correlated in the group (10.828 needed to reject the independence hypothesis) 507.93 January 25, 2011 34 2 (1000 840) + 840 2 =

Metrics of (in)dependence Mutual Information: mutual dependence between two attributes What s the mutual information between 2 completely independent attributes? Kullback Leibler divergence: asymmetric 35

Feature Selection Brute-force approach: Try all possible feature subsets Heuristic methods Step-wise forward selection Step-wise backward elimination Combining forward selection and backward elimination

Filter approaches: Feature Selection Features are selected independent of data mining algorithm E.g. Minimal pair-wise correlation/dependence, top k information entropy Wrapper approaches: Use the data mining algorithm as a black box to find best subset E.g. best classification accuracy Embedded approaches: Feature selection occurs naturally as part of the data mining algorithm E.g. Decision tree classification 37

Data Reduction Instance reduction Sampling Aggregation Dimension reduction Feature selection Feature extraction/creation 38