CS249: ADVANCED DATA MINING

Similar documents
Data Mining: Concepts and Techniques

Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1]

ANÁLISE DOS DADOS. Daniela Barreiro Claro

Similarity and Dissimilarity

Chapter I: Introduction & Foundations

CS249: ADVANCED DATA MINING

Data Mining 4. Cluster Analysis

University of Florida CISE department Gator Engineering. Clustering Part 1

CS249: ADVANCED DATA MINING

CS6220: DATA MINING TECHNIQUES

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr

CS6220: DATA MINING TECHNIQUES

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

CS6220: DATA MINING TECHNIQUES

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

CS6220: DATA MINING TECHNIQUES

CS145: INTRODUCTION TO DATA MINING

CS6220: DATA MINING TECHNIQUES

Getting To Know Your Data

CS6220: DATA MINING TECHNIQUES

CS249: ADVANCED DATA MINING

CS145: INTRODUCTION TO DATA MINING

Descriptive Data Summarization

MSCBD 5002/IT5210: Knowledge Discovery and Data Minig

CS6220: DATA MINING TECHNIQUES

proximity similarity dissimilarity distance Proximity Measures:

CS249: ADVANCED DATA MINING

Machine Learning Concepts in Chemoinformatics

CS145: INTRODUCTION TO DATA MINING

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Data Preprocessing. Cluster Similarity

Chapter DM:II (continued)

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Lecture 2: Linear Algebra Review

Measurement and Data

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

CS145: INTRODUCTION TO DATA MINING

CS 6375 Machine Learning

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining algorithms

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

CS6220: DATA MINING TECHNIQUES

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Iterative Laplacian Score for Feature Selection

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

CSC 411: Lecture 03: Linear Classification

Distances & Similarities

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Data Exploration Slides by: Shree Jaswal

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Lecture 4 Discriminant Analysis, k-nearest Neighbors

CS-E4830 Kernel Methods in Machine Learning

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

CS626 Data Analysis and Simulation

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Stephen Scott.

Performance Evaluation

Social Data Mining Trainer: Enrico De Santis, PhD

Metric-based classifiers. Nuno Vasconcelos UCSD

Semi Supervised Distance Metric Learning

6.036 midterm review. Wednesday, March 18, 15

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Kernel Learning with Bregman Matrix Divergences

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

INTRODUCTION TO DATA SCIENCE

Unsupervised machine learning

Algorithms for Picture Analysis. Lecture 07: Metrics. Axioms of a Metric

Generative Models for Classification

Inner Product Spaces

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Composite Quantization for Approximate Nearest Neighbor Search

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Multivariate Statistics: Hierarchical and k-means cluster analysis

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Multimedia Retrieval Distance. Egon L. van den Broek

Modern Information Retrieval

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

Prediction of Citations for Academic Papers

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

LEC 5: Two Dimensional Linear Discriminant Analysis

Anomaly Detection. Jing Gao. SUNY Buffalo

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

Knowledge Discovery and Data Mining 1 (VO) ( )

Data Mining and Analysis: Fundamental Concepts and Algorithms

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

CISC 4631 Data Mining

CSE5243 INTRO. TO DATA MINING

Transcription:

CS249: ADVANCED DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu May 2, 2017

Announcements Homework 2 due later today Due May 3 rd (11:59pm) Course project proposal Due May 8 th (11:59pm) Homework 3 out Due May 10 th (11:59pm) 2

Learnt Clustering Methods Classification Clustering Vector Data Tet Data Recommender System Decision Tree; Naïve Bayes; Logistic Regression SVM; NN K-means; hierarchical clustering; DBSCAN; Miture Models; kernel k-means PLSA; LDA Matri Factorization Graph & Network Label Propagation SCAN; Spectral Clustering Prediction Ranking Linear Regression GLM Collaborative Filtering PageRank Feature Representation Word embedding Network embedding 3

Evaluation and Other Practical Issues Evaluation of Clustering Similarity and Dissimilarity Summary 4

Measuring Clustering Quality Two methods: etrinsic vs. intrinsic Etrinsic: supervised, i.e., the ground truth is available Compare a clustering against the ground truth using certain clustering quality measure E. Purity, precision and recall metrics, normalized mutual information Intrinsic: unsupervised, i.e., the ground truth is unavailable Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are E. Silhouette coefficient 5

Purity Let CC = cc 1,, cc KK be the output clustering result, ΩΩ = ωω 1,, ωω JJ be the ground truth clustering result (ground truth class) cc kk aaaaaa ww kk are sets of data points pppppppppppp CC, Ω = 1 NN kk ma jj cc kk ωω jj 6

Eample Clustering output: cluster 1, cluster 2, and cluster 3 Ground truth clustering result: s, s, and s. cluster 1 vs. s, cluster 2 vs. s, and cluster 3 vs. s 7

Normalized Mutual Information NNNNNN CC, Ω = II(CC,Ω) HH CC HH(Ω) II Ω, CC = kk jj PP(cc kk ωω jj ) llllll PP(cc kk ww jj ) PP cc kk PP(ww jj ) = kk jj cc kk ωω jj NN llllll NN cc kk ωω jj cc kk ww jj HH Ω = jj PP ww jj llllllll ww jj ωω jj = NN llllll ωω jj NN jj 8

Eample NMI=0.36 ωω kk cc jj ωω kk Cluster 1 Cluster 2 Cluster 3 sum crosses 5 1 2 8 circles 1 4 0 5 diamonds 0 1 3 4 sum 6 6 5 N=17 cc jj 9

Precision and Recall Random Inde (RI) = (TP+TN)/(TP+FP+FN+TN) F-measure: 2P*R/(P+R) P = TP/(TP+FP) R = TP/(TP+FN) Consider pairs of data points: hopefully, two data points that are in the same cluster will be clustered into the same cluster (TP), and two data points that are in different clusters will be clustered into different clusters (TN). Same cluster Different clusters Same class TP FN Different classes FP TN 10

Eample Data points Output clustering Ground truth clustering (class) a 1 2 b 1 2 c 2 2 d 2 1 # pairs of data points: 6 (a, b): same class, same cluster (a, c): same class, different cluster (a, d): different class, different cluster (b, c): same class, different cluster (b, d): different class, different cluster (c, d): different class, same cluster TP = 1 FP = 1 FN = 2 TN = 2 RI = 0.5 P= ½, R= 1/3, F = 0.4 11

Question If we flip the ground truth cluster labels (2->1 and 1->2), will the evaluation results be the same? Data points Output clustering Ground truth clustering (class) a 1 2 b 1 2 c 2 2 d 2 1 12

Evaluation and Other Practical Issues Evaluation of Clustering Similarity and Dissimilarity Summary 13

Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1] Dissimilarity (e.g., distance) Numerical measure of how different two data objects are Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proimity refers to a similarity or dissimilarity 14

Data Matri and Dissimilarity Matri Data matri n data points with p dimensions Two modes Dissimilarity matri n data points, but registers only the distance A triangular matri Single mode 11 i1 n1 0 d(2,1) d(3,1) : d( n,1) 1f if nf 0 d(3,2) : d( n,2) 0 : 1p ip np 0 15

Eample: Data Matri and Dissimilarity Matri 2 4 Data Matri 4 2 1 point attribute1 attribute2 1 1 2 2 3 5 3 2 0 4 4 5 Dissimilarity Matri (with Euclidean Distance) 3 0 2 4 1 2 3 4 1 0 2 3.61 0 3 2.24 5.1 0 4 4.24 1 5.39 0 16

Proimity Measure for Nominal Attributes Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary attribute) Method 1: Simple matching m: # of matches, p: total # of variables d( i, j) = p p m Method 2: Use a large number of binary attributes creating a new binary attribute for each of the M nominal states 17

Proimity Measure for Binary Attributes A contingency table for binary data Object i Object j Distance measure for symmetric binary variables: Distance measure for asymmetric binary variables: Jaccard coefficient (similarity measure for asymmetric binary variables): 18

Dissimilarity between Binary Variables Eample Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N Gender is a symmetric attribute The remaining attributes are asymmetric binary Let the values Y and P be 1, and the value N be 0 d( d( d( 0 + 1 jack, mary) = = 0.33 2 + 0 + 1 1+ 1 jack, jim) = = 0.67 1+ 1+ 1 1 + 2 jim, mary) = = 0.75 1 + 1 + 2 19

Standardizing Numeric Data Z-score: z = σ µ X: raw score to be standardized, μ: mean of the population, σ: standard deviation the distance between the raw score and the population mean in units of the standard deviation negative when the raw score is below the mean, + when above An alternative way: Calculate the mean absolute deviation where f 1 f f 2 f f nf f m = 1( + + + ). f 1 f 2 f nf standardized measure (z-score): n s = 1( n m + m + + m Using mean absolute deviation is more robust than using standard deviation z if = if m s f f ) 20

Distance on Numeric Data: Minkowski Distance Minkowski distance: A popular distance measure where i = ( i1, i2,, ip ) and j = ( j1, j2,, jp ) are two p- dimensional data objects, and h is the order (the distance so defined is also called L-h norm) Properties d(i, j) > 0 if i j, and d(i, i) = 0 (Positive definiteness) d(i, j) = d(j, i) (Symmetry) d(i, j) d(i, k) + d(k, j) (Triangle Inequality) A distance that satisfies these properties is a metric 21

Special Cases of Minkowski Distance h = 1: Manhattan (city block, L 1 norm) distance E.g., the Hamming distance: the number of bits that are different between two binary vectors h = 2: (L 2 norm) Euclidean distance h. supremum (L ma norm, L norm) distance. This is the maimum difference between any component (attribute) of the vectors ), ( 2 2 1 1 p p j i j i j i j i d + + + = 22 ) ( ), ( 2 2 2 2 2 1 1 p p j i j i j i j i d + + + =

Eample: Minkowski Distance 4 2 point attribute 1 attribute 2 1 1 2 2 3 5 3 2 0 4 4 5 1 3 2 4 0 2 4 Manhattan (L 1 ) Dissimilarity Matrices L 1 2 3 4 1 0 2 5 0 3 3 6 0 4 6 1 7 0 Euclidean (L 2 ) L2 1 2 3 4 1 0 2 3.61 0 3 2.24 5.1 0 4 4.24 1 5.39 0 Supremum L 1 2 3 4 1 0 2 3 0 3 2 5 0 4 3 1 5 0 23

Ordinal Variables Order is important, e.g., rank Can be treated like interval-scaled replace if by their rank map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by z if = r { 1,, M if f r M 1 1 compute the dissimilarity using methods for interval-scaled variables if f } 24

Attributes of Mied Type A database may contain all attribute types Nominal, symmetric binary, asymmetric binary, numeric, ordinal One may use a weighted formula to combine their effects d( i, Σ j) = p f Σ δ d ( f ) ( f ) = 1 ij ij p ( f ) δ f = 1 ij f is binary or nominal: d (f) ij = 0 if if = jf, or d (f) ij = 1 otherwise f is numeric: use the normalized distance f is ordinal Compute ranks r if and Treat z if as interval-scaled z if = r if M 1 f 1 Clustering algorithm: K-prototypes 25

Cosine Similarity A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document. Other vector objects: gene features in micro-arrays, Applications: information retrieval, biologic taonomy, gene feature mapping, Cosine measure: If d 1 and d 2 are two vectors (e.g., term-frequency vectors), then cos(d 1, d 2 ) = (d 1 d 2 ) / d 1 d 2, where indicates vector dot product, d : the length of vector d Clustering algorithm: Spherical k-means 26

Eample: Cosine Similarity cos(d 1, d 2 ) = (d 1 d 2 ) / d 1 d 2, where indicates vector dot product, d : the length of vector d E: Find the similarity between documents 1 and 2. d 1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d 2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d 1 d 2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 d 1 = (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0) 0.5 =(42) 0.5 = 6.481 d 2 = (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1) 0.5 =(17) 0.5 = 4.12 cos(d 1, d 2 ) = 0.94 27

Evaluation and Other Practical Issues Evaluation of Clustering Similarity and Dissimilarity Summary 28

Summary Evaluation of Clustering Purity, NMI, RI, F-measure Similarity and Dissimilarity Nominal attributes Numerical attributes Combine attributes High dimensional feature vector 29