Cluster Validation Determining Number of Clusters. Umut ORHAN, PhD.

Similar documents
Clustering gene expression data & the EM algorithm

Problem Set 9 Solutions

Aggregation of Social Networks by Divisive Clustering Method

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

Boostrapaggregating (Bagging)

10-701/ Machine Learning, Fall 2005 Homework 3

Lecture Notes on Linear Regression

MDL-Based Unsupervised Attribute Ranking

Chapter 13: Multiple Regression

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

Negative Binomial Regression

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Statistics MINITAB - Lab 2

/ n ) are compared. The logic is: if the two

Lecture 12: Classification

SIMPLE LINEAR REGRESSION

Statistical pattern recognition

ERROR RATES STABILITY OF THE HOMOSCEDASTIC DISCRIMINANT FUNCTION

Statistics for Economics & Business

Kernel Methods and SVMs Extension

Evaluation for sets of classes

VQ widely used in coding speech, image, and video

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Lecture 4: Universal Hash Functions/Streaming Cont d

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Spectral Clustering. Shannon Quinn

STAT 511 FINAL EXAM NAME Spring 2001

FREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced,

Important Instructions to the Examiners:

Statistics II Final Exam 26/6/18

Logistic Regression Maximum Likelihood Estimation

Richard Socher, Henning Peters Elements of Statistical Learning I E[X] = arg min. E[(X b) 2 ]

Structure and Drive Paul A. Jensen Copyright July 20, 2003

K means B d ase Consensus Cluste i r ng Dr. Dr Junjie Wu Beihang University

Regularized Discriminant Analysis for Face Recognition

SDMML HT MSc Problem Sheet 4

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Cathy Walker March 5, 2010

Spatial Statistics and Analysis Methods (for GEOG 104 class).

Gaussian Mixture Models

x = , so that calculated

Support Vector Machines

Basic Statistical Analysis and Yield Calculations

Regression Analysis. Regression Analysis

Linear Regression Analysis: Terminology and Notation

WINTER 2017 EXAMINATION

An Improved multiple fractal algorithm

Comparison of Regression Lines

EDMS Modern Measurement Theories. Multidimensional IRT Models. (Session 6)

Performance of Different Algorithms on Clustering Molecular Dynamics Trajectories

BIO Lab 2: TWO-LEVEL NORMAL MODELS with school children popularity data

Probability Density Function Estimation by different Methods

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

Complete subgraphs in multipartite graphs

Assessing inter-annual and seasonal variability Least square fitting with Matlab: Application to SSTs in the vicinity of Cape Town

Common loop optimizations. Example to improve locality. Why Dependence Analysis. Data Dependence in Loops. Goal is to find best schedule:

Ensemble Methods: Boosting

18.1 Introduction and Recap

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

ρ some λ THE INVERSE POWER METHOD (or INVERSE ITERATION) , for , or (more usually) to

Economics 130. Lecture 4 Simple Linear Regression Continued

Lecture Nov

A PROBABILITY-DRIVEN SEARCH ALGORITHM FOR SOLVING MULTI-OBJECTIVE OPTIMIZATION PROBLEMS

The Minimum Universal Cost Flow in an Infeasible Flow Network

Linear Feature Engineering 11

Natural Language Processing and Information Retrieval

Dynamic Programming. Preview. Dynamic Programming. Dynamic Programming. Dynamic Programming (Example: Fibonacci Sequence)

Lecture 10: May 6, 2013

PHYS 450 Spring semester Lecture 02: Dealing with Experimental Uncertainties. Ron Reifenberger Birck Nanotechnology Center Purdue University

Unified Subspace Analysis for Face Recognition

Lecture 4. Instructor: Haipeng Luo

Expectation Maximization Mixture Models HMMs

LECTURE 9 CANONICAL CORRELATION ANALYSIS

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

JAB Chain. Long-tail claims development. ASTIN - September 2005 B.Verdier A. Klinger

Chapter 11: Simple Linear Regression and Correlation

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

LNG CARGO TRANSFER CALCULATION METHODS AND ROUNDING-OFFS

Lecture 6 More on Complete Randomized Block Design (RBD)

Lecture 3. Ax x i a i. i i

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Non-linear Canonical Correlation Analysis Using a RBF Network

This model contains two bonds per unit cell (one along the x-direction and the other along y). So we can rewrite the Hamiltonian as:

Laboratory 1c: Method of Least Squares

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Goodness of fit and Wilks theorem

Error Bars in both X and Y

Which Separator? Spring 1

Bootstrap Stability Evaluation and Validation of Clusters Based on Agricultural Indicators of EU Countries

FAULT TEMPLATE EXTRACTION FROM INDUSTRIAL ALARM FLOODS. Sylvie Charbonnier, Nabil Bouchair, Philippe Gayet

Support Vector Machines

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Assortment Optimization under MNL

Joint Statistical Meetings - Biopharmaceutical Section

Suppose that there s a measured wndow of data fff k () ; :::; ff k g of a sze w, measured dscretely wth varable dscretzaton step. It s convenent to pl

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

The big picture. Outline

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Statistics for Business and Economics

Laboratory 3: Method of Least Squares

Transcription:

Cluster Analyss Cluster Valdaton Determnng Number of Clusters 1 Cluster Valdaton The procedure of evaluatng the results of a clusterng algorthm s known under the term cluster valdty. How do we evaluate the goodness of the resultng clusters? Then why do we want to evaluate them? To avod fndng patterns n nose To compare clusterngs, or clusterng algorthms 2 1

Cluster Valdaton Determnng the clusterng tendency of a set of data. Comparng the results of a cluster analyss to externally known results. Evaluatng how well the results of a cluster analyss ft the data wthout reference to external nformaton. Comparng the results of two dfferent sets of cluster analyses to determne whch s better. Determnng the correct number of clusters. 3 Cluster Valdaton Measures that are appled to judge varous aspects of cluster valdty, are classfed nto the followng two types: External Index: Used to measure the extent to whch cluster labels match externally suppled class labels. E.g., entropy, precson, recall Internal Index: Used to measure the goodness of a clusterng structure wthout reference to external nformaton. E.g., Sum of Squared Error (SSE) 4 2

Cluster Valdaton External Index Valdate aganst ground truth Compare two clusters: (how smlar) Internal Index Valdate wthout external nfo Wth dfferent number of clusters Solve the number of clusters 5 External Index Assume that the data s labeled wth some class labels. Ths s called the ground truth. It s wanted the clusters to be homogeneous wth respect to classes. External measures are based on a matrx that summarze the number of correct predctons and wrong predctons. 6 3

External Index n = number of ponts m = ponts n cluster c j = ponts n class j n j = ponts n cluster comng from class j p j = n j /m = probablty of element from cluster to be assgned n class j 7 External Index Entropy: L Of a cluster : e = j=1 p j log p j Precson: Of a cluster : Prec, j = p j The fracton of a cluster that conssts of objects of a class j 8 4

External Index Recall: Of a cluster : Rec, j = n j c j The extent to whch a cluster contans all objects of a class j. F-measure: F, j = 2 Prec,j Rec(,j) Prec,j +Rec(,j) 9 External Index Precson: (0.94, 0.81, 0.85) overall 0.86 Recall: (0.85, 0.9, 0.85) - overall 0.87 Precson: (0.38, 0.38, 0.38) overall 0.38 Recall: (0.35, 0.42, 0.38) overall 0.39 (Assgn to cluster the class k such that k = arg max n j ) j 10 5

Internal Index Used to measure the goodness of a clusterng structure wthout reference to external nformaton. Internal ndex: Varances of wthn cluster and between clusters Slhouette Coeffcent F-Rato Daves-Bouldn Index (DBI) 11 Internal Index Internal valdaton measures are often based on the followng two crtera: Cluster Coheson: Measures how closely related are objects n a cluster. Cluster Separaton: Measure how dstnct or wellseparated a cluster s from other clusters. Intra-cluster varance s mnmzed Inter-cluster varance s maxmzed 12 6

Internal Index Coheson s measured by the wthn cluster sum of squares (SSE) WSS x C ( x c ) Separaton s measured by the between cluster sum of squares BSS m ( c c ) 2 2 13 Internal Index Example: BSS + WSS = constant 14 7

Slhouette Coeffcent Slhouette Coeffcent combne deas of both coheson and separaton. For an ndvdual pont, Calculate a = average dstance of to the ponts n ts cluster Calculate b = mn (average dstance of to ponts n another cluster) 15 Slhouette Coeffcent x x coheson a(x): average dstance n the cluster separaton b(x): average dstances to others clusters, fnd mnmal 16 8

Daves-Bouldn Index The Daves-Bouldn ndex can be calculated by the followng formula: c x s the centrod of cluster x ơx s the average dstance of all elements n cluster x to centrod c x 17 Dunn Index The Dunn ndex can be calculated by the followng formula: d(,j) represents the dstance between clusters and j. d '(k) measures the ntra-cluster dstance of cluster k 18 9

F - Rato Measures rato of between-groups varance aganst the wthn-groups varance. F N k x c k j 1 1 p() n c x j j 2 2 k SSW ( X ) SSW BSS 19 Determnng Number of Clusters By rule of thumb Elbow method Choosng k usng the slhoutte Cross valdaton 20 10

By Rule of Thumb It s a smple method. Ths method can by apply to any type of data set. k = (n/2) 21 Elbow Method The oldest method for determnng the true number of clusters n a data set s nelegantly called the elbow method. The dea s that Start wth K=2, and keep ncreasng t n each step by 1. Calculate your clusters and the cost that comes wth the tranng. At some value for K the cost drops dramatcally, and after that t reaches a plateau when you ncrease t further. 22 11

Elbow Method 23 Choosng k Usng The Slhoutte The largest average slhouette wdth, over dfferent K, ndcates the best number of clusters. The slhouette of a data nstance s a measure of how closely t s matched to data wthn ts cluster and how loosely t s matched to data of the neghbourng cluster. 24 12

Cross Valdaton Ths method splts the data n two or more (K) parts. One part s used for clusterng and the other parts are used for valdaton. The value of the objectve functon calculated for each part. These K values are calculated and averaged for each alternatve number of clusters. The cluster number s selected that leads to only a small reducton n the objectve functon. 25 Herarchcal: Dendogram Cuttng a dendrogram at a certan level gves a set of clusters. 26 13

Densty-based: Determnng ε The average dstance to each mnpts-nearest neghbors s calculated. These mnptsdstances are then drawn n ascendng order. The ε value can also be determned by the user. 27 14