Chapter 3: Cluster Analysis

Similar documents
T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Tree Structured Classifier

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

The standards are taught in the following sequence.

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

Checking the resolved resonance region in EXFOR database

Pattern Recognition 2014 Support Vector Machines

Time, Synchronization, and Wireless Sensor Networks

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

Determining Optimum Path in Synthesis of Organic Compounds using Branch and Bound Algorithm

Enhancing Performance of MLP/RBF Neural Classifiers via an Multivariate Data Distribution Scheme

NUROP CONGRESS PAPER CHINESE PINYIN TO CHINESE CHARACTER CONVERSION

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

You need to be able to define the following terms and answer basic questions about them:

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Testing Groups of Genes

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

ENSC Discrete Time Systems. Project Outline. Semester

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Simple Linear Regression (single variable)

The blessing of dimensionality for kernel methods

Agenda. What is Machine Learning? Learning Type of Learning: Supervised, Unsupervised and semi supervised Classification

5 th grade Common Core Standards

1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1-4, 1996, Vol. 7, pp

Support-Vector Machines

A Novel Stochastic-Based Algorithm for Terrain Splitting Optimization Problem

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Computational modeling techniques

Administrativia. Assignment 1 due thursday 9/23/2004 BEFORE midnight. Midterm exam 10/07/2003 in class. CS 460, Sessions 8-9 1

IAML: Support Vector Machines

Determining the Accuracy of Modal Parameter Estimation Methods

Surface and Contact Stress

AP Statistics Notes Unit Two: The Normal Distributions

Differentiation Applications 1: Related Rates

A Scalable Recurrent Neural Network Framework for Model-free

Multiple Source Multiple. using Network Coding

Homology groups of disks with holes

What is Statistical Learning?

Turing Machines. Human-aware Robotics. 2017/10/17 & 19 Chapter 3.2 & 3.3 in Sipser Ø Announcement:

Section 6-2: Simplex Method: Maximization with Problem Constraints of the Form ~

Part 3 Introduction to statistical classification techniques

initially lcated away frm the data set never win the cmpetitin, resulting in a nnptimal nal cdebk, [2] [3] [4] and [5]. Khnen's Self Organizing Featur

Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons

Combining Dialectical Optimization and Gradient Descent Methods for Improving the Accuracy of Straight Line Segment Classifiers

SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST. Mark C. Otto Statistics Research Division, Bureau of the Census Washington, D.C , U.S.A.

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

Math Foundations 20 Work Plan

CS:4420 Artificial Intelligence

Lab 1 The Scientific Method

IEEE Int. Conf. Evolutionary Computation, Nagoya, Japan, May 1996, pp. 366{ Evolutionary Planner/Navigator: Operator Performance and

1 The limitations of Hartree Fock approximation

Physical Layer: Outline

SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES

Methods for Determination of Mean Speckle Size in Simulated Speckle Pattern

Analysis on the Stability of Reservoir Soil Slope Based on Fuzzy Artificial Neural Network

On Variable Constraints in Privacy Preserving Data Mining

Lecture 10, Principal Component Analysis

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

ECEN 4872/5827 Lecture Notes

Hiding in plain sight

Floating Point Method for Solving Transportation. Problems with Additional Constraints

INSTRUMENTAL VARIABLES

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

Regents Chemistry Period Unit 3: Atomic Structure. Unit 3 Vocabulary..Due: Test Day

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Hypothesis Tests for One Population Mean

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Particle Size Distributions from SANS Data Using the Maximum Entropy Method. By J. A. POTTON, G. J. DANIELL AND B. D. RAINFORD

a(k) received through m channels of length N and coefficients v(k) is an additive independent white Gaussian noise with

G3.6 The Evolutionary Planner/Navigator in a mobile robot environment

MATCHING TECHNIQUES Technical Track Session VI Céline Ferré The World Bank

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

A Global Approach to the n-dimensional Traveling Salesman Problem: Application to the Optimization of Crystallographic Data Collection

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

Versatility of Singular Value Decomposition (SVD) January 7, 2015

Least Squares Optimal Filtering with Multirate Observations

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

CONSTRUCTING STATECHART DIAGRAMS

Chapter 3 Digital Transmission Fundamentals

Name: Period: Date: ATOMIC STRUCTURE NOTES ADVANCED CHEMISTRY

Eric Klein and Ning Sa

A Matrix Representation of Panel Data

ESE 403 Operations Research Fall Examination 1

Sequential Allocation with Minimal Switching

Margin Distribution and Learning Algorithms

CHM112 Lab Graphing with Excel Grading Rubric

Experiment #3. Graphing with Excel

Edinburgh Research Explorer

Comparison of hybrid ensemble-4dvar with EnKF and 4DVar for regional-scale data assimilation

Chapter 15 & 16: Random Forests & Ensemble Learning

Figure 1a. A planar mechanism.

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

Transcription:

Chapter 3: Cluster Analysis } 3.1 Basic Cncepts f Clustering 3.1.1 Cluster Analysis 3.1. Clustering Categries } 3. Partitining Methds 3..1 The principle 3.. K-Means Methd 3..3 K-Medids Methd 3..4 CLARA 3..5 CLARANS } 3.3 Hierarchical Methds } 3.4 Density-based Methds } 3.5 Clustering High-Dimensinal Data } 3.6 Outlier Analysis

3.1.1 Cluster Analysis } Unsupervised learning (i.e., Class label is unknwn) } Grup data t frm new categries (i.e., clusters), e.g., cluster huses t find distributin patterns } Principle: Maximizing intra-class similarity & minimizing interclass similarity } Typical Applicatins " WWW, Scial netwrks, Marketing, Bilgy, Library, etc.

3.1. Clustering Categries } } } } } } } Partitining Methds " Cnstruct k partitins f the data Hierarchical Methds " Creates a hierarchical decmpsitin f the data Density-based Methds " Grw a given cluster depending n its density (# data bjects) Grid-based Methds " Quantize the bject space int a finite number f cells Mdel-based methds " Hypthesize a mdel fr each cluster and find the best fit f the data t the given mdel Clustering high-dimensinal data " Subspace clustering Cnstraint-based methds " Used fr user-specific applicatins

Chapter 3: Cluster Analysis } 3.1 Basic Cncepts f Clustering 3.1.1 Cluster Analysis 3.1. Clustering Categries } 3. Partitining Methds 3..1 The principle 3.. K-Means Methd 3..3 K-Medids Methd 3..4 CLARA 3..5 CLARANS } 3.3 Hierarchical Methds } 3.4 Density-based Methds } 3.5 Clustering High-Dimensinal Data } 3.6 Outlier Analysis

3..1 Partitining Methds: The Principle } Given " A data set f n bjects " K the number f clusters t frm } Organize the bjects int k partitins (k<=n) where each partitin represents a cluster } The clusters are frmed t ptimize an bjective partitining criterin " Objects within a cluster are similar " Objects f different clusters are dissimilar

3.. K-Means Methd Chse 3 bjects (cluster centrids) Gal: create 3 clusters (partitins) Assign each bject t the clsest centrid t frm Clusters Update cluster centrids + + +

K-Means Methd Recmpute Clusters + + + If Stable centrids, then stp + + +

K-Means Algrithm } Input " K: the number f clusters " D: a data set cntaining n bjects } Output: A set f k clusters } Methd: (1) Arbitrary chse k bjects frm D as in initial cluster centers () Repeat (3) Reassign each bject t the mst similar cluster based n the mean value f the bjects in the cluster (4) Update the cluster means (5) Until n change

K-Means Prperties } The algrithm attempts t determine k partitins that minimize the square-errr functin E = k i 1 p C i p m i " E: the sum f the squared errr fr all bjects in the data set " P: the data pint in the space representing an bject " m i : is the mean f cluster C i } It wrks well when the clusters are cmpact cluds that are rather well separated frm ne anther

K-Means Prperties Advantages } K-means is relatively scalable and efficient in prcessing large data sets } The cmputatinal cmplexity f the algrithm is O(nkt) " n: the ttal number f bjects " k: the number f clusters " t: the number f iteratins " Nrmally: k<<n and t<<n Disadvantage } Can be applied nly when the mean f a cluster is defined } Users need t specify k } K-means is nt suitable fr discvering clusters with nncnvex shapes r clusters f very different size } It is sensitive t nise and utlier data pints (can influence the mean value)

Variatins f the K-Means Methd } A few variants f the k-means which differ in " Selectin f the initial k means " Dissimilarity calculatins " Strategies t calculate cluster means } Handling categrical data: k-mdes (Huang 9) " Replacing means f clusters with mdes " Using new dissimilarity measures t deal with categrical bjects " Using a frequency-based methd t update mdes f clusters " A mixture f categrical and numerical data Nvember 0, 01 Data Mining: Cncepts and Techniques 11

3..3 K-Medids Methd } Minimize the sensitivity f k-means t utliers } Pick actual bjects t represent clusters instead f mean values } Each remaining bject is clustered with the representative bject (Medid) t which is the mst similar } The algrithm minimizes the sum f the dissimilarities between each bject and its crrespnding reference pint E = k i 1 p C i p i " E: the sum f abslute errr fr all bjects in the data set " P: the data pint in the space representing an bject " O i : is the representative bject f cluster C i

K-Medids Methd: The Idea } Initial representatives are chsen randmly } The iterative prcess f replacing representative bjects by n representative bjects cntinues as lng as the quality f the clustering is imprved } Fr each representative Object O " Fr each nn-representative bject R, swap O and R } Chse the cnfiguratin with the lwest cst } Cst functin is the difference in abslute errr-value if a current representative bject is replaced by a nn-representative bject

K-Medids Methd: Example Data Objects O 1 A 1 A 6 O 3 4 O 3 3 O 4 4 7 O 5 6 O 6 6 4 O 7 7 3 O 7 4 O 9 5 O 10 7 6 9 7 6 5 4 3 1 3 4 5 3 4 5 6 7 9 Gal: create tw clusters Chse randmly tw medids O = (3,4) O = (7,4) 6 10 7 9

K-Medids Methd: Example Data Objects O 1 A 1 A 6 O 3 4 O 3 3 O 4 4 7 O 5 6 O 6 6 4 O 7 7 3 O 7 4 O 9 5 O 10 7 6 9 7 6 5 4 3 1 3 cluster1 4 5 3 4 5 6 7 9 Cluster1 = {O 1, O, O 3, O 4 } 6 10 cluster " Assign each bject t the clsest representative bject " Using L1 Metric (Manhattan), we frm the fllwing clusters 7 9 Cluster = {O 5, O 6, O 7, O, O 9, O 10 }

K-Medids Methd: Example O 1 A 1 A 6 O 3 4 O 3 3 O 4 4 7 O 5 6 O 6 6 4 O 7 7 3 O 7 4 O 9 5 O 10 7 6 Data Objects 3 4 5 6 7 9 3 4 5 6 7 9 1 3 4 5 6 7 9 10 " Cmpute the abslute errr criterin [fr the set f Medids (O,O)] 10 9 7 6 5 4 3 1 1 p E k i C p i i + + + + + + + = = cluster1 cluster

K-Medids Methd: Example Data Objects O 1 A 1 A 6 O 3 4 O 3 3 O 4 4 7 O 5 6 O 6 6 4 O 7 7 3 O 7 4 O 9 5 O 10 7 6 9 7 6 5 4 3 1 3 cluster1 4 5 3 4 5 6 7 9 " The abslute errr criterin [fr the set f Medids (O,O)] 6 10 cluster E = ( 3+ 4 + 4) + (3 + 1+ 1+ + ) = 7 9 0

K-Medids Methd: Example Data Objects O 1 A 1 A 6 O 3 4 O 3 3 O 4 4 7 O 5 6 O 6 6 4 O 7 7 3 O 7 4 O 9 5 O 10 7 6 9 7 6 5 4 3 1 " Chse a randm bject O 7 " Swap O and O7 3 cluster1 4 5 3 4 5 6 7 9 " Cmpute the abslute errr criterin [fr the set f Medids (O,O7)] 6 10 cluster E = ( 3+ 4 + 4) + ( + + 1+ 3+ 3) = 7 9

K-Medids Methd: Example Data Objects O 1 A 1 A 6 O 3 4 O 3 3 O 4 4 7 O 5 6 O 6 6 4 O 7 7 3 O 7 4 O 9 5 O 10 7 6 9 7 6 5 4 3 cluster1 3 4 1 6 5 10 9 7 3 4 5 6 7 9 " Cmpute the cst functin Abslute errr [fr O,O 7 ] Abslute errr [O,O ] S = 0 cluster S> 0 it is a bad idea t replace O by O 7

K-Medids Methd Data Objects O 1 A 1 A 6 O 3 4 O 3 3 O 4 4 7 O 5 6 O 6 6 4 O 7 7 3 O 7 4 O 9 5 O 10 7 6 9 7 6 5 4 3 1 3 cluster1 4 5 3 4 5 6 7 9 6 cluster } In this example, changing the medid f cluster did nt change the assignments f bjects t clusters. } What are the pssible cases when we replace a medid by anther bject? 10 7 9

K-Medids Methd Cluster 1 Cluster A B B First case The assignment f P t A des nt change p Representative bject Randm Object Currently P assigned t A Cluster 1 Cluster A p B B Secnd case P is reassigned t A Representative bject Randm Object Currently P assigned t B

K-Medids Methd Cluster 1 Cluster A p B B Third case P is reassigned t the new B Representative bject Randm Object Currently P assigned t B Cluster 1 Cluster A Furth case p B B P is reassigned t B Representative bject Randm Object Currently P assigned t A

} Input K-Medids Algrithm(PAM) " K: the number f clusters " D: a data set cntaining n bjects } Output: A set f k clusters } Methd: (1) Arbitrary chse k bjects frm D as representative bjects (seeds) () Repeat PAM : Partitining Arund Medids (3) Assign each remaining bject t the cluster with the nearest representative bject (4) Fr each representative bject O j (5) Randmly select a nn representative bject O randm (6) Cmpute the ttal cst S f swapping representative bject Oj with O randm (7) if S<0 then replace O j with O randm () Until n change

K-Medids Prperties(k-medis vs. K-means) } The cmplexity f each iteratin is O(k(n-k) ) } Fr large values f n and k, such cmputatin becmes very cstly } Advantages " K-Medids methd is mre rbust than k-means in the presence f nise and utliers } Disadvantages " K-Medids is mre cstly that the k-means methd " Like k-means, k-medids require the user t specify k " It des nt scale well fr large data sets

3..4 CLARA } CLARA (Clustering Large Applicatins) uses a sampling-based methd t deal with large data sets } A randm sample shuld clsely represent the riginal data sample PAM } The chsen medids will likely be similar t what wuld have been chsen frm the whle data set

CLARA } Draw multiple samples f the data set } Apply PAM t each sample } Return the best clustering Chse the best clustering Clusters Clusters Clusters PAM PAM PAM sample 1 sample sample m

CLARA Prperties } Cmplexity f each Iteratin is: O(ks + k(n-k)) " s: the size f the sample " k: number f clusters " n: number f bjects } PAM finds the best k medids amng a given data, and CLARA finds the best k medids amng the selected samples } Prblems " The best k medids may nt be selected during the sampling prcess, in this case, CLARA will never find the best clustering " If the sampling is biased we cannt have a gd clustering " Trade ff-f efficiency

3..5 CLARANS } CLARANS (Clustering Large Applicatins based upn RANdmized Search ) was prpsed t imprve the quality and the scalability f CLARA } It cmbines sampling techniques with PAM } It des nt cnfine itself t any sample at a given time } It draws a sample with sme randmness in each step f the search

CLARANS: The idea Clustering view Current medids medids Cst=-10 Cst=5 Cst=-15 Cst=0 Cst= Cst=3 Cst=1 Keep the current medids

CLARANS: The idea CLARA } Draws a sample f ndes at the beginning f the search } Neighbrs are frm the chsen sample } Restricts the search t a specific area f the riginal data First step f the search Neighbrs are frm the chsen sample Current medids Sample medids secnd step f the search Neighbrs are frm the chsen sample

CLARANS: The idea CLARANS } Des nt cnfine the search t a lcalized area } Stps the search when a lcal minimum is fund } Finds several lcal ptimums and utput the clustering with the best lcal ptimum First step f the search Draw a randm sample f neighbrs Current medids Original data medids secnd step f the search Draw a randm sample f neighbrs The number f neighbrs sampled frm the riginal data is specified by the user

CLARANS Prperties } Advantages " Experiments shw that CLARANS is mre effective than bth PAM and CLARA " Handles utliers } Disadvantages " The cmputatinal cmplexity f CLARANS is O(n ), where n is the number f bjects " The clustering quality depends n the sampling methd

Summary f Sectin 3. } Partitining methds find sphere-shaped clusters } K- mean is efficient fr large data sets but sensitive t utliers } PAM used centers f the clusters instead f means } CLARA and CLARANS are used fr clustering large databases