k-means clustering mark = which(md == min(md)) nearest[i] = ifelse(mark <= 5, "blue", "orange")}

Similar documents
Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Data Exploration and Unsupervised Learning with Clustering

1 Basic Concept and Similarity Measures

Multivariate Statistics: Hierarchical and k-means cluster analysis

Unsupervised machine learning

Multivariate Analysis

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Overview of clustering analysis. Yuehua Cui

Machine Learning - MT Clustering

IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH

Multivariate Statistics

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Hierarchical Clustering

Clustering: K-means. -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression

Machine Learning. Chao Lan

diversity(datamatrix, index= shannon, base=exp(1))

Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li

DM534: Introduction to Computer Science Autumn term Exercise Clustering: Clustering, Color Histograms

Chapter 5: Microarray Techniques

EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS

Multivariate Analysis Cluster Analysis

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Applying cluster analysis to 2011 Census local authority data

Support Vector Machine. Industrial AI Lab.

STATISTICA MULTIVARIATA 2

Lecture 5: Clustering, Linear Regression

Data Preprocessing. Cluster Similarity

Clustering Ambiguity: An Overview

2. Sample representativeness. That means some type of probability/random sampling.

Machine Learning. Clustering 1. Hamid Beigy. Sharif University of Technology. Fall 1395

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

CSE446: Clustering and EM Spring 2017

Unsupervised clustering of COMBO-17 galaxy photometry

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Data Mining 4. Cluster Analysis

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

STAT 753 Homework 3 SOLUTIONS

Clustering. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Divide-and-conquer. Curs 2015

More on Unsupervised Learning

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Package interspread. September 7, Index 11. InterSpread Plus: summary information

University of Florida CISE department Gator Engineering. Clustering Part 1

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Measurement, Scaling, and Dimensional Analysis Summer 2017 METRIC MDS IN R

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Marielle Caccam Jewel Refran

Classification methods

2/1/2016. Species Abundance Curves Plot of rank abundance (x-axis) vs abundance or P i (yaxis).

Chapter 8: Regression Models with Qualitative Predictors

Clustering using Mixture Models

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality

STAT 730 Chapter 14: Multidimensional scaling

Similarity and Dissimilarity

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

LEC1: Instance-based classifiers

Chapter 3. Measuring data

Unsupervised Learning. k-means Algorithm

Survival analysis in R

Relational Nonlinear FIR Filters. Ronald K. Pearson

Package clustergeneration

CISC 4631 Data Mining

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2008, Mr. Ruey S. Tsay. Solutions to Final Exam

Package CEC. R topics documented: August 29, Title Cross-Entropy Clustering Version Date

DAA Unit- II Greedy and Dynamic Programming. By Mrs. B.A. Khivsara Asst. Professor Department of Computer Engineering SNJB s KBJ COE, Chandwad

MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms

Module Master Recherche Apprentissage et Fouille

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Graph Metrics and Dimension Reduction

Issues and Techniques in Pattern Classification

Package ForwardSearch

Clustering and classification with applications to microarrays and cellular phenotypes

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Discrimination Among Groups. Discrimination Among Groups

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Data Mining: Data. Lecture Notes for Chapter 2 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Descriptive Data Summarization

Final Exam, Machine Learning, Spring 2009

Introduction to Statistics and R

Dimension Reduc-on. Example: height of iden-cal twins. PCA, SVD, MDS, and clustering [ RI ] Twin 2 (inches away from avg)

CS626 Data Analysis and Simulation

A Partitioning Method for the Clustering of Categorical Variables

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance

Statistical Machine Learning

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Experimental Design and Data Analysis for Biologists

Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms

Dimension Reduction and Iterative Consensus Clustering

1.3. Principal coordinate analysis. Pierre Legendre Département de sciences biologiques Université de Montréal

Copyright 2000, Kevin Wayne 1

Transcription:

1 / 16 k-means clustering km15 = kmeans(x[g==0,],5) km25 = kmeans(x[g==1,],5) for(i in 1:6831){ md = c(mydist(xnew[i,],km15$center[1,]),mydist(xnew[i,],km15$center[2, mydist(xnew[i,],km15$center[3,]),mydist(xnew[i,],km15$center[4,]), mydist(xnew[i,],km15$center[5,]),mydist(xnew[i,],km25$center[1,]), mydist(xnew[i,],km25$center[2,]),mydist(xnew[i,],km25$center[3,]), mydist(xnew[i,],km25$center[4,]),mydist(xnew[i,],km25$center[5,])) mark = which(md == min(md)) nearest[i] = ifelse(mark <= 5, "blue", "orange")} plot(xnew, type="n", xlab = "x1", ylab = "x2", main= "kmeans with 5 cluster centers") points(xnew, col=nearest, pch=".") points(km25$centers, col="orange", pch=19, cex=2) points(km15$centers, col="blue", pch=19, cex=2) points(x, col= ifelse(g==0, "blue","orange"))

kmeans with 2 cluster centers kmeans with 5 cluster centers x2-2 -1 0 1 2 3 x2-2 -1 0 1 2 3-2 -1 0 1 2 3 4 x1-2 -1 0 1 2 3 4 x1 data(mixture.example) p.16: m 1k N 2 {(1, 0), I}, k = 1,..., 10; m 2k N 2 {(0, 1), I}, k = 1,..., 10 bluex <- mvrnorm(100, mu = m1[sample(10,1),], Sigma = matrix(c(1,0,0,1),ncol=2))

3 / 16 The curse of dimensionality ( 2.5) local in R 1 is quite different than local in R p Example: each feature variable uniformly distributed on (0, 1). want 10% of the sample in R 1 : need a window of length 0.1. want 10% of the sample in R p : need a box with edge length 0.1 1/10 = 0.80 on each axis need a window of length 0.8. Figure 2.6

5 / 16... curse Example: N data points uniformly distributed on a unit ball in R p. Distance from the origin to the nearest data point? Median: (1 0.5 1/N ) 1/p 0.52 if p = 10, N = 500. median distance 0.0 0.2 0.4 0.6 0.8 1.0 N=500 N=100 N=20 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 dimension

6 / 16 Cluster Analysis ( 14.3) discover groupings among the cases; cases within clusters should be close and clusters should be far apart Figure 14.4 many (not all) clustering methods use as input an N N matrix D of dissimilarities require D ii > 0, D ii = D i i and D ii = 0 sometimes the data are collected this way (see 14.3.1) more often D needs to be constructed from the N p data matrix often (usually) D ii = p j=1 d j(x ij, x i j), where d j (, ) to be chosen, e.g. (x ij x i j) 2, x ij x i j, etc. See p 504, 505 for more details on choosing a type of dissimilarity matrix this can be done using dist or daisy (the latter in the R library cluster)

7 / 16... cluster analysis dissimilarities for categorical features binary: simple matching uses Jacard coefficient uses D ii = (#{(1, 0) or (0, 1) pairs )/p D ii = (#{(1, 0)or(0, 1) pairs )/(#{(1, 0), (0, 1) or (1, 1) pairs ) ordered categories use ranks as continuous data (see eq. (14.23)) unordered categories create binary dummy variables and use matching

8 / 16... cluster analysis dist(x, method = c("euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski")) where maximum is max 1 j p (x ij x i j) and binary is Jacard coefficient. daisy(x, metric=c("euclidean", "manhattan", "gower") standardize=f, type=c("ordratio","logratio","asymm","symm") (see the help files) > x = matrix(rnorm(100),nrow=5) > dim(x) [1] 5 20 > dist(x) 1 2 3 4 2 5.493679 3 6.360923 5.652732 4 7.439924 5.885949 7.960187 5 4.437444 3.679995 6.133873 5.936607

Combinatorial algorithms suppose number of clusters K is fixed (K < N) C(i) = k if observation i is assigned to cluster k T = 1 N N D ii 2 i=1 i =1 = 1 K D ii + D ii 2 = 1 2 k=1 C(i)=k K C(i )=k k=1 C(i)=k C(i )=k = W (C) + B(C) D ii + 1 2 C(i ) k K k=1 C(i)=k C(i ) k W (C) is a measure of within cluster dissimilarity B(C) is a measure of between cluster dissimilarity T is fixed given the data: minimizing W (C) same as maximizing B(C) D ii 9 / 16

K-Means clustering ( 14.3.6) most algorithms use a greedy approach by modifying a given clustering to decrease within cluster distance: analogous to forward selection in regression K -means clustering is (usually) based on Euclidean distance: D ii = x i x i 2, so x s should be centered and scaled (and continuous) Use the result 1 2 K k=1 C(i)=k C(i )=k x i x i 2 = K k=1 N k C(i)=k x i x k 2 where N k is the number of observations in cluster k and x k = ( x 1k,..., x pk ) is the mean in the kth cluster The algorithm starts with a current set of clusters, and computes the cluster means. Then assign observations to clusters by finding the cluster whose mean is closest. Recompute the cluster means and continue. 10 / 16

11 / 16 Constructing dissimilarity matrices dist(x, method = c("euclidean", "maximum", "manhattan", "canberra", "binary")) where maximum is max 1 j p (x ij x i j) and binary is Jacard coefficient. daisy(x, metric=c("euclidean", "manhattan", "gower"), standardize=f, type=c("ordratio","logratio "asymm","symm") (see the help files)

Hierarchical clustering 14.3.12 no specification of number of clusters top down = divisive; bottom up = agglomerative bottom up: each value is a cluster, cluster the closest pair of points, iterate: find the closest pair of clusters C i and C i merge them need a measure for distance between points and between clusters (the clusters needn t be vectors) single link clustering measures the distance between clusters by the minimum distance d(c 1, C 2 ) = min i C1,i C 2 D ii susceptible to chaining ; long strings of points assigned to the same cluster sensitive to outliers complete linkage d(c 1, C 2 ) = max i C1,i C 2 D ii group average intermediate between complete and single linkage. 12 / 16

13 / 16... hierarchical clustering easily pictured in a dendogram Figs 14.12 and 14.13 look is quite different for different linkages Implemented in R in hclust and agnes.