Data Exploration and Unsupervised Learning with Clustering

Similar documents
Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Data Preprocessing. Cluster Similarity

Statistical Machine Learning

Unsupervised machine learning

University of Florida CISE department Gator Engineering. Clustering Part 1

Expectation Maximization

Clustering VS Classification

Dimensionality Reduction

Clustering using Mixture Models

CSE446: Clustering and EM Spring 2017

Applying cluster analysis to 2011 Census local authority data

PCA, Kernel PCA, ICA

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

DIMENSION REDUCTION AND CLUSTER ANALYSIS

Overview of clustering analysis. Yuehua Cui

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Final Exam, Machine Learning, Spring 2009

Dimensionality reduction

Principal Component Analysis

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Unsupervised Learning. k-means Algorithm

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CS534 Machine Learning - Spring Final Exam

Multivariate Statistics

Advanced Introduction to Machine Learning

Multivariate analysis of genetic data: exploring groups diversity

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

L11: Pattern recognition principles

k-means clustering mark = which(md == min(md)) nearest[i] = ifelse(mark <= 5, "blue", "orange")}

Data Mining and Matrices

Machine Learning - MT Clustering

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Multivariate Analysis Cluster Analysis

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering with k-means and Gaussian mixture distributions

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

PCA and admixture models

Clustering with k-means and Gaussian mixture distributions

Linear Dimensionality Reduction

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Multivariate Statistics: Hierarchical and k-means cluster analysis

More on Unsupervised Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Generative Clustering, Topic Modeling, & Bayesian Inference

Data Mining Techniques

Modeling Classes of Shapes Suppose you have a class of shapes with a range of variations: System 2 Overview

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Correlation Preserving Unsupervised Discretization. Outline

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

LEC1: Instance-based classifiers

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Clustering with k-means and Gaussian mixture distributions

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Chapter 5-2: Clustering

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Machine Learning - MT & 14. PCA and MDS

Short Answer Questions: Answer on your separate blank paper. Points are given in parentheses.

20 Unsupervised Learning and Principal Components Analysis (PCA)

When Dictionary Learning Meets Classification

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Unsupervised Learning Basics

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Principal Component Analysis (PCA)

Multivariate Analysis

An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Unsupervised Learning: K- Means & PCA

Machine Learning, Fall 2009: Midterm

Data Mining Techniques

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Measurement and Data. Topics: Types of Data Distance Measurement Data Transformation Forms of Data Data Quality

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Randomized Algorithms

Introduction to Machine Learning Midterm, Tues April 8

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

CS246: Mining Massive Data Sets Winter Only one late period is allowed for this homework (11:59pm 2/14). General Instructions

Recognition Using Class Specific Linear Projection. Magali Segal Stolrasky Nadav Ben Jakov April, 2015

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Homework : Data Mining SOLUTIONS

Machine Learning: Homework 5

Lecture 11: Unsupervised Machine Learning

Mixture of Gaussians Models

CS 340 Lec. 6: Linear Dimensionality Reduction

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Chapter 14 Combining Models

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Noise & Data Reduction

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Descriptive Data Summarization

Basics of Multivariate Modelling and Data Analysis

Unsupervised clustering of COMBO-17 galaxy photometry

Transcription:

Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence

Clustering Idea Given a set of data can we find a natural grouping? X 2 Essential R commands: D =rnorm(12,0,1) #generate 12 #random normal X1 =matrix(d,6,2) #put into 6x2 matrix X1[,1]=X1[,1]+4; #shift center X1[,2]=X1[,2]+2; #repeat for another set of points #bind data points and plot plot(rbind(x1,x2), xlim=c(-10,10),ylim=c(-10,10)); X 1

Why Clustering A good grouping implies some structure In other words, given a good grouping, we can then: Interpret and label clusters Identify important features Characterize new points by the closest cluster (or nearest neighbors) Use the cluster assignments as a compression or summary of the data

Clustering Objective Objective: find subsets that are similar within cluster and dissimilar between clusters Similarity defined by distance measures Euclidean distance Manhattan distance Mahalanobis (Euclidean w/dimensions rescaled by variance)

Kmeans Clustering A simple, effective, and standard method Start with K initial cluster centers Loop: Assign each data point to nearest cluster center Calculate mean of cluster for new center Stop when assignments don t change Issues: How to choose K? How to choose initial centers? Will it always stop?

Kmeans Example For K=1, using Euclidean distance, where will the cluster center be? X 2 X 1

Kmeans Example For K=1, the overall mean minimizes Sum Squared Error (SSE), aka Euclidean distance Essential R commands: Kresult = kmeans(x,1,10,1) #choose 1 data point as initial K centers #10 is max loop iterations #1 is number of initial sets to try #Kresult is an R object with subfields Kresult$cluster #cluster assignments Kresults$tot.withinss # tot within SSE

Kmeans Example K=1 K=2 Essential R commands: inds=which(kresult$cluster==k) plot(x[inds,],col2use= red ); K=3 K=4 As K increases individual points get a cluster

Choosing K for Kmeans Essential R commands: Total Within Cluster SSE for (num_k in 1:10) { Kres=kmeans(X,num_k,10,1); Save and then plot Kres$tot.withinss K=1 to 10 - Not much improvement after K=2 ( elbow )

Kmeans Example more points How many clusters should there be?

Choosing K for Kmeans Total Within Cluster SSE K=1 to 10 - Smooth decrease at K 2, harder to choose - In general, smoother decrease => less structure

Kmeans Guidelines Choosing K: Elbow in total-within-cluster SSE as K=1 N Cross-validation: hold out points, compare fit as K=1 N Choosing initial starting points: take K random data points, do several Kmeans, take best fit Stopping: may converge to sub-optimal clusters may get stuck or have slow convergence (point assignments bounce around), 10 iterations is often good

Kmeans Example uniform K=1 K=2 K=3 K=4

Choosing K - uniform Total Within Cluster SSE K=1 to 10 - Smooth decrease across K => less structure

Kmeans Clustering Issues Scale: Dimensions with large numbers may dominate distance metrics Outliers: Outliers can pull cluster mean, K-mediods uses median instead of mean

Soft Clustering Methods Fuzzy Clustering Use weighted assignments to all clusters Find min weighted SSE Expectation-Maximization: Mixture of multivariate Gaussian distributions find best cluster means & variances that maximize likelihood

Kmeans unequal cluster variance

Choosing K unequal distributions Total Within Cluster SSE K=1 to 10 - Smooth decrease across K => less structure

EM clustering Selects K=2 (Bayesian Information Criterion) Handles unequal variance R: library( mclust ) em_fit=mclust(x); plot(em_fit);

Kmeans computations Distance of each point to each cluster center For N points, D dimensions: each loop requires N*D*K operations Update Cluster centers only track points that change, get change in cluster center On HPC: Distance calculations can be partitioned data across dimension

R Kmeans Performance 1 Gordon compute node, normal random matrices R: system.time(kmeans()) Wall Time (secs) 60 min 30 min 8000 pts Number of Points (i.e. rows) 1K 4K 8K 16K 32K Number of Dimensions (i.e. columns in data matrix) 4000 pts 2000 pts 1000 pts

Kmeans vs EM performance 1 Gordon compute node, normal random matrices R: system.time(mclust()) Wall Time (secs) 15 min 10 min EM: 1000 pts Number of Points (i.e. rows) Kmeans:1000 pts 1K 4K 8K Number of Dimensions (i.e. columns in data matrix)

Kmeans big data example 45,000 NYTimes articles, 102,000 unique words (UCI Machine Learning repository) Full Data Matrix: 45Kx102K ~ 40Gb article 1 article 2 article 3 0 0 1 0 0 0 2 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 Cell i,j is count of i th -word in j th -article article 45K 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 0 0 1 0 0

Matlab original script Distance calculation to cluster center 1. Take difference of column Subtract articles X NxP 0 0 1 0 0 0 2 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 0 0 1 0 0 Cluster_Means MxP 0 0 1 0 0 0 0.5 0 0 0 0 1 0 0.5 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0.5 0 0 0 1 0 0 0.4 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 0 0 2. square and sum across columns Works better for large N small P

Matlab Script Altered 1. take difference of row Subtract X NxP 0 0 1 0 0 0 2 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 0 0 1 0 0 Cluster_Means MxP 0 0 1 0 0 0 0.5 0 0 0 0 1 0 0.5 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0.5 0 0 0 1 0 0 0.4 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 0 0 2. use dot product Works better for large P and dot( ) will use threads

Kmeans Matlab Runtime Matlab Kmeans (original) ~ 50 hours Matlab Kmeans (distributed) ~ 10 hours, 8 threads

Kmeans results cluster means shown with coordinates determining fontsize 7 viable clusters found

Incremental & Hierarchical Clustering Start with 1 cluster (all instances) and do splits OR Start with N clusters (1 per instance) and do merges Can be greedy & expensive in its search some algorithms might merge & split algorithms need to store and recalculate distances Need distance between groups in constrast to K-means

Incremental & Hierarchical Clustering Result is a hierarchy of clusters displayed as a dendrogram tree Useful for tree-like interpretations syntax (e.g. word co-occurences) concepts (e.g. classification of animals) topics (e.g. sorting Enron emails) spatial data (e.g. city distances) genetic expression (e.g. possible biological networks) exploratory analysis

Incremental & Hierarchical Clustering Clusters are merged/split according to distance or utility measure Euclidean distance (squared differences) conditional probabilities (for nominal features) Options to choose which clusters to Link single linkage, mean, average (w.r.t. points in clusters) (may lead to different trees, depending on spreads) Ward method (smallest increase within cluster variance) change in probability of features for given clusters

Linkage options e.g. single linkage (closest to any cluster instance) Cluster1 Cluster2 e.g. mean (closest to mean of all cluster instances) Cluster1 Cluster2

Linkage options (cont ) e.g. average (mean of pairwise distances) Cluster1 Cluster2 e.g. Ward s method (find new cluster with min. variance) Cluster1 Cluster2

Hierarchical Clustering Demo 3888 Interactions among 685 proteins From Hu et.al. TAP dataset http://www.compsysbio.org/bacteriome/dataset/) b0009 b0014 0.92 b0009 b2231 0.87 b0014 b0169 1.0 b0014 b0595 0.76 b0014 b2614 1.0 b0014 b3339 0.95 b0014 b3636 0.9 b0015 b0014 0.99.

Hierarchical Clustering Demo Essential R commands: >d=read.table("hu_tap_ppi.txt"); Note: strings read as factors >str(d) #show d structure 'data.frame': 3888 obs. of 3 variables: $ V1: Factor w/ 685 levels "b0009","b0014",..: 1 1 2 2 2 2 2 3 3 3... $ V2: Factor w/ 536 levels "b0011","b0014",..: 2 248 28 66 297 396... $ V3: num 0.92 0.87 1 0.76 1 0.95 0.9 0.99 0.99 0.93... >fs =c(d[,1],d[,2]); #combine factor levels >str(fs) int [1:7776] 1 1 2 2 2 2 2 3 3 3...

Hierarchical Clustering Demo Essential R commands: C =matrix(0,p,p); #Connection matrix (aka Adjacency matrix) IJ =cbind(d[,1],d[,2]) #factor level is saved as Nx2 list of i-th,j-th protein for (i in 1:N) {C[IJ[i,1],IJ[i,2]]=1;} #populate C with 1 for connections install.packages('igraph') library('igraph') gc=graph.adjacency(c,mode="directed") plot.graph(gc,vertex.size=3,edge.arrow.size=0,vertex.label=na) or just plot(.

Hierarchical Clustering Demo hclust with single distance: chaining d2use=dist(c,method="binary") fit <- hclust(d2use, method= single") plot(fit) the cluster distance when 2 are combined Items that cluster first

Hierarchical Clustering Demo hclust with Ward distance: spherical clusters

Hierarchical Clustering Demo Where height change looks big, cut off tree groups <- cutree(fit, k=7) rect.hclust(fit, k=7, border="red")

Hierarchical Clustering Demo Kmeans vs Hierarchical: Lots of overlap despite that Kmeans not have binary distance option Kmeans cluster assigment Hierarchical Group Assigment 1 2 3 4 5 6 7 1 26 0 0 293 0 19 23 2 2 1 0 9 0 0 46 3 72 0 0 27 0 2 1 groups <- cutree(fit, k=7) ; Kresult=kmeans(d2use,7,10,1); table(kresult$cluster,groups)

Dimensionality Reduction via Principle Components Idea: Given N points and P features (aka dimensions), can we represent data with fewer features: Yes, if features are constant Yes, if features are redundant Yes, if features only contribute noise (conversely, want features that contribute to variations of the data)

PCA: Dimensionality Reduction via Principle Components Find set of vector (aka factors) that describe data in alternative way First component is the vector that maximizes the variance of data projected onto that vector K-th component is orthogonal to all k-1 previous components

PCA on 2012 Olympic Althetes Height by Weight scatter plot Height- cm (mean centered) PC2 PC1 Projection of (145,5) to PCs Total Variance Conserved: Var in Weight + Var in Height = Var in PC1 + Var in PC2 In general: Var in PC1> Var in PC2> Var in PC3 Weight- Kg (mean centered)

PCA on Height by Weight scatter plot Height- cm (mean centered) Essential R: X = X-mean; #mean center data S <- svd(x); #returns matrices U,D,V #cols in V are 2D vectors # S = U*D*V #plot each V col as a line in X s space (use 7 th grade geometry) points(.., type= l ); #line type #get the 1 st coord. point in X s space S$u[,1]*S$v[,1]*S$d[1] #repeat for 2 nd coord. point and plot Weight- Kg (mean centered) #PCA is eigen()

Principle Components Can choose k heuristically as approximation improves, or choose k so that 95% of data variance accounted aka Singular Value Decomposition PCA on square matrices only SVD gives same vectors on square matrices Works for numeric data only In contrast, clustering reduces to categorical groups In some cases, k PCs k clusters

Summary Having no label doesn t stop you from finding structure in data Unsupervised methods are somewhat related

End