Machine Learning - MT Clustering

Similar documents
Machine Learning - MT & 14. PCA and MDS

Statistical Machine Learning

LECTURE NOTE #11 PROF. ALAN YUILLE

Machine Learning for Data Science (CS4786) Lecture 11

Clustering. Stephen Scott. CSCE 478/878 Lecture 8: Clustering. Stephen Scott. Introduction. Outline. Clustering.

Clustering using Mixture Models

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Data-dependent representations: Laplacian Eigenmaps

Expectation Maximization

Machine Learning - MT Classification: Generative Models

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Unsupervised machine learning

Preprocessing & dimensionality reduction

Final Exam, Machine Learning, Spring 2009

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Chapter 5-2: Clustering

PCA, Kernel PCA, ICA

Lecture 12 : Graph Laplacians and Cheeger s Inequality

Dimensionality Reduc1on

Advanced Machine Learning & Perception

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Clustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT

Applying cluster analysis to 2011 Census local authority data

ECE 5984: Introduction to Machine Learning

8.1 Concentration inequality for Gaussian random matrix (cont d)

SC4/SM4 Data Mining and Machine Learning Clustering

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Data Exploration and Unsupervised Learning with Clustering

15 Singular Value Decomposition

Spectral Clustering of Polarimetric SAR Data With Wishart-Derived Distance Measures

Principal Component Analysis

Spectral Clustering. Guokun Lai 2016/10

Spectral Feature Vectors for Graph Clustering

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

1 Matrix notation and preliminaries from spectral graph theory

Data dependent operators for the spatial-spectral fusion problem

PCA and admixture models

Spectral Clustering. Zitao Liu

Unsupervised Learning Basics

CSE446: Clustering and EM Spring 2017

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Data Mining Techniques

Data Mining Techniques

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Advanced Machine Learning & Perception

Overview of clustering analysis. Yuehua Cui

Multivariate Statistics: Hierarchical and k-means cluster analysis

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction

Statistical Pattern Recognition

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

14 Singular Value Decomposition

Spectral Clustering. by HU Pili. June 16, 2013

Data Analysis and Manifold Learning Lecture 9: Diffusion on Manifolds and on Graphs

Fundamentals of Matrices

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering

Experimental Design and Data Analysis for Biologists

7 Principal Component Analysis

Kernel methods for comparing distributions, measuring dependence

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Nonlinear Dimensionality Reduction

Lecture: Face Recognition and Feature Reduction

Module Master Recherche Apprentissage et Fouille

Dimensionality Reduction

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Principal Component Analysis (PCA)

Spectral Generative Models for Graphs

Beyond Scalar Affinities for Network Analysis or Vector Diffusion Maps and the Connection Laplacian

Dimensionality reduction

Dimension Reduc-on. Example: height of iden-cal twins. PCA, SVD, MDS, and clustering [ RI ] Twin 2 (inches away from avg)

Introduction to Machine Learning

CS-E4830 Kernel Methods in Machine Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Nonlinear Dimensionality Reduction

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

STA141C: Big Data & High Performance Statistical Computing

Data Analysis and Manifold Learning Lecture 3: Graphs, Graph Matrices, and Graph Embeddings

Distance Preservation - Part I

Multivariate Statistics

Spectral Clustering. Spectral Clustering? Two Moons Data. Spectral Clustering Algorithm: Bipartioning. Spectral methods

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016

Analysis of Interest Rate Curves Clustering Using Self-Organising Maps

Graph Partitioning Using Random Walks

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Randomized Algorithms

An indicator for the number of clusters using a linear map to simplex structure

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Machine Learning. Clustering 1. Hamid Beigy. Sharif University of Technology. Fall 1395

Kernel Methods in Machine Learning

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Transcription:

Machine Learning - MT 2016 15. Clustering Varun Kanade University of Oxford November 28, 2016

Announcements No new practical this week All practicals must be signed off in sessions this week Firm Deadline: Reports handed in at CS reception by Friday noon Revision Class for M.Sc. + D.Phil. Thu Week 9 (2pm & 3pm) Work through ML HT2016 Exam (Problem 3 is optional) 1

Outline This week, we will study some approaches to clustering Defining an objective function for clustering k-means formulation for clustering Multidimensional Scaling Hierarchical clustering Spectral clustering 2

England pushed towards Test defeat by India France election: Socialists scramble to avoid split after Fillon win Giants Add to the Winless Browns Misery Strictly Come Dancing: Ed Balls leaves programme Trump Claims, With No Evidence, That Millions of People Voted Illegally Vive La Binoche, the reigning queen of French cinema 3

Sports Politics Sports Film&TV Politics Film&TV England pushed towards Test defeat by India France election: Socialists scramble to avoid split after Fillon win Giants Add to the Winless Browns Misery Strictly Come Dancing: Ed Balls leaves programme Trump Claims, With No Evidence, That Millions of People Voted Illegally Vive La Binoche, the reigning queen of French cinema 3

England France USA England USA France England pushed towards Test defeat by India France election: Socialists scramble to avoid split after Fillon win Giants Add to the Winless Browns Misery Strictly Come Dancing: Ed Balls leaves programme Trump Claims, With No Evidence, That Millions of People Voted Illegally Vive La Binoche, the reigning queen of French cinema 3

Clustering Often data can be grouped together into subsets that are coherent. However, this grouping may be subjective. It is hard to define a general framework. Two types of clustering algorithms 1. Feature-based - Points are represented as vectors in R D 2. (Dis)similarity-based - Only know pairwise (dis)similarities Two types of clustering methods 1. Flat - Partition the data into k clusters 2. Hierarchical - Organise data as clusters, clusters of clusters, and so on 4

Defining Dissimilarity Weighted dissimilarity between (real-valued) attributes D d(x, x ) = f w id i(x i, x i) i=1 In the simplest setting w i = 1 and d i(x i, x i) = (x i x i) 2 and f(z) = z, which corresponds to the squared Euclidean distance Weights allow us to emphasise features differently If features are ordinal or categorical then define distance suitably Standardisation (mean 0, variance 1) may or may not help 5

Helpful Standardisation 6

Unhelpful Standardisation 7

Partition Based Clustering Want to partition the data into subsets C 1,..., C k, where k is fixed in advance Define quality of a partition by W (C) = 1 2 k j=1 If we use d(x, x ) = x x 2, then where µ j = 1 C j i C j x i W (C) = 1 C j k j=1 i,i C j d(x i, x i ) i C j x i µ j 2 The objective is minimising the sum of squares of distances to the mean within each cluster 8

Outline Clustering Objective k-means Formulation of Clustering Multidimensional Scaling Hierarchical Clustering Spectral Clustering

Partition Based Clustering : k-means Objective Minimise jointly over partitions C 1,..., C k and µ 1,..., µ k W (C) = k j=1 i C j x i µ j 2 This problem is NP-hard even for k = 2 for points in R D If we fix µ 1,..., µ j, finding a partition (C j) k j=1 that minimises W is easy C j = {i x i µ j = min x i µ j } j If we fix the clusters C 1,..., C k minimising W with respect to (µ j ) k j=1 is easy µ j = 1 x i C j i C j Iteratively run these two steps - assignment and update 9

10

10

10

10

10

Ground Truth Clusters k-means Clusters (k = 3) 11

The k-means Algorithm 1. Intialise means µ 1,..., µ k randomly 2. Repeat until convergence: a. Find assignments of data to clusters represented by the mean that is closest to obtain, C 1,..., C k : C j = {i j = argmin j x i µ j 2 } b. Update means using the current cluster assignments: µ j = 1 x i C j i C j Note 1: Ties can be broken arbitrarily Note 2: Choosing k random datapoints to be the initial k-means is a good idea 12

The k-means Algorithm Does the algorithm always converge? Yes, because the W function decreases every time a new partition is used; there are only finitely many partitions W (C) = k j=1 i C j x i µ j 2 Convergence may be very slow in the worst-case, but typically fast on real-world instances Convergence is probably to a local minimum. Run multiple times with random initialisation. Can use other criteria: k-medoids, k-centres, etc. Selecting the right k is not easy: plot W against k and identify a "kink" 13

Ground Truth Clusters k-means Clusters (k = 4) 14

Choosing the number of clusters k 0.25 MSE on test vs K for K means 0.2 0.15 0.1 0.05 0 2 4 6 8 10 12 14 16 As in the case of PCA, larger k will give better value of the objective Choose suitable k by identifying a kink or elbow in the curve (Source: Kevin Murphy, Chap 11) 15

Outline Clustering Objective k-means Formulation of Clustering Multidimensional Scaling Hierarchical Clustering Spectral Clustering

Multidimensional Scaling (MDS) In certain cases, it may be easier to define (dis)similarity between objects than embed them in Euclidean space Algorithms such as k-means require points to be in Euclidean space Ideal Setting: Suppose for some N points in R D we are given all pairwise Euclidean distances in a matrix D Can we reconstruct x 1,..., x N, i.e., all of X? 16

Multidimensional Scaling Distances are preserved under translation, rotation, reflection, etc. We cannot recover X exactly; we can aim to determine X up to these transformations If D ij is the distance between points x i and x j, then D 2 ij = x i x j 2 = x T i x i 2x T i x j + x T j x j = M ii 2M ij + M jj Here M = XX T is the N N matrix of dot products Exercise: Show that assuming i xi = 0, M can be recovered from D 17

Multidimensional Scaling Consider the (full) SVD: X = UΣV T We can write M as M = XX T = UΣΣ T U T Starting from M, we can reconstruct X using the eigendecomposition of M M = UΛU T Because, M is symmetric and positive semi-definite, U T = U 1 and all entries of (diagonal matrix) Λ are non-negative Let X = UΛ 1/2 If we are satisfied with approximate reconstruction, we can use truncated eigendecomposition 18

Multidimensional Scaling: Additional Comments In general if you define (dis)similarities on objects such as text documents, genetic sequences, etc., we cannot be sure that the generated similarity matrix M will be positive semi-definite or that the dissimilarity matrix D is a valid squared Euclidean distance If such cases, we cannot always find a Euclidean embeddding that recovers the (dis)similarities exactly Minimize stress function: Find z 1,..., z N that minimizes S(Z) = i j (D ij z i z j ) 2 Several other types of stress functions can be used 19

Multidimensional Scaling: Summary In certain applications, it may be easier to define pairwise similarities or distances, rather than construct a Euclidean embedding of discrete objects, e.g., genetic data, text data, etc. Many machine learning algorithms require (or are more naturally expressed with) data in some Euclidean space Multidimensional Scaling gives a way to find an embedding of the data in Euclidean space that (approximately) respects the original distance/similarity values 20

Outline Clustering Objective k-means Formulation of Clustering Multidimensional Scaling Hierarchical Clustering Spectral Clustering

Hierarchical Clustering Hierarchical structured data exists all around us Measurements of different species and individuals within species Top-level and low-level categories in news articles Country, county, town level data Two Algorithmic Strategies for Clustering Agglomerative: Bottom-up, clusters formed by merging smaller clusters Divisive: Top-down, clusters formed by splitting larger clusters Visualise this as a dendogram or tree 21

Measuring Dissimilarity at Cluster Level To find hierarchical clusters we need to define dissimilarity at cluster level, not just at datapoints Suppose we have dissimilarity at datapoint level, e.g., d(x, x ) = x x Different ways to define dissimilarity at cluster level, say C and C Single Linkage Complete Linkage D(C, C ) = D(C, C ) = min d(x, x ) x C,x C max d(x, x ) x C,x C Average Linkage D(C, C ) = 1 C C x C,x C d(x, x ) 22

Measuring Dissimilarity at Cluster Level Single Linkage D(C, C ) = min d(x, x ) x C,x C Complete Linkage D(C, C ) = max d(x, x ) x C,x C Average Linkage D(C, C ) = 1 C C x C,x C d(x, x ) 23

Linkage-based Clustering Algorithm 1. Initialise clusters as singletons C i = {i} 2. Initialise clusters available for merging S = {1,..., N} 3. Repeat a. Pick 2 most similar clusters, (j, k) = argmin D(j, k) j,k S b. Let C l = C j C k c. If C l = {1,..., N}, break; d. Set S = (S \ {j, k}) {l} e. Update D(i, l) for all i S (using desired linkage property) 24

Hierarchical Clustering: Dendogram Outputs of hierarchical clustering algorithms are typically represented using dendograms A dendogram is a binary tree, representing clusters as they were merged The height of a node represents dissimilarity Cutting the dendogram at some level gives a partition of data 25

Outline Clustering Objective k-means Formulation of Clustering Multidimensional Scaling Hierarchical Clustering Spectral Clustering

Spectral Clustering 26

Spectral Clustering: Limitations of k-means 27

Limitations of k-means k-means will typically form clusters that are spherical, elliptical, convex Kernel PCA followed by k-means can result in better clusters Spectral clustering is a (related) alternative that often works better 28

Spectral Clustering Construct a graph from data; one node for every point in dataset Use similarity measure, e.g., s i,j = exp( x i x j 2 /σ) Construct mutual K-nearest neighbour graph, i.e., (i, j) is an edge if either i is among the K nearest neighbours of j or vice versa The weight of edge (i, j), if it exists is s i,j 29

Spectral Clustering 30

Spectral Clustering Use graph partitioning algorithms Mincut can give bad cuts (only one node on one side of the cut) Multi-way cuts, balanced cuts, are typically NP-hard to compute Relaxations of these problems give eigenvectors of Laplacian W is the weighted adjacency matrix D is (diagonal) degree matrix: D ii = j Wij Laplacian L = D W Normalised Laplacian: L = I D 1 W 31

Spectral Clustering: Simple Example 1 1 1 1 2 3 4 5 1 1 1 6 Suppose all edge weights are 1 (0 for missing edges) The weighted adjacency matrix, the degree matrix and the Laplacian are given by 0 1 1 0 0 0 1 0 1 0 0 0 W = 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 1 1 0 2 0 0 0 0 0 0 2 0 0 0 0 D = 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 2 2 1 1 0 0 0 1 2 1 0 0 0 L = D W = 1 1 2 0 0 0 0 0 0 2 1 1 0 0 0 1 2 1 0 0 0 1 1 2 32

Spectral Clustering: Simple Example Let us consider some eigenvectors of L 2 1 1 0 0 0 1 1 2 1 0 0 0 1 1 L = D W = 1 1 2 0 0 0 0 0 0 2 1 1 1 0 0 0 1 2 1 2 3 0 0 0 1 1 2 4 5 1 1 1 6 Suppose all edge weights are 1 (0 for missing edges) v 1 = [1, 1, 1, 1, 1, 1] T is an eigenvector with eigenvalue 0 v 2 = [1, 1, 1, 1, 1, 1] T is also an eigenvector with eigenvalue 0 α 1v 1 + α 2v 2 for any α 1, α 2 is also an eigenvector with eigenvalue 0 We can use the matrix [v 1v 2] as the N 2 feature matrix and perform k-means 33

Spectral Clustering: Simple Example 1 1 1 1 2 3 4 5 1 1 1 6 Suppose all edge weights are 1 (0 for missing edges) 34

Spectral Clustering: Simple Example 1.1 1 0.9 1 2 3 0.1 0.2 4 5 1.1 0.9 1 6 Suppose all edge weights are 1 (0 for missing edges) Let us consider some eigenvectors of L 2 1.1 0.9 0 0 0 1.1 2.2 1 0.1 0 0 L = D W = 0.9 1 2.1 0 0.2 0 0 0.1 0 2.1 1.1 0.9 0 0 0.2 1.1 2.3 1 0 0 0 0.9 1 1.9 When the weights are slightly perturbed, v 1 = [1,..., 1] T is still an eigenvector with eigenvalue 1 We can t compute the second eigenvector v 2 by hand Nevertheless, we expect that the eigenspace corresponding to similar eigenvalues is relatively stable We can still use the matrix [v 1v 2] as the N 2 feature matrix and perform k-means 35

Spectral Clustering: Simple Example 1.1 1 0.9 1 2 3 0.1 0.2 4 5 1.1 0.9 1 6 Suppose all edge weights are 1 (0 for missing edges) 36

Spectral Clustering Algorithm Input: Weighted graph with weighted adjacency matrix W 1. Construct Laplacian L = D W 2. Find v 1 = 1, v 2,..., v l+1 the k-eigenvectors 3. Construct the N l feature matrix V l = [v 2,, v l ] 4. Apply clustering algorithm using V l as features, e.g., k-means Note: If the degrees of nodes are not balanced, using the normalised Laplacian, L = I D 1 W may be a better idea 37

Spectral Clustering 38

Summary: Clustering Clustering is grouping together similar data in a larger collection of heterogeneous data Definition of good clusters often user-dependent Clustering algorithms in feature space, e.g., k-means Clustering algorithms that only use (dis)similarities: k-medoids, hierarchical clustering Spectral clustering when clusters may be non-convex 39