MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT

Size: px

Start display at page:

Download "MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT"

Bridget Gilbert
5 years ago
Views:

1 MLCC Clustering Lorenzo Rosasco UNIGE-MIT-IIT

2 About this class We will consider an unsupervised setting, and in particular the problem of clustering unlabeled data into coherent groups. MLCC

3 supervised learning Learning with a teacher Data set S = {(x 1, y 1 ),..., (x n, y n )} with x i R d and y i R ˆX = (x1,..., x n ) R n d and ŷ = (y 1,..., y n ). MLCC

4 supervised learning Learning with a teacher Data set S = {(x 1, y 1 ),..., (x n, y n )} with x i R d and y i R ˆX = (x1,..., x n ) R n d and ŷ = (y 1,..., y n ). MLCC

5 Unsupervised learning Learning without a teacher Data set S = {x 1,..., x n } with x i R d MLCC

6 Unsupervised learning Learning without a teacher Data set S = {x 1,..., x n } with x i R d MLCC

7 Unsupervised learning problems Dimensionality reduction Clustering Density estimation Learning association rules Learning adaptive data representations... MLCC

8 Supervised vs unsupervised methods In supervised learning we have a measure of success based on a loss function and on a model selection procedure e.g., cross validation In unsupervised learning we don t! hence many heuristics and the proliferation of many algorithms difficult to evaluate lack of theoretical grounds MLCC

9 Clustering Clustering is a widely used technique for data analysis, with applications ranging from statistics, computer science, biology, social sciences... Goal: Grouping/segmenting a collection of objects into subsets or clusters. (Possibly also) arrange clusters into a natural hierarchy MLCC

10 Clustering examples MLCC

11 Clustering algorithms Combinatorial algorithms - directly from data {x i } n i=1 + some notion of similarity or dissimilarity Mixture models - based on some assumption on the underlying probability distribution MLCC

12 Combinatorial clustering We assume some knowledge on the number of clusters K n. Goal: associate a cluster label k = {1,..., K} with each datum, by defining an encoder C s.t. k = C(x i ) We look for an encoder C that achieves the goal of clustering data, according to some specific requirement of the algorithm and based on data pairs dissimilarities MLCC

13 Combinatorial clustering Criterion: assign to the same cluster similar/close data We may start from the following loss or energy function (within class): W (C) = 1 K d(x i, x i ) 2 C = arg min W (C) Unfeasible in practice! S(N, K) = 1 K! k=1 C(i)=k C(i )=k K ( K ( 1) K k k k=1 ) k n and notice that S(10, 4) 34K while S(19, 4) MLCC

14 K-means algorithm It refers specifically to the Euclidean distance. initialize cluster centroids m k k = 1,..., K at random repeat until convergence 1. assign data to centroids C(x i) = arg min 1 k K x i m k 2 2. update centroids MLCC

15 K-means functional K-means corresponds to minimizing the following function J(C, m) = K k=1 C(i)=k x i m k 2 The algorithm is an alternating optimization procedure, with convergence guarantees in practice (no rates). The function J is not convex, thus K-means is not guaranteed to find a global minimum. Computational cost 1. data assignment O(Kn) 2. cluster centers updates O(n) MLCC

16 K-means Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap Initial Centroids Initial Partition Iteration Number 2 Iteration Number 20 Figure from Hastie, Tibshirani, Friedman MLCC

Example Vector Quantization FIGURE 14.9. Sir Ronald A.

17 Example Vector Quantization FIGURE Sir Ronald A. Fisher ( ) was one of the founders of modern day statistics, to whom we owe maximum-likelihood, sufficiency, and many other fundamental concepts. The image on the left is a grayscale image at 8 bits per pixel. The center image is the result of 2 2 block VQ, using 200 code vectors, with a compression rate of 1.9 bits/pixel. The right image uses only four code vectors, with a compression rate of 0.50 bits/pixel Figure from Hastie, Tibshirani, Friedman MLCC

18 Spectral clustering - similarity graph A set of unlabeled data {x i } n i=1 and some notion of similarity between data pairs s ij We may represent them as a similarity graph G = (V, E) Clustering can be seen as a graph partitioning problem MLCC

19 Spectral clustering - graph notation G = (V, E) undirected graph V : data correspond to the vertices E : Weighted adjacency matrix W = (w ij ) n i,j=1 with w ij 0. W is symmetric w ij = w ji, as G is undirected. Degree of a vertex: d i = n j=1 w ij Degree matrix: D = diag(d i ) Sub-graphs: A, B V then W (A, B) = i A,j B w ij Subgraph size: A number of vertices vol(a) = i A di MLCC

20 Spectral clustering - how to build the graph We use the available pairwise similarities s ij ɛ-neighbourhood graph: connect vertices whose similarity is larger than ɛ KNN graph: connect vertex v i to its K neighbours. Not symmetric! fully connected graph: s ij = exp( d 2 ij /2σ2 ) d is the Euclidean distance, σ 0 controls the width of a neighborhood MLCC

21 Spectral clustering - how to build the graph n can be very large, it would be preferable if W was sparse In general it is better some notion of locality { sij if j is a KNN of i w ij = 0 otherwise MLCC

22 Spectral clustering - graph Laplacians Unnormalized graph Laplacian: L = D W Properties: For all f R n f Lf = 1 2 n w ij (f i f j ) 2 ij=1 f Lf = f Df f W f = i d i f 2 i i,j f i f j w ij = 1 2 i ( j w ij )f 2 i 2 ij f i f j w ij + j ( i w ij )fj 2 = = 1 w ij (f i f j ) 2 2 ij MLCC

23 Spectral clustering - graph Laplacians Unnormalized graph Laplacian: L = D W For each vector f R n f Lf = 1 2 n w ij (f i f j ) 2 ij=1 The graph Laplacian measures the variation of f on the graph (f Lf small if close points have close function values f i ) L is symmetric and positive semi-definite The smallest eigenvalue of L is 0 and its corresponding eigenvector is a vector of ones L has N non negative real-valued eigenvalues 0 = λ 1 λ 2... λ N MLCC

24 Spectral clustering - graph Laplacians Unnormalized graph Laplacian: L = D W For each vector f R n f Lf = 1 2 n w ij (f i f j ) 2 ij=1 The graph Laplacian measures the variation of f on the graph (f Lf small if close points have close function values f i ) L is symmetric and positive semi-definite The smallest eigenvalue of L is 0 and its corresponding eigenvector is a vector of ones L has N non negative real-valued eigenvalues 0 = λ 1 λ 2... λ N Laplacian and clustering: the multiplicity k of λ 0 = 0 equals the number of connected components in the graph MLCC

25 Spectral clustering - graph Laplacians Unnormalized graph Laplacian: L = D W Normalized graph Laplacians: L n1 = D 1/2 LD 1/2 = I D 1/2 W D 1/2 L n2 = D 1 L = I D 1 W MLCC

26 A spectral clustering algorithm Graph Laplacian compute the Unnormalized Graph Laplacian L (unnormalized algorithm) compute a Normalized Graph Laplacian L n1 or L n2 (normalized algorithm) compute the first k eigenvectors of the Laplacian (k number of clusters to compute) let U k R n k be the matrix containing the k eigenvectors as columns y j R k be the vector obtained by the j-th row of U k j = 1... n. Apply k-means to {y j } MLCC

27 A spectral clustering algorithm Computational cost Eigendecomposition O(n 3 ) It may be enough to compute the first k eigenvalues/eigenvectors. There are algorithms for this MLCC

28 Example x Eigenvalue x Number Eigenvectors Spectral Clustering 3rd Smallest 2nd Smallest Third Smallest Eigenvector Index Second Smallest Eigenvector FIGURE Toy example illustrating spectral Figure from Hastie, Tibshirani, Friedman MLCC

29 The number of clusters eigengap heuristic Figure from Von Luxburg tutorial MLCC

30 Semi-supervised learning Laplacian-based regularization algorithms (Belkin et al. 04) Set of labeled examples: {(x i, y i )} n i=1 Set of unlabeled examples: {(x j )} n+u j=n+1 f 1 = arg min f H n n i=1 l(f(x i ), y i ) + λ A f 2 + λ I u 2 f T Lf MLCC

31 Wrapping up In this class we introduced the concept of data clustering and sketched some of the best known algorithms Ulrike Von Luxburg - A tutorial on Spectral Clustering MLCC

MATH 829: Introduction to Data Mining and Analysis Clustering II

his lecture is based on U. von Luxburg, A Tutorial on Spectral Clustering, Statistics and Computing, 17 (4), 2007. MATH 829: Introduction to Data Mining and Analysis Clustering II Dominique Guillot Departments