MATH 829: Introduction to Data Mining and Analysis Clustering II

Size: px

Start display at page:

Download "MATH 829: Introduction to Data Mining and Analysis Clustering II"

Verity Parsons
6 years ago
Views:

1 his lecture is based on U. von Luxburg, A Tutorial on Spectral Clustering, Statistics and Computing, 17 (4), MATH 829: Introduction to Data Mining and Analysis Clustering II Dominique Guillot Departments of Mathematical Sciences University of Delaware April 27, 2016

2 Spectral clustering: overview 2/11 In the previous lecture, we discussed how K-means can be used to cluster points in R p. Spectral clustering: Very popular clustering method. Often outperforms other methods such as K-means. Can be used for various types of data (not only points in R p ). Easy to implement. Only uses basic linear algebra.

3 Spectral clustering: overview 2/11 In the previous lecture, we discussed how K-means can be used to cluster points in R p. Spectral clustering: Very popular clustering method. Often outperforms other methods such as K-means. Can be used for various types of data (not only points in R p ). Easy to implement. Only uses basic linear algebra. Overview of spectral clustering:

4 Spectral clustering: overview 2/11 In the previous lecture, we discussed how K-means can be used to cluster points in R p. Spectral clustering: Very popular clustering method. Often outperforms other methods such as K-means. Can be used for various types of data (not only points in R p ). Easy to implement. Only uses basic linear algebra. Overview of spectral clustering: 1 Construct a similarity matrix measuring the similarity of pairs of objects.

5 Spectral clustering: overview 2/11 In the previous lecture, we discussed how K-means can be used to cluster points in R p. Spectral clustering: Very popular clustering method. Often outperforms other methods such as K-means. Can be used for various types of data (not only points in R p ). Easy to implement. Only uses basic linear algebra. Overview of spectral clustering: 1 Construct a similarity matrix measuring the similarity of pairs of objects. 2 Use the similarity matrix to construct a (weighted or unweighted) graph.

6 Spectral clustering: overview 2/11 In the previous lecture, we discussed how K-means can be used to cluster points in R p. Spectral clustering: Very popular clustering method. Often outperforms other methods such as K-means. Can be used for various types of data (not only points in R p ). Easy to implement. Only uses basic linear algebra. Overview of spectral clustering: 1 Construct a similarity matrix measuring the similarity of pairs of objects. 2 Use the similarity matrix to construct a (weighted or unweighted) graph. 3 Compute eigenvectors of the graph Laplacian.

7 Spectral clustering: overview 2/11 In the previous lecture, we discussed how K-means can be used to cluster points in R p. Spectral clustering: Very popular clustering method. Often outperforms other methods such as K-means. Can be used for various types of data (not only points in R p ). Easy to implement. Only uses basic linear algebra. Overview of spectral clustering: 1 Construct a similarity matrix measuring the similarity of pairs of objects. 2 Use the similarity matrix to construct a (weighted or unweighted) graph. 3 Compute eigenvectors of the graph Laplacian. 4 Cluster the graph using the eigenvectors of the graph Laplacian using the K-means algorithm.

8 Notation 3/11 We will use the following notation/conventions: G = (V, E) a graph with vertex set V = {v 1,..., v n } and edge set E V V.

9 Notation 3/11 We will use the following notation/conventions: G = (V, E) a graph with vertex set V = {v 1,..., v n } and edge set E V V. Each edge carries a weight w ij 0.

10 Notation 3/11 We will use the following notation/conventions: G = (V, E) a graph with vertex set V = {v 1,..., v n } and edge set E V V. Each edge carries a weight w ij 0. The adjacency matrix of G is W = W G = (w ij ) n i,j=1. We will assume W is symmetric (undirected graphs).

11 Notation 3/11 We will use the following notation/conventions: G = (V, E) a graph with vertex set V = {v 1,..., v n } and edge set E V V. Each edge carries a weight w ij 0. The adjacency matrix of G is W = W G = (w ij ) n i,j=1. We will assume W is symmetric (undirected graphs). The degree of v i is d i := n w ij. j=1

12 Notation 3/11 We will use the following notation/conventions: G = (V, E) a graph with vertex set V = {v 1,..., v n } and edge set E V V. Each edge carries a weight w ij 0. The adjacency matrix of G is W = W G = (w ij ) n i,j=1. We will assume W is symmetric (undirected graphs). The degree of v i is d i := n w ij. j=1 The degree matrix of G is D := diag(d 1,..., d n ).

13 Notation 3/11 We will use the following notation/conventions: G = (V, E) a graph with vertex set V = {v 1,..., v n } and edge set E V V. Each edge carries a weight w ij 0. The adjacency matrix of G is W = W G = (w ij ) n i,j=1. We will assume W is symmetric (undirected graphs). The degree of v i is d i := n w ij. j=1 The degree matrix of G is D := diag(d 1,..., d n ). We denote the complement of A V by A.

14 Notation 3/11 We will use the following notation/conventions: G = (V, E) a graph with vertex set V = {v 1,..., v n } and edge set E V V. Each edge carries a weight w ij 0. The adjacency matrix of G is W = W G = (w ij ) n i,j=1. We will assume W is symmetric (undirected graphs). The degree of v i is d i := n w ij. j=1 The degree matrix of G is D := diag(d 1,..., d n ). We denote the complement of A V by A. If A V, then we let 1 A = (f 1,..., f n ) T R n, where f i = 1 if v i A and 0 otherwise.

15 Similarity graphs 4/11 We assume we are given a measure of similarity s between data points x 1,..., x n X : s : X X [0, ).

16 Similarity graphs 4/11 We assume we are given a measure of similarity s between data points x 1,..., x n X : s : X X [0, ). We denote by s ij := s(x i, x j ) the measure of similarity between x i and x j.

17 Similarity graphs 4/11 We assume we are given a measure of similarity s between data points x 1,..., x n X : s : X X [0, ). We denote by s ij := s(x i, x j ) the measure of similarity between x i and x j. Equivalently, we may assume we have a measure of distance between data points (e.g. (X, d) is a metric space).

18 Similarity graphs 4/11 We assume we are given a measure of similarity s between data points x 1,..., x n X : s : X X [0, ). We denote by s ij := s(x i, x j ) the measure of similarity between x i and x j. Equivalently, we may assume we have a measure of distance between data points (e.g. (X, d) is a metric space). Let d ij := d(x i, x j ), the distance between x i and x j.

19 Similarity graphs 4/11 We assume we are given a measure of similarity s between data points x 1,..., x n X : s : X X [0, ). We denote by s ij := s(x i, x j ) the measure of similarity between x i and x j. Equivalently, we may assume we have a measure of distance between data points (e.g. (X, d) is a metric space). Let d ij := d(x i, x j ), the distance between x i and x j. From d ij (or s ij ), we naturally build a similarity graph.

20 Similarity graphs 4/11 We assume we are given a measure of similarity s between data points x 1,..., x n X : s : X X [0, ). We denote by s ij := s(x i, x j ) the measure of similarity between x i and x j. Equivalently, we may assume we have a measure of distance between data points (e.g. (X, d) is a metric space). Let d ij := d(x i, x j ), the distance between x i and x j. From d ij (or s ij ), we naturally build a similarity graph. We will discuss 3 popular ways of building a similarity graph.

21 Similarity graphs (cont.) 5/11 Vertex set = {v 1,..., v n } where n is the number of data points.

22 Similarity graphs (cont.) 5/11 Vertex set = {v 1,..., v n } where n is the number of data points. 1 The ɛ-neighborhood graph: Connect all points whose pairwise distances are smaller than some ɛ > 0. We usually don't weight the edges. The graph is thus a simple graph (unweighted, undirected graph containing no loops or multiple edges).

23 5/11 Similarity graphs (cont.) Vertex set = {v 1,..., v n } where n is the number of data points. 1 The ɛ-neighborhood graph: Connect all points whose pairwise distances are smaller than some ɛ > 0. We usually don't weight the edges. The graph is thus a simple graph (unweighted, undirected graph containing no loops or multiple edges). 2 The k-nearest neighbor graph: The goal is to connect v i to v j if x j is among the k nearest neighbords of x i. However, this leads to a directed graph. We therefore dene:

24 5/11 Similarity graphs (cont.) Vertex set = {v 1,..., v n } where n is the number of data points. 1 The ɛ-neighborhood graph: Connect all points whose pairwise distances are smaller than some ɛ > 0. We usually don't weight the edges. The graph is thus a simple graph (unweighted, undirected graph containing no loops or multiple edges). 2 The k-nearest neighbor graph: The goal is to connect v i to v j if x j is among the k nearest neighbords of x i. However, this leads to a directed graph. We therefore dene: the k-nearest neighbor graph: v i is adjacent to v j i x j is among the k nearest neighbords of x i OR x i is among the k nearest neighbords of x j.

25 5/11 Similarity graphs (cont.) Vertex set = {v 1,..., v n } where n is the number of data points. 1 The ɛ-neighborhood graph: Connect all points whose pairwise distances are smaller than some ɛ > 0. We usually don't weight the edges. The graph is thus a simple graph (unweighted, undirected graph containing no loops or multiple edges). 2 The k-nearest neighbor graph: The goal is to connect v i to v j if x j is among the k nearest neighbords of x i. However, this leads to a directed graph. We therefore dene: the k-nearest neighbor graph: v i is adjacent to v j i x j is among the k nearest neighbords of x i OR x i is among the k nearest neighbords of x j. the mutual k-nearest neighbor graph: v i is adjacent to v j i x j is among the k nearest neighbords of x i AND x i is among the k nearest neighbors of x j.

26 5/11 Similarity graphs (cont.) Vertex set = {v 1,..., v n } where n is the number of data points. 1 The ɛ-neighborhood graph: Connect all points whose pairwise distances are smaller than some ɛ > 0. We usually don't weight the edges. The graph is thus a simple graph (unweighted, undirected graph containing no loops or multiple edges). 2 The k-nearest neighbor graph: The goal is to connect v i to v j if x j is among the k nearest neighbords of x i. However, this leads to a directed graph. We therefore dene: the k-nearest neighbor graph: v i is adjacent to v j i x j is among the k nearest neighbords of x i OR x i is among the k nearest neighbords of x j. the mutual k-nearest neighbor graph: v i is adjacent to v j i x j is among the k nearest neighbords of x i AND x i is among the k nearest neighbors of x j. We weight the edges by the similarity of their endpoints.

27 Similarity graphs (cont.) 6/11 3 The fully connected graph: Connect all points with edge weights s ij.

28 Similarity graphs (cont.) 6/11 3 The fully connected graph: Connect all points with edge weights s ij. For example, one could use the Gaussian similarity function to represent a local neighborhood relationships: s ij = s(x i, x j ) = exp( x i x j 2 /(2σ 2 )) (σ 2 > 0). Note: σ 2 controls the width of the neighborhoods.

29 Similarity graphs (cont.) 6/11 3 The fully connected graph: Connect all points with edge weights s ij. For example, one could use the Gaussian similarity function to represent a local neighborhood relationships: s ij = s(x i, x j ) = exp( x i x j 2 /(2σ 2 )) (σ 2 > 0). Note: σ 2 controls the width of the neighborhoods. All graphs mentioned above are regularly used in spectral clustering.

30 Graph Laplacians 7/11 There are three commonly used denitions of the graph Laplacian: 1 The unnormalized Laplacian is L := D W.

31 Graph Laplacians 7/11 There are three commonly used denitions of the graph Laplacian: 1 The unnormalized Laplacian is L := D W. 2 The normalized symmetric Laplacian is L sym := D 1/2 LD 1/2 = I D 1/2 W D 1/2.

32 Graph Laplacians 7/11 There are three commonly used denitions of the graph Laplacian: 1 The unnormalized Laplacian is L := D W. 2 The normalized symmetric Laplacian is L sym := D 1/2 LD 1/2 = I D 1/2 W D 1/2. 3 The normalized random walk Laplacian is L rw := D 1 L = I D 1 W.

33 Graph Laplacians 7/11 There are three commonly used denitions of the graph Laplacian: 1 The unnormalized Laplacian is L := D W. 2 The normalized symmetric Laplacian is L sym := D 1/2 LD 1/2 = I D 1/2 W D 1/2. 3 The normalized random walk Laplacian is L rw := D 1 L = I D 1 W. We begin by studying properties of the unnormalized Laplacian.

34 The unnormalized Laplacian 8/11 Proposition: The matrix L satises the following properties:

35 The unnormalized Laplacian 8/11 Proposition: The matrix L satises the following properties: 1 For any f R n : f T Lf = 1 n w ij (f i f j ) 2. 2 i,j=1

36 The unnormalized Laplacian 8/11 Proposition: The matrix L satises the following properties: 1 For any f R n : f T Lf = 1 n w ij (f i f j ) 2. 2 i,j=1 2 L is symmetric and positive semidenite.

37 The unnormalized Laplacian 8/11 Proposition: The matrix L satises the following properties: 1 For any f R n : f T Lf = 1 n w ij (f i f j ) 2. 2 i,j=1 2 L is symmetric and positive semidenite. 3 0 is an eigenvalue of L with associated constant eigenvector 1.

38 The unnormalized Laplacian 8/11 Proposition: The matrix L satises the following properties: 1 For any f R n : f T Lf = 1 n w ij (f i f j ) 2. 2 i,j=1 2 L is symmetric and positive semidenite. 3 0 is an eigenvalue of L with associated constant eigenvector 1. Proof:

39 The unnormalized Laplacian (2) follows from (1). (3) is easy. 8/11 Proposition: The matrix L satises the following properties: 1 For any f R n : f T Lf = 1 n w ij (f i f j ) 2. 2 i,j=1 2 L is symmetric and positive semidenite. 3 0 is an eigenvalue of L with associated constant eigenvector 1. Proof: To prove (1), f T Lf = f T Df f T W f = n d i fi 2 i=1 n w ij f i f j i,j=1 = 1 n d i fi = 1 2 i=1 n w ij f i f j + i,j=1 n w ij (f i f j ) 2. i,j=1 n d j fj 2 j=1

40 The unnormalized Laplacian (cont.) 9/11 Proposition: Let G be an undirected graph with non-negative weights. Then:

41 The unnormalized Laplacian (cont.) 9/11 Proposition: Let G be an undirected graph with non-negative weights. Then: 1 The multiplicity k of the eigenvalue 0 of L equals the number of connected components A 1,..., A k in the graph.

42 The unnormalized Laplacian (cont.) 9/11 Proposition: Let G be an undirected graph with non-negative weights. Then: 1 The multiplicity k of the eigenvalue 0 of L equals the number of connected components A 1,..., A k in the graph. 2 The eigenspace of eigenvalue 0 is spanned by the indicator vectors 1 A1,..., 1 Ak of those components.

43 The unnormalized Laplacian (cont.) 9/11 Proposition: Let G be an undirected graph with non-negative weights. Then: 1 The multiplicity k of the eigenvalue 0 of L equals the number of connected components A 1,..., A k in the graph. 2 The eigenspace of eigenvalue 0 is spanned by the indicator vectors 1 A1,..., 1 Ak of those components. Proof:

44 The unnormalized Laplacian (cont.) 9/11 Proposition: Let G be an undirected graph with non-negative weights. Then: 1 The multiplicity k of the eigenvalue 0 of L equals the number of connected components A 1,..., A k in the graph. 2 The eigenspace of eigenvalue 0 is spanned by the indicator vectors 1 A1,..., 1 Ak of those components. Proof: If f is an eigenvector associate to λ = 0, then 0 = f T Lf = n w ij (f i f j ) 2. i,j=1 It follows that f i = f j whenever w ij > 0. Thus f is constant on the connected components of G. We conclude that the eigenspace of 0 is contained in span(1 A1,..., 1 Ak ). Conversely, it is not hard to see that each 1 Ai is an eigenvector associated to 0 (write L in block diagonal form).

45 The normalized Laplacians 10/11 Proposition: The normalized Laplacians satisfy the following properties: 1 For every f R n, we have f T L sym f = 1 2 ( ) 2 n f i w ij f j. di dj i,j=1 2 λ is an eigenvalue of L rw with eigenvector u if and only if λ is an eigenvalue of L sym with eigenvector w = D 1/2 u. 3 λ is an eigenvalue of L rw with eigenvector u if and only if λ and u solve the generalized eigenproblem Lu = λdu. Proof: The proof of (1) is similar to the proof of the analogous result for the unnormalized Laplacian. (2) and (3) follow easily by using appropriate rescalings.

46 The normalized Laplacians (cont.) 11/11 Proposition: Let G be an undirected graph with non-negative weights. Then: 1 The multiplicity k of the eigenvalue 0 of both L sym and L rw equals the number of connected components A 1,..., A k in the graph. 2 For L rw, the eigenspace of eigenvalue 0 is spanned by the indicator vectors 1 Ai, i = 1,..., k. 3 For L sym, the eigenspace of eigenvalue 0 is spanned by the vectors D 1/2 1 Ai, i = 1,..., k. Proof: Similar to the proof of the analogous result for the unnormalized Laplacian.

MATH 567: Mathematical Techniques in Data Science Clustering II

This lecture is based on U. von Luxburg, A Tutorial on Spectral Clustering, Statistics and Computing, 17 (4), 2007. MATH 567: Mathematical Techniques in Data Science Clustering II Dominique Guillot Departments