Dimension Reduction and Iterative Consensus Clustering

Size: px

Start display at page:

Download "Dimension Reduction and Iterative Consensus Clustering"

Johnathan Carr
5 years ago
Views:

1 Dimension Reduction and Iterative Consensus Clustering Southeastern Clustering and Ranking Workshop August 24, 2009 Dimension Reduction and Iterative

2 1 Document Clustering Geometry of the SVD Centered SVD Uncentered SVD Principal Direction Divisive Partitioning 2 3 Combination Algorithms Iterating to Reach Consensus 4 Medlars/Cranfield/CISI Benchmark Data Set by Sinka and Corne 5 Dimension Reduction and Iterative

3 Document Clustering Geometry of the SVD Principal Direction Divisive Partitioning For document clustering, we create a term-document matrix, A, as follows: A m n = Term 1 Term i 0 Doc 1 Doc j Doc n f ij 1 C A Term m Where f i,j is the frequency of term i in document j. Various types of term-weighting can be used in place of raw frequencies. For our experiments, we simply normalized the columns. Each column of A represents the coordinates of a document in the m-dimensional term-space", where each standard basis vector represents one term from the dictionary. Dimension Reduction and Iterative

4 Geometry of the SVD Principal Direction Divisive Partitioning Singular Value Decomposition (SVD) Decomposes A = U ΣV T where U and V are orthogonal matrices and Σ is a diagonal matrix of singular values. Dimension Reduction and Iterative

5 Geometry of the SVD Principal Direction Divisive Partitioning Singular Value Decomposition (SVD) Decomposes A = U ΣV T where U and V are orthogonal matrices and Σ is a diagonal matrix of singular values. The truncated SVD yields the closest rank r approximation to A in the 2-norm. a j rx [V T ] i,j σ i u i i=1 Dimension Reduction and Iterative

6 Geometry of the SVD Principal Direction Divisive Partitioning Singular Value Decomposition (SVD) Decomposes A = U ΣV T where U and V are orthogonal matrices and Σ is a diagonal matrix of singular values. The truncated SVD yields the closest rank r approximation to A in the 2-norm. a j rx [V T ] i,j σ i u i i=1 Thus, a column v j of the truncated V T is the coordinates of a j once projected into the lower dimensional space spanned by the orthogonal basis. (σ 1 u 1, σ 2 u 2,... σ r u r ) Dimension Reduction and Iterative

7 Geometry of the SVD Principal Direction Divisive Partitioning Singular Value Decomposition (SVD) Decomposes A = U ΣV T where U and V are orthogonal matrices and Σ is a diagonal matrix of singular values. The truncated SVD yields the closest rank r approximation to A in the 2-norm. a j rx [V T ] i,j σ i u i i=1 Thus, a column v j of the truncated V T is the coordinates of a j once projected into the lower dimensional space spanned by the orthogonal basis. (σ 1 u 1, σ 2 u 2,... σ r u r ) We ll use the columns of V T as a lower dimensional representation of the columns of A for the purposes of clustering. Dimension Reduction and Iterative

8 Geometry of the SVD Principal Direction Divisive Partitioning Geometry of Singular Vectors when A is centered The first left-hand singular vector, u 1, of the centered matrix C = A µe T is the direction along which the variance of the data is maximal. u 1 Dimension Reduction and Iterative

9 Geometry of the SVD Principal Direction Divisive Partitioning Geometry of Singular Vectors when A is centered The second left singular vector of C, u 2, is the direction orthogonal to u 1 along which the variance is maximal. u 2 u 1 u 1 Dimension Reduction and Iterative

10 Outline Geometry of the SVD Principal Direction Divisive Partitioning Geometry of SVD when A is uncentered The first left singular vector of A is the direction of the least-squares line through the origin. T = α* u (A ) 1 (Zero intercept total least squares line) u (A ) 1 u (C ) 1 Dimension Reduction and Iterative

11 Geometry of the SVD Principal Direction Divisive Partitioning Principal Direction Divisive Partitioning (PDDP) Algorithm proposed by Daniel Boley at Univ. of MN in 2002 Dimension Reduction and Iterative

12 Geometry of the SVD Principal Direction Divisive Partitioning Principal Direction Divisive Partitioning (PDDP) Algorithm proposed by Daniel Boley at Univ. of MN in 2002 Iterative process partitions data into 2 clusters with each iteration, based upon their projection onto the direction of maximal variance. Dimension Reduction and Iterative

13 Geometry of the SVD Principal Direction Divisive Partitioning Principal Direction Divisive Partitioning (PDDP) Algorithm proposed by Daniel Boley at Univ. of MN in 2002 Iterative process partitions data into 2 clusters with each iteration, based upon their projection onto the direction of maximal variance. PDDP can be adapted to use more than just the principal singular vector. Dimension Reduction and Iterative

14 Geometry of the SVD Principal Direction Divisive Partitioning Principal Direction Divisive Partitioning (PDDP) Algorithm proposed by Daniel Boley at Univ. of MN in 2002 Iterative process partitions data into 2 clusters with each iteration, based upon their projection onto the direction of maximal variance. PDDP can be adapted to use more than just the principal singular vector. We will often use the results from PDDP to seed the k-means algorithm with an initial guess Dimension Reduction and Iterative

15 Outline Illustration of one iteration of PDDP Geometry of the SVD Principal Direction Divisive Partitioning Signs of the entries in v 1 tell us whether the projected data fall to the left or right of the origin. v (i) <0 1 v 1(i) >0 Figure: Data Cloud Projected onto the span of u 1 (C) Dimension Reduction and Iterative

16 The nonnegative matrix factorization (NMF) seeks to decompose a nonnegative matrix into the product of two nonnegative matrices: A m n = W m r H r n. Dimension Reduction and Iterative

17 The nonnegative matrix factorization (NMF) seeks to decompose a nonnegative matrix into the product of two nonnegative matrices: A m n = W m r H r n. The decomposition is created by solving the following nonlinear optimization problem: min A WH 2 F such that W 0 and H 0 Dimension Reduction and Iterative

18 The nonnegative matrix factorization (NMF) seeks to decompose a nonnegative matrix into the product of two nonnegative matrices: A m n = W m r H r n. The decomposition is created by solving the following nonlinear optimization problem: min A WH 2 F such that W 0 and H 0 The inner dimension of the factorization, r, must be input by the user. Dimension Reduction and Iterative

19 The nonnegative matrix factorization (NMF) seeks to decompose a nonnegative matrix into the product of two nonnegative matrices: A m n = W m r H r n. The decomposition is created by solving the following nonlinear optimization problem: min A WH 2 F such that W 0 and H 0 The inner dimension of the factorization, r, must be input by the user. The result is an additive, parts-based approximation to each data column a j in the form of a linear combination of feature" vectors, w i, as follows: Dimension Reduction and Iterative

20 The nonnegative matrix factorization (NMF) seeks to decompose a nonnegative matrix into the product of two nonnegative matrices: A m n = W m r H r n. The decomposition is created by solving the following nonlinear optimization problem: min A WH 2 F such that W 0 and H 0 The inner dimension of the factorization, r, must be input by the user. The result is an additive, parts-based approximation to each data column a j in the form of a linear combination of feature" vectors, w i, as follows: rx a j h i,j w i i=1 Dimension Reduction and Iterative

21 NMF for Dimension Reduction Columns of H represent the coordinates of each document after projection into the lower dimensional feature-space" spanned by the columns of W. We ll use the columns of H as a lower dimensional representation of the columns of A for the purposes of clustering. a 1 a 2 a 3. a m h 1 h 2 h 3. h r Dimension Reduction and Iterative

matrix C uncentered matrix A first r (or k) columns of VT

22 New Algorithms out of Old Combination Algorithms Iterating to Reach Consensus Data Matrix A SVD centered matrix C uncentered matrix A first r (or k) columns of VT NMF H k-means PDDP Cluster Solution Dimension Reduction and Iterative

23 New Algorithms out of Old Combination Algorithms Iterating to Reach Consensus One Possible Clustering Algorithm un.svdr.pddp-kmeans Data Matrix A SVD centered matrix C uncentered matrix A first r (or k) columns of VT NMF H k-means PDDP Cluster Solution Dimension Reduction and Iterative

matrix A centered matrix C T first r (or k)

24 New Algorithms out of Old Combination Algorithms Iterating to Reach Consensus Clustering Algorithm w/ 2 Dimension Reductions (H.SVDk.PDDP-kmeans) Data Matrix A SVD NMF uncentered matrix A centered matrix C T first r (or k) columns of V H k-means PDDP Cluster Solution Dimension Reduction and Iterative

25 Combination Algorithms Iterating to Reach Consensus Since no single algorithm will perform better than all others on a given class of data, we propose using several algorithms to find agreement upon clusters. Dimension Reduction and Iterative

26 Combination Algorithms Iterating to Reach Consensus Since no single algorithm will perform better than all others on a given class of data, we propose using several algorithms to find agreement upon clusters. We create an adjacency matrix for each clustering, whose (i, j) th entry is 1 if a i and a j were clustered together and 0 otherwise. Dimension Reduction and Iterative

27 Combination Algorithms Iterating to Reach Consensus Since no single algorithm will perform better than all others on a given class of data, we propose using several algorithms to find agreement upon clusters. We create an adjacency matrix for each clustering, whose (i, j) th entry is 1 if a i and a j were clustered together and 0 otherwise. We sum the adjacency matrices from various algorithms to create a consensus matrix, M whose (i, j) th entry reveals the number of times a i and a j were clustered together. Dimension Reduction and Iterative

28 Combination Algorithms Iterating to Reach Consensus Since no single algorithm will perform better than all others on a given class of data, we propose using several algorithms to find agreement upon clusters. We create an adjacency matrix for each clustering, whose (i, j) th entry is 1 if a i and a j were clustered together and 0 otherwise. We sum the adjacency matrices from various algorithms to create a consensus matrix, M whose (i, j) th entry reveals the number of times a i and a j were clustered together. Entries in the consensus matrix that are below a certain tolerance may be changed to zero. Dimension Reduction and Iterative

29 Combination Algorithms Iterating to Reach Consensus Since no single algorithm will perform better than all others on a given class of data, we propose using several algorithms to find agreement upon clusters. We create an adjacency matrix for each clustering, whose (i, j) th entry is 1 if a i and a j were clustered together and 0 otherwise. We sum the adjacency matrices from various algorithms to create a consensus matrix, M whose (i, j) th entry reveals the number of times a i and a j were clustered together. Entries in the consensus matrix that are below a certain tolerance may be changed to zero. This consensus matrix is then clustered using the same algorithms to see if the algorithms will agree upon a solution. Dimension Reduction and Iterative

solution n Consensus Matrix Clustering solution n Consensus Matrix If no consensus reached Dimension

30 Iterating the Consensus Process Combination Algorithms Iterating to Reach Consensus Data Matrix Clustering Algorithm 1 Clustering Algorithm 2 Clustering Algorithm n cluster solution 1 cluster solution 2 cluster solution n Consensus Matrix Clustering Algorithm 1 Clustering Algorithm 2 Clustering Algorithm n cluster solution 1 cluster solution 2 cluster solution n Consensus Matrix If no consensus reached Dimension Reduction and Iterative

31 Medlars/Cranfield/CISI Benchmark Data Set by Sinka and Corne Medlars/Cranfield/CISI - Medical and Scientific Abstracts docs/11000 terms -k = 3 clusters Accuracies for Med/Cran/CISI after Dimension Reduction to r = 3 Algorithm r=k=3 Consensus 1 Consensus 2 Consensus 3 NMF Basic PDDP PDDP-kmeans SVDr-PDDP-kmeans un.svdr-pddp-kmeans H-PDDP H-PDDP-kmeans H-SVDk-PDDP-kmeans H-un.SVDk-PDDP-kmeans Dimension Reduction and Iterative

32 Medlars/Cranfield/CISI Benchmark Data Set by Sinka and Corne Medlars/Cranfield/CISI - Medical and Scientific Abstracts doc, terms - k = 3 clusters Accuracies for Med/Cran/CISI after dimension reduction to r = 15 Algorithm r=15 Consensus 1 Consensus 2 Consensus 3 NMF Basic [] PDDP PDDP-kmeans SVDr-PDDP-kmeans un.svdr-pddp-kmeans H-PDDP H-PDDP-kmeans H-SVDk-PDDP-kmeans H-un.SVDk-PDDP-kmeans Dimension Reduction and Iterative

33 Medlars/Cranfield/CISI Benchmark Data Set by Sinka and Corne Benchmark Data Set Collected by Sinka and Corne documents proposed as a benchmark collection for document clustering. Dimension Reduction and Iterative

34 Medlars/Cranfield/CISI Benchmark Data Set by Sinka and Corne Benchmark Data Set Collected by Sinka and Corne documents proposed as a benchmark collection for document clustering. Documents pertain to 4 broad topics (banking/finance, programming, science, and sport) Dimension Reduction and Iterative

35 Medlars/Cranfield/CISI Benchmark Data Set by Sinka and Corne Benchmark Data Set Collected by Sinka and Corne documents proposed as a benchmark collection for document clustering. Documents pertain to 4 broad topics (banking/finance, programming, science, and sport) Each topic contains 2 or 3 subtopics (commercial banks, insurance agencies, java, astronomy, biology, etc). Dimension Reduction and Iterative

36 Medlars/Cranfield/CISI Benchmark Data Set by Sinka and Corne Benchmark Data Set Collected by Sinka and Corne documents proposed as a benchmark collection for document clustering. Documents pertain to 4 broad topics (banking/finance, programming, science, and sport) Each topic contains 2 or 3 subtopics (commercial banks, insurance agencies, java, astronomy, biology, etc). Documents were extracted automatically from the web. Dimension Reduction and Iterative

37 Medlars/Cranfield/CISI Benchmark Data Set by Sinka and Corne Benchmark Data Set Collected by Sinka and Corne documents proposed as a benchmark collection for document clustering. Documents pertain to 4 broad topics (banking/finance, programming, science, and sport) Each topic contains 2 or 3 subtopics (commercial banks, insurance agencies, java, astronomy, biology, etc). Documents were extracted automatically from the web. Some long detailed articles Some just list of words, addresses, or links. Noisy Data! Dimension Reduction and Iterative

38 Medlars/Cranfield/CISI Benchmark Data Set by Sinka and Corne Benchmark Data - subset BCFG - k = 4 clusters Cluster Accuracies for BenchmarkBCFG after Dimension Reduction to r = 4 Algorithm r=4 Consensus 1 Consensus 2 Consensus 3 NMF Basic PDDP PDDP-kmeans SVDr-PDDP-kmeans un.svdr-pddp-kmeans H-PDDP H-PDDP-kmeans H-SVDk-PDDP-kmeans H-un.SVDk-PDDP-kmeans Dimension Reduction and Iterative

39 Medlars/Cranfield/CISI Benchmark Data Set by Sinka and Corne Benchmark Data - subset BCFG-k = 4 clusters Experiment 3: Cluster Accuracies for BenchmarkBCFG after Dimension Reduction to r = 10 Algorithm r=10 Consensus 1 Consensus 2 Consensus 3 Consensus 5 NMF Basic PDDP PDDP-kmeans SVDr-PDDP-kmeans un.svdr-pddp-kmeans H-PDDP H-PDDP-kmeans H-SVDk-PDDP-kmeans H-un.SVDk-PDDP-kmeans Dimension Reduction and Iterative

40 The choices for the size of dimension reduction, r, and the various combinations of algorithms produce hundreds of clusterings for the consensus approach. shows potential as a technique to determine a final clustering solution through many different algorithms. Although the final clustering solution determined through is not guaranteed to be optimal, experiments suggest that the technique provides a solution that is well above the average of the algorithms used. Dimension Reduction and Iterative

Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY

Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY OUTLINE Why We Need Matrix Decomposition SVD (Singular Value Decomposition) NMF (Nonnegative