Spectral Clustering. by HU Pili. June 16, 2013

Similar documents
Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Manifold Learning: Theory and Applications to HRI

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

LECTURE NOTE #11 PROF. ALAN YUILLE

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Spectral Clustering. Zitao Liu

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Unsupervised dimensionality reduction

Distance Preservation - Part 2

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Nonlinear Dimensionality Reduction

Dimension Reduction and Low-dimensional Embedding

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

Nonlinear Dimensionality Reduction

Data Mining and Analysis: Fundamental Concepts and Algorithms

Statistical and Computational Analysis of Locality Preserving Projection

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Principal Component Analysis

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Non-linear Dimensionality Reduction

Machine Learning - MT & 14. PCA and MDS

Data dependent operators for the spatial-spectral fusion problem

Statistical Machine Learning

14 Singular Value Decomposition

Dimensionality Reduc1on

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering

Support Vector Machine (SVM) and Kernel Methods

Robust Motion Segmentation by Spectral Clustering

Statistical Pattern Recognition

The Curse of Dimensionality for Local Kernel Machines

Support Vector Machine (SVM) and Kernel Methods

Dimensionality Reduction AShortTutorial

Support Vector Machine (SVM) and Kernel Methods

PCA, Kernel PCA, ICA

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Graph Metrics and Dimension Reduction

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Convergence of Eigenspaces in Kernel Principal Component Analysis

MLCC 2015 Dimensionality Reduction and PCA

Spectral Algorithms I. Slides based on Spectral Mesh Processing Siggraph 2010 course

L26: Advanced dimensionality reduction

Fundamentals of Matrices

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Introduction to Machine Learning

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Kernel methods for comparing distributions, measuring dependence

A Statistical Look at Spectral Graph Analysis. Deep Mukhopadhyay

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Spectral Regularization

7 Principal Component Analysis

Kernel Principal Component Analysis

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY

Lecture: Some Practical Considerations (3 of 4)

Homework 1. Yuan Yao. September 18, 2011

Regularization via Spectral Filtering

Advanced Machine Learning & Perception

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Principal Component Analysis

Intrinsic Structure Study on Whale Vocalizations

Lecture 2: Linear Algebra Review

Data-dependent representations: Laplacian Eigenmaps

Graph Matching & Information. Geometry. Towards a Quantum Approach. David Emms

Lecture 10: October 27, 2016

Spectral Clustering. Spectral Clustering? Two Moons Data. Spectral Clustering Algorithm: Bipartioning. Spectral methods

Data Analysis and Manifold Learning Lecture 3: Graphs, Graph Matrices, and Graph Embeddings

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Nonlinear Dimensionality Reduction. Jose A. Costa

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Machine learning for pervasive systems Classification in high-dimensional spaces

EECS 275 Matrix Computation

Spectral clustering. Two ideal clusters, with two points each. Spectral clustering algorithms

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Introduction to Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Apprentissage non supervisée

Markov Chains and Spectral Clustering

1 Feature Vectors and Time Series

Spectral Generative Models for Graphs

4 Bias-Variance for Ridge Regression (24 points)

Discriminative Direction for Kernel Classifiers

Lecture 10: Dimension Reduction Techniques

Preprocessing & dimensionality reduction

Beyond Scalar Affinities for Network Analysis or Vector Diffusion Maps and the Connection Laplacian

Cholesky Decomposition Rectification for Non-negative Matrix Factorization

PCA and admixture models

Discriminative K-means for Clustering

Statistical Learning. Dong Liu. Dept. EEIS, USTC

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Linear & nonlinear classifiers

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Transcription:

Spectral Clustering by HU Pili June 16, 2013

Outline Clustering Problem Spectral Clustering Demo Preliminaries Clustering: K-means Algorithm Dimensionality Reduction: PCA, KPCA. Spectral Clustering Framework Spectral Clustering Justification Ohter Spectral Embedding Techniques Main reference: Hu 2012 [4] 2

Clustering Problem Height Weight Figure 1. Abstract your target using feature vector 3

Clustering Problem Height Weight Figure 2. Cluster the data points into K (2 here) groups 4

Clustering Problem Height Thin Fat Weight Figure 3. Gain Insights of your data 5

Clustering Problem Height Weight Figure 4. The center is representative (knowledge) 6

Clustering Problem Height Weight Figure 5. Use the knowledge for prediction 7

Review: Clustering We learned the general steps of Data Mining/ Knowledge Discover using a clustering example: 1. Abstract your data in form of vectors. 2. Run learning algorithms 3. Gain insights/ extract knowledge/ make prediction We focus on 2. 8

Spectral Clustering Demo 3.5 3 2.5 2 1.5 1 0.5 0 0.5 1 1.5 1.5 1 0.5 0 0.5 1 1.5 2 2.5 3 3.5 Figure 6. Data Scatter Plot 9

Spectral Clustering Demo 3.5 3 2.5 2 1.5 1 0.5 0 0.5 1 1.5 1.5 1 0.5 0 0.5 1 1.5 2 2.5 3 3.5 Figure 7. Standard K-Means 10

Spectral Clustering Demo 3.5 3 2.5 2 1.5 1 0.5 0 0.5 1 1.5 1.5 1 0.5 0 0.5 1 1.5 2 2.5 3 3.5 Figure 8. Our Sample Spectral Clustering 11

Spectral Clustering Demo The algorithm is simple: K-Means: [idx, c] = kmeans(x, K) ; Spectral Clustering: epsilon = 0.7 ; D = dist(x ) ; A = double(d < epsilon) ; [V, Lambda] = eigs(a, K) ; [idx, c] = kmeans(v, K) ; 12

Review: The Demo The usual case in data mining: No weak algorithms Preprocessing is as important as algorithms The problem looks easier in another space (the secret coming soon) Transformation to another space: High to low: dimensionality reduction, low dimension embedding. e.g. Spectral clustering. Low to high. e.g. Support Vector Machine (SVM) 13

Secrets of Preprocessing 3.5 3 2.5 2 1.5 1 0.5 0 0.5 1 1.5 1.5 1 0.5 0 0.5 1 1.5 2 2.5 3 3.5 Figure 9. The similarity graph: connect to ε-ball. D = dist(x ); A = double(d < epsilon); 14

Secrets of Preprocessing 0.15 0.1 0.05 0 0.05 0.1 0.11 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 Figure 10. 2-D embedding using largest 2 eigenvectors [V, Lambda] = eigs(a, K); 15

Secrets of Preprocessing 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Figure 11. Even better after projecting to unit circle (not used in our sample but more applicable, Brand 2003 [3]) 16

Secrets of Preprocessing 3 x 104 Angles All Points 2 1 0 0.2 0 0.2 0.4 0.6 0.8 1 1.2 800 Angles Cluster 1 600 400 200 0 0.94 0.96 0.98 1 1.02 1.04 1.06 4000 Angles Cluster 2 3000 2000 1000 0 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 Figure 12. Angle histogram: The two clusters are concentrated and they are nearly perpendicular to each other 17

Notations Data points: X n N = [x 1, x 2,, x N ]. N points, each of n dimensions. Organize in columns. Feature extraction: φ(x i ). d dimensional. Eigen value decomposition: A = UΛU T. First d colums of U: U d. Feature matrix: Φ nˆ N =[φ(x) 1,φ(x) 2,,φ(x) N ] Low dimension embedding: Y N d. Embed N points into d-dimensional space. Number of clusters: K. 18

K-Means Initialize m 1,,m K centers Iterate until convergence: Cluster assignment:c i = arg min j x i m j for all data points, 1 i N. Update clustering:s j ={i:c i =j},1 j K Update centers: m j = 1 S j i S j x i 19

Remarks: K-Means 1. A chiken and egg problem. 2. How to initialize centers? 3. Determine superparameter K? 4. Decision boundary is a straight line: C i = arg min j x i m j x m j = x m k m i T m i m j T m j =2x T (m i m j ) We address the 4th points by transforming the data into a space where straight line boundary is enough. 20

Principle Component Analysis Error PC Data Figure 13. Error minimization formulation 21

Principle Component Analysis Assume x i is already centered (easy to preprocess). Project points onto the space spanned by U n d with minimum errors: min J(U)= N U R n d i=1 s.t. U T U =I UU T x i x i 2 22

Principle Component Analysis Transform to trace maximization: max U R n d Tr[U T (XX T )U] s.t. U T U =I Standard problem in matrix theory: Solution of U is given by the largest d eigen vectors of XX T (those corresponding to largest eigen values). i.e. XX T U =UΛ. Usually denote Σ n n =XX T, because XX T can be interpreted as the covariance of variables (n variables). (Again not that X is centered) 23

Principle Component Analysis About UU T x i : x i : the data points in original n dimensional space. U n d : the d dimensional space (axis) expressed using the coordinates of original n-d space. Principle Axis. U T x i : the coordinates of x i ind-dimensional space; d-dimensional embedding; the projection of x i to d-d space expressed in d-d space. Principle Component. UU T x i : the projection of x i tod-d space expressed in n-d space. 24

Principle Component Analysis From principle axis to principle component: One data point: U T x i Matrix form: (Y T ) d N =U T X Relation between covariance and similarity: (X T X )Y = X T X (U T X) T = X T XX T U = X T U Λ = YΛ Observation: Y is the eigen vectors of X T X. 25

Principle Component Analysis Two operational approaches: Decompose(XX T ) n n and Let(Y T ) d N =U T X. Decompose (X T X) N N and directly get Y. Implications: Choose the smaller size one in practice. X T X hint that we can do more with the structure. Remarks: Principle components are what we want in most cases. i.e. Y. i.e. d-dimension embedding. e.g. Can do clustering on coordinates given by Y. 26

Kernel PCA Settings: A feature extraction function: φ:r n R nˆ original n dimension features. Map to nˆ dimension. Matrix organization: Φ nˆ N =[φ(x) 1,φ(x) 2,,φ(x) N ] Now Φ is the data points in the formulation of PCA. Try to embed them into d-dimensions. 27

Kernel PCA According to the analysis of PCA, we can operate on: (ΦΦ T ) nˆ nˆ: the covariance matrix of features (Φ T Φ) N N : the similarity/ affinity matrix (in spectral clustering language); the gram/ kernal matrix (in keneral PCA language). The observation: nˆ can be very large. e.g. φ:r n R (Φ T Φ) i,j =φ T (x i )φ(x j ). Don t need explict φ; only need k(x i,x j )=φ T (x i )φ(x j ). 28

Kernel PCA k(x i, x j ) = φ T (x i )φ(x j ) is the kernel. One important property: by definition, k(x i,x j ) is a positive semidefinite function. K is a positive semidefinite matrix. Some example kernals: Linear: k(x i,x j )=x i T x j. Degrade to PCA. Polynomial: k(x i,x j )=(1+x i T x j ) p Gaussion: k(x i,x j )=e x i x j 2 2σ 2 29

Remarks: KPCA Avoid explicit high dimension (maybe infinite) feature construction. Enable one research direction: kernel engineering. The above discussion assume Φ is centered! See Bishop 2006 [2] for how to center this matrix (using only kernel function). (or double centering technique in [4]) Out of sample embedding is the real difficulty, see Bengio 2004 [1]. 30

Review: PCA and KPCA Minimum error formulation of PCA Two equivalent implementation approaches: covariance matrix similarity matrix Similarity matrix is more convenient to manipulate and leads to KPCA. Kernel is Positive Semi-Definite (PSD) by definition. K=Φ T Φ 31

Spectral Clustering Framework A bold guess: Decomposing K =Φ T Φ gives good low-dimension embedding. Inner product measures similarity, i.e. k(x i,x j )=φ T (x i )φ(x j ). K is similarity matrix. In the operation, we acturally do not look at Φ. We can specify K directly and perform EVD: X n N K N N What if we directly give a similarity measure, K, without the constraint of PSD? That leads to the general spectral clustering. 32

Spectral Clustering Framework 1. Get similarity matrix A N N from data points X. (A: affinity matrix; adjacency matrix of a graph; similarity graph) 2. EVD: A=UΛU T. Use U d (or post-processed version, see [4]) as the d-dimension embedding. 3. Perform clustering on d-d embedding. 33

Spectral Clustering Framework Review our naive spectral clustering demo: 1. epsilon = 0.7 ; D = dist(x ) ; A = double(d < epsilon) ; 2. [V, Lambda] = eigs(a, K) ; 3. [idx, c] = kmeans(v, K) ; 34

Remarks: SC Framework We start by relaxing A (K) in KPCA. Lose PSD == Lose KPCA justification? Not exact A =A+σI Real tricks: (see [4] section 2 for details) How to form A? Decompose A or other variants of A (L = D A). Use EVD result directly (e.g. U) or use a variant (e.g. UΛ 1/2 ). 35

Similarity graph Input is high dimensional data: (e.g. come in form of X) k-nearest neighbour ε-ball mutual k-nn complete graph (with Gaussian kernel weight) 36

Similarity graph Input is distance. ( D (2)) i,j is the squred distance between i and j. (may not come from raw x i, x j ) c = [x T 1 x 1,,x T N x N ] T D (2) = c1 T +1c T 2X T X J = I 1 n 11T X T X = 1 2 JD(2) J Remarks: See MDS. 37

Similarity graph Input is a graph: Just use it. Or do some enhancements. e.g. Geodesic distance. See [4] Section 2.1.3 for some possible methods. After get the graph (or input): Adjacency matrix A, Laplacian matrix L=D A. Normalized versions: A left =D 1 A, A sym =D 1/2 AD 1/2 L left =D 1 L, L sym =D 1/2 LD 1/2 38

EVD of the Graph Matrix types: Adjacency series: Use the largest EVs. Laplacian series: Use the smallest EVs. 39

Remarks: SC Framework There are many possibilities in construction of similarity matrix and the post-processing of EVD. Not all of these combinations have justifications. Once a combination is shown to working, it may not be very hard to find justifications. Existing works actually starts from very different flavoured formulations. Only one common property: involve EVD; aka spectral analysis ; hence the name. 40

Spectral Clustering Justification Cut based argument (main stream; origin) Random walk escaping probability Commute time: L 1 encodes the effective resistance. (where UΛ 1/2 come from) Low-rank approximation. Density estimation. Matrix perturbation. Polarization. (the demo) See [4] for pointers. 41

Cut Justification Normalized Cut (Shi 2000 [5]): K cut(ci,v C i ) NCut= i=1 vol(c i ) Characteristic vector for C i, χ i ={0,1} N : K χi T Lχ i NCut= i=1 χ i T Dχ i 42

Relax χ i to real value: min v i R N K i=1 v i T Lv i s.t. v i T Dv i =1 v i T v j =0, i j This is the generalized eigenvalue problem: Equivalent to EVD on: Lv=λDv L left =D 1 L 43

Matrix Perturbation Justification When the graph is ideally separable, i.e. multiple connected components, A and L have characteristic (or piecewise linear) EVs. When not ideally separable but sparse cut exists, A can be viewed as ideal separable matrix plus a small perturbation. Small perturbation of matrix entries leads to small perturbation of EVs. EVs are not too far from piecewise linear: easy to separate by simple algorithms like K-Means. 44

Low Rank Approximation The similarity matrix A is generated by inner product in some unknown space we want to recover. We want to minimize the recovering error: min A YY T 2 F Y R N d The standard low-rank approximation problem, which leads to EVD of A: Y =UΛ 1/2 45

Spectral Embedding Techniques See [4] for some pointers: MDS, isomap, PCA, KPCA, LLE, LEmap, HEmap, SDE, MVE, SPE. The difference, as said, lies mostly in the construction of A. 46

Bibliography [1] Y. Bengio, J. Paiement, P. Vincent, O. Delalleau, N. Le Roux, and M. Ouimet. Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. Advances in neural information processing systems, 16:177 184, 2004. [2] C. Bishop. Pattern recognition and machine learning, volume 4. springer New York, 2006. [3] M. Brand and K. Huang. A unifying theorem for spectral embedding and clustering. In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, 2003. [4] P. Hu. Spectral clustering survey, 5 2012. [5] J. Shi and J. Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888 905, 2000. 47

Thanks Q/A Some supplementary slides for details are attached. 48

SVD and EVD Definitions of Singular Value Decomposition (SVD): T X n N = U n k Σ k k V N k Definitions of Eigen Value Decomposition (EVD): Relations: A = X T X A = UΛU T X T X = VΣ 2 V T XX T = UΣ 2 U T 49

Remarks: SVD requires U T U = I, V T V = I and σ i 0 (Σ = diag(σ 1,, σ N )). This is to guarantee the uniqueness of solution. EVD does not have constraints, any U and Λ satisfying AU=UΛ is OK. The requirement of U T U= I is also to guarantee uniqueness of solution (e.g. PCA). Another benefit is the numerical stability of subspace spanned by U: orthogonal layout is more error resilient. The computation of SVD is done via EVD. Watch out the terms and the object they refer to. 50

Out of Sample Embedding New data point x R n that is not in X. How to find the lower dimension embedding, i.e. y R d. In PCA, we have principle axisu (XX T =UΛU T ). Out of sample embedding is simple: y=u T x. U n d is actually a compact representation of knowledge. In KPCA and different variants of SC, we operate on similarity graph and do not have such compact representation. It is thus hard to explicitly the out of sample embedding result. See [1] for some researches on this. 51

1 k! xk Gaussian Kernel The gaussian kernel: (Let τ = 1 2σ 2 ) k(x i,x j )=e Use Taylor expansion: x i x j 2 2σ 2 =e τ x i x j 2 e x = k=0 Rewrite the kernel: k(x i,x j ) = e τ(x i x j ) T (x i x j ) = e τx i T xi e τx j T xj e 2τx i T x j 52

Focus on the last part: e 2τx i T x j = k=0 1 k! (2τx i T x j ) k It s hard to write out the form when x i R n,n>1. We demo the case when n = 1. x i and x j are now single variable: e 2τx i T x j = k=0 1 k! (2τ)k x i k x j k = k=0 c(k) x i k x j k 53

The feature vector is: (infinite dimension) [ φ(x)=e τx2 c(0), c(1) x, c(2) x 2,, c(k) x k, ] Verify that: k(x i,x j )=φ(x i ) φ(x j ) This shows that Gaussian kernel implicitly map 1-D data to an infinite dimensional feature space. 54