Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Yoshua Bengio Pascal Vincent Jean-François Paiement University of Montreal April 2, Snowbird Learning 2003

Learning Modal Structures of the Distribution Manifold learning and clustering = learning where are the main high-density zones Learning a tranformation that reveals clusters and manifolds: Cluster = zone of high density separated from other clusters by regions of low density

Spectral Embedding Algorithms Many learning algorithms, e.g. spectral clustering, kernel PCA, Local Linear Embedding (LLE), Isomap, Multi-Dimensional Scaling (MDS), Laplacian eigenmaps have at their core the following (or its equivalent): 1. Start from data points 2. Construct a neighborhood or similarity matrix (with corresponding [possibly data-dependent] kernel 3. Normalize it (and make it symmetric), yielding (with corresponding kernel ) 4. Compute 5. Embedding of scaled using e-values) largest (equivalently, smallest) e-values/e-vectors = -th elements of each of the e-vectors (possibly )

of kernel Kernel PCA Data is implicitly mapped to feature space s.t. PCA is performed in feature space: Projecting points in high-dim might allow to find straight line along which they are almost aligned (if basis, i.e. kernel, is right ).

centered: -th p.c. = Kernel PCA Eigenvectors of (generally infinite) matrix are where is an eigenvector of Gram matrix. Projection on N.B. need, subtractive normalization (Scholkopf 96)

Laplacian Eigenmaps Gram matrix from Laplace-Beltrami operator ( data (neighborhood graph) gives graph Laplacian. ), which on finite Gaussian kernel. Approximated by k-nn adjacency matrix Normalization: row average - Gram matrix. Laplace-Beltrami operator : justified as a smoothness regularizer on the manifold :, which equals eigenvalue of for eigenfunctions. Successfully used for semi-supervised learning. (Belkin & Niyogi, 2002)

Spectral Clustering Normalize kernel or Gram matrix divisively: Embedding of = where is -th eigenvector of Gram matrix. Perform clustering on the embedded points (e.g. after normalizing them by their norm). Weiss, Ng, Jordan,...

Spectral Clustering unit sphere principal eigenfns approx. kernel (= dot product) in MSE sense and and almost colinear almost orthogonal points in same cluster mapped to points with near angle, even if non-blob cluster (global constraint = transitivity of nearness )

Density-Dependent Hilbert Space Define a Hilbert space with density-dependent inner product with density. A kernel function defines a linear operator in that space:

Eigenfunctions of a Kernel Infinite-dimensional version of eigenvectors of Gram matrix: (some conditions to obtain a discrete spectrum) Convergence of e-vec/e-values of Gram matrix from data sampled from to e-functions/e-values of linear operator with underlying proven as (Williams+Seeger 2000).,

Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors and eigenfunctions (and corr. eigenvalues) when is the empirical distribution: Proposition 1: If we choose for the empirical distribution of the data, then the spectral embedding from is equivalent to values of the eigenfunctions of the normalized kernel :. Proof: come and see our poster!

Link between Kernel PCA and Eigenfunctions Proposition 2: If we choose for the empirical distribution of the data, then the kernel PCA projection is equivalent to scaled values of the eigenfunctions of :. Proof: come and see our poster! Consequence: up to the choice of kernel, kernel normalization, and up to scaling by clustering, Laplacian eigenmaps and kernel PCA give the same embedding. Isomap, MDS and LLE also give eigenfunctions but from a different type of kernel., spectral

From Embedding to General Mapping Laplacian eigenmaps, spectral clustering, Isomap, LLE, and MDS only provided an embedding for the given data points. Natural generalization to new points: consider these algorithms as learning eigenfunctions of. eigenfunctions : provide a mapping for new points. e.g. for empirical Data-dependent kernels (Isomap, LLE): need to compute without changing. Reasonable for Isomap, less clear it makes sense for LLE.

Criterion to Learn Eigenfunctions Proposition 3: Given the first eigenfunctions function, the -th one can be obtained by minimizing w.r.t. the expected value of over. Then we get and. of a symmetric This helps understand what the eigenfunctions are doing (approximating the dot product ) and provides a possible criterion for estimating the eigenfunctions when is not an empirical distribution. Kernels such as the Gaussian kernel and nearest-neighbor related kernels force the eigenfunctions to reconstruct correctly only for nearby objects: in high-dim, don t trust Euclidean distance between far objects.

Using a Smooth Density to Define Eigenfunctions? Use your best estimator data, for defining the eigenfunctions. of the density of the data, instead of the Constrained class of e-fns, e.g. neural networks, can force e-fns to be smooth and not necessarily local. Advantage? better generalization away from training points? Advantage? better scaling with? (no Gram matrix, no e-vectors) Disadvantage? optimization of e-fns may be more difficult?

Recovering the Density from the Eigenfunctions? Visually the eigenfunctions appear to capture the main characteristics of the density. Can we obtain a better estimate of the density using the principal eigenfunctions? (Girolami 2001): truncating the expansion. Use ideas similar to (Teh+Roweis 2003) and other mixtures of factor analyzers and project back in input space, convoluting with a model of reconstruction error as noise.

Role of Kernel Normalization? Subtractive normalization yields to kernel PCA: Thus the corresponding kernel is expanded: the constant function is an eigenfunction eigenfunctions have zero mean and unit variance double-centering normalization (MDS, Isomap): (based on relation between dot product and distance) above What can be said about the divisive normalization? Seems better at clustering.

Multi-layer Learning of Similarity and Density? The learned eigenfunctions capture salient features of the distribution: abstractions such as clusters and manifolds. Old AI (and connectionist) idea: build high-level abstractions on top of lower-level abstractions. empirical density + local Euclidean similarity improved density model + farther reaching notion of similarity

and closer than Density-Adjusted Similarity and Kernel A B C Want and. Define a density adjusted distance as a geodesic wrt a Riemannian metric, with metric tensor that penalizes low density. SEE OTHER POSTER (Vincent & Bengio)

Density-Adjusted Similarity and Kernel 0.8 0.6 0.4 original spirals -6-5 -4-3 -2-1 0 1 2 3 0.1 0.08 0.06 0.04-6 -5-4 -3-2 -1 0 1 2 3 0.2 0.02 0 Gaussian kernel spectral embedding 0-0.02-0.04-0.2-0.06-0.4-0.2 0 0.2 0.4 0.6 0.8 1 0.1-6 -5 0.08-4 -3-2 0.06-1 0 1 2 0.04 3 0.02 0 Density-adjusted embedding -0.08 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08 0.085 0.1-6 -5 0.08-4 -3-2 0.06-1 0 1 2 0.04 3 0.02 0 Density-adjusted embedding -0.02-0.02-0.04-0.04-0.06-0.06-0.08-0.08-0.1-0.1-0.05 0 0.05 0.1 0.15-0.1-0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Conclusions Many unsupervised learning algorithms (kernel PCA, spectral clustering, Laplacian eigenmaps, MDS, LLE, ISOMAP) are linked: compute eigenfunctions of a normalized kernel. Embedding can be generalized to mapping applicable to new points. Eigenfunctions seem to capture salient features of the distribution by minimizing kernel reconstruction error. Many questions open: eigenfunctions finding e-fns with smooth recover explicit density function? meaning of various kernel normalization? multi-layer learning? density-adjusted similarity (see Vincent & Bengio poster).?

Proposition 3 The principal eigenfunction of the linear operator corresponding to kernel is the (or a, if repeated e-values) norm-1 function that minimizes the reconstruction error

Proof of Proposition 1 Proposition 1: If we choose for the empirical distribution of the data, then the spectral embedding from is equivalent to values of the eigenfunctions of the normalized kernel :. (Simplified) proof: As shown in Proposition 3, finding function and scalar minimizing s.t. yields a solution that satisfies with the (possibly repeated) maximum norm eigenvalue.

, Proof of Proposition 1 With empirical, the above becomes ( Write and, then and we obtain for the principal eigenvector: For the other eigenvalues, consider the residual kernel and recursively apply the same reasoning to obtain, etc... Q.E.D. ):

Proof of Proposition 2 Proposition 2: If we choose for the empirical distribution of the data, then the kernel PCA projection is equivalent to scaled values of the eigenfunctions of:. (Simplified) proof: Apply the linear operator on both sides of : or changing the order of integrals on the left-hand side: Plug-in :

Proof of Proposition 2 which contains elements of covariance matrix : thus yielding or where values, has elements. So, where takes its where is also the -th e-vector of.

Proof of Proposition 2 PCA projection on is Q.E.D.

Proof of Proposition 3 Proposition 3: Given the first eigenfunctions function, the -th one can be obtained by minimizing w.r.t. the expected value of over. Then we get and. Proof: Reconstruction error using approximation : where, and are the first (eigenfunction,eigenvalue) pairs in order of decreasing absolute value of. with of a symmetric

Proof of Proposition 3 Minimization of wrt gives (1) using eq. 1. should be maximized.

(2) Proof of Proposition 3 and set it equal to zero: Using : Using recursive assumption that are orthogonal for : Write the application of in terms of the eigenfunctions:

, Proof of Proposition 3 we obtain Applying Perceval s thm to obtain the norm on both sides: If distinct Since Q.E.D. for. and obtained and and and s, max. when., get = 1 and and