Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Similar documents
Non-linear Dimensionality Reduction

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Unsupervised dimensionality reduction

Nonlinear Dimensionality Reduction

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Data-dependent representations: Laplacian Eigenmaps

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

The Curse of Dimensionality for Local Kernel Machines

Statistical Pattern Recognition

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Apprentissage non supervisée

Spectral Dimensionality Reduction

Data dependent operators for the spatial-spectral fusion problem

Nonlinear Dimensionality Reduction. Jose A. Costa

LECTURE NOTE #11 PROF. ALAN YUILLE

Analysis of Spectral Kernel Design based Semi-supervised Learning

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Nonlinear Dimensionality Reduction

Learning Eigenfunctions Links Spectral Embedding

Dimensionality Reduction AShortTutorial

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Manifold Learning: Theory and Applications to HRI

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction

Dimension Reduction and Low-dimensional Embedding

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

L26: Advanced dimensionality reduction

Intrinsic Structure Study on Whale Vocalizations

Neural Networks, Convexity, Kernels and Curses

Statistical Machine Learning

EECS 275 Matrix Computation

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Learning a kernel matrix for nonlinear dimensionality reduction

Dimensionality Reduc1on

Lecture 10: Dimension Reduction Techniques

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Contribution from: Springer Verlag Berlin Heidelberg 2005 ISBN

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Locality Preserving Projections

Spectral Clustering. by HU Pili. June 16, 2013

Statistical and Computational Analysis of Locality Preserving Projection

Beyond Scalar Affinities for Network Analysis or Vector Diffusion Maps and the Connection Laplacian

Nonlinear Manifold Learning Summary

Bi-stochastic kernels via asymmetric affinity functions

Graphs, Geometry and Semi-supervised Learning

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies

Introduction to Machine Learning

Robust Laplacian Eigenmaps Using Global Information

Manifold Regularization

Kernel Methods. Barnabás Póczos

Learning gradients: prescriptive models

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING

PCA, Kernel PCA, ICA

Regression on Manifolds Using Kernel Dimension Reduction

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Graph Metrics and Dimension Reduction

Manifold Regularization

Lecture: Some Practical Considerations (3 of 4)

Semi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analysis

Principal Component Analysis

Metric-based classifiers. Nuno Vasconcelos UCSD

Laplace-Beltrami Eigenfunctions for Deformation Invariant Shape Representation

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Spectral Techniques for Clustering

Manifold Learning and it s application

Kernel methods for comparing distributions, measuring dependence

EE Technion, Spring then. can be isometrically embedded into can be realized as a Gram matrix of rank, Properties:

Image Analysis & Retrieval Lec 13 - Feature Dimension Reduction

March 13, Paper: R.R. Coifman, S. Lafon, Diffusion maps ([Coifman06]) Seminar: Learning with Graphs, Prof. Hein, Saarland University

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

The Curse of Dimensionality for Local Kernel Machines

Data Analysis and Manifold Learning Lecture 3: Graphs, Graph Matrices, and Graph Embeddings

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Is Manifold Learning for Toy Data only?

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Advanced Machine Learning & Perception

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

DIMENSION REDUCTION. min. j=1

Data Analysis and Manifold Learning Lecture 9: Diffusion on Manifolds and on Graphs

Dimensionality Reduction: A Comparative Review

Spectral Algorithms I. Slides based on Spectral Mesh Processing Siggraph 2010 course

Discriminative Direction for Kernel Classifiers

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

A SEMI-SUPERVISED METRIC LEARNING FOR CONTENT-BASED IMAGE RETRIEVAL. {dimane,

How to learn from very few examples?

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

MLCC 2015 Dimensionality Reduction and PCA

From graph to manifold Laplacian: The convergence rate

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Localized Sliced Inverse Regression

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Transcription:

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Yoshua Bengio Pascal Vincent Jean-François Paiement University of Montreal April 2, Snowbird Learning 2003

Learning Modal Structures of the Distribution Manifold learning and clustering = learning where are the main high-density zones Learning a tranformation that reveals clusters and manifolds: Cluster = zone of high density separated from other clusters by regions of low density

Spectral Embedding Algorithms Many learning algorithms, e.g. spectral clustering, kernel PCA, Local Linear Embedding (LLE), Isomap, Multi-Dimensional Scaling (MDS), Laplacian eigenmaps have at their core the following (or its equivalent): 1. Start from data points 2. Construct a neighborhood or similarity matrix (with corresponding [possibly data-dependent] kernel 3. Normalize it (and make it symmetric), yielding (with corresponding kernel ) 4. Compute 5. Embedding of scaled using e-values) largest (equivalently, smallest) e-values/e-vectors = -th elements of each of the e-vectors (possibly )

of kernel Kernel PCA Data is implicitly mapped to feature space s.t. PCA is performed in feature space: Projecting points in high-dim might allow to find straight line along which they are almost aligned (if basis, i.e. kernel, is right ).

centered: -th p.c. = Kernel PCA Eigenvectors of (generally infinite) matrix are where is an eigenvector of Gram matrix. Projection on N.B. need, subtractive normalization (Scholkopf 96)

Laplacian Eigenmaps Gram matrix from Laplace-Beltrami operator ( data (neighborhood graph) gives graph Laplacian. ), which on finite Gaussian kernel. Approximated by k-nn adjacency matrix Normalization: row average - Gram matrix. Laplace-Beltrami operator : justified as a smoothness regularizer on the manifold :, which equals eigenvalue of for eigenfunctions. Successfully used for semi-supervised learning. (Belkin & Niyogi, 2002)

Spectral Clustering Normalize kernel or Gram matrix divisively: Embedding of = where is -th eigenvector of Gram matrix. Perform clustering on the embedded points (e.g. after normalizing them by their norm). Weiss, Ng, Jordan,...

Spectral Clustering unit sphere principal eigenfns approx. kernel (= dot product) in MSE sense and and almost colinear almost orthogonal points in same cluster mapped to points with near angle, even if non-blob cluster (global constraint = transitivity of nearness )

Density-Dependent Hilbert Space Define a Hilbert space with density-dependent inner product with density. A kernel function defines a linear operator in that space:

Eigenfunctions of a Kernel Infinite-dimensional version of eigenvectors of Gram matrix: (some conditions to obtain a discrete spectrum) Convergence of e-vec/e-values of Gram matrix from data sampled from to e-functions/e-values of linear operator with underlying proven as (Williams+Seeger 2000).,

Link between Spectral Clustering and Eigenfunctions Equivalence between eigenvectors and eigenfunctions (and corr. eigenvalues) when is the empirical distribution: Proposition 1: If we choose for the empirical distribution of the data, then the spectral embedding from is equivalent to values of the eigenfunctions of the normalized kernel :. Proof: come and see our poster!

Link between Kernel PCA and Eigenfunctions Proposition 2: If we choose for the empirical distribution of the data, then the kernel PCA projection is equivalent to scaled values of the eigenfunctions of :. Proof: come and see our poster! Consequence: up to the choice of kernel, kernel normalization, and up to scaling by clustering, Laplacian eigenmaps and kernel PCA give the same embedding. Isomap, MDS and LLE also give eigenfunctions but from a different type of kernel., spectral

From Embedding to General Mapping Laplacian eigenmaps, spectral clustering, Isomap, LLE, and MDS only provided an embedding for the given data points. Natural generalization to new points: consider these algorithms as learning eigenfunctions of. eigenfunctions : provide a mapping for new points. e.g. for empirical Data-dependent kernels (Isomap, LLE): need to compute without changing. Reasonable for Isomap, less clear it makes sense for LLE.

Criterion to Learn Eigenfunctions Proposition 3: Given the first eigenfunctions function, the -th one can be obtained by minimizing w.r.t. the expected value of over. Then we get and. of a symmetric This helps understand what the eigenfunctions are doing (approximating the dot product ) and provides a possible criterion for estimating the eigenfunctions when is not an empirical distribution. Kernels such as the Gaussian kernel and nearest-neighbor related kernels force the eigenfunctions to reconstruct correctly only for nearby objects: in high-dim, don t trust Euclidean distance between far objects.

Using a Smooth Density to Define Eigenfunctions? Use your best estimator data, for defining the eigenfunctions. of the density of the data, instead of the Constrained class of e-fns, e.g. neural networks, can force e-fns to be smooth and not necessarily local. Advantage? better generalization away from training points? Advantage? better scaling with? (no Gram matrix, no e-vectors) Disadvantage? optimization of e-fns may be more difficult?

Recovering the Density from the Eigenfunctions? Visually the eigenfunctions appear to capture the main characteristics of the density. Can we obtain a better estimate of the density using the principal eigenfunctions? (Girolami 2001): truncating the expansion. Use ideas similar to (Teh+Roweis 2003) and other mixtures of factor analyzers and project back in input space, convoluting with a model of reconstruction error as noise.

Role of Kernel Normalization? Subtractive normalization yields to kernel PCA: Thus the corresponding kernel is expanded: the constant function is an eigenfunction eigenfunctions have zero mean and unit variance double-centering normalization (MDS, Isomap): (based on relation between dot product and distance) above What can be said about the divisive normalization? Seems better at clustering.

Multi-layer Learning of Similarity and Density? The learned eigenfunctions capture salient features of the distribution: abstractions such as clusters and manifolds. Old AI (and connectionist) idea: build high-level abstractions on top of lower-level abstractions. empirical density + local Euclidean similarity improved density model + farther reaching notion of similarity

and closer than Density-Adjusted Similarity and Kernel A B C Want and. Define a density adjusted distance as a geodesic wrt a Riemannian metric, with metric tensor that penalizes low density. SEE OTHER POSTER (Vincent & Bengio)

Density-Adjusted Similarity and Kernel 0.8 0.6 0.4 original spirals -6-5 -4-3 -2-1 0 1 2 3 0.1 0.08 0.06 0.04-6 -5-4 -3-2 -1 0 1 2 3 0.2 0.02 0 Gaussian kernel spectral embedding 0-0.02-0.04-0.2-0.06-0.4-0.2 0 0.2 0.4 0.6 0.8 1 0.1-6 -5 0.08-4 -3-2 0.06-1 0 1 2 0.04 3 0.02 0 Density-adjusted embedding -0.08 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08 0.085 0.1-6 -5 0.08-4 -3-2 0.06-1 0 1 2 0.04 3 0.02 0 Density-adjusted embedding -0.02-0.02-0.04-0.04-0.06-0.06-0.08-0.08-0.1-0.1-0.05 0 0.05 0.1 0.15-0.1-0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Conclusions Many unsupervised learning algorithms (kernel PCA, spectral clustering, Laplacian eigenmaps, MDS, LLE, ISOMAP) are linked: compute eigenfunctions of a normalized kernel. Embedding can be generalized to mapping applicable to new points. Eigenfunctions seem to capture salient features of the distribution by minimizing kernel reconstruction error. Many questions open: eigenfunctions finding e-fns with smooth recover explicit density function? meaning of various kernel normalization? multi-layer learning? density-adjusted similarity (see Vincent & Bengio poster).?

Proposition 3 The principal eigenfunction of the linear operator corresponding to kernel is the (or a, if repeated e-values) norm-1 function that minimizes the reconstruction error

Proof of Proposition 1 Proposition 1: If we choose for the empirical distribution of the data, then the spectral embedding from is equivalent to values of the eigenfunctions of the normalized kernel :. (Simplified) proof: As shown in Proposition 3, finding function and scalar minimizing s.t. yields a solution that satisfies with the (possibly repeated) maximum norm eigenvalue.

, Proof of Proposition 1 With empirical, the above becomes ( Write and, then and we obtain for the principal eigenvector: For the other eigenvalues, consider the residual kernel and recursively apply the same reasoning to obtain, etc... Q.E.D. ):

Proof of Proposition 2 Proposition 2: If we choose for the empirical distribution of the data, then the kernel PCA projection is equivalent to scaled values of the eigenfunctions of:. (Simplified) proof: Apply the linear operator on both sides of : or changing the order of integrals on the left-hand side: Plug-in :

Proof of Proposition 2 which contains elements of covariance matrix : thus yielding or where values, has elements. So, where takes its where is also the -th e-vector of.

Proof of Proposition 2 PCA projection on is Q.E.D.

Proof of Proposition 3 Proposition 3: Given the first eigenfunctions function, the -th one can be obtained by minimizing w.r.t. the expected value of over. Then we get and. Proof: Reconstruction error using approximation : where, and are the first (eigenfunction,eigenvalue) pairs in order of decreasing absolute value of. with of a symmetric

Proof of Proposition 3 Minimization of wrt gives (1) using eq. 1. should be maximized.

(2) Proof of Proposition 3 and set it equal to zero: Using : Using recursive assumption that are orthogonal for : Write the application of in terms of the eigenfunctions:

, Proof of Proposition 3 we obtain Applying Perceval s thm to obtain the norm on both sides: If distinct Since Q.E.D. for. and obtained and and and s, max. when., get = 1 and and