Nonlinear Dimensionality Reduction

Similar documents
Unsupervised dimensionality reduction

Non-linear Dimensionality Reduction

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Nonlinear Dimensionality Reduction

Lecture 10: Dimension Reduction Techniques

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Dimension Reduction and Low-dimensional Embedding

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Manifold Learning: Theory and Applications to HRI

L26: Advanced dimensionality reduction

Data-dependent representations: Laplacian Eigenmaps

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Nonlinear Dimensionality Reduction. Jose A. Costa

Intrinsic Structure Study on Whale Vocalizations

Dimensionality Reduction AShortTutorial

Statistical Pattern Recognition

LECTURE NOTE #11 PROF. ALAN YUILLE

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction

Manifold Learning and it s application

Data dependent operators for the spatial-spectral fusion problem

Robust Laplacian Eigenmaps Using Global Information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Nonlinear Manifold Learning Summary

DIMENSION REDUCTION. min. j=1

Learning a kernel matrix for nonlinear dimensionality reduction

Manifold Regularization

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Locality Preserving Projections

Statistical and Computational Analysis of Locality Preserving Projection

Lecture: Some Practical Considerations (3 of 4)

Apprentissage non supervisée

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

A Duality View of Spectral Methods for Dimensionality Reduction

Linear Dimensionality Reduction

Dimensionality Reduction: A Comparative Review

Machine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU,

Gaussian Process Latent Random Field

A Duality View of Spectral Methods for Dimensionality Reduction

Spectral Dimensionality Reduction via Maximum Entropy

Dimensionality Reduc1on

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

(Non-linear) dimensionality reduction. Department of Computer Science, Czech Technical University in Prague

Graphs, Geometry and Semi-supervised Learning

Principal Component Analysis

Statistical Machine Learning

Learning Eigenfunctions Links Spectral Embedding

Data Mining II. Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich. Basel, Spring Semester 2016 D-BSSE

Spectral Dimensionality Reduction

Advanced Machine Learning & Perception

Discriminative Direction for Kernel Classifiers

Manifold Learning: From Linear to nonlinear. Presenter: Wei-Lun (Harry) Chao Date: April 26 and May 3, 2012 At: AMMAI 2012

Kernel Methods. Machine Learning A W VO

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies

Discriminant Uncorrelated Neighborhood Preserving Projections

A Scalable Kernel-Based Algorithm for Semi-Supervised Metric Learning

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Chap.11 Nonlinear principal component analysis [Book, Chap. 10]

Graph-Laplacian PCA: Closed-form Solution and Robustness

Image Analysis & Retrieval Lec 13 - Feature Dimension Reduction

Nonlinear Dimensionality Reduction by Semidefinite Programming and Kernel Matrix Factorization

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

A SEMI-SUPERVISED METRIC LEARNING FOR CONTENT-BASED IMAGE RETRIEVAL. {dimane,

Dimensionality Reduction:

Dimensionality Reduction: A Comparative Review

Distance Preservation - Part 2

Dimensionality Reduction

ISOMAP TRACKING WITH PARTICLE FILTER

Kernel Principal Component Analysis

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst

Local Learning Projections

Informative Laplacian Projection

The Curse of Dimensionality for Local Kernel Machines

Data Analysis and Manifold Learning Lecture 3: Graphs, Graph Matrices, and Graph Embeddings

EECS 275 Matrix Computation

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Approximate Kernel PCA with Random Features

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Bi-stochastic kernels via asymmetric affinity functions

Kernel methods for comparing distributions, measuring dependence

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Support Vector Machine (SVM) and Kernel Methods

Machine Learning (BSMC-GA 4439) Wenke Liu

Nonlinear Learning using Local Coordinate Coding

Spectral Techniques for Clustering

Iterative Laplacian Score for Feature Selection

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Graph Metrics and Dimension Reduction

Approximate Kernel Methods

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Preprocessing & dimensionality reduction

Linear and Non-Linear Dimensionality Reduction

Transcription:

Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012)

Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap

Centering in Feature Space Suppose we use kernel function ˆk(, ) which induces a nonlinear feature map ˆφ from the input space X to some feature space F. The images of the N points in F are ˆφ(x (1) ),..., ˆφ(x (N) ), which in general are not centered. The corresponding kernel matrix ˆK is ˆK = [ ˆK ij ] N N = [ˆk(x (i), x (j) )] N N = [ ˆφ(x (i) ), ˆφ(x (j) ) ] N N We want to translate the coordinate system of F such that the new origin is at the sample mean of the N points, i.e., φ(x (i) ) = ˆφ(x (i) ) 1 N N ˆφ(x (j) ) j=1

Centering in Feature Space (2) As a result, we also convert the kernel matrix ˆK to K: K = [K ij ] N N = [k(x (i), x (j) )] N N = [ φ(x (i) ), φ(x (j) ) ] N N Let Z = [φ(x (1) ),..., φ(x (N) )] T Ẑ = [ ˆφ(x (1) ),..., ˆφ(x (N) )] T H = I 1 N 11T where 1 is a column vector of ones. We write Z = HẐ. Hence, K = ZZ T = HẐẐT H = HˆKH

Eigenvalue Equation Based on Covariance Matrix The covariance matrix of the N centered points in F is given by C = 1 N N φ(x (i) )φ(x (i) ) T (1) i=1 If F is infinite-dimensional (e.g., F is a Hilbert space), we can think of φ(x (i) )φ(x (i) ) T as a linear operator on F, mapping z φ(x (i) ) φ(x (i) ), z. To perform PCA in F, we solve the following eigenvalue equation for the eigenvalues λ k (k = 1,..., N) and eigenvectors v k (k = 1,..., N) of C: Cv = λv (2)

Eigenvalue Equation Based on Covariance Matrix (2) Substituting (1) into (2) gives an equivalent form of (2): λv = 1 N N φ(x (i) )φ(x (i) ) T v = i=1 N i=1 φ(x (i) ), v φ(x (i) ) N If λ 0, then we have the following dual eigenvector representation: v = N i=1 φ(x (i) ), v φ(x (i) ) = λn N α i φ(x (i) ) (3) for some coefficients α i (i = 1,..., N). Thus, all eigenvector solutions v with nonzero eigenvalues λ 0 must lie in the span of φ(x (1) ),..., φ(x (N) ). (2) can be written as the following set of equations: φ(x (k) ), Cv = λ φ(x (k) ), v, k = 1,..., N. (4) i=1

Eigenvalue Equation Based on Kernel Matrix Substituting (1) and (3) into (4), we have 1 N N j=1 K kj N α i K ji = λ i=1 i=1 N α i K ki, k = 1,..., N or in matrix form: K 2 α = NλKα, (5) where K is the kernel matrix (or Gram matrix) and α = (α 1,..., α N ) T. If K is invertible, (5) can be expressed as the following (dual) eigenvalue equation: Kα = ξα (6) where ξ = Nλ.

Normalization of Eigenvectors Let ξ 1... ξ N 0 denote the N eigenvalues of K and α 1,..., α N be the corresponding eigenvectors. Suppose ξ p is the smallest nonzero eigenvalue for some 1 p N. We normalize α 1,..., α p such that v k, v k = 1, k = 1,..., p. (7)

Normalization of Eigenvectors (2) Substituting (3) into (7), we have N α ik α jk K ij = 1, i,j=1 α k, Kα k = 1, α k, ξ k α k = 1, α k, α k = 1, ξ k for all k = 1,..., p.

Normalization of Eigenvectors (3) Suppose the eigenvectors obtain for (6) are such that α k = 1, k = 1,..., p. Then we should modify (3) to in order to satisfy (7). v k = 1 ξk N i=1 α ik φ(x (i) )

Embedding of New Data Points For any input x, the kth principal component y k of φ(x) is given by y k = v k, φ(x) = 1 ξk N i=1 α ik φ(x (i) ), φ(x) = 1 ξk N i=1 α ik k(x (i), x). If x = x j for some 1 j N, i.e., x is one of the N original points, then the kth principal component y jk of φ(x j ) becomes y jk = v k, φ(x (j) ) = 1 ξk = N i=1 1 ξk (ξ k α k ) j = ξ k α jk, α ik K ij = 1 ξk (Kα k ) j which is proportional to the expansion coefficient α jk.

Embedding of New Data Points (2) Let Y = [y jk ] N p. Then we can express Y as Y = [α 1,..., α p ]diag( ξ 1,..., ξ p ). Note that K = YY T.

Geodesic Distance Euclidean distance in the high-dimensional input space cannot reflect the true low-dimensional geometry of the manifold.

Geodesic Distance (2) The geodesic ( shortest path ) distance should be used instead. Geodesic distance: Neighboring points: input space Euclidean distance provided a good approximation of the geodesic distance. Faraway points: geodesic distance can be approximated by adding up a sequence of short hops between neighboring points based on Euclidean distance.

Isomap Algorithm Isomap is a nonlinear dimensionality reduction (NLDR) method that is based on metric MDS but seeks to preserve the intrinsic geometry of the data as captured in the geodesic distances between data points. Three steps of the Isomap algorithm: 1 Construct the neighborhood graph 2 Compute the shortest paths 3 Construct the low-dimensional embedding

Isomap Algorithm (2) Given distance d(i, j) between point pairs for N points in X. 1 Construct the neighborhood graph: Define a graph G over all N data points by connecting points i and j if their distance d(i, j) is closer than ɛ (ɛ-isomap) or if i is one of the K nearest neighbors of j (K -Isomap). Set edge lengths equal to d(i, j). 2 Compute the shortest paths: Initialize d G (i, j) = d(i, j) if i and j are linked by an edge and d G (i, j) = otherwise. For each k = 1,..., N, replace all entries d G (i, j) by min(d G (i, j), d G (i, k) + d G (k, j)). Then D G = [d G (i, j)] contains the shortest path distances between all point pairs in G. 3 Construct the low-dimensional embedding (by MDS).

Isomap Algorithm (3) There are two bottlenecks in the Isomap algorithm. Shortest path computation: Floydś algorithm: O(N 3 ) Dijkstraś algorithm (with Fibonacci heaps): O(KN 2 log N) where K is the neighborhood size. Eigendecomposition: O(N 3 )

Example - Face Images

Example - Face Images

Intrinsic Dimensionality of Data Manifolds In practice, some of the eigenvalues may be so close to zero that they can be ignored. As with PCA and MDS, the true (intrinsic) dimensionality of the data can be estimated from the decrease in error as the dimensionality of the low-dimensional space increases. For nonlinear manifolds, PCA and MDS tend to overestimate the intrinsic dimensionality. The intrinsic degrees of freedom provide a simple way to analyze and manipulate high-dimensional data.

Intrinsic Dimensionality of Data Manifolds (2) The residual variance of PCA, MDS, and Isomap on 4 data sets: (A) face images (MDS, Isomap ) (B) Swiss roll data (MDS, Isomap ) (C) hand images (MDS, Isomap ) (D) handwritten 2 s(pca, MDS, Isomap )

Global vs. Local Embedding Methods Metric MDS and Isomap compute embeddings that seek to preserve inter-point straight-line (Euclidean) distances or geodesic distances between all pairs of points. Hence they are global methods. Both locally linear embedding (LLE) and Laplacian eigenmap try to recover the global nonlinear structure from local geometric properties. They are local methods. Overlapping local neighborhoods, collectively analyzed, can provide information about the global geometry.

Computational Advantages of LLE Like PCA and MDS, LLE is simple to implement and its optimization problems do not have local minima. Although only linear algebraic methods are used, the constraint that points are only reconstructed from neighbors based on locally linear fits can result in highly nonlinear embeddings. Its main step involves a sparse eigenvalue problem that scales up better with large, high-dimensional data sets.

Problem Setting Let X = {x (1),..., x (N) } be a set of n points in a high-dimensional input space R D. The N data points are assumed to lie on or near a nonlinear manifold of intrinsic dimensionality p < D (typically p D). Provided that sufficient data are available by sampling well from the manifold, the goal of LLE is to find a low-dimensional embedding of X by mapping the D-dimensional data into a single global coordinate system in R p. Let us denote the set of N points in the embedding space R p by Y = {y (1),..., y (N) }.

LLE Algorithm 1 For each data point x (i) X : Find the set N i of K nearest neighbors of x (i). Compute the reconstruction weights of the neighbors that minimize the error of reconstructing x (i). 2 Compute the low-dimensional embedding Y that best preserves the local geometry represented by the reconstruction weights.

Locally Linear Fitting If the manifold is sufficiently dense, then each point and its neighbors are expected to lie on or close to a locally linear patch of the manifold. The local geometry of a patch is characterized by the reconstruction weights with which a data point is constructed from its neighbors. Let w i denote the K -dimensional vector of local reconstruction weights for data point x (i). (One may also consider the full N-dimensional weight vector by constraining the terms w ij for x (j) / N i to 0)

Constrained Least Squares Problem Optimality is achieved by minimizing the local reconstruction error function for each data point x (i) : E i (w i ) = x (i) x (j) N i w ij x (j) 2 which is the squared distance between x (i) and its reconstruction, subject to the constraints x (j) N i w ij = 1 T w i = 1 and w ij = 0 for any x (j) / N i. This is a constrained least squares problem that can be solved using the classical method of Lagrange multipliers.

Constrained Least Squares Problem(2) The error function E i (w i ) can be rewritten as follows: E i (w i ) = [ w ij (x (i) x (j) )] T [ w ij (x (i) x (j) )] x (j) N i x (j) N i = w ij w ik (x (i) x (j) ) T (x (i) x (k) )] x (j),x (k) N i = w T i G i w i where G i = [(x (i) x (j) ) T (x (i) x (k) )] K K is the local Gram matrix for x (i). To minimize E i (w i ) subject to the constraint 1 T w i = 1, we define Lagrangian function with multiplier λ: L(w i, λ) = w T i G i w i + λ(1 1 T w i )

Constrained Least Squares Problem (3) The partial derivative of L(w i, λ) w.r.t. w i and λ are L w i = 2G i w i λ1 L λ = 1 1T w i Setting the above equations to 0, we finally get (if G 1 i w i = G 1 i 1 1 T G 1 i 1 exists)

A More Efficient Method Instead of inverting G i, a more efficient way is to solve the linear system of equations G i ŵ i = 1 for ŵ i and then compute w i as ŵi w i = 1 T ŵ i so that the equality constraint 1 T w i = 1 is satisfied. Based on the reconstruction weights computed for all N data points, we form a weight matrix W = [w ij ] N N that will be used in the next step.

Low-Dimensional Embedding Given the weight matrix W, the best low-dimensional embedding Y can be computed by minimizing the following error function w.r.t. Y = [y (1),..., y (N) ] T R N p : J(Y) = N y (i) i=1 x (j) N i w ij y (j) Let b i be the ith column of the identity matrix I and w i be the ith column of W T (i.e., w i is the weight vector for x (i) ).

Optimization We can rewrite J(Y) as N N J(Y) = Y T b i Y T w i 2 = Y T (b i w i ) 2 i=1 i=1 = Y T (I W T ) 2 F = Tr[YT (I W) T (I W)Y] = Tr[Y T MY] where M = (I W) T (I W) is symmetric and positive semi-definite matrix (since x T Mx 0 for all x). M is sparse for reasonable choices of the neighborhood size K (i.e., K N).

Invariance to Translation, Rotation and Scaling Note that the error function J(Y) is invariant to translation, rotation and scaling of the vectors y (i) in the low-dimensional embedding Y. To remove the translational degree of freedom, we require the vectors y (i) to have zero mean, i.e. N y (i) = Y T 1 = 0 i=1 To remove the degree of freedom due to rotation and scaling, we constrain the vectors y (i) to have covariance matrix equal to the identity matrix, i.e. 1 N N y (i) (y (i) ) T = 1 N YT Y = I i=1

Eigenvalue Problem The optimization problem can thus be stated as min Tr(Y T MY) Y subject to Y T 1 = 0 and Y T Y = NI If we express Y as [y 1,..., y p ], then the optimization problem can also be expressed as p min y T k My k Y subject to k=1 y T k 1 = 0 and yt k y k = N for k = 1,..., p.

Eigenvalue Problem (2) Thus the solution to the optimization problem can be obtained by solving the following eigenvalue problem My = λy for the eigenvectors y k (k = 1,..., p) that correspond to the p smallest nonzero eigenvalues. The eigenvectors are normalized such that y T k y k = N for all k = 1,..., p.

Algorithm Overview Let x (1),..., x (N) be N points in R D. Like Isomap, the Laplacian eigenmap algorithm first constructs a weighted graph with N nodes representing the neighborhood relationships. It then computes an eigenmap based on the graph.

Edge Creation An edge is created between nodes i and j if x (i) and x (j) are close to each other. Two possible criteria for edge creation: ɛ-neighborhood: Nodes i and j are connected by an edge if x (i) x (j) < ɛ for some ɛ R +. K nearest neighbors: Nodes i and j are connected by an edge if x (i) is among the K nearest neighbors of x (j) or x (j) is among the K nearest neighbors of x (i).

Edge Weighting Two common variations for edge weighting: Heat kernel: { w ij = exp( x(i) x (j) ) if nodes i and j are connected σ 2 0 otherwise for some σ 2 R +. Binary weights: { 1 if nodes i and j are connected w ij = 0 otherwise

Construction of Eigenmap If the graph constructed above is not connected, then the following procedure is applied to each connected component separately. We first consider the special case which finds a 1-dimensional embedding, and then generalize it to the general p-dimensional case for p > 1. Let y = (y (1),..., y (N) ) T denote 1-dimensional embedding. The objective function for minimization is given by N (y (i) y (j) ) 2 w ij i,j=1

Construction of Eigenmap (2) We can rewrite the objective function as 1 2 N (y (i) y (j) ) 2 w ij = 1 2 i,j=1 = = N (y (i) ) 2 w ij + 1 2 i,j=1 N (y (i) ) 2 d ii i=1 N (y (j) ) 2 w ij + i,j=1 N y (i) y (j) w ij i,j=1 N y (i) (d ij w ij ) i,j=1 = y T (D W)y = y T Ly N y (i) y (j) w ij i,j=1 where d ii = j w ij,d = diag([d 11,..., d NN ]),L = D W is the graph Laplacian.

Scale Invariance To remove the arbitrary scaling factor in the embedding, we enforce the constraint y T Dy = 1 (the larger d ii is, the more important is the corresponding node i). The optimization problem can thus be restated as min y y T Ly subject to y T Dy = 1 or min y y T Ly y T Dy

Generalized Eigenvalue Problem This corresponds to solving the following eigenvalue problem (D 1 L)y = λy or the corresponding generalized eigenvalue problem Ly = λdy for the smallest eigenvalue λ and the corresponding eigenvector y. Note that λ = 0 and y = ci for all c 0 form a solution since cl1 = c(d W)1 = 0 = 0D1

Generalized Eigenvalue Problem (2) To eliminate such cases, we modify the minimization problem to min y y T Ly subject to y T Dy = 1, y T D1 = 0 Note that if D = I, then y T D1 = 0 is equivalent to centering. Finally, we can conclude that the solution is the eigenvector y for the following generalized eigenvalue problem: Ly = λdy corresponding to the smallest nonzero eigenvalue λ. Normalization of y is performed such that y T Dy = 1.

Construction of Eigenmap for p > 1 Let the p-dimensional embedding be denoted by the N p matrix Y = [y n,..., y p ] = [y (1),..., y (N) ] T. Note that y (i) is the p-dimensional representation of x (i) in the embedding space. The objective function for minimization is given by N y (i) y (j) 2 w ij i,j=1

Construction of Eigenmap for p > 1 (2) The minimization problem can be stated as min Tr(Y T LY) Y subject to Y T DY = I, Y T D1 = 0 Solution: the eigenvectors y k (k = 1,..., p) for the generalized eigenvalue problem corresponding to the p smallest nonzero eigenvalues give the solution to the optimization problem above. The eigenvectors are normalized such that y T k Dy k = 1 for all k = 1,..., p.

And Beyond Robust version of dimensionality reduction Out-of-sample extensions for LLE, Isomap, MDS A kernel view of embedding methods Probabilistic view Supervised and semi-supervised extensions Applications Super-resolution Image recognition Many many others...

Main References Kernel PCA: [SSM98] Isomap: [Ten98][TdL00][BST + 02] Locally Linear Embedding: [RS00][SR03] Laplacian Eigenmap: [BN02][BN03]

M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 585 591. MIT Press, Cambridge, MA, USA, 2002. M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373 1396, 2003. M. Balasubramanian, E.L. Schwartz, J.B. Tenenbaum, V. de Silva, and J.C. Langford. The Isomap algorithm and topological stability. Science, 295(5552):7a, 2002. S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323 2326, 2000.

L.K. Saul and S.T. Roweis. Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research, 4:119 155, 2003. B. Schölkopf, A.J. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue probelm. Neural Computation, 10:1299 1319, 1998. J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319 2323, 2000. J.B. Tenenbaum. Mapping a manifold of perceptual observations. In M.I. Jordan, M.J. Kearns, and S.A. Solla, editors, Advances in Neural Information Processing Systems 10, pages 682 688. MIT Press, 1998.