Nonlinear Dimensionality Reduction

Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012)

Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap

Centering in Feature Space Suppose we use kernel function ˆk(, ) which induces a nonlinear feature map ˆφ from the input space X to some feature space F. The images of the N points in F are ˆφ(x (1) ),..., ˆφ(x (N) ), which in general are not centered. The corresponding kernel matrix ˆK is ˆK = [ ˆK ij ] N N = [ˆk(x (i), x (j) )] N N = [ ˆφ(x (i) ), ˆφ(x (j) ) ] N N We want to translate the coordinate system of F such that the new origin is at the sample mean of the N points, i.e., φ(x (i) ) = ˆφ(x (i) ) 1 N N ˆφ(x (j) ) j=1

Centering in Feature Space (2) As a result, we also convert the kernel matrix ˆK to K: K = [K ij ] N N = [k(x (i), x (j) )] N N = [ φ(x (i) ), φ(x (j) ) ] N N Let Z = [φ(x (1) ),..., φ(x (N) )] T Ẑ = [ ˆφ(x (1) ),..., ˆφ(x (N) )] T H = I 1 N 11T where 1 is a column vector of ones. We write Z = HẐ. Hence, K = ZZ T = HẐẐT H = HˆKH

Eigenvalue Equation Based on Covariance Matrix The covariance matrix of the N centered points in F is given by C = 1 N N φ(x (i) )φ(x (i) ) T (1) i=1 If F is infinite-dimensional (e.g., F is a Hilbert space), we can think of φ(x (i) )φ(x (i) ) T as a linear operator on F, mapping z φ(x (i) ) φ(x (i) ), z. To perform PCA in F, we solve the following eigenvalue equation for the eigenvalues λ k (k = 1,..., N) and eigenvectors v k (k = 1,..., N) of C: Cv = λv (2)

Eigenvalue Equation Based on Covariance Matrix (2) Substituting (1) into (2) gives an equivalent form of (2): λv = 1 N N φ(x (i) )φ(x (i) ) T v = i=1 N i=1 φ(x (i) ), v φ(x (i) ) N If λ 0, then we have the following dual eigenvector representation: v = N i=1 φ(x (i) ), v φ(x (i) ) = λn N α i φ(x (i) ) (3) for some coefficients α i (i = 1,..., N). Thus, all eigenvector solutions v with nonzero eigenvalues λ 0 must lie in the span of φ(x (1) ),..., φ(x (N) ). (2) can be written as the following set of equations: φ(x (k) ), Cv = λ φ(x (k) ), v, k = 1,..., N. (4) i=1

Eigenvalue Equation Based on Kernel Matrix Substituting (1) and (3) into (4), we have 1 N N j=1 K kj N α i K ji = λ i=1 i=1 N α i K ki, k = 1,..., N or in matrix form: K 2 α = NλKα, (5) where K is the kernel matrix (or Gram matrix) and α = (α 1,..., α N ) T. If K is invertible, (5) can be expressed as the following (dual) eigenvalue equation: Kα = ξα (6) where ξ = Nλ.

Normalization of Eigenvectors Let ξ 1... ξ N 0 denote the N eigenvalues of K and α 1,..., α N be the corresponding eigenvectors. Suppose ξ p is the smallest nonzero eigenvalue for some 1 p N. We normalize α 1,..., α p such that v k, v k = 1, k = 1,..., p. (7)

Normalization of Eigenvectors (2) Substituting (3) into (7), we have N α ik α jk K ij = 1, i,j=1 α k, Kα k = 1, α k, ξ k α k = 1, α k, α k = 1, ξ k for all k = 1,..., p.

Normalization of Eigenvectors (3) Suppose the eigenvectors obtain for (6) are such that α k = 1, k = 1,..., p. Then we should modify (3) to in order to satisfy (7). v k = 1 ξk N i=1 α ik φ(x (i) )

Embedding of New Data Points For any input x, the kth principal component y k of φ(x) is given by y k = v k, φ(x) = 1 ξk N i=1 α ik φ(x (i) ), φ(x) = 1 ξk N i=1 α ik k(x (i), x). If x = x j for some 1 j N, i.e., x is one of the N original points, then the kth principal component y jk of φ(x j ) becomes y jk = v k, φ(x (j) ) = 1 ξk = N i=1 1 ξk (ξ k α k ) j = ξ k α jk, α ik K ij = 1 ξk (Kα k ) j which is proportional to the expansion coefficient α jk.

Embedding of New Data Points (2) Let Y = [y jk ] N p. Then we can express Y as Y = [α 1,..., α p ]diag( ξ 1,..., ξ p ). Note that K = YY T.

Geodesic Distance Euclidean distance in the high-dimensional input space cannot reflect the true low-dimensional geometry of the manifold.

Geodesic Distance (2) The geodesic ( shortest path ) distance should be used instead. Geodesic distance: Neighboring points: input space Euclidean distance provided a good approximation of the geodesic distance. Faraway points: geodesic distance can be approximated by adding up a sequence of short hops between neighboring points based on Euclidean distance.

Isomap Algorithm Isomap is a nonlinear dimensionality reduction (NLDR) method that is based on metric MDS but seeks to preserve the intrinsic geometry of the data as captured in the geodesic distances between data points. Three steps of the Isomap algorithm: 1 Construct the neighborhood graph 2 Compute the shortest paths 3 Construct the low-dimensional embedding

Isomap Algorithm (2) Given distance d(i, j) between point pairs for N points in X. 1 Construct the neighborhood graph: Define a graph G over all N data points by connecting points i and j if their distance d(i, j) is closer than ɛ (ɛ-isomap) or if i is one of the K nearest neighbors of j (K -Isomap). Set edge lengths equal to d(i, j). 2 Compute the shortest paths: Initialize d G (i, j) = d(i, j) if i and j are linked by an edge and d G (i, j) = otherwise. For each k = 1,..., N, replace all entries d G (i, j) by min(d G (i, j), d G (i, k) + d G (k, j)). Then D G = [d G (i, j)] contains the shortest path distances between all point pairs in G. 3 Construct the low-dimensional embedding (by MDS).

Isomap Algorithm (3) There are two bottlenecks in the Isomap algorithm. Shortest path computation: Floydś algorithm: O(N 3 ) Dijkstraś algorithm (with Fibonacci heaps): O(KN 2 log N) where K is the neighborhood size. Eigendecomposition: O(N 3 )

Example - Face Images

Intrinsic Dimensionality of Data Manifolds In practice, some of the eigenvalues may be so close to zero that they can be ignored. As with PCA and MDS, the true (intrinsic) dimensionality of the data can be estimated from the decrease in error as the dimensionality of the low-dimensional space increases. For nonlinear manifolds, PCA and MDS tend to overestimate the intrinsic dimensionality. The intrinsic degrees of freedom provide a simple way to analyze and manipulate high-dimensional data.

Intrinsic Dimensionality of Data Manifolds (2) The residual variance of PCA, MDS, and Isomap on 4 data sets: (A) face images (MDS, Isomap ) (B) Swiss roll data (MDS, Isomap ) (C) hand images (MDS, Isomap ) (D) handwritten 2 s(pca, MDS, Isomap )

Global vs. Local Embedding Methods Metric MDS and Isomap compute embeddings that seek to preserve inter-point straight-line (Euclidean) distances or geodesic distances between all pairs of points. Hence they are global methods. Both locally linear embedding (LLE) and Laplacian eigenmap try to recover the global nonlinear structure from local geometric properties. They are local methods. Overlapping local neighborhoods, collectively analyzed, can provide information about the global geometry.

Computational Advantages of LLE Like PCA and MDS, LLE is simple to implement and its optimization problems do not have local minima. Although only linear algebraic methods are used, the constraint that points are only reconstructed from neighbors based on locally linear fits can result in highly nonlinear embeddings. Its main step involves a sparse eigenvalue problem that scales up better with large, high-dimensional data sets.

Problem Setting Let X = {x (1),..., x (N) } be a set of n points in a high-dimensional input space R D. The N data points are assumed to lie on or near a nonlinear manifold of intrinsic dimensionality p < D (typically p D). Provided that sufficient data are available by sampling well from the manifold, the goal of LLE is to find a low-dimensional embedding of X by mapping the D-dimensional data into a single global coordinate system in R p. Let us denote the set of N points in the embedding space R p by Y = {y (1),..., y (N) }.

LLE Algorithm 1 For each data point x (i) X : Find the set N i of K nearest neighbors of x (i). Compute the reconstruction weights of the neighbors that minimize the error of reconstructing x (i). 2 Compute the low-dimensional embedding Y that best preserves the local geometry represented by the reconstruction weights.

Locally Linear Fitting If the manifold is sufficiently dense, then each point and its neighbors are expected to lie on or close to a locally linear patch of the manifold. The local geometry of a patch is characterized by the reconstruction weights with which a data point is constructed from its neighbors. Let w i denote the K -dimensional vector of local reconstruction weights for data point x (i). (One may also consider the full N-dimensional weight vector by constraining the terms w ij for x (j) / N i to 0)

Constrained Least Squares Problem Optimality is achieved by minimizing the local reconstruction error function for each data point x (i) : E i (w i ) = x (i) x (j) N i w ij x (j) 2 which is the squared distance between x (i) and its reconstruction, subject to the constraints x (j) N i w ij = 1 T w i = 1 and w ij = 0 for any x (j) / N i. This is a constrained least squares problem that can be solved using the classical method of Lagrange multipliers.

Constrained Least Squares Problem(2) The error function E i (w i ) can be rewritten as follows: E i (w i ) = [ w ij (x (i) x (j) )] T [ w ij (x (i) x (j) )] x (j) N i x (j) N i = w ij w ik (x (i) x (j) ) T (x (i) x (k) )] x (j),x (k) N i = w T i G i w i where G i = [(x (i) x (j) ) T (x (i) x (k) )] K K is the local Gram matrix for x (i). To minimize E i (w i ) subject to the constraint 1 T w i = 1, we define Lagrangian function with multiplier λ: L(w i, λ) = w T i G i w i + λ(1 1 T w i )

Constrained Least Squares Problem (3) The partial derivative of L(w i, λ) w.r.t. w i and λ are L w i = 2G i w i λ1 L λ = 1 1T w i Setting the above equations to 0, we finally get (if G 1 i w i = G 1 i 1 1 T G 1 i 1 exists)

A More Efficient Method Instead of inverting G i, a more efficient way is to solve the linear system of equations G i ŵ i = 1 for ŵ i and then compute w i as ŵi w i = 1 T ŵ i so that the equality constraint 1 T w i = 1 is satisfied. Based on the reconstruction weights computed for all N data points, we form a weight matrix W = [w ij ] N N that will be used in the next step.

Low-Dimensional Embedding Given the weight matrix W, the best low-dimensional embedding Y can be computed by minimizing the following error function w.r.t. Y = [y (1),..., y (N) ] T R N p : J(Y) = N y (i) i=1 x (j) N i w ij y (j) Let b i be the ith column of the identity matrix I and w i be the ith column of W T (i.e., w i is the weight vector for x (i) ).

Optimization We can rewrite J(Y) as N N J(Y) = Y T b i Y T w i 2 = Y T (b i w i ) 2 i=1 i=1 = Y T (I W T ) 2 F = Tr[YT (I W) T (I W)Y] = Tr[Y T MY] where M = (I W) T (I W) is symmetric and positive semi-definite matrix (since x T Mx 0 for all x). M is sparse for reasonable choices of the neighborhood size K (i.e., K N).

Invariance to Translation, Rotation and Scaling Note that the error function J(Y) is invariant to translation, rotation and scaling of the vectors y (i) in the low-dimensional embedding Y. To remove the translational degree of freedom, we require the vectors y (i) to have zero mean, i.e. N y (i) = Y T 1 = 0 i=1 To remove the degree of freedom due to rotation and scaling, we constrain the vectors y (i) to have covariance matrix equal to the identity matrix, i.e. 1 N N y (i) (y (i) ) T = 1 N YT Y = I i=1

Eigenvalue Problem The optimization problem can thus be stated as min Tr(Y T MY) Y subject to Y T 1 = 0 and Y T Y = NI If we express Y as [y 1,..., y p ], then the optimization problem can also be expressed as p min y T k My k Y subject to k=1 y T k 1 = 0 and yt k y k = N for k = 1,..., p.

Eigenvalue Problem (2) Thus the solution to the optimization problem can be obtained by solving the following eigenvalue problem My = λy for the eigenvectors y k (k = 1,..., p) that correspond to the p smallest nonzero eigenvalues. The eigenvectors are normalized such that y T k y k = N for all k = 1,..., p.

Algorithm Overview Let x (1),..., x (N) be N points in R D. Like Isomap, the Laplacian eigenmap algorithm first constructs a weighted graph with N nodes representing the neighborhood relationships. It then computes an eigenmap based on the graph.

Edge Creation An edge is created between nodes i and j if x (i) and x (j) are close to each other. Two possible criteria for edge creation: ɛ-neighborhood: Nodes i and j are connected by an edge if x (i) x (j) < ɛ for some ɛ R +. K nearest neighbors: Nodes i and j are connected by an edge if x (i) is among the K nearest neighbors of x (j) or x (j) is among the K nearest neighbors of x (i).

Edge Weighting Two common variations for edge weighting: Heat kernel: { w ij = exp( x(i) x (j) ) if nodes i and j are connected σ 2 0 otherwise for some σ 2 R +. Binary weights: { 1 if nodes i and j are connected w ij = 0 otherwise

Construction of Eigenmap If the graph constructed above is not connected, then the following procedure is applied to each connected component separately. We first consider the special case which finds a 1-dimensional embedding, and then generalize it to the general p-dimensional case for p > 1. Let y = (y (1),..., y (N) ) T denote 1-dimensional embedding. The objective function for minimization is given by N (y (i) y (j) ) 2 w ij i,j=1

Construction of Eigenmap (2) We can rewrite the objective function as 1 2 N (y (i) y (j) ) 2 w ij = 1 2 i,j=1 = = N (y (i) ) 2 w ij + 1 2 i,j=1 N (y (i) ) 2 d ii i=1 N (y (j) ) 2 w ij + i,j=1 N y (i) y (j) w ij i,j=1 N y (i) (d ij w ij ) i,j=1 = y T (D W)y = y T Ly N y (i) y (j) w ij i,j=1 where d ii = j w ij,d = diag([d 11,..., d NN ]),L = D W is the graph Laplacian.

Scale Invariance To remove the arbitrary scaling factor in the embedding, we enforce the constraint y T Dy = 1 (the larger d ii is, the more important is the corresponding node i). The optimization problem can thus be restated as min y y T Ly subject to y T Dy = 1 or min y y T Ly y T Dy

Generalized Eigenvalue Problem This corresponds to solving the following eigenvalue problem (D 1 L)y = λy or the corresponding generalized eigenvalue problem Ly = λdy for the smallest eigenvalue λ and the corresponding eigenvector y. Note that λ = 0 and y = ci for all c 0 form a solution since cl1 = c(d W)1 = 0 = 0D1

Generalized Eigenvalue Problem (2) To eliminate such cases, we modify the minimization problem to min y y T Ly subject to y T Dy = 1, y T D1 = 0 Note that if D = I, then y T D1 = 0 is equivalent to centering. Finally, we can conclude that the solution is the eigenvector y for the following generalized eigenvalue problem: Ly = λdy corresponding to the smallest nonzero eigenvalue λ. Normalization of y is performed such that y T Dy = 1.

Construction of Eigenmap for p > 1 Let the p-dimensional embedding be denoted by the N p matrix Y = [y n,..., y p ] = [y (1),..., y (N) ] T. Note that y (i) is the p-dimensional representation of x (i) in the embedding space. The objective function for minimization is given by N y (i) y (j) 2 w ij i,j=1

Construction of Eigenmap for p > 1 (2) The minimization problem can be stated as min Tr(Y T LY) Y subject to Y T DY = I, Y T D1 = 0 Solution: the eigenvectors y k (k = 1,..., p) for the generalized eigenvalue problem corresponding to the p smallest nonzero eigenvalues give the solution to the optimization problem above. The eigenvectors are normalized such that y T k Dy k = 1 for all k = 1,..., p.

And Beyond Robust version of dimensionality reduction Out-of-sample extensions for LLE, Isomap, MDS A kernel view of embedding methods Probabilistic view Supervised and semi-supervised extensions Applications Super-resolution Image recognition Many many others...

Main References Kernel PCA: [SSM98] Isomap: [Ten98][TdL00][BST + 02] Locally Linear Embedding: [RS00][SR03] Laplacian Eigenmap: [BN02][BN03]

M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 585 591. MIT Press, Cambridge, MA, USA, 2002. M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373 1396, 2003. M. Balasubramanian, E.L. Schwartz, J.B. Tenenbaum, V. de Silva, and J.C. Langford. The Isomap algorithm and topological stability. Science, 295(5552):7a, 2002. S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323 2326, 2000.

L.K. Saul and S.T. Roweis. Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research, 4:119 155, 2003. B. Schölkopf, A.J. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue probelm. Neural Computation, 10:1299 1319, 1998. J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319 2323, 2000. J.B. Tenenbaum. Mapping a manifold of perceptual observations. In M.I. Jordan, M.J. Kearns, and S.A. Solla, editors, Advances in Neural Information Processing Systems 10, pages 682 688. MIT Press, 1998.