Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction A presentation by Evan Ettinger on a Paper by Vin de Silva and Joshua B. Tenenbaum May 12, 2005

Outline Introduction The NLDR Problem Theoretical Claims ISOMAP Asymptotic Convergence Proofs Conformal ISOMAP Relaxing the Assumptions Landmark ISOMAP Faster and Scalable

The NLDR Problem Outline Introduction The NLDR Problem Theoretical Claims ISOMAP Asymptotic Convergence Proofs Conformal ISOMAP Relaxing the Assumptions Landmark ISOMAP Faster and Scalable

The NLDR Problem Nonlinear Dimensionality Reduction (NLDR) Let Y R d be a convex domain and let f : Y R D be a smooth embedding into a manifold M for some D > d. Hidden data {y i } are generated randomly and then mapped using f to {x i = f (y i )}. We would like to recover Y and f based on a given set {x i } of observed data in R D. To make the problem well posed assume f is an isometric embedding (i.e. f preserves infinitesmal angles and lengths).

The NLDR Problem Locally Linear Embedding (LLE) Compute a weight vector w i that best reconstructs each x i by a linear combination of its k-nn. Compute the embedding in R d that minimizes reconstruction error of each y i using w i and its corresponding k-nn.

The NLDR Problem ISOMAP Idea ISOMAP algorithm attempts to recover original embedding of hidden data {y i }. A geodesic is the shortest path in M between two points x and y. The idea: Approximate pairwise geodesic distances in M of {x i }. Given these pairwise geodesic distances, use MDS (description to follow) to find a d-dimensional embedding that preserves geodesic distances.

The NLDR Problem Multidimensional Scaling (MDS) Input: A matrix δ 2 ij of dissimilarities (pairwise sq. distances; sq. geodesic distances) where δ 2 ij = (x i x j ) T (x i x j ). Goal: Find a set of points in R d that well-approximates the dissimilarities in δ 2 ij. Let matrix A ij = 1 2 δ2 ij, let symmetric matrix H = I n 1 n 11T and then construct B = HAH. H is known as the centering matrix. B becomes B ij = (x i x) T (x j x). B = HAH = (HX)(HX) T where B is real, symmetric and positive semi-definite. Eigendecomposition of B gives B = V ΛV T.

The NLDR Problem MDS (cont d) Like in PCA, take the eigenvectors v i with λ i > 0 of B = V ΛV T and order them from largest to smallest to create both V D and Λ D. Our reconstructed points are then ˆX = V D Λ 1/2 D since B = ˆX ˆX T. To get a d-dimensional embedding we can maximize the fraction of the variance explained by taking the largest d eigenvalues and their corresponding eigenvectors. NOTE: We are not restricted to using a Euclidean distance metric. However, when Euclidean distances are used MDS is equivalent to PCA. For ISOMAP, dissimilarity input to MDS will be pairwise squared geodesic distances.

The NLDR Problem ISOMAP in Detail 1. Create graph G of {x i } by either using a K-NN rule or ɛ-rule where each edge e = (x i, x j ) is weighted by the Euclidean distance between the two points O(DN 2 ). 2. Compute the shortest paths in the graph for all pairs of data points to approximate geodesic distances and store in pairwise sq. distance matrix O(N 3 ) using Floyd s Algorithm or O(kN 2 logn) using Dijkstra s algorithm with Fibonacci heaps. 3. Apply MDS on to find a d-dimensional embedding in Euclidean space for the reconstructed {y i } O(N 3 ).

The NLDR Problem An Example: Twos

The NLDR Problem ISOMAP Theoretical Guarantees ISOMAP has provable convergence guarantees. Given that {x i } is sampled sufficiently dense ISOMAP will approximate closely the original distance as measured in M with high probability. In other words, actual geodesic distance approximations using graph G can be arbitrarily good. Let s examine these theoretical guarantees in more detail...

ISOMAP Asymptotic Convergence Proofs Outline Introduction The NLDR Problem Theoretical Claims ISOMAP Asymptotic Convergence Proofs Conformal ISOMAP Relaxing the Assumptions Landmark ISOMAP Faster and Scalable

ISOMAP Asymptotic Convergence Proofs Approximating geodesics with G It is not immediately obvious that G should give a good approximation to geodesic distances. Degenerate cases could lead to zig-zagging behavior that could add a significant amount of overhead.

ISOMAP Asymptotic Convergence Proofs Theoretical Setup Convergence proof hinges on the idea that we can approximate geodesic distance in M by short Euclidean distance hops. Let s define the following for two points x, y M: d M (x, y) = inf γ {length(γ)} d G (x, y) = min( x 0 x 1 +... + x p 1 x p ) P d S (x, y) = min(d M (x 0, x 1 ) +... + d M (x p 1, x p )) P where γ varies over the set of smooth arcs connecting x to y in M and P varies over all paths along the edges of G starting at data point x = x 0 and ending at y = x p. We will show d M d S and d S d G, which will imply the desired result that d G d M.

ISOMAP Asymptotic Convergence Proofs Theorem 1: Sampling Condition Theorem 1: Let ɛ, δ > 0 with 4δ < ɛ. Suppose G contains all edges e = (x, y) for which d M (x, y) < ɛ. Furthermore, assume for every point m M there is a data point x i such that d M (m, x i ) < δ (δ-sampling condition). Then for all pairs of data points x,y we have: d M (x, y) d S (x, y) (1 + 4δ/ɛ)d M (x, y)

ISOMAP Asymptotic Convergence Proofs Proof of Theorem 1 Proof: d M (x, y) d S (x, y) (1 + 4δ/ɛ)d M (x, y) The left hand side of the inequality follows directly from the triangle inequality. Let γ be any piecewise-smooth arc connecting x to y with l = length(γ). If l ɛ 2δ then x and y are connected by an edge in G which we can use as our path.

ISOMAP Asymptotic Convergence Proofs Proof (cont d) If l > ɛ 2δ then we can write l = l 0 + (l 1 +... + l 1 ) + l 0 where l 1 = ɛ 2δ and ɛ 2δ l 0 (ɛ 2δ)/2. This splits up arc γ into a sequence of points γ 0 = x, γ 1,..., γ p = y. Each point γ i lies within a distance δ of a sample data point x i. Claim: The path xx 1 x 2... x p 1 y satisfies our requirements. d M (x i, x i+1 ) d M (x i, γ i ) + d M (γ i, γ i+1 ) + d M (γ i+1, x i+1 ) δ + l 1 + δ = ɛ = l 1 ɛ/(ɛ 2δ)

ISOMAP Asymptotic Convergence Proofs Proof (cont d) Similarly d M (x, x 1 ) l 0 ɛ/(ɛ 2δ) ɛ and the same holds for d M (x p 1, y). d M (x 0, x 1 ) +... + d M (x p 1, x p ) lɛ/(ɛ 2δ) < l(1 + 4δ/ɛ) The last inequality utilizes the fact that 1/(1 t) < 1 + 2t for 0 < t < 1/2. Finally, we take the inf over all γ giving l = d M (x, y). Thus, we see that d S d M arbitrarily well given both the graph construction and δ-sampling conditions.

ISOMAP Asymptotic Convergence Proofs d S d G We would like to now show the other approximate equality: d S d G. First let s make some definitions: 1. The minimum radius of curvature r 0 = r 0 (M) is defined by 1 r 0 = max γ,t γ (t) where γ varies over all unit-speed geodesics in M and t is in the domain D of γ. Intuitively, geodesics in M curl around less tightly than circles of radius less than r 0 (M). 2. The minimum branch separation s 0 = s 0 (M) is the largest positive number for which x y < s 0 implies d M (x, y) πr 0 for any x, y M. Lemma: If γ is a geodesic in M connecting points x and y, and if l = length(γ) πr 0, then: 2r 0 sin(l/2r 0 ) x y l

ISOMAP Asymptotic Convergence Proofs Notes on Lemma We will take this Lemma without proof as it is somewhat technical and long. Using the fact that sin(t) t t 3 /6 for t 0 we can write down a weakened form of the Lemma: (1 l 2 /24r0 2 )l x y l We can also write down an even more weakened version valid for l πr 0 : We can now show d G d S. (2/π)l x y l

ISOMAP Asymptotic Convergence Proofs Theorem 2: Euclidean Hops Geodesic Hops Theorem 2: Let λ > 0 be given. Suppose data points x i, x i+1 M satisfy: x i x i+1 < s 0 x i x i+1 (2/π)r 0 24λ Suppose also there is a geodesic arc of length l = d M (x i, x i+1 ) connecting x i to x i+1. Then: (1 λ)l x i x i+1 l

ISOMAP Asymptotic Convergence Proofs Proof of Theorem 2 By the first assumption we can directly conclude l πr 0. This fact allows us to apply the Lemma using the weakest form combined with the second assumption gives us: l (π/2) x i x i+1 r 0 24λ Solving for λ in the above gives: 1 λ (1 l 2 /24r0 2 ). Applying the weakened statement of the Lemma then gives us the desired result. Combining Theorem 1 and 2 shows d M d G. This leads us then to our main theorem...

ISOMAP Asymptotic Convergence Proofs Main Theorem Theorem 1: Let M be a compact submanifold of R n and let {x i } be a finite set of data points in M. We are given a graph G on {x i } and positive real numbers λ 1, λ 2 < 1 and δ, ɛ > 0. Suppose: 1. G contains all edges (x i, x j ) of length x i x j ɛ. 2. The data set {x i } statisfies a δ-sampling condition for every point m M there exists an x i such that d M (m, x i ) < δ. 3. M is geodesically convex the shortest curve joining any two points on the surface is a geodesic curve. 4. ɛ < (2/π)r 0 24λ1, where r 0 is the minimum radius of curvature of M 1 r 0 = max γ,t γ (t) where γ varies over all unit-speed geodesics in M. 5. ɛ < s 0, where s 0 is the minimum branch separation of M the largest positive number for which x y < s 0 implies d M (x, y) πr 0. 6. δ < λ 2 ɛ/4. Then the following is valid for all x, y M, (1 λ 1 )d M (x, y) d G (x, y) (1 + λ 2 )d M (x, y)

ISOMAP Asymptotic Convergence Proofs Recap So, short Euclidean distance hops along G approximate well actual geodesic distance as measured in M. What were the main assumptions we made? The biggest one was the δ-sampling density condition. A probabilistic version of the Main Theorem can be shown where each point x i is drawn from a density function. Then the approximation bounds will hold with high probability. Here s a truncated version of what the theorem looks like now: Asymptotic Convergence Theorem: Given λ 1, λ 2, µ > 0 then for density function α sufficiently large: 1 λ 1 d G(x, y) d M (x, y) 1 + λ 2 will hold with probability at least 1 µ for any two data points x, y.

Relaxing the Assumptions Outline Introduction The NLDR Problem Theoretical Claims ISOMAP Asymptotic Convergence Proofs Conformal ISOMAP Relaxing the Assumptions Landmark ISOMAP Faster and Scalable

Relaxing the Assumptions ISOMAP Relaxation Remember NLDR problem formulation: we want to recover f and {y i = f (x i )}. In the original ISOMAP problem statement we only considered isometric embeddings by f (i.e. preserve infinitesimal lengths and angles). If f is isometric, then the shortest path between two points in Y will be a geodesic in M. ISOMAP relies on exactly this fact. A relaxation that will allow us to consider a wider class of manifolds is to allow f to be a conformal mapping f preserves only infinitesimal angles and makes no claims about preserving distances. An equivalent way to view a conformal mapping: at every point y Y there is a smooth scalar function s(y) > 0 such that infinitesimal vectors at y get magnified in length by a factor s(y).

Relaxing the Assumptions Linear Isometry, Isometry, Conformal If f is a linear isometry f : R d R D then we can simply use PCA or MDS to recover the d significant dimensions Plane. If f is an isometric embedding f : Y R D then provided that data points are sufficiently dense and Y R d is a convex domain we can use ISOMAP to recover the approximate original structure Swiss Roll. If f is a conformal embedding f : Y R D then we must assume the data is uniformly dense in Y and Y R d is a convex domain and then we can successfuly use C-ISOMAP Fish Bowl.

Relaxing the Assumptions C-ISOMAP Idea behind C-ISOMAP: Not only estimate geodesic distances, but also scalar function s(y). Let µ(i) be the mean distance from xi to its k-nn. Each yi and its k-nn occupy a d-dimensional disk of radius r r depends only on d and sampling density. f maps this disk to approximately a d-dimensional disk on M of radius s(y i )r µ(i) s(y i ). µ(i) is a reasonable estimate of s(yi ) since it will be off by a constant factor (uniform density assumption).

Relaxing the Assumptions Altering ISOMAP to C-ISOMAP We replace each edge weight in G by x i x j / µ(i)µ(j). Everything else is the same. Resulting Effect: magnify regions of high density and shrink regions of low density. A similar convergence theorem as given before can be shown about C-ISOMAP assuming that Y is sampled uniformly from a bounded convex region.

Relaxing the Assumptions C-ISOMAP Examples Setup We will compare LLE, ISOMAP, C-ISOMAP, and MDS on toy datasets. Conformal Fishbowl: Use stereographic projection to project points uniformly distributed in a disk in R 2 onto a sphere with the top removed. Uniform Fishbowl: Points distributed uniformly on the surface of the fishbowl. Offset Fishbowl: Same as conformal fishbowl but points are sampled in Y with a Gaussian offset from center.

Relaxing the Assumptions

Relaxing the Assumptions Example 2: Face Images 2000 face images were randomly generated varying in distance and left-right pose. Each image is a vector in 16384-dimensional space. Below shows the four extreme cases. Conformal because changes in orientation at a long distance will have a smaller effect on local pixel distances than the corresponding change at a shorter distance.

Relaxing the Assumptions Face Images Results C-ISOMAP separates the two intrinsic dimensions cleanly. ISOMAP narrows as faces get further away. LLE is highly distorted.

Faster and Scalable Outline Introduction The NLDR Problem Theoretical Claims ISOMAP Asymptotic Convergence Proofs Conformal ISOMAP Relaxing the Assumptions Landmark ISOMAP Faster and Scalable

Faster and Scalable Motivation for L-ISOMAP ISOMAP out of the box is not scalable. Two bottlenecks: All pairs shortest path - O(kN 2 log N). MDS eigenvalue calculation on a full NxN matrix - O(N 3 ). For contrast, LLE is limited by a sparse eigenvalue computation - O(dN 2 ). Landmark ISOMAP (L-ISOMAP) Idea: Use n << N landmark points from {x i } and compute a n x N matrix of geodesic distances, D n, from each data point to the landmark points only. Use new procedure Landmark-MDS (LMDS) to find a Euclidean embedding of all the data utilizes idea of triangulation similar to GPS. Savings: L-ISOMAP will have shortest paths calculation of O(knN log N) and LMDS eigenvalue problem of O(n 2 N).

Faster and Scalable LMDS Details 1. Designate a set of n landmark points. 2. Apply classical MDS to the n x n matrix n of the squared distances between each landmark point to find a d-dimensional embedding of these n points. Let L k be the d x n matrix containing the embedded landmark points constructed by utilizing the calculated eigenvectors v i and eigenvalues λ i. 2 λ1 v T 1 3 λ2 v T 2 L k = 64. p λd v T d 75

Faster and Scalable LMDS Details (cont d) 3. Apply distance-based triangulation to find a d-dimensional embedding of all N points. Let δ 1,..., δ n be vectors of the squared distances from the i-th landmark to all the landmarks and let δ µ be the mean of these vectors. Let δ x be the vector of squared distances between a point x and the landmark points. Then the i-th component of the embedding vector for y x is: y x i = 1 v i T ( δ x δ µ) 2 λi It can be shown that the above embedding of y x is equivalent to projecting onto the first d principal components of the landmarks. 4. Finally, we can optionally choose to run PCA to reorient our axes.

Faster and Scalable Landmark choices How many landmark points should we choose?... d + 1 landmarks are enough for the triangulation to locate each point uniquely, but heuristics show that a few more is better for stability. Poorly distributed landmarks could lead to foreshortening projection onto the d-dimensional subspace causes a shortening of distances. Good methods are random OR use more expensive MinMax method that for each new landmark added maximizes the minimum distance to the already chosen ones. Either way, running L-ISOMAP in combination with cross-validation techniques would be useful to find a stable embedding.

Faster and Scalable L-ISOMAP example

Recap and Questions LLE ISOMAP Approach Local Global Isometry Most of the time, covariance distortion Yes Conformal No Guarantees, but sometimes C-ISOMAP Speed Quadratic in N Cubic in N, but L-ISOMAP How do LLE and L-ISOMAP compare in the quality of their output on real world datasets? can we develop a quantitative metric to evaluate them? How much improvement in classification tasks do NLDR techniques really give over traditional dimensionality reduction techniques? Is there some sort of heuristic for choosing k? Possibly could we utilize heirarchical clustering information in constructing a better graph G? Lots of research potential...

References V. de Silva and J. B. Tenenbaum. Global versus local methods in nonlinear dimensionality reduction. Neural Information Processing Systems 15 (NIPS 2002), pp. 705-712, 2003. J.A. Lee, M. Verleysen. How to project circular manifolds using geodesic distances? ESANN 2004, European Symposium on Artificial Neural Networks, pp. 223-230, 2004. M. Bernstein, V. de Silva, J. Langford, and J. Tenenbaum. Graph Approximations to Geodesics on Embedded Manifolds. Technical Report, Department of Psychology, Stanford University, 2000. V. de Silva, J.B. Tenenbaum. Unsupervised learning of curved manifolds. Nonlinear Estimation and Classification, 2002. J.B. Tenenbaum, V. de Silva and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, vol. 290, pp. 2319 2323, 2000. V. de Silva, J.B. Tenenbaum. Sparse multidimensional scaling using landmark points. Available at: http://math.stanford.edu/ silva/public/publications.html C. Williams. On a connection between kernel PCA and metric multidimensional scaling. Machine Learning. 2002.