LECTURE NOTE #11 PROF. ALAN YUILLE

Similar documents
Non-linear Dimensionality Reduction

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Nonlinear Dimensionality Reduction

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

L26: Advanced dimensionality reduction

Statistical Pattern Recognition

Dimension Reduction and Low-dimensional Embedding

Nonlinear Dimensionality Reduction

Manifold Learning: Theory and Applications to HRI

Lecture: Some Practical Considerations (3 of 4)

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Lecture 10: Dimension Reduction Techniques

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Apprentissage non supervisée

Unsupervised dimensionality reduction

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

LECTURE NOTE #10 PROF. ALAN YUILLE

EECS 275 Matrix Computation

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction

Intrinsic Structure Study on Whale Vocalizations

Statistical Machine Learning

Nonlinear Manifold Learning Summary

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Advanced Machine Learning & Perception

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Data dependent operators for the spatial-spectral fusion problem

Distance Preservation - Part 2

A Duality View of Spectral Methods for Dimensionality Reduction

A Duality View of Spectral Methods for Dimensionality Reduction

Principal Components Analysis. Sargur Srihari University at Buffalo

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies

Machine Learning - MT Clustering

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Manifold Learning and it s application

Dimensionality Reduction: A Comparative Review

Dimensionality Reduc1on

DIMENSION REDUCTION. min. j=1

Data-dependent representations: Laplacian Eigenmaps

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Principal Component Analysis (PCA)

14 Singular Value Decomposition

Principal Component Analysis

15 Singular Value Decomposition

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Nonlinear Dimensionality Reduction. Jose A. Costa

Dimensionality Reduction AShortTutorial

Exercises * on Principal Component Analysis

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Dimensionality Reduction: A Comparative Review

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Jordan normal form notes (version date: 11/21/07)

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction

Functional Maps ( ) Dr. Emanuele Rodolà Room , Informatik IX

Principal Component Analysis (PCA)

Conceptual Questions for Review

PCA, Kernel PCA, ICA

Manifold Regularization

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Introduction to Matrix Algebra

Spectral Processing. Misha Kazhdan

Linear Dimensionality Reduction

Data Analysis and Manifold Learning Lecture 9: Diffusion on Manifolds and on Graphs

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Spherical Euclidean Distance Embedding of a Graph

Learning a kernel matrix for nonlinear dimensionality reduction

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Kernel Principal Component Analysis

Data Analysis and Manifold Learning Lecture 3: Graphs, Graph Matrices, and Graph Embeddings

Lecture 3: Review of Linear Algebra

Spectral Clustering. by HU Pili. June 16, 2013

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Lecture 3: Review of Linear Algebra

Lecture 7: Positive Semidefinite Matrices

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

MATH 423 Linear Algebra II Lecture 33: Diagonalization of normal operators.

Introduction to Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Linear Algebra & Geometry why is linear algebra useful in computer vision?

COMP 558 lecture 18 Nov. 15, 2010

Data Mining II. Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich. Basel, Spring Semester 2016 D-BSSE

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Lecture II: Linear Algebra Revisited

Graph Partitioning Using Random Walks

PRACTICE FINAL EXAM. why. If they are dependent, exhibit a linear dependence relation among them.

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

Preprocessing & dimensionality reduction

EECS 275 Matrix Computation

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

Data Mining Techniques

EE Technion, Spring then. can be isometrically embedded into can be realized as a Gram matrix of rank, Properties:

ICS 6N Computational Linear Algebra Symmetric Matrices and Orthogonal Diagonalization

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

CS 143 Linear Algebra Review

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Methods for sparse analysis of high-dimensional data, II

Transcription:

LECTURE NOTE #11 PROF. ALAN YUILLE 1. NonLinear Dimension Reduction Spectral Methods. The basic idea is to assume that the data lies on a manifold/surface in D-dimensional space, see figure (1) Perform multi-dimensional scaling, or other dimension reduction method, using distances calculated on the manifold, see figure (2). Figure 1. We assume that the data lies on a manifold. This is a surface which is locally flat. Figure 2. The distance is calculated on the manifold. I.e. the distance from x 3 to x 7 is the sum of the distances from x 3 to x 4, from x 4 to x 5, from x 5 to x 6, from x 6 to x 7. It is not the direct distance between x 3 and x 7. 1

2 PROF. ALAN YUILLE Strategy: (i) define a sparse graph using knn (nearest neighbors), (ii) derive a matrix from graph weights, (iii) derive the embedding from the graph weights. 2. Example 1: ISOMAP Datapoints X = { x i : i = 1,..., N}. Step 1. Construct an adjacency graph (V, E), where the datapoints are the vertices V (hence V = N). The edges E connect each node to its k nearest neighbors, see figure (2). Figure 3. K-nearest neighbors (left) and dynamic programming (right) dynamic programming is described in the lecture on probability on graphs. Step 2. Estimate geodesics, the shortest distance between two nodes x i, x j of the graph on the manifold, see figure (2). Use dynamic programming to calculate the shortest path from A to B, see figure (2). This gives a geodesic distance between x i and x j measured on the manifold. Step 3. Apply Multi-dimensional scaling (MDS) to the ij. MDS is described at the end of this lecture. The complexity of this algorithm is as follows. (1) k nearest neighbor complexity O(n 2 D), (2) Shortest path by dynamic programming complexity O(n 2 k +n 2 log n), MDS complexity O(n 2 d). The overall complexity is high. Impractical for large numbers of datapoints. Possible solution Lamdnark ISOMAP: identify a subset of datapoints as landmarks, do ISOMAP on the landmarks, interpolate for other datapoints. Comment. This is a simple and intuitive algorithm. There is a problem is the algorithm finds short-cuts, see figure (2). It only projects the datapoints and does not specify how to project new data. The results are intuitive. For example, take two images of a face and interpolate in ISOMAP feature space. ISOMAP was the first of several methods which used similar strategies. Theoretical guarantee. If the manifold is isometric to a convex subset of Euclidean space then ISOMAP recovers this subspace (up to rotation and translation). An isometry is a distance preserving map.

LECTURE NOTE #11 3 Figure 4. Short-cuts can cause a problems for ISOMAP. The short cut is a connection between two different spirals of the surface. It prevents the geodesic from lying on the surface and instead allow it to take short cutes. 3. Local Linear Embedding (LLE) Step 1. As for ISOMAP, construct adjacency graph using k-nn. Step 2. Construct weights characterize the local geometry by weights w ij chosen to minimize: Φ(w) = x i j w ij x j 2. This is the error of reconstructing/predicting x i from its neighbors. Here w ii = 0, j w ij = 1. Local invariance. The optimal weights are invariant to rotation, translation, and scaling. Step 3. Linearization. This maps inputs to outputs y i to minimize the reconstruction error: Ψ( y) = i y i j w ij y j 2 With the constraint j y j = 0. And impose 1/N i y i y i T Reformulate: Ψ( y) = Ψ ij y i y j. ij = I (the identity matrix). Where Ψ = (I W ) T (I W ). Use the bottom d + 1 eigenvalues where we are projecting onto d dimensions. Both optimizations of Φ(w) and Ψ( y) can be performed by linear algebra. Properties. This has polynomial time optimization (no need to compute geodesics). It does not estimate the dimensions. It has no theoretical guarantees. 4. Laplacian Eignemaps This is closely related to Local Linear Embedding (LLE). Step 1. Build adjacency graph. Step 2. Assign weights w ij = exp{ β x i x j 2 }.

4 PROF. ALAN YUILLE Step 3. Compute outputs by minimizing: Ψ( y) = ij w ij y i y j 2 Dii D jj Here D ii = j w ij. Questions. (1) How to estimate dimensionality? ISOMAP does this (by the eigenvalues) but LLE does not. (2) Should we preserve distances? Will other criteria give us better ways to reduce the dimension. Two new algorithms. Algorithm 1 preserves local distances (isometry). Algorithm 2 preserves local angles. Conformal mappings. An isometry is a smooth invertible mapping that preserves distances and looks locally like rotation and translation. Intuition bend a sheet of paper but do not tear it. Algorithm 1. Maximum Variance Unfolding. This generalizes the PCA computation of maximum variance subspace. Let K ij = y i y j be the Gram matrix. Maximize i K ii subject to: (i) K ii + K jj 2K ij = x i x j 2. (ii) i K ii = 0. (iii) K is positive semi-definite. Semi-definite programming (SDP). Summary: (1) nearest neighbbor to compute adjacency graph. (2) Semi-definite programming (SDP) to compute the maximum variance unfolding. (3) Diagonalize the Gram matrix estimate the dimension from the rank. Speed up by using landmarks. But is still limited to isometric transformations. Algorithm 2, Preserve local angles. Conformal mapping: rotation, translation, and scaling. Construct the k nearest neighbor graph. Compare edges, are lengths the same up to a scaling constant. D = ζ ij ζ ik { z i z k 2 s i x j x k 2 } 2. ijk Given inputs { x i }, can we compute outputs { z i } and scaling parameter s i such that the angles are well-preserved? Problem. There is a trivial solution s = 0, z = 0. Need to regularize the solution. Require the z s to be expressed in form z i = L y i, with L = D W and D ii = j w ij. Only use the m smallest eigenvectors of L (prevents the trivial solution T r(l T L) = 1). Strategy to minimize D(L, S) (require z i = L y i ). Solve for the scaling factor D/ s = 0. This reduces to a semi-definite programming problem. 5. Relation to Kernel PcA kpca (kernel PCA) is linear in feature space and non-linear in input space. But can kernel PCA perform isometric transformations? Answer No, although kpca with RBF kernel is approximately a local isometry. But spectral methods e.g., ISOMAP can be seen as ways to construct kernel matrices. I.e. not generic kernels, but data-driven kernels.

LECTURE NOTE #11 5 6. Multidimensional scaling MDS is a linear projection method similar to PCA. But it has the attractive property that it does not require knowing the features x of the datapoints. Instead it only needs to know the distances between two datapoints. This is attractive for problems where it is hard to decide what features to use e.g., for representing a picture but easier to decide if two pictures are similar. This also makes it suitable for nonlinear dimension reduction because MDS depends on the distance on the manifold. The basic idea of MDS is to project the data to preserve the distances x i x j between the datapoints. In other words, we project x to a lower-dimensional vector y so that y i y j x i x j. This projection constraint is imposed on the dot products x i x j y i y j, which will imply the result. Key Result: we only need to know ij = x i x j in order to calculate x i x j (i.e. we only need to know the distances between points and not their positions x). Result. x i x j = 1 2N k 2 ik + 1 2N 2 lj 1 2N 2 2 lk 1 2 2 ij, l provided i x i = 0 (which we can always enforce). Proof. Delta 2 ij = x i 2 + x j 2 2 x i x j. Let T = i x i 2. (Note that i x i x j = 0 = j x i x j ). Then 2 ik = N x i 2 + T, l 2 lj N x j 2 + T, lk 2 lk = 2NT. The result follows by subtraction. Now we proceed by defining the Gram matrix: G ij = x i x j = 2 ik + 1 2 lj 2N 1 2N 2 2 lk 1 2 ij 2. k Define an error function for projection: Err(y) = ij l (G ij y i y j ) 2. Here x lies in D-dimensional space, G is an N N matrix (N is the number of datapoints), y lies in a d-dimensional space, with d << D and d < N. 7. MDS Procedure To minimize Err(y) we do spectral decomposition of the Gram matrix: G = a λ a v a v a In coordinates: G ij = a λ av a i va j, where λ a is the a th eigenvalue of G, v a its eigenvector, and v a i is the ith component of the eigenvector v a. We use the convention that λ 1... λ n and, as always, the eigenvectors are orthogonal v a v b = 0 if a b ( v a = 1). Result. The optimal minimization of Err(y) is obtained by setting the vectors y i to be have N components y i a = λ a v a i. I.e. y i = ( λ 1 v 1 1,..., λ N v N i ). lk lk

6 PROF. ALAN YUILLE Proof. y i y j = a λa vi a λa vj a = a λ a v a v a = G ij. I.e. we get Err(y) = 0 if we express ya i = λ a vi a. But this does not reduce the dimension (even if it gives zero error). We can reduce the dimension, while raising the error, by only keeping the first d components. I.e. set y i = ( λ 1 v1 1,..., λ d vi d ) (recall that we ordered the eigenvalues λ 1... λ N, so we are picking the biggest d eigenvalues). In this case, G ij y i y j T. Instead, we compute that N G ij y i y j = λ a vi a vj a a=d+1 It follows that the error Err(y) = N a=d+1 λ2 a. This justifies keeping the first d components (corresponding to the biggest d eigenvalues). 8. Relations between PCA and MDS Both are linear projections. Both involve projections specified by the biggest d eigenvalues/eigenvectors of a matrix which depends on the data. PCA uses the eigenvalues/eigenvectors of the covariance N i=1 x i x T i. MDS uses eigenvalues/eigenvectors of the Gram matrix x i x j. Result. The covariance and the Gram matrix have the same eigenvalues and their eigenvectors are related. Proof. This result was discussed in an earlier lecture (in the PCA lecture in the section which discussed SVD). To obtain it, we defined the matrix X with components x ia. Then the covariance is expressed as the matrix XX T and the Gram matrix as X T X. If e is an eigenvector of XX T with eigenvalue λ i.e. XX T e = λe then X T e is an eigenvector of X T X with eigenvalue λ because X T XX T e = X T (XX T e) = X T λe = λx T e. Similarly we can relate the eigenvectors/eigenvalues of X T X to those of XX T. In summary, the error criteria of PCA and MDS depend on the same eigenvalues and dimension reduction occurs by rejecting the coordinate directions corresponding to the smallest eigenvalues. But the difference is the PCA and MDS project using the eigenvectors of different, but related, matrix (correlation or Gram).