Nonlinear Dimensionality Reduction

Similar documents
Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Principal Component Analysis

Non-linear Dimensionality Reduction

Nonlinear Dimensionality Reduction

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

LECTURE NOTE #11 PROF. ALAN YUILLE

Dimension Reduction and Low-dimensional Embedding

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Lecture 10: Dimension Reduction Techniques

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

L26: Advanced dimensionality reduction

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Data-dependent representations: Laplacian Eigenmaps

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Manifold Learning: Theory and Applications to HRI

Manifold Learning and it s application

Dimensionality Reduc1on

Unsupervised dimensionality reduction

DIMENSION REDUCTION. min. j=1

Statistical Pattern Recognition

Advanced Machine Learning & Perception

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction

Intrinsic Structure Study on Whale Vocalizations

PCA, Kernel PCA, ICA

Nonlinear Dimensionality Reduction. Jose A. Costa

Learning a kernel matrix for nonlinear dimensionality reduction

Nonlinear Manifold Learning Summary

Apprentissage non supervisée

Dimensionality Reduction AShortTutorial

Data Preprocessing Tasks

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies

Lecture: Some Practical Considerations (3 of 4)

Principal Component Analysis

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Data dependent operators for the spatial-spectral fusion problem

Machine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU,

Dimensionality Reduction: A Comparative Review

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Maximum variance formulation

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Kernel Principal Component Analysis

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Robust Laplacian Eigenmaps Using Global Information

Manifold Learning: From Linear to nonlinear. Presenter: Wei-Lun (Harry) Chao Date: April 26 and May 3, 2012 At: AMMAI 2012

Statistical Machine Learning

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

MLCC 2015 Dimensionality Reduction and PCA

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Dimensionality Reduction: A Comparative Review

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Principal Component Analysis

Lecture: Face Recognition and Feature Reduction

A Duality View of Spectral Methods for Dimensionality Reduction

15 Singular Value Decomposition

Distance Preservation - Part 2

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

A Duality View of Spectral Methods for Dimensionality Reduction

Laplacian Mesh Processing

Lecture: Face Recognition and Feature Reduction

EECS 275 Matrix Computation

Dimensionality Reduction:

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Learning gradients: prescriptive models

ISOMAP TRACKING WITH PARTICLE FILTER

Data Mining II. Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich. Basel, Spring Semester 2016 D-BSSE

Kernel methods, kernel SVM and ridge regression

Locality Preserving Projections

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Kernel methods for comparing distributions, measuring dependence

Supplemental Materials for. Local Multidimensional Scaling for. Nonlinear Dimension Reduction, Graph Drawing. and Proximity Analysis

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Graphs, Geometry and Semi-supervised Learning

Structured matrix factorizations. Example: Eigenfaces

(Non-linear) dimensionality reduction. Department of Computer Science, Czech Technical University in Prague

Spectral Clustering. by HU Pili. June 16, 2013

1 Principal Components Analysis

Dimensionality reduction. PCA. Kernel PCA.

STATISTICAL LEARNING SYSTEMS

14 Singular Value Decomposition

CALCULUS ON MANIFOLDS. 1. Riemannian manifolds Recall that for any smooth manifold M, dim M = n, the union T M =

Data Mining Techniques

Dimensionality Reduction

Kernel Methods. Machine Learning A W VO

Convergence of Eigenspaces in Kernel Principal Component Analysis

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Principal Components Analysis. Sargur Srihari University at Buffalo

PCA and admixture models

Localized Sliced Inverse Regression

Dimensionality Reduction

Manifold Regularization

Principal Component Analysis

Transcription:

Nonlinear Dimensionality Reduction Piyush Rai CS5350/6350: Machine Learning October 25, 2011

Recap: Linear Dimensionality Reduction Linear Dimensionality Reduction: Based on a linear projection of the data Assumes that the data lives close to a lower dimensional linear subspace The data is projected on to that subspace Data X is N D, Projection Matrix U is D K, Projection Z is N K Z = XU Using UU = I (orthonormality of eigenvectors), we have: X = ZU Linear dimensionality reduction does a matrix factorization of X

Dimensionality Reduction as Matrix Factorization Matrix Factorization view helps reveal latent aspects about the data In PCA, each principal component corresponds to a latent aspect

Examples: Netflix Movie-Ratings Data K principal components corresponds to K underlying genres Z denotes the extent each user likes different movie genres

Examples: Amazon Book-Ratings Data K principal components corresponds to K underlying genres Z denotes the extent each user likes different book genres

Examples: Identifying Topics in Document Collections K principal components corresponds to K underlying topics Z denotes the extent each topic is represented in a document

Examples: Image Dictionary (Template) Learning K principal components corresponds to K image templates (dictionary) Z denotes the extent each dictionary element is represented in an image

Nonlinear Dimensionality Reduction Given: Low-dim. surface embedded nonlinearly in high-dim. space Such a structure is called a Manifold Goal: Recover the low-dimensional surface

Linear Projection may not be good enough.. Consider the swiss-roll dataset (points lying close to a manifold) Linear projection methods (e.g., PCA) can t capture intrinsic nonlinearities

Nonlinear Dimensionality Reduction We want to do nonlinear projections Different criteria could be used for such projections Most nonlinear methods try to preserve the neighborhood information Locally linear structures (locally linear globally nonlinear) Pairwise distances (along the nonlinear manifold) Roughly translates to unrolling the manifold

Nonlinear Dimensionality Reduction Two ways of doing it: Nonlinearize a linear dimensionality reduction method. E.g.: Kernel PCA (nonlinear PCA) Using manifold based methods. E.g.: Locally Linear Embedding (LLE) Isomap Maximum Variance Unfolding Laplacian Eigenmaps And several others (Hessian LLE, Hessian Eigenmaps, etc.)

Kernel PCA Given N observations {x 1,...,x N }, x n R D, define the D D covariance matrix (assuming centered data n xn = 0) S = 1 N x n x n N Linear PCA: Compute eigenvectors u i satisfying: Su i = λ i u i i = 1,...,D Consider a nonlinear transformation φ(x) of x into an M dimensional space n=1 M M covariance matrix in this space (assume centered data n φ(xn) = 0) C = 1 N N φ(x n )φ(x n ) n=1 Kernel PCA: Compute eigenvectors v i satisfying: Cv i = λ i v i i = 1,...,M Ideally, we would like to do this without having to compute the φ(x n ) s

Kernel PCA Kernel PCA: Compute eigenvectors v i satisfying: Cv i = λ i v i Plugging in the expression for C, we have the eigenvector equation: 1 N N φ(x n){φ(x n) v i } = λ i v i n=1 Using the above, we can write v i as: v i = N n=1 a inφ(x n ) Plugging this back in the eigenvector equation: 1 N N N N φ(x n)φ(x n) a im φ(x m) = λ i a in φ(x n) n=1 m=1 n=1 Pre-multiplying both sides by φ(x l ) and re-arranging 1 N N N N φ(x l ) φ(x n) a im φ(x n) φ(x m) = λ i a in φ(x l ) φ(x n) n=1 m=1 n=1

Kernel PCA Using φ(x n ) φ(x m ) = k(x n,x m ), the eigenvector equation becomes: 1 N N N N k(x l,x n) a im k(x n,x m) = λ i a in k(x l,x n) n=1 m=1 Define K as the N N kernel matrix with K nm = k(x n,x m ) K is the similarity of two examples x n and x m in the φ space φ is implicitly defined by kernel function k (which can be, e.g., RBF kernel) Define a i as the N 1 vector with elements a in Using K and a i, the eigenvector equation becomes: n=1 K 2 a i = λ i NKa i Ka i = λ i Na i This corresponds to the original Kernel PCA eigenvalue problem Cv i = λ i v i For a projection to K < D dimensions, top K eigenvectors of K are used

Kernel PCA: Centering the Data In PCA, we centered the data before computing the covariance matrix For kernel PCA, we need to do the same φ(x n) = φ(x n) 1 N N φ(x l ) l=1 How does it affect the kernel matrix K which is eigen-decomposed? K nm = φ(x n) φ(xm) = φ(x n) φ(x m) 1 N φ(x n) φ(x l ) 1 N φ(x l ) φ(x m) + 1 N N N 2 l=1 l=1 N j=1 l=1 = k(x n,x m) 1 N k(x n,x l ) 1 N k(x l,x m) + 1 N N k(x N N N 2 l,x l ) l=1 l=1 j=1 l=1 N φ(x j ) φ(x l ) In matrix notation, the centered K = K 1 N K K1 N +1 N K1 N 1 N is the N N matrix with every element = 1/N Eigen-decomposition is then done for the centered kernel matrix K

Kernel PCA: The Projection Suppose {a 1,...,a K } are the top K eigenvectors of kernel matrix K The K-dimensional KPCA projection z = [z 1,...,z K ] of a point x: z i = φ(x) v i Recall the definition of v i N v i = a in φ(x n ) n=1 Thus N z i = φ(x) v i = a in k(x,x n ) n=1

Manifold Based Methods Locally Linear Embedding (LLE) Isomap Maximum Variance Unfolding Laplacian Eigenmaps And several others (Hessian LLE, Hessian Eigenmaps, etc.)

Locally Linear Embedding Based on a simple geometric intuition of local linearity Assume each example and its neighbors lie on or close to a locally linear patch of the manifold LLE assumption: Projection should preserve the neighborhood Projected point should have the same neighborhood as the original point

Locally Linear Embedding: The Algorithm Given D dim. data {x 1,...,x N }, compute K dim. projections {z 1,...,z N } For each example x i, find its L nearest neighbors Assume x i to be a weighted linear combination of the L nearest neighbors x i W ij x j j N (so the data is assumed locally linear) Find the weights by solving the following least-squares problem: W = argmin W N x i i=1 j N i W ij x j 2 s.t. i W ij = 1 j N i are the L nearest neighbors of x i (note: should choose L K +1) Use W to compute low dim. projections Z = {z 1,...,z N } by solving: Z = argmin Z N z i W ij z j 2 i=1 j N s.t. i N z i = 0, i=1 1 N ZZ = I Refer to the LLE reading (appendix A and B) for the details of these steps

LLE: Examples

Isometric Feature Mapping (Isomap) A graph based algorithm based on constructing a matrix of geodesic distances Identify the L nearest neighbors for each data point (just like LLE) Connect each point to all its neighbors (an edge for each neighbor) Assign weight to each edge based on the Euclidean distance Estimate the geodesic distance d ij between any two data points i and j Approximated by the sum of arc lengths along the shortest path between i and j in the graph (can be computed using Djikstras algorithm) Construct the N N distance matrix D = {d 2 ij }

Isomap (Contd.) Use the distance matrix D to construct the Gram Matrix G = 1 2 HDH where G is N N and H = I 1 N 11 I is N N identity matrix, 1 is N 1 vector of 1s Do an eigen decomposition of G Let the eigenvectors be {v 1,...,v N } with eigenvalues {λ 1,...,λ N } Each eigenvector v i is N-dimensional: v i = [v 1i,v 2i,...,v Ni ] Take the top K eigenvalue/eigenvectors The K dimensional embedding z i = [z i1,z i2,...,z ik ] of a point x i : z ik = λ k v ki

Isomap: Example Digit images projected down to 2 dimensions

Isomap: Example Face images with varying poses