L26: Advanced dimensionality reduction

Similar documents
Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Statistical Pattern Recognition

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Nonlinear Dimensionality Reduction

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

LECTURE NOTE #11 PROF. ALAN YUILLE

Non-linear Dimensionality Reduction

Manifold Learning and it s application

Nonlinear Dimensionality Reduction

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Lecture 10: Dimension Reduction Techniques

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Apprentissage non supervisée

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Dimension Reduction and Low-dimensional Embedding

Dimensionality Reduction AShortTutorial

Nonlinear Manifold Learning Summary

Unsupervised dimensionality reduction

L11: Pattern recognition principles

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Statistical Machine Learning

15 Singular Value Decomposition

Linear Dimensionality Reduction

Data Mining II. Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich. Basel, Spring Semester 2016 D-BSSE

Lecture: Some Practical Considerations (3 of 4)

Introduction to Machine Learning

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Intrinsic Structure Study on Whale Vocalizations

ISOMAP TRACKING WITH PARTICLE FILTER

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Manifold Learning: Theory and Applications to HRI

Lecture: Face Recognition and Feature Reduction

PCA, Kernel PCA, ICA

Advanced Machine Learning & Perception

Lecture: Face Recognition and Feature Reduction

L3: Review of linear algebra and MATLAB

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction

Machine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU,

14 Singular Value Decomposition

Manifold Coarse Graining for Online Semi-supervised Learning

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

EECS 275 Matrix Computation

Data Preprocessing Tasks

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Learning a kernel matrix for nonlinear dimensionality reduction

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

1 Principal Components Analysis

PARAMETERIZATION OF NON-LINEAR MANIFOLDS

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

(Non-linear) dimensionality reduction. Department of Computer Science, Czech Technical University in Prague

Dimensionality reduction. PCA. Kernel PCA.

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Dimensionality Reduction: A Comparative Review

Permutation-invariant regularization of large covariance matrices. Liza Levina

Manifold Learning: From Linear to nonlinear. Presenter: Wei-Lun (Harry) Chao Date: April 26 and May 3, 2012 At: AMMAI 2012

Preprocessing & dimensionality reduction

Example: Face Detection

Statistical and Computational Analysis of Locality Preserving Projection

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Dimensionality Reduc1on

Nonlinear Dimensionality Reduction. Jose A. Costa

Dimensionality Reduction:

Structured matrix factorizations. Example: Eigenfaces

Linear Algebra Review. Vectors

Lecture 2: Linear Algebra Review

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

DIMENSION REDUCTION. min. j=1

Linear Subspace Models

Computation. For QDA we need to calculate: Lets first consider the case that

Machine Learning (BSMC-GA 4439) Wenke Liu

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Lecture 7: Con3nuous Latent Variable Models

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Unsupervised learning: beyond simple clustering and PCA

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Manifold Regularization

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

L5: Quadratic classifiers

EECS 275 Matrix Computation

Maximum variance formulation

Principal Components Analysis. Sargur Srihari University at Buffalo

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Properties of Matrices and Operations on Matrices

Clustering VS Classification

STA 414/2104: Machine Learning

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Machine Learning (Spring 2012) Principal Component Analysis

Locality Preserving Projections

Singular Value Decomposition and Principal Component Analysis (PCA) I

A Duality View of Spectral Methods for Dimensionality Reduction

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Dimensionality Reduction: A Comparative Review

Iterative Laplacian Score for Feature Selection

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Data Mining Techniques

Transcription:

L26: Advanced dimensionality reduction The snapshot CA approach Oriented rincipal Components Analysis Non-linear dimensionality reduction (manifold learning) ISOMA Locally Linear Embedding CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 1

roblem definition The snapshot CA approach Imagine that we have collected a small number of samples x (p (p = 1 ) each of which has a very high number of features D (e.g., high-resolution images or 3D scans) In the conventional CA approach, we would compute the sample covariance matrix as follows C = 1 1 p=1 And then try to diagonalize it! This approach has two problems x (p μ x (p μ T First, the sample covariance matrix will not be full-rank, in which case direct inversion is not possible (we d need to use SVD) Second, C will be very large (e.g., 400MB for 100x100 images with double precision) However, we know that at most eigenvectors will be non-zero. Is there a better way to find these eigenvectors? [This material is based upon an unpublished manuscript by Sam Roweis, entitled Finding the first few eigenvectors in a large space ] CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 2

The snapshot trick is based on the fact that the eigenvectors are linear combinations of the data samples Note that the eigenvectors capture the directions of variance, and there is no variance in directions normal to the subspace spanned by the data Thus, we will seek to express the CA decomposition in a manner that depends only on the number of samples Derivation Assume that the data has been centered by subtraction of its mean Then, the covariance matrix can be expressed as C = 1 x (p (p T x 1 p=1 Since the eigenvectors are linear combinations of the data, we can then express them as e j = α p j x (p p=1 where e j denotes the j th eigenvector With this formulation, our goal becomes finding constants α p j CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 3

Since e j are the eigenvectors, they satisfy the condition Ce j = λ j e j Which after derivation becomes 1 1 1 1 p1=1 C p=1 x (p1 x (p1 T p1=1 p2=1 j α p2 α p j x (p = λ j α p j x (p p2=1 p=1 α j p2 x (p2 x (p1 x (p1 T x (p2 j j = λ j α p3 p3=1 j x (p3 = λ j α p3 p3=1 x (p3 We now define matrix R, which is the sample inner product of pairs of samples (i.e., the covariance matrix is the sample outer product) R p1p2 = 1 1 x(p1 T x (p2 j j CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 4

Substituting this matrix into the previous expression yields p1=1 p2=1 j α p2 x (p1 R p1p2 = λ j α p3 *This is one out of possibly many solutions (i.e., ways of p=1 expressing eigenvectors as linear combinations of the samples), but one that makes the problem very easy to solve, as we see next CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 5 p3=1 j x (p3 which, merging subindices p1 and p3, and moving terms to the LHS, yields p1=1 p2=1 x (p1 α j p2 R p1p2 λ j j α p1 This condition can be met by finding α p1 such that* α j p2 R p1p2 λ j j α p1 = 0 j, p1, p2 which can be written as Rα j = λ j α j j Therefore, the -dim vectors α p j are the eigenvectors of matrix R, which can be found in a conventional matter since R has size j = 0 Once α j p have been found, the actual eigenvectors of the data e j are obtained by a weighted sum of the training samples e j = α p j x (p

Oriented principal components analysis OCA is a generalization of CA that uses a generalized eigenvalue problem with two covariance matrices The cost function maximized by OCA is the signal-to-noise ratio between a pair of high-dimensional signals u and v J OCA w = E wt u 2 E w T v 2 = wt S u w w T S v w where S u = E uu T ; S v = E vv T Since S u and S v are symmetric, all the generalized eigenvectors are real, and can be sorted by decreasing generalized eigenvalues Note that the generalized eigenvectors will NOT be orthogonal but instead will meet the constraint e T S u e = e T S v e = 0 i j The term oriented is due to the fact that e is similar to the ordinary principal direction of u, except that it is oriented towards the least principal direction of v In other words, S v steers e away from the directions of high energy in v If v is white noise, then there is no steering, and OCA is identical to the CA solution CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 6

OCA: an example Consider the problem of speech/speaker recognition Short-time speech signals are typically represented by a feature vector x, which capture some of its stationary spectral properties (e.g, LC, cepstrum) Assume that we have data from two speakers (s1, s2) uttering two different phonemes (p1, p2), as illustrated below x s1 Speaker 1 Speaker 2 x s2 x p1 x p2 honeme 1 honeme 2 We are interested in finding projection vectors w of the feature vector that maximizes the signal to noise ratio, signal being linguistic information (LI) and noise being speaker information (SI) SNR = LI SI [Malayath et al., 1997] CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 7

We then extract feature vectors x p1 and x p2 from the same speaker but different phonemes; the difference between these two vectors will only contain linguistic information u = x p1 x p2 We also extract feature vectors x s1 and x s2 from the same phoneme but different speakers; the difference between these two vectors will only contain speaker information v = x s1 x s2 We define the signal and noise covariance matrices as S u = E u u u u T S v = E v v v v T And the SNR becomes SNR = LI SI = E ut w 2 E v T w 2 = wt S u w w T S v w Thus, the projection vectors w that maximize the SNR are derived using an OCA formulation These vectors will span a speaker-independent subspace CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 8

Non-linear dimensionality reduction CA, LDA and their variants perform a global transformation of the data (rotation/translation/rescaling) These techniques assume that most of the information in the data is contained in a linear subspace What do we do when the data is actually embedded in a non-linear subspace (i.e., a low-dimensional manifold)? CA From http://www.cs.unc.edu/courses/comp290-090-s06 CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 9

One possible solution (ISOMA) To find the embedding manifold, find a transformation that preserves the geodesic distances between points in the high-dimensional space This approach is related to multidimensional scaling (e.g., Sammon s mapping; L10), except for MDS seeks to preserve the Euclidean distance in the high-dimensional space The issue is how to compute geodesic distances from sample data Small Euclidean distance Large geodesic distance CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 10

ISOMA ISOMA [Tenenbaum et al., 2000] is based on two simple ideas For neighboring samples, Euclidean distance provides a good approximation to geodesic distance For distant points, geodesic distance can be approximated with a sequence of steps between clusters of neighboring points ISOMA operates in three steps Find nearest neighbors to each sample Find shortest paths (e.g., Dijkstra) Apply MDS [Tenenbaum et al., 1997] CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 11

STE 1 Determine which points are neighbors in the manifold, based on the distances d X (i, j) in the input space X This can be performed in two different ways Connect each point to all points within a fixed radius ε Connect each point to all of its K nearest neighbors These neighborhood relations are represented as a weighted graph G, each edge of weight d X (i, j) between neighboring points Result: CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU This discussion is borrowed from [Tenenbaum et al., 1997] 12

STE 2 Estimate the geodesic distances d M (i, j) between all pair of points on the manifold M by computing their shortest path distances d G (i, j) in the graph G This can be performed, e.g., using Dijkstra s algorithm STE 3 Find d-dim embedding Y that best preserves the manifold s estimated distances In other words, apply classical MDS to the matrix of graph distances D G = {d G (i, j)} The coordinate vectors y i are chosen to minimize the following cost function E = τ D G τ D Y L 2 where D Y denotes the matrix of Euclidean distances d Y (i, j) = y i y j, and operator τ converts distances to inner products τ = HSH/2 where S is the matrix of squared distances S ij = D 2 ij, and H is a centering matrix, defined as H = I 1 N eet ; e = 111 1 T It can be shown that the global minimum of E is obtained by setting the coordinates y i to the top d eigenvectors of the matrix τ(d G ) CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 13

(1) Construct neighborhood graph (a) Define graph G by connecting points i and j if they are [as measured by d X i, j ] closer than epsilon (epsilon -Isomap), or if i is one of the K nearest neighbors of j (K-Isomap). (b) Set edge lengths equal to d X i, j (2) Compute shortest paths (a) Initialize d G i, j = d X i, j if i, j are linked by an edge; d G i, j = otherwise. (b) For k = 1, 2 N, replace all entries d G i, j by min d G i, j, d G i, k + d G k, j (c) Matrix D G = d G i, j will contain the shortest path distances between all pairs of points in G (3) Construct d-dimensional embedding (a) Let λ p be the p th eigenvalue (in decreasing order) of matrix τ D G the p th eigenvector, and v p i be the i th component of (b) Set the p th component of the d-dimensional coordinate vector y i equal to λ p v p i CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 14

ISOMA results [Tenenbaum et al., 1997] CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 15

ISOMA discussion ISOMA is guaranteed asymptotically to recover the true dimensionality and geometric structure of a certain class of Euclidean manifolds These are manifolds whose intrinsic geometry is that of a convex region of Euclidean space, but whose geometry in the high-dimensional space may be highly folded, twisted or curved Intuitively, in 2D, these include any physical transformations one can perform on a sheet of paper without introducing tears, holes, or self-intersections For non-euclidean manifolds (hemispheres or tori), ISOMA will still provide a globally-optimal low-dimensional representation These guarantees are based on the fact that, as the number of samples increases, the graph distances d G (i, j) provide increasingly better approximations of the intrinsic geodesic distances d M (i, j) However, this proof has limited application [Carreira-erpiñán, in press] because Data in high-dimensions is scarce (!) and Computational complexity would preclude the use of large datasets anyway Mapping Note that ISOMA does not provide a mapping function Y = f(x) One could however be learned from the pairs {X i, Y i } in a supervised fashion CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 16

Locally Linear Embedding LLE uses a different strategy than ISOMA to recover the global non-linear structure ISOMA estimates pairwise geodesic distances for all points in the manifold Instead, LLE uses only distances within locally linear neighborhoods Intuition Assume that the dataset consists of N vectors x i, each having D dimensions, sampled from an underlying manifold rovided that there is sufficient data (i.e., the manifold is well sampled) one can expect that each point and its neighbors will lie on a linear patch of the manifold Approach LLE solves the problem in two stages First, compute a linear approximation of each sample in the original space Second, find the coordinates in the manifold that are consistent with the previous linear approximation This discussion is borrowed from [Roweis and Saul, 2000] CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 17

Algorithm (part 1) The local geometry of these patches is modeled by linear weights that reconstruct each data point as a linear combination of its neighbors Reconstruction errors are measured with the L2 cost function N ε W = X i W ij X j i=1 j where weights W ij measure the contribution of the j th example to the reconstruction of the i th example Weights W ij are minimized subject to two constraints Each data point is reconstructed only from its neighbors Rows of the weight matrix sum up to one: 2 j W ij = 1 These constraints ensure that, for any particular sample, the weights are invariant to translation, rotation or scaling This problem can be solved using least-squares (details next) CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 18

Algorithm (part 2) Given that these weights reflect intrinsic properties of the local geometry, we expect that they will also be valid for local patches in the manifold Therefore, we seek to find d-dimensional coordinates Y i that minimize the following cost function Φ Y = Y i W ij Y j i Note that this function is similar to the previous one, except that here W ij are fixed and we solve for Y i This minimization problem can be solved as a sparse N N eigenvalue problem, whose bottom d non-zero eigenvectors are the orthogonal coordinates of the data in the manifold (details follow) j 2 CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 19

LLE summary [Roweis and Saul, 2000] CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 20

LLE: solving for W ij The linear weights W ij can be solved as a constrained LS problem Consider a particular sample x with K nearest neighbors η j and reconstruction weights w j (that sum up to one) These weights can be found in three steps: STE 1: Evaluate inner product between neighbors to compute the neighborhood correlation matrix C jk and its inverse C 1 C jk = η T j η k STE 2: Compute Lagrange multiplier λ that enforces the constraint j w j = 1 λ = 1 C jk 1 x T jk η k jk C 1 jk STE 3: Compute the reconstruction weights as w j = C jk 1 x T η k + λ k NOTE: If C is singular, regularize by adding a small multiple of the identity matrix Since this 3-step process requires matrix inversion, a more efficient solution is to First, solve the linear system of equations j C jk w k = 1, and Then, rescale the weights so they sum up to one (which yields the same results) [Roweis and Saul, 2000; Saul and Roweis, 2001] CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 21

LLE: solving for Y i The embedding vectors Y i are found by minimizing the cost function Φ Y = Y i W ij Y j i To make the optimization problem well-posed we introduce two constraints Since coordinates Y i can be translated without affecting the cost function, we remove this degree of freedom by imposing that they are centered j Y j = 0 j To avoid degenerate solutions, we constraint the embedding vectors to have unit covariance matrix 1 N Y i Y i T = I i This allows the cost function to be expressed in a quadratic form involving inner products of the embedding vectors 2 where Φ Y = M ij Y i T Y j ij M ij = δ ij W ij W ji + k W ki W kj CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 22

It can be shown that the optimal embedding is found by computing the bottom d + 1 eigenvectors of matrix M We discard the bottom eigenvector, which is the unit vector This eigenvector represents a free translation mode with eigenvalue zero Discarding this eigenvector enforces the constraint that the embedding coordinates have zero mean The remaining d eigenvectors form the d embedding coordinates found by LLE CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 23

LLE examples [Roweis and Saul, 2000] CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 24

[Roweis and Saul, 2000] CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 25

LLE discussion LLE is simple and attractive, but shares some of the limitations of MDS methods Sensitivity to noise Sensitivity to non-uniform sampling of the manifold Does not provide a mapping (though one can be learned in a supervised fashion from the pairs {X i, Y i } Quadratic complexity on the training set size Unlike ISOMA, no robust method to compute the intrinsic dimensionality, and No robust method to define the neighborhood size K CSCE 666 attern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 26