Dimension Reduction and Low-dimensional Embedding

Dimension Reduction and Low-dimensional Embedding Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/26

Dimension Reduction High-dimensional raw data Difficult to visualize Difficult to find useful and meaningful information Uncertainties vary for different features Features may be correlated Low-dimensional structures The structures can be actually simple and linear They can also be complicated and nonlinear Can we project them to a low-dimensional space? 2/26

What to Preserve? Information loss in dimension reduction So, what do we want to preserve? This is critical How do we go from high-dim to low-dim? Linear vs. nonlinear 3/26

Outline Principal Component Analysis (PCA) Metric Multidimensional Scaling (MDS) Isometric Feature Mapping (ISOMAP) Locally Linear Embedding (LLE) 4/26

PCA Revisit Learning linear principal components from {x 1,...,x N }: N 1. calculating m = 1 N x k k=1 2. centering A = [x 1 m,...,x N m] N 3. calculating S = (x k m)(x k m) T = AA T k=1 4. eigenvalue decomposition 5. sorting λ i and e i 6. finding the bases Note: The components for x is S = U T ΣU W = [e 1,e 2,...,e m ] y = W T (x m), where x R n and y R m 5/26

PCA: Preserve the Variance We have a linear projection of x to a 1-d subspace y = w T x The first principal component of x is such that the variance of the projection y is maximized (we need to constrain w to be a unit vector.) so we have the following optimization problem max w J(w) = E{y2 } = E{(w T x) 2 } = w T Sw, s.t. w T w = 1 The sorted eigenvalues of S are λ 1 λ 2... λ n, and eigenvectors are {e 1,...,e n }. It is clearly that the first PC is y 1 = e T 1 x This can be generalized to m PCs (where m < n) 6/26

Outline Principal Component Analysis (PCA) Metric Multidimensional Scaling (MDS) Isometric Feature Mapping (ISOMAP) Locally Linear Embedding (LLE) 7/26

Formulation We have a set of samples {x 1,...,x n } in a high-dim space And we know their dissimilarity, i.e., pair-wise distance d ij = dist(x i,x j ) We want to find their projections {y 1,...,y n } in a low-dim linear subspace in which the dissimilarity is preserved δ ij = dist(y i,y j ) In other words, we want to reconstruct the configurations of this set of points in a low-dim space If we use the Euclidean distance, we have d 2 ij = (x i x j ) T (x i x j ) 8/26

Prerequisite: Centering Matrix In R n, denote by 1 1 =. 1 n 1, and H = I n 1 n 11T where I n is an identity matrix of size n Based the centering matrix H, give a vector x R n Hx = x ( ) 1 n 1T x 1 easy to see 1 n 1T x is the mean of the vector Its use is to make easier matrix manipulations Suppose we have X = [x 1,...,x n ] T Discuss the effects and difference of HX and XH So, what is (HX)(HX) T? 9/26

Classical Scaling Algorithm Suppose we have X = [x 1,...,x n ] T R n R d Given δ ij, construct a dissimilarity matrix A n n = { 1 2 δ ij} Let s do B n n = HAH What does B mean 1? Perform EVD on B B = UΣU T If d < n, B has n d zero eigenvalues Ordering the eigenvalues by λ 1..., λ n, then B = U d Σ d U T d If we use the k largest eigenvalues, we can have the k-dim reconstruction Y = U k Σ 1/2 k 1 Prove b ij = (x i x) T (x j x), where x = 1 n n i=1 x i, and B = (HX)(HX) T 10/26

Relation to PCA It is clear that B is the scatter matrix Let s see what PCA does. In PCA, we have the covariance matrix S = X T HX Actually, B and S are dual Suppose Sv = λv, we have BXv = λxv i.e., u = Xv is an eigenvector of B In PCA, the low-dimensional projection is Y = XV k This is U k (before normalization) This is what we had in MDS! 11/26

Outline Principal Component Analysis (PCA) Metric Multidimensional Scaling (MDS) Isometric Feature Mapping (ISOMAP) Locally Linear Embedding (LLE) 12/26

Motivation: Nonlinear Intrinsic Structures Both PCA and MDS are linear embedding What if the intrinsic structure is nonlinear 13/26

From Euclidean Distance to Geodesic Distance Sometimes Euclidean distance does not make sense If two points are located on a nonlinear surface, their Euclidean distance can be small, although they are far away (e.g., in Figure A) We need to consider the geodesic distance (in Figure B) We want to unfold the nonlinear surface preserving the geodesic distances 14/26

Computing Geodesic Distance This is the most important step in ISOMAP Given x i, find its close neighboring points (based on Euclidean distance) Euclidean distance approximates geodesic distance for neighboring points Construct a weighted graph based on these neighboring relationships For any two faraway point, find the shortest path connecting them, and sum up the distance over the path This can be done efficiently by any shortest-path algorithms We end up with a matrix D G, where D G (i, j) is the geodesic distance between x i and x j 15/26

Unfolding the Nonlinear Manifold Once D G is obtained, the rest is MDS i.e., find a low-dim configuration {y 1,...,y n } the preserve these pair-wise geodesic distance This can be easily done B = HD G H B = U k Σ k U T k centering EVD and Y = U k Σ 1/2 k principal coordinates 16/26

Summary: ISOMAP S1: construct neighborhood graph S2: approximate geodesic distance and obtain the pair-wise dissimilarity matrix D G S3: applying MDS on D G 17/26

Example: head pose and lighting 18/26

Outline Principal Component Analysis (PCA) Metric Multidimensional Scaling (MDS) Isometric Feature Mapping (ISOMAP) Locally Linear Embedding (LLE) 19/26

Motivation Nonlinear low-dimensional intrinsic structure The structure of a local neighborhood is linear! 20/26

What to Preserve Local linear reconstruction ˆx i W ij x j j N(i) Preserve local relationship y i = W ij y j j N(i) 21/26

Computing the Local Combination Given a set of high-dim vectors {x 1,...,x n } x i, find its neighbors N(i) Our goal n W = arg min x i W ij x j 2 W i=1 j N(i) W ij = 0 if x j / N(i) s.t. W ij = 1 j Once we find N(i), we can estimate the weights one by one for each x i This is a constrained least-squares fitting problem 22/26

Weighted Least-squares Fitting Let s consider x and its k-nn A = [x 1,...,x t ], where all x t N(x) Introduce a local covariance matrix C C jk = (x x j ) T (x x k ) We can rewrite the reconstruction error for x e(w) = x Aw 2 = (x1 T A)w 2 = w T Cw Construct the Lagrangian It easy to see To see it clearly L(w, λ) = w T Cw + λ(1 T w 1) w = C 1 1 1 T C 1 1 w j = k C 1 jk l m C 1 lm 23/26

Low-dimensional Embedding Denote the low-dim vectors by Y = [y 1,...,y n ] W.l.g, we assume they are centered to 0 and have unit covariance, i.e., Y1 = 0, YY T = I Our reconstruction problem is Y = arg min Y YW 2 = tr(ymy T ) Y s.t. YY T = I where M = (I W) T (I W) 24/26

Still an EVD Problem! We have the Lagrangian Partial derivative w.r.t y i See what we have here! L(Y, λ) = tr(ymy T ) + λ(i YY T ) L(Y, λ) y i = 2(My i λy i ) My i = λy i As we are minimizing it, we need to use the smallest eigenvalues Suppose d is the dimension of the low-dim space, we take the d + 1 smallest eigenvalues discard the bottom one (trivial solution) keep the rest d eigenvalues, and their corresponding eigenvectors are our low-dim reconstruction! 25/26

Example (head pose and facial expression) 26/26