Dimension Reduction Techniques Presented by Jie (Jerry) Yu
Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting
Background Advances in data collection and storage capacities lead to information overload in many fields. Traditional statistical methods often break down because of the increase in the number of variables in each observations, that is, the dimension of the data. One of the most challenging problem is to reduce the dimension of original data.
Problem Modeling Original high-dimensional data: T X = ( x : p dimensional multivariate random 1,..., x p ) Underlying/Intrinsic low-dimensional data: T Y = ( y 1,..., y k ) : k (<<p) dimensional multivariate random The mean and covariance: T T E ( X ) = µ = ( µ 1,..., µ p ) x = E {( X µ )( X µ ) } Problems : 1) Find the appropriate mapping that can best capture the most important features in low dimension and 2) Find the appropriate k that can best describe the data in low dimension.
State-of-the-art Techniques Dimension reduction techniques can be categorized into two major classes: linear and non-linear. Non-Linear Methods: Multidimensional Scaling (MDS), Principal Curves, Self-Organizing Map (SOM), Neural Network, Isomap, Local Linear Embedding (LLE) and Charting. Linear Methods: Principal Component Analysis (PCA), Factor Analysis, Projection Pursuit and Independent Component Analysis (ICA)
Principal Component Analysis (PCA) Denote a linear projection as W = w,..., w ] Thus y i = w In essence PCA tries to reduce the data dimension by finding a few orthogonal linear combinations (Principal Components, PCs) of the original variables with the largest variance. W = arg max k i= 1 var{ y } = arg max var{ w It can also be further rewritten as : W T i [ 1 k X = arg max( W i T Σ x k i= 1 W ) T i X }
PCA Σ can be decomposed by eigen decomposition as T = U Λ U x Λ= diag ( λ1,..., λ p ) is the diagonal matrix of ascending ordered eigenvalues. U is the orthogonal matrix containing the eigenvectors. It is proven that the optimal projection matrix W are the first k eigenvectors in U.
PCA Property 1: The subspace spanned by the first k eigenvectors has the smallest mean square deviation from X among all subspace of dimension K. Property 2: The total variance is equal to the sum of the eigenvalues of the original covariance matrix.
Multidimensional Scaling (MDS) Multidimensional Scaling (MDS) produces lowdimensional representation of the data such that the distance in the new space reflect the proximities of the data in the original space. Denote symmetric proximity matrix as : = { δ, i, j 1,..., n} ij = MDS tries to find the mapping such that the distance in the lower space d are as ij =d( yi, yj) close as possible to a function of the corresponding proximity f ( δ ij ).
MDS Mapping Cost function: i, j scale [ f ( δ _ ij factor 2 The scale_factor are often based on i, j f ( δ ij ) or 2 i, d. j ij Problem: Find optimal mapping that minimize the cost function If the proximity is the distance measure, L 2 or L 1, we call it metric-mds. If the proximity uses ordinal information of the data, it is called non-metric-mds. ) d ij ] 2
Isomap Disadvantage of PCA and MDS: 1) Both methods often fail to discover complicated nonlinear structure and 2) both have difficulties in detecting the intrinsic dimension of the data. Goal : combine the major algorithmic feature of PCA and MDS: computational efficiency, global optimality and asymptotic convergence guarantee and have the flexibility to learn nonlinear manifolds. Idea : Introduce geodesic distance that can better describe the relation between data points.
Isomap Illustration: Points far apart on the underlying manifold, when measured by their geodesic distance may appear close in high-dimensional input space. The Swiss Roll data set
Isomap In this approach the intrinsic geometry of the data is preserved by capturing the manifold distance between all data. For neighboring points (ε or k-nearest), the Euclidean distance provides good approximation to the geodesic distance. For faraway points, geodesic distance can be approximated by adding up a sequence of short hops between neighboring points. (Floyd Algorithm)
Isomap Algorithm Step 1: determine which points are neighbors on the manifold based on the input distance matrix. Step 2: Isomap estimates the geodesic distances d G ( i, j) between all pairs of points on the manifold M by computing their shortest path distance d x ( i, j). Step 3: Apply MDS or PCA to the matrix of the graph distance matrix. D G = { d ( i, j)} G
The Swiss Roll Problem
Detect Intrinsic Dimension The intrinsic dimensionality of the data can be estimated from the decrease rate of Residual Variance as the dimensionality of Y increased. Residual Variance is defined as : 2 1 R ( D M, D y ) while R() operation is the linear correlation coefficient and D M is the estimated distance in original space and D y the distance in projected space.
Theoretical Analysis The main contribution of Isomap is substitute the Euclidean distance with geodesic distance, which may better capture the nonlinear structure of a manifold. Given sufficient data, Isomap is guaranteed asymptotically to recover the true dimensionality and geometric structure of a non-linear manifolds.
Experiments
Experiments
Experiment 1: Facial Images
Experiment 2: The hand-written 2 s
Locally Linear Embedding (LLE) MDS and its variant Isomap try to preserve pair wise distance between data points. Locally Linear Embedding (LLE) is unsupervised learning algorithm that recovers global nonlinear structure from locally linear fits. Assumption: each data point and its neighbors lie on or close to a locally linear patch of the manifold.
Locally Linearity
LLE Idea: The local geometry is characterized by linear coefficients that reconstruct each data point from its neighbors. Reconstruction Cost is defined as : ε ( W ) = i x i Two constraints: 1) each data point is only reconstruct by its neighbors instead of faraway points and 2) rows of weight matrix sum to one. j w ij x j 2
Linear reconstruction
LLE The symmetric weight matrix for any data point is invariant to rotations, rescaling and translations. Although the global manifold may be nonlinear, for each locally linear neighborhood there exists a linear mapping (consisting of a translation,rotation and rescaling) that project the neighborhood to low dimension. The same weight matrix that reconstruct ith data in D dimension should also reconstruct its embedded manifold in d dimsension.
LLE W is solved by minimizing the reconstruct cost function in the original space. To find the optimal global mapping to lower dimensional space, define an embedding cost function: φ( Y ) = i y i Because W is fixed, the problem turns to find a optimal projection (X->Y) which minimize the embedding function. j w ij y j 2
Theoretical analysis: 1) only one free parameter K and transformation is determinant. 2)Guranteed to converge to global optimality with sufficient data point. 3)LLE don t have to be rerun to compute higher dimension embeddings. 4)The intrinsic dimension d can be estimated by analyzing a reciprocal cost function of reconstruct Y to X.
Experiment 1 Facial Images
Experiment 2: Words in semantic space
Experiment 2: Arranging words in semantic space
Charting Charting is the problem of assigning a lowdimensional coordinate system to data points in a high-dimensional samples. Assume that the data lies on or near a lowdimensional manifold in the sample space and there exists a 1-to-1 smooth nonlinear transform between the manifold and a low-dimensional vector space. Goal: find a mapping that is expressed as a kernelbased mixture of linear projections that minimizes information loss about the density and relative locations of sample points.
Local Linear Scale and Intrinsic Dimensionality Local Linear Scale (r) : at some scale r the mapping from a neighborhood on M d (original space) to lower dimension r is linear. Consider a ball of radius r centered on a data point and containing n(r) data points. d The count n(r) grows at r only at the locally linear scale.
Local Linear Scale and Intrinsic Dimensionality There are two other factor that may affect the data distribution in different scale: isotropic noise (at a smaller scale) and embedding curvature ( at a larger scale). Define c(r) =log r /log n(r).at noise scale c(r)=1/d<1/d. At locally linear scale c(r)=1/d. At curvature scale c(r) <1/d.
Local Linear Scale and Intrinsic Dimensionality Gradually increase r, when c(r) first peaks (at 1/d). We have one observation of both r and d. Average on all data points, we can estimate the r and d.
Charting the data Model: Each chart is modeled as Gaussian Mixture Model (GMM). Goal: find a soft partition of data into locally linear low-dimension neighborhoods. Problem: one data point may belong to several neighboring chart. The estimation of local GMM should take account into the information from neighboring chart.
Charting the data Co-locality: is defined to estimate how close two charts are: m i ( µ j ) = N ( µ j : µ i, σ Each data point is associated with a Gaussian neighborhood with µ i = x i. Covariance is estimated by: T T i = i( µ j)[( xj µ i)( xj µ i) + ( µ j µ i)( µ j µ i) + ( m ])/ m ( µ ) j This step brings non-local information about the manifold s shape into the local description of each neighbor, ensuring that adjoining neighborhoods have similar covariance and small angles between their respective subspaces. 2 ) j j i j
Connecting the charts To minimize the information loss in connection, the data points project into a local subspace associate with each neighbor should have 1) minimal loss of local variance and 2) maximal agreement of projections of nearby points into nearby neighborhoods. The first criterion is met by apply PCA on each chart and get a local low-dimensional coordinate system. Each original data point has different copies (projected low-dimensional sample) in each local coordinate. The second criterion is met by project each local coordinate to a global coordinate with minimal disagreement of the projected data point in the global space.
Connecting the charts Each data point (i) are projected to neighboring local coordinates (j): Each copy of data point in local coordinate is finally projected to a global coordinate: G j y = G u p ( x ) i j j Where is the projection from jth chart to global space. Minimizing the disagreement is modeled as a weighted least-squared-distance problem u ki u ji 2 G [ G1,..., G K ] = arg min p k x ( xi ) p j x ( xi ) G k G j F G u = ji G k, G j l j ji x i j x i 1 1
Experiment 1: The Twisted Curl Problem
Experiment 2: The Trefoil Problem
Experiment 3: The Facial Image Modeling