Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Similar documents
Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Statistical Pattern Recognition

L26: Advanced dimensionality reduction

Non-linear Dimensionality Reduction

Dimension Reduction and Low-dimensional Embedding

Nonlinear Manifold Learning Summary

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Unsupervised dimensionality reduction

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Lecture 10: Dimension Reduction Techniques

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Manifold Learning and it s application

LECTURE NOTE #11 PROF. ALAN YUILLE

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Manifold Learning: Theory and Applications to HRI

Intrinsic Structure Study on Whale Vocalizations

Lecture: Some Practical Considerations (3 of 4)

Principal Components Analysis. Sargur Srihari University at Buffalo

Machine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU,

PCA, Kernel PCA, ICA

Statistical Machine Learning

15 Singular Value Decomposition

Apprentissage non supervisée

ISOMAP TRACKING WITH PARTICLE FILTER

Data Preprocessing Tasks

Distance Preservation - Part 2

Advanced Machine Learning & Perception

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Data dependent operators for the spatial-spectral fusion problem

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Introduction to Machine Learning

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Dimensionality Reduc1on

Preprocessing & dimensionality reduction

Table of Contents. Multivariate methods. Introduction II. Introduction I

CHAPTER 4 PRINCIPAL COMPONENT ANALYSIS-BASED FUSION

Dimensionality Reduction AShortTutorial

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Unsupervised learning: beyond simple clustering and PCA

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Robust Laplacian Eigenmaps Using Global Information

Dimensionality Reduction: A Comparative Review

Distance Preservation - Part I

(Non-linear) dimensionality reduction. Department of Computer Science, Czech Technical University in Prague

14 Singular Value Decomposition

Principal Component Analysis (PCA)

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Data-dependent representations: Laplacian Eigenmaps

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Principal Component Analysis

Dimensionality Reduction

Nonlinear Dimensionality Reduction. Jose A. Costa

Dimensionality Reduction: A Comparative Review

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Statistical and Computational Analysis of Locality Preserving Projection

Machine Learning (BSMC-GA 4439) Wenke Liu

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Manifold Learning: From Linear to nonlinear. Presenter: Wei-Lun (Harry) Chao Date: April 26 and May 3, 2012 At: AMMAI 2012

Methods for sparse analysis of high-dimensional data, II

Data Mining II. Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich. Basel, Spring Semester 2016 D-BSSE

Chapter 3. Riemannian Manifolds - I. The subject of this thesis is to extend the combinatorial curve reconstruction approach to curves

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

7. Variable extraction and dimensionality reduction

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Supplemental Materials for. Local Multidimensional Scaling for. Nonlinear Dimension Reduction, Graph Drawing. and Proximity Analysis

Methods for sparse analysis of high-dimensional data, II

Notes on Latent Semantic Analysis

DIDELĖS APIMTIES DUOMENŲ VIZUALI ANALIZĖ

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

7 Principal Component Analysis

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

CALCULUS ON MANIFOLDS. 1. Riemannian manifolds Recall that for any smooth manifold M, dim M = n, the union T M =

PARAMETERIZATION OF NON-LINEAR MANIFOLDS

Permutation-invariant regularization of large covariance matrices. Liza Levina

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

ADAPTIVE ANTENNAS. SPATIAL BF

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Discriminative Direction for Kernel Classifiers

Linear Factor Models. Sargur N. Srihari

Linear Dimensionality Reduction

Spectral Clustering. by HU Pili. June 16, 2013

STA 414/2104: Lecture 8

Motivating the Covariance Matrix

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Linear Regression and Its Applications

Manifold Regularization

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Transcription:

Dimension Reduction Techniques Presented by Jie (Jerry) Yu

Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting

Background Advances in data collection and storage capacities lead to information overload in many fields. Traditional statistical methods often break down because of the increase in the number of variables in each observations, that is, the dimension of the data. One of the most challenging problem is to reduce the dimension of original data.

Problem Modeling Original high-dimensional data: T X = ( x : p dimensional multivariate random 1,..., x p ) Underlying/Intrinsic low-dimensional data: T Y = ( y 1,..., y k ) : k (<<p) dimensional multivariate random The mean and covariance: T T E ( X ) = µ = ( µ 1,..., µ p ) x = E {( X µ )( X µ ) } Problems : 1) Find the appropriate mapping that can best capture the most important features in low dimension and 2) Find the appropriate k that can best describe the data in low dimension.

State-of-the-art Techniques Dimension reduction techniques can be categorized into two major classes: linear and non-linear. Non-Linear Methods: Multidimensional Scaling (MDS), Principal Curves, Self-Organizing Map (SOM), Neural Network, Isomap, Local Linear Embedding (LLE) and Charting. Linear Methods: Principal Component Analysis (PCA), Factor Analysis, Projection Pursuit and Independent Component Analysis (ICA)

Principal Component Analysis (PCA) Denote a linear projection as W = w,..., w ] Thus y i = w In essence PCA tries to reduce the data dimension by finding a few orthogonal linear combinations (Principal Components, PCs) of the original variables with the largest variance. W = arg max k i= 1 var{ y } = arg max var{ w It can also be further rewritten as : W T i [ 1 k X = arg max( W i T Σ x k i= 1 W ) T i X }

PCA Σ can be decomposed by eigen decomposition as T = U Λ U x Λ= diag ( λ1,..., λ p ) is the diagonal matrix of ascending ordered eigenvalues. U is the orthogonal matrix containing the eigenvectors. It is proven that the optimal projection matrix W are the first k eigenvectors in U.

PCA Property 1: The subspace spanned by the first k eigenvectors has the smallest mean square deviation from X among all subspace of dimension K. Property 2: The total variance is equal to the sum of the eigenvalues of the original covariance matrix.

Multidimensional Scaling (MDS) Multidimensional Scaling (MDS) produces lowdimensional representation of the data such that the distance in the new space reflect the proximities of the data in the original space. Denote symmetric proximity matrix as : = { δ, i, j 1,..., n} ij = MDS tries to find the mapping such that the distance in the lower space d are as ij =d( yi, yj) close as possible to a function of the corresponding proximity f ( δ ij ).

MDS Mapping Cost function: i, j scale [ f ( δ _ ij factor 2 The scale_factor are often based on i, j f ( δ ij ) or 2 i, d. j ij Problem: Find optimal mapping that minimize the cost function If the proximity is the distance measure, L 2 or L 1, we call it metric-mds. If the proximity uses ordinal information of the data, it is called non-metric-mds. ) d ij ] 2

Isomap Disadvantage of PCA and MDS: 1) Both methods often fail to discover complicated nonlinear structure and 2) both have difficulties in detecting the intrinsic dimension of the data. Goal : combine the major algorithmic feature of PCA and MDS: computational efficiency, global optimality and asymptotic convergence guarantee and have the flexibility to learn nonlinear manifolds. Idea : Introduce geodesic distance that can better describe the relation between data points.

Isomap Illustration: Points far apart on the underlying manifold, when measured by their geodesic distance may appear close in high-dimensional input space. The Swiss Roll data set

Isomap In this approach the intrinsic geometry of the data is preserved by capturing the manifold distance between all data. For neighboring points (ε or k-nearest), the Euclidean distance provides good approximation to the geodesic distance. For faraway points, geodesic distance can be approximated by adding up a sequence of short hops between neighboring points. (Floyd Algorithm)

Isomap Algorithm Step 1: determine which points are neighbors on the manifold based on the input distance matrix. Step 2: Isomap estimates the geodesic distances d G ( i, j) between all pairs of points on the manifold M by computing their shortest path distance d x ( i, j). Step 3: Apply MDS or PCA to the matrix of the graph distance matrix. D G = { d ( i, j)} G

The Swiss Roll Problem

Detect Intrinsic Dimension The intrinsic dimensionality of the data can be estimated from the decrease rate of Residual Variance as the dimensionality of Y increased. Residual Variance is defined as : 2 1 R ( D M, D y ) while R() operation is the linear correlation coefficient and D M is the estimated distance in original space and D y the distance in projected space.

Theoretical Analysis The main contribution of Isomap is substitute the Euclidean distance with geodesic distance, which may better capture the nonlinear structure of a manifold. Given sufficient data, Isomap is guaranteed asymptotically to recover the true dimensionality and geometric structure of a non-linear manifolds.

Experiments

Experiments

Experiment 1: Facial Images

Experiment 2: The hand-written 2 s

Locally Linear Embedding (LLE) MDS and its variant Isomap try to preserve pair wise distance between data points. Locally Linear Embedding (LLE) is unsupervised learning algorithm that recovers global nonlinear structure from locally linear fits. Assumption: each data point and its neighbors lie on or close to a locally linear patch of the manifold.

Locally Linearity

LLE Idea: The local geometry is characterized by linear coefficients that reconstruct each data point from its neighbors. Reconstruction Cost is defined as : ε ( W ) = i x i Two constraints: 1) each data point is only reconstruct by its neighbors instead of faraway points and 2) rows of weight matrix sum to one. j w ij x j 2

Linear reconstruction

LLE The symmetric weight matrix for any data point is invariant to rotations, rescaling and translations. Although the global manifold may be nonlinear, for each locally linear neighborhood there exists a linear mapping (consisting of a translation,rotation and rescaling) that project the neighborhood to low dimension. The same weight matrix that reconstruct ith data in D dimension should also reconstruct its embedded manifold in d dimsension.

LLE W is solved by minimizing the reconstruct cost function in the original space. To find the optimal global mapping to lower dimensional space, define an embedding cost function: φ( Y ) = i y i Because W is fixed, the problem turns to find a optimal projection (X->Y) which minimize the embedding function. j w ij y j 2

Theoretical analysis: 1) only one free parameter K and transformation is determinant. 2)Guranteed to converge to global optimality with sufficient data point. 3)LLE don t have to be rerun to compute higher dimension embeddings. 4)The intrinsic dimension d can be estimated by analyzing a reciprocal cost function of reconstruct Y to X.

Experiment 1 Facial Images

Experiment 2: Words in semantic space

Experiment 2: Arranging words in semantic space

Charting Charting is the problem of assigning a lowdimensional coordinate system to data points in a high-dimensional samples. Assume that the data lies on or near a lowdimensional manifold in the sample space and there exists a 1-to-1 smooth nonlinear transform between the manifold and a low-dimensional vector space. Goal: find a mapping that is expressed as a kernelbased mixture of linear projections that minimizes information loss about the density and relative locations of sample points.

Local Linear Scale and Intrinsic Dimensionality Local Linear Scale (r) : at some scale r the mapping from a neighborhood on M d (original space) to lower dimension r is linear. Consider a ball of radius r centered on a data point and containing n(r) data points. d The count n(r) grows at r only at the locally linear scale.

Local Linear Scale and Intrinsic Dimensionality There are two other factor that may affect the data distribution in different scale: isotropic noise (at a smaller scale) and embedding curvature ( at a larger scale). Define c(r) =log r /log n(r).at noise scale c(r)=1/d<1/d. At locally linear scale c(r)=1/d. At curvature scale c(r) <1/d.

Local Linear Scale and Intrinsic Dimensionality Gradually increase r, when c(r) first peaks (at 1/d). We have one observation of both r and d. Average on all data points, we can estimate the r and d.

Charting the data Model: Each chart is modeled as Gaussian Mixture Model (GMM). Goal: find a soft partition of data into locally linear low-dimension neighborhoods. Problem: one data point may belong to several neighboring chart. The estimation of local GMM should take account into the information from neighboring chart.

Charting the data Co-locality: is defined to estimate how close two charts are: m i ( µ j ) = N ( µ j : µ i, σ Each data point is associated with a Gaussian neighborhood with µ i = x i. Covariance is estimated by: T T i = i( µ j)[( xj µ i)( xj µ i) + ( µ j µ i)( µ j µ i) + ( m ])/ m ( µ ) j This step brings non-local information about the manifold s shape into the local description of each neighbor, ensuring that adjoining neighborhoods have similar covariance and small angles between their respective subspaces. 2 ) j j i j

Connecting the charts To minimize the information loss in connection, the data points project into a local subspace associate with each neighbor should have 1) minimal loss of local variance and 2) maximal agreement of projections of nearby points into nearby neighborhoods. The first criterion is met by apply PCA on each chart and get a local low-dimensional coordinate system. Each original data point has different copies (projected low-dimensional sample) in each local coordinate. The second criterion is met by project each local coordinate to a global coordinate with minimal disagreement of the projected data point in the global space.

Connecting the charts Each data point (i) are projected to neighboring local coordinates (j): Each copy of data point in local coordinate is finally projected to a global coordinate: G j y = G u p ( x ) i j j Where is the projection from jth chart to global space. Minimizing the disagreement is modeled as a weighted least-squared-distance problem u ki u ji 2 G [ G1,..., G K ] = arg min p k x ( xi ) p j x ( xi ) G k G j F G u = ji G k, G j l j ji x i j x i 1 1

Experiment 1: The Twisted Curl Problem

Experiment 2: The Trefoil Problem

Experiment 3: The Facial Image Modeling