Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Similar documents
Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Linear algebra methods for data mining Yousef Saad Department of Computer Science and Engineering University of Minnesota.

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Department of Computer Science and Engineering

Department of Computer Science and Engineering

Efficient Linear Algebra methods for Data Mining Yousef Saad Department of Computer Science and Engineering University of Minnesota

Department of Computer Science and Engineering

Department of Computer Science and Engineering

A few applications of the SVD

Iterative Laplacian Score for Feature Selection

Machine learning for pervasive systems Classification in high-dimensional spaces

Enhanced graph-based dimensionality reduction with repulsion Laplaceans

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Machine Learning - MT & 14. PCA and MDS

Nonlinear Dimensionality Reduction. Jose A. Costa

Advanced Introduction to Machine Learning CMU-10715

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Statistical Machine Learning

Principal Component Analysis

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Example: Face Detection

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Data Mining Techniques

Nonlinear Dimensionality Reduction

Principal Component Analysis

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Unsupervised dimensionality reduction

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Locality Preserving Projections

L26: Advanced dimensionality reduction

Statistical and Computational Analysis of Locality Preserving Projection

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Machine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU,

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

Dimensionality Reduction

Dimensionality Reduc1on

Principal Component Analysis (PCA)

Statistical Pattern Recognition

Principal Component Analysis (PCA)

Introduction to Machine Learning

Deriving Principal Component Analysis (PCA)

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Non-linear Dimensionality Reduction

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Discriminant Uncorrelated Neighborhood Preserving Projections

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Nonlinear Dimensionality Reduction

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

20 Unsupervised Learning and Principal Components Analysis (PCA)

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Local Learning Projections

Expectation Maximization

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Recognition Using Class Specific Linear Projection. Magali Segal Stolrasky Nadav Ben Jakov April, 2015

USING SINGULAR VALUE DECOMPOSITION (SVD) AS A SOLUTION FOR SEARCH RESULT CLUSTERING

Data dependent operators for the spatial-spectral fusion problem

LECTURE NOTE #11 PROF. ALAN YUILLE

STA141C: Big Data & High Performance Statistical Computing


Discriminative K-means for Clustering

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Lecture: Face Recognition and Feature Reduction

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Preprocessing & dimensionality reduction

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Probabilistic Latent Semantic Analysis

Data-dependent representations: Laplacian Eigenmaps

Fantope Regularization in Metric Learning

Dimensionality Reduction:

PCA, Kernel PCA, ICA

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Department of Computer Science and Engineering

Face Detection and Recognition

Advanced Machine Learning & Perception

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Lecture: Some Practical Considerations (3 of 4)

Principal components analysis COMS 4771

Dimensionality Reduction

Principal Component Analysis

Lecture: Face Recognition and Feature Reduction

Image Analysis & Retrieval. Lec 14. Eigenface and Fisherface

Apprentissage non supervisée

Dimensionality reduction. PCA. Kernel PCA.

Automatic Subspace Learning via Principal Coefficients Embedding

CITS 4402 Computer Vision

Linear and Non-Linear Dimensionality Reduction

Transcription:

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota Université du Littoral- Calais July 11, 16

First..... to the memory of Mohammed Bellalij Calais 05/18/16 p. 2

Introduction, background, and motivation Common goal of data mining methods: to extract meaningful information or patterns from data. Very broad area includes: data analysis, machine learning, pattern recognition, information retrieval,... Main tools used: linear algebra; graph theory; approximation theory; optimization;... In this talk: emphasis on dimension reduction techniques and the interrelations between techniques Calais 05/18/16 p. 3

Introduction: a few factoids Data is growing exponentially at an alarming rate: 90% of data in world today was created in last two years Every day, 2.3 Million terabytes (2.3 18 bytes) created Mixed blessing: Opportunities & big challenges. Trend is re-shaping & energizing many research areas...... including my own: numerical linear algebra Calais 05/18/16 p. 4

Topics Focus on two main problems Information retrieval Face recognition and 2 types of dimension reduction methods Standard subspace methods [SVD, Lanczos] Graph-based methods Calais 05/18/16 p. 5

Major tool of Data Mining: Dimension reduction Goal is not as much to reduce size (& cost) but to: Reduce noise and redundancy in data before performing a task [e.g., classification as in digit/face recognition] Discover important features or paramaters The problem: Given: X = [x 1,, x n ] R m n, find a low-dimens. representation Y = [y 1,, y n ] R d n of X Achieved by a mapping Φ : x R m y R d so: φ(x i ) = y i, i = 1,, n Calais 05/18/16 p. 6

n m X x i d Y y n i Φ may be linear : y i = W x i, i.e., Y = W X,..... or nonlinear (implicit). Mapping Φ required to: Preserve proximity? Maximize variance? Preserve a certain graph? Calais 05/18/16 p. 7

Example: Principal Component Analysis (PCA) In Principal Component Analysis W is computed to maximize variance of projected data: max W R m d ;W W =I n i=1 y i 1 n n j=1 y j 2 2, y i = W x i. Leads to maximizing Tr [ W (X µe )(X µe ) W ], µ = 1 n Σn i=1 x i Solution W = { dominant eigenvectors } of the covariance matrix Set of left singular vectors of X = X µe Calais 05/18/16 p. 8

SVD: X = UΣV, U U = I, V V = I, Σ = Diag Optimal W = U d matrix of first d columns of U Solution W also minimizes reconstruction error.. i x i W W T x i 2 = i x i W y i 2 In some methods recentering to zero is not done, i.e., X replaced by X. Calais 05/18/16 p. 9

Unsupervised learning Unsupervised learning : methods that do not exploit known labels Example of digits: perform a 2-D projection Images of same digit tend to cluster (more or less) Such 2-D representations are popular for visualization Can also try to find natural clusters in data, e.g., in materials Basic clusterning technique: K- means 5 4 3 2 1 0 1 2 3 4 PCA digits : 5 7 5 6 4 2 0 2 4 6 8 Superconductors Multi ferroics Superhard Catalytic Photovoltaic 5 6 7 Ferromagnetic Thermo electric Calais 05/18/16 p.

Example: The Swill-Roll (00 points in 3-D) Original Data in 3 D 15 5 0 5 0 15 15 5 0 5 15 Calais 05/18/16 p. 11

2-D reductions : 15 PCA 15 LPP 5 5 0 0 5 5 15 0 15 15 5 0 5 Eigenmaps ONPP Calais 05/18/16 p. 12

Example: Digit images (a random sample of 30) 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 5 15 Calais 05/18/16 p. 13

2-D reductions : 6 4 2 0 2 PCA digits : 0 4 0 1 2 3 4 0.15 0.1 0.05 0 0.05 0.1 LLE digits : 0 4 4 0.15 6 5 0 5 0.2 0.2 0.1 0 0.1 0.2 0.2 0.15 0.1 0.05 K PCA digits : 0 4 0 1 2 3 4 0.1 0.05 0 0.05 ONPP digits : 0 4 0 0.1 0.05 0.15 0.1 0.2 0.15 0.07 0.071 0.072 0.073 0.074 0.25 5.4341 5.4341 5.4341 5.4341 5.4341 x 3 Calais 05/18/16 p. 14

APPLICATION: INFORMATION RETRIEVAL

Application: Information Retrieval Given: collection of documents (columns of a matrix A) and a query vector q. Representation: m n term by document matrix A query q is a (sparse) vector in R m ( pseudo-document ) Problem: find a column of A that best matches q Vector space model: use cos (A(:, j), q), j = 1 : n Requires the computation of A T q Literal Matching ineffective Calais 05/18/16 p. 16

Common approach: Dimension reduction (SVD) LSI: replace A by a low rank approximation [from SVD] A = UΣV T A k = U k Σ k V T k Replace similarity vector: s = A T q by s k = A T k q Main issues: 1) computational cost 2) Updates Idea: Replace A k by Aφ(A T A), where φ == a filter function Consider the stepfunction (Heaviside): φ(x) = { 0, 0 x σ 2 k 1, σ 2 k x σ2 1 Would yield the same result as TSVD but not practical Calais 05/18/16 p. 17

Use of polynomial filters Solution : use a polynomial approximation to φ Note: s T = q T Aφ(A T A), requires only Mat-Vec s Ideal for situations where data must be explored once or a small number of times only Details skipped see: E. Kokiopoulou and YS, Polynomial Filtering in Latent Semantic Indexing for Information Retrieval, ACM-SIGIR, 04. Calais 05/18/16 p. 18

IR: Use of the Lanczos algorithm (J. Chen, YS 09) Lanczos algorithm = Projection method on Krylov subspace Span{v, Av,, A m 1 v} Can get singular vectors with Lanczos, & use them in LSI Better: Use the Lanczos vectors directly for the projection K. Blom and A. Ruhe [SIMAX, vol. 26, 05] perform a Lanczos run for each query [expensive]. Proposed: One Lanczos run- random initial vector. Then use Lanczos vectors in place of singular vectors. In short: Results comparable to those of SVD at a much lower cost. Calais 05/18/16 p. 19

Tests: IR Information retrieval datasets # Terms # Docs # queries sparsity MED 7,014 1,033 30 0.735 CRAN 3,763 1,398 225 1.412 Med dataset. Cran dataset. Preprocessing times preprocessing time 90 80 70 60 50 40 30 lanczos tsvd Med 0 0 50 0 150 0 250 300 iterations preprocessing time 60 50 40 30 lanczos tsvd Cran 0 0 50 0 150 0 250 300 iterations Calais 05/18/16 p.

Average retrieval precision Med dataset Cran dataset 0.9 Med 0.9 Cran 0.8 0.8 0.7 0.7 average precision 0.6 0.5 0.4 0.3 average precision 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 0 150 0 250 300 iterations lanczos tsvd lanczos rf tsvd rf 0.2 0.1 0 0 0 0 300 400 500 600 iterations lanczos tsvd lanczos rf tsvd rf Retrieval precision comparisons Calais 05/18/16 p. 21

Supervised learning: classification Problem: Given labels (say A and B ) for each item of a given set, find a mechanism to classify an unlabelled item into either the A or the B" class. Many applications.?? Example: distinguish SPAM and non-spam messages Can be extended to more than 2 classes. Calais 05/18/16 p. 22

Supervised learning: classification Best illustration: written digits recognition example Given: a set of labeled samples (training set), and an (unlabeled) test image. Problem: find label of test image Digit 0 Digit 0 Digfit 1 Digit 2 01 01 01 01 01 01 Training data Digfit 1 Digit 2 Digit 9 Digit 9 Digit?? Test data Digit?? Dimension reduction Roughly speaking: we seek dimension reduction so that recognition is more effective in low-dim. space Calais 05/18/16 p. 23

Supervised learning: Linear classification Linear classifiers: Find a hyperplane which best separates the data in classes A and B. Example of application: Distinguish between SPAM and non-spam e- mails Linear classifier Note: The world in non-linear. Often this is combined with Kernels amounts to changing the inner product Calais 05/18/16 p. 24

A harder case: 4 Spectral Bisection (PDDP) 3 2 1 0 1 2 3 4 2 1 0 1 2 3 4 5 Use kernels to transform

0.1 Projection with Kernels σ 2 = 2.7463 0.08 0.06 0.04 0.02 0 0.02 0.04 0.06 0.08 0.1 0.08 0.06 0.04 0.02 0 0.02 0.04 0.06 Transformed data with a Gaussian Kernel Calais 05/18/16 p. 26

GRAPH-BASED TECHNIQUES

Graph-based methods Start with a graph of data. e.g.: graph of k nearest neighbors (k-nn graph) Want: Perform a projection which preserves the graph in some sense Define a graph Laplacean: L = D W x i x j e.g.,: w ij = { 1 if j Adj(i) 0 else D = diag d ii = w ij j i with Adj(i) = neighborhood of i (excluding i) Calais 05/18/16 p. 28

A side note: Graph partitioning If x is a vector of signs (±1) then x Lx = 4 ( number of edge cuts ) edge-cut = pair (i, j) with x i x j Consequence: Can be used for partitioning graphs, or clustering [take p = sign(u 2 ), where u 2 = 2nd smallest eigenvector..] Calais 05/18/16 p. 29

Example: The Laplacean eigenmaps approach Laplacean Eigenmaps [Belkin-Niyogi 01] *minimizes* n F(Y ) = w ij y i y j 2 subject to Y DY = I i,j=1 Motivation: if x i x j is small (orig. data), we want y i y j to be also small (low-dim. data) Original data used indirectly through its graph Leads to n n sparse eigenvalue problem [In sample space] x i x j y i y j Calais 05/18/16 p. 30

Problem translates to: min Y R d n Y D Y = I Tr [ Y (D W )Y ]. Solution (sort eigenvalues increasingly): (D W )u i = λ i Du i ; y i = u i ; i = 1,, d Note: can assume D = I. Amounts to rescaling data. Problem becomes (I W )u i = λ i u i ; y i = u i ; i = 1,, d Calais 05/18/16 p. 31

Locally Linear Embedding (Roweis-Saul-00) LLE is very similar to Eigenmaps. Main differences: 1) Graph Laplacean matrix is replaced by an affinity graph 2) Objective function is changed. 1. Graph: Each x i is written as a convex combination of its k nearest neighbors: x i Σw ij x j, j N i w ij = 1 Optimal weights computed ( local calculation ) by minimizing x i Σw ij x j for i = 1,, n x i x j Calais 05/18/16 p. 32

2. Mapping: The y i s should obey the same affinity as x i s Minimize: y i j i w ij y j 2 subject to: Y 1 = 0, Y Y = I Solution: (I W )(I W )u i = λ i u i ; y i = u i. (I W )(I W ) replaces the graph Laplacean of eigenmaps Calais 05/18/16 p. 33

ONPP (Kokiopoulou and YS 05) Orthogonal Neighborhood Preserving Projections A linear (orthogonoal) version of LLE obtained by writing Y in the form Y = V X Same graph as LLE. Objective: preserve the affinity graph (as in LEE) *but* with the constraint Y = V X Problem solved to obtain mapping: min V Tr s.t. V T V = I [ V X(I W )(I W )X V ] In LLE replace V X by Y Calais 05/18/16 p. 34

Implicit vs explicit mappings In PCA the mapping Φ from high-dimensional space (R m ) to low-dimensional space (R d ) is explicitly known: y = Φ(x) V T x In Eigenmaps and LLE we only know Mapping φ is complex, i.e., y i = φ(x i ), i = 1,, n Difficult to get φ(x) for an arbitrary x not in the sample. Inconvenient for classification The out-of-sample extension problem Calais 05/18/16 p. 35

Face Recognition background Problem: We are given a database of images: [arrays of pixel values]. And a test (new) image. Calais 05/18/16 p. 36

Face Recognition background Problem: We are given a database of images: [arrays of pixel values]. And a test (new) image. - % Question: Does this new image correspond to one of those in the database? Difficulty Positions, Expressions, Lighting,..., Calais 05/18/16 p. 37

Example: Eigenfaces [Turk-Pentland, 91] Idea identical with the one we saw for digits: Consider each picture as a (1-D) column of all pixels Put together into an array A of size #_pixels #_images.... =... }{{} A Do an SVD of A and perform comparison with any test image in low-dim. space Calais 05/18/16 p. 38

Graph-based methods in a supervised setting Graph-based methods can be adapted to supervised mode. Idea: Build G so that nodes in the same class are neighbors. If c = # classes, G consists of c cliques. Weight matrix W =block-diagonal Note: rank(w ) = n c. As before, graph Laplacean: W = L c = D W W 1 W 2 Can be used for ONPP and other graph based methods... W c 09] Improvement: add repulsion Laplacean [Kokiopoulou, YS Calais 05/18/16 p. 39

Class 1 Class 2 Leads to eigenvalue problem with matrix: L c ρl R Class 3 Test: ORL L c = class-laplacean, L R = repulsion Laplacean, ρ = parameter 40 subjects, sample images each example: # of pixels : 112 92; TOT. # images : 400 Calais 05/18/16 p. 40

0.98 ORL TrainPer Class=5 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 onpp pca olpp R laplace fisher onpp R olpp 0.8 30 40 50 60 70 80 90 0 Observation: some values of ρ yield better results than using the optimum ρ obtained from maximizing trace ratio Calais 05/18/16 p. 41

Conclusion Interesting new matrix problems in areas that involve the effective mining of data Among the most pressing issues is that of reducing computational cost - [SVD, SDP,..., too costly] Many online resources available Huge potential in areas like materials science though inertia has to be overcome To a researcher in computational linear algebra : big tide of change on types or problems, algorithms, frameworks, culture,.. But change should be welcome

When one door closes, another opens; but we often look so long and so regretfully upon the closed door that we do not see the one which has opened for us. Alexander Graham Bell (1847-1922) In the words of Who Moved My Cheese? [ Spencer Johnson, 02]: If you do not change, you can become extinct! Calais 05/18/16 p. 43