DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Similar documents
Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Numerical Methods I Singular Value Decomposition

PCA, Kernel PCA, ICA

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Principal Component Analysis

Information Retrieval

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Problems. Looks for literal term matches. Problems:

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

Lecture: Face Recognition and Feature Reduction

Latent Semantic Analysis. Hongning Wang

Linear Algebra (Review) Volker Tresp 2018

Linear Algebra Methods for Data Mining

Lecture: Face Recognition and Feature Reduction

Linear Algebra Background

Singular Value Decompsition

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Machine learning for pervasive systems Classification in high-dimensional spaces

Let A an n n real nonsymmetric matrix. The eigenvalue problem: λ 1 = 1 with eigenvector u 1 = ( ) λ 2 = 2 with eigenvector u 2 = ( 1

Linear Algebra Review. Vectors

Linear Algebra (Review) Volker Tresp 2017

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

CS47300: Web Information Search and Management

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Notes on Latent Semantic Analysis

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Fall TMA4145 Linear Methods. Exercise set Given the matrix 1 2

Manning & Schuetze, FSNLP (c) 1999,2000

A few applications of the SVD

Multivariate Statistical Analysis

Jeffrey D. Ullman Stanford University

Manning & Schuetze, FSNLP, (c)

CS 143 Linear Algebra Review

Example Linear Algebra Competency Test

Linear Algebra & Geometry why is linear algebra useful in computer vision?

EIGENVALE PROBLEMS AND THE SVD. [5.1 TO 5.3 & 7.4]

Dimensionality Reduction

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Review of Linear Algebra

The Singular Value Decomposition

Linear Methods in Data Mining

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

15 Singular Value Decomposition

Latent Semantic Analysis. Hongning Wang

1 Singular Value Decomposition and Principal Component

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

Introduction to Information Retrieval

Matrix Vector Products

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Mathematical foundations - linear algebra

Assignment 3. Latent Semantic Indexing

Principal components analysis COMS 4771

14 Singular Value Decomposition

Advanced Introduction to Machine Learning CMU-10715

Probabilistic Latent Semantic Analysis

Data Mining Techniques

Singular Value Decomposition and Principal Component Analysis (PCA) I

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Eigenvalues and diagonalization

EECS 275 Matrix Computation

EE 381V: Large Scale Learning Spring Lecture 16 March 7

Matrices, Vector Spaces, and Information Retrieval

a Short Introduction

Conceptual Questions for Review

Linear Algebra & Geometry why is linear algebra useful in computer vision?

EECS 275 Matrix Computation

PV211: Introduction to Information Retrieval

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Linear Algebra Review. Fei-Fei Li

Collaborative Filtering: A Machine Learning Perspective

Introduction to Machine Learning

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Matrix Factorization In Recommender Systems. Yong Zheng, PhDc Center for Web Intelligence, DePaul University, USA March 4, 2015

STA141C: Big Data & High Performance Statistical Computing

Data Preprocessing Tasks

Leverage Sparse Information in Predictive Modeling

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

CSE 554 Lecture 7: Alignment

7 Principal Component Analysis

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra

Lecture Notes 2: Matrices

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Eigenvalue Problems Computation and Applications

Latent Semantic Indexing

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Parallel Singular Value Decomposition. Jiaxing Tan

Chapter 7: Symmetric Matrices and Quadratic Forms

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Linear Algebra Methods for Data Mining

Announcements (repeat) Principal Components Analysis

STA141C: Big Data & High Performance Statistical Computing

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Lecture 5: Web Searching using the SVD

5 Linear Algebra and Inverse Problem

Transcription:

DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD

The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary of words Facebook graph, where the dimensionality is the number of users Huge number of dimensions causes problems Data becomes very sparse, some algorithms become meaningless (e.g. density based clustering) The complexity of several algorithms depends on the dimensionality and they become infeasible.

Dimensionality Reduction Usually the data can be described with fewer dimensions, without losing much of the meaning of the data. The data reside in a space of lower dimensionality Essentially, we assume that some of the data is noise, and we can approximate the useful part with a lower dimensionality space. Dimensionality reduction does not just reduce the amount of data, it often brings out the useful part of the data

Dimensionality Reduction We have already seen a form of dimensionality reduction LSH, and random projections reduce the dimension while preserving the distances

Data in the form of a matrix We are given n objects and d attributes describing the objects. Each object has d numeric values describing it. We will represent the data as a n d real matrix A. We can now use tools from linear algebra to process the data matrix Our goal is to produce a new n k matrix B such that It preserves as much of the information in the original matrix A as possible It reveals something about the structure of the data in A

Example: Document matrices d terms (e.g., theorem, proof, etc.) n documents A ij = frequency of the j-th term in the i-th document Find subsets of terms that bring documents together

Example: Recommendation systems d movies n customers A ij = rating of j-th product by the i-th customer Find subsets of movies that capture the behavior or the customers

Linear algebra We assume that vectors are column vectors. We use v T for the transpose of vector v (row vector) Dot product: u T v (1 n, n 1 1 1) The dot product is the projection of vector v on u (and vice versa) 4 1, 2, 3 1 = 12 2 u T v = v u cos(u, v) If u = 1 (unit vector) then u T v is the projection length of v on u 1, 2, 3 4 1 = 0 orthogonal vectors 2 Orthonormal vectors: two unit vectors that are orthogonal

Matrices An n m matrix A is a collection of n row vectors and m column vectors A = a 1 a 2 a 3 A = T α 1 T α 2 T α 3 Matrix-vector multiplication Right multiplication Au: projection of u onto the row vectors of A, or projection of row vectors of A onto u. Left-multiplication u T A: projection of u onto the column vectors of A, or projection of column vectors of A onto u Example: 1 0 1,2,3 0 1 = [1,2] 0 0

Rank Row space of A: The set of vectors that can be written as a linear combination of the rows of A All vectors of the form v = u T A Column space of A: The set of vectors that can be written as a linear combination of the columns of A All vectors of the form v = Au. Rank of A: the number of linearly independent row (or column) vectors These vectors define a basis for the row (or column) space of A

Rank-1 matrices In a rank-1 matrix, all columns (or rows) are multiples of the same column (or row) vector 1 2 1 A = 2 4 2 3 6 3 All rows are multiples of r = [1,2, 1] All columns are multiples of c = 1,2,3 T External product: uv T (n 1, 1 m n m) The resulting n m has rank 1: all rows (or columns) are linearly dependent A = rc T

Eigenvectors (Right) Eigenvector of matrix A: a vector v such that Av = λv λ: eigenvalue of eigenvector v A square matrix A of rank r, has r orthonormal eigenvectors u 1, u 2,, u r with eigenvalues λ 1, λ 2,, λ r. Eigenvectors define an orthonormal basis for the column space of A

Singular Value Decomposition A = U Σ V T = u 1, u 2,, u r σ 1 σ 2 0 [n m] = [n r] [r r] [r m] r: rank of matrix A 0 σ r v 1 T v 2 T v r T σ 1, σ 2 σ r : singular values of matrix A (also, the square roots of eigenvalues of AA T and A T A) u 1, u 2,, u r : left singular vectors of A (also eigenvectors of AA T ) v 1, v 2,, v r : right singular vectors of A (also, eigenvectors of A T A) A = σ 1 u 1 v 1 T + σ 2 u 2 v 2 T + + σ r u r v r T

Symmetric matrices Special case: A is symmetric positive definite matrix A = λ 1 u 1 u 1 T + λ 2 u 2 u 2 T + + λ r u r u r T λ 1 λ 2 λ r 0: Eigenvalues of A u 1, u 2,, u r : Eigenvectors of A

Singular Value Decomposition The left singular vectors are an orthonormal basis for the row space of A. The right singular vectors are an orthonormal basis for the column space of A. If A has rank r, then A can be written as the sum of r rank-1 matrices There are r linear components (trends) in A. Linear trend: the tendency of the row vectors of A to align with vector v Strength of the i-th linear trend: Av i = σ i

An (extreme) example Document-term matrix Blue and Red rows (colums) are linearly dependent A = There are two prototype documents (vectors of words): blue and red To describe the data is enough to describe the two prototypes, and the projection weights for each row A is a rank-2 matrix A = w 1, w 2 d 1 T d 2 T

An (more realistic) example Document-term matrix A = There are two prototype documents and words but they are noisy We now have more than two singular vectors, but the strongest ones are still about the two types. By keeping the two strongest singular vectors we obtain most of the information in the data. This is a rank-2 approximation of the matrix A

Rank-k approximations (A k ) n x d n x k k x k k x d U k (V k ): orthogonal matrix containing the top k left (right) singular vectors of A. k : diagonal matrix containing the top k singular values of A A k is an approximation of A A k is the best approximation of A

SVD as an optimization The rank-k approximation matrix A k produced by the top-k singular vectors of A minimizes the Frobenious norm of the difference with the matrix A A k = arg max A B F 2 B:rank B =k A B F 2 = A ij B ij 2 i,j

What does this mean? We can project the row (and column) vectors of the matrix A into a k-dimensional space and preserve most of the information (Ideally) The k dimensions reveal latent features/aspects/topics of the term (document) space. (Ideally) The A k approximation of matrix A, contains all the useful information, and what is discarded is noise

Latent factor model Rows (columns) are linear combinations of k latent factors E.g., in our extreme document example there are two factors Some noise is added to this rank-k matrix resulting in higher rank SVD retrieves the latent factors (hopefully).

significant noise SVD and Rank-k approximations A = U V T features sig. significant = noise noise objects

Application: Recommender systems Data: Users rating movies Sparse and often noisy Assumption: There are k basic user profiles, and each user is a linear combination of these profiles E.g., action, comedy, drama, romance Each user is a weighted cobination of these profiles The true matrix has rank k What we observe is a noisy, and incomplete version of this matrix A The rank-k approximation A k is provably close to A k Algorithm: compute A k and predict for user u and movie m, the value A k [m, u]. Model-based collaborative filtering

SVD and PCA PCA is a special case of SVD on the centered covariance matrix.

Covariance matrix Goal: reduce the dimensionality while preserving the information in the data Information in the data: variability in the data We measure variability using the covariance matrix. Sample covariance of variables X and Y x i μ X T (y i μ Y ) i Given matrix A, remove the mean of each column from the column vectors to get the centered matrix C The matrix V = C T C is the covariance matrix of the row vectors of A.

PCA: Principal Component Analysis We will project the rows of matrix A into a new set of attributes (dimensions) such that: The attributes have zero covariance to each other (they are orthogonal) Each attribute captures the most remaining variance in the data, while orthogonal to the existing attributes The first attribute should capture the most variance in the data For matrix C, the variance of the rows of C when projected to vector x is given by σ 2 = Cx 2 The right singular vector of C maximizes σ 2!

PCA Input: 2-d dimensional points 5 4 2nd (right) singular vector Output: 1st (right) singular vector: direction of maximal variance, 3 1st (right) singular vector 2 4.0 4.5 5.0 5.5 6.0 2nd (right) singular vector: direction of maximal variance, after removing the projection of the data along the first singular vector.

Singular values 5 4 2nd (right) singular vector 1 : measures how much of the data variance is explained by the first singular vector. 3 1 1st (right) singular vector 2 4.0 4.5 5.0 5.5 6.0 2 : measures how much of the data variance is explained by the second singular vector.

Singular values tell us something about the variance The variance in the direction of the k-th principal component is given by the corresponding singular value σ k 2 Singular values can be used to estimate how many components to keep Rule of thumb: keep k enough to explain 85% of the 2 variation: j 1 n j 1 j 2 j 0.85

Example drugs A = a 11 a 1n a n1 a nn students legal illegal a ij : usage of student i of drug j A = UΣV T Drug 1 First right singular vector v 1 More or less same weight to all drugs Discriminates heavy from light users Second right singular vector Positive values for legal drugs, negative for illegal Drug 2

Another property of PCA/SVD The chosen vectors are such that minimize the sum of square differences between the data vectors and the low-dimensional projections 5 4 3 1st (right) singular vector 2 4.0 4.5 5.0 5.5 6.0

Application Latent Semantic Indexing (LSI): Apply PCA on the document-term matrix, and index the k-dimensional vectors When a query comes, project it onto the k-dimensional space and compute cosine similarity in this space Principal components capture main topics, and enrich the document representation

SVD is the Rolls-Royce and the Swiss Army Knife of Numerical Linear Algebra. * *Dianne O Leary, MMDS 06