CS60021: Scalable Data Mining. Dimensionality Reduction

Similar documents
Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Introduction to Data Mining

Jeffrey D. Ullman Stanford University

Large Scale Data Analysis Using Deep Learning

15 Singular Value Decomposition

Dimensionality Reduction

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Text Processing and High-Dimensional Data

Singular Value Decompsition

EECS 275 Matrix Computation

14 Singular Value Decomposition

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Lecture: Face Recognition and Feature Reduction

Lecture: Face Recognition and Feature Reduction

Linear Algebra and Eigenproblems

Computational Methods. Eigenvalues and Singular Values

Introduction to Numerical Linear Algebra II

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Review problems for MA 54, Fall 2004.

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

Properties of Matrices and Operations on Matrices

Linear Algebra (Review) Volker Tresp 2017

Linear Algebra Review. Vectors

Dimensionality Reduction

Linear Algebra (Review) Volker Tresp 2018

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Probabilistic Latent Semantic Analysis

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Singular Value Decomposition

1 Linearity and Linear Systems

Linear Algebra Methods for Data Mining

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

7. Symmetric Matrices and Quadratic Forms

STA141C: Big Data & High Performance Statistical Computing

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Linear Algebra Massoud Malek

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

Deep Learning Book Notes Chapter 2: Linear Algebra

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

Subspace sampling and relative-error matrix approximation

Notes on Eigenvalues, Singular Values and QR

Background Mathematics (2/2) 1. David Barber

Chapter 3 Transformations

Image Registration Lecture 2: Vectors and Matrices

Lecture 9: SVD, Low Rank Approximation

Introduction to Machine Learning

Quantum Computing Lecture 2. Review of Linear Algebra

Information Retrieval

Latent Semantic Analysis. Hongning Wang

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Sketching as a Tool for Numerical Linear Algebra

7 Principal Component Analysis

Text Analytics (Text Mining)

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

4 ORTHOGONALITY ORTHOGONALITY OF THE FOUR SUBSPACES 4.1

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

SVD, PCA & Preprocessing

STA141C: Big Data & High Performance Statistical Computing

Knowledge Discovery and Data Mining 1 (VO) ( )

PCA and admixture models

LECTURE 16: PCA AND SVD

PCA, Kernel PCA, ICA

Numerical Methods. Elena loli Piccolomini. Civil Engeneering. piccolom. Metodi Numerici M p. 1/??

Assignment #10: Diagonalization of Symmetric Matrices, Quadratic Forms, Optimization, Singular Value Decomposition. Name:

Linear Algebra for Machine Learning. Sargur N. Srihari

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Review of Linear Algebra

RandNLA: Randomized Numerical Linear Algebra

CS 143 Linear Algebra Review

Notes on Linear Algebra and Matrix Theory

Singular Value Decomposition and Digital Image Compression

Linear Algebra, part 3 QR and SVD

Faloutsos, Tong ICDE, 2009

From Matrix to Tensor. Charles F. Van Loan

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

Least Squares. Tom Lyche. October 26, Centre of Mathematics for Applications, Department of Informatics, University of Oslo

Lecture 18 Nov 3rd, 2015

Linear Subspace Models

Fall TMA4145 Linear Methods. Exercise set Given the matrix 1 2

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Latent Semantic Analysis (Tutorial)

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR

Multivariate Statistical Analysis

Linear Least-Squares Data Fitting

UNIT 6: The singular value decomposition.

Problems. Looks for literal term matches. Problems:

1. What is the determinant of the following matrix? a 1 a 2 4a 3 2a 2 b 1 b 2 4b 3 2b c 1. = 4, then det

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Data Mining and Matrices

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

Linear Algebra. Shan-Hung Wu. Department of Computer Science, National Tsing Hua University, Taiwan. Large-Scale ML, Fall 2016

Linear Algebra Review. Fei-Fei Li

Outline Introduction: Problem Description Diculties Algebraic Structure: Algebraic Varieties Rank Decient Toeplitz Matrices Constructing Lower Rank St

Mobile Robotics 1. A Compact Course on Linear Algebra. Giorgio Grisetti

Nonlinear Dimensionality Reduction

This appendix provides a very basic introduction to linear algebra concepts.

Transcription:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Dimensionality Reduction Sourangshu Bhattacharya

Assumption: Data lies on or near a low d-dimensional subspace Axes of this subspace are effective representation of the data J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3 Compress / reduce dimensionality: 10 6 rows; 10 3 columns; no updates Random access to any cell(s); small error: OK The above matrix is really 2-dimensional. All rows can be reconstructed by scaling [1 1 1 0 0] or [0 0 0 1 1]

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4 Q: What is rank of a matrix A? A: Number of linearly independent columns of A For example: Matrix A = has rank r=2 Why? The first two rows are linearly independent, so the rank is at least 2, but all three rows are linearly dependent (the first is equal to the sum of the second and third) so the rank must be less than 3. Why do we care about low rank? We can write A as two basis vectors: [1 2 1] [-2-3 1] And new coordinates of : [1 0] [0 1] [1 1]

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5 Cloud of points 3D space: Think of point positions as a matrix: 1 row per point: A B C A We can rewrite coordinates more efficiently! Old basis vectors: [1 0 0] [0 1 0] [0 0 1] New basis vectors: [1 2 1] [-2-3 1] Then A has new coordinates: [1 0]. B: [0 1], C: [1 1] Notice: We reduced the number of coordinates!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6 Goal of dimensionality reduction is to discover the axis of data! Rather than representing every point with 2 coordinates we represent each point with 1 coordinate (corresponding to the position of the point on the red line). By doing this we incur a bit of error as the points do not exactly lie on the line

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7 Why reduce dimensions? Discover hidden correlations/topics Words that occur commonly together Remove redundant and noisy features Not all words are useful Interpretation and visualization Easier storage and processing of the data

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8 A [m x n] = U [m x r] S [ r x r] (V [n x r] ) T A: Input data matrix m x n matrix (e.g., m documents, n terms) U: Left singular vectors m x r matrix (m documents, r concepts) S: Singular values r x r diagonal matrix (strength of each concept ) (r : rank of the matrix A) V: Right singular vectors n x r matrix (n terms, r concepts)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9 T n n m A» m S V T U

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10 T n s 1 u 1 v 1 s 2 u 2 v 2 m A» + σ i scalar u i vector v i vector

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11 It is always possible to decompose a real matrix A into A = U S V T, where U, S, V: unique U, V: column orthonormal U T U = I; V T V = I (I: identity matrix) (Columns are orthogonal unit vectors) S: diagonal Entries (singular values) are positive, and sorted in decreasing order (σ 1 ³ σ 2 ³... ³ 0) Nice proof of uniqueness: http://www.mpi-inf.mpg.de/~bast/ir-seminar-ws04/lecture2.pdf

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12 SciFi Romnce A = U S V T - example: Users to Movies Matrix Alien Serenity Casablanca Amelie 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 = m U S n V T Concepts AKALatent dimensions AKA Latent factors

SciFi Romnce A = U S V T - example: Users to Movies Matrix Alien Serenity Casablanca Amelie 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 = 0.13 0.02-0.01 0.41 0.07-0.03 0.55 0.09-0.04 0.68 0.11-0.05 0.15-0.59 0.65 0.07-0.73-0.67 0.07-0.29 0.32 x 12.4 0 0 0 9.5 0 0 0 1.3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13 x 0.56 0.59 0.56 0.09 0.09 0.12-0.02 0.12-0.69-0.69 0.40-0.80 0.40 0.09 0.09

SciFi Romnce A = U S V T - example: Users to Movies Matrix Alien Serenity Casablanca Amelie 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 = SciFi-concept 0.13 0.02-0.01 0.41 0.07-0.03 0.55 0.09-0.04 0.68 0.11-0.05 0.15-0.59 0.65 0.07-0.73-0.67 0.07-0.29 0.32 Romance-concept x 12.4 0 0 0 9.5 0 0 0 1.3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14 x 0.56 0.59 0.56 0.09 0.09 0.12-0.02 0.12-0.69-0.69 0.40-0.80 0.40 0.09 0.09

SciFi Romnce A = U S V T - example: Matrix Alien Serenity Casablanca Amelie 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 = SciFi-concept 0.13 0.02-0.01 0.41 0.07-0.03 0.55 0.09-0.04 0.68 0.11-0.05 0.15-0.59 0.65 0.07-0.73-0.67 0.07-0.29 0.32 Romance-concept U is user-to-concept similarity matrix x 12.4 0 0 0 9.5 0 0 0 1.3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15 x 0.56 0.59 0.56 0.09 0.09 0.12-0.02 0.12-0.69-0.69 0.40-0.80 0.40 0.09 0.09

SciFi Romnce A = U S V T - example: Matrix Alien Serenity Casablanca Amelie 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 = SciFi-concept 0.13 0.02-0.01 0.41 0.07-0.03 0.55 0.09-0.04 0.68 0.11-0.05 0.15-0.59 0.65 0.07-0.73-0.67 0.07-0.29 0.32 x strength of the SciFi-concept 12.4 0 0 0 9.5 0 0 0 1.3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16 x 0.56 0.59 0.56 0.09 0.09 0.12-0.02 0.12-0.69-0.69 0.40-0.80 0.40 0.09 0.09

SciFi Romnce A = U S V T - example: Matrix Alien Serenity Casablanca Amelie 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 = SciFi-concept 0.13 0.02-0.01 0.41 0.07-0.03 0.55 0.09-0.04 0.68 0.11-0.05 0.15-0.59 0.65 0.07-0.73-0.67 0.07-0.29 0.32 SciFi-concept V is movie-to-concept similarity matrix x 12.4 0 0 0 9.5 0 0 0 1.3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17 x 0.56 0.59 0.56 0.09 0.09 0.12-0.02 0.12-0.69-0.69 0.40-0.80 0.40 0.09 0.09

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18 movies, users and concepts : U: user-to-concept similarity matrix V: movie-to-concept similarity matrix S: its diagonal elements: strength of each concept

Movie 2 rating first right singular vector v 1 Movie 1 rating Instead of using two coordinates (x, y) to describe point locations, let s use only one coordinate z Point s position is its location along vector v 1 How to choose v 1? Minimize reconstruction error J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21 Goal: Minimize the sum of reconstruction errors: Movie 2 rating first right singular vector where are the old and are the new coordinates SVD gives best axis to project on: best = minimizing the reconstruction errors In other words, minimum reconstruction error v 1 Movie 1 rating

A = U S V T - example: V: movie-to-concept matrix U: user-to-concept matrix Movie 2 rating v 1 first right singular vector 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02-0.01 0.41 0.07-0.03 0.55 0.09-0.04 Movie 1 rating = x 0 9.5 0 x 0.68 0.11-0.05 0.15-0.59 0.65 0.07-0.73-0.67 0.07-0.29 0.32 12.4 0 0 0 0 1.3 0.56 0.59 0.56 0.09 0.09 0.12-0.02 0.12-0.69-0.69 0.40-0.80 0.40 0.09 0.09 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22

A = U S V T - example: variance ( spread ) on the v 1 axis Movie 2 rating v 1 first right singular vector 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02-0.01 0.41 0.07-0.03 0.55 0.09-0.04 Movie 1 rating = x 0 9.5 0 x 0.68 0.11-0.05 0.15-0.59 0.65 0.07-0.73-0.67 0.07-0.29 0.32 12.4 0 0 0 0 1.3 0.56 0.59 0.56 0.09 0.09 0.12-0.02 0.12-0.69-0.69 0.40-0.80 0.40 0.09 0.09 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23

A = U S V T - example: U S: Gives the coordinates of the points in the projection axis Movie 2 rating v 1 first right singular vector 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 Projection of users on the Sci-Fi axis (U S) T : Movie 1 rating 1.61 0.19-0.01 5.08 0.66-0.03 6.82 0.85-0.05 8.43 1.04-0.06 1.86-5.60 0.84 0.86-6.93-0.87 0.86-2.75 0.41 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

More details Q: How exactly is dim. reduction done? 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02-0.01 0.41 0.07-0.03 0.55 0.09-0.04 = x 0 9.5 0 x 0.68 0.11-0.05 0.15-0.59 0.65 0.07-0.73-0.67 0.07-0.29 0.32 12.4 0 0 0 0 1.3 0.56 0.59 0.56 0.09 0.09 0.12-0.02 0.12-0.69-0.69 0.40-0.80 0.40 0.09 0.09 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25

More details Q: How exactly is dim. reduction done? A: Set smallest singular values to zero 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02-0.01 0.41 0.07-0.03 0.55 0.09-0.04 = x 0 9.5 0 x 0.68 0.11-0.05 0.15-0.59 0.65 0.07-0.73-0.67 0.07-0.29 0.32 12.4 0 0 0 0 1.3 0.56 0.59 0.56 0.09 0.09 0.12-0.02 0.12-0.69-0.69 0.40-0.80 0.40 0.09 0.09 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26

More details Q: How exactly is dim. reduction done? A: Set smallest singular values to zero 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2» 0.13 0.02-0.01 0.41 0.07-0.03 0.55 0.09-0.04 0.68 0.11-0.05 0.15-0.59 0.65 0.07-0.73-0.67 0.07-0.29 0.32 x 12.4 0 0 0 9.5 0 0 0 1.3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27 x 0.56 0.59 0.56 0.09 0.09 0.12-0.02 0.12-0.69-0.69 0.40-0.80 0.40 0.09 0.09

More details Q: How exactly is dim. reduction done? A: Set smallest singular values to zero 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2» 0.13 0.02-0.01 0.41 0.07-0.03 0.55 0.09-0.04 0.68 0.11-0.05 0.15-0.59 0.65 0.07-0.73-0.67 0.07-0.29 0.32 x 12.4 0 0 0 9.5 0 0 0 1.3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28 x 0.56 0.59 0.56 0.09 0.09 0.12-0.02 0.12-0.69-0.69 0.40-0.80 0.40 0.09 0.09

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29 More details Q: How exactly is dim. reduction done? A: Set smallest singular values to zero 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02 0.41 0.07 0.55 0.09» x 0 9.5 x 0.68 0.11 0.15-0.59 0.07-0.73 0.07-0.29 12.4 0 0.56 0.59 0.56 0.09 0.09 0.12-0.02 0.12-0.69-0.69

More details Q: How exactly is dim. reduction done? A: Set smallest singular values to zero 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2» 0.92 0.95 0.92 0.01 0.01 2.91 3.01 2.91-0.01-0.01 3.90 4.04 3.90 0.01 0.01 4.82 5.00 4.82 0.03 0.03 0.70 0.53 0.70 4.11 4.11-0.69 1.34-0.69 4.78 4.78 0.32 0.23 0.32 2.01 2.01 Frobenius norm: ǁMǁ F = ÖΣ ij M ij 2 ǁA-Bǁ F = Ö Σ ij (A ij -B ij ) 2 is small J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31 A = U Sigma V T B is best approximation of A B = U Sigma V T

Theorem: Let A = U S V T and B = U S V T where S = diagonal rxr matrix with s i =σ i (i=1 k) else s i =0 then B is a best rank(b)=k approx. to A What do we mean by best : B is a solution to min B ǁA-Bǁ F where rank(b)=k Σ σ ++ σ,, A B 0 = 2 A 34 B 34 5 34 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32

Details! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33 Theorem: Let A = U S V T (σ 1 ³σ 2 ³, rank(a)=r) then B = U S V T S = diagonal rxr matrix where s i =σ i (i=1 k) else s i =0 is a best rank-k approximation to A: B is a solution to min B ǁA-Bǁ F where rank(b)=k σ ++ Σ We will need 2 facts: where M = P Q R is SVD of M U S V T - U S V T = U (S - S) V T σ,,

Details! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34 We will need 2 facts: M 0 = q :: 5 : where M = P Q R is SVD of M U S V T - U S V T = U (S - S) V T We apply: -- P column orthonormal -- R row orthonormal -- Q is diagonal

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35 A = U S V T, B = U S V T (σ 1 ³σ 2 ³ ³ 0, rank(a)=r) S = diagonal nxn matrix where s i =σ i (i=1 k) else s i =0 then B is solution to min B ǁA-Bǁ F, rank(b)=k Why? We want to choose s i to minimize σ 3 s 3 5 3 Solution is to set s i =σ i (i=1 k) and other s i =0 å å å + = + = = = + - = r k i i r k i i k i i i s s i 1 2 1 2 1 2 ) ( min s s s å = = - = S - = - r i i i s F F k B rank B s S B A i 1 2 ) (, ) ( min min min s We used: U S V T - U S V T = U (S - S) V T Details!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36 Equivalent: spectral decomposition of the matrix: 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 = x x u 1 u 2 σ 1 σ 2 v 1 v 2

n Equivalent: spectral decomposition of the matrix m 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 = σ 1 u 1 v T 1 + σ 2 u 2 v T 2 +... n x 1 1 x m k terms Assume: σ 1 ³ σ 2 ³ σ 3 ³... ³ 0 Why is setting small σ i to 0 the right thing to do? Vectors u i and v i are unit length, so σ i scales them. So, zeroing small σ i introduces less error. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38 n Q: How many σs to keep? A: Rule-of-a thumb: keep 80-90% of energy = 2 σ i m 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 = u 1 i σ 1 v T 1 + σ 2 u 2 v T 2 +... Assume: σ 1 ³ σ 2 ³ σ 3 ³...

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39 To compute SVD: O(nm 2 ) or O(n 2 m) (whichever is less) But: Less work, if we just want singular values or if we want first k singular vectors or if the matrix is sparse Implemented in linear algebra packages like LINPACK, Matlab, SPlus, Mathematica...

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40 SVD: A= U S V T : unique U: user-to-concept similarities V: movie-to-concept similarities S : strength of each concept Dimensionality reduction: keep the few largest singular values (80-90% of energy ) SVD: picks up linear correlations

SVD gives us: A = U S V T Eigen-decomposition: A = X L X T A is symmetric U, V, X are orthonormal (U T U=I), L, S are diagonal Now let s calculate: AA T = US V T (US V T ) T = US V T (VS T U T ) = USS T U T A T A = V S T U T (US V T ) = V SS T V T J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42 SVD gives us: A = U S V T Eigen-decomposition: A = X L X T A is symmetric U, V, X are orthonormal (U T U=I), L, S are diagonal Now let s calculate: AA T = US V T (US V T ) T = US V T (VS T U T ) = USS T U T A T A = V S T U T (US V T ) = V SS T V T X L 2 X T Shows how to compute SVD using eigenvalue decomposition! X L 2 X T

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43 A A T = U S 2 U T A T A = V S 2 V T (A T A) k = V S 2k V T E.g.: (A T A) 2 = V S 2 V T V S 2 V T = V S 4 V T (A T A) k ~ v 1 σ 1 2k v 1 T for k>>1

SciFi Romnce Q: Find users that like Matrix A: Map query into a concept space how? Matrix Alien Serenity Casablanca Amelie 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 = 0.13 0.02-0.01 0.41 0.07-0.03 0.55 0.09-0.04 0.68 0.11-0.05 0.15-0.59 0.65 0.07-0.73-0.67 0.07-0.29 0.32 x 12.4 0 0 0 9.5 0 0 0 1.3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45 x 0.56 0.59 0.56 0.09 0.09 0.12-0.02 0.12-0.69-0.69 0.40-0.80 0.40 0.09 0.09

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46 Q: Find users that like Matrix A: Map query into a concept space how? q = Matrix Alien Serenity Casablanca Amelie 5 0 0 0 0 Project into concept space: Inner product with each concept vector v i v2 Alien v1 q Matrix

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 47 Q: Find users that like Matrix A: Map query into a concept space how? q = Matrix Alien Serenity Casablanca Amelie 5 0 0 0 0 Project into concept space: Inner product with each concept vector v i v2 Alien v1 q q*v 1 Matrix

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 48 Compactly, we have: q concept = q V E.g.: q = Matrix Alien Serenity Casablanca Amelie 5 0 0 0 0 SciFi-concept 0.56 0.12 0.59-0.02 x 0.56 0.12 = 2.8 0.6 0.09-0.69 0.09-0.69 movie-to-concept similarities (V)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 49 How would the user d that rated ( Alien, Serenity ) be handled? d concept = d V E.g.: q = Matrix Alien Serenity Casablanca Amelie 0 4 5 0 0 SciFi-concept 0.56 0.12 0.59-0.02 x 0.56 0.12 = 5.2 0.4 0.09-0.69 0.09-0.69 movie-to-concept similarities (V)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 50 Observation: User d that rated ( Alien, Serenity ) will be similar to user q that rated ( Matrix ), although d and q have zero ratings in common! d = Matrix Alien Serenity Casablanca Amelie 0 4 5 0 0 SciFi-concept 5.2 0.4 q = 5 0 0 0 0 2.8 0.6 Zero ratings in common Similarity 0

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 51 + Optimal low-rank approximation in terms of Frobenius norm - Interpretability problem: A singular vector specifies a linear combination of all input columns or rows - Lack of sparsity: Singular vectors are dense! = S U V T

Frobenius norm: ǁXǁ F = Ö Σ ij X ij 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 53 Goal: Express A as a product of matrices C,U,R Make ǁA-C U Rǁ F small Constraints on C and R: A C U R

Frobenius norm: ǁXǁ F = Ö Σ ij X ij 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 54 Goal: Express A as a product of matrices C,U,R Make ǁA-C U Rǁ F small Constraints on C and R: Pseudo-inverse of the intersection of C and R A C U R

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 55 Let: A k be the best rank k approximation to A (that is, A k is SVD of A) Theorem [Drineas et al.] CUR in O(m n) time achieves ǁA-CURǁ F ǁA-A k ǁ F + eǁaǁ F with probability at least 1-d, by picking O(k log(1/d)/e 2 ) columns, and O(k 2 log 3 (1/d)/e 6 ) rows In practice: Pick 4k cols/rows

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 56 Sampling columns (similarly for rows): Note this is a randomized algorithm, same column can be sampled more than once

Removing duplicates: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 57

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 58

Approximate Multiplication: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 59

Final Algorithm: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 60

Why it works: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 61

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 62 Let W be the intersection of sampled columns C and rows R Let SVD of W = X Z Y T Then: U = W + = Y Z + X T Z + : reciprocals of non-zero singular values: Z + ii =1/ Z ii W + is the pseudoinverse A» W C R U = W + Why pseudoinverse works? W = X Z Y then W -1 = X -1 Z -1 Y -1 Due to orthonomality X -1 =X T and Y -1 =Y T Since Z is diagonal Z -1 = 1/Z ii Thus, if W is nonsingular, pseudoinverse is the true inverse

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 63

Algorithm: CUR error SVD error In practice: Pick 4k cols/rows for a rank-k approximation J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 64

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 65 Given matrix A, compute leverage scores: Algorithm:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 66 By algorithm: Hence:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 67 + Easy interpretation Since the basis vectors are actual columns and rows + Sparse basis Since the basis vectors are actual columns and rows - Duplicate columns and rows Columns of large norms will be sampled many times Actual column Singular vector

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 68 If we want to get rid of the duplicates: Throw them away Scale (multiply) the columns/rows by the square root of the number of duplicates R d R s A C d C s Construct a small U

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 69 sparse and small SVD: A = U S V T Huge but sparse Big and dense dense but small CUR: A = C U R Huge but sparse Big but sparse

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 70 DBLP bibliographic data Author-to-conference big sparse matrix A ij : Number of papers published by author i at conference j 428K authors (rows), 3659 conferences (columns) Very sparse Want to reduce dimensionality How much time does it take? What is the reconstruction error? How much space do we need?

SVD SVD CUR CUR no duplicates SVD CUR CUR no dup CUR Accuracy: 1 relative sum squared errors Space ratio: #output matrix entries / #input matrix entries CPU time Sun, Faloutsos: Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM 07. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 71

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 72 SVD is limited to linear projections: Lower-dimensional linear projection that preserves Euclidean distances Non-linear methods: Isomap Data lies on a nonlinear low-dim curve aka manifold Use the distance as measured along the manifold How? Build adjacency graph Geodesic distance is graph distance SVD/PCA the graph pairwise distance matrix

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 73 Drineas et al., Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition, SIAM Journal on Computing, 2006. J. Sun, Y. Xie, H. Zhang, C. Faloutsos: Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM 2007 Intra- and interpopulation genotype reconstruction from tagging SNPs, P. Paschou, M. W. Mahoney, A. Javed, J. R. Kidd, A. J. Pakstis, S. Gu, K. K. Kidd, and P. Drineas, Genome Research, 17(1), 96-107 (2007) Tensor-CUR Decompositions For Tensor-Based Data, M. W. Mahoney, M. Maggioni, and P. Drineas, Proc. 12-th Annual SIGKDD, 327-336 (2006)