Data Mining and Analysis: Fundamental Concepts and Algorithms

Similar documents
Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms

PCA, Kernel PCA, ICA

14 Singular Value Decomposition

Data Mining II. Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich. Basel, Spring Semester 2016 D-BSSE

Data Mining and Analysis: Fundamental Concepts and Algorithms

Principal Component Analysis

Principal Component Analysis

Data Mining and Analysis: Fundamental Concepts and Algorithms

15 Singular Value Decomposition

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Chapter 3 Transformations

Introduction to Machine Learning

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

7. Symmetric Matrices and Quadratic Forms

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Data Mining and Analysis: Fundamental Concepts and Algorithms

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Principal Component Analysis (PCA)

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

1 Singular Value Decomposition and Principal Component

2. LINEAR ALGEBRA. 1. Definitions. 2. Linear least squares problem. 3. QR factorization. 4. Singular value decomposition (SVD) 5.

Computation. For QDA we need to calculate: Lets first consider the case that

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Data Mining and Analysis: Fundamental Concepts and Algorithms

7 Principal Component Analysis

Mathematical foundations - linear algebra

Tutorials in Optimization. Richard Socher

Lecture: Face Recognition and Feature Reduction

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Regularized Discriminant Analysis and Reduced-Rank LDA

Chap 3. Linear Algebra

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Principal Component Analysis and Linear Discriminant Analysis

Numerical Methods I Singular Value Decomposition

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Novelty Detection. Cate Welch. May 14, 2015

MLCC 2015 Dimensionality Reduction and PCA

Tutorial on Principal Component Analysis

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Review of Some Concepts from Linear Algebra: Part 2

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Chapter XII: Data Pre and Post Processing

UNIT 6: The singular value decomposition.

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

Review problems for MA 54, Fall 2004.

Principal Component Analysis

Preprocessing & dimensionality reduction

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

STATISTICAL LEARNING SYSTEMS

Methods for sparse analysis of high-dimensional data, II

Kernel Principal Component Analysis

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Dimensionality Reduction

Conceptual Questions for Review

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Introduction to Matrix Algebra

Data Mining and Analysis: Fundamental Concepts and Algorithms

Lecture 3: Review of Linear Algebra

CHAPTER 5. THE KERNEL METHOD 138

Lecture 3: Review of Linear Algebra

Linear Methods for Regression. Lijun Zhang

MATH 304 Linear Algebra Lecture 20: The Gram-Schmidt process (continued). Eigenvalues and eigenvectors.

Computational math: Assignment 1

Lecture 4: Principal Component Analysis and Linear Dimension Reduction

Data Analysis and Manifold Learning Lecture 2: Properties of Symmetric Matrices and Examples

Foundations of Computer Vision

1 Principal Components Analysis

Dimensionality Reduction and Principle Components

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Advanced Introduction to Machine Learning CMU-10715

Problem Set 1. Homeworks will graded based on content and clarity. Please show your work clearly for full credit.

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Linear Algebra Methods for Data Mining

5 Linear Algebra and Inverse Problem

Lecture 7 Spectral methods

Machine Learning (BSMC-GA 4439) Wenke Liu

Lecture 2: Linear Algebra Review

1 Feature Vectors and Time Series

Concentration Ellipsoids

CSE 554 Lecture 7: Alignment

Dimensionality Reduction and Principal Components

EECS 275 Matrix Computation

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

Methods for sparse analysis of high-dimensional data, II

Probabilistic Latent Semantic Analysis

Transcription:

Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 7: Dimensionality Reduction Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 1 / 33

Dimensionality Reduction The goal of dimensionality reduction is to find a lower dimensional representation of the data matrix D to avoid the curse of dimensionality. Given n d data matrix, each point x i = (x i1, x i2,...,x id ) T is a vector in the ambient d-dimensional vector space spanned by the d standard basis vectors e 1, e 2,...,e d. Given any other set of d orthonormal vectors u 1, u 2,...,u d we can re-express each point x as x = a 1 u 1 + a 2 u 2 + +a d u d where a = (a 1, a 2,...,a d ) T represents the coordinates of x in the new basis. More compactly: x = Ua where U is the d d orthogonal matrix, whose ith column comprises the ith basis vector u i. Thus U 1 = U T, and we have a = U T x Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 2 / 33

Optimal Basis: Projection in Lower Dimensional Space There are potentially infinite choices for the orthonormal basis vectors. Our goal is to choose an optimal basis that preserves essential information about D. We are interested in finding the optimal r-dimensional representation of D, with r d. Projection of x onto the first r basis vectors is given as x = a 1 u 1 + a 2 u 2 + +a r u r = r a i u i = U r a r where U r and a r comprises the r basis vectors and coordinates, respv. Also, restricting a = U T x to r terms, we have a r = U T r x The r-dimensional projection of x is thus given as: x = U r U T r x = P r x where P r = U r U T r = r u iu T i is the orthogonal projection matrix for the subspace spanned by the first r basis vectors. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 3 / 33

Optimal Basis: Error Vector Given the projected vector x = P r x, the corresponding error vector, is the projection onto the remaining d r basis vectors ǫ = d i=r+1 The error vector ǫ is orthogonal to x. a i u i = x x The goal of dimensionality reduction is to seek an r-dimensional basis that gives the best possible approximation x i over all the points x i D. Alternatively, we seek to minimize the error ǫ i = x i x i over all the points. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 4 / 33

Iris Data: Optimal One-dimensional Basis X1 X2 X3 X1 X2 X3 u1 Iris Data: 3D Optimal 1D Basis Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 5 / 33

Iris Data: Optimal 2D Basis X 1 X 2 X 3 X1 X2 X3 u1 u2 Iris Data (3D) Optimal 2D Basis Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 6 / 33

Principal Component Analysis Principal Component Analysis (PCA) is a technique that seeks a r-dimensional basis that best captures the variance in the data. The direction with the largest projected variance is called the first principal component. The orthogonal direction that captures the second largest projected variance is called the second principal component, and so on. The direction that maximizes the variance is also the one that minimizes the mean squared error. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 7 / 33

Principal Component: Direction of Most Variance We seek to find the unit vector u that maximizes the projected variance of the points. Let D be centered, and let Σ be its covariance matrix. The projection of x i on u is given as ( u x T ) x i i = u T u = (u T x i )u = a i u u Across all the points, the projected variance along u is σu 2 = 1 n (a i µ u ) 2 = 1 n u T ( ( x i x T ) i u = u T 1 n x i x T i n n n ) u = u T Σu We have to find the optimal basis vector u that maximizes the projected variance σ 2 u = u T Σu, subject to the constraint that u T u = 1. The maximization objective is given as max J(u) = u T Σu α(u T u 1) u Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 8 / 33

Principal Component: Direction of Most Variance Given the objective max u J(u) = u T Σu α(u T u 1), we solve it by setting the derivative of J(u) with respect to u to the zero vector, to obtain ( u T Σu α(u T u 1) ) = 0 u that is, 2Σu 2αu = 0which implies Σu = αu Thus α is an eigenvalue of the covariance matrix Σ, with the associated eigenvector u. Taking the dot product with u on both sides, we have σ 2 u = ut Σuu T αu = αu T u = α To maximize the projected variance σ 2 u, we thus choose the largest eigenvalue λ 1 of Σ, and the dominant eigenvector u 1 specifies the direction of most variance, also called the first principal component. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 9 / 33

Iris Data: First Principal Component X 3 X 1 X 2 u 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 10 / 33

Minimum Squared Error Approach The direction that maximizes the projected variance is also the one that minimizes the average squared error. The mean squared error (MSE) optimization condition is MSE(u) = 1 n n ǫ i 2 = 1 n n x i x i 2 = n x i 2 u T Σu n Since the first term is fixed for a dataset D, we see that the direction u 1 that maximizes the variance is also the one that minimizes the MSE. Further, we have sum n x i 2 u T Σu = var(d) = tr(σ) = n Thus, for the direction u 1 that minimizes MSE, we have d MSE(u 1 ) = var(d) u T 1 Σu 1 = var(d) λ 1 σ 2 i Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 11 / 33

Best 2-dimensional Approximation The best 2D subspace that captures the most variance in D comprises the eigenvectors u 1 and u 2 corresponding to the largest and second largest eigenvalues λ 1 and λ 2, respv. Let U 2 = ( ) u 1 u 2 be the matrix whose columns correspond to the two principal components. Given the point x i D its projected coordinates are computed as follows: a i = U T 2 x i Let A denote the projected 2D dataset. The total projected variance for A is given as var(a) = u T 1 Σu 1 + u T 2 Σu 2 = u T 1 λ 1u 1 + u T 2 λ 2u 2 = λ 1 +λ 2 The first two principal components also minimize the mean square error objective, since MSE = 1 n n x i x i 2 = var(d) 1 n n ( ) x T i P 2 x i = var(d) var(a) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 12 / 33

Optimal and Non-optimal 2D Approximations The optimal subspace maximizes the variance, and minimizes the squared error, whereas the nonoptimal subspace captures less variance, and has a high mean squared error value, as seen from the lengths of the error vectors (line segments). u2 X1 X3 X2 X1 X3 X2 u1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 13 / 33

Best r-dimensional Approximation To find the best r-dimensional approximation to D, we compute the eigenvalues of Σ. Because Σ is positive semidefinite, its eigenvalues are non-negative λ 1 λ 2 λ r λ r+1 λ d 0 We select the r largest eigenvalues, and their corresponding eigenvectors to form the best r-dimensional approximation. Total Projected Variance: Let U r = ( ) u 1 u r be the r-dimensional basis vector matrix, withe the projection matrix given as P r = U ru T r = r u iu T i. Let A denote the dataset formed by the coordinates of the projected points in the r-dimensional subspace. The projected variance is given as var(a) = 1 n n x T i P r x i = r u T i Σu i = r λ i Mean Squared Error: The mean squared error in r dimensions is MSE = 1 n n r xi x 2 i = var(d) λ i = d λ i r λ i Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 14 / 33

Choosing the Dimensionality One criteria for choosing r is to compute the fraction of the total variance captured by the first r principal components, computed as f(r) = λ r 1 +λ 2 + +λ r = λ i λ 1 +λ 2 + +λ d d λ i = r λ i var(d) Given a certain desired variance threshold, say α, starting from the first principal component, we keep on adding additional components, and stop at the smallest value r, for which f(r) α. In other words, we select the fewest number of dimensions such that the subspace spanned by those r dimensions captures at least α fraction (say 0.9) of the total variance. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 15 / 33

Principal Component Analysis: Algorithm PCA (D,α): µ = 1 n 1 n x i // compute mean 2 Z = D 1 µ T // center the data ( Σ = 1 n Z T Z ) 3 // compute covariance matrix 4 (λ 1,λ 2,...,λ d ) = eigenvalues(σ)// compute eigenvalues U = ( ) 5 u 1 u 2 u d = eigenvectors(σ)// compute eigenvectors 6 f(r) = r λ i d λ i, for all r = 1, 2,...,d // fraction of total variance 7 Choose smallest r so that f(r) α// choose dimensionality U r = ( ) 8 u 1 u 2 u r // reduced basis 9 A = {a i a i = U T r x i, for i = 1,...,n}// reduced dimensionality data Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 16 / 33

Iris Principal Components Covariance matrix: 0.681 0.039 1.265 Σ = 0.039 0.187 0.320 1.265 0.32 3.092 The eigenvalues and eigenvectors of Σ λ 1 = 3.662 λ 2 = 0.239 λ 3 = 0.059 u 1 = 0.390 0.089 u 2 = 0.639 0.742 u 3 = 0.663 0.664 0.916 0.200 0.346 The total variance is therefore λ 1 +λ 2 +λ 3 = 3.662+0.239+0.059 = 3.96. The fraction of total variance for different values of r is given as r 1 2 3 f(r) 0.925 0.985 1.0 This r = 2 PCs are need to capture α = 0.95 fraction of variance. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 17 / 33

Iris Data: Optimal 3D PC Basis X 3 X 1 X 2 u2 u3 u1 Iris Data (3D) Optimal 3D Basis Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 18 / 33

Iris Principal Components: Projected Data (2D) u 2 1.5 1.0 0.5 0 0.5 1.0 1.5 4 3 2 1 0 1 2 3 u 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 19 / 33

Geometry of PCA Geometrically, when r = d, PCA corresponds to a orthogonal change of basis, so that the total variance is captured by the sum of the variances along each of the principal directions u 1, u 2,...,u d, and further, all covariances are zero. Let U be the d d orthogonal matrix U = ( u 1 u 2 u d ), with U 1 = U T. Let Λ = diag(λ 1,,λ d ) be the diagonal matrix of eigenvalues. Each principal component u i corresponds to an eigenvector of the covariance matrix Σ Σu i = λ i u i for all 1 i d which can be written compactly in matrix notation: ΣU = UΛ which impliesσ = UΛU T Thus, Λ represents the covariance matrix in the new PC basis. In the new PC basis, the equation x T Σ 1 x = 1 defines a d-dimensional ellipsoid (or hyper-ellipse). The eigenvectors u i of Σ, that is, the principal components, are the directions for the principal axes of the ellipsoid. The square roots of the eigenvalues, that is, λ i, give the lengths of the semi-axes. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 20 / 33

Iris: Elliptic Contours in Standard Basis u 2 u 3 u 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 21 / 33

Iris: Axis-Parallel Ellipsoid in PC Basis u 1 u 2 u 3 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 22 / 33

Kernel Principal Component Analysis Principal component analysis can be extended to find nonlinear directions in the data using kernel methods. Kernel PCA finds the directions of most variance in the feature space instead of the input space. Using the kernel trick, all PCA operations can be carried out in terms of the kernel function in input space, without having to transform the data into feature space. Let φ be a function that maps a point x in input space to its image φ(x i ) in feature space. Let the points in feature space be centered and let Σ φ be the covariance matrix. The first PC in feature space correspond to the dominant eigenvector where Σ φ u 1 = λ 1 u 1 Σ φ = 1 n n φ(x i )φ(x i ) T Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 23 / 33

Kernel Principal Component Analysis n It can be shown that u 1 = c i φ(x i ). That is, the PC direction in feature space is a linear combination of the transformed points. The coefficients are captured in the weight vector c = ( ) T c 1, c 2,, c n Substituting into the eigen-decomposition of Σ φ and simplifying, we get: Kc = nλ 1 c = η 1 c Thus, the weight vector c is the eigenvector corresponding to the largest eigenvalue η 1 of the kernel matrix K. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 24 / 33

Kernel Principal Component Analysis The weight vector c can be used to then find u 1 via u 1 = n c i φ(x i ). The only constraint we impose is that u 1 should be normalized to be a unit vector, which implies c 2 = 1 η 1. We cannot compute directly compute the principal direction, but we can project any point φ(x) onto the principal direction u 1, as follows: u T 1φ(x) = n c i φ(x i ) T φ(x) = n c i K(x i, x) which requires only kernel operations. We can obtain the additional principal components by solving for the other eigenvalues and eigenvectors of Kc j = nλ j c j = η j c j If we sort the eigenvalues of K in decreasing order η 1 η 2 η n 0, we can obtain the jth principal component as the corresponding eigenvector c j. The variance along the jth principal component is given as λ j = η j n. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 25 / 33

Kernel PCA Algorithm KERNELPCA (D, K,α): K = { K(x i, x j ) } 1 // compute n n kernel matrix i,j=1,...,n 2 K = (I 1 n 1 n n)k(i 1 n 1 n n)// center the kernel matrix 3 ( (η 1,η 2,...,η d ) = ) eigenvalues(k)// compute eigenvalues 4 c1 c 2 c n = eigenvectors(k)// compute eigenvectors 5 λ i = η i n for all i = 1,...,n// compute variance for each component 6 7 c i = f(r) = 1 η i c i for all i = 1,...,n// ensure that u T i u i = 1 r λ i d, for all r = 1, 2,...,d // fraction of total λ i variance 8 Choose smallest r so that f(r) α// choose dimensionality C r = ( ) 9 c 1 c 2 c r // reduced basis 10 A = {a i a i = C T r K i, for i = 1,...,n}// reduced dimensionality data Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 26 / 33

Nonlinear Iris Data: PCA in Input Space X2 1.5 1.0 0.5 u 1 0 0.5 1 0.5 0 0.5 1.0 1.5 X2 1.5 1.0 0.5 0 0.5 1 0.5 0 0.5 1.0 1.5 u 2 X 1 X 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 27 / 33

Nonlinear Iris Data: Projection onto PCs u 2 0 0.5 1.0 1.5 0.75 0 0.5 1.0 1.5 u 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 28 / 33

Kernel PCA: 3 PCs (Contours of Constant Projection) Homogeneous Quadratic Kernel: K (xi, xj ) = (xti xj )2 1.5 1.5 1.5 1.0 0.5 0.5 Cb 0 0.5 Cb Cb 0 0.5 0.5 0.5 1.0 X1 Zaki & Meira Jr. (RPI and UFMG) 1.5 1 0.5 Cb 0 0 Cb 1 0.5 X2 X2 X2 0 1.0 Cb 1.0 0.5 0.5 1.0 X1 Data Mining and Analysis 1 1.5 0.5 0 0.5 1.0 1.5 X1 Chapter 7: Dimensionality Reduction 29 / 33

Kernel PCA: Projected Points onto 2 PCs Homogeneous Quadratic Kernel: K(x i, x j ) = (x T i x j ) 2 u 2 0 0.5 1.0 1.5 2 0.5 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 u 1 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 30 / 33

Singular Value Decomposition Principal components analysis is a special case of a more general matrix decomposition method called Singular Value Decomposition (SVD). PCA yields the following decomposition of the covariance matrix: Σ = UΛU T where the covariance matrix has been factorized into the orthogonal matrix U containing its eigenvectors, and a diagonal matrix Λ containing its eigenvalues (sorted in decreasing order). SVD generalizes the above factorization for any matrix. In particular for an n d data matrix D with n points and d columns, SVD factorizes D as follows: D = L R T where L is a orthogonal n n matrix, R is an orthogonal d d matrix, and is an n d diagonal matrix, defined as (i, i) = δ i, and 0 otherwise. The columns of L are called the left singular and the columns of R (or rows of R T ) are called the right singular vectors. The entriesδ i are called the singular values of D, and they are all non-negative. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 31 / 33

Reduced SVD If the rank of D is r min(n, d), then there are only r nonzero singular values, ordered as follows: δ 1 δ 2 δ r > 0. We discard the left and right singular vectors that correspond to zero singular values, to obtain the reduced SVD as D = L r r R T r where L r is the n r matrix of the left singular vectors, R r is the d r matrix of the right singular vectors, and r is the r r diagonal matrix containing the positive singular vectors. The reduced SVD leads directly to the spectral decomposition of D given as D = r δ i l i r T i The best rank q approximation to the original data D is the matrix D q = q δ il i r T i that n minimizes the expression D D q F, where A F = d j=1 A(i, j)2 is called the Frobenius Norm of A. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 32 / 33

Connection Between SVD and PCA Assume D has been centered, and let D = L R T via SVD. Consider the scatter matrix for D, given as D T D. We have D T D = (L R T) T ( L R T) = R T L T L R T = R( T )R T = R 2 dr T where 2 d is the d d diagonal matrix defined as 2 d(i, i) = δ 2 i, for i = 1,...,d. The covariance matrix of centered D is given as Σ = 1 n DT D; we get D T D = nσ = nuλu T = U(nΛ)U T The right singular vectors R are the same as the eigenvectors of Σ. The singular values of D are related to the eigenvalues of Σ as nλ i = δi 2, which impliesλ i = δ2 i, for i = 1,...,d n Likewise the left singular vectors in L are the eigenvectors of the matrix n n matrix DD T, and the corresponding eigenvalues are given as δ 2 i. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Analysis Chapter 7: Dimensionality Reduction 33 / 33