Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Similar documents
COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

Covariance and Principal Components

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

UNIT 6: The singular value decomposition.

Foundations of Computer Vision

Mathematical foundations - linear algebra

Lecture 3: Review of Linear Algebra

Multivariate Statistical Analysis

Lecture 3: Review of Linear Algebra

Principal Component Analysis

2. Review of Linear Algebra

Linear Algebra Review. Fei-Fei Li

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

COMP 558 lecture 18 Nov. 15, 2010

STA141C: Big Data & High Performance Statistical Computing

PCA, Kernel PCA, ICA

14 Singular Value Decomposition

MATH 304 Linear Algebra Lecture 20: The Gram-Schmidt process (continued). Eigenvalues and eigenvectors.

STA141C: Big Data & High Performance Statistical Computing

Lecture: Face Recognition and Feature Reduction

Econ Slides from Lecture 7

Linear Algebra Review. Fei-Fei Li

Singular Value Decomposition

CS 143 Linear Algebra Review

1 Singular Value Decomposition and Principal Component

Lecture: Face Recognition and Feature Reduction

15 Singular Value Decomposition

Computational Methods. Eigenvalues and Singular Values

Large Scale Data Analysis Using Deep Learning

Probabilistic Latent Semantic Analysis

1 Linearity and Linear Systems

B553 Lecture 5: Matrix Algebra Review

Linear Algebra & Geometry why is linear algebra useful in computer vision?

EE731 Lecture Notes: Matrix Computations for Signal Processing

Linear Algebra Review. Vectors

Pseudoinverse & Moore-Penrose Conditions

Parallel Singular Value Decomposition. Jiaxing Tan

Introduction to Machine Learning

IV. Matrix Approximation using Least-Squares

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Singular Value Decomposition

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag

Principal Component Analysis

Information Retrieval

Linear Algebra, part 3. Going back to least squares. Mathematical Models, Analysis and Simulation = 0. a T 1 e. a T n e. Anna-Karin Tornberg

Linear Algebra Methods for Data Mining

Machine learning for pervasive systems Classification in high-dimensional spaces


Numerical Methods I Singular Value Decomposition

Stat 159/259: Linear Algebra Notes

Vectors and Matrices Statistics with Vectors and Matrices

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Krylov Subspaces. Lab 1. The Arnoldi Iteration

Chapter 3 Transformations

Linear Algebra, part 3 QR and SVD

Review problems for MA 54, Fall 2004.

7. Symmetric Matrices and Quadratic Forms

1 Principal Components Analysis

Data Analysis and Manifold Learning Lecture 2: Properties of Symmetric Matrices and Examples

Computational Methods CMSC/AMSC/MAPL 460. EigenValue decomposition Singular Value Decomposition. Ramani Duraiswami, Dept. of Computer Science

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Notes on Eigenvalues, Singular Values and QR

Positive Definite Matrix

Lecture 02 Linear Algebra Basics

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Maths for Signals and Systems Linear Algebra in Engineering

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

Problem # Max points possible Actual score Total 120

Linear Algebra for Machine Learning. Sargur N. Srihari

Dimensionality Reduction

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Principal Components Theory Notes

Eigenvalues, Eigenvectors, and an Intro to PCA

Singular Value Decomposition

The Singular Value Decomposition

Principal Component Analysis and Linear Discriminant Analysis

Linear Algebra. Session 12

LECTURE 16: PCA AND SVD

Linear Subspace Models

Eigenvalues, Eigenvectors, and an Intro to PCA

CS60021: Scalable Data Mining. Dimensionality Reduction

Principal Component Analysis

SVD, PCA & Preprocessing

Principal Components Analysis (PCA)

Linear Algebra in Actuarial Science: Slides to the lecture

Vector and Matrix Norms. Vector and Matrix Norms

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

1 Last time: least-squares problems

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Lecture 2: Linear Algebra Review

Deep Learning Book Notes Chapter 2: Linear Algebra

Principal Component Analysis

LINEAR ALGEBRA 1, 2012-I PARTIAL EXAM 3 SOLUTIONS TO PRACTICE PROBLEMS

7 Principal Component Analysis

Singular Value Decomposition and Principal Component Analysis (PCA) I

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

AM 205: lecture 8. Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition

Linear Algebra Background

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

linearly indepedent eigenvectors as the multiplicity of the root, but in general there may be no more than one. For further discussion, assume matrice

Transcription:

Data Mining Lecture 4: Covariance, EVD, PCA & SVD Jo Houghton ECS Southampton February 25, 2019 1 / 28

Variance and Covariance - Expectation A random variable takes on different values due to chance The sample values from a single dimension of a featurespace can be considered to be a random variable The expected value E[X ] is the most likely value a random variable will take. If we assume that the values an element of a feature can take are all equally likely then the expected value is just the mean value. 2 / 28

Variance and Covariance - Variance Variance = The expected squared difference from the mean E[(X E[X ]) 2 ] i.e. the mean squared difference from the mean σ 2 (x) = 1 n = 1 n n (x i µ) 2 i=1 A measure of how spread out the data is 3 / 28

Variance and Covariance - Covariance Covariance = the product of the expected difference between each feature and its mean E[(x E[x])(y E[y])] i.e. it measures how two variables change together σ(x, y) = 1 n n (x µ x )(y µ y ) i=1 When both variables are the same, covariance = variance, as σ(x, x) = σ 2 (x) If σ 2 = 0 then the variables are uncorrelated 4 / 28

Variance and Covariance - Covariance A covariance matrix encodes how all features vary together For two dimensions: [ ] σ(x, x) σ(x, y) σ(y, x) σ(y, y) For n dimensions: σ(x 1, x 1 ) σ(x 1, x 2 )... σ(x 1, x n ) σ(x 2, x 1 ) σ(x 2, x 2 )... σ(x 2, x n )...... σ(x n, x 1 ) σ(x n, x 2 )... σ(x n, x n ) This matrix must be square symmetric (x cannot vary with y differently to how y varies with x!) 2d covariance demo 5 / 28

Variance and Covariance - Covariance Mean Centering = subtract the of all the vectors from each vector This gives centered data, with mean at the origin 6 / 28

Variance and Covariance - Covariance Mean Centering = subtract the of all the vectors from each vector This gives centered data, with mean at the origin ipynb mean centering demo 6 / 28

Variance and Covariance - Covariance If you have a set of mean centred data with d dimensions, where each row is your data point: [ ] x 11 x 12... x 1d Z = x 21 x 22... x 2d Then its inner product is proportional to the covariance matrix C Z T Z 7 / 28

Variance and Covariance - Covariance Principal axes of variation: 1st principal axis: direction of greatest variance 8 / 28

Variance and Covariance - Covariance 2nd Principal axis: direction orthogonal to 1st principal axis in direction of greatest variance 9 / 28

Variance and Covariance - Basis Set In linear algebra, a basis set is defined for a space with the properties: They are all linearly independent They span the whole space Every vector in the space can be described as a combination of basis vectors 10 / 28

Variance and Covariance - Basis Set In linear algebra, a basis set is defined for a space with the properties: They are all linearly independent They span the whole space Every vector in the space can be described as a combination of basis vectors Using Cartesian coordinates, we describe every vector as a combination of x and y directions 10 / 28

Variance and Covariance - Basis Set Eigenvectors and eigenvalues An eigenvector is a vector that when multiplied by a matrix A gives a value that is a multiple of itself, i.e.: Ax = λx The eigenvalue λ is the multiple that it should be multiplied by. eigen comes from German, meaning Characteristic or Own for an n n dimensional matrix A there are n eigenvector-eigenvalue pairs 11 / 28

Variance and Covariance - Basis Set So if A is a covariance matrix, then the eigenvectors are its principal axes. The eigenvalues are proportional to the variance of the data along each eigenvector The eigenvector corresponding to the largest eigenvalue is the first principal axis 12 / 28

Variance and Covariance - Basis Set To find eigenvectors and eigenvalues for smaller matrices, there are algebraic solutions, and all values can be found For larger matrices, numerical solutions are found, using eigendecomposition. Eigen - Value - Decomposition (EVD): A = QΛQ 1 Where Q is a matrix where the columns are the eigenvectors, and Λ is a matrix with eigenvalues along the corresponding diagonal Covariance matrices are real symmetric, so Q 1 = Q T Therefore: A = QΛQ T This diagonalisation of a covariance matrix gives the principal axes and their relative magnitudes 13 / 28

Variance and Covariance - Basis Set A = QΛQ T This diagonalisation of a covariance matrix gives the principal axes and their relative magnitudes Usually the implementation will order the eigenvectors such that the eigenvalues are sorted in order of decreasing value. Some solvers only find the top k eigenvalues and corresponding eigenvectors, rather than all of them. Java demo: EVD and component analysis 14 / 28

Variance and Covariance - PCA Principal Component Analysis - PCA Projects the data in to a lower dimensional space, while keeping as much of the information as possible. 2 1 X = 3 2 3 3 For example: data set X can be transformed so only information from the x dimension is retained using a projection matrix P: X p = XP 15 / 28

Variance and Covariance - PCA However if a different line is chosen, more information can be retained. This process can be reversible, (Using ˆX = X p P 1 ) but this is lossy if the dimensionality has been changed. 16 / 28

Variance and Covariance - PCA In PCA, a line is chosen that minimises the orthogonal distances of the data points from the projected space. It does this by keeping the dimensions where it has the most variation, i.e. using the directions provided by the eigenvectors corresponding to the largest eigenvalues of the estimated covariance matrix It uses the mean centred data to give the matrix proportional to the covariance matrix (ipynb projection demo) 17 / 28

Variance and Covariance - PCA Algorithm 1: PCA algorithm using EVD Data: N data points with feature vectors X i i = 1... N Z = meancentre(x ); eigvals, eigvects = eigendecomposition(z T Z); take k eigvects corresponding to k largest eigvals; make projection matrix P; Project data Xp = ZP in to lower dimensional space; (Java PCA demo) 18 / 28

Variance and Covariance - SVD Eigenvalue Decomposition, EVD, A = QΛQ T only works for symmetric matrices. Singular value decomposition - SVD A = UΣV T where U and V are both different orthogonal matrices, and Σ is a diagonal matrix Any matrix can be factorised this way. Orthogonal matrices are where each column is a vector pointing in an othogonal direction to each other, U T U = UU T = I 19 / 28

Variance and Covariance - SVD A m x n U m x p Σ p x p V p x n Where p is rank of matrix A U called left singular vectors, contains the eigenvectors of AA T, V called right singular vectors, contains the eigenvectors of A T A Σ contains square roots of eigenvalues of AA T and A T A If A is matrix of mean centred featurevectors, V contains principal components of the covariance matrix 20 / 28

Variance and Covariance - SVD Algorithm 2: PCA algorithm using SVD Data: N data points with feature vectors X i i = 1... N Z = meancentre(x ); U, Σ, V = SVD(Z); take k columns of V corresponding to the largest k values of Σ; make projection matrix P; Project data Xp = ZP in to lower dimensional space; Better than using EVD of Z T Z as: has better numerical stability can be faster 21 / 28

Variance and Covariance - SVD SVD has better numerical stability: E.g. Läuchli matrix: 1 1 1 1 ɛ 0 0 1 + ɛ X T X = 1 0 ɛ 0 ɛ 0 0 2 1 1 0 ɛ 0 = 1 1 + ɛ 2 1 1 0 0 ɛ 1 1 1 + ɛ 2 0 0 ɛ If ɛ is very small, 1 + ɛ 2 will be counted as 1, so information is lost 22 / 28

Variance and Covariance - Truncated SVD A m x n U r U m x p m x r Σ r r x rσ p x p V r r x nv p x n Uses only the largest r singular values (and corresponding left and right vectors) This can give a low rank approximation of A, à = U r Σ r V r has the effect of minimising the Frobenius norm of the difference between A and à 23 / 28

Variance and Covariance - SVD SVD can be used to give a Pseudoinverse: A + = V Σ 1 U T This is used to solve Ax = b for x where Ax b 2 is minimised i.e. in least squares regression x = A + b Also useful in solving homogenous linear equations Ax = 0 SVD has also found application in: model based CF recommender systems latent factors image compression and much more.. 24 / 28

Variance and Covariance - EVS / SVD computation Eigenvalue algorithms are iterative, using power iteration b k+1 = Ab k Ab k Vector b 0 is either an approximation of the dominant eigenvector of A or a random vector. At every iteration, the vector b k is multiplied by the matrix A and normalized If A has an eigenvalue that is greater than its other eigenvalues and the starting vector b 0 has a non zero component in the direction of the associated eigenvector, then the following b k will converge to the dominant eigenvector. After the dominant eigenvector is found, the matrix can be rotated and truncated to remove the effect of the dominant eigenvector, then repeated to find the next dominant eigenvector, etc. 25 / 28

Variance and Covariance - EVS / SVD computation More efficient (and complex) algorithms exist Using Raleigh Quotient Arnoldi Iteration Lanczos Algorithm Can also use Gram-Schmidt to find the orthonormal basis of the top r eigenvectors 26 / 28

Variance and Covariance - EVS / SVD computation In practice, we use the library implementation, usually from LAPACK (numpy matrix operations usually involve LAPACK underneath) These algorithms work very efficiently for small to medium sized matrices, as well as for large, sparse matrices, but not really massive matrices (e.g. in pagerank) There are variations to find the smallest non-zero eigenvectors. 27 / 28

Variance and Covariance - Summary Covariance measures how different dimensions change together: Represented by a matrix Eigenvalue decomposition gives eigenvalue - eigenvector pairs The dominant eigenvector gives the principal axis The Eigenvalue is proportional to the variance along that axis The principal axes give a basis set, describing the directions of greatest variance PCA: aligns data with its principal axes, allows dimensional reduction losing the least information by discounting axes with low variance SVD: a general matrix factorisation tool with many uses. 28 / 28