Unsupervised Learning: Dimensionality Reduction

Size: px

Start display at page:

Download "Unsupervised Learning: Dimensionality Reduction"

Ira Summers
5 years ago
Views:

1 Unsupervised Learning: Dimensionality Reduction CMPSCI 689 Fall 2015 Sridhar Mahadevan Lecture 3

2 Outline In this lecture, we set about to solve the problem posed in the previous lecture Given a dataset, how do we find a compressed representation that has minimum reconstruction error How to choose a good basis to represent the data? We will cover some popular dimensionality reduction methods Principal components analysis (PCA)

3 Covariance Given two random variables X and Y,thecovariance between them is defined as cov(x, Y )=E((X µ X )(Y µ Y )) The covariance can also be written as (show this!) cov(x, Y )=E(XY ) µ X µ Y This makes it clear that variance is simply the covariance of a random variable with itself. When the random variable X is a vector, the covariance of X becomes a matrix. p.25/48

4 Correlation Correlation The correlation between two random variables X and Y is defined as ρ XY = E((X µ X)(Y µ Y )) σxx σyy Intuitively, the correlation between two variables measures whether increasing one variable causes the other to increase (positive, closer to 1) ordecrease (negative, closer to 1). If the correlation is 0, changingonevariablehasno effect on the other. p.26/48

5 Real-World Example Predict element composition from laser-induced breakdown spectroscopy (LIBS) data

6 LIBS Spectra Vector of dimension 5485

7 Mean of LIBS Spectra LIBS dataset Mean over 329 spectra 0.02 x μ 1 = (1,1, 1) P n i=1 x i µ = n

8 Medical Dataset Analysis [Efron et al., Least Angle Regression, Annals of Statistics, 2004] p.3/48

9 Medical Medical Data Data Analysis 200 Means in Diabetes Data p.22/48

10 Visualizing Data Data BP BMI p.4/48

11 Visualizing Data Data p.5/48

12 Predictor Response Correlations in Diabetes Correlations in Diabetes Data

13 Covariance Variance-Covariance Matrix Matrix The covariance of a p-dimensional random variable is defined as Σ X = E((X µ X )(X µ X ) T ) The diagonal entries are the variances, and the off-diagonal entries are covariances. σ 11 σ 12 σ 1p σ 21 σ 22 σ 2p Σ X = σ p1 σ p2 σ pp p.28/48

14 Estimating Sample Covariance Matrix The sample covariance s ik = 1 n j (x ij x i )(x kj x k ) The diagonal entries are the sample variances, and the off-diagonal entries are sample covariances. S X = s 11 s 12 s 1p s 21 s 22 s 2p s p1 s p2 s pp p.31/48

15 Sample Covariance in Sample Covariances in Diabetes Study Diabetes 1 Covariance Matrix for Diabetes Data p.32/48

16 Correlation Matrix Correlation Matrix The correlation matrix ρ X = V 1 2 Σ X V 1 2 Here, V 1 2 is a diagonal matrix whose entries are the inverse of the standard deviations 1 σii. The ik th entry ρ ik = σ ik σii σkk ρ X = 1 ρ 12 ρ 1p ρ 21 1 ρ 2p ρ p1 ρ p2 1 p.33/48

17 Diabetes Sample Sample Correlation Matrix Matrix Correlation Matrix of Diabetes Data p.34/48

18 Diabetes Sample Sample Correlation Matrix Matrix Correlation Matrix of Diabetes Data p.34/48

19 Eigenvalues of Matrices A real-valued symmetric matrix M of size NxN can be diagonalized to find a more compact representation M = V Λ V T Here, V is a square NxN matrix of orthogonal eigenvectors Λ is a diagonal matrix of scalar real-valued eigenvalues

20 Compressing Matrices What is the meaning of the equation M = V Λ V T It tells us that if we use the eigenvectors V as a basis, the matrix M can be compressed into a diagonal matrix Λ Why is this important? Consider computing M 1000 This is an expensive computation if M is as originally represented However, in the new basis, it is trivial to compute!

21 Computing Matrix Powers M 2 =(V V T )(V V T )=(V 2 V T ) M 3 =(V V T ) 2 (V V T )=(V 3 V T ) M N =(V V T ) N 1 (V V T )=(V N V T )

22 Spectrum of Covariance Spectrum of Covariance MatricesMatrix 2500 Eigenvalues of Covariance Matrix The eigenvalues reveal a great deal of information about the data p.36/48

23 Approximating Matrices Structure of Covariance Σ = V Λ V T Matrices Sum of Projection Matrices: Σ = p i=1 λ iv i v T i Note each outer product" v i v Derive this result! i T rank 1. is a p by p matrix of By sorting the eigenvalues by their size, a low-rank" approximation can be constructed Inverse covariance matrix: Σ 1 = p i=1 1 λ i v i v T i Square root representation": Σ =(V Λ 1 2 V T )(V Λ 1 2 V T )

24 Geometry Statistical and Distance Statistics 3 2 x z 1 u y p.29/48

25 Statistical Distance Statistical Distance If each dimension x i of a p-dimensional random variable x has different variance, a new measure of distance is needed: d 2 (x, O) = x2 1 σ 11 + x2 2 σ 22 = c 2 This defines an ellipse, withcenterat(0, 0), semi-major axis = c σ 11 and semi-minor axis c σ 11. Points at distance c away from x define a hyper ellipsoid: d 2 (x, y) = p i=1 x 2 i yi 2 σ ii = c 2 p.30/48

26 Positive Positive definite Definite matrices Matrices AsymmetricmatrixA is positive definite if and only if for all nonzero vectors x, x T Ax > 0. Equivalently,allits eigenvalues are real and positive. Positive definite matrices define a distance metric: d 2 (x, 0) = x T Ax = x T ( p λ i v i v T i )x = p i=1 i=1 λ i (x T v i )(v T i x) =c 2 Hyper-ellipsoid: axes defined by ± c λi v i. p.38/48

27 Multivariate Gaussian Multivariate Gaussian Distribution Ageneraldistancemeasuredefinedas d 2 (x, µ) =(x µ) T Σ 1 (x µ) =c 2 This defines a hyper-ellipsoid with center at µ, and axes defined as ±c λ i v i. Multivariate Gaussian p θ (x) = 1 (2π) p 2 Σ e 1 2 (x µ)t Σ 1 (x µ)

28 Multivariate Example Gaussian PDF Probability Density x2 2 2 x1 0 2 µ =[0, 0] T, Σ = [ ] p.40/48

29 Principal Components Analysis PCA is one of the most widely used methods to find lower-dimensional representations of data It was invented in 1901 (!) by Karl Pearson

30 Derivation of PCA PCA Problem: find a linear combination Y = p i=1 α ix i of the predictor variables x i that maximizes the variance Var(αX) Exercise: Show that Var(Y )=Var(α T X)=α T Σα. Since Var(α T X) is unbounded for arbitrary α, we restrict our attention to linear combinations such that α T α =1.

31 Derivation of PCA PCA Problem: Find α R p such that var(α T X) is maximized subject to α T α =1. Solution: Using Lagrange multipliers, we can formulate this optimization problem as the unconstrained problem of maximizing: L(λ, α) =α T Σα λ(α T α 1) Solving for the gradient L α =2Σα 2λα Setting the gradient to 0 gives us Σα = λα. In other words, α must be an eigenvector of the covariance matrix Σ. p.44/48

32 Derivation of PCA PCA In fact, the first principal component must be the eigenvector associated with the largest eigenvalue Var(Y )=Var(α T X)=α T Σα = λα T α = λ So, we see that the largest eigenvalue is exactly the variance Var(Y ) of the new derived variable. Generalized formulation: Find a set of m new variables Y 1,...Y m such that Y i = αi T X and Var(Y i ) is maximized, where αi T α i =1and for all 1 j i 1, αi T α j =0. Solution: The principal components are the eigenvectors of Σ. p.45/48

33 PCA and Variance PCA and Variance Theorem: If X =(X 1,...,X p ),then i var(x i)= p i=1 σ ii = p i=1 λ i = p i=1 Var(Y i) Proof: Given spectral decomposition Σ = V ΛV T,we have i Var(X i)=tr(σ) =tr(v ΛV T )=tr(v T V Λ) = tr(λ) = p i=1 λ i Proportion of population variance due to k th principal component is given by λ k p i=1 λ i p.46/48

34 PCA PCA on on Diabetes Data Data 0.8 PCA for Diabetes Data 0.7 Percentage of Variance Principal Component

35 PCA on LIBS Spectra from Mars 8 # Decay of eigenvalues suggest data is highly compressible Data lies on a much lower-dimensional space!

36 Summary PCA gives us a powerful tool to find structure in data By analyzing the spectrum of the sample covariance matrix, we can discover hidden regularities in the data Much of the variance in the data can usually be explained by a small number of dimensions Many other dimensionality reduction methods Manifold learning, Non-negative matrix factorization

2. Matrix Algebra and Random Vectors

2. Matrix Algebra and Random Vectors 2.1 Introduction Multivariate data can be conveniently display as array of numbers. In general, a rectangular array of numbers with, for instance, n rows and p columns