Lecture 7 Spectral methods

Similar documents
DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

1 Singular Value Decomposition and Principal Component

Chapter 3 Transformations

Mathematical foundations - linear algebra

Lecture 7: Positive Semidefinite Matrices

Singular Value Decomposition and Principal Component Analysis (PCA) I

Lecture 8: Linear Algebra Background

Symmetric matrices and dot products

2. Matrix Algebra and Random Vectors

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

Math 215 HW #9 Solutions

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

10-725/36-725: Convex Optimization Prerequisite Topics

Principal Components Theory Notes

Linear Algebra for Machine Learning. Sargur N. Srihari

CS 143 Linear Algebra Review

STATISTICAL LEARNING SYSTEMS

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

Eigenvalues and diagonalization

CSC 411 Lecture 12: Principal Component Analysis

18.06SC Final Exam Solutions

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Math 291-2: Lecture Notes Northwestern University, Winter 2016

Ma/CS 6b Class 20: Spectral Graph Theory

Mathematical foundations - linear algebra

Multivariate Statistical Analysis

Principal Component Analysis

Recall the convention that, for us, all vectors are column vectors.

Lecture 8 : Eigenvalues and Eigenvectors

Vectors and Matrices Statistics with Vectors and Matrices

Lecture 1 Review: Linear models have the form (in matrix notation) Y = Xβ + ε,

Linear Algebra Review. Vectors

B553 Lecture 5: Matrix Algebra Review

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Symmetric Matrices and Eigendecomposition

Basic Concepts in Matrix Algebra

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Linear Algebra Solutions 1

forms Christopher Engström November 14, 2014 MAA704: Matrix factorization and canonical forms Matrix properties Matrix factorization Canonical forms

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88

Chap 3. Linear Algebra

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

Properties of Matrices and Operations on Matrices

The study of linear algebra involves several types of mathematical objects:

Lecture 2: Linear operators

Least Squares Optimization

PRINCIPAL COMPONENTS ANALYSIS

14 Singular Value Decomposition

Principal Component Analysis

Def. The euclidian distance between two points x = (x 1,...,x p ) t and y = (y 1,...,y p ) t in the p-dimensional space R p is defined as

CS 246 Review of Linear Algebra 01/17/19

Econ 204 Supplement to Section 3.6 Diagonalization and Quadratic Forms. 1 Diagonalization and Change of Basis

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Notes on Linear Algebra

Linear Algebra - Part II

Tutorials in Optimization. Richard Socher

7. Symmetric Matrices and Quadratic Forms

Notes on Eigenvalues, Singular Values and QR

Ma/CS 6b Class 20: Spectral Graph Theory

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

Linear Algebra Formulas. Ben Lee

Seminar on Linear Algebra

LINEAR ALGEBRA BOOT CAMP WEEK 4: THE SPECTRAL THEOREM

Principal Component Analysis

Large Scale Data Analysis Using Deep Learning

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Review problems for MA 54, Fall 2004.

Linear Algebra Practice Final

Lecture 2: Linear Algebra Review

Knowledge Discovery and Data Mining 1 (VO) ( )

There are two things that are particularly nice about the first basis

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Chapter 3. Matrices. 3.1 Matrices

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Ma/CS 6b Class 23: Eigenvalues in Regular Graphs

More Linear Algebra. Edps/Soc 584, Psych 594. Carolyn J. Anderson

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra

STAT200C: Review of Linear Algebra

Problem Set 1. Homeworks will graded based on content and clarity. Please show your work clearly for full credit.

Principal Component Analysis

Singular Value Decomposition

Fall TMA4145 Linear Methods. Exercise set Given the matrix 1 2

Review of Linear Algebra Definitions, Change of Basis, Trace, Spectral Theorem

33AH, WINTER 2018: STUDY GUIDE FOR FINAL EXAM

Introduction to Machine Learning

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Principal Component Analysis and Linear Discriminant Analysis

Maths for Signals and Systems Linear Algebra in Engineering

Foundations of Computer Vision

AN ELEMENTARY PROOF OF THE SPECTRAL RADIUS FORMULA FOR MATRICES

Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

Unsupervised Learning: Dimensionality Reduction

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Transcription:

CSE 291: Unsupervised learning Spring 2008 Lecture 7 Spectral methods 7.1 Linear algebra review 7.1.1 Eigenvalues and eigenvectors Definition 1. A d d matrix M has eigenvalue λ if there is a d-dimensional vector u 0 for which Mu = λu. This u is the eigenvector corresponding to λ. In other words, the linear transformation M maps vector u into the same direction. It is interesting that any linear transformation necessarily has directional fixed points of this kind. The following chain of implications helps in understanding this: λ is an eigenvalue of M there exists u 0 with Mu = λu there exists u 0 with (M λi)u = 0 (M λi) is singular (that is, not invertible) det(m λi) = 0. Now, det(m λi) is a polynomial of degree d in λ. As such it has d roots (although some of them might be complex). This explains the existence of eigenvalues. A case of great interest is when M is real-valued and symmetric, because then the eigenvalues are real. Theorem 2. Let M be any real symmetric d d matrix. Then: 1. M has d real eigenvalues λ 1,...,λ d (not necessarily distinct). 2. There is a set of d corresponding eigenvectors u 1,...,u d that constitute an orthonormal basis of R d, that is, u i u j = δ ij for all i,j. 7.1.2 Spectral decomposition The spectral decomposition recasts a matrix in terms of its eigenvalues and eigenvectors. This representation turns out to be enormously useful. Theorem 3. Let M be a real symmetric d d matrix with eigenvalues λ 1,...,λ d and corresponding orthonormal eigenvectors u 1,...,u d. Then: λ 1 0 u 1 1. M = u 1 u 2 u d λ 2 u 2..... {{ call this Q 0 {{ λ d Λ 7-1 u d {{ Q T

2. M = d λ iu i u T i. Proof. A general proof strategy is to observe that M represents a linear transformation x Mx on R d, and as such, is completely determined by its behavior on any set of d linearly independent vectors. For instance, {u 1,...,u d are linearly independent, so any d d matrix N that satisfies Nu i = Mu i (for all i) is necessarily identical to M. Let s start by verifying (1). For practice, we ll do this two different ways. Method One: For any i, we have QΛQ T u i = QΛe i = Qλ i e i = λ i Qe i = λ i u i = Mu i. Thus QΛQ T = M. Method Two: Since the u i are orthonormal, we have Q T Q = I. Thus Q is invertible, with Q 1 = Q T ; whereupon QQ T = I. For any i, Q T MQe i = Q T Mu i = Q T λ i u i = λ i Q T u i = λ i e i = Λe i. Thus Λ = Q T MQ, which implies M = QΛQ T. Now for (2). Again we use the same proof strategy. For any j, ( ) λ i u i u T i u j = λ j u j = Mu j. Hence M = i λ iu i u T i. 7.1.3 Positive semidefinite matrices We now introduce an important subclass of real symmetric matrices. i Definition 4. A real symmetric d d matrix M is positive semidefinite (denoted M 0) if z T Mz 0 for all z R d. It is positive definite (denoted M 0) if z T Mz > 0 for all nonzero z R d. Example 5. Consider any random vector X R d, and let µ = EX and S = E[(X µ)(x µ) T ] denote its mean and covariance, respectively. Then S 0 because for any z R d, z T Sz = z T E[(X µ)(x µ) T ]z = E[(z T (X µ))((x µ) T z)] = E[(z (X µ)) 2 ] 0. Positive (semi)definiteness is easily characterized in terms of eigenvalues. Theorem 6. Let M be a real symmetric d d matrix. Then: 1. M is positive semidefinite iff all its eigenvalues λ i 0. 2. M is positive definite iff all its eigenvalues λ i > 0. Proof. Let s prove (1) (the second is similar). Let λ 1,...,λ d be the eigenvalues of M, with corresponding eigenvectors u 1,...,u d. First, suppose M 0. Then for all i, λ i = u T i Mu i 0. Conversely, suppose that all the λ i 0. Then for any z R d, we have ( ) z T Mz = z T λ i u i u T i z = λ i (z u) 2 0. 7-2

7.1.4 The Rayleigh quotient One of the reasons why eigenvalues are so useful is that they constitute the optimal solution of a very basic quadratic optimization problem. Theorem 7. Let M be a real symmetric d d matrix with eigenvalues λ 1 λ 2 λ d, and corresponding eigenvectors u 1,...,u d. Then: z T Mz max z =1 zt Mz = max z 0 z T z z T Mz min z =1 zt Mz = min z 0 z T z = λ 1 = λ d and these are realized at z = u 1 and z = u d, respectively. Proof. Denote the spectral decomposition by M = QΛQ T. Then: max z 0 z T Mz z T z = max z 0 z T QΛQ T z z T QQ T z y T Λy = max y 0 y T y (since QQ T = I) (writing y = Q T z) λ 1 y1 2 + + λ d yd 2 = max y 0 y1 2 + + y2 d λ 1, where equality is attained in the last step when y = e 1, that is, z = Qe 1 = u 1. The argument for the minimum is identical. Example 8. Suppose random vector X R d has mean µ and covariance matrix M. Then z T Mz represents the variance of X in direction z: var(z T X) = E[(z T (X µ)) 2 ] = E[z T (X µ)(x µ) T z] = z T Mz. Theorem 7 tells us that the direction of maximum variance is u 1, and that of minimum variance is u d. Continuing with this example, suppose that we are interested in the k-dimensional subspace (of R d ) that has the most variance. How can this be formalized? To start with, we will think of a linear projection from R d to R k as a function x P T x, where P T is a k d matrix with P T P = I k. The last condition simply says that the rows of the projection matrix are orthonormal. When a random vector X R d is subjected to such a projection, the resulting k-dimensional vector has covariance matrix cov(p T X) = E[P T (X µ)(x µ) T P] = P T MP. Often we want to summarize the variance by just a single number rather than an entire matrix; in such cases, we typically use the trace of this matrix, and we write var(p T X) = tr(p T MP). This is also equal to E P T X P T µ 2. With this terminology established, we can now determine the projection P T that maximizes this variance. Theorem 9. Let M be a real symmetric d d matrix as in Theorem 7. Pick any k d. max tr(p T MP) = λ 1 + + λ k P R d k,p T P=I min tr(p T MP) = λ d k+1 + + λ d. P R d k,p T P=I 7-3

These are realized when the columns of P span the k-dimensional subspace spanned by {u 1,...,u k and {u d k+1,...,u d, respectively. Proof. We will prove the result for the maximum; the other case is symmetric. Let p 1,...,p k denote the columns of P. Then tr ( P T MP ) = p T i Mp i = p T i λ j u j u T j p i = λ j (p i u j ) 2. We will show that this quantity is at most λ 1 + + λ k. To this end, let z j denote k (p i u j ) 2 ; clearly it is nonnegative. We will show that j z j = k and that each z j 1; the desired bound is then immediate. First, z j = j=1 j=1 (p i u j ) 2 = j=1 j=1 p T i u j u T j p i = j=1 p T i QQ T p i = p i 2 = k. To upper-bound an individual z j, start by extending the k orthonormal vectors p 1,...,p k to a full orthonormal basis p 1,...,p d of R d. Then It then follows that z j = (p i u j ) 2 (p i u j ) 2 = tr ( P T MP ) = u T j p i p T i u j = u T j 2 = 1. λ j z j λ 1 + + λ k, and equality holds when p 1,...,p k span the same space as u 1,...,u k. j=1 7.2 Principal component analysis Let X R d be a random vector. We wish to find the single direction that captures as much as possible of the variance of X. Formally: we want p R d (the direction) such that p = 1, so as to maximize var(p T X). Theorem 10. The solution to this optimization problem is to make p the principal eigenvector of cov(x). Proof. Denote µ = EX and S = cov(x) = E[(X µ)(x µ) T ]. For any p R d, the projection p T X has mean E[p T X] = p T µ and variance var(p T X) = E[(p T X p T µ) 2 ] = E[p T (X µ)(x µ) T p] = p T Sp. By Theorem 7, this is maximized (over all unit-length p) when p is the principal eigenvector of S. Likewise, the k-dimensional subspace that captures as much as possible of the variance of X is simply the subspace spanned by the top k eigenvectors of cov(x); call these u 1,...,u k. Projection onto these eigenvectors is called principal component analysis (PCA). It can be used to reduce the dimension of the data from d to k. Here are the steps: Compute the mean µ and covariance matrix S of the data X. Compute the top k eigenvectors u 1,...,u k of S. Project X P T X, where P T is the k d matrix whose rows are u 1,...,u k. 7-4

7.2.1 The best approximating affine subspace We ve seen one optimality property of PCA. Here s another: it is the k-dimensional affine subspace that best approximates X, in the sense that the expected squared distance from X to the subspace is minimized. Let s formalize the problem. A k-dimensional affine subspace is given by a displacement p o R d and a set of (orthonormal) basis vectors p 1,...,p k R d. The subspace itself is then {p o +α 1 p 1 + +α k p k : α i R. p 2 p 1 subspace p o The projection of X R d onto this subspace is P T X +p o, where P T is the k d matrix whose rows are p 1,...,p k. Thus, the expected squared distance from X to this subspace is E X (P T X +p o ) 2. We wish to find the subspace for which this is minimized. Theorem 11. Let µ and S denote the mean and covariance of X, respectively. The solution of this optimization problem is to choose p 1,...,p k to be the top k eigenvectors of S and to set p o = (I P T )µ. Proof. Fix any matrix P; the choice of p o that minimizes E X (P T X + p o ) 2 is (by calculus) p o = E[X P T X] = (I P T )µ. Now let s optimize P. Our cost function is origin E X (P T X + p o ) 2 = E (I P T )(X µ) 2 = E X µ 2 E P T (X µ) 2, where the second step is simply an invocation of the Pythagorean theorem. X µ P T (X µ) origin Therefore, we need to maximize E P T (X µ) 2 = var(p T X), and we ve already seen how to do this in Theorem 10 and the ensuing discussion. 7-5

7.2.2 The projection that best preserves interpoint distances Suppose we want to find the k-dimensional projection that minimizes the expected distortion in interpoint distances. More precisely, we want to find the k d projection matrix P T (with P T P = I k ) such that, for i.i.d. random vectors X and Y, expected squared distortion E[ X Y 2 P T X P T Y 2 ] is minimized (of course, the term in brackets is always positive). Theorem 12. The solution is to make the rows of P T the top k eigenvectors of cov(x). Proof. This time we want to maximize E P T X P T Y 2 = 2 E P T X P T µ 2 = 2var(P T X), and once again we re back to our original problem. This is emphatically not the same as finding the linear transformation (that is, not necessarily a projection) P T for which E X Y 2 P T X P T Y 2 is minimized. The random projection method that we saw earlier falls in this latter camp, because it consists of a projection followed by a scaling by d/k. 7.2.3 A prelude to k-means clustering Suppose that for random vector X R d, the optimal k-means centers are µ 1,...,µ k, with cost opt = E X (nearest µ i ) 2. If instead, we project X into the k-dimensional PCA subspace, and find the best k centers µ 1,...,µ k in that subspace, how bad can these centers be? Theorem 13. cost(µ 1,...,µ k ) 2 opt. Proof. Without loss of generality EX = 0 and the PCA mapping is X P T X. Since µ 1,...,µ k are the best centers for P T X, it follows that E P T X (nearest µ i ) 2 E P T (X (nearest µ i )) 2 E X (nearest µ i ) 2 = opt. Let X A T X denote projection onto the subspace spanned by µ 1,...,µ k. From our earlier result on approximating affine subspaces, we know that E X P T X 2 E X A T X 2. Thus cost(µ 1,...,µ k ) = E X (nearest µ i ) 2 = E P T X (nearest µ i ) 2 + E X P T X 2 (Pythagorean theorem) opt + E X A T X 2 opt + E X (nearest µ i ) 2 = 2 opt. 7.3 Singular value decomposition For any real symmetric d d matrix M, we can find its eigenvalues λ 1 λ d and corresponding orthonormal eigenvectors u 1,...,u d, and write M = λ i u i u T i. 7-6

The best rank-k approximation to M is M k = λ i u i u T i, in the sense that this minimizes M M k 2 F over all rank-k matrices. (Here F denotes Frobenius norm; it is the same as L 2 norm if you imagine the matrix rearranged into a very long vector.) In many applications, M k is an adequate approximation of M even for fairly small values of k. And it is conveniently compact, of size O(kd). But what if we are dealing with a matrix M that is not square; say it is m n with m n: m M To find a compact approximation in such cases, we look at M T M or MM T, which are square. Eigendecompositions of these matrices lead to a good representation of M. 7.3.1 The relationship between MM T and M T M Lemma 14. M T M and MM T are symmetric positive semidefinite matrices. Proof. We ll do M T M; the other is similar. First off, it is symmetric: n (M T M) ij = k (M T ) ik M kj = k M ki M kj = k (M T ) jk M ki = (M T M) ji. Next, M T M 0 since for any z R n, we have z T M T Mz = Mz 2 0. Which one should we use, M T M or MM T? Well, they are of different sizes, n n and m m respectively. n M T M m MM T n m Ideally, we d prefer to deal with the smaller of two, MM T, especially since eigenvalue computations are expensive. Fortunately, it turns out the two matrices have the same (non-zero) eigenvalues! Lemma 15. If λ is an eigenvalue of M T M with eigenvector u, then either: (i) λ is an eigenvalue of MM T with eigenvector Mu, or (ii) λ = 0 and Mu = 0. 7-7

Proof. Say λ 0; we ll prove that condition (i) holds. First of all, M T Mu = λu 0, so certainly Mu 0. It is an eigenvector of MM T with eigenvalue λ, since MM T (Mu) = M(M T Mu) = M(λu) = λ(mu). Next, suppose λ = 0; we ll establish condition (ii). Notice that Thus it must be the case that Mu = 0. Mu 2 = u T M T Mu = u T (M T Mu) = u T (λu) = 0. 7.3.2 A spectral decomposition for rectangular matrices Let s summarize the consequences of Lemma 15. We have two square matrices, a large one (M T M) of size n n and a smaller one (MM T ) of size m m. Let the eigenvalues of the large matrix be λ 1 λ 2 λ n, with corresponding orthonormal eigenvectors u 1,...,u n. From the lemma, we know that at most m of the eigenvalues are nonzero. The smaller matrix MM T has eigenvalues λ 1,...,λ m, and corresponding orthonormal eigenvectors v 1,...,v m. The lemma suggests that v i = Mu i ; this is certainly a valid set of eigenvectors, but they are not necessarily normalized to unit length. So instead we set v i = Mu i Mu i = Mu i u T i M T Mu i = Mu i. λi This finally gives us the singular value decomposition, a spectral decomposition for general matrices. Theorem 16. Let M be a rectangular m n matrix with m n. Define λ i,u i,v i as above. Then M = v 1 v 2 v m {{ Q 1, size m m λ1 0 λ2 0... λm 0 {{ Σ, size m n u 1 u 2. u n. {{ Q T 2, size n n Proof. We will check that Σ = Q T 1 MQ 2. By our proof strategy from Theorem 3, it is enough to verify that both sides have the same effect on e i for all 1 i n. For any such i, { { Q T 1 MQ 2 e i = Q T Q T 1 Mu i = 1 λi v i if i m λi e = i if i m = Σe 0 if i > m 0 if i > m i. The alternative form of the singular value decomposition is m M = λi v i u T i, which immediately yields a rank-k approximation M k = λi v i u T i. As in the square case, M k is the rank-k matrix that minimizes M M k 2 F. 7-8