Multi dimensional Signal Analysis Lecture 2E Principal Component Analysis
Subspace representation Note! Given avector space V of dimension N a scalar product defined by G 0 a subspace U of dimension M < N An N M basis matrix B of the subspace U a vector v V we can determine v 1 U that is closest to v v 1 U is a subspace representation of v V v 1 is independent of B, it only depends on U Lecture 2E, Klas Nordberg, LiU 2
Subspace representation V v U v 1 Lecture 2E, Klas Nordberg, LiU 3
Subspace representation More precisely: the coordinates of v 1 relative basis B is given by c = G 1 c Where c = B T G 0 v and G = B T G 0 B Lecture 2E, Klas Nordberg, LiU 4
Stochastic signals In the previous applications normalised convolution and filter optimisation the basis was fixed This means that U is fixed An alternative approach is to allow the signal vector v to be a stochastic variable and to determine an M dimensional subspace U such that v 1 is as close as possible to v in average Lecture 2E, Klas Nordberg, LiU 5
Initial problem formulation We want to minimise i i ² = E k v v 1 k 2 E means here to take the expectation value or mean where the expectation value is taken over all observations of the signal v ² is minimised i i over all M dimensional i l subspaces U Lecture 2E, Klas Nordberg, LiU 6
Problem formulation To simplify things, we assume that V is real = R N V is real R G 0 = I B is an ON basis of subspace U: Leads to: h u v i = u T v B = B B T B = I Lecture 2E, Klas Nordberg, LiU 7
Problem formulation This simplification i also means that: v 1 = B c = B B T v We want to determine an M dim subspace U (represented by ON basis B) such that we minimise ² = E k v B B T v k 2 Lecture 2E, Klas Nordberg, LiU 8
Solving the problem We assume that t v has known statistical ti ti properties We want to determine B that minimises ² This is the same as maximising ² 1 = E [v T BB T v] (why?) for ON basis B This is an optimisation problem with a set of constraints (B T B = I) Lecture 2E, Klas Nordberg, LiU 9
Practical solution As soon as M, = the dimension of U, = the number of columns in B, is determined, we optimise ² 1 over the N M elements in B with the constraints B T B = I Can be solved by means of standard techniques for constrained optimisation Lagrange s method Lecture 2E, Klas Nordberg, LiU 10
A simple example Simple example: M = 1 B has only one single column b 1 We want to maximise ² = E [v T 1T v] = E 1T v v T = T E [v v T 1 b 1 b ] [b b 1 ] b 1 ] b 1 with constraint c = b 1 T b 1 = 1 Lecture 2E, Klas Nordberg, LiU 11
A simple example Use Lagrange s method: ² 1 = λ c where the gradients are with respect to the elements in b 1 Leads to E [ vv T ] b 1 = λ b 1 Lecture 2E, Klas Nordberg, LiU 12
A simple example Consequently: b 1 is an eigenvector corresponding to eigenvalue λ of E [ vv T ] Remember: kb 1 k = 1, since B is ON Lecture 2E, Klas Nordberg, LiU 13
The correlation matrix It is clear that the matrix Approximation of C from P samples: C = E [ vv T ] C 1 P X 1 P k=1 v k v T k is an important thing to know in order to solve the problem C is called the correlation matrix of the signal C is symmetric and N N C is positive definite (why?) Lecture 2E, Klas Nordberg, LiU 14
A simple example With this choice of b 1, ² 1 becomes ² 1 = b 1 T Cb 1 = λ b 1T b 1 = λ Since we want to maximise ² 1 we should choose λ as large as possible b 1 is a normalised eigenvector corresponding to the largest eigenvalue of C Lecture 2E, Klas Nordberg, LiU 15
A simple example From ² 1 = λ 1 = largest eigenvalue of C follows that ² = sum of all eigenvalues of C except λ 1 (why?) Lecture 2E, Klas Nordberg, LiU 16
A slightly more complicated example Let U be 2 dimensional: B has two columns b 1 and b 2 We want to maximise i ² 1 = E [v T BB T v] ] with the constraints c 1 = b 1T b 1 = 1, c 2 = b 1T b 2 = 0, c 3 = b 2T b 2 = 1 Lecture 2E, Klas Nordberg, LiU 17
A slightly more complicated example In the general case, when M > 1, the subspace ON basis matrix B is not unique B = B Q with Q O(M) is also a solution (why?) We need additional constraints on B in order to make B unique We choose: B T C B is diagonal Subspace basis vectors are uncorrelated Lecture 2E, Klas Nordberg, LiU 18
A slightly more complicated example This additional constraint leads to Cb 1 = λ 1 b 1 C b 2 = λ 2 b 2 Both b 1 and b 2 must the normalised and mutually orthogonal eigenvectors of C, with eigenvalues λ 1 and λ 2 Lecture 2E, Klas Nordberg, LiU 19
A slightly more complicated example We want to maximise i (why?) ² 1 = E [v T BB T v] = b 1 T Cb 1 + b 2 T C b 2 = λ 1 + λ 2 and therefore we should choose b 1 and b 2 as normalised eigenvectors corresponding to the two largest eigenvalues of C Remember multiplicity of eigenvalues Since C is symmetric, b 1 and b 2 can always be chosen as orthogonal! Lecture 2E, Klas Nordberg, LiU 20
Generalisation Based on these examples we present a general result: We want to determine an M dimensional subspace U,, described by a ON basis matrix B. 1. Form the correlation matrix C 2. Compute eigenvalues and eigenvectors of C 3. The basis B consists of M eigenvectors corresponding to the M largest eigenvalues of C Lecture 2E, Klas Nordberg, LiU 21
Generalisation This gives ² 1 = [v T BB T v] = λ 1 +... + λ M and ² = λ M+1 +... + λ N where λ 1 λ 2... λ N are the eigenvalues of C Lecture 2E, Klas Nordberg, LiU 22
Principal components The eigenvectors of C are in this context referred to as principal p components The magnitude of a principal component is given by the corresponding eigenvalue The M dimensional subspace U is spanned by the M largest principal components of C To determine the basis B in this way is called principal component analysis or PCA Lecture 2E, Klas Nordberg, LiU 23
Analysis and reconstruction In PCA we analyse the signal v to get its coordinates relative to the subspace basis B: c = B T v If necessary, we can then reconstruct v 1 as v 1 = B c Lecture 2E, Klas Nordberg, LiU 24
Analysis and reconstruction Stochastic process PCA Reconstruction v R N c R M v 1 R N C R N N In some application, for example, clustering, the reconstruction step is not relevant Lecture 2E, Klas Nordberg, LiU 25
Applications In signal processing, PCA can be used for General data analysis For an unknown data/signal determine if it can be reduced in dimension Data/signal compression An N dimensional signal can be represented with an M dimensional basis (M < N) Reduces the amount to data needed to store/transmit the signal Data dependent compression (+) Effective since the compression is data dependent ( ) Overhead since B must be stored/transmitted as well Lecture 2E, Klas Nordberg, LiU 26
Signal model PCA can typically be applied to signals with a statistical model (mean and C known) PCA can also be applied to a finite data set in the form of high dimensional vectors Dimensionality reduction C is estimated from the finite data set Lecture 2E, Klas Nordberg, LiU 27
How large subspace? The idea behind PCA is to estimate C from a larges set of observation of the data/signal v and then choose the basis B as the M largest principal components How do we know a suitable value for M? No optimal strategy! Application dependent Lecture 2E, Klas Nordberg, LiU 28
How large subspace? In some applications i M may be fixed, i.e., already given In other applications it makes sense to analyse the eigenvalues λ 1,..., λ N in order to choose a suitable M For example: ² / ² 1 small (why?) Usually each dimension i of U corresponds to a cost Represents a coordinate of v 1 that needs to be stored or transmitted We want to keep M as small as possible Balance between cost and decrease in ² Lecture 2E, Klas Nordberg, LiU 29
An example A 512 512 pixel image Lecture 2E, Klas Nordberg, LiU 30
An example Let us divide id this image into 8 8 pixels block This gives in total 64 64 = 4096 blocks The pixels of each block constitute our signal v, a vector in R 64 (8 8 reshaped to a 64 dim column vector) From the 4096 observations of v we form the 64 64 correlation matrix C We also compute the eigenvalues and eigenvectors of C Lecture 2E, Klas Nordberg, LiU 31
The eigenvalues of C Here are the 64 eigenvalues of C One large eigenvalue, the rest are relatively small Lecture 2E, Klas Nordberg, LiU 32
The eigenvalues of C If we remove the largest eigenvalue and plot the rest: Lecture 2E, Klas Nordberg, LiU 33
The eigenvectors of C Each eigenvector of C is a 64 dimensional vector that can be reshaped p into an 8 8 block b1 b4 b2 b5 Lecture 2E, Klas Nordberg, LiU b3 b6 34
The eigenvectors of C We notice that b 1 is approximately flat or constant (DC) b 2 and b 3 are approximately shaped like a plane with iha slope in two perpendicular directions. Linear b 4, b 5 and b 6 are approximately shaped like quadratic surfaces Lecture 2E, Klas Nordberg, LiU 35
An example The distribution of the eigenvalues implies that already by going gfrom 0 to 1 dimension of U the error ² is reduced quite a lot By adding a few more dimensions it should be possible to represent the signal quite well Let us try with some different values for M and look at the result Lecture 2E, Klas Nordberg, LiU 36
An example Each block, as a vector v R 64, is projected onto v 1 U v 1 can be represented with only M coordinates v 1 is a reasonably good approximation of v, at least the mean error should be low How good is determined by the distribution of eigenvalues relative to M Lecture 2E, Klas Nordberg, LiU 37
M = 1 Here v is projected onto the first principal component b 1 Since b 1 is flat or constant, each block becomes flat or constant Lecture 2E, Klas Nordberg, LiU 38
M = 2 Here v is projected onto the first two principal components b 1 and b 2 Since b 2 is a slope in one direction we can now represent that change in each block But not a slope in the orthogonal direction! Lecture 2E, Klas Nordberg, LiU 39
M = 3 Here v is projected onto the first three principal components b 1... b 3 Since b 3 is a slope in the other direction we can now have slopes in any direction within each block Lecture 2E, Klas Nordberg, LiU 40
M=6 Here v is projected onto the first six principal i components b 1... b 6 With 3 more dimensions, each block can contain more details Here, data has been reduced by a factor 64/6 11 Lecture 2E, Klas Nordberg, LiU 41
General observation A general observation: In areas of low spatial frequency the approximation is good Inareasof high spatialfrequency the approximation is worse The more dtil details or higher h spatial frequency, the more dimensions are needed for an accurate representation Lecture 2E, Klas Nordberg, LiU 42
Karhunen Loève transformation Introducing the new basis B in this way and computing the coordinates of v relative to B is sometimes also referred to as Karhunen Loève transformation B T v gives the transformed signal (= the coordinates of v relative to basis B) BB T v reconstructs the projected signal v 1 Lecture 2E, Klas Nordberg, LiU 43
Covariance and mean In some application PCA is used to make a statistical analysis of a signal v It is then very common to describe v in terms of its mean m v and its covariance matrix m v = E [ v ] Cov v = E [ (v m v ) (v m v ) T ] v v v Lecture 2E, Klas Nordberg, LiU 44
Covariance and mean The PCA is then based on the covariance matrix Cov v rather than the correlation C This corresponds to translating the origin of the subspace U to the mean m v and do correlation based PCA there Both approaches are called PCA Statistical data analysis typically use Cov v Data compression typically use C Lecture 2E, Klas Nordberg, LiU 45
The data matrix In the case that we approximate the statistics of v from a finite number of P observations, these can be described by a data matrix A that holds all these v in its columns An estimate of C = E [ vv T ] is then given by C = (1/P) A A T Lecture 2E, Klas Nordberg, LiU 46
SVD vs EVD The principal i components are the eigenvectors of AA T These are also the left singular vectors of A An alternative to computing the PCA, that in some cases may have better numerical properties: 1. Form the data matrix A 2. Compute its SVD 3. Form B from the left singular vector that corresponds to the M largest singular values Lecture 2E, Klas Nordberg, LiU 47
What you should know includes Formulation of the problem that t PCA solves: Find a subspace U that minimises ² = E k v B B T v k 2 This subspace is represented by an ON basis B Solution: B consists of the M largest eigenvectors of C = E[ v v T ] Called principal components Alternatively computed by means of SVD ² = sum of residual eigenvalues of C Applications: Dimensionality reduction Signal compression Lecture 2E, Klas Nordberg, LiU 48