ECE 501b Homework #6 Due: 11/26

Size: px

Start display at page:

Download "ECE 501b Homework #6 Due: 11/26"

Ashlie Underwood
5 years ago
Views:

1 ECE 51b Homework #6 Due: 11/26 1. Principal Component Analysis: In this assignment, you will explore PCA as a technique for discerning whether low-dimensional structure exists in a set of data and for finding good representations of the data in that subspace. In the previous homework, I suggested that we would be looking at image compression applications. As several groups will be potentially employing PCA in their class projects, I decided to forgo that particular topic. For this assignment, please do the following (include MATLAB code and plots where appropriate): (a) Download the paper A Tutorial on Principal Component Analysis, from the course website and read it carefully. This paper really does an excellent job of introducing PCA. Note however, that it is written very much from the same perspective as we will explore in this assignment discovering low-dimensional structure if it exists. Much of the utility of PCA comes from applications that then make use of that discovered structure (for compression, denoising, etc.). In what follows, you may use the code-snippets at the end for inspiration, but are expected to design and implement your own functions where necessary. BE AWARE that the notation used in the paper does not always match my and MATLAB s default notation (vectors are stored in columns of a matrix). I want you to use MATLAB s approach, so you need to be careful before blindly applying something from the paper. (b) Download the rawdata.mat datafile from the website. The rawdata matrix in this datafile represents 512 vectors in a 124-dimensional vector space. I generated this data so that the vectors actually reside in a much-lower dimensional subspace: More specifically, I generated data that lives in a low-dimensional subspace and then added a certain amount of noise so the data in rawdata is just approximately low-dimensional. Begin by writing a function to zero-mean the data. That is, write a function that shifts the vectors so that the mean of the data in each dimension is zero. Use it to zero-mean rawdata. Use a different name, as you ll still need the non-zero-mean version of rawdata in the future. Here s the function function zm = zeromean(data) [rr cc] = size(data); zm = data - repmat(mean(data),[rr 1]); and here s the call load rawdata.mat ; [M N] = size(rawdata); rdz = zeromean(rawdata); (c) Compute the covariance matrix for both rawdata and its zero-mean version. You may use the MATLAB function cov. Plot the matrices and their difference to demonstrate that zero-meaning the data doesn t affect the covariance. (Pro tips: Use imagesc to display the matrix and it will scale the colormap appropriately. Use colormap gray to get a colormap that gives a smooth variation in color with

2 ECE 51b Homework #6 Due: 11/26 value. Use colorbar to place a scale next to the image.) rdcov = cov(rawdata); rdzcov = cov(rdz); subplot(1,3,1);imagesc(rdcov);colormap gray;colorbar; title( Non zero meaned ); subplot(1,3,2);imagesc(rdzcov);colormap gray;colorbar; title( Zero meaned ); subplot(1,3,3);imagesc(rdcov-rdzcov);colormap gray;colorbar; title( Difference ); Non zero meaned Zero meaned Difference x The difference is on the order of 1 14, which is just numerical rounding error. (d) Compute the principal components by finding the eigendecomposition of your zeromean covariance matrix. Use the MATLAB function eig. Sort the eigenvalues from largest to smallest (and sort the eigenvectors as well). (Pro tip: Use the [vals index] = sort(numbers, descend ) version of the sort command to get a sorted list of indices that you can use to sort the eigenvectors.). Make two plots of the eigenvalues (linear and semilog). Use this information to infer the dimension of the low-dimensional subspace that the data approximately resides in. [ev D] = eig(rdzcov); [vals index] = sort(diag(d), descend ); ev = ev(:,index); subplot(1,2,1);plot(vals);subplot(1,2,2);semilogy(vals);

ECE 51b Homework #6 Eigenvalues (linear) 25 Due: 11/26 Eigenvalues (semilog) 5 1 2 1 X: 65 Y: 5.313 15 5 1 1 1 1 5 15 1 2 5 2 4 6 8 1 1 12 2 4 6 8 The semilog plot shows it best.

3 ECE 51b Homework #6 Eigenvalues (linear) 25 Due: 11/26 Eigenvalues (semilog) X: 65 Y: The semilog plot shows it best. The first 64 eigenvalues decrease smoothly, but then there is a sharp transition in magnitude. This is the break between the signal subspace and noise contributions. The even sharper drop at 513 is because we only have 512 datavectors, so the remainder of the eigenvalues are essentially just rounding errors. Thus, I conclude the signal lives in a 64-dimensional subspace. (e) Use the principal components to diagonalize the covariance matrix. Plot the original and diagonalized covariance matrices to demonstrate the difference. cv2 = ev *rdzcov*ev; subplot(1,3,1);imagesc(rdzcov);colormap gray;colorbar; title( Original covariance matrix ); subplot(1,3,2);imagesc(cv2);colormap gray;colorbar; title( After diagonalizing with eigenvectors ); subplot(1,3,3);imagesc(cv2(1:75,1:75));colormap gray;colorbar; title( Zoomed view ); Original covariance matrix After diagonalizing with eigenvectors Zoomed view (Left) Original (Center) Diagonalized (Right) Zoomed view on upper left region. (f) Now we re going to compute the principal components via SVD. Use the MATLAB svd command to decompose the zero-mean data into U, S, and V matrices. Make two plots of the singular values (linear and semilog). Use this information to infer the dimension of the low-dimensional subspace that the data approximately resides

4 ECE 51b Homework #6 Due: 11/26 in. Compare with the answer you found earlier. [U S sv] = svd(rdz); subplot(1,2,1);plot(diag(s));title( Singular values (linear) ); subplot(1,2,2);semilogy(diag(s));title( Singular values (semilog) ); 12 Singular values (linear) 1 5 Singular values (semilog) Again, we see a sharp drop in the magnitude of the singular values after the first 64. We again conclude a 64-dimensional subspace. This matches our earlier result. (g) Compare the principal components found by taking the eigenvectors of the covariance matrix with the ones found in the matrix V of the SVD. Plot the difference of the two matrices. Comment on the differences. imagesc(sv - ev);colormap gray;colorbar;title( Difference ); Difference So the plot above is the difference between the two sets of vectors (arranged in columns). Close examination reveals something odd there are some columns that cancel out exactly (at least to within rounding errors), but others don t. After thinking about it for a while, we realize that there can be an overall sign

5 ECE 51b Homework #6 Due: 11/26 ambiguity to a direction vector. So we use the following code instead: map = repmat(sign(sv(1,:))./sign(ev(1,:)),[n 1]); ev2 = ev.*map; imagesc(sv - ev2);colormap gray;colorbar; title( Difference correcting for flips ); This takes the eigenvectors and multiplies them by -1 if they have a different leading sign than the singular vectors. The graph below is the difference of the result and the singular vectors: Difference correcting for flips We see that now all of the first 512 columns zero out to within rounding (the remainder are meaningless as a result of the fact that we started with only 512 data vectors). (h) Now we re going to use the MATLAB function princomp to compute the principal components. This time, use the non-zero-meaned data (princomp takes care of that detail for you). Plot the score matrix returned by princomp. This matrix gives the expansion coefficients for the data in the principal component basis. Comment on the structure you see. [pv score] = princomp(rawdata); Look at the scores to infer size of low-dim subspace imagesc(score);colormap gray;colorbar;title( Scores );

6 ECE 51b Homework #6 Due: 11/26 Scores The rows correspond to the different data vectors, the columns are the projection of that data vector into the 124 difference PC basis vectors. We see that there are significant weights in only a small number of basis directions. We ll explore this in the next part. (i) Following up on the lead from the part above, compute the mean of the absolute value of the elements in each column of score. Make two plots of this information (linear and semilog). Use this information to infer the dimension of the low-dimensional subspace that the data approximately resides in. subplot(1,2,1);plot(mean(abs(score))); title( mean(abs(scores)) (linear) ); subplot(1,2,2);semilogy(mean(abs(score))); title( mean(abs(scores)) (semilog) ); 4 35 mean(abs(scores)) (linear) 1 2 mean(abs(scores)) (semilog) So we see what we saw in all the other cases. The values drop off significantly in magnitude after the first 64 values. Thus we conclude a 64-dimensional subspace for the data.

7 ECE 51b Homework #6 Due: 11/26 (j) Now compare the principal components found via svd and princomp. Plot the difference of the two matrices. Now compare the difference in the expansion coefficients. In the svd version, this will be the product US. Compute the difference of the two matrices. Comment on the results. subplot(1,2,1);imagesc(sv - pv);colormap gray;colorbar; title( Difference in vectors ); subplot(1,2,2);imagesc(u*s - score);colormap gray;colorbar; title( Difference in weights ); Difference in vectors 1 Difference in weights x (Left) Difference of vectors (exactly zero) and (Right) Difference of weights (zero to within numerical precision). NOTE: You should have found roughly equivalent results from all of the methods (with the princomp and svd methods being identical. This is because princomp uses svd internally the SVD method is generally viewed as being numerically superior to the eigendecomposition of the covariance matrix approach.). In what follows, just use the built-in princomp command. (k) Now you re going to turn your attention to real, rather than synthetic data. Download the spectra.mat datafile from the course website. This data represents the optical spectra of 2 compounds measured in 13 different spectral channels. Perform principal component analysis of the data to try to infer the underlying dimensionality of the data. As is common with real data, the data is only approximately low-dimensional (to a greater degree than my synthetic data). Thus, there is no clear-cut answer. Use the tools at your disposal and justify your answer. (Pro tip: I often use the cumsum function in part of my analysis.) We ll again plot the mean of the absolute value of the elements in each column of score (repeating our approach with princomp from above).

8 ECE 51b Homework #6 Due: 11/26 35 mean(abs(scores)) (linear) 1 4 mean(abs(scores)) (semilog) Wow! First of all, the values drop much faster. The linear plot is all but useless. Furthermore, in the semilog plot, we don t see any clear jump in magnitude, just a continual decrease. This is what I meant about there not being a clearcut answer. In these kind of cases, it s sometimes useful to consider the cumulative sum of the values and see where that stops growing rapidly (it turns the value of the original function into the local slope of the cumulative sum, this sometimes makes it easier to see where things really change.) Here s the code and plot: cs = cumsum(mean(abs(score))); subplot(1,2,1);plot(cs(1:2));title( cumulative sume (linear) ); subplot(1,2,2);semilogy(cs(1:2));title( cumulative sum (semilog) ); 55 cumulative sume (linear) cumulative sum (semilog) X: 21 Y: I m limiting the range in x to just the first 2 values (as we only provided that many data vectors in the first place). We see that there is a clear knee in both the linear and semilog plots around 21. That s the point where the growth in the cumulative sum really starts changing. So we ll conclude that the spectral data approximately resides in a 21-dimensional space.

9 ECE 51b Homework #6 Due: 11/26 2. Please estimate how much productive time you spent completing this assignment (watching television with the assignment in your lap does not count as productive time!).

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Principal Component Analysis. Applied Multivariate Statistics Spring 2012 Principal Component Analysis Applied Multivariate Statistics Spring 2012 Overview Intuition Four definitions Practical examples Mathematical example Case study 2 PCA: Goals Goal 1: Dimension reduction