Low-Dimensional Signal Models in Compressive Sensing

Size: px

Start display at page:

Download "Low-Dimensional Signal Models in Compressive Sensing"

Mavis Haynes
5 years ago
Views:

University of Colorado, Boulder CU Scholar Electrical, Computer & Energy Engineering Graduate Theses & Dissertations Electrical, Computer & Energy Engineering Spring 4-1-2013 Low-Dimensional Signal

1 University of Colorado, Boulder CU Scholar Electrical, Computer & Energy Engineering Graduate Theses & Dissertations Electrical, Computer & Energy Engineering Spring Low-Dimensional Signal Models in Compressive Sensing Hanchao Qi University of Colorado at Boulder, Follow this and additional works at: Part of the Electrical and Computer Engineering Commons Recommended Citation Qi, Hanchao, "Low-Dimensional Signal Models in Compressive Sensing" (2013). Electrical, Computer & Energy Engineering Graduate Theses & Dissertations This Dissertation is brought to you for free and open access by Electrical, Computer & Energy Engineering at CU Scholar. It has been accepted for inclusion in Electrical, Computer & Energy Engineering Graduate Theses & Dissertations by an authorized administrator of CU Scholar. For more information, please contact

2 Low-Dimensional Signal Models in Compressive Sensing by Hanchao Qi B.S., University of Science and Technology of China, 2008 M.S., University of Colorado at Boulder, 2010 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Electrical, Computer, and Energy Engineering 2013

3 This thesis entitled: Low-Dimensional Signal Models in Compressive Sensing written by Hanchao Qi has been approved for the Department of Electrical, Computer, and Energy Engineering Prof. Shannon M. Hughes Prof. Youjian Liu Date The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline.

4 Qi, Hanchao (Ph.D., Electrical Engineering) Low-Dimensional Signal Models in Compressive Sensing Thesis directed by Prof. Shannon M. Hughes In today s world, we often face an explosion of data that can be difficult to handle. Signal models help make this data tractable, and thus play an important role in designing efficient algorithms for acquiring, storing, and analyzing signals. However, choosing the right model is critical. Poorly chosen models may fail to capture the underlying structure of signals, making it hard to achieve satisfactory results in signal processing tasks. Thus, the most accurate and concise signal models must be used. Many signals can be expressed as a linear combination of a few elements of some dictionary, and this is the motivation behind the emerging field of compressive sensing. Compressive sensing leverages this signal model to enable us to perform signal processing tasks without full knowledge of the data. However, this is only one possible model for signals, and many signals could in fact be more accurately and concisely described by other models. In particular, in this thesis, we will look at two such models, and show how these other two models can be used to allow signal reconstruction and analysis from partial knowledge of the data. First, we consider signals that belong to low-dimensional nonlinear manifolds, i.e. that can be represented as a continuous nonlinear function of few parameters. We show how to apply the kernel trick, popular in machine learning, to adapt compressive sensing to this type of sparsity. Our approach provides computationally-efficient, improved signal reconstruction from partial measurements when the signal is accurately described by such a manifold model. We then consider collections of signals that together have strong principal components, so that each individual signal may be modeled as a linear combination of these few shared principal components. We focus on the problem of finding the center and princi-

5 iv pal components of these high-dimensional signals using only their measurements. We show experimentally and theoretically that our approach will generally return the correct center and principal components for a large enough collection of signals. The recovered principal components also allow performance gains in other signal processing tasks.

6 To my family. Dedication

7 vi Acknowledgements I am deeply grateful to my advisor Prof. Hughes for her constant guidance and support during my Ph.D. study. This thesis will not be possible without her countless hours devoted to guide me in research. I am very grateful to my committee members: Prof. Chen, Prof. Doostan, Prof. Hughes, Prof. Liu and Prof. Meyer. I would especially like to thank Prof. Liu for serving as my thesis reader and for his valuable suggestions and comments. Many thanks to my fellow students in the DSP and Communications Lab. I have benefited greatly from discussions with them, and it is my honor and pleasure to work with them. Finally, I would like to thank my family. No words can express my deepest gratitude for their love and support.

8 Contents Chapter 1 Introduction Low-Dimensional Signal Models Contributions Outline of the Thesis Potential Applications Background and Review of Related Work Compressive Sensing Sparsity Measurement Matrix Restricted Isometry Property and Incoherence Recovery Algorithms Structured Sparsity Models in Compressive Sensing Manifold Models in Compressive Sensing: Prior Work Compressive Principal Component Recovery: Prior Work Applications Background Material Needed for Later Chapters Principal Component Analysis Kernel Principal Component Analysis

9 viii 3 Kernel Trick Compressive Sensing Signal Recovery from Compressive Sensing Measurements Using the Kernel Trick Problem Set-up Signal Recovery in Feature Space Preimage Methods Algorithm Experimental Results Datasets Results Handling Large Scale Data Notes on the Choice of Parameters Error Analysis of Our Estimator A Theorem Bounding the Error in Feature Space Theoretical Verification Conclusions Compressive Principal Component Recovery via PCA on Random Projections Notations and Assumptions Projections and Measurements Recovery of Center via PCA on Random Projections Convergence of Center Estimator Iteration to Improve Results A Simpler Version of the Center Estimator Recovery of Principal Components via PCA on Random Projections Intuition behind Principal Component Recovery Convergence of Principal Component Estimator

10 ix Using Magnitude of Eigenvalues to Determine Dimension Further Improving the Principal Components Estimation Algorithm Experimental Results Synthetic Example and Effects of Various Parameters Synthetic Example: Comparison with Previous Methods Real-World Data: Comparison with Previous Methods The Case of Random Bernoulli Measurements Application to Hyperspectral Images Image Reconstruction Image Source Separation Proofs of Theoretical Results Proofs of Lemmas for Theorem Proofs of Theorem 2 and 3 for Convergence of Center Estimator Proof of Lemma for Theorem Proof of Theorem 4 for Convergence of Principal Component Estimator Conclusions Bounds on the Convergence Rate for our PCA Estimators Convergence Rate of the Center Estimator Convergence Rate of the Covariance Matrix Estimator Matrix Perturbation and Convergence Rate of the Principal Component Estimator Review of Matrix Perturbation Convergence Rate of Principal Component Estimator Theoretical Verification

11 x 6 Conclusions and Future Work Bernoulli Random Measurements Kernel Choices Compressive Kernel PCA Potential Applications of Our Work Bibliography 114

12 Tables Table 3.1 Time Consumption Comparison

13 Figures Figure 1.1 The Information Pipeline, consisting of three stages: data acquisition, data storage, and data analysis A typical wavelet transform. In (b), each pixel shown represents the magnitude of one wavelet coefficient. Here, the value of each black pixel is zero or near zero, and thus there are very few significant wavelet coefficients (a) An original phantom MRI image. (b) The same phantom MRI image reconstructed using compressive sensing techniques from 30% as many measurements as there are pixels in the original image. We see that an almost perfect reconstruction is achieved from 30% as much data Video frame example from [158]. The middle row shows the common background of these video frames. The last row shows a few moving objects in each frame. Thus these video frames have a low-rank plus sparse structure that can be captured by PCA Hyperspectral image example from [13]. Hyperspectral images are images of the same objects from different wavelengths (0.4µm 2.5µm in this example). PCA can be used to model these signals to find a shared low-dimensional model for these signals because they are very likely to share a common basis. 10

14 xiii 1.6 (a) A set of images that have 3 underlying degrees of freedom. (b) Each image is now plotted at the location in 2-dimensional space that a kernel PCA (keeping 2 principal components) would map it to. The two underlying degrees of freedom of pose angle are nicely captured A plot of MSE between an original sculpture faces image and its best d-term approximation in wavelets vs. MSE between the image and its best d-term kernel PCA approximation. The image is better approximated in kernel PCA components Unit Sphere for l q ball with q = 2, 1, 1 in 2 R2. Small q, such as q = 1, 1, have 2 points along the axes where sparse solutions lie, and thus the result of the optimization problem using them tends to be sparse Single-pixel camera architecture [142, 156] Reconstruction results from the Rice single-pixel camera. Left: Original image. Middle: Reconstructed image with 10% measurements. Right: Reconstructed image with 20% measurements [50, 156] (a) Data representation with two PCs. (b) Projection of data onto the first PC. The coefficients of the data with respect to the first PC give a concise approximate representation of the original data Example of kernel PCA Image samples for all three datasets we used in the experimental verification Experimental results: a visual results comparison (left) and plots (in log-log scale) of MSE vs. number of measurements (right) for each of the sculpture face (top), Frey face (middle), and Handwritten digits (bottom) datasets. Our method (KTCS) is compared with l 1 -Minimization (L1-MIN), Total Variation Minimization (TVM) and Nonparametric Mixture of Factor Analyzers (NMFA)

15 xiv 3.3 The divide-and-conquer procedure for a large size dataset MSE of our recovered image as a function of m and d for the sculpture face dataset for (a) fixed m, varying d and (b) fixed d, varying m Projection of the center onto a random measurement vector, and then projection of this back onto the original center direction Plot of normalized distance between the estimated center and true center with increasing number of iterations. There are n = 2000 points in R 100 with 5 significant PCs with (σ 1, σ 2, σ 3, σ 4, σ 5 ) = (20, 15, 10, 8, 6), and the measurement ratio is m p = Randomly projecting the data preserves the principal component. In each of the three figures, there are n = 3000 points uniformly distributed on a line in R 3, R 10 and R 50 respectively. We randomly project each point onto a two-dimensional random subspace, and view two dimensions of the result (the original principal component s and one other). Blue stars are the original points and red circles are the projected points. We observe that the original principal component remains intact even for a very small ratio m p 4.4 Plot of MSE between the estimated center and the true center for varying n and m p 4.5 Plots of normalized inner product magnitude between estimated PCs and the corresponding true PCs for (a) varying measurement ratios m p for n = 2000, (b) varying number of data points n when m p = 0.2, and (c) varying noise ratio ɛ/σ

16 xv 4.6 Plots of normalized inner product magnitude between estimated first PC and the corresponding true first PC for (a) fixed m, increasing p, (b) fixed m, p increasing p. We see that as the dimension of the space p increases, if the number of measurements m is fixed and the measurement ratio m/p is thereby decreasing with increasing p, then the performance deteriorates with increasing p. On the other hand, if the measurement ratio m/p is fixed, then the performance actually improves with increasing p Comparison of eigenvalues of the randomly projected data s covariance matrix with true eigenvalues of the original data s covariance matrix when m = 0.3. p The eigenvalues of the randomly projected data can be used to determine the dimensionality of the original data (a)-(e) Plot of normalized inner product between the true and estimated PCs for each of the first 5 PCs for our approach vs. CP-PCA on the synthetic dataset. Each graph is a function of the number of partitions corresponding to different measurement matrices that we have divided the data samples into. (f) Running time of the two algorithms in seconds as a function of number of partitions Normalized MSE between the estimated center and the true center for different n and m p for the Lankershim Boulevard data The normalized inner product magnitude between the first 5 estimated principal components and the true first 5 principal components for Lankershim Boulevard data. (a) Normal PCA on the Randomly Projected Data vs. (b) Compressive-Projection Principal Components Analysis (CP-PCA) [58] Lankershim Boulevard video visual results: (a) The original background image. (b) The estimated background image when measurement ratio m = 0.1. p (c) The true first PC, which appears to be a traffic trend along the roadway. (d) The estimated first PC for our approach when m p =

17 xvi 4.12 (a) Plot of normalized error measure between the estimated center and the true center using Bernoulli measurements for varying n and m. (b,c,d) Plots p of normalized inner product magnitude between estimated PCs and the corresponding true PCs using Bernoulli measurements for (b) varying measurement ratios m for n = 2000 and (c) varying number of data points n when m = 0.2, p p and (d) varying noise ratio ɛ/σ Plots of average SNR of reconstructed hyperspectral images for various measurement ratios Source separation using independent component analysis. (a) ICA on original images. (b) ICA on reconstructed images using our approach with m p = Comparison of our bound with empirical values in probability. The X-axis ( ) represents P ˆ x x 2 η and the Y -axis represents x 2 p mnη 2 ( d i=1 σ2 i +pɛ2 x p m p Here the red line Y = X is the reference line for easy comparison. Here we ). use the synthetic data with p = 100, n = 1000 and m p = 0.1 in 1000 trials. Based on experiments, we do not see significant variations in the results when we change the parameters (m,n,p) slightly Comparison of our bound with empirical values in probability. The X-axis represents P ( C P C F C F η) and the Y -axis represents 1 nη 2 ( a b 1). Here the red line Y = X is the reference line for easy comparison. Here we use the synthetic data with p = 100, n = 1000 and m p = 0.1 in 1000 trials. Based on experiments, we do not see significant variations in the results when we change the parameters (m,n,p) slightly Norm Squared Measurements for the Radial Basis Kernel

18 Chapter 1 Introduction Today we live in a world of rapid data explosion with the amount of data generated each day increasingly very rapidly. According to a recent report of IBM [42], 2.5 quintillion bytes of data are created each day, and this exceeds the additional storage capacity created each day. Moreover, more than 90% of the data in the world has been generated in the last two years. This amount of data is far too much for us to process with current methods, so more efficient ways to deal with data are becoming more desirable than ever. We consider general data processing procedures to have three main stages including data acquisition, data storage, and data analysis. Together these three stages form the information pipeline shown in Figure 1.1. The data explosion places strain on each stage of this pipeline. For the data storage stage, there is a long history of research [36, 101] exploiting low-dimensional signal models for the data to develop efficient data compression algorithms that reduce the strain on storage resources. For example, many wavelet-based algorithms compress or store time series, images, or video using the fact that many wavelet coefficients of a typical real-world image or signal have very low energy. A typical wavelet transform of an image is shown in Figure 1.2, from which we can see that only a few wavelet coefficients are significant. Throwing away the remaining small coefficients stores the image compactly without introducing significant error. However, reducing strain on the other two stages of this information pipeline has been

The traditional way to acquire data is to use the Shannon-Nyquist Theorem to sample the signal at a high rate which is at least twice the maximum frequency of the signal, and then to compress the

19 2 Figure 1.1: The Information Pipeline, consisting of three stages: data acquisition, data storage, and data analysis. explored less well. In this thesis, we will focus our efforts on reducing strain on the data acquisition stage of the pipeline (although we will occasionally address data acquisition and processing jointly). The traditional way to acquire data is to use the Shannon-Nyquist Theorem to sample the signal at a high rate which is at least twice the maximum frequency of the signal, and then to compress the data by throwing away a large number of wavelet/fourier coefficients immediately. However, this way is not efficient and is a waste of resources. We might thus wonder if there is a smarter way of acquiring the compressed signal directly instead. The solution is compressive sensing [5, 25, 45, 22, 20, 24, 148, 37], a novel theory for more efficient data acquisition. It uses new ideas and insights to build new sensing schemes and to reduce the strain on the data acquisition stage. A simple example in Figure 1.3 shows the power of compressive sensing in data acquisition. The sparsity property of signals lies at the core of compressive sensing theory. Stated simply, it says that many signals of interest can be expressed as a linear combination of just a few elements of some dictionary. In other words, many signals of interest, even though they are high-dimensional, tend to be described well by a low-dimensional model.

3 (a) Original cameraman image (b) Wavelet coefficients Figure 1.2: A typical wavelet transform. In (b), each pixel shown represents the magnitude of one wavelet coefficient.

20 3 (a) Original cameraman image (b) Wavelet coefficients Figure 1.2: A typical wavelet transform. In (b), each pixel shown represents the magnitude of one wavelet coefficient. Here, the value of each black pixel is zero or near zero, and thus there are very few significant wavelet coefficients. Compressive sensing takes advantage of this low-dimensionality to allow data acquisition from fewer measurements. For example, it asserts that this sparsity property is sufficient to perfectly recover the signal from far fewer samples than Shannon-Nyquist traditionally requires. In this thesis, we will examine whether other low-dimensional models besides sparsity can be used effectively for compressive sensing. First, we hypothesize that the more accurately and concisely a model can describe a given signal, the fewer measurements we will need for its acquisition. Our priority is thus to find the lowest-dimensional models we can for signals of interest. We will start by exploring some of these in the following section. 1.1 Low-Dimensional Signal Models In traditional compressive sensing, we say a signal is sparse if it can be approximated with little distortion as a linear combination of very few elements of a certain basis. We say this type of signals is sparse in that basis. Typically, the signal is assumed to be sparse in the Fourier basis, wavelet basis, or curvelet tight frame. These are non-adaptive

21 4 (a) Original phantom image (b) Reconstructed phantom image Figure 1.3: (a) An original phantom MRI image. (b) The same phantom MRI image reconstructed using compressive sensing techniques from 30% as many measurements as there are pixels in the original image. We see that an almost perfect reconstruction is achieved from 30% as much data. dictionaries and they do not take the special characteristics and structures of the signals of interest into consideration. Thus, these non-adaptive bases usually do not provide the sparsest representation for a collection of signals of interest. However, signal-adaptive bases for collections of signals can often provide much sparser representations by considering different characteristics of different signals. Because Principal Component Analysis (PCA) [82] can find a signal-adaptive basis in which a collection of signals of interest can all be sparsely represented, we can use PCA to model these signals efficiently. For example, the video frames [158] in Figure 1.4 can be modeled well by PCA, considering the static background as the center, and the few moving people as a linear combination of principal components. Another example is the hyperspectral images in Figure 1.5, which contain images of the same objects imaged at many different spectral wavelengths. Because the same objects appear in every image, these spectral images are likely to share a common basis that can be captured by PCA. Recent results [113, 27, 47, 155] indicate that some other classes of signals typically lie close to a low-dimensional nonlinear manifold, as opposed to a collection of linear subspaces as the traditional sparsity model would suggest. These types of signals, which we sometimes

5 Figure 1.4: Video frame example from [158]. The middle row shows the common background of these video frames. The last row shows a few moving objects in each frame.

22 5 Figure 1.4: Video frame example from [158]. The middle row shows the common background of these video frames. The last row shows a few moving objects in each frame. Thus these video frames have a low-rank plus sparse structure that can be captured by PCA. call nonlinearly k-sparse signals, can be represented as a nonlinear function of up to k underlying parameters. In this situation, kernel PCA can be used to learn a model for the low-dimensional manifold on which the data sits. For example, in the synthetic sculpture faces dataset of Figure 1.6, each pixel face image is a highly nonlinear, but deterministic and continuous, function of three underlying variables: two pose angles and one lighting angle. Each image thus lies along an underlying 3-dimensional manifold of R A kernel PCA, performed with a well-chosen kernel function, is able to pick out two of these degrees of freedom as the first two dimensions chosen in kernel PCA. The results of kernel PCA can thus learn this type of nonlinear sparsity in the dataset. We could represent each image fairly accurately, knowing only its coordinates in this two-dimensional representation provided by kernel PCA. We thus see that for a manifold-modeled class of signals, we may be able to build a better approximation of the image knowing its first d coordinates in a nonlinearly sparse representation such as kernel PCA than we can knowing its largest d Fourier, wavelet or curvelet coefficients. Figure 1.7 shows a comparison of the mean-squared error (MSE) for an individual image of the sculpture faces dataset when approximated from d kernel PCA

23 6 components vs. d wavelet coefficients. The MSE decays much faster for kernel PCA components, showing that the image is more efficiently represented using the manifold model than the linear sparsity model. We see that many types of signals can be approximated more efficiently using other low-dimensional signal models such as PCA and kernel PCA that using the traditional compressive sensing sparsity model. Thus, in this thesis, we will attempt to build versions of compressive sensing that use underlying PCA and kernel PCA low-dimensional signal models, instead of the sparsity model, to reduce the number of measurements needed for performing signal recovery and other signal processing tasks. 1.2 Contributions The main contribution of this thesis is to create new methods to acquire images, videos and other types of data from very few measurements based on these low-dimensional models that describe many important classes of signals. In this thesis, we will not only show promising experimental results of our approaches, but also provide theoretical analysis to bound the errors. In the end, this thesis will allow us to achieve similar signal reconstruction results from fewer random measurements compared to traditional methods and/or greatly reduce overall computation time in comparison to other recently developed methods. We also foresee that various signal processing tasks, such as signal classification and recognition, can be performed based on our approaches Outline of the Thesis The organization of this thesis is as follows. In Chapter 2, we start with a brief overview of the field of compressive sensing. We then discuss some recently-developed techniques using low-dimensional signal models in compressive sensing that are related to our work. Finally, we will introduce two lowdimensional signal models, PCA and kernel PCA, that will be important in our work.

24 7 In Chapter 3, we consider the problem of recovering nonlinearly k-sparse signals from compressive sensing measurements. We show how to apply the kernel trick to adapt the usual compressive sensing paradigm of reconstructing a linearly sparse signal from a linear set of measurements to the case of reconstructing a nonlinearly k-sparse signal from either nonlinear or linear measurements. Experimentally, our algorithm can accurately recover these nonlinearly k-sparse signals from dramatically fewer measurements, sometimes an order of magnitude fewer measurements, than needed by traditional compressive sensing techniques (e.g. l 1 -Minimization, Total Variation Minimization) under the assumption of sparsity in an orthonormal basis (e.g. wavelets). We also show that our method compares favorably with other more recentlydeveloped manifold-based compressive sensing methods, producing similar recovery results in 1-2 orders of magnitude less computation time. Finally, we provide a bound on the error of our recovered signals to theoretically explain the success of our approach in signal recovery. In Chapter 4, we work with collections of signals that are each linearly sparse, but that have common shared structure. For such collections of signals, PCA can be used to find a common basis in which these signals can all be sparsely represented. We might for example use such a PCA-discovered basis for compression of the dataset, storing only the coefficients of each signal in this basis. However, if instead of acquiring the full dataset then compressing using the PCA basis, we wish to employ a more efficient compressive sensing data acquisition scheme, then we must determine the center and principal components of the original data from only relatively few measurements of each sample. To achieve this aim, we propose an approach to learn the center and principal components from only compressive sensing measurements. We show that when the usual PCA algorithm is instead applied to low-dimensional random projections (e.g. from Gaussian random measurements) of each data sample, it will often return the same center (scaled) and principal components as it would for the original dataset. More precisely, we show that the center of the low-dimensional random projections of the data converges to the true center of the original data (up to a known

25 8 scaling factor) almost surely as the number of data samples increases. We then show that the top d eigenvectors of the randomly projected data s covariance matrix converge to the true d principal components of the original data as the number of data samples increases. Moreover, both of the above conclusions are true regardless of how few dimensions we use for our random projections (i.e. how few compressive sensing Gaussian random measurements we take of each data sample). Experimentally, we find that for both synthetic and real-world examples, including video and hyperspectral imaging data, normal PCA on low-dimensional random projections of the data recovers the center and the principal components of the original data very well. In fact, the principal components recovered using normal PCA on the randomly projected data are significantly more accurate than those returned by other algorithms previously designed for this task, such as Compressive-Projection Principal Component Analysis (CP-PCA) [58]. We further show that knowledge of the principal components gained from a collection of data can then be used to improve reconstruction of each individual data example and to aid in other signal analysis tasks such as source separation. In addition to the theoretical proofs of the almost sure convergence of both the center and principal component estimators, we provide theorems in Chapter 5 showing how quickly the center estimator converges to the true center with respect to number of points n and measurement ratio m. We also show the convergence rate of the covariance matrix estimator p and principal component estimator with respect to the same quantities. To conclude this thesis, in Chapter 6, we summarize our contributions and discuss several open problems Potential Applications We hope that our work will be useful in applications, such as image and video processing, hyperspectral imaging, and medical imaging. For example in Medical Resonance Imaging (MRI), it costs thousands of dollars to acquire even a single image and requires the

26 9 patient to hold still for an uncomfortably long period of time. Our work could be applied to help reduce the number of measurements needed for MRI by a large amount. Another application example is hyperspectral imaging [67]. Hyperspectral images provide rich information about the subject being imaged and have been widely used in application areas such as agriculture [56], mineralogy [14] and environmental studies [135]. They are usually taken by satellites or remote sensors, which may have very little power and limited computation capabilities. In this situation, the resources available for taking measurements are severely limited. Because of the common features shared across wavelengths, the different spectral images have a shared basis in which they may be sparsely and efficiently represented. Thus, our work could help reduce the initial samples the remote sensors need to take, shifting the computational burden to the more powerful base stations, and at the same time keeping the same rich information as the original hyperspectral images. Our methods may also permit source separation (e.g. of water, roads, various minerals, etc.) in hyperspectral imaging from a much smaller number of measurements. Improved reconstruction of successive frames of real-world video, which also have similar shared structure between different frames, is another potential application of our work that we examine. Our work might also be applied in areas such as neuroscience, biology and optics.

10 Figure 1.5: Hyperspectral image example from [13].

they are very likely to share a common basis. (a) Image Samples (b) Arranged by KPCA Figure 1.

27 10 Figure 1.5: Hyperspectral image example from [13]. Hyperspectral images are images of the same objects from different wavelengths (0.4µm 2.5µm in this example). PCA can be used to model these signals to find a shared low-dimensional model for these signals because they are very likely to share a common basis. (a) Image Samples (b) Arranged by KPCA Figure 1.6: (a) A set of images that have 3 underlying degrees of freedom. (b) Each image is now plotted at the location in 2-dimensional space that a kernel PCA (keeping 2 principal components) would map it to. The two underlying degrees of freedom of pose angle are nicely captured.

28 Figure 1.7: A plot of MSE between an original sculpture faces image and its best d-term approximation in wavelets vs. MSE between the image and its best d-term kernel PCA approximation. The image is better approximated in kernel PCA components. 11

29 Chapter 2 Background and Review of Related Work 2.1 Compressive Sensing As the amount of data to be acquired and stored in many applications increases rapidly, efficient signal acquisition methods are becoming increasingly important. While the traditional well-known Shannon-Nyquist sampling theorem tells us that the sampling rate must be at least twice the maximum frequency of the signal for perfect reconstruction, this typically results in much more data being captured than is actually used. In 2004, Candes, Tao and Romberg [22, 20, 24] and Donoho [45] independently showed that we can achieve perfect signal reconstruction with far fewer samples than Shannon- Nyquist requires, if the signal is sparse in some basis and the measurements are taken appropriately. Their results provide a new way to sense/acquire the signals in a compressed way, thus leading to the terms compressive sensing, compressed sensing, or compressive sampling for this type of data acquisition strategy. We will review the literature of compressive sensing in the next few subsections Sparsity Compressive sensing relies on the important fact that many types of signals can be approximated with little distortion as a linear combination of a few elements from a certain basis [5]. This is sometimes referred to as the sparsity property of signals. To be precise, we say that a vector c R p is k-sparse if it has at most k nonzero entries

30 13 (i.e. c 0 k). We then say that a signal x R p is k-sparse with respect to some basis Ψ R p p (with the columns of Ψ as the basis elements) if there exists a k-sparse vector c such that x = Ψc (2.1) Practically, most real-world signals are not exactly sparse but approximately sparse with respect to a basis Ψ, and this leads to a representation that x Ψc ɛ (2.2) for some sparse c and small ɛ > 0. For example, real-world images are typically almost sparse in wavelets and thus we can usually represent the images well with a few large wavelet coefficients. The sparsity property of signals has been explored for a long time in different signal processing tasks including signal compression, signal denoising, and image deblurring [101, 36, 43]. This property makes the recovery of signals from just a few measurements possible, and the sparsity level k usually determines how many measurements are needed to uniquely recover the signal Measurement Matrix Compressive sensing also relies critically on the way the measurements are taken. The measurements in compressive sensing are not uniformly-spaced samplings as in the traditional Shannon-Nyquist sampling theorem. In compressive sensing, we use linear inner products between the signal and specially-chosen measurement vectors as the measurements. We can then represent each measurement y i as the inner product between the data point x and a measurement vector φ i, y i = φ i, x (2.3)

31 14 If we take m (m < p) measurements, then from Eq. 2.1 and 2.3, we have y = ΦΨc (2.4) where y is a size m 1 vector and Φ is a size m p matrix in which the i th row is the measurement vector φ T i. Because the number of measurements m is less than the dimension of each data sample p, Eq. 2.4 is an underdetermined system and there are infinitely many solutions for c. We need further assumptions to ensure that an accurate estimate of c can be reliably recovered from this underdetermined system. We proceed by assuming that c is k-sparse and that the sensing matrix A, the product of the measurement matrix Φ and the basis matrix Ψ, satisfies the restricted isometry property [22, 24] which will be presented in the next section Restricted Isometry Property and Incoherence The sensing matrix A = ΦΨ defined as above needs to satisfy the restricted isometry property to ensure robust and stable signal recovery. This property was first presented by Candes, Tao, and Romberg [22, 24] and is useful in analyzing the general robustness of a compressive sensing measurement scheme. We say an m p matrix A satisfies the RIP of order k if there exists a constant δ k (0, 1) and a rescaling constant C, such that for all k-sparse signals x, C (1 δ k ) Ax 2 2 x 2 2 C (1 + δ k ) (2.5) Many types of random matrices have been shown to satisfy the RIP with small constants [24, 106, 8]. These include Random Gaussian: formed by randomly drawing each entry i.i.d. from the Gaussian distribution N (0, I). Random Bernoulli: formed by randomly drawing each entry i.i.d. from the Bernoulli distribution taking the value 1 or 1 with equal probability.

32 Random Fourier: formed by randomly choosing a set of m rows of the discrete Fourier matrix. For example, the random Gaussian matrix is shown to have this property with high probability when the number of measurements is m O(k log(p/k)). Moreover, the RIP also holds with high probability when Ψ is an arbitrary fixed orthonormal basis and Φ is randomly generated from certain distributions such as Gaussian and Bernoulli [25, 8]. However, in practice, the verification of the RIP for any given matrix is computationally intractable because we would need to check each m k submatrix of A to see whether it satisfies the RIP. This becomes a combinatorial problem which is NP-hard. In this situation, it is better to use other easy-to-compute properties of the measurement matrix Φ and the basis matrix Ψ to guarantee the robust and stable signal recovery. The incoherence is one of the most important among such properties. The incoherence considers the relationship between rows of the measurement matrix Φ and columns of the basis matrix Ψ. We define the coherence between Φ and Ψ as µ(φ, Ψ) = max i,j 15 < φ i, ψ j > φ i 2 ψ j 2. (2.6) We then need µ(φ, Ψ) to be low in order to robustly recover the signal. It has been proved [45, 22, 24] that the coherence between rows of the measurement matrix Φ and columns of the basis matrix Ψ is low with high probability for any fixed signal basis Φ and random measurement vectors such as random Gaussian or Bernoulli vectors. In [127], a relation between the RIP and the incoherence between rows of the measurement matrix Φ and columns of the basis matrix Ψ has been established. It shows that under certain conditions, low coherence (incoherence) implies the RIP with large probability. It is also important to note here some deterministic measurement matrices satisfying the RIP have also been proposed [3, 17, 77, 12]. For example, the Chirp Sensing Code [3] uses chirp sequences as the measurement matrix columns to design deterministic measurement matrices. The expander graph method [77, 12] constructs the measurement matrix by first

33 16 constructing a bipartite expander graph in which any set of k vertices in one partition have as neighbors at least αk vertices in the other partition for a constant α. This guarantees incoherence between the sparsity basis and the measurement basis. It has also been shown that measurement matrices built out of expander graphs satisfy a modified version of RIP [12]. There are also other methods [44, 128, 17], which we will not detail here, to build deterministic measurement matrices as well Recovery Algorithms If the signal x is k-sparse with respect to the basis Ψ and we use a random measurement matrix Φ so that ΦΨ satisfies the RIP of appropriate order (see below for details), we can develop efficient algorithms to recover the signal x. We note that, ideally, the recovery of an exactly k-sparse signal x with respect to the basis Ψ can be considered as an l 0 minimization problem: min c 0, s.t. y = ΦΨc. (2.7) Because, most of the time, signals are not exactly k-sparse with respect to Ψ, we can relax the constraint in Eq. 2.7: min c 0, s.t. y ΦΨc 2 ɛ. (2.8) The l 0 minimization problem usually guarantees a sparse solution with high probability, and we just need very few measurements to recover a k-sparse signal via l 0 minimization. In [153], Wakin proved that, through solving l 0 minimization problem in Eq. 2.7, 2k random measurements will be sufficient to exactly recover x with some assumptions on Φ and Ψ. However, unfortunately this is a combinatorial optimization problem which is NP-hard. Instead of solving the NP-hard l 0 minimization problem, a much simpler l 1 minimization problem will also almost always give us the sparsest solution with very few measurements as proved by Donoho [45]. As an illustrating example in Figure 2.1(b), we can also recover the sparsest solution of c by considering the l 1 minimization problem [45]:

34 17 min c 1, s.t. y = ΦΨc. (2.9) Similar to the l 0 minimization problem, we can relax the constraint in Eq. 2.9 for signals that are not exactly k-sparse with respect to Ψ: min c 1, s.t. y ΦΨc 2 ɛ. (2.10) This l 1 minimization problem is a convex optimization problem with low computational complexity, and thus it is one of the most popular methods to solve compressive sensing problems. In [18], Candes showed that stable recovery will be guaranteed for the solution to Eq under certain RIP assumptions. If A = ΦΨ satisfies the RIP of order 2k and δ 2k < 2 1, then c, the solution to Eq. 2.10, satisfies c c 2 C 0 k 1/2 c c k 1 + C 1 ɛ (2.11) where C 0 and C 1 are some small constants (see [18] for exact expressions), and c k is defined as the best sparse approximation one could possibly get if the exact locations and magnitudes of the k largest entries of c are known. Especially, if c is exactly k-sparse and ɛ = 0, the recovery c is exact. Many other algorithms have also been proposed to solve the compressive sensing problems. For example, total variation minimization [29] is another popular method to solve the compressive sensing problem. As we know, most images have sharp changes of pixel intensity at edges in the image, but smooth areas of pixel intensity in most parts of the image. The total variation minimization method takes advantage of these characteristics of images to promote signal recovery from few measurements. We begin our discussion of this method by defining a quantity known as the total variation (for discrete-space images). To do this, we first need to introduce the magnitude of the discrete gradient of an image X R p p given by

35 18 ( X) ij 2 = (Xi+1,j X i,j ) 2 + (X i,j+1 X i,j ) 2 1 i p 1, 1 j p 1 X i,j+1 X i,j i = p, 1 j p 1 X i+1,j X i,j j = p, 1 i p 1 0 i = p, j = p We can then define the total variation of a discrete image X as the l 1 norm of the matrix D with ij th entry D ij = ( X) ij 2, X T V = D 1 Finally, the total variation minimization method is to recover the image X from measurements y = Φx, where x is the vector rearrangement of the image X, by solving the following optimization problem min X T V, s.t. y Φx 2 ɛ. (2.12) Because we can think of total variation for a pixelized image as the l 1 norm of the magnitudes of the image s gradient vectors at each pixel, we can see that the total variation minimization method actually promotes sparsity in the discrete gradient of the image, and the image will have few places where the gradient is significant. This promotes the sort of smooth-areas-with-sharp-edges structure we discussed above, and hence tends to provide very good signal recovery results for compressive sensing problems. Another popular algorithm for signal recovery that we will mention is orthogonal matching pursuit [146, 112, 102]. This is a greedy iterative algorithm for finding a sparse c that approximates the measurements y, and by extension, the image x. Finally, we will give a simple intuition why the l 1 minimization method works in Figure 2.1. Solving the minimization problem in Eq. 2.9 is equivalent to expanding the l 1 ball in Figure 2.1(b) until it hits the y = ΦΨc subspace. We see that because the ball has points along the axes, it tends to hit a sparse solution. Meanwhile, the l 2 ball in Figure 2.1(a)

Some results of signal reconstruction using l q quasinorm with 0 < q < 1 have been discussed in [30]. (a) l 2 -ball (b) l 1 -ball (c) l 1 -ball 2 Figure 2.

36 is different and doesn t hit in a sparse solution as we expand it. However, this geometric intuition allows us to see that other l q balls for 0 < q < 1, e.g. the l ball, are also pointy in the same way the l 1 ball is, and can also be used in the recovery of sparse signals. Some results of signal reconstruction using l q quasinorm with 0 < q < 1 have been discussed in [30]. (a) l 2 -ball (b) l 1 -ball (c) l 1 -ball 2 Figure 2.1: Unit Sphere for l q ball with q = 2, 1, 1 2 in R2. Small q, such as q = 1, 1 2, have points along the axes where sparse solutions lie, and thus the result of the optimization problem using them tends to be sparse Structured Sparsity Models in Compressive Sensing Besides the standard sparsity model in compressive sensing, some structured sparsity models have also been proposed to improve signal recovery results. For example, the block sparsity model [60, 53] considers that sparsity patterns may be shared across multiple related signals. For example, suppose we have several successive frames of a video. Because the frames are very similar to each other, each frame will likely have a similar sparsity pattern, i.e. the same positions for nonzero wavelet coefficients. The block sparsity model thus assumes that the support of the coefficient vector c i corresponding to each frame x (i) is the same for all i. The block sparsity model can lead to improved signal recovery results since there are then fewer parameters overall to estimate from the measurements. It also leads to

37 20 a simple recovery algorithm in which all frames are recovered at the same time. Another model is the structured sparsity model [6, 51, 72]. Because of the multiscale property of wavelet transform, the wavelet coefficients in the same location and orientation at different scales are likely to behave similarly, i.e. if some wavelet coefficient is large (or small), the wavelet coefficient corresponding to the same location and orientation at the next scale is also likely to be large (or small). Thus, we expect the nonzero wavelet coefficients to form a connected subset of a tree. The structured sparsity model takes advantage of this knowledge by imposing restrictions on the possible ways the nonzero elements of c can arrange themselves. This in turn improves the compressive sensing results. Generative probabilistic models that tend to produce sparse vectors such as Bayesian models [80, 78], wavelet-based Bayesian models [69] and Markov models [52, 28] have also been explored. All these models use the additional structure of the data model to further reduce the number of measurements needed for signal recovery Manifold Models in Compressive Sensing: Prior Work Recent results indicate that some classes of signals typically lie close to a nonlinear lowdimensional manifold, as opposed to lying on a collection of linear subspaces as the traditional sparsity model would suggest. This motivates the search for methods for compressive sensing that use an underlying manifold model for the signal rather than a sparsity model. Inspired by this finding, a few researchers have tried to adapt compressive sensing to the case of data that lies close to a nonlinear low-dimensional manifold. G. Peyré [113] examined several specific classes of natural signals and images with explicitly-described underlying manifold models, and demonstrated that the results of compressive sensing for these signal/image classes could be improved by using their underlying manifold models to regularize the compressive sensing inverse problem. Meanwhile, M. Chen et al. [32] chose instead to use a mixture of Gaussians to model the underlying manifold, intuitively fitting a nonlinear manifold with a collection of relatively flat Gaussian pancakes. More specifically, their

38 21 model is a nonparametric variant of the mixture of factor analyzers (MFA) graphical model, created by imposing Dirichlet Process and Beta Process priors on the MFA model. Finally, some nice theoretical results about compressive sensing on manifolds have been given by M. Wakin [154], who gives bounds on the error for parameter estimation and signal recovery using compressive measurements of manifold-modeled signals. Wakin also created a multiscale Newton iteration algorithm [153] for estimating from measurements the coordinates of an unknown signal along an explicitly-described manifold. However, these manifold-based methods present some significant challenges to their use. Explicit mathematical descriptions of an image patch manifold or image manifold as are required by Wakin and Peyré are generally not available for most image classes of interest, and even if they do exist, tend to be complex and unwieldy to work with. On the other hand, training a large nonparametric MFA model for the manifold as in the case of Chen et al. is very computationally intensive, since it involves estimating both the number of Gaussians and all the parameters of each from training samples. Our work in Chapter 3 will address these shortcomings of prior work by creating a new method for manifold-based compressive sensing that is computationally efficient and does not require an explicit model for the manifold Compressive Principal Component Recovery: Prior Work Meanwhile, a couple prior works have been interested in learning principal components of a collection of data samples from measurements only, as we will aim to do in Chapter 4. The most obviously related work to our own is J. Fowler s Compressive-Projection Principal Components Analysis (CP-PCA) [58], which aims to solve the same problem of recovering principal components from only random compressive sensing (CS) measurements of the data. Further work by Fowler and others addresses the issues of dimensionality determination [93] and improved hyperspectral data reconstruction [59] from CS measurements. We compare our method extensively to Fowler s in Section 4.5, showing that our proposed approach

39 22 actually performs better than CP-PCA on a variety of examples. However, finding principal components from measurements only can also be considered as part of a larger class of attempts to learn a signal dictionary [64, 55, 136, 138] in which the original data could be represented sparsely from only CS measurements of it. In 2010, Gleichman and Eldar [64] proposed a dictionary learning method to address signal recovery problem using compressive sensing measurements, and stated that, under certain conditions, this dictionary learning method will work for all sparse signals regardless of the sparsity basis. Later, Silva et al. [64] used a similar dictionary learning method to learn the subspaces where collections of signals are sparse in, and address the problem of simultaneous signals recovery using compressive sensing measurements. Recently, a paper of our group s presents another new algorithm, compressive K-SVD, for learning such a dictionary from compressive sensing measurements in [116]. One can also consider our problem of recovering principal components from CS measurements as a special case of the low-rank matrix recovery problem considered in e.g. [57, 158]. To see this, we note that if we view each data sample as a column of a matrix X, then assuming that these data samples have shared principal components is equivalent to assuming that the matrix X is low-rank. In fact, finding the data s principal components is equivalent to finding the left singular vectors of the matrix X. Previous work such as Fazel et al. [57] has aimed to recover low-rank matrices from underdetermined linear measurements of them by minimizing the nuclear norm. Indeed, fast algorithms for low-rank matrix recovery such as ADMiRA [90], SpaRCS [158], and Compressive Principal Component Pursuit [160] have also been developed for this purpose, and have been shown to improve video and hyperspectral data recovery from underdetermined linear measurements. The difference between our work and this previous work in low-rank matrix recovery is that the underdetermined linear measurements in our case have a specific structure (a few random measurements per column of the matrix), which allows for a simpler and more efficient recovery strategy based on normal PCA to be implemented in this specific case. We hope that this strategy

40 23 might be applicable to other problems in the low-rank matrix recovery literature with similar measurement structure. One notable exception to the above is the similar m-measurements-per-column measurement structure used in recent work on sketched SVD [63]. In this case, the same measurement matrix is applied to every column of X to create a sketch matrix. The paper then seeks to bound the error between the right singular vectors (and singular values) of the sketch matrix and those of the original matrix. Recall that in our work we are interested instead in the left singular vectors of the original data matrix X, which cannot be readily obtained from the right singular vectors if the original data is not fully known. Hence, our work solves a different problem than sketched SVD. Several other differences between sketched SVD and our work are detailed in [63]. Finally, work in randomized algorithms has recently used random projections of data to help speed the cost of computing PCA on very high-dimensional data. For example, Klenk and Heidemann [85] propose without proof that if the principal components of the high-dimensional data can be assumed to be sparse, then a PCA can be applied to a single lower-dimensional random projection of the data instead, and typical CS recovery techniques can be used to recover the sparse high-dimensional PCs from each resulting low-dimensional PC. Obviously, the restrictive assumption of sparse PCs here differentiates this work from our own. While not closely related to our work, we note also that PCA has a history of informing the way in which sensing measurements are taken. For example, Masiero et al. [105, 123, 104] have used PCA on a set of already available data to obtain accurate estimates of the mean and principal components of the data they wish to sense, then using this information to drive the choice of CS measurements taken.

41 Applications Since the inception of compressive sensing circa 2004, compressive sensing has started to significantly reduce the number of measurements needed in a variety of application areas such as remote sensing [99], geosciences [71, 76, 94], medical imaging [139, 140, 110, 147], neuroscience [65, 31, 122, 95, 83, 84], biology [34, 134], and optics [16]. A very interesting example of a hardware implementation of compressive sensing principles is Rice University s single-pixel camera [142, 50, 156]. This camera can obtain an image with a single photon detector through taking measurements several times. The camera architecture is shown in Figure 2.2. It has a digital micromirror device (DMD) to mimic a random Bernoulli measurement matrix. The performance of this camera is quite good even with a small number of measurements, as shown in Figure 2.3. Figure 2.2: Single-pixel camera architecture [142, 156]. 2.2 Background Material Needed for Later Chapters Before we move on to our results, we pause to review the PCA and kernel PCA algorithms because they will be important in our work.

42 25 Figure 2.3: Reconstruction results from the Rice single-pixel camera. Left: Original image. Middle: Reconstructed image with 10% measurements. Right: Reconstructed image with 20% measurements [50, 156] Principal Component Analysis chapters. In this section, we will give a brief review of PCA, which will be needed in later Introduction Principal component analysis (PCA) [82] selects the best low-dimensional linear projection of a set of data points to minimize mean-squared error between the original and projected data. It can also be thought of as finding the linear subspace that maximally preserves the variance of, or in some cases the information in, the data. PCA is frequently used for dimensionality reduction, or as a summary of interesting features of the data. For example, in the field of microarray data analysis, it is widely used as a first step for feature selection before proceeding to other analyses [124, 73]. It is also often used as a precursor to signal classification, as when eigenfaces are found before attempting to perform face recognition [149] or when hyperspectral data [67] undergoes PCA to reduce dimensionality before classification [125]. Finally, PCA can also be used to find a common basis in which a collection of data points can all be sparsely represented, and for this reason, it is sometimes used in signal compression (see e.g. [75, 143]).

43 Algorithm To obtain the principal components (PCs) of data, one typically centers the data first and then computes the eigenvectors of the data s covariance matrix. Consider data points x (1),..., x (n) R p, where n is the number of points, the algorithm is as follows: (1) Calculate the center x = 1 n n i=1 x(i). (2) Subtract the center from each data sample x (i) centered = x(i) x. (3) Compute the eigenvectors of covariance matrix C = 1 n n T i=1 x(i) centered x(i) centered. We then name the top k eigenvectors of C, i.e. those corresponding to the k largest eigenvalues of C, as the principal components of the data. If we project the data into the k-dimensional subspace spanned by the top k eigenvectors of C, we could get a concise (approximate) representation of the data in this new orthonormal basis. A simple example of PCA is shown in Figure 2.4. (a) (b) Figure 2.4: (a) Data representation with two PCs. (b) Projection of data onto the first PC. The coefficients of the data with respect to the first PC give a concise approximate representation of the original data.

44 Kernel Principal Component Analysis In this section, we will give a brief review of kernel PCA, which is a nonlinear version of PCA. In contrast to PCA, which learns a best fitting affine subspace for the data, kernel PCA can learn a concise representation of data that lies along a nonlinear low-dimensional manifold. This will make it useful in our later work on low-dimensional manifold models in compressive sensing. To begin our description of kernel PCA, we start by reviewing the kernel trick in machine learning Kernel Trick The kernel trick in machine learning is a way to easily adapt linear algorithms to nonlinear situations. For example, by applying the kernel trick to the support vector machines (SVM) algorithm [33], which constructs the best linear hyperplane separating data points belonging to two different classes, we obtain the kernel SVM algorithm, an algorithm that constructs the best curved boundary separating data points belonging to two different classes. Similarly, where principal components analysis (PCA) selects the best linear projection of the data to minimize error between the original and projected data. Kernel PCA [132] finds the best polynomial mapping to represent data. The key idea of the kernel trick is that, conceptually, we map our data from the original data space R P to a much higher-dimensional feature space F using the nonlinear mapping Φ : R P F before applying the usual linear algorithm such as SVM or PCA in the feature space. As an example, we might map a point x = (x 1, x 2 ) R 2 onto the higher-dimensional vector with components x 1, x 2, x 2 1, x 2 2, x 1 x 2, x 3 1, etc. before applying SVM or PCA. A linear boundary in the higher-dimensional feature space ( j a jφ(x) j = C), can be expressed as a polynomial boundary in the original space (a 0 x 1 + a 1 x 2 + a 2 x = C). Similarly, a linear mapping of the data to lower-dimensional space becomes a nonlinear polynomial mapping to lower-dimensional space.

45 28 Let us see an example in Figure 2.5. Suppose we generate data samples in a spiral with x 1 = θcosθ and x 2 = θ sin θ, where θ (3/2π, 9/2π). If we map the data to a 3-dimensional feature space via Φ(x) = (x 1, x 2, 5 x x 2 2) first, the projection onto the first principal component will reveal the right structure of the data samples. However, if we directly apply PCA on these data samples, it will not. Figure 2.5: Example of kernel PCA. However, this view of the kernel trick is purely conceptual. In reality, we avoid the complexity of mapping to and working in the high-dimensional feature space. When the original algorithm can be written in terms of only inner products between data points, not the points themselves, we can replace the original inner product x, y with the new inner product k(x, y) = Φ(x), Φ(y) and run the original algorithm without additional computation. For example, a popular choice of k(x, y) is polynomial kernel ( x, y + c) d, which produces a Φ of monomials as described above. As an illustration, for x, y R 2, c = 0, d = 2, then k(x, y) = x, y 2 = (x 2 1, 2x 1 x 2, x 2 2), (y 2 1, 2y 1 y 2, y 2 2) so Φ(x) = (x 2 1, 2x 1 x 2, x 2 2). Thus, the kernel function k(x, y) provides a means to shortcut computation of inner products in feature space without ever actually mapping the data there. So to summarize, the kernel trick employs a nonlinear mapping Φ to a higher-dimensional

46 29 feature space before applying the usual linear algorithm to create a nonlinear version of the algorithm. However, any additional computation that would be incurred by this process is avoided by writing the original algorithm entirely in terms of inner products and using a well-chosen nonlinear mapping so that the necessary inner products Φ(x), Φ(y) take the form of a simple kernel function k(x, y) and can be easily computed via this function Algorithm For the particular case of PCA, the kernel trick can be applied to allow the learning of a best fitting polynomial surface to the manifold rather than a best fitting linear subspace. To find the algorithm, we must rewrite the PCA algorithm in feature space entirely in terms of inner products of the form Φ(w), Φ(z). Then we will be able to choose an appropriate Φ so that the algorithm can be written in terms of a simple kernel function k(w, z). Let us proceed. Consider n data points x (1),..., x (n), and suppose these points are mapped from the original data space R p to a high-dimensional feature space F using the nonlinear mapping Φ : R p F, producing the points Φ(x (1) ),..., Φ(x (n) ) in F. Let us first find the center of the mapped data in feature space F: Φ(x) = 1 n n i=1 Φ(x(i) ). The covariance matrix of the data in F is then C = 1 n n ( Φ(x (i) ) Φ(x) ) ( Φ(x (i) ) Φ(x) ) T i=1 (2.13) It is easy to see that any eigenvectors v k of C must lie in the subspace spanned by {Φ(x (i) ) Φ(x)} n i=1, that is v k = n i=1 α k i ( Φ(x (i) ) Φ(x) ), for some α k 1,...,α k n. (2.14) For every eigenvalue λ k and its corresponding eigenvector v k, we then have Cv k = λ k v k

47 30 and if we take the inner product of both sides with Φ(x (i) ) Φ(x) for each i, we get Φ(x (i) ) Φ(x), Cv k = λ k Φ(x (i) ) Φ(x), v k. We can then use Eq and 2.13 to get Φ(x (i) ) Φ(x), 1 n ( Φ(x (l) ) Φ(x) ) ( Φ(x (l) ) Φ(x) ) T n n = λ k Φ(x (i) ) Φ(x), l=1 n j=1 α k j ( Φ(x (j) ) Φ(x) ). j=1 α k j ( Φ(x (j) ) Φ(x) ) Defining the n n Gram matrix K, where K ij = Φ(x (i) ) Φ(x), Φ(x (j) ) Φ(x), we then have that K 2 α k = nλ k Kα k (2.15) Thus, the coefficients α k of the kernel principal components v k for k = 1,..., d in terms of the training samples are found as the top d eigenvectors (i.e. eigenvectors corresponding to maximum eigenvalues) of K. We note that K can be found using the kernel function as K = (I 1 n 11T )K (I 1 n 11T ), where K is a matrix with i, jth entry K ij = Φ(x (i) ), Φ(x (j) ) = k(x (i), x (j) ) and 1 is the vector of all ones. Finally, α k is also scaled to ensure that v k 2 = 1 for all k; this is equivalent to ensuring that α kt Kα k = 1. We then have the kernel principal components v k = n i=1 αk i ( Φ(x (i) ) Φ(x) ). Each data point in feature space Φ(x (i) ) can then be concisely approximately represented as Φ(x (i) ) Φ(x) + = Φ(x) + = Φ(x) + = Φ(x) + n d Φ(x (i) ) Φ(x), v k v k k=1 d Φ(x (i) ) Φ(x), k=1 d k=1 j=1 n αj k K ij v k d λ k αi k v k k=1 n j=1 α k j ( Φ(x (j) ) Φ(x) ) v k Finally, let us revisit the sculpture faces dataset example in Figure 1.6 and 1.7. A kernel PCA, performed with a well-chosen kernel function, is able to pick out two of

48 31 these degrees of freedom as the first two dimensions chosen in kernel PCA. We could then represent each image fairly accurately, knowing only its coordinates in this two-dimensional representation provided by kernel PCA. In other words, for some choice of kernel, some types of signals can be reconstructed with small error from their coordinates in a d-dimensional affine subspace of feature space.

49 Chapter 3 Kernel Trick Compressive Sensing In this chapter, we consider the problem of recovering signals that belong to lowdimensional nonlinear manifolds, i.e. can be represented as a continuous nonlinear function of relatively few parameters, from compressive sensing measurements. We will show how the kernel trick can be used to easily adapt the usual paradigm of reconstructing a linearly sparse signal from a linear set of measurements to the case of reconstructing a nonlinearly sparse signal from either nonlinear or linear measurements. The key idea is that a signal that is nonlinearly sparse can, with a proper choice of kernel, become linearly sparse in feature space, being concisely represented in terms of very few coefficients, as our sculpture faces did in Chapter 2. We can thus reconstruct it from random measurements in feature space, which can be easily obtained from the usual random measurements for certain kernels. Experimentally, we find that when the signal to be reconstructed lies on a low-dimensional manifold, it can be reconstructed from far fewer compressive sensing measurements than required by an assumption of sparsity in the Fourier/wavelet basis. Compared with other compressive sensing recovery methods based on underlying manifold models, we find that our algorithm is similar in performance, but is much simpler to implement, requiring mainly an eigendecomposition for the PCA and a least-squares fit, and runs in 1-2 orders of magnitude shorter time. This chapter is organized as follows. Section 3.1 outlines our signal recovery approach from compressive sensing random measurements using the kernel trick algorithm. Section 3.2

50 33 shows the overall algorithm step by step. Section 3.3 presents experimental results showing the power of our approach on some sample datasets and comparing with other approaches in the literature. Finally, in Section 3.4, we present some theoretical error analysis for our recovered signal. 3.1 Signal Recovery from Compressive Sensing Measurements Using the Kernel Trick Problem Set-up Suppose we wish to recover an unknown signal y that is approximately a continuous nonlinear function of d underlying variables, or lies close to a d-dimensional manifold. We ll assume, as is common in manifold learning, that this d-dimensional manifold can be described through kernel PCA, with appropriate choice of kernel. Describing signal manifolds in this way has previously been effective, e.g. for denoising images from an underlying manifold that are corrupted by noise [107]. Let {v k } d k=1 be an orthonormal basis of feature space F describing this subspace (along with possible offset Φ). We note that, unlike traditional compressive sensing, it does not make sense to expect that the {v k } d k=1 will be some unknown subset of the standard canonical basis for F, which are typically monomials. Hence, {v k } d k=1 and Φ in F will typically need to be estimated via manifold learning, i.e. kernel PCA, from other data that is expected to be nonlinearly sparse in the same way, i.e. from other samples {x (i) } n i=1 of the manifold of images that our image belongs to, or from other natural images in the case of natural image patches. If we denote the center of the data as Φ(x) = 1 n n i=1 Φ(x(i) ), then our unknown signal y that we wish to recover can be modeled as Φ(y) Φ(x) + Φ(y) d = Φ(x) + β k v k. (3.1) Now suppose we have m measurements of y in the form of linear inner products y, e i, k=1

51 34 or nonlinear inner products k(y, e i ) = Φ(y), Φ(e i ), where {e i } m i=1 are random vectors drawn from Gaussian distribution or Bernoulli distribution. In the case that linear inner products are provided, we shall assume the kernel defining our feature space is in the form f( y, e i ), e.g. the polynomial kernel, the sigmoid kernel, etc., so that k(y, e i ) is known as well. Using Eq. 3.1 for y, we thus aim to recover β 1,..., β d from these measurements. As in typical kernel methods, our final method should utilize only inner products Φ(x (i) ), Φ(x (j) ) = k(x (i), x (j) ), not the elements Φ(x (i) ) themselves, so that we may avoid working in the feature space F Signal Recovery in Feature Space To do this, let us first complete the orthonormal basis v 1,..., v d for F, with vectors v d+1,..., v q, where q is the dimension of F, so that v 1,..., v q now forms an orthonormal basis for F. We can then write Φ(y) as d q Φ(y) = Φ(x) + β k v k + γ k d v k (3.2) k=1 k=d+1 = Φ(x) + P Vd (Φ(y)) + P V d (Φ(y)) where P Vd ( ) is the projection of Φ(y) Φ(x) onto the subspace V d spanned by {v i } d i=1, and P V d ( ) is the projection onto the orthogonal complement of V d in F. Then each random measurement can be expressed as k(y, e i ) = Φ(y), Φ(e i ) = Φ(x), Φ(e i ) + P Vd (Φ(y)), Φ(e i ) + PV d (Φ(y)), Φ(e i ) = 1 n n d k(x (j), e i ) + β k v k, Φ(e i ) + ɛ i j=1 k=1 where ɛ i is a small error term since P V d (Φ(y)) is small by assumption. In Section 3.4, we will analyze the error incurred by P V d (Φ(y)).

52 35 Writing all m measurements in matrix form, we then have 1 k(y, e 1 ) n n j=1 k(x(j), e 1 ).. 1 k(y, e m ) n n j=1 k(x(j), e m ) v 1, Φ(e 1 )... v d, Φ(e 1 ) β 1 = v 1, Φ(e m )... v d, Φ(e m ) We can then write the above equation as β d ɛ 1. ɛ m. M M x = Gβ + ɛ. (3.3) where M is the measurement vector with i th entry as k(y, e i ), M x is the vector with i th entry as 1 n n j=1 k(x(j), e i ), and G is the matrix with k, l th entry G kl = v l, Φ(e k ) n ( = Φ(x (j) ) Φ(x) ), Φ(e k ) = j=1 n αj l j=1 α l j ( k(x (j), e k ) 1 n ) n k(x (i), e k ) where we have assumed that {v k } d k=1 have been learned from training data as above. Letting k Xei denote k Xei = 1 n n j=1 k(x(j), e i ), we can thus represent G as k(x (1), e 1 ) k Xe1... k(x (n), e 1 ) k Xe1 G =..... k(x (1), e m ) k Xem... k(x (n), e m ) k Xem α α1 d (3.4) αn 1... αn d Finally, we can use the least squares estimator to estimate β from Eq. 3.3, ˆβ = G + (M M x ) (3.5) i=1

53 where G + is the pseudoinverse of G. We shall later show that we must have m d, i.e. the number of measurements should be larger than the dimension of the manifold, so we will always have the left inverse G + = (G T G) 1 G T. After getting the coefficients β, we can then estimate Φ(y) as where γ i = d k=1 ˆβ k α k i. Φ(y) = Φ(x) + = Φ(x) + = n i=1 36 d ˆβ k v k (3.6) k=1 n i=1 d ( ˆβ k αi k Φ(x (i) ) Φ(x) ) k=1 ( 1 n + γ i 1 n ) n γ j Φ(x (i) ) We thus see that with the kernel PCA model for the manifold, the compressive sensing recovery problem turns into a simple least-squares problem in feature space. j= Preimage Methods Now we have recovered Φ(y) in feature space, the question left is that how we can invert the one-to-one mapping Φ to find our estimate ŷ of y. This problem, of recovering z from Φ(z), is called the preimage problem in the kernel methods literature. If we assume an exact preimage z such that Φ(z) = Φ(y) exists, then we can estimate our original signal y via ŷ = Φ 1 ( Φ(y) ). In Eq. 3.6, consider the expansion Φ(y) = n i=1 c iφ(x (i) ), where c i = 1 n +γ i 1 n n j=1 γ j. If there exists z R p such that Φ(z) = Φ(y), and an invertible function f k such that

54 k(x, y) = f k ( x, y ), then p z = z, u j u j (3.7) = = j=1 p j=1 p j=1 f 1 k f 1 k ( n ) c i k(x (i), u j ) i=1 ( n i=1 ( 1 n + γ i 1 n u j ) ) n γ j k(x (i), u j ) where {u j } p j=1 is any orthonormal basis of Rp, for example, the standard one with 1 in one position and 0 elsewhere. As an example kernel, we could choose k(x, y) = f k ( x, y ) = ( x, y + c) d, with d odd to guarantee that f k is one-to-one invertible. We note that if our signals are images, even d will also work because the values of image pixels are nonnegative. However, more often due to estimation error in Φ(y) or for a kernel function not satisfying the requirements above, an exact preimage will not exist, so we will instead find the best approximation. Some approximate preimage methods (see [131, 86, 130, 163]) allow us to do this without venturing into feature space. However, we will not detail these methods here since we found that the simple preimage recovery method in Eq. 3.7 was sufficient. j=1 u j Algorithm The overall algorithm is shown in Algorithm Experimental Results In this section, we compare our method with popular traditional compressive sensing recovery methods l 1 -Minimization (L1-MIN) [11] and Total Variation Minimization (TVM) [29]. We also compare with more recent work that recovers signals assumed to lie on a manifold from compressive sensing measurements via the Nonparametric Mixture of Factor Analyzers (NMFA) model [32]. We break our comparison of these methods up into three different aspects: the visual quality of reconstruction, the mean-squared error (MSE), and the computation required to produce the results.

55 38 Algorithm 1 Kernel Trick Compressive Sensing Input: Training data {x (i) } n i=1, kernel function k(x, y), measurement vectors {e i } m i=1, and measurements M i = k(y, e i ) possibly obtained as f k ( y, e i ) if the kernel takes the form k(x, y) = f k ( x, y ). Output: Recovered signal ŷ Perform kernel PCA [132] on {x (i) } n i=1 to obtain {α k } d k=1. Compute k Xei for i = 1,..., m: k Xei = 1 n n k(x (j), e i ) j=1 Compute G: G = Compute M x : k(x (1), e 1 ) k Xe1... k(x (n), e 1 ) k Xe k(x (1), e m ) k Xem... k(x (n), e m ) k Xem M x = Then the coefficients ˆβ can be calculated from Compute γ i : 1 n n j=1 k(x(j), e 1 ). 1 n n j=1 k(x(j), e m ) ˆβ = (G T G) 1 G T (M M x ) γ i = d ˆβ k αi k k=1 The preimage ŷ can then be derived from ( p n ( ŷ = f 1 k γ i + 1 n 1 n j=1 i=1 ) ) n γ k k(x (i), u j ) k=1 α α1 d..... αn 1... αn d for a kernel of the form k(x, y) = f k ( x, y ) with invertible f k or by other preimage methods [131, 86, 130, 163] if not. u j

56 39 We do not compare with the methods for manifold-model-based recovery presented by G. Peyrè [113] or M. Wakin [153] since both of these methods require an explicit parameterization of the manifold, which is not generally available, and is not available for our examples. In fact, even if such a comparison were done for other data, it would not be a fair one since knowledge of an explicit parameterization of the manifold is a significant advantage Datasets For our analysis, we choose three datasets in which the images show the type of nonlinear sparsity we discussed at the beginning of this chapter. The Sculpture Face dataset [144] includes 698 images of size of a sculpture face rendered according to 3 different input parameters, two of pose angle and one of lighting angle (See Figure 3.1(a))). The Frey Face dataset [126] includes 1964 images of size of the same person s face as he shows different emotions (See Figure 3.1(b))). The MNIST Handwritten Digit dataset [89] includes training images of size of digits from 0 to 9, each digit has about 6000 training samples (See Figure 3.1(c))) Results Figure 3.2 compares our method with compressive sensing recovery methods L1-Min, TVM and NMFA, giving the reconstructed images for varying numbers of measurements and plotting MSE vs. number of measurements (in logarithmic scale) for each method. We notice that our method and NMFA, the two manifold-based compressive sensing techniques, outperform traditional compressive sensing recovery methods (L1-MIN and TVM) by an order of magnitude for small numbers of measurements, producing either a greatly reduced MSE for the same number of measurements, or vice-versa. While NMFA s performance is comparable in terms of MSE to our method for small number of measurements, we will see later that in Table 3.1 that it is a complex and computationally intensive process to train the parameters of the NMFA model from training samples, while our method is much faster

57 40 (a) Sculpture Face Samples (b) Frey Face Samples (c) Handwritten Digit Samples Figure 3.1: Image samples for all three datasets we used in the experimental verification.

(top), Frey face (middle), and Handwritten digits (bottom)

58 41 Figure 3.2: Experimental results: a visual results comparison (left) and plots (in log-log scale) of MSE vs. number of measurements (right) for each of the sculpture face (top), Frey face (middle), and Handwritten digits (bottom) datasets. Our method (KTCS) is compared with `1 -Minimization (L1-MIN), Total Variation Minimization (TVM) and Nonparametric Mixture of Factor Analyzers (NMFA).

59 42 and simpler to implement. This is important for practical use, especially on larger images than the very small examples presented here. L1-MIN and TVM do catch up and surpass our method for very large numbers of measurements and very small recovery error, most likely due to small inaccuracies in estimating the pre-image from Φ(y). These inaccuracies should decrease with exact knowledge of {v k } d k=1 or an increased set of data {x(i) } n i=1 from which to estimate them. A comparison of computation time for the four methods is shown in Table 3.1. The time is that to recover a single image using a fixed number of measurements. For NMFA, the training of the graphical manifold model is the most computationally intensive part, so we have showed this separately to facilitate comparisons. From Table 3.1, we can see that getting the parameters in the NMFA model from the training data is indeed very computationally intensive. For example, in the first line, it takes only 2.2 seconds for our method to train a manifold model from the data and recover the measured signal while NMFA takes approximately 2 hours to train its model. This is a significant advantage for our method given its similar performance to NMFA. Table 3.1: Time Consumption Comparison KTCS L1-Min TVM NMFA Sculpture Face: 50 measurements 2.2s 1.4s 79.1s s 300 measurements 2.5s 2.8s 92.1s s 1000 measurements 3.9s 8.9s 95.6s s Frey Face: 5 measurements 10.2s 0.1s 0.5s s 50 measurements 12.6s 0.4s 0.6s s 100 measurements 13.8s 0.5s 0.7s s Handwritten digits: 10 measurements 557.5s 0.3s 7.5s s 50 measurements 581.3s 0.5s 8.7s s 500 measurements 612.9s 0.8s 9.1s s

60 Handling Large Scale Data We note that the Handwritten Digits dataset, the number of the training samples (60000 samples) is much larger than for the other two datasets (about 500 for Sculpture Face, 2000 for Frey Face ). This prevents using a single eigendecomposition to apply a kernel PCA at once to all the training data. We thus have slightly modified our method here to divide the training data into several subsets of reasonable size (more specifically, 24 subsets with 2500 samples each). We then use our method to construct ˆβ for each subset and use this to locate the closest training data to y in feature space in each subset (50 samples per subset). Finally, we obtain a total of = 1200 similar training samples and use these to reconstruct the test sample by applying our technique. This divide-and-conquer procedure results in running our method multiple times, so our method takes longer for Handwritten Digits dataset than the other two. However, this procedure can be used in a general way to handle large training datasets. The overall procedure is shown in Figure Notes on the Choice of Parameters Choice of Kernel: We shall assume that the kernel defining our feature space is in the form f( y, e i ) e.g. the polynomial kernel, so that the measurement in feature space k(y, e i ) can be known from the regular linear measurement y, e i. All experiments were done using the polynomial kernel k(x, y) = ( x, y + c) 5, setting c as 0.5 times the mean of all entries of the covariance matrix formed from the points {x (i) } n i=1, so that it is of approximately the same scale as the inner products. Based on experiments, we do not see significant variations in the results when we change the parameters slightly. Choice of Measurement Vectors: Here we have chosen the measurement vectors e k to be drawn i.i.d. from the Gaussian

61 Figure 3.3: The divide-and-conquer procedure for a large size dataset. 44

62 distribution N (0, 1 m I). In practice, however, we have found that e k drawn from the random Bernoulli distribution as follows also works very well: 1, p = 1 2 e ki = 1, p = 1 2 Choice of the Sparsity Level d: Since real-world signals are not exactly sparse, but rather show a rapidly decaying approximation error with increasing number of components kept, Figure 3.4 attempts to experimentally find the optimal level of nonlinear sparsity d to assume for a given number of measurements m. It plots the mean squared error (MSE) between the original image and our reconstructed image for different combinations of assumed sparsity level d and number of measurements m for the sculpture faces dataset. In Figure 3.4(a), we see a sharp drop in MSE for each curve, occurring precisely when m becomes larger than d. We shall understand this drop better using our error analysis in Section 4. Given m > d, we obtain a smaller MSE for larger d. In Figure 3.4(b), for each m, the MSE is smallest at about d m 2. Based 45 on this guideline, we chose d = m 2 in our experiments above. (a) (b) Figure 3.4: MSE of our recovered image as a function of m and d for the sculpture face dataset for (a) fixed m, varying d and (b) fixed d, varying m.

63 Error Analysis of Our Estimator A Theorem Bounding the Error in Feature Space As we mentioned in Section 3.1, error will be introduced by PV d (Φ(y)), which is the projection of Φ(y) onto the orthogonal complement of V d. In this section, we will give a bound on the error of our estimate. Theorem 1. Suppose v 1,..., v q is an orthonormal basis for the feature space F and let Φ(x) be a fixed element of F. Now consider Φ(y) F and its representation in terms of these quantities d q Φ(y) = Φ(x) + β i v i + γ j d v j (3.8) i=1 j=d+1 Suppose we take m measurements of Φ(y), with m > d, by taking inner products M k = e k, Φ(y) for vectors {e k } m k=1 already in feature space F that are i.i.d. drawn from an isotropic Gaussian distribution e k N (0, 1 m I). Now suppose our estimate of Φ(y) is Φ(y) = Φ(x) + d ˆβ j v j (3.9) where ˆβ = G + (M M x ) with M as above, G a matrix with k, lth entry v l, e k, and M x the vector with ith entry e i, Φ(x). 1 j=1 Then for all constants r 0 and b 0, we have Φ(y) Φ(y) 2 2 γ b (1 d/m r) 2 + γ 2 2 (3.10) b with probability at least 2 (1 e b mr2 /2 ). m γ Note that the notation here differs slightly from that introduced earlier in Section 3.1. For simplicity, e i here is assumed already in the feature space.

64 47 Note that because of the rapid decay of error with number of kernel PCA components for real signals, as displayed in Figure 1.7, γ 2 2 should decay rapidly for increasing d, becoming very small and ensuring a tight bound. Also, to ensure a small error, d/m should be small, which means that we will need more measurements than the nonlinear sparsity of the signal, as we would expect Theoretical Verification Proof. For each random measurement, we take the inner product between Φ(y) and e k, M k = Φ(y), e k d p = Φ(x) + β i v i + γ j d v j, e k i=1 j=d+1 d p = (M x ) k + β i v i, e k + γ j d v j, e k i=1 j=d+1 In matrix form, the measurements M can be written as M = M x + Gβ + Hγ where G and H are the matrix with i, jth entry G ij = e i, v j, H ij = e i, v j+d respectively. From Eq. 3.8 and 3.9, the error can be represented as Φ(y) Φ(y) 2 d q 2 2 = (β i ˆβ i )v i + γ j d v j i=1 j=d+1 2 = β ˆβ γ 2 2 (3.11) where β ˆβ 2 = β G + (M M x ) 2 = β G + (Gβ + Hγ) 2 (I G + G)β 2 + G + Hγ 2 (3.12)

65 Here, G and H are both i.i.d. matrices of Gaussian random variables (entries are linear projections of an isotropic Gaussian onto orthogonal vectors), each distributed according to N (0, 1 ). It is easy to show that G and H are uncorrelated, and we know that entries in G m and H are Gaussian distributed, thus G and H are also independent from each other. Hence, for the first term (I G + G)β 2 in Eq. 3.12, since m > d, the columns of G are linearly independent with probability 1. Thus, G T G is full rank and the null space of G will be empty, which indicates that (I G + G), the orthogonal projector onto the null space of G, is zero. Alternatively, we can show this by noting that when the columns of G are linearly independent, then G + = (G T G) 1 G T, 48 (I G + G)β = (I (G T G) 1 G T G)β 2 = 0 with prob. 1 (3.13) For the second term G + Hγ 2 in Eq. 3.12, we first note that G + Hγ 2 G + op Hγ 2 (3.14) where A op = max x R n,x 0 Ax x. Since G is a m d matrix with entries i.i.d. drawn from N (0, 1 ), we use a theorem m of Szarek [141], which gives a bound of the smallest singular value of G. It states that the smallest singular value of G, σ min (G) obeys ( P r σ min (G) > 1 ) d/m r 1 e mr2 /2 (3.15) and we also have G + op = (G T G) 1 G T 2 (a) λ max (G(G T G) 2 G T ) (b) = λ max ((G T G) 2 G T G) = λ max ((G T G) 1 ) = 1 λmin (G T G) = 1 σ min (G) (3.16) where (a) follows from A 2 λ max (A T A), and (b) follows from λ max (AB) = λ max (BA).

66 Combining Eq and 3.16, a bound on G op is ( ) P r G + 1 op 1 1 e mr2 /2 d/m r Now considering the term Hγ 2, e 1, v d+1... e 1, v q Hγ 2 2 =..... e m, v d+1... e m, v q e 1, 2 p d i=1 γ iv d+i =. e m, p d i=1 γ iv d+i = p d m e k, γ i v d+i 2 k=1 i=1 2 γ 1. γ p d (3.17) If we denote a k = e k, p d i=1 γ iv d+i, each a k is a linear projection of a Gaussian random variable, and thus a k is still Gaussian distributed. Moreover, it is easy to check that {a k } m k=1 are independent, and the mean and variance of a k can be calculated as follows The variance of a k is ( ) p d p d E(a k ) = E e k, γ i v d+i = E(e k ), γ i v d+i = 0. (3.18) i=1 p d V ar(a k ) = E(a 2 k) = E( e k, γ i v d+i 2 ) = E = = ( p d ( p d i=1 ) T γ i v d+i e k e T k i=1 ) T γ i v d+i E(e k e T k ) i=1 ( p d ) T γ i v d+i i=1 = 1 p d γi 2 = 1 m m γ 2 2 i=1 i=1 ( p d i=1 i=1 ( p d i=1 γ i v d+i ) γ i v d+i ) ( ) 1 p d m I γ i v d+i

67 Thus a k is i.i.d. N (0, 1 m γ 2 2), and Hγ 2 2 = m k=1 a2 k. We can thus obtain the expectation and variance of Hγ 2 2 as 50 m E( Hγ 2 2) = E( a 2 k) = k=1 m E(a 2 k) = γ 2 2 k=1 V ar( Hγ 2 2) = E(( Hγ 2 2) 2 ) ( E(( Hγ 2 2)) ) 2 ( ) n = E ( a 2 k) 2 γ 4 2 = = = k=1 m E(a 2 i )E(a 2 j) + i j i,j=1 m E(a 4 i ) γ 4 2 i=1 ( ) (m 2 m) γ m 3 γ 4 2 γ 4 m 2 m 2 2 ( ) γ 4 2 γ 4 2 m = 2 m γ 4 2 Using the one-sided Chebyshev inequality to bound the tail decay of this random variable, for b 0, Hence, we have P r (W E(W ) b) V ar(w ) V ar(w ) + b 2 (3.19) P r ( Hγ 2 2 γ b ) 1 = 2 m γ (3.20) m γ b 2 b 2 2 m γ b 2 Because G and H are independent, we will have that P r ( G + Hγ K 1 K 2 ) P r ( G + op K 1 and Hγ 2 K 2 ) (3.21) = P r ( G + op K 1 ) P r ( Hγ 2 K 2 )

68 51 From Eq. 3.21, 3.17 and 3.20, we have ( ) Prob G + Hγ 2 γ b 2 (1 d/m r) 2 Combining Eq. 3.11, 3.12, 3.13 and 3.22, we have that b 2 b (1 e mr2 /2 ) (3.22) m γ 4 2 Φ(y) Φ(y) 2 2 b with probability at least 2 (1 e b mr2 /2 ). m γ 4 2 γ b (1 d/m r) 2 + γ 2 2 (3.23) From Theorem 1, we can easily state the following corollary. Corollary 1. When r = 0 and b = γ 2 2, as m, we have Φ(y) Φ(y) γ 2 2 with prob. 1 (3.24) While our measurement vectors may not always be i.i.d. Gaussian in feature space, we note that the above proof primarily relied on lower bounding G op and upper bounding H op, which should also be possible for the Φ corresponding to many standard kernels. 3.5 Conclusions We have demonstrated that the kernel trick from machine learning can be used to provide a computationally inexpensive way of using an underlying manifold model for a signal when reconstructing it from compressive sensing measurements. We have demonstrated that the nonlinear sparsity of such signals is often far lower than these signals sparsity in some basis such as wavelets or Fourier. In this case, the result is a large improvement for our method in signal reconstruction quality for a given number of measurements compared to traditional compressive sensing. At the same time, we have shown that our strategy can

69 produce recovery results comparable to state-of-the-art manifold-based recovery methods such as NMFA, while reducing overall computation time by 1 2 orders of magnitude and 52 providing a much simpler algorithm. Finally, we have proved a bound on the error of our reconstructed signal, showing that the reconstruction error of our method is primarily dependent on the extent to which the signal is well-represented by the nonlinear sparsity model. We see this work as paving this way for the practical use of manifold models in compressive sensing.

70 Chapter 4 Compressive Principal Component Recovery via PCA on Random Projections In this chapter, we consider the problem of learning the inherent low-dimensional structure of a collection of high-dimensional signals using only compressive sensing measurements when the original high-dimensional data is not available. In particular, we focus on the problem of finding the center and principal components of a collection of data points from random measurements of them. We will show that when the usual PCA algorithm is applied to lowdimensional random projections (e.g. from Gaussian random measurements) of each data sample, it will often return the same center (scaled) and principal components as it would for the original dataset. We can then use the learned PCs as the basis for reconstruction of individual signals from their random measurements, improving the reconstruction results, as well as to improve the results of other signal processing tasks such as classification and recognition. This chapter is organized as follows. Section 4.1 presents the notation and assumptions that we will be using throughout the chapter. Section 4.2 provides an analysis of the center of randomly projected data compared to that of the original data, while Section 4.3 analyzes principal components using normal PCA on randomly projected data compared to those of the original data. In Section 4.5, we show experimental results that verify our theoretical conclusions, while benchmarking this approach against other approaches in the literature. In Section 4.6, we show more applications of our approach on hyperspectral images. Finally, in Section 4.7, we present the proofs of the lemmas and theorems in Sections 4.2 and 4.3.

71 Notations and Assumptions We assume that our original data are centered at x R p with principal components v 1,..., v d R p. The vectors v 1,..., v d are assumed orthonormal and thus span a d-dimensional subspace in R p. Each data sample can then be represented as d x (i) = x + w ij σ j v j + z i (4.1) j=1 where {w i } n i=1 are drawn i.i.d. from N (0, I d ), {z i } n i=1 are drawn i.i.d. from N (0, ɛ 2 I p ) with small ɛ 0, and {σ i } d i=1 are scalar constants with σ 1 σ 2... σ d. Suppose we have n such data samples x (1),..., x (n) R p Projections and Measurements We then draw m i.i.d. random vectors e i j R p, for j = 1,..., m, from N (0, 1 I) for p each x (i). We will project each x (i) onto the subspace spanned by these associated e i j. In general, we will assume independently generated random vectors for each x (i). This condition is chosen to ensure that the random vectors, when taken as a whole, provide information about all of R p. (So long as mn p, the e i js will then span R p almost surely.) Clearly, if instead the random measurement vectors e i j spanned only a strict subspace of R p, then we would only be able to recover principal components within that subspace; all information about the data s behavior in the orthogonal complement space would have been lost in the measurement process. However, some alternatives that also allow the e i j to span R p are also possible. We could also use the same random projection across several data points to reduce the number of measurement matrices needed, so long as the projections taken together still provide a basis for the entire space in which the principal components might lie. Our proofs on convergence to the true center/pcs as the number of data samples grows will still hold for this second case; the convergence will merely be slower. Experimental results later in the paper will also investigate the effect in practice of allowing data points to share the same random projection.

72 Denoting by E i the matrix with columns {e i j} m j=1, we see that E i R p m. The projection matrix onto the subspace spanned by the columns of E i is 55 P i = E i (E it E i ) 1 E it. (4.2) Hence, the random projection of each x (i) onto the subspace spanned by the columns of E i is P i x (i). In the following sections, we will argue that the center and principal components of {P i x (i) } n i=1 are remarkably similar to those of {x (i) } n i=1. We note that if we had typical CS measurements of each x (i) of the form m i = E it x (i), then for each i, P i x (i) = E i (E it E i ) 1 m i. (4.3) Hence, each random projection P i x (i) can be recovered directly from the measurements m i and measurement matrix E i, without knowledge of the original data x (i). 4.2 Recovery of Center via PCA on Random Projections Prior to finding principal components, the first step in PCA is to estimate the center of the data. In this section, we will first show that a good estimator of the original data s center is the center found from the randomly projected data, i.e. from {P i x (i) } n i=1, scaled by factor p, m x = p 1 m n n P i x (i) = p 1 m n i=1 n E i (E it E i ) 1 m i (4.4) The intuition for the scaling factor comes from the Johnson-Lindenstrauss lemma [81, 35], which shows that a random projection of x will have an expected length of m x. p ( ) In turn, this projection then has a component of expected length m m x when p p projected a second time back onto the original direction this idea. i=1 x. Figure 4.1 below illustrates x

56 Figure 4.1: Projection of the center onto a random measurement vector, and then projection of this back onto the original center direction. 4.2.

Theorem 2. Suppose {P i } n i=1, {x (i) } n i=1, etc. are defined as in Section 4.1 with fixed 1 m < p.

73 56 Figure 4.1: Projection of the center onto a random measurement vector, and then projection of this back onto the original center direction Convergence of Center Estimator A theorem showing that this estimator converges to the true center of the original data {x (i) } n i=1 is presented below, with the proof deferred to Section 4.7. Theorem 2. Suppose {P i } n i=1, {x (i) } n i=1, etc. are defined as in Section 4.1 with fixed 1 m < p. Then as the number of data samples n, the center of the randomly projected data converges to the true center of the original data x = 1 n n i=1 x(i) almost surely: lim n p 1 m n n P i x (i) = x (4.5) i=1 This result allows recovery of the original data s center. We note that it does not depend on the number of measurements per sample m. The intuition for the proof is that the distribution of a random projection of x is symmetric about the original direction of x. Therefore, its expectation is a multiple of x, and the mean of the randomly projected samples thus converges to a multiple of x by the law of large numbers.

74 Iteration to Improve Results Although the estimator in Eq. 4.4 converges to the true center, in practice, we found that an iterative procedure helped to estimate the center more accurately. The main idea of the iteration is to improve our center estimator gradually by shifting the data and reestimating the center. More precisely, once we have found the estimated center of the original data ˆ x, we will adjust the random projections P i x (i) to reflect centering the original data. This involves replacing P i x (i) with P i (x (i) ˆ x) or, if we are working with CS measurements, replacing m i with m centered i = m i E it ˆ x (4.6) Once we have m centered i, we then reestimate the center and recenter again. This iteration procedure works well to reduce center estimation error and converges very quickly. Figure 4.2 shows how fast the estimation error drops with increasing number of iterations. The error is measured as the distance between the estimated and true centers normalized by the true center s magnitude. Based on our experiments, it drops exponentially and usually takes less than 5 iterations to converge. Figure 4.2: Plot of normalized distance between the estimated center and true center with increasing number of iterations. There are n = 2000 points in R 100 with 5 significant PCs with (σ 1, σ 2, σ 3, σ 4, σ 5 ) = (20, 15, 10, 8, 6), and the measurement ratio is m p = 0.05.

75 A Simpler Version of the Center Estimator In order to reduce computation by avoiding calculation of a large matrix s inverse, we observe that E it E i will be close to I m because each e i j in E i is drawn i.i.d. from N (0, 1I p p) ( ) 1 with p large. Substituting E it E i Im in Eq. 4.4 gives a simpler estimator that achieves similar performance, x simple = p 1 m n n E i E it x (i) = p 1 m n i=1 n E i m i (4.7) i=1 In fact, Eq. 4.7 achieves similar results in 1-2 orders of magnitude less computation time that our original estimator. This is proven in the following theorem, whose proof is also deferred to Section 4.7. Theorem 3. Suppose {E i } n i=1, {x (i) } n i=1, etc. are defined as in Section 4.1 with fixed 1 m < p. Then as the number of data samples n, x simple defined in Eq. 4.7 converges to the true center of the original data x almost surely: lim n p 1 m n n E i E it x (i) = x (4.8) i=1 In order to convey the main idea and simplify the notations, in the rest of this chapter, we will assume that the original data samples are centered around the origin, i.e. that we are working with x (i) ˆ x instead of x (i) and/or m centered i instead of m i. 4.3 Recovery of Principal Components via PCA on Random Projections In typical PCA, the principal components are found as the eigenvectors of the empirical covariance matrix of the data C emp = 1 n n x (i) x (i)t which often comes very close to the true underlying covariance C true = d i=1 σ2 i v i v T i. i=1

76 as Consider instead the covariance matrix C proj of the projected data {P i x (i) } n i=1 defined C proj = 1 n = 1 n n P i x (i) (P i x (i) ) T i=1 59 n E i (E it E i ) 1 m i m T i (E it E i ) 1 E it. (4.9) i=1 We propose the following estimator for C true : C P = k 1 C proj (4.10) where k 1 is a needed scaling factor, k 1 = p(p+2) m(m+2), similar to the p m for the center estimator. This estimator C P scaling factor we needed of C true will turn out to have (in the limit) the same the principal components v 1,..., v d as C true, and similar, albeit slightly different, eigenvalues. We will prove this later Intuition behind Principal Component Recovery To gain some intuition for this, consider a simple example in which we take twodimensional random projections of data points in R p generated from one principal component for varying p (see Figure 4.3). We can see that in all cases the projected points have the same principal component as the original points, and that the projected points are nicely symmetrically distributed around the original principal component. Although random projection scatters the energy of a principal component into other directions, we observe there is a limit to the amount of energy that can be scattered. Where the principal angle between the random subspace and the PC is small, the projection onto the random subspace is able to maintain most of the energy of the original data. However, where the random subspace is nearly orthogonal to the PC, very little energy in this direction is created. In the middle, where the principal angle between PC and random subspace is close to π/4, the energy in the PC and orthogonal directions is equal and maximum scattering occurs. Hence, we see that overall the random projection process produces energy in the

77 direction of the PC preferentially over energy in orthogonal directions, and the direction with the most energy remains the original PC direction, even for small m p. 60 Hence, the principal components are unchanged by the scattering process, although their corresponding eigenvalues are changed slightly (and scaled) by the scattering Convergence of Principal Component Estimator Theorem 4. Suppose data samples {x (i) } n i=1, centered at x = 0, and v 1,..., v d R p, the orthonormal principal components, are as defined in Section 4.1. Let us select v d+1,..., v p so that v 1,..., v p is an orthonormal basis for R p. Let V R p p be the matrix with i th column v i, then where Σ is a diagonal matrix, lim C P = V ΣV T + p + 2 n m + 2 ɛ2 I (4.11) d d Σ = diag(σ1 2 + σj 2 k 2,..., σd 2 + σj 2 k 2, j=1,j 1 d σj 2 k 2,..., j=1 j=1 j=1,j d d σj 2 k 2 ), (4.12) and k 2 = p m (m+2)(p 1). In other words, C P preserves the eigenvectors of C true, but the eigenvalues are changed somewhat to reflect the scattering of energy into other directions by the random projection process. In particular, each of the top d eigenvalues λ i in C P contains two parts, its original value σ 2 i and energy from all the other directions d j=1,j i σ2 j k 2. We note that k 2 1 m for large p, so for fixed measurement ratio m, the scattering effect on the eigenvalues will p generally be very small in the case of large p. This preservation of the eigenvectors and eigenvalues leads to the following corollary.

78 61 (a) p = 3 (b) p = 10 (c) p = 50 Figure 4.3: Randomly projecting the data preserves the principal component. In each of the three figures, there are n = 3000 points uniformly distributed on a line in R 3, R 10 and R 50 respectively. We randomly project each point onto a two-dimensional random subspace, and view two dimensions of the result (the original principal component s and one other). Blue stars are the original points and red circles are the projected points. We observe that the original principal component remains intact even for a very small ratio m p.

79 62 Corollary 2. For m > 0, p 2, as n, taking the top l d eigenvectors of C P will recover the true top l principal components v 1,..., v l of the original data, provided also σ 2 l > σ 2 l+1. Proof Clearly, v 1,..., v l are recovered from C P as n as long as Σ rr > Σ ss for all r, s such that r l and s > l. Now, since σr 2 (1 k 2 ), if s > d Σ rr Σ ss =, (4.13) (σr 2 σs) 2 (1 k 2 ), if s d we need only show that 1 k 2 > 0 to prove the conclusion. We note that p m 1 k 2 = 1 (m + 2)(p 1) mp + p 2 = (m + 2)(p 1). (4.14) From this, we can easily verify that when m > 0 and p 2, we indeed have 1 k 2 > 0, and the desired conclusion follows. Indeed, we see again that if we fix the measurement ratio m, then as p becomes large, p k 2 1 m and 1 k 2 1, and thus the difference between the original eigenvalues are preserved. This can be used for dimensionality estimation, and experimental verification of this will be shown in Section 4.5. As another note, one might initially be concerned by the factor p+2 m+2 in the noise term which is (approximately) inversely proportional to the measurement ratio m. However, this p is due to the way we have defined ɛ, so that ɛ 2 is the power in each dimension of the noise. With this definition, the power in the noise does indeed scale with increasing p. If we had instead defined ɛ 2 as the overall power of the noise, causing each dimension to have noise power ɛ2, then this noise term would change to p+2 p p(m+2) ɛ2 and would not scale with 1 m. p

80 Using Magnitude of Eigenvalues to Determine Dimension As we have seen in Theorem 4, the eigenvalues of our new C P are not exactly the same as the true eigenvalues due to the scattering of energy into other directions by the random projection process. However, the difference between Σ dd and Σ (d+1)(d+1) can still be used as a indicator to determine the underlying dimension d. As noted in Theorem 4, we expect the eigenvalues of C P to be close to those of the original matrix C true, corrupted by a small effect of energy scattering into other directions. We will check how faithful the eigenvalues of the projected covariance matrix C P are to the eigenvalues of the original covariance matrix C in Section Further Improving the Principal Components Estimation Although Theorem 4 indicates that the principal components of the original data can be recovered directly from normal PCA on the randomly projected data, we found that the results were improved slightly in practice by estimating principal components v 1,..., v d one by one instead of all at once. To do this, we obtain ˆv 1 as the first eigenvector from the covariance matrix C P as expected, but then use least squares as in [117, 58] to estimate the coefficient ˆβ i,1 = ŵ i1 σ 1 of v 1 for each original data sample x (i) : ˆβ i,1 = (E it v 1 ) + m i = (v T 1 E i E it v 1 ) 1 v T 1 E i m i We then subtract ˆβ i,1 E it ˆv 1 from each m i (equivalent to subtracting ˆβ i,1ˆv 1 from each x (i) ) to get the randomly projected data after removal of the first PC from the original data set. We then repeat the process l 1 times to recover ˆv 2,..., ˆv l, each time estimating the next PC and then removing all PCs so far using least squares to jointly estimate their coefficients.

81 Algorithm The overall algorithm including center estimation, principal components estimation, and signal recovery can be seen in Algorithm Experimental Results We have shown that regular PCA on random projections of our data permits recovery of the original data s center and principal components. In this section, we present experimental verification of this for both synthetic and real-world datasets, including video and hyperspectral data. We also compare our approach with other algorithms recently developed in the literature for this purpose, in particular CP-PCA [58], and find that the performance of (appropriately scaled) normal PCA is superior. Finally, we will look at several additional concerns such as the success of PCA on random projections in extremely high-dimensional spaces, determination of dimensionality via this approach, and performance for random Bernoulli measurements Synthetic Example and Effects of Various Parameters For our first experiment, we synthetically generate data samples {x (i) } n i=1 R 100 with 5 significant underlying principal components. Using the notation of Section 4.1, the five significant principal components have (σ 1, σ 2, σ 3, σ 4, σ 5 ) = (20, 15, 10, 8, 6) and ɛ = 1. Each component of the center is drawn from a uniform distribution on [0, 10). Figure 4.4 shows how the center estimated by normal PCA on the randomly projected data (with iterative enhancement as noted above and scaled by p ) compares to the true m center for varying measurement ratio m. The error is measured as the distance between the p estimated and true centers, normalized by the true center s magnitude. We verify that this error becomes very small for large n, even for small number of measurements m. To evaluate error for the principal components, we use the magnitude of the normalized

82 65 Algorithm 2 Compressive Principal Component Analysis Input: Measurements {m i } n i=1. Measurement Matrices {E i } n i=1, Desired Number of PCs d, Max Number of Center Iterations T iteration Output: Center x, Principal Components {ˆv j } d j=1 and Reconstructed Signals {ˆx (i) } n i=1. STEP 1: Center Estimation x Initialize x = 0; k = 1; m centered i = m i ; for k = 1 : T iteration do x = 1 n p n i=1 m Ei (E it E i ) 1 m centered i + x; for i = 1 : n do m centered i = m i E it x; end for end for STEP 2: PCs Estimation {ˆv j } d j=1 Initialize V = 0 p d ; α i = 0 d 1 ; for k = 1 : d do for i = 1 : n do = m centered i E it V α i ; P i x (i) = E i (E it E i ) 1 m remaining i ; end for 1 C P = k n 1 n i=1 P ix (i) (P i x (i) ) T ; Set ˆv j equal to the eigenvector corresponding to the largest eigenvalues of C P. m remaining i Set V (:, j) = ˆv j and α i = (E it V ) + m centered i end for STEP 3: Signals Reconstruction{ˆx (i) } n i=1. for i = 1 : n do α i = (E it V ) + m centered i ˆx (i) = V α i + x end for

83 66 Figure 4.4: Plot of MSE between the estimated center and the true center for varying n and m p. inner product between each principal component obtained from random projections and the corresponding true principal component. Clearly, a value near one for this quantity reflects that the principal components are nearly identical, while a value near zero reflects nearly orthogonal recovered principal components. It is informative to look at this success metric while varying the various factors involved in our model. Figures 4.5(a,b,c) show the results for the first 5 significant principal components for varying measurement ratio m/p (number of measurements over dimension of the space), varying number of samples n, and varying noise level ɛ/σ 1 respectively. We see that when the number of samples is reasonably large (n = 2000), the measurement ratio can be quite low ( 0.1) for successful PC recovery. At the same time, for a reasonable measurement ratio ( 0.2), even a relatively small number of samples (n = 500) can ensure successful recovery. Finally, the method stays robust to noise in the measurements, until the power in the noise approaches the same order of magnitude as the largest principal component. Finally, although our theoretical results were for the case of asymptotically large number of samples n, we see that in practice n does not have to be very large for successful recovery. Based on our initial intuition of Figure 4.3, we might worry that these results fall apart

84 67 (a) (b) (c) Figure 4.5: Plots of normalized inner product magnitude between estimated PCs and the corresponding true PCs for (a) varying measurement ratios m for n = 2000, (b) varying p number of data points n when m = 0.2, and (c) varying noise ratio ɛ/σ p 1.

85 68 if the dimension of the space is sufficiently high. For this reason, we examine the results of our synthetic example for fixed number of measurements m as we let p and also for fixed measurement ratio m/p as p. The results are shown in Figure 4.6 (a) and (b) respectively. We see that for fixed m and fixed number of samples n, the ability to successfully recover PCs does indeed deteriorate as p, as would be expected since the measurement ratio m/p is approaching 0. However, if we fix the measurement ratio m/p and let p, a more fair test of our approach s success in high dimensions, we see that PCA s performance actually improves as p increases. (a) (b) Figure 4.6: Plots of normalized inner product magnitude between estimated first PC and the corresponding true first PC for (a) fixed m, increasing p, (b) fixed m, increasing p. We see p that as the dimension of the space p increases, if the number of measurements m is fixed and the measurement ratio m/p is thereby decreasing with increasing p, then the performance deteriorates with increasing p. On the other hand, if the measurement ratio m/p is fixed, then the performance actually improves with increasing p. Finally, we can check how faithful the eigenvalues of the projected covariance matrix C P are to the eigenvalues of the original covariance matrix C. As noted in Section 4.3, we expect the eigenvalues of C P to be close to those of the original matrix C, corrupted by a small effect of energy scattering into other directions. Indeed, we see that this is the case in Figure 4.7, which shows that the eigenvalues of the projected data are indeed close to the true eigenvalues of the original data, although the values are not exactly the same. This

86 shows that the trend of the magnitude of eigenvalues of the projected data can be used as a guideline for determining the underlying dimension d of the original data. 69 Figure 4.7: Comparison of eigenvalues of the randomly projected data s covariance matrix with true eigenvalues of the original data s covariance matrix when m = 0.3. The eigenvalues p of the randomly projected data can be used to determine the dimensionality of the original data Synthetic Example: Comparison with Previous Methods Finally, we may compare the approach to that of previously proposed methods for finding principal components from compressive sensing measurements of the data. In particular, we compare with Compressive Projection Principal Components Analysis [58]. Figure 4.8 shows the comparison of our approach with CP-PCA for the synthetic dataset above for measurement ratio m/p = 0.1 and n = The data s center is assumed known since CP-PCA does not find it. CP-PCA typically uses the same measurement matrix E i across multiple data samples, both to facilitate the algorithm s performance and to save on the storage of measurement matrices. Hence, we have run our algorithm for this situation as well, so that both algorithms might share the same set of measurements and be compared fairly. The x-axis in each graph

87 70 in Figure 4.8 is thus the number of approximately equal partitions (corresponding to different measurement matrices E i ) that the data samples have been divided into. Thus, where the number of partitions is one, it means that we have used the same measurement matrix for all 2000 data samples, while where the number of partitions is 2000, it means we have used a different measurement matrix for each data sample. One further concern is that our strategy of estimating one principal component at a time, then removing its contribution to the projected data before continuing to estimate the others, aids our approach greatly in improving the accuracy of the second, third, fourth, etc. principal components. Meanwhile, CP-PCA does not do this, possibly leading to an unfair advantage for our approach in the results. For this reason, we ran a slightly modified version of CP-PCA that also estimated and removed one PC at a time, but was otherwise the same as typical CP-PCA. As we suspected, this slightly modified version of CP-PCA did perform better on this example. Hence, to promote the fairest possible comparison, we use this slightly enhanced version of the CP-PCA algorithm for the results in Figure 4.8(a)-(e). The plotted results are averaged over 10 random trials each, and did not differ significantly from plots obtained by averaging over 5 random trials each. The jagged appearance of the CP-PCA graphs thus seems to be an actual artifact of the algorithm of unknown cause. We see that CP-PCA is comparable for the first PC, but that normal PCA performs much better for the remaining PCs. We note that for higher measurement ratios (e.g. 0.3), CP-PCA and our approach were both able to achieve normalized inner product magnitudes around 1, and the results did not differ much. However, for lower measurement ratios as we have here, it is clear that our approach outperforms CP-PCA. Finally, we may compare the running time of the two approaches in Figure 4.8(f). (For the sake of being maximally favorable to CP-PCA here, we used regular CP-PCA for this benchmark since our slight enhancement slowed the algorithm down even while it improved its performance.) While CP-PCA has a superior running time for a small number of partitions, its running time grows linearly with the number of partitions, while that of

88 71 (a) First PC (b) Second PC (c) Third PC (d) Fourth PC (e) Fifth PC (f) Running Time Figure 4.8: (a)-(e) Plot of normalized inner product between the true and estimated PCs for each of the first 5 PCs for our approach vs. CP-PCA on the synthetic dataset. Each graph is a function of the number of partitions corresponding to different measurement matrices that we have divided the data samples into. (f) Running time of the two algorithms in seconds as a function of number of partitions.

89 72 normal PCA on the samples remains constant with number of partitions Real-World Data: Comparison with Previous Methods For our next dataset, we examine the Lankershim Boulevard Data. These data are traffic videos recorded by five cameras for 30 minutes at 10 frames per second at Lankershim Boulevard in Los Angeles. This video is an excellent example of a potential real-world application of our work. Considering individual frames of video as data points, we expect such video to have a low-rank structure resulting from the large amount of background that remains the same across successive frames, with small sparse anomalies due to small changes between adjacent frames. In all, this becomes a low-rank-plus-sparse structure as noted in [158]. One can thus imagine taking a small number of compressive sensing random measurements of each frame individually, then using our approach to find the principal components of the total data set. The obtained PCs would then provide a known basis for the low-rank structure shared across many frames, and hence improved recovery of each individual frame could be achieved using the PCs. Here, we use the normal PCA approach on this low-rank-plus-sparse example to extract the background image, which is the center of the data, as well as the first principal component. (Due to the large size of each frame of the raw video, we have resized it to pixels.) The MSE between the estimated center and the true center using different measurement ratios and numbers of points is shown in Figure 4.9, and we also show a sample visual comparison between the true center image and the center estimated with our method in Figure 4.11(a,b). We can see that they are almost identical. We also compare the first 5 principal components as estimated using normal PCA vs. CP-PCA [58] for known center. Here, we have attempted to pick the optimal number of partitions for CP-PCA, which was about 50 for CP-PCA. The results, in the form of the normalized inner product between the estimated and true principal components, are shown in Figure We see that using normal PCA to estimate the principal components for this

90 73 Figure 4.9: Normalized MSE between the estimated center and the true center for different n and m for the Lankershim Boulevard data. p

91 74 dataset results in more accurate estimated principal components than CP-PCA [58]. (a) (b) Figure 4.10: The normalized inner product magnitude between the first 5 estimated principal components and the true first 5 principal components for Lankershim Boulevard data. (a) Normal PCA on the Randomly Projected Data vs. (b) Compressive-Projection Principal Components Analysis (CP-PCA) [58]. We also show the visual results of our approach. Here, the center is roughly the background image, which is shown in Figure 4.11(a,b). The first PC meanwhile seems to represent a traffic trend. The visual comparison of the true first PC of the video frames and that estimated by our approach are shown in Figure 4.11(c,d) The Case of Random Bernoulli Measurements As one final note, although we have focused our analysis on projections onto Gaussian random vectors, it appears that normal PCA is able to recover the center and principal components of data from projections onto Bernoulli random vectors as well. We return to the synthetic example of Section 4.5.1, replacing the measurement matrices E i with new ones in which each element is i.i.d. and set to +1 with probability 1 2 and 1 otherwise. Repeating our experiments from before yields Figure 4.12, which shows normal PCA is capable of recovering the center and PCs from random Bernoulli measurements as well as random Gaussian. We hope to explore this phenomenon further in future work. In particular

75 (a) True Center (b) Estimated Center (c) True PC (d)

(b) The estimated background image when measurement ratio m

92 75 (a) True Center (b) Estimated Center (c) True PC (d) Estimated PC Figure 4.11: Lankershim Boulevard video visual results: (a) The original background image. (b) The estimated background image when measurement ratio m = 0.1. (c) The true first p PC, which appears to be a traffic trend along the roadway. (d) The estimated first PC for our approach when m = 0.1. p

93 we wish to give some theoretical analysis why random Bernoulli measurements seem to work as well as random Gaussian Application to Hyperspectral Images Hyperspectral images [135] are collections of images taken in different electromagnetic spectrum bands, from visible to infrared. Because they cover such a wide range of spectral bands, hyperspectral images provide rich information about the objects being imaged. Thus, this type of imaging has become widely used in a variety of application areas such as agriculture [56], mineralogy [14] and environmental studies [135]. Hyperspectral images are usually taken by satellites or remote sensors, which may have very little power and limited computation capabilities. Hence, the resources available for taking measurements are severely limited. One wishes to reduce the initial samples the remote sensors need to take, shifting the computational burden to the more powerful base stations, and at the same time keeping the same rich information of the original hyperspectral images Image Reconstruction We first examine a potential application: reconstructing hyperspectral images from CS measurements. Here the spectral signatures acquired in various pixels have a shared low-rank structure due to a mix of relatively few materials across the hyperspectral image. This can be discovered via compressive PCA and exploited to produce improved recovery of individual spectra. As in [58], we use the Cuprite and Jasper Ridge image datasets with p = 224 spectral bands and n = samples (image pixels) for each. In each case, we use normal PCA on random projections of the data to estimate the original data s principal components, followed by least squares estimation of the coefficients of each spectra within the principal components basis to estimate the original data point (see [117, 58]). We compare the average SNR (see [58] for details) of the resulting reconstructed hyperspectral images with that ob-

94 77 (a) (b) (c) (d) Figure 4.12: (a) Plot of normalized error measure between the estimated center and the true center using Bernoulli measurements for varying n and m. (b,c,d) Plots of normalized inner p product magnitude between estimated PCs and the corresponding true PCs using Bernoulli measurements for (b) varying measurement ratios m for n = 2000 and (c) varying number p of data points n when m = 0.2, and (d) varying noise ratio ɛ/σ p 1.

95 78 tained reconstructing using CP-PCA [58] and MTBCS [79]. Again, we attempt to choose the best number of partitions 500 for CP-PCA. Average SNR of the reconstructed hyperspectral images as a function of different measurement ratios m/p is shown in Figure We see that the SNR using our approach is higher than that using the other two methods. (a) Cuprite Dataset (b) Jasper Ridge Dataset Figure 4.13: Plots of average SNR of reconstructed hyperspectral images for various measurement ratios Image Source Separation In this section, we will examine another potential application, that of source separation in hyperspectral imagery. In many cases, one would like to use hyperspectral imaging to separate out different types of materials (e.g. soil, water, vegetation) in a hyperspectral image for purposes such as environmental monitoring or defense surveillance. Typically, this task is treated as a source separation problem and solved via Independent Component Analysis (ICA) [74] on the full hyperspectral imaging data. We wish to examine whether these sources can be isolated instead by using ICA on a fraction of the data acquired via compressive sensing measurements. This would allow us to perform this important material identification and separation task in a more data-efficient way, putting less stress on available sensing resources.

96 We compare the results obtained by directly applying ICA on the original images vs. those obtained via ICA on the images reconstructed using Algorithm 2 in Section with only m p = 0.3 measurements. Here we use the Joint Approximate Diagonalization of Eigenmatrices algorithm [26] to implement Independent Component Analysis. We can see from Figure 4.14 that the sources classified from the reconstructed images are almost identical to the true sources. Moreover, if we measure the similarity of the sources classified in these two situations with correlation coefficients, all space correlation coefficients ρ, between the results obtained by directly applying ICA on the original images and those obtained via ICA on the images reconstructed using Algorithm 2 with only m p = 0.3 measurements, are larger than It indicates that we can achieve very good performance in hyperspectral image source separation using only m p = 0.3 measurements, which significantly reduces the stress placed on available sensing resources. If we compare our approach with CP-PCA, our approach with all space correlation coefficients, between the results obtained by directly applying ICA on the original images and those obtained via ICA on the images reconstructed using Algorithm 2 with only m = 0.3 measurements, ρ > 0.99 vs. CP-PCA with ρ = p (0.98, 0.91, 0.99, 0.94, 0.95) respectively. 4.7 Proofs of Theoretical Results In this section, we present proofs of Theorems 2, 3 which show that the center of the low-dimensional random projections of the data converges to the true center of the original data (up to a known scaling factor) almost surely as the number of data samples increases. We then show the proof of Theorem 4 that the top d eigenvectors of the randomly projected data s covariance matrix converge to the true d principal components of the original data as the number of data samples increases. Moreover, both of the above conclusions are true regardless of how few dimensions we use for our random projections (i.e. how few CS Gaussian random measurements we take of each data sample).

97 Figure 4.14: Source separation using independent component analysis. (a) ICA on original images. (b) ICA on reconstructed images using our approach with m p =

98 Proofs of Lemmas for Theorem 2 We start by introducing two lemmas that will in the proof of Theorem 2. The first lemma shows that the distribution of P i x is unchanged when reflected across x. Lemma 1 (Symmetry of the distribution of P i x under reflection across x). Suppose x is a fixed point in R p and let P i x be a random vector with P i defined as in Section 4.1. Define the reflection operator R x as R x (y) = y + 2 ( y, ˆx ˆx y) = 2 y, ˆx ˆx y (4.15) where ˆx = x x. Then the distribution of P ix is the same as the distribution of R x (P i x). Proof Suppose e 1,..., e m are random variables representing the columns of the matrix E i so that each e i is drawn i.i.d. from a Gaussian distribution as in Section 4.1. Then, for every realization e 0 1,..., e 0 m of the random variables e 1,..., e m, there is an equally likely realization R x (e 0 1),..., R x (e 0 m). This can be easily seen from the fact that the Gaussian distribution N (0, I p p ) is symmetric across any line through the origin of R p. Then, defining P (e 0 1,...,e 0 m)(x) as the projection of x onto the subspace spanned by e 0 1,..., e 0 m, we will show that P (Rx(e 0 1 ),...,Rx(e0 m )) (x) = R x (P (e 0 1,...,e 0 m ) (x)). (4.16) That is, the projection of x onto the reflected vectors R x (e 0 1),..., R x (e 0 m) is the reflection of that onto the original random vectors e 0 1,..., e 0 m. To show this, we first observe three properties of R x : Property 1: R x is a linear operator: R x (αa + βb) = 2 αa + βb, ˆx ˆx αa βb = α (2 a, ˆx ˆx a) + β (2 b, ˆx ˆx b) = αr x (a) + βr x (b)

99 Property 2: The operator R x preserves inner products (and hence norms as well): 82 R x (a), R x (b) = 2 a, ˆx ˆx a, 2 b, ˆx ˆx b = a, b Property 3: For any orthonormal u 1,..., u k and any b, the projection of R x (b) onto R x (u 1 ),..., R x (u k ) is the reflection of that of b onto u 1,..., u k : P (Rx(u1 ),...,R x(u k )) (R x (b)) k = R x (b), R x (u j ) R x (u j ) = j=1 k b, u j R x (u j ) j=1 ( k ) = R x b, u j u j j=1 ( = R x P(u1,...,u k )(b) ) Using the above three properties, we can easily see that if we perform Gram-Schmidt orthogonalization on e 0 1,..., e 0 m to obtain orthonormalized vectors u 1,..., u m, then performing Gram-Schmidt orthogonalization on R x (e 0 1),..., R x (e 0 m) must result in R x (u 1 ),..., R x (u m ). To see this, we note that Gram-Schmidt involves two alternating steps: (i) we subtract from the currently selected vector its orthogonal projection onto those orthonormal vectors already obtained and (ii) we scale the resulting vector by 1 over its norm. Suppose that we start with the two sets of vectors e 0 1,..., e 0 m and R x (e 0 1),..., R x (e 0 m). We note that the second set are initially the reflections of the first set. If we run the steps of Gram-Schmidt on the two sets of vectors simultaneously, then each step of Gram-Schmidt preserves the property that the second set of vectors are the reflections of the first set. In the case of step (i), the orthogonal projections that we subtract off from the second set are reflections by Property 3 above of those we subtract off from the corresponding vector in the first set.

100 83 Then, the linearity of R x (Property 1 above) guarantees that the resulting difference vector in the second set is a reflection of that obtained for the first set. In the case of step (ii), the norms we divide by are equal (Property 2 above). Hence, using Property 3 above and the fact that R x (x) = x, we have P (Rx(e 0 1 ),...,Rx(e0 m )) (x) = P (Rx(u 1 ),...,R x(u m))(x)) = R x (P (u1,...,u m)(x)) = R x (P (e 0 1,...,e 0 m)(x)) Finally, since for every realization e 0 1,..., e 0 m of the random variables e 1,..., e m, resulting in the projection P i x = P (e 0 1,...,e 0 m ) (x), there is an equally likely realization R x (e 0 1),..., R x (e 0 m), resulting in the projection P (Rx(e 0 1 ),...,Rx(e0 m )) (x) = R x (P (e 0 1,...,e 0 m ) (x)) = R x (P i x), the probability distribution f of P i x satisfies f(p i x) f (R x (P i x)). Similarly, since for every realization R x (e 0 1),..., R x (e 0 m), resulting in the projection R x (P i x), there is an equally likely realization R x (R x (e 0 1)),..., R x (R x (e 0 m)) = e 0 1,..., e 0 m resulting in the projection R x (R x (P i x)) = P i x, we have that f (R x (P i x)) f(p i x). These inequalities show that f(p i x) = f (R x (P i x)), which proves Lemma 1. We also make use of Theorem 1.1 from [61], which for convenience we restate here as Lemma 2. Lemma 2. Let e 1,..., e m R p be m points i.i.d. drawn from N (0, 1I p p). If m < p then the vectors {e i } m i=1 span a m-dimensional linear subspace of R p almost surely. This subspace is

101 then called a random m-space in R p. Let H be a random m-space in R p, L be a fixed 1-space in R p and θ be the principal angle between H and L. 84 The random variable cos 2 θ has the beta distribution β ( m, ) p m Proofs of Theorem 2 and 3 for Convergence of Center Estimator We combine the above two lemmas in the proof of Theorem 2, regarding convergence of the center estimator to the true center, from Section 4.2. Proof of Theorem 2. Because {P i x (i) } n i=1 are i.i.d., we will focus on evaluating E ( P i x (i)) so that we may use the law of large numbers to show Eq Because w ij, P i, and z i are independent with E(w ij ) = 0 and E(z i ) = 0, ( )) d E(P i x (i) ) = E (P i x + w ij σ j v j + z i j=1 d = E(P i x) + σ j E(w ij )E(P i v j ) + E(P i )E(z i ) j=1 = E(P i x) (4.17) From Lemma 1, the distribution of P i x is the same as the distribution of R x (P i x). Thus, E(P i x) = E (R x (P i x)). Hence, E(P i x) = 1 2 E (P i x + R x (P i x)) ( ) 1 = E x P i x, x x (4.18) 2 Now, P i x is the projection of x onto a random m-space. Suppose the principal angle between this space and the span of x is θ. Then P i x, x = P i x x cosθ = x 2 cos 2 θ (4.19)

102 85 Thus, from Lemma 2 and (4.19), we have ( ) 1 E x P i x, x = E(cos 2 θ) = m 2 p (4.20) Combining Eq. 4.17, 4.18 and 4.20, we have that E ( P i x (i)) = m p x (4.21) Theorem 2 follows from the law of large numbers. The proof of Theorem 3 from Section 4.2 (convergence of the simplified center estimator) can be achieved without using Lemma 1 and 2 as follows. Proof of Theorem 3. Because w ij, P i, and z i are independent with E(w ij ) = E(z ij ) = 0, then = E ) E (E i E it x (i) ( (E i E it x + ( ) = E E i E it x + ( ) = E E i E it x )) d w ij σ j v j + z i j=1 d σ j E(w ij )E(E i E it v j ) + E(E i E it )E(z i ) j=1 (4.22) Since each e i j N (0, Ip ), we have p

103 86 E(E i E it ) e i e i m1 e i e i 1p = E e i 1p... e i mp e i m1... e i mp m j=1 ei j12... m j=1 ei j1e i jp m j=1 = E ei j2e i.. j m j=1 ei jpe i j1... m j=1 ei jp2 ( ) = m p I p p (4.23) (*): Since each element e i jk is i.i.d. Gaussian distributed with mean 0 and variance 1/p, ( ) m ( all diagonal entries E j=1 ei jk2 = m 1 = m, and all off-diagonal entries E m ) p p j=1 ei jk ei jl = 0 for k l. From Eq and 4.23, we have shown that for all i = 1,..., n, Theorem 3 then follows from the law of large numbers. ) E (E i E it x (i) = m x (4.24) p Proof of Lemma for Theorem 4 To prove Theorem 4, we first introduce another lemma that shows the distribution of P i x as defined in Section 4.1 is unchanged when rotated about the axis of x. Lemma 3 (Symmetry of the distribution of P i x under rotation about x). Suppose x is a fixed point in R p and let P i be as defined in Section 4.1. Let V R p p be an orthogonal

104 87 matrix with first column ˆx = x x Q x = V and let (p 1) 0 (p 1) 1 Q V T where Q is in the special orthogonal group SO p 1, so that Q x represents an arbitrary rotation of R p about x. Then the distribution of P i x is the same as the distribution of Q x (P i x). Proof The proof follows the exact same structure as that of Lemma 1. Similarly, we note that for every realization e 0 1,..., e 0 m of the random variables e 1,..., e m, there is an equally likely realization Q x e 0 1,..., Q x e 0 m, since the Gaussian distribution is rotationally symmetric. Then we would like to show that if we define P (e 0 1,...,e 0 m ) (x) as the projection of x onto the subspace spanned by e 0 1,..., e 0 m, then P (Qxe 0 1,...,Qxe0 m ) (x) = Q x P (e 0 1,...,e 0 m ) (x). (4.25) That is, the projection of x onto the rotated vectors Q x e 0 1,..., Q x e 0 m is the rotation of that onto the original random vectors e 0 1,..., e 0 m. As before, to prove this, we note that Q x is a linear operator. Q x also preserves inner products and norms (i.e. Q x (a), Q x (b) = a, b for all a, b) since V is an orthogonal matrix and Q is in the special orthogonal group SO p 1. Using these two properties, we can show that for any vector b and any orthonormal set u 1,..., u k, we have that P (Qxu1,...,Q xu k ) (Q x b) = = k Q x b, Q x u j Q x u j j=1 k b, u j Q x u j j=1 ( k ) = Q x b, u j u j j=1 = Q x P (u1,...,u k )(b)

105 88 The same argument as before can be used with the above three properties to show if u 1,..., u m is the result of Gram-Schmidt orthogonalization on the vectors e 0 1,..., e 0 m, then Q x u 1,..., Q x u m must be the result of the Gram-Schmidt orthogonalization on Q x e 0 1,..., Q x e 0 m. Finally, using the above and the fact that Q x x = x, we have P (Qx(e 0 1 ),...,Qx(e0 m))(x) = P (Qx(u 1 ),...,Q x(u m))(x) = Q x P (u1,...,u m)(x) = Q x (P (e 0 1,...,e 0 m)(x)) Since for every realization e 0 1,..., e 0 m of the random variables, e 1,..., e m, resulting in the projection P (e 0 1,...,e 0 m ) (x) = P i x, there is an equally likely realization Q x (e 0 1),..., Q x (e 0 m), resulting in the projection P (Qx(e 0 1 ),...,Qx(e0 m))(x) = Q x P (e 0 1,...,e 0 m)(x) = Q x P i x (4.26) we see that the probability distribution f of P i x satisfies f(p i x) f(q x P i x) Moreover, noting that the matrix Q 1 x = V (p 1) 0 (p 1) 1 Q 1 V T has the same properties as Q x, we can see that for every realization Q x e 0 1,..., Q x e 0 m of e 1,..., e m, resulting in the projection Q x P i x, there is an equally likely realization Q 1 x Q x e 0 1,..., Q 1 x Q x e 0 m = e 0 1,..., e 0 m, resulting in the projection P i x. We therefore also have f(q x P i x) f(p i x). These inequalities show that f(p i x) = f(q x P i x)

106 89 which proves Lemma Proof of Theorem 4 for Convergence of Principal Component Estimator We will apply the Lemma 3 to the proof of Theorem 4 regarding convergence of the principal component estimator to the true principal component. Proof of Theorem 4. Since the data is assumed centered with x = 0, x (i) = d w ij σ j v j + z i. j=1 Thus, since all the w ij and z i are independent and zero-mean, we can show that ( ) E P i x (i) x (i)t Pi T = d j=1 σ 2 j E ( P i v j v T j P T i ) ( ) + E Pi z i z T i Pi T. (4.27) Let s first analyze a single term C 1 = E ( ) P i v 1 v1 T Pi T. Considering Pi v 1 as a random variable, we can define P v1 (P i v 1 ) = v 1, P i v 1 R 1, P v (P i v 1 ) = ( v 2, P i v 1,..., v p, P i v 1 ) T R p 1. We will abbreviate P v1 (P i v 1 ) by P v, and P v (P i v 1 ) by P v when no confusion will arise. Then, P i v 1 = V C 1 = V E P v 2 Pv T P v P v P v P v P T v P v P T v (4.28) V T (4.29) We now proceed to evaluate the four terms in the block matrix in Eq For the first term E( P v 2 ), we note that our earlier analysis in Eq gives P v = cos 2 θ, where

107 θ is the principal angle between v 1 and the random m-space P i is projecting onto. Thus, from Lemma 2, 90 where k 1 = p(p+2) m(m+2). E ( P v 2) = E ( (cos 2 θ) 2) = m(m + 2) p(p + 2) = 1 k 1 (4.30) To compute the remaining three terms in Eq. 4.29, we take advantage of Lemma 3. From Lemma 3, the distribution of P i v 1 is rotationally symmetric about v 1. This implies that E(P v P T v ) = 0 1 p and E(P v P T v ) = 0 p 1. Furthermore, since the distribution of P i v 1 is rotationally symmetric about v 1, the distribution of P v, the projection of P i v 1 onto the orthogonal complement of v 1, is rotationally symmetric about 0. This implies that E(P v P T v ) is a multiple of the identity. Now consider the trace of E(P v P T v ), trace ( E ( P v P T v )) = E(trace(C1 )) E( P v 2 ) = E( P i v 1 2 ) 1 k 1 = m p 1 k 1 (4.31) since the norm squared of a random m-dimensional projection of a unit vector in R p is well-known to be m p from Johnson-Lindenstrauss [81, 35]. Then, since E(P v P T v ) is a multiple of the identity, we must have E(P v P T v ) = m p 1 k 1 p 1 I p 1 We name this constant k 3 = m p 1 k 1 p 1. Then, from Eq. 4.29, where k 2 = k 3 k 1 = p m (m+2)(p 1). k 1 C 1 = V diag (1, k 2,..., k 2 ) V T (4.32) We may perform a similar analysis for each of the other terms k 1 E(P i v j v T j P T i ) resulting in the same answer, except with 1 occupying the j th entry of the diagonal instead.

108 For the term C ɛ = E ( ) P i z i z T i Pi T in Eq. 4.27, because both zi and {e i j} m j=1 are random with completely isotropic distributions, we know that the distribution of P i z i will also be isotropic. Thus, E ( ) P i z i z T i Pi T is also a multiple of identity. Now consider the trace of E ( ) P i z i z T i Pi T, 91 trace ( E ( )) P i z i z T i Pi T = E( Pi z i 2 ) = m p E z i 2 = m p pɛ2 = mɛ 2 (4.33) and thus Then we have From Eq. 4.27, 4.32, and 4.34, we have E ( ) P i z i z T i Pi T m = p ɛ2 I k 1 E ( ) P i z i z T i Pi T p + 2 = m + 2 ɛ2 I (4.34) ( ) k 1 E P i x (i) x (i)t Pi T = d j=1 σ 2 j k 1 E ( P i v j v T j P T i ) + k1 E ( ) P i z i z T i Pi T = V ΣV T + p + 2 m + 2 ɛ2 I (4.35) where Σ is defined in Eq Since the terms {P i x (i) x (i)t P T i } n i=1 are i.i.d., Theorem 3 then follows from the law of large numbers. 4.8 Conclusions We have demonstrated, both through theoretical analysis and experimentally, that PCA performed on low-dimensional random projections of the data recovers both the center and the principal components of the original data quite well, indeed better than previous approaches in the literature for recovering principal components from compressive sensing measurements. We have further showed that it can be used to estimate dimensionality of

109 92 the original data and to improve reconstruction results for a collection of data (e.g. video frames or hyperspectral images) with shared structure from compressive sensing measurements.

110 Chapter 5 Bounds on the Convergence Rate for our PCA Estimators In Section 4.7, we showed that the center of the low-dimensional random projections of the data converges to the true center of the original data (up to a known scaling factor) almost surely as the number of data samples increases, and we also showed that the top d eigenvectors of the randomly projected data s covariance matrix converge to the true d principal components of the original data almost surely as the number of data samples increases. However, it is also important to know the convergence rates of our estimators, i.e. how fast do our estimators converge to their true values with respect to the number of points n and the measurement ratio m. The convergence rates can provide more detailed information p about the number of points n we need to guarantee small error for a given measurement ratio m, and vice versa. Thus, in this chapter, we focus on providing several theorems about p the convergence rates of our center, covariance matrix, and PC estimators. 5.1 Convergence Rate of the Center Estimator The following theorem establishes the convergence rate of the center estimator, with the proof deferred to Section 5.4. Theorem 5. Consider the center estimator ˆ x = p 1 n m n i=1 P ix (i), using the problem set-up

111 and notation given in Section 4.1. Then, for any η > 0, we have that ( ) ( ˆ x x 2 P rob η p d ) i=1 σ2 i + pɛ 2 + p m. (5.1) x 2 mnη 2 x 2 2 p Or, we can bound the absolute error as P rob ( ˆ x x 2 η ) p mnη 2 ( d i=1 94 ) σi 2 + pɛ 2 + p m x 2 2. (5.2) p From Theorem 5, we see that for a fixed error bound η, as the number of points n and measurement ratio m p increase, the error probability decreases at speed 1 and 1 m. This n p matches our results from Chapter 4, in which we showed that as n, the probability will go to zero for any fixed η. We further note that the error bound depends on the product mn (total number of measurements taken across all points) more than m or n individually. As long as the product mn is large enough, the error bound will be small. This can be achieved either by taking a large number of points n with very few compressive sensing measurements m of each sample, or by taking a large number of compressive sensing measurements m of each sample with very few data points n. Moreover, the power of the signals d i=1 σ2 i and the power of the noise pɛ 2 also play a important role here. As the power of the noise pɛ 2 increases, we expect that the error probability will increase due to noise s adverse effect on estimating the center. However, we also see that the error probability also increases with increasing power of the signals, i.e. with d i=1 σ2 i. This initially surprising result occurs because it is preferable from the point of view of estimating the center that all the data points are close to the center. If the PCs have more variation and thus the data points are more scattered, this makes the center estimation harder. Thus, in center estimation, the power of the signals behaves much like the power of the noise.

112 95 Because we use the Chebychev inequality (see Section 5.4 for details), instead of full information about the probability distribution, to derive our bound, the bound is not tight in comparison to the empirical error probability values. We can compare our error probability bound with empirical error probability in Figure 5.1, from which we can see that our bound is not tight. Figure 5.1: Comparison ( ) of our bound with empirical values ( d in probability. ) The X-axis represents P ˆ x x 2 p x 2 η and the Y -axis represents mnη 2 i=1 σ2 i +pɛ2 + p m. Here the x 2 2 p red line Y = X is the reference line for easy comparison. Here we use the synthetic data with p = 100, n = 1000 and m = 0.1 in 1000 trials. Based on experiments, we do not see p significant variations in the results when we change the parameters (m,n,p) slightly. 5.2 Convergence Rate of the Covariance Matrix Estimator In this section, we will present a theorem about the convergence rate of the covariance matrix estimator, with the proof deferred to Section 5.4 as well. Theorem 6. Using the notation of Chapter 4, consider the covariance matrix estimator 1 C P = k n 1 n i=1 P ix (i) (P i x (i) ) T and let C = V ΣV T + p+2 m+2 ɛ2 I from Theorem 4. Then, for any η > 0, we can bound the error of C P as an estimator of C as ( CP C F P rob C F ) η 1 ( a ) nη 2 b 1. (5.3)

113 96 Or, we can bound the absolute error as P rob( C P C F η) 1 (a b) (5.4) nη2 where a = k 1 ( 2h + s 2 + (2p + 4)ɛ 2 s + (p 2 + 2p)ɛ 4 ) and b = (1 k 2 ) 2 h + Here k 1 = p(p+2) m(m+2), k 2 = ( ) ( ) 2 ( ) 2 p + 2 p + 2 2k 2 + (p 2)k2 2 s ɛ 2 s + p ɛ 4 m + 2 m + 2 p m, s = d (m+2)(p 1) j=1 σ2 j, and h = d j=1 σ4 j. We see that both a and b are very complex expressions. In order to clearly see the factors influencing the bound, we will make some simplifications for the noiseless case. We will set ɛ = 0 and show the following corollary. Corollary 3. Using the notation in Theorem 6, if 0 < m p, p 2 and ɛ = 0, we have ( CP C F P rob C F ) η < 1 p 2 nη 2 m 2 ) (2 + s2 h (5.5) Proof. We can bound a b 1 in Eq. 5.3 as a b 1 = k 1 (2h + s 2 ) ( 1 (1 k 2 ) 2 h + 2k 2 + (p 2)k2 )s 2 2 < (a) < < (b) p(p + 2) 2h + s 2 m(m + 2) h 2k 2 h + k2h 2 + 2k 2 s 2 + (p 2)k2s 2 2 p(p + 2) 2h + s 2 m(m + 2) h + k2h 2 + (p 2)k2s 2 2 p(p + 2) 2h + s 2 m(m + 2) h ) p (2 2 + s2 m 2 h where (a) is from the fact s 2 h, and (b) is from the fact p(p+2) m(m+2) p2 m 2 when 0 < m p.

114 97 Thus, we have ( CP C F P rob C F ) η < 1 p 2 nη 2 m 2 ) (2 + s2 h Let us analyze the right side of Eq When we only have one significant PC, i.e. σ 2 = σ 3 =... = σ d = 0, we will have s = σ 2 1 and h = σ 4 1. Thus, s2 h = 1 and the bound is tight. On the other hand, when the power of each PC is the same, i.e. σ 1 = σ 2 =... = σ d, we will have s = dσ 2 1 and h = dσ 4 1. Thus, s2 h = d and the bound gets loose. We thus see that when the powers of some PCs are more significant than others, the bound gets tighter; when they are all about the same, the bound gets looser. Similar to the center estimator, the bound we derived is not tight in comparison to the empirical error probability values. We compare our error probability bound with empirical error probability in Figure 5.2. Figure 5.2: Comparison of our bound with empirical values in probability. The X-axis represents P ( C P C F ( 1 C F η) and the Y -axis represents a 1). Here the red line Y = X nη 2 b is the reference line for easy comparison. Here we use the synthetic data with p = 100, n = 1000 and m = 0.1 in 1000 trials. Based on experiments, we do not see significant p variations in the results when we change the parameters (m,n,p) slightly.

115 5.3 Matrix Perturbation and Convergence Rate of the Principal Component Estimator 98 In this section, we consider the rate of convergence of the PCs using the results on convergence of the covariance matrix we have proved in the last subsection. To begin, we will first briefly review some matrix perturbation theory results we will need. Then, we will present a bound on the rate of convergence of the estimated PCs to the true PCs Review of Matrix Perturbation Suppose we have a symmetric matrix A R N N and we add a perturbation matrix S R N N to it so that we have a new matrix Ã = A + S. There is a long line of previous work [88, 41, 92] to analyze the change in the eigenvalues and eigenvectors of Ã vs. A due to the introduction of the perturbation matrix S. It is known that when the amount of perturbation S, e.g. S 2 or S F, is small, the eigenvalues and eigenvectors of A will be close to those of Ã in certain situations. and For instance, suppose that the matrix decompositions of A and Ã are A = V ΣV T = (V 1 V 2 ) Σ 1 0 V 1 T 0 Σ 2 Ã = Ṽ ΣṼ T = (Ṽ1 Σ 1 0 Ṽ 2 ) 0 Σ2 V T 2 Ṽ T 1 Ṽ T 2 (5.6) (5.7) where the eigenvalue matrix Σ 1 contains k eigenvalues with 2 k N 1, and Σ 2 contains the remaining N k eigenvalues. Similarly, Σ 1 contains k eigenvalues and Σ 2 contains the rest. We can then define the angle between the subspace spanned by columns of a matrix U 1 R N k and the subspace spanned by the columns of a matrix U 2 R N k as in [41, 63] ( ( ) ) Θ(U 1, U 2 ) = arccos (U1 T U 1 ) 1 2 U T 1 U 2 (U2 T U 2 ) 1 U2 T U 1 (U1 T U 1 ) (5.8)

116 It is stated in [41, 63] that the eigenvalues of Θ(V 1, V 2 ) will be the angles required to rotate the subspace spanned by columns of V 1 to the subspace spanned by the columns of V 2. Moreover, from [41], 99 sin Θ(V 1, Ṽ1) F = Ṽ T 2 V 1 F (5.9) Finally, if we denote κ = min i,j Σ ii Σ jj > 0, then we will have sin Θ(V 1, Ṽ1) F S F κ (5.10) We see that either having the norm of the perturbation matrix S be small or having a large absolute difference between eigenvalues κ can make sin Θ(V 1, Ṽ1) F small Convergence Rate of Principal Component Estimator From the absolute bound of C P C F in Theorem 6 and the results from matrix perturbation theory, we can easily prove the following theorem about the convergence rate of the principal component estimator. Theorem 7. Using the notation of Chapter 4, consider again the covariance matrix estimator C P = k 1 1 n n i=1 P ix (i) (P i x (i) ) T. Let Ṽd be the matrix containing the top d principal components of C P and let V d be the matrix containing the true principal components v 1,..., v d as columns. Then, for any η > 0, ( P rob sin Θ(V d, Ṽd) F η ) κ 1 (a b) (5.11) nη2 where a and b are defined as they were in Theorem 6, and the angle Θ and the absolute difference between eigenvalues κ are defined the same as in the previous section. To gain some intuition for the bound, we will analyze the bound by considering the

117 100 noiseless case. Setting ɛ = 0, we have Here k 1 = p(p+2) m(m+2) and k 2 = a b = 2k 1 h + k 1 s 2 (1 k 2 ) 2 h (2k 2 + (p 2)k 2 2)s 2 = 2k 1 h + k 1 s 2 h + 2k 2 h k 2 2h 2k 2 s 2 (p 2)k 2 2s 2 = 2k 1 h + k 1 s 2 h + 2k 2 (h s 2 ) k 2 2h (p 2)k 2 2s 2 (a) 2k 1 h + k 1 s 2 h k 2 2h (p 2)k 2 2s 2 (b) (2k 1 1)h + (k 1 (p 2)k 2 2)s 2 (c) k 1 (2h + s 2 ). p m, which we have defined in Section 5.2, and (a) follows (m+2)(p 1) from the fact h s 2, (b) follows from the fact that k 2 2h 0, and (c) follows from the fact that h 0 and (p 2)k 2 2s 2 0 when p 2. Moreover, we note that the terms we have eliminated are relatively small because k m 2 is a very small coefficient of h compared to 2k p2 m 2, and (p 2)k 2 2 p m 2 k 1 (2h + s 2 ) for a b is fairly tight. << p2 m 2 k 1 when p is large. Thus, we expect this bound Thus, we see that for the noiseless case, as the measurement ratio m p increases, the upper bound of a b will decrease and thus the bound gets tight. Moreover, from the right side of Eq. 5.11, we see that as the number of points n increases, we will get a tight bound as well. 5.4 Theoretical Verification In this section, we present proofs of Theorems 5 and 6 which show the convergence rates of the center and covariance matrix estimators with respect to the number of points n and the measurement ratio m p. We will use the Chebyshev inequality theorem as follows to prove Theorem 5 and 6. Theorem 8 ([87]). For a random vector x R p, and any η > 0, P rob ( x E(x) 2 η) E( x E(x) 2 2) η 2 (5.12)

118 101 follows. First, we will prove Theorem 5 about the convergence rate of the center estimator as Proof of Theorem 5. In the proof of Theorem 2, we showed that for each i, ( p ) E m P ix (i) = x. Since {P i x (i) } n i=1 are all i.i.d., ( p ) E(ˆ x) = E m P ix (i) = x. Thus, from Theorem 8, we have We can now compute E( ˆ x x 2 ), P rob ( ˆ x x 2 η ) E( ˆ x x 2 2) η 2. (5.13) E( ˆ x x 2 ) = tr ( E ( (ˆ x x) T (ˆ x x) )) = tr ( E ( (ˆ x x)(ˆ x x) )) T ( ( 1 n ( p = tr E n m P ix (i) x ) 1 n i=1 = 1 ( n (E n tr 2 i=1 n j=1 ( p m P ix (i) x ) n j=1 ( p m P jx (j) x ) )) T ( p m P jx (j) x ) T ) ) (5.14) ( n ( In order to compute E p i=1 P m ix (i) x ) n ( p j=1 P m jx (j) x ) ) T, we will derive a useful property. Consider i.i.d. vectors {t i } n i=1 with E(t i ) = t, then E ( ( n (t i t) i=1 n (t j t) ) T = j=1 n E ( (t i t)(t i t) ) T + i=1 ( ) = ne( (t i t)(t i t) T ) + n i=1 n i=1 n E ( (t i t)(t j t) ) T j=1 j i n E(t i t)e ( (t j t) ) T j=1 j i = ne ( (t i t)(t i t) T ) (5.15)

119 102 where (*) is from the independence of t i and t j when i j. From Eq and 5.15, we have E( ˆ x x 2 ) = 1 ( ( (E n tr p m P ix (i) x )( p m P ix (i) x ) )) T = 1 ( ( (E n tr p m P i(x (i) x) + p m P i x x )( p m P i(x (i) x) + p m P i x x ) )) T = 1 p 2 n m tr(c x) + 2 ( ( p 2 n tr E m P i(x (i) x) ( p m P i x x ) )) T + 1 ( ( (E n tr p m P i x x )( p m P i x x ) )) T (5.16) where C x = E ( (P i x (i) centered )(P ix (i) centered )T ) = 1 k 1 V ΣV T + 1 k 1 p+2 m+2 ɛ2 I. Now we will compute individual terms in Eq to get E( ˆ x x 2 ). p 2 First, let us consider 1 tr(c n m 2 x ). From Theorem 4 and 1 k 1 (1 + (p 1)k 2 ) = m, we have p tr(c x ) = 1 ( d σi 2 + (p 1)k 2 k 1 i=1 = 1 k 1 (1 + (p 1)k 2 ) = m p d σi 2 + mɛ 2 i=1 d i=1 σ 2 i d σi 2 + mɛ 2 i=1 ) + mɛ 2 ( Second, we will compute the cross term (E 2 tr p P n m i(x (i) x) ( p P m i x x ) )) T as follows, ( ( p tr E m P i(x (i) x) ( p m P i x x ) )) T = p ( (E m tr (x (i) x) ( p = p ( ) (E m tr x (i) x E = 0 m P i x x ) )) T Pi ( ( p m P i x x ) )) T Pi

120 103 Finally, we compute the last term in Eq. 5.16, ( ( ( p tr E m P i x x )( p m P i x x ) )) T ( ( p = E m P i x x ) T ( p m P i x x )) ( p 2 = E m 2 xt P T i P i x ) p ) 2E( m xt P i x + x 2 = p2 m 2 E( P i x 2) 2 p m E( P i x 2) + x 2 = p2 m 2 m p x 2 x 2 = p m m x 2 Thus, we will have, ( E( ˆ x x 2 ) = 1 p 2 ( m d ) σ n m 2 i 2 + mɛ 2 p i=1 ( = 1 ( p d i=1 σ2 i + pɛ 2) n m ( d = p mn i=1 ) + p m m x 2 ) + p m m x 2 ) σi 2 + pɛ 2 + p m x 2. (5.17) p From Eq and 5.17, the bound of the center error will then be P rob ( ˆ x x η ) ( d ) p σ 2 mnη 2 i + pɛ 2 + p m x 2. (5.18) p i=1 If we replace η with η x in Eq. 5.18, we will get the relative bound as ( ) ( ˆ x x P rob η p d ) i=1 σ2 i + pɛ 2 + p m. (5.19) x mnη 2 x 2 p Now we will use Chebyshev inequality theorem to prove Theorem 6 about the convergence rate of the covariance matrix estimator. Proof of Theorem 6. We have already showed, in the proof of Theorem 4, that for each C i = k 1 P i x (i) (P i x (i) ) T, we have E(C i ) = C. Thus, E(C P ) = E(C). (5.20)

121 104 Then from Theorem 8, we know P rob( (C P ) v (C) v 2 η) E ( (C P ) v (C) v 2 2) η 2 (5.21) where (C P ) v and (C) v are the p p matrices C P and C respectively rearranged into p 2 1 vectors. is Because the Frobenius norm of a matrix is equal to the l 2 norm of its vector form, that then we have C P C F = (C P ) v (C) v 2, (5.22) P rob( C P C F η) E ( C P C 2 F ) η 2. (5.23) We can now compute E ( C P C 2 F ). E ( ) ( ( C P C 2 F = E tr (CP C) T (C P C) )) ( ( )) = tr E ( 1 n C i C) T ( 1 n C i C) n n = i=1 i=1 1 n tr (E (C ic i )) 1 n trace ( C 2) = 1 n tr (E(C ic i )) 1 n C 2 F (5.24) where (*) is from the following property ( n ) n E (C i C) T (C i C) = i=1 i=1 n E ( (C i C) T (C i C) ) + i=1 = ne ( (C i C) T (C i C) ) n i=1 n E ( (C i C) T (C j C) ) j=1 j i = ne(c i C i ) nc 2 (5.25) Now we are going to compute the two individual terms in Eq. 5.24, and we will start

122 105 with 1 n tr (E(C ic i )), ( ) tr E(C i C i ) ( = tr = k 2 1E E ( ) ) k1p 2 i x (i) x (i)t Pi T P i x (i) x (i)t Pi T ( tr ( x (i)t P T i P i x (i) x (i)t P T i P i x (i))) = k 2 1E ( P i x (i) 4) (5.26) Using Lemma 2 in Section 4.7 and E(X) = E Y (E X (X Y )), we have k1e ( 2 P i x (i) 4) = ( k1e 2 x (i) (E Pi Pi x (i) 4 x (i))) = ( k1e 2 x (i) (E θ cos 4 θ x (i) 4 x (i))) = ( m(m + 2) ) k1e 2 p(p + 2) x(i) 4 = k 1 E( x (i) 4 ) where θ is the principal angle between x (i) and the subspace spanned by columns of P i. In the following, we will first work out the distribution of x (i) and then compute E( x (i) 4 ). If we assume the data is centered with x (i) = d j=1 w ijσ j v j + z i, then x (i) N (0, K) with K as K = E ( x (i) x (i)t ) ( = E ( d d ) w ij σ j v j + z i )( w ik σ k v k + z i ) T j=1 (1) = E ( d (2) j=1 k=1 k=1 d ) w ij w ik σ j σ k v j vk T + E(z i z T i ) ( d ) = E wijσ 2 j 2 v j vj T + ɛ 2 I = j=1 j=1 d σj 2 v j vj T + ɛ 2 I where (1) follows from the independence of w ij and z i and (2) follows from the independence of w ij and w ik when j k. We will also compute tr(k) and tr(kk) that will be used later in the computation of E( x (i) 4 ). If we let s = d j=1 σ2 j and h = d j=1 σ4 j, then

123 106 tr(k) = tr ( d σj 2 v j vj T + ɛ 2 I ) j=1 d = tr( σj 2 vj T v j ) + pɛ 2 j=1 d = σj 2 + pɛ 2 j=1 = s + pɛ 2 (5.27) and ( ( d tr(kk) = tr σj 2 v j vj T + ɛ 2 I )( d σkv 2 k vk T + ɛ 2 I )) j=1 ( d = tr j=1 k=1 k=1 d σj 2 σkv 2 j vj T v k vk T + 2ɛ 2 d σj 2 v j vj T j=1 ( d ) d = tr σj 4 v j vj T + 2ɛ 2 σj 2 + pɛ 4 = j=1 d σj 4 + 2ɛ 2 j=1 j=1 d σj 2 + pɛ 4 j=1 ) + ɛ 4 I = h + 2ɛ 2 s + pɛ 4 (5.28)

124 107 We can now use Eq and 5.28 to compute E( x (i) 4 ), E( x (i) 4 ) = E = E( = ( ) = = ( (x (i) 1 p p j=1 k=1 p j=1 k=1 p j=1 k=1 p j=1 2 (i) + x 2 p x (i) j E(x (i) j x (i) 2) ) 2 p 2 (i) 2 x k ) 2 (i) 2 x k ) p ( Kjj K kk + 2Kjk) 2 K jj p K kk + 2 k=1 = ( tr(k) ) 2 + 2tr(KK) p p j=1 k=1 K 2 jk = (s + pɛ 2 ) 2 + 2(h + 2sɛ 2 + pɛ 4 ) where (*) is from Isserlis theorem. Thus, we have ( ) tr E(C i C i ) = k 1 E( x (i) 4 ) ) = k 1 ((s + pɛ 2 ) 2 + 2(h + 2sɛ 2 + pɛ 4 ) ( ) = k 1 2h + s 2 + (2p + 4)ɛ 2 s + (p 2 + 2p)ɛ 4

125 108 Now we move to compute the second term 1 n C 2 F in Eq. 5.24, C 2 F = tr(cc) p ( = Σ ii + p + 2 m + 2 ɛ2 = = i=1 d i=1 d i=1 ) 2 ( σ 2 i (1 k 2 ) + sk 2 + p + 2 m + 2 ɛ2 ) 2 + p i=d+1 ( σ 2 i (1 k 2 ) ) 2 + 2s(1 k2 )(sk 2 + p + 2 m + 2 ɛ2 ) + p ( sk 2 + p + 2 m + 2 ɛ2 ) 2 ( sk 2 + p + 2 m + 2 ɛ2 ) 2 = (1 k 2 ) 2 h + 2(1 k 2 )k 2 s p + 2 m + 2 (1 k 2)ɛ 2 s + pk2s ( ) 2 p + 2 +p ɛ 4 m + 2 = (1 k 2 ) 2 h + = (1 k 2 ) 2 h + ( ) 2k 2 + (p 2)k2 2 ( ) 2k 2 + (p 2)k2 2 s s (p2 + p 2)k 2 + p + 2 m + 2 ( ) 2 p + 2 ɛ 2 s + p m + 2 p(p + 2) m + 2 sk 2ɛ 2 ( p + 2 ɛ 2 s + p m + 2 ( ) 2 p + 2 ɛ 4 m + 2 ) 2 ɛ 4 Finally, if we denote tr (E(C i C i )) = a and C 2 F = b, then P rob( C P C F η) 1 (a b). nη2 Or, we can replace η with η C F to get the relative bound, P rob( C P C F C F η) 1 nη 2 ( a b 1 ).

126 Chapter 6 Conclusions and Future Work Many types of signals have a very low-dimensional representation compared to their original high-dimensional size. Thus, they can often be approximated with little distortion as a sparse linear combination of elements from a certain basis or dictionary or as a nonlinear function of a few underlying variables. In this thesis, PCA and kernel PCA models are employed to capture the underlying low-dimensional structure of signals of interest. As in the usual compressive sensing paradigm, these low-dimensional models then allow us to design efficient algorithms for acquiring, representing, and processing signals of interest from random measurements of them. Our main contributions are that, first, we have proposed a new kernel-pca-based signal recovery method to accurately recover nonlinear-manifold-modeled signals. Experimental results showed that our approach requires dramatically fewer measurements, sometimes an order of magnitude fewer measurements, than traditional compressive sensing techniques. We have also shown that our approach compares favorably with other recently developed manifold-based compressive sensing methods, producing similar recovery results in 1-2 orders of magnitude less computation time. In addition, a theoretical bound on the error of our recovered signal was also proved. For collections of signals that have efficient linear representations in a shared common basis, a simple and efficient approach has been proposed to recover the center and principal components of high-dimensional data from compressive sensing measurements of it. Ex-

127 110 perimental results verified that this approach compares favorably against other algorithms developed in this field. We have further shown that it can be used to estimate dimensionality of the original data and to improve reconstruction results for a collection of data with shared structure from random measurements. We then theoretically show that our center and PC estimators converge almost surely to the true center and true PCs asymptotically even for very few random measurements. Moreover, a bound on the convergence rate of our estimators has also been given. In the following sections, we would like to discuss several open problems and future directions. 6.1 Bernoulli Random Measurements In compressive sensing, both random Gaussian and random Bernoulli measurement matrices satisfy the restricted isometry property with high probability when the number of measurements is m = O(klog(p/k)). In Chapters 3 and 4, we show experimentally that both random Gaussian and Bernoulli measurements provide good recovery of signals. However, proofs of asymptotic convergence and bounds on the rate of convergence have only been given for random Gaussian measurements. It would be nice to prove similar rates of convergence for random Bernoulli measurements. Perhaps this could be done by extending our proofs to any measurement matrix that satisfies the Johnson-Lindenstrauss property [81, 35, 1]. A matrix Φ is said to satisfy this property if for a given ɛ (0, 1) and any x R p, P ( Φx 2 x 2 > ɛ x 2) ( ) 2e m ɛ 2 4 ɛ3 6. In essence, the Johnson-Lindenstrauss Lemma states that the mutual distance between points is likely to be preserved under the action of Φ. If we could prove rates of convergence for all measurement matrices satisfying the Johnson-Lindenstrauss property, then proofs for the random Gaussian matrix and random Bernoulli matrix will be simple corollaries. Another advantage of this approach is that we could then use these results for

general Johnson-Lindenstrauss matrices to create new randomized algorithms for PCA since Johnson-Lindenstrauss matrices can be sparse and are thus more computationally efficient. 111 6.

However, for some datasets, the polynomial kernel may not be the best choice to model the signals.

128 general Johnson-Lindenstrauss matrices to create new randomized algorithms for PCA since Johnson-Lindenstrauss matrices can be sparse and are thus more computationally efficient Kernel Choices In Chapter 3, when the measurements in original space (linear inner products) are provided, we assume the kernel defining our feature space is in the form f( y, e i ), e.g. the polynomial kernel or the sigmoid kernel, so that the measurements in kernel space k(y, e i ) is known as well. However, for some datasets, the polynomial kernel may not be the best choice to model the signals. If at the initial stage, we take the measurements not in terms of linear inner products, but for example in terms of e i y 2, then we can instead use radial basis function kernels such as the Gaussian kernel, which are the most popular class of kernel. Although most of the compressive sensing literature currently uses inner products as the measurements, we may also easily design hardware to take measurements of the form e i y 2. For example, we may design the mechanism as in Figure 6.1. Figure 6.1: Norm Squared Measurements for the Radial Basis Kernel. 6.3 Compressive Kernel PCA Finally, in Chapter 3, we considered the problem of recovering nonlinearly k-sparse signals by modeling the signals with kernel PCA. However, the low-dimensional signal structure {v k } d k=1 in F were estimated via kernel PCA on other data that is expected to share

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles Or: the equation Ax = b, revisited University of California, Los Angeles Mahler Lecture Series Acquiring signals Many types of real-world signals (e.g. sound, images, video) can be viewed as an n-dimensional