GRAPH-BASED REGULARIZATION OF LARGE COVARIANCE MATRICES.

Size: px

Start display at page:

Download "GRAPH-BASED REGULARIZATION OF LARGE COVARIANCE MATRICES."

Barnard Hall
5 years ago
Views:

1 GRAPH-BASED REGULARIZATION OF LARGE COVARIANCE MATRICES. A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Srikar Yekollu, B.E. * * * * * The Ohio State University 2009 Master s Examination Committee: Approved by Dr. Mikhail Belkin, Adviser Dr. Simon Dennis Adviser Computer Science & Engineering Graduate Program

2 c Copyright by Srikar Yekollu 2009

3 ABSTRACT A range of successful techniques in computer vision, such as Eigenfaces and Fisherfaces are based on using the spectral decomposition of the empirical covariance matrices that are constructed from given data. These matrices are typically constructed in a setting where the dimension of the data (number of pixels) exceeds the number of available samples, sometimes by a large margin. However it has been established in statistics that under these conditions and some fairly general modeling assumptions, the eigenvectors and eigenvalues of covariance matrices cannot be estimated reliably. Several techniques to remedy this problem have been proposed. These techniques typically make specific assumptions about the structure of the covariance matrix and assume that this structure is known in advance. In this thesis we propose a new method for automatically learning non-local structure in the covariance matrix in a data-dependent way. This learned structure is then used to improve inference for methods like Eigenfaces. Unlike most existing methods in computer vision and statistics we do not make any assumptions about the spatial (pixel) proximity structure of the data. We provide theoretical results indicating that our methods may overcome the problem of insufficient data. We evaluate the performance of our algorithms empirically and demonstrate significant and consistent improvements over traditional Eigenfaces as well as more recent techniques, such as 2D PCA, Euclidean Banding and thresholding for a wide range of parameter settings. ii

4 This is dedicated to my aunt and uncle, Sailaja & Raja and my parents, Girija & Anand who have strived to provide me the best quality of education and have always encouraged me in my endeavours. iii

5 ACKNOWLEDGEMENTS First, I would like to thank my adviser, Dr. Mikhail Belkin, not only for his wisdom and guidance but also for his unending patience with me. I thank him for his insightful ideas and attention to detail that have led to the production of this thesis. Through him, I have learned to become a better writer and overall thinker. He has strengthened my understanding of how to research effectively and continues to serve as an important role-model for me. I thank Dr. Simon Dennis for agreeing to take the time to be a member of my thesis committee. iv

6 VITA October 3, Born - Nellore, Andhra Pradesh, India June B.E., Computer Science & Engineering July 2005 June Technical Associate, Trilogy E-Business, Bangalore, India present Graduate Research Assistant, The Ohio State University PUBLICATIONS Research Publications Srikar Yekollu and Mikhail Belkin. Learning Banded Faces: Towards better Eigenfaces. IEEE International Conference on Computer Vision. Submitted. FIELDS OF STUDY Major Field: Computer Science & Engineering v

7 TABLE OF CONTENTS Page Abstract Dedication Acknowledgements Vita List of Tables ii iii iv v viii Chapters: 1. Introduction Idea and the Algorithm Idea and intuition Preliminaries Algorithm Theoretical analysis Experimental Evaluation Datasets Face Recognition Fisherfaces Image Reconstruction/Denoising vi

8 5. Summary and Conclusions Bibliography vii

9 LIST OF TABLES Table Page 4.1 Classification error rate (in %) on a training set of 40 subjects with 4 samples per subject and a test set 240 samples drawn from the ATT/ORL face dataset with different amounts of added noise. The number of eigenvectors was chosen to be 100 for all method other than 2DPCA. For 2DPCA the number of eigenvectors was 4, which produced the best performance Classification error rate (in %) on a training set of 15 subjects with 4 samples per subject and a test set 165 samples drawn from the Yale face dataset. The number of eigenvectors chosen were 50 and k was chosen to be 20. The results are averaged over 15 iterations Classification error rate (in %) as a function of the number of eigenvectors N e. The training set contains 40 subjects with 4 samples per subject and the test set contains 240 samples drawn from the ATT/ORL face dataset Classification error rate (in %) and the percentage of non-zero entries (sparsity) of covariance matrices (in %) Vs Banded Faces for different numbers of nearest neighbors k used to construct the graph for Banded Faces, vs best parameter setting for Euclidean and Thresholding banding. While sparsity is not applicable for 2DPCA, the classification error is provided for comparison. The bolded number is used for all other computations Classification error rate for Fisherfaces (in %) on a training set of 40 subjects with 4 samples per subject and a test set of 240 samples drawn from the ATT/ORL face dataset with different amounts of added noise. 26 viii

10 4.6 Reconstruction error (L 2 Distance to the original) on a set of 10 subjects with 3 samples per subject drawn from the ATT face dataset, different amounts of noise added Reconstruction error (L 2 Distance to the original) on a set of 15 subjects with 3 samples per subject drawn from the Yale face dataset, different amounts of noise added A few representative samples from the experiments in Table (4.6). The numbers below an image are the values of its L 2 distance from the corresponding original image ix

11 CHAPTER 1 INTRODUCTION A large class of popular techniques in computer vision, such as Eigenfaces and Fisherfaces, are based on computing the eigenvectors of certain covariance matrices generated from data, where a w h image is represented by a vector of pixels of dimension p = w h. These pixel-pixel covariance matrices are of the size (w h) 2, for w h images. For even a moderately sized image of dimensions 30 40, this translates to working in a 1200-dimensional space. For example, images ranging from up to are used in [26], [3] and [8]. It is clear that accurate estimation of these pixel-pixel covariance matrices plays a key role in the success of these techniques. Standard results from the random matrix theory ([2]) establish that the sample covariance matrix (the maximum likelihood estimator) constructed from n samples, converges to the population (true) covariance matrix at a rate no better than p/n. This implies that in the above examples, we need at least several thousand samples to establish a reasonable approximation of the covariance matrix. However, in many cases this number of samples is not available. More so in the case of Fisherfaces, this would be a more significant problem due to the requirement escalating to p samples per subject. For example in [26] 16 images of 16 individuals is used as the training set, in [3] a training set of size is used. Interestingly, despite this theoretical 1

12 disconnect, these techniques have proven successful for a wide range of problems in vision. We conjecture that this success is due to the fact that images reside near a lower-dimensional linear subspace in the space of all pixel configurations. However images have a wealth of structure, for example related to spatial proximity. Moreover, specific classes of images, e.g. faces, have additional nonlinear structure, which is not necessarily local in nature. One expects that statistically grounded methods for identifying and using such structures in the context of estimating covariance matrices could significantly improve performance of algorithms like Eigenfaces. In this thesis we will demonstrate that a class of methods specifically designed to detect non-linear structures in covariance matrices significantly outperforms classical Eigenface-like algorithms. Analyzing high dimensional data has been a topic of active research in the statistics community. A subject of particular interest was understanding the so-called high p, low n setting, where the number of data points is smaller than the dimension of the space. Some of this research has direct relevance to high-dimensional inference problems such as face recognition. In this thesis we will be interested in the structure of covariance matrices for PCA-based and closely related methods, such as Eigenfaces and Fisherfaces, to name two most well-known examples. We start with providing a brief overview of the relevant statistical literature. The behavior of the empirical (sample) covariance matrix ˆΣ (used in the classical PCA algorithm) is well understood in traditional multivariate statistics (see, e.g., [2]) when the dimensionality p is fixed and the number of samples n goes to infinity. The problem of high dimensional inference is usually expressed by treating the dimensionality p, as a variable which increases with the number of samples n as opposed to treating 2

13 it as a fixed value as in traditional multivariate statistics. So, the high p, low n setting is sometimes written as p(n) n c as n, where c > 0 is a constant. Note how p is written as p(n) to indicate the relation between them (p and N) in this setting. So, in this setting, when p is comparable to n, ˆΣ provides a poor estimate for the true covariance matrix Σ. In their fundamental work on the subject [21] Marcenko and Pastur provide the first theoretical analysis of the properties of certain random matrices and show how their eigenspectrum behaved with the value of p/n. This analysis was extended by Silverstein [24] to include covariance matrices. A number of theoretical advances including Johnstone [15] and El Karoui[16] followed. These works explore the behavior of the eigenvalues of covariance matrices in the high-dimensional setting. A general conclusion is that the sample covariance matrix constructed from n p-dimensional samples, approaches the population covariance matrix at the rate of n/p in operator norm (which implies the same rate of convergence for eigenvalues and eigenvectors). Informally speaking, one needs several times as many samples as there are dimensions of the space. However, obtaining more samples is not a viable option in a range of practical situations. A need to obtain better estimators for the true covariance matrix was recognized. While some standard regularization techniques such as Ridge regression and Steinian shrinkage [7] can be applied, they are not expected to work well as they do not make use of any existing additional structure in the data to overcome the curse of dimensionality [13]. A technique utilizing such additional structure was proposed in Bickel and Levina 2006 [6], which proposes banding of the covariance matrix based on a linear ordering of the coordinates. Effectively, the entries of the empirical covariance matrix corresponding to the coordinates which are not close, are set to be zero. Other techniques are those proposed by Wu 3

14 and Pourahmandi [28] who suggest smoothing along the components of the Cholesky decomposition of the covariance matrix, Huand et al. [14] who propose to impose an L 1 penalty on the Cholesky factor to achieve covariance sparsity and Furrer and Bengtsson [9] who propose to shrink (taper) the sample covariance matrix based on its Schur product with a positive definite function. Another set of techniques for regularization is based on the more general assumption that the covariance matrix is sparse, see Bickel and Levina [5] and El Karoui [17]. In this case, the authors suggest choosing a threshold and setting the entries of the empirical covariance matrix, which fall below it, to zero. The authors provide a variety of theoretical results, such as convergence rates for these estimates, showing the potential of these methods to work even when p is significantly larger than n. There have also been several lines of work in computer vision aimed at improving covariance-based methods. Most of these approaches (e.g., [1, 19, 22, 29, 30]) attempt to exploit a certain assumed spatial structure to improve inference. For example, the approach in [22] is to divide the data into blocks defined by pixel proximity and to apply PCA to each block separately. A popular method, 2DPCA [29], treats each row as a unit which can be interpreted as a certain assumption about the pixel proximity structure on the coordinates, see [27]. Most of the existing methods impose a certain fixed structure on the space of images. While such structures, based, e.g., on spatial pixel proximity are often natural and lead to elegant algorithms, one expects that many non-local dependencies would be ignored because of the rigidity of these models. In this thesis we propose an algorithm for learning certain dependencies between the pixels of an image (or, more generally, coordinates of a vector representation) in 4

15 a more flexible data-dependent way. More precisely, the data (images) are used to construct a metric space structure on the set of coordinates (pixels), which is applied to regularizing the covariance matrix. Our algorithm is related to the class of manifold methods ([4, 23, 25]), but operates on the space of coordinates rather than data points. We provide theoretical results, indicating its potential ability to overcome the curse of dimensionality. In the experimental section we also show that our algorithms consistently outperform 2DPCA as well as classical Eigenfaces, Fisherfaces and several recent statistical techniques, particularly when the data is noisy. Our method also generates very sparse matrices allowing for efficient computation. The outline for the rest of the thesis is as follows. In Section 2, we introduce the idea and intuition behind the work followed by a few preliminaries and then the actual Algorithm. In Section 3, we then present theoretical analysis of the ideas in Section 2 along with the relevant assumptions. The experimental evaluation of the algorithm is presented and discussed in Section 4. 5

16 CHAPTER 2 IDEA AND THE ALGORITHM. 2.1 Idea and intuition. In recent years there has been a large amount of research on manifold based algorithms for Machine Learning (e.g., [25], [23], [4]). These methods make an assumption that the data lies close to a low-dimensional, generally non-linear manifold embedded in a high-dimensional space. The manifold structure is the recovered and is used for various inferential purposes, such as data representation, clustering, semisupervised learning and many others. These ideas have also been applied in vision, e.g., [12]. In this thesis we take a point of view dual to that of traditional Manifold Learning. Instead of considering a non-linear structure on the space of data points (images) we try to discover non-linear structure on the space of coordinates (pixels). We believe that there are strong interactions between the coordinates involved, which are not fully explained by the spatial pixel proximity. For example, for face images, the pixels of the left and of the right eye are strongly correlated despite significant spatial separation between them. Dependencies of this sort are clearly structurally important for families of images, yet it is often hard to encode them explicitly. Therefore, we would like to find an automated way of detecting such dependencies from data in a 6

17 way useful for inferential tasks. We do this by constructing a metric space (graph) structure on the space of coordinates, reflecting correlations between the pixel values in images, such that the strongly correlated coordinates are close together in the graph. While technically not a manifold, we think of this metric space as analogous to a manifold in manifold learning. We note that this structure is related to graphical models, where the relation between variables (coordinates) is also represented by a graph (e.g.,[10]). However most graphical models research, (see, [20] for an overview) either assumes that the graph structure is known in advance or attempts to learn the independence structure of variables from the data. Our thesis, on the other hand, is primarily concerned with highly correlated coordinates and does not require the much more subtle notion of conditional dependence. We note that in the case of a multivariate Gaussian distribution, learning conditional independence requires inverting the correlation matrix ( [20]), a procedure which is unstable in the high p low n setting considered here. Intuitively, the underlying idea is to take advantage of highly correlated coordinates in the data, where the signal is particularly strong and inference can be done with higher confidence. This small set of coordinate pairs is then used to construct a graph structure on the set of coordinates. 2.2 Preliminaries We start with a set of n w h image samples. They are represented as column vectors(x i ) of length p = w h. The sample covariance matrix (ˆΣ p ) is defined as ˆΣ p = [ˆσ ij ] p p = 1 n n (X i X)(X i X) T (2.1) i=1 7

18 where, The correlations between pixels are X = 1 n n X i (2.2) i=1 ˆρ ij = ˆσ ij ˆσii ˆσ jj (2.3) Recall that ρ ij [ 1, 1]. As discussed above, we will use the correlation structure to construct a metric on the set of coordinates. A metric on a set of p elements is represented by p p symmetric matrix G, where G(i, j) gives the distance between the elements i and j. We will obtain this metric from a graph representing the strongest correlations between coordinates and will use it to regularize the covariance matrix by removing correlations between coordinates which are not close in the graph. The banding operator (following the notation in Bickel and Levina [6]) used for banding a p p matrix M = [m ij ] using a metric G with a distance threshold d is defined as: { [BG(M)] d m ij, if G(i, j) d ij = 0, if G(i, j) > d Thus given a matrix M, the operator BG d (M) produces a new matrix, where certain values are replaced by zeros. BG d (M) can be viewed as a projection operator on the space of matrices preserving entries corresponding to the pairs of coordinates close on the graph. In practice we expect the resulting matrices to be sparse. The underlying assumption is that there is a metric on the set of coordinates that reflects their correlation structure. Moreover, we hope that this structure can be recovered just from the pixels with the strongest correlations as we expect these pixel pairs to give us the most information. In Section 3 we will prove that these pixel pairs can be recovered reliably from a number of image samples logarithmic in dimension. 8

19 Moreover we will demonstrate that under a simple model for the underlying metric this number of samples is sufficient for reconstructing the whole metric structure. 2.3 Algorithm Input: A set of n labeled w h images as column vectors (X 1, X 2... X n ) of length p = w h, Parameters: k, d 1. [Constructing dissimilarity matrix] Construct the sample correlation matrix [ˆρ ij ] p p as in Equation (2.3). Construct the corresponding p p dissimilarity matrix ˆD as ˆd ij = 1 ˆρ ij 2. [Constructing the graph] For each coordinate i retain only the k smallest (least dissimilar) entries in the ith row of ˆD. Symmetrize the resulting matrix to get a sparse matrix ˆT = [ˆt ij ] p p with ˆt ij = max( ˆd ij, ˆd ji ) We interpret ˆT as the adjacency matrix for a weighted graph G. 3. [Constructing the metric] Apply the Floyd-Warshall algorithm on the graph G given by ˆT to construct the matrix of distances between coordinates. 4. [Regularizing the Covariance Matrix] Construct the banded covariance ˆΣ reg matrix ˆΣ p to get p = BG d (ˆΣ p ): { ˆσ reg ˆσ ij if the shortest distance between i and j d ij = 0 otherwise. Note: In the supervised setting, the value of d can be learned through the standard technique of cross validation. 9

20 5. [Eigenfaces] Replace ˆΣ p by ˆΣ reg p in the standard Eigenface algorithms or any other algorithm that uses covariance matrices. 10

21 CHAPTER 3 THEORETICAL ANALYSIS In this section, we formalize and provide theoretical justification for the ideas and the Algorithm presented in the previous section. We assume that, for the class of facial images, the dependencies between coordinates can be represented by a weighted graph whose nodes are the coordinates. This graph induces a metric space structure d M (, ), such that strongly related coordinates are close together. We propose a model for this structure and show, how, under these assumptions, it can be reconstructed from the data. We will assume that, if j is among the k coordinates most highly correlated with the coordinate i, they are connected by an edge whose weight is a linear function of the correlation ρ ij between them. For adjacent vertices i and j, we define the distance d M (i, j) to be the weight of the corresponding edge: d M (i, j) = m (1 ρ ij ) + c (3.1) If i and j are not adjacent then d M (i, j) is the shortest weighted distance between them. This definition encodes our intuition that strongly correlated pixel pairs contain most of the information about the structure of images. 11

22 To make this intuition algorithmically useful, we will further assume that the correlation structure has certain inherent sparsity so that the correlations between pixels far away in our metric are close to zero. This notion is formalized in Definition (3). We will now discuss our key results providing the basis for the Algorithm described in the previous section. 1. In Theorem (3.0.1) we show that the adjacency structure of our graph can be log p recovered from data using only k. We note that this quantity is logarith- n mic in the dimension and depends linearly on the number of nearest neighbors, which we expect to be small in practice. This justifies Step 1 in our algorithm. 2. Given that the adjacency structure is recovered and under the assumption of linearity (as above), the Theorem (3.0.2) shows that the full metric space structure can be reconstructed in Steps 2 and 3 of the algorithm. 3. Assuming now that the correlation matrix is bandable with respect to the metric structure, we show in Theorem (3.0.3) that the banded version of the sample covariance matrix (constructed in Step 4 of the Algorithm) is close to the underlying population covariance matrix. Specifically, the spectral norm of the difference between these matrices is bounded by a function of log p. On the n other hand, if no banding is performed, the difference is of the order p n. 4. Finally, Corollary (3.0.4) shows that the actual computation of Eigenfaces or any other eigenvector-base algorithm can be accurately done using the number of samples depending on log p. This corollary signifies the usage of spectral norm in Theorem (3.0.3) by making use of a Result (see Appendix) which says 12

23 that spectral norm convergence implies that both the sample eigenvalues and sample eigenvectors converge to their corresponding population equivalents (see [11] for a thorough discussion on matrix norms). This justifies Step 5 of the Algorithm. Definition 1. Given a matrix M = [m ij ], define, m ij, if m ij is less than the k th largest [T k (M)] ij = element in the i th row of M. 0, otherwise Definition 2. Given the metric d M (i, j) on a set of p variables, define k d = max i I[d M (i, j) < d] j That is, k d is maximum, among all variables, of the number of neighboring variables that are at less than d distance away from the given variable with respect to the metric d M (i, j). Definition 3. Given the metric d M (i, j) on a set of p variables, we define the set of (α, C) bandable matrices with respect to this metric as Σ p : max j i { σ ij : d M (i, j) > d} Ck α d This definition is a generalization of Definition (5) in [6]. Definition 4. (Matrix Norms) For a p dimensional vector x, the following norms are defined, x k = p x j k j=1 x = max x j j x = x 2 13

24 For a symmetric matrix M, the following norms are defined, M = M 2 = sup{ Mx : x = 1} = max λ i (M) i M (, ) = sup{ Mx : x = 1} = max m ij i M (1,1) = sup{ Mx 1 : x 1 = 1} = max j j m ij M = max m ij min( M (1,1), M (, ) ) ij For symmetric matrices, the following are true from the above definitions, i M (, ) = M (1,1) M M (, ) Result 1. (Hoeffding s Inequality, 1963) Let X 1,..., X n be independent random variables. Assume that the X i are almost surely bounded; that is, assume for 1 i n that Pr(X i [a i, b i ]) = 1. Then, for the sum of these variables S = X X n we have the inequality: ( ) 2 n 2 t 2 Pr(S E[S] nt) exp n i=1 (b, i a i ) 2 which is valid for positive values of t (where E[S] is the expected value of S). Result 2. Implication of Spectral norm convergence. (see. [18]) If A and B are two symmetric matrices, and if λ i is their i th eigenvalue, where the eigenvalues are sorted in decreasing order, we have, λ i (A) λ i (B) A B 14

25 and P D (A) P D (B) A B δ where, D > 0, P D (A) is the projection operator onto the first D eigenvectors of A and δ = (λ D (A) λ D+1 (A))/2. Theorem k-nearest Neighbor convergence ( ˆT ) T k (ˆΣ p ) T k (Σ p ) = O P (k ) log p Proof. The outline of the proof is as follows. We note from Definition (4) that, for symmetric matrices, the spectral norm ( M ) is bounded by M (, ) and use this to get, n T k (ˆΣ p ) T k (Σ p ) T k (ˆΣ p ) T k (Σ p ) (, ) 2k T k (ˆΣ p ) T k (Σ p ) The second part in the above follows from bounding the number of non-zero elements in any row of the difference. Which can be at max 2k if the operator T k on both the matrices totally disagrees on the spatial locations of the non-zero elements in any particular row. Thus, T k (ˆΣ p ) T k (Σ p ) = O P (k T k (ˆΣ p ) T k (Σ p ) ) (3.2) Since, each element of both ˆΣ p and Σ p (and hence of T k (ˆΣ p ) and T k (ˆΣ)) can be represented as the sum of independent random variables (which are bounded since the individual pixel values are bounded), an application of Hoeffding s Inequality(1) 15

26 followed by a union-bound on the whole matrix gives, Choose t = log pk n [ ] Pr T k (ˆΣ p ) T k (Σ p ) t kpe δnt2 [ ] log pk Pr T k (ˆΣ p ) T k (Σ p ) (pk) 1 δ n Thus we have, ( ) log pk T k (ˆΣ p ) T k (Σ p ) = O p n ( ) log p = O p ( k < p) (3.3) n Thus from (3.2) and (3.3) we have, T k (ˆΣ p ) T k (Σ p ) = O p (k ) log p n Theorem Reconstruction of the metric (d M (, )) Under the model in (3.1), given a set of sample images, we can recover a metric d G (i, j) such that, if d M (i, j) d M (i, k), then, d G (i, j) d G (i, k) Proof. If the largest value retained by T k ( ˆD) is ɛ, then the smallest correlation retained in the corresponding correlation matrix is ɛ = (1 ɛ ). If we choose k 16

27 such that ɛ > ɛ o, then by Theorem (3.0.1) and the assumption in Equation 3.1, for ρ ij > ɛ o, d G (i, j) = 1 ρ ij and d M (ij) = m (d G (i, j)) + c. Consider a path with one intermediate coordinate in the original metric. Say, the path between coordinates X and Y passes through Q. We have, d M (X, Y ) = d M (X, P ) + d M (P, Y ) and P = min v Λ d M(X, v) + d M (v, Y ) = min v Λ m (d G(X, v)) + c + m (d G (v, Y )) + c P = min v Λ (d G(X, v)) + (d G (v, Y )) where, Λ is the set of all coordinates. So, the path in the metric that we construct also passes through P. This logic can thus be extended to paths passing through more than one coordinates. This in fact also uses the basis of the Floyd Warshall Algorithm (dynamic programming). Theorem Convergence of the Banded Estimator(ˆΣ reg p ) If the population covariance matrix under consideration Σ p belongs to the set of matrices bandable with respect to the metric d M (i, j) (Definition (3)). Then, if we choose d such that k d (n 1 log p) 1 2(α+1), ˆΣ reg p ( (log ) ) α p 2(α+1) Σ p = O P n where, ˆΣ reg p = B d G (ˆΣ p ). 17

28 Proof. The proof follows along the lines of the proof of Theorem 1 in [6]. From the basic requirement of a norm to satisfy the triangle inequality we have, B d G(ˆΣ p ) Σ p B d G(ˆΣ p ) B d G(Σ p ) + O P ( B d G(Σ p ) Σ p ) (3.4) From the Definition (4) that, for symmetric matrices, the spectral norm ( M ) is bounded by M (, ) and use this to get, B d G(ˆΣ p ) B d G(Σ p ) B d G(ˆΣ p ) B d M(Σ p ) (, ) 2k d B d G(ˆΣ p ) B d G(Σ p ) The second part in the above follows from bounding the number of non-zero elements in any row of the difference. Which can be at max k d according to the definition of k d. Thus, B d G(ˆΣ p ) B d G(Σ p ) = O P ( B d G(ˆΣ p ) B d G(Σ p ) ) (3.5) Since, each element of both ˆΣ p and Σ p (and hence of B d G (ˆΣ p ) B d G (Σ p)) can be represented as the sum of independent random variables (which are bounded since the individual pixel values are bounded), an application of Hoeffding s Inequality (1) followed by a union-bound on the whole matrix gives, Choose t = log pk d n [ ] Pr BG(ˆΣ d p ) BG(Σ d p ) t k d pe δnt2 [ ] log Pr BG(ˆΣ d p ) BG(Σ d pkd p ) (pk d ) 1 δ n 18

29 Thus we have, ( ) log BG(ˆΣ d p ) BG(Σ d pkd p ) = O p n ( ) log p = O p ( k d < p) (3.6) n Thus from (3.5) and (3.6) we have, ( ) log p BG(ˆΣ d p ) BG(Σ d p ) = O P k d n ( (log ) ) α p 2(α+1) = O P n (3.7) But, from Definition (3) and using the properties of norms we have, B d G(Σ p ) Σ p B d G(Σ p ) Σ p (, ) ( ) = O P k α d = O P ( (log p n ) ) α 2(α+1) (3.8) From Equations (3.4), (3.7) and (3.8) we thus have, ˆΣ reg p Σ p = O P ( (log p n ) α 2(α+1) ) Corollary Convergence of Banded Faces If the population covariance matrix under consideration Σ p belongs to the set of matrices bandable with respect to the metric d M (i, j) (Definition (3)) and we choose d such that k d (n 1 log p) 1 2(α+1), ( ( ) ) α P D (Σ p ) P D (ˆΣ reg 1 log p 2(α+1) p ) = O P δ n where, ˆΣ reg p = B d G (ˆΣ p ), D > 0, δ = (λ D (Σ p ) λ D+1 (Σ p ))/2, λ i (A) is the i th largest eigenvalue of A and P D (A) is the projection operator onto the first D eigenvectors of A. 19

30 Proof. By a straightforward application of Result (2) to P D (Σ p ) P D (ˆΣ reg p ) we get, P D (Σ p ) P D (ˆΣ reg p ) Σ p δ reg ˆΣ p By applying the result of Theorem (3.0.3), P D (Σ p ) P D (ˆΣ reg p ) 1 δ O P = O P ( 1 δ ( (log p n ( log p n ) ) α 2(α+1) ) ) α 2(α+1) 20

31 CHAPTER 4 EXPERIMENTAL EVALUATION In this section we evaluate the performance of the proposed algorithm and compare it to 2DPCA, classical Eigenfaces and to some of the techniques recently proposed in the statistical literature. We will test our algorithm using two basic tasks of computer vision face recognition and image restoration/denoising on two popular face datasets, the Yale Face dataset and ATT/ORL face dataset. 4.1 Datasets ATT/ORL Face dataset: The ATT/ORL dataset consists of images of 40 different subjects with 10 images per subject. To speed up the computations we scale the images down to pixels. Yale Face dataset: The Yale face dataset contains grayscale images of 15 individuals. There are 11 images per subject, one per different facial expression or configuration. For our experiments, we scale the images down to resolution. In some of the experiments, zero mean Gaussian pixel noise of differing variance is added to images. We note that in all our experiments the number of training images n is significantly smaller than the dimension of the space p. 21

32 4.2 Face Recognition We compare the performance of our algorithm with 2DPCA, Eigenfaces and two recent statistical banding techniques in the classification setting. To do that, we reduce the dimensionality using each algorithm and run a very simple classification algorithm, 1-Nearest Neighbor classifier, in the reduced space. The two statistical techniques are banding of the covariance matrix using Euclidean pixel proximity [6] and thresholding (see [5, 17]). In the Euclidean pixel proximity based banding the assumption is that each pixel is correlated only with those pixels which are less than a distance d away from it in the Euclidean pixel space. Hence the procedure here is to zero covariances between pixels if the distance between them is greater than a certain distance (the banding parameter) in the Euclidean pixel space. The assumption for the thresholding approach is that all covariances below a certain threshold are due to noise. The algorithm works by setting all entries in the empirical covariance matrix that have an absolute value below a certain thresholding parameter t to zero. The value for the banding and thresholding parameters in these algorithms as well as the parameter d for our method were chosen using cross validation which is a common practice used in similar experiments. The number of nearest neighbors k for the Banded Faces algorithm was set to be 20 for all the experiments other than Table 4.4 where we show the dependence of the error on the number of nearest neighbors. To test the performance of estimators and their robustness to noise, we run them on a set of faces with additional Gaussian pixel noise added. The results of this experiment on the ATT/ORL dataset are shown in the Table (4.1) and for the Yale Face data set in Table (4.2). The first 100 (50 for the Yale dataset) eigenvectors were chosen as they capture approximately 90% of the variance. To see the influence of 22

33 the number of eigenvectors on performance, in Table (4.3) we compare Banded Faces with different techniques for different numbers of eigenvectors (2DPCA is omitted as the number of eigenvectors is not directly comparable). All experiments were run for 15 iterations and the average values are reported Fisherfaces We also compare Banded Faces with the technique of Euclidean proximity banding ( [6]) in the context of classification by using Fisherfaces as the dimensionality reduction technique. Here the banding techniques operate on the full covariance matrix constructed in the dimensionality reduction step of Fisherfaces (see. [3]) as well as the between-class covariance matrix. The number of nearest neighbors k for the Banded Faces algorithm was set to be 20. The performances of the algorithms are compared with different amounts of Gaussian pixel noise added to the datasets. The results for this experiment can be found in Table ( 4.5). σ noise Eigenfaces Euc 2DPCA Thresh B F Table 4.1: Classification error rate (in %) on a training set of 40 subjects with 4 samples per subject and a test set 240 samples drawn from the ATT/ORL face dataset with different amounts of added noise. The number of eigenvectors was chosen to be 100 for all method other than 2DPCA. For 2DPCA the number of eigenvectors was 4, which produced the best performance. Discussion. Several observations are now in order: 23

34 σ noise Eigenfaces Euc 2DPCA Thresh B F Table 4.2: Classification error rate (in %) on a training set of 15 subjects with 4 samples per subject and a test set 165 samples drawn from the Yale face dataset. The number of eigenvectors chosen were 50 and k was chosen to be 20. The results are averaged over 15 iterations. N e Eigenfaces Euc Thresh B F Table 4.3: Classification error rate (in %) as a function of the number of eigenvectors N e. The training set contains 40 subjects with 4 samples per subject and the test set contains 240 samples drawn from the ATT/ORL face dataset. 1. In our experiments we see that Banded Faces produces consistently better classification accuracy than 2DPCA, classical Eigenfaces as well as Thresholding and Euclidean banding methods. These improvements are consistent throughout a range of parameters. We also see that the performance of Banded Faces is quite robust to the changes in the number of nearest neighbors In line with the theoretical results in [15] we expect that the addition of noise to images makes estimating the covariance matrix and its eigenvectors less 24

35 Banded Faces, nearest neighbors for the adjacency graph, k Euc Thr 2DPCA Err rate % Sparsity % NA Table 4.4: Classification error rate (in %) and the percentage of non-zero entries (sparsity) of covariance matrices (in %) Vs Banded Faces for different numbers of nearest neighbors k used to construct the graph for Banded Faces, vs best parameter setting for Euclidean and Thresholding banding. While sparsity is not applicable for 2DPCA, the classification error is provided for comparison. The bolded number is used for all other computations. stable, since the number of samples is much smaller than the dimension of the space. Our theoretical results suggest that Banded Faces should require fewer samples for accurate estimation and thus its advantage should increase with the amount of noise. This is generally borne out in our experiments. For the ATT/ORL dataset in Table (4.1)the performance of 2DPCA and other methods deteriorates significantly with added noise, while the classification accuracy of Banded Faces is decreases much less. For the Yale dataset, Table (4.2), the performance of all decreases significantly with added noise, however the comparative advantage of Banded Faces still increases. 3. It is interesting to note that the 2DPCA, Euclidean pixel proximity and thresholding methods show similar performance (Table (4.1) and Table (4.2)). We see that these methods tend to outperform Eigenfaces, especially for noisy images, but do not match the performance of Banded Faces. We take these findings as validating out intuition that incorporating additional structure into the covariance matrix is generally helpful, especially when large amount of noise are present, but that the banding just 25

36 σ noise Fisherfaces Euclidean Banded Faces Table 4.5: Classification error rate for Fisherfaces (in %) on a training set of 40 subjects with 4 samples per subject and a test set of 240 samples drawn from the ATT/ORL face dataset with different amounts of added noise. using Euclidean spatial structure as in (2DPCA and Euclidean banding) or simple thresholding may be too rigid. 4. On a coarse level, the regularization by Banded Faces might seem equivalent to a thresholding procedure which functions by setting all covariances whose absolute values fall below a particular threshold to 0. The experiments in Table (4.4) serve to dispell this possibility by comparing the respective regularized covariances in terms of sparsity. It can be seen that even when the parameter k used in Banded Faces approaches its maximum value, the two procedures do not collapse to equivalence. 4.3 Image Reconstruction/Denoising The goal of the next set of experiments is to compare the performance of the algorithms in the context of image reconstruction/denoising. Here, the procedure is to project the images onto a subspace spanned by the few top eigenvectors, in which the dimensions which mostly capture noise are presumable eliminated. In our experiments we chose to project onto the top 20 eigenvectors. We compare the results obtained by using Eigenfaces, Banded Faces and banding using Euclidean pixel 26

37 proximity. The L 2 distance between the reconstructed and the original image is used to quantify reconstruction performance, with 0 distance corresponding to the perfect reconstruction. As before, the value of k for the Banded Faces algorithm was set to be 20 for all experiments. The experiments were repeated over multiple iterations with 30 samples of 10 subjects (3 per subject) drawn randomly from the dataset for each iteration. It should be noted that, Eigenfaces (Principal Components Analysis) produce the best possible approximation to the empirical covariance matrix, thus minimizing the average L 2 distance to the training set. However (according to the results in [15]) we believe that this may not be a good estimator to the true population covariance matrix. We test this by adding noise to images, applying the algorithm and analyzing the distance of reconstructed images to the original (no noise) prototypes. The corresponding performance values averaged over 5 iterations are seen in Table(4.6) for the ATT/ORL dataset and in Table(4.7) for the Yale dataset. We also show some representative sample images of three of the subjects with varying levels of noise and their reconstructions in Table (4.8). σ noise Eigenfaces Euc Thresh 2DPCA B F Table 4.6: Reconstruction error (L 2 Distance to the original) on a set of 10 subjects with 3 samples per subject drawn from the ATT face dataset, different amounts of noise added. 27

38 σ noise Eigfaces Euc Thresh 2DPCA B F Table 4.7: Reconstruction error (L 2 Distance to the original) on a set of 15 subjects with 3 samples per subject drawn from the Yale face dataset, different amounts of noise added. Discussion. We observe that images reconstructed using Banded Faces are consistently closer to the original prototypes than images reconstructed with Eigenfaces, and other methods. As expected, this gap increases with the amount of added noise. These improvements in the reconstruction accuracy are easily visually observable in the images in Table (4.8), where some of the facial features are more clearly seen in the rightmost column but are less apparent for the reconstructions obtained with the other two methods. Interestingly, Euclidean proximity banding shows almost identical performance to Eigenfaces for smaller amounts of noise, but demonstrates some improvement for the largest noise setting, while 2DPCA is better on one of the datasets. We see that the difference in performance between Eigenfaces and Banded Faces is significantly larger on the ATT/ORL dataset than on the Yale dataset. We conjecture that this may be a result of the fact that the original images in the ATT dataset are considerably more noisy. 28

39 Noise Original Image Noisy Image Eigenface Euclidean Banded Faces Low Noise, σ noise 0.01 L 2 Dist Medium Noise, σ noise 0.02 L 2 Dist High Noise, σ noise 0.03 L 2 Dist Table 4.8: A few representative samples from the experiments in Table (4.6). The numbers below an image are the values of its L 2 distance from the corresponding original image. 29

40 CHAPTER 5 SUMMARY AND CONCLUSIONS In this thesis we explore the structure of covariance matrices used in many popular algorithms, such as Eigenfaces, in the setting when the dimension of the space is significantly larger than the number of samples available, a situation which is common in a range of computer vision problems. Theoretical analysis of Eigenfaces shows that they generally require the number of samples linear in dimension. We postulate the existence of a data-dependent metric space structure on the coordinates, and show that under certain assumptions covariance matrices can be estimated accurately even when the number of samples is logarithmic in the dimension of the space. The resulting algorithms are simple and, as demonstrated by our experimental results, compare favorably to Eigenfaces in image reconstruction/denoising and image classification. We note that Eigenfaces is a very popular algorithm used for a variety of problems in computer vision and it provides us with a nice basis for both experimental and theoretical comparison. However,s our methods are not restricted to Eigenfaces and can be utilized wherever covariance or correlation matrices are used. 30

41 BIBLIOGRAPHY [1] T. H. Ahonen and M. A. Pietikainen. Face description with local binary patterns: Application to face recognition. IEEE PAMI, 28(12): , <4> [2] T. W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley, New York, <1, 2> [3] P. Belhumeur, J. Hespanha, and D. Kriegman. Eigenfaces vs. Fisherfaces: recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence., 19(7): , <1, 23> [4] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput., 15(6): , ISSN <5, 6> [5] P. J. Bickel and E. Levina. Covariance Regularization by Thresholding. To appear in the Annals of Statistics., <4, 22> [6] P. J. Bickel and E. Levina. Regularized Estimation of Large Covariance Matrices. Annals of Statistics, 36(1): , <3, 8, 13, 18, 22, 23> [7] P. J. Bickel and B. Li. Regularization in Statistics. Test., 15(2), <3> [8] R. Chellappa, C. L. Wilson, and S. Sirohey. Human and Machine Recognition of Faces: A Survey. Proceedings of the IEEE, 83(5): , <1> [9] R. Furrer and T. Bengtsson. Estimation of high-dimensional prior and posteriori covariance matrices in Kalman filter variants. Journal of Multivariate Analysis, 98:227255, <4> [10] S. Geman and D. Geman. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Trans. Pattern Analysis and Machine Intelligence, 6(2): , <7> [11] G. H. Golub and C. F. Van Loan. Matrix Computations. The John Hopkins University Press, Baltimore, Maryland, second edition, <13> 31

42 [12] X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J.. Zhang. Face Recognition Using Laplacianfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence., 27(3), <6> [13] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12:55 67, <3> [14] J. Huang, N. Liu, M. Pourahmadi, and L. Liu. Covariance matrix selection and estimation via penalised normal likelihood. Biometrika, 93:8598, <4> [15] I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics, 29(2): , <3, 24, 27> [16] N. E. Karoui. Spectrum estimation for large dimensional covariance matrices using random matrix theory. Annals of Statistics, <3> [17] N. E. Karoui. Operator norm consistent estimation of large dimensional sparse covariance matrices. To appear in the Annals of Statistics, <4, 22> [18] T. Kato. Perturbation theory for linear operators. Springer, <14> [19] H. Kong, L. Wang, E. K. Teoh, X. Li, J. G. Wang, and R. Venkateswarlu. Generalized 2d principal component analysis for face image representation and recognition. Neural Networks, 18(5-6): , <4> [20] S. L. Lauritzen. Graphical Models. Clarendon Press, <7> [21] V. A. Marcenko and P. L. A. Distribution of the eigenvalues in certain sets of random matrices. Math. USSR-Sbornik, 1(4): , <3> [22] K. Nishino, S. Nayar, and T. Jebara. Clustered blockwise PCA for representing visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10):1675, October <4> [23] S. T. Roweis and K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science., 290, <5, 6> [24] J. W. Silverstein and Z. D. Bai. On the empirical distribution of eigenvalues of a class of large dimensional random matrices. Journal of Multivariate Analysis, 54: , <3> [25] J. Tenenbaum, d. Silva, and J. Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science., 290, <5, 6> [26] M. Turk and A. Pentland. Face recognition using Eigenfaces. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition., pages , <1> 32

43 [27] L. Wang, X. Wang, X. Zhang, and J. Feng. The equivalence of two-dimensional pca to line-based pca. ACM Pattern Recognition Letters, 26(1):57 60, <4> [28] W. B. Wu and M. Pourahmadi. Nonparametric estimation of large covariance matrices of longitudinal data. Biometrika, 90:831844, <4> [29] J. Yang, D. Zhang, A. F. Frangi, and J. Yang. Two-dimensional pca: A new approach to appearance-based face representation and recognition. IEEE PAMI, 26(1): , ISSN <4> [30] J. Ye, R. Janardan, and Q. Li. Two-dimensional linear discriminant analysis. NIPS, <4> 33

Discriminant Uncorrelated Neighborhood Preserving Projections

Journal of Information & Computational Science 8: 14 (2011) 3019 3026 Available at http://www.joics.com Discriminant Uncorrelated Neighborhood Preserving Projections Guoqiang WANG a,, Weijuan ZHANG a,