SUBSPACE-BASED LEARNING WITH GRASSMANN KERNELS. Jihun Hamm. Electrical and Systems Engineering

Size: px

Start display at page:

Download "SUBSPACE-BASED LEARNING WITH GRASSMANN KERNELS. Jihun Hamm. Electrical and Systems Engineering"

Bennett Richard
5 years ago
Views:

1 SUBSPACE-BASED LEARNING WITH GRASSMANN KERNELS Jihun Hamm A DISSERTATION in Electrical and Systems Engineering Presented to the Faculties of the University of Pennsylvania in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy 2008 Supervisor of Dissertation Graduate Group Chair

2 COPYRIGHT Jihun Hamm 2008

3 Acknowledgements I deeply thank my advisor Dr. Daniel D. Lee for so many things. Besides having provided financial and mental support for my graduate study, Daniel has initiated me into the field of machine learning which I barely knew about before working with him. From him I learned how to tackle the problems from the ground and stay fresh-minded. The upmost influence Daniel had on me was his energy and passion towards the goal. Being near him made me and my colleagues always stimulated and energized, and helped us to endure through some tough periods during the Ph.D process. I thank Dr. Lawrence Saul for inspiring me to follow my interest in the manifold learning. During my early years working with him, his knowledge and intuition on the matter has strongly affected my approach to learning problems. I am also grateful to other professors who served in my thesis committee: Dr. Ali Jadbabaie, Dr. Jianbo Shi, Dr. Ben Taskar, and Dr. Ragini Verma. They provided valuable feedback to polish the thesis. Dr. Jean Gallier has provided me with a guidance on mathematical issues before and during the writing of the thesis. I appreciate the support from my colleagues, especially from my lab members: Dan Huang, Yuanqing Lin, Yung-kyun Noh, and Paul Vernaza. Besides sharing the enthusiasm for research, we shared lots of fun and sometimes stressful moments of daily life as graduate students. Yung-kyun has always been a pleasure to discuss any problem with. He was kind enough to read through the draft of the thesis and give suggestions. iii

4 Lastly, I thank my parents and my family for being who they are, and for understanding my excuses for not talking to them more often. My wife Sophia has always been by my side, and I cannot thank her enough for that. iv

5 ABSTRACT SUBSPACE-BASED LEARNING WITH GRASSMANN KERNELS Jihun Hamm Supervisor: Prof. Daniel D. Lee In this thesis I propose a subspace-based learning paradigm for solving novel problems in machine learning. We often encounter subspace structures within data that lie inside a vector space. For example, the set of images of an object or a face under varying lighting conditions are known to lie on a low (4 or 9)-dimensional subspace with mild assumptions. Many other types of variations such as pose change or facial expression, can also be approximated quite well with low-dimensional subspaces. Treating such subspaces as basic units of learning gives rise to challenges that conventional algorithms cannot handle well. In this work, I tackle subspace-based learning problems with the unifying framework of Grassmann manifold, which is the set of linear subspaces of a Euclidean space. I propose positive definite kernels on this space, which provide an easy access to the repository of various kernel algorithms. Furthermore, I show that the Grassmann kernels can be extended to the set of affine and scaled subspaces. This extension allows us to handle larger classes of problems with little additional cost. The proposed kernels in this thesis can be used with any kernel method. In particular, I demonstrate the potential advantages of the proposed kernel with Discriminant Analysis techniques and Support Vector Machines for recognition and categorization tasks. Experiments with real image databases show not only the feasibility of the proposed framework, but also the improved performance of the method compared with previously known methods. v

6 Contents 1 INTRODUCTION Overview Contributions and related work Organization of the paper BACKGROUND Introduction Kernel machines Motivation Mercer kernels Reproducing Kernel Hilbert Space Examples of kernels Generating new kernels from old kernels Distance and conditionally positive definite kernels Support Vector Machines Large margin linear classifier Dual problem and support vectors Extensions Generalization error and overfitting vi

7 2.4 Discriminant Analysis Fisher Discriminant Analysis Nonparametric Discriminant Analysis Discriminant analysis in high-dimensional spaces Extension to nonlinear discriminant analysis MOTIVATION: SUBSPACE STRUCTURE IN DATA Introduction Illumination subspaces in multi-lighting images Pose subspaces in multi-view images Video sequences of human motions Conclusion GRASSMANN MANIFOLDS AND SUBSPACE DISTANCES Introduction Stiefel and Grassmann manifolds Stiefel manifold Grassmann manifold Principal angles and canonical correlations Grassmann distances for subspaces Projection distance Binet-Cauchy distance Max Correlation Min Correlation Procrustes distance Comparison of the distances Experiments vii

8 4.4.1 Experimental setting Results and discussion Conclusion GRASSMANN KERNELS AND DISCRIMINANT ANALYSIS Introduction Kernel functions for subspaces Projection kernel Binet-Cauchy kernel Indefinite kernels from other metrics Extension to nonlinear subspaces Experiments with synthetic data Synthetic data Algorithms Results and discussion Discriminant Analysis of subspace Grassmann Discriminant Analysis Mutual Subspace Method (MSM) Constrained MSM (cmsm) Discriminant Analysis of Canonical Correlations (DCC) Experiments with real-world data Algorithms Results and discussion Conclusion EXTENDED GRASSMANN KERNELS AND PROBABILISTIC DISTANCES Introduction viii

9 6.2 Analysis of probabilistic distances and kernels Probabilistic distances and kernels Data as Mixture of Factor Analyzers Analysis of KL distance Analysis of Probability Product Kernel Extended Grassmann Kernel Motivation Extension to affine subspaces Extension to scaled subspaces Extension to nonlinear subspaces Experiments with synthetic data Synthetic data Algorithms Results and discussion Experiments with real-world data Algorithms Results and discussion Conclusion CONCLUSION Summary Future work Theory Applications Bibliography 139 ix

10 List of Tables 4.1 Summary of the Grassmann distances. The distances can be defined as simple functions of both the basis Y and the principal angles θ i except for the arc-length which involves matrix exponentials Classification rates of the Euclidean SVMs and the Grassmannian SVMs. The best rate for each dataset is highlighted by boldface Classification rates of the Euclidean SVMs and the Grassmann SVMs. The best rate for each dataset is highlighted by boldface x

11 List of Figures 2.1 Classification in the input space (Left) vs a feature space (Right). A nonlinear classification in the input space is achieved by a linear classification in the feature space via the following map: φ : R 2 R 3, (x 1, x 2 ) (x 2 1, 2x 1 x 2, x 2 2), which maps the elliptical decision boundary to the hyperplane. This illustration was captured from the tutorial slides of Schölkopf s given at the workshop The Analysis of Patterns, Erice, Italy, Example of classifying two-class data with a hyperplane w, x + b = 0. In this case the data can be separated without error. This illustration was captured from the tutorial slides of Schölkopf s given at the workshop The Analysis of Patterns, Erice, Italy, The most discriminant direction for two-class data. Suppose we have two classes of Gaussian-distributed data, and we want to project the data onto one-dimensional directions denoted by the arrows. The projection in the direction of the largest variance (PCA direction) results in a large overlap of the two class which is undesirable for classification, whereas the projection in the Fisher direction yields the least overlapping, therefore most discriminant one-dimensional distributions. This illustration was captured from the paper of [58] xi

12 3.1 The figure shows the first five principal components of a face, computed analytically from a 3D model (Top) and a sphere (Bottom). These images matches well with the empirical principal component computed from a set of real images. The figure was captured from [61] Yale Face Database: the first 10 (out of 38) subjects at all poses under a fixed illumination condition Yale Face Database: all illumination conditions of a person at a fixed pose used to compute the corresponding illumination subspace Yale Face Database: examples of basis images and (cumulative) singular values CMU-PIE Database: the first 10 (out of 68) subjects at all poses under a fixed illumination condition CMU-PIE Database: all illumination conditions of a person at a fixed pose used to compute the corresponding illumination subspace CMU-PIE Database: examples of basis images and (cumulative) singular values ETH-80 Database: all categories and objects at a fixed pose ETH-80 Database: all poses of an object from a category used to compute the corresponding pose subspace of the object ETH-80 Database: examples of basis images and (cumulative) singular values IXMAS Database: video sequences of an actor performing 11 different actions viewed from a fixed camera IXMAS Database: 3D occupancy volume of an actor of one time frame. The volume is initially computed in Cartesian coordinate system and later represented to cylindrical coordinate system to apply FFT xii

13 3.13 IXMAS Database: the kick action performed by 11 actors. Each sequence has a different kick style well as different body shape and height IXMAS Database: cylindrical coordinate representation of the volume V (r, θ, z), and the corresponding 1D FFT feature abs(f F T (V (r, θ, z))), shown at a few values of θ Principal angles and Grassmann distances. Let span(y i ) and span(y j ) be two subspaces in the Euclidean space R D on the left. The distance between two subspaces span(y i ) and span(y j ) can be measured using the principal angles θ = [θ 1,..., θ m ]. In the Grassmann manifold viewpoint, the subspaces span(y i ) and span(y j ) are considered as two points on the manifold G(m, D), whose Riemannian distance is related to the principal angles by d(y i, Y j ) = θ 2. Various distances can be defined based on the principal angles Yale Face Database: face recognition rates from 1NN classifier with the Grassmann distances. The two highest rates including ties are highlighted with boldface for each subspace dimension m CMU-PIE Database: face recognition rates from 1NN classifier with the Grassmann distances ETH-80 Database: object categorization rates from 1NN classifier with the Grassmann distances IXMAS Database: action recognition rates from 1NN classifier with the Grassmann distances. The two highest rates including ties are highlighted with boldface for each subspace dimension m xiii

14 5.1 Doubly kernel method. The first kernel implicitly maps the two nonlinear subspaces X i and X j to span(y i ) and span(y j ) via the map Φ : X H 1, where the nonlinear subspace means the preimage X i = Φ 1 (span(y i )) and X j = Φ 1 (span(y j )). The second (=Grassmann) kernel maps the points Y i and Y j on the Grassmann manifold G(m, D) to the corresponding points in H 2 via the map Ψ : G(m, D) H 2 such as (5.3) or (5.5) A two-dimensional subspace is represented by a triangular patch swept by two basis vectors. The positive and negative classes are colored-coded by blue and red respectively. A: The two class centers Y + and Y around which other subspaces are randomly generated. B D: Examples of randomly selected subspaces for easy, intermediate, and difficult datasets Yale Face Database: face recognition rates from various discriminant analysis methods. The two highest rates including ties are highlighted with boldface for each subspace dimension m CMU-PIE Database: face recognition rates from various discriminant analysis methods ETH-80 Database: object categorization rates from various discriminant analysis methods IXMAS Database: action recognition rates from various discriminant analysis methods Grassmann manifold as a Mixture of Factor Analyzers. The Grassmann manifold (Left), the set of linear subspaces, can alternatively be modeled as the set of flat (σ 0) spheres (Y i Y i = I m ) intersecting at the origin (u i = 0). The right figure shows a general Mixture of Factor Analyzers which are not bound by these conditions xiv

15 6.2 The Mixture of Factor Analyzer model of the Grassmann manifold is the collection of linear homogeneous Factor Analyzers shown as flat spheres intersecting at the origin (A). This can be relaxed to allow nonzero offsets for each Factor Analyzer (B), and also to allow arbitrary eccentricity and scale for each Factor Analyzer shown as flat ellipsoids (C) The same affine span can be expressed with different offsets u 1, u 2,... However, one can use the unique standard offset û, which has the shortest length from the origin Homogeneous vs scaled subspaces. The two 2-dimensional Gaussians that span almost the same 2-dimensional space and have almost the same means, are considered similar as two representations of linear subspaces (Left). However, probabilistic distance between two Gaussian also depends on scale and eccentricity: the distance can be quite large if the Gaussians are nonhomogeneous (Right) Yale Face Database: face recognition rates from various kernels. The two highest rates including ties are highlighted with boldface for each subspace dimension m CMU-PIE Database: face recognition rates from various kernels ETH-80 Database: object categorization rates from various kernels IXMAS Database: action recognition rates from various kernels xv

16 Chapter 1 INTRODUCTION 1.1 Overview In machine learning problems the data commonly lie in a vector space, especially in a Euclidean space. The Euclidean space is convenient for data representation, storage and computation and geometrically intuitive to understand as well. There are, however, other kinds of non-euclidean spaces more suitable for data outside the conventional Euclidean domain. The data domain I focus on in this thesis is one of those non-euclidean spaces where each data sample is a linear subspace of a Euclidean space. Researches often encounter this non-conventional domain in computer vision problems. For example, a set of images of an object or a face with varying lighting conditions is known to lie on a low (4 or 9-) dimensional subspace under mild assumptions. Many other types of variations such as pose changes or facial expressions, can also be empirically approximated quite well by low-dimensional subspaces. If the data consist of multiple sets of images, they can consequently be modeled as a collection of low-dimensional subspaces. What are the potential advantages of having such structures? In the above example of 1

17 face images, we can model the illumination variation of data irrelevant to a recognition task by subspaces, and focus on learning the appropriate variation between those subspaces such as the variation due to a subject identity. This idea applies not only to illumination-varying faces but also to many other types of data for which we can model out the undesired factors from the data with subspaces. Furthermore, representing data as a collection of subspaces is much more economical than keeping all the data samples as unorganized points, since we only need to store and handle the basis vectors. I refer to this approach of handling data as the subspace-based learning approach. Few researchers have clearly defined and fully utilized the properties of such a space in learning problems. Since a collection of subspaces is non-euclidean, one cannot benefit from the conveniences of the Euclidean space anymore. For a learning algorithm to work with the subspace representation of the data, it requires a suitable framework which is also convenient in storage and computation of such data. This thesis provides the foundations of the subspace-based learning problems using a novel framework and kernels. To show the reader the scope and the depth of this work, I raise the following questions regarding the subject: Questions 1. What are the examples of the subspace structure in real data? 2. Which non-euclidean domain suits the subspace-structed data? 3. What dissimilarity measures of subspaces are there, and what are their properties? 4. Can we define kernels for such domain? 5. Are the kernels related to probabilistic distances? 6. Can we extend the framework to subspaces that are not exactly linear? 2

18 This thesis gives detailed and definitive answers to all of the questions above. 1.2 Contributions and related work In this thesis I propose the Grassmann manifold framework for solving subspace-based problems. The Grassmann manifold is the set of fixed-dimensional linear subspaces and is an ideal model of the data under consideration. The Grassmann manfiolds have been previously used in signal processing and control [74, 36, 6], numerical optimization [20] (and other references therein), and machine learning/computer vision [51, 50, 14, 33, 78]. In particular, there are many approaches that use the subspace concept for problem solving in computer vision [92, 64, 24, 43, 3]. However, these work do not explicitly nor fully utilize the benefits of the Grassmann approach for subspace-based problems. In contrast, I make a full use of the properties of the Grassmann manifold with a unifying framework that subsumes the previous approaches. With the proposed framework, a dissimilarity between subspaces can be viewed as a distance function on the Grassmann manifold. I review several known distances including the Arc-length, Projection, Binet-Cauchy, Max Corr, Min Corr, and Procrustes distances [20, 16], and provide analytical and empirical comparisons. Furthermore, I propose the Projection kernel as a legitimate kernel function on the Grassmann manifold. The Projection kernel is also used in [85] where it is mainly used as a similarity measure of subspaces rather than as a full-fledged kernel function on the Grassmann manifold. Another kernel I use in the thesis is the Binet-Cauchy kernel [90, 83]. I show that in spite of the more attention the Binet-Cauchy kernel has received, the Binet-Cauchy kernel is less useful than the Projection kernel is with noisy data. Using the two kernels as the representative kernels on the Grassmann manifold, I demonstrate the advantages of using the Grassmann kernels over the Euclidean kernels by a 3

19 classification problem with Support Vector Machines on synthetic datasets. To demonstrate the potential benefits of the kernels further, I apply the kernels to a discriminant analysis on the Grassmann manifold and compare the approach with previously suggested algorithms for subspace-based discriminant analysis [92, 64, 24, 43]. In the previous methods, feature extraction is performed in the Euclidean space while non-euclidean subspace distances are used in the objective. This inconsistency results in a difficult optimization and a weak guarantee of convergence, whereas the proposed approach with the Grassmann kernels is simpler and more effective, evidenced by experiments with the real image databases. In this thesis I also investigate the relationship between probabilistic distances and the Grassmann kernels. If we assume the set of vectors are i.i.d. samples from an arbitrary probability distribution, then it is possible to compare two such distributions of vectors with probabilistic similarity measures, such as the KL distance [47], the Chernoff distance [15], or the Bhattacharyya/Hellinger distance [10]. Furthermore, the Bhattacharyya affinity is in fact a positive definite kernel function on the space of distributions and has nice closedform expressions for the exponential family [40]. The probabilistic distances and kernels are used for recognizing hand-written digits and faces [70, 46, 96]. I provide a link between the probabilistic and the Grassmann views by modeling the subspace data as a limit of the Mixture of Factor Analyzers [27] under the zero-mean and homogeneous conditions. The first result I show is that the KL distance is reduced to the Projection kernel under the Factor Analyzer model, whereas the Bhattacharyya kernel becomes trivial in the limit and is suboptimal for subspace-based problems. Secondly, based on my analysis of the KL distance, I propose an extension of the Projection kernel which is originally confined to the set of linear subspaces, to the set of affine as well as scaled subspaces. I demonstrate the potential benefits of the extended kernels with the Support Vector Machines and the Kernel Discriminant Analysis, using synthetic and real image databases. The experiments show the superiority of the extended kernels over the Bhattacharyya and the Binet-Cauchy 4

20 kernels, as well as over the Euclidean methods. There is a related but independent problem of clustering unlabeled data points into multiple subspaces. Several approaches have been proposed in the literature. A traditional and inefficient technique is to use an EM-algorithm [27] for a Mixture of Factor Analyzers (MFA), which models the data distribution as a superposition of the Gaussian distributions. More recent work on clustering subspace data includes K-subspaces method [37] which extends the K-means algorithm to the case of subspaces, and Generalized PCA [81] which represents subspaces with polynomials and solves algebraic equations to fit the data. These methods are different from the proposed method of this thesis in that they serve as a preprocessing step to generate subspace labels for the proposed subspace-based learning. 1.3 Organization of the paper The rest of the paper is organized as follows: Chapter 2 provides background materials for the thesis, including kernel theory, large margin classifiers, and discriminant analyses. Chapter 3 discusses theoretical and empirical evidences of inherent subspace structures in image and video databases, and describes procedures for preprocessing the databases. Chapter 4 introduces the Grassmann manifold as a common framework for subspacebased learning. Various distances on the Grassmann manifold are reviewed and analyzed in depth. Chapter 5 defines the Grassmann kernels and proposes the application to discriminant analysis. Comparisons with previously used algorithms are given. 5

21 Chapter 6 examines the relationship between probabilistic distances and the Grassmann kernels. The chapter contains further discussions on the extension of the domain of the subspace-based learning and presents the extended Grassmann kernels Chapter 7 summarizes the contributions of the thesis and discusses the future work related to the proposed methods. Bibliography contains all the referenced work in this thesis. The main chapters of the thesis are also divide into two parts. Chapter 3 and 4 integrate known facts and set up the framework for the thesis. Chapter 5 and 6 provide the main proposals, analyses, and experimental results. 6

22 Chapter 2 BACKGROUND 2.1 Introduction In this chapter I review three topics: 1) kernel machine, 2) its application to 2) large margin classification and 3) application to discriminant analysis. The theory behind the kernel machines is helpful and partially necessary to understand the proposed kernels in the thesis. The large margin classification and discriminant analysis algorithms will be used to test the proposed kernels in Chapters 5 and 6. I provide a brief tutorial of the three topics based on the well-known texts and papers such as [18, 69, 71]. Most of the proofs are omitted and can be found in the original texts. 2.2 Kernel machines Motivation Oftentimes it is not very effective nor convenient to use the original data space to learn patterns of the data. For simplicity, let s assume the data X lie in the Euclidean space. When the patterns have a complex structure in the original data space, we can try to transform 7

23 the data space nonlinearly to another space so that the learning task becomes easier on the transformed space. The new space is called a feature space, and the map is called a feature map. Suppose we are trying to classify two-dimensional, two-class data (Figure 2.1.) If the true class boundary is an ellipse in the input space, a linear classifier cannot classify the data correctly. However, when the input space is mapped to the feature space by φ : R 2 R 3, (x 1, x 2 ) (x 2 1, 2x 1 x 2, x 2 2), the decision boundary becomes a hyperplane in three-dimensional space, and therefore the two classes can be perfectly separated by a simple linear classifier. Note that we mapped the data to the feature space of all (ordered) monomials of degree two (x 2 1, 2x 1 x 2, x 2 2), and used a hyperplane in that space. We can use the same idea for a feature space of higher-degree monomials. However, if we map X R D to the space of degree-d monomials, the dimension of the feature space becomes D + d 1 d which can be computationally infeasible even for moderate D and d. This difficulty is easily circumvented by noting that we only need to compute inner products of points in the feature space to define a hyperplane. For the space of degree-2 monomials, the inner product can be computed from the original data by φ(x), φ(y) = x 2 1y x 2 2y x 1 x 2 y 1 y 2 = x, y 2, which can be extended to degree-d monomials by x, y d. The inner product in the feature 8

24 Figure 2.1: Classification in the input space (Left) vs a feature space (Right). A nonlinear classification in the input space is achieved by a linear classification in the feature space via the following map: φ : R 2 R 3, (x 1, x 2 ) (x 2 1, 2x 1 x 2, x 2 2), which maps the elliptical decision boundary to the hyperplane. This illustration was captured from the tutorial slides of Schölkopf s given at the workshop The Analysis of Patterns, Erice, Italy, space, such as k(x, y) = x, y d, is called a kernel function. From a user s point of view, a kernel function is simply a nonlinear similarity measure of data that corresponds to a linear similarity measure in a feature space that the user need not know explicitly. A formal definition will follow shortly Mercer kernels In this subsection I introduce the Mercer s theorem which characterizes the condition when a kernel function k induces the feature map and space. Let X denote the data space. 9

25 In case of finite X Definition 2.1 (Symmetric Positive Definite Matrix). A real N by N symmetric matrix K is positive definite if c i c j K ij 0, for all c 1,..., c N (c i R). i,j Consider a finite input space X = {x 1,..., x N } and a symmetric real-valued function k(x, y). Let K be the N by N matrix of the function K ij = k(x i, x j ) evaluated at X X. Since K is symmetric it can be diagonalized as K = V ΛV where Λ is a diagonal matrix of eigenvalues λ 1... λ N and V is an orthonormal matrix whose columns are the corresponding eigenvectors. Let v i denote the i-th row of V : V = [v 1 v N]. If the matrix K is positive definite, and therefore the eigenvalues are non-negative, then we can define the following feature map φ : X H = R N, x i v i D 1/2, i = 1,..., N, where D 1/2 is a diagonal matrix D 1/2 = diag( λ 1,..., λ N ). We now observe that the inner product in the feature space, H coincides with the kernel matrix of the data φ(x i ), φ(x j ) = v i Dv j = (V DV ) ij = K ij. 10

26 In case of compact X Let s apply the intuition gained from the finite case to an infinite dimensional case. Although further generalization is possible to a finite measure space (X, µ), we will deal with compact subsets of R D as the domain. Theorem 2.2 (Mercer). Let X be a compact subset of R D. Suppose k : X X R is a continuous symmetric function such that the integral operator T k : L 2 (X ) L 2 (X ), (T k f)(x) = k(x, y)f(y) dy X has the property of X 2 k(x, y)f(x)f(y) dxdy 0, for all f L 2 (X ). Then we have a uniformly convergent series k(x, y) = λ i ψ i (x)ψ i (y) i=1 in terms of the normalized eigenfunctions ψ i L 2 (X ) of T k ( normalized means ψ i L2 = 1.) The condition on T k is an extension of the positive definite condition for matrices. Let s define a sequence of features maps from the operator eigenfunctions ψ i : ( φ d : X H = l2, d x λ1 ψ 1 (x),..., ) λ d ψ d (x), for d = 1, 2,... The Theorem 2.2 tells us that the sequence of the maps φ 1, φ 2,... converges to a map φ : X H such that φ(x), φ(y) H = k(x, y). The theorem below is the formalization of this observation: 11

27 Theorem 2.3 (Mercer Kernel Map). If X is a compact subset of R D and k is a function satisfying the conditions of Theorem 2.2, then there is a feature map φ : X H into a features space H where k becomes an inner product φ(x), φ(y) H = k(x, y), for almost all x, y X. Moreover, given any ɛ > 0, there exists a map φ n into an n- dimensional Hilbert space such that k(x, y) φ n (x), φ n (y) < ɛ for almost all x, y X. The Mercer s kernel gives us a construction of a features space. In the next section we will look at a more general construction via the Reproducing Kernel Hilbert Space Reproducing Kernel Hilbert Space Extending the notion of positive definiteness of matrices and compact operators, we can define the positive definiteness of a function on an arbitrary set X as follows: Definition 2.4 (Positive Definite Kernel). Let X be any set, and k : X X R be a symmetric real-valued function k(x i, x j ) = k(x j, x i ) for all x i, x j X. Then k is a positive definite kernel function if c i c j k(x i, x j ) 0, i,j for all x 1,..., x n (x i X ) and c 1,..., c n (c i R) for any n N. In fact, the necessary and sufficient condition for a kernel to have associated feature 12

28 space and feature map, is that the kernel be positive definite. Below are the three steps in [69] to construct the feature map φ and the feature space H from a given positive definite kernel k: 1. Define a vector space with k. 2. Endow it with an inner product with a reproducing property. 3. Complete the space to a Hilbert space. First, we define H as the set of all linear combinations of the functions of the form f( ) = m α i k(, x i ), i for arbitrary m N, α 1,..., α m (α i R), and x 1,..., x m (x i X ). It is not difficult to check that H is a vector space. Let g( ) = n j β jk(, y j ) be an another function in the vector space for some n N, β 1,..., β n (β j R), and y 1,..., y n (y j X ). Next, we define the following inner product between f and g: m,n f, g = α i β j k(x i, y j ). i,j It is possible that the coefficients {α i } and {β j } are not unique. That is, a function f (or g) may be represented in multiple ways with different coefficients. To see that the inner product is still well-defined, note that m,n f, g = α i β j k(x i, y j ) = j i,j β j f(y j ) by definition. This shows that, does not depend on particular expansion coefficients {α i }. Similarly, f, g = i α ig(x i ) shows that the inner product does not depend on 13

29 {β j } either. The positivity f, f = i,j α iα j k(x i, x j ) 0 follows from the positive definiteness of k. Other axioms are easily checked. One notable property of the defined kernel is as follows. By choosing g( ) = k(, y) we have f, k(, y) = f(y) by definition. Furthermore with f( ) = k(, x) we have k(, x), k(, y) = k(x, y), which is called the reproducing property. Finally, the space can be completed to a Hilbert space, which is called the Reproducing Kernel Hilbert Space. Below is the formal definition of the space: Definition 2.5 (Reproducing Kernel Hilbert Space). Let X be a nonempty set and H a Hilbert space of functions f : X R. Then H is called a Reproduction Kernel Hilbert Space (RKHS) endowed with the inner product,, if there exist a function k : X X with the following properties: 1. f, k(x, ) = f(x) for all f H; in particular, k(x, ), k(y, ) = k(x, y). 2. k spans H, that is, H = span{k(x, ) x X } where X denote the completion of the set X. We have seen that a RKHS can be constructed from a positive definite kernel in three steps. The converse is also true. If a RKHS H is given then a unique positive definite kernel can be defined as the inner product of the space H. Finally, we show that Mercer kernels are positive definite in the generalized sense. Theorem 2.6 (Equivalence of Positive Definiteness). Let X = [a, b] be a compact interval and let k : [a, b] [a, b] C be continuous. Then k is a positive definite kernel if and only if [a,b] [a,b] k(x, y)f(x)f(y) dxdy 0, 14

30 for any continuous function f : X C. In this regard, every Mercer kernel k has a RKHS as a feature space for which k is the reproducing kernel Examples of kernels There are an ever-expanding number of kernels for various types of data and applications, and we can only glimpse a portion of those. Below is a list of the most often-used kernels for Euclidean data. Let x, y R D. Homogeneous polynomial kernel: k(x, y) = x, y d. Nonhomogeneous polynomial kernel: k(x, y) = ( x, y + c) d, c R. Gaussian RBF kernel: k(x, y) = exp x y 2, σ > 0. The Gaussian RBF kernel 2σ 2 has the following characteristics: 1) the points in the feature space lie on a sphere, since φ(x) 2 = 1, 2) the angle between two points x, y is at most π/2, and 3) the feature space is infinite-dimensional. Those kernels are the first kernels to be used with large margin classifiers [11]. These kernels can be evaluated in a closed-from without having to construct the feature spaces explicitly. Further work has discovered other types of kernels that can be evaluated efficiently by a recursion. These include the following two kernels: All-subsets kernel [75]: Let I = {1, 2,..., D} be indices for the variables x i, i I. For every subset of A of I, let us define φ A (x) = i A x i. For A = we define 15

31 φ (x) = 1. If φ(x) is the sequence (φ A (x)) A I, then the all-subsets kernel is k(x, y) = φ(x), φ(y) = A I φ A (x)φ A (y) = A I x i y i = i A D (1 + x i y i ). i=1 ANOVA kernel [87]: it is defined similarly to the all-subsets kernel. Define φ(x) as the sequence (φ A (x)) A I, A =d, where we restrict A to the subsets of cardinality d. Then the kernel is k(x, y) = φ A (x), φ A (y) = φ A (x)φ A (y) = 1 i 1 <...<i d D A I, A =d (x i1 y i1 ) (x id y id ). Kernels defined for non-euclidean spaces are especially interesting and useful. These include kernels for graphs [45], bag-of-words [41], strings [88, 53, 49, 82], probability distributions [40, 46], and dynamical systems [84]. This thesis contribute to the field of kernel machines by introducing kernels for subspaces, which have not received full attention in the community so far. The subspace kernels, which I will call the Grassmann kernels later, are related to the kernels for dynamical systems and probability distributions in particular. These relationships will be studied in detail in Chapter 5 and Generating new kernels from old kernels When we have a few kernels at hand, we can generated new kernels form them by the following theorem: 16

32 Theorem 2.7. If k 1 (x, y) and k 2 (x, y) are positive definite kernels, then following kernels are also positive definite: 1. Conic combination: α 1 k 1 (x, y) + α 2 k 2 (x, y), (α 1, α 2 > 0) 2. Pointwise product: k 1 (x, y)k 2 (x, y) 3. Integration: k(x, z)k(y, z) dz, 4. Product with rank-1 kernel: k(x, y)f(x)f(y) 5. Limit: if k 1 (x, y), k 2 (x, y),... are positive definite kernels then so is lim i k i (x, y). Proofs can be found in [69, 71]. Corollary 2.8. If k is a positive definite kernel, then so are f(k(x, y)) and exp k(x, y), where f : R R is any polynomial function with nonnegative coefficients Distance and conditionally positive definite kernels In this subsection I review the relationship between distances and conditionally positive definite kernels. Distance and metric Throughout the thesis I will use the term distance interchangeably with similarity measure, to denote an intuitive notion of closeness between two patterns in the data. Therefore a distance d(, ) is any assignment of nonnegative values to a pair of points in a set X. A metric is, however, a distance that satisfies the additional axioms: Definition 2.9 (Metric). A real-valued function d : X X R is called a metric if 1. d(x 1, x 2 ) 0, 17

33 2. d(x 1, x 2 ) = 0 if and only if x 1 = x 2, 3. d(x 1, x 2 ) = d(x 2, x 1 ), 4. d(x 1, x 2 ) + d(x 2, x 3 ) d(x 1, x 3 ), for all x 1, x 2, x 3 X. Relationship between metric and kernel The standard metric d(φ(x 1 ), φ(x 2 )) in the feature space is the norm φ(x 1 ) φ(x 2 ) induced from the inner product. The metric can be written in terms of the kernel as d 2 (φ(x 1 ), φ(x 2 )) = k(x 1, x 1 ) + k(x 2, x 2 ) 2k(x 1, x 2 ). (2.1) Therefore any RKHS is also a metric space (H, d) with the metric given above. Conversely, if a metric is given that is known to be induced from an inner product, then we can recover the inner product from the polarization of the metric: k(x 1, x 2 ) = φ(x 1 ), φ(x 2 ) = 1 2 ( φ(x 1) φ(x 2 ) 2 + φ(x 1 ) 2 + φ(x 2 ) 2 ). This raises the following question: if we are given a set and a metric (X, d), can we determine if d is induced from a positive definite kernel? To answer the question we need the following definition Definition 2.10 (Conditionally Positive Definite Kernel). Let X be any set, and k : X X R be a symmetric real-valued function k(x i, x j ) = k(x j, x i ) for all x i, x j X. Then k is a conditionally positive definite kernel function if c i c j k(x i, x j ) 0, i,j 18

34 for all x 1,..., x n (x i X ) and c 1,..., c n (c i R) such that n i=1 c i = 0, for any n N. The question above is answered by the following theorem[67]: Theorem 2.11 (Schoenberg). A metric space (X, d) can be embedded isometrically into a Hilbert space if and only if d 2 (, ) is conditionally positive definite. As a corollary, we have Corollary 2.12 ([35]). A metric d is induced from a positive definite kernel if and only if k(x 1, x 2 ) = d 2 (x 1, x 2 )/2, x 1, x 2 X (2.2) is conditionally positive definite. It is known that one can use conditionally positive definite kernels just as positive definite kernels in learning problems that are invariant to the choice of origin [68]. 2.3 Support Vector Machines A Support Vector Machine (SVM) is a supervised learning method used for classification. Due to its computational efficiency and theoretically well-understood generalization performance, the SVM has received a lot of attention in the last decade and is still one of the main topics in machine learning research. In this section I review the basics of SVM. I use the notation D = {(x 1, y 1 ),..., (x N, y N )} to denote N pairs of a training sample x i R D and its class label y i { 1, 1}, i = 1, 2,..., N. 19

35 2.3.1 Large margin linear classifier Consider the problem of separating the two-class training data D = {(x 1, y 1 ),..., (x N, y N )} with a hyperplane P : w, x + b = 0. Let s assume the data are linearly separable, that is, we can separate the data with a hyperplane without error (refer to Figure 2.2.) Since the equation c w, x + c b = 0 represents the same hyperplane for any nonzero c R, we choose a canonical representation of the hyperplane by setting min i w, x i +b = 1. The linear separability can then be expressed as y i ( w, x i + b) 1, i = 1,..., N, (2.3) and the distance of a point x to the hyperplane P is given by d(x, P) = w, x i + b. w We define the margin of the hyperplane as the minimum of the distance between training samples and the hyperplane: which can be shown to be equal to ρ = 2 w. ρ = min d(x i, P), i If the data are linearly separable, there are typically an infinite number of hyperplanes that separate the classes correctly. However, the main idea of the SVM is to choose the one that has the maximum margin. Therefore the maximum margin classifier is the solution to the optimization problem: min w,b 1 2 w 2, subject to y i ( w, x i + b) 1, i = 1,..., N. (2.4) 20

36 Figure 2.2: Example of classifying two-class data with a hyperplane w, x + b = 0. In this case the data can be separated without error. This illustration was captured from the tutorial slides of Schölkopf s given at the workshop The Analysis of Patterns, Erice, Italy, Dual problem and support vectors The primal problem (2.4) is a convex optimization problem with linear constraints. From the Lagrangian duality, solving the primal problem is equivalent to solving the dual problem: 1 min α 2 α i α j y i y j x i, x j i,j i α i, subject to α i 0, i = 1,..., N, and i α i y i = 0. (2.5) The advantages of the dual formation are two-fold: 1) the dual problem is often easier to solve than the primal problem, and 2) it provides a geometrically meaningful interpretation of the solution. If α is the optimal solution of (2.5), then the optimal value of the primal variables is 21

37 given by w = i α iy i x i, and b = 1 2 w, x + + x, where x + and x are positive and negative class samples such that w, x + + b = 1 and w, x + b = 1 respectively. The resultant classifier for test data is then, f(x) = sgn( w, x + b) = sgn( i α i y i x i, x + b), where sgn(z) = 1, z < 0 0, z = 0 1, z > 0. The Kuhn-Tucker condition of the optimization problem requires α i [y i ( w, x i + b) 1] = 0, i = 1,..., N. This implies that only the points x that satisfy y i ( w, x i + b) = 1 will have nonzero dual variable α. These points are called support vectors, since these are the only points needed to define the decision function in the linearly separable case Extensions Non-separable case: soft-margin SVM Suppose the data are not linearly separable and the constraints (2.3) need the relaxation of the conditions: y i ( w, x i + b) 1 ξ i, ξ i 0, i = 1,..., N, 22

38 for the problem to be feasible. A soft-margin SVM is defined by the optimization min w,b,ξ 1 2 w 2 + C i ξ i, subject to y i ( w, x i + b) 1 ξ i, ξ i 0, i = 1,..., N, (2.6) where C is a fixed parameter that determines the weight between the margin and the classification error in the cost. The primal problem (2.6) also has an equivalent dual problem: 1 min α 2 α i α j y i y j x i, x j i,j i α i, subject to (2.7) 0 α i C, i = 1,..., N, and i α i y i = 0. The regularization parameter C should reflect the prior knowledge of the amount of noise in the data. Nonlinear separation: kernel SVM We obtain a nonlinear version of SVM by mapping the space X to a RKHS via a kernel function k. The kernel SVM is implemented simply by replacing the Euclidean inner product x i, x j with a given kernel function k(x i, x j ). After the replacement the soft SVM problem (2.7) becomes 1 min α 2 α i α j y i y j k(x i, x j ) i,j i α i, subject to (2.8) 0 α i C, i = 1,..., N, and i α i y i = 0, and the resultant decision function for test data is given by the kernel function: f(x) = sgn( i α i y i k(x i, x) + b). 23

39 Since K ij = k(x i, x j ) is a fixed matrix, the optimization in the training phase is no more difficult than solving the linear SVM. The resultant decision function can classify highly nonlinear, complicated data distributions with the same cost of training the simple linear classifier Generalization error and overfitting The success of SVM algorithms in practice can be ascribed to their ability to bound generalization errors. I will not go into the vast topic but would like to point out the following fact: the maximization of margin corresponds to the minimization of the capacity (or the complexity) of the hyperplane, which helps to avoid overfitting. 2.4 Discriminant Analysis A discriminant analysis technique is a method to find a low-dimensional subspace of the input space which preserves the discriminant features of multiclass data. Figure 2.3 illustrates the idea for a two-class toy problem. I introduce two techniques: Fisher Discriminant Analysis (FDA) (or Linear Discriminant Analysis) [25] and Nonparametric Discriminant Analysis (NDA) [12]. Originally these algorithms are developed and used for low-dimensional Euclidean data. I will discuss the challenges and solutions when the techniques are applied to high-dimensional data, and describe their extensions to nonlinear discrimination problems with kernels. Both FDA and NDA are discriminant analysis techniques which find a subspace that maximizes the ratio of between-class scatter S b and within-class scatter S w after the data are projected onto the subspace. The objective function for one-dimensional case is the Rayleigh quotient J(w) = w S b w w S w w, w RD, 24

40 Figure 2.3: The most discriminant direction for two-class data. Suppose we have two classes of Gaussian-distributed data, and we want to project the data onto one-dimensional directions denoted by the arrows. The projection in the direction of the largest variance (PCA direction) results in a large overlap of the two class which is undesirable for classification, whereas the projection in the Fisher direction yields the least overlapping, therefore most discriminant one-dimensional distributions. This illustration was captured from the paper of [58]. where D is the dimension of the data space. For multiclass data there are several options for the objective function [25]. The most widely used objective is the multiclass Rayleigh quotient J(W ) = tr [ (W S w W ) 1 W S b W ] (2.9) where W is a D d matrix, and d < D is the low-dimensional feature dimension. The quotient measures the class separability in the subspace span(w ) similarly to the onedimensional case. 25

41 2.4.1 Fisher Discriminant Analysis Let {x 1,..., x N } be the data vectors and {y 1,..., y N } be the class labels y i {1,..., C}. Without loss of generality we assume the data are ordered according to the class labels: 1 = y 1 y 2... y N = C. Each class c has N c number of samples. Let µ c = 1 N c {i y i =c} x i be the mean of class c, and µ = 1 N i x i be the global mean. The between-scatter and within-scatter matrices of FDA are defined as follows: S b = 1 N S w = 1 N C N c (µ c µ)(µ c µ) c=1 C (x i µ c )(x i µ c ) c=1 {i y i =c} When S w is nonsingular, which is typically the case for low-dimensional data (D < N), the optimal W is found from the largest eigenvector of S 1 w S b. Since S 1 w S b has rank C 1, there are C 1-number of seqential optima W = {w 1,..., w C 1 }. By projecting data onto the span(w ), we achieve dimensionality reduction and feature extraction of data onto the most discriminant directions. To classify points with the simple k-nn classifier, one can use the distance of data projected onto span(w ), or use the Mahalanobis distance to the projected mean of each class: arg min j d j (x) = [W (x µ j )] (W S w W ) 1 [W (x µ j )]. (2.10) Nonparametric Discriminant Analysis The FDA is motivated from the simple scenario in which the class-conditional distribution p(x c i ) is Gaussian or at least has a peak around its mean µ i. However, this assumption is easily violated, for example, by a distribution that has multiple peaks. The NDA tries to 26

42 relax the parametric Gaussian assumption a little. The between-scatter and within-scatter matrices of NDA are defined as S b = 1 N S w = 1 N N i=1 (x i x j )(x i x j ) j B i N (x i x j )(x i x j ), j W i i=1 where B i is the indices for K nearest neighbors of x i which belong to the different classes from x i, and W i is the indices for K nearest neighbors of x i which belong to the same class as x i. While FDA uses the global class mean µ i as a representative of each class, NDA uses the local class mean around the point of interest. This results in a tolerance to the non-gaussianity or multimodality of the classes. When the number of nearest neighbors K increases, NDA behaves similarly to FDA. For classification tasks one can also use the simple k-nn rule restricted to the span(w ) or the Mahalanobis distance similarly to FDA, although k-nn is more consistent with the nonparametric assumption of NDA Discriminant analysis in high-dimensional spaces In the previous explanation of FDA, we assumed S w is nonsingular. However, this is not the case for high-dimensional data. Note the ranks of S w and S b of FDA can be at most N C and C 1 respectively. The maxima are achieved when the data are not co-linear which is very likely for high-dimensional data [93]. Because the number of features d cannot exceed the rank of S b, FDA can extract at most C 1 features. For NDA, when K > 1, the rank of S w can be up to N C and the rank of S b can be up to N 1. However, for small K the ranks of S w and S b will also be small. The number of features NDA can extract is also less than the rank of S B. 27

43 Because S w spans at most N C dimension, there is always nullspace in the span of data which is at least N 1 dimensional. Without regularization, both FDA and NDA always yield nullspace of S w in the span of data as the maximizer of the Rayleigh quotients. This is not preferable because even a small change in the data can make a big change in the solution. One solution suggested in [7] is to use Principal Component Analysis to first reduce the dimensionality of data by projecting them to a subspace spanned by the N C largest eigenvectors. In this subspace S w are likely to be well-conditioned. Another solution is to regularize the ill-conditioned matrix S w by adding an isotropic noise matrix S w S w + σ 2 I, (2.11) where σ determines the amount of regularization. I use the regularization approach in this thesis Extension to nonlinear discriminant analysis From the discussion of kernel machines in the previous section, we know that a linear algorithm can be extended to nonlinear algorithms by using kernels. The Kernel FDA, also known as the Generalized Discriminant Analysis, has been fully studied by [5, 56, 57]. To summarize, Kernel FDA can be formulated as follows. Let φ : G H be the feature map, and Φ = [φ 1... φ N ] be the feature matrix of the training points. Assuming the FDA direction w in the feature space is a linear combination of the feature vectors, w = Φα, we can rewrite the Rayleigh quotient in terms of α as J(α) = α Φ S B Φα α Φ S W Φα = α K(V 1 N 1 N1 N )Kα α (K(I N V )K + σ 2 I N )α, (2.12) where K is the kernel matrix, 1 N is a uniform vector [1... 1] of length N, and V is a 28

44 block-diagonal matrix whose c-th block is the uniform matrix 1 N c 1 Nc 1 N c, V = 1 N 1 1 N1 1 N N c 1 Nc 1 N c. The term σ 2 I N is a regularizer for making the computation stable. Similarly to FDA, the set of optimal α s are computed from the eigenvectors of Kw 1 K b, where K b and K w are the quadratic matrices in the numerator and the denominator of (2.12): K b = K(V 1 N 1 N1 N)K K w = K(I N V )K + σ 2 I N. The NDA can be similarly kernelized by the assumption w = Φα and is omitted here. 29

45 Chapter 3 MOTIVATION: SUBSPACE STRUCTURE IN DATA 3.1 Introduction In this chapter I discuss theoretical and empirical evidences of subspace structures which naturally appear in real-world data. The most prominent examples of subspaces can be found in the image-based face recognition problem. Face images show large variability due to identity, pose change, illumination condition, facial expression, and so on. The Principal Component Analysis (PCA) has been applied to construct low-dimensional models of the faces by [73, 44] and used for recognition by [79], known as the Eigenfaces. Although the Eigenfaces were originally applied to model image variations across different people, they also explain the illumination variation of a single person exceptionally well [32, 21, 94]. Theoretical explanations to the low-dimensionality of the illumination variability have been proposed by [8, 62, 61, 4]. When the data consist of the illumination-varying images of multiple people, I can model the data as a collection of the illumination subspaces from each people. In this way, 30

46 I can absorb the undesired variability of illumination as variability within subspaces, and emphasize the variability of subject identity as variability between the subspaces. This idea not only applies to illumination change but also to other types of data that have known linear substructures. This is the main idea of the subspace-based approach advocated in this thesis. More recent examples of subspaces structure are found in the dynamical system models of video sequences of, for example, human actions or time-varying textures [19, 80, 78]. When each sequence is modeled by a dynamical system, I can compare those dynamical systems by comparing the linear span of the observability matrices of each systems, which is similar to comparing the images subspaces. In the rest of this chapter I explain the details of computing subspaces and estimating dynamical systems from image data. The procedures are demonstrated with the well-known image databases: the Yale Face, CMU-PIE, ETH-80, and IXMAS databases. 3.2 Illumination subspaces in multi-lighting images Suppose we have a convex-shaped Lambertian object and a single distance light source illuminating the object. If we ignore the attached and the cast shadows on the object for now, then the observed intensity x (irradiance) for a surface patch is linearly related to the incident intensity of light s (radiance) by the Lambertian reflectance model x = ρ b, s, where ρ is the albedo of the surface, b is the surface normal, and s is the light source vector. If the whole image x is a collection of D-pixel values, that is, x = [x 1,..., x D ], then x = Bs, 31

47 where B = [α 1 b 1,..., α D b D ] is the D 3 matrix of albedo and surface normals. Thus, the set of images under all possible illuminations are all linear combinations of the column vectors of B, X = {Bs, s R 3 }, which is a three-dimensional subspace at most. However, this is an unrealistic model since this allows a negative light intensity. We get a more realistic model by removing negative intensity and also by allowing attached shadows as follows: x i = max (α i b i, s i, 0 ), and therefore x = max(bs, 0), where the max operation is performed for each row. An image from a multiple light sources is the combination of single distant light cases x = k max(bs k, 0). As can be seen from the equation, the set of such images under all illuminations form a convex cone [8]. However, the dimensionality of the subspace the cone lies in can be as large as the number of pixels D in general, which is inconsistent with the empirical observations. Theoretical explanations on the low dimensionality have been offered by [8, 62, 61, 4] with spherical harmonics. Although the mathematics of the model is rather involved, the main idea can be summarized as follows. The interaction between a distance light source and a Lambertian surface is a convolution on the unit sphere. If we adopt frequency-domain representation of the light distributions and the reflectance function, then the interaction of an arbitrary light distribution with the surface can be computed by multiplication of coefficients w.r.t. the spherical harmonics, analogous to the Fourier analysis on real lines. Since the max operation can be well approximated by convolution with a low-pass filter, the resultant set of all possible illumination can be expressed using only a few (4 to 9) harmonic basis images. Figure 3.1 shows the analytically computed PCA from this model. 32

In the following two subsections I introduce two well-known face databases and show the PCA results from the data to demonstrate the low-dimensionality of illuminationvarying images.

48 Figure 3.1: The figure shows the first five principal components of a face, computed analytically from a 3D model (Top) and a sphere (Bottom). These images matches well with the empirical principal component computed from a set of real images. The figure was captured from [61]. In the following two subsections I introduce two well-known face databases and show the PCA results from the data to demonstrate the low-dimensionality of illuminationvarying images. Yale Face Database It is often possible to acquire multi-view, multi-lighting images simultaneously with a special camera rig. The Yale face database and the Extended Yale face database [26] together consist of pictures of 38 subjects with 9 different poses and 45 different lighting conditions. The original image is gray-valued, is in size, and includes redundant background objects. I crop and align the face regions by manually choosing a few feature points (center of eyes and mouth, and nose tip) for each image. The cropped images are resized to pixels (D = 896) and normalized to have a unit variance. Figure 3.2 shows the first 10 subjects and all 9 poses under a fixed illumination condition. 33

49 Subject Pose Figure 3.2: Yale Face Database: the first 10 (out of 38) subjects at all poses under a fixed illumination condition. 34

50 To compute subspaces, I use all 45 illumination conditions of a person under a fixed pose, which is depicted in Figure 3.3. The m-dimensional orthonormal basis are computed from the Singular Value Decomposition (SVD) of this set of data, as follows. Let X = [x 1,..., x N ] is the D N data matrix pertaining to all illuminations of a person at a fixed pose, and let X = USV be the SVD of the data, where U U = UU = I D, V V = V V = I N, and S is a D N matrix whose elements are zero except on the diagonal diag(s) = [s 1, s 2,..., s min(d,n) ]. If the singular values are ordered as s 1 s 2... s min(d,n), then the m-dimensional basis for this set is the first m columns of U. In the coming chapters I will use a range of values for m in experiments. The SVD procedure above is the same as PCA procedure except that the mean is not removed from the data. The role of the mean will be discussed further in Chapter 6. When the mean is ignored, the PCA eigenvalues are related to the singular values by λ 1 = s 2 1, λ 2 = s 2 2, and so on. A few of the orthonormal bases computed from the procedure are shown in Figure 3.4, along with the spectrum of the singular values. CMU-PIE Database The CMU-PIE database is another multi-view, multi-lighting face database acquired with a camera rig. The database [72] consists of images from 68 subjects under 13 different poses and 43 different lighting conditions. Among the 43 lighting conditions I use 21 lighting conditions which have full pose variations. The original image is color-valued, is in size, and includes redundant background objects. I crop and align the face regions by manually choosing a few feature points (center of eyes and mouth, and nose tip) for each image. Among the 13 poses I choose only 7 poses and discarded 6 poses which are close to a profile-view. This is done to facilitate the cropping process. The cropped images are resized to pixels (D = 504) and normalized to have a unit variance. Figure 3.5 shows the first 10 subjects at 7 poses under 35

51 a fixed illumination condition. To compute subspaces, I use all 21 illumination conditions of a person at a fixed pose (refer to Figure 3.6). The m-dimensional orthonormal basis are computed from the Singular Value Decomposition (SVD) of this set of data similarly to the Yale Face database. A few of the orthonormal bases computed from the database are shown in Figure 3.7, along with the spectrum of the singular values. 36

Figure 3.3: Yale Face Database: all illumination conditions of a person at a fixed pose used to compute the corresponding illumination subspace.

52 Figure 3.3: Yale Face Database: all illumination conditions of a person at a fixed pose used to compute the corresponding illumination subspace. subspace #1 subspace # subspace # subspace # subspace # subspace # Figure 3.4: Yale Face Database: examples of basis images and (cumulative) singular values. 37

53 Subject Pose Figure 3.5: CMU-PIE Database: the first 10 (out of 68) subjects at all poses under a fixed illumination condition. 38

54 Figure 3.6: CMU-PIE Database: all illumination conditions of a person at a fixed pose used to compute the corresponding illumination subspace. subspace #1 subspace # subspace # subspace # subspace # subspace # Figure 3.7: CMU-PIE Database: examples of basis images and (cumulative) singular values. 39

55 3.3 Pose subspaces in multi-view images We have seen that illumination change can be approximated well by linear subspace. The change in the pose of the object and/or the camera, however, is harder to analyze without knowing the 3D geometry of the object. Since we are often given 2D image data only and do not know the underlying geometry, it is useful to construct image-based models of an object or a face under pose changes. A popular multi-pose representation of images is the light-field presentation which models the radiance of light as a function of the 5D pose of the observer [29, 30, 95]. Theoretically, the light-field model provides pose-invariant recognition of images taken with arbitrary camera and pose when the illumination condition is fixed. Zhou et al. extended the light-field model to a bilinear model which allows simultaneous change of pose and illumination [95]. An alternative method is proposed in [34] which uses a generative model of a warped illumination subspace. Image variations due to illumination change are accounted for by a low-dimensional linear subspace, whereas variations due to pose change are approximated by a geometric warping of images in the subspace. The studies above indicate the nonlinearity of the pose-varying images in general. However, the dimensionality of the images as a nonlinear manifold is rather small, since there are at most 6 degrees of freedom for the pose space (=E(3)). Therefore, when the range of the pose variation is limited, the nonlinear structure can be contained inside a lowdimensional subspace, and the nonlinear submanifolds can be distinguished by their enclosing subspaces. Although a general method of adopting nonlinearity is possible and will be discussed in Section 5.2.4, here I use linear subspaces as the simplest model of the pose variations. The approximation by a subspace is demonstrated with the ETH-80 object database in the following subsection. 40

56 Category Object Figure 3.8: ETH-80 Database: all categories and objects at a fixed pose. ETH-80 Database The ETH-80 [48] is an object database designed for object categorization test under varying poses. The database consists of pictures of 8 object categories; apple, pear, tomato, cow, dog, horse, cup, car. Each category has 10 object instances that belong to the category, and each object is recored under 41 different poses. There are several versions of the data. The one I use is color-valued and in size. The images are resized to pixels (D = 1024) and normalized to have a unit variance. Figure 3.8 shows the 8 categories under 10 poses at a fixed viewpoint. From the spectrum I can determine how good the m-dimensional approximation is by looking at the value at m. For example, if the 5-th cumulative singular value is 0.92, it means that the 5-dimensional subspace captures 92 percent of the total variations of the data (including the bias of the mean). 41

57 To compute subspaces, I use all 41 different poses of an object from a category as shown in Figure 3.9. The m-dimensional orthonormal basis are computed from SVD of this set of data. A few of the orthonormal bases computed from the database are shown in Figure 3.10 along with the spectrum of the singular values. 42

58 Figure 3.9: ETH-80 Database: all poses of an object from a category used to compute the corresponding pose subspace of the object. subspace #1 subspace # subspace # subspace # subspace # subspace # Figure 3.10: ETH-80 Database: examples of basis images and (cumulative) singular values. 43

59 3.4 Video sequences of human motions Suppose we have a video sequence of a person performing an action. The sequence is more than just a set of images because of the temporal information contained in the sequence. To capture both the appearance and the temporal dynamics, we often use linear dynamical systems in modeling the sequence. In particular, the Auto-Regressive Moving Average (ARMA) has been used to model moving human bodies or textures in computer vision [19, 80]. The ARMA model can be described as follows. Let y(t) be the D 1 observation vector, and x(t) be the d 1 internal state vector, for t = 1,..., T. Then the states evolve according to the linear time-invariant dynamics: x(t + 1) = Ax(t) + v(t) (3.1) y(t) = Cx(t) + w(t), (3.2) where v(t) and w(t) are additive noises. A probabilistic version of the model assumes that the observation, the states, and the noise are Gaussian distributed with v(t) N (0, Q), w(t) N (0, R). This allows us to use statistical techniques such as Maximum Likelihood Estimation to infer the states and to estimate parameters only from the observed data y(1),..., y(t ). The estimation problem is known as the system identification problem and a good textbook on the topic is [52]. The estimation I use in the thesis is one of the simplest estimation method and is described in the next section. For now, let s go back to the original question of comparing the image sequence using the ARMA model. If we have the parameters A i, C i for each sequence i = 1,.., N in 44

60 the data, the simplest method of comparing two such sequences is to measure the sum of squared differences d 2 i,j = A i A j 2 F + C i C j 2 F, (3.3) ignoring the noise statistics which are of less importance. However, it is well-known that the parameters are not unique. If we change the basis for the state variables to define new state variables by ˆx = Gx, where G is any d d nonsingular matrix, then the same system can be described with different coefficients such that ˆx(t + 1) = ŷ(t) = Âˆx(t) + ˆv(t) Ĉ ˆx(t) + ŵ(t), where Â = GAG 1, Ĉ = CG 1, ˆv = Gv, and ŵ = w. Unfortunately, the simple distance (3.3) is not invariant under the coordinate change. I will defer the discussion of other invariant distances for dynamical systems to the next two chapters, and proceed with the basic idea in this chapter. One of the coordinate-independent representations of the system is given by the infinite observability matrix [17] O C,A = C CA CA 2..., (3.4) which is concatenation of the matrices CA n for n = 1, 2,..., along the row. Note that after 45

61 the coordinate change ˆx = Gx, the new observability matrix becomes OĈ, Â = CG 1 CAG 1 CA 2 G 1... = O C,A G 1, (3.5) which is the original observability matrix multiplied by G 1 on the right. This suggests that if we consider the linear span of the column vectors of the O instead of the matrix O itself to represent the dynamical system, the representation is clearly invariant to the choice of G. This linear structure of a dynamical system is exactly what we are seeking: in the (infinite-dimensional) space of all possible ARMA models of the same size, each model of a sequence occupies the subspace spanned by the columns of O. In the next section I will introduce the IXMAS database and explain how I preprocess the data to compute this linear structure. IXMAS Database The INRIA Xmas Motion Acquisition Sequences (IXMAS) is a multiview video database for view-invariant human action recognition [89]. The database consists of 11 daily-live motions ( check watch, cross arms, scratch head, sit down, get up, turn around, walk, wave hand, punch, kick, pick up ), performed by 11 actors and repeated 3 times. The motions are recorded by 5 calibrated and synchronized cameras at 23 fps at resolutions. Figure 3.11 shows sample sequences of an actor perform the 11 actions at a fixed view. The authors of [89] propose further processing of the database. The appearances of the actors such as clothes are irrelevant to actions, and therefore image silhouettes are 46

62 computed to extract shapes from each camera. These silhouettes are combined to carve out the 3D visual hull of the actor represented by 3D occupancy volume data V (x, y, z) as shown in Figure However, the actions performed by different actors still have a lot of variabilities as demonstrated in Figure The variabilities irrelevant to action recognition include the followings. Firstly, the actors have different heights and body shapes, and therefore the volumes have to be resized in each axes. Secondly, the actors freely choose position and orientation, and therefore the volumes have to be centered and reoriented. The resizing and centering can be done by computing the center of mass and second-order moments of the volumes and then standardizing the volumes. However, the orientation variability requires further processing. The authors suggest changing the Cartesian coordinate system V (x, y, z) to the cylindrical coordinate system V (r, θ, z) and then performing 1D circular Fourier Transform along the θ axis to get F F T (V (r, θ, z)). By taking only the magnitude of the transform, the resultant feature abs F F T (V (r, θ, z)) becomes rotation-invariant around the z-axis. The resultant feature of the 3D volume is a D = = 32 3 /2-dimensional vector. Note this FFT is computer per frame and is not to be confused with a temporal FFT along the frames. Figure 3.14 shows a sample snapshot of the processed features. ARMA model of data Once the features are computed for each action, actor and frame, we can proceed to model the feature sequences using the ARMA model. I estimate the parameters using a fast approximate method based on the SVD of the the observed data [19]. Let USV = [y(1),..., y(t )] be the SVD of data. Then, the parameters 47

63 C, A, and the states x(1),..., x(t ) are sequentially estimated by C = U, x(t) = C y(t) Ã = arg min A T 1 x(t + 1) A x(t) 2. i=1 I used d = 5 as the dimension of the state space. The estimated A i and C i matrices for each sequence i = 1,..., N are used to form a finite observability matrix of size (Dd) d: O i = [C i (C i A i )... (C i A d 1 i ) ]. A total of 363=11 (action) x 3 (trial) x 11 (actor) observability matrices are computed as the final subspace representation of the database. 3.5 Conclusion In this chapter I aimed to provide motivations for subspace representation with examples from image databases, which range from illumination-varying faces to video sequences modeled as dynamical systems. The procedures for computing subspaces from these databases were described. The goal of the subspace-based learning approach is using this inherent linear structure to emphasize the desired information and to de-emphasize the unwanted variations in the data. This approach translates to 1) illumination-invariant face recognition for the Yale Face and CMU-PIE databases, 2) pose-invariant object categorization with the ETH-80 database, and 3) the video-based action recognition with the IXMAS database. However, I add a caveat that the invariant recognition problems above are different from the more general problem of recognizing a single test image, since at least a few test images are 48

64 required to reliably compute the subspace. In the next three chapters, I will use the computed subspaces from the databases to test various algorithms for subspace-based learning. 49

65 Check watch Cross arms Scratch head Sit down Get up Turn around Walk Wave hand Punch Kick Pick up T=1, 2, 3,... Figure 3.11: IXMAS Database: video sequences of an actor performing 11 different actions viewed from a fixed camera. 50

66 Figure 3.12: IXMAS Database: 3D occupancy volume of an actor of one time frame. The volume is initially computed in Cartesian coordinate system and later represented to cylindrical coordinate system to apply FFT. 51

67 Subj 1 Subj 2 Subj 3 Subj 4 Subj 5 Subj 6 Subj 7 Subj 8 Subj 9 Subj 10 T=1, 2, 3,... Figure 3.13: IXMAS Database: the kick action performed by 11 actors. Each sequence has a different kick style well as different body shape and height. 52

68 V(r,θ,z) abs(fft(v(r,θ,z))) Figure 3.14: IXMAS Database: cylindrical coordinate representation of the volume V (r, θ, z), and the corresponding 1D FFT feature abs(f F T (V (r, θ, z))), shown at a few values of θ. 53

69 Chapter 4 GRASSMANN MANIFOLDS AND SUBSPACE DISTANCES 4.1 Introduction In the previous chapter, I discussed the examples of linear subspace structures found in the real-world data. In this chapter I introduce the Grassmann manifold as the common framework of subspace-based learning algorithms. While a subspace is certainly a linear space, the collection of linear subspaces is a totally different space of its own, which is known as the Grassmann manifold. The Grassmann manifold, named after the renowned mathematian Hermann Günther Grassmann ( ), has long been known for its intriguing mathematical properties, and as an example of homogeneous spaces of Lie groups [86, 13]. However, its applications in computer science and engineering have appeared rather recently; in signal processing and control [74, 36, 6], numerical optimization [20] (and other references therein), and machine learning/computer vision [51, 50, 14, 33, 78]. Moreover, many works have used the subspace concept without explicitly relating their works to this mathematical object [92, 64, 24, 90, 85, 43, 3]. One of the goals of this the- 54

70 sis, is to provide a unified view of the subspace-based algorithms in the framework of the Grassmann manifold. In this chapter I define the Grassmann distance which provides a measure of (dis)similarity of subspaces, and review the known distances including the Arc-length, Projection, Binet- Cauchy, Max Corr, Min Corr, and Procrustes distances. Some of these distances have been studied in [20, 16], and I provide a more thorough analysis and proofs in this chapter. Furthermore, these distances will be used in conjunction with a k-nn algorithm to demonstrate their potentials in classification tasks using the databases prepared in the previous chapter. 4.2 Stiefel and Grassmann manifolds In this section I introduce the Stiefel and the Grassmann manifolds by summarizing necessary definitions and properties of these manifolds from [28, 20, 16]. Although these manifolds are not linear spaces, I introduce these manifolds as subsets of Euclidean spaces and use matrix representations. This helps to understand the nonlinear spaces intuitively and also facilitates computations on these spaces Stiefel manifold Let Y be a D m matrix whose elements are real numbers. In optimization problems with the matrix variable Y, we often formulate the notion of normality by an orthonormality condition Y Y = I m. 1 This feasible set is not linear nor convex, and in fact is the Stiefel manifold defined as follows: Definition 4.1. An m-frame is a set of m orthonormal vectors in R D (m D). The Stiefel manifold S(m, D) is the set of m-frames in R D. 1 Although the term orthongoal is the more standard one for this condition, I use the term orthonormal to clarify that each column of Y has a unit length. 55

71 The Stiefel manifold S(m, D) is represented by the set of D m matrices Y such that Y Y = I m. Therefore we can rewrite is as S(m, D) = {Y R D m Y Y = I m }. There are D m variables in Y and 1 m(m + 1) independent conditions in the constraint 2 Y Y = I m. Hence S(m, D) is an analytical manifold of dimension Dm 1 m(m + 1) = 2 1m(2D m 1). 2 For m = 1, the S(m, D) is the familiar unit sphere in R D, and for m = D, the S(m, D) is the orthogonal group O(D) of m m orthogonal matrices. The Stiefel manifold can also be thought of as the quotient space S(m, D) = O(D)/O(D m), under the right-multiplication by orthonormal matrices. To see this point, let X = [Y Y ] O(D) be a representer of Y S(m, D), where the first m columns form the m-frame we care about and Y is any D (D m) matrix such that Y Y + Y (Y ) = I D. Then the only subgroup of O(D) which leaves the m-frame unchanged, is the set of the blockdiagonal matrix diag(i m, R D m ) where R D m is any matrix in O(D m). That is, the m-frame of X after the right multiplication X I m 0 0 R D m = [Y Y ] I m 0 0 R D m = [Y Y R D m ] remains the same as the m-frame of X. 56

72 4.2.2 Grassmann manifold The Grassmann manifold is a mathematical object with several similarities to the Stiefel manifold. In optimization problems with a matrix variable Y, we occasionally have a cost function which is affected only by span(y ) the the linear subspace spanned by the column vectors of Y and not by the specific values of Y. Such a condition leads to the concept of the Grassmann manifold defined as follows: Definition 4.2. The Grassmann manifold G(m, D) is the set of m-dimensional linear subspaces of the R D. For a Euclidean representation of the manifold, consider the space R (0) D,m of all D m matrices Y R D m of full rank m, and consider the group of transformations Y Y L, where L is any nonsingular m m matrix. The group defines an equivalence relation in R (0) D,m : two elements Y 1, Y 2 R (0) D,m are the same if span(y 1) = span(y 2 ). Hence the equivalence classes of R (0) D,m are in one-to-one correspondence with the points of the Grassmann manifold G(m, D), and G(m, D) is thought of as the quotient space G(m, D) = R (0) D,m /R(0) m,m. The G(m, D) is an analytical manifold of dimension Dm m 2 = m(d m), since for each Y regarded as a point in R Dm, the set of all elements Y L in the equivalence class is a surface in R Dm of dimension m 2. The special case m = 1 is called the real projective space RP D 1 which consists of all lines through the origin. The Grassmann manifold can be also thought of as the quotient space G(m, D) = O(D)/O(m) O(D m), 57

73 under the right-multiplication by orthonormal matrices. To see this, let X = [Y Y ] O(D) be a representer of Y G(m, D), where we only care about the span of the first m columns and Y is any D (D m) matrix such that Y Y + Y (Y ) = I D. Then the only subgroup of O(D) which leaves the m-frame unchanged, is the set of the blockdiagonal matrix diag(r m, R D m ) where R m and R D m are any two matrices in O(m) and O(D m) respectively. That is, the span of the first m-columns of X after the right multiplication X R m 0 0 R D m = [Y Y ] R m 0 0 R D m = [Y R m Y R D m ] is the same as the span of the first m-columns of X. From the quotient space representations, we see that G(m, D) = S(m, D)/O(m). This is the representation I use throughout the thesis. To summarize, an element of G(m, D) is represented by an orthonormal matrix Y R D m such that Y Y = I m, with the equivalence relation: Definition 4.3. Y 1 = Y2 if and only if span(y 1 ) = span(y 2 ). We can also write the equivalence relation as Corollary 4.4. Y 1 = Y2 if and only if Y 1 = Y 2 R m for some orthonormal matrix R m O(m). In this thesis I use more general geometry than the Riemannian geometry of the Grassmann manifold, and do not discuss this subject further. I refer the interested readers to [86, 13, 20, 16, 2] for a further reading. 58

74 4.2.3 Principal angles and canonical correlations A canonical distance between two subspaces is the Riemannian distance, which is the length of the geodesic path connecting the two corresponding points on the Grassmann manifold. However, there is a more intuitive and computationally efficient way of defining distances using the principal angles [28]. I define the principal angles / canonical correlations as follows: Definition 4.5. Let Y 1 and Y 2 be two orthonormal matrices of size D by m. The principal angles 0 θ 1 θ m π/2 between two subspaces span(y 1 ) and span(y 2 ), are defined recursively by cos θ k = max max u k span(y 1 ) v k span(y 2 ) u k v k, subject to u k u k = 1, v k v k = 1, u k u i = 0, v k v i = 0, (i = 1,..., k 1). The first principal angle θ 1 is the smallest angle between a pair of unit vectors each from the two subspaces. The cosine of the principal angle is the first canonical correlation [39]. The k-th principal angle and canonical correlation are defined recursively. It is known [91, 20] that the principal angles are related to the geodesic (=arc length) distance as shown in Figure 4.1 by d 2 Arc(Y 1, Y 2 ) = i θ 2 i. (4.1) To compute the principal angles, we need not directly solve the maximization problem. Instead, the principal angles can be computed from the Singular Value Decomposition (SVD) of the product of the two matrices Y 1Y 2, Y 1Y 2 = USV, (4.2) 59

75 R D span( Y i ) u 1 v 1 span( Y j ) Y i G(m, D ) θ 2 Y j θ 1,..., θ m Figure 4.1: Principal angles and Grassmann distances. Let span(y i ) and span(y j ) be two subspaces in the Euclidean space R D on the left. The distance between two subspaces span(y i ) and span(y j ) can be measured using the principal angles θ = [θ 1,..., θ m ]. In the Grassmann manifold viewpoint, the subspaces span(y i ) and span(y j ) are considered as two points on the manifold G(m, D), whose Riemannian distance is related to the principal angles by d(y i, Y j ) = θ 2. Various distances can be defined based on the principal angles. where U = [u 1... u m ], V = [v 1... v m ], and S is the diagonal matrix S = diag(cos θ 1... cos θ m ). The proof can be found in p.604 of [28]. The principal angles form a non-decreasing sequence 0 θ 1 θ m π/2, and consequently the canonical correlations form a non-increasing sequence 1 cos θ 1 cos θ m 0. Although the definition of principal angles can be extended to the cases where Y 1 and Y 2 have different number of columns, I assume Y 1 and Y 2 have the same size D by m throughout this thesis. 4.3 Grassmann distances for subspaces In this section I introduce a few subspace distances which appeared in the literature, and give analyses of the distances in terms of the principal angles. 60

76 I use the term distance for any assignment of nonnegative values to each pair of points in the data space. A valid metric is, however, a distance that satisfies the additional axioms in Definition 2.9. Furthermore, a distance (or a metric) between subspaces has to be invariant under different basis representations. A distance that satisfies this condition is referred to as the Grassmann distance (or metric): Definition 4.6. Let d : R D m R D m R be a distance function. The function d is a Grassmann distance if d(y 1, Y 2 ) = d(y 1 R 1, Y 2 R 2 ), R 1, R 2 O(m) Projection distance The Projection distance is defined as ( m ) 1/2 ( d Proj (Y 1, Y 2 ) = sin 2 θ i = m i=1 ) 1/2 m cos 2 θ i, (4.3) i=1 which is the 2-norm of the sine of principal angles [20, 85]. An interesting property of the this metric is that it can be computed from only the product Y 1Y 2 whose importance will be revealed in the next chapter. From the relationship between principal angles and the SVD of Y 1Y 2 in (4.2) we get d 2 Proj(Y 1, Y 2 ) = m m i=1 cos 2 θ i = m Y 1Y 2 2 F = 2 1 Y 1 Y 1 Y 2 Y 2 2 F, (4.4) where F is the matrix Frobenius norm A 2 F = m n A 2 ij, A R m n. i=1 j=1 The Projection distance is a Grassmann distances since it is invariant to different representations which can be easily seen from (4.4). Furthermore, the distance is a metric: 61

77 Lemma 4.7. The Projection distance d Proj is a Grassmann metric. Proof. The nonnegativity, symmetry, and triangle equality naturally follows from F being a matrix norm. The remaining condition to be shown is the necessary and sufficient condition Y 1 Y 1 Y 2 Y 2 F = 0 span(y 1 ) = span(y 2 ). From being a matrix norm, the equality follows Y 1 Y 1 Y 2 Y 2 F = 0 Y 1 Y 1 = Y 2 Y 2. The proof of the next step Y 1 Y 1 = Y 2 Y 2 span(y 1 ) = span(y 2 ) is also simple and is given in the proof of Theorem Binet-Cauchy distance The Binet-Cauchy distance is defined as d BC (Y 1, Y 2 ) = ( 1 i cos 2 θ i ) 1/2, (4.5) which involves the product of canonical correlations [90, 83]. The distance can also be computed from only the product Y 1Y 2. From the relationship between principal angles and the SVD of Y 1Y 2 (4.2) we get d 2 BC(Y 1, Y 2 ) = 1 i cos 2 θ i = 1 det(y 1Y 2 ) 2. (4.6) The Binet-Cauchy distance is also invariant under different representations, and furthermore is a metric: Lemma 4.8. The Binet-Cauchy distance d BC is a Grassmann metric. The proof of the lemma is trivial after I prove Theorem 5.4 later. 62

78 There is an interesting relationship between this distance and the Martin distance in control theory [55]. Martin proposed a metric between two ARMA processes with the cepstrum of the models, which was later shown to be of the following form [17]: m d M (O 1, O 2 ) 2 = log cos 2 θ i, where O 1 and O 2 are the infinite observability matrices explained in the previous chapter: i=1 O 1 = C 1 C 1 A 1 C 1 A , and, O 2 = C 2 C 2 A 2 C 2 A Consequently, the Binet-Cauchy distance is directly related to the Martin distance by the following: d BC (span(o 1 ), span(o 2 )) = exp 1 2 d M(O 1, O 2 ) 2. (4.7) Max Correlation The Max Correlation distance is defined as d MaxCor (Y 1, Y 2 ) = ( 1 cos 2 θ 1 ) 1/2 = sin θ1, (4.8) which is based on the largest canonical correlation cos θ 1 (or the smallest principal angle θ 1 ). The max correlation is an intuitive measure between two subspaces which was used often in previous works [92, 64, 24]. It is a Grassmann distance. However, it is not a metric and therefore has some limitations. For example, it is possible for two distinct subspaces span(y 1 ) and span(y 2 ) to have a zero distance d MaxCor = 0 if they have intersect at other 63

79 than the origin Min Correlation The min correlation distance is defined as d MinCor (Y 1, Y 2 ) = ( 1 cos 2 θ m ) 1/2 = sin θm. (4.9) The min correlation is conceptually the opposite of the max correlation, in that it is based on the smallest canonical correlation (or the largest principal angle). This distance is also closely related to the definition of the Projection distance. Previously I rewrote the Projection distance as d Proj = 2 1/2 Y 1 Y 1 Y 2 Y 2 F. The min correlation can be similarly written as ([20]) where 2 is the matrix 2-norm: d MinCor = Y 1 Y 1 Y 2 Y 2 2, (4.10) A 2 = max x 0 Ax 2 x 2, A R m n. The proof can be found in p.75 of [28]. This distance is also a metric: Lemma 4.9. The Min Correlation distance d MinCor is a Grassmann metric. The proof is almost the same as the proof for the Projection distance with 2 replaced by F and is omitted. 64

80 4.3.5 Procrustes distance The Procrustes distance is defined as ( m 1/2 d Proc1 (Y 1, Y 2 ) = 2 sin 2 (θ i /2)), (4.11) which is the vector 2-norm of [sin(θ 1 /2),..., sin(θ m /2)]. There is an alternative definition. The Procrustes distance is the minimum Euclidean distance between different representations of two subspaces span(y 1 ) and span(y 2 ) ([20, 16]): i=1 d Proc1 (Y 1, Y 2 ) = min R 1,R 2 O(m) Y 1 R 1 Y 2 R 2 F = Y 1 U Y 2 V F, (4.12) where U and V are from (4.2). Let s first check if the equation above is true. Proof. First note that min R 1,R 2 O(m) Y 1 R 1 Y 2 R 2 = min R 1,R 2 O(m) Y 1 Y 2 R 2 R 1 = min Q O(m) Y 1 Y 2 Q, (4.13) for R 1, R 2, Q O(m). This also holds for 2. Using this equality, we have min R 1,R 2 O(m) Y 1 R 1 Y 2 R 2 2 F = min Q O(m) Y 1 Y 2 Q 2 F = min Q tr(y 1Y 1 + Y 2Y 2 Y 1Y 2 Q Q Y 2Y 1 ) = 2m 2 max Q tr(y 1Y 2 Q). However, try 1Y 2 Q = trusv Q = trst, where T = V QU which is another orthonormal matrix. Since S is diagonal, trst = i S iit ii i S ii, and the maximum is achieved 65

81 for T = I m, or equivalently Q = V U. Hence min R 1,R 2 O(m) Y 1 R 1 Y 2 R 2 F = Y 1 Y 2 V U F = Y 1 U Y 2 V F. Finally, let s prove the equivalence of the two definitions 4.11 and Lemma ( m 1/2 Y 1 U Y 2 V F = 2 sin 2 (θ i /2)). i=1 Proof. Left-multiply Y 1 U Y 2 V with U Y 1 to get Y 1 U Y 2 V F = U Y 1Y 1 U U Y 1Y 2 V F = I m S F, since the norm does not change under the multiplication with the orthonormal matrix U Y 1. Since ( ) 1/2 ( ) 1/2 I m S F = (1 cos θ i ) 2 = 2 sin θ i /2 2, we have the desired result. i The Procrustes distance is also called chordal distance [20]. The author of [20] also suggests another version of the Procrustes distance using the matrix 2-norm: i d Proc2 (Y 1, Y 2 ) = Y 1 U Y 2 V 2 = 2 sin(θ m /2). (4.14) Let s check the validity of the definition: Proof. Left-multiply Y 1 U Y 2 V with U Y 1 to get Y 1 U Y 2 V 2 = U Y 1Y 1 U U Y 1Y 2 V 2 = I m S 2. 66

82 From the definition of matrix 2-norm, we have I m S 2 = max x =1 (I m S)x 2 ( ( = max x =1 i (1 cos θ i ) 2 x 2 i ) 1/2 = max x =1 i 2 sin 2 (θ i /2)x 2 i ) 1/2. Since sin 2 (θ 1 /2)... sin 2 (θ m /2), the sum is maximized for (x 1,..., x m ) = (0,..., 0, 1), and therefore Y 1 U Y 2 V 2 = 2 sin(θ m /2). Note that this version of the Procrustes distance has the immediate relationship with the min correlation distance: ( d 2 MinCor(Y 1, Y 2 ) = sin 2 θ m = 1 (1 2 sin 2 (θ m /2)) 2 = d2 Proc2(Y 1, Y 2 )). (4.15) Since the function f(x) = (1 (1 x 2 /2) 2 ) 1/2 is a non-decreasing transform of the distance for 0 x 2, the two distances are expected to behave similarly although not exactly in the same manner. By definition, both versions of the Procrustes distances are invariant under different representations and furthermore are valid metrics: Lemma The Procrustes distances d Proc1 and d Proc2 are Grassmann metrics. Proof. Nonnegativity and symmetry is immediate. For triangle inequality, let s use the 67

83 equality (4.13) to show that d Proc (Y 1, Y 2 ) + d Proc (Y 2, Y 3 ) = min R 1,R 2 O(m) = min Q 1 O(m) Y 1 R 1 Y 2 R 2 + Y 1 Q 1 Y 2 + min Q 3 O(m) min R 2,R 3 O(m) Y 2 Y 3 Q 3 = min { Y 1 Q 1 Y 2 + Y 2 Y 3 Q 3 } Q 1,Q 3 O(m) min Y 1 Q 1 Y 3 Q 3 = d Proc (Y 1, Y 3 ). Q 1,Q 3 O(m) Y 2 R 2 Y 3 R 3 The remaining condition to show is the necessary and sufficient condition Y 1 U Y 2 V = 0 span(y 1 ) = span(y 2 ). From being a matrix norm, the equality follows: Y 1 U Y 2 V = 0 Y 1 U = Y 2 V. The proof of span(y 1 ) = span(y 2 ) Y 1 U = Y 2 V is similar to the case of the Projection distance and is omitted Comparison of the distances Table 4.1 summarizes the distances introduced so far. When these distances are used for a learning task, the choice of the most appropriate distance for the task depends on several factors. The first factor is the distribution of data. Since the distances are defined from particular functions of the principal angles, the best distance depends highly on the probability distribution of the principal angles of the given data. For example, the max correlation d MaxCor uses only the smallest principal angle θ 1, and therefore can serve as a robust distance when 68

84 Table 4.1: Summary of the Grassmann distances. The distances can be defined as simple functions of both the basis Y and the principal angles θ i except for the arc-length which involves matrix exponentials. Arc Length Projection Binet-Cauchy d 2 (Y 1, Y 2 ) 2 1 Y 1 Y 1 Y 2 Y 2 F 1 det(y 1Y 2 ) 2 In terms of θ θ 2 i sin 2 θ i 1 cos 2 θ i Is a metric? Yes Yes Yes Max Corr Min Corr Procrustes 1 Procrustes 2 d 2 (Y 1, Y 2 ) 2 2 Y 1Y In terms of θ 2 sin 2 θ 1 Y 1 Y 1 Y 2 Y 2 2 sin 2 θ m Y 1 U Y 2 V 2 F 4 sin 2 (θ i /2) Y 1 U Y 2 V sin 2 (θ m /2) Is a metric? No Yes Yes Yes the subspaces are highly scattered and noisy, whereas the min correlation d MinCor uses only the largest principal angle θ m, and therefore is not a sensible choice. On the other hand, when the subspaces are concentrated and have nonzero intersections, d MaxCor will be close to zero for most of the data, and d MinCor may be more discriminative in this case. The second Procrustes distances d Proc2 is also expected to behave similarly to d MinCor since it also uses only the largest principal angle. Besides, d MinCor and d Proc2 are directly related by (4.15). The Arc-length d Arc, the Projection distance d Proj, and the first Procrustes distance d Proc1 use all the principal angles. Therefore they have intermediate characteristics between d MaxCor and d MinCor, and will be useful for a wider range of data distributions. The Binet- Cauchy distance d BC also uses all the principal angles, but it behaves similarly to d MinCor for scattered subspaces since the distance will become the maximum value (=1) if at least one of the principal angles is π/2, due to the product form of d BC. The second criterion for choosing the distance, is the degree of structure in the distance. Without any structure a distance can be used only with a simple K-Nearest Neighbor (K- NN) algorithm for classification. When a distance has an extra structure such as triangle inequality, for example, we can speed up the nearest neighbor searches by estimating lower 69

85 and upper limits of unknown distances [23]. From this point of view, the max correlation d MaxCor is not a metric and may not be used with more sophisticated algorithms unlike the rest of the distances. 4.4 Experiments In this section I make empirical comparisons of the Grassmann distances discussed so far by using the distances for classification tasks with real image database Experimental setting In this section I use the subspaces computed from the four databases Yale Face, CMU-PIE, ETH-80 and IXMAS, and compare the performances of simple 1NN classifiers using the Grassmann distances. The training and the test sets are prepared by N-fold cross validation as follows. For the Yale Face and the CMU-PIE databases, I keep the subspaces corresponding to a particular pose from all subjects for testing, and use the remaining subspaces corresponding to other poses for training. This results in 9-fold and 7-fold cross validation tests for Yale Face and CMU-PIE respectively. For the ETH-80 database, I keep the subspaces of 8 objects one from each category for testing, and use the remaining subspaces for training, which is a 10-fold cross validation. For the IXMAS database, I keep all the subspaces corresponding to a particular person for testing, and use the subspaces of other people for training, which is a 11-fold cross validation test. As mentioned in the previous chapter, the subspace representation of the databases absorbs the variability due to illumination, pose, and the choice of the state space respectively. The cross validation setting of this thesis is the test of whether the remaining variability between subspaces are indeed useful to recognize subjects, objects, or actions, regardless of 70

86 different poses, object instances, and actors, respectively Results and discussion Figures show the classification rates. I can summarize the results as follows: 1. The best performing distances are different for each database: d MaxCor for Yale Face, d Proj, d Proc1 for CMU-PIE, d Arc, d Proj, d Proc1 for ETH-80, and d Proj, d Proc1 for IXMAS databases. I interpret this as certain distances being better suited for discriminating the subspaces of a particular database. 2. With the exception of d MaxCor for Yale Face, the three distances d Arc, d Proj, d Proc1 are consistently better than d BC, d MinCor, d Proc2. This grouping of the distances are theoretically predicted in Section The d MinCor and d Proc2 show exactly the same rates, since the former is monotonically related to the latter by (4.15). However the two distance will show different rates when they are used with more sophisticated algorithms than the K-NN. 4. With the exception of Yale Face, the three distances perform much better than the Euclidean distance does, which demonstrates the potential advantages of the subspacebased approach. 5. For CMU-PIE and IXMAS, the rates increase overall as the subspace dimension m increases. For Yale Face, the rates of d BC and d Proc2 drop as m increases, wherease the rates of other distances remain the same. For ETH-80, the rates seem to have different peaks for each distance. This means that the choice of the subspace dimensionality m can have significant effects on the recognition rates when the simple K-NN algorithm is employed. However, it will be shown in the later chapters that the m has less effects on more sophisticated algorithms that are able to adapt to the peculiarities of the data. 71

87 4.5 Conclusion In this chapter I introduced the Grassmann manifold as the framework for subspace-based algorithms, and reviewed several well-known Grassmann distances for measuring the dissimilarity of subspaces. These Grassmann distances are analyzed and compared in terms of how they use the principal angles to define dissimilarity of subspaces. In the classification task of real image databases with 1NN algorithm, the best performing distance varied depending on the data used. This suggests that we need some prior knowledge of the data in choosing the best distance a priori. However, most of the Grassmann distances performed better than the Euclidean distance in 1NN classification, and behaved in groups as predicted from the analysis. In the next chapter I will present a more important criterion for choosing a distance: whether a distance is associated with a positive definite kernel or not. 72

88 100 d Eucl d Arc d Proj d BC d MaxCor d MinCor d Proc1 d Proc subspace dimension (m) m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9 d Eucl d Arc d P roj d BC d MaxCor d MinCor d P roc d P roc Figure 4.2: Yale Face Database: face recognition rates from 1NN classifier with the Grassmann distances. The two highest rates including ties are highlighted with boldface for each subspace dimension m. 73

89 100 d Eucl d Arc d Proj d BC d MaxCor d MinCor d Proc1 d Proc subspace dimension (m) m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9 d Eucl d Arc d P roj d BC d MaxCor d MinCor d P roc d P roc Figure 4.3: CMU-PIE Database: face recognition rates from 1NN classifier with the Grassmann distances. 74

90 100 d Eucl d Arc d Proj d BC d MaxCor d MinCor d Proc1 d Proc subspace dimension (m) m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9 d Eucl d Arc d P roj d BC d MaxCor d MinCor d P roc d P roc Figure 4.4: ETH-80 Database: object categorization rates from 1NN classifier with the Grassmann distances. 75

91 100 d Eucl d Arc d Proj d BC d MaxCor d MinCor d Proc1 d Proc subspace dimension (m) m=1 m=2 m=3 m=4 m=5 d Eucl d Arc d P roj d BC d MaxCor d MinCor d P roc d P roc Figure 4.5: IXMAS Database: action recognition rates from 1NN classifier with the Grassmann distances. The two highest rates including ties are highlighted with boldface for each subspace dimension m. 76

92 Chapter 5 GRASSMANN KERNELS AND DISCRIMINANT ANALYSIS 5.1 Introduction In the previous chapter I defined subspace distances on the Grassmann manifold. However, with a distance structure only, there is a severe restricted in the possible operations with the data. In this chapter, I show that it is possible to define positive definite kernel functions on the manifold, and thereby it is possible to transform the space to the familiar Hilbert space by virtue of the RKHS theory in Section 2.2. In particular, the Projection and the Binet-Cauchy distances presented in the previous chapter will be shown to be compatible with the Projection and the Binet-Cauchy kernels defined as follows: k Proj (Y 1, Y 2 ) = Y 1Y 2 2 F, k BC (Y 1, Y 2 ) = (det Y 1Y 2 ) 2. 77

93 These kernels are discussed in detail in this chapter. The Binet-Cauchy kernel has been used as a similarity measure for sets [90] and dynamical systems [83]. 1 The Projection distance has been used for face recognition [85], but the corresponding Projection kernel has not been explicitly used, and it is the main object of this chapter. I examine both kernels as the representative kernels on the Grassmann manifold. Advantages of the Grassmann kernels over the Euclidean kernels are demonstrated by a classification problem with Support Vector Machines (SVMs) on synthetic datasets. To demonstrate the potential benefits of the kernels further, I use the kernels in a discriminant analysis of subspaces. The proposed method will be contrasted with the previously suggested subspace-based discriminant algorithms [92, 64, 24, 43]. Those previous methods adopt an inconsistent strategy: feature extraction is performed in the Euclidean space while non-euclidean subspace distances are used. This inconsistency results in a difficult optimization and a weak guarantee of convergence. In the proposed approach of this chapter, the feature extraction and the distance measurement are integrated around the Grassmann kernel, resulting in a simpler and more familiar algorithm. Experiments with the image databases also show that the proposed method performs better than the previous methods. 5.2 Kernel functions for subspaces Among the various distances presented in Chapter 4, only the Projection distance and the Binet-Cauchy distance are induced from positive definite kernels. This means that we can 1 The authors of [83] use the term Binet-Cachy kernel for a more abstract class of kernels for Fredholm operators. The Binet-Cauchy kernel k BC in this paper is a special case which is close to what those authors call the Martin kernel. 78

94 define the corresponding kernels k Proj and k BC such that the following is true: d 2 (Y 1, Y 2 ) = k(y 1, Y 1 ) + k(y 2, Y 2 ) 2k(Y 1, Y 2 ). (5.1) To define a kernel on the Grassmann manifold, let s recall the definition of a positive definite kernel in Definition 2.4: A real symmetric function k is a (resp. conditionally) positive definite kernel function, if i,j c ic j k(x i, x j ) 0, for all x 1,..., x n (x i X ) and c 1,..., c n (c i R) for any n N. (resp. for all c 1,..., c n (c i R) such that n i=1 c i = 0.) Based on the Euclidean coordinates of subspaces, the Grassmann kernel is defined as follows: Definition 5.1. Let k : R D m R D m R be a real valued symmetric function k(y 1, Y 2 ) = k(y 2, Y 1 ). The function k is a Grassmann kernel if it is 1) positive definite and 2) invariant to different representations: k(y 1, Y 2 ) = k(y 1 R 1, Y 2 R 2 ), R 1, R 2 O(m). In the following sections I explicitly construct an isometry from (G, d Proj or BC ) to a Hilbert space (H, L 2 ), and use the isometry to show that the Projection and the Binet- Cauchy kernels are Grassmann kernels Projection kernel The Projection distance d Proj can be understood by associating a subspace with a projection matrix by the following embedding [16] Ψ : G(m, D) R D D, span(y ) Y Y. (5.2) 79

95 The image Ψ(G(m, D)) is the set of rank-m orthogonal projection matrices, hence the name Projection distances. Theorem 5.2. The map Ψ : G(m, D) R D D, span(y ) Y Y (5.3) is an embedding. In particular, it is an isometry from (G, d Proj ) to (R D D, F ). Proof. 1. Well-defined: if span(y 1 ) = span(y 2 ), or equivalently Y 1 = Y 2 R for some R O(m), then Ψ(Y 1 ) = Y 1 Y 1 = Y 2 Y 2 = Ψ(Y 2 ). 2. Injective: suppose Ψ(Y 1 ) = Y 1 Y 1 = Y 2 Y 2 = Ψ(Y 2 ). After multiplying Y 1 and Y 2 to the right side we get the equalities Y 1 = Y 2 (Y 2Y 1 ) and Y 2 = Y 1 (Y 1Y 2 ) respectively. Let R denote R = Y 2Y 1, then Y 1 = Y 2 R = Y 1 (R R), which shows R O(m) and therefore span(y 1 ) = span(y 2 ). 3. Isometry: Ψ(Y 1 ) Ψ(Y 2 ) F = Y 1 Y 1 Y 2 Y 2 F = 2 1/2 d Proj (Y 1, Y 2 ). Since we have a Euclidean embedding into (R D D, F ), the natural inner product of this space is the trace tr [(Y 1 Y 1)(Y 2 Y 2)] = Y 1Y 2 2 F. This provides us with the definition of the Projection kernel: Theorem 5.3. The Projection kernel k Proj (Y 1, Y 2 ) = Y 1Y 2 2 F (5.4) is a Grassmann kernel. 80

96 Proof. The kernel is well-defined because k Proj (Y 1, Y 2 ) = k Proj (Y 1 R 1, Y 2 R 2 ) for any R 1, R 2 O(m). The positive definiteness follows from the properties of the Frobenius norm: for all Y 1,..., Y n (Y i G) and c 1,..., c n (c i R) for any n N, we have ij c i c j Y i Y j 2 F = ij = tr( i c i c j tr(y i Y i Y j Y j ) = ij c i Y i Y i ) 2 = i c i c j tr(y i Y i )(Y j Y j ) c i Y i Y i 2 F 0. The Projection kernel has a very simple form and requires only O(Dm) multiplications to evaluate. It is the main kernel I propose to use for subspace-based learning Binet-Cauchy kernel The Binet-Cauchy distance can also be understood by an embedding. Let s be a subset of {1,..., D} with m elements s = {r 1,..., r m }, and Y (s) be the m m matrix whose rows are the r 1,..., r m -th rows of Y. If s 1, s 2,..., s n are all such choices of the subset ordered lexicographically, then the Binet-Cauchy embedding is defined as Ψ : G(m, D) R n, span(y ) ( det Y (s 1),..., det Y (sn)), (5.5) where n = D C m is the number of choosing m rows out of D rows. It is also an isometry from (G, d BC ) to (R n, 2 ). The natural inner product in R n is the dot product of the two vectors n r=1 det Y (s i) 1 det Y (s i) 2, which provides us with the definition of the Binet-Cauchy kernel. 81

97 Theorem 5.4. The Binet-Cauchy kernel k BC (Y 1, Y 2 ) = (det Y 1Y 2 ) 2 = det Y 1Y 2 Y 2Y 1 (5.6) is a Grassmann kernel. Proof. First, the kernel is well-defined because k BC (Y 1, Y 2 ) = k BC (Y 1 R 1, Y 2 R 2 ) for any R 1, R 2 O(m). To show that k BC is positive definite it suffices to show that k(y 1, Y 2 ) = det Y 1Y 2 is positive definite. From the Binet-Cauchy identity [38, 90, 83], we have det Y 1Y 2 = s det Y (s) 1 det Y (s) 2. Therefore, for all Y 1,..., Y n (Y i G) and c 1,..., c n (c i R) for any n N, ij c i c j det Y i Y j = ij = s c i c j ij s det Y (s) i det Y (s) j c i c j det Y (s) i det Y (s) j = s ( i c i det Y (s) i ) 2 0. Some other forms of the Binet-Cauchy kernel also appeared in the literature. Note that although det Y 1Y 2 is also a Grassmann kernel, we prefer k BC (Y 1, Y 2 ) = det(y 1Y 2 ) 2. The reason is that the latter is directly related to the principal angles by det(y 1Y 2 ) 2 = i cos2 θ i and therefore admits geometric interpretations, whereas the former cannot be written directly in terms of the principal angles. That is, det Y 1Y 2 i cos θ i in general. 2 Another variant arcsin k BC (Y 1, Y 2 ) is also a positive definite kernel 3 and its induced metric d = (arccos(det Y 1Y 2 )) 1/2 is a conditionally positive definite metric. 2 For example, det Y 1Y 2 can be negative whereas i cos θ i - a product of singular values - is nonnegative by definition. 3 From Theorem 4.18 and 4.19 of [69]. 82

98 5.2.3 Indefinite kernels from other metrics Since the Projection distance and the Binet-Cauchy distance are derived from positive definite kernels, we have all the kernel-based algorithms for Hilbert spaces at our disposal. In contrast, other distances in the previous chapter are not associated with Grassmann kernels and can only be used with less powerful algorithms. Showing that a distance is not associated with any kernel directly, is not as easy as showing the opposite that there is a kernel. However, Theorem 2.12 can be used to make the task easier: A metric d is induced from a positive definite kernel if and only if ˆk(x 1, x 2 ) = d 2 (x 1, x 2 )/2, x 1, x 2 X (5.7) is conditionally positive definite. The theorem allows us to show a metric s non-positive definiteness by constructing an indefinite kernel matrix from (5.7) as a counterexample. There have been efforts to use indefinite kernels for learning [59, 31], and several heuristics have been proposed to modify an indefinite kernel matrix to a positive definite matrix [60]. However, I do not advocate the use of the heuristics since they change the geometry of the original data Extension to nonlinear subspaces Linear subspaces in the original space can be generalized to nonlinear subspaces by considering linear subspaces in a RKHS, which is a trick that has been used successfully in kernel PCA [68]. 4 In [90, 85] the trick is shown to be applicable to the computation of the principal angles, called the kernel principal angles. Wolf and Shashua, in particular, use the trick to compute the Binet-Cauchy kernel. Note that these two kernels have different 4 A nonlinear subspace is an oxymoron. Technically speaking, it is a preimage of a linear subspace in RHKS. 83

99 H 1 span( Y i ) span( Y j ) Y i G(m, D ) Y j θ 2 θ 1,..., θ m Φ Ψ X H 2 Ψ( ) Ψ( ) Figure 5.1: Doubly kernel method. The first kernel implicitly maps the two nonlinear subspaces X i and X j to span(y i ) and span(y j ) via the map Φ : X H 1, where the nonlinear subspace means the preimage X i = Φ 1 (span(y i )) and X j = Φ 1 (span(y j )). The second (=Grassmann) kernel maps the points Y i and Y j on the Grassmann manifold G(m, D) to the corresponding points in H 2 via the map Ψ : G(m, D) H 2 such as (5.3) or (5.5). roles and need to be distinguished. An illustration of this doubly kernel method is given in Figure 5.1. The key point of the trick is that the principal angles between two subspaces in the RKHS can be derived only from the inner products of vectors in the original space. 5 Furthermore, the orthonomalization procedure in the feature space also requires the inner product of vectors only. Below is a summary of the procedures in [90]. 1. Let X i = {x i 1,..., x i N i } the i-th set of data and Φ i = [φ(x i 1),..., φ(x i N i )] be the image matrix of X i in the feature space implicitly defined by a kernel function k, 5 A similar idea is also used to define probability distributions in the feature space [46, 96], and will be explained in the next chapter. 84

100 e.g., Gaussian RBF kernel. 2. The orthonormal basis Y i of the span(φ i ) is then computed from the Gram-Schmidt process in RKHS: Φ i = Y i R i. 3. Finally, the product Y i Y j in the features space, used to define the Binet-Cauchy kernel for example, is computed from the original data by Y i Y j = (R 1 i ) Φ iφ j R 1 j = (R 1 i ) [k(x i k, x j l )] klr 1 j. Although this extension has been used to improve classification tasks with a few small databases [90], I will not use the extension in the thesis for the following reasons. First, the databases I use already have theoretical grounds for being linear subspaces, and we want to verify the linear subspace models. Second, the advantage of kernel tricks in general is most pronounced when the ambient space R D has a relatively small dimension D compared to the number of data sample N. This is obviously not the case with the data used in the thesis. Further experiments with the nonlinear extension will be carried out in the future. 5.3 Experiments with synthetic data In this section I demonstrate the application of the Grassmann kernels to a two-class classification problem with Support Vector Machines (SVMs). Using synthetic data I will compare the classification performances of linear/nonlinear SVMs in the original space with the performances of the SVMs in the Grassmann space. The advantages of the subspace approach over the conventional Euclidean appraoch for classification problems will be discussed. 85

101 A. Class centers B. Easy C. Intermediate D. Difficult Figure 5.2: A two-dimensional subspace is represented by a triangular patch swept by two basis vectors. The positive and negative classes are colored-coded by blue and red respectively. A: The two class centers Y + and Y around which other subspaces are randomly generated. B D: Examples of randomly selected subspaces for easy, intermediate, and difficult datasets Synthetic data I generate three types of datasets: easy, intermediate, and difficult ; these datasets differ in the amount of noise in the data. For each type of the data, I generate N = 100 subspaces in D = 6 dimensional Euclidean space, where each subspace is m = 2 dimensional. To generate two-class data, I 86

102 first define two exemplar subspaces spanned by the following bases Y + and Y : Y + = 1 6 [ [ ], [ ] ] Y = 1 6 [ [ ], [ ] ]. The Y + and Y serve as the positive and the negative class centers respectively. The corresponding subspaces span(y + ) and span(y ) have the principal angles θ 1 = θ 2 = arccos(1/3). The other subspaces Y i s are generated by adding a Gaussian random matrix M to the bases Y + or Y, and then by applying SVD to compute the new perturbed bases: Y i = U, where UΣV = Y + + M i, i = 1,..., N/2 UΣV = Y + M i, i = N/2 + 1,..., N, where the elements of the matrix M i are independent Gaussian variables [M i ] jk N (0, s 2 ). The standard deviation s controls the amount of noise; the s is chosen to be s = 0.2, 0.3 and 0.4 for easy, intermediate, and difficult datasets respectively. Figure 5.2 shows examples of the subspaces for the three datasets. Note that the subspaces become more cluttered and the class boundary becomes more irregular as s increases Algorithms I compare the performance of the Euclidean SVM with linear/polynomial/rbf kernels and the performance of SVM with Grassmann kernels. To test the Euclidean SVMs, I randomly sample n = 50 points from each subspace from a Gaussian distribution. There is an immediate handicap with a linear classifier in the original data space. Each subspace is symmetric with respect to the origin, that is, if x is a point on a subspace, 87

103 then x is also on the subspace. As a result, any hyperplane either 1) contains a subspace or 2) halves a subspace into two parts and yields 50 percent classification rate, which is useless. Therefore, if data lie on subspaces without further restrictions, a linear classifier (with a zero-bias) always fails to classify subspaces. To alleviate the problem with the Euclidean algorithms, I sample points from the intersection of the subspaces and the halfspace {(x 1,..., x 6 ) R 6 x 1 > 0}. To test the Grassmann SVM, I first estimate the basis Y i from the SVD of the same sampled points used for the Euclidean SVM, and then evaluate the Grassmann kernel functions. Five kernels in the followings are compared: 1. Euclidean SVM with linear kernels: k(x 1, x 2 ) = x 1, x 2 2. Euclidean SVM with Polynomial kernels: k(x 1, x 2 ) = ( x 1, x 2 + 1) Euclidean SVM with Gaussian RBF kernels: k(x 1, x 2 ) = exp( 1 2r 2 x 1 x 2 2 ). The radius r is chosen to be one-fifth of the diameter of the data: r = 0.2 max ij 4. Grassmannian SVM with Projection kernel k(y 1, Y 2 ) = Y 1Y 2 2 F 5. Grassmannian SVM with Binet-Cauchy kernel k(y 1, Y 2 ) = (det Y 1Y 2 ) 2 x i x j. For the Euclidean SVMs, I use the public-domain software SVM-light [42] with default parameters. For the Grassmann SVMs, I use a Matlab code with a nonnegative QP solver. I evaluate the algorithms with the leave-one-out test by holding out one subspace and training with the other N 1 subspaces Results and discussion Table 5.1 shows the classification rates of the Euclidean SVMs and the Grassmann SVMs, averaged for 10 independent trials. The results show that the Grassmann SVM with the 88

104 Table 5.1: Classification rates of the Euclidean SVMs and the Grassmannian SVMs. The best rate for each dataset is highlighted by boldface. Euclidean Grassmann Lin Poly RBF Proj BC Easy Intemediate Difficult Projection kernel outperforms other the Euclidean SVMs. The Grassmann SVM with the Binet-Cauchy kernel is a close second. The Polynomial and RBF kernels perform equally better than the linear kernel, but not as good as the Grassmann kernels. The overall classification rates decrease as the data become more difficult to separate. The Grassmann kernels achieve better results for the two main reasons. First, when the data are highly cluttered as shown in Figure 5.2, the geometric prior of the subspace structures can disambiguate the points close to each other that the Euclidean distance cannot distinguish well. Second, the Grassmann approach implicitly maps the data from the original D-dimensional space to a higher-dimensional (m(d m)) space where separating the subspaces becomes easier. In addition to having a superior classification performance with subspace-structured data, the Grassmann kernel method has a smaller computational cost. In the experiment above, for example, the Euclidean approach uses a kernel matrix of a size , whereas the Grassmann approach uses a kernel matrix of a size which is n = 50 times smaller than the Euclidean kernel matrix. 89

105 5.4 Discriminant Analysis of subspace In this section I introduce a discriminant analysis method on the Grassmann manifold, and compare this method with other previously known discriminant techniques for subspaces. Since the image databases in Chapter 3 are highly multiclass 6 and lie in high dimensional space, I propose to use the discriminant analysis technique to reduce dimensionality and extract features of subspace data Grassmann Discriminant Analysis It is straightforward to show the procedures of using the Projection and the Binet-Cauchy kernels with the Kernel FDA method introduced in Section 5.4. Recall that the cost function of Kernel FDA is as follows: J(α) = α Φ S B Φα α Φ S W Φα = α K(V 1 N 1 N /N)Kα α (K(I N V )K + σ 2 I N )α, (5.8) where K is the kernel matrix, σ is a regularization term, and the others are fixed terms. Since the method is already explained in detail, I only present a summary of the procedure below. 6 N c =38, 68, 8, and,11 for Yale Face, CMU-PIE, ETH-80, and IXMAS databases respectively 90

106 Assume the D by m orthonormal bases {Y i } are already computed and given. Training: 1. Compute the matrix [K train ] ij = k Proj (Y i, Y j ) or k BC (Y i, Y j ) for all Y i, Y j in the training set. 2. Solve max α J(α) in (5.8) by eigen-decomposition. 3. Compute the (N c 1)-dimensional coefficients F train = α K train. Testing: 1. Compute the matrix [K test ] ij = k Proj (Y i, Y j ) or k BC (Y i, Y j ) for all Y i in training set and Y j in the test set. 2. Compute the (N c 1)-dim coefficients F test = α K test. 3. Perform 1-NN classification from the Euclidean distance between F train and F test. I call this method the Grassmann Discriminant Analysis to differentiate it from other discriminant methods for subspaces, which I review in the following sections Mutual Subspace Method (MSM) The original MSM [92] performs simple 1-NN classification with d Max with no feature extraction. The method can be extended to any distance described in the thesis. Although there are attempts to use kernels for MSM [64], the kernel is used only to represent data in the original space, and the MSM algorithm is still a 1-NN classification. 91

107 5.4.3 Constrained MSM (cmsm) Constrained MSM [24] is a technique that applies dimensionality reduction to the bases of the subspaces in the original space. Let G = i Y iy i be the sum of the projection matrices of the data and {v 1,..., v D } be the eigenvectors corresponding to the eigenvalues {λ 1... λ D } of G. The authors of [24] claim that the first few eigenvectors v 1,..., v d of G are more discriminative than the later eigenvectors, and suggest projecting the basis vectors of each subspace Y i onto the span(v 1,..., v l ), followed by normalizations. However these procedure lack justifications, as well as a clear criterion for choosing the dimension d, on which the result crucially depends from our experience Discriminant Analysis of Canonical Correlations (DCC) The Discriminant Analysis of Canonical Correlations [43] can be understood as a nonparametric version of linear discrimination analysis using the Procrustes distance (4.11). The algorithm finds the discriminating direction w which maximizes the ratio L(w) = w S B w/w S w w, where S b and S w are the nonparametric between-class and within-class covariance matrices from Section 2.4.2: S b = i S w = i (Y i U Y j V )(Y i U Y j V ) j B i (Y i U Y j V )(Y i U Y j V ), j W i where U and V are from (4.2). Recall that tr(y i U Y j V )(Y i U Y j V ) = Y i U Y j V 2 F is the Procrustes distance (squared). However, unlike my method, S b and S w do not admit a geometric interpretation as true covariance matrices, nor can they be kernelized directly. Another disadvantage of the DCC is the difficulty in optimization. The algorithm iterates the two stages of 1) maximizing the ratio L(w) and of 2) computing S b and S w, which 92

108 results in a computational overhead and a weak theoretical support for global convergence. 5.5 Experiments with real-world data In this section I test the Grassmann Discriminant Analysis with the Yale Face, the CMU- PIE, the ETH-80 and the IXMAS databases, and compare its performance with those of other algorithms Algorithms The following is the list of algorithms used in the test. 1. Baseline: Euclidean FDA 2. Grassmann Discriminant Analysis: GDA1 (Projection kernel + kernel FDA) GDA2 (Binet-Cauchy kernel + kernel FDA) For GDA1 and GDA2, the optimal values of σ are found by scanning through a range of values. The results do not seem to vary much as long as σ is small enough. 3. Others MSM (max corr) cmsm (PCA+max corr) DCC (NDA + Procrustes dist): For cmsm and DCC, the optimal dimension d is found by exhaustive searching. For DCC, we have used two nearest-neighbors for B i and W i in Section However, increasing the number of nearestneighbors does not change the results very much as was observed in [43]. In DCC the optimization is iterated for 5 times each. 93

109 I evaluate the algorithms with the cross validation as explained in Section Results and discussion Figures show the classification rates. I can summarize the results as follows: 1. The GDA1 shows significantly better performance than all the other algorithms for all datasets. However, the difference is less pronounced in the Yale Face database where the other discriminant algorithms also performed well. 2. The overall rates are roughly in the order of (GDA1 > cmsm > DCC > others ). These three algorithms consistently outperform the baseline method, whereas GDA2 and MSM occasionally lag behind the baseline. 3. With the exception of the IXMAS database, the rates of the GDA1, MSM, cmsm, and DCC remain relatively the same as the subspace dimension m increases. For IXMAS, the rates seem to increase gradually as m increases in the given range. 4. The GDA2 performs poorly in general and degrades fast as m increases. This can be ascribed to the properties of the Binet-Cauchy distance explained in Chapter 4. Due to its product form, the kernel matrix tends to be an identity as the subspace dimension increases, which is also empirically checked from data. 5.6 Conclusion In this chapter I defined the Grassmann kernels for subspace-based learning, and showed constructions of the Projection kernel and the Binet-Cauchy kernel via isometric embeddings. Although the embeddings can be used explicitly to represent a subspace as a D D projection matrix or a D C m 1 vector, as in [3], the equivalent kernel representations are preferred due to the storage and computation requirements. 94

110 To demonstrate the potential advantages of the Grassmann kernels, I applied the kernel discriminant analysis algorithm to image databases represented as collections of subspaces. For its surprisingly simple form and usage, the proposed method with the Projection kernel outperformed the other state-of-the-art discriminant methods with the real data. However, the Binet-Cauchy kernel, when used in its naive form, are shown to be of limited value for subspace-based learning problems. There are possibly other Grassmann kernels which are not derived from the two representative kernels, and it is left as a future work to discover them. 95

111 FDA (Eucl) GDA (Proj) GDA (BC) MSM cmsm DCC subspace dimension (m) m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9 FDA (Eucl) GDA (Proj) GDA (BC) MSM cmsm DCC Figure 5.3: Yale Face Database: face recognition rates from various discriminant analysis methods. The two highest rates including ties are highlighted with boldface for each subspace dimension m. 96

112 100 FDA (Eucl) GDA (Proj) GDA (BC) MSM cmsm DCC subspace dimension (m) m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9 FDA (Eucl) GDA (Proj) GDA (BC) MSM cmsm DCC Figure 5.4: CMU-PIE Database: face recognition rates from various discriminant analysis methods. 97

113 FDA (Eucl) GDA (Proj) GDA (BC) MSM cmsm DCC subspace dimension (m) m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 m=9 FDA (Eucl) GDA (Proj) GDA (BC) MSM cmsm DCC Figure 5.5: ETH-80 Database: object categorization rates from various discriminant analysis methods. 98

114 100 FDA (Eucl) GDA (Proj) GDA (BC) MSM cmsm DCC subspace dimension (m) m=1 m=2 m=3 m=4 m=5 FDA (Eucl) GDA (Proj) GDA (BC) MSM cmsm DCC Figure 5.6: IXMAS Database: action recognition rates from various discriminant analysis methods. 99

115 Chapter 6 EXTENDED GRASSMANN KERNELS AND PROBABILISTIC DISTANCES 6.1 Introduction So far I have modeled the data as the set of linear subspaces. To relax this geometric assumption of the data, let s take a step back from the assumption and take a probabilistic view of the data. Let s suppose a set of vectors are i.i.d samples from an arbitrary probability distribution. Then it is possible to compare two such distributions of vectors with probabilistic similarity measures, such as the KL distance 1 [47], the Chernoff distance [15], or the Bhattacharyya/Hellinger distance [10], to name a few [70, 40, 46, 96]. Furthermore, the Bhattacharyya affinity is in fact a positive definite kernel function on the space of distributions and has nice closed-form expressions for the exponential family [40]. In this paper, I investigate the relationship between the Grassmann kernels and the probabilistic distances. The link is provided by the probabilistic generalization of subspaces with a Factor Analyzer [22], which is a Gaussian distribution that resembles a pancake. 1 By distance I mean any nonnegative measure of similarity and not necessarily a metric. 100

116 The first result I show is that the KL distance is reduced to the Projection kernel under the Factor Analyzer model, whereas the Bhattacharyya kernel becomes trivial in the limit and is suboptimal for subspace-based problems. Secondly, based on my analysis of the KL distance, I propose an extension of the Projection kernel which is originally confined to the set of linear subspaces, to the set of affine as well as scaled subspaces. For this I introduce the affine Grassmann manifold and kernels. I demonstrate the extended kernels with the Support Vector Machines and the Kernel Discriminant Analysis using synthetic and real image databases. The experiments show the advantages of the extended kernels over the Bhattacharyya and the Binet-Cauchy kernels. 6.2 Analysis of probabilistic distances and kernels In this section I introduce several well-known probabilistic distances, and establish their relationships with the Grassmann distances and kernels Probabilistic distances and kernels Various probabilistic distances between distributions have been proposed in the literature. Some of them yield closed-form expressions for the exponential family and are convenient for analysis. Below is a short list of those distances. KL distance : J(p 1, p 2 ) = p 1 (x) log p 1(x) p 2 (x) dx (6.1) The KL distance is probably the most frequently used distance in learning problems. It is sometime called the relative entropy and plays a fundamental role in information theory. 101

117 KL distance (symmetric) : J KL (p 1, p 2 ) = [p 1 (x) p 2 (x)] log p 1(x) p 2 (x) dx (6.2) Since the original KL distance is asymmetric, this symmetrized version is often used instead. I exclusively use the symmetric version in the chapter. This distance is still not a valid metric. Chernoff distance: J Cher (p 1, p 2 ) = log p α 1 1 (x) p α 2 2 (x) dx, (α 1 + α 2 = 1, α 1, α 2 > 0) (6.3) The Chernoff distance is asymmetric. A symmetric version of the distance with α 1 = α 2 = 1/2 is known as the Bhattacharyya distance: J Bhat (p 1, p 2 ) = log [p 1 (x) p 2 (x)] 1/2 dx (6.4) Hellinger distance: J Hel (p 1, p 2 ) = ( p1 (x) p 2 (x)) 2 dx (6.5) The Hellinger distance is directly related to the Bhattacharyya distance by J Hel = 2(1 exp( J Bhat )). One can also define similarity measures instead of the dissimilarity measures above. Jebara and Kondor [40] proposed the Probability Product kernel k Prob (p 1, p 2 ) = p α 1 (x) p α 2 (x) dx, (α > 0). (6.6) 102

118 By construction, this kernel is positive definite in the space of normalized probability distributions [40]. This kernel includes the Bhattacharrya and the Expected Likelihood kernels as special cases: Bhattacharyya kernel: (α = 1/2) k Bhat (p 1, p 2 ) = [p 1 (x) p 2 (x)] 1/2 dx (6.7) Expected Likelihood kernel: (α = 1) k EL (p 1, p 2 ) = p 1 (x) p 2 (x) dx (6.8) The probabilistic distances are closely related to each other. For example, the Hellinger distance forms a bound on the KL distance [77], and the Bhattacharyya distance and the KL distance are both instances of the Rényi divergence [63]. However the behaviors of the distances are quite different under my data model. I examine the KL distance and the Product Probability kernel in particular Data as Mixture of Factor Analyzers The probabilistic distances in the previous section are not restricted to specific distributions. However, I will model the data distribution as the Mixture of Factor Analyzers (MFA) [27]. If we have i = 1,..., N sets in the data, then each set is considered as i.i.d. samples from the i-th Factor Analyzer x p i (x) = N (u i, C i ), C i = Y i Y i + σ 2 I D, (6.9) 103

The right figure shows a general Mixture of Factor Analyzers which are not bound by these conditions.

119 Figure 6.1: Grassmann manifold as a Mixture of Factor Analyzers. The Grassmann manifold (Left), the set of linear subspaces, can alternatively be modeled as the set of flat (σ 0) spheres (Y i Y i = I m ) intersecting at the origin (u i = 0). The right figure shows a general Mixture of Factor Analyzers which are not bound by these conditions. where u i R D is the mean, Y i is a full-rank D m matrix (D > m), and σ is the ambient noise level. The factor analyzer model is a practical substitute for a Gaussian distribution in case the dimensionality D of the images is greater than the number of samples n in a set. Otherwise it is impossible to estimate the full covariance C nor invert it. More importantly, I use the factor analyzer distribution to provide the link between the Grassmann manifold and the space of probabilistic distributions. In fact a linear subspace can be considered as the flattened (σ 0) limit of a zero-mean (u i = 0), homogeneous (Y i Y i = I m ) factor analyzer distribution as depicted in Figure 6.1. Some linear algebra Let s summarize some linear algebraic shortcuts to analyze the distances. The inversion lemma will be used several times. For σ > 0, we have the identity: C 1 i = (Y i Y i + σ 2 I) 1 = σ 2 (I Y i (σ 2 I + Y i Y i ) 1 Y i ). 104

120 Let M 1 and M 2 be m m matrices M 1 = (σ 2 I m + Y 1Y 1 ) 1, and M 2 = (σ 2 I m + Y 2Y 2 ) 1, and Ỹ1 and Ỹ2 be the matrices Ỹ 1 = Y 1 M 1/2 1, and Ỹ 2 = Y 2 M 1/2 2. From the identity we can compute the followings C 1 1 C 2 + C 1 2 C 1 = (σ 2 I D + Y 1 Y 1) 1 (σ 2 I D + Y 2 Y 2) + (σ 2 I D + Y 2 Y 2) 1 (σ 2 I D + Y 1 Y 1) = σ 2 (I D Ỹ1Ỹ 1)(σ 2 I D + Y 2 Y 2) + σ 2 (I D Ỹ2Ỹ 2)(σ 2 I D + Y 1 Y 1), = 2I D Ỹ1Ỹ 1 Ỹ2Ỹ 2 + σ 2 (Y 1 Y 1 + Y 2 Y 2 Ỹ1Ỹ 1Y 2 Y 2 Ỹ2Ỹ 2Y 1 Y 1), (C 1 + C 2 ) 1 = (2σ 2 I D + Y 1 Y 1 + Y 2 Y 2) 1 = (2σ 2 ) 1 (I D + ZZ ) 1 = (2σ 2 ) 1 (I D Z(I 2m + Z Z) 1 Z ), where Z = (2σ 2 ) 1/2 [Y 1 Y 2 ], C C 1 2 = σ 2 (2I D Ỹ1Ỹ 1 Ỹ2Ỹ 2) = 2σ 2 (I D Z Z ), where Z = 2 1/2 [Ỹ1 Ỹ 2 ], (C C 1 2 ) 1 = σ2 2 (I D Z Z ) 1 = σ2 2 (I D + Z(I 2m Z Z) 1 Z ) Analysis of KL distance The KL distance for a Factor Analyzers is as follows: 105

121 J KL (p 1, p 2 ) = 1 2 tr ( C 1 2 C 1 + C 1 1 C 2 2I D ) (u 1 u 2 ) (C C 1 2 )(u 1 u 2 ) = 1 2 tr( Ỹ 1Ỹ1 Ỹ 2Ỹ2) + σ 2 2 tr(y 1Y 1 + Y 2Y 2 Ỹ 1Y 2 Y 2Ỹ1 Ỹ 2Y 1 Y 1Ỹ2) ( ) + σ 2 2 (u 1 u 2 ) 2I D Ỹ1Ỹ 1 Ỹ2Ỹ 2 (u 1 u 2 ) (6.10) Furthermore, we can write the distances as J KL (p 1, p 2 ) = 1 2 tr( Ỹ 1Ỹ1 Ỹ 2Ỹ2) + σ 2 2 tr(y 1Y 1 + Y 2Y 2 Ỹ 1Y 2 Y 2Ỹ1 Ỹ 2Y 1 Y 1Ỹ2) + σ 2 2 (2u u u Z Z u), where u = u 1 u 2. Note that the computation of distance involves only the products of column vectors of Y i and u i, and we need not handle any D D matrix explicitly. KL in the limit yields the projection kernel For u i = 0, and Y i Y i = I m, we have J KL (p 1, p 2 ) = 1 2 tr( Ỹ 1Ỹ1 Ỹ 2Ỹ2) + σ 2 2 tr(y 1Y 1 + Y 2Y 2 Ỹ 1Y 2 Y 2Ỹ1 Ỹ 2Y 1 Y 1Ỹ2) = 1 ( ) 2 ( 2 m σ ) + σ 2 1 2m 2 2 σ tr(y 1Y 2 Y 2Y 1 ) 1 = (2m 2tr(Y 2σ 2 (σ 2 + 1) 1Y 2 Y 2Y 1 )). We can ignore the multiplying factors which does not depend on Y 1 or Y 2, and rewrite the distance as J KL (p 1, p 2 ) 2m 2tr(Y 1Y 2 Y 2Y 1 ). 106

122 One can immediately realize that this is indeed the definition of the squared Projection distance d 2 Proj (Y 1, Y 2 ) up to multiplicative factors Analysis of Probability Product Kernel The Probability Product Kernel for Gaussian distributions is [40] k Prob (p 1, p 2 ) = (2π) (1 2α)D/2 det(c ) 1/2 det(c 1 ) α/2 det(c 2 ) α/2 exp 1 2 ( αu 1 C 1 1 u 1 + αu 2C 1 2 u 2 (u ) C u ), (6.11) where C = α 1 (C C 1 2 ) 1, and u = α(c 1 1 u 1 + C 1 2 u 2 ). To compute the determinant terms for Factor Analyzers, we use the following identity: if A and B are D m matrices, then det(i D + AB ) = det(i m + B A) = m (1 + τ i (B A)), (6.12) i=1 where τ i is the i-th singular value of B A. Using the identity we can write the following. det C1 1 = det(σ 2 (I D Ỹ1Ỹ 1)) = σ 2D det(i m Ỹ 1Ỹ1) m = σ 2D (1 τ i (Ỹ 1Ỹ1)), i=1 det C2 1 = det(σ 2 (I D Ỹ2Ỹ 2)) = σ 2D det(i m Ỹ 2Ỹ2) m = σ 2D (1 τ i (Ỹ 2Ỹ2)), i=1 [ det C = det σ 2 (2α) 1 (I D + Z(I 2m Z ] 1 Z) Z ) = σ 2D (2α) D det(i 2m + (I 2m Z Z) 1 Z Z) = σ 2D (2α) D det(i 2m Z Z) 1 2m = σ 2D (2α) D (1 τ i ( Z 1 Z)) i=1 107

123 To compute the exponents in (6.11) we need to use the followings C1 1 C C2 1 = C1 1 (C1 1 + C2 1 ) 1 C2 1 = C2 1 (C1 1 + C2 1 ) 1 C1 1 = (C 1 + C 2 ) 1 = (2σ 2 ) 1 (I D Z(I 2m + Z Z) 1 Z ) C 1 1 C C 1 1 = C 1 1 (C C 1 2 ) 1 C 1 1 = C 1 1 (C 1 + C 2 ) 1 = σ 2 2 (2I D 2Ỹ1Ỹ 1 I D + Z(I 2m + Z Z) 1 Z ) = σ 2 2 (I D 2Ỹ1Ỹ 1 + Z(I 2m + Z Z) 1 Z ) C 1 2 C C 1 2 = C 1 2 (C C 1 2 ) 1 C 1 2 = C 1 2 (C 1 + C 2 ) 1 = σ 2 2 (2I D 2Ỹ2Ỹ 2 I D + Z(I 2m + Z Z) 1 Z ) = σ 2 2 (I D 2Ỹ2Ỹ 2 + Z(I 2m + Z Z) 1 Z ) Plugging these results back in (6.11) we again can compute the kernel without handling any D D matrix. For concreteness I derive the Bhattacharyya kernel as an instance of the probability product kernel with α = 1/2 as follows: k Bhat (p 1, p 2 ) = det(c ) 1/2 det(c 1 ) 1/4 det(c 2 ) 1/4 exp 1 4 (u 1 u 2 ) (C 1 + C 2 ) 1 (u 1 u 2 ) = det(i 2m Z Z) 1/2 det(i m Ỹ 1Ỹ1) 1/4 det(i m Ỹ 2Ỹ2) 1/4 exp σ 2 4 (u 1 u 2 ) (I D Z(I 2m + Z Z) 1 Z )(u 1 u 2 ). (6.13) 108

124 Probability product kernel in the limit becomes trivial For u i = 0, and Y i Y i = I m, we have k Prob (p 1, p 2 ) = (2π) (1 2α)D/2 det(c ) 1/2 det(c 1 ) α/2 det(c 2 ) α/2 = (2π) (1 2α)D/2 σ D (2α) D/2 det(i 2m Z Z) 1/2 σ αd det(i m Ỹ 1Ỹ1) α/2 σ αd det(i m Ỹ 2Ỹ2) α/2 = (2π) (1 2α)D/2 σ D (2α) D/2 det(i 2m Z Z) 1 σ2α(m D) (σ 2 + 1) αm = π (1 2α)D 2 αd D/2 σ2α(m D)+D α (σ 2 + 1) det(i αm 2m Z Z) 1/2, and furthermore, det(i 2m Z Z) 1/2 = det I 2m 1 2 Ỹ 1Ỹ1 Ỹ 2Ỹ1 Ỹ 1Ỹ2 Ỹ 2Ỹ2 1/2 1 = det I 2m 2(σ 2 I m Y 1Y 2 + 1) Y 2Y 1 I m ( ) 2(σ 2 m + 1) = det 2σ = ( ) 2(σ 2 m ( + 1) det I 2σ 2 m + 1 1/2 I m 1 2σ 2 +1 Y 1Y 2 1 2σ 2 +1 Y 2Y 1 I m 1/2 1 (2σ 2 + 1) 2 Y 1Y 2 Y 2Y 1 ) 1/2. Ignoring the terms which are not the functions of Y 1 or Y 2, we have ( ) 1/2 1 k Prob (Y 1, Y 2 ) det I m (2σ 2 + 1) Y 1Y 2 2 Y 2Y 1. Suppose the two subspaces span(y 1 ) and span(y 2 ) intersect only at the origin, that is, the singular values of Y 1Y 2 are strictly less than 1. In this case k Prob has a finite value as 109

125 σ 0 and the inversion is well-defined. In contrast, the diagonal terms of k Prob become ( ) 1/2 ( ) 1 (2σ 2 k Prob (Y 1, Y 1 ) = det (1 (2σ 2 + 1) )I + 1) 2 m/2 2 m =, (6.14) 4σ 2 (σ 2 + 1) which diverges to infinity as σ 0. This implies that after the kernel is normalized by the diagonal terms, it becomes a trivial kernel: 1, span(y i ) = span(y j ) k Prob (Y i, Y j ) = 0, otherwise, as σ 0. (6.15) As I claimed earlier, the Probability Product kernel, including the Bhattacharyya kernel, loses its discriminating power as the Gaussian distributions become flatter. 6.3 Extended Grassmann Kernel In the previous section I presented the probabilistic interpretation of the Projection kernel. Based on this analysis, I propose extensions of the Projection kernel and make the kernels applicable to more general data. In this section I examine the two directions of extension: from linear to affine subspaces, and from homogeneous to scaled subspaces Motivation The motivations for considering affine subspaces and non-homogeneous subspaces arise from observing the subspaces computed from real data. Firstly, the set of images, for example from the Yale Face database, have nonzero means. If the mean is significantly different from set to set, we want to use the mean image as well as the PCA basis images to represent a set. Secondly, the eigenvalues from PCA almost always have non-homogeneous values. It is likely that the eigenvector direction corresponding to a larger eigenvalue is 110

126 A. Linear B. Affine C. Scaled Figure 6.2: The Mixture of Factor Analyzer model of the Grassmann manifold is the collection of linear homogeneous Factor Analyzers shown as flat spheres intersecting at the origin (A). This can be relaxed to allow nonzero offsets for each Factor Analyzer (B), and also to allow arbitrary eccentricity and scale for each Factor Analyzer shown as flat ellipsoids (C). more important than the eigenvector direction corresponding to a smaller eigenvalue. In which case we want to consider the eigenvalue scales as well as the eigenvectors when representing the set. These two extensions are naturally derived from the probabilistic generalization of subspaces. Figure 6.2 illustrates the ideas. Considering the data as a MFA distribution, we can gradually relax the zero-mean (u i = 0) condition in Figure A to the nonzero-mean (u i = arbitrary) condition in Figure B, and furthermore relax the homogeneity (Y Y = I) condition to the non-homogeneous (Y Y = full rank) condition in Figure C. From this I expect to benefit from both worlds probabilistic distributions and geometric manifolds. However, simply relaxing the conditions and taking the limit σ 0 of the KL distance do not guarantee a metric or a positive definite kernel, as we will shortly examine. Certain compromises have to be made to turn the KL distance in the limit into a well-defined and usable kernel function. In the following sections I propose new frameworks for the extensions and the technical details for making valid kernels. 111

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering