ESTIMATION OF SUBSPACE ARRANGEMENTS: ITS ALGEBRA AND STATISTICS ALLEN YANG YANG

Size: px

Start display at page:

Download "ESTIMATION OF SUBSPACE ARRANGEMENTS: ITS ALGEBRA AND STATISTICS ALLEN YANG YANG"

Phoebe Dawson
5 years ago
Views:

1 ESTIMATION OF SUBSPACE ARRANGEMENTS: ITS ALGEBRA AND STATISTICS BY ALLEN YANG YANG B.Eng., University of Science and Technology of China, 2001 M.S., University of Illinois at Urbana-Champaign, 2003 M.S., University of Illinois at Urbana-Champaign, 2005 DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering in the Graduate College of the University of Illinois at Urbana-Champaign, 2006 Urbana, Illinois

2 ABSTRACT In the literature of computer vision and image processing, a fundamental difficulty in modeling visual data is that multivariate image or video data tend to be heterogeneous or multimodal. That is, subsets of the data may have significantly different geometric or statistical properties. For example, image features from multiple independently moving objects may be tracked in a motion sequence, or a video clip may capture scenes of different events over time. Therefore, it seems to be desirable to segment mixed data into unimodal subsets and then model each subset with a distinct model. Recently, subspace arrangements have become an increasingly popular class of mathematical objects to be used for modeling a multivariate mixed data set that is (approximately) piecewise linear. A subspace arrangement is a union of multiple subspaces. Each subspace can be conveniently used to model a homogeneous subset of the data. Hence, all the subspaces together can capture the heterogeneous structure of the data set. Such hybrid subspace models have been successfully applied to modeling different types of image features for purposes such as motion segmentation, texture analysis, and 3-D reconstruction. In this work, we study the problem of segmenting subspace arrangements. The problem is sometimes called the subspace segmentation problem. The work was inspired by generalized principal component analysis (GPCA), an algebraic solution that simultaneously estimates the segmentation of the data and the parameters of the multiple iii

3 subspaces. Built on past extensive study of subspace arrangements in algebraic geometry, we propose a principled framework that summarizes important algebraic properties and statistical facts that are crucial for making the inference of subspace arrangement models both efficient and robust, even when the given data are corrupted with noise and/or contaminated by outliers. Algebraically, we study the properties of polynomials vanishing on a union of subspaces; and statistically, we study how to estimate these polynomials robustly from real sample sets with noise and outliers. These new methods in many ways improve and generalize extant methods for modeling or clustering mixed data. Finally, we will show results of these methods applied to computer vision and image processing. To facilitate verification and adaptation of our results, and to encourage future research on this topic, all algorithms proposed in this work are available to download as MATLAB toolboxes at: iv

4 T o Y i and Robert v

5 ACKNOWLEDGMENTS It is a pleasure to thank those who have been a part of my graduate study, as friends, teachers, and colleagues. First of all, I am deeply indebted to my supervisor Dr. Yi Ma, who has shown me the beauty of rigorous mathematics and the principles of engineering study in the last five years. These invaluable treasures shall be carried on into my future research. Secondly, I want to thank my advisor in the Mathematics Department, Dr. Robert Fossum. He is a great advocate of interdisciplinary collaboration, who has artfully mingled many complex engineering problems and classical mathematical frameworks. I need to thank Dr. Narendra Ahuja, Dr. Thomas Huang, and Dr. Stephen Levinson, who have served on my Ph.D. committee, and provided many valuable suggestions in the course of my Ph.D. research. I also thank my wonderful colleagues and friends, Dr. Wei Hong, Dr. Kun Huang, Shankar Rao, Andrew Wagner, and John Wright, who have helped me in many ways in my graduate study. In 2004, I was very fortunate to work at Honda Research USA, CA, as a summer intern. I want to thank Dr. James Davis, Dr. Hector Gonzalez-Banos, and Dr. Victor Ng-Thow-Hing, who gave me the chance to work on their exciting Honda Asimo humanoid robot. Finally, my Ph.D. research was partially supported by the UIUC Computational Science and Engineering Fellowship, the Beckman Institute Cognitive Science/AI Fellowship, vi

6 the Joint Applied Mathematics Assistantship, and the following funding sources: UIUC ECE and CSL startup funds, NSF CAREER IIS , NSF CCF-TF , and ONR YIP N vii

7 TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES x xi CHAPTER 1 INTRODUCTION Motivation Problem Statement Subspace Segmentation in Computer Vision Principal component analysis and its extensions Estimation of multiple subspaces Organization CHAPTER 2 ALGEBRAIC INFERENCE OF SUBSPACE ARRANGEMENTS Basic Notation and Definitions Vanishing Ideals and Hilbert Functions of Subspace Arrangements Generalized Principal Component Analysis Retrieving the vanishing polynomials Retrieving the normal vectors and bases of the subspaces The algebraic GPCA algorithm and its variations CHAPTER 3 ESTIMATION OF SUBSPACE ARRANGEMENTS FROM NOISY SAMPLES Estimation of Vanishing Polynomials Sampson distances Generalized eigenvector fit Simulations Estimation of Subspace Arrangements via a Voting Scheme Arrays of bases and counters Voting for the subspace bases Simulations and comparison viii

8 CHAPTER 4 ESTIMATION OF SUBSPACE ARRANGEMENTS WITH OUTLIERS Robust Techniques: A Literature Review Robust Generalized Principal Component Analysis Influence functions Multivariate trimming Estimating the outlier percentage Random Sample Consensus RANSAC on individual subspaces Simulations and Comparison CHAPTER 5 APPLICATIONS Motion Segmentation with an Affine Camera Model Vanishing Point Detection Other Applications CHAPTER 6 CONCLUSIONS AND DISCUSSION Estimation of Hybrid Quadratic Manifolds REFERENCES AUTHOR S BIOGRAPHY ix

9 LIST OF TABLES 2.1 Hilbert functions of the four possible configurations of three subspaces in R Average time for solving subspace arrangements of dimensions (2, 2, 1) in R 3 with sample numbers (200, 200, 100) by the sample influence function, theoretical influence function, and MVT RANSAC-on-Union applying to three subspace arrangement models: (2, 2, 1) in R 3, (4, 2, 2, 1) in R 5, and (5, 5, 5) in R 6 (with 6% Gaussian noise and 24% outliers) Average time for applying RANSAC-on-Subspaces, RGPCA-Influence, and RGPCA-MVT to the three subspace arrangement models with 24% outliers: (2, 2, 1) in R 3, (4, 2, 2, 1) in R 5, and (5, 5, 5) in R Linear and quadratic constraints of two perspective camera models x

10 LIST OF FIGURES 1.1 Inferring a hybrid linear model Z, consisting of a plane V 1 and two lines (V 2, V 3 ), from a set of mixed data, which can be: (a) noiseless samples, (b) noisy samples, (c) noisy samples with outliers Two model fitting results for four noisy samples from a line in R 2. The dashed line shows the result with the line model assumption, and the curve shows the result with the third degree polynomial model assumption Inferring a subspace arrangement of three subspaces: Z = V 1 V 2 V 3 from a set of sample points {z i } Four configurations of three subspaces in R 3. The numbers in the parentheses are the subspace dimensions Three key steps in the algebraic GPCA algorithm The effects of data noise in the process of GPCA (a) The eigenvalues of the matrix Σ with 0% noise. (b) The eigenvalues of the matrix Σ with 6% noise. (c) The improved generalized eigenvalues of (Σ, Γ) The segmentation result of GPCA-Voting on three sample sets from a subspace arrangement of dimensions (2, 1, 1) R Comparison of EM, K-Subspaces, PDA, and GPCA-Voting. GPCA- Voting+K-Subspaces means using the estimated model by GPCA-Voting to initialize K-Subspaces. The EM algorithm fails in the (2, 2, 1) and (4, 2, 2, 1) cases A segmentation result of GPCA-Voting for samples drawn from two planes and one line in R 3, with 6% Gaussian noise as well as 6% outliers drawn from a uniform distribution (marked as black asterisks ). Left: the ground truth. Right: estimated subspaces and the segmentation result The effects of outliers in the process of GPCA. The black dot indicates an add-in outlier with a large magnitude Average space angle error of GPCA with vanishing polynomials estimated by the three robust methods xi

11 4.4 Maximal sample residuals versus rejection rates. The data set consists of three subspaces of dimensions (2, 2, 1) in R 3. The valid sample size is (200, 200, 100), and 16% uniformly distributed outliers are added in. The algorithm trims out samples with rejection rates from 0% to 54% using MVT. The maximal sample residual of the remaining sample points at each rejection rate is measured with respect to the model estimated by GPCA-Voting on the remaining sample points Subspace segmentation results at 7% and 38% rejection rates Segmentation results of RGPCA-MVT on three data sets from a subspace arrangement of dimensions (2, 1, 1) R Segmentation results of RGPCA-MVT on three data sets from a subspace arrangement of dimensions (2, 2, 1) R Possible segmentation results (in color) to fit the first plane model on samples drawn from 4 subspaces of dimensions (2, 2, 1, 1). The support of this 2-D model may come from samples on a true plane, or multiple degenerate line models. The degeneracy becomes more difficult to detect with outliers as shown in (c) Average space angle errors of RANSAC-on-Subspaces, RGPCA-MVT, and RGPCA-Influence (50 trials at each percentage) The first and last frames of five video sequences with tracked image features superimposed The segmentation results of affine camera motions in the five sequences. The black asterisks denote the outliers Vanishing-point detection results by RANSAC-on-Subspaces Vanishing-point detection results by RGPCA-Influence Vanishing-point detection results by RGPCA-MVT Results of vanishing-point detection on a natural scene that contains a bridge, a mountain, and trees. All three subspace-segmentation algorithms fail to find the true vanishing points that correspond to parallel line families in space xii

12 CHAPTER 1 INTRODUCTION 1.1 Motivation In scientific and engineering studies, one of the most common tasks is to find a parametric model for a given set of data. Depending on the nature of the data, the model can be either a probabilistic distribution (e.g., a Gaussian distribution, a hidden Markov chain) or a geometric structure (e.g., a line, a curve, or a manifold). Nevertheless, among all the models, linear models such as a straight line or a subspace are the most popular choices mainly because they are simple to understand and easy to represent and compute. Very often in the practice of data modeling, however, a given data set is not homogeneous and cannot be described well by a single linear model. This is especially the case with imagery data. For instance, a natural image typically contains multiple regions, which are significantly different in the complexity of texture. While it is generally true that each region can be modeled well by a simple linear model, the same model unlikely applies to other regions. It is therefore reasonable to use multiple models to describe different regions of the image. The above example about images reveals a challenging problem that permeates many 1

13 V 2 V 3 Z = V 1 V 2 V 3 R 3 V 1 o (a) sample points (b) noisy samples (c) noisy samples with outliers Figure 1.1 Inferring a hybrid linear model Z, consisting of a plane V 1 and two lines (V 2, V 3 ), from a set of mixed data, which can be: (a) noiseless samples, (b) noisy samples, (c) noisy samples with outliers. research areas such as image processing, computer vision, pattern recognition, and system identification: How to segment a given set of data into multiple subsets and find the best model for each subset? In different contexts, such a data set as well as its associated model has been called mixed, multimodal, piecewise, heterogeneous, or hybrid. For simplicity, in this work, we refer to the data as mixed and the model as hybrid. We are particularly interested in the hybrid linear model that can be characterized as one linear model for each homogeneous subset of the data. Figure 1.1 shows a simple example. The importance of hybrid linear models is multifold: 1. They are the natural generalizations to single linear models. 2. They are sufficiently expressive for representing or approximating arbitrary complex data structures. 3. The understanding of hybrid linear models has been significantly advanced in recent years and efficient solutions have been developed. A fundamental challenge in estimating such a hybrid model for mixed data is the chicken-and-egg problem. If the data were already segmented properly into homoge- 2

14 neous subsets, estimating a model for each subset would be easy. Or, if the hybrid model were already known, segmenting the data into multiple subsets would be straightforward. For instance, in Figure 1.1, if a correct segmentation is given, finding an optimal linear subspace for each subset of sample points has a well-established solution known as principal component analysis (PCA) [1]. Or, given the three linear subspaces, one can easily segment the samples to their closest subspaces, respectively. The problem becomes much more involved if neither the model nor the segmentation is known a priori and we have only the unsegmented sample points, which sometimes are also corrupted with noise and outliers (see Figure 1.1 (b) and (c)). So at the heart of modeling such mixed data is the question of how to effectively resolve the coupling between data segmentation and model estimation. 1.2 Problem Statement In this work, we address the following problem. Problem Given a set of sufficiently dense sample points drawn from a finite union of n linear subspaces 1 V 1, V 2,..., V n of dimensions d 1, d 2,..., d n, respectively, in a D- dimensional space R D, estimate a basis for each subspace, and segment all sample points into their respective subspaces. We consider the problem under three assumptions with increasing practicality and difficulty: Assumption 1: The samples are noiseless samples from the subspaces, see Figure 1.1 (a). 1 Linear subspaces are subspaces that pass through the origin, which are distinguished from affine subspaces. 3

15 Assumption 2: The samples are corrupted by (typically Gaussian) noise, see Figure 1.1 (b). Assumption 3: The samples are corrupted by both noise and outliers, see Figure 1.1 (c). In this work we develop the solution under the above three assumptions. The technical conditions under which a set of sample points is considered to be sufficiently dense will become clear in the context. Although algebraically, one or multiple subspaces are fundamentally different from a finite set of samples, it has been shown that in theory one can always recover an algebraic set of subspaces from a finite number of sample points [2]. In mathematics, a union of multiple subspaces is called a subspace arrangement. Subspace arrangements constitute a very important class of objects that have been studied in mathematics for centuries. The importance as well as the difficulty of studying subspace arrangements can hardly be exaggerated. Different aspects of their properties have been and are still being investigated and exploited in many mathematical fields, including algebraic geometry, algebraic topology, combinatorics and complexity theory, and graph and lattice theory. See [3 5] for a general review. In the context of modeling mixed data, subspace arrangements are of immediate interest because they are the natural generalizations of single subspaces the linear models. As an underlying model for mixed data, a subspace arrangement is sufficiently flexible and expressive: It may contain subspaces of different dimensions, and can approximate any nonlinear geometric or topological structures with arbitrary accuracy. In addition, as we will see in this work, a subspace arrangement as an algebraic set can be effectively estimated and segmented from a set of samples in a noniterative way. 4

16 1.3 Subspace Segmentation in Computer Vision Before we proceed to the estimation of subspace arrangement models, we emphasize, as a general fact in the pattern recognition literature, that any classification algorithm can be optimal for a data set under certain criteria. It is also true in the model selection scenario. Given a data set, there are many different models that fit the data. For example, given the assumption that the samples of a data set obey a multivariate Gaussian distribution, the Karhunen-Loève (KL) transform is the optimal coding scheme at a given compression rate [6]. But there are many image compression methods that outperform the KL transform, which means an optimal method will become inferior if the model assumption does not hold for the data set. In this case, the distribution of pixel values in most images does not obey the single Gaussian model. In other words, in order to efficiently obtain the coefficients of a model M from a data set, we have to first select a correct model class M that we are trying to fit. The above example illustrates that an optimal solution by no means gives the best representation of a data set if the fundamental model assumption is inaccurate. The presence of data noise makes choosing a proper model even more difficult. Consider a set of four points from a line in R 2 shown in Figure 1.2. The samples contain data noise, and the number of the samples is limited. Using the linear least squares (LLS) method [7], we get two different fitting results with two model assumptions. The first one is a line model, which is consistent with the underlying structure, and the second one is a third degree polynomial model. The results show that the third degree polynomial model fits the sample data perfectly with respect to the least square error. This indeed is an optimal solution for this given data set. But the model significantly deviates from the true structure of the data. For engineering purposes, the choice in a model class for an application should be 5

17 Figure 1.2 Two model fitting results for four noisy samples from a line in R 2. The dashed line shows the result with the line model assumption, and the curve shows the result with the third degree polynomial model assumption. rich enough that it contains a model that can fit the data to a desired accuracy, yet be simple enough so as to make the identification of the best model for the data tractable. A common strategy is to try to get away with the simplest possible class of models that can solve the problems at hand Principal component analysis and its extensions A data set can be modeled by its geometric shape. In most cases, samples are represented in real vector form with very high dimensions. The space that all samples sit in is called the sample space or ambient space. The fundamental assumption made by geometric modeling is that the underlying structure of the object from which the data are sampled is contained in a small portion of the whole ambient space, sometimes as only a zero measure set, e.g., a line in R 2. By the assumption of the shape of the object, there are many well-studied algorithms. Probably the most widely recognized one is principal component analysis (PCA). In this case, principal components refer to the coordinates of a vector along the basis vectors with dominant energy. That is, any vector z in an ambient space R K can be represented 6

18 as a linear combination of an orthonormal basis: z = K a i v i, (1.1) i=1 and its k-dimensional approximation ẑ by the first k principal components a 1,, a k minimizes the following sum of squared error for all k = 1,..., K : ẑ = k i=1 a i v i = argmin z z 2. (1.2) z R k In addition to its simple objective function, a numerically stable and fast solution to the function is also credited for the popularity of PCA, which is called singular value decomposition (SVD). SVD is among the most stable algorithms for solving matrix orthogonal factorization [8]. A historical appraisal of SVD can be found in [9]. Due to its fast and stable performance, there are many extensions based on the PCA (SVD) method. The higher-order singular value decomposition (HOSVD) processes the data samples in nth-order tensor form, i.e., z R K 1 K n. HOSVD is a multilinear generalization of the SVD method, which decomposes an nth-order tensor to an outer product: z = S 1 U (1) 2 U (2) 3 n U (n), (1.3) in which U (l) is a unitary K l K l matrix, and S is a K 1 K 2 K n tensor. There is a clear analogy between the decomposition results in SVD and HOSVD, and the interested reader is referred to [10]. PCA is an orthogonal transformation in the data space. However, the input variables may be related nonlinearly. In this case, the solution using Equation (1.2) in the Euclidean coordinate system becomes nonoptimal. By introducing a kernel function in the data space, the input variables are embedded into a feature space where the vectors sit in a linear subspace that can be estimated by the standard PCA. This approach is known as kernel PCA [11]. Of course, the choice of the kernel function is crucial for the 7

19 method, which in theory has infinitely many candidates. It has to be made with respect to the application Estimation of multiple subspaces Although PCA is probably the most successful algorithm for fitting single low-dimensional subspace models, it is well-known that extending PCA to estimate a mixture of subspace models as a union is difficult. Before the introduction of generalized principal component analysis (GPCA) [12, 13] as an algebraic solution to this problem, it was commonly believed that the only principled approach to solve for a mixture of models has to be to model the mixed data {z 1, z 2,..., z N } R D as a set of independent samples drawn from a mixture of probabilistic distributions {p(z, θ j )} n j=1, which is typically represented as a weighted sum p(z, Θ) = j π jp(z, θ j ) with j π j = 1. Then the problem of segmenting mixed data is often converted to a statistical model-estimation problem. In 1999, Tipping and Bishop wrote the following assertion in their classic paper [14]: However, conventional PCA does not correspond to a probability density, and so there is no unique way to combine PCA models. Therefore, previous attempts to formulate mixture models for PCA have been ad hoc to some extent. Depending on the purpose, the estimated model parameters can take either the maximum-likelihood estimate, which maximizes the log-likelihood: max Θ,π or the minimax estimate, which optimizes the objective: min Θ log ( π j p(z i, θ j ) ), (1.4) i i j [min j ( log p(z i, θ j ))]. (1.5) However, even for simple distributions such as the Gaussian distribution, there is no simple noniterative solution. One needs to resort to iterative schemes to find the optimal 8

20 estimate. For the maximum-likelihood estimate, one can view the event that a sample is drawn from the jth distribution as a hidden random variable and it has an expectation of π j. Then the classical expectation-maximization (EM) algorithm [15,16] can be called upon to maximize the likelihood in a gradient-descent fashion. The minimax estimate leads to an iterative algorithm, known as the K-Means algorithm [17 20] (or its variation for subspaces, K-Subspaces [21]), which in many aspects resembles the EM algorithm. To establish the E-step and M-step of the EM algorithm for a subspace arrangement model, we need to assume that a data set F = {z 1, z 2,..., z N } consists of samples drawn from multiple component distributions, and each component distribution is centered around a subspace. To model from which component distribution a sample z is actually drawn, we can introduce a hidden discrete random variable η N, where η = i if z is drawn from the ith subspace, i {1,..., n}. Thus, the random vector (z, η) R D N (1.6) completely describes the random event that value z is drawn from a subspace indicated by the value of η. We also assume the distribution of each subspace component is a multivariate Gaussian distribution: p(z η = i) =. ( ) 1 exp zt B i Bi T z, (1.7) (2π) (D d i)/2 σ i 2σi 2 where B i R D (D d i) is an orthogonal matrix. Assign to each subspace i the model parameter θ i = (B i, σ i, π i ), where π i. = p(η = i). (1.8) Also define the probability function given the kth observation z k : w ik. = p(ηk = i z k, θ). (1.9) Then we establish the following iterative steps: 9

21 (0) (0) 1. Initialization: Set initial values for ˆθ i = { ˆB i, ˆσ (0) i, ˆπ (0) i } for i = 1,..., n. Set the iteration variable m = Expectation: Compute the expected value of w ik as w (m) ik = p(η k = i z k, ˆθ (m) ) = where p(z η = i, θ) is given in (1.7). ˆπ (m) i p(z k η k = i, ˆθ (m) ) N l=1 ˆπ(m) l p(z k η k = l, ˆθ (m) ), (1.10) 3. Maximization: Using the expected values w (m) ij, compute ˆθ (m+1). ( ˆB (m+1) i = the eigenvectors associated with the smallest D d i eigenvalues of ˆπ (m+1) i = ˆσ (m+1) i ) 2 = the matrix n k=1 w(m) P n k=1 w(m) ik. n P n k=1 w(m) (m+1) ik ( ˆB i ) T z k 2 (D d i ) P n k=1 w(m) ik ik z kz T k.. 4. Let m m + 1, and repeat step 2 and 3 until the update in the parameters is small enough. The K-Subspace algorithm does not assume any statistical model. Instead, it recursively updates the estimates of the subspace basis matrices by the following method: 1. Initialization: Set initial values of orthogonal matrices Û (0) i R D d i for i = 1,..., N. Let m = Segmentation: For each sample z k, assign it to group ˆX (m) i if i = argmin l z k Û (m) l (Û (m) l ) T z k 2. (1.11) If the cost function is minimized by more than one subspace, randomly assign the point to one of them. 3. Estimation: Apply PCA to each subset subspace bases Û (m+1) i. 10 ˆX (m) i and obtain new estimates for the

22 4. Let m m + 1, and repeat step 2 and 3 until the segmentation does not change. In a sense, both the EM and K-Subspaces algorithms have reinforced the belief that the chicken-and-egg coupling between model estimation and data segmentation can only be dealt with through such an iteration between the two. However, both the EM and K-subspace algorithms share several drawbacks: (1) They are local methods, thus sensitive to initialization. (2) For certain objective functions, they may converge to the boundary of the parameter space leading to meaningless estimates. (3) Estimation of the number of primitive models is another difficult problem. Although this work does not address the model selection problem to estimate the subspace number and dimensions when they are unknown, we summarize the most classical recipes in the following. Many general-purpose model selection criteria have been developed in the statistics community and the algorithmic complexity community for general classes of models. These criteria include: Akaike Information Criterion (AIC) [22] (also known as the C p statistic [23]). Bayesian Information Criterion (BIC) (also known as the Schwartz criterion, see [24] and references therein). Minimum Description Length (MDL) [25] and Minimum Message Length (MML) [26]. Although these criteria are originally motivated and derived from different points of view (or in different contexts), they all share a common characteristic: The optimal model should be the one that strikes a good balance between the model complexity (typically depends on the dimension of the parameter space) and the data fidelity to the chosen model (e.g., measured as the sum of squared errors). 11

23 In computer vision, the AIC and MDL methods have been widely adopted to estimate the unknown number of simple models in a mixed data set. For example, Figueiredo and Jain [27] proposed to use MDL to estimate a mixture of Gaussian distributions; Kanatani [28] generalized both AIC and MDL to estimate geometric subspace structures, which were named Geometric-AIC and Geometric-MDL; and Schindler and Suter [29] further used Geometric-AIC to estimate an unknown number of rigid-body motions in a set of tracked image features. Naturally, we have extended these criteria to determine the number of subspaces and their dimensions for subspace arrangement models. We refer to [30, 31] for further reading. 1.4 Organization Contrary to classical statistical approaches, in this work, we first study a principled algebro-geometric framework, which summarizes important algebraic properties that are crucial for making the inference of subspace arrangement models. In Chapter 2, we show that the set of polynomials that vanishes on a subspace arrangement forms an ideal and the subspace arrangement is uniquely determined by this ideal. We give a complete characterization of the dimension of each graded component of the ideal, also known as the Hilbert function. We show how the dimensions of the subspaces can be determined from the values of the Hilbert function. These algebraic properties serve as a global signature of a subspace arrangement model in the noiseless situation. We then put statistics back to the mathematical framework, and study how to estimate the vanishing polynomials and the subspaces from sample points that are corrupted by noise and outliers. In Chapter 3, we introduce a two-stage process to improve the segmentation in the presence of data noise. We use a Fisher discriminant criterion to improve the estimation of vanishing polynomials, and a voting scheme to evaluate normal 12

24 vectors perpendicular to the subspaces. In Chapter 4, we study the same problem under the assumption that the given sample points are contaminated by outliers. We introduce certain robust statistical techniques that can detect or diminish the effects of outliers, especially for subspace arrangements. Finally, we demonstrate how these methods can be applied to several real-world applications in Chapter 5. To facilitate verification and adaptation of our results, and to encourage future research on this topic, all algorithms proposed in this work are available to download as MATLAB toolboxes at: 13

25 CHAPTER 2 ALGEBRAIC INFERENCE OF SUBSPACE ARRANGEMENTS Before we can introduce subspace arrangements as a useful class of models for data modeling and segmentation, we need to understand their properties as an important class of algebraic sets. In this chapter, we discuss the necessary mathematical facts that allow us to infer a subspace arrangement from a finite number of samples and subsequently to decompose the arrangement into separate subspaces, as shown in Figure 2.1. The algebraic facts presented in this chapter serve as the theoretical foundation for an effective method to model and segment mixed data known as generalized principal component analysis (GPCA). z 2 z 3 V 1 V 2 o z 1 V 3 sample points {z i} Z = V 1 V 2 V 3 I(Z) = I(V 1) I(V 2) I(V 3) individual subspaces Figure 2.1 Inferring a subspace arrangement of three subspaces: Z = V 1 V 2 V 3 from a set of sample points {z i }. 14

26 2.1 Basic Notation and Definitions In what follows, the ambient space is a D-dimensional vector space over the infinite field R. We immediately identify our vector space with R D. If V is a d-dimensional subspace, then its codimension is c. = D d. Definition (Subspace Arrangement) A subspace arrangement in R D is a finite union A =. V 1 V 2 V n (2.1) of n linear subspaces V 1, V 2,..., V n of R D. For a nonempty subset S of the index set {1, 2,..., n}, we define the intersection V S. = s S V s with dimension d S. = dim VS and codimension c S. = D ds. Definition (Transversal Subspace Arrangement) A subspace arrangement A = V 1 V 2... V n is called transversal if c S = min ( D, i S c i ) for all nonempty S {1, 2,..., n}. That is, the dimensions of all intersections are as small as possible. Notice that transversality is a weaker condition than the typical notion of general position. For instance, three coplanar lines through the origin are transversal in R 3, but usually they are not regarded to be in general position. Transversality is an appropriate assumption for most real applications. Moderate data noise and machine roundoff should guarantee that the subspace structures of the data are transversal. 15

27 The ring of polynomial functions on our ambient space R D is denoted by R [D]. = R[X1, X 2,..., X D ]. It is the ring of polynomials in the functions {X 1, X 2,..., X D }, where X j is the function that assigns the jth coordinate to a point in R D. Any polynomial f R [D] can be written as a unique sum f = f 0 + f f deg(f), where the f i s are homogeneous polynomials of degree i. Let R [D] h of all homogeneous polynomials of degree h. Then there is a decomposition denote the vector space R [D] = R R [D] 1 R [D] 2 (2.2) of R [D] into the direct sum of its homogeneous components. Clearly R [D] h R[D] k R [D] h+k. Each homogeneous component R [D] h dimension M [D] h One can verify this by observing that the monomials. = is a finite-dimensional vector space over R of h + D 1. (2.3) D 1 form a basis of R [D] h. {X h 1, X h 1 1 X 2, X h 1 1 X 3,..., X h D} We end this subsection with one more definition. Definition (Veronese Map) The Veronese Map of order h is the map ν h : R D R M [D] h 16

28 given by ν h x 1 x D x 2 =. x h 1 x h 1 1 x 2. x h D. (2.4) 2.2 Vanishing Ideals and Hilbert Functions of Subspace Arrangements We will discuss the correspondence between ideals in the polynomial ring R [D] and subsets in R D. Definition (Vanishing Ideal) The vanishing ideal I(W ) of a subset W R D is defined by I(W ). = {f R [D] : f(z) = 0, z W }. One easily checks that I(W ) is indeed an ideal of the polynomial ring R [D] : For any f I(W ) and g R [D], fg I(W ). Before dealing with a general subspace arrangement, consider first the situation of a single subspace V. The homogeneous component R [D] 1 is the vector space of linear functions from R D to R. Denote by V those linear functions on R D that vanish on V. Any linear function that vanishes on V can be written as f(x) = b 1 X 1 + b 2 X b D X D, where b. = (b 1, b 2,..., b D ) T R D is a vector that satisfies b 1 x 1 + b 2 x b D x D = 0 for all (x 1, x 2,..., x D ) T V. (2.5) 17

29 One can show that if the dimension of V is d, then V has dimension c = D d. That is, V is spanned by c linearly independent linear functions where each g i R [D] 1. V = span{g 1, g 2,..., g c }, (2.6) All the ideals that we work with turn out to be homogeneous. Definition (Homogeneous Ideal) An ideal I in R [D] is homogeneous if the homogeneous components of elements in I are also in I. It can be shown that an ideal is homogeneous if and only if it is generated by homogeneous elements. The vanishing ideal I(V ) of a subspace V R D is obviously generated by the linear functions in V, in fact by a basis of V, and hence is a homogeneous ideal generated by finitely many homogeneous elements. It is easy to see that the vanishing ideal I(A) of a subspace arrangement A is the intersection of the vanishing ideals of the individual subspaces: I(A) = I(V 1 V 2 V n ) = I(V 1 ) I(V 2 ) I(V n ). (2.7) Since each of the constituents is homogeneous, the ideal I(A) itself is homogeneous and hence where I h = I(A) R [D] h I(A) = I 0 I 1 I 2 is the homogeneous part of degree h (for a small h this may be the trivial vector space). Let m be the smallest nonnegative integer such that I m {0}. Then m n and we can write I(A) = I m I m+1 I n I n+1. (2.8) Notice that polynomials that vanish on A may have degrees strictly lower than n, the number of subspaces in the arrangement. One example is a transversal arrangement of 18

30 two lines and one plane in R 3. Since any two lines lie on a plane, this arrangement can be embedded in a hyperplane arrangement of two planes, and there exist homogeneous polynomials of degree two that vanish on the arrangement. Let us introduce an ideal related to the vanishing ideal I(A), called the product ideal J(A) = I(V 1 )I(V 2 ) I(V n ). That is, J(A) is the ideal generated by the products g 1 g 2 g n, where g j I(V j ) for each j. The ideal J(A) is also homogeneous. So J(A) = J n J n+1. (2.9) It is clear that the first nonzero graded component of J(A) is J n and that J n = V 1 V 2 V n = I 1 (V 1 )I 1 (V 2 ) I 1 (V n ). (2.10) Definition (Zero Set) Given a set of polynomials I R [D], the zero set of I is defined to be Z(I). = {z R D : g(z) = 0 for all g I} R D. Theorem The subspace arrangement A is the zero set of the homogeneous component I n and also the zero set of the homogeneous component J n. That is, Z(I n ) = Z(J n ) = Z(I(A)) = Z(J(A)) = A. Proof: Since J n I n I(A) and elements of I(A) vanish on the set A by definition, we have A Z(I(A)) Z(I n ) Z(J n ). (2.11) For the other direction, suppose z A. Then z V i for all i = 1, 2,..., n. Hence for each i, there exists a linear function g i V i such that g i (z) 0. Let g = g 1 g 2 g n. 19

31 Then g(z) 0. Obviously g J n. It then follows that z Z(J n ). Therefore Z(J n ) A. Using (2.11), we obtain A = Z(I(A)) = Z(I n ) = Z(J n ). Also Z(J(A)) = Z(J n ) = A because J(A) is generated by J n. A consequence of Theorem is that in order to recover an arrangement A of n subspaces, one needs only to know the set of polynomials of degree n that vanish on A. As we will see shortly, to estimate a subspace arrangement A, it is very useful to first know the number of linearly independent polynomials of degree n that vanish on A. This is related to the Hilbert function of the vanishing ideal I(A) and the product ideal J(A). Definition (Hilbert Function) The Hilbert Function of a homogeneous ideal K is the function h K : N N defined by h K (i) =. dim(k i ), (2.12) where K i is the ith homogeneous component of K. 1 Definition (Combinatorial Invariant) The Hilbert function for a subspace arrangement A = V 1 V 2 V n is combinatorial invariant if it only depends on the dimensions of the intersections d S = dim V S = s S V s, S {1, 2,..., n}. (2.13) Theorem (Derksen, 2005) 1. For a subspace arrangement A = V 1 V 2 V n, the Hilbert function of the ideal J(A) is combinatorial invariant. Furthermore, for all i n, h J (i) = S ( 1) S i + D 1 c S, (2.14) D 1 c S 1 Be aware that in the literature, the Hilbert function is sometimes defined as the codimension of K i in R [D] i : M [D] i dim(k i ). 20

32 where c S = j S c j and the sum is over all S {1, 2,..., n} (including the empty set) for which c S < D. 2. If A is transversal, then for all i n, h I (i) = h J (i). (2.15) Hence, h I (i) is only a function of (d 1,, d n ) for all i n, regardless of the geometry of the subspaces. A detailed development of Theorem is given in [32]. The proof requires the understanding of the Hilbert series and exact sequences, which is beyond the scope of this work. Nevertheless, this closed-form formula is very important for the development and improvement of the GPCA algorithm for estimating a subspace arrangement: 1. The equality h I (i) = h J (i) for i n implies that I i = J i for i n and in particular I n = J n. That is, the homogeneous component I n of the vanishing ideal of a transversal subspace arrangement is always generated by the products of linear forms, which is called pl-generated [3]. This fact was used (but not established at the time) in the early development of the GPCA algorithm because the algorithm would be much easier to explain by using products of linear forms. 2. Knowing these values may greatly facilitate the task of finding the correct subspace arrangement model for a given set of (noisy) data. On one hand, given a data set, if we know the number of subspaces and their dimensions, the value of the Hilbert function will tell us exactly how many linearly independent polynomials of a certain degree to use to fit the data set. This information becomes particularly important when the data are noisy and the number of fitting polynomials is difficult to determine from the data themselves. On the other hand, if the dimensions 21

33 (or number) of the subspaces are not given but we are able to obtain the set of vanishing polynomials (up to certain degree), then the dimensions (or number) of the subspaces can be uniquely determined from the values of the Hilbert function (even without segmenting the data first). Example (A Transversal Arrangement of Three Subspaces in R 3 ) Consider A to be an arrangement of three subspaces in R 3. There are in total four possible configurations of A shown in Figure 2.2, and the values of their corresponding Hilbert functions are listed in Table 2.1. (a) (2, 2, 2) (b) (2, 2, 1) (c) (2, 1, 1) (d) (1, 1, 1) Figure 2.2 Four configurations of three subspaces in R 3. The numbers in the parentheses are the subspace dimensions. Table 2.1 Hilbert functions of the four possible configurations of three subspaces in R 3. d 1 d 2 d 3 h I (3) Given a data set sampled from one of the above configurations, the Veronese map embeds data samples from R 3 to R 10 (as M [3] 3 = 10). It is clear that the null space of the data matrix L 3 can only assume four possible dimensions, namely, 1, 2, 4, and 7. 22

34 Furthermore, if the subspace dimensions are known, the value of the Hilbert function h I (3) is uniquely determined; and if h I (3) may be correctly estimated from the data set, we can also determine the dimensions of the subspaces. 2.3 Generalized Principal Component Analysis In the previous sections, we considered the correspondence between a transversal subspace arrangement A and its vanishing ideal. In this section, we review an efficient algebraic algorithm to retrieve a subspace arrangement and its individual subspaces from a given set of samples. This process is known as generalized principal component analysis (GPCA). In this section, we assume the samples to be noise-free. We will discuss samples corrupted by noise in Chapter 3 and samples contaminated by outliers in Chapter 4. The first version of the algebraic GPCA algorithm was proposed in [12]. Since then, several different variations have been proposed [13,33]. All variants consist of three steps: First, a set of polynomials that vanish on the given data samples is retrieved. Second, the vectors normal to the subspaces are estimated from the derivatives of these polynomials. Third, the samples are segmented into their respective subspaces based on the normals. Figure 2.3 illustrates the three key steps. In the following, we give a brief description of each step. V 1 V 2 R D R M n [D] R M n [D] p(x) = c T x V 1 V 2 R D ν n (x) Null(L n ) = = = x rank(l n ) = M n [D] h I (n) Figure 2.3 Three key steps in the algebraic GPCA algorithm. 23

35 2.3.1 Retrieving the vanishing polynomials We are given a set of samples F = {z 1, z 2,..., z N } that lie in a subspace arrangement A = V 1 V 2 V n R D. Suppose that we know the number n and the dimensions of the subspaces in the subspace arrangement. We then know the number of linearly independent vanishing polynomials of degree n is equal to the value of the Hilbert function of I(A) at n. Suppose m = h I (n). We then embed the samples in R M n [D] via the Veronese map ν n (see Definition 2.1.3) and obtain the matrix L n. = (νn (z 1 ), ν n (z 2 ),..., ν n (z N )) R M [D] n N. (2.16) Obviously, if q(x) = c T ν n (X) is a polynomial that vanishes on A, then we have q(z i ) = c T ν n (z i ) = 0 for all i = 1, 2,..., N. Therefore the column of coefficients c must be in the (left) null space of L n : c T L n = 0. If the sample set is large enough, the dimension of the null space of L n is exactly m = h I (n). Thus, a basis C = (c 1, c 2,..., c m ) of the null space of L n gives a basis of I n (A): Q(X). = (q 1 (X), q 2 (X),..., q m (X)) T, (2.17) where q i (X) = c T i ν n (X), i = 1, 2,..., m. In the case of small noise or numerical roundoff errors, we can take the m eigenvectors associated with the m smallest eigenvalues. Numerically, this can be done via singular value decomposition (SVD) of L n. In the next chapter, we will see how the estimate of C can be further improved when the samples are noisy Retrieving the normal vectors and bases of the subspaces Having found the vanishing polynomials Q(X), in principle we can obtain the subspace arrangement A as their zero set. In practice, we are more interested in the individual subspaces of the arrangement than in their union; particularly, we want to segment the 24

36 data into their respective subspaces. Thus, the problem that arises is how to retrieve the subspaces from the vanishing polynomials. Fortunately, in addition to the polynomials generating the vanishing ideal, we also have sample points from their zero set. This turns out to greatly simplify the identification of the individual consituent subspaces in the arrangement. First, we define the Jacobian of the polynomials generating the vanishing ideal. Definition (Jacobian Matrix) The Jacobian Matrix of the polynomials Q(X) = ( q 1 (X), q 2 (X),..., q m (X)) T is the matrix q 1 q 1 X D J (Q)(X) =. X q m X 1 q m X D Rm D. (2.18) Pick one sample z i per subspace V i (not in any of the other subspaces). 2 Evaluate the Jacobian matrix at z i, and we obtain an m D matrix J (Q)(z i ). It is easy to verify that the rows of J (Q)(z i ) span the orthogonal complement V i of V i. Thus, a basis of V i can be computed from the (right) null space of J (Q)(z i ), say from the SVD of J (Q)(z i ), in a manner similar to the computation of C The algebraic GPCA algorithm and its variations For future reference, we summarize the above algebraic process in Algorithm 1 below, which is also called polynomial differentiation algorithm (PDA) [13]. Algorithm 1 applies to a very idealistic situation in which the samples have very low noise and the number and dimensions of the subspaces are all known. If any of those conditions is changed, the algorithm needs to be modified accordingly. 2 There exist many proposals in the literature for picking such a point when the samples are noisy. In the next chapter, we will provide a scheme that does not rely on the choice of the point, which becomes more stable in the presence of data noise. 25

37 Algorithm 1 (Polynomial Differentiation Algorithm). Given a set of samples F = {z 1, z 2,..., z N } from n linear subspaces of dimensions d 1, d 2,..., d n in R D : ( ) 1: Construct the matrix L n = ν n (z 1 ), ν n (z 2 ),..., ν n (z N ). 2: Compute the singular value decomposition (SVD) of L n and let C be the singular vectors associated with the m = h I (n) smallest singular values. 3: Construct the polynomials Q(X) = C T ν n (X). 4: for all 1 i n do 5: Pick one point z i per subspace, and compute the Jacobian J (Q)(z i ). ) 6: Compute a basis B i = (b 1, b 2,..., b di of V i from the right null space of J (Q)(z i ) via the singular value decomposition of J (Q)(z i ). 7: Assign samples z j that satisfy B T i z j = 0 to the subspace V i. 8: end for For instance, we know that the lowest degree of the polynomials that vanish on the given data set can be strictly lower than the number of subspaces. If the number of subspaces is not known, the derivatives of these polynomials of the lowest degree lead to a super subspace arrangement A that contains the original arrangement: A A. Thus, we can recursively apply GPCA to samples in each subspace of A. In principle, the process will stop when all the subspaces in the original arrangement are found. In the literature, this is known as Recursive GPCA [30]. However, if the samples are noisy, the stopping criterion becomes quite elusive. There are several other variations to PDA when the samples are corrupted by noise or outliers. We will discuss some of the important variations in the next two chapters. 26

38 CHAPTER 3 ESTIMATION OF SUBSPACE ARRANGEMENTS FROM NOISY SAMPLES When the samples from a subspace arrangement are corrupted by noise, estimating the vanishing polynomials and subsequently retrieving the subspaces become a statistical problem. In this case, the embedded data matrix will be of full rank and the vanishing polynomials can no longer be retrieved directly from its null space. Likewise, the derivatives of the vanishing polynomials at a noisy sample point no longer span the orthogonal complement to the underlying subspace. Thus, neither the dimension nor the basis of the subspace can be obtained directly from the derivatives. The effects of common data noise in the process of GPCA are illustrated in Figure 3.1 in comparison with Figure 2.3 in the noise-free case. V 1 V 2 R D R M n [D] R M n [D] p(x) = c T x R D V 1 V 2 ν n (x) Null(L n ) = x = = Figure 3.1 The effects of data noise in the process of GPCA. We discuss how to estimate the vanishing polynomials from noisy samples in Section 3.1, which is inspired by the work of [34] with special treatment given to homogeneous 27

39 polynomials. In Section 3.2, we show how to modify the algebraic GPCA algorithm with a multiple-hypotheses voting scheme to estimate the subspaces. This voting-based GPCA algorithm has been shown to outperform other extant variations. 3.1 Estimation of Vanishing Polynomials From the previous chapter, we have known that GPCA is based on the concept that we are able to correctly identify a set of (linearly independent) polynomials Q(X) = (q 1 (X), q 2 (X),..., q m (X)) T, say of degree n, whose zero set is exactly the subspace arrangement A = V 1 V 2 V n = {z R D : Q(z) = 0}. (3.1) For noisy samples, the algebraic GPCA algorithm is modified by replacing the null space of the embedded data matrix L n by the eigenspace associated with the smallest eigenvalues. In order for such a least-square fitting c = arg min c ν n (z) T c 2 (3.2) to be statistically optimal, one needs to assume that the embedded data vector ν n (z) has a Gaussian distribution. In practice, it is often more natural and meaningful to assume instead that the samples z i themselves are corrupted by Gaussian noise. That is, we assume that for each sample point z i, z i = ẑ i (c) + n i, i = 1, 2,..., N, (3.3) where ẑ i (c) is a point on the subspace arrangement determined by c, and n i is an independent zero-mean Gaussian random noise added to ẑ i (c). If the arrangement is clearly indicated from the context, we also write ẑ i (c) as ẑ. It is easy to verify that with respect to this noise model, the embedded data vector ν n (z i ) no longer has a Gaussian distribu- 28

40 tion 1 and subsequently the least-square fitting no longer gives the optimal estimate of the vanishing polynomials. In fact, under the Gaussian noise model, the maximum-likelihood estimate minimizes the mean square distance: c 1 = arg min c N N z i ẑ i (c) 2. (3.4) i=1 However, it is difficult to minimize (3.4) because the closest point ẑ i (c) to z i is a complicated function of the polynomial coefficients c. To resolve this difficulty, in practice we often use the first order approximation of z i ẑ i as a replacement for the mean square distance. This leads to the Sampson distance that we now introduce Sampson distances We assume that the polynomials in Q(X) are linearly independent. Given a point z close to the zero set of Q(X), i.e., the subspace arrangement A, we let ẑ denote the point closest to z on A. Using the Taylor series of Q(X) expanded at z, the value of Q(X) at ẑ is then given by Q(ẑ) = Q(z) + J (Q)(z)(ẑ z) + O( ẑ z 2 ). (3.5) After ignoring the higher order terms and noting that Q(ẑ) = 0, we have z ẑ ( J (Q)(z) T J (Q)(z) ) J (Q)(z) T Q(z) R D, (3.6) where ( J (Q)(z) T J (Q)(z) ) is the pseudoinverse of the matrix J (Q)(z) T J (Q)(z). Thus, the approximate square distance from z to A is given by z ẑ 2 Q(z) T ( J (Q)(z)J (Q)(z) T ) Q(z) R. (3.7) 1 For example, if two random variables x 1 and x 2 are standard normal distributed, then x 2 1 has a χ 2 distribution, and z = x 1 x 2 follows a distribution with density p(z) = 1 π K 0( z ), where K 0 is a modified Bessel function of the second kind. 29

41 The expression on the right-hand side is known as the Sampson distance [35]. Thus, the average Sampson distance 1 N N Q(z i ) ( T J (Q)(z i )J (Q)(z i ) ) T Q(zi ) (3.8) i=1 is an approximation of the mean square distance (3.4). Minimizing the Sampson distance typically leads to a good approximation to the maximum-likelihood estimate that minimizes the mean square distance. There is however a certain redundancy in the expression of the Sampson distance. If A is the zero set of Q(X), it is also the zero set of the polynomials Q(X) = MQ(X) for any nonsingular matrix M R m m. It is easy to check that the Sampson distance (3.7) is invariant under the nonsingular linear transformation M. Thus, the estimate of polynomials in Q that minimizes the average Sampson distance (or the mean square error) is not unique, at least not in terms of the coefficients of the polynomials in Q(X). One way to reduce the redundancy is to impose some constraints on the coefficients of the polynomials in Q(X). Notice that J ( Q)(z i )J ( Q)(z i ) T = MJ (Q)(z i )J (Q)(z i ) T M T and if there is no polynomial of lower degree (than those in Q(X)) that vanishes on A, the matrix 1 N N J (Q)(z i )J (Q)(z i ) T R m m i=1 is a positive-definite symmetric matrix. Therefore, we can choose the matrix M such that the matrix below is the identity: 1 N N J (Q)(z i )J (Q)(z i ) T = I m m. (3.9) i=1 Thus, the problem of minimizing the average Sampson distance now becomes a con- 30

42 strained nonlinear optimization problem: Q = arg min P 1 N N i=1 Q(z i) T ( J (Q)(z i )J (Q)(z i ) T ) Q(zi ), subject to 1 N N i=1 J (Q)(z i)j (Q)(z i ) T = I m m. (3.10) Many nonlinear optimization algorithms (such as Levenberg-Marquardt [7]) can be employed to minimize the above objective function to a local solution via iterative gradientdescent techniques Generalized eigenvector fit Due to the nonlinearity of Equation (3.10), the convergence of its optimization algorithm will be relatively slow, and in general a good initialization is required for the algorithm to reach the global minimum. In the following, we will propose a simplified linear optimization problem, which is a good approximation of the nonlinear problem. Notice that the linear transformations that preserve the identity (3.9) are unitary transformations, the group of which is denoted by O(m) = {R R m m : R T R = I m m }. Obviously, the least-square fitting error is invariant under unitary transformations: RQ(z) 2 = Q(z) 2. In addition, as the identity matrix I m m is the average of the matrices J (Q)(z i )J (Q)(z i ) T, we can use the identity matrix to approximate each J (Q)(z i )J (Q)(z i ) T. With this approximation, the Sampson distance (3.7) becomes the least-square fitting error: Q(z) T ( J (Q)(z)J (Q)(z) T ) Q(z) Q(z) T Q(z) = Q(z) 2. (3.11) This leads to the following constrained optimization problem: Q = arg min Q 1 N N i=1 Q(z i) 2, subject to 1 N N i=1 J (Q)(z i)j (Q)(z i ) T = I m m. (3.12) This problem has a simple linear algebraic solution. Without loss of generality, we assume that all the polynomials in Q(X) are of degree n and there is no polynomial 31

43 of degree strictly less than n that vanishes on the subspace arrangement A of interest. Homogeneous polynomials of degree n have the form Let C. = q i (X) = ν n (X) T c i, i = 1, 2,..., m. (3.13) (c 1, c 2,..., c m ). Then we have Q(X) = C T ν n (X) and J (Q)(X) = C T ν n (X). Define two matrices Σ. = 1 N N ν n (z i )ν n (z i ) T, i=1 Γ. = 1 N N ν n (z i ) ν n (z i ) T. (3.14) i=1 Using these notations, we rewrite the above optimization problem (3.12) as C = arg min C Trace(C T ΣC), subject to C T ΓC = I m m. (3.15) In comparison, the naive least-square fitting (3.2) minimizes the same objective function but subject to a different constraint: C T C = I m m. Using Lagrange multipliers and the necessary conditions for minima, one can show that the optimal solution C is such that its ith column c i is the ith generalized eigenvector of the matrix pair (Σ, Γ): Σc i = λ i Γc i, i = 1, 2,..., m, (3.16) where 0 λ 1 λ 2 λ m are the m smallest generalized eigenvalues of (Σ, Γ). The generalized eigenvector fit has yet another statistical explanation from the viewpoint of (Fisher) discriminant analysis. The matrix Σ can be viewed as a measure of intraclass distances the closer a point is to one of the subspaces, the smaller the (absolute) value of a fitting polynomial, and the matrix Γ can be viewed as a measure of interclass distances the norm of the derivative at a point in a subspace is roughly proportional to its distance to other subspaces. According to discriminant analysis, the optimal polynomial q(x) = ν n (X) T c for discriminating the subspaces minimizes the 32

44 Rayleigh quotient: c = arg min c c T Σc c T Γc. (3.17) It is then easy to show that the optimal solution c is exactly the generalized eigenvector of the matrix pair (Σ, Γ). Therefore, the fitting polynomials found via the generalized eigenvector fit are the ones that are in a sense optimal for segmenting the multiple subspaces Simulations We demonstrate how the normalization by Γ may significantly improve the eigenvalue spectrum of Σ. That is, the generalized eigenvectors of (Σ, Γ) are less sensitive to the corruption of noise than the eigenvectors of Σ, which makes the estimation of the fitting polynomials a more well-conditioned problem. To see this, let us consider a set of points drawn from two lines and one plane in R 3 (see Figure 1.1) 200 points from the plane and 100 points from each line with 6% Gaussian noise added. 2 As Figure 3.2 illustrates, the generalized eigenvalues of (Σ, Γ) provide a much sharper knee point than the eigenvalues of Σ. With the new spectrum, one can more easily estimate the correct number of polynomials that fit the data (in this case four polynomials). (a) (b) (c) Figure 3.2 (a) The eigenvalues of the matrix Σ with 0% noise. (b) The eigenvalues of the matrix Σ with 6% noise. (c) The improved generalized eigenvalues of (Σ, Γ). 2 The percentage is defined as the standard deviation of the Gaussian distribution, i.e., σ, relative to the maximal magnitude of the data set. If not specified explicitly, the maximal magnitude of a simulated data set is one in this work. Hence, 6% Gaussian noise also means the standard deviation of the noise is

45 3.2 Estimation of Subspace Arrangements via a Voting Scheme In the algebraic GPCA algorithm, the basis of each subspace is computed as the orthogonal complement to the derivatives of the fitting polynomials at a representative sample point. However, if the chosen point is noisy, it may cause a large error in the estimated basis and subsequently cause a large error in the segmentation. From a statistical point of view, a more accurate estimate of the basis can be obtained if we are able to compute an average of the derivatives at many points in the same subspace. However, a fundamental difficulty is that we do not know which points belong to the same subspace. There is yet another issue. In the algebraic GPCA algorithm, the rank of the derivatives at each point determines the codimension of the subspace to which it belongs. One can determine the rank from the singular values of the Jacobian matrix J (Q). However, this estimated rank can be erroneous if the chosen point is noisy. In this section, we show how to improve the estimates of the basis of each subspace by using collectively the derivatives at all the sample points. The algorithm relies on a voting method on the feature space of subspace basis parameters, which was inspired by the classical Hough transform [36, 37]. An important difference here is that we do not quantize the feature space of basis vector parameters, since it is impossible to store the whole quantized space in a computer when the dimensions of the subspaces are high Arrays of bases and counters Suppose the subspace arrangement is a union of n subspaces: A = V 1 V 2 V n. Let the dimensions of the subspaces be d 1, d 2,..., d n and their codimensions be c 1, c 2,..., c n. Pick a sample point z 1. The Jacobian of the fitting polynomials Q(X) at z 1 is J (Q)(z 1 ). If there is no noise, then rank(j (Q)(z 1 )) will be exactly the codimension 34

46 of the subspace to which z 1 belongs. When the samples are noisy, it is very difficult to determine the codimension in this way. The idea is to calculate a basis under each codimension assumption, and invoke a voting method to search for the high-consensus bases among all the candidates over all samples. Without loss of generality, we assume that c 1, c 2,..., c n have l distinct values c 1 < c 2 <... < c l. As we do not know the true codimension at the sample point z 1, we compute a set of candidate bases in column form: B i (z 1 ) R D c i, i = 1, 2,..., l, (3.18) where B i (z 1 ) collects the first c i principal components of J (Q)(z 1 ). Thus, each B i (z 1 ) is a D c i orthogonal matrix. To store the basis candidates B 1 (z j ), B 2 (z j ),..., B l (z j ) for all samples j = 1, 2,..., N, we create l arrays of bases U 1, U 2,..., U l, where each U i stores all D c i candidate matrices. Correspondingly, we create l arrays of voting counters u 1, u 2,..., u l. Suppose U i (j) R D c i stores a candidate basis, then ui (j) is an integer that counts the number of sample points z k with B i (z k ) = U i (j). Notice that numerically any other B i (z k ) cannot be exactly equal to U i (j). We introduce a similarity measure to compare two sets of basis vectors in the next subsection Voting for the subspace bases With the above definitions, we now outline an algorithm that will select a set of bases for the n subspaces that achieve the highest consensus on all the sample points. Suppose J i is the size of both the arrays U i and u i (i.e., the number of candidates already generated) for all i = 1, 2,..., l. Initially, all J i s are equal to zero. For every sample point z k, 1. we compute a set of basis candidates B i (z k ), i = 1, 2,..., l as in (3.18); 35

47 2. for each B i (z k ), we compare it with the bases already in the array U i : (a) if B i (z k ) = U i (j) for some j, then increase the value of u i (j) by one; (b) if B i (z k ) is different from any of the bases in U i, then add U i (J i + 1) = B i (z k ) as a new basis to U i, and also add a new counter u i (J i + 1) to u i with the initial value u i (J i + 1) = 1. Set J i J i + 1. In the end, the bases of the n subspaces are chosen to be the n bases in the arrays {U 1, U 2,..., U l } that have the highest votes according to the corresponding counters in {u 1, u 2,..., u l }. For instance, suppose the codimensions of four subspaces are 1, 3, 3, and 4 in R 5 and the distinct codimensions are c 1 = 1, c 2 = 3, and c 3 = 4. Then after the bases are evaluated at all the samples, we select one basis candidate from U 1 and one from U 3 with the largest numbers in u 1 and u 3, respectively, and two basis candidates from U 2 with the largest two numbers in u 2. In the above scheme, in order to compare B i (z k ) with bases in U i when the data are noisy, we need to set an error tolerance. This tolerance, denoted by τ, can be a small subspace angle chosen by the user. 3 Thus, if the subspace angle difference between B i (z k ) and U i (j) B i (z k ), U i (j) (3.19) is less than τ, we set the new value of U i (j) to be the average of its votes with B i (z k ) added in: U i (j) 1 ( ui (j)u i (j) + B i (z k ) ), (3.20) u i (j) + 1 and increase the value of the counter u i (j) by one. Notice that the weighted sum may no longer be an orthogonal matrix. If so, apply the Gram-Schmidt process to make U i (j) an orthogonal matrix again. 3 Please refer to [38] for numerical implementations of computing subspace angles. In MATLAB, the built-in command is subspace. 36

In the case when subspaces have different dimensions, for the same sample point z, it is possible that more than one candidate basis B i1 (z), B i2 (z),.

Our analysis shows that this ambiguity is mainly caused by the fact that a subset of samples on a high-dimensional subspace may form the support of a low-dimensional subspace model, e.g., sample points on a 2-D plane along the same direction may result in a high consensus as a line.

When a sample is associated to a subspace model of smaller codimension, its votes in other counters of higher codimensions will be removed and the associated bases U i (j) recalculated.

48 In the case when subspaces have different dimensions, for the same sample point z, it is possible that more than one candidate basis B i1 (z), B i2 (z),..., B ij (z) has the highest votes in multiple codimension hypotheses. Our analysis shows that this ambiguity is mainly caused by the fact that a subset of samples on a high-dimensional subspace may form the support of a low-dimensional subspace model, e.g., sample points on a 2-D plane along the same direction may result in a high consensus as a line. To resolve this ambiguity, we choose highest votes starting with the smallest codimension c 1. When a sample is associated to a subspace model of smaller codimension, its votes in other counters of higher codimensions will be removed and the associated bases U i (j) recalculated. We summarize the overall process as Algorithm 2, which will be referred to as GPCA- Voting. Figure 3.3 shows some illustrations of the segmentation results on subspace arrangements in R 3. (a) 8% noise (b) 12% noise (c) 16% noise Figure 3.3 The segmentation result of GPCA-Voting on three sample sets from a subspace arrangement of dimensions (2, 1, 1) R 3. There are important features about the above voting scheme that are quite different from the well-known statistical learning methods K-Subspaces and EM for estimating subspace arrangements. The K-Subspaces and EM algorithms iteratively update one basis for each subspace, while the voting scheme essentially keeps multiple candidate bases per subspace through the process. Thus, the voting algorithm does not have the same difficulty with local minima as K-Subspaces and EM do. 37

49 Algorithm 2 (Generalized Principal Component Analysis with Voting). Given a set of samples {z 1, z 2,..., z N } in R D and a parameter for angle tolerance τ, fit n linear subspaces with codimensions c 1, c 2,..., c n : 1: Suppose there are l distinct codimensions, ordered as c 1 < c 2 < < c l. Allocate u 1, u 2,..., u l to be l arrays of counters and U 1, U 2,..., U l be l arrays of candidate bases. 2: Estimate the set of fitting polynomials Q(X), and compute their derivatives J (Q)(X) for all z k. 3: for all sample z k do 4: for all 1 i l do 5: Assume z k is drawn from a subspace of codimension c i. Find the first c i principal vectors of J (P )(z k ) and stack them into the matrix B i (z k ) R D c i. 6: If B i (z k ), U i (j) < τ for some j, increase u i (j) by one and reset U i (j) to be the weighted sum in (3.20). Otherwise, create a new candidate basis in U i and a new counter in u i with initial value one. 7: end for 8: end for 9: for all 1 i l do 10: Choose the highest vote(s) in u i with their corresponding basis/bases in U i. 11: Assign the samples that achieve the highest vote(s) to their subspace(s), and remove their votes in other counters and bases of higher codimensions. 12: end for 13: Segment the remaining samples that are not associated with the bases of the highest votes based on the estimated bases. 38

50 There are other voting or random sampling methods developed in statistics and machine learning, such as the least median estimate (LME) and random sample consensus (RANSAC). These methods are similar in nature as they compute multiple candidate models from multiple small subsets of the data and then choose the one which achieves the highest consensus (for RANSAC) or the smallest median error (for LME). The data that do not conform to the model are regarded as outliers. We will discuss these methods in the context of dealing with outliers in Chapter Simulations and comparison We provide a comparison of several algorithms for the estimation and segmentation of subspace arrangements that we have mentioned so far. They include: EM, K-Subspaces, PDA, GPCA-Voting, as well as GPCA-Voting+K-Subspaces, which means using the estimated model by GPCA-Voting to initialize K-Subspace. We randomly generate subspace arrangements of some prechosen dimensions. For instance, (2, 2, 1) indicates an arrangement of three subspaces of dimensions 2, 2, and 1, respectively. We then randomly draw a set of samples from them. The samples are corrupted with Gaussian noise from 0% to 12%. The errors are measured in terms of the average subspace angle difference (in degree) between the a priori model and the estimated one. 4 All cases are averaged over 100 trials. The performance of all the algorithms are compared in Figure 3.4. For the (2, 2, 1) and (4, 2, 2, 1) cases, because of the different dimensions in the subspace arrangement models, K-Subspaces carries off very large estimation errors. GPCA-Voting significantly improves the performance of the algebraic PDA algorithm. Furthermore, the EM algorithm fails in these two cases. More specifically, for the (2, 2, 1) case, it tends to classify all samples 4 Notice that even with prior knowledge of the subspaces, due to the samples drawn at subspace intersections and sample noises, the sample segmentation error that measures the sample misclassification cannot be zero. 39

51 into the first two two-dimensional subspaces, and for the (4, 2, 2, 1) case, it classifies all samples into the first four-dimensional subspace. For the hyperplane (5, 5, 5) case, the results generated by EM, K-Subspaces, and GPCA-Voting are equally accurate. (a) (2, 2, 1) in R 3. Sample size: (200, 200, 100). (b) (4, 2, 2, 1) in R 5. Sample size: (400, 200, 200, 100). (c) (5, 5, 5) in R 6. Sample size: (600, 600, 600). Figure 3.4 Comparison of EM, K-Subspaces, PDA, and GPCA-Voting. GPCA- Voting+K-Subspaces means using the estimated model by GPCA-Voting to initialize K-Subspaces. The EM algorithm fails in the (2, 2, 1) and (4, 2, 2, 1) cases. In summary, for all the models, GPCA-Voting provides accurate estimates regardless of the subspace dimensions. The segmentation is further improved by running K- 40

52 Subspaces as a postprocessing step, but not much. In the rest of the work, we include this postprocessing step in our GPCA algorithms by default. 41

CHAPTER 4 ESTIMATION OF SUBSPACE ARRANGEMENTS WITH OUTLIERS In many situations, the sample points may be contaminated by some atypical samples known as outliers in addition to the noise that we have

Both the estimated model and the segmentation can be far from the ground truth, as illustrated by an example in Figure 4.

53 CHAPTER 4 ESTIMATION OF SUBSPACE ARRANGEMENTS WITH OUTLIERS In many situations, the sample points may be contaminated by some atypical samples known as outliers in addition to the noise that we have previously discussed. The application of EM, K-Subspaces, or GPCA to samples contaminated by such outliers can lead to disastrous results. Both the estimated model and the segmentation can be far from the ground truth, as illustrated by an example in Figure 4.1. Figure 4.1 A segmentation result of GPCA-Voting for samples drawn from two planes and one line in R 3, with 6% Gaussian noise as well as 6% outliers drawn from a uniform distribution (marked as black asterisks ). Left: the ground truth. Right: estimated subspaces and the segmentation result. In this chapter, we study some relevant robust statistical techniques that can detect or diminish the effects of outliers in estimating subspace arrangements. After a literature review, we conduct a systematic study on the robust estimation of subspace arrangements, 42

54 and investigate three principled approaches to robustly estimate the subspace parameters, namely, influence functions, multivariate trimming (MVT), and random sample consensus (RANSAC). 4.1 Robust Techniques: A Literature Review Despite centuries of study, there is no universally accepted definition for outliers. Most approaches are based on one of the following assumptions: 1. Probability-based: Outliers are a set of small-probability samples with respect to the probability distribution in question. A data set is atypical if such samples constitute a significant portion of the data. Methods in this approach include M-estimators [39, 40] and its variation, multivariate trimming (MVT) [41]. 2. Influence-based: Outliers are samples that have relatively large influence on the estimated model parameters [1,42]. The influence of a sample is normally measured by the difference between the models estimated with and without the sample. 3. Consensus-based: Outliers are samples that are not consistent with the model inferred from the remainder of the data. A measure of inconsistency is normally the error residue of the sample in question with respect to the model. Methods in this approach include the Hough transform [36], RANSAC [43], and its many variations [29, 44 48]. In computer vision, various techniques have been derived based on these three assumptions. RANSAC was first used to estimate fundamental matrices [49], and was then extended to estimate multiple homography relations [50] and a mixture of epipolar and homography relations [29]. Xu and Yuille [51] and De la Torre and Black [52] used M-estimators and MVT to robustify PCA. For work about estimating single subspace 43

55 models, we refer to [1, 42, 53, 54]. However, to the best of our knowledge, only a few methods address the robustness issue for estimating multiple subspaces. These solutions either highly rely on a good initialization [55], or assume that the subspaces have special properties that cannot be easily generalized to other situations, e.g., orthogonality, no intersections, or same dimensions [28, 29, 45, 56]. One important index of robustness for a method is the breakdown point [39, 42], which is the minimal percentage of outliers in a data set that can cause arbitrarily large estimation error. It can be shown that the breakdown point for most probability-based and influence-based methods is 50% [42, 54]. This drawback motivates the investigation of consensus-based methods. These techniques treat outliers as samples drawn from a model that is very different from the model of inliers. Therefore, although the outlier percentage may be greater than 50%, they may not result in a model with a higher consensus than the inlier model. The breakdown point also depends on the definition of the model. In the context of multiple subspaces, if one chooses to estimate and extract one subspace at a time, the inlying samples from all other subspaces together with the true outliers become the outliers for the subspace of interest. Therefore, one has to adopt a consensus-based approach (e.g., RANSAC) in searching for an estimator with a high breakdown point. On the other hand, if one chooses to treat the union of the subspaces as a single model (e.g., GPCA), probability-based and influence-based approaches may also be applied to achieve good estimation, as long as the true outliers do not exceed 50% of the total sample points. 44

56 4.2 Robust Generalized Principal Component Analysis Recall that given a set of sample points F = {z 1, z 2,..., z N } drawn from a union of subspaces A = V 1 V 2 V n in R D, GPCA seeks to simultaneously infer the subspaces and segment the data points to their closest subspaces by identifying a set of polynomials Q = {q 1 (z), q 2 (z),..., q m (z)} of degree n that vanish on (or fit) all the sample data. The coefficients of a vanishing polynomial f i correspond to the coordinates of an eigenvector in the null space of L n. = [νn (z 1 ), ν n (z 2 ),..., ν n (z N )] R M [D] n N, (4.1) where M [D] n = ( ) n+d 1 D 1 is the number of monomials of degree n in D variables, and m := dim(null(l n )) is uniquely determined by the Hilbert function of the subspaces (see Section 2.2). The breakdown points of GPCA and its variations (including GPCA-Voting) are 0%; i.e., an add-in outlier with a large magnitude can arbitrarily perturb the estimated subspaces. The reason is that the coefficients of the vanishing polynomials Q = {q 1 (z), q 2 (z),..., q m (z)} corresponding to the smallest eigenvectors of the data matrix L n are estimated by PCA. A single outlier may arbitrarily change the eigenspace of L n and result in erroneous coefficients of the polynomials. Figure 4.2 illustrates the effects of an outlier on the estimation of the null space. Comparing to Figure 3.1, a single outlying sample with a large magnitude can arbitrarily perturb the eigenspace of L n. To eliminate the effects of outliers, we are essentially seeking a robust PCA method to estimate Null(L n ) such that it is insensitive to outliers, or to reject outliers before estimating Null(L n ). In the robust statistics literature, there are two major approaches to robustify PCA in high-dimensional multivariate data, namely, influence functions and robust covariance estimators [1]. In this section, we first discuss a simpler situation in 45

57 V 2 R D R M n [D] R M n [D] V 1 ν n (x) Null(L n ) = = p(x) = c T x Figure 4.2 The effects of outliers in the process of GPCA. The black dot indicates an add-in outlier with a large magnitude. which the percentage of outliers is given. We introduce two methods to robustly estimate the coefficients of a set of linearly independent vanishing polynomials by both influence functions and multivariate trimming (MVT). When the percentage of outliers is not given, we propose a method to estimate the percentage in Section Influence functions When we try to estimate the parameter θ of the distribution p(z, θ) from a set of samples {z 1, z 2,..., z N }, every sample z i might have uneven effects on the estimated parameter ˆθ. The samples that have relatively large effects are called influential samples and they can be regarded as outliers [42, 57, 58]. To measure the influence of a particular sample z i, we may compare the difference between the parameter ˆθ estimated from all the N samples and the parameter ˆθ (i) estimated from all but z i. We consider the maximum-likelihood estimate as an example: ˆθ = arg max θ N log p(z j, θ), j=1 ˆθ(i) = arg max θ log p(z j, θ), (4.2) j i and the influence of z i on the estimation of θ is measured by the difference I(z i ; θ). = ˆθ ˆθ (i). (4.3) The function I(z i ; θ) is also called the sample influence function in the literature of robust statistics. 46

58 If a set of sample points {z 1, z 2,..., z N } is drawn from a subspace arrangement A = n i=1 V i R D, then GPCA relies on obtaining the set of polynomials Q(X) = {q 1 (X), q 2 (X),..., q m (X)} of degree n that vanish on the subspace arrangement. As we have discussed in Section 3.1.2, the coefficients C = (c 1, c 2,..., c m ) of the polynomials {q i (X) = ν n (X) T c i } are estimated from the eigenvectors associated with the smallest eigenvalues of the matrix Σ for least-square fitting or the generalized eigenvectors of the matrix Γ 1 Σ. Regardless of the case, we denote the estimate as Ĉ. Notice that for our problem, we are not interested in the individual vectors in Ĉ, but rather the eigensubspace spanned by the eigenvectors: Ŝ = span(ĉ). Therefore, the influence of the sample z i on the estimate of the eigensubspace can be measured by I(z i ; S) = Ŝ, Ŝ(i), (4.4) where, denotes the subspace angle between two subspaces [38], and Ŝ(i) is the eigensubspace estimated with the ith sample omitted. All samples are then sorted by their influence values, and the ones with the highest values will be rejected as outliers and will not be used for the estimation of the vanishing polynomials. Equation (4.4) is a precise expression describing the influence of a sample on the estimation of the vanishing polynomials Q(X). However, the complexity of the resulting algorithm is rather high. Suppose we have N samples, then we need to perform SVD N + 1 times in order to evaluate the influence values of the N samples. In light of this drawback, some first order approximations were developed at roughly the same period as the sample influence function was proposed [57, 58], when the computational resource was scarcer than it is today. In robust statistics, formulae that approximate a sample influence function are referred to as theoretical influence functions. 47

59 While it is rather difficult to approximate the influence of each sample on the estimated subspace S, it is relatively easy to approximate the sample s influence on the individual vectors {c 1, c 2,..., c m } as the eigenvectors of the sample covariance matrix Σ =. 1 N ν n (z i )ν n (z i ) T. (4.5) N 1 i=1 The basic idea is to assume that each c j is a random vector with a cumulative distribution function (c.d.f) F. The distribution can be perturbed by a change of the weighting ɛ [0, 1] of the ith sample: F i (ɛ) = (1 ɛ)f + ɛδ i, (4.6) where δ i indicates the c.d.f. of a random variable that takes the value z i with probability one. When F becomes F i (ɛ), let c j (ɛ) be the new estimate of c j after the change. Now we can define a theoretical influence function I(z i ; c j ) of the ith sample on c j as the first-order approximation of the above sample influence: c j (ɛ) c j = I(z i ; c j )ɛ + h.o.t.(ɛ). (4.7) c As derived in [58], the theoretical influence function I(z i ; c j ) = lim j (ɛ) c j ɛ 0 ɛ is given by I(z i ; c j ) = z j z h c h (λ h λ j ) 1 R M [D] n, (4.8) h j where {λ j, c j } are the eigenvalues and eigenvectors of the sample covariance matrix Σ and z h is the hth principal component (PC) of the sample z i, i.e., the coordinate value with respect to the hth eigenvector c h of the covariance matrix Σ. Finally, the total influence of z i on all the vectors {c 1, c 2,..., c m } is given by m I(z i ; C) = I(z i ; c j ) 2. (4.9) j=1 A further discussion of this solution can be found in [1]. Notice that in order to compute the theoretical influence function (4.8), one only needs to compute once the sample covariance matrix Σ and its eigenvalues and eigenvectors. Thus, computationally, it is much more efficient than the sample influence function. 48

60 4.2.2 Multivariate trimming As we noticed in the estimation of the vanishing polynomials, if we view the vectors ν n (z 1 ), ν n (z 2 ),, ν n (z N ) as samples of a random vector u = ν n (z), the problem becomes how to robustly estimate the covariance matrix of u 1, u 2,, u N. It is shown in [59] that, if both the valid samples and the outliers have a Gaussian distribution and the covariance matrix of the outliers is a scaled version of that of the valid samples, then the Mahalanobis distance d i = (u i ū) T Σ 1 (u i ū) (4.10) based on the empirical sample covariance Σ = 1 N 1 N i=1 (u i ū)(u i ū) T is a sufficient statistic for the optimal test that maximizes the probability of the correct decision about the outliers (in the class of tests that are invariant under linear transformations). Thus, one can use d i as a measure to down-weight or discard outlying samples while trying to estimate the correct sample covariance Σ. Depending on the choice of the down-weighting schemes, many robust covariance estimators have been developed in the literature. Among them, two methods have been widely adopted, namely, M-estimators [39] and multivariate trimming (MVT) [41]. A major constraint for robust covariance estimators is the maximal percentage of outliers in a data set that an algorithm can effectively handle, i.e., the breakdown point. Roughly speaking, for M-estimators it is inversely proportional to the dimension of the samples, and it usually becomes prohibitive when the data dimension is higher than 20. For MVT, it is equal to the percentage of samples trimmed from the data set, which can be very high. The convergence rate of MVT is also the fastest among all methods of this kind. In the case of subspace arrangements, the dimension of u = ν n (z), i.e., M n [D], is normally very high. Thus, M-estimators become impractical and MVT becomes the method of choice. 49

61 The MVT method proceeds as follows. As the random vector u = ν n (z) is not necessarily zero mean, we first obtain a robust estimate of the mean ū of the samples {u 1, u 2,, u N }, which in our implementation is a vector of median values of the corresponding components of {u 1, u 2,, u N } [41]. 1 Then we need to specify a trimming parameter α%, which is essentially equivalent to the outlier percentage. To initialize the covariance matrix Σ 0, all samples are sorted by their Euclidean distance u i ū, and Σ 0 is calculated as Σ 0 = 1 U 1 (u h ū)(u h ū) T, (4.11) h U where U is an index set of of the first (100 α)% samples with the smallest distances. In the kth iteration, the Mahalanobis distance of each sample, (u i ū) T Σ 1 k 1 (u i ū), is calculated, and Σ k is again calculated using the set of first (100 α)% samples with the smallest Mahalanobis distances. The iteration terminates when the difference between Σ k 1 and Σ k is small enough. To proceed with the rest of the GPCA algorithm, we treat the trimmed samples in the final iteration as the outliers, and estimate Q(X) from the last m eigenvectors of the resulting covariance matrix. Example We test and compare the performance of the three robust methods discussed above. Two synthetic data sets are used for the test. Data set one: three subspaces in R 3 of dimensions 2, 2, 1, with 200, 200, 100 sample points drawn from the subspaces, respectively. Data set two: four subspaces in R 5 of dimensions 4, 2, 2, 1, with 400, 200, 200, 100 sample points. For each data set, we first add 6% Gaussian noise to all the valid samples, then generate an additional set of outliers with percentages ranging from 0% to 48% of the total sample number, and the rejection rate is set to be the same as the true outlier percentage. At each percentage level, the simulation is repeated for We refer to [60] for more extensive study on robust mean-value estimators. 50

62 times. Notice that because of randomness, some outliers may be very close to the subspaces and are not rejected as such; some samples might be rejected as outliers because of large noise. Thus, the segmentation error that measures the number of misclassified sample points becomes a less effective index for the performance of the algorithms. Nevertheless, the average subspace angle difference between the a priori model and the estimated one provides a reasonable measure of the performance. Figure 4.3 shows the average subspace angle differences (in degree). Table 4.1 summarizes the average running time of the MATLAB codes for one trial on a dual 2.7-GHz G5 Apple workstation. (a) (2, 2, 1) in R 3 (b) (4, 2, 2, 1) in R 5 Figure 4.3 Average space angle error of GPCA with vanishing polynomials estimated by the three robust methods. Table 4.1 Average time for solving subspace arrangements of dimensions (2, 2, 1) in R 3 with sample numbers (200, 200, 100) by the sample influence function, theoretical influence function, and MVT. Outlier Percentage 0% 4% 8% 16% 24% 32% 48% Data Size Sample Influence 5.4 s 2.5 min 2.8 min 3.7 min 5 min 7.8 min 18 min Theoretical Influence 5.4 s 9 s 9.2 s 9.3 s 9.3 s 9.6 s 10.8 s MVT 5.4 s 5.4 s 5.5 s 5.6 s 5.7 s 5.7 s 5.8 s To summarize, given a sample rejection rate that is close to the true outlier percentage, the robust covariance estimator turns out to be the most accurate and fastest method 51

63 for subspace arrangements. In fact, the MVT algorithm usually only takes less than 10 iterations to converge regardless of the outlier percentage, and the estimate does not deteriorate when the outlier percentage increases. On the other hand, the sample influence method and the theoretical influence method give reasonable results when the outliers are less than 15%. The performances of the two methods are very close, but the sample influence method is significantly slower than the theoretical influence method. Finally, we emphasize that although the distribution of the data is multimodal of multiple subspaces in the original data space, the embedded data in the Veronese space become unimodal as a single subspace model. We have shown in Chapter 2 that any sample z with ν n (z) perpendicular to Null(L n ) will vanish on the set of polynomials Q = {q 1 (z), q 2 (z),..., q m (z)}, and therefore is on one of the subspaces in the data space. In MVT, the Mahalanobis distance is a sufficient statistic by assuming a Gaussian distribution of u = ν n (x), which is not true under the Veronese embedding. Nevertheless, this approximation of the distribution gives consistent good performance in our simulations and real experiments Estimating the outlier percentage The above algorithms did not completely solve the outlier issue, since usually we do not know the outlier percentage for a given data set. In this subsection, we propose a means to estimate the outlier percentage. The percentage will be so determined that the GPCA algorithm returns a good subspace arrangement model from the remaining sample points, and the estimate may not necessarily be close to the a priori percentage. The main idea is to conduct the outlier rejection process multiple times under different rejection rates, and verify the goodness of the resulting models. We first illustrate the basic idea with an example. We randomly draw a set of 52

sample points from three subspaces of dimensions (2, 2, 1) in R 3 with a sample size of (200, 200, 100) and add 6% Gaussian noise.

We use MVT to trim out various percentages of samples ranging from 0% to 54%, and compute a maximal residual of the remaining samples at each rejection rate with respect to the model estimated by

The maximal sample residual reaches a plateau right after 7% rejection rate, and the residuals decrease when the rejection rates increase. Figure 4.

64 sample points from three subspaces of dimensions (2, 2, 1) in R 3 with a sample size of (200, 200, 100) and add 6% Gaussian noise. Then, the data are contaminated by 16% uniformly distributed outliers. We use MVT to trim out various percentages of samples ranging from 0% to 54%, and compute a maximal residual of the remaining samples at each rejection rate with respect to the model estimated by GPCA-Voting. Figure 4.4 shows the plot of the maximal residuals versus the rejection rates. The maximal sample residual reaches a plateau right after 7% rejection rate, and the residuals decrease when the rejection rates increase. Figure 4.5 shows the segmentation results at rejection rates 7% and 38%, respectively. Figure 4.4 Maximal sample residuals versus rejection rates. The data set consists of three subspaces of dimensions (2, 2, 1) in R 3. The valid sample size is (200, 200, 100), and 16% uniformly distributed outliers are added in. The algorithm trims out samples with rejection rates from 0% to 54% using MVT. The maximal sample residual of the remaining sample points at each rejection rate is measured with respect to the model estimated by GPCA-Voting on the remaining sample points. (a) a priori (b) 7% rejected (c) 38% rejected Figure 4.5 Subspace segmentation results at 7% and 38% rejection rates. In the experiment, although the 7% rejection rate is far less than the a priori 16% 53

65 outlier percentage, the remaining outliers left are nevertheless close to the subspaces (in terms of their residuals with respect to the estimated model), and the resulting subspaces are close to the ground truth. We also see that MVT is moderately stable when the rejection rate is higher than the actual percentage of outliers. In this case, when the rejection rate is 38%, MVT trims out inlying samples that have relatively larger magnitudes, which results in an even smaller maximal residual as shown in Figure 4.4. Therefore, one does not have to reject the exact a priori outlier percentage in order to obtain a good estimate of the model. In the presence of both noise and outliers, e.g., Figure 4.5(b), it is impossible and unnecessary to distinguish outliers that close to the subspaces from valid samples that have large noise. Principle (Outlier Percentage Test) A good estimation of the outlier percentage can be determined by the influence of the outlier candidates with respect to the estimated subspace arrangement model. That is, further rejection from the data set will only result in small changes on both the model parameters estimated and the fitting error. This principle suggests two possible approaches to determine the rejection rate from the plot of the maximal sample residuals: 1. The rejection rate can be determined by finding the first knee point, or equivalently the first plateau, in the maximal residual plot (in the above example, at 7%). 2. The rejection rate can be determined by a prespecified maximal residual threshold. One may choose to use either approach based on the nature of the application at hand. However, for the first approach, it is commonly agreed in the literature that a method that finds knee points and plateaus in a plot may not perform well if the data are noisy, 54

66 since they are both related to the first-order derivatives of the plot. In addition, a wellshaped plateau may not exist in a residual plot at all if the a priori outlier percentage is small, where the maximal sample residuals shall rapidly reduce to a small value close to zero. Therefore, in this work, we determine the outlier percentage as the smallest one such that the maximal sample residual is smaller than a given residual threshold for several consecutive rejection rates, i.e., the residuals stabilize. The residual threshold can also be seen as a measure of the variance of the noise of the inlying data, which is also similar to the same parameter in other robust statistical techniques (in particular RANSAC that we are going to discuss next). In practice, we find that three consecutive trials of 1% increments work well in both simulations and real experiments. 2 Algorithm 3 gives an outline of the complete algorithm. We will refer to the algorithm that uses MVT as RGPCA-MVT, and the one that uses the sample influence function method as RGPCA-Influence. 3 Figures 4.6 and 4.7 show illustrations of several segmentation results in R 3 with various outlier percentages using the MVT method. More comparisons will be given in Section 4.4 on synthetic data and Chapter 5 on some real applications. 4.3 Random Sample Consensus As we have mentioned before, there are two ways to apply robust techniques to estimate a subspace arrangement model. Since in our problem, the number and the dimensions of the subspaces are given, one can either sample a larger subset to estimate the union of the 2 It implicitly requires that the samples from any single subspace shall be more than 3% of the total data. 3 The theoretical influence function method can also be used to replace the sample influence function in robust GPCA, but our experiments show that the decline in performance is not worth the gain in the increase of the speed. 55

67 Algorithm 3 (Robust GPCA). Given a set of samples F = {z 1, z 2,..., z N } in R D, a threshold τ for the subspace angle, and a residual threshold σ, fit n linear subspaces of codimensions c 1, c 2,..., c n : 1: Set a maximal possible outlier percentage M%. 2: Normalize the data such that the max vector magnitude is 1. 3: for all rejection rate 0 r M do 4: F removing r% samples from F using MVT or Influence Function. 5: Estimate the subspace bases { ˆB 1, ˆB 2,..., ˆB n } by applying GPCA to F with parameters τ and c 1, c 2,..., c n. 6: Maximal residual σ max max z F min k z ˆB k ˆBT k z. 7: if σ max is consistently smaller than σ then 8: B k ˆB k for k = 1, 2,..., n. Break. 9: end if 10: end for 11: if σ max > σ then 12: ERROR: the given σ is too small. 13: else 14: Label z F as an inlier if min k z B k Bk T z < σ. 15: Segment the inlying samples to their respective subspaces. 16: end if 56

(a) 12% outliers, 10% (b) 32% outliers, 31% (c) 48% outliers, 49% rejected rejected rejected Figure 4.

7 Segmentation results of RGPCA-MVT on three data sets from a subspace arrangement of dimensions (2,

subspaces (referred to as RANSAC-on-Union), or estimate and extract one subspace at a time (referred

In the computer vision literature, the latter approach dominates most applications [29, 45, 47, 48,

68 (a) 12% outliers, 10% (b) 32% outliers, 31% (c) 48% outliers, 49% rejected rejected rejected Figure 4.6 Segmentation results of RGPCA-MVT on three data sets from a subspace arrangement of dimensions (2, 1, 1) R 3. (a) 12% outliers, 6% rejected (b) 32% outliers, 23% rejected (c) 48% outliers, 34% rejected Figure 4.7 Segmentation results of RGPCA-MVT on three data sets from a subspace arrangement of dimensions (2, 2, 1) R 3. subspaces (referred to as RANSAC-on-Union), or estimate and extract one subspace at a time (referred to as RANSAC-on-Subspaces). In the computer vision literature, the latter approach dominates most applications [29, 45, 47, 48, 50] because applying RANSAC on individual simple geometric models has been well studied, and the algorithm complexity is much lower. To illustrate the dramatic difference of the complexity, suppose a set of 1800 valid samples is evenly drawn from three hyperplanes of dimension five in R 6, and the data set is further contaminated by 20% outliers. To estimate a single subspace model, we need five points (plus the origin), and with respect to a single subspace, the outlier percentage is 73.3%. To reach 95% confidence that one subset is outlier free, one needs to sample 57

69 2220 subsets. 4 On the other hand, to estimate the three subspaces as a union, we need to sample 15 points, evenly partition the set into three subsets, and estimate three subspace bases from the three subsets, respectively. To reach 95% confidence that one subset is outlier free, one needs to sample about 7.27 billion subsets. However, our experiments show that RANSAC-on-Union can achieve good accuracy as long as enough iterations are provided. We implemented this method in MATLAB on a dual 2.7-GHz G5 Apple workstation, and the results of three simulations are shown in Table 4.2. We use the average subspace angle error between the a priori model and the estimated one to measure the accuracy of the algorithm. The numbers of iterations shown in the table are in fact much smaller than the ones with 95% confidence. Table 4.2 RANSAC-on-Union applying to three subspace arrangement models: (2, 2, 1) in R 3, (4, 2, 2, 1) in R 5, and (5, 5, 5) in R 6 (with 6% Gaussian noise and 24% outliers). Model (2, 2, 1) in R 3 (4, 2, 2, 1) in R 5 (5, 5, 5) in R 6 Inlier Size (200, 200, 100) (400, 200, 200, 100) (600, 600, 600) Number of Samplings 5,000 1,000,000 10,000,000 Angle Error (degree) Time Lapse 27 s 5 h > 2 days RANSAC on individual subspaces To sequentially estimate multiple subspaces via RANSAC, most solutions assume that either the subspaces have the same dimensions or the data do not contain samples on the intersections of the subspaces. Special care has to be taken when we deal with a more general situation. In fact, many complications arise when we try to apply RANSAC to a mixture of subspaces: (1) If one tries to find a higher-dimensional subspace first, 4 While the joint confidence of the whole model will be slightly less than 95%. 58

70 the model may overfit one or multiple lower-dimensional subspaces, and they are more likely to rank high in the consensus test. (2) If one tries to estimate a lower-dimensional subspace first, a subset from a higher-dimensional subspace or even an intersection of subspaces may likely win out first in the consensus test. These types of ambiguities have been well known in computer vision as a potential problem of RANSAC. In multiple-view geometry, if both a planar object and a general 3-D object are present, a RANSAC process that searches for a fundamental matrix may overfit points from the planar object since a homography is a degenerate epipolar relation, which causes an erroneous estimation of the epipolar geometry [48]. However, in the general subspace segmentation problem, the situation is much more delicate, as multiple subspaces or samples at intersections may give a high consensus to a subspace model in the presence of both noise and outliers. These rare situations are indeed very common in many applications. For instance, in hybrid linear systems, a single linear system may satisfy a subspace constraint of an arbitrary dimension, and the switching between multiple systems will generate input-output data that are close to subspace intersections [61]. Figure 4.8 illustrates some complications in a simple toy example. Recently, two modifications have been proposed to address this problem: 1. Starting with the highest-dimensional model, after a minimal sample set of the model achieves a high consensus, the algorithm further verifies if this minimal subset can be well fit by one or multiple lower-dimensional models. If this is true, the algorithm will re-estimate the model from the remaining samples outside the support of the degenerate model [48]. 2. Alternatively, Schindler and Suter [29] suggested a means to estimate more model hypotheses that outnumber the true subspace number, and use a model selection criterion to decide the models that best represent the mixed data. 59

(a) Four subspaces in R3 with and without 32% outliers.

(c) Three possible fits with outliers Figure 4.

samples drawn from 4 subspaces of dimensions (2, 2, 1, 1).

For the second method, the model selection step via either minimum

introduces new heuristic parameters to the process.

Therefore, in this work, we implement RANSAC-on-Subspaces using the first

71 (a) Four subspaces in R3 with and without 32% outliers. (b) Three possible fits for the first plane model. (c) Three possible fits with outliers Figure 4.8 Possible segmentation results (in color) to fit the first plane model on samples drawn from 4 subspaces of dimensions (2, 2, 1, 1). The support of this 2-D model may come from samples on a true plane, or multiple degenerate line models. The degeneracy becomes more difficult to detect with outliers as shown in (c). For the second method, the model selection step via either minimum description length [29] or maximum likelihood estimation [50] inevitably introduces new heuristic parameters to the process. The algorithm complexity in over-sampling more models usually is also high. Therefore, in this work, we implement RANSAC-on-Subspaces using the first method. The performance of RANSAC-on-Subspaces highly relies on the degeneracy testing for subspaces of different dimensions. With more uniformly distributed outliers added in, the degeneracy testing becomes less effective, which leads to a decline of the accuracy. 60

72 4.4 Simulations and Comparison We test RANSAC-on-Subspaces and RGPCA on the three subspace models in Table 4.2 with various outlier percentages. Each data set has a maximal magnitude of one. For RANSAC, the boundary threshold is fixed at 0.1. For RGPCA using either MVT or Influence, the residual threshold σ is fixed at 0.04, and the angle threshold τ is fixed at 0.3 rad. Figure 4.9 shows the average angle errors of the estimation. Table 4.3 shows the average time for the three algorithms with 24% outliers. (a) (2, 2, 1) in R 3 (b) (4, 2, 2, 1) in R 5 (c) (5, 5, 5) in R 6 Figure 4.9 Average space angle errors of RANSAC-on-Subspaces, RGPCA-MVT, and RGPCA-Influence (50 trials at each percentage). Table 4.3 Average time for applying RANSAC-on-Subspaces, RGPCA-Influence, and RGPCA-MVT to the three subspace arrangement models with 24% outliers: (2, 2, 1) in R 3, (4, 2, 2, 1) in R 5, and (5, 5, 5) in R 6. Arrangement (2,2,1) in R 3 (4,2,2,1) in R 5 (5,5,5) in R 6 RANSAC-on-Subspaces 44 s 5.1 min 3.4 min RGPCA-Influence 3 min 58 min 146 min RGPCA-MVT 46 s 23 min 8 min In terms of speed, RGPCA-MVT and RGPCA-Influence run slower than RANSACon-Subspaces, particularly when the number and dimensions of the subspaces are high. With respect to the accuracy, RGPCA-MVT gives the best overall estimation of the models on all three synthetic data sets. The subspace angle errors for RGPCA-MVT in (2, 2, 1) case and (4, 2, 2, 1) case are both within 2 with up to 50% outliers. For (5, 5, 5) 61

73 case, the worst angle error is 11. Finally, RANSAC-on-Subspaces outperforms RGPCA- Influence in most cases, but the performance of both algorithms declines rapidly with high outlier percentages. 62

74 CHAPTER 5 APPLICATIONS There have been many successful applications of the PDA algorithm and its statistical variations in a wide range of research areas, including computer vision, image processing, pattern recognition, and system identification. In this section, we present a couple of representative examples that demonstrate the basic reasons why subspace arrangements may become the model of choice in real-world problems. Roughly speaking, there are two categories of applications in which subspace arrangements have proven useful. In the first category, the given mixed data are known to have a piecewise-linear structure. That is, the data can be partitioned into different subsets such that each subset is drawn from a linear subspace model. Then we can apply GPCA to extract such hybrid linear structures. In the second category, the exact structure of the given data is more complex, but known to be somewhat heterogeneous or even nonlinear. Then we can apply GPCA to find a hybrid linear model that approximates the data. The resulting model provides a compact (lossy) representation of the data as well as a partition of the data into homogeneous subsets. This is the case with sparse image and video representation in image processing. In this work, our focus has been on the accuracy, stability, and robustness of a sub- 63

75 space segmentation algorithm with a valid underlying subspace structure. Therefore, in the following, we will demonstrate the performance of robust GPCA and RANSAC-on- Subspaces in two classical applications in computer vision. In Section 5.1, we study the motion segmentation problem with an affine camera model; and in Section 5.2, we test our algorithms for vanishing-point detection. To fairly evaluate the performance of the methods as generic subspace segmentation algorithms, no information other than the coordinates of the tracked features is used in the experiment. Finally, we will briefly summarize other promising applications in Section Motion Segmentation with an Affine Camera Model An observed scene in a video sequence typically consists of multiple objects moving independently against the background. Suppose that multiple feature points are detected on the objects and the background. These could be either corner points or other local texture patterns that are invariant to camera motions. An important problem in computer vision is how to group these feature points that belong to different moving objects. More precisely, denote by {X 1, X 2,..., X N } R 3 a set of points in the 3-D scene that are attached either to the moving objects or to the background. Suppose the video sequence contains F frames of images. The image of X i in the jth image frame is denoted by m ij R 2, a point in the two-dimensional image plane. Then the problem is how to segment the images {m 1j, m 2,j,, m Nj } so that for each subset, their corresponding X i s belong to the same moving object or the background in the 3-D scene. Of course, the problem depends on how the 3-D points X 1, X 2,... X N are projected onto the image plane (i.e., the camera model) and what class of motions we consider for X i or for m ij (i.e., the 3-D or 2-D motion). Nevertheless, it has been shown that the motion segmentation problem can be converted to a subspace-segmentation problem 64

76 for most camera models that have been considered in computer vision [62]. We present below one such example that has some practical importance. For feature points on one object, the projection can be modeled as an affine camera model 1 from R 3 to R 2 : m ij = A j X i + b j R 2 for all j = 1, 2,..., F, (5.1) where A j R 2 3 and b j R 2 are the affine camera parameters for the jth frame. If we stack all the image correspondences of a point in 3-D into a 2F -dimensional vector z i = m i1 m T i2. m if R 2F, i = 1, 2,..., N, (5.2) we obtain a matrix W of dimension 2F N: W =. ) A 1 b 1 (z 1 z N =.. 2F N A F b F X 1 X N 1 1 2F 4 4 N. (5.3) Notice that the product of the two matrices on the right-hand side of (5.3) should result in a matrix of maximal rank 4. It follows that rank(w ) 4. Hence, the 2-D trajectories of the image points, i.e., z 1, z 2,..., z N, live in a subspace of R 2F of dimension less than 5. Furthermore, if the object is a planar structure, or a 3-D structure undergoing a planar motion, one can easily show that the z i s shall lie in a three-dimensional subspace [28]. Five motion sequences shown in Figure 5.1 are used for testing. The image feature tracker was implemented based on [63], which naturally gives mistracked outliers. 1 A more precise model for conventional cameras is a perspective projection. However, when the objects have a small depth variation relative to their distances to the camera, an affine projection model is a good approximation. 65

We first project the stacked vectors {z 1, z 2,.

Then the three robust algorithms, namely, RGPCA-Influence, RGPCA-MVT, and

fourdimensional subspaces with the number of subspaces and their dimensions given.

(c) Sequence C: 107 points in 81 frames. (d) Sequence D: 126 points in 81 frames.

1 The first and last frames of five video sequences with tracked image features

77 We first project the stacked vectors {z 1, z 2,..., z N } onto a five-dimensional space using PCA. Then the three robust algorithms, namely, RGPCA-Influence, RGPCA-MVT, and RANSAC-on-Subspaces, are used to estimate multiple three-dimensional or fourdimensional subspaces with the number of subspaces and their dimensions given. (a) Sequence A: 140 points in 27 frames. (b) Sequence B: 52 points in 28 frames. (c) Sequence C: 107 points in 81 frames. (d) Sequence D: 126 points in 81 frames. (e) Sequence E: 224 points in 16 frames. Figure 5.1 The first and last frames of five video sequences with tracked image features superimposed. We separate the sequences into two categories. For the first two sequences, the objects all undergo rigid-body motions, and the cameras are far away from the scene. Therefore, 66

78 the affine camera model is valid. For the last three sequences, the affine model assumption is not strictly satisfied: In Sequence C, the camera is very close to the man, and the head motion is not a rigid-body motion; in Sequence D, the camera is also close to the scene, and partial occlusion of the objects causes false tracked features; and in Sequence E, the motion of the man on the segway is not a rigid-body motion, and the camera zooms in on the man s face. For such sequences, the stacked vectors {z 1, z 2,..., z N } shall satisfy more complex models than 3-D or 4-D subspaces. Nevertheless, we command RANSAC and RGPCA to robustly fit subspaces to the data. We also use Sequence C, D, and E to test the algorithms on subspaces of different dimensions. For each sequence, the planar background is modeled as a three-dimensional subspace, and the foreground objects are modeled as four-dimensional subspaces. Figure 5.2 shows the segmentation results. All parameters are tuned to achieve the best results. We can see that the three algorithms perform reasonably well on the first four sequences, considering no other imagery information is used to optimize the segmentation, in which the two RGPCA algorithms perform slightly better on Sequence B and D. For Sequence E, because the number of samples tracked from the wall significantly exceeds the number of the samples from the man, RGPCA-MVT in fact trims out the samples tracked from the man, and uses two subspace models to overfit samples from the wall, which results in even smaller sample residual. On the other hand, RGPCA-Influence and RANSAC still achieve good segmentation. 5.2 Vanishing Point Detection Vanishing point detection was formulated as a subspace-segmentation problem in [13]. Given two image points x 1 and x 2 in three-dimensional homogeneous coordinates, the line defined by the two points is the cross product l = x 2 x 1 R 3. Suppose there are n 67

(a) RANSAC on Seq. A. (b) Influence on Seq. A. (c) MVT on Seq.

B. (f) MVT on Seq. B. (g) RANSAC on Seq. C.

D. (k) Influence on Seq. D. (l) MVT on Seq. D. (m) RANSAC on Seq.

2 The segmentation results of affine camera

The black asterisks denote the outliers. 68

79 (a) RANSAC on Seq. A. (b) Influence on Seq. A. (c) MVT on Seq. A. (d) RANSAC on Seq. B. (e) Influence on Seq. B. (f) MVT on Seq. B. (g) RANSAC on Seq. C. (h) Influence on Seq. C. (i) MVT on Seq. C. (j) RANSAC on Seq. D. (k) Influence on Seq. D. (l) MVT on Seq. D. (m) RANSAC on Seq. E. (n) Influence on Seq. E. (o) MVT on Seq. E. Figure 5.2 The segmentation results of affine camera motions in the five sequences. The black asterisks denote the outliers. 68

80 vanishing points v 1, v 2,, v n in the image, then the following condition holds for any line l that passes through one of the vanishing points: (l T v 1 )(l T v 2 ) (l T v n ) = 0. (5.4) Thus, each vanishing point defines a 2-D hyperplane in R 3. In the experiment, we estimate n = 2 vanishing points in each image. The line segment detection was implemented based on [64, 65]. Those line segments that do not pass any vanishing point become outliers. We use RGPCA-Influence, RGPCA-MVT, and RANSAC-on-Subspaces to estimate two 2-D hyperplanes. For RANSAC, the boundary threshold σ = For RGPCA-Influence, the angle tolerance τ = 0.8 rad, and the boundary threshold σ = 0.4. For RGPCA-MVT, the angle tolerance τ = 0.8 rad, and the boundary thershold σ = 0.2. We also set the default maximal inlier number for RGPCA-Influence and RGPCA-MVT to be 70, i.e., to start with, the algorithms first reject all but 70 samples based on the corresponding rejection criteria. The detection results are shown in Figures , in which the RGPCA-Influence method easily outperforms RANSAC-on-Subspaces and RGPCA-MVT. Finally, we want to point out that in order to use RGPCA or RANSAC as a core model estimation algorithm in any computer vision application, domain-specific information has to be considered to achieve good performance. Using vanishing-point detection as an example, vanishing points in a perspective image are usually caused by parallel structures that are abundant in man-made environments. However, even an image of an urban scene may capture both man-made objects and natural objects, e.g., trees, grass, and animals. In such cases, any generic model estimation algorithm will fail to extract the true vanishing points of parallel straight lines in space; an example is shown in Figure 5.6. A good algorithm that may perform well in this situation has to be able to recognize and extract man-made structures from images. We refer to [66] for more discussion about 69

81 (a) Original images. (b) Line segments. (c) RANSAC-on-Subspaces segmentation. (d) Vanishing points via RANSAC-on-Subspaces. Figure 5.3 Vanishing-point detection results by RANSAC-on-Subspaces. 70

82 (a) Original images. (b) Line segments. (c) RGPCA-Influence segmentation. (d) Vanishing points via RGPCA-Influence. Figure 5.4 Vanishing-point detection results by RGPCA-Influence. 71

83 (a) Original images. (b) Line segments. (c) RGPCA-MVT segmentation. (d) Vanishing points via RGPCA-MVT. Figure 5.5 Vanishing-point detection results by RGPCA-MVT. 72

84 (a) Original image. (b) Line segments. (c) RANSAC-on- Subspaces result. (d) Vanishing points via RANSAC-on- Subspaces. (e) RGPCA-Influence (f) Vanishing points (g) RGPCA-MVT (h) Vanishing points result. via RGPCA-Influence. result. via RGPCA-MVT. Figure 5.6 Results of vanishing-point detection on a natural scene that contains a bridge, a mountain, and trees. All three subspace-segmentation algorithms fail to find the true vanishing points that correspond to parallel line families in space. robust vanishing-point detection. 5.3 Other Applications Subspace arrangements have been proven to be pertinent for many other problems that arise in image processing, computer vision, pattern recognition, and system identification. Besides the two applications mentioned above, we list a few more examples and references in the following: 1. Hybrid Linear Representation of Images. An important problem in image processing is to find efficient and sparse representations of images (rather than the original bitmaps). A popular and still dominant approach is to transform images 73

Segmentation of Subspace Arrangements III Robust GPCA

Segmentation of Subspace Arrangements III Robust GPCA Berkeley CS 294-6, Lecture 25 Dec. 3, 2006 Generalized Principal Component Analysis (GPCA): (an overview) x V 1 V 2 (x 3 = 0)or(x 1 = x 2 = 0) {x 1x