Robust methods for high-dimensional data, and a theoretical study of depth-related estimators.

Size: px

Start display at page:

Download "Robust methods for high-dimensional data, and a theoretical study of depth-related estimators."

Gregory Lee
5 years ago
Views:

Katholieke Universiteit Leuven Faculteit Wetenschappen Departement Wiskunde Robust methods for high-dimensional data, and a theoretical study of

1 Katholieke Universiteit Leuven Faculteit Wetenschappen Departement Wiskunde Robust methods for high-dimensional data, and a theoretical study of depth-related estimators. Karlien Vanden Branden Promotor: Prof. dr. M. Hubert Proefschrift ingediend tot het behalen van de graad van Doctor in de Wetenschappen Leuven 25

5 Dankwoord Het schrijven van dit dankwoord heeft me best wat moeite gekost. t Is niet zozeer dat ik niet goed wist wie ik waarom wou bedanken, maar eerder hoe ik op een gepaste manier kan uitdrukken wat heel wat mensen de afgelopen vier jaar voor me hebben betekend. Deze personen die me tijdens deze periode gemotiveerd en gesteund hebben, verdienen nu dan ook een zeer welgemeende danku. In de eerste plaats wil ik mijn promotor Prof. dr. Mia Hubert bedanken. Zij gaf haar enthousiasme voor onderzoek door en ze bleef boeiende onderwerpen aanbieden. Door haar nauwe banden met de chemische wereld heb ik theoretisch werk met hele praktische resultaten kunnen combineren. Dit was zeker een heel belangrijke motivatie voor verder onderzoek. I am also grateful to Prof. dr. Steve Portnoy for giving me the opportunity to learn from his expertise in regression quantiles and to Tereza Neocleous for taking a stranger into her house. I will never forget the nice time in the wintery Urbana-Champaign and thank you both for this. Of course I am very proud of the results we have achieved and I appreciate enormously the cooperation to achieve the nice result from Chapter 5 by means of some hard labour... Mijn collega s en ex-collega s ben ik zeker niet vergeten en bedank ik voor de fijne werkdagen op zowel de B-blok als op het UCS. In het bijzonder wil ik mijn bureaugenoten bedanken omdat die op hun eigen manier mijn werkdag boeiend hebben gehouden. Bedankt Jonathan, Sanne en Michiel! Speciaal wil ik Sanne bedanken voor de leuke tijd en voor de aangename sfeer die ze creëert op onze bureau en op het UCS. Naast mijn collega s bij de K.U.Leuven heb ik dankzij Mia de kans gekregen heel wat interessante mensen te ontmoeten op jaarlijkse bijeenkomsten over heel de wereld. Hoewel het niet direct ver van huis is, denk ik hierbij vooral aan dr. Sabine Verboven van de Universiteit Antwerpen. Zij stond altijd klaar voor hulp en raad bij allerhande vragen en problemen. En dankzij haar heb ik ook

6 heel wat restaurantjes in Leuven beter leren kennen. Hopelijk kunnen we deze smakelijke traditie voortzetten. Tenslotte wil ik ook graag mijn vrienden en familie in de bloemetjes zetten. In het bijzonder dank ik mijn ouders voor hun steun en aanmoediging tijdens de voorbije 25 jaar. Karlien, Leuven, 13 mei 25.

7 Contents Introduction 1 List of Publications 7 I Robust methods for high-dimensional data 9 1 ROBPCA: a New Approach to Robust Principal Component Analysis Introduction The ROBPCA method Description Diagnostic plot Examples Car data Octane data Glass spectra Simulations Conclusion Appendix References Robust Classification in High Dimensions based on the SIMCA Method Introduction The SIMCA method Construction of the classification rules Example: Forest soil data vii

8 viii Contents 2.3 A robust SIMCA method Robust PCA in high dimensions Robust classification rules Example: Forest soil data Misclassification percentage Example: Forest soil data Simulation Study Examples Fruit data Wine recognition data Conclusions References Robust Methods for Partial Least Squares Regression Introduction The SIMPLS algorithm Robustified versions of the SIMPLS algorithm Robust covariance estimation in high dimensions Robust PLS Comparison Model Calibration and Validation Selecting the number of components Estimation of Prediction Error Outlier Detection Regression diagnostic plot Score diagnostic plot Simulation Study Example : Biscuit dough data Conclusion References Robustness Properties of a Robust PLS Regression Method Introduction The algorithms The SIMPLS algorithm The robust approach Robustness properties in low dimensions Functional representation

9 Contents ix The influence functions Example: Tobacco data Breakdown property Computation times Robustness properties in high dimensions Empirical influence function Empirical breakdown property Conclusions Appendix A Appendix B References II A theoretical study of depth-related estimators Censored Regression Quantiles Introduction Censored regression quantiles Consistency proof Censored depth regression quantiles Conclusion Appendix References Location-Scale Depth Introduction Another interpretation of Student depth Attaining the maximal Student depth Influence functions Breakdown value Conclusions References Conclusion 161 Nederlandse Samenvatting 163

11 Introduction Preprocessing and analyzing high-dimensional data are specific techniques that are strongly required in the fields of chemistry, bio-informatics, image analysis, etc. In these industries large or fat data sets for which the number of observations is largely exceeded by the number of variables appear very naturally from spectral analysis, gene analysis through microarrays or image segmentation. Principal Component Analysis (PCA), a classification method called SIMCA, and Partial Least Squares Regression (PLSR) are three specific calibration methods that were developed and studied to achieve the understanding of these measurements in huge dimensions (in the multivariate setting and in the regression situation). Whereas PCA and SIMCA operate on one set of variables, say X n,p, PLSR introduces a regression model between a set of predictors X n,p and a set of responses Y n,q. Note that we denote the dimension of a matrix using a subscript, e.g. X n,p stands for a (n p) dimensional matrix, and we print column vectors in bold. These measurements are typically spectra with the number of observations n p, the number of wavelengths. In microarray analysis p represents the number of genes for n arrays. However, all three methods lack robustness. This means that they are incapable to detect unusual measurements, so called outliers which often leads to unreliable outcomes. Therefore, we study the robustness of PCA, SIMCA and PLSR in Part I of this thesis and we propose robust alternatives. First we focus in Chapter 1 on PCA. As this is a dimension reduction technique, it is very well suited to handle high-dimensional data. We assume that the original set of p highly correlated variables can be summarized by a much smaller group of k uncorrelated variables stored in P p,k. This loading matrix P p,k with k p represents linear combinations of the original variables. To formalize the PCA model, we let x i be the ith measurement (row) of X n,p, µ x the center of X n,p and t i = P k,p (x i µ x ) the ith score. PCA then 1

12 2 Introduction looks for a k-dimensional subspace represented by the orthonormal columns of P p,k by minimizing n i=1 g i 2 where x i = µ x + P p,k t i + g i (1) and g i is the error term. PCA thus results in a new low-dimensional data set T n,k = (t i ) n i=1 where the small dimension k now allows for a better interpretation. In practical examples k is often not larger than 5. The classical PCA (CPCA) loadings can be found by taking the eigenvalues of the empirical variancecovariance matrix of the data X n,p. As this approach is badly influenced by the presence of outliers, we propose in Chapter 1 a robust PCA method called ROBPCA. It combines projection-pursuit ideas with robust covariance estimation. We show its superiority in robustness by means of a comparative simulation study and with several examples. This robust PCA method ROBPCA now allows to investigate other multivariate techniques for high-dimensional data. In Chapter 2 we will incorporate the ROBPCA method in a robust classification method called RSIMCA. This classifier, based on the SIMCA [9] method, uses a principal component analysis on each available group separately and applies robust classification rules. This gives rise to lower misclassification percentages compared with SIMCA when outliers appear in one or in several groups. ROBPCA can also be explored in the regression context. Especially in the case of a large dimensional set of predictors or independent variables X n,p with n p, it is important to extract only the most important predictors for the response(s) Y n,q or to obtain the best linear combinations of the p original covariates for predicting Y n,q. In such situations multivariate least squares regression is no longer an option and therefore biased regression methods such as PLSR [3] have been proposed in the past. The basic idea behind PLSR is to construct a bilinear model x i = µ x + P p,k t i + g i (2) y i = µ y + A q,kt i + f i. The matrix of x-loadings is given by P p,k, A is the slope matrix in the regression of y i on t i = R k,p (x i µ x ), µ x and µ y are the centers of X n,p and Y n,q, and g i and f i are the error terms. Here the transpose of a vector v or a matrix V is written as v or V. The estimates from this bilinear model provide regression estimates expressing the relationship between x i and y i y i = β + B q,px i + e i

13 Introduction 3 with B = R p,k A, β = µ y B q,pµ x, and e i is the regression error. Although (1) and (2) look similar, the PLS regression scores t i do not have to correspond with the PCA scores as another objective function is considered. This will be discussed in Chapter 3. Basically, the construction of the matrix R p,k leads to different regression methods. Besides PLSR, Principal Component Regression (PCR) [3] is another very popular biased regression technique in chemometrics. This method also proceeds according to the same bilinear model. The main difference with PLSR is that the PCR regression scores t i now correspond with the PCA scores P k,p (x i µ x ). The scores from PCR are thus solely based on the set of regressors X n,p. In contrast PLSR already involves the responses in this first stage and therefore the scores on which the final regression results are based, are expected to inherit more useful predictive information for these responses. As PLSR turns out not to be be robust against outliers, Chapter 3 is dedicated to a robust PLSR method. It is called RSIMPLS because it is based on the ideas behind the SIMPLS method [1]. The ROBPCA method proves to be successful for robustifying both stages (i.e. the PCA step and the regression stage) of the bilinear model. A theoretical study of RSIMPLS is contained in Chapter 4, where we derive its influence function for low-dimensional data. We simultaneously obtain the corresponding influence functions for SIMPLS. All algorithms developed in this first part are available as a part of LIBRA, a Matlab Library for Robust Analysis [7]. This toolbox can be downloaded from our web site Also S-Plus and R implementations of several algorithms can be obtained from our web site. Moreover, our programs become available in the PLS toolbox [8] which is the standard Matlab library for chemometricians. In Part II of this thesis we focus on the aspect of regression quantiles for censored data and on depth-related estimators. In Chapter 5 we first discuss a recent method based on the Kaplan-Meier estimator for estimating regression quantiles in the case of censored data [5]. Censored data occurs very frequently in clinical trials when persons drop out of a study. Because it is wasteful to simply omit those persons from the analysis, they are still taken into account as censored observations. One then knows that the true response for that person, say the life time, exceeds the observed value. Instead of observing a response Y i for a set of predictors X i, we now thus have a triple (Y i, C i, i ) with C i the censoring time, and i = I(Y i C i ) the censoring indicator. This indicator tells whether

14 4 References observation i is observed ( i = 1) or not ( i = ). We denote the observed response by Ỹi = min(y i, C i ). If i = we only know that the true response value Y i is some unknown value larger than C i. In Chapter 5 we suppose that the τth conditional regression quantile of Y given X = x, denoted by Q Y (τ x), is linear in x. So for τ (, 1) Q Y (τ x) = x β(τ). We will discuss two methods to obtain these linear regression quantiles β(τ) for censored data. The first method is proposed by Portnoy [5] and it extends the Koenker-Bassett estimator [2] by means of the Kaplan-Meier estimator. The basic idea is to introduce a reweighting scheme where a weight w(τ) is given to the censored observation once it is crossed and a weight 1 w(τ) is given to a pseudo-observation at infinity. The main result in this chapter is a consistency proof of this estimator. However, this censored regression quantile estimator is only robust towards outliers in the response variable Y. Therefore we also briefly introduce a second estimator that can deal with leverage points which are outlying in x-space. This method combines concepts of regression depth [6] with the reweighting scheme for censored data. The idea of regression depth recently inspired Mizera and Müller [4] to define location-scale depth for univariate data based on a likelihood principle. In Chapter 6 we look at the most tractable version of their proposed depth, i.e. the Student depth which corresponds to the bivariate halfspace depth interpreted in the Poincaré plane of the Lobachevski geometry. The Student depth of some (µ, σ) R [, ) with respect to some probability measure P on R is defined as d(µ, σ, P) = inf P { y : u 1 (y µ) + u 2 ((y µ) 2 σ 2 ) }. (3) (u 1,u 2) We derive the influence function of this Student depth and of the Student median which is the maximum depth estimator of location and scale. This is the couple (µ S, σ S ) R [, ) for which (3) obtains its maximal value. We also obtain the breakdown value of the Student median. In a final conclusion all results obtained in Part I and in Part II are summarized. We also discuss some possibilities for future research. References [1] S. de Jong. SIMPLS: an alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18: , 1993.

15 References 5 [2] R. Koenker and G.W. Bassett. Regression quantiles. Econometrica, 46:33 5, [3] H. Martens and T. Naes. Multivariate calibration. Wiley, Chichester, UK, [4] I. Mizera and C.H. Müller. Location-scale depth. Journal of the American Statistical Association, 99: , 24. [5] S. Portnoy. Censored regression quantiles. Journal of the American Statistical Association, 98:11 112, 23. [6] P.J. Rousseeuw and M. Hubert. Regression depth. Journal of the American Statistical Association, 94:388 42, [7] S. Verboven and M. Hubert. LIBRA: a Matlab library for robust analysis. Chemometrics and Intelligent Laboratory Systems, 75: , 25. [8] B.M. Wise, N.B. Gallagher, R. Bro, J.M. Shaver, W. Windig, and R.S. Koch. PLS Toolbox 3.5 for use with MATLAB. Software, Eigenvector Research, Inc., August 24. [9] S. Wold. Pattern recognition by means of disjoint principal components models. Pattern Recognition, 8: , 1976.

17 List of Publications M. Hubert and K. Vanden Branden. Robust methods for partial least squares regression. Journal of Chemometrics, 17: , 23. K. Vanden Branden and M. Hubert. Robustness properties of a robust PLS regression method. Analytica Chimica Acta, 515: , 24. S. Engelen, M. Hubert, K. Vanden Branden, and S. Verboven. Robust PCR and Robust PLSR: a comparative study, Theory and Applications of Recent Robust Methods, edited by M. Hubert, G. Pison, A. Struyf and S. Van Aelst, Series: Statistics for Industry and Technology, Birkhauser, Basel, pages , 24. K. Vanden Branden and M. Hubert. Robust classification of high-dimensional data, Proceedings in Computational Statistics, edited by J. Antoch, Springer-Verlag, Heidelberg, pages , 24. M. Hubert, P.J. Rousseeuw, and K. Vanden Branden. Comment on locationscale depth by I. Mizera and C. Müller. Journal of the American Statistical Association, 99:966 97, 24. M. Hubert, P.J. Rousseeuw, and K. Vanden Branden. ROBPCA: a new approach to robust principal component analysis. Technometrics, 47:64 79, 25. K. Vanden Branden and M. Hubert. Robust classification in high dimensions based on the SIMCA method, to appear in Chemometrics and Intelligent Laboratory Systems, 25. S. Engelen, M. Hubert, and K. Vanden Branden. A comparison of three procedures for robust PCA in high dimensions, to appear in Austrian Journal of Statistics, 25. 7

19 Part I Robust methods for high-dimensional data 9

21 Chapter 1 ROBPCA: a New Approach to Robust Principal Component Analysis Classical PCA is based on the empirical covariance matrix of the data and hence it is highly sensitive to outlying observations. In the past, two robust approaches have been developed. The first is based on the eigenvectors of a robust scatter matrix such as the MCD or an S-estimator, and is limited to relatively low-dimensional data. The second approach is based on projection pursuit and can handle high-dimensional data. In this chapter, we propose the ROBPCA approach which combines projection pursuit ideas with robust scatter matrix estimation. It yields more accurate estimates at non-contaminated data sets and more robust estimates at contaminated data. ROBPCA can be computed fast, and is able to detect exact fit situations. As a byproduct, ROBPCA produces a diagnostic plot which displays and classifies the outliers. The algorithm is applied to several data sets from chemometrics and engineering. 11

22 12 ROBUST PRINCIPAL COMPONENT ANALYSIS 1.1 Introduction Principal component analysis is a popular statistical method which tries to explain the covariance structure of data by means of a small number of components. These components are linear combinations of the original variables, and often allow for an interpretation and a better understanding of the different sources of variation. Because PCA is concerned with data reduction, it is widely used for the analysis of high-dimensional data which are frequently encountered in chemometrics, computer vision, engineering, genetics, and other domains. PCA is then often the first step of the data analysis, followed by discriminant analysis, cluster analysis, or other multivariate techniques. It is thus important to find those principal components that contain most of the information. In the classical approach, the first component corresponds to the direction in which the projected observations have the largest variance. The second component is then orthogonal to the first and again maximizes the variance of the data points projected on it. Continuing in this way produces all the principal components, which correspond to the eigenvectors of the empirical covariance matrix. Unfortunately, both the classical variance (which is being maximized) and the classical covariance matrix (which is being decomposed) are very sensitive to anomalous observations. Consequently, the first components are often attracted towards outlying points, and may not capture the variation of the regular observations. Therefore, data reduction based on classical PCA (CPCA) becomes unreliable if outliers are present in the data. The goal of robust PCA methods is to obtain principal components that are not influenced much by outliers. A first group of methods is obtained by replacing the classical covariance matrix by a robust covariance estimator. Maronna [19] and Campbell [3] proposed to use affine equivariant M-estimators of scatter for this purpose, but these cannot resist many outliers. More recently, Croux and Haesbroeck [4] used positive-breakdown estimators such as the minimum covariance determinant (MCD) method [22] and S-estimators [6, 24]. The result is more robust, but unfortunately limited to small to moderate dimensions. To see why, let us e.g. consider the MCD estimator. It is defined as the mean and the covariance matrix of the h observations (out of the whole data set of size n) whose covariance matrix has the smallest determinant. If p denotes the number of variables in our data set, the MCD estimator can only be computed if p < h, otherwise the covariance matrix of any h-subset has zero determinant. By default h is about.75n, and it may be chosen as small as.5n; in any case p may never

23 1.2. THE ROBPCA METHOD 13 be larger than n. A second problem is the computation of these robust estimators in high dimensions. Today s fastest algorithms [28, 25] can handle up to about 1 dimensions, whereas there are fields like chemometrics which need to analyze data with dimensions in the thousands. A second approach to robust PCA uses Projection Pursuit (PP) techniques [17, 5, 11]. These methods maximize a robust measure of spread to obtain consecutive directions on which the data points are projected. This idea has also been generalized to common principal components [1]. It yields transparent algorithms that can be applied to data sets with many variables and/or many observations. In Section 1.2 we propose the ROBPCA method, which attempts to combine the advantages of both approaches. We also describe an accompanying diagnostic plot which can be used to detect and classify possible outliers. Several real data sets from chemometrics and engineering are analyzed in Section 1.3, whereas Section 1.4 investigates the performance and robustness of ROBPCA through simulations. The concluding section outlines potential applications of ROBPCA in other types of multivariate data analysis. 1.2 The ROBPCA method Description The proposed ROBPCA method combines ideas of both projection pursuit and robust covariance estimation. The PP part is used for the initial dimension reduction. Some ideas based on the MCD estimator are then applied to this lowerdimensional data space. The combined approach yields more accurate estimates than the raw PP algorithm, as we will see in Section 1.4. The complete description of the ROBPCA method is quite involved and relegated to Appendix A, but here is a rough sketch of how it works. We assume that the original data are stored in an n p data matrix X = X n,p where n stands for the number of objects and p for the original number of variables. The ROBPCA method then proceeds in three major steps. First, the data are preprocessed such that the transformed data are lying in a subspace whose dimension is at most n 1. Next, we construct a preliminary covariance matrix S that is used for selecting the number of components k that will be retained in the sequel, yielding a k-dimensional subspace that fits the data well. Then we project the data points on this subspace where we robustly estimate their location and their scatter matrix, of which we compute its k non-zero eigenvalues l 1,..., l k. The corresponding

24 14 ROBUST PRINCIPAL COMPONENT ANALYSIS eigenvectors are the k robust principal components. In the original space of dimension p, these k components span a k-dimensional subspace. Formally, writing the (column) eigenvectors next to each other yields the p k matrix P p,k with orthogonal columns. The location estimate is denoted as the p-variate column vector ˆµ and called the robust center. The scores are the entries of the n k matrix T n,k = (X n,p 1 n ˆµ )P p,k (1.1) where 1 n is the column vector with all n components equal to 1. Moreover, the k robust principal components generate a p p robust scatter matrix S of rank k given by S = P p,k L k,k P p,k (1.2) where L k,k is the diagonal matrix with the eigenvalues l 1,..., l k. Like classical PCA, the ROBPCA method is location and orthogonal equivariant. That is, when we apply a shift and/or an orthogonal transformation (e.g. a rotation or a reflection) to the data, the robust center is also shifted and the loadings are rotated accordingly. Hence the scores do not change under this type of transformations. Let A p,p define the orthogonal transformation, thus A is of full rank and A = A 1, and ˆµ x and P p,k the ROBPCA center and loading matrix for the original X n,p. Then the ROBPCA center and loadings for the transformed data XA + 1 n v are equal to Aˆµ x + v and AP. Consequently the scores remain the same under these transformations: T (XA + 1 n v ) = (XA + 1 n v 1 n (Aˆµ x + v) )AP = (X 1 n ˆµ x)p = T (X). Although these properties seem very natural for a PCA method, they are not shared by some other robust PCA estimators such as the resampling by halfmeans and the smallest half-volume methods of Egan and Morgan [8] Diagnostic plot As it is the case for many robust methods, the goal of a robust PCA is twofold. First, it allows to find those linear combinations of the original variables that contain most of the information, even if there are outliers. Secondly, these estimates are useful to flag outliers and to determine of which type they are. To see that there can be different types of outliers, consider Figure 1.1 where p = 3 and k = 2. Here we can distinguish between four types of observations. The regular observations form one homogeneous group that is close to the PCA

25 1.2. THE ROBPCA METHOD 15 subspace. Next, we have good leverage points that lie close to the PCA space but far from the regular observations, such as the observations 1 and 4 in Figure 1.1. We can also have orthogonal outliers whose orthogonal distance to the PCA space is large but which we cannot see when we only look at their projection on the PCA space, like observation 5. The fourth type of data points are the bad leverage points that have a large orthogonal distance and whose projection on the PCA subspace is remote from the typical projections, such as observations 2 and x x x 4 3 Figure 1.1: Different types of outliers when a three-dimensional data set is projected on a robust two-dimensional PCA-subspace. To distinguish between regular observations and the three types of outliers for higher-dimensional data, we construct a diagnostic plot. On the horizontal axis it plots the robust score distance SD i of each observation, given by SD i = k t 2 ij (1.3) l j where the scores t ij are obtained from (1.1). If k = 1, we prefer to plot the (signed) standardized score t i / l 1. On the vertical axis of the diagnostic plot we display the orthogonal distance OD i of each observation to the PCA subspace, defined as j=1 OD i = x i ˆµ P p,k t i (1.4)

26 16 ROBUST PRINCIPAL COMPONENT ANALYSIS where the ith observation is denoted as the p-variate column vector x i and t i is the ith row of T n,k. To classify the observations we draw two cut-off lines. The cut-off value on the horizontal axis is χ 2 k,.975 when k > 1 and ± χ 2 1,.975 when k = 1 (because the squared Mahalanobis distances of normally distributed scores are approximately χ 2 k-distributed). The cut-off value on the vertical axis is more difficult to determine because the distribution of the orthogonal distances is not known exactly. However, a scaled chi-squared distribution g 1 χ 2 g 2 gives a good approximation for the unknown distribution of the squared orthogonal distances [2]. In [2] the method of moments is used to estimate the two unknown parameters g 1 and g 2. We prefer to follow a robust approach. We use the Wilson-Hilferty approximation for a chi-squared distribution. This implies that the orthogonal distances to the power 2/3 are approximately normally distributed with mean µ = (g 1 g 2 ) 1/3 (1 2 9g 2 ) and variance σ 2 = 2g2/3 1. We obtain estimates ˆµ and ˆσ 2 by 9g 1/3 2 means of the univariate MCD. The cut-off value on the vertical axis then equals (ˆµ + ˆσz.975 ) 3/2 with z.975 = Φ 1 (.975) the 97.5% quantile of the Gaussian distribution. Note the analogy of this diagnostic plot with that of Rousseeuw and Van Zomeren [26] for robust regression. There the vertical axis gives the standardized residuals obtained with a robust regression method, with cut-off values at -2.5 and 2.5 (because for normally distributed data roughly 1% of the standardized residuals fall outside that interval). 1.3 Examples We will illustrate the ROBPCA method and the diagnostic plot on several real data sets. We also compare the results from ROBPCA with four other PCA methods: classical PCA (CPCA), RAPCA [11], and spherical (SPHER) and ellipsoidal (ELL) PCA [18]. The latter three methods are also robust and designed for high-dimensional data. Li and Chen [17] proposed the idea of projection pursuit for PCA, but their algorithm has a high computational cost. More attractive methods have been developed in [5], but in high dimensions, these algorithms still have a numerical inaccuracy. Therefore, Hubert et al. [11] have developed RAPCA, which is a fast two-step algorithm. It searches for the direction on which the projected observations have the largest robust scale, and then removes this dimension and repeats.

27 1.3. EXAMPLES 17 The spherical and ellipsoidal PCA method provide a very fast algorithm to perform robust PCA. After robustly centering the data, the observations are projected on a sphere (spherical PCA) or an ellipse (ellipsoidal PCA). The principal components are then derived as the eigenvectors of the covariance matrix of these projected data points. SPHER and ELL do not yield estimates of the eigenvalues, which makes it impossible to compute score distances. Therefore, in the examples we additionally applied the MCD estimator on the scores to compute robust distances in the PCA subspace Car data Our first example is the low-dimensional car data set which is available in S-PLUS as the data frame cu.dimensions. For n = 111 cars, p = 11 characteristics were measured such as the length, the width and the height of the car. We first looked at pairwise scatter plots of the variables, and we computed pairwise Spearman rank correlations ρ S (X i, X j ). This preliminary analysis already indicated that there are high correlations among the variables, e.g. ρ S (X 1, X 2 ) =.83 and ρ S (X 3, X 9 ) =.87. Hence, PCA seems an appropriate method to find the most important sources of variation in this data set. When applying ROBPCA to these data, an important choice we need to make is how many principal components to keep. This is done by means of the eigenvalues l 1 l 2... l r of S with r = rank(s ), as obtained in the second stage of the algorithm (see also (1.1) in Appendix A). We can use these eigenvalues in various ways. For instance, we can look at the scree plot, which is a graph of the (monotone decreasing) eigenvalues [14]. We can also use a selection criterion, e.g. to choose k such that k r lj / lj 9% (1.5) j=1 j=1 or for instance such that lk / l (1.6) Here we decided to retain k = 2 components based on criterion (1.5), because ( l 1 + l 2 )/ 11 1 l j = 94%. Figure 1.2(a) shows the resulting diagnostic plot. We can distinguish a group of orthogonal outliers (labelled 13 14, 17, 19, 111) and two groups of bad leverage points (cases 12, 15 16, 18, 11 and observations 25, 3, 32, 34, 36). A few good leverage points are also visible (6 and 96). If we look at the measurements, we notice that the five most important bad leverage points (25,

28 18 ROBUST PRINCIPAL COMPONENT ANALYSIS ROBPCA CPCA Orthogonal distance Orthogonal distance Score distance (2 PC) (a) Score distance (2 PC) (b) Figure 1.2: Diagnostic plot of the car data set based on (a) two robust principal components; (b) two classical principal components. 3, 32, 34, and 36) have the value -2 on four of the 11 original variables, namely X 6 = Rear.Hd, X 8 = Rear.Seat, X 1 = Rear.Shld, X 11 = luggage. None of the other observations share this property. The observations have the value -2 for the last variable X 11 = luggage (and observation 19 the value -3). Let us compare this robust result with a classical PCA analysis. The first two components account for 85% of the total variance. The diagnostic plot in Figure 1.2(b) looks completely different from the robust one, although the same set of outliers is detected. The most striking difference is that the group of bad leverage points from ROBPCA is converted into good leverage points. This shows how the subspace found by CPCA is attracted towards these bad leverage points. Some of the differences between ROBPCA and CPCA are also visible in the plot of the scores (t i1, t i2 ) for all i = 1,..., n. Figure 1.3(a) shows the score plot of ROBPCA, together with the 97.5% tolerance ellipse which is defined as the set of vectors in R 2 whose score distance is equal to χ 2 2,.975. Data points which fall outside the tolerance ellipse are by definition the good and the bad leverage points. We clearly see how well the robust tolerance ellipse encloses the regular data points. Figure 1.3(b) is the score plot obtained with CPCA. The corresponding tolerance ellipse is highly inflated toward the outliers 25, 3, 32, 34 and 36. The resulting eigenvectors are not lying in the direction of the highest variability of the other points. We also see how the second eigenvalue of CPCA is blown up by the same set of outliers. We also performed robust PCA to this low-dimensional data set using the

29 1.3. EXAMPLES % tolerance ellipse 97.5 % tolerance ellipse T T T 1 T 1 (a) (b) Figure 1.3: The score plot with the 97.5% tolerance ellipse of the car data set for (a) ROBPCA; (b) CPCA. eigenvectors and eigenvalues of the MCD covariance matrix. The resulting diagnostic plot was almost identical to the ROBPCA plot and is therefore not included. Also the other robust methods detected the same set of outliers Octane data Our second example is the octane data set described in Esbensen et al. [9]. It contains near-infrared (NIR) absorbance spectra over p = 226 wavelengths of n = 39 gasoline samples with certain octane numbers. It is known that six of the samples (25, 26, 36 39) contain added alcohol. Both the classical scree plot, shown in Figure 1.4(a), and the ROBPCA scree plot in Figure 1.4(b) suggest to retain two principal components. The CPCA diagnostic plot is shown in Figure 1.5(a). We see that the classical analysis only detects the outlying spectrum 26, which even does not stick out much above the border line. In contrast, we immediately clearly spot the six samples with added alcohol on the ROBPCA diagnostic plot in Figure 1.5(b). The first principal component from CPCA is clearly attracted by the six outliers, yielding a classical eigenvalue of.13. On the other hand the first robust eigenvalue l 1 is only.1. Next, we wondered whether the robust loadings would be influenced by the outlying spectra if we would retain more than two components. To avoid the curse of dimensionality with n = 39 observations, it is generally advised that n > 5k (see [26]) so we considered k max = 7. From the robust diagnostic plot in

30 2 ROBUST PRINCIPAL COMPONENT ANALYSIS.14 CPCA 8 x 1 3 ROBPCA Eigenvalue Eigenvalue Principal Component (a) Principal Component (b) Figure 1.4: ROBPCA. Scree plot of the octane data set with (a) CPCA; (b) Figure 1.5(c) we see that the outliers are still very far from the estimated robust subspace. The diagnostic plots of RAPCA, SPHER and ELL were similar to Figure 1.5(b) for k = 2. But when we selected k = 7 components with ELL, we see from Figure 1.5(d) that the outliers have a much lower orthogonal distance. This illustrates their leverage effect on the estimated principal components Glass spectra Our third data set consists of EPXMA spectra over p = 75 wavelengths collected on 18 different glass samples [16]. The chemical analysis was performed using a Jeol JSM 63 scanning electron microscope equipped with an energy-dispersive Si(Li) X-ray detection system (SEM-EDX). We performed ROBPCA with h =.7n = 126. Three components are retained for CPCA and ROBPCA yielding a classical explanation percentage of 99% and a robust explanation percentage (1.5) of 96%. We then obtain the diagnostic plots in Figure 1.6. From the classical diagnostic plot in Figure 1.6(a), we see that CPCA does not find important outliers. On the other hand the ROBPCA plot of Figure 1.6(b) clearly distinguishes two major groups in the data, a smaller group of bad leverage points, a few orthogonal outliers and the isolated case 18 in between the two major groups. A high-breakdown method such as ROBPCA treats the smaller group with cases as one set of outliers. Later, it turned out that the window of the detector system had been cleaned before the last 38

31 1.3. EXAMPLES 21 CPCA ROBPCA Orthogonal distance Orthogonal distance Score distance (2 PC) (a) Score distance (2 PC) (b) Orthogonal distance ROBPCA Orthogonal distance ELL Score distance (7 PC) (c) Score distance (7 PC) (d) Figure 1.5: Diagnostic plot of the octane data set based on (a) two CPCA principal components; (b) two ROBPCA principal components; (c) seven ROBPCA principal components; (d) seven ELL principal components. spectra were measured. As a result of this less radiation (X-rays) is absorbed and more can be detected, resulting in higher X-ray intensities. If we look at the spectra, we can indeed observe these differences. The regular samples, shown in Figure 1.7(a), clearly have lower measurements at the channels than the samples of Figure 1.7(b). The spectrum of case 18 (not shown) was somewhat in between. Note that instead of plotting the raw data, we first robustly centered the spectra by subtracting the univariate MCD location estimator from each wavelength. Doing so we can observe more of the variability which is present in the data. The other bad leverage points (57 63) and (74-76) are samples with a large

32 22 ROBUST PRINCIPAL COMPONENT ANALYSIS concentration of calcic. In Figure 1.7(c) we see that their calcic alpha peak (around channels 34-37) and calcic beta peak (channels 375-4) is higher than for the other glass vessels. The orthogonal outliers (22, 23 and 3) whose spectra are shown in Figure 1.7(d) are rather boundary cases, although they have larger measurements at the channels This might indicate a larger concentration of phosphor. Orthogonal distance CPCA Score distance (3 PC) (a) Orthogonal distance ROBPCA Score distance (3 PC) 8 1 (b) SPHER ELL Orthogonal distance Orthogonal distance Score distance (3 PC) (c) Score distance (3 PC) (d) Figure 1.6: Diagnostic plot of the glass data set based on three principal components computed with (a) CPCA; (b) ROBPCA; (c) SPHER and (d) ELL. RAPCA yielded a diagnostic plot similar to the ROBPCA plot. SPHER en ELL are also able to detect the outliers, as can be seen from Figure 1.6(c) and (d), but they turn the bad leverage points into good leverage points and orthogonal outliers.

33 1.4. SIMULATIONS Index (a) Index (b) Index (c) Index (d) Figure 1.7: The glass data set (a) regular observations; (b) bad leverage points ; (c) bad leverage points and 74-76; (d) orthogonal outliers 22, 23 and Simulations We conducted a simulation study to compare the performance and the robustness of ROBPCA with the four other principal component methods introduced in Section 1.3: classical PCA (CPCA), RAPCA [11], spherical (SPHER) and elliptical (ELL) PCA [18]. We generated 1 samples of size n from the contamination model (1 ε)n p (, Σ) + εn p ( µ, Σ) or (1 ε)t 5 (, Σ) + εt 5 ( µ, Σ)

34 24 ROBUST PRINCIPAL COMPONENT ANALYSIS for different values of n, p, ε, Σ, µ and Σ. That is, n(1 ε) of the observations were generated from the p-variate gaussian distribution N p (, Σ) or the p-variate elliptical t 5 (, Σ) distribution and nε of the observations were generated from N p ( µ, Σ) or from t 5 ( µ, Σ). Note that the consistency factor in the FAST-MCD algorithm, which is used within ROBPCA, is constructed under the assumption that the regular observations are normally distributed. Then the denominator equals χ 2 k,1 α, the (1 α) quantile of the χ 2 distribution with k degrees of freedom. Hence, the best results of the simulations with t ν (and here ν = 5) is obtained by replacing the denominator with k((ν 2)/ν)F k,ν,1 α, with F k,ν,1 α the (1 α) quantile of the F distribution with k and ν degrees of freedom. However, in real examples any foreknowledge of the true underlying distribution is mostly unavailable. Therefore, and also to make a fair comparison with RAPCA, we did not adjust the consistency factor. In Tables and Figures we report some typical results, that were obtained in the following situations: 1. n = 1, p = 4, Σ = diag(8, 4, 2, 1) and k = 3 (because then ( 3 1 λ i)/( 4 1 λ i) = 93.3%.) (4a) ε = (no contamination). (4b) ε = 1% or ε = 2%, µ = f 1 e 4 = (,,, f 1 ), Σ = Σ f 2 where f 1 = 6, 8, 1,..., 2 and f 2 = 1 or f 2 = n = 5, p = 1, Σ = diag(17, 13.5, 8, 3, 1,.95,...,.2,.1) and k = 5 (because here ( 5 1 λ i)/( 1 1 λ i ) = 9.3%.) (1a) ε = (no contamination). (1b) ε = 1% or ε = 2%, µ = f 1 e 6, Σ = Σ f 2 f 2 = 1 or f 2 = 15. where f 1 = 6, 8, 1,..., 2 and Note that ε = % also corresponds to f 1 = and f 2 = 1. The subspace spanned by the first k eigenvectors of Σ is denoted by E k = span{e 1,..., e k } with e j the jth column of I p,k. The settings (4a) and (4b) consider low-dimensional data (p = 4) of not too small size n = 1, whereas in (1a) and (1b) we generate high-dimensional data with n = 5 being rather small, and even less than p = 1. In the settings (4b) and (1b) the contaminated data are shifted by a distance f 1 in the direction of the (k + 1)th principal component. We started with f 1 = 6, otherwise the

35 1.4. SIMULATIONS 25 outliers could not be distinguished from the regular data points. The factor f 2 determines how strongly the contaminated data are concentrated. In [21] it is shown that shifted outliers with the same covariance structure as the regular points are the most difficult ones to detect. This situation corresponds with f 2 = 1. Note that because of the orthogonal equivariance of the ROBPCA method, we only need to consider diagonal covariance matrices. For each simulation setting, we summarized the result of the estimation procedure (CPCA, RAPCA, SPHER, ELL and ROBPCA) as follows: for each method we consider the maximal angle between E k and the estimated PCA subspace, which is spanned by the columns of P p,k. Krzanowski [15] proposed a measure to calculate this angle, which we will denote by maxsub: maxsub = arccos( λ k ) where λ k is the smallest eigenvalue of I k,p P p,kp k,p I p,k. It represents the largest angle between a vector in E k and the vector most parallel to it in the estimated PCA subspace. To standardize this value, we have divided it by π 2. we compute the proportion of variability that is explained by the estimated eigenvalues. This is done by comparing the sum of the k largest eigenvalues to the sum of all p known eigenvalues. We report the mean proportion of explained variability: where ˆλ (l) j l=1 ˆλ (l) 1 (l) (l) + ˆλ ˆλ k λ 1 + λ λ k +... λ p is the estimated value of λ j at the lth replication. It would be more elegant if the denominator also contained the estimated eigenvalues, but ROBPCA and RAPCA only estimate the first k eigenvalues. We report these results for the settings without contamination, and for a specific situation with 1% contamination (f 1 = 1, f 2 = 1). As SPHER and ELL only estimate the principal components and not their eigenvalues, they are not included in the comparison. for the k largest eigenvalues we also compute their mean squared error (MSE) defined as MSE(ˆλ j ) = 1 1 (ˆλ (l) j λ j ) 2. 1 l=1

36 26 ROBUST PRINCIPAL COMPONENT ANALYSIS We only report the results for f 2 = 1, because they were very similar for f 2 = 15. The ideal value of maxsub and MSE in the tables and figures is thus. For the mean proportion of explained variability the optimal values are 93.3% for lowdimensional data and 9.3% for high-dimensional data. Table 1.1: Simulation results of maxsub in settings (4a) and (1a) when there is no contamination. Distribution n p CPCA RAPCA SPHER ELL ROBPCA normal t Table 1.1 reports the simulation results of maxsub for the settings (4a) and (1a). We see that elliptical PCA yields the best results for maxsub when there is no contamination. For low-dimensional data, the results for the other methods are more or less comparable, whereas for high-dimensional data, RAPCA is clearly the less efficient approach. From Table 1.2 we see that CPCA provides the best mean proportion of explained variability when there is no contamination in the data. RAPCA attains higher values than ROBPCA for both distributions. When contamination is added to the data the eigenvalues obtained with CPCA are overestimated, which results in estimated percentages which are even larger than 1%! The robust methods are much less sensitive to the outliers, but also RAPCA attains a value larger than 1% at the contaminated low-dimensional normal distribution. Note that when the consistency factor in ROBPCA is adapted to the t 5 distribution, the results improve substantially. For the low-dimensional data we obtain 8% without and 82.8% with contamination, whereas in high dimensions the mean percentage of explained variability is 69.3% and 69.4% respectively. The results of the maxsub measure for simulations (4b) and (1b) are summarized in Figures In every situation, CPCA clearly fails and it provides the worst possible result because maxsub is always very close to 1. This implies

37 1.4. SIMULATIONS 27 Table 1.2: Simulation results of the mean proportion of explained variability when there is no contamination and with 1% contamination (f 1 = 1 and f 2 = 1). Multivariate normal Multivariate t 5 CPCA RAPCA ROBPCA CPCA RAPCA ROBPCA n = 1, p = 4 ε = % 93.4% 94.7% 83.9% 98.7% 72.1% 6.2% ε = 1% 147.8% 112.9% 88.5% 135.2% 86.8% 67.5% n = 5, p = 1 ε = % 91.6% 83.6% 79.5% 99.2% 65.1% 57.3% ε = 1% 19.4% 86.1% 79.3% 11.9% 66.9% 56.7% that the estimated PCA subspace has been attracted by outliers in such a way that at least one principal component is orthogonal to E k. Also RAPCA, SPHER and ELL are clearly influenced by the outliers, the most strongly when the data are high-dimensional or when there is a large percentage of contamination. In all situations, ROBPCA outperforms the other methods. ROBPCA only attains high values for maxsub at the long tailed t 5 when f 1 is between 6 and 8. This is because in this case the outliers are not yet very well separated from the regular data group. Also the other methods fail in such a situation. As soon as the contamination lies somewhat further, ROBPCA is capable to distinguish the outliers and maxsub remains almost constant. Finally, some results for the MSE s of the eigenvalues are summarized in Figure Here we display the ratio of the MSE s of CPCA versus ROBPCA, and RAPCA versus ROBPCA, for the normally distributed data with ε = 1% contamination and f 2 = 1. Figures 1.12(a) and 1.12(b) show the results for the low-dimensional data, whereas Figures 1.12(c) and 1.12(d) present the results for the high-dimensional ones. When we compare CPCA and ROBPCA, we see that the MSE of the first CPCA eigenvalue increases strongly when the contamination is shifted further away from the regular points. Also the MSE s of the other CPCA eigenvalues are much larger than those of ROBPCA. Only MSE(ˆλ 2 ) and MSE(ˆλ 3 ) in Figure 1.12(c) are of the same order of magnitude. Figures 1.12(b) and 1.12(d) show the superiority of ROBPCA over RAPCA. At high-dimensional data the differences are most prominent in the fifth eigenvalue.

FAST CROSS-VALIDATION IN ROBUST PCA

COMPSTAT 2004 Symposium c Physica-Verlag/Springer 2004 FAST CROSS-VALIDATION IN ROBUST PCA Sanne Engelen, Mia Hubert Key words: Cross-Validation, Robustness, fast algorithm COMPSTAT 2004 section: Partial