Robust methods for high-dimensional data, and a theoretical study of depth-related estimators.

Size: px
Start display at page:

Download "Robust methods for high-dimensional data, and a theoretical study of depth-related estimators."

Transcription

1 Katholieke Universiteit Leuven Faculteit Wetenschappen Departement Wiskunde Robust methods for high-dimensional data, and a theoretical study of depth-related estimators. Karlien Vanden Branden Promotor: Prof. dr. M. Hubert Proefschrift ingediend tot het behalen van de graad van Doctor in de Wetenschappen Leuven 25

2

3

4

5 Dankwoord Het schrijven van dit dankwoord heeft me best wat moeite gekost. t Is niet zozeer dat ik niet goed wist wie ik waarom wou bedanken, maar eerder hoe ik op een gepaste manier kan uitdrukken wat heel wat mensen de afgelopen vier jaar voor me hebben betekend. Deze personen die me tijdens deze periode gemotiveerd en gesteund hebben, verdienen nu dan ook een zeer welgemeende danku. In de eerste plaats wil ik mijn promotor Prof. dr. Mia Hubert bedanken. Zij gaf haar enthousiasme voor onderzoek door en ze bleef boeiende onderwerpen aanbieden. Door haar nauwe banden met de chemische wereld heb ik theoretisch werk met hele praktische resultaten kunnen combineren. Dit was zeker een heel belangrijke motivatie voor verder onderzoek. I am also grateful to Prof. dr. Steve Portnoy for giving me the opportunity to learn from his expertise in regression quantiles and to Tereza Neocleous for taking a stranger into her house. I will never forget the nice time in the wintery Urbana-Champaign and thank you both for this. Of course I am very proud of the results we have achieved and I appreciate enormously the cooperation to achieve the nice result from Chapter 5 by means of some hard labour... Mijn collega s en ex-collega s ben ik zeker niet vergeten en bedank ik voor de fijne werkdagen op zowel de B-blok als op het UCS. In het bijzonder wil ik mijn bureaugenoten bedanken omdat die op hun eigen manier mijn werkdag boeiend hebben gehouden. Bedankt Jonathan, Sanne en Michiel! Speciaal wil ik Sanne bedanken voor de leuke tijd en voor de aangename sfeer die ze creëert op onze bureau en op het UCS. Naast mijn collega s bij de K.U.Leuven heb ik dankzij Mia de kans gekregen heel wat interessante mensen te ontmoeten op jaarlijkse bijeenkomsten over heel de wereld. Hoewel het niet direct ver van huis is, denk ik hierbij vooral aan dr. Sabine Verboven van de Universiteit Antwerpen. Zij stond altijd klaar voor hulp en raad bij allerhande vragen en problemen. En dankzij haar heb ik ook

6 heel wat restaurantjes in Leuven beter leren kennen. Hopelijk kunnen we deze smakelijke traditie voortzetten. Tenslotte wil ik ook graag mijn vrienden en familie in de bloemetjes zetten. In het bijzonder dank ik mijn ouders voor hun steun en aanmoediging tijdens de voorbije 25 jaar. Karlien, Leuven, 13 mei 25.

7 Contents Introduction 1 List of Publications 7 I Robust methods for high-dimensional data 9 1 ROBPCA: a New Approach to Robust Principal Component Analysis Introduction The ROBPCA method Description Diagnostic plot Examples Car data Octane data Glass spectra Simulations Conclusion Appendix References Robust Classification in High Dimensions based on the SIMCA Method Introduction The SIMCA method Construction of the classification rules Example: Forest soil data vii

8 viii Contents 2.3 A robust SIMCA method Robust PCA in high dimensions Robust classification rules Example: Forest soil data Misclassification percentage Example: Forest soil data Simulation Study Examples Fruit data Wine recognition data Conclusions References Robust Methods for Partial Least Squares Regression Introduction The SIMPLS algorithm Robustified versions of the SIMPLS algorithm Robust covariance estimation in high dimensions Robust PLS Comparison Model Calibration and Validation Selecting the number of components Estimation of Prediction Error Outlier Detection Regression diagnostic plot Score diagnostic plot Simulation Study Example : Biscuit dough data Conclusion References Robustness Properties of a Robust PLS Regression Method Introduction The algorithms The SIMPLS algorithm The robust approach Robustness properties in low dimensions Functional representation

9 Contents ix The influence functions Example: Tobacco data Breakdown property Computation times Robustness properties in high dimensions Empirical influence function Empirical breakdown property Conclusions Appendix A Appendix B References II A theoretical study of depth-related estimators Censored Regression Quantiles Introduction Censored regression quantiles Consistency proof Censored depth regression quantiles Conclusion Appendix References Location-Scale Depth Introduction Another interpretation of Student depth Attaining the maximal Student depth Influence functions Breakdown value Conclusions References Conclusion 161 Nederlandse Samenvatting 163

10

11 Introduction Preprocessing and analyzing high-dimensional data are specific techniques that are strongly required in the fields of chemistry, bio-informatics, image analysis, etc. In these industries large or fat data sets for which the number of observations is largely exceeded by the number of variables appear very naturally from spectral analysis, gene analysis through microarrays or image segmentation. Principal Component Analysis (PCA), a classification method called SIMCA, and Partial Least Squares Regression (PLSR) are three specific calibration methods that were developed and studied to achieve the understanding of these measurements in huge dimensions (in the multivariate setting and in the regression situation). Whereas PCA and SIMCA operate on one set of variables, say X n,p, PLSR introduces a regression model between a set of predictors X n,p and a set of responses Y n,q. Note that we denote the dimension of a matrix using a subscript, e.g. X n,p stands for a (n p) dimensional matrix, and we print column vectors in bold. These measurements are typically spectra with the number of observations n p, the number of wavelengths. In microarray analysis p represents the number of genes for n arrays. However, all three methods lack robustness. This means that they are incapable to detect unusual measurements, so called outliers which often leads to unreliable outcomes. Therefore, we study the robustness of PCA, SIMCA and PLSR in Part I of this thesis and we propose robust alternatives. First we focus in Chapter 1 on PCA. As this is a dimension reduction technique, it is very well suited to handle high-dimensional data. We assume that the original set of p highly correlated variables can be summarized by a much smaller group of k uncorrelated variables stored in P p,k. This loading matrix P p,k with k p represents linear combinations of the original variables. To formalize the PCA model, we let x i be the ith measurement (row) of X n,p, µ x the center of X n,p and t i = P k,p (x i µ x ) the ith score. PCA then 1

12 2 Introduction looks for a k-dimensional subspace represented by the orthonormal columns of P p,k by minimizing n i=1 g i 2 where x i = µ x + P p,k t i + g i (1) and g i is the error term. PCA thus results in a new low-dimensional data set T n,k = (t i ) n i=1 where the small dimension k now allows for a better interpretation. In practical examples k is often not larger than 5. The classical PCA (CPCA) loadings can be found by taking the eigenvalues of the empirical variancecovariance matrix of the data X n,p. As this approach is badly influenced by the presence of outliers, we propose in Chapter 1 a robust PCA method called ROBPCA. It combines projection-pursuit ideas with robust covariance estimation. We show its superiority in robustness by means of a comparative simulation study and with several examples. This robust PCA method ROBPCA now allows to investigate other multivariate techniques for high-dimensional data. In Chapter 2 we will incorporate the ROBPCA method in a robust classification method called RSIMCA. This classifier, based on the SIMCA [9] method, uses a principal component analysis on each available group separately and applies robust classification rules. This gives rise to lower misclassification percentages compared with SIMCA when outliers appear in one or in several groups. ROBPCA can also be explored in the regression context. Especially in the case of a large dimensional set of predictors or independent variables X n,p with n p, it is important to extract only the most important predictors for the response(s) Y n,q or to obtain the best linear combinations of the p original covariates for predicting Y n,q. In such situations multivariate least squares regression is no longer an option and therefore biased regression methods such as PLSR [3] have been proposed in the past. The basic idea behind PLSR is to construct a bilinear model x i = µ x + P p,k t i + g i (2) y i = µ y + A q,kt i + f i. The matrix of x-loadings is given by P p,k, A is the slope matrix in the regression of y i on t i = R k,p (x i µ x ), µ x and µ y are the centers of X n,p and Y n,q, and g i and f i are the error terms. Here the transpose of a vector v or a matrix V is written as v or V. The estimates from this bilinear model provide regression estimates expressing the relationship between x i and y i y i = β + B q,px i + e i

13 Introduction 3 with B = R p,k A, β = µ y B q,pµ x, and e i is the regression error. Although (1) and (2) look similar, the PLS regression scores t i do not have to correspond with the PCA scores as another objective function is considered. This will be discussed in Chapter 3. Basically, the construction of the matrix R p,k leads to different regression methods. Besides PLSR, Principal Component Regression (PCR) [3] is another very popular biased regression technique in chemometrics. This method also proceeds according to the same bilinear model. The main difference with PLSR is that the PCR regression scores t i now correspond with the PCA scores P k,p (x i µ x ). The scores from PCR are thus solely based on the set of regressors X n,p. In contrast PLSR already involves the responses in this first stage and therefore the scores on which the final regression results are based, are expected to inherit more useful predictive information for these responses. As PLSR turns out not to be be robust against outliers, Chapter 3 is dedicated to a robust PLSR method. It is called RSIMPLS because it is based on the ideas behind the SIMPLS method [1]. The ROBPCA method proves to be successful for robustifying both stages (i.e. the PCA step and the regression stage) of the bilinear model. A theoretical study of RSIMPLS is contained in Chapter 4, where we derive its influence function for low-dimensional data. We simultaneously obtain the corresponding influence functions for SIMPLS. All algorithms developed in this first part are available as a part of LIBRA, a Matlab Library for Robust Analysis [7]. This toolbox can be downloaded from our web site Also S-Plus and R implementations of several algorithms can be obtained from our web site. Moreover, our programs become available in the PLS toolbox [8] which is the standard Matlab library for chemometricians. In Part II of this thesis we focus on the aspect of regression quantiles for censored data and on depth-related estimators. In Chapter 5 we first discuss a recent method based on the Kaplan-Meier estimator for estimating regression quantiles in the case of censored data [5]. Censored data occurs very frequently in clinical trials when persons drop out of a study. Because it is wasteful to simply omit those persons from the analysis, they are still taken into account as censored observations. One then knows that the true response for that person, say the life time, exceeds the observed value. Instead of observing a response Y i for a set of predictors X i, we now thus have a triple (Y i, C i, i ) with C i the censoring time, and i = I(Y i C i ) the censoring indicator. This indicator tells whether

14 4 References observation i is observed ( i = 1) or not ( i = ). We denote the observed response by Ỹi = min(y i, C i ). If i = we only know that the true response value Y i is some unknown value larger than C i. In Chapter 5 we suppose that the τth conditional regression quantile of Y given X = x, denoted by Q Y (τ x), is linear in x. So for τ (, 1) Q Y (τ x) = x β(τ). We will discuss two methods to obtain these linear regression quantiles β(τ) for censored data. The first method is proposed by Portnoy [5] and it extends the Koenker-Bassett estimator [2] by means of the Kaplan-Meier estimator. The basic idea is to introduce a reweighting scheme where a weight w(τ) is given to the censored observation once it is crossed and a weight 1 w(τ) is given to a pseudo-observation at infinity. The main result in this chapter is a consistency proof of this estimator. However, this censored regression quantile estimator is only robust towards outliers in the response variable Y. Therefore we also briefly introduce a second estimator that can deal with leverage points which are outlying in x-space. This method combines concepts of regression depth [6] with the reweighting scheme for censored data. The idea of regression depth recently inspired Mizera and Müller [4] to define location-scale depth for univariate data based on a likelihood principle. In Chapter 6 we look at the most tractable version of their proposed depth, i.e. the Student depth which corresponds to the bivariate halfspace depth interpreted in the Poincaré plane of the Lobachevski geometry. The Student depth of some (µ, σ) R [, ) with respect to some probability measure P on R is defined as d(µ, σ, P) = inf P { y : u 1 (y µ) + u 2 ((y µ) 2 σ 2 ) }. (3) (u 1,u 2) We derive the influence function of this Student depth and of the Student median which is the maximum depth estimator of location and scale. This is the couple (µ S, σ S ) R [, ) for which (3) obtains its maximal value. We also obtain the breakdown value of the Student median. In a final conclusion all results obtained in Part I and in Part II are summarized. We also discuss some possibilities for future research. References [1] S. de Jong. SIMPLS: an alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18: , 1993.

15 References 5 [2] R. Koenker and G.W. Bassett. Regression quantiles. Econometrica, 46:33 5, [3] H. Martens and T. Naes. Multivariate calibration. Wiley, Chichester, UK, [4] I. Mizera and C.H. Müller. Location-scale depth. Journal of the American Statistical Association, 99: , 24. [5] S. Portnoy. Censored regression quantiles. Journal of the American Statistical Association, 98:11 112, 23. [6] P.J. Rousseeuw and M. Hubert. Regression depth. Journal of the American Statistical Association, 94:388 42, [7] S. Verboven and M. Hubert. LIBRA: a Matlab library for robust analysis. Chemometrics and Intelligent Laboratory Systems, 75: , 25. [8] B.M. Wise, N.B. Gallagher, R. Bro, J.M. Shaver, W. Windig, and R.S. Koch. PLS Toolbox 3.5 for use with MATLAB. Software, Eigenvector Research, Inc., August 24. [9] S. Wold. Pattern recognition by means of disjoint principal components models. Pattern Recognition, 8: , 1976.

16

17 List of Publications M. Hubert and K. Vanden Branden. Robust methods for partial least squares regression. Journal of Chemometrics, 17: , 23. K. Vanden Branden and M. Hubert. Robustness properties of a robust PLS regression method. Analytica Chimica Acta, 515: , 24. S. Engelen, M. Hubert, K. Vanden Branden, and S. Verboven. Robust PCR and Robust PLSR: a comparative study, Theory and Applications of Recent Robust Methods, edited by M. Hubert, G. Pison, A. Struyf and S. Van Aelst, Series: Statistics for Industry and Technology, Birkhauser, Basel, pages , 24. K. Vanden Branden and M. Hubert. Robust classification of high-dimensional data, Proceedings in Computational Statistics, edited by J. Antoch, Springer-Verlag, Heidelberg, pages , 24. M. Hubert, P.J. Rousseeuw, and K. Vanden Branden. Comment on locationscale depth by I. Mizera and C. Müller. Journal of the American Statistical Association, 99:966 97, 24. M. Hubert, P.J. Rousseeuw, and K. Vanden Branden. ROBPCA: a new approach to robust principal component analysis. Technometrics, 47:64 79, 25. K. Vanden Branden and M. Hubert. Robust classification in high dimensions based on the SIMCA method, to appear in Chemometrics and Intelligent Laboratory Systems, 25. S. Engelen, M. Hubert, and K. Vanden Branden. A comparison of three procedures for robust PCA in high dimensions, to appear in Austrian Journal of Statistics, 25. 7

18

19 Part I Robust methods for high-dimensional data 9

20

21 Chapter 1 ROBPCA: a New Approach to Robust Principal Component Analysis Classical PCA is based on the empirical covariance matrix of the data and hence it is highly sensitive to outlying observations. In the past, two robust approaches have been developed. The first is based on the eigenvectors of a robust scatter matrix such as the MCD or an S-estimator, and is limited to relatively low-dimensional data. The second approach is based on projection pursuit and can handle high-dimensional data. In this chapter, we propose the ROBPCA approach which combines projection pursuit ideas with robust scatter matrix estimation. It yields more accurate estimates at non-contaminated data sets and more robust estimates at contaminated data. ROBPCA can be computed fast, and is able to detect exact fit situations. As a byproduct, ROBPCA produces a diagnostic plot which displays and classifies the outliers. The algorithm is applied to several data sets from chemometrics and engineering. 11

22 12 ROBUST PRINCIPAL COMPONENT ANALYSIS 1.1 Introduction Principal component analysis is a popular statistical method which tries to explain the covariance structure of data by means of a small number of components. These components are linear combinations of the original variables, and often allow for an interpretation and a better understanding of the different sources of variation. Because PCA is concerned with data reduction, it is widely used for the analysis of high-dimensional data which are frequently encountered in chemometrics, computer vision, engineering, genetics, and other domains. PCA is then often the first step of the data analysis, followed by discriminant analysis, cluster analysis, or other multivariate techniques. It is thus important to find those principal components that contain most of the information. In the classical approach, the first component corresponds to the direction in which the projected observations have the largest variance. The second component is then orthogonal to the first and again maximizes the variance of the data points projected on it. Continuing in this way produces all the principal components, which correspond to the eigenvectors of the empirical covariance matrix. Unfortunately, both the classical variance (which is being maximized) and the classical covariance matrix (which is being decomposed) are very sensitive to anomalous observations. Consequently, the first components are often attracted towards outlying points, and may not capture the variation of the regular observations. Therefore, data reduction based on classical PCA (CPCA) becomes unreliable if outliers are present in the data. The goal of robust PCA methods is to obtain principal components that are not influenced much by outliers. A first group of methods is obtained by replacing the classical covariance matrix by a robust covariance estimator. Maronna [19] and Campbell [3] proposed to use affine equivariant M-estimators of scatter for this purpose, but these cannot resist many outliers. More recently, Croux and Haesbroeck [4] used positive-breakdown estimators such as the minimum covariance determinant (MCD) method [22] and S-estimators [6, 24]. The result is more robust, but unfortunately limited to small to moderate dimensions. To see why, let us e.g. consider the MCD estimator. It is defined as the mean and the covariance matrix of the h observations (out of the whole data set of size n) whose covariance matrix has the smallest determinant. If p denotes the number of variables in our data set, the MCD estimator can only be computed if p < h, otherwise the covariance matrix of any h-subset has zero determinant. By default h is about.75n, and it may be chosen as small as.5n; in any case p may never

23 1.2. THE ROBPCA METHOD 13 be larger than n. A second problem is the computation of these robust estimators in high dimensions. Today s fastest algorithms [28, 25] can handle up to about 1 dimensions, whereas there are fields like chemometrics which need to analyze data with dimensions in the thousands. A second approach to robust PCA uses Projection Pursuit (PP) techniques [17, 5, 11]. These methods maximize a robust measure of spread to obtain consecutive directions on which the data points are projected. This idea has also been generalized to common principal components [1]. It yields transparent algorithms that can be applied to data sets with many variables and/or many observations. In Section 1.2 we propose the ROBPCA method, which attempts to combine the advantages of both approaches. We also describe an accompanying diagnostic plot which can be used to detect and classify possible outliers. Several real data sets from chemometrics and engineering are analyzed in Section 1.3, whereas Section 1.4 investigates the performance and robustness of ROBPCA through simulations. The concluding section outlines potential applications of ROBPCA in other types of multivariate data analysis. 1.2 The ROBPCA method Description The proposed ROBPCA method combines ideas of both projection pursuit and robust covariance estimation. The PP part is used for the initial dimension reduction. Some ideas based on the MCD estimator are then applied to this lowerdimensional data space. The combined approach yields more accurate estimates than the raw PP algorithm, as we will see in Section 1.4. The complete description of the ROBPCA method is quite involved and relegated to Appendix A, but here is a rough sketch of how it works. We assume that the original data are stored in an n p data matrix X = X n,p where n stands for the number of objects and p for the original number of variables. The ROBPCA method then proceeds in three major steps. First, the data are preprocessed such that the transformed data are lying in a subspace whose dimension is at most n 1. Next, we construct a preliminary covariance matrix S that is used for selecting the number of components k that will be retained in the sequel, yielding a k-dimensional subspace that fits the data well. Then we project the data points on this subspace where we robustly estimate their location and their scatter matrix, of which we compute its k non-zero eigenvalues l 1,..., l k. The corresponding

24 14 ROBUST PRINCIPAL COMPONENT ANALYSIS eigenvectors are the k robust principal components. In the original space of dimension p, these k components span a k-dimensional subspace. Formally, writing the (column) eigenvectors next to each other yields the p k matrix P p,k with orthogonal columns. The location estimate is denoted as the p-variate column vector ˆµ and called the robust center. The scores are the entries of the n k matrix T n,k = (X n,p 1 n ˆµ )P p,k (1.1) where 1 n is the column vector with all n components equal to 1. Moreover, the k robust principal components generate a p p robust scatter matrix S of rank k given by S = P p,k L k,k P p,k (1.2) where L k,k is the diagonal matrix with the eigenvalues l 1,..., l k. Like classical PCA, the ROBPCA method is location and orthogonal equivariant. That is, when we apply a shift and/or an orthogonal transformation (e.g. a rotation or a reflection) to the data, the robust center is also shifted and the loadings are rotated accordingly. Hence the scores do not change under this type of transformations. Let A p,p define the orthogonal transformation, thus A is of full rank and A = A 1, and ˆµ x and P p,k the ROBPCA center and loading matrix for the original X n,p. Then the ROBPCA center and loadings for the transformed data XA + 1 n v are equal to Aˆµ x + v and AP. Consequently the scores remain the same under these transformations: T (XA + 1 n v ) = (XA + 1 n v 1 n (Aˆµ x + v) )AP = (X 1 n ˆµ x)p = T (X). Although these properties seem very natural for a PCA method, they are not shared by some other robust PCA estimators such as the resampling by halfmeans and the smallest half-volume methods of Egan and Morgan [8] Diagnostic plot As it is the case for many robust methods, the goal of a robust PCA is twofold. First, it allows to find those linear combinations of the original variables that contain most of the information, even if there are outliers. Secondly, these estimates are useful to flag outliers and to determine of which type they are. To see that there can be different types of outliers, consider Figure 1.1 where p = 3 and k = 2. Here we can distinguish between four types of observations. The regular observations form one homogeneous group that is close to the PCA

25 1.2. THE ROBPCA METHOD 15 subspace. Next, we have good leverage points that lie close to the PCA space but far from the regular observations, such as the observations 1 and 4 in Figure 1.1. We can also have orthogonal outliers whose orthogonal distance to the PCA space is large but which we cannot see when we only look at their projection on the PCA space, like observation 5. The fourth type of data points are the bad leverage points that have a large orthogonal distance and whose projection on the PCA subspace is remote from the typical projections, such as observations 2 and x x x 4 3 Figure 1.1: Different types of outliers when a three-dimensional data set is projected on a robust two-dimensional PCA-subspace. To distinguish between regular observations and the three types of outliers for higher-dimensional data, we construct a diagnostic plot. On the horizontal axis it plots the robust score distance SD i of each observation, given by SD i = k t 2 ij (1.3) l j where the scores t ij are obtained from (1.1). If k = 1, we prefer to plot the (signed) standardized score t i / l 1. On the vertical axis of the diagnostic plot we display the orthogonal distance OD i of each observation to the PCA subspace, defined as j=1 OD i = x i ˆµ P p,k t i (1.4)

26 16 ROBUST PRINCIPAL COMPONENT ANALYSIS where the ith observation is denoted as the p-variate column vector x i and t i is the ith row of T n,k. To classify the observations we draw two cut-off lines. The cut-off value on the horizontal axis is χ 2 k,.975 when k > 1 and ± χ 2 1,.975 when k = 1 (because the squared Mahalanobis distances of normally distributed scores are approximately χ 2 k-distributed). The cut-off value on the vertical axis is more difficult to determine because the distribution of the orthogonal distances is not known exactly. However, a scaled chi-squared distribution g 1 χ 2 g 2 gives a good approximation for the unknown distribution of the squared orthogonal distances [2]. In [2] the method of moments is used to estimate the two unknown parameters g 1 and g 2. We prefer to follow a robust approach. We use the Wilson-Hilferty approximation for a chi-squared distribution. This implies that the orthogonal distances to the power 2/3 are approximately normally distributed with mean µ = (g 1 g 2 ) 1/3 (1 2 9g 2 ) and variance σ 2 = 2g2/3 1. We obtain estimates ˆµ and ˆσ 2 by 9g 1/3 2 means of the univariate MCD. The cut-off value on the vertical axis then equals (ˆµ + ˆσz.975 ) 3/2 with z.975 = Φ 1 (.975) the 97.5% quantile of the Gaussian distribution. Note the analogy of this diagnostic plot with that of Rousseeuw and Van Zomeren [26] for robust regression. There the vertical axis gives the standardized residuals obtained with a robust regression method, with cut-off values at -2.5 and 2.5 (because for normally distributed data roughly 1% of the standardized residuals fall outside that interval). 1.3 Examples We will illustrate the ROBPCA method and the diagnostic plot on several real data sets. We also compare the results from ROBPCA with four other PCA methods: classical PCA (CPCA), RAPCA [11], and spherical (SPHER) and ellipsoidal (ELL) PCA [18]. The latter three methods are also robust and designed for high-dimensional data. Li and Chen [17] proposed the idea of projection pursuit for PCA, but their algorithm has a high computational cost. More attractive methods have been developed in [5], but in high dimensions, these algorithms still have a numerical inaccuracy. Therefore, Hubert et al. [11] have developed RAPCA, which is a fast two-step algorithm. It searches for the direction on which the projected observations have the largest robust scale, and then removes this dimension and repeats.

27 1.3. EXAMPLES 17 The spherical and ellipsoidal PCA method provide a very fast algorithm to perform robust PCA. After robustly centering the data, the observations are projected on a sphere (spherical PCA) or an ellipse (ellipsoidal PCA). The principal components are then derived as the eigenvectors of the covariance matrix of these projected data points. SPHER and ELL do not yield estimates of the eigenvalues, which makes it impossible to compute score distances. Therefore, in the examples we additionally applied the MCD estimator on the scores to compute robust distances in the PCA subspace Car data Our first example is the low-dimensional car data set which is available in S-PLUS as the data frame cu.dimensions. For n = 111 cars, p = 11 characteristics were measured such as the length, the width and the height of the car. We first looked at pairwise scatter plots of the variables, and we computed pairwise Spearman rank correlations ρ S (X i, X j ). This preliminary analysis already indicated that there are high correlations among the variables, e.g. ρ S (X 1, X 2 ) =.83 and ρ S (X 3, X 9 ) =.87. Hence, PCA seems an appropriate method to find the most important sources of variation in this data set. When applying ROBPCA to these data, an important choice we need to make is how many principal components to keep. This is done by means of the eigenvalues l 1 l 2... l r of S with r = rank(s ), as obtained in the second stage of the algorithm (see also (1.1) in Appendix A). We can use these eigenvalues in various ways. For instance, we can look at the scree plot, which is a graph of the (monotone decreasing) eigenvalues [14]. We can also use a selection criterion, e.g. to choose k such that k r lj / lj 9% (1.5) j=1 j=1 or for instance such that lk / l (1.6) Here we decided to retain k = 2 components based on criterion (1.5), because ( l 1 + l 2 )/ 11 1 l j = 94%. Figure 1.2(a) shows the resulting diagnostic plot. We can distinguish a group of orthogonal outliers (labelled 13 14, 17, 19, 111) and two groups of bad leverage points (cases 12, 15 16, 18, 11 and observations 25, 3, 32, 34, 36). A few good leverage points are also visible (6 and 96). If we look at the measurements, we notice that the five most important bad leverage points (25,

28 18 ROBUST PRINCIPAL COMPONENT ANALYSIS ROBPCA CPCA Orthogonal distance Orthogonal distance Score distance (2 PC) (a) Score distance (2 PC) (b) Figure 1.2: Diagnostic plot of the car data set based on (a) two robust principal components; (b) two classical principal components. 3, 32, 34, and 36) have the value -2 on four of the 11 original variables, namely X 6 = Rear.Hd, X 8 = Rear.Seat, X 1 = Rear.Shld, X 11 = luggage. None of the other observations share this property. The observations have the value -2 for the last variable X 11 = luggage (and observation 19 the value -3). Let us compare this robust result with a classical PCA analysis. The first two components account for 85% of the total variance. The diagnostic plot in Figure 1.2(b) looks completely different from the robust one, although the same set of outliers is detected. The most striking difference is that the group of bad leverage points from ROBPCA is converted into good leverage points. This shows how the subspace found by CPCA is attracted towards these bad leverage points. Some of the differences between ROBPCA and CPCA are also visible in the plot of the scores (t i1, t i2 ) for all i = 1,..., n. Figure 1.3(a) shows the score plot of ROBPCA, together with the 97.5% tolerance ellipse which is defined as the set of vectors in R 2 whose score distance is equal to χ 2 2,.975. Data points which fall outside the tolerance ellipse are by definition the good and the bad leverage points. We clearly see how well the robust tolerance ellipse encloses the regular data points. Figure 1.3(b) is the score plot obtained with CPCA. The corresponding tolerance ellipse is highly inflated toward the outliers 25, 3, 32, 34 and 36. The resulting eigenvectors are not lying in the direction of the highest variability of the other points. We also see how the second eigenvalue of CPCA is blown up by the same set of outliers. We also performed robust PCA to this low-dimensional data set using the

29 1.3. EXAMPLES % tolerance ellipse 97.5 % tolerance ellipse T T T 1 T 1 (a) (b) Figure 1.3: The score plot with the 97.5% tolerance ellipse of the car data set for (a) ROBPCA; (b) CPCA. eigenvectors and eigenvalues of the MCD covariance matrix. The resulting diagnostic plot was almost identical to the ROBPCA plot and is therefore not included. Also the other robust methods detected the same set of outliers Octane data Our second example is the octane data set described in Esbensen et al. [9]. It contains near-infrared (NIR) absorbance spectra over p = 226 wavelengths of n = 39 gasoline samples with certain octane numbers. It is known that six of the samples (25, 26, 36 39) contain added alcohol. Both the classical scree plot, shown in Figure 1.4(a), and the ROBPCA scree plot in Figure 1.4(b) suggest to retain two principal components. The CPCA diagnostic plot is shown in Figure 1.5(a). We see that the classical analysis only detects the outlying spectrum 26, which even does not stick out much above the border line. In contrast, we immediately clearly spot the six samples with added alcohol on the ROBPCA diagnostic plot in Figure 1.5(b). The first principal component from CPCA is clearly attracted by the six outliers, yielding a classical eigenvalue of.13. On the other hand the first robust eigenvalue l 1 is only.1. Next, we wondered whether the robust loadings would be influenced by the outlying spectra if we would retain more than two components. To avoid the curse of dimensionality with n = 39 observations, it is generally advised that n > 5k (see [26]) so we considered k max = 7. From the robust diagnostic plot in

30 2 ROBUST PRINCIPAL COMPONENT ANALYSIS.14 CPCA 8 x 1 3 ROBPCA Eigenvalue Eigenvalue Principal Component (a) Principal Component (b) Figure 1.4: ROBPCA. Scree plot of the octane data set with (a) CPCA; (b) Figure 1.5(c) we see that the outliers are still very far from the estimated robust subspace. The diagnostic plots of RAPCA, SPHER and ELL were similar to Figure 1.5(b) for k = 2. But when we selected k = 7 components with ELL, we see from Figure 1.5(d) that the outliers have a much lower orthogonal distance. This illustrates their leverage effect on the estimated principal components Glass spectra Our third data set consists of EPXMA spectra over p = 75 wavelengths collected on 18 different glass samples [16]. The chemical analysis was performed using a Jeol JSM 63 scanning electron microscope equipped with an energy-dispersive Si(Li) X-ray detection system (SEM-EDX). We performed ROBPCA with h =.7n = 126. Three components are retained for CPCA and ROBPCA yielding a classical explanation percentage of 99% and a robust explanation percentage (1.5) of 96%. We then obtain the diagnostic plots in Figure 1.6. From the classical diagnostic plot in Figure 1.6(a), we see that CPCA does not find important outliers. On the other hand the ROBPCA plot of Figure 1.6(b) clearly distinguishes two major groups in the data, a smaller group of bad leverage points, a few orthogonal outliers and the isolated case 18 in between the two major groups. A high-breakdown method such as ROBPCA treats the smaller group with cases as one set of outliers. Later, it turned out that the window of the detector system had been cleaned before the last 38

31 1.3. EXAMPLES 21 CPCA ROBPCA Orthogonal distance Orthogonal distance Score distance (2 PC) (a) Score distance (2 PC) (b) Orthogonal distance ROBPCA Orthogonal distance ELL Score distance (7 PC) (c) Score distance (7 PC) (d) Figure 1.5: Diagnostic plot of the octane data set based on (a) two CPCA principal components; (b) two ROBPCA principal components; (c) seven ROBPCA principal components; (d) seven ELL principal components. spectra were measured. As a result of this less radiation (X-rays) is absorbed and more can be detected, resulting in higher X-ray intensities. If we look at the spectra, we can indeed observe these differences. The regular samples, shown in Figure 1.7(a), clearly have lower measurements at the channels than the samples of Figure 1.7(b). The spectrum of case 18 (not shown) was somewhat in between. Note that instead of plotting the raw data, we first robustly centered the spectra by subtracting the univariate MCD location estimator from each wavelength. Doing so we can observe more of the variability which is present in the data. The other bad leverage points (57 63) and (74-76) are samples with a large

32 22 ROBUST PRINCIPAL COMPONENT ANALYSIS concentration of calcic. In Figure 1.7(c) we see that their calcic alpha peak (around channels 34-37) and calcic beta peak (channels 375-4) is higher than for the other glass vessels. The orthogonal outliers (22, 23 and 3) whose spectra are shown in Figure 1.7(d) are rather boundary cases, although they have larger measurements at the channels This might indicate a larger concentration of phosphor. Orthogonal distance CPCA Score distance (3 PC) (a) Orthogonal distance ROBPCA Score distance (3 PC) 8 1 (b) SPHER ELL Orthogonal distance Orthogonal distance Score distance (3 PC) (c) Score distance (3 PC) (d) Figure 1.6: Diagnostic plot of the glass data set based on three principal components computed with (a) CPCA; (b) ROBPCA; (c) SPHER and (d) ELL. RAPCA yielded a diagnostic plot similar to the ROBPCA plot. SPHER en ELL are also able to detect the outliers, as can be seen from Figure 1.6(c) and (d), but they turn the bad leverage points into good leverage points and orthogonal outliers.

33 1.4. SIMULATIONS Index (a) Index (b) Index (c) Index (d) Figure 1.7: The glass data set (a) regular observations; (b) bad leverage points ; (c) bad leverage points and 74-76; (d) orthogonal outliers 22, 23 and Simulations We conducted a simulation study to compare the performance and the robustness of ROBPCA with the four other principal component methods introduced in Section 1.3: classical PCA (CPCA), RAPCA [11], spherical (SPHER) and elliptical (ELL) PCA [18]. We generated 1 samples of size n from the contamination model (1 ε)n p (, Σ) + εn p ( µ, Σ) or (1 ε)t 5 (, Σ) + εt 5 ( µ, Σ)

34 24 ROBUST PRINCIPAL COMPONENT ANALYSIS for different values of n, p, ε, Σ, µ and Σ. That is, n(1 ε) of the observations were generated from the p-variate gaussian distribution N p (, Σ) or the p-variate elliptical t 5 (, Σ) distribution and nε of the observations were generated from N p ( µ, Σ) or from t 5 ( µ, Σ). Note that the consistency factor in the FAST-MCD algorithm, which is used within ROBPCA, is constructed under the assumption that the regular observations are normally distributed. Then the denominator equals χ 2 k,1 α, the (1 α) quantile of the χ 2 distribution with k degrees of freedom. Hence, the best results of the simulations with t ν (and here ν = 5) is obtained by replacing the denominator with k((ν 2)/ν)F k,ν,1 α, with F k,ν,1 α the (1 α) quantile of the F distribution with k and ν degrees of freedom. However, in real examples any foreknowledge of the true underlying distribution is mostly unavailable. Therefore, and also to make a fair comparison with RAPCA, we did not adjust the consistency factor. In Tables and Figures we report some typical results, that were obtained in the following situations: 1. n = 1, p = 4, Σ = diag(8, 4, 2, 1) and k = 3 (because then ( 3 1 λ i)/( 4 1 λ i) = 93.3%.) (4a) ε = (no contamination). (4b) ε = 1% or ε = 2%, µ = f 1 e 4 = (,,, f 1 ), Σ = Σ f 2 where f 1 = 6, 8, 1,..., 2 and f 2 = 1 or f 2 = n = 5, p = 1, Σ = diag(17, 13.5, 8, 3, 1,.95,...,.2,.1) and k = 5 (because here ( 5 1 λ i)/( 1 1 λ i ) = 9.3%.) (1a) ε = (no contamination). (1b) ε = 1% or ε = 2%, µ = f 1 e 6, Σ = Σ f 2 f 2 = 1 or f 2 = 15. where f 1 = 6, 8, 1,..., 2 and Note that ε = % also corresponds to f 1 = and f 2 = 1. The subspace spanned by the first k eigenvectors of Σ is denoted by E k = span{e 1,..., e k } with e j the jth column of I p,k. The settings (4a) and (4b) consider low-dimensional data (p = 4) of not too small size n = 1, whereas in (1a) and (1b) we generate high-dimensional data with n = 5 being rather small, and even less than p = 1. In the settings (4b) and (1b) the contaminated data are shifted by a distance f 1 in the direction of the (k + 1)th principal component. We started with f 1 = 6, otherwise the

35 1.4. SIMULATIONS 25 outliers could not be distinguished from the regular data points. The factor f 2 determines how strongly the contaminated data are concentrated. In [21] it is shown that shifted outliers with the same covariance structure as the regular points are the most difficult ones to detect. This situation corresponds with f 2 = 1. Note that because of the orthogonal equivariance of the ROBPCA method, we only need to consider diagonal covariance matrices. For each simulation setting, we summarized the result of the estimation procedure (CPCA, RAPCA, SPHER, ELL and ROBPCA) as follows: for each method we consider the maximal angle between E k and the estimated PCA subspace, which is spanned by the columns of P p,k. Krzanowski [15] proposed a measure to calculate this angle, which we will denote by maxsub: maxsub = arccos( λ k ) where λ k is the smallest eigenvalue of I k,p P p,kp k,p I p,k. It represents the largest angle between a vector in E k and the vector most parallel to it in the estimated PCA subspace. To standardize this value, we have divided it by π 2. we compute the proportion of variability that is explained by the estimated eigenvalues. This is done by comparing the sum of the k largest eigenvalues to the sum of all p known eigenvalues. We report the mean proportion of explained variability: where ˆλ (l) j l=1 ˆλ (l) 1 (l) (l) + ˆλ ˆλ k λ 1 + λ λ k +... λ p is the estimated value of λ j at the lth replication. It would be more elegant if the denominator also contained the estimated eigenvalues, but ROBPCA and RAPCA only estimate the first k eigenvalues. We report these results for the settings without contamination, and for a specific situation with 1% contamination (f 1 = 1, f 2 = 1). As SPHER and ELL only estimate the principal components and not their eigenvalues, they are not included in the comparison. for the k largest eigenvalues we also compute their mean squared error (MSE) defined as MSE(ˆλ j ) = 1 1 (ˆλ (l) j λ j ) 2. 1 l=1

36 26 ROBUST PRINCIPAL COMPONENT ANALYSIS We only report the results for f 2 = 1, because they were very similar for f 2 = 15. The ideal value of maxsub and MSE in the tables and figures is thus. For the mean proportion of explained variability the optimal values are 93.3% for lowdimensional data and 9.3% for high-dimensional data. Table 1.1: Simulation results of maxsub in settings (4a) and (1a) when there is no contamination. Distribution n p CPCA RAPCA SPHER ELL ROBPCA normal t Table 1.1 reports the simulation results of maxsub for the settings (4a) and (1a). We see that elliptical PCA yields the best results for maxsub when there is no contamination. For low-dimensional data, the results for the other methods are more or less comparable, whereas for high-dimensional data, RAPCA is clearly the less efficient approach. From Table 1.2 we see that CPCA provides the best mean proportion of explained variability when there is no contamination in the data. RAPCA attains higher values than ROBPCA for both distributions. When contamination is added to the data the eigenvalues obtained with CPCA are overestimated, which results in estimated percentages which are even larger than 1%! The robust methods are much less sensitive to the outliers, but also RAPCA attains a value larger than 1% at the contaminated low-dimensional normal distribution. Note that when the consistency factor in ROBPCA is adapted to the t 5 distribution, the results improve substantially. For the low-dimensional data we obtain 8% without and 82.8% with contamination, whereas in high dimensions the mean percentage of explained variability is 69.3% and 69.4% respectively. The results of the maxsub measure for simulations (4b) and (1b) are summarized in Figures In every situation, CPCA clearly fails and it provides the worst possible result because maxsub is always very close to 1. This implies

37 1.4. SIMULATIONS 27 Table 1.2: Simulation results of the mean proportion of explained variability when there is no contamination and with 1% contamination (f 1 = 1 and f 2 = 1). Multivariate normal Multivariate t 5 CPCA RAPCA ROBPCA CPCA RAPCA ROBPCA n = 1, p = 4 ε = % 93.4% 94.7% 83.9% 98.7% 72.1% 6.2% ε = 1% 147.8% 112.9% 88.5% 135.2% 86.8% 67.5% n = 5, p = 1 ε = % 91.6% 83.6% 79.5% 99.2% 65.1% 57.3% ε = 1% 19.4% 86.1% 79.3% 11.9% 66.9% 56.7% that the estimated PCA subspace has been attracted by outliers in such a way that at least one principal component is orthogonal to E k. Also RAPCA, SPHER and ELL are clearly influenced by the outliers, the most strongly when the data are high-dimensional or when there is a large percentage of contamination. In all situations, ROBPCA outperforms the other methods. ROBPCA only attains high values for maxsub at the long tailed t 5 when f 1 is between 6 and 8. This is because in this case the outliers are not yet very well separated from the regular data group. Also the other methods fail in such a situation. As soon as the contamination lies somewhat further, ROBPCA is capable to distinguish the outliers and maxsub remains almost constant. Finally, some results for the MSE s of the eigenvalues are summarized in Figure Here we display the ratio of the MSE s of CPCA versus ROBPCA, and RAPCA versus ROBPCA, for the normally distributed data with ε = 1% contamination and f 2 = 1. Figures 1.12(a) and 1.12(b) show the results for the low-dimensional data, whereas Figures 1.12(c) and 1.12(d) present the results for the high-dimensional ones. When we compare CPCA and ROBPCA, we see that the MSE of the first CPCA eigenvalue increases strongly when the contamination is shifted further away from the regular points. Also the MSE s of the other CPCA eigenvalues are much larger than those of ROBPCA. Only MSE(ˆλ 2 ) and MSE(ˆλ 3 ) in Figure 1.12(c) are of the same order of magnitude. Figures 1.12(b) and 1.12(d) show the superiority of ROBPCA over RAPCA. At high-dimensional data the differences are most prominent in the fifth eigenvalue.

FAST CROSS-VALIDATION IN ROBUST PCA

FAST CROSS-VALIDATION IN ROBUST PCA COMPSTAT 2004 Symposium c Physica-Verlag/Springer 2004 FAST CROSS-VALIDATION IN ROBUST PCA Sanne Engelen, Mia Hubert Key words: Cross-Validation, Robustness, fast algorithm COMPSTAT 2004 section: Partial

More information

MULTIVARIATE TECHNIQUES, ROBUSTNESS

MULTIVARIATE TECHNIQUES, ROBUSTNESS MULTIVARIATE TECHNIQUES, ROBUSTNESS Mia Hubert Associate Professor, Department of Mathematics and L-STAT Katholieke Universiteit Leuven, Belgium mia.hubert@wis.kuleuven.be Peter J. Rousseeuw 1 Senior Researcher,

More information

Computational Statistics and Data Analysis

Computational Statistics and Data Analysis Computational Statistics and Data Analysis 53 (2009) 2264 2274 Contents lists available at ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Robust

More information

Sparse PCA for high-dimensional data with outliers

Sparse PCA for high-dimensional data with outliers Sparse PCA for high-dimensional data with outliers Mia Hubert Tom Reynkens Eric Schmitt Tim Verdonck Department of Mathematics, KU Leuven Leuven, Belgium June 25, 2015 Abstract A new sparse PCA algorithm

More information

Damage detection in the presence of outliers based on robust PCA Fahit Gharibnezhad 1, L.E. Mujica 2, Jose Rodellar 3 1,2,3

Damage detection in the presence of outliers based on robust PCA Fahit Gharibnezhad 1, L.E. Mujica 2, Jose Rodellar 3 1,2,3 Damage detection in the presence of outliers based on robust PCA Fahit Gharibnezhad 1, L.E. Mujica 2, Jose Rodellar 3 1,2,3 Escola Universitària d'enginyeria Tècnica Industrial de Barcelona,Department

More information

EXTENDING PARTIAL LEAST SQUARES REGRESSION

EXTENDING PARTIAL LEAST SQUARES REGRESSION EXTENDING PARTIAL LEAST SQUARES REGRESSION ATHANASSIOS KONDYLIS UNIVERSITY OF NEUCHÂTEL 1 Outline Multivariate Calibration in Chemometrics PLS regression (PLSR) and the PLS1 algorithm PLS1 from a statistical

More information

Robust estimation of principal components from depth-based multivariate rank covariance matrix

Robust estimation of principal components from depth-based multivariate rank covariance matrix Robust estimation of principal components from depth-based multivariate rank covariance matrix Subho Majumdar Snigdhansu Chatterjee University of Minnesota, School of Statistics Table of contents Summary

More information

Fast and Robust Discriminant Analysis

Fast and Robust Discriminant Analysis Fast and Robust Discriminant Analysis Mia Hubert a,1, Katrien Van Driessen b a Department of Mathematics, Katholieke Universiteit Leuven, W. De Croylaan 54, B-3001 Leuven. b UFSIA-RUCA Faculty of Applied

More information

IMPROVING THE SMALL-SAMPLE EFFICIENCY OF A ROBUST CORRELATION MATRIX: A NOTE

IMPROVING THE SMALL-SAMPLE EFFICIENCY OF A ROBUST CORRELATION MATRIX: A NOTE IMPROVING THE SMALL-SAMPLE EFFICIENCY OF A ROBUST CORRELATION MATRIX: A NOTE Eric Blankmeyer Department of Finance and Economics McCoy College of Business Administration Texas State University San Marcos

More information

ROBUST PARTIAL LEAST SQUARES REGRESSION AND OUTLIER DETECTION USING REPEATED MINIMUM COVARIANCE DETERMINANT METHOD AND A RESAMPLING METHOD

ROBUST PARTIAL LEAST SQUARES REGRESSION AND OUTLIER DETECTION USING REPEATED MINIMUM COVARIANCE DETERMINANT METHOD AND A RESAMPLING METHOD ROBUST PARTIAL LEAST SQUARES REGRESSION AND OUTLIER DETECTION USING REPEATED MINIMUM COVARIANCE DETERMINANT METHOD AND A RESAMPLING METHOD by Dilrukshika Manori Singhabahu BS, Information Technology, Slippery

More information

Small Sample Corrections for LTS and MCD

Small Sample Corrections for LTS and MCD myjournal manuscript No. (will be inserted by the editor) Small Sample Corrections for LTS and MCD G. Pison, S. Van Aelst, and G. Willems Department of Mathematics and Computer Science, Universitaire Instelling

More information

Robust multivariate methods in Chemometrics

Robust multivariate methods in Chemometrics Robust multivariate methods in Chemometrics Peter Filzmoser 1 Sven Serneels 2 Ricardo Maronna 3 Pierre J. Van Espen 4 1 Institut für Statistik und Wahrscheinlichkeitstheorie, Technical University of Vienna,

More information

A ROBUST METHOD OF ESTIMATING COVARIANCE MATRIX IN MULTIVARIATE DATA ANALYSIS G.M. OYEYEMI *, R.A. IPINYOMI **

A ROBUST METHOD OF ESTIMATING COVARIANCE MATRIX IN MULTIVARIATE DATA ANALYSIS G.M. OYEYEMI *, R.A. IPINYOMI ** ANALELE ŞTIINłIFICE ALE UNIVERSITĂłII ALEXANDRU IOAN CUZA DIN IAŞI Tomul LVI ŞtiinŃe Economice 9 A ROBUST METHOD OF ESTIMATING COVARIANCE MATRIX IN MULTIVARIATE DATA ANALYSIS G.M. OYEYEMI, R.A. IPINYOMI

More information

Robust Wilks' Statistic based on RMCD for One-Way Multivariate Analysis of Variance (MANOVA)

Robust Wilks' Statistic based on RMCD for One-Way Multivariate Analysis of Variance (MANOVA) ISSN 2224-584 (Paper) ISSN 2225-522 (Online) Vol.7, No.2, 27 Robust Wils' Statistic based on RMCD for One-Way Multivariate Analysis of Variance (MANOVA) Abdullah A. Ameen and Osama H. Abbas Department

More information

Basics of Multivariate Modelling and Data Analysis

Basics of Multivariate Modelling and Data Analysis Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 6. Principal component analysis (PCA) 6.1 Overview 6.2 Essentials of PCA 6.3 Numerical calculation of PCs 6.4 Effects of data preprocessing

More information

Robust Classification for Skewed Data

Robust Classification for Skewed Data Advances in Data Analysis and Classification manuscript No. (will be inserted by the editor) Robust Classification for Skewed Data Mia Hubert Stephan Van der Veeken Received: date / Accepted: date Abstract

More information

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R.

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R. Methods and Applications of Linear Models Regression and the Analysis of Variance Third Edition RONALD R. HOCKING PenHock Statistical Consultants Ishpeming, Michigan Wiley Contents Preface to the Third

More information

MULTICHANNEL SIGNAL PROCESSING USING SPATIAL RANK COVARIANCE MATRICES

MULTICHANNEL SIGNAL PROCESSING USING SPATIAL RANK COVARIANCE MATRICES MULTICHANNEL SIGNAL PROCESSING USING SPATIAL RANK COVARIANCE MATRICES S. Visuri 1 H. Oja V. Koivunen 1 1 Signal Processing Lab. Dept. of Statistics Tampere Univ. of Technology University of Jyväskylä P.O.

More information

The S-estimator of multivariate location and scatter in Stata

The S-estimator of multivariate location and scatter in Stata The Stata Journal (yyyy) vv, Number ii, pp. 1 9 The S-estimator of multivariate location and scatter in Stata Vincenzo Verardi University of Namur (FUNDP) Center for Research in the Economics of Development

More information

Review of robust multivariate statistical methods in high dimension

Review of robust multivariate statistical methods in high dimension Review of robust multivariate statistical methods in high dimension Peter Filzmoser a,, Valentin Todorov b a Department of Statistics and Probability Theory, Vienna University of Technology, Wiedner Hauptstr.

More information

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis: Uses one group of variables (we will call this X) In

More information

Evaluation of robust PCA for supervised audio outlier detection

Evaluation of robust PCA for supervised audio outlier detection Evaluation of robust PCA for supervised audio outlier detection Sarka Brodinova, Vienna University of Technology, sarka.brodinova@tuwien.ac.at Thomas Ortner, Vienna University of Technology, thomas.ortner@tuwien.ac.at

More information

Robust Factorization of a Data Matrix

Robust Factorization of a Data Matrix Robust Factorization of a Data Matrix Christophe Croux 1 and Peter Filzmoser 2 1 ECARE and Institut de Statistique, Université Libre de Bruxelles, Av. F.D. Roosevelt 50, B-1050 Bruxelles, Belgium 2 Department

More information

FAULT DETECTION AND ISOLATION WITH ROBUST PRINCIPAL COMPONENT ANALYSIS

FAULT DETECTION AND ISOLATION WITH ROBUST PRINCIPAL COMPONENT ANALYSIS Int. J. Appl. Math. Comput. Sci., 8, Vol. 8, No. 4, 49 44 DOI:.478/v6-8-38-3 FAULT DETECTION AND ISOLATION WITH ROBUST PRINCIPAL COMPONENT ANALYSIS YVON THARRAULT, GILLES MOUROT, JOSÉ RAGOT, DIDIER MAQUIN

More information

Detection of outliers in multivariate data:

Detection of outliers in multivariate data: 1 Detection of outliers in multivariate data: a method based on clustering and robust estimators Carla M. Santos-Pereira 1 and Ana M. Pires 2 1 Universidade Portucalense Infante D. Henrique, Oporto, Portugal

More information

DIMENSION REDUCTION OF THE EXPLANATORY VARIABLES IN MULTIPLE LINEAR REGRESSION. P. Filzmoser and C. Croux

DIMENSION REDUCTION OF THE EXPLANATORY VARIABLES IN MULTIPLE LINEAR REGRESSION. P. Filzmoser and C. Croux Pliska Stud. Math. Bulgar. 003), 59 70 STUDIA MATHEMATICA BULGARICA DIMENSION REDUCTION OF THE EXPLANATORY VARIABLES IN MULTIPLE LINEAR REGRESSION P. Filzmoser and C. Croux Abstract. In classical multiple

More information

ON THE CALCULATION OF A ROBUST S-ESTIMATOR OF A COVARIANCE MATRIX

ON THE CALCULATION OF A ROBUST S-ESTIMATOR OF A COVARIANCE MATRIX STATISTICS IN MEDICINE Statist. Med. 17, 2685 2695 (1998) ON THE CALCULATION OF A ROBUST S-ESTIMATOR OF A COVARIANCE MATRIX N. A. CAMPBELL *, H. P. LOPUHAA AND P. J. ROUSSEEUW CSIRO Mathematical and Information

More information

Forecast comparison of principal component regression and principal covariate regression

Forecast comparison of principal component regression and principal covariate regression Forecast comparison of principal component regression and principal covariate regression Christiaan Heij, Patrick J.F. Groenen, Dick J. van Dijk Econometric Institute, Erasmus University Rotterdam Econometric

More information

CHAPTER 5. Outlier Detection in Multivariate Data

CHAPTER 5. Outlier Detection in Multivariate Data CHAPTER 5 Outlier Detection in Multivariate Data 5.1 Introduction Multivariate outlier detection is the important task of statistical analysis of multivariate data. Many methods have been proposed for

More information

Classification 2: Linear discriminant analysis (continued); logistic regression

Classification 2: Linear discriminant analysis (continued); logistic regression Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1 Reminder:

More information

STATISTICAL LEARNING SYSTEMS

STATISTICAL LEARNING SYSTEMS STATISTICAL LEARNING SYSTEMS LECTURE 8: UNSUPERVISED LEARNING: FINDING STRUCTURE IN DATA Institute of Computer Science, Polish Academy of Sciences Ph. D. Program 2013/2014 Principal Component Analysis

More information

A New Robust Partial Least Squares Regression Method

A New Robust Partial Least Squares Regression Method A New Robust Partial Least Squares Regression Method 8th of September, 2005 Universidad Carlos III de Madrid Departamento de Estadistica Motivation Robust PLS Methods PLS Algorithm Computing Robust Variance

More information

Table of Contents. Multivariate methods. Introduction II. Introduction I

Table of Contents. Multivariate methods. Introduction II. Introduction I Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation

More information

Experimental design. Matti Hotokka Department of Physical Chemistry Åbo Akademi University

Experimental design. Matti Hotokka Department of Physical Chemistry Åbo Akademi University Experimental design Matti Hotokka Department of Physical Chemistry Åbo Akademi University Contents Elementary concepts Regression Validation Hypotesis testing ANOVA PCA, PCR, PLS Clusters, SIMCA Design

More information

Stahel-Donoho Estimation for High-Dimensional Data

Stahel-Donoho Estimation for High-Dimensional Data Stahel-Donoho Estimation for High-Dimensional Data Stefan Van Aelst KULeuven, Department of Mathematics, Section of Statistics Celestijnenlaan 200B, B-3001 Leuven, Belgium Email: Stefan.VanAelst@wis.kuleuven.be

More information

Robust scale estimation with extensions

Robust scale estimation with extensions Robust scale estimation with extensions Garth Tarr, Samuel Müller and Neville Weber School of Mathematics and Statistics THE UNIVERSITY OF SYDNEY Outline The robust scale estimator P n Robust covariance

More information

Regression Analysis for Data Containing Outliers and High Leverage Points

Regression Analysis for Data Containing Outliers and High Leverage Points Alabama Journal of Mathematics 39 (2015) ISSN 2373-0404 Regression Analysis for Data Containing Outliers and High Leverage Points Asim Kumer Dey Department of Mathematics Lamar University Md. Amir Hossain

More information

ECE 661: Homework 10 Fall 2014

ECE 661: Homework 10 Fall 2014 ECE 661: Homework 10 Fall 2014 This homework consists of the following two parts: (1) Face recognition with PCA and LDA for dimensionality reduction and the nearest-neighborhood rule for classification;

More information

Identification of Multivariate Outliers: A Performance Study

Identification of Multivariate Outliers: A Performance Study AUSTRIAN JOURNAL OF STATISTICS Volume 34 (2005), Number 2, 127 138 Identification of Multivariate Outliers: A Performance Study Peter Filzmoser Vienna University of Technology, Austria Abstract: Three

More information

Robust Exponential Smoothing of Multivariate Time Series

Robust Exponential Smoothing of Multivariate Time Series Robust Exponential Smoothing of Multivariate Time Series Christophe Croux,a, Sarah Gelper b, Koen Mahieu a a Faculty of Business and Economics, K.U.Leuven, Naamsestraat 69, 3000 Leuven, Belgium b Erasmus

More information

TITLE : Robust Control Charts for Monitoring Process Mean of. Phase-I Multivariate Individual Observations AUTHORS : Asokan Mulayath Variyath.

TITLE : Robust Control Charts for Monitoring Process Mean of. Phase-I Multivariate Individual Observations AUTHORS : Asokan Mulayath Variyath. TITLE : Robust Control Charts for Monitoring Process Mean of Phase-I Multivariate Individual Observations AUTHORS : Asokan Mulayath Variyath Department of Mathematics and Statistics, Memorial University

More information

Introduction to Robust Statistics. Anthony Atkinson, London School of Economics, UK Marco Riani, Univ. of Parma, Italy

Introduction to Robust Statistics. Anthony Atkinson, London School of Economics, UK Marco Riani, Univ. of Parma, Italy Introduction to Robust Statistics Anthony Atkinson, London School of Economics, UK Marco Riani, Univ. of Parma, Italy Multivariate analysis Multivariate location and scatter Data where the observations

More information

Small sample corrections for LTS and MCD

Small sample corrections for LTS and MCD Metrika (2002) 55: 111 123 > Springer-Verlag 2002 Small sample corrections for LTS and MCD G. Pison, S. Van Aelst*, and G. Willems Department of Mathematics and Computer Science, Universitaire Instelling

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Accurate and Powerful Multivariate Outlier Detection

Accurate and Powerful Multivariate Outlier Detection Int. Statistical Inst.: Proc. 58th World Statistical Congress, 11, Dublin (Session CPS66) p.568 Accurate and Powerful Multivariate Outlier Detection Cerioli, Andrea Università di Parma, Dipartimento di

More information

Short Answer Questions: Answer on your separate blank paper. Points are given in parentheses.

Short Answer Questions: Answer on your separate blank paper. Points are given in parentheses. ISQS 6348 Final exam solutions. Name: Open book and notes, but no electronic devices. Answer short answer questions on separate blank paper. Answer multiple choice on this exam sheet. Put your name on

More information

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling Expression Data Exploration: Association, Patterns, Factors & Regression Modelling Exploring gene expression data Scale factors, median chip correlation on gene subsets for crude data quality investigation

More information

Explaining Correlations by Plotting Orthogonal Contrasts

Explaining Correlations by Plotting Orthogonal Contrasts Explaining Correlations by Plotting Orthogonal Contrasts Øyvind Langsrud MATFORSK, Norwegian Food Research Institute. www.matforsk.no/ola/ To appear in The American Statistician www.amstat.org/publications/tas/

More information

Robust methods for multivariate data analysis

Robust methods for multivariate data analysis JOURNAL OF CHEMOMETRICS J. Chemometrics 2005; 19: 549 563 Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cem.962 Robust methods for multivariate data analysis S. Frosch

More information

Chapter 4: Factor Analysis

Chapter 4: Factor Analysis Chapter 4: Factor Analysis In many studies, we may not be able to measure directly the variables of interest. We can merely collect data on other variables which may be related to the variables of interest.

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Anders Øland David Christiansen 1 Introduction Principal Component Analysis, or PCA, is a commonly used multi-purpose technique in data analysis. It can be used for feature

More information

Machine Learning 11. week

Machine Learning 11. week Machine Learning 11. week Feature Extraction-Selection Dimension reduction PCA LDA 1 Feature Extraction Any problem can be solved by machine learning methods in case of that the system must be appropriately

More information

Inferential Analysis with NIR and Chemometrics

Inferential Analysis with NIR and Chemometrics Inferential Analysis with NIR and Chemometrics Santanu Talukdar Manager, Engineering Services Part 2 NIR Spectroscopic Data with Chemometrics A Tutorial Presentation Part 2 Page.2 References This tutorial

More information

Outlier Detection via Feature Selection Algorithms in

Outlier Detection via Feature Selection Algorithms in Int. Statistical Inst.: Proc. 58th World Statistical Congress, 2011, Dublin (Session CPS032) p.4638 Outlier Detection via Feature Selection Algorithms in Covariance Estimation Menjoge, Rajiv S. M.I.T.,

More information

Robust control charts for time series data

Robust control charts for time series data Robust control charts for time series data Christophe Croux K.U. Leuven & Tilburg University Sarah Gelper Erasmus University Rotterdam Koen Mahieu K.U. Leuven Abstract This article presents a control chart

More information

Independent Component (IC) Models: New Extensions of the Multinormal Model

Independent Component (IC) Models: New Extensions of the Multinormal Model Independent Component (IC) Models: New Extensions of the Multinormal Model Davy Paindaveine (joint with Klaus Nordhausen, Hannu Oja, and Sara Taskinen) School of Public Health, ULB, April 2008 My research

More information

UPSET AND SENSOR FAILURE DETECTION IN MULTIVARIATE PROCESSES

UPSET AND SENSOR FAILURE DETECTION IN MULTIVARIATE PROCESSES UPSET AND SENSOR FAILURE DETECTION IN MULTIVARIATE PROCESSES Barry M. Wise, N. Lawrence Ricker and David J. Veltkamp Center for Process Analytical Chemistry and Department of Chemical Engineering University

More information

Fast and robust bootstrap for LTS

Fast and robust bootstrap for LTS Fast and robust bootstrap for LTS Gert Willems a,, Stefan Van Aelst b a Department of Mathematics and Computer Science, University of Antwerp, Middelheimlaan 1, B-2020 Antwerp, Belgium b Department of

More information

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION In this lab you will first learn how to display the relationship between two quantitative variables with a scatterplot and also how to measure the strength of

More information

arxiv: v1 [math.st] 11 Jun 2018

arxiv: v1 [math.st] 11 Jun 2018 Robust test statistics for the two-way MANOVA based on the minimum covariance determinant estimator Bernhard Spangl a, arxiv:1806.04106v1 [math.st] 11 Jun 2018 a Institute of Applied Statistics and Computing,

More information

Outlier detection for skewed data

Outlier detection for skewed data Outlier detection for skewed data Mia Hubert 1 and Stephan Van der Veeken December 7, 27 Abstract Most outlier detection rules for multivariate data are based on the assumption of elliptical symmetry of

More information

Modeling Classes of Shapes Suppose you have a class of shapes with a range of variations: System 2 Overview

Modeling Classes of Shapes Suppose you have a class of shapes with a range of variations: System 2 Overview 4 4 4 6 4 4 4 6 4 4 4 6 4 4 4 6 4 4 4 6 4 4 4 6 4 4 4 6 4 4 4 6 Modeling Classes of Shapes Suppose you have a class of shapes with a range of variations: System processes System Overview Previous Systems:

More information

Generalized Least Squares for Calibration Transfer. Barry M. Wise, Harald Martens and Martin Høy Eigenvector Research, Inc.

Generalized Least Squares for Calibration Transfer. Barry M. Wise, Harald Martens and Martin Høy Eigenvector Research, Inc. Generalized Least Squares for Calibration Transfer Barry M. Wise, Harald Martens and Martin Høy Eigenvector Research, Inc. Manson, WA 1 Outline The calibration transfer problem Instrument differences,

More information

International Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA

International Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA International Journal of Pure and Applied Mathematics Volume 19 No. 3 2005, 359-366 A NOTE ON BETWEEN-GROUP PCA Anne-Laure Boulesteix Department of Statistics University of Munich Akademiestrasse 1, Munich,

More information

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING Vishwanath Mantha Department for Electrical and Computer Engineering Mississippi State University, Mississippi State, MS 39762 mantha@isip.msstate.edu ABSTRACT

More information

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables 1 A factor can be considered to be an underlying latent variable: (a) on which people differ (b) that is explained by unknown variables (c) that cannot be defined (d) that is influenced by observed variables

More information

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage

More information

Chemometrics. 1. Find an important subset of the original variables.

Chemometrics. 1. Find an important subset of the original variables. Chemistry 311 2003-01-13 1 Chemometrics Chemometrics: Mathematical, statistical, graphical or symbolic methods to improve the understanding of chemical information. or The science of relating measurements

More information

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions Journal of Modern Applied Statistical Methods Volume 8 Issue 1 Article 13 5-1-2009 Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error

More information

7. Variable extraction and dimensionality reduction

7. Variable extraction and dimensionality reduction 7. Variable extraction and dimensionality reduction The goal of the variable selection in the preceding chapter was to find least useful variables so that it would be possible to reduce the dimensionality

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

Discriminant analysis and supervised classification

Discriminant analysis and supervised classification Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

Motivating the Covariance Matrix

Motivating the Covariance Matrix Motivating the Covariance Matrix Raúl Rojas Computer Science Department Freie Universität Berlin January 2009 Abstract This note reviews some interesting properties of the covariance matrix and its role

More information

MULTICOLLINEARITY DIAGNOSTIC MEASURES BASED ON MINIMUM COVARIANCE DETERMINATION APPROACH

MULTICOLLINEARITY DIAGNOSTIC MEASURES BASED ON MINIMUM COVARIANCE DETERMINATION APPROACH Professor Habshah MIDI, PhD Department of Mathematics, Faculty of Science / Laboratory of Computational Statistics and Operations Research, Institute for Mathematical Research University Putra, Malaysia

More information

Fast and Robust Classifiers Adjusted for Skewness

Fast and Robust Classifiers Adjusted for Skewness Fast and Robust Classifiers Adjusted for Skewness Mia Hubert 1 and Stephan Van der Veeken 2 1 Department of Mathematics - LStat, Katholieke Universiteit Leuven Celestijnenlaan 200B, Leuven, Belgium, Mia.Hubert@wis.kuleuven.be

More information

Vector Space Models. wine_spectral.r

Vector Space Models. wine_spectral.r Vector Space Models 137 wine_spectral.r Latent Semantic Analysis Problem with words Even a small vocabulary as in wine example is challenging LSA Reduce number of columns of DTM by principal components

More information

Improved Feasible Solution Algorithms for. High Breakdown Estimation. Douglas M. Hawkins. David J. Olive. Department of Applied Statistics

Improved Feasible Solution Algorithms for. High Breakdown Estimation. Douglas M. Hawkins. David J. Olive. Department of Applied Statistics Improved Feasible Solution Algorithms for High Breakdown Estimation Douglas M. Hawkins David J. Olive Department of Applied Statistics University of Minnesota St Paul, MN 55108 Abstract High breakdown

More information

Evaluation of robust PCA for supervised audio outlier detection

Evaluation of robust PCA for supervised audio outlier detection Institut f. Stochastik und Wirtschaftsmathematik 1040 Wien, Wiedner Hauptstr. 8-10/105 AUSTRIA http://www.isw.tuwien.ac.at Evaluation of robust PCA for supervised audio outlier detection S. Brodinova,

More information

Basics of Multivariate Modelling and Data Analysis

Basics of Multivariate Modelling and Data Analysis Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 2. Overview of multivariate techniques 2.1 Different approaches to multivariate data analysis 2.2 Classification of multivariate techniques

More information

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline. Practitioner Course: Portfolio Optimization September 10, 2008 Before we define dependence, it is useful to define Random variables X and Y are independent iff For all x, y. In particular, F (X,Y ) (x,

More information

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data. Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Minimum Regularized Covariance Determinant Estimator

Minimum Regularized Covariance Determinant Estimator Minimum Regularized Covariance Determinant Estimator Honey, we shrunk the data and the covariance matrix Kris Boudt (joint with: P. Rousseeuw, S. Vanduffel and T. Verdonck) Vrije Universiteit Brussel/Amsterdam

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The

More information

arxiv: v3 [stat.me] 2 Feb 2018 Abstract

arxiv: v3 [stat.me] 2 Feb 2018 Abstract ICS for Multivariate Outlier Detection with Application to Quality Control Aurore Archimbaud a, Klaus Nordhausen b, Anne Ruiz-Gazen a, a Toulouse School of Economics, University of Toulouse 1 Capitole,

More information

Unit 10: Simple Linear Regression and Correlation

Unit 10: Simple Linear Regression and Correlation Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for

More information

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture

More information

5. Discriminant analysis

5. Discriminant analysis 5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Introduction Edps/Psych/Stat/ 584 Applied Multivariate Statistics Carolyn J Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN c Board of Trustees,

More information

MULTIVARIATE ANALYSIS OF VARIANCE

MULTIVARIATE ANALYSIS OF VARIANCE MULTIVARIATE ANALYSIS OF VARIANCE RAJENDER PARSAD AND L.M. BHAR Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 0 0 lmb@iasri.res.in. Introduction In many agricultural experiments,

More information

Re-weighted Robust Control Charts for Individual Observations

Re-weighted Robust Control Charts for Individual Observations Universiti Tunku Abdul Rahman, Kuala Lumpur, Malaysia 426 Re-weighted Robust Control Charts for Individual Observations Mandana Mohammadi 1, Habshah Midi 1,2 and Jayanthi Arasan 1,2 1 Laboratory of Applied

More information

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ). .8.6 µ =, σ = 1 µ = 1, σ = 1 / µ =, σ =.. 3 1 1 3 x Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ ). The Gaussian distribution Probably the most-important distribution in all of statistics

More information

DIMENSION REDUCTION AND CLUSTER ANALYSIS

DIMENSION REDUCTION AND CLUSTER ANALYSIS DIMENSION REDUCTION AND CLUSTER ANALYSIS EECS 833, 6 March 2006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@kgs.ku.edu 864-2093 Overheads and resources available at http://people.ku.edu/~gbohling/eecs833

More information

Journal of Statistical Software

Journal of Statistical Software JSS Journal of Statistical Software April 2013, Volume 53, Issue 3. http://www.jstatsoft.org/ Fast and Robust Bootstrap for Multivariate Inference: The R Package FRB Stefan Van Aelst Ghent University Gert

More information

Assessing the relation between language comprehension and performance in general chemistry. Appendices

Assessing the relation between language comprehension and performance in general chemistry. Appendices Assessing the relation between language comprehension and performance in general chemistry Daniel T. Pyburn a, Samuel Pazicni* a, Victor A. Benassi b, and Elizabeth E. Tappin c a Department of Chemistry,

More information

ReducedPCR/PLSRmodelsbysubspaceprojections

ReducedPCR/PLSRmodelsbysubspaceprojections ReducedPCR/PLSRmodelsbysubspaceprojections Rolf Ergon Telemark University College P.O.Box 2, N-9 Porsgrunn, Norway e-mail: rolf.ergon@hit.no Published in Chemometrics and Intelligent Laboratory Systems

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information