Robust Classification for Skewed Data

Size: px

Start display at page:

Download "Robust Classification for Skewed Data"

Rose Caldwell
5 years ago
Views:

1 Advances in Data Analysis and Classification manuscript No. (will be inserted by the editor) Robust Classification for Skewed Data Mia Hubert Stephan Van der Veeken Received: date / Accepted: date Abstract In this paper we propose a robust classification rule for skewed unimodal distributions. For low dimensional data, the classifier is based on minimizing the adjusted outlyingness to each group. In the case of high dimensional data, the robustified SIMCA method is adjusted for skewness. The robustness of the methods is investigated through different simulations and by applying it to some datasets. Keywords Robustness Classification Skewness. 1 Introduction Given a random sample from a group of k populations, constructing a rule to classify a new observation into one of the k populations is a widely studied problem, known as classification, discriminant analysis and supervised learning. Many of the proposed classification methods rely on quite strict distributional assumptions such as multivariate normality, or at least elliptical symmetry. Examples include the most popular ones, such as Fisher s linear discriminant rule (LDA) and classical quadratic discriminant analysis (CQDA) (Johnson and Wichern, 1998). A second drawback is that these classical procedures are sensitive towards outliers. Therefore, over the last years, several robustifications have been proposed. One group of methods relies on robust estimators of location and scatter, and consequently implicitly assume elliptical symmetry (Hubert and Van Driessen, 24; He and Fung, 2; Croux and Dehon, 21). We acknowledge financial support by the GOA/7/4-project of the Research Fund K.U.Leuven and by the IAP research network no. P6/3 of the Federal Science Policy, Belgium M. Hubert (corresponding author) Department of Mathematics - LStat, Katholieke Universiteit Leuven, Celestijnenlaan 2B, BE-31 Heverlee, Belgium mia.hubert@wis.kuleuven.be S. Van der Veeken Department of Mathematics - LStat, Katholieke Universiteit Leuven, Celestijnenlaan 2B, BE-31 Heverlee, Belgium

2 2 Proposals for skewed data are hardly available. An interesting approach is based on the concept of depth, and has been proposed in Ghosh and Chaudhuri (25). A new observation is classified to the population for which it attains maximal depth. Most common robust depth functions include Tukey s halfspace depth (Tukey, 1975) and Liu s simplicial depth (Liu, 199). The resulting classifiers achieve a good classification performance (especially when outliers are present) but they have the drawback that data outside the convex hull of every group will have zero depth, and then the classification becomes ambiguous. Moreover they are hard to compute. For example, the computation of the Tukey depth has O(n p 1 log n) time complexity (Rousseeuw and Struyf, 1998). For simplicial depth, an O(n log n) algorithm exists in p = 2 dimensions (Rousseeuw and Ruts, 1996), an O(n 2 ) algorithm in p = 3 dimensions and an O(n 4 ) algorithm in p = 4 dimensions (Cheng and Ouyang, 21). For higher dimensional spaces, there are no known algorithms faster than O(n p+1 ). In this paper we propose a classification rule based on the adjusted outlyingness, which is related to projection depth (Zuo and Serfling, 2). Hence, it can be seen as an extension of the projection depth classifiers studied in Dutta and Ghosh (29) for elliptical data. In low dimensions this leads to a robust classifier which can more easily be computed. Moreover for high-dimensional data it also allows for a generalization of the Soft Independent Modelling by Class Analogy method (SIMCA) (Wold, 1976). To construct our new method, we combine robust PCA for skewed data (Hubert et al., 29) with the robust RSIMCA method for elliptical data (Vanden Branden and Hubert, 25). In Section 2 we focus on the low-dimensional case. We introduce the new classification rule, report simulation results and apply our method on a real data set. Section 3 contains the methodology and the results for high-dimensional data, whereas Section 4 concludes. 2 Low dimensional data 2.1 Construction of the classification rule Outlyingness We assume we have sampled observations from k different classes X j, j = 1,..., k. The data belonging to group X j are denoted by x j i for i = 1,..., n j. The dimension of the data space is p and is assumed to be much smaller than the sample sizes. More comments on the dimension are given at the end of this section. Our new classification rule is defined as follows: for each new observation y to be classified, we first compute its adjusted outlyingness with respect to each group X j. Then we assign y to the group for which its adjusted outlyingness is minimal. The adjusted outlyingness is introduced in Brys et al. (25) and studied in detail in Hubert and Van der Veeken (28). It generalizes the Stahel-Donoho outlyingness towards skewed data. We first describe the concepts of outlyingness for an observation x j i coming from the class Xj = (x j 1,..., xj n j ) t. For univariate data x j i, the Stahel- Donoho outlyingness is defined as (Stahel, 1981; Donoho, 1982): SDO (1) (x j i, Xj ) = xj i med(xj ) s(x j )

3 3 where med(x j ) is the median of the data and s(x j ) is a robust scale estimate such as the MAD. For skewed distributions, a different scale is used on both sides of the median. First we measure the amount of skewness by means of the medcouple (MC), a robust measure of skewness (Brys et al., 24). For the jth (univariate) group, it is defined as MC(X j ) = med h(x j x j i <med(xj )<x j i, xj l ) (1) l with med(x j ) the sample median, and h(x i, x l ) = (xj l med(xj )) (med(x j ) x j i ) x j l. (2) xj i If MC(X j ) >, the adjusted outlyingness is defined as: AO (1) (x j i, Xj ) = x j i med(xj ) c 2 med(x j ) if x j i > med(xj ) med(x j ) x j i med(x j ) c 1 if x j i < med(xj ) where c 1 corresponds to the smallest observation greater than Q e 4 MC IQR and c 2 to the largest observation smaller than Q e 3 MC IQR. The notations Q 1 and Q 3 stand for the first and third quartile of the data, and IQR = Q 3 Q 1 is the interquartile range. If MC(X j ) <, the adjusted outlyingness is computed on the inverted data X j. In order to define the adjusted outlyingness for a multivariate data point x j i, the data are projected on all possible directions a and the AO (1) is computed. The overall AO j i is then defined as the supremum over all univariate AO s: AO j i = AO(xj i, Xj ) = sup a R p AO (1) (a t x j i, Xj a). (3) As the AO is based on robust measures of location, scale and skewness, it is resistant to outliers. In theory, a resistance up to 25% of outliers can be achieved, although we noticed in practice that the medcouple might become quite sensitive to contamination of more than 1%. Moreover, it can be shown that the influence function of the univariate AO (1) is bounded (Hubert and Van der Veeken, 28). Since in practice it is impossible to consider all possible directions a, we compute the AO (1) in m = 25p directions. This yields an overall computation time of O(mn log n) (using the fast algorithm O(n log n) algorithm to compute the MC (Brys et al., 24)). Random directions are generated as the direction perpendicular to the subspace spanned by p observations, randomly drawn from the data set. As such, the AO is invariant to affine transformations of the data. Note that this procedure can only be applied in our classification setting when p < n j, and when the dimension p is not too large (say p < 1). Otherwise taking 25p directions is insufficient and more directions are required to achieve good estimates. We do not consider this as an important drawback of our rule as it is well known that skewness is only an issue in small dimensions. As the dimensionality increases, the data are more and more concentrated in an outer shell of the distribution, see for example Hastie et al. (21). Of course, the algorithm can be easily adapted to search over more than m directions, but this will come at the cost of more computation time.

4 New robust classification method To apply our new classification rule, we now have to define the outlyingness of a new observation y w.r.t. each group X j. One approach would be to compute this outlyingness AO(y, ˇX j ) according to (3), with ˇX j the augmented data set ˇX j = (x j 1,..., xj n j, y) t. This would of course become computationally very demanding when many new observations need to be classified, as then the augmented data set is modified for each new observation and the median, the IQR and the medcouple have to be recomputed each time on every projection. Hence, we compute the outlyingness of y w.r.t. a fixed data set which does not include y. A natural candidate is of course X j itself. However, we obtain better results when we first remove the outliers from X j. As explained in Hubert and Van der Veeken (28), this can be easily performed by first computing the adjusted outlyingness of all observations AO j i in group Xj. Then the univariate outlyingness of every AO j i (i = 1,..., n j) is computed. Formally we can define the outlier score OS i = AO (1) (AO j i, {AOj i }). Observations with a too large outlyingness can be defined as those x j i for which AOj i > med(aoj i ) and OS i > 1 (or equivalently with AO j i > c 2). Those observations are removed from X j i, yielding X j i. To compute the outlyingness of a new case y, we then consider AO(y, X j ), so we fixed the median, IQR and medcouple of the projected outlier-free data from group j. The observation y is then assigned to the group for which its adjusted outlyingness is minimal. (In case of ties, which occurs with zero probability, we randomly assign the datum to one of the groups involved.) 2.2 Simulation results In order to illustrate the classification improvement by accounting for skewness, we performed a simulation study. In all simulation settings, clean data were generated from a multivariate skew-normal distribution (Azzalini and Dalla Valle, 1996). A p- dimensional random variable Z is said to be standard skew-normal distributed SN p ( Ω, α) if its density function is of the form f(z) = 2φ p(z; Ω)Φ(α t z) (4) where φ p(z; Ω) is the p-dimensional normal density with zero mean and correlation matrix Ω, Φ is the c.d.f. of the standard normal distribution and α is a p-dimensional vector that regulates the skewness. The mean vector and the covariance matrix are given by where 2 µ z = δ π cov(z) = Ω µ z µ t z (5) 1 δ = Ωα. 1 + α t Ωα (6) By adding a location vector µ and a scale matrix W = diag(ω 1,..., ω p ), with all ω j, we obtain the skew-normal random variable X = µ + W Z SN p (µ, Ω, α)

5 5 with Ω = W ΩW. Using the notation p = (,,..., ) t R p, note that if α = p, the skew-normal density reduces to the normal density. We also consider contaminated training data, in which ε = 5% or ε = 1% of the observations come from a normal distribution. First we consider the two-class problem (k = 2). With 1 p = (1, 1,..., 1) t R p, we can describe the different simulation settings as follows: 1. (a) p = 2, n 1 = n 2 = 25, X 1 SN 2 ( 2, I 2, ), X 2 SN 2 (2 1 2, I 2, ) (b) Outliers: ε = 1% observations from X 1 N 2 ( ,.2 I 2 ) 2. (a) p = 3, n 1 = n 2 = 5, X 1 SN 3 ( 3, I 3, ), X 2 SN 3 (2 1 3, I 3, ) (b) Outliers: ε = 1% observations from X 1 N 3 ( ,.2 I 3 ) 3. (a) p = 5, n 1 = n 2 = 5, X 1 SN 5 ( 5, I 5, ), X 2 SN 5 (1 5, I 5, ) (b) Outliers: ε = 1% observations from X 1 N 5 ( 1 5,.1 I 5 ) From each population we randomly generate n j training observations and n j /5 test data. On the training data we apply five classifiers: Classical Quadratic Discriminant Analysis (CQDA) Robust Linear Discriminant Analysis (RLDA) by means of the MCD (Hubert and Van Driessen, 24). Robust Quadratic Discriminant Analysis (RQDA) by means of the MCD (Hubert and Van Driessen, 24). our new classifier based on Adjusted Outlyingness (AO) Least squares Support Vector Machines (LS-SVM) classification (Suykens et al., 22) with radial basis function (RBF). The first three programs are available in the Matlab library LIBRA (Verboven and Hubert, 25), whereas the LS-SVM classifier is computed with LS-SVMlab ( (Suykens et al., 22). We have added the LS-SVM classifier since it is a popular classifier for many complex nonlinear problems. Because of the computational complexity of halfspace depth and simplicial depth and as they do not provide a classification in case of ties (equal depth to several groups), we did not include these maximum depth-based classifiers. The results of the simulations are summarized in terms of average misclassification errors. The misclassification error is defined as the overall proportion of wrongly classified observations in the test sets. The results listed in Table 1 are average misclassification errors with their respective standard errors over 1 simulations. In all situations we see that our new method outperforms the classical and robust methods that are based on elliptical symmetry. The performance is also slightly better than using the LS-SVM classifier. The main difference between these two approaches however lies in the computation time. Whereas to apply our method based on the adjusted outlyingness, we only need to choose the number of directions a in (3) for computational reasons, the LS-SVM method requires the choice of two hyperparameters (the width of the RBF kernel, and the regularization constant). If a 1-fold cross-validation approach is followed as implemented in the LS-SVMlab, we need e.g seconds to perform the classification problem (2a) with LS-SVM. On the other hand the AO-classifier only needs 7.4 seconds when 25p = 75 directions are considered. Note that to decrease the large computation time of LS-SVM in our simulations, for each simulation setting, we have first estimated the hyperparameters on 1 training data sets and we computed their average value. We then performed the 1 replications with these fixed values for the hyperparameters. We checked for setting (1a) that this

6 6 Table 1 Simulation results for the two-class problem on asymmetric distributions. ε CQDA RLDA RQDA AO LS-SVM 2D % (.12) (.13) (.13) (.8) (.1) 1% (.18) (.14) (.14) (.1) (.1) 3D % (.9) (.11) (.9) (.6) (.7) 1% (.9) (.1) (.11) (.6) (.1) 5D % (.13) (.13) (.14) (.12) (.11) 1% (.12) (.12) (.12) (.14) (.9) procedure did not give a significant different result (.9 (s.e..1) with fixed values for the hyperparameters versus.89 (s.e..1) with different hyperparameters). We also simulated some symmetric distributions. For this, we used the following setting: 4. (a) p = 2, n 1 = n 2 = 25, X 1 N 2 ( 2, I 2 ), X 2 N 2 (2 1 2, I 2 ) (b) Outliers: ε = 1% observations from X 1 N 2 (1 1 2,.2 I 2 ) 5. (a) p = 3, n 1 = n 2 = 5, X 1 N 3 ( 3, I 3 ), X 2 N 3 (2 1 3, I 3 ) (b) Outliers: ε = 1% observations from X 1 N 3 (1 1 3,.2 I 3 ) 6. (a) p = 5, n 1 = n 2 = 5, X 1 N 5 ( 5, I 5 ), X 2 N 5 (1 5, I 5 ) (b) Outliers: ε = 1% observations from X 1 N 5 (5 1 5,.1 I 5 ) The results listed in Table 2 are again the average misclassification errors with their respective standard errors over 1 simulations. Here, we notice that the average misclassification error for our new method is somewhat higher than for the other robust methods which are designed for elliptical distributions. But contrary to CQDA, we see that our AO method is not highly affected by outliers. Note that the optimal error rates are.787 for the 2D-setting,.416 for the 3D-setting and.1318 for the 5D-setting. The results obtained by the robust methods RLDA and RQDA lie quite close to these optimal values. Table 3 lists the results for the three-class problem (k = 3). As LS-SVM cannot directly cope with more than two groups, we have not included it in our comparison. Here, the data from the first two groups were generated as before, and a third group was added. Outliers are now generated in each group. 7. (a) p = 2, n 3 = 25, X 3 SN 2 ((3, 1) t, I 2, (5, 5) t ) (b) Outliers: 5% observations from X 1 N 2 ( 2 1 2,.2 I 2 ), 5% of X 2 N 2 ((8, ) t,.2 I 2 ) and 5% of X 3 N 2 ((, 8) t,.2 I 2 ) (c) Idem as (b) but with 1% outliers in each group. 8. (a) p = 3, n 3 = 5, X 3 SN 3 ((3, 1, 3) t, I 3, (5, 5, 5) t )

7 7 Table 2 Simulation results for the two-class problem on symmetric distributions. ε CQDA RLDA RQDA AO LS-SVM 2D % (.28) (.27) (.26) (.29) (.24) 1% (.52) (.26) (.26) (.26) (.25) 3D % (.15) (.14) (.14) (.15) (.15) 1% (.36) (.14) (.14) (.14) (.14) 5D % (.25) (.25) (.24) (.26) (.24) 1% (.38) (.25) (.25) (.25) (.25) (b) Outliers: 5% observations from X 1 N 3 ( 2 1 3,.1 I 3 ), 5% of X 2 N 3 ((6,, 1) t,.1 I 3 ) and 5% of X 3 N 3 ((, 6, 1) t,.1 I 3 ) (c) Idem as (b) but with 1% outliers in each group. 9. (a) p = 5, n 3 = 5, X 3 SN 5 (.5 1 5, I 5, ) (b) Outliers: 5% observations from X 1 N 5 ( ,.1 I 5 ), 5% of X 2 N 5 (3 1 5,.1 I 5 ) and 5% of X 3 N 5 ((4, 4, 4,, ) t,.1 I 5 ) (c) Idem as (b) but with 1% outliers in each group. Also here we can conclude that we achieve lower misclassification errors by adjusting for skewness. In Figure 1 we illustrate the different classifiers by plotting the data generated in setting 7(b). The white region marks the set of test data that will be classified in class 3, the light-colored region indicates the data that will be classified in class 1, whereas the dark-colored region visualizes the classification region of class 2. We see in Figure 1(a) that CQDA does not well separate the regular observations in the three groups, as the classification rules are attracted by the outliers. The robust methods RLDA in Figure 1(b) and RQDA in Figure 1(c) are clearly less influenced by the outlying cases, but still misclassify some observations that lie in between the groups. The AO classifier shown in Figure 1(d) yields very precise classification boundaries. 2.3 Example The data used in this example come from the Belgian Household Survey of 25. The Household Survey is a multi-purpose continuous survey carried out by the Social Survey Division of the Belgian National Institute of Statistics which collects information on people living in private households in Belgium. The main aim of the survey is to collect data on a range of topics such as education, welfare, family structure and health. One of the most important direct applications of the survey is the computation of the evolution of the purchasing power. Therefore, quite some data on income and expenditure are collected. We selected a subset of 11 variables from the data set. The

8 8 Table 3 Simulation results for the three-class problem on asymmetric distributions. ε CQDA RLDA RQDA AO 2D % (.1) (.11) (.12) (.12) 5% (.15) (.13) (.11) (.11) 1% (.11) (.11) (.12) (.18) 3D % (.6) (.8) (.8) (.8) 5% (.1) (.6) (.6) (.7) 1% (.14) (.9) (.8) (.8) 5D % (.6) (.8) (.7) (.7) 5% (.8) (.8) (.8) (.7) 1% (.7) (.7) (.6) (.8) most important variable is the yearly income of the correspondent. The other variables are all expenditures on different types of goods such as health, food and leisure. It is generally known that the percentage of income spent on a specific type of good is a function of the yearly revenue of an individual. A good is called a luxury good if the relative amount of income spent on it is an increasing function of the income. If the function is constant, one deals with normal goods. In the case of inferior goods the function decreases. This means that the spending pattern changes with the yearly income of a person. In order to avoid correcting factors for family size, only single persons are considered. This group of single persons consists of 174 unemployed and 76 (at least partially) employed persons. With our analysis we try to determine whether a person is employed or not by looking at his or her spending pattern and income. In this section we only consider the variables Income and Expenditure on durable consumer goods (which are luxury goods). Figure 2(a) is a scatterplot of the data indicating both groups. As the group of employed people is highly spread out, we have also plotted the lower left part of the data (Figure 2(b)), in which both groups can be better distinguished. We notice the skewness in both classes, as well as some overlap between the groups. Both groups are 1 times randomly split into a training set and a test set which contains 2% of the data. CQDA yields an average misclassification error of.1914 with a standard error of.16. Due to the fact that there are quite some outliers, RQDA improves this result to.1617 with a standard error of.17. Our skewness adjusted robust classifier yields the smallest misclassification error of.1394 with a standard deviation of.85.

9 9 CQDA RLDA (a) AO (b) RQDA (c) (d) Fig. 1 Class acceptance regions for the three-classes simulation data (case 7(b)) for the four classifiers : (a) Classical Quadratic Discriminant Analysis (CQDA), (b) Robust Linear Discriminant Analysis (RLDA), (c) Robust Quadratic Discriminant Analysis (RQDA), (d) Adjusted Outlyingness (AO). 3 High dimensional data 3.1 Construction of the classification rule To classify high-dimensional data, the analysis often starts with applying a dimension reduction technique such as Principal Component Analysis (PCA) on the whole data set. Next, discriminant analysis for low-dimensional data is applied on the resulting PCA-scores, see e.g. Hubert and Engelen (24), where a robust PCA is followed by RQDA to classify mice with different therapies for cancer based on high-dimensional NMR spectra. However, it is not always useful to perform one single PCA on the entire data set, as the optimal subspaces and the dimensionality of the different classes X j

10 1 EXPENDITURE ON DURABLE CONSUMER GOODS 1 x employed unemployed.5 1 INCOME x x 14 (a) 5 employed unexmployed EXPENDITURE ON DURABLE CONSUMER GOODS INCOME x 1 4 (b) Fig. 2 Scatterplot of expenditure on durable consumer goods versus income: (a) complete data set; (b) observations with a yearly income smaller than 5 euro and a yearly expenditure on durable goods smaller than 6 euro. can be different. A way to deal with such classes of different dimensionality is the so called Soft Independent Modelling by Class Analogy (SIMCA) method. The idea behind SIMCA (Wold, 1976) is to apply a (classical) PCA in each group, thereby retaining within each group a sufficient number of principal components to account for most of the variation. For each new observation y, we can then compute its projection within the PCA-subspace of group j, denoted by ŷ j. Roughly speaking, the SIMCA classification rule is then obtained by combining the orthogonal (Euclidean) distance (OD) of y to the jth class OD j (y) = y ŷ j (7)

11 11 and the score distance (SD). The score distance is the Mahalanobis distance measured in the PCA-subspace. For a new observation y, the score distance with respect to the jth class is given by: SD j (y) = (t j ) t (L j ) 1 t j Here t j = (P j ) t (y x j ) is the score of y with respect to the jth class, P j the matrix of the loadings of group j, L j the diagonal matrix of the eigenvalues and x j the empirical mean of the observations of X j. Since the SIMCA method is based on classical PCA, it is inherently sensitive to outliers. In order to robustify SIMCA, the RSIMCA classifier has been developed (Hubert et al., 25), which is mostly based on the robust PCA method ROBPCA (Hubert et al. 25). This ROBPCA method uses concepts from the Stahel-Donoho outlyingness and the MCD estimator, and consequently it is mostly suited for elliptically distributed data. As we want to account for skewness, we propose a classification method which is similar to RSIMCA, but which is based on a robust PCA approach for skewed data (Hubert et al. 29) as well as on the adjusted outlyingness (3). The method, denoted as RSIMCA-AO, consists of the following steps: 1. First we apply ROBPCA for skewed data within each group. To this end, (a) we compute the AO of all observations in X j. Then we take the set I h of the h j (with [n j /2] h j n j ) data points with smallest AO, and we compute their mean ˆµ (X j ) and covariance matrix ˆΣ (X j ). Note that h j should be smaller than the number of non-outlying objects in X j. In our examples we have always set h j =.9n j but this can of course be modified if a smaller number of regular observations is too be expected. (b) In each group we reduce the dimension by projecting all data on the l j -dimensional subspace V j spanned by the first l j eigenvectors of ˆΣ (X j ). (c) For each observation x j x i, we compute its orthogonal distance ODj i = j i ˆxj i with ˆx j i the projection of x j i on the subspace V j. We then obtain an improved robust subspace estimate V j 1 as the subspace spanned by the l j dominant eigenvectors of ˆΣ1, which is the covariance matrix of all observations x j i for which the OD j i cj OD. The cutoff value cj OD is the largest ODj i smaller than Q 3 {OD} e 3 MC{OD} IQR{OD}. Here {OD} is the set {OD j i, i = 1,..., n j }. Next, we project all data points of all groups on their respective PCA subspace V j 1, yielding the final ˆxj i. (d) We compute the AO j i of the ˆxj i. Similar to the low-dimensional case, we retain the estimates of the median, MAD and IQR in each direction. 2. To classify a new observation y, we consider two distances. First, we look at the orthogonal Euclidean distance OD j (y), as in (7), of y to the subspace V j 1. Next, we compute the adjusted outlyingness AO of ŷ j with respect to the ˆx j i. As explained in Section 2.1 we reduce the computation time by using the results from deriving the AO j i. To make sure that none of these two distances dominates the other, both measures are rescaled. For the OD we use c j OD, whereas for the AO we define c j AO, with {AO} = {AOj i, i = 1,..., n j}, as the largest AO j i smaller than Q 3 {AO} + 1.5e 3 MC{AO} IQR{AO}. Finally, similarly as in Vanden Branden and Hubert (25), we consider a linear combination of the two squared rescaled

12 12 distances ( ) 2 ( ) 2 OD j (y) AO j (y) γ c j + (1 γ) OD c j. (8) AO Observation y is now classified into the group j for which (8) is minimal. The parameter γ [, 1] determines the relative weight that is given to both distance measures. If there is no prior information on the relative importance of both measures, γ can be chosen in such a way that the misclassification error is minimized. Remark: The computation of the AO in step 1(a) is carried out in the space spanned by the datapoints. When p > n j, a preliminary SVD decomposition of X j leads to a dataspace of dimension r n j 1. As r will generally still be large, the computation of the AOs can no longer be done as in the low-dimensional setting. First note that in the PCA-setting, we further assume that the data can be well represented in a lower dimensional space k r, so the intrinsic problem is still low-dimensional. Secondly, we do not need an affine equivariant AO-measure anymore, but only a procedure which is orthogonally equivariant. Therefore, we generate m = min(25p, 25) directions a through 2 datapoints and then take the maximum of the AO (1) s on those projections. 3.2 Simulation results To illustrate the effectiveness of the skewness adjustment, RSIMCA-AO is compared to SIMCA and RSIMCA method in a simulation study. First we consider an outlier-free situation. In p = 5 dimensions, three groups of observations are generated from a skew normal distribution. The first group has its center at p=(,,..., ) t R p. The covariance matrix has the three first canonical vectors as dominant directions, and they explain 98% of the variability. The skewness vector α is set to (2, p 1 ) t. The second group has its center in (5, p 1 ) t and has only two important directions (the first and the second canonical vector) yielding 94% explanation of the variance. The skewness parameter equals (1, p 1 ) t. The center of the third class is located at ( 5, p 1 ) t. Its first four directions explain 95% of the variance, while its skewness parameter is ( 1, p 1 ) t. From each group we generate 4 training data and 8 validation cases. Table 4 summarizes the results of applying the three different classifiers. Average misclassification errors (and corresponding standard errors) are reported over 1 simulations, for 11 equidistant γ values in [, 1]. Note that all methods require to choose the number of principal components within each group. Here, we have used l 1 = 3, l 2 = 2 and l 3 = 4 according to the true dimensionality of the data. We see that, in this outlier-free setting, the performance of SIMCA and RSIMCA is very comparable. RSIMCA-AO gives better results over a broad range of γ-values and attains the minimal classification error of.484 at γ =.3. The result for γ =.2 is very close with an error of.488. In order to illustrate the robustness towards outliers, we next introduced 5% orthogonal outliers in the first group and the second group and 5% bad leverage points in the third group (for a precise definition of the types of outliers, see Hubert et al. (29)). The outliers follow the same distributions as the regular observations but their center is changed to (,,, 5, 496 ) t for the first group, (5,, 5, 497 ) t for the second group and (,,,, 1, 495 ) t for the last group. It is known that orthogonal contamination lifts the classical PCA subspace towards the outliers and that bad leverage points tilt the subspace to accommodate all the outliers. From Table 5 we see indeed

13 13 that the contamination has a high impact on the performance of classical SIMCA. The robust methods are quite resistant towards outliers. Again RSIMCA-AO obtains the smallest misclassification error, with the minimal value of.511 at γ =.2. Table 4 Simulation results for uncontaminated high-dimensional data. γ SIMCA RSIMCA RSIMCA-AO (.24) (.24) (.3) (.16) (.17) (.22) (.15) (.14) (.14) (.14) (.14) (.14) (.14) (.14) (.17) (.17) (.16) (.24) (.2) (.21) (.24) (.19) (.19) (.33) (.18) (.19) (.24) (.6) (.1) (.13) (.6) (.5) (.4) 3.3 Example We reconsider the social data example from Section 2.3 but now we take all 11 variables into account: Income and expenditures on clothing, alcoholic drinks, durable consumer goods, energy, food, health, leisure, nonalcoholic drinks, transport and housing. On this data set we apply SIMCA, RSIMCA and RSIMCA-AO. For each method we retained three principal components in each group. For RSIMCA-AO e.g. this explained 87% percentage of the variation of the first group, and 89% of the second group. On Figure 3 the average misclassification errors are plotted as a function of γ (based on 1 random splits into training and test set). It can be clearly seen that the skewness adaptation gives smaller misclassification errors for all values of γ.

14 14 Table 5 Simulation results for contaminated high-dimensional data. γ SIMCA RSIMCA RSIMCA-AO (.22) (.23) (.28) (.25) (.18) (.19) (.21) (.16) (.15) (.19) (.16) (.17) (.22) (.14) (.16) (.23) (.18) (.22) (.25) (.19) (.24) (.29) (.19) (.28) (.15) (.17) (.27) (.7) (.7) (.23) (.7) (.5) (.5).22.2 SIMCA RSIMCA RSIMCA AO.18 Misclassification γ Fig. 3 Average misclassification errors for the social data as a function of γ, with different classification methods. 4 Conclusion and outlook In this paper we have shown how classical and robust classifiers for elliptical data can be adjusted for skewness. In the low-dimensional case we adapt the maximum depth

15 15 classification method by using the concept of adjusted outlyingness. This is a fast projection pursuit procedure which does not need any tuning parameter. Only for its computation, the number of projections needs to be chosen. Note that when the sample sizes are highly different among the classes, some modifications to this classification rule can be made, leading to even smaller misclassification errors. This has been studied in Hubert and Van der Veeken (21). For high-dimensional data we modify the RSIMCA approach by using a robust PCA method for skewed data for dimension reduction and the AO within the PCA subspaces. Simulation results and an application on a real data set show the benefits of these new procedures. The programs will become available as part of LIBRA (Verboven and Hubert, 25) at wis.kuleuven.be/stat/robust. References Azzalini A, Dalla Valle A (1996) The multivariate skew-normal distribution. Biometrika 83: Brys G, Hubert M, Rousseeuw PJ (25) A robustification of Independent Component Analysis. Journal of Chemometrics 19: Brys G, Hubert M, Struyf A (24) A robust measure of skewness. Journal of Computational and Graphical Statistics 13: Cheng AY, Ouyang M (21) On algorithms for simplicial depth. In: Proceedings 13th Canadian Conference on Computational Geometry. pp Croux C, Dehon C (21) Robust linear discriminant analysis using S-estimators. The Canadian Journal of Statistics 29: Donoho DL (1982) Breakdown properties of multivariate location estimators. PhD thesis, Harvard University Dutta S, Ghosh AK (29) On robust classification using projection depth. Indian Statistical Institute, Technical report R11/29 Ghosh AK, Chaudhuri P (25) On maximum depth and related classifiers. Scandinavian Journal of Statistics: Theory and applications 32(2): Hastie T, Tibshirani R, Friedman J (21) The Elements of Statistical Learning. Springer, New York He X, Fung WK (2) High breakdown estimate for multiple populations with applications to discriminant analysis. Journal of Multivariate Analysis 72: Hubert M, Engelen S (24) Robust PCA and classification in biosciences. Bioinformatics 2: Hubert M, Rousseeuw PJ, Vanden Branden K (25) ROBPCA: a new approach to robust principal component analysis. Technometrics 47:64-79 Hubert M, Rousseeuw PJ, Verdonck T (29) Robust PCA for skewed data. Computational Statistics and Data Analysis 53: Hubert M, Van der Veeken S (28) Outlier detection for skewed data. Journal of Chemometrics 22: Hubert M, Van der Veeken S (21) Fast and robust classifiers adjusted for skewness. In: Proceedings of Compstat 21. Springer Hubert M, Van Driessen K (24) Fast and robust discriminant analysis. Computational Statistics and Data Analysis 45:31-32 Johnson RA, Wichern DW (1998) Applied multivariate statistical analysis. Prentice Hall Inc., Englewood Cliffs, NJ

16 16 Liu RY (199) On a notion of data depth based on random simplices. The Annals of Statistics 18(1): Rousseeuw PJ, Ruts I (1996) Bivariate location depth. Applied Statistics 45: Rousseeuw PJ, Struyf A (1998) Computing location depth and regression depthin higher dimensions. Statistics and Computing 8: Stahel WA (1981) Robuste schätzungen: infinitesimale optimalität und schätzungen von kovarianzmatrizen. PhD thesis, ETH Zürich Suykens JAK, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J (22) Least Squares Support Vector Machines. World Scientific, Singapore Tukey JW (1975) Mathematics and picturing of data. In: Proceedings of the International Congress of Mathematicians, Volume 2: Vanden Branden K, Hubert M (25) Robust classification in high dimensions based on the SIMCA method. Chemometrics and Intelligent Laboratory Systems 79:1-21 Verboven S, Hubert M (25) LIBRA: a Matlab library for robust analysis. Chemometrics and Intelligent Laboratory Systems 75: Wold S (1976) Pattern recognition by means of disjoint principal component models. Pattern Recognition 8: Zuo Y, Serfling R (2) General notions of statistical depth function. The Annals of Statistics 28:

Fast and Robust Classifiers Adjusted for Skewness

Fast and Robust Classifiers Adjusted for Skewness Mia Hubert 1 and Stephan Van der Veeken 2 1 Department of Mathematics - LStat, Katholieke Universiteit Leuven Celestijnenlaan 200B, Leuven, Belgium, Mia.Hubert@wis.kuleuven.be