Computationally Easy Outlier Detection via Projection Pursuit with Finitely Many Directions

Computationally Easy Outlier Detection via Projection Pursuit with Finitely Many Directions Robert Serfling 1 and Satyaki Mazumder 2 University of Texas at Dallas and Indian Institute of Science, Education and Research December, 2012 1 Department of Mathematics, University of Texas at Dallas, Richardson, Texas 75080-3021, USA. Email: serfling@utdallas.edu. Website: www.utdallas.edu/ serfling. Support by NSF Grants DMS-0805786 and DMS-1106691 and NSA Grant H98230-08-1-0106 is gratefully acknowledged. 2 Indian Institute of Science, Education and Research, Kolkata, India

Abstract Outlier detection is fundamental to data analysis. Desirable properties are affine invariance, robustness, low computational burden, and nonimposition of elliptical contours. However, leading methods fail to possess all of these features. The Mahalanobis distance outlyingness (MD) imposes elliptical contours. The projection outlyingness (SUP), powerfully involving projections of the data onto all univariate directions, is highly intensive computationally. Computationally easy variants using projection pursuit with but finitely many directions have been introduced, but these fail to capture at once the other desired properties. Here we develop a robust Mahalanobis spatial outlyingness on projections (RMSP) function which indeed satisfies all four desired properties. Pre-transformation to a strong invariant coordinate system yields affine invariance, spatial trimming yields robustness, and spatial Mahalanobis outlyingness is used to obtain computational ease and smooth, unconstrained contours. From empirical study using artificial and actual data, our findings are that SUP is outclassed by MD and RMSP, that MD and RMSP are competitive, and that RMSP is especially advantageous in describing the intermediate outlyingness structure when elliptical contours are not assumed. AMS 2000 Subject Classification: Primary 62H99. Secondary 62G99 Key words and phrases: Outlier detection; Projection pursuit; Nonparametric; Multivariate; Affine invariance; Robustness.

1 Introduction Outlier identification is fundamental to multivariate statistical analysis and data mining. Using a selected outlyingness function defined on the sample or input space, outliers are those points whose outlyingness value exceeds a specified threshold. Desirably, outlyingness functions have the properties (i) robustness against the presence of outliers, (ii) weak affine invariance (i.e., transformation to other coordinates should not affect relative outyingness rankings and comparisons), (iii) computational efficiency in any practical dimension, and (iv) nonimposition of elliptical contours. We point out that, although elliptical contours are attractive and in some cases justified, an outlyingness function that does not arbitrarily impose them can more effectively identify which points belong to the layers of moderate outlyingness. The widely used Mahalanobis distance outlyingness (MD) satisfies (i)-(iii) but gives up (iv). On the other hand, the projection outlyingness (SUP) based on the projection pursuit approach satisfies (i), (ii), and (iv) but gives up (iii). Here we construct a modified projection pursuit outlyingness (RMSP) that satisfies all of (i)-(iv). Projection pursuit techniques, originally proposed and experimented with by Kruskal (1969, 1972), are extremely powerful in multivariate data analysis. Related ideas occur in Switzer (1970) and Switzer and Wright (1971). A key implementation is due to Friedman and Tukey (1974). Recently, projection depth has received significant attention in the literature (Liu, 1992; Zuo and Serfling, 2000b; Zuo, 2003). The corresponding projection outlyingness (SUP) is a multivariate extension of univariate scaled deviation outlyingness O(x) = x µ /σ, where µ and σ are univariate measures of location and spread. Specifically, for a multivariate data point x, the projection outlyingness is given by the supremum of all the univariate outlyingness values of the projections of x onto lines (Liu, 1992; Zuo and Serfling, 2000b; Zuo, 2003; Serfling, 2004; Dang and Serfling, 2010). Let us make this precise. Let X have distribution F X on R d and, for any unit vector u = (u 1,..., u d ) in R d, let F u X denote the induced univariate distribution of u X. With µ given by the Median, say ν, and σ by the MAD (median absolute deviation from the median), say η, the associated well-known projection outlyingness (SUP) is O P (x, F X ) = sup u = 1 u x ν(f u X) η(f u X), x Rd, (1) representing the worst case scaled deviation outlyingness of projections of x onto lines. For a d-dimensional data set X n = {X 1,...,X n }, the sample version O P (x, X n ) is affine invariant, highly masking robust (Dang and Serfling, 2010), and does not impose ellipsoidal contours. However, it is obviously highly computational. To overcome the computational burden, we develop an alternative projection pursuit outlyingness entailing only finitely many projections. Implementation involves the following key steps, which are described in detail in the sequel. Step 1. With only finitely many projections, O P (x, X n ) is no longer affine invariant, not even orthogonally invariant. However, if we first transform the data to X n = 1

D(X n )X n using a strong invariant coordinate system (SICS) transformation D(X n ), then the quantity u x ν(x n) η(x n) for any u becomes invariant under affine transformation of the original data via X n Y n = AX n +b for nonsingular d d A and d-vector b (see Serfling, 2010). Consequently, to capture affine invariance for our method, a SICS transformation is applied to X n before taking projections. In order not to lose information on the outliers, we need the SICS transformation D(X n ) to be robust. This is accomplished by constructing D(X n ) from a subset Z N of N inner observations from X n, selected in an affine invariant way, with N = 0.5n or N = 0.75n, for example. One possibility for Z N is the N inner observations used in computing the well-known Fast-MCD sample scatter matrix (Rousseeuw and Van Driessen, 1999, implemented in the R packages MASS, rrcov, and robustbase, for example). However, this becomes computationally prohibitive as the dimension d increases. Instead, we use a computationally attractive spatial trimming approach (Mazumder and Serfling, 2013) that chooses the N observations with lowest values of sample Mahalanobis spatial outlyingness. Step 2. The computational burden is reduced by taking relatively few projections, but enough to capture sufficient information. From both intuitive and computational considerations, the use of deterministic directions strongly appeals. In particular, we choose s = 5d directions = {u i, 1 i s} approximately uniformly distributed on the unit sphere in R d but lying on distinct diameters. Fang and Wang (1994) provide convenient numerical algorithms for this purpose. Step 3. Projection pursuit is carried out. For each X i in X n and direction u j in, the scaled deviation outlyingness of the projected value u jx i is computed: a n (i, j) = u jx i ν(u jx n) η(u j X n ), 1 j s, 1 i n, with ν(u jx n) and η(u jx n) the sample median and sample MAD, respectively, of the projected data set u j X n. This replaces the original data set X n of n d-vectors, which then becomes replaced by a new data set A n = [a n (i, j)] s n of n s-vectors of projected scaled deviations. These contain the relevant outlyingness information. 2

Step 4. Redundancy in A n is eliminated by a PCA reduction. With J indexing the N inner observations of Step 1, the (robust) sample covariance matrix S n of the data set given by the columns of A n indexed by J is formed. Then PCA based on S n is used to transform and reduce the data vectors A n to t data vectors V n, where t denotes the number of positive eigenvalues of S n. Typically, t = d. Step 5. We construct our new outlyingness function by again employing the Mahalanobis spatial outlyingness function and the spatial trimming step of Step 1, but now applied to the data V n instead of X n. This yields a Robust Mahalanobis Spatial outlyingness function based on Projections (RMSP), O RMSP (X i, X n ), 1 i n, a robust outlyingness ordering of the original data points in X n via projection pursuit with finite. Summary: X n (original data: n vectors in R d ) (via D(X n )) X n (SICS-transformed data: n vectors in R d ) (via O(x, F) and ) A n (n projected scaled deviation vectors in R s, s = 5d) (via PCA) V n (n vectors in R t, typically with t = d) (via robust Mahalanobis spatial outlyingness) O RMSP Details for Step 1 are provided in Section 2. Following a method of Serfling (2010) for construction of sample SICS functionals in conjunction with spatial trimming of observations (Mazumder and Serfling, 2013), we introduce an easily computable and robust SICS functional to be used for the purposes of this paper. Details for Steps 3 and 4 are provided in Section 3, and those for Step 5 in Section 4, leading to our affine invariant, robust, and easily computable outlyingness function RMSP. In Section 5, we compare SUP, MD, and RMSP using artificial bivariate data sets for the sake of visual comparisons, artificial higher-dimensional data sets, and actual higherdimensional data sets, the Stackloss Data (n = 21, d = 4), and the Air Pollution and Mortality Data (n = 59, d = 13), which have been studied extensively in the literature. Our findings and conclusions are provided in Section 6. Briefly, SUP is outclassed by MD and RMSP, MD and RMSP are competitive, and RMSP is especially advantageous in describing the intermediate outlyingness structure when elliptical contours are not assumed. We conclude this introduction by mentioning some notable previous work on projection pursuit outlyingness using only finitely many projections. Despite their various strengths, however, none of these methods capture all of properties (i)-(iv). 3

Pan, Fung, and Fang (2000) use finitely many deterministic directions approximately uniformly scattered and develop a finite approach calculating a sample quadratic form based on the differences {O(u x, u X n ) O(u x, F u X), u }. This imposes elliptical outlyingness contours. Further, since these differences involve the unknown F, a bootstrap step is incorporated. The number of directions is datadriven and the method is not affine invariant. Peña and Prieto (2001) introduce an affine invariant method using the supremum and 2d data-driven directions. These are selected using univariate measures of kurtosis over candidate directions, choosing the d directions with local extremes of high kurtosis and the d directions with local extremes of low kurtosis. In a very complex algorithm, the outliers are ultimately selected using Mahalanobis distance (thus entailing elliptical contours). Filzmozer, Maronna, and Werner (2008) extend the Peña and Prieto (2001) approach into an even more elaborate one, adding a principal components step. This achieves certain improvements in performance for detection of location outliers, especially in high dimension. However, this gives up affine invariance. See also Maronna, Martin, and Yohai (2006). 2 Step 1: a Robust SICS Transformation In general, by weak affine invariance of a functional T(F) is meant that T(Ax + b, F AX + b ) = ct(x, F X ), x R d, where A d d is nonsingular, b is any vector in R d and c = c(a,b, F X ) is a constant. Here, corresponding to some given finite = {u 1,..., u s }, we are interested in the functional ( u 1 x ν(f u 1 ζ(x,, F X ) = X ),..., u s x ν(f ) u sx), x R d, η(f u 1 X) η(f u s X) whose components give (signed) scaled deviation outlyingness values for the projections of a point x onto the lines represented by. It is straightforward to verify by simple counterexamples that ζ(x,, F X ) is not weakly affine invariant (nor even orthogonally invariant). However, we can make it become weakly affine invariant by first applying to the variable x a strong invariant coordinate system (SICS) functional (Serfling, 2010). Also, as applied to the sample version ζ(x,, X n ) we also want the SICS transformation to be robust. 4

2.1 Standardization of ζ(x,,f X ) using a SICS functional Definition 1 (Serfling, 2010). A positive definite matrix-valued functional D(F) is a strong invariant coordinate system (SICS) functional if, for Y = AX + b, D(F Y ) = k 3 D(F X )A 1, where A d d is nonsingular, b is any vector in R d, and k 3 = k 3 (A,b, F X ) is a scalar. Detailed treatment is found in Tyler, Critchley, Dümbgen and Oja (2009), Serfling (2010), and Ilmonen, Oja, and Serfling (2012). We now establish that the SICS-standardized vector is weakly affine invariant. ζ(d(f X )x,, F D(FX )X) Theorem 2 Let X have distribution F X on R d and let X = D(F X )X, where D(F) is a SICS functional. Then ζ(x,, F X ) is weakly affine invariant. Proof of Theorem 2. Suppose X Y = AX+b, where A is a d d nonsingular matrix and b is any vector in R d. Now, transform Y Y = D(F Y )Y. Then, using Definition 1, we have Y = D(F AX + b )(AX + b) = k D(F X )X + k D(F X )A 1 b = k X + c, where k = k(b,a, F X ), and c =k D(F X )A 1 b. Hence, for 1 j s, u j y med(u j Y ) MAD(u j Y ) = u j (k x + c) med(u j (k X + c)) MAD(u j (k X + c)) = sgn(k) u j x med(u j X ). MAD(u j X ) Thus ζ(y,, F Y ) = sgn(k)ζ(x,, F X ) and the result follows. 2.2 A robust SICS transformation via spatial trimming In general, following Serfling (2010), we may construct a SICS functional as follows. Let Z N be a subset of X n of size N obtained through some affine invariant and permutation invariant procedure. Then form d + 1 means Z 1,...,Z d+1 based, respectively, on consecutive blocks of size m = N/(d + 1) from Z N, and define the matrix W(X n ) = [(Z 2 Z 1 ),..., (Z d+1 Z 1 )] d d. Then the matrix D(X n ) = W(X n ) 1 (2) 5

is a sample SICS functional. The robustness and computational burden rest on the method of choosing Z N. One implementation is to let Z N be the set of the observations selected and used in computing the well-known Fast-MCD scatter matrix with N = αn, where α = 0.5 or 0.75. This uses all the observations in a permutationally invariant way in selecting Z N and all of the latter observations in defining W(X n ). However, due to its combinatorial character, this approach becomes computationally prohibitive as d increases. Instead we shall develop a computationally easy robust sample SICS functional based on letting Z N consist of the inner observations obtained in spatial trimming, as follows. We start with the following definition. Definition 3 (Serfling, 2010). A positive definite matrix-valued functional M(F) is a transformation-retransformation (TR) functional if, A M(Y n ) M(Y n )A = k 2 M(X n ) M(X n ) (3) for Y = AX + b, and with k 2 = k 2 (A, b, X n ) a positive scalar function of A, b, and X n. Any inverse square root of a scatter or shape matrix is a TR matrix. These are discussed in detail in Serfling (2010) and, in particular, it is shown that any SICS functional is a TR functional. In particular, we will use a TR functional introduced by Tyler (1987) to achieve certain favorable theoretical properties in the elliptical model and extended by Dümbgen (1998). It is easily computed in any dimension, making it an computationally attractive alternative to Fast-MCD, although not as robust. With respect to a specified location functional θ(x n ), the Tyler matrix is defined as C(X n ) = (M s (X n ) M s (X n )) 1, with M s (X n ) the TR matrix defined as the unique symmetric square root of C(X n ) 1 obtained through the M-estimation equation n 1 n i=1 {( ) ( ) } M s (X n )(X i θ(x n )) Ms (X n )(X i θ(x n )) = d 1 I d. (4) M s (X n )(X i θ(x n )) M s (X n )(X i θ(x n )) An iterative algorithm using Cholesky factorizations to compute M s (X n ) quickly in any practical dimension is given in Tyler (1987). Another solution of (4) is given by the upper triangular square root M t (X n ) of C(X n ) 1 and is computed by a similar algorithm. In Tyler (1987), the quantity θ(x n ) is specified as a known constant. For inference situations when θ(x n ) is not known or specified, a symmetrized version of M s (X n ) eliminating the need of a location measure is given by Dümbgen (1998): M s1 (X n ) = M s (X n X n ), where X n X n denotes the set of differences X i X j. Convenient R-packages (e.g., ICSNP) are available for computation of these estimators. See Tyler (1987) and Serfling (2010) for detailed discussion. Here we will utilize the Dümbgen-Tyler TR matrix M s1 (F) by DT. We will make use of the TR matrix DT via the Mahalanobis spatial outlyingness function (Serfling, 2010) and the notion of spatial trimming (Mazumder and Serfling, 2013). Based 6

on any given TR matrix M(X n ), and based on the spatial sign function (or unit vector function), { x x S(x) =, x Rd, x 0, 0, x = 0, a corresponding affine invariant Mahalanobis spatial outlyingness function is given by n O MS (x, X n ) = n 1 S (M(X n )(x X i )), x Rd, (5) i=1 Without standardization by any M(X n ), (5) gives the well-known spatial outlyingness function (Chaudhuri, 1996), which is only orthogonally invariant. Standardization by a TR matrix produces affine invariance. Spatial trimming (possibly depending on n and/or d) consists of trimming away those outer observations satisfying O MS (X i, X n ) > λ 0, for some specified threshold λ 0. Then the covariance matrix of the remaining inner observations is robust against outliers. For M(X n ) given by DT, this is computationally faster than Fast-MCD. We denote by J the set of indices of the inner observations selected by spatial trimming. For choice of the threshold λ 0, we follow Mazumder and Serfling (2013) and recommend λ 0 = d d + 2, (6) which is justified as follows. The outlyingness function O MS without standardization has masking breakdown point (MBP) approximately (1 λ 0 )/2 (Dang and Serfling, 2010). This is the minimal fraction of observations in the sample which, if corrupted, can cause arbitrarily extreme outliers to become masked (misclassified as nonoutliers). One may obtain an MBP as high as possible ( 1/2) by choosing λ 0 small enough. However, to avoid overtrimming and losing resolution, a moderate choice is prudent. With standardization, we also need to take account of the explosion replacement breakdown point RBPexp of the chosen TR matrix M(X n ). We will adopt the easily computable DT, which has RBPexp subject to the upper bound (d + 1) 1, decreasing with d (see Dümbgen and Tyler, 2005). We balance (1 λ 0 )/2 with (d + 1) 1. However, to avoid overrobustification in low dimensions, we instead use (d + 2) 1, thus solving (1 λ 0 )/2 = (d + 2) 1, to obtain (6). The increase of λ 0 with d is reasonable, since in higher dimension distributions place greater probability in the tails and observations are more dispersed, causing outlyingness values to tend to be larger overall. In summary: our SICS functional D(X n ) is given by the sample version of (2) with Z n and J determined via spatial trimming using O MS based on the TR matrix DT and with trimming threshold given by (6). 7

3 Steps 3 and 4: Application of Projection Pursuit and PCA Let X n = {X 1,...,X n } be a random sample from some F on R d and let D(X n ) be a sample SICS functional (as, for example, the one obtained above). We describe in Algorithm A below the steps for transforming the original data X n to a set of t-dimensional vectors V n = {V 1 (X n ),...,V n (X n )} by a projection pursuit approach using u i, 1 i s, combined with a SICS preliminary transformation and a PCA dimension reduction step. We also provide a key property of V n. Algorithm A, for V n 1. Standardize X i, 1 i n, with a given sample SICS functional D(X n ) d d via X i = D(X n )X i, 1 i n, and put X n = [X 1 X n ]. 2. Choose the number of directions s and select s unit vectors = {u 1,..., u s } uniformly distributed on the unit sphere in R d but lying on distinct diameters, following the algorithm of Fang and Wang (1994). From trial and error with examples of various dimensions, choices such as s = 4d, 5d, or 6d are effective. We recommend: s = 5d. 3. Calculate the projections u jx i, 1 i n, 1 j s, and denote by ν(u jx n) and η(u jx n) the median and MAD, respectively, of {u jx 1,...,u jx n) 1 n }, 1 j s. 4. Put A n = [a 1 a n ] s n, where a i = (a n (i, 1),..., a n (i, s)) s 1, 1 i n, with a n (i, j) = u jx i ν(u jx n), 1 j s, 1 i n. η(u j X n) The initial d n data matrix X n of n d-vectors now has been converted to a new data matrix A n of dimension s n, i.e., consisting of n s-vectors a 1,..., a n, with a i associated with the original data point X i, 1 i n. 5. Let S n denote the sample covariance matrix of those columns a i of A n with indices i in the set J. Calculate the eigenvalues λ 1... λ s of S n and let P = [p 1 p s ] s s denote the orthogonal matrix containing the corresponding eigenvectors p 1,...,p s as column vectors. 6. PCA reduction. Define the s 1 dimensional vectors Ṽi = P a i, 1 i n. Let t be the number of eigenvalues greater than 10 6 (typically, t = d.) Let the t-vector V i contain the first t components of the vector Ṽ i, 1 i n, and put V n = [V 1 V n ] t n. This completes reduction of the original d n data matrix X n to a new t n data matrix V n. It is readily checked that the covariance matrix of each vector V(i), 1 i n, is Λ t t = diag(λ 1,..., λ t ). 8

We note that the above algorithm is affine invariant, in the sense that if we transform X i Y i = AX i + b, 1 i n, where A is any d d dimensional nonsingular matrix and b is any vector in R d, then the matrix of vectors V i, 1 i n, will remain the same up to a global sign change. This is stated formally as follows (the proof is straightforward and omitted). Lemma 4 Let X 1,...,X n be a random sample from d dimensional distribution function F. Suppose Y i = AX i + b, where A is any d d dimensional nonsingular matrix and b is any vector in R d. Put Y n = [Y 1 Y n ]. Further, assume that V n (X n ) is obtained using Algorithm A starting with the data matrix X n, and that V n (Y n ) is obtained using Algorithm A starting with the data matrix Y n. Then where k = k(a,b, X n ). V n (Y n ) = sgn(k) V n (X n ), (7) 4 Robust Mahalanobis Spatial Outlyingness using V n In Algorithm A, the data vector X i becomes replaced by V i, 1 i n, or more generally points x R d are mapped onto points v R t. We formulate our new robust affine invariant outlyingness function O RMSP (x, X n ) (denoted by RMSP) as the robust Mahalanobis spatial outlyingness function on the vectors V i, 1 i n. Thus O RMSP (x, X n ) = O MS (v(x), V n ), where v(x) is the vector associated with x via Algorithm A. As before, we use DT (on V n ) for the standardization defining O MS. After transforming the given data X n = [X 1 X n ] d n to V n = [V 1 V n ] t n, using inner observations indexed by J, we follow the steps below to form RMSP. Algorithm B, for RMSP 1. Using the Mahalanobis spatial outlyingness function (5) on V n (instead of X n ), and with TR matrix M(V n ) = DT(V n ), i.e, using n O MS (v, V n ) = n 1 S (M(V n )(v V i )), v Rt, (8) i=1 carry out the spatial trimming method of Section 2.2 based on trimming threshold t t + 2. Let J denote the set of indices of the inner observations so obtained, and let V n denote the inner observations. 2. Now robustify DT(V n ) by computing it just on V n: DT(V n) 9

3. Then RMSP is defined according to (8), but with the robustified DT for standardization and with the averaging taken over just the inner observations. That is, denoting by K the cardinality of J, and with M(V n) = DT(V n), we form O RMSP (x, X n ) = K 1 S (M(V n )(v(x) V j)), v Rt. (9) j J In particular, for the data points X i, 1 i n, the corresponding RMSP outlyingness values are given by O RMSP (X i, X n ) = K 1 S (M(V n)(v i V j )), 1 i n. (10) j J Lemma 5 O RMSP (x, X n ) is affine invariant, in the sense that if we transform X i Y i = AX i + b, where A d d is nonsingular and b is any vector in R d, then with y = Ax + b. (The proof is immediate and omitted.) O RMSP (y, Y n ) = O RMSP (x, X n ), 5 Comparison of RMSP, SUP, and MD 5.1 Visual illustration with artificial bivariate data Using artificially created bivariate data sets, we are able to provide visual illustrations of important differences among RMSP, SUP, and MD. Here MD denotes robust Mahalanobis distance outlyingness with the Fast-MCD location and scatter measures based on the inner 50% observations. Besides identifying the more extreme outliers, an outlyingness function also has the role of providing a structural description of the data set. In effect, this provides a quantile-based description. Our plots exhibit 50%, 75%, and 90% outlyingness contours (based on the given outlyingness function and enclosing 50%, 75%, and 90% of the observations, respectively). Figure 1 shows a triangularly shaped data set with and without outliers, Figure 2 a bivariate Pareto data with and without outliers, and Figure 3 a bivariate Normal data with and without outliers. Based on these displays, we make the following comments. 1. SUP is dominated by MD and RMSP. All three of SUP, MD, and RMSP have similar contours: strictly ellipsoidal for MD and approximately so for the others. However, the contours of SUP are not as smooth as the those of the others and its computational burden is considerably greater. 10

2. The robustness properties of SUP, MD, and RMSP are comparable for moderate contamination. For moderate contamination (somewhat less than 50%), all three methods perform equally well in constructing contours that account for the outliers without becoming distorted by them. For each of SUP, MD, and RMSP, the outlier cluster becomes placed on or beyond the 90% contour and well outside the 75% contour, rather than, for example, pulling the 75% contour toward itself. On the other hand, if protection against very high contamination is needed, SUP and MD can be adjusted to have nearly 50% BP, while RMSP cannot. 3. RMSP is overall superior for moderate contamination. With less computational burden and without imposing elliptical contours, RMSP is as robust as MD and SUP. 5.2 Numerical experiments with artificial data A small experiment comparing SUP (with 100,000 directions), MD (with Fast-MCD), and RMSP was carried out on a typical PC for multivariate standard Normal data with sample size n = 100 and dimensions d = 2, 5, 10, and 20 (only d = 2 for SUP). 1. First, each of the 15 most extreme observations according to Euclidean distance was replaced by itself times the factor 5. All three methods correctly selected the 15 outliers as distinctly more outlying than the other 85 observations (although not always with the same ranks). Times in seconds: d SUP MD RMSP 2 330.69 0.48 2.55 5 2.03 3.65 10 3.97 4.60 20 6.73 4.94 Clearly, SUP is prohibitively computational, compared with MD and RMSP. The well-established MD with Fast-MCD exhibits high computational efficiency for very low dimensions. The above times are based on setting the tuning parameter nsamp in MD(MCD) using rrcov to make its breakdown point match that of RMSP, namely (d+2) 1. Thus we used nsamp = 6, 10, 11, and 12 for d = 2, 5, 10, and 20 (versus the default nsamp = 500, which results in slightly higher computation times). However, the advantage of MD is quickly lost as d increases, and the computational appeal of RMSP becomes evident. 2. As a more challenging variant carried out just for d = 2, each of the 15 most extreme observations according to Euclidean distance was replaced by itself plus the vector (10, 0). Both RMSP and MD again correctly ranked the 15 outliers as more extreme in outlyingness than the other 85 (although not always in the same order). 11

From all considerations, RMSP is the method of choice for moderate contamination. A competitor is MD when elliptical contours are acceptable, while SUP may be dropped from consideration. This study raises the question of just how much we do or do not like the ellipsoidal contours of MD. A detailed evaluation of this question is worthwhile but beyond the scope of the present paper. 5.3 Actual data For two well-studied higher-dimensional data sets, we examine the performance of MD and RMSP. Here, for compatibility with Becker and Gather (1999), MD denotes the robust Mahalanobis distance with S-estimates for location and covariance based on Tukey s biweight (BW) function. (Separate investigation shows little difference between this MD and that based on MCD.) Stackloss Data (d = 4) The stackloss data set of Brownlee (1965) has been much studied as a test of outlier detection methods. It consists of n = 21 observations in dimension d = 4. See Rousseeuw and Leroy (1987) and Becker and Gather (1999) for the data set, for references to many studies, and for general discussion. All robust methods cited in the literature, including MD, rank observations 1, 3, 4, and 21 as the top 4 in outlyingness, with little difference in order. Our RMSP agrees with these rankings and also with MD on observation 2 as 5th, 13 as 6th, and 17 as 7th. Thus RMSP corroborates existing approaches while being more computationally attractive. Figure 4 shows that the main difference between MD and RMSP with this data set is the ranking of moderately outlying observations. Pollution and Mortality Data (d = 13) Becker and Gather (1999) study in detail a 13-dimensional data set available at Data and Story Library (http://lib.stat.cmu.edu/dasl/) which provides information on social and economic conditions, climate, air pollution, and mortality for 60 Standard Metropolitan Statistical Areas (a standard U. S. Census Bureau designation of the region around a major city) in the United States. They omit a case with incomplete data and rank the remaining n = 59 cases in dimension d = 13 by MD. Here we compare RMSP and MD. Both agree on the extreme observations indexed 28, 47, 46, 48 and ranked in this order. Also, they agree on 11 cases as among the next 12 cases, although with somewhat differing ranks. The exceptions are that MD ranks observation 36 as 14th and observation 38 as 22nd, while RMSP ranks these as 17th and 16th, respectively. The difference in ranks 16 versus 22 for observation 38 raises the question of whether observation 36 (New Orleans) in comparison with observation 38 (Philadelphia) should rank far apart, 14th versus 22nd as per MD, or closely, 17th versus 16th as per RMSP. 12

Coordinatewise dotplots of all 13 variables for these cases reveals that these points overall are not very outlying, except that 36 is moderate to extreme in outlyingness for mortality and moderate for SO2 pollution, and that 38 is moderate to strong in outlyingness for population size, moderate for population density, and nearly moderate for SO2 pollution. On this basis we regard 36 and 38 as comparable cases both moderate in outlyingness, corroborating the opinion of RMSP over that of MD. This illustrates the advantage of not imposing elliptical contours: the levels of moderate outlyingness can be delineated better. Figure 4 shows that the main difference between MD and RMSP with this data set is the ranking of moderately outlying observations. 6 Conclusions and Recommendations Based on our theoretical development and empirical analysis, we make the following basic conclusions about SUP, MD, and RTRP. 1. RMSP has all four desired properties: (i) affine invariance, (ii) robustness, (iii) easily computable in all dimensions, and (iv) no imposition of elliptical contours. 2. SUP is outclassed by RMSP, which is also a projection pursuit approach but one much more attractive computationally and among other finite approaches easy to understand. 3. Robust versions of MD and RMSP are competitive, but, by not imposing elliptical contours, RMSP is of particular advantage in describing intermediate outlyingness structure. 4. The spatial trimming technique used here with V n to develop a new projection pursuit approach is used just on X n in Mazumder and Serfling (2013) to produce a new robust spatial outlyingness. It provides another alternative to MD and performs similarly to RMSP, although the latter seems slightly better due to acquiring projection pursuit information. A detailed comparison in some depth would be of interest but is beyond the scope of the present paper. 5. In practice, one might use both MD (when computationally feasible) and RMSP. When they agree, one can be confidant. When they disagree, the relevant observations can be investigated. Acknowledgements The authors gratefully acknowledge very helpful, insightful comments from reviewers. Useful input from G. L. Thompson is also greatly appreciated. Support under National Science Foundation Grants DMS-0805786 and DMS-1106691 and National Security Agency Grant H98230-08-1-0106 is sincerely acknowledged. 13

References [1] Becker, C. and Gather, U. (1999). The masking breakdown point of multivariate outlier identification rules. Journal of the American Statistical Association 94 947 955. [2] Brownlee, K. A. (1965). Statistical Theory and Methodology in Science and Engineering, 2nd edition. John Wiley & Sons, New York. [3] Chaudhuri, P. (1996). On a geometric notion of quantiles for multivariate data. Journal of the American Statistical Association 91 862 872. [4] Dang, X. and Serfling, R. (2010). Nonparametric depth-based multivariate outlier identifiers, and masking robustness properties. Journal of Statistical Planning and Inference 140 198 213. [5] Dümbgen, L. (1998). On Tyler s M-functional of scatter in high dimension. Annals of the Institute of Statistical Mathematics 50 471 491. [6] Dümbgen, L. and Tyler, D. E. (2005). On the breakdown properties of some multivariate M-functionals. Scandinavian Journal of Statistics 32 247 264. [7] Fang, K.T. and Wang, Y. (1994). Number Theoretic Methods in Statistics. Chapman and Hall, London. [8] Filzmoser, P., Maronna, R., and Werner, M. (2008). Outlier identification in high dimensions. Computational Statistics & Data Analysis 52 1694 1711. [9] Friedman, J. H. and Tukey, J. W. (1974). A projection pursuit algorithm for explatory data analysis. IEEE Trans. Comput. C-23 881-889. [10] Ilmonen, P., Oja, H., and Serfling, R. (2012). On invariant coordinate system (ICS) functionals. International Statistical Review 80 93 110. [11] Kruskal, J. B. (1969). Toward a practical method which helps uncover the structure of a set of multivariate observations by finding the linear transformation which optimizes a new new index of consideration. In Statistical Computation. (R. C. Milton and J. A. Nelder, eds.) Academic, New York. [12] Kruskal, J. B. (1972). Linear transformation to multivariate data to reveal clustering. In Multidimensional scaling: Theory and Application in the Behavioral Science, I, Theory. Seminar Press, New York and London. [13] Liu, R. Y. (1992). Data depth and multivariate rank tests. In L 1 -Statistics and Related Methods (Y. Dodge, ed.), pp. 279 294, North-Holland, Amsterdam. [14] Maronna, R. A., Martin, R. D., and Yohai, V. J. (2006). Robust Statistics: Theory and Methods. Wiley, Chichester, England. 14

[15] Mazumder, S. and Serfling, R. (2013). A robust sample spatial outlyingness function. Journal of Statistical Planning and Inference 143 144 159. [16] Pan, J.-X., Fung, W.-K., and Fang, K.-T. (2000). Multiple outlier detection in multivariate data using projection pursuit techniques. Journal of Statistical Planning and Inference 83 153 167. [17] Peña, D. and Prieto, F. J. (2001). Robust covariance matrix estimation and multivariate outlier rejection. Technometrics 43 286 310. [18] Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. John Wiley & Sons, New York. [19] Rousseeuw, P. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics 41 212 223. [20] Serfling, R. (2004). Nonparametric multivariate descriptive measures based on spatial quantiles. Journal of Statistical Planning and Inference 123 259 278. [21] Serfling, R. (2010). Equivariance and invariance properties of multivariate quantile and related functions, and the role of standardization. Journal of Nonparametric Statistics 22 915 936. [22] Switzer, P. (1970). Numerical Classification. In Geostatistics. Plenum, New York. [23] Switzer, P. and Wright, R. M. (1971). Numerical classification applied to certain Jamaican eocene nummulitids. Math. Geol. 3 297-311. [24] Tyler, D. E. (1987). A distribution-free M-estimator of multivariate scatter. Annals of Statistics 15 234 251. [25] Tyler, D. E., Critchley, F., Dümbgen, L. and Oja, H. (2009). Invariant co-ordinate selection. Journal of the Royal Statistical Society, Series B 71 127. [26] Zuo, Y. and Serfling, R. (2000b). General notions of statistical depth function. Annals of Statistics 28 461 482. [27] Zuo, Y. (2003). Projection-based depth functions and associated medians. Annals of Statistics 31 1460 1490. 15

A B C D E F G A B C E F D G A B C E F D G Figure 1: Triangular data set with 50%, 75%, and 90% outlyingness contours for SUP (upper), MD (middle), and RMSP (lower), without outliers (left) and with extreme outliers including a cluster (right). 16

D A B C D A B C D A B C Figure 2: Bivariate Pareto data set with 50%, 75%, and 90% outlyingness contours for SUP (upper), MD (middle), and RMSP (lower), without outliers (left) and with extreme outliers including a cluster (right). 17

D A B C D A B C D A B C Figure 3: Bivariate Normal data set with 50%, 75%, and 90% outlyingness contours for SUP (upper), MD (middle), and RMSP (right), without outliers (left) and with extreme replacement outliers including a cluster (right). 18

Figure 4: RMSP versus MD: Stackloss data (left) and Pollution data (right). 19