Computationally Easy Outlier Detection via Projection Pursuit with Finitely Many Directions

Size: px
Start display at page:

Download "Computationally Easy Outlier Detection via Projection Pursuit with Finitely Many Directions"

Transcription

1 Computationally Easy Outlier Detection via Projection Pursuit with Finitely Many Directions Robert Serfling 1 and Satyaki Mazumder 2 University of Texas at Dallas and Indian Institute of Science, Education and Research December, Department of Mathematics, University of Texas at Dallas, Richardson, Texas , USA. serfling@utdallas.edu. Website: serfling. Support by NSF Grants DMS and DMS and NSA Grant H is gratefully acknowledged. 2 Indian Institute of Science, Education and Research, Kolkata, India

2 Abstract Outlier detection is fundamental to data analysis. Desirable properties are affine invariance, robustness, low computational burden, and nonimposition of elliptical contours. However, leading methods fail to possess all of these features. The Mahalanobis distance outlyingness (MD) imposes elliptical contours. The projection outlyingness (SUP), powerfully involving projections of the data onto all univariate directions, is highly intensive computationally. Computationally easy variants using projection pursuit with but finitely many directions have been introduced, but these fail to capture at once the other desired properties. Here we develop a robust Mahalanobis spatial outlyingness on projections (RMSP) function which indeed satisfies all four desired properties. Pre-transformation to a strong invariant coordinate system yields affine invariance, spatial trimming yields robustness, and spatial Mahalanobis outlyingness is used to obtain computational ease and smooth, unconstrained contours. From empirical study using artificial and actual data, our findings are that SUP is outclassed by MD and RMSP, that MD and RMSP are competitive, and that RMSP is especially advantageous in describing the intermediate outlyingness structure when elliptical contours are not assumed. AMS 2000 Subject Classification: Primary 62H99. Secondary 62G99 Key words and phrases: Outlier detection; Projection pursuit; Nonparametric; Multivariate; Affine invariance; Robustness.

3 1 Introduction Outlier identification is fundamental to multivariate statistical analysis and data mining. Using a selected outlyingness function defined on the sample or input space, outliers are those points whose outlyingness value exceeds a specified threshold. Desirably, outlyingness functions have the properties (i) robustness against the presence of outliers, (ii) weak affine invariance (i.e., transformation to other coordinates should not affect relative outyingness rankings and comparisons), (iii) computational efficiency in any practical dimension, and (iv) nonimposition of elliptical contours. We point out that, although elliptical contours are attractive and in some cases justified, an outlyingness function that does not arbitrarily impose them can more effectively identify which points belong to the layers of moderate outlyingness. The widely used Mahalanobis distance outlyingness (MD) satisfies (i)-(iii) but gives up (iv). On the other hand, the projection outlyingness (SUP) based on the projection pursuit approach satisfies (i), (ii), and (iv) but gives up (iii). Here we construct a modified projection pursuit outlyingness (RMSP) that satisfies all of (i)-(iv). Projection pursuit techniques, originally proposed and experimented with by Kruskal (1969, 1972), are extremely powerful in multivariate data analysis. Related ideas occur in Switzer (1970) and Switzer and Wright (1971). A key implementation is due to Friedman and Tukey (1974). Recently, projection depth has received significant attention in the literature (Liu, 1992; Zuo and Serfling, 2000b; Zuo, 2003). The corresponding projection outlyingness (SUP) is a multivariate extension of univariate scaled deviation outlyingness O(x) = x µ /σ, where µ and σ are univariate measures of location and spread. Specifically, for a multivariate data point x, the projection outlyingness is given by the supremum of all the univariate outlyingness values of the projections of x onto lines (Liu, 1992; Zuo and Serfling, 2000b; Zuo, 2003; Serfling, 2004; Dang and Serfling, 2010). Let us make this precise. Let X have distribution F X on R d and, for any unit vector u = (u 1,..., u d ) in R d, let F u X denote the induced univariate distribution of u X. With µ given by the Median, say ν, and σ by the MAD (median absolute deviation from the median), say η, the associated well-known projection outlyingness (SUP) is O P (x, F X ) = sup u = 1 u x ν(f u X) η(f u X), x Rd, (1) representing the worst case scaled deviation outlyingness of projections of x onto lines. For a d-dimensional data set X n = {X 1,...,X n }, the sample version O P (x, X n ) is affine invariant, highly masking robust (Dang and Serfling, 2010), and does not impose ellipsoidal contours. However, it is obviously highly computational. To overcome the computational burden, we develop an alternative projection pursuit outlyingness entailing only finitely many projections. Implementation involves the following key steps, which are described in detail in the sequel. Step 1. With only finitely many projections, O P (x, X n ) is no longer affine invariant, not even orthogonally invariant. However, if we first transform the data to X n = 1

4 D(X n )X n using a strong invariant coordinate system (SICS) transformation D(X n ), then the quantity u x ν(x n) η(x n) for any u becomes invariant under affine transformation of the original data via X n Y n = AX n +b for nonsingular d d A and d-vector b (see Serfling, 2010). Consequently, to capture affine invariance for our method, a SICS transformation is applied to X n before taking projections. In order not to lose information on the outliers, we need the SICS transformation D(X n ) to be robust. This is accomplished by constructing D(X n ) from a subset Z N of N inner observations from X n, selected in an affine invariant way, with N = 0.5n or N = 0.75n, for example. One possibility for Z N is the N inner observations used in computing the well-known Fast-MCD sample scatter matrix (Rousseeuw and Van Driessen, 1999, implemented in the R packages MASS, rrcov, and robustbase, for example). However, this becomes computationally prohibitive as the dimension d increases. Instead, we use a computationally attractive spatial trimming approach (Mazumder and Serfling, 2013) that chooses the N observations with lowest values of sample Mahalanobis spatial outlyingness. Step 2. The computational burden is reduced by taking relatively few projections, but enough to capture sufficient information. From both intuitive and computational considerations, the use of deterministic directions strongly appeals. In particular, we choose s = 5d directions = {u i, 1 i s} approximately uniformly distributed on the unit sphere in R d but lying on distinct diameters. Fang and Wang (1994) provide convenient numerical algorithms for this purpose. Step 3. Projection pursuit is carried out. For each X i in X n and direction u j in, the scaled deviation outlyingness of the projected value u jx i is computed: a n (i, j) = u jx i ν(u jx n) η(u j X n ), 1 j s, 1 i n, with ν(u jx n) and η(u jx n) the sample median and sample MAD, respectively, of the projected data set u j X n. This replaces the original data set X n of n d-vectors, which then becomes replaced by a new data set A n = [a n (i, j)] s n of n s-vectors of projected scaled deviations. These contain the relevant outlyingness information. 2

5 Step 4. Redundancy in A n is eliminated by a PCA reduction. With J indexing the N inner observations of Step 1, the (robust) sample covariance matrix S n of the data set given by the columns of A n indexed by J is formed. Then PCA based on S n is used to transform and reduce the data vectors A n to t data vectors V n, where t denotes the number of positive eigenvalues of S n. Typically, t = d. Step 5. We construct our new outlyingness function by again employing the Mahalanobis spatial outlyingness function and the spatial trimming step of Step 1, but now applied to the data V n instead of X n. This yields a Robust Mahalanobis Spatial outlyingness function based on Projections (RMSP), O RMSP (X i, X n ), 1 i n, a robust outlyingness ordering of the original data points in X n via projection pursuit with finite. Summary: X n (original data: n vectors in R d ) (via D(X n )) X n (SICS-transformed data: n vectors in R d ) (via O(x, F) and ) A n (n projected scaled deviation vectors in R s, s = 5d) (via PCA) V n (n vectors in R t, typically with t = d) (via robust Mahalanobis spatial outlyingness) O RMSP Details for Step 1 are provided in Section 2. Following a method of Serfling (2010) for construction of sample SICS functionals in conjunction with spatial trimming of observations (Mazumder and Serfling, 2013), we introduce an easily computable and robust SICS functional to be used for the purposes of this paper. Details for Steps 3 and 4 are provided in Section 3, and those for Step 5 in Section 4, leading to our affine invariant, robust, and easily computable outlyingness function RMSP. In Section 5, we compare SUP, MD, and RMSP using artificial bivariate data sets for the sake of visual comparisons, artificial higher-dimensional data sets, and actual higherdimensional data sets, the Stackloss Data (n = 21, d = 4), and the Air Pollution and Mortality Data (n = 59, d = 13), which have been studied extensively in the literature. Our findings and conclusions are provided in Section 6. Briefly, SUP is outclassed by MD and RMSP, MD and RMSP are competitive, and RMSP is especially advantageous in describing the intermediate outlyingness structure when elliptical contours are not assumed. We conclude this introduction by mentioning some notable previous work on projection pursuit outlyingness using only finitely many projections. Despite their various strengths, however, none of these methods capture all of properties (i)-(iv). 3

6 Pan, Fung, and Fang (2000) use finitely many deterministic directions approximately uniformly scattered and develop a finite approach calculating a sample quadratic form based on the differences {O(u x, u X n ) O(u x, F u X), u }. This imposes elliptical outlyingness contours. Further, since these differences involve the unknown F, a bootstrap step is incorporated. The number of directions is datadriven and the method is not affine invariant. Peña and Prieto (2001) introduce an affine invariant method using the supremum and 2d data-driven directions. These are selected using univariate measures of kurtosis over candidate directions, choosing the d directions with local extremes of high kurtosis and the d directions with local extremes of low kurtosis. In a very complex algorithm, the outliers are ultimately selected using Mahalanobis distance (thus entailing elliptical contours). Filzmozer, Maronna, and Werner (2008) extend the Peña and Prieto (2001) approach into an even more elaborate one, adding a principal components step. This achieves certain improvements in performance for detection of location outliers, especially in high dimension. However, this gives up affine invariance. See also Maronna, Martin, and Yohai (2006). 2 Step 1: a Robust SICS Transformation In general, by weak affine invariance of a functional T(F) is meant that T(Ax + b, F AX + b ) = ct(x, F X ), x R d, where A d d is nonsingular, b is any vector in R d and c = c(a,b, F X ) is a constant. Here, corresponding to some given finite = {u 1,..., u s }, we are interested in the functional ( u 1 x ν(f u 1 ζ(x,, F X ) = X ),..., u s x ν(f ) u sx), x R d, η(f u 1 X) η(f u s X) whose components give (signed) scaled deviation outlyingness values for the projections of a point x onto the lines represented by. It is straightforward to verify by simple counterexamples that ζ(x,, F X ) is not weakly affine invariant (nor even orthogonally invariant). However, we can make it become weakly affine invariant by first applying to the variable x a strong invariant coordinate system (SICS) functional (Serfling, 2010). Also, as applied to the sample version ζ(x,, X n ) we also want the SICS transformation to be robust. 4

7 2.1 Standardization of ζ(x,,f X ) using a SICS functional Definition 1 (Serfling, 2010). A positive definite matrix-valued functional D(F) is a strong invariant coordinate system (SICS) functional if, for Y = AX + b, D(F Y ) = k 3 D(F X )A 1, where A d d is nonsingular, b is any vector in R d, and k 3 = k 3 (A,b, F X ) is a scalar. Detailed treatment is found in Tyler, Critchley, Dümbgen and Oja (2009), Serfling (2010), and Ilmonen, Oja, and Serfling (2012). We now establish that the SICS-standardized vector is weakly affine invariant. ζ(d(f X )x,, F D(FX )X) Theorem 2 Let X have distribution F X on R d and let X = D(F X )X, where D(F) is a SICS functional. Then ζ(x,, F X ) is weakly affine invariant. Proof of Theorem 2. Suppose X Y = AX+b, where A is a d d nonsingular matrix and b is any vector in R d. Now, transform Y Y = D(F Y )Y. Then, using Definition 1, we have Y = D(F AX + b )(AX + b) = k D(F X )X + k D(F X )A 1 b = k X + c, where k = k(b,a, F X ), and c =k D(F X )A 1 b. Hence, for 1 j s, u j y med(u j Y ) MAD(u j Y ) = u j (k x + c) med(u j (k X + c)) MAD(u j (k X + c)) = sgn(k) u j x med(u j X ). MAD(u j X ) Thus ζ(y,, F Y ) = sgn(k)ζ(x,, F X ) and the result follows. 2.2 A robust SICS transformation via spatial trimming In general, following Serfling (2010), we may construct a SICS functional as follows. Let Z N be a subset of X n of size N obtained through some affine invariant and permutation invariant procedure. Then form d + 1 means Z 1,...,Z d+1 based, respectively, on consecutive blocks of size m = N/(d + 1) from Z N, and define the matrix W(X n ) = [(Z 2 Z 1 ),..., (Z d+1 Z 1 )] d d. Then the matrix D(X n ) = W(X n ) 1 (2) 5

8 is a sample SICS functional. The robustness and computational burden rest on the method of choosing Z N. One implementation is to let Z N be the set of the observations selected and used in computing the well-known Fast-MCD scatter matrix with N = αn, where α = 0.5 or This uses all the observations in a permutationally invariant way in selecting Z N and all of the latter observations in defining W(X n ). However, due to its combinatorial character, this approach becomes computationally prohibitive as d increases. Instead we shall develop a computationally easy robust sample SICS functional based on letting Z N consist of the inner observations obtained in spatial trimming, as follows. We start with the following definition. Definition 3 (Serfling, 2010). A positive definite matrix-valued functional M(F) is a transformation-retransformation (TR) functional if, A M(Y n ) M(Y n )A = k 2 M(X n ) M(X n ) (3) for Y = AX + b, and with k 2 = k 2 (A, b, X n ) a positive scalar function of A, b, and X n. Any inverse square root of a scatter or shape matrix is a TR matrix. These are discussed in detail in Serfling (2010) and, in particular, it is shown that any SICS functional is a TR functional. In particular, we will use a TR functional introduced by Tyler (1987) to achieve certain favorable theoretical properties in the elliptical model and extended by Dümbgen (1998). It is easily computed in any dimension, making it an computationally attractive alternative to Fast-MCD, although not as robust. With respect to a specified location functional θ(x n ), the Tyler matrix is defined as C(X n ) = (M s (X n ) M s (X n )) 1, with M s (X n ) the TR matrix defined as the unique symmetric square root of C(X n ) 1 obtained through the M-estimation equation n 1 n i=1 {( ) ( ) } M s (X n )(X i θ(x n )) Ms (X n )(X i θ(x n )) = d 1 I d. (4) M s (X n )(X i θ(x n )) M s (X n )(X i θ(x n )) An iterative algorithm using Cholesky factorizations to compute M s (X n ) quickly in any practical dimension is given in Tyler (1987). Another solution of (4) is given by the upper triangular square root M t (X n ) of C(X n ) 1 and is computed by a similar algorithm. In Tyler (1987), the quantity θ(x n ) is specified as a known constant. For inference situations when θ(x n ) is not known or specified, a symmetrized version of M s (X n ) eliminating the need of a location measure is given by Dümbgen (1998): M s1 (X n ) = M s (X n X n ), where X n X n denotes the set of differences X i X j. Convenient R-packages (e.g., ICSNP) are available for computation of these estimators. See Tyler (1987) and Serfling (2010) for detailed discussion. Here we will utilize the Dümbgen-Tyler TR matrix M s1 (F) by DT. We will make use of the TR matrix DT via the Mahalanobis spatial outlyingness function (Serfling, 2010) and the notion of spatial trimming (Mazumder and Serfling, 2013). Based 6

9 on any given TR matrix M(X n ), and based on the spatial sign function (or unit vector function), { x x S(x) =, x Rd, x 0, 0, x = 0, a corresponding affine invariant Mahalanobis spatial outlyingness function is given by n O MS (x, X n ) = n 1 S (M(X n )(x X i )), x Rd, (5) i=1 Without standardization by any M(X n ), (5) gives the well-known spatial outlyingness function (Chaudhuri, 1996), which is only orthogonally invariant. Standardization by a TR matrix produces affine invariance. Spatial trimming (possibly depending on n and/or d) consists of trimming away those outer observations satisfying O MS (X i, X n ) > λ 0, for some specified threshold λ 0. Then the covariance matrix of the remaining inner observations is robust against outliers. For M(X n ) given by DT, this is computationally faster than Fast-MCD. We denote by J the set of indices of the inner observations selected by spatial trimming. For choice of the threshold λ 0, we follow Mazumder and Serfling (2013) and recommend λ 0 = d d + 2, (6) which is justified as follows. The outlyingness function O MS without standardization has masking breakdown point (MBP) approximately (1 λ 0 )/2 (Dang and Serfling, 2010). This is the minimal fraction of observations in the sample which, if corrupted, can cause arbitrarily extreme outliers to become masked (misclassified as nonoutliers). One may obtain an MBP as high as possible ( 1/2) by choosing λ 0 small enough. However, to avoid overtrimming and losing resolution, a moderate choice is prudent. With standardization, we also need to take account of the explosion replacement breakdown point RBPexp of the chosen TR matrix M(X n ). We will adopt the easily computable DT, which has RBPexp subject to the upper bound (d + 1) 1, decreasing with d (see Dümbgen and Tyler, 2005). We balance (1 λ 0 )/2 with (d + 1) 1. However, to avoid overrobustification in low dimensions, we instead use (d + 2) 1, thus solving (1 λ 0 )/2 = (d + 2) 1, to obtain (6). The increase of λ 0 with d is reasonable, since in higher dimension distributions place greater probability in the tails and observations are more dispersed, causing outlyingness values to tend to be larger overall. In summary: our SICS functional D(X n ) is given by the sample version of (2) with Z n and J determined via spatial trimming using O MS based on the TR matrix DT and with trimming threshold given by (6). 7

10 3 Steps 3 and 4: Application of Projection Pursuit and PCA Let X n = {X 1,...,X n } be a random sample from some F on R d and let D(X n ) be a sample SICS functional (as, for example, the one obtained above). We describe in Algorithm A below the steps for transforming the original data X n to a set of t-dimensional vectors V n = {V 1 (X n ),...,V n (X n )} by a projection pursuit approach using u i, 1 i s, combined with a SICS preliminary transformation and a PCA dimension reduction step. We also provide a key property of V n. Algorithm A, for V n 1. Standardize X i, 1 i n, with a given sample SICS functional D(X n ) d d via X i = D(X n )X i, 1 i n, and put X n = [X 1 X n ]. 2. Choose the number of directions s and select s unit vectors = {u 1,..., u s } uniformly distributed on the unit sphere in R d but lying on distinct diameters, following the algorithm of Fang and Wang (1994). From trial and error with examples of various dimensions, choices such as s = 4d, 5d, or 6d are effective. We recommend: s = 5d. 3. Calculate the projections u jx i, 1 i n, 1 j s, and denote by ν(u jx n) and η(u jx n) the median and MAD, respectively, of {u jx 1,...,u jx n) 1 n }, 1 j s. 4. Put A n = [a 1 a n ] s n, where a i = (a n (i, 1),..., a n (i, s)) s 1, 1 i n, with a n (i, j) = u jx i ν(u jx n), 1 j s, 1 i n. η(u j X n) The initial d n data matrix X n of n d-vectors now has been converted to a new data matrix A n of dimension s n, i.e., consisting of n s-vectors a 1,..., a n, with a i associated with the original data point X i, 1 i n. 5. Let S n denote the sample covariance matrix of those columns a i of A n with indices i in the set J. Calculate the eigenvalues λ 1... λ s of S n and let P = [p 1 p s ] s s denote the orthogonal matrix containing the corresponding eigenvectors p 1,...,p s as column vectors. 6. PCA reduction. Define the s 1 dimensional vectors Ṽi = P a i, 1 i n. Let t be the number of eigenvalues greater than 10 6 (typically, t = d.) Let the t-vector V i contain the first t components of the vector Ṽ i, 1 i n, and put V n = [V 1 V n ] t n. This completes reduction of the original d n data matrix X n to a new t n data matrix V n. It is readily checked that the covariance matrix of each vector V(i), 1 i n, is Λ t t = diag(λ 1,..., λ t ). 8

11 We note that the above algorithm is affine invariant, in the sense that if we transform X i Y i = AX i + b, 1 i n, where A is any d d dimensional nonsingular matrix and b is any vector in R d, then the matrix of vectors V i, 1 i n, will remain the same up to a global sign change. This is stated formally as follows (the proof is straightforward and omitted). Lemma 4 Let X 1,...,X n be a random sample from d dimensional distribution function F. Suppose Y i = AX i + b, where A is any d d dimensional nonsingular matrix and b is any vector in R d. Put Y n = [Y 1 Y n ]. Further, assume that V n (X n ) is obtained using Algorithm A starting with the data matrix X n, and that V n (Y n ) is obtained using Algorithm A starting with the data matrix Y n. Then where k = k(a,b, X n ). V n (Y n ) = sgn(k) V n (X n ), (7) 4 Robust Mahalanobis Spatial Outlyingness using V n In Algorithm A, the data vector X i becomes replaced by V i, 1 i n, or more generally points x R d are mapped onto points v R t. We formulate our new robust affine invariant outlyingness function O RMSP (x, X n ) (denoted by RMSP) as the robust Mahalanobis spatial outlyingness function on the vectors V i, 1 i n. Thus O RMSP (x, X n ) = O MS (v(x), V n ), where v(x) is the vector associated with x via Algorithm A. As before, we use DT (on V n ) for the standardization defining O MS. After transforming the given data X n = [X 1 X n ] d n to V n = [V 1 V n ] t n, using inner observations indexed by J, we follow the steps below to form RMSP. Algorithm B, for RMSP 1. Using the Mahalanobis spatial outlyingness function (5) on V n (instead of X n ), and with TR matrix M(V n ) = DT(V n ), i.e, using n O MS (v, V n ) = n 1 S (M(V n )(v V i )), v Rt, (8) i=1 carry out the spatial trimming method of Section 2.2 based on trimming threshold t t + 2. Let J denote the set of indices of the inner observations so obtained, and let V n denote the inner observations. 2. Now robustify DT(V n ) by computing it just on V n: DT(V n) 9

12 3. Then RMSP is defined according to (8), but with the robustified DT for standardization and with the averaging taken over just the inner observations. That is, denoting by K the cardinality of J, and with M(V n) = DT(V n), we form O RMSP (x, X n ) = K 1 S (M(V n )(v(x) V j)), v Rt. (9) j J In particular, for the data points X i, 1 i n, the corresponding RMSP outlyingness values are given by O RMSP (X i, X n ) = K 1 S (M(V n)(v i V j )), 1 i n. (10) j J Lemma 5 O RMSP (x, X n ) is affine invariant, in the sense that if we transform X i Y i = AX i + b, where A d d is nonsingular and b is any vector in R d, then with y = Ax + b. (The proof is immediate and omitted.) O RMSP (y, Y n ) = O RMSP (x, X n ), 5 Comparison of RMSP, SUP, and MD 5.1 Visual illustration with artificial bivariate data Using artificially created bivariate data sets, we are able to provide visual illustrations of important differences among RMSP, SUP, and MD. Here MD denotes robust Mahalanobis distance outlyingness with the Fast-MCD location and scatter measures based on the inner 50% observations. Besides identifying the more extreme outliers, an outlyingness function also has the role of providing a structural description of the data set. In effect, this provides a quantile-based description. Our plots exhibit 50%, 75%, and 90% outlyingness contours (based on the given outlyingness function and enclosing 50%, 75%, and 90% of the observations, respectively). Figure 1 shows a triangularly shaped data set with and without outliers, Figure 2 a bivariate Pareto data with and without outliers, and Figure 3 a bivariate Normal data with and without outliers. Based on these displays, we make the following comments. 1. SUP is dominated by MD and RMSP. All three of SUP, MD, and RMSP have similar contours: strictly ellipsoidal for MD and approximately so for the others. However, the contours of SUP are not as smooth as the those of the others and its computational burden is considerably greater. 10

13 2. The robustness properties of SUP, MD, and RMSP are comparable for moderate contamination. For moderate contamination (somewhat less than 50%), all three methods perform equally well in constructing contours that account for the outliers without becoming distorted by them. For each of SUP, MD, and RMSP, the outlier cluster becomes placed on or beyond the 90% contour and well outside the 75% contour, rather than, for example, pulling the 75% contour toward itself. On the other hand, if protection against very high contamination is needed, SUP and MD can be adjusted to have nearly 50% BP, while RMSP cannot. 3. RMSP is overall superior for moderate contamination. With less computational burden and without imposing elliptical contours, RMSP is as robust as MD and SUP. 5.2 Numerical experiments with artificial data A small experiment comparing SUP (with 100,000 directions), MD (with Fast-MCD), and RMSP was carried out on a typical PC for multivariate standard Normal data with sample size n = 100 and dimensions d = 2, 5, 10, and 20 (only d = 2 for SUP). 1. First, each of the 15 most extreme observations according to Euclidean distance was replaced by itself times the factor 5. All three methods correctly selected the 15 outliers as distinctly more outlying than the other 85 observations (although not always with the same ranks). Times in seconds: d SUP MD RMSP Clearly, SUP is prohibitively computational, compared with MD and RMSP. The well-established MD with Fast-MCD exhibits high computational efficiency for very low dimensions. The above times are based on setting the tuning parameter nsamp in MD(MCD) using rrcov to make its breakdown point match that of RMSP, namely (d+2) 1. Thus we used nsamp = 6, 10, 11, and 12 for d = 2, 5, 10, and 20 (versus the default nsamp = 500, which results in slightly higher computation times). However, the advantage of MD is quickly lost as d increases, and the computational appeal of RMSP becomes evident. 2. As a more challenging variant carried out just for d = 2, each of the 15 most extreme observations according to Euclidean distance was replaced by itself plus the vector (10, 0). Both RMSP and MD again correctly ranked the 15 outliers as more extreme in outlyingness than the other 85 (although not always in the same order). 11

14 From all considerations, RMSP is the method of choice for moderate contamination. A competitor is MD when elliptical contours are acceptable, while SUP may be dropped from consideration. This study raises the question of just how much we do or do not like the ellipsoidal contours of MD. A detailed evaluation of this question is worthwhile but beyond the scope of the present paper. 5.3 Actual data For two well-studied higher-dimensional data sets, we examine the performance of MD and RMSP. Here, for compatibility with Becker and Gather (1999), MD denotes the robust Mahalanobis distance with S-estimates for location and covariance based on Tukey s biweight (BW) function. (Separate investigation shows little difference between this MD and that based on MCD.) Stackloss Data (d = 4) The stackloss data set of Brownlee (1965) has been much studied as a test of outlier detection methods. It consists of n = 21 observations in dimension d = 4. See Rousseeuw and Leroy (1987) and Becker and Gather (1999) for the data set, for references to many studies, and for general discussion. All robust methods cited in the literature, including MD, rank observations 1, 3, 4, and 21 as the top 4 in outlyingness, with little difference in order. Our RMSP agrees with these rankings and also with MD on observation 2 as 5th, 13 as 6th, and 17 as 7th. Thus RMSP corroborates existing approaches while being more computationally attractive. Figure 4 shows that the main difference between MD and RMSP with this data set is the ranking of moderately outlying observations. Pollution and Mortality Data (d = 13) Becker and Gather (1999) study in detail a 13-dimensional data set available at Data and Story Library ( which provides information on social and economic conditions, climate, air pollution, and mortality for 60 Standard Metropolitan Statistical Areas (a standard U. S. Census Bureau designation of the region around a major city) in the United States. They omit a case with incomplete data and rank the remaining n = 59 cases in dimension d = 13 by MD. Here we compare RMSP and MD. Both agree on the extreme observations indexed 28, 47, 46, 48 and ranked in this order. Also, they agree on 11 cases as among the next 12 cases, although with somewhat differing ranks. The exceptions are that MD ranks observation 36 as 14th and observation 38 as 22nd, while RMSP ranks these as 17th and 16th, respectively. The difference in ranks 16 versus 22 for observation 38 raises the question of whether observation 36 (New Orleans) in comparison with observation 38 (Philadelphia) should rank far apart, 14th versus 22nd as per MD, or closely, 17th versus 16th as per RMSP. 12

15 Coordinatewise dotplots of all 13 variables for these cases reveals that these points overall are not very outlying, except that 36 is moderate to extreme in outlyingness for mortality and moderate for SO2 pollution, and that 38 is moderate to strong in outlyingness for population size, moderate for population density, and nearly moderate for SO2 pollution. On this basis we regard 36 and 38 as comparable cases both moderate in outlyingness, corroborating the opinion of RMSP over that of MD. This illustrates the advantage of not imposing elliptical contours: the levels of moderate outlyingness can be delineated better. Figure 4 shows that the main difference between MD and RMSP with this data set is the ranking of moderately outlying observations. 6 Conclusions and Recommendations Based on our theoretical development and empirical analysis, we make the following basic conclusions about SUP, MD, and RTRP. 1. RMSP has all four desired properties: (i) affine invariance, (ii) robustness, (iii) easily computable in all dimensions, and (iv) no imposition of elliptical contours. 2. SUP is outclassed by RMSP, which is also a projection pursuit approach but one much more attractive computationally and among other finite approaches easy to understand. 3. Robust versions of MD and RMSP are competitive, but, by not imposing elliptical contours, RMSP is of particular advantage in describing intermediate outlyingness structure. 4. The spatial trimming technique used here with V n to develop a new projection pursuit approach is used just on X n in Mazumder and Serfling (2013) to produce a new robust spatial outlyingness. It provides another alternative to MD and performs similarly to RMSP, although the latter seems slightly better due to acquiring projection pursuit information. A detailed comparison in some depth would be of interest but is beyond the scope of the present paper. 5. In practice, one might use both MD (when computationally feasible) and RMSP. When they agree, one can be confidant. When they disagree, the relevant observations can be investigated. Acknowledgements The authors gratefully acknowledge very helpful, insightful comments from reviewers. Useful input from G. L. Thompson is also greatly appreciated. Support under National Science Foundation Grants DMS and DMS and National Security Agency Grant H is sincerely acknowledged. 13

16 References [1] Becker, C. and Gather, U. (1999). The masking breakdown point of multivariate outlier identification rules. Journal of the American Statistical Association [2] Brownlee, K. A. (1965). Statistical Theory and Methodology in Science and Engineering, 2nd edition. John Wiley & Sons, New York. [3] Chaudhuri, P. (1996). On a geometric notion of quantiles for multivariate data. Journal of the American Statistical Association [4] Dang, X. and Serfling, R. (2010). Nonparametric depth-based multivariate outlier identifiers, and masking robustness properties. Journal of Statistical Planning and Inference [5] Dümbgen, L. (1998). On Tyler s M-functional of scatter in high dimension. Annals of the Institute of Statistical Mathematics [6] Dümbgen, L. and Tyler, D. E. (2005). On the breakdown properties of some multivariate M-functionals. Scandinavian Journal of Statistics [7] Fang, K.T. and Wang, Y. (1994). Number Theoretic Methods in Statistics. Chapman and Hall, London. [8] Filzmoser, P., Maronna, R., and Werner, M. (2008). Outlier identification in high dimensions. Computational Statistics & Data Analysis [9] Friedman, J. H. and Tukey, J. W. (1974). A projection pursuit algorithm for explatory data analysis. IEEE Trans. Comput. C [10] Ilmonen, P., Oja, H., and Serfling, R. (2012). On invariant coordinate system (ICS) functionals. International Statistical Review [11] Kruskal, J. B. (1969). Toward a practical method which helps uncover the structure of a set of multivariate observations by finding the linear transformation which optimizes a new new index of consideration. In Statistical Computation. (R. C. Milton and J. A. Nelder, eds.) Academic, New York. [12] Kruskal, J. B. (1972). Linear transformation to multivariate data to reveal clustering. In Multidimensional scaling: Theory and Application in the Behavioral Science, I, Theory. Seminar Press, New York and London. [13] Liu, R. Y. (1992). Data depth and multivariate rank tests. In L 1 -Statistics and Related Methods (Y. Dodge, ed.), pp , North-Holland, Amsterdam. [14] Maronna, R. A., Martin, R. D., and Yohai, V. J. (2006). Robust Statistics: Theory and Methods. Wiley, Chichester, England. 14

17 [15] Mazumder, S. and Serfling, R. (2013). A robust sample spatial outlyingness function. Journal of Statistical Planning and Inference [16] Pan, J.-X., Fung, W.-K., and Fang, K.-T. (2000). Multiple outlier detection in multivariate data using projection pursuit techniques. Journal of Statistical Planning and Inference [17] Peña, D. and Prieto, F. J. (2001). Robust covariance matrix estimation and multivariate outlier rejection. Technometrics [18] Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. John Wiley & Sons, New York. [19] Rousseeuw, P. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics [20] Serfling, R. (2004). Nonparametric multivariate descriptive measures based on spatial quantiles. Journal of Statistical Planning and Inference [21] Serfling, R. (2010). Equivariance and invariance properties of multivariate quantile and related functions, and the role of standardization. Journal of Nonparametric Statistics [22] Switzer, P. (1970). Numerical Classification. In Geostatistics. Plenum, New York. [23] Switzer, P. and Wright, R. M. (1971). Numerical classification applied to certain Jamaican eocene nummulitids. Math. Geol [24] Tyler, D. E. (1987). A distribution-free M-estimator of multivariate scatter. Annals of Statistics [25] Tyler, D. E., Critchley, F., Dümbgen, L. and Oja, H. (2009). Invariant co-ordinate selection. Journal of the Royal Statistical Society, Series B [26] Zuo, Y. and Serfling, R. (2000b). General notions of statistical depth function. Annals of Statistics [27] Zuo, Y. (2003). Projection-based depth functions and associated medians. Annals of Statistics

18 A B C D E F G A B C E F D G A B C E F D G Figure 1: Triangular data set with 50%, 75%, and 90% outlyingness contours for SUP (upper), MD (middle), and RMSP (lower), without outliers (left) and with extreme outliers including a cluster (right). 16

19 D A B C D A B C D A B C Figure 2: Bivariate Pareto data set with 50%, 75%, and 90% outlyingness contours for SUP (upper), MD (middle), and RMSP (lower), without outliers (left) and with extreme outliers including a cluster (right). 17

20 D A B C D A B C D A B C Figure 3: Bivariate Normal data set with 50%, 75%, and 90% outlyingness contours for SUP (upper), MD (middle), and RMSP (right), without outliers (left) and with extreme replacement outliers including a cluster (right). 18

21 Figure 4: RMSP versus MD: Stackloss data (left) and Pollution data (right). 19

Supplementary Material for Wang and Serfling paper

Supplementary Material for Wang and Serfling paper Supplementary Material for Wang and Serfling paper March 6, 2017 1 Simulation study Here we provide a simulation study to compare empirically the masking and swamping robustness of our selected outlyingness

More information

On Invariant Within Equivalence Coordinate System (IWECS) Transformations

On Invariant Within Equivalence Coordinate System (IWECS) Transformations On Invariant Within Equivalence Coordinate System (IWECS) Transformations Robert Serfling Abstract In exploratory data analysis and data mining in the very common setting of a data set X of vectors from

More information

Inequalities Relating Addition and Replacement Type Finite Sample Breakdown Points

Inequalities Relating Addition and Replacement Type Finite Sample Breakdown Points Inequalities Relating Addition and Replacement Type Finite Sample Breadown Points Robert Serfling Department of Mathematical Sciences University of Texas at Dallas Richardson, Texas 75083-0688, USA Email:

More information

General Foundations for Studying Masking and Swamping Robustness of Outlier Identifiers

General Foundations for Studying Masking and Swamping Robustness of Outlier Identifiers General Foundations for Studying Masking and Swamping Robustness of Outlier Identifiers Robert Serfling 1 and Shanshan Wang 2 University of Texas at Dallas This paper is dedicated to the memory of Kesar

More information

Asymptotic Relative Efficiency in Estimation

Asymptotic Relative Efficiency in Estimation Asymptotic Relative Efficiency in Estimation Robert Serfling University of Texas at Dallas October 2009 Prepared for forthcoming INTERNATIONAL ENCYCLOPEDIA OF STATISTICAL SCIENCES, to be published by Springer

More information

Generalized Multivariate Rank Type Test Statistics via Spatial U-Quantiles

Generalized Multivariate Rank Type Test Statistics via Spatial U-Quantiles Generalized Multivariate Rank Type Test Statistics via Spatial U-Quantiles Weihua Zhou 1 University of North Carolina at Charlotte and Robert Serfling 2 University of Texas at Dallas Final revision for

More information

Invariant coordinate selection for multivariate data analysis - the package ICS

Invariant coordinate selection for multivariate data analysis - the package ICS Invariant coordinate selection for multivariate data analysis - the package ICS Klaus Nordhausen 1 Hannu Oja 1 David E. Tyler 2 1 Tampere School of Public Health University of Tampere 2 Department of Statistics

More information

Accurate and Powerful Multivariate Outlier Detection

Accurate and Powerful Multivariate Outlier Detection Int. Statistical Inst.: Proc. 58th World Statistical Congress, 11, Dublin (Session CPS66) p.568 Accurate and Powerful Multivariate Outlier Detection Cerioli, Andrea Università di Parma, Dipartimento di

More information

IMPROVING THE SMALL-SAMPLE EFFICIENCY OF A ROBUST CORRELATION MATRIX: A NOTE

IMPROVING THE SMALL-SAMPLE EFFICIENCY OF A ROBUST CORRELATION MATRIX: A NOTE IMPROVING THE SMALL-SAMPLE EFFICIENCY OF A ROBUST CORRELATION MATRIX: A NOTE Eric Blankmeyer Department of Finance and Economics McCoy College of Business Administration Texas State University San Marcos

More information

MULTIVARIATE TECHNIQUES, ROBUSTNESS

MULTIVARIATE TECHNIQUES, ROBUSTNESS MULTIVARIATE TECHNIQUES, ROBUSTNESS Mia Hubert Associate Professor, Department of Mathematics and L-STAT Katholieke Universiteit Leuven, Belgium mia.hubert@wis.kuleuven.be Peter J. Rousseeuw 1 Senior Researcher,

More information

Identification of Multivariate Outliers: A Performance Study

Identification of Multivariate Outliers: A Performance Study AUSTRIAN JOURNAL OF STATISTICS Volume 34 (2005), Number 2, 127 138 Identification of Multivariate Outliers: A Performance Study Peter Filzmoser Vienna University of Technology, Austria Abstract: Three

More information

Robust estimation of principal components from depth-based multivariate rank covariance matrix

Robust estimation of principal components from depth-based multivariate rank covariance matrix Robust estimation of principal components from depth-based multivariate rank covariance matrix Subho Majumdar Snigdhansu Chatterjee University of Minnesota, School of Statistics Table of contents Summary

More information

INVARIANT COORDINATE SELECTION

INVARIANT COORDINATE SELECTION INVARIANT COORDINATE SELECTION By David E. Tyler 1, Frank Critchley, Lutz Dümbgen 2, and Hannu Oja Rutgers University, Open University, University of Berne and University of Tampere SUMMARY A general method

More information

Statistical Depth Function

Statistical Depth Function Machine Learning Journal Club, Gatsby Unit June 27, 2016 Outline L-statistics, order statistics, ranking. Instead of moments: median, dispersion, scale, skewness,... New visualization tools: depth contours,

More information

Introduction to Robust Statistics. Anthony Atkinson, London School of Economics, UK Marco Riani, Univ. of Parma, Italy

Introduction to Robust Statistics. Anthony Atkinson, London School of Economics, UK Marco Riani, Univ. of Parma, Italy Introduction to Robust Statistics Anthony Atkinson, London School of Economics, UK Marco Riani, Univ. of Parma, Italy Multivariate analysis Multivariate location and scatter Data where the observations

More information

Commentary on Basu (1956)

Commentary on Basu (1956) Commentary on Basu (1956) Robert Serfling University of Texas at Dallas March 2010 Prepared for forthcoming Selected Works of Debabrata Basu (Anirban DasGupta, Ed.), Springer Series on Selected Works in

More information

Robustness of the Affine Equivariant Scatter Estimator Based on the Spatial Rank Covariance Matrix

Robustness of the Affine Equivariant Scatter Estimator Based on the Spatial Rank Covariance Matrix Robustness of the Affine Equivariant Scatter Estimator Based on the Spatial Rank Covariance Matrix Kai Yu 1 Xin Dang 2 Department of Mathematics and Yixin Chen 3 Department of Computer and Information

More information

Improved Feasible Solution Algorithms for. High Breakdown Estimation. Douglas M. Hawkins. David J. Olive. Department of Applied Statistics

Improved Feasible Solution Algorithms for. High Breakdown Estimation. Douglas M. Hawkins. David J. Olive. Department of Applied Statistics Improved Feasible Solution Algorithms for High Breakdown Estimation Douglas M. Hawkins David J. Olive Department of Applied Statistics University of Minnesota St Paul, MN 55108 Abstract High breakdown

More information

Robust estimation of scale and covariance with P n and its application to precision matrix estimation

Robust estimation of scale and covariance with P n and its application to precision matrix estimation Robust estimation of scale and covariance with P n and its application to precision matrix estimation Garth Tarr, Samuel Müller and Neville Weber USYD 2013 School of Mathematics and Statistics THE UNIVERSITY

More information

Computational Connections Between Robust Multivariate Analysis and Clustering

Computational Connections Between Robust Multivariate Analysis and Clustering 1 Computational Connections Between Robust Multivariate Analysis and Clustering David M. Rocke 1 and David L. Woodruff 2 1 Department of Applied Science, University of California at Davis, Davis, CA 95616,

More information

FAULT DETECTION AND ISOLATION WITH ROBUST PRINCIPAL COMPONENT ANALYSIS

FAULT DETECTION AND ISOLATION WITH ROBUST PRINCIPAL COMPONENT ANALYSIS Int. J. Appl. Math. Comput. Sci., 8, Vol. 8, No. 4, 49 44 DOI:.478/v6-8-38-3 FAULT DETECTION AND ISOLATION WITH ROBUST PRINCIPAL COMPONENT ANALYSIS YVON THARRAULT, GILLES MOUROT, JOSÉ RAGOT, DIDIER MAQUIN

More information

A Convex Hull Peeling Depth Approach to Nonparametric Massive Multivariate Data Analysis with Applications

A Convex Hull Peeling Depth Approach to Nonparametric Massive Multivariate Data Analysis with Applications A Convex Hull Peeling Depth Approach to Nonparametric Massive Multivariate Data Analysis with Applications Hyunsook Lee. hlee@stat.psu.edu Department of Statistics The Pennsylvania State University Hyunsook

More information

TITLE : Robust Control Charts for Monitoring Process Mean of. Phase-I Multivariate Individual Observations AUTHORS : Asokan Mulayath Variyath.

TITLE : Robust Control Charts for Monitoring Process Mean of. Phase-I Multivariate Individual Observations AUTHORS : Asokan Mulayath Variyath. TITLE : Robust Control Charts for Monitoring Process Mean of Phase-I Multivariate Individual Observations AUTHORS : Asokan Mulayath Variyath Department of Mathematics and Statistics, Memorial University

More information

Detection of outliers in multivariate data:

Detection of outliers in multivariate data: 1 Detection of outliers in multivariate data: a method based on clustering and robust estimators Carla M. Santos-Pereira 1 and Ana M. Pires 2 1 Universidade Portucalense Infante D. Henrique, Oporto, Portugal

More information

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 4: Measures of Robustness, Robust Principal Component Analysis

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 4: Measures of Robustness, Robust Principal Component Analysis MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 4:, Robust Principal Component Analysis Contents Empirical Robust Statistical Methods In statistics, robust methods are methods that perform well

More information

A ROBUST METHOD OF ESTIMATING COVARIANCE MATRIX IN MULTIVARIATE DATA ANALYSIS G.M. OYEYEMI *, R.A. IPINYOMI **

A ROBUST METHOD OF ESTIMATING COVARIANCE MATRIX IN MULTIVARIATE DATA ANALYSIS G.M. OYEYEMI *, R.A. IPINYOMI ** ANALELE ŞTIINłIFICE ALE UNIVERSITĂłII ALEXANDRU IOAN CUZA DIN IAŞI Tomul LVI ŞtiinŃe Economice 9 A ROBUST METHOD OF ESTIMATING COVARIANCE MATRIX IN MULTIVARIATE DATA ANALYSIS G.M. OYEYEMI, R.A. IPINYOMI

More information

Small Sample Corrections for LTS and MCD

Small Sample Corrections for LTS and MCD myjournal manuscript No. (will be inserted by the editor) Small Sample Corrections for LTS and MCD G. Pison, S. Van Aelst, and G. Willems Department of Mathematics and Computer Science, Universitaire Instelling

More information

A Comparison of Robust Estimators Based on Two Types of Trimming

A Comparison of Robust Estimators Based on Two Types of Trimming Submitted to the Bernoulli A Comparison of Robust Estimators Based on Two Types of Trimming SUBHRA SANKAR DHAR 1, and PROBAL CHAUDHURI 1, 1 Theoretical Statistics and Mathematics Unit, Indian Statistical

More information

ON THE CALCULATION OF A ROBUST S-ESTIMATOR OF A COVARIANCE MATRIX

ON THE CALCULATION OF A ROBUST S-ESTIMATOR OF A COVARIANCE MATRIX STATISTICS IN MEDICINE Statist. Med. 17, 2685 2695 (1998) ON THE CALCULATION OF A ROBUST S-ESTIMATOR OF A COVARIANCE MATRIX N. A. CAMPBELL *, H. P. LOPUHAA AND P. J. ROUSSEEUW CSIRO Mathematical and Information

More information

The S-estimator of multivariate location and scatter in Stata

The S-estimator of multivariate location and scatter in Stata The Stata Journal (yyyy) vv, Number ii, pp. 1 9 The S-estimator of multivariate location and scatter in Stata Vincenzo Verardi University of Namur (FUNDP) Center for Research in the Economics of Development

More information

ROBUST ESTIMATION OF A CORRELATION COEFFICIENT: AN ATTEMPT OF SURVEY

ROBUST ESTIMATION OF A CORRELATION COEFFICIENT: AN ATTEMPT OF SURVEY ROBUST ESTIMATION OF A CORRELATION COEFFICIENT: AN ATTEMPT OF SURVEY G.L. Shevlyakov, P.O. Smirnov St. Petersburg State Polytechnic University St.Petersburg, RUSSIA E-mail: Georgy.Shevlyakov@gmail.com

More information

Influence Functions for a General Class of Depth-Based Generalized Quantile Functions

Influence Functions for a General Class of Depth-Based Generalized Quantile Functions Influence Functions for a General Class of Depth-Based Generalized Quantile Functions Jin Wang 1 Northern Arizona University and Robert Serfling 2 University of Texas at Dallas June 2005 Final preprint

More information

YIJUN ZUO. Education. PhD, Statistics, 05/98, University of Texas at Dallas, (GPA 4.0/4.0)

YIJUN ZUO. Education. PhD, Statistics, 05/98, University of Texas at Dallas, (GPA 4.0/4.0) YIJUN ZUO Department of Statistics and Probability Michigan State University East Lansing, MI 48824 Tel: (517) 432-5413 Fax: (517) 432-5413 Email: zuo@msu.edu URL: www.stt.msu.edu/users/zuo Education PhD,

More information

Robust Exponential Smoothing of Multivariate Time Series

Robust Exponential Smoothing of Multivariate Time Series Robust Exponential Smoothing of Multivariate Time Series Christophe Croux,a, Sarah Gelper b, Koen Mahieu a a Faculty of Business and Economics, K.U.Leuven, Naamsestraat 69, 3000 Leuven, Belgium b Erasmus

More information

CONTRIBUTIONS TO THE THEORY AND APPLICATIONS OF STATISTICAL DEPTH FUNCTIONS

CONTRIBUTIONS TO THE THEORY AND APPLICATIONS OF STATISTICAL DEPTH FUNCTIONS CONTRIBUTIONS TO THE THEORY AND APPLICATIONS OF STATISTICAL DEPTH FUNCTIONS APPROVED BY SUPERVISORY COMMITTEE: Robert Serfling, Chair Larry Ammann John Van Ness Michael Baron Copyright 1998 Yijun Zuo All

More information

FAST CROSS-VALIDATION IN ROBUST PCA

FAST CROSS-VALIDATION IN ROBUST PCA COMPSTAT 2004 Symposium c Physica-Verlag/Springer 2004 FAST CROSS-VALIDATION IN ROBUST PCA Sanne Engelen, Mia Hubert Key words: Cross-Validation, Robustness, fast algorithm COMPSTAT 2004 section: Partial

More information

MULTICHANNEL SIGNAL PROCESSING USING SPATIAL RANK COVARIANCE MATRICES

MULTICHANNEL SIGNAL PROCESSING USING SPATIAL RANK COVARIANCE MATRICES MULTICHANNEL SIGNAL PROCESSING USING SPATIAL RANK COVARIANCE MATRICES S. Visuri 1 H. Oja V. Koivunen 1 1 Signal Processing Lab. Dept. of Statistics Tampere Univ. of Technology University of Jyväskylä P.O.

More information

Invariant co-ordinate selection

Invariant co-ordinate selection J. R. Statist. Soc. B (2009) 71, Part 3, pp. 549 592 Invariant co-ordinate selection David E. Tyler, Rutgers University, Piscataway, USA Frank Critchley, The Open University, Milton Keynes, UK Lutz Dümbgen

More information

Scatter Matrices and Independent Component Analysis

Scatter Matrices and Independent Component Analysis AUSTRIAN JOURNAL OF STATISTICS Volume 35 (2006), Number 2&3, 175 189 Scatter Matrices and Independent Component Analysis Hannu Oja 1, Seija Sirkiä 2, and Jan Eriksson 3 1 University of Tampere, Finland

More information

An Overview of Multiple Outliers in Multidimensional Data

An Overview of Multiple Outliers in Multidimensional Data Sri Lankan Journal of Applied Statistics, Vol (14-2) An Overview of Multiple Outliers in Multidimensional Data T. A. Sajesh 1 and M.R. Srinivasan 2 1 Department of Statistics, St. Thomas College, Thrissur,

More information

A Modified M-estimator for the Detection of Outliers

A Modified M-estimator for the Detection of Outliers A Modified M-estimator for the Detection of Outliers Asad Ali Department of Statistics, University of Peshawar NWFP, Pakistan Email: asad_yousafzay@yahoo.com Muhammad F. Qadir Department of Statistics,

More information

Journal of Statistical Software

Journal of Statistical Software JSS Journal of Statistical Software April 2013, Volume 53, Issue 3. http://www.jstatsoft.org/ Fast and Robust Bootstrap for Multivariate Inference: The R Package FRB Stefan Van Aelst Ghent University Gert

More information

Fast and robust bootstrap for LTS

Fast and robust bootstrap for LTS Fast and robust bootstrap for LTS Gert Willems a,, Stefan Van Aelst b a Department of Mathematics and Computer Science, University of Antwerp, Middelheimlaan 1, B-2020 Antwerp, Belgium b Department of

More information

Detecting outliers and/or leverage points: a robust two-stage procedure with bootstrap cut-off points

Detecting outliers and/or leverage points: a robust two-stage procedure with bootstrap cut-off points Detecting outliers and/or leverage points: a robust two-stage procedure with bootstrap cut-off points Ettore Marubini (1), Annalisa Orenti (1) Background: Identification and assessment of outliers, have

More information

Independent Component (IC) Models: New Extensions of the Multinormal Model

Independent Component (IC) Models: New Extensions of the Multinormal Model Independent Component (IC) Models: New Extensions of the Multinormal Model Davy Paindaveine (joint with Klaus Nordhausen, Hannu Oja, and Sara Taskinen) School of Public Health, ULB, April 2008 My research

More information

Unconstrained Ordination

Unconstrained Ordination Unconstrained Ordination Sites Species A Species B Species C Species D Species E 1 0 (1) 5 (1) 1 (1) 10 (4) 10 (4) 2 2 (3) 8 (3) 4 (3) 12 (6) 20 (6) 3 8 (6) 20 (6) 10 (6) 1 (2) 3 (2) 4 4 (5) 11 (5) 8 (5)

More information

Outlier Detection via Feature Selection Algorithms in

Outlier Detection via Feature Selection Algorithms in Int. Statistical Inst.: Proc. 58th World Statistical Congress, 2011, Dublin (Session CPS032) p.4638 Outlier Detection via Feature Selection Algorithms in Covariance Estimation Menjoge, Rajiv S. M.I.T.,

More information

Monitoring Random Start Forward Searches for Multivariate Data

Monitoring Random Start Forward Searches for Multivariate Data Monitoring Random Start Forward Searches for Multivariate Data Anthony C. Atkinson 1, Marco Riani 2, and Andrea Cerioli 2 1 Department of Statistics, London School of Economics London WC2A 2AE, UK, a.c.atkinson@lse.ac.uk

More information

DIMENSION REDUCTION OF THE EXPLANATORY VARIABLES IN MULTIPLE LINEAR REGRESSION. P. Filzmoser and C. Croux

DIMENSION REDUCTION OF THE EXPLANATORY VARIABLES IN MULTIPLE LINEAR REGRESSION. P. Filzmoser and C. Croux Pliska Stud. Math. Bulgar. 003), 59 70 STUDIA MATHEMATICA BULGARICA DIMENSION REDUCTION OF THE EXPLANATORY VARIABLES IN MULTIPLE LINEAR REGRESSION P. Filzmoser and C. Croux Abstract. In classical multiple

More information

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1: Introduction, Multivariate Location and Scatter

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1: Introduction, Multivariate Location and Scatter MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1:, Multivariate Location Contents , pauliina.ilmonen(a)aalto.fi Lectures on Mondays 12.15-14.00 (2.1. - 6.2., 20.2. - 27.3.), U147 (U5) Exercises

More information

More Powerful Tests for Homogeneity of Multivariate Normal Mean Vectors under an Order Restriction

More Powerful Tests for Homogeneity of Multivariate Normal Mean Vectors under an Order Restriction Sankhyā : The Indian Journal of Statistics 2007, Volume 69, Part 4, pp. 700-716 c 2007, Indian Statistical Institute More Powerful Tests for Homogeneity of Multivariate Normal Mean Vectors under an Order

More information

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ). .8.6 µ =, σ = 1 µ = 1, σ = 1 / µ =, σ =.. 3 1 1 3 x Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ ). The Gaussian distribution Probably the most-important distribution in all of statistics

More information

Modified Simes Critical Values Under Positive Dependence

Modified Simes Critical Values Under Positive Dependence Modified Simes Critical Values Under Positive Dependence Gengqian Cai, Sanat K. Sarkar Clinical Pharmacology Statistics & Programming, BDS, GlaxoSmithKline Statistics Department, Temple University, Philadelphia

More information

Two Simple Resistant Regression Estimators

Two Simple Resistant Regression Estimators Two Simple Resistant Regression Estimators David J. Olive Southern Illinois University January 13, 2005 Abstract Two simple resistant regression estimators with O P (n 1/2 ) convergence rate are presented.

More information

Robust Classification for Skewed Data

Robust Classification for Skewed Data Advances in Data Analysis and Classification manuscript No. (will be inserted by the editor) Robust Classification for Skewed Data Mia Hubert Stephan Van der Veeken Received: date / Accepted: date Abstract

More information

Why is the field of statistics still an active one?

Why is the field of statistics still an active one? Why is the field of statistics still an active one? It s obvious that one needs statistics: to describe experimental data in a compact way, to compare datasets, to ask whether data are consistent with

More information

Independent component analysis for functional data

Independent component analysis for functional data Independent component analysis for functional data Hannu Oja Department of Mathematics and Statistics University of Turku Version 12.8.216 August 216 Oja (UTU) FICA Date bottom 1 / 38 Outline 1 Probability

More information

Definitions of ψ-functions Available in Robustbase

Definitions of ψ-functions Available in Robustbase Definitions of ψ-functions Available in Robustbase Manuel Koller and Martin Mächler July 18, 2018 Contents 1 Monotone ψ-functions 2 1.1 Huber.......................................... 3 2 Redescenders

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

Computational rank-based statistics

Computational rank-based statistics Article type: Advanced Review Computational rank-based statistics Joseph W. McKean, joseph.mckean@wmich.edu Western Michigan University Jeff T. Terpstra, jeff.terpstra@ndsu.edu North Dakota State University

More information

Efficient and Robust Scale Estimation

Efficient and Robust Scale Estimation Efficient and Robust Scale Estimation Garth Tarr, Samuel Müller and Neville Weber School of Mathematics and Statistics THE UNIVERSITY OF SYDNEY Outline Introduction and motivation The robust scale estimator

More information

Introduction to robust statistics*

Introduction to robust statistics* Introduction to robust statistics* Xuming He National University of Singapore To statisticians, the model, data and methodology are essential. Their job is to propose statistical procedures and evaluate

More information

Statistical Data Analysis

Statistical Data Analysis DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the

More information

Robust Tools for the Imperfect World

Robust Tools for the Imperfect World Robust Tools for the Imperfect World Peter Filzmoser a,, Valentin Todorov b a Department of Statistics and Probability Theory, Vienna University of Technology, Wiedner Hauptstr. 8-10, 1040 Vienna, Austria

More information

Outlier detection for skewed data

Outlier detection for skewed data Outlier detection for skewed data Mia Hubert 1 and Stephan Van der Veeken December 7, 27 Abstract Most outlier detection rules for multivariate data are based on the assumption of elliptical symmetry of

More information

368 XUMING HE AND GANG WANG of convergence for the MVE estimator is n ;1=3. We establish strong consistency and functional continuity of the MVE estim

368 XUMING HE AND GANG WANG of convergence for the MVE estimator is n ;1=3. We establish strong consistency and functional continuity of the MVE estim Statistica Sinica 6(1996), 367-374 CROSS-CHECKING USING THE MINIMUM VOLUME ELLIPSOID ESTIMATOR Xuming He and Gang Wang University of Illinois and Depaul University Abstract: We show that for a wide class

More information

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL Discussiones Mathematicae Probability and Statistics 36 206 43 5 doi:0.75/dmps.80 A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL Tadeusz Bednarski Wroclaw University e-mail: t.bednarski@prawo.uni.wroc.pl

More information

Robust methods for multivariate data analysis

Robust methods for multivariate data analysis JOURNAL OF CHEMOMETRICS J. Chemometrics 2005; 19: 549 563 Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cem.962 Robust methods for multivariate data analysis S. Frosch

More information

THE BREAKDOWN POINT EXAMPLES AND COUNTEREXAMPLES

THE BREAKDOWN POINT EXAMPLES AND COUNTEREXAMPLES REVSTAT Statistical Journal Volume 5, Number 1, March 2007, 1 17 THE BREAKDOWN POINT EXAMPLES AND COUNTEREXAMPLES Authors: P.L. Davies University of Duisburg-Essen, Germany, and Technical University Eindhoven,

More information

arxiv: v3 [stat.me] 2 Feb 2018 Abstract

arxiv: v3 [stat.me] 2 Feb 2018 Abstract ICS for Multivariate Outlier Detection with Application to Quality Control Aurore Archimbaud a, Klaus Nordhausen b, Anne Ruiz-Gazen a, a Toulouse School of Economics, University of Toulouse 1 Capitole,

More information

A PRACTICAL APPLICATION OF A ROBUST MULTIVARIATE OUTLIER DETECTION METHOD

A PRACTICAL APPLICATION OF A ROBUST MULTIVARIATE OUTLIER DETECTION METHOD A PRACTICAL APPLICATION OF A ROBUST MULTIVARIATE OUTLIER DETECTION METHOD Sarah Franklin, Marie Brodeur, Statistics Canada Sarah Franklin, Statistics Canada, BSMD, R.H.Coats Bldg, 1 lth floor, Ottawa,

More information

Solving Corrupted Quadratic Equations, Provably

Solving Corrupted Quadratic Equations, Provably Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin

More information

CHAPTER 5. Outlier Detection in Multivariate Data

CHAPTER 5. Outlier Detection in Multivariate Data CHAPTER 5 Outlier Detection in Multivariate Data 5.1 Introduction Multivariate outlier detection is the important task of statistical analysis of multivariate data. Many methods have been proposed for

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH

IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH SESSION X : THEORY OF DEFORMATION ANALYSIS II IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH Robiah Adnan 2 Halim Setan 3 Mohd Nor Mohamad Faculty of Science, Universiti

More information

Re-weighted Robust Control Charts for Individual Observations

Re-weighted Robust Control Charts for Individual Observations Universiti Tunku Abdul Rahman, Kuala Lumpur, Malaysia 426 Re-weighted Robust Control Charts for Individual Observations Mandana Mohammadi 1, Habshah Midi 1,2 and Jayanthi Arasan 1,2 1 Laboratory of Applied

More information

Rare Event Discovery And Event Change Point In Biological Data Stream

Rare Event Discovery And Event Change Point In Biological Data Stream Rare Event Discovery And Event Change Point In Biological Data Stream T. Jagadeeswari 1 M.Tech(CSE) MISTE, B. Mahalakshmi 2 M.Tech(CSE)MISTE, N. Anusha 3 M.Tech(CSE) Department of Computer Science and

More information

DD-Classifier: Nonparametric Classification Procedure Based on DD-plot 1. Abstract

DD-Classifier: Nonparametric Classification Procedure Based on DD-plot 1. Abstract DD-Classifier: Nonparametric Classification Procedure Based on DD-plot 1 Jun Li 2, Juan A. Cuesta-Albertos 3, Regina Y. Liu 4 Abstract Using the DD-plot (depth-versus-depth plot), we introduce a new nonparametric

More information

Fast and Robust Classifiers Adjusted for Skewness

Fast and Robust Classifiers Adjusted for Skewness Fast and Robust Classifiers Adjusted for Skewness Mia Hubert 1 and Stephan Van der Veeken 2 1 Department of Mathematics - LStat, Katholieke Universiteit Leuven Celestijnenlaan 200B, Leuven, Belgium, Mia.Hubert@wis.kuleuven.be

More information

A Characterization of Principal Components. for Projection Pursuit. By Richard J. Bolton and Wojtek J. Krzanowski

A Characterization of Principal Components. for Projection Pursuit. By Richard J. Bolton and Wojtek J. Krzanowski A Characterization of Principal Components for Projection Pursuit By Richard J. Bolton and Wojtek J. Krzanowski Department of Mathematical Statistics and Operational Research, University of Exeter, Laver

More information

Small sample corrections for LTS and MCD

Small sample corrections for LTS and MCD Metrika (2002) 55: 111 123 > Springer-Verlag 2002 Small sample corrections for LTS and MCD G. Pison, S. Van Aelst*, and G. Willems Department of Mathematics and Computer Science, Universitaire Instelling

More information

Stahel-Donoho Estimation for High-Dimensional Data

Stahel-Donoho Estimation for High-Dimensional Data Stahel-Donoho Estimation for High-Dimensional Data Stefan Van Aelst KULeuven, Department of Mathematics, Section of Statistics Celestijnenlaan 200B, B-3001 Leuven, Belgium Email: Stefan.VanAelst@wis.kuleuven.be

More information

Elliptically Contoured Distributions

Elliptically Contoured Distributions Elliptically Contoured Distributions Recall: if X N p µ, Σ), then { 1 f X x) = exp 1 } det πσ x µ) Σ 1 x µ) So f X x) depends on x only through x µ) Σ 1 x µ), and is therefore constant on the ellipsoidal

More information

Regression Analysis for Data Containing Outliers and High Leverage Points

Regression Analysis for Data Containing Outliers and High Leverage Points Alabama Journal of Mathematics 39 (2015) ISSN 2373-0404 Regression Analysis for Data Containing Outliers and High Leverage Points Asim Kumer Dey Department of Mathematics Lamar University Md. Amir Hossain

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Robust scale estimation with extensions

Robust scale estimation with extensions Robust scale estimation with extensions Garth Tarr, Samuel Müller and Neville Weber School of Mathematics and Statistics THE UNIVERSITY OF SYDNEY Outline The robust scale estimator P n Robust covariance

More information

Extreme geometric quantiles

Extreme geometric quantiles 1/ 30 Extreme geometric quantiles Stéphane GIRARD (Inria Grenoble Rhône-Alpes) Joint work with Gilles STUPFLER (Aix Marseille Université) ERCIM, Pisa, Italy, December 2014 2/ 30 Outline Extreme multivariate

More information

7. Variable extraction and dimensionality reduction

7. Variable extraction and dimensionality reduction 7. Variable extraction and dimensionality reduction The goal of the variable selection in the preceding chapter was to find least useful variables so that it would be possible to reduce the dimensionality

More information

Robust Estimation of Cronbach s Alpha

Robust Estimation of Cronbach s Alpha Robust Estimation of Cronbach s Alpha A. Christmann University of Dortmund, Fachbereich Statistik, 44421 Dortmund, Germany. S. Van Aelst Ghent University (UGENT), Department of Applied Mathematics and

More information

Smooth simultaneous confidence bands for cumulative distribution functions

Smooth simultaneous confidence bands for cumulative distribution functions Journal of Nonparametric Statistics, 2013 Vol. 25, No. 2, 395 407, http://dx.doi.org/10.1080/10485252.2012.759219 Smooth simultaneous confidence bands for cumulative distribution functions Jiangyan Wang

More information

Characteristics of multivariate distributions and the invariant coordinate system

Characteristics of multivariate distributions and the invariant coordinate system Characteristics of multivariate distributions the invariant coordinate system Pauliina Ilmonen, Jaakko Nevalainen, Hannu Oja To cite this version: Pauliina Ilmonen, Jaakko Nevalainen, Hannu Oja. Characteristics

More information

Robust estimators based on generalization of trimmed mean

Robust estimators based on generalization of trimmed mean Communications in Statistics - Simulation and Computation ISSN: 0361-0918 (Print) 153-4141 (Online) Journal homepage: http://www.tandfonline.com/loi/lssp0 Robust estimators based on generalization of trimmed

More information

Minimum distance tests and estimates based on ranks

Minimum distance tests and estimates based on ranks Minimum distance tests and estimates based on ranks Authors: Radim Navrátil Department of Mathematics and Statistics, Masaryk University Brno, Czech Republic (navratil@math.muni.cz) Abstract: It is well

More information

Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators

Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators UC3M Working Papers Statistics and Econometrics 17-10 ISSN 2387-0303 Mayo 2017 Departamento de Estadística Universidad Carlos III de Madrid Calle Madrid, 126 28903 Getafe (Spain) Fax (34) 91 624-98-48

More information

Robust Maximum Association Between Data Sets: The R Package ccapp

Robust Maximum Association Between Data Sets: The R Package ccapp Robust Maximum Association Between Data Sets: The R Package ccapp Andreas Alfons Erasmus Universiteit Rotterdam Christophe Croux KU Leuven Peter Filzmoser Vienna University of Technology Abstract This

More information

Principal component analysis

Principal component analysis Principal component analysis Motivation i for PCA came from major-axis regression. Strong assumption: single homogeneous sample. Free of assumptions when used for exploration. Classical tests of significance

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Fast and Robust Discriminant Analysis

Fast and Robust Discriminant Analysis Fast and Robust Discriminant Analysis Mia Hubert a,1, Katrien Van Driessen b a Department of Mathematics, Katholieke Universiteit Leuven, W. De Croylaan 54, B-3001 Leuven. b UFSIA-RUCA Faculty of Applied

More information

Package riv. R topics documented: May 24, Title Robust Instrumental Variables Estimator. Version Date

Package riv. R topics documented: May 24, Title Robust Instrumental Variables Estimator. Version Date Title Robust Instrumental Variables Estimator Version 2.0-5 Date 2018-05-023 Package riv May 24, 2018 Author Gabriela Cohen-Freue and Davor Cubranic, with contributions from B. Kaufmann and R.H. Zamar

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information