General Foundations for Studying Masking and Swamping Robustness of Outlier Identifiers Robert Serfling 1 and Shanshan Wang 2 University of Texas at Dallas This paper is dedicated to the memory of Kesar Singh, an outstanding contributor to statistical science. November, 2012 1 Department of Mathematics, University of Texas at Dallas, Richardson, Texas 75080-3021, USA. Email: serfling@utdallas.edu. Website: www.utdallas.edu/ serfling. 2 Department of Mathematics, University of Texas at Dallas, Richardson, Texas 75080-3021, USA.
Abstract With greatly advanced computational resources, the scope of statistical data analysis and modeling has widened to accommodate pressing new arenas of application. In all such data settings, an important and challenging task is the identification of outliers. Especially, an outlier identification procedure must be robust against the possibilities of masking (an outlier is undetected as such and swamping (a nonoutlier is classified as an outlier. Here we provide general foundations and criteria for quantifying the robustness of outlier detection procedures against masking and swamping. This unifies a scattering of existing results confined to univariate or multivariate data, and extends to a completely general framework allowing any type of data. For any space X of objects and probability model F on X, we consider a real-valued outlyingness function O(x,F defined over x in X and a sample version O(x based on a sample from X. In this setting, and within a coherent framework, we formulate general definitions of masking breakdown point and swamping breakdown point and develop lemmas for evaluating these robustness measures in practical applications. A brief illustration of the technique of application of the lemmas is provided for univariate scaled deviation outlyingness. AMS 2000 Subject Classification: Primary 62G35 Secondary 62-07 Key words and phrases: Nonparametric; Outlier detection; Masking robustness; Swamping robustness.
1 Introduction With greatly advanced computational resources, the scope of statistical data analysis and modeling has widened to accommodate pressing new arenas of application. Now data is invariably multivariate, typically with very high dimension and/or heavy tails and/or huge sample sizes, or complex, involving curves, images, sets, and other types of object, often with stream or network structure. In all such data settings, an increasingly important and challenging task is the identification of outliers. New contexts involving outliers and anomaly detection include fraud detection, intrusion detection, and network robustness analysis, to name a few. The outliers themselves are sometimes the cases of primary interest. Nonparametric notions and methods are especially important, since tractable parametric modeling is rather limited in the case of multivariate data (e.g., Olkin, 1994 and even more so with complex data. Further, since visualization is feasible only in the case of numerical or vector data in low dimension, outlier detection methods become of necessity algorithmic in nature and the determination of their performance properties is complicated. The key concern about performance of an outlier detection procedure is its robustness. It must be resistant to adverse performance effects due to the very presence of the outliers to be identified, or even due to the presence of a concentration of inliers. In particular, one must assess the proclivities of the procedure toward either of two kinds of misclassification error: masking (an outlier is undetected and swamping (a nonoutlier is classified as an outlier. Robustness against both masking and swamping is clearly essential. In handling complex modern data structures, ingenious outlier detection approaches have been crafted ad hoc in diverse settings. Typically their robustness performance is explored only through limited simulation studies, with results that lack both generality and precise interpretation. Highly needed are theoretical underpinnings. Here we develop general foundations and criteria for quantifying robustness of outlier detection procedures against masking and swamping. We employ a leading type of robustness measure, the (finite sample breakdown point (BP of Donoho and Huber (1983, i.e., the minimum fraction of replacements of the sample data (by outliers or inliers sufficient to break down the statistical procedure, i.e., to render it drastically ineffective. This provides a distinctive quantitative approach toward measuring robustness. In dealing with an outlier identification procedure, the BP approach to robustness involves two such measures: the masking breakdown point (MBP and the swamping breakdown point (SBP. These are the minimum fractions of points in a data set which if arbitarily placed as outliers or inliers suffice to cause the outlier detection procedure to mask arbitrarily extreme outliers, or swamp arbitrarily central nonoutliers, respectively. The higher the MBP and SBP values, the better the robustness performance of an outlier detection procedure. Although the idea of BP for estimators such as the sample mean or variance is well established and quite simple and straightforward to define, the corresponding formulations of MBP and SWP are considerably more problematic and have received limited treatment. In the parametric setting of univariate data within the contaminated normal model, Davies and 1
Gather (1993 formulate versions of MBP and SBP using addition contamination. Becker and Gather (1999 extend that MBP to the multivariate contaminated normal model, but extension of the SBP is not considered, although it is treated in Becker (1996. Dang and Serfling (2010 introduce a version of MBP in the setting of fully nonparametric multivariate outlier identification based on the use of depth functions and apply this notion to compare several different depth-based outlier identifiers. Again, however, the SBP is left untreated. Although the MBP and SBP are conceptually interrelated and are formulated in parallel ways, the SBP is technically more delicate to treat than the MBP. Further, it turns out that for each of MBP and SBP there are two relevant versions representing complementary perspectives, making four robustness measures in all. Here we introduce a general framework for study of these robustness measures, establish key lemmas for their application, and carry out their application in the setting of univariate data. Section 2 develops a completely general formulation of nonparametric outlier identification in terms of a real-valued outlyingness function O(x, F, defined over x in any space X of objects and based on a probability distribution F on X, and a sample version O(x, X n based on a sample X n from X. General definitions of MBP and SBP are provided within a unified conceptual framework for studying these robustness measures. Section 3 provides key technical lemmas for evaluating MBP and SBP in practical applications. Section 4 provides a brief illustration of the technique of application of the lemmas, using a leading outlier identifier in the case of univariate (real-valued data, scaled deviation outlyingness. Complete treatment of univariate scaled deviation outlyingness as well as of centered rank outlyingness, is carried out in Wang and Serfling (2012, including application to show the excellent masking and swamping robustness of the boxplot. Applications to other data settings (multivariate, functional, etc. are beyond the scope of the present paper and deferred to future studies. 2 General foundations In Section 2.1 we formulate outlyingness functions within a broad conceptual framework, and in Section 2.2 introduce the nonparametric outlier identification problem. General foundations on masking and swamping robustness are provided in Section 2.3. 2.1 Outlyingness functions on a space X The idea of outliers has a long tradition. The goal is to identify points or groups of points which lie apart from the main body of data, or which are unusual, anomalous, or suspicious in some sense, and then to take an appropriate action. With univariate and bivariate data, outlier visualization is easy. In higher dimensional spaces, however, algorithmic approaches become essential and entail formulating and relying upon outlyingness functions to explore data. Here let X be any space equipped with a suitable σ-algebra of measurable sets (left implicit and let F be a corresponding probability measure on the measurable sets. 2
Outlyingness Function. Associated with a probability distribution F on X, an outlyingness function O(x, F provides a center-outward ordering of points x in X, with higher values representing greater outlyingness relative to a center measuring location. In the typical case of X = R d, where we might compare with the density function of F, we observe that density contours and outlyingness contours need not coincide. A density function quantifies local probability structure at a point, whereas an outlyingness function quantifies the location of a point from a global and tail-oriented perspective. For R d, the outlyingness approach based on Mahalanobis distance (with robust location and dispersion estimates is popular for its tractabiliy and intuitive appeal. However, the corresponding outlyingness contours are necessarily elliptical, an unwanted restriction with many data sets. Thus other types of multivariate outlyingness functions are of interest. See Serfling (2010 for details and illustrations, and for connections with multivariate depth, quantile, and rank functions. 2.2 Outlier identification in X We will suppose that inf x O(x, F = 0, sup O(x, F = 1. (1 x Accordingly, we define corresponding λ outlier regions associated with F and O(x, F, and we have (for later reference out(λ, F = {x : O(x, F > λ}, 0 < λ < 1, inf{λ > 0 : out(λ, F } = 0, sup{λ > 0 : out(λ, F } = 1, (2 where A denotes the complement of a set A. The goal is to classify, for given choice of λ, all points x of X as belonging to out(λ, F or not. To this purpose, using a data set X n and a data-based outlyingness function O(x, X n that may be considered to estimate O(x, F, we estimate the region out(λ, F by a sample outlier region OR(λ, X n = {x : O(x, X n > λ}. Regarding the sample outlyingness function O(x, X n, we define O n and O n inf O(x, X n = On( 0, sup O(x, X n = On ( 1. (3 x In comparison with (1, this allows for the case that O(x, X n is a step function that possibly does not attain the values 0 or 1. Although typically we have On = 0 and On = 1, for the centered rank function treated in Wang and Serfling (2012 we have On = n 1 if n is odd. For later reference, the analogue of (2 is x by inf{λ > 0 : OR(λ, X n } = O n, sup{λ > 0 : OR(λ, X n } = O n. (4 3
It is to be understood that the data-based outlier region OR(λ, X n should include, in principle, regular points from F which happen to be outlying according to the selected threshold. This region also may include true outliers or contaminants originating from another source. In some cases, OR(λ, X n is given by out(λ, F n with F n an empirical df. The key issue is that if OR(λ, X n, or equivalently O(x, X n, is itself sensitive to outliers (or inliers, then OR(λ, X n cannot serve as a reliable outlier identifier. Robust choices of OR(λ, X n, i.e., of O(x, X n, are needed. In particular, masking robustness and swamping robustness are essential. 2.3 Masking and swamping robustness Here we introduce a foundational and conceptual framework for the study of masking and swamping robustness. It becomes clear that there are two variants of the notion of masking robustness and also two of the notion of swamping robustness. Very importantly, such a framework enables all four of these notions to be investigated and characterized in a unified and coherent fashion. 2.3.1 Masking robustness Let a sample outlier identifier OR(, X n be given. Key associated sets regarding masking are of the form M(λ, γ, X n, F = OR(λ, X n out(γ, F, defined for any λ and γ. Clearly, masking occurs if M(λ, γ, X n, F, (5 which in view of (4 requires λ > O n. In this case some γ outliers of F are included in the sample threshold λ nonoutlier region. For fixed λ, masking becomes more severe as γ 1. That is, increasingly extreme outliers of F become masked as sample threshold λ nonoutliers. For fixed γ, masking becomes more severe as λ O n. That is, some threshold γ outliers of F are included within an increasingly central sample nonoutlier region. Now consider all possible modified data sets X n,k obtainable by replacing k observations of X n by arbitrarily positioned new values ( outliers or contaminants. Corresponding to the fixed λ and fixed γ cases, respectively, two indices that measure in different ways the size of the masking effect are γ M (λ, X n, k = largest γ for which (5 with fixed λ holds subject to k replacements = sup{γ < 1 : k replacements changing X n to X n,k such that M(λ, γ, X n,k, F }, 4
and λ M (γ, X n, k = smallest λ for which (5 with fixed γ holds subject to k replacements = inf{λ > O n : k replacements changing X n to X n,k such that M(λ, γ, X n,k, F }. The quantity γ M (λ, X n, k answers Question I. What is the most extreme level γ 1 at which an F outlier can be masked at threshold λ due to the presence of k replacement outliers in the data X n? It represents the largest degree of outlyingness relative to F that is nonidentifiable at sample outlyingness threshold λ. The larger the value of γ M (λ, X n, k, the worse is the masking robustness performance of our given outlier identifier OR(, X n. The worst possible case, γ M (λ, X n, k = 1, denotes a version (let us say Type A of masking breakdown due to k replacements. Let k (A M (λ, X n = min{k : γ M (λ, X n, k = 1}. Then the Type A masking breakdown point of OR(, X n at sample outlyingness threshold λ is given by MBP (A (λ, X n = k(a M (λ, X n. n On the other hand, the quantity λ M (γ, X n, k answers Question II. How centrally in terms of sample outlyingness threshold λ O n can a γ outlier of F be masked due to the presence of k replacement outliers in the data X n? The smaller the value of λ M (γ, X n, k, the worse the masking robustness of OR(, X n, and the worst possible case, λ M (γ, X n, k = On, denotes Type B masking breakdown due to k replacements. Let k (B M (γ, X n = min{k : λ M (γ, X n, k = On}. Then the Type B masking breakdown point of OR(, X n at F outlyingness threshold γ is given by MBP (B (γ, X n = k(b M (γ, X n. n The quantities MBP (A (λ, X n for On < λ < 1 and MBP (B (γ, X n for 0 < γ < 1 represent the minimum fractions of replacements in the data X n sufficient for Type A or Type B masking breakdown relative to specified thresholds. The higher their values, the greater the masking robustness of the outlier identifier OR(, X n. The above formulation of MBP (A (λ, X n extends that of Dang and Serfling (2010 for nonparametric multivariate outlier identification. It also corresponds (with a somewhat different formulation to the notion introduced by Davies and Gather (1993 in the univariate contaminated normal model and extended to the multivariate contaminated normal model by Becker and Gather (1999. However, the version MBP (B (γ, X n is completely new. 5
2.3.2 Swamping robustness Again, let a sample outlier identifier OR(, X n be given. swamping are of form S(λ, γ, X n, F = OR(λ, X n out(γ, F, defined for any λ and γ, and swamping occurs if Key associated sets regarding S(λ, γ, X n, F, (6 which in view of (4 requires λ < On. In this case some γ nonoutliers of F are included in the sample threshold λ outlier region. For fixed λ, the swamping becomes more severe as γ 0, with increasingly central nonoutliers of F becoming included in the sample threshold λ outlier region. For fixed γ, swamping becomes more severe as λ On, with threshold γ nonoutliers of F included within an increasingly extreme sample outlier region. Again consider the modified data sets X n,k obtainable by replacing k observations of X n by contaminants. Corresponding to the fixed λ and fixed γ cases, respectively, two indices related to extreme instances of swamping are and γ S (λ, X n, k = smallest γ for which (6 with fixed λ holds subject to k replacements = inf{γ > 0 : k replacements changing X n to X n,k such that S(λ, γ, X n,k, F }, λ S (γ, X n, k = largest λ for which (6 with fixed γ holds subject to k replacements = sup{λ < O n The quantity γ S (λ, X n, k answers : k replacements changing X n to X n,k such that S(λ, γ, X n,k, F }. Question III. What is the most central level γ 0 of nonoutlier of F that can be swamped at sample outlier threshold λ by the presence of k replacement outliers in the data X n? The smaller the value of γ S (λ, X n, k, the worse is the swamping robustness performance of our given OR(, X n. The worst possible case, γ S (λ, X n, k = 0, denotes Type A swamping breakdown due to k replacements. Let k (A S (λ, X n = min{k : γ S (λ, X n, k = 0}. Then the Type A swamping breakdown point of OR(, X n at sample outlyingness threshold λ is given by SBP (A (λ, X n = k(a S (λ, X n. n On the other hand, the quantity λ S (γ, X n, k answers 6
Question IV. How extremely at sample threshold λ On can a γ nonoutlier of F be swamped by the presence of k replacement outliers in the data X n? The larger the value of λ S (γ, X n, k, the worse the swamping robustness of OR(, X n, and the worst possible case, λ S (γ, X n, k = On, denotes Type B swamping breakdown due to k replacements. Let k (B S (γ, X n = min{k : λ S (γ, X n, k = On }. Then the Type B swamping breakdown point of OR(, X n at F outlyingness threshold γ is given by SBP (B (γ, X n = k(b S (γ, X n. n The quantities SBP (A (λ, X n for 0 < λ < On and SBP (B (γ, X n for 0 < γ < 1 represent the minimum fractions of replacements in the data X n sufficient for Type A or Type B swamping breakdown relative to specified thresholds. The higher their values, the greater the swamping robustness of the outlier identifier OR(, X n. Both SBP (A (λ, X n and SBP (B (γ, X n as given above for the context of nonparametric outlier identification are newly formulated in the present paper. A version of SBP (B (γ, X n was introduced by Davies and Gather (1993 in the parametric univariate contaminated normal model and extended with a somewhat different formulation in Becker (1996 to the parametric multivariate contaminated normal model. 2.3.3 The four masking and swamping robustness measures The following figure illustrates the interaction between the sets M = M(λ, γ, X n, F and S = S(λ, γ, X n, F relevant to masking and swamping, respectively. F γ-outlyingness contour Sample λ-outlyingness contour S M In analyzing a data set X n using an outlier identifier OR(, X n, one approach is to adopt a specific outlyingness threshold λ and consider OR(λ, X n as an estimator of the target 7
outlier region out(λ, F. In this case, with focus on the sample region OR(λ, X n for a specified λ, the Type A versions of masking and swamping breakdown points are relevant and quite naturally go together as companion robustness measures which address Questions I and III, respectively. On the other hand, one might focus on out(γ, F for some γ and ask how centrally this outlier region can be masked using OR(, X n (Question II. Also, one might focus on out(γ, F and want to know how extremely this nonoutlier region can be swamped using OR(, X n (Question IV. For these, the Type B versions of masking and swamping breakdown points go together as companion robustness measures and play roles complementary to the Type A versions. The treatment of Davies and Gather (1993 for the univariate contaminated normal model chooses Type A masking breakdown and Type B swamping breakdown, something of a mismatch. The foundational framework introduced above provides four relevant masking and swamping robustness measures and clarifies how to use them coherently and comprehensively. The next section provides basic lemmas of use for evaluation of the measures MBP (A (λ, X n, MBP (B (γ, X n, SBP (A (λ, X n, and SBP (B (γ, X n in applications. 3 Basic lemmas 3.1 Connections with standard replacement breakdown points For a real-valued statistic T (X n taking values in (, + (or in [0, 1], respectively, explosion breakdown of T (X n occurs with k points of X n replaced if sup X n,k T (X n,k = sup X n,n T (X n,n =: T, (7 with X n,k as previously. Typical values of T are 1 or, although not necessarily. With k exp (T (X n denoting the minimum k such that (7 can occur, the explosion replacement breakdown point of T (X n is given by RBP exp (T (X n = k exp (T (X n /n. Likewise, implosion breakdown occurs with k points of X n replaced if inf X n,k T (X n,k = inf X n,n T (X n,n =: T. (8 The typical value of T is 0, although not necessarily. With obvious notation, the implosion replacement breakdown point of T (X n is given by RBP imp (T (X n = k imp (T (X n /n. We now give representations for MBP (A (λ, X n, MBP (B (γ, X n, SBP (A (λ, X n, and SBP (B (γ, X n in terms of the above explosion and implosion RBPs. Lemma 1 Type A masking breakdown with replacement of k sample values (γ M (λ, X n, k = 1 holds if and only if sup X n,k sup y OR(λ,X n,k O(y, F = 1, (9 8
and hence MBP (A (λ, X n = RBP exp ( sup O(y, F y OR(λ,X n. (10 Proof. Suppose that γ M (λ, X n, k = 1. Then, for any γ < 1, there exists X n,k (γ such that (5 holds. Hence, for a sequence γ m 1, let y m belong to the intersection in (5 corresponding to γ = γ m and X n,k = X n,k (γ m. Then O(y m, F > γ m 1, m, and hence sup X n,k sup y OR(λ,X n,k O(y, F sup m O(y m, F sup γ m = 1, m and (9 follows. Now assume that (9 holds. Then, for a sequence γ m 1, there exists X n,k (γ m and a point y m OR(λ, X n,k (γ m such that O(y m, F > γ m. Then (5 holds with X n,k = X n,k (γ m and γ = γ m, and it follows that γ M (λ, X n, k = 1. Finally, by definition, and RBP exp ( MBP (A (λ, X n = n 1 min{k : γ M (λ, X n, k = 1} sup O(y, F y OR(λ,X n = n 1 min { k : sup O(y, F 1 y OR(λ,X n,k As shown above, γ M (λ, X n, k = 1 if and only if (9 holds, and it follows that establishing (10. MBP (A (λ, X n = n 1 min{k : γ M (λ, X n, k = 1} { = n 1 min = n 1 min { = RBP exp ( k : sup X n,k k : sup O(y, F = 1 y OR(λ,X n,k } sup O(y, F 1 y OR(λ,X n,k sup O(y, F y OR(λ,X n By proofs along similar lines (omitted, we obtain the following three lemmas. Lemma 2 Type B masking breakdown with replacement of k sample values (λ M (γ, X n, k = On holds if and only if inf inf O(y, X n,k = O X n,k n, (11 y out(γ,f and hence MBP (B (γ, X n = RBP imp ( inf y out(γ,f, } }. O(y, X n. (12 9
Lemma 3 Type A swamping breakdown with replacement of k sample values (γ S (λ, X n, k = 0 holds if and only if inf inf O(y, F = 0, (13 X n,k y OR(λ,X n,k and hence SBP (A (λ, X n = RBP imp ( inf y OR(λ,X n O(y, F. (14 Lemma 4 Type B swamping breakdown with replacement of k sample values (λ S (γ, X n, k = On holds if and only if sup sup O(y, X n,k = On, (15 X n,k y out(γ,f and hence SBP (B (γ, X n = RBP exp ( sup O(y, X n y out(γ,f. (16 Lemmas 1-4 transform the problem of evaluating masking and swamping breakdown points to a problem of evaluating standard explosion and implosion breakdown points of certain inf and sup statistics. These latter statistics are complicated to treat, and in the next section some helpful lemmas are established. 3.2 Key lemmas for implementation of the RBP formulas In some cases, the infimum or supremumin the foregoing RBP formulas reduces to a minimum or maximum. For such cases, the following result treats breakdown of a statistic S(X n when it is either the minimum or the maximum of certain other statistics T 1 (X n,..., T J (X n. Let k exp, (0 k exp, (1..., k exp (J be the minimal numbers of data points which must be replaced in order to cause explosion breakdown of the respective statistics S and T 1,..., T J, and let k (0 imp, k(1 imp,..., k(j imp be their counterparts for implosion breakdown. Also, let T = min{t 1,..., T J }, T = max{t 1,..., T J }. Lemma 5 (i Let S = min{t 1,..., T J }. Then (ii Let S = max{t 1,..., T J }. Then min{k (1 imp,..., k(j imp } k(0 imp max{k(1 imp,..., k(j imp }. (17 min{k (1 exp,..., k (J exp} k (0 exp max{k (1 exp,..., k (J exp}. (18 Proof. (i If S T for some choice of k (0 imp T 1 T,..., T J T occurs with k (0 we must have k (j imp k(0 imp replacements, then at least one of the events imp replacements. For any such occurrence, say T j T,. This yields the left hand inequality of (17. On the other hand, if 10
k (m imp = max{k(1 imp,..., k(j imp } for some m {1,..., J}, then all the events T j Tj, 1 j J, occur with k (m imp replacements and hence S T, yielding k (0 imp k(m imp. This yields the right hand inequality of (17, and part (i is now established. (ii If S T for some choice of k exp (0 replacements, then at least one of the events T 1 T,..., T J T occurs with k exp (0 replacements. For any such occurrence, say T j T, we have k exp (j k exp, (0 and the left hand inequality of (18 thus follows. On the other hand, if k exp (m = max{k exp, (1..., k exp} (J for some m {1,..., J}, then all the events T j Tj occur with k exp (m replacements and hence S T, yielding k exp (0 k exp (m. This yields the right hand inequality of (18, establishing part (ii of the lemma. The above result applies, for example, in the case X = R and J = 2, where the outlier regions are complements of finite intervals and the inf and sup are attained at endpoints. The following result extends to the general case and is obtained by a proof along similar lines with T = inf y Ty and T = sup y Ty. Lemma 6 (i Let S = inf y T y. Then (ii Let S = sup y T y. Then inf y inf y k(y imp k(0 k(y exp k (0 imp sup y exp sup y k (y imp. (19 k (y exp. (20 The next lemma treats breakdown of a statistic S(X n when the event of breakdown due to k replacements is related to the possible occurrences of certain events E 1,..., E J as a consequence of k replacements. Let k S be the minimal number of data points which must be replaced in order to cause breakdown (either implosion or explosion of S, and let k 1,..., k J be the minimal numbers of data points which must be replaced in order to cause occurrence of the respective events E 1,..., E J. It is assumed that k S and k 1,..., k J are well-defined and belong to {1, 2,..., n}. Lemma 7 (i If breakdown of S is implied by occurrence of each one of the events E 1,..., E J, then k S min{k 1,..., k J }. (21 (ii If breakdown of S implies occurrence of at least one of the events E 1,..., E J, then k S min{k 1,..., k J }. (22 (iii If breakdown of S is implied by occurrence of each one of the events E 1,..., E J and also implies that at least one of E 1,..., E J must occur, then k S = min{k 1,..., k J }. (23 Proof. (i For any event E j whose occurrence with k j replacements implies breakdown of S, we must have k S k j. Thus (21 follows. (ii For any event E j whose occurrence is implied by breakdown of S, we must have k S k j, and thus (22 follows. 11
4 Illustrative Application: MBP (A (λ, X n for Univariate Scaled Deviation Outlyingness For X = R, let F be a distribution on R and µ(f and σ(f any location and spread measures. The corresponding scaled deviation outlyingness function taking values in [0, 1 is given by O(x, F = Õ(x, F /(1 + Õ(x, F, with Õ(x, F = x µ(f σ(f, and sample versions O(x, X n and Õ(x, X n are similarly defined using µ(x n and σ(x n. Such outlyingness functions have been popularized by Mosteller and Tukey (1977, for example. A complete study of MBP (A (λ, X n, MBP (B (γ, X n, SBP (A (λ, X n, and SBP (B (γ, X n for scaled deviation outlyingness is carried out by Wang and Serfling (2012. Here we obtain the first of these, as an illustration of the use of our key lemmas. Note that for scaled deviation outlyingness, we have On = 0 and On = 1. In terms of the Õ versions, and with η = γ/(1 γ and β = λ/(1 λ, we have and out(γ, F = {x : O(x, F > γ} = {x : Õ(x, F > η}, OR(λ, X n = {x : O(x, X n > λ} = {x : Õ(x, X n > β}. Note also that η as γ 1, β as λ 1, and OR(λ, X n = [µ(x n βσ(x n, µ(x n βσ(x n ]. Expressing Lemma 1 in terms of Õ(x, F and Õ(x, X n, we obtain ( MBP (A (λ, X n = RBP exp sup Õ(y, F. (24 y OR(λ,X n For convenience, we put µ(x n = µ and σ(x n = σ. Using (from above OR(λ, X n = [ µ β σ, µ + β σ], it follows that sup Õ(y, F y OR(λ,X n Õ( µ + β σ, F = Õ( µ β σ, F } max {Õ( µ + β σ, F, Õ( µ β σ, F } = max {Õ( µ + β σ, F, Õ( µ β σ, F 12 if µ(f µ β σ if µ(f µ + β σ otherwise (in all cases. (25
It then follows from Lemma 5(ii that } min {RBP exp (Õ( µ + β σ, F, RBP exp (Õ( µ β σ, F MBP (A (λ, X n } max {RBP exp (Õ( µ + β σ, F, RBP exp (Õ( µ β σ, F We next evaluate RBP exp (Õ( µ + β σ, F respectively, to RBP exp ( µ + β σ and RBP exp ( µ β σ. Now we adopt (26 and RBP exp (Õ( µ β σ, F, which are equal, Assumption A. RBP exp ( µ + β σ, X n and RBP exp ( µ, X n are invariant if X n is replaced by X n, i.e, if each observation X i is replaced by X i, 1 i n. Under Assumption A, RBP exp ( µ + β σ and RBP exp ( µ β σ are equal, and we have MBP (A (λ, X n = RBP exp ( µ + β σ. (27 Here we are using RBP exp with T =. We now evaluate RBP exp ( µ + β σ, for which we apply Lemma 7 with some choice of k and the events S, E 1, E 2, and E 3, where S = { {X n,k } such that µ(x n,k + β σ(x n,k } E 1 = { {X n,k } such that µ(x n,k + } E 2 = { {X n,k } such that µ(x n,k is bounded and σ(x n,k } E 3 = { {X n,k } such that µ(x n,k and µ(x n,k + β σ(x n,k }. Note that (with k fixed each of E 1, E 2, E 3 implies S and S implies E 1 E 2 E 3. Then (23 yields RBP exp (Õ( µ + β σ, F = RBP exp ( µ + β σ = n 1 min{k 1, k 2, k 3 }, (28 where k 1, k 2, k 3 are the minimal values of k, respectively, for occurrence of E 1, E 2, E 3. Now let us note that, under Assumption A, k 3 k exp ( µ = k 1. Thus we have established the following useful result. Corollary 8 Under Assumption A, MBP (A (λ, X n = n 1 min{k 1, k 2 } = min{rbp exp ( µ, RBP exp ( σ µ bounded}. (29 Note that MBP (A (λ, X n does not depend upon the threshold λ. In typical cases, we have RBP exp ( σ µ bounded RBP exp ( σ RBP exp ( µ, 13
in which case simply MBP (A (λ, X n = RBP exp ( µ. Examples. (Assumption A is satisfied in each case. (i Mean and Standard Deviation ( µ = X and σ = S. It is straightforward that RBP exp ( µ = n 1, the minimum possible, yielding MBP (A (λ, X n = n 1 0. (In passing, we note that RBP exp ( σ µ bounded = 2n 1. (ii Median and MAD ( µ = Med(X n and σ = MAD(X n. We obtain RBP exp ( µ = RBP exp ( σ µ bounded = n 1 n+1, yielding 2 MBP(A (λ, X n = n 1 n+1 2 1. 2 (iii α-trimmed Mean and SD. Let X (n 2 nα denote the n 2 nα observations remaining after trimming away the upper nα observations and the lower nα observations. Then take µ to be the mean and σ to be the standard deviation of the data set X (n 2 nα. It is readily checked that RBP exp ( µ = ( nα + 1/n and that RBP exp ( σ µ bounded = 2( nα + 1/n, yielding MBP (A (λ, X n = n 1 ( nα + 1 α. Note that the result in (iii approaches that in (i as α 0 and that in (ii as α 1/2. As seen in Wang and Serfling (2012, there can be some advantage in MBP (A (λ, X n = α < 1/2, when it produces a trade-off allowing SBP (A (λ, X n > 1/2, should this be desired. Acknowledgements The authors gratefully acknowledge useful input from G. L. Thompson, Xin Dang, Satyaki Mazumder, Bo Hong, and Seoweon Jin. Also, support under National Science Foundation Grant DMS-1106691 is sincerely acknowledged. References [1] Becker, C. (1996. Bruchpunkt und Bias zur Beurteilung multivariater Ausreisseridentifizierung. Ph.D. Dissertation. Universität Dortmund. [2] Becker, C. and Gather, U. (1999. The masking breakdown point of multivariate outlier identification rules. Journal of the American Statistical Association 94 947 955. [3] Dang, X. and Serfling, R. (2010. Nonparametric depth-based multivariate outlier identifiers, and masking robustness properties. Journal of Statistical Planning and Inference 140 198 213. [4] Davies, L. and Gather, U. (1993. The identification of multiple outliers. Journal of the American Statistical Association 88 782 801. [5] Donoho, D. L. and Huber, P. J. (1983. The notion of breakdown point. In A Festschrift foe Erich L. Lehmann (P. J. Bickel, K. A. Doksum and J. L. Hodges, Jr., eds. pp. 157-184, Wadsworth, Belmont, California. 14
[6] Mosteller, C. F. and Tukey, J. W. (1977. Data Analysis and Regression. Addison- Wesley, Reading, Mass. [7] Olkin, I. (1994. Multivariate non-normal distributions and models of dependency. In Multivariate Analysis and Its Applications (T. W. Anderson, K. T. Fang and I. Olkin, eds., IMS Lecture Notes Monograph Series, Volume 24, pp. 37 53. Hayward, California. [8] Wang, S. and Serfling, R. (2012. On masking and swamping robustness of outlier identifiers for univariate data. Submitted (available at www.utdallas.edu/ serfling. 15