General Foundations for Studying Masking and Swamping Robustness of Outlier Identifiers

Similar documents
Inequalities Relating Addition and Replacement Type Finite Sample Breakdown Points

Supplementary Material for Wang and Serfling paper

Asymptotic Relative Efficiency in Estimation

On Invariant Within Equivalence Coordinate System (IWECS) Transformations

Computationally Easy Outlier Detection via Projection Pursuit with Finitely Many Directions

Generalized Multivariate Rank Type Test Statistics via Spatial U-Quantiles

Measuring robustness

robustness, efficiency, breakdown point, outliers, rank-based procedures, least absolute regression

CONTRIBUTIONS TO THE THEORY AND APPLICATIONS OF STATISTICAL DEPTH FUNCTIONS

YIJUN ZUO. Education. PhD, Statistics, 05/98, University of Texas at Dallas, (GPA 4.0/4.0)

DD-Classifier: Nonparametric Classification Procedure Based on DD-plot 1. Abstract

Influence Functions for a General Class of Depth-Based Generalized Quantile Functions

THE BREAKDOWN POINT EXAMPLES AND COUNTEREXAMPLES

Detection of outliers in multivariate data:

Commentary on Basu (1956)

Jensen s inequality for multivariate medians

TITLE : Robust Control Charts for Monitoring Process Mean of. Phase-I Multivariate Individual Observations AUTHORS : Asokan Mulayath Variyath.

ROBUST ESTIMATION OF A CORRELATION COEFFICIENT: AN ATTEMPT OF SURVEY

CHAPTER 5. Outlier Detection in Multivariate Data

MULTIVARIATE TECHNIQUES, ROBUSTNESS

A Convex Hull Peeling Depth Approach to Nonparametric Massive Multivariate Data Analysis with Applications

Stahel-Donoho Estimation for High-Dimensional Data

Survey on (Some) Nonparametric and Robust Multivariate Methods

Working Paper Convergence rates in multivariate robust outlier identification

Robust estimation of principal components from depth-based multivariate rank covariance matrix

Fast and robust bootstrap for LTS

Introduction to Robust Statistics. Anthony Atkinson, London School of Economics, UK Marco Riani, Univ. of Parma, Italy

On the limiting distributions of multivariate depth-based rank sum. statistics and related tests. By Yijun Zuo 2 and Xuming He 3

Detecting outliers in weighted univariate survey data

On robust and efficient estimation of the center of. Symmetry.

An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance

Re-weighted Robust Control Charts for Individual Observations

ON MULTIVARIATE MONOTONIC MEASURES OF LOCATION WITH HIGH BREAKDOWN POINT

Lehrstuhl für Statistik und Ökonometrie. Diskussionspapier 88 / 2012

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 4: Measures of Robustness, Robust Principal Component Analysis

A ROBUST METHOD OF ESTIMATING COVARIANCE MATRIX IN MULTIVARIATE DATA ANALYSIS G.M. OYEYEMI *, R.A. IPINYOMI **

Bahadur representations for bootstrap quantiles 1

Introduction to Statistical Analysis

Modulation of symmetric densities

Statistical Depth Function

Accurate and Powerful Multivariate Outlier Detection

CORRELATION ESTIMATION SYSTEM MINIMIZATION COMPARED TO LEAST SQUARES MINIMIZATION IN SIMPLE LINEAR REGRESSION

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Robust estimators based on generalization of trimmed mean

Descriptive Data Summarization

Identification of Multivariate Outliers: A Performance Study

Exploratory data analysis: numerical summaries

High Breakdown Analogs of the Trimmed Mean

Fast and Robust Classifiers Adjusted for Skewness

A Brief Overview of Robust Statistics

Application and Use of Multivariate Control Charts In a BTA Deep Hole Drilling Process

Exploiting Multiple Mahalanobis Distance Metric to Screen Outliers from Analogue Product Manufacturing Test Responses

Robustness of the Affine Equivariant Scatter Estimator Based on the Spatial Rank Covariance Matrix

Statistical Data Analysis

MSc / PhD Course Advanced Biostatistics. dr. P. Nazarov

DESCRIPTIVE STATISTICS FOR NONPARAMETRIC MODELS I. INTRODUCTION

Methods of Nonparametric Multivariate Ranking and Selection

Stat 5101 Lecture Notes

Semi-parametric predictive inference for bivariate data using copulas

Empirical likelihood-based methods for the difference of two trimmed means

Robustness and Distribution Assumptions

Jensen s inequality for multivariate medians

Discriminant Analysis with High Dimensional. von Mises-Fisher distribution and

A Bayesian perspective on GMM and IV

The AAFCO Proficiency Testing Program Statistics and Reporting

Outlier detection for high-dimensional data

7 Sensitivity Analysis

Stochastic dominance with imprecise information

MAJORIZING MEASURES WITHOUT MEASURES. By Michel Talagrand URA 754 AU CNRS

Regression Analysis for Data Containing Outliers and High Leverage Points

Multivariate Statistical Analysis

CHAPTER 7. Connectedness

Comprehensive Definitions of Breakdown-Points for Independent and Dependent Observations

Monitoring Random Start Forward Searches for Multivariate Data

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

Comprehensive definitions of breakdown points for independent and dependent observations

Lecture 3: Introduction to Complexity Regularization

A Gini Autocovariance Function for Heavy Tailed Time Series Modeling

A Comparison of Robust Estimators Based on Two Types of Trimming

P8130: Biostatistical Methods I

Robust Preprocessing of Time Series with Trends

A new approach for stochastic ordering of risks

3 Measurable Functions

Optimal global rates of convergence for interpolation problems with random design

Data Analysis and Statistical Methods Statistics 651

Physics 509: Bootstrap and Robust Parameter Estimation

INVARIANT COORDINATE SELECTION

Descriptive Univariate Statistics and Bivariate Correlation

Unsupervised Anomaly Detection for High Dimensional Data

Lecture 12 Robust Estimation

Robust Subspace DOA Estimation for Wireless Communications

Ahlswede Khachatrian Theorems: Weighted, Infinite, and Hamming

Integration on Measure Spaces

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

THE TRIANGULAR THEOREM OF THE PRIMES : BINARY QUADRATIC FORMS AND PRIMITIVE PYTHAGOREAN TRIPLES

2 Sequences, Continuity, and Limits

Complements on Simple Linear Regression

Highly Robust Variogram Estimation 1. Marc G. Genton 2

1 Measures of the Center of a Distribution

AP Statistics Cumulative AP Exam Study Guide

Transcription:

General Foundations for Studying Masking and Swamping Robustness of Outlier Identifiers Robert Serfling 1 and Shanshan Wang 2 University of Texas at Dallas This paper is dedicated to the memory of Kesar Singh, an outstanding contributor to statistical science. November, 2012 1 Department of Mathematics, University of Texas at Dallas, Richardson, Texas 75080-3021, USA. Email: serfling@utdallas.edu. Website: www.utdallas.edu/ serfling. 2 Department of Mathematics, University of Texas at Dallas, Richardson, Texas 75080-3021, USA.

Abstract With greatly advanced computational resources, the scope of statistical data analysis and modeling has widened to accommodate pressing new arenas of application. In all such data settings, an important and challenging task is the identification of outliers. Especially, an outlier identification procedure must be robust against the possibilities of masking (an outlier is undetected as such and swamping (a nonoutlier is classified as an outlier. Here we provide general foundations and criteria for quantifying the robustness of outlier detection procedures against masking and swamping. This unifies a scattering of existing results confined to univariate or multivariate data, and extends to a completely general framework allowing any type of data. For any space X of objects and probability model F on X, we consider a real-valued outlyingness function O(x,F defined over x in X and a sample version O(x based on a sample from X. In this setting, and within a coherent framework, we formulate general definitions of masking breakdown point and swamping breakdown point and develop lemmas for evaluating these robustness measures in practical applications. A brief illustration of the technique of application of the lemmas is provided for univariate scaled deviation outlyingness. AMS 2000 Subject Classification: Primary 62G35 Secondary 62-07 Key words and phrases: Nonparametric; Outlier detection; Masking robustness; Swamping robustness.

1 Introduction With greatly advanced computational resources, the scope of statistical data analysis and modeling has widened to accommodate pressing new arenas of application. Now data is invariably multivariate, typically with very high dimension and/or heavy tails and/or huge sample sizes, or complex, involving curves, images, sets, and other types of object, often with stream or network structure. In all such data settings, an increasingly important and challenging task is the identification of outliers. New contexts involving outliers and anomaly detection include fraud detection, intrusion detection, and network robustness analysis, to name a few. The outliers themselves are sometimes the cases of primary interest. Nonparametric notions and methods are especially important, since tractable parametric modeling is rather limited in the case of multivariate data (e.g., Olkin, 1994 and even more so with complex data. Further, since visualization is feasible only in the case of numerical or vector data in low dimension, outlier detection methods become of necessity algorithmic in nature and the determination of their performance properties is complicated. The key concern about performance of an outlier detection procedure is its robustness. It must be resistant to adverse performance effects due to the very presence of the outliers to be identified, or even due to the presence of a concentration of inliers. In particular, one must assess the proclivities of the procedure toward either of two kinds of misclassification error: masking (an outlier is undetected and swamping (a nonoutlier is classified as an outlier. Robustness against both masking and swamping is clearly essential. In handling complex modern data structures, ingenious outlier detection approaches have been crafted ad hoc in diverse settings. Typically their robustness performance is explored only through limited simulation studies, with results that lack both generality and precise interpretation. Highly needed are theoretical underpinnings. Here we develop general foundations and criteria for quantifying robustness of outlier detection procedures against masking and swamping. We employ a leading type of robustness measure, the (finite sample breakdown point (BP of Donoho and Huber (1983, i.e., the minimum fraction of replacements of the sample data (by outliers or inliers sufficient to break down the statistical procedure, i.e., to render it drastically ineffective. This provides a distinctive quantitative approach toward measuring robustness. In dealing with an outlier identification procedure, the BP approach to robustness involves two such measures: the masking breakdown point (MBP and the swamping breakdown point (SBP. These are the minimum fractions of points in a data set which if arbitarily placed as outliers or inliers suffice to cause the outlier detection procedure to mask arbitrarily extreme outliers, or swamp arbitrarily central nonoutliers, respectively. The higher the MBP and SBP values, the better the robustness performance of an outlier detection procedure. Although the idea of BP for estimators such as the sample mean or variance is well established and quite simple and straightforward to define, the corresponding formulations of MBP and SWP are considerably more problematic and have received limited treatment. In the parametric setting of univariate data within the contaminated normal model, Davies and 1

Gather (1993 formulate versions of MBP and SBP using addition contamination. Becker and Gather (1999 extend that MBP to the multivariate contaminated normal model, but extension of the SBP is not considered, although it is treated in Becker (1996. Dang and Serfling (2010 introduce a version of MBP in the setting of fully nonparametric multivariate outlier identification based on the use of depth functions and apply this notion to compare several different depth-based outlier identifiers. Again, however, the SBP is left untreated. Although the MBP and SBP are conceptually interrelated and are formulated in parallel ways, the SBP is technically more delicate to treat than the MBP. Further, it turns out that for each of MBP and SBP there are two relevant versions representing complementary perspectives, making four robustness measures in all. Here we introduce a general framework for study of these robustness measures, establish key lemmas for their application, and carry out their application in the setting of univariate data. Section 2 develops a completely general formulation of nonparametric outlier identification in terms of a real-valued outlyingness function O(x, F, defined over x in any space X of objects and based on a probability distribution F on X, and a sample version O(x, X n based on a sample X n from X. General definitions of MBP and SBP are provided within a unified conceptual framework for studying these robustness measures. Section 3 provides key technical lemmas for evaluating MBP and SBP in practical applications. Section 4 provides a brief illustration of the technique of application of the lemmas, using a leading outlier identifier in the case of univariate (real-valued data, scaled deviation outlyingness. Complete treatment of univariate scaled deviation outlyingness as well as of centered rank outlyingness, is carried out in Wang and Serfling (2012, including application to show the excellent masking and swamping robustness of the boxplot. Applications to other data settings (multivariate, functional, etc. are beyond the scope of the present paper and deferred to future studies. 2 General foundations In Section 2.1 we formulate outlyingness functions within a broad conceptual framework, and in Section 2.2 introduce the nonparametric outlier identification problem. General foundations on masking and swamping robustness are provided in Section 2.3. 2.1 Outlyingness functions on a space X The idea of outliers has a long tradition. The goal is to identify points or groups of points which lie apart from the main body of data, or which are unusual, anomalous, or suspicious in some sense, and then to take an appropriate action. With univariate and bivariate data, outlier visualization is easy. In higher dimensional spaces, however, algorithmic approaches become essential and entail formulating and relying upon outlyingness functions to explore data. Here let X be any space equipped with a suitable σ-algebra of measurable sets (left implicit and let F be a corresponding probability measure on the measurable sets. 2

Outlyingness Function. Associated with a probability distribution F on X, an outlyingness function O(x, F provides a center-outward ordering of points x in X, with higher values representing greater outlyingness relative to a center measuring location. In the typical case of X = R d, where we might compare with the density function of F, we observe that density contours and outlyingness contours need not coincide. A density function quantifies local probability structure at a point, whereas an outlyingness function quantifies the location of a point from a global and tail-oriented perspective. For R d, the outlyingness approach based on Mahalanobis distance (with robust location and dispersion estimates is popular for its tractabiliy and intuitive appeal. However, the corresponding outlyingness contours are necessarily elliptical, an unwanted restriction with many data sets. Thus other types of multivariate outlyingness functions are of interest. See Serfling (2010 for details and illustrations, and for connections with multivariate depth, quantile, and rank functions. 2.2 Outlier identification in X We will suppose that inf x O(x, F = 0, sup O(x, F = 1. (1 x Accordingly, we define corresponding λ outlier regions associated with F and O(x, F, and we have (for later reference out(λ, F = {x : O(x, F > λ}, 0 < λ < 1, inf{λ > 0 : out(λ, F } = 0, sup{λ > 0 : out(λ, F } = 1, (2 where A denotes the complement of a set A. The goal is to classify, for given choice of λ, all points x of X as belonging to out(λ, F or not. To this purpose, using a data set X n and a data-based outlyingness function O(x, X n that may be considered to estimate O(x, F, we estimate the region out(λ, F by a sample outlier region OR(λ, X n = {x : O(x, X n > λ}. Regarding the sample outlyingness function O(x, X n, we define O n and O n inf O(x, X n = On( 0, sup O(x, X n = On ( 1. (3 x In comparison with (1, this allows for the case that O(x, X n is a step function that possibly does not attain the values 0 or 1. Although typically we have On = 0 and On = 1, for the centered rank function treated in Wang and Serfling (2012 we have On = n 1 if n is odd. For later reference, the analogue of (2 is x by inf{λ > 0 : OR(λ, X n } = O n, sup{λ > 0 : OR(λ, X n } = O n. (4 3

It is to be understood that the data-based outlier region OR(λ, X n should include, in principle, regular points from F which happen to be outlying according to the selected threshold. This region also may include true outliers or contaminants originating from another source. In some cases, OR(λ, X n is given by out(λ, F n with F n an empirical df. The key issue is that if OR(λ, X n, or equivalently O(x, X n, is itself sensitive to outliers (or inliers, then OR(λ, X n cannot serve as a reliable outlier identifier. Robust choices of OR(λ, X n, i.e., of O(x, X n, are needed. In particular, masking robustness and swamping robustness are essential. 2.3 Masking and swamping robustness Here we introduce a foundational and conceptual framework for the study of masking and swamping robustness. It becomes clear that there are two variants of the notion of masking robustness and also two of the notion of swamping robustness. Very importantly, such a framework enables all four of these notions to be investigated and characterized in a unified and coherent fashion. 2.3.1 Masking robustness Let a sample outlier identifier OR(, X n be given. Key associated sets regarding masking are of the form M(λ, γ, X n, F = OR(λ, X n out(γ, F, defined for any λ and γ. Clearly, masking occurs if M(λ, γ, X n, F, (5 which in view of (4 requires λ > O n. In this case some γ outliers of F are included in the sample threshold λ nonoutlier region. For fixed λ, masking becomes more severe as γ 1. That is, increasingly extreme outliers of F become masked as sample threshold λ nonoutliers. For fixed γ, masking becomes more severe as λ O n. That is, some threshold γ outliers of F are included within an increasingly central sample nonoutlier region. Now consider all possible modified data sets X n,k obtainable by replacing k observations of X n by arbitrarily positioned new values ( outliers or contaminants. Corresponding to the fixed λ and fixed γ cases, respectively, two indices that measure in different ways the size of the masking effect are γ M (λ, X n, k = largest γ for which (5 with fixed λ holds subject to k replacements = sup{γ < 1 : k replacements changing X n to X n,k such that M(λ, γ, X n,k, F }, 4

and λ M (γ, X n, k = smallest λ for which (5 with fixed γ holds subject to k replacements = inf{λ > O n : k replacements changing X n to X n,k such that M(λ, γ, X n,k, F }. The quantity γ M (λ, X n, k answers Question I. What is the most extreme level γ 1 at which an F outlier can be masked at threshold λ due to the presence of k replacement outliers in the data X n? It represents the largest degree of outlyingness relative to F that is nonidentifiable at sample outlyingness threshold λ. The larger the value of γ M (λ, X n, k, the worse is the masking robustness performance of our given outlier identifier OR(, X n. The worst possible case, γ M (λ, X n, k = 1, denotes a version (let us say Type A of masking breakdown due to k replacements. Let k (A M (λ, X n = min{k : γ M (λ, X n, k = 1}. Then the Type A masking breakdown point of OR(, X n at sample outlyingness threshold λ is given by MBP (A (λ, X n = k(a M (λ, X n. n On the other hand, the quantity λ M (γ, X n, k answers Question II. How centrally in terms of sample outlyingness threshold λ O n can a γ outlier of F be masked due to the presence of k replacement outliers in the data X n? The smaller the value of λ M (γ, X n, k, the worse the masking robustness of OR(, X n, and the worst possible case, λ M (γ, X n, k = On, denotes Type B masking breakdown due to k replacements. Let k (B M (γ, X n = min{k : λ M (γ, X n, k = On}. Then the Type B masking breakdown point of OR(, X n at F outlyingness threshold γ is given by MBP (B (γ, X n = k(b M (γ, X n. n The quantities MBP (A (λ, X n for On < λ < 1 and MBP (B (γ, X n for 0 < γ < 1 represent the minimum fractions of replacements in the data X n sufficient for Type A or Type B masking breakdown relative to specified thresholds. The higher their values, the greater the masking robustness of the outlier identifier OR(, X n. The above formulation of MBP (A (λ, X n extends that of Dang and Serfling (2010 for nonparametric multivariate outlier identification. It also corresponds (with a somewhat different formulation to the notion introduced by Davies and Gather (1993 in the univariate contaminated normal model and extended to the multivariate contaminated normal model by Becker and Gather (1999. However, the version MBP (B (γ, X n is completely new. 5

2.3.2 Swamping robustness Again, let a sample outlier identifier OR(, X n be given. swamping are of form S(λ, γ, X n, F = OR(λ, X n out(γ, F, defined for any λ and γ, and swamping occurs if Key associated sets regarding S(λ, γ, X n, F, (6 which in view of (4 requires λ < On. In this case some γ nonoutliers of F are included in the sample threshold λ outlier region. For fixed λ, the swamping becomes more severe as γ 0, with increasingly central nonoutliers of F becoming included in the sample threshold λ outlier region. For fixed γ, swamping becomes more severe as λ On, with threshold γ nonoutliers of F included within an increasingly extreme sample outlier region. Again consider the modified data sets X n,k obtainable by replacing k observations of X n by contaminants. Corresponding to the fixed λ and fixed γ cases, respectively, two indices related to extreme instances of swamping are and γ S (λ, X n, k = smallest γ for which (6 with fixed λ holds subject to k replacements = inf{γ > 0 : k replacements changing X n to X n,k such that S(λ, γ, X n,k, F }, λ S (γ, X n, k = largest λ for which (6 with fixed γ holds subject to k replacements = sup{λ < O n The quantity γ S (λ, X n, k answers : k replacements changing X n to X n,k such that S(λ, γ, X n,k, F }. Question III. What is the most central level γ 0 of nonoutlier of F that can be swamped at sample outlier threshold λ by the presence of k replacement outliers in the data X n? The smaller the value of γ S (λ, X n, k, the worse is the swamping robustness performance of our given OR(, X n. The worst possible case, γ S (λ, X n, k = 0, denotes Type A swamping breakdown due to k replacements. Let k (A S (λ, X n = min{k : γ S (λ, X n, k = 0}. Then the Type A swamping breakdown point of OR(, X n at sample outlyingness threshold λ is given by SBP (A (λ, X n = k(a S (λ, X n. n On the other hand, the quantity λ S (γ, X n, k answers 6

Question IV. How extremely at sample threshold λ On can a γ nonoutlier of F be swamped by the presence of k replacement outliers in the data X n? The larger the value of λ S (γ, X n, k, the worse the swamping robustness of OR(, X n, and the worst possible case, λ S (γ, X n, k = On, denotes Type B swamping breakdown due to k replacements. Let k (B S (γ, X n = min{k : λ S (γ, X n, k = On }. Then the Type B swamping breakdown point of OR(, X n at F outlyingness threshold γ is given by SBP (B (γ, X n = k(b S (γ, X n. n The quantities SBP (A (λ, X n for 0 < λ < On and SBP (B (γ, X n for 0 < γ < 1 represent the minimum fractions of replacements in the data X n sufficient for Type A or Type B swamping breakdown relative to specified thresholds. The higher their values, the greater the swamping robustness of the outlier identifier OR(, X n. Both SBP (A (λ, X n and SBP (B (γ, X n as given above for the context of nonparametric outlier identification are newly formulated in the present paper. A version of SBP (B (γ, X n was introduced by Davies and Gather (1993 in the parametric univariate contaminated normal model and extended with a somewhat different formulation in Becker (1996 to the parametric multivariate contaminated normal model. 2.3.3 The four masking and swamping robustness measures The following figure illustrates the interaction between the sets M = M(λ, γ, X n, F and S = S(λ, γ, X n, F relevant to masking and swamping, respectively. F γ-outlyingness contour Sample λ-outlyingness contour S M In analyzing a data set X n using an outlier identifier OR(, X n, one approach is to adopt a specific outlyingness threshold λ and consider OR(λ, X n as an estimator of the target 7

outlier region out(λ, F. In this case, with focus on the sample region OR(λ, X n for a specified λ, the Type A versions of masking and swamping breakdown points are relevant and quite naturally go together as companion robustness measures which address Questions I and III, respectively. On the other hand, one might focus on out(γ, F for some γ and ask how centrally this outlier region can be masked using OR(, X n (Question II. Also, one might focus on out(γ, F and want to know how extremely this nonoutlier region can be swamped using OR(, X n (Question IV. For these, the Type B versions of masking and swamping breakdown points go together as companion robustness measures and play roles complementary to the Type A versions. The treatment of Davies and Gather (1993 for the univariate contaminated normal model chooses Type A masking breakdown and Type B swamping breakdown, something of a mismatch. The foundational framework introduced above provides four relevant masking and swamping robustness measures and clarifies how to use them coherently and comprehensively. The next section provides basic lemmas of use for evaluation of the measures MBP (A (λ, X n, MBP (B (γ, X n, SBP (A (λ, X n, and SBP (B (γ, X n in applications. 3 Basic lemmas 3.1 Connections with standard replacement breakdown points For a real-valued statistic T (X n taking values in (, + (or in [0, 1], respectively, explosion breakdown of T (X n occurs with k points of X n replaced if sup X n,k T (X n,k = sup X n,n T (X n,n =: T, (7 with X n,k as previously. Typical values of T are 1 or, although not necessarily. With k exp (T (X n denoting the minimum k such that (7 can occur, the explosion replacement breakdown point of T (X n is given by RBP exp (T (X n = k exp (T (X n /n. Likewise, implosion breakdown occurs with k points of X n replaced if inf X n,k T (X n,k = inf X n,n T (X n,n =: T. (8 The typical value of T is 0, although not necessarily. With obvious notation, the implosion replacement breakdown point of T (X n is given by RBP imp (T (X n = k imp (T (X n /n. We now give representations for MBP (A (λ, X n, MBP (B (γ, X n, SBP (A (λ, X n, and SBP (B (γ, X n in terms of the above explosion and implosion RBPs. Lemma 1 Type A masking breakdown with replacement of k sample values (γ M (λ, X n, k = 1 holds if and only if sup X n,k sup y OR(λ,X n,k O(y, F = 1, (9 8

and hence MBP (A (λ, X n = RBP exp ( sup O(y, F y OR(λ,X n. (10 Proof. Suppose that γ M (λ, X n, k = 1. Then, for any γ < 1, there exists X n,k (γ such that (5 holds. Hence, for a sequence γ m 1, let y m belong to the intersection in (5 corresponding to γ = γ m and X n,k = X n,k (γ m. Then O(y m, F > γ m 1, m, and hence sup X n,k sup y OR(λ,X n,k O(y, F sup m O(y m, F sup γ m = 1, m and (9 follows. Now assume that (9 holds. Then, for a sequence γ m 1, there exists X n,k (γ m and a point y m OR(λ, X n,k (γ m such that O(y m, F > γ m. Then (5 holds with X n,k = X n,k (γ m and γ = γ m, and it follows that γ M (λ, X n, k = 1. Finally, by definition, and RBP exp ( MBP (A (λ, X n = n 1 min{k : γ M (λ, X n, k = 1} sup O(y, F y OR(λ,X n = n 1 min { k : sup O(y, F 1 y OR(λ,X n,k As shown above, γ M (λ, X n, k = 1 if and only if (9 holds, and it follows that establishing (10. MBP (A (λ, X n = n 1 min{k : γ M (λ, X n, k = 1} { = n 1 min = n 1 min { = RBP exp ( k : sup X n,k k : sup O(y, F = 1 y OR(λ,X n,k } sup O(y, F 1 y OR(λ,X n,k sup O(y, F y OR(λ,X n By proofs along similar lines (omitted, we obtain the following three lemmas. Lemma 2 Type B masking breakdown with replacement of k sample values (λ M (γ, X n, k = On holds if and only if inf inf O(y, X n,k = O X n,k n, (11 y out(γ,f and hence MBP (B (γ, X n = RBP imp ( inf y out(γ,f, } }. O(y, X n. (12 9

Lemma 3 Type A swamping breakdown with replacement of k sample values (γ S (λ, X n, k = 0 holds if and only if inf inf O(y, F = 0, (13 X n,k y OR(λ,X n,k and hence SBP (A (λ, X n = RBP imp ( inf y OR(λ,X n O(y, F. (14 Lemma 4 Type B swamping breakdown with replacement of k sample values (λ S (γ, X n, k = On holds if and only if sup sup O(y, X n,k = On, (15 X n,k y out(γ,f and hence SBP (B (γ, X n = RBP exp ( sup O(y, X n y out(γ,f. (16 Lemmas 1-4 transform the problem of evaluating masking and swamping breakdown points to a problem of evaluating standard explosion and implosion breakdown points of certain inf and sup statistics. These latter statistics are complicated to treat, and in the next section some helpful lemmas are established. 3.2 Key lemmas for implementation of the RBP formulas In some cases, the infimum or supremumin the foregoing RBP formulas reduces to a minimum or maximum. For such cases, the following result treats breakdown of a statistic S(X n when it is either the minimum or the maximum of certain other statistics T 1 (X n,..., T J (X n. Let k exp, (0 k exp, (1..., k exp (J be the minimal numbers of data points which must be replaced in order to cause explosion breakdown of the respective statistics S and T 1,..., T J, and let k (0 imp, k(1 imp,..., k(j imp be their counterparts for implosion breakdown. Also, let T = min{t 1,..., T J }, T = max{t 1,..., T J }. Lemma 5 (i Let S = min{t 1,..., T J }. Then (ii Let S = max{t 1,..., T J }. Then min{k (1 imp,..., k(j imp } k(0 imp max{k(1 imp,..., k(j imp }. (17 min{k (1 exp,..., k (J exp} k (0 exp max{k (1 exp,..., k (J exp}. (18 Proof. (i If S T for some choice of k (0 imp T 1 T,..., T J T occurs with k (0 we must have k (j imp k(0 imp replacements, then at least one of the events imp replacements. For any such occurrence, say T j T,. This yields the left hand inequality of (17. On the other hand, if 10

k (m imp = max{k(1 imp,..., k(j imp } for some m {1,..., J}, then all the events T j Tj, 1 j J, occur with k (m imp replacements and hence S T, yielding k (0 imp k(m imp. This yields the right hand inequality of (17, and part (i is now established. (ii If S T for some choice of k exp (0 replacements, then at least one of the events T 1 T,..., T J T occurs with k exp (0 replacements. For any such occurrence, say T j T, we have k exp (j k exp, (0 and the left hand inequality of (18 thus follows. On the other hand, if k exp (m = max{k exp, (1..., k exp} (J for some m {1,..., J}, then all the events T j Tj occur with k exp (m replacements and hence S T, yielding k exp (0 k exp (m. This yields the right hand inequality of (18, establishing part (ii of the lemma. The above result applies, for example, in the case X = R and J = 2, where the outlier regions are complements of finite intervals and the inf and sup are attained at endpoints. The following result extends to the general case and is obtained by a proof along similar lines with T = inf y Ty and T = sup y Ty. Lemma 6 (i Let S = inf y T y. Then (ii Let S = sup y T y. Then inf y inf y k(y imp k(0 k(y exp k (0 imp sup y exp sup y k (y imp. (19 k (y exp. (20 The next lemma treats breakdown of a statistic S(X n when the event of breakdown due to k replacements is related to the possible occurrences of certain events E 1,..., E J as a consequence of k replacements. Let k S be the minimal number of data points which must be replaced in order to cause breakdown (either implosion or explosion of S, and let k 1,..., k J be the minimal numbers of data points which must be replaced in order to cause occurrence of the respective events E 1,..., E J. It is assumed that k S and k 1,..., k J are well-defined and belong to {1, 2,..., n}. Lemma 7 (i If breakdown of S is implied by occurrence of each one of the events E 1,..., E J, then k S min{k 1,..., k J }. (21 (ii If breakdown of S implies occurrence of at least one of the events E 1,..., E J, then k S min{k 1,..., k J }. (22 (iii If breakdown of S is implied by occurrence of each one of the events E 1,..., E J and also implies that at least one of E 1,..., E J must occur, then k S = min{k 1,..., k J }. (23 Proof. (i For any event E j whose occurrence with k j replacements implies breakdown of S, we must have k S k j. Thus (21 follows. (ii For any event E j whose occurrence is implied by breakdown of S, we must have k S k j, and thus (22 follows. 11

4 Illustrative Application: MBP (A (λ, X n for Univariate Scaled Deviation Outlyingness For X = R, let F be a distribution on R and µ(f and σ(f any location and spread measures. The corresponding scaled deviation outlyingness function taking values in [0, 1 is given by O(x, F = Õ(x, F /(1 + Õ(x, F, with Õ(x, F = x µ(f σ(f, and sample versions O(x, X n and Õ(x, X n are similarly defined using µ(x n and σ(x n. Such outlyingness functions have been popularized by Mosteller and Tukey (1977, for example. A complete study of MBP (A (λ, X n, MBP (B (γ, X n, SBP (A (λ, X n, and SBP (B (γ, X n for scaled deviation outlyingness is carried out by Wang and Serfling (2012. Here we obtain the first of these, as an illustration of the use of our key lemmas. Note that for scaled deviation outlyingness, we have On = 0 and On = 1. In terms of the Õ versions, and with η = γ/(1 γ and β = λ/(1 λ, we have and out(γ, F = {x : O(x, F > γ} = {x : Õ(x, F > η}, OR(λ, X n = {x : O(x, X n > λ} = {x : Õ(x, X n > β}. Note also that η as γ 1, β as λ 1, and OR(λ, X n = [µ(x n βσ(x n, µ(x n βσ(x n ]. Expressing Lemma 1 in terms of Õ(x, F and Õ(x, X n, we obtain ( MBP (A (λ, X n = RBP exp sup Õ(y, F. (24 y OR(λ,X n For convenience, we put µ(x n = µ and σ(x n = σ. Using (from above OR(λ, X n = [ µ β σ, µ + β σ], it follows that sup Õ(y, F y OR(λ,X n Õ( µ + β σ, F = Õ( µ β σ, F } max {Õ( µ + β σ, F, Õ( µ β σ, F } = max {Õ( µ + β σ, F, Õ( µ β σ, F 12 if µ(f µ β σ if µ(f µ + β σ otherwise (in all cases. (25

It then follows from Lemma 5(ii that } min {RBP exp (Õ( µ + β σ, F, RBP exp (Õ( µ β σ, F MBP (A (λ, X n } max {RBP exp (Õ( µ + β σ, F, RBP exp (Õ( µ β σ, F We next evaluate RBP exp (Õ( µ + β σ, F respectively, to RBP exp ( µ + β σ and RBP exp ( µ β σ. Now we adopt (26 and RBP exp (Õ( µ β σ, F, which are equal, Assumption A. RBP exp ( µ + β σ, X n and RBP exp ( µ, X n are invariant if X n is replaced by X n, i.e, if each observation X i is replaced by X i, 1 i n. Under Assumption A, RBP exp ( µ + β σ and RBP exp ( µ β σ are equal, and we have MBP (A (λ, X n = RBP exp ( µ + β σ. (27 Here we are using RBP exp with T =. We now evaluate RBP exp ( µ + β σ, for which we apply Lemma 7 with some choice of k and the events S, E 1, E 2, and E 3, where S = { {X n,k } such that µ(x n,k + β σ(x n,k } E 1 = { {X n,k } such that µ(x n,k + } E 2 = { {X n,k } such that µ(x n,k is bounded and σ(x n,k } E 3 = { {X n,k } such that µ(x n,k and µ(x n,k + β σ(x n,k }. Note that (with k fixed each of E 1, E 2, E 3 implies S and S implies E 1 E 2 E 3. Then (23 yields RBP exp (Õ( µ + β σ, F = RBP exp ( µ + β σ = n 1 min{k 1, k 2, k 3 }, (28 where k 1, k 2, k 3 are the minimal values of k, respectively, for occurrence of E 1, E 2, E 3. Now let us note that, under Assumption A, k 3 k exp ( µ = k 1. Thus we have established the following useful result. Corollary 8 Under Assumption A, MBP (A (λ, X n = n 1 min{k 1, k 2 } = min{rbp exp ( µ, RBP exp ( σ µ bounded}. (29 Note that MBP (A (λ, X n does not depend upon the threshold λ. In typical cases, we have RBP exp ( σ µ bounded RBP exp ( σ RBP exp ( µ, 13

in which case simply MBP (A (λ, X n = RBP exp ( µ. Examples. (Assumption A is satisfied in each case. (i Mean and Standard Deviation ( µ = X and σ = S. It is straightforward that RBP exp ( µ = n 1, the minimum possible, yielding MBP (A (λ, X n = n 1 0. (In passing, we note that RBP exp ( σ µ bounded = 2n 1. (ii Median and MAD ( µ = Med(X n and σ = MAD(X n. We obtain RBP exp ( µ = RBP exp ( σ µ bounded = n 1 n+1, yielding 2 MBP(A (λ, X n = n 1 n+1 2 1. 2 (iii α-trimmed Mean and SD. Let X (n 2 nα denote the n 2 nα observations remaining after trimming away the upper nα observations and the lower nα observations. Then take µ to be the mean and σ to be the standard deviation of the data set X (n 2 nα. It is readily checked that RBP exp ( µ = ( nα + 1/n and that RBP exp ( σ µ bounded = 2( nα + 1/n, yielding MBP (A (λ, X n = n 1 ( nα + 1 α. Note that the result in (iii approaches that in (i as α 0 and that in (ii as α 1/2. As seen in Wang and Serfling (2012, there can be some advantage in MBP (A (λ, X n = α < 1/2, when it produces a trade-off allowing SBP (A (λ, X n > 1/2, should this be desired. Acknowledgements The authors gratefully acknowledge useful input from G. L. Thompson, Xin Dang, Satyaki Mazumder, Bo Hong, and Seoweon Jin. Also, support under National Science Foundation Grant DMS-1106691 is sincerely acknowledged. References [1] Becker, C. (1996. Bruchpunkt und Bias zur Beurteilung multivariater Ausreisseridentifizierung. Ph.D. Dissertation. Universität Dortmund. [2] Becker, C. and Gather, U. (1999. The masking breakdown point of multivariate outlier identification rules. Journal of the American Statistical Association 94 947 955. [3] Dang, X. and Serfling, R. (2010. Nonparametric depth-based multivariate outlier identifiers, and masking robustness properties. Journal of Statistical Planning and Inference 140 198 213. [4] Davies, L. and Gather, U. (1993. The identification of multiple outliers. Journal of the American Statistical Association 88 782 801. [5] Donoho, D. L. and Huber, P. J. (1983. The notion of breakdown point. In A Festschrift foe Erich L. Lehmann (P. J. Bickel, K. A. Doksum and J. L. Hodges, Jr., eds. pp. 157-184, Wadsworth, Belmont, California. 14

[6] Mosteller, C. F. and Tukey, J. W. (1977. Data Analysis and Regression. Addison- Wesley, Reading, Mass. [7] Olkin, I. (1994. Multivariate non-normal distributions and models of dependency. In Multivariate Analysis and Its Applications (T. W. Anderson, K. T. Fang and I. Olkin, eds., IMS Lecture Notes Monograph Series, Volume 24, pp. 37 53. Hayward, California. [8] Wang, S. and Serfling, R. (2012. On masking and swamping robustness of outlier identifiers for univariate data. Submitted (available at www.utdallas.edu/ serfling. 15