General Foundations for Studying Masking and Swamping Robustness of Outlier Identifiers

Size: px
Start display at page:

Download "General Foundations for Studying Masking and Swamping Robustness of Outlier Identifiers"

Transcription

1 General Foundations for Studying Masking and Swamping Robustness of Outlier Identifiers Robert Serfling 1 and Shanshan Wang 2 University of Texas at Dallas This paper is dedicated to the memory of Kesar Singh, an outstanding contributor to statistical science. November, Department of Mathematics, University of Texas at Dallas, Richardson, Texas , USA. serfling@utdallas.edu. Website: serfling. 2 Department of Mathematics, University of Texas at Dallas, Richardson, Texas , USA.

2 Abstract With greatly advanced computational resources, the scope of statistical data analysis and modeling has widened to accommodate pressing new arenas of application. In all such data settings, an important and challenging task is the identification of outliers. Especially, an outlier identification procedure must be robust against the possibilities of masking (an outlier is undetected as such and swamping (a nonoutlier is classified as an outlier. Here we provide general foundations and criteria for quantifying the robustness of outlier detection procedures against masking and swamping. This unifies a scattering of existing results confined to univariate or multivariate data, and extends to a completely general framework allowing any type of data. For any space X of objects and probability model F on X, we consider a real-valued outlyingness function O(x,F defined over x in X and a sample version O(x based on a sample from X. In this setting, and within a coherent framework, we formulate general definitions of masking breakdown point and swamping breakdown point and develop lemmas for evaluating these robustness measures in practical applications. A brief illustration of the technique of application of the lemmas is provided for univariate scaled deviation outlyingness. AMS 2000 Subject Classification: Primary 62G35 Secondary Key words and phrases: Nonparametric; Outlier detection; Masking robustness; Swamping robustness.

3 1 Introduction With greatly advanced computational resources, the scope of statistical data analysis and modeling has widened to accommodate pressing new arenas of application. Now data is invariably multivariate, typically with very high dimension and/or heavy tails and/or huge sample sizes, or complex, involving curves, images, sets, and other types of object, often with stream or network structure. In all such data settings, an increasingly important and challenging task is the identification of outliers. New contexts involving outliers and anomaly detection include fraud detection, intrusion detection, and network robustness analysis, to name a few. The outliers themselves are sometimes the cases of primary interest. Nonparametric notions and methods are especially important, since tractable parametric modeling is rather limited in the case of multivariate data (e.g., Olkin, 1994 and even more so with complex data. Further, since visualization is feasible only in the case of numerical or vector data in low dimension, outlier detection methods become of necessity algorithmic in nature and the determination of their performance properties is complicated. The key concern about performance of an outlier detection procedure is its robustness. It must be resistant to adverse performance effects due to the very presence of the outliers to be identified, or even due to the presence of a concentration of inliers. In particular, one must assess the proclivities of the procedure toward either of two kinds of misclassification error: masking (an outlier is undetected and swamping (a nonoutlier is classified as an outlier. Robustness against both masking and swamping is clearly essential. In handling complex modern data structures, ingenious outlier detection approaches have been crafted ad hoc in diverse settings. Typically their robustness performance is explored only through limited simulation studies, with results that lack both generality and precise interpretation. Highly needed are theoretical underpinnings. Here we develop general foundations and criteria for quantifying robustness of outlier detection procedures against masking and swamping. We employ a leading type of robustness measure, the (finite sample breakdown point (BP of Donoho and Huber (1983, i.e., the minimum fraction of replacements of the sample data (by outliers or inliers sufficient to break down the statistical procedure, i.e., to render it drastically ineffective. This provides a distinctive quantitative approach toward measuring robustness. In dealing with an outlier identification procedure, the BP approach to robustness involves two such measures: the masking breakdown point (MBP and the swamping breakdown point (SBP. These are the minimum fractions of points in a data set which if arbitarily placed as outliers or inliers suffice to cause the outlier detection procedure to mask arbitrarily extreme outliers, or swamp arbitrarily central nonoutliers, respectively. The higher the MBP and SBP values, the better the robustness performance of an outlier detection procedure. Although the idea of BP for estimators such as the sample mean or variance is well established and quite simple and straightforward to define, the corresponding formulations of MBP and SWP are considerably more problematic and have received limited treatment. In the parametric setting of univariate data within the contaminated normal model, Davies and 1

4 Gather (1993 formulate versions of MBP and SBP using addition contamination. Becker and Gather (1999 extend that MBP to the multivariate contaminated normal model, but extension of the SBP is not considered, although it is treated in Becker (1996. Dang and Serfling (2010 introduce a version of MBP in the setting of fully nonparametric multivariate outlier identification based on the use of depth functions and apply this notion to compare several different depth-based outlier identifiers. Again, however, the SBP is left untreated. Although the MBP and SBP are conceptually interrelated and are formulated in parallel ways, the SBP is technically more delicate to treat than the MBP. Further, it turns out that for each of MBP and SBP there are two relevant versions representing complementary perspectives, making four robustness measures in all. Here we introduce a general framework for study of these robustness measures, establish key lemmas for their application, and carry out their application in the setting of univariate data. Section 2 develops a completely general formulation of nonparametric outlier identification in terms of a real-valued outlyingness function O(x, F, defined over x in any space X of objects and based on a probability distribution F on X, and a sample version O(x, X n based on a sample X n from X. General definitions of MBP and SBP are provided within a unified conceptual framework for studying these robustness measures. Section 3 provides key technical lemmas for evaluating MBP and SBP in practical applications. Section 4 provides a brief illustration of the technique of application of the lemmas, using a leading outlier identifier in the case of univariate (real-valued data, scaled deviation outlyingness. Complete treatment of univariate scaled deviation outlyingness as well as of centered rank outlyingness, is carried out in Wang and Serfling (2012, including application to show the excellent masking and swamping robustness of the boxplot. Applications to other data settings (multivariate, functional, etc. are beyond the scope of the present paper and deferred to future studies. 2 General foundations In Section 2.1 we formulate outlyingness functions within a broad conceptual framework, and in Section 2.2 introduce the nonparametric outlier identification problem. General foundations on masking and swamping robustness are provided in Section Outlyingness functions on a space X The idea of outliers has a long tradition. The goal is to identify points or groups of points which lie apart from the main body of data, or which are unusual, anomalous, or suspicious in some sense, and then to take an appropriate action. With univariate and bivariate data, outlier visualization is easy. In higher dimensional spaces, however, algorithmic approaches become essential and entail formulating and relying upon outlyingness functions to explore data. Here let X be any space equipped with a suitable σ-algebra of measurable sets (left implicit and let F be a corresponding probability measure on the measurable sets. 2

5 Outlyingness Function. Associated with a probability distribution F on X, an outlyingness function O(x, F provides a center-outward ordering of points x in X, with higher values representing greater outlyingness relative to a center measuring location. In the typical case of X = R d, where we might compare with the density function of F, we observe that density contours and outlyingness contours need not coincide. A density function quantifies local probability structure at a point, whereas an outlyingness function quantifies the location of a point from a global and tail-oriented perspective. For R d, the outlyingness approach based on Mahalanobis distance (with robust location and dispersion estimates is popular for its tractabiliy and intuitive appeal. However, the corresponding outlyingness contours are necessarily elliptical, an unwanted restriction with many data sets. Thus other types of multivariate outlyingness functions are of interest. See Serfling (2010 for details and illustrations, and for connections with multivariate depth, quantile, and rank functions. 2.2 Outlier identification in X We will suppose that inf x O(x, F = 0, sup O(x, F = 1. (1 x Accordingly, we define corresponding λ outlier regions associated with F and O(x, F, and we have (for later reference out(λ, F = {x : O(x, F > λ}, 0 < λ < 1, inf{λ > 0 : out(λ, F } = 0, sup{λ > 0 : out(λ, F } = 1, (2 where A denotes the complement of a set A. The goal is to classify, for given choice of λ, all points x of X as belonging to out(λ, F or not. To this purpose, using a data set X n and a data-based outlyingness function O(x, X n that may be considered to estimate O(x, F, we estimate the region out(λ, F by a sample outlier region OR(λ, X n = {x : O(x, X n > λ}. Regarding the sample outlyingness function O(x, X n, we define O n and O n inf O(x, X n = On( 0, sup O(x, X n = On ( 1. (3 x In comparison with (1, this allows for the case that O(x, X n is a step function that possibly does not attain the values 0 or 1. Although typically we have On = 0 and On = 1, for the centered rank function treated in Wang and Serfling (2012 we have On = n 1 if n is odd. For later reference, the analogue of (2 is x by inf{λ > 0 : OR(λ, X n } = O n, sup{λ > 0 : OR(λ, X n } = O n. (4 3

6 It is to be understood that the data-based outlier region OR(λ, X n should include, in principle, regular points from F which happen to be outlying according to the selected threshold. This region also may include true outliers or contaminants originating from another source. In some cases, OR(λ, X n is given by out(λ, F n with F n an empirical df. The key issue is that if OR(λ, X n, or equivalently O(x, X n, is itself sensitive to outliers (or inliers, then OR(λ, X n cannot serve as a reliable outlier identifier. Robust choices of OR(λ, X n, i.e., of O(x, X n, are needed. In particular, masking robustness and swamping robustness are essential. 2.3 Masking and swamping robustness Here we introduce a foundational and conceptual framework for the study of masking and swamping robustness. It becomes clear that there are two variants of the notion of masking robustness and also two of the notion of swamping robustness. Very importantly, such a framework enables all four of these notions to be investigated and characterized in a unified and coherent fashion Masking robustness Let a sample outlier identifier OR(, X n be given. Key associated sets regarding masking are of the form M(λ, γ, X n, F = OR(λ, X n out(γ, F, defined for any λ and γ. Clearly, masking occurs if M(λ, γ, X n, F, (5 which in view of (4 requires λ > O n. In this case some γ outliers of F are included in the sample threshold λ nonoutlier region. For fixed λ, masking becomes more severe as γ 1. That is, increasingly extreme outliers of F become masked as sample threshold λ nonoutliers. For fixed γ, masking becomes more severe as λ O n. That is, some threshold γ outliers of F are included within an increasingly central sample nonoutlier region. Now consider all possible modified data sets X n,k obtainable by replacing k observations of X n by arbitrarily positioned new values ( outliers or contaminants. Corresponding to the fixed λ and fixed γ cases, respectively, two indices that measure in different ways the size of the masking effect are γ M (λ, X n, k = largest γ for which (5 with fixed λ holds subject to k replacements = sup{γ < 1 : k replacements changing X n to X n,k such that M(λ, γ, X n,k, F }, 4

7 and λ M (γ, X n, k = smallest λ for which (5 with fixed γ holds subject to k replacements = inf{λ > O n : k replacements changing X n to X n,k such that M(λ, γ, X n,k, F }. The quantity γ M (λ, X n, k answers Question I. What is the most extreme level γ 1 at which an F outlier can be masked at threshold λ due to the presence of k replacement outliers in the data X n? It represents the largest degree of outlyingness relative to F that is nonidentifiable at sample outlyingness threshold λ. The larger the value of γ M (λ, X n, k, the worse is the masking robustness performance of our given outlier identifier OR(, X n. The worst possible case, γ M (λ, X n, k = 1, denotes a version (let us say Type A of masking breakdown due to k replacements. Let k (A M (λ, X n = min{k : γ M (λ, X n, k = 1}. Then the Type A masking breakdown point of OR(, X n at sample outlyingness threshold λ is given by MBP (A (λ, X n = k(a M (λ, X n. n On the other hand, the quantity λ M (γ, X n, k answers Question II. How centrally in terms of sample outlyingness threshold λ O n can a γ outlier of F be masked due to the presence of k replacement outliers in the data X n? The smaller the value of λ M (γ, X n, k, the worse the masking robustness of OR(, X n, and the worst possible case, λ M (γ, X n, k = On, denotes Type B masking breakdown due to k replacements. Let k (B M (γ, X n = min{k : λ M (γ, X n, k = On}. Then the Type B masking breakdown point of OR(, X n at F outlyingness threshold γ is given by MBP (B (γ, X n = k(b M (γ, X n. n The quantities MBP (A (λ, X n for On < λ < 1 and MBP (B (γ, X n for 0 < γ < 1 represent the minimum fractions of replacements in the data X n sufficient for Type A or Type B masking breakdown relative to specified thresholds. The higher their values, the greater the masking robustness of the outlier identifier OR(, X n. The above formulation of MBP (A (λ, X n extends that of Dang and Serfling (2010 for nonparametric multivariate outlier identification. It also corresponds (with a somewhat different formulation to the notion introduced by Davies and Gather (1993 in the univariate contaminated normal model and extended to the multivariate contaminated normal model by Becker and Gather (1999. However, the version MBP (B (γ, X n is completely new. 5

8 2.3.2 Swamping robustness Again, let a sample outlier identifier OR(, X n be given. swamping are of form S(λ, γ, X n, F = OR(λ, X n out(γ, F, defined for any λ and γ, and swamping occurs if Key associated sets regarding S(λ, γ, X n, F, (6 which in view of (4 requires λ < On. In this case some γ nonoutliers of F are included in the sample threshold λ outlier region. For fixed λ, the swamping becomes more severe as γ 0, with increasingly central nonoutliers of F becoming included in the sample threshold λ outlier region. For fixed γ, swamping becomes more severe as λ On, with threshold γ nonoutliers of F included within an increasingly extreme sample outlier region. Again consider the modified data sets X n,k obtainable by replacing k observations of X n by contaminants. Corresponding to the fixed λ and fixed γ cases, respectively, two indices related to extreme instances of swamping are and γ S (λ, X n, k = smallest γ for which (6 with fixed λ holds subject to k replacements = inf{γ > 0 : k replacements changing X n to X n,k such that S(λ, γ, X n,k, F }, λ S (γ, X n, k = largest λ for which (6 with fixed γ holds subject to k replacements = sup{λ < O n The quantity γ S (λ, X n, k answers : k replacements changing X n to X n,k such that S(λ, γ, X n,k, F }. Question III. What is the most central level γ 0 of nonoutlier of F that can be swamped at sample outlier threshold λ by the presence of k replacement outliers in the data X n? The smaller the value of γ S (λ, X n, k, the worse is the swamping robustness performance of our given OR(, X n. The worst possible case, γ S (λ, X n, k = 0, denotes Type A swamping breakdown due to k replacements. Let k (A S (λ, X n = min{k : γ S (λ, X n, k = 0}. Then the Type A swamping breakdown point of OR(, X n at sample outlyingness threshold λ is given by SBP (A (λ, X n = k(a S (λ, X n. n On the other hand, the quantity λ S (γ, X n, k answers 6

9 Question IV. How extremely at sample threshold λ On can a γ nonoutlier of F be swamped by the presence of k replacement outliers in the data X n? The larger the value of λ S (γ, X n, k, the worse the swamping robustness of OR(, X n, and the worst possible case, λ S (γ, X n, k = On, denotes Type B swamping breakdown due to k replacements. Let k (B S (γ, X n = min{k : λ S (γ, X n, k = On }. Then the Type B swamping breakdown point of OR(, X n at F outlyingness threshold γ is given by SBP (B (γ, X n = k(b S (γ, X n. n The quantities SBP (A (λ, X n for 0 < λ < On and SBP (B (γ, X n for 0 < γ < 1 represent the minimum fractions of replacements in the data X n sufficient for Type A or Type B swamping breakdown relative to specified thresholds. The higher their values, the greater the swamping robustness of the outlier identifier OR(, X n. Both SBP (A (λ, X n and SBP (B (γ, X n as given above for the context of nonparametric outlier identification are newly formulated in the present paper. A version of SBP (B (γ, X n was introduced by Davies and Gather (1993 in the parametric univariate contaminated normal model and extended with a somewhat different formulation in Becker (1996 to the parametric multivariate contaminated normal model The four masking and swamping robustness measures The following figure illustrates the interaction between the sets M = M(λ, γ, X n, F and S = S(λ, γ, X n, F relevant to masking and swamping, respectively. F γ-outlyingness contour Sample λ-outlyingness contour S M In analyzing a data set X n using an outlier identifier OR(, X n, one approach is to adopt a specific outlyingness threshold λ and consider OR(λ, X n as an estimator of the target 7

10 outlier region out(λ, F. In this case, with focus on the sample region OR(λ, X n for a specified λ, the Type A versions of masking and swamping breakdown points are relevant and quite naturally go together as companion robustness measures which address Questions I and III, respectively. On the other hand, one might focus on out(γ, F for some γ and ask how centrally this outlier region can be masked using OR(, X n (Question II. Also, one might focus on out(γ, F and want to know how extremely this nonoutlier region can be swamped using OR(, X n (Question IV. For these, the Type B versions of masking and swamping breakdown points go together as companion robustness measures and play roles complementary to the Type A versions. The treatment of Davies and Gather (1993 for the univariate contaminated normal model chooses Type A masking breakdown and Type B swamping breakdown, something of a mismatch. The foundational framework introduced above provides four relevant masking and swamping robustness measures and clarifies how to use them coherently and comprehensively. The next section provides basic lemmas of use for evaluation of the measures MBP (A (λ, X n, MBP (B (γ, X n, SBP (A (λ, X n, and SBP (B (γ, X n in applications. 3 Basic lemmas 3.1 Connections with standard replacement breakdown points For a real-valued statistic T (X n taking values in (, + (or in [0, 1], respectively, explosion breakdown of T (X n occurs with k points of X n replaced if sup X n,k T (X n,k = sup X n,n T (X n,n =: T, (7 with X n,k as previously. Typical values of T are 1 or, although not necessarily. With k exp (T (X n denoting the minimum k such that (7 can occur, the explosion replacement breakdown point of T (X n is given by RBP exp (T (X n = k exp (T (X n /n. Likewise, implosion breakdown occurs with k points of X n replaced if inf X n,k T (X n,k = inf X n,n T (X n,n =: T. (8 The typical value of T is 0, although not necessarily. With obvious notation, the implosion replacement breakdown point of T (X n is given by RBP imp (T (X n = k imp (T (X n /n. We now give representations for MBP (A (λ, X n, MBP (B (γ, X n, SBP (A (λ, X n, and SBP (B (γ, X n in terms of the above explosion and implosion RBPs. Lemma 1 Type A masking breakdown with replacement of k sample values (γ M (λ, X n, k = 1 holds if and only if sup X n,k sup y OR(λ,X n,k O(y, F = 1, (9 8

11 and hence MBP (A (λ, X n = RBP exp ( sup O(y, F y OR(λ,X n. (10 Proof. Suppose that γ M (λ, X n, k = 1. Then, for any γ < 1, there exists X n,k (γ such that (5 holds. Hence, for a sequence γ m 1, let y m belong to the intersection in (5 corresponding to γ = γ m and X n,k = X n,k (γ m. Then O(y m, F > γ m 1, m, and hence sup X n,k sup y OR(λ,X n,k O(y, F sup m O(y m, F sup γ m = 1, m and (9 follows. Now assume that (9 holds. Then, for a sequence γ m 1, there exists X n,k (γ m and a point y m OR(λ, X n,k (γ m such that O(y m, F > γ m. Then (5 holds with X n,k = X n,k (γ m and γ = γ m, and it follows that γ M (λ, X n, k = 1. Finally, by definition, and RBP exp ( MBP (A (λ, X n = n 1 min{k : γ M (λ, X n, k = 1} sup O(y, F y OR(λ,X n = n 1 min { k : sup O(y, F 1 y OR(λ,X n,k As shown above, γ M (λ, X n, k = 1 if and only if (9 holds, and it follows that establishing (10. MBP (A (λ, X n = n 1 min{k : γ M (λ, X n, k = 1} { = n 1 min = n 1 min { = RBP exp ( k : sup X n,k k : sup O(y, F = 1 y OR(λ,X n,k } sup O(y, F 1 y OR(λ,X n,k sup O(y, F y OR(λ,X n By proofs along similar lines (omitted, we obtain the following three lemmas. Lemma 2 Type B masking breakdown with replacement of k sample values (λ M (γ, X n, k = On holds if and only if inf inf O(y, X n,k = O X n,k n, (11 y out(γ,f and hence MBP (B (γ, X n = RBP imp ( inf y out(γ,f, } }. O(y, X n. (12 9

12 Lemma 3 Type A swamping breakdown with replacement of k sample values (γ S (λ, X n, k = 0 holds if and only if inf inf O(y, F = 0, (13 X n,k y OR(λ,X n,k and hence SBP (A (λ, X n = RBP imp ( inf y OR(λ,X n O(y, F. (14 Lemma 4 Type B swamping breakdown with replacement of k sample values (λ S (γ, X n, k = On holds if and only if sup sup O(y, X n,k = On, (15 X n,k y out(γ,f and hence SBP (B (γ, X n = RBP exp ( sup O(y, X n y out(γ,f. (16 Lemmas 1-4 transform the problem of evaluating masking and swamping breakdown points to a problem of evaluating standard explosion and implosion breakdown points of certain inf and sup statistics. These latter statistics are complicated to treat, and in the next section some helpful lemmas are established. 3.2 Key lemmas for implementation of the RBP formulas In some cases, the infimum or supremumin the foregoing RBP formulas reduces to a minimum or maximum. For such cases, the following result treats breakdown of a statistic S(X n when it is either the minimum or the maximum of certain other statistics T 1 (X n,..., T J (X n. Let k exp, (0 k exp, (1..., k exp (J be the minimal numbers of data points which must be replaced in order to cause explosion breakdown of the respective statistics S and T 1,..., T J, and let k (0 imp, k(1 imp,..., k(j imp be their counterparts for implosion breakdown. Also, let T = min{t 1,..., T J }, T = max{t 1,..., T J }. Lemma 5 (i Let S = min{t 1,..., T J }. Then (ii Let S = max{t 1,..., T J }. Then min{k (1 imp,..., k(j imp } k(0 imp max{k(1 imp,..., k(j imp }. (17 min{k (1 exp,..., k (J exp} k (0 exp max{k (1 exp,..., k (J exp}. (18 Proof. (i If S T for some choice of k (0 imp T 1 T,..., T J T occurs with k (0 we must have k (j imp k(0 imp replacements, then at least one of the events imp replacements. For any such occurrence, say T j T,. This yields the left hand inequality of (17. On the other hand, if 10

13 k (m imp = max{k(1 imp,..., k(j imp } for some m {1,..., J}, then all the events T j Tj, 1 j J, occur with k (m imp replacements and hence S T, yielding k (0 imp k(m imp. This yields the right hand inequality of (17, and part (i is now established. (ii If S T for some choice of k exp (0 replacements, then at least one of the events T 1 T,..., T J T occurs with k exp (0 replacements. For any such occurrence, say T j T, we have k exp (j k exp, (0 and the left hand inequality of (18 thus follows. On the other hand, if k exp (m = max{k exp, (1..., k exp} (J for some m {1,..., J}, then all the events T j Tj occur with k exp (m replacements and hence S T, yielding k exp (0 k exp (m. This yields the right hand inequality of (18, establishing part (ii of the lemma. The above result applies, for example, in the case X = R and J = 2, where the outlier regions are complements of finite intervals and the inf and sup are attained at endpoints. The following result extends to the general case and is obtained by a proof along similar lines with T = inf y Ty and T = sup y Ty. Lemma 6 (i Let S = inf y T y. Then (ii Let S = sup y T y. Then inf y inf y k(y imp k(0 k(y exp k (0 imp sup y exp sup y k (y imp. (19 k (y exp. (20 The next lemma treats breakdown of a statistic S(X n when the event of breakdown due to k replacements is related to the possible occurrences of certain events E 1,..., E J as a consequence of k replacements. Let k S be the minimal number of data points which must be replaced in order to cause breakdown (either implosion or explosion of S, and let k 1,..., k J be the minimal numbers of data points which must be replaced in order to cause occurrence of the respective events E 1,..., E J. It is assumed that k S and k 1,..., k J are well-defined and belong to {1, 2,..., n}. Lemma 7 (i If breakdown of S is implied by occurrence of each one of the events E 1,..., E J, then k S min{k 1,..., k J }. (21 (ii If breakdown of S implies occurrence of at least one of the events E 1,..., E J, then k S min{k 1,..., k J }. (22 (iii If breakdown of S is implied by occurrence of each one of the events E 1,..., E J and also implies that at least one of E 1,..., E J must occur, then k S = min{k 1,..., k J }. (23 Proof. (i For any event E j whose occurrence with k j replacements implies breakdown of S, we must have k S k j. Thus (21 follows. (ii For any event E j whose occurrence is implied by breakdown of S, we must have k S k j, and thus (22 follows. 11

14 4 Illustrative Application: MBP (A (λ, X n for Univariate Scaled Deviation Outlyingness For X = R, let F be a distribution on R and µ(f and σ(f any location and spread measures. The corresponding scaled deviation outlyingness function taking values in [0, 1 is given by O(x, F = Õ(x, F /(1 + Õ(x, F, with Õ(x, F = x µ(f σ(f, and sample versions O(x, X n and Õ(x, X n are similarly defined using µ(x n and σ(x n. Such outlyingness functions have been popularized by Mosteller and Tukey (1977, for example. A complete study of MBP (A (λ, X n, MBP (B (γ, X n, SBP (A (λ, X n, and SBP (B (γ, X n for scaled deviation outlyingness is carried out by Wang and Serfling (2012. Here we obtain the first of these, as an illustration of the use of our key lemmas. Note that for scaled deviation outlyingness, we have On = 0 and On = 1. In terms of the Õ versions, and with η = γ/(1 γ and β = λ/(1 λ, we have and out(γ, F = {x : O(x, F > γ} = {x : Õ(x, F > η}, OR(λ, X n = {x : O(x, X n > λ} = {x : Õ(x, X n > β}. Note also that η as γ 1, β as λ 1, and OR(λ, X n = [µ(x n βσ(x n, µ(x n βσ(x n ]. Expressing Lemma 1 in terms of Õ(x, F and Õ(x, X n, we obtain ( MBP (A (λ, X n = RBP exp sup Õ(y, F. (24 y OR(λ,X n For convenience, we put µ(x n = µ and σ(x n = σ. Using (from above OR(λ, X n = [ µ β σ, µ + β σ], it follows that sup Õ(y, F y OR(λ,X n Õ( µ + β σ, F = Õ( µ β σ, F } max {Õ( µ + β σ, F, Õ( µ β σ, F } = max {Õ( µ + β σ, F, Õ( µ β σ, F 12 if µ(f µ β σ if µ(f µ + β σ otherwise (in all cases. (25

15 It then follows from Lemma 5(ii that } min {RBP exp (Õ( µ + β σ, F, RBP exp (Õ( µ β σ, F MBP (A (λ, X n } max {RBP exp (Õ( µ + β σ, F, RBP exp (Õ( µ β σ, F We next evaluate RBP exp (Õ( µ + β σ, F respectively, to RBP exp ( µ + β σ and RBP exp ( µ β σ. Now we adopt (26 and RBP exp (Õ( µ β σ, F, which are equal, Assumption A. RBP exp ( µ + β σ, X n and RBP exp ( µ, X n are invariant if X n is replaced by X n, i.e, if each observation X i is replaced by X i, 1 i n. Under Assumption A, RBP exp ( µ + β σ and RBP exp ( µ β σ are equal, and we have MBP (A (λ, X n = RBP exp ( µ + β σ. (27 Here we are using RBP exp with T =. We now evaluate RBP exp ( µ + β σ, for which we apply Lemma 7 with some choice of k and the events S, E 1, E 2, and E 3, where S = { {X n,k } such that µ(x n,k + β σ(x n,k } E 1 = { {X n,k } such that µ(x n,k + } E 2 = { {X n,k } such that µ(x n,k is bounded and σ(x n,k } E 3 = { {X n,k } such that µ(x n,k and µ(x n,k + β σ(x n,k }. Note that (with k fixed each of E 1, E 2, E 3 implies S and S implies E 1 E 2 E 3. Then (23 yields RBP exp (Õ( µ + β σ, F = RBP exp ( µ + β σ = n 1 min{k 1, k 2, k 3 }, (28 where k 1, k 2, k 3 are the minimal values of k, respectively, for occurrence of E 1, E 2, E 3. Now let us note that, under Assumption A, k 3 k exp ( µ = k 1. Thus we have established the following useful result. Corollary 8 Under Assumption A, MBP (A (λ, X n = n 1 min{k 1, k 2 } = min{rbp exp ( µ, RBP exp ( σ µ bounded}. (29 Note that MBP (A (λ, X n does not depend upon the threshold λ. In typical cases, we have RBP exp ( σ µ bounded RBP exp ( σ RBP exp ( µ, 13

16 in which case simply MBP (A (λ, X n = RBP exp ( µ. Examples. (Assumption A is satisfied in each case. (i Mean and Standard Deviation ( µ = X and σ = S. It is straightforward that RBP exp ( µ = n 1, the minimum possible, yielding MBP (A (λ, X n = n 1 0. (In passing, we note that RBP exp ( σ µ bounded = 2n 1. (ii Median and MAD ( µ = Med(X n and σ = MAD(X n. We obtain RBP exp ( µ = RBP exp ( σ µ bounded = n 1 n+1, yielding 2 MBP(A (λ, X n = n 1 n (iii α-trimmed Mean and SD. Let X (n 2 nα denote the n 2 nα observations remaining after trimming away the upper nα observations and the lower nα observations. Then take µ to be the mean and σ to be the standard deviation of the data set X (n 2 nα. It is readily checked that RBP exp ( µ = ( nα + 1/n and that RBP exp ( σ µ bounded = 2( nα + 1/n, yielding MBP (A (λ, X n = n 1 ( nα + 1 α. Note that the result in (iii approaches that in (i as α 0 and that in (ii as α 1/2. As seen in Wang and Serfling (2012, there can be some advantage in MBP (A (λ, X n = α < 1/2, when it produces a trade-off allowing SBP (A (λ, X n > 1/2, should this be desired. Acknowledgements The authors gratefully acknowledge useful input from G. L. Thompson, Xin Dang, Satyaki Mazumder, Bo Hong, and Seoweon Jin. Also, support under National Science Foundation Grant DMS is sincerely acknowledged. References [1] Becker, C. (1996. Bruchpunkt und Bias zur Beurteilung multivariater Ausreisseridentifizierung. Ph.D. Dissertation. Universität Dortmund. [2] Becker, C. and Gather, U. (1999. The masking breakdown point of multivariate outlier identification rules. Journal of the American Statistical Association [3] Dang, X. and Serfling, R. (2010. Nonparametric depth-based multivariate outlier identifiers, and masking robustness properties. Journal of Statistical Planning and Inference [4] Davies, L. and Gather, U. (1993. The identification of multiple outliers. Journal of the American Statistical Association [5] Donoho, D. L. and Huber, P. J. (1983. The notion of breakdown point. In A Festschrift foe Erich L. Lehmann (P. J. Bickel, K. A. Doksum and J. L. Hodges, Jr., eds. pp , Wadsworth, Belmont, California. 14

17 [6] Mosteller, C. F. and Tukey, J. W. (1977. Data Analysis and Regression. Addison- Wesley, Reading, Mass. [7] Olkin, I. (1994. Multivariate non-normal distributions and models of dependency. In Multivariate Analysis and Its Applications (T. W. Anderson, K. T. Fang and I. Olkin, eds., IMS Lecture Notes Monograph Series, Volume 24, pp Hayward, California. [8] Wang, S. and Serfling, R. (2012. On masking and swamping robustness of outlier identifiers for univariate data. Submitted (available at serfling. 15

Inequalities Relating Addition and Replacement Type Finite Sample Breakdown Points

Inequalities Relating Addition and Replacement Type Finite Sample Breakdown Points Inequalities Relating Addition and Replacement Type Finite Sample Breadown Points Robert Serfling Department of Mathematical Sciences University of Texas at Dallas Richardson, Texas 75083-0688, USA Email:

More information

Supplementary Material for Wang and Serfling paper

Supplementary Material for Wang and Serfling paper Supplementary Material for Wang and Serfling paper March 6, 2017 1 Simulation study Here we provide a simulation study to compare empirically the masking and swamping robustness of our selected outlyingness

More information

Asymptotic Relative Efficiency in Estimation

Asymptotic Relative Efficiency in Estimation Asymptotic Relative Efficiency in Estimation Robert Serfling University of Texas at Dallas October 2009 Prepared for forthcoming INTERNATIONAL ENCYCLOPEDIA OF STATISTICAL SCIENCES, to be published by Springer

More information

On Invariant Within Equivalence Coordinate System (IWECS) Transformations

On Invariant Within Equivalence Coordinate System (IWECS) Transformations On Invariant Within Equivalence Coordinate System (IWECS) Transformations Robert Serfling Abstract In exploratory data analysis and data mining in the very common setting of a data set X of vectors from

More information

Computationally Easy Outlier Detection via Projection Pursuit with Finitely Many Directions

Computationally Easy Outlier Detection via Projection Pursuit with Finitely Many Directions Computationally Easy Outlier Detection via Projection Pursuit with Finitely Many Directions Robert Serfling 1 and Satyaki Mazumder 2 University of Texas at Dallas and Indian Institute of Science, Education

More information

Generalized Multivariate Rank Type Test Statistics via Spatial U-Quantiles

Generalized Multivariate Rank Type Test Statistics via Spatial U-Quantiles Generalized Multivariate Rank Type Test Statistics via Spatial U-Quantiles Weihua Zhou 1 University of North Carolina at Charlotte and Robert Serfling 2 University of Texas at Dallas Final revision for

More information

Measuring robustness

Measuring robustness Measuring robustness 1 Introduction While in the classical approach to statistics one aims at estimates which have desirable properties at an exactly speci ed model, the aim of robust methods is loosely

More information

robustness, efficiency, breakdown point, outliers, rank-based procedures, least absolute regression

robustness, efficiency, breakdown point, outliers, rank-based procedures, least absolute regression Robust Statistics robustness, efficiency, breakdown point, outliers, rank-based procedures, least absolute regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html

More information

CONTRIBUTIONS TO THE THEORY AND APPLICATIONS OF STATISTICAL DEPTH FUNCTIONS

CONTRIBUTIONS TO THE THEORY AND APPLICATIONS OF STATISTICAL DEPTH FUNCTIONS CONTRIBUTIONS TO THE THEORY AND APPLICATIONS OF STATISTICAL DEPTH FUNCTIONS APPROVED BY SUPERVISORY COMMITTEE: Robert Serfling, Chair Larry Ammann John Van Ness Michael Baron Copyright 1998 Yijun Zuo All

More information

YIJUN ZUO. Education. PhD, Statistics, 05/98, University of Texas at Dallas, (GPA 4.0/4.0)

YIJUN ZUO. Education. PhD, Statistics, 05/98, University of Texas at Dallas, (GPA 4.0/4.0) YIJUN ZUO Department of Statistics and Probability Michigan State University East Lansing, MI 48824 Tel: (517) 432-5413 Fax: (517) 432-5413 Email: zuo@msu.edu URL: www.stt.msu.edu/users/zuo Education PhD,

More information

DD-Classifier: Nonparametric Classification Procedure Based on DD-plot 1. Abstract

DD-Classifier: Nonparametric Classification Procedure Based on DD-plot 1. Abstract DD-Classifier: Nonparametric Classification Procedure Based on DD-plot 1 Jun Li 2, Juan A. Cuesta-Albertos 3, Regina Y. Liu 4 Abstract Using the DD-plot (depth-versus-depth plot), we introduce a new nonparametric

More information

Influence Functions for a General Class of Depth-Based Generalized Quantile Functions

Influence Functions for a General Class of Depth-Based Generalized Quantile Functions Influence Functions for a General Class of Depth-Based Generalized Quantile Functions Jin Wang 1 Northern Arizona University and Robert Serfling 2 University of Texas at Dallas June 2005 Final preprint

More information

THE BREAKDOWN POINT EXAMPLES AND COUNTEREXAMPLES

THE BREAKDOWN POINT EXAMPLES AND COUNTEREXAMPLES REVSTAT Statistical Journal Volume 5, Number 1, March 2007, 1 17 THE BREAKDOWN POINT EXAMPLES AND COUNTEREXAMPLES Authors: P.L. Davies University of Duisburg-Essen, Germany, and Technical University Eindhoven,

More information

Detection of outliers in multivariate data:

Detection of outliers in multivariate data: 1 Detection of outliers in multivariate data: a method based on clustering and robust estimators Carla M. Santos-Pereira 1 and Ana M. Pires 2 1 Universidade Portucalense Infante D. Henrique, Oporto, Portugal

More information

Commentary on Basu (1956)

Commentary on Basu (1956) Commentary on Basu (1956) Robert Serfling University of Texas at Dallas March 2010 Prepared for forthcoming Selected Works of Debabrata Basu (Anirban DasGupta, Ed.), Springer Series on Selected Works in

More information

Jensen s inequality for multivariate medians

Jensen s inequality for multivariate medians Jensen s inequality for multivariate medians Milan Merkle University of Belgrade, Serbia emerkle@etf.rs Given a probability measure µ on Borel sigma-field of R d, and a function f : R d R, the main issue

More information

TITLE : Robust Control Charts for Monitoring Process Mean of. Phase-I Multivariate Individual Observations AUTHORS : Asokan Mulayath Variyath.

TITLE : Robust Control Charts for Monitoring Process Mean of. Phase-I Multivariate Individual Observations AUTHORS : Asokan Mulayath Variyath. TITLE : Robust Control Charts for Monitoring Process Mean of Phase-I Multivariate Individual Observations AUTHORS : Asokan Mulayath Variyath Department of Mathematics and Statistics, Memorial University

More information

ROBUST ESTIMATION OF A CORRELATION COEFFICIENT: AN ATTEMPT OF SURVEY

ROBUST ESTIMATION OF A CORRELATION COEFFICIENT: AN ATTEMPT OF SURVEY ROBUST ESTIMATION OF A CORRELATION COEFFICIENT: AN ATTEMPT OF SURVEY G.L. Shevlyakov, P.O. Smirnov St. Petersburg State Polytechnic University St.Petersburg, RUSSIA E-mail: Georgy.Shevlyakov@gmail.com

More information

CHAPTER 5. Outlier Detection in Multivariate Data

CHAPTER 5. Outlier Detection in Multivariate Data CHAPTER 5 Outlier Detection in Multivariate Data 5.1 Introduction Multivariate outlier detection is the important task of statistical analysis of multivariate data. Many methods have been proposed for

More information

MULTIVARIATE TECHNIQUES, ROBUSTNESS

MULTIVARIATE TECHNIQUES, ROBUSTNESS MULTIVARIATE TECHNIQUES, ROBUSTNESS Mia Hubert Associate Professor, Department of Mathematics and L-STAT Katholieke Universiteit Leuven, Belgium mia.hubert@wis.kuleuven.be Peter J. Rousseeuw 1 Senior Researcher,

More information

A Convex Hull Peeling Depth Approach to Nonparametric Massive Multivariate Data Analysis with Applications

A Convex Hull Peeling Depth Approach to Nonparametric Massive Multivariate Data Analysis with Applications A Convex Hull Peeling Depth Approach to Nonparametric Massive Multivariate Data Analysis with Applications Hyunsook Lee. hlee@stat.psu.edu Department of Statistics The Pennsylvania State University Hyunsook

More information

Stahel-Donoho Estimation for High-Dimensional Data

Stahel-Donoho Estimation for High-Dimensional Data Stahel-Donoho Estimation for High-Dimensional Data Stefan Van Aelst KULeuven, Department of Mathematics, Section of Statistics Celestijnenlaan 200B, B-3001 Leuven, Belgium Email: Stefan.VanAelst@wis.kuleuven.be

More information

Survey on (Some) Nonparametric and Robust Multivariate Methods

Survey on (Some) Nonparametric and Robust Multivariate Methods Survey on (Some) Nonparametric and Robust Multivariate Methods Robert Serfling University of Texas at Dallas June 2007 Abstract Rather than attempt an encyclopedic survey of nonparametric and robust multivariate

More information

Working Paper Convergence rates in multivariate robust outlier identification

Working Paper Convergence rates in multivariate robust outlier identification econstor www.econstor.eu Der Open-Access-Publikationsserver der ZBW Leibniz-Informationszentrum Wirtschaft The Open Access Publication Server of the ZBW Leibniz Information Centre for Economics Gather,

More information

Robust estimation of principal components from depth-based multivariate rank covariance matrix

Robust estimation of principal components from depth-based multivariate rank covariance matrix Robust estimation of principal components from depth-based multivariate rank covariance matrix Subho Majumdar Snigdhansu Chatterjee University of Minnesota, School of Statistics Table of contents Summary

More information

Fast and robust bootstrap for LTS

Fast and robust bootstrap for LTS Fast and robust bootstrap for LTS Gert Willems a,, Stefan Van Aelst b a Department of Mathematics and Computer Science, University of Antwerp, Middelheimlaan 1, B-2020 Antwerp, Belgium b Department of

More information

Introduction to Robust Statistics. Anthony Atkinson, London School of Economics, UK Marco Riani, Univ. of Parma, Italy

Introduction to Robust Statistics. Anthony Atkinson, London School of Economics, UK Marco Riani, Univ. of Parma, Italy Introduction to Robust Statistics Anthony Atkinson, London School of Economics, UK Marco Riani, Univ. of Parma, Italy Multivariate analysis Multivariate location and scatter Data where the observations

More information

On the limiting distributions of multivariate depth-based rank sum. statistics and related tests. By Yijun Zuo 2 and Xuming He 3

On the limiting distributions of multivariate depth-based rank sum. statistics and related tests. By Yijun Zuo 2 and Xuming He 3 1 On the limiting distributions of multivariate depth-based rank sum statistics and related tests By Yijun Zuo 2 and Xuming He 3 Michigan State University and University of Illinois A depth-based rank

More information

Detecting outliers in weighted univariate survey data

Detecting outliers in weighted univariate survey data Detecting outliers in weighted univariate survey data Anna Pauliina Sandqvist October 27, 21 Preliminary Version Abstract Outliers and influential observations are a frequent concern in all kind of statistics,

More information

On robust and efficient estimation of the center of. Symmetry.

On robust and efficient estimation of the center of. Symmetry. On robust and efficient estimation of the center of symmetry Howard D. Bondell Department of Statistics, North Carolina State University Raleigh, NC 27695-8203, U.S.A (email: bondell@stat.ncsu.edu) Abstract

More information

An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance

An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance Dhaka Univ. J. Sci. 61(1): 81-85, 2013 (January) An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance A. H. Sajib, A. Z. M. Shafiullah 1 and A. H. Sumon Department of Statistics,

More information

Re-weighted Robust Control Charts for Individual Observations

Re-weighted Robust Control Charts for Individual Observations Universiti Tunku Abdul Rahman, Kuala Lumpur, Malaysia 426 Re-weighted Robust Control Charts for Individual Observations Mandana Mohammadi 1, Habshah Midi 1,2 and Jayanthi Arasan 1,2 1 Laboratory of Applied

More information

ON MULTIVARIATE MONOTONIC MEASURES OF LOCATION WITH HIGH BREAKDOWN POINT

ON MULTIVARIATE MONOTONIC MEASURES OF LOCATION WITH HIGH BREAKDOWN POINT Sankhyā : The Indian Journal of Statistics 999, Volume 6, Series A, Pt. 3, pp. 362-380 ON MULTIVARIATE MONOTONIC MEASURES OF LOCATION WITH HIGH BREAKDOWN POINT By SUJIT K. GHOSH North Carolina State University,

More information

Lehrstuhl für Statistik und Ökonometrie. Diskussionspapier 88 / 2012

Lehrstuhl für Statistik und Ökonometrie. Diskussionspapier 88 / 2012 Lehrstuhl für Statistik und Ökonometrie Diskussionspapier 88 / 2012 Robustness Properties of Quasi-Linear Means with Application to the Laspeyres and Paasche Indices Ingo Klein Vlad Ardelean Lange Gasse

More information

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 4: Measures of Robustness, Robust Principal Component Analysis

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 4: Measures of Robustness, Robust Principal Component Analysis MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 4:, Robust Principal Component Analysis Contents Empirical Robust Statistical Methods In statistics, robust methods are methods that perform well

More information

A ROBUST METHOD OF ESTIMATING COVARIANCE MATRIX IN MULTIVARIATE DATA ANALYSIS G.M. OYEYEMI *, R.A. IPINYOMI **

A ROBUST METHOD OF ESTIMATING COVARIANCE MATRIX IN MULTIVARIATE DATA ANALYSIS G.M. OYEYEMI *, R.A. IPINYOMI ** ANALELE ŞTIINłIFICE ALE UNIVERSITĂłII ALEXANDRU IOAN CUZA DIN IAŞI Tomul LVI ŞtiinŃe Economice 9 A ROBUST METHOD OF ESTIMATING COVARIANCE MATRIX IN MULTIVARIATE DATA ANALYSIS G.M. OYEYEMI, R.A. IPINYOMI

More information

Bahadur representations for bootstrap quantiles 1

Bahadur representations for bootstrap quantiles 1 Bahadur representations for bootstrap quantiles 1 Yijun Zuo Department of Statistics and Probability, Michigan State University East Lansing, MI 48824, USA zuo@msu.edu 1 Research partially supported by

More information

Introduction to Statistical Analysis

Introduction to Statistical Analysis Introduction to Statistical Analysis Changyu Shen Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology Beth Israel Deaconess Medical Center Harvard Medical School Objectives Descriptive

More information

Modulation of symmetric densities

Modulation of symmetric densities 1 Modulation of symmetric densities 1.1 Motivation This book deals with a formulation for the construction of continuous probability distributions and connected statistical aspects. Before we begin, a

More information

Statistical Depth Function

Statistical Depth Function Machine Learning Journal Club, Gatsby Unit June 27, 2016 Outline L-statistics, order statistics, ranking. Instead of moments: median, dispersion, scale, skewness,... New visualization tools: depth contours,

More information

Accurate and Powerful Multivariate Outlier Detection

Accurate and Powerful Multivariate Outlier Detection Int. Statistical Inst.: Proc. 58th World Statistical Congress, 11, Dublin (Session CPS66) p.568 Accurate and Powerful Multivariate Outlier Detection Cerioli, Andrea Università di Parma, Dipartimento di

More information

CORRELATION ESTIMATION SYSTEM MINIMIZATION COMPARED TO LEAST SQUARES MINIMIZATION IN SIMPLE LINEAR REGRESSION

CORRELATION ESTIMATION SYSTEM MINIMIZATION COMPARED TO LEAST SQUARES MINIMIZATION IN SIMPLE LINEAR REGRESSION CORRELATION ESTIMATION SYSTEM MINIMIZATION COMPARED TO LEAST SQUARES MINIMIZATION IN SIMPLE LINEAR REGRESSION RUDY A. GIDEON Abstract. A general method of minimization using correlation coefficients and

More information

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1 Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression

More information

Robust estimators based on generalization of trimmed mean

Robust estimators based on generalization of trimmed mean Communications in Statistics - Simulation and Computation ISSN: 0361-0918 (Print) 153-4141 (Online) Journal homepage: http://www.tandfonline.com/loi/lssp0 Robust estimators based on generalization of trimmed

More information

Descriptive Data Summarization

Descriptive Data Summarization Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning

More information

Identification of Multivariate Outliers: A Performance Study

Identification of Multivariate Outliers: A Performance Study AUSTRIAN JOURNAL OF STATISTICS Volume 34 (2005), Number 2, 127 138 Identification of Multivariate Outliers: A Performance Study Peter Filzmoser Vienna University of Technology, Austria Abstract: Three

More information

Exploratory data analysis: numerical summaries

Exploratory data analysis: numerical summaries 16 Exploratory data analysis: numerical summaries The classical way to describe important features of a dataset is to give several numerical summaries We discuss numerical summaries for the center of a

More information

High Breakdown Analogs of the Trimmed Mean

High Breakdown Analogs of the Trimmed Mean High Breakdown Analogs of the Trimmed Mean David J. Olive Southern Illinois University April 11, 2004 Abstract Two high breakdown estimators that are asymptotically equivalent to a sequence of trimmed

More information

Fast and Robust Classifiers Adjusted for Skewness

Fast and Robust Classifiers Adjusted for Skewness Fast and Robust Classifiers Adjusted for Skewness Mia Hubert 1 and Stephan Van der Veeken 2 1 Department of Mathematics - LStat, Katholieke Universiteit Leuven Celestijnenlaan 200B, Leuven, Belgium, Mia.Hubert@wis.kuleuven.be

More information

A Brief Overview of Robust Statistics

A Brief Overview of Robust Statistics A Brief Overview of Robust Statistics Olfa Nasraoui Department of Computer Engineering & Computer Science University of Louisville, olfa.nasraoui_at_louisville.edu Robust Statistical Estimators Robust

More information

Application and Use of Multivariate Control Charts In a BTA Deep Hole Drilling Process

Application and Use of Multivariate Control Charts In a BTA Deep Hole Drilling Process Application and Use of Multivariate Control Charts In a BTA Deep Hole Drilling Process Amor Messaoud, Winfied Theis, Claus Weihs, and Franz Hering Fachbereich Statistik, Universität Dortmund, 44221 Dortmund,

More information

Exploiting Multiple Mahalanobis Distance Metric to Screen Outliers from Analogue Product Manufacturing Test Responses

Exploiting Multiple Mahalanobis Distance Metric to Screen Outliers from Analogue Product Manufacturing Test Responses 1 Exploiting Multiple Mahalanobis Distance Metric to Screen Outliers from Analogue Product Manufacturing Test Responses Shaji Krishnan and Hans G. Kerkhoff Analytical Research Department, TNO, Zeist, The

More information

Robustness of the Affine Equivariant Scatter Estimator Based on the Spatial Rank Covariance Matrix

Robustness of the Affine Equivariant Scatter Estimator Based on the Spatial Rank Covariance Matrix Robustness of the Affine Equivariant Scatter Estimator Based on the Spatial Rank Covariance Matrix Kai Yu 1 Xin Dang 2 Department of Mathematics and Yixin Chen 3 Department of Computer and Information

More information

Statistical Data Analysis

Statistical Data Analysis DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the

More information

MSc / PhD Course Advanced Biostatistics. dr. P. Nazarov

MSc / PhD Course Advanced Biostatistics. dr. P. Nazarov MSc / PhD Course Advanced Biostatistics dr. P. Nazarov petr.nazarov@crp-sante.lu 2-12-2012 1. Descriptive Statistics edu.sablab.net/abs2013 1 Outline Lecture 0. Introduction to R - continuation Data import

More information

DESCRIPTIVE STATISTICS FOR NONPARAMETRIC MODELS I. INTRODUCTION

DESCRIPTIVE STATISTICS FOR NONPARAMETRIC MODELS I. INTRODUCTION The Annals of Statistics 1975, Vol. 3, No.5, 1038-1044 DESCRIPTIVE STATISTICS FOR NONPARAMETRIC MODELS I. INTRODUCTION BY P. J. BICKEL 1 AND E. L. LEHMANN 2 University of California, Berkeley An overview

More information

Methods of Nonparametric Multivariate Ranking and Selection

Methods of Nonparametric Multivariate Ranking and Selection Syracuse University SURFACE Mathematics - Dissertations Mathematics 8-2013 Methods of Nonparametric Multivariate Ranking and Selection Jeremy Entner Follow this and additional works at: http://surface.syr.edu/mat_etd

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Semi-parametric predictive inference for bivariate data using copulas

Semi-parametric predictive inference for bivariate data using copulas Semi-parametric predictive inference for bivariate data using copulas Tahani Coolen-Maturi a, Frank P.A. Coolen b,, Noryanti Muhammad b a Durham University Business School, Durham University, Durham, DH1

More information

Empirical likelihood-based methods for the difference of two trimmed means

Empirical likelihood-based methods for the difference of two trimmed means Empirical likelihood-based methods for the difference of two trimmed means 24.09.2012. Latvijas Universitate Contents 1 Introduction 2 Trimmed mean 3 Empirical likelihood 4 Empirical likelihood for the

More information

Robustness and Distribution Assumptions

Robustness and Distribution Assumptions Chapter 1 Robustness and Distribution Assumptions 1.1 Introduction In statistics, one often works with model assumptions, i.e., one assumes that data follow a certain model. Then one makes use of methodology

More information

Jensen s inequality for multivariate medians

Jensen s inequality for multivariate medians J. Math. Anal. Appl. [Journal of Mathematical Analysis and Applications], 370 (2010), 258-269 Jensen s inequality for multivariate medians Milan Merkle 1 Abstract. Given a probability measure µ on Borel

More information

Discriminant Analysis with High Dimensional. von Mises-Fisher distribution and

Discriminant Analysis with High Dimensional. von Mises-Fisher distribution and Athens Journal of Sciences December 2014 Discriminant Analysis with High Dimensional von Mises - Fisher Distributions By Mario Romanazzi This paper extends previous work in discriminant analysis with von

More information

A Bayesian perspective on GMM and IV

A Bayesian perspective on GMM and IV A Bayesian perspective on GMM and IV Christopher A. Sims Princeton University sims@princeton.edu November 26, 2013 What is a Bayesian perspective? A Bayesian perspective on scientific reporting views all

More information

The AAFCO Proficiency Testing Program Statistics and Reporting

The AAFCO Proficiency Testing Program Statistics and Reporting The AAFCO Proficiency Testing Program Statistics and Reporting Program Chair: Dr. Victoria Siegel Statistics and Reports: Dr. Andrew Crawford Contents Program Model Data Prescreening Calculating Robust

More information

Outlier detection for high-dimensional data

Outlier detection for high-dimensional data Biometrika (2015), 102,3,pp. 589 599 doi: 10.1093/biomet/asv021 Printed in Great Britain Advance Access publication 7 June 2015 Outlier detection for high-dimensional data BY KWANGIL RO, CHANGLIANG ZOU,

More information

7 Sensitivity Analysis

7 Sensitivity Analysis 7 Sensitivity Analysis A recurrent theme underlying methodology for analysis in the presence of missing data is the need to make assumptions that cannot be verified based on the observed data. If the assumption

More information

Stochastic dominance with imprecise information

Stochastic dominance with imprecise information Stochastic dominance with imprecise information Ignacio Montes, Enrique Miranda, Susana Montes University of Oviedo, Dep. of Statistics and Operations Research. Abstract Stochastic dominance, which is

More information

MAJORIZING MEASURES WITHOUT MEASURES. By Michel Talagrand URA 754 AU CNRS

MAJORIZING MEASURES WITHOUT MEASURES. By Michel Talagrand URA 754 AU CNRS The Annals of Probability 2001, Vol. 29, No. 1, 411 417 MAJORIZING MEASURES WITHOUT MEASURES By Michel Talagrand URA 754 AU CNRS We give a reformulation of majorizing measures that does not involve measures,

More information

Regression Analysis for Data Containing Outliers and High Leverage Points

Regression Analysis for Data Containing Outliers and High Leverage Points Alabama Journal of Mathematics 39 (2015) ISSN 2373-0404 Regression Analysis for Data Containing Outliers and High Leverage Points Asim Kumer Dey Department of Mathematics Lamar University Md. Amir Hossain

More information

Multivariate Statistical Analysis

Multivariate Statistical Analysis Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 9 for Applied Multivariate Analysis Outline Addressing ourliers 1 Addressing ourliers 2 Outliers in Multivariate samples (1) For

More information

CHAPTER 7. Connectedness

CHAPTER 7. Connectedness CHAPTER 7 Connectedness 7.1. Connected topological spaces Definition 7.1. A topological space (X, T X ) is said to be connected if there is no continuous surjection f : X {0, 1} where the two point set

More information

Comprehensive Definitions of Breakdown-Points for Independent and Dependent Observations

Comprehensive Definitions of Breakdown-Points for Independent and Dependent Observations TI 2000-40/2 Tinbergen Institute Discussion Paper Comprehensive Definitions of Breakdown-Points for Independent and Dependent Observations Marc G. Genton André Lucas Tinbergen Institute The Tinbergen Institute

More information

Monitoring Random Start Forward Searches for Multivariate Data

Monitoring Random Start Forward Searches for Multivariate Data Monitoring Random Start Forward Searches for Multivariate Data Anthony C. Atkinson 1, Marco Riani 2, and Andrea Cerioli 2 1 Department of Statistics, London School of Economics London WC2A 2AE, UK, a.c.atkinson@lse.ac.uk

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

Comprehensive definitions of breakdown points for independent and dependent observations

Comprehensive definitions of breakdown points for independent and dependent observations J. R. Statist. Soc. B (2003) 65, Part 1, pp. 81 94 Comprehensive definitions of breakdown points for independent and dependent observations Marc G. Genton North Carolina State University, Raleigh, USA

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

A Gini Autocovariance Function for Heavy Tailed Time Series Modeling

A Gini Autocovariance Function for Heavy Tailed Time Series Modeling A Gini Autocovariance Function for Heavy Tailed Time Series Modeling Marcel Carcea 1 and Robert Serfling 2 University of Texas at Dallas December, 212 1 Department of Mathematics, University of Texas at

More information

A Comparison of Robust Estimators Based on Two Types of Trimming

A Comparison of Robust Estimators Based on Two Types of Trimming Submitted to the Bernoulli A Comparison of Robust Estimators Based on Two Types of Trimming SUBHRA SANKAR DHAR 1, and PROBAL CHAUDHURI 1, 1 Theoretical Statistics and Mathematics Unit, Indian Statistical

More information

P8130: Biostatistical Methods I

P8130: Biostatistical Methods I P8130: Biostatistical Methods I Lecture 2: Descriptive Statistics Cody Chiuzan, PhD Department of Biostatistics Mailman School of Public Health (MSPH) Lecture 1: Recap Intro to Biostatistics Types of Data

More information

Robust Preprocessing of Time Series with Trends

Robust Preprocessing of Time Series with Trends Robust Preprocessing of Time Series with Trends Roland Fried Ursula Gather Department of Statistics, Universität Dortmund ffried,gatherg@statistik.uni-dortmund.de Michael Imhoff Klinikum Dortmund ggmbh

More information

A new approach for stochastic ordering of risks

A new approach for stochastic ordering of risks A new approach for stochastic ordering of risks Liang Hong, PhD, FSA Department of Mathematics Robert Morris University Presented at 2014 Actuarial Research Conference UC Santa Barbara July 16, 2014 Liang

More information

3 Measurable Functions

3 Measurable Functions 3 Measurable Functions Notation A pair (X, F) where F is a σ-field of subsets of X is a measurable space. If µ is a measure on F then (X, F, µ) is a measure space. If µ(x) < then (X, F, µ) is a probability

More information

Optimal global rates of convergence for interpolation problems with random design

Optimal global rates of convergence for interpolation problems with random design Optimal global rates of convergence for interpolation problems with random design Michael Kohler 1 and Adam Krzyżak 2, 1 Fachbereich Mathematik, Technische Universität Darmstadt, Schlossgartenstr. 7, 64289

More information

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651 Data Analysis and Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasini/teaching.html Boxplots and standard deviations Suhasini Subba Rao Review of previous lecture In the previous lecture

More information

Physics 509: Bootstrap and Robust Parameter Estimation

Physics 509: Bootstrap and Robust Parameter Estimation Physics 509: Bootstrap and Robust Parameter Estimation Scott Oser Lecture #20 Physics 509 1 Nonparametric parameter estimation Question: what error estimate should you assign to the slope and intercept

More information

INVARIANT COORDINATE SELECTION

INVARIANT COORDINATE SELECTION INVARIANT COORDINATE SELECTION By David E. Tyler 1, Frank Critchley, Lutz Dümbgen 2, and Hannu Oja Rutgers University, Open University, University of Berne and University of Tampere SUMMARY A general method

More information

Descriptive Univariate Statistics and Bivariate Correlation

Descriptive Univariate Statistics and Bivariate Correlation ESC 100 Exploring Engineering Descriptive Univariate Statistics and Bivariate Correlation Instructor: Sudhir Khetan, Ph.D. Wednesday/Friday, October 17/19, 2012 The Central Dogma of Statistics used to

More information

Unsupervised Anomaly Detection for High Dimensional Data

Unsupervised Anomaly Detection for High Dimensional Data Unsupervised Anomaly Detection for High Dimensional Data Department of Mathematics, Rowan University. July 19th, 2013 International Workshop in Sequential Methodologies (IWSM-2013) Outline of Talk Motivation

More information

Lecture 12 Robust Estimation

Lecture 12 Robust Estimation Lecture 12 Robust Estimation Prof. Dr. Svetlozar Rachev Institute for Statistics and Mathematical Economics University of Karlsruhe Financial Econometrics, Summer Semester 2007 Copyright These lecture-notes

More information

Robust Subspace DOA Estimation for Wireless Communications

Robust Subspace DOA Estimation for Wireless Communications Robust Subspace DOA Estimation for Wireless Communications Samuli Visuri Hannu Oja ¾ Visa Koivunen Laboratory of Signal Processing Computer Technology Helsinki Univ. of Technology P.O. Box 3, FIN-25 HUT

More information

Ahlswede Khachatrian Theorems: Weighted, Infinite, and Hamming

Ahlswede Khachatrian Theorems: Weighted, Infinite, and Hamming Ahlswede Khachatrian Theorems: Weighted, Infinite, and Hamming Yuval Filmus April 4, 2017 Abstract The seminal complete intersection theorem of Ahlswede and Khachatrian gives the maximum cardinality of

More information

Integration on Measure Spaces

Integration on Measure Spaces Chapter 3 Integration on Measure Spaces In this chapter we introduce the general notion of a measure on a space X, define the class of measurable functions, and define the integral, first on a class of

More information

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore. This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore. Title Composition operators on Hilbert spaces of entire functions Author(s) Doan, Minh Luan; Khoi, Le Hai Citation

More information

THE TRIANGULAR THEOREM OF THE PRIMES : BINARY QUADRATIC FORMS AND PRIMITIVE PYTHAGOREAN TRIPLES

THE TRIANGULAR THEOREM OF THE PRIMES : BINARY QUADRATIC FORMS AND PRIMITIVE PYTHAGOREAN TRIPLES THE TRIANGULAR THEOREM OF THE PRIMES : BINARY QUADRATIC FORMS AND PRIMITIVE PYTHAGOREAN TRIPLES Abstract. This article reports the occurrence of binary quadratic forms in primitive Pythagorean triangles

More information

2 Sequences, Continuity, and Limits

2 Sequences, Continuity, and Limits 2 Sequences, Continuity, and Limits In this chapter, we introduce the fundamental notions of continuity and limit of a real-valued function of two variables. As in ACICARA, the definitions as well as proofs

More information

Complements on Simple Linear Regression

Complements on Simple Linear Regression Complements on Simple Linear Regression Terry R. McConnell Syracuse University March 16, 2015 Abstract We present a simple-minded approach to a variant of simple linear regression that seeks to minimize

More information

Highly Robust Variogram Estimation 1. Marc G. Genton 2

Highly Robust Variogram Estimation 1. Marc G. Genton 2 Mathematical Geology, Vol. 30, No. 2, 1998 Highly Robust Variogram Estimation 1 Marc G. Genton 2 The classical variogram estimator proposed by Matheron is not robust against outliers in the data, nor is

More information

1 Measures of the Center of a Distribution

1 Measures of the Center of a Distribution 1 Measures of the Center of a Distribution Qualitative descriptions of the shape of a distribution are important and useful. But we will often desire the precision of numerical summaries as well. Two aspects

More information

AP Statistics Cumulative AP Exam Study Guide

AP Statistics Cumulative AP Exam Study Guide AP Statistics Cumulative AP Eam Study Guide Chapters & 3 - Graphs Statistics the science of collecting, analyzing, and drawing conclusions from data. Descriptive methods of organizing and summarizing statistics

More information