Introduction to robust statistics*

Similar documents
Introduction to Robust Statistics. Elvezio Ronchetti. Department of Econometrics University of Geneva Switzerland.

Measuring robustness

odhady a jejich (ekonometrické)

A Brief Overview of Robust Statistics

A Modified M-estimator for the Detection of Outliers

368 XUMING HE AND GANG WANG of convergence for the MVE estimator is n ;1=3. We establish strong consistency and functional continuity of the MVE estim

Regression Analysis for Data Containing Outliers and High Leverage Points

robustness, efficiency, breakdown point, outliers, rank-based procedures, least absolute regression

9. Robust regression

High Breakdown Point Estimation in Regression

ON THE CALCULATION OF A ROBUST S-ESTIMATOR OF A COVARIANCE MATRIX

Fast and robust bootstrap for LTS

IMPROVING THE SMALL-SAMPLE EFFICIENCY OF A ROBUST CORRELATION MATRIX: A NOTE

MULTIVARIATE TECHNIQUES, ROBUSTNESS

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Introduction Robust regression Examples Conclusion. Robust regression. Jiří Franc

ROBUST ESTIMATION OF A CORRELATION COEFFICIENT: AN ATTEMPT OF SURVEY

Small Sample Corrections for LTS and MCD

Lecture 12 Robust Estimation

Journal of the American Statistical Association, Vol. 85, No (Sep., 1990), pp

Two Simple Resistant Regression Estimators

A SHORT COURSE ON ROBUST STATISTICS. David E. Tyler Rutgers The State University of New Jersey. Web-Site dtyler/shortcourse.

TITLE : Robust Control Charts for Monitoring Process Mean of. Phase-I Multivariate Individual Observations AUTHORS : Asokan Mulayath Variyath.

Highly Robust Variogram Estimation 1. Marc G. Genton 2

The Performance of Mutual Information for Mixture of Bivariate Normal Distributions Based on Robust Kernel Estimation

Robust estimators based on generalization of trimmed mean

Introduction to Robust Statistics. Anthony Atkinson, London School of Economics, UK Marco Riani, Univ. of Parma, Italy

INVARIANT COORDINATE SELECTION

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

THE BREAKDOWN POINT EXAMPLES AND COUNTEREXAMPLES

Robust Wilks' Statistic based on RMCD for One-Way Multivariate Analysis of Variance (MANOVA)

Robust Methods in Regression Analysis: Comparison and Improvement. Mohammad Abd- Almonem H. Al-Amleh. Supervisor. Professor Faris M.

IDENTIFYING MULTIPLE OUTLIERS IN LINEAR REGRESSION : ROBUST FIT AND CLUSTERING APPROACH

Robust Parameter Estimation in the Weibull and the Birnbaum-Saunders Distribution

Why is the field of statistics still an active one?

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1: Introduction, Multivariate Location and Scatter

Outliers Robustness in Multivariate Orthogonal Regression

By 3DYHOýtåHN, Wolfgang Härdle

Sign-constrained robust least squares, subjective breakdown point and the effect of weights of observations on robustness

Robust Estimation Methods for Impulsive Noise Suppression in Speech

Improved Ridge Estimator in Linear Regression with Multicollinearity, Heteroscedastic Errors and Outliers

In Chapter 2, some concepts from the robustness literature were introduced. An important concept was the inuence function. In the present chapter, the

Inequalities Relating Addition and Replacement Type Finite Sample Breakdown Points

Lecture 14 October 13

Author s Accepted Manuscript

Re-weighted Robust Control Charts for Individual Observations

Fast and Robust Classifiers Adjusted for Skewness

Computing Robust Regression Estimators: Developments since Dutter (1977)

Asymptotic Relative Efficiency in Estimation

COMPARISON OF THE ESTIMATORS OF THE LOCATION AND SCALE PARAMETERS UNDER THE MIXTURE AND OUTLIER MODELS VIA SIMULATION

A ROBUST METHOD OF ESTIMATING COVARIANCE MATRIX IN MULTIVARIATE DATA ANALYSIS G.M. OYEYEMI *, R.A. IPINYOMI **

CHAPTER 5. Outlier Detection in Multivariate Data

Robust Estimation of Cronbach s Alpha

Introduction to Linear regression analysis. Part 2. Model comparisons

Robust scale estimation with extensions

Regression Models - Introduction

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 4: Measures of Robustness, Robust Principal Component Analysis

Robust Linear Discriminant Analysis and the Projection Pursuit Approach

Small sample corrections for LTS and MCD

Computational Connections Between Robust Multivariate Analysis and Clustering

GETTING STARTED. Chapter PROBABILITY CURVES

Regression Models - Introduction

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

Detecting and Assessing Data Outliers and Leverage Points

Nonlinear Signal Processing ELEG 833

The Fundamentals of Heavy Tails Properties, Emergence, & Identification. Jayakrishnan Nair, Adam Wierman, Bert Zwart

Review of Statistics

A Robust Approach of Regression-Based Statistical Matching for Continuous Data

Outlier Detection via Feature Selection Algorithms in

A Comparison of Robust Estimators Based on Two Types of Trimming

Describing Distributions With Numbers

Robust methods for multivariate data analysis

Exploring data sets using partial residual plots based on robust fits

Improved Feasible Solution Algorithms for. High Breakdown Estimation. Douglas M. Hawkins. David J. Olive. Department of Applied Statistics

A general linear model OPTIMAL BIAS BOUNDS FOR ROBUST ESTIMATION IN LINEAR MODELS

Robust Regression. Catherine Stuart

Identification of Multivariate Outliers: A Performance Study

An Overview of Multiple Outliers in Multidimensional Data

OPTIMAL B-ROBUST ESTIMATORS FOR THE PARAMETERS OF THE GENERALIZED HALF-NORMAL DISTRIBUTION

The Glejser Test and the Median Regression

Leverage effects on Robust Regression Estimators

A Multi-Step, Cluster-based Multivariate Chart for. Retrospective Monitoring of Individuals

Breakdown points of Cauchy regression-scale estimators

Departamento de Estadfstica y Econometrfa Statistics and Econometrics Series 10

Practical High Breakdown Regression

Regression Analysis By Example

Lecture 2 and Lecture 3

Lecture 2. Descriptive Statistics: Measures of Center

ASYMPTOTICS OF REWEIGHTED ESTIMATORS OF MULTIVARIATE LOCATION AND SCATTER. By Hendrik P. Lopuhaä Delft University of Technology

Robust Maximum Association Estimators

ARE THE BAD LEVERAGE POINTS THE MOST DIFFICULT PROBLEM FOR ESTIMATING THE UNDERLYING REGRESSION MODEL?

Accurate and Powerful Multivariate Outlier Detection

Joseph W. McKean 1. INTRODUCTION

Definitions of ψ-functions Available in Robustbase

EXTENDING PARTIAL LEAST SQUARES REGRESSION

S 7ITERATIVELY REWEIGHTED LEAST SQUARES - ENCYCLOPEDIA ENTRY.(U) FEB 82 D B RUBIN DAAG29-80-C-O0N1 UNCLASSIFIED MRC-TSR-2328

Robust model selection criteria for robust S and LT S estimators

Half-Day 1: Introduction to Robust Estimation Techniques

DE FINETTI'S THEOREM FOR SYMMETRIC LOCATION FAMILIES. University of California, Berkeley and Stanford University

Transcription:

Introduction to robust statistics* Xuming He National University of Singapore To statisticians, the model, data and methodology are essential. Their job is to propose statistical procedures and evaluate how good they are. Take a sample of size n from a Normal distribution N (p,, a 2 ) with unknown mean p, and variance a 2 It is well known that the "best" estimator for p, is the sample mean, the average of all n observations. Here, given the model (Normal distribution) and data, the problem seems to be completely solved - the best estimator is found. However, unlike the mathematicians who can live in their own beautiful world of mathematics, the statisticians have to dirty their hands and face the reality. What is the reality? Firstly, no model is exact in describing the real problem. Secondly, most classical (and optimal) statistical procedures can easily break down when some deviation from the model occurs or the data set contains just a few outliers. A classical procedure can be shown optimal only under a series of assumptions such as Normality, linearity, symmetry, independence or finite moments. Violations of these distributional assumptions often nullify the optimality seriously. Even more dangerous is the occurrence of outliers, which may be a result of key punch errors, misplaced decimal points, recording or transmission errors, exceptional phenomena such as earthquakes or strikes, or members of a different population slipping into the sample. For example, with just one bad point, the sample mean can go everywhere, yielding no relevant information at all. Another good example is linear regression, an important statistical tool that is routinely applied in most sciences. * Written version of a colloquium talk delivered to the Department of Mathematics, National University of Singapore on 20 February 1990. The talk is intended to be nontechnical. 22

As an illustration, consider the simple model y, = a + bx, + e, ( i = 1, 2,..., n) where y is the response variable (output) and xis the explanatory variable (input), e 1, e 2,, en are independent random errors. Given observations (x,, y,) for i = 1, 2,..., n, the classical fitting of the line y = a + bx is the least squares method. We find the regression coefficients (a, b) by n minimizing L:: (y, -a- bx,) 2 over every pair (a, b). i= 1 Dating back to 1809, Gauss introduced the Normal distribution as the error distribution for which LS is optimal, yielding a beautiful mathematical theory. Recently, people began to realize that real.data do not completely satisfy the classical assumptions, which often have dramatic effects on the quality of the fitting. The LS fitting is fooled easily by a single outlier as shown in the following example. Figure a contains five points {(x,,y.),i = 1,2,...,5} with a well fitting LS line. If we make an error in recording x 1, we obtain Figure b with one outlier in the x-direction, and its effect on the fit is so large that it actually tilts the LS line. 4.00 4 4.00 ;;: "'.:1 I... 2,00 2.00 2.00 3.00 4.00 5.00.r-axis (a) 2.00.r-axis (b) 4.00 23

--------~--~~--------~------~ - ------ There are two possible ways that we can take for better and safer results. The first approach is to clean the data (e.g. removing outliers) before a classical treatment is applied. However, it is often impossible to detect outliers effectively without a robust method. What's more, the model is still not exact even after the data is cleaned. A better approach is to develop robust procedures. A good robust procedure should have the following features. (1) (2) It performs well at the idealized model in terms of consistency, efficiency, etc. Small changes in the model or the data should have small effect on the result. (3) A larger deviation should not nullify the analysis completely. (4) It is computationally possible. Robust procedures are actcally in use long before the formal theory of robust statistics is developed by Huber in 1964. For example, Turkey in 1960 considered the efficiency of trimmed means for a location model {F(x- 0), 0 E R} with F(x) = (1- E)~(x) + E~(x/3) where ~(x) is the distribution function of the standard Normal. F(x) is a mixture of two Normal distributions as E ranges from 0 to 10%. The resulting distribution is still symmetric with slightly flattened tails. As E increases from 0 to 0.1, the efficiency of the sample mean decreases from 100% to only 70%. However, the 6%-trimmed mean obtained by trimming the 6% largest and 6% smallest sample points and then averaging over the remaining observations retains its efficiency over 95% for all E between 0 to 0.1. Moreover, the 6%-trimmed mean remains trustworthy if there are no more than 6% outliers in the sample. The most robust location estimator is the sample median (the mid point of the ordered sample). In order to formalize this aspect, we introduce the notion of breakdown for any statistical estimate T( x 1, x 2,, Xn). Let XO = ( x 1, x 2,, Xn) be an initial sample. The breakdown point of T(x1,., Xn) E RP is defined by E:(T;X 0 ) =min {m: sup IIT(Xm)ll = oo} n xm where xm is obtained by replacing m out of n points in X 0 arbitrarily. In other words, it is the smallest fraction of contamination that can cause the estimator T to take on values beyond all bounds. 24

Since one outlier can drive the sample mean over all bounds, the breakdown point for the sample mean is ~. (Typically, the breakdown point f~ (T, X 0 ) does not depend on X 0.) The sample median has breakdown point close to ~, since the median remains bounded as long as more than half sample points stay bounded. For multivariate data, however, it is not all that easy to find a "median" with breakdown point as high as t, if we require affine equivariance of the estimator, i.e. T(Ax1 +b,...,axn +b) =-AT(x,...,xn) +b for any nonsingular p x p matrix A and vector b E RP. Almost all known affine equivalent techniques before 1982 face the ubiquitous upper bound 1/(p+ 1) on their breakdown points. Those techniques include convex peeling and iterative trimming. Convex peeling proceeds by removing the points on the boundary of the smallest convex hull containing all the data points, and repeating this several times until a sufficient number of points have been peeled away. Such a procedure may delete too many "good" points at the first few peelings, because each step removes p + 1 points from the sample, and there may be only one outlier among them. The well-known M-estimators that are great robust estimators in the one-dimension location problems also have low breakdown points for higher p, see "Robust Statistics" by Huber (1981, Wiley) for more details. Since 1982, several high breakdown point estimators have been pro-. posed. Rousseeuw's minimum volume ellipsoid estimator (MVE) is the center of the minimal volume ellipsoid covering at le~t half of the data points. For a typical p-dimensional sample X 0, the breakdown point of the MVE equals([~]- p + 1)/n which approaches 1/2 as n-+ oo. The MVE is not locally stable and converge quite slowly to the true mean vector of an elliptically symmetric distribution (if such a model holds exactly). But as a highly reliable "center" estimator, it can often be used as a starting point for diagnostics. Locally stable and faster converging estimator for multivariate location-scatter estimators are also available. Let K : R+ -+ [0, 1] be a nondecreasing left continuous function with (1) K(O) = 1 and K(u) continuous at u = 0 (2) K(u) > 0 for 0 ~ u < c,but K(u) = 0 for u > c for some c > 0. 25

For an elliptically symmetric distribution F((x- u)t E- 1 (x-u)) with mean vector JJ. and ~catter matrix E, an S-estimator (P.n, En) is a solution ( a, A ) of the following minimization problem. minimize det IAI subject to..!:. t K ( ( x, - a) T A- 1 ( x, - a)) ~ 1 - n i= 1 over all a E RP and p x p positive definite matrix A, where 1 - = E[K((X::::. JJ.)T E- 1 (X- JJ.))] under the model. Such an S-estimator has been shown to have the following properties : (i) For n(1- ) ~ p + 1, there is at least one solution (a*,a*) with A*> 0. (ii) Under the model, both P.n and En converge to JJ. and E at the rate of n-} (with some regularity conditions). (iii) The breakdown point of the S-estimator is, which can be chosen between 0 and ([i]- p + 1)/n. (iv) The smooth S-estimator is also locally stable. However, the S-estimator is computationally difficult. Finally, we should point out that a robust location-scatter estimator is itself a key to various robust techniques for multivariate data. For a further understanding of robust statistics, see "Robust statistics, The Approach Based on Influence Function" by Hampel, Rousseeuw, Ronchetti and Stahel (1986, John Wiley & Sons). 26