Robust Statistics. Frank Klawonn

Similar documents
Indian Statistical Institute

Measuring robustness

ROBUST ESTIMATION OF A CORRELATION COEFFICIENT: AN ATTEMPT OF SURVEY

Introduction to Robust Statistics. Elvezio Ronchetti. Department of Econometrics University of Geneva Switzerland.

Lecture 12 Robust Estimation

Robust regression in R. Eva Cantoni

ON THE CALCULATION OF A ROBUST S-ESTIMATOR OF A COVARIANCE MATRIX

Study Sheet. December 10, The course PDF has been updated (6/11). Read the new one.

Introduction Robust regression Examples Conclusion. Robust regression. Jiří Franc

MIT Spring 2015

A Brief Overview of Robust Statistics

IMPROVING THE SMALL-SAMPLE EFFICIENCY OF A ROBUST CORRELATION MATRIX: A NOTE

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Definitions of ψ-functions Available in Robustbase

1 Introduction to Minitab

Regression Analysis for Data Containing Outliers and High Leverage Points

A Robust Strategy for Joint Data Reconciliation and Parameter Estimation

Midwest Big Data Summer School: Introduction to Statistics. Kris De Brabanter

A Modified M-estimator for the Detection of Outliers

Review of Multiple Regression

Robust statistics. Michael Love 7/10/2016

ROBUST TESTS BASED ON MINIMUM DENSITY POWER DIVERGENCE ESTIMATORS AND SADDLEPOINT APPROXIMATIONS

REGRESSION ANALYSIS AND ANALYSIS OF VARIANCE

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

-However, this definition can be expanded to include: biology (biometrics), environmental science (environmetrics), economics (econometrics).

Descriptive Statistics-I. Dr Mahmoud Alhussami

2.1 Measures of Location (P.9-11)

COMPARISON OF THE ESTIMATORS OF THE LOCATION AND SCALE PARAMETERS UNDER THE MIXTURE AND OUTLIER MODELS VIA SIMULATION

COMPARING ROBUST REGRESSION LINES ASSOCIATED WITH TWO DEPENDENT GROUPS WHEN THERE IS HETEROSCEDASTICITY

Breakdown points of Cauchy regression-scale estimators

Inference For High Dimensional M-estimates: Fixed Design Results

Introduction to Linear regression analysis. Part 2. Model comparisons

Efficient and Robust Scale Estimation

1 The Classic Bivariate Least Squares Model

Robust model selection criteria for robust S and LT S estimators

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression

5. Linear Regression

Robust scale estimation with extensions

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Inference For High Dimensional M-estimates. Fixed Design Results

9. Robust regression

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

Statistical Data Analysis

Diagnostics and Transformations Part 2

8. Nonstandard standard error issues 8.1. The bias of robust standard errors

Figure 1. Sketch of various properties of an influence function. Rejection point

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

5. Linear Regression

Describing Distributions with Numbers

AP Statistics Cumulative AP Exam Study Guide

1 Introduction 1. 2 The Multiple Regression Model 1

Units. Exploratory Data Analysis. Variables. Student Data

A Comparison of Robust Estimators Based on Two Types of Trimming

Introduction to Linear Regression

Descriptive Data Summarization

Statistics for Engineering, 4C3/6C3 Assignment 2

Determining the Spread of a Distribution

Linear Regression Model. Badr Missaoui

A Short Course in Basic Statistics

Determining the Spread of a Distribution

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

Why is the field of statistics still an active one?

Correlated Data: Linear Mixed Models with Random Intercepts

Highly Robust Variogram Estimation 1. Marc G. Genton 2

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

Inferences on Linear Combinations of Coefficients

Confidence Intervals, Testing and ANOVA Summary

Introduction to robust statistics*

SAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c

Least Squares Estimation-Finite-Sample Properties

Week 7.1--IES 612-STA STA doc

Statistics for Python

Package ForwardSearch

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

An Introduction to Descriptive Statistics. 2. Manually create a dot plot for small and modest sample sizes

Robust estimation of scale and covariance with P n and its application to precision matrix estimation

Small Sample Corrections for LTS and MCD

MATH 644: Regression Analysis Methods

Math 2311 Written Homework 6 (Sections )

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

13: Additional ANOVA Topics. Post hoc Comparisons

Elementary Statistics

Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

Beam Example: Identifying Influential Observations using the Hat Matrix

Robustness of location estimators under t- distributions: a literature review

Robustness and Distribution Assumptions

Lecture 18: Simple Linear Regression

Chapter 3: Displaying and summarizing quantitative data p52 The pattern of variation of a variable is called its distribution.

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

The entire data set consists of n = 32 widgets, 8 of which were made from each of q = 4 different materials.

EXTENDING PARTIAL LEAST SQUARES REGRESSION

Final Exam. Name: Solution:

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

Frequency Distribution Cross-Tabulation

A SHORT COURSE ON ROBUST STATISTICS. David E. Tyler Rutgers The State University of New Jersey. Web-Site dtyler/shortcourse.

Stat 5102 Final Exam May 14, 2015

Regression and the 2-Sample t

ISQS 5349 Final Exam, Spring 2017.

Transcription:

Robust Statistics Frank Klawonn f.klawonn@fh-wolfenbuettel.de, frank.klawonn@helmholtz-hzi.de Data Analysis and Pattern Recognition Lab Department of Computer Science University of Applied Sciences Braunschweig/Wolfenbüttel, Germany Bioinformatics & Statistics Helmholtz Centre for Infection Research Braunschweig, Germany Robust Statistics p.1/98

Outline Motivation: Mean or median ffl What is robust statistics? ffl M-estimators ffl Robust regression ffl Median polish ffl Summary and references ffl Robust Statistics p.2/98

Motivation: Mean or median Imagine a small town with 20 thousand ffl inhabitants. In average, each inhabitant has a capital of 10 ffl thousand $. Assume a very rich man named Bill G. owning a ffl capital of 20 billion $ decides to move to this town. After Bill G. has settled there, the inhabitants ffl own an average capital of roughly one million $. Robust Statistics p.3/98

Motivation: Mean or median Imagine a small town with 20 thousand ffl inhabitants. In average, each inhabitant has a capital of 10 ffl thousand $. Assume a very rich man named Bill G. owning a ffl capital of 20 billion $ decides to move to this town. After Bill G. has settled there, the inhabitants ffl own an average capital of roughly one million $. And all but one inhabitants might own less capital ffl than average. Robust Statistics p.4/98

(Empirical) median (1) ; : : : ; x (n) denotes a sample in ascending order. x Definition. The (sample or empirical) median denoted by ~x, isgivenby 8 < n+1 x ) ( 2 if n is odd ~x = : x ( n 2 ) +x ( n 2 + 1) if n is even 2 Robust Statistics p.5/98

(Empirical) median A @ E= @ @ A @ E= A L A Robust Statistics p.6/98

Motivation: Mean or median A less extreme example: Robust Statistics p.7/98

Motivation: Mean or median A less extreme example: Robust Statistics p.8/98

Motivation: Mean or median A less extreme example: Robust Statistics p.9/98

What is a good estimator? Assume, we want to estimate the expected value of a normal distribution from which a sample was generated. For a symmetric distributions like the normal distribution, the expected value and median are equal. The median q 0:5 of a (continuous) probability distribution, representing the random variable X, is the 50%-quantile, i.e. P (X» q 0:5 ) = 0:5 = P (X q 0:5 ): Robust Statistics p.10/98

What is a good estimator? Classical statistics: (a) The estimator should be correct in average (unbiased), at least for large sample sizes (asymptotically unbiased). (b) The estimator should have a small variance (efficiency). (c) With increasing sample size the variance of the estimator should tend to zero. (a) and (b) together guarantee consistency: With increasing sample size, the estimator converges with probability one to the true value of the parameter to be estimated. Robust Statistics p.11/98

What is a good estimator? Should we choose the mean or the median to estimate the expected value μ of our normal distribution? Both estimators are consistent. Robust Statistics p.12/98

Mean or median Histogram for the estimation of the mean, n= 20 Histogram for the estimation of the median, n= 20 Frequency 0 500 1000 1500 2000 2500 3000 Frequency 0 500 1000 1500 2000 2500 3000 1 0 1 2 mean x 1 0 1 2 median x Robust Statistics p.13/98

Mean or median Histogram for the estimation of the mean, n= 100 Histogram for the estimation of the median, n= 100 Frequency 0 500 1000 1500 2000 2500 3000 Frequency 0 500 1000 1500 2000 2500 3000 1 0 1 2 mean x 1 0 1 2 median x Robust Statistics p.14/98

Mean or median Histogram for the estimation of the mean, n= 20 Histogram for the estimation of the mean, n= 20 Frequency 0 500 1000 1500 2000 2500 3000 Frequency 0 500 1000 1500 2000 2500 3000 1 0 1 2 1 0 1 2 mean x mean (5% noise) xxx Robust Statistics p.15/98

Mean or median Histogram for the estimation of the mean, n= 100 Histogram for the estimation of the mean, n= 100 Frequency 0 500 1000 1500 2000 2500 3000 Frequency 0 500 1000 1500 2000 2500 3000 1 0 1 2 1 0 1 2 mean x mean (5% noise) x Robust Statistics p.16/98

Mean or median Histogram for the estimation of the median, n= 20 Histogram for the estimation of the median, n= 20 Frequency 0 500 1000 1500 2000 2500 3000 Frequency 0 500 1000 1500 2000 2500 3000 1 0 1 2 1 0 1 2 median x median (5% noise) x Robust Statistics p.17/98

Mean or median Histogram for the estimation of the median, n= 100 Histogram for the estimation of the median, n= 100 Frequency 0 500 1000 1500 2000 2500 3000 Frequency 0 500 1000 1500 2000 2500 3000 1 0 1 2 1 0 1 2 median x median (5% noise) x Robust Statistics p.18/98

Mean or median Under the ideal assumption that the data were ffl sampled from a normal distribution, the mean is a more efficient estimator than the median. If a small fraction of the data is for some reason ffl erroneous or generated by another distribution, the mean can even become a biased estimator and lose consistency. The median is more or less not affected if a small ffl fraction of the data is corrupted. Robust Statistics p.19/98

Robust statistics Hampel et al. (1986): In a broad informal sense, robust statistics is a body of knowledge, partly formalized into theories of robustness, relating to deviations from idealized assumptions in statistics. Robust Statistics p.20/98

Robust statistics idealized assumption: The data are sampled from the (possibly multivariate) random variable X with cumulative distribution function F X. modified assumption: The data are sampled from a random variable with "-contaminated cumulative distribution function F " = (1 ")F X + "F outliers : ffl F X : The assumed ideal model distribution ffl ": (small) probability for outliers ffl F outliers : unknown and unspecified distribution Robust Statistics p.21/98

nx (X i μ X) 2 Estimators (Statistics) Statistics is concerned with functionals t (or better t n ) called statistics which are used for parameter estimation and other purposes. The mean t n (X 1 ; : : : ; X n ) = μ X = 1 n nx X i ; the median or the (empirical) variance i=1 1 t n (X 1 ; : : : ; X n ) = s 2 = n 1 are typical examples for estimators. i=1 Robust Statistics p.22/98

Estimators (Statistics) Two views of estimators: Applied to (finite) samples (x 1 ; : : : ; x n ) resulting ffl in a concrete estimation (a realization of a random experiment consisting of the drawn sample). As random variables (applied to random ffl variables). This enables us to investigate the (theoretical) properties of estimators. Samples are not needed for this purpose. Robust Statistics p.23/98

Estimators (Statistics) Assuming an infinite sample size, the limit in probability t(f X ) = lim t n(x 1 ; : : : ; X n ) n!1 can be considered (in case it exists). t(f X ) is then again a random variable. For typical estimators, t(f X ) is a constant random variable, i.e. the limit converges (with probability 1) to a unique value. Robust Statistics p.24/98

Fisher consistency An estimator t is called Fisher consistent for a paramater of probability distribution X if t(f X ) = ; i.e. for large (infinite) sample sizes, the estimator converges with probability 1 to the true value of the parameter to be estimated. Robust Statistics p.25/98

Empirical influence function Given a sample (x 1 ; : : : ; x n ) and an estimator t n (x 1 ; : : : ; x n ), what is the influence of a single observation on t? Empirical influence function: = t n+1 (x 1 ; : : : ; x n ; x) EIF(x) x Vary 1 between and 1. Robust Statistics p.26/98

Empirical influence function Consider the (ordered) sample 0.4, 1.2, 1.4, 1.5, 1.7, 2.0, 2.9, 3.8, 3.8, 4.2 μx = 2:29 med(x) = 1:85 μx 10% = 2:2875 The (ff-)trimmed mean is the mean of the sample from which the lowest and highest 100 ff% values are removed. (For the mean: ff = 0, for the median: ff = 0:5.) Robust Statistics p.27/98

Empirical influence function 3 2.8 mean(x) median(x) trimmedmean(x) 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1-10 -5 0 5 10 Robust Statistics p.28/98

Sensitivity curve The (empirical) sensitivity curve is a normalized EIF (centred around 0 and scaled according to the sample size): = t n+1(x 1 ; : : : ; x n ; x) t n (x 1 ; : : : ; x n ) 1 SC(x) n+1 15 mean(x) median(x) trimmedmean(x) 10 5 0-5 -10-10 -5 0 5 10 Robust Statistics p.29/98

1 n 1 F + n ffi x t(f ) 1 t Influence function The influence function corresponds to the sensitivity curve for large (infinite) sample sizes. IF(x; t; F ) = lim n!1 x represents a (cumulative probability) distribution ffi yielding the x value with probability 1. In this sense, the influence function measures what happens with the estimator for an infinitesimal small contamination for large sample sizes. Note that the influence function might not be defined if the limit does not exist. 1 n Robust Statistics p.30/98

Gross-error sensitivity The worst case (in terms of the outlier x) is called gross-error sensitivity. Λ (t; F ) = sup fl fjif(x; t; F )jg x If fl Λ (t; F ) is finite, t is called a B-robust estimator (B stands for bias) (at F ). For the arithmetic mean, we have fl Λ (μx; F ) = 1. For the median and the trimmed mean, the gross-error sensitivity depends on the sample F distribution. Robust Statistics p.31/98

Breakdown point The influence curve and the gross-error sensitivity characterise the influence of single (or even infinitesimal) outliers. A minimum requirement for robustness is that the influence curve is bounded. What happens when the fraction of outliers increases? Robust Statistics p.32/98

Breakdown point The breakdown point is the smallest fraction of (extreme) outliers that need to be included in a sample in order to let the estimator break down completely, i.e. yield (almost) infinity. Let hd((x 1 ; : : : ; x n ); (y 1 ; : : : ; y n )) = jfi 2 f1; : : : ; ng j x i 6= y i gj denote the Hamming distance between two samples (x 1 ; : : : ; x n ) and (y 1 ; : : : ; y n ). Robust Statistics p.33/98

fi fi fi fi supfjt(y 1; : : : ; y n )j j ) Breakdown point The breakdown point of an estimator t is defined as " Λ n (t; x 1; : : : ; x n ) = 1 n min (m fi hd((x 1 ; : : : ; x n ); (y 1 ; : : : ; y n )) = mg = 1 : " Normally, n is independent of the specific choice of the sample Λ 1 ; : : : ; x n ). (x Robust Statistics p.34/98

Breakdown point " If n is independent of the sample, for large (infinite) sample sizes the breakdown point is defined as Λ Examples: Λ = lim " "Λ n : n!1 Arithmetic mean: " Λ = 0% Median: " Λ = 50% ff-trimmed mean: " Λ = ff Robust Statistics p.35/98

Criteria for robust estimators Bounded influence function: Single extreme ffl outliers cannot do too much harm to the estimator. Low gross-error sensitivity ffl Positive breakdown point (the higher, the better): ffl Even a number of outliers can be tolerated without leading to nonsense estimations. Fisher consistency: For very large sample sizes ffl the estimator will yield the correct value. High efficiency: The variance of the estimator ffl should be as low as possible. Robust Statistics p.36/98

Criteria for robust estimators There is no way to satisfy all criteria in the best way at the same time. There is a trade-off between robustness issues like positive breakdown point and low gross-error sensitivity on the one hand and efficiency on the other hand. As an example, compare the mean (high efficiency, breakdown point 0) and the median (lower efficiency, but very good breakdown point). Robust Statistics p.37/98

Robust measures of spread The (empirical) variance suffers from the same problems as the mean. (The estimation of the variance usually includes an estimation of the mean.) An example for a more robust estimator for spread is the interquartile range, the difference between the 75%- and the 25%-quantile. (The q%-quantile is the value x in the sample for which q% are smaller than x and (100 q)% are larger than x.) Robust Statistics p.38/98

E (X μ) 2 : nx Error measures The expected value μ minimizes the error function Correspondingly, the arithmetic mean μx minimizes the error function (x i μx) 2 : i=1 Robust Statistics p.39/98

nx Error measures The median q 0:5 minimizes the error function E (jx q 0:5 j) : Correspondingly, the (sample) median ~x minimies the error function jx i ~xj: i=1 Robust Statistics p.40/98

Error measures This also explains, why the median is less sensitive to outliers: The quadratic error for the mean punishes outlier much stronger than the absolute error. Therefore, extreme outliers have a higher influence ( pull stronger ) than other points. Robust Statistics p.41/98

Error measures How to measure errors? The error for an estimation ^ including the sign is = x i ^ : n i=1 Minimizing i does not make sense. e P e i ffl Usually inf ^ P n i=1 e i = 1. Even if we require P n ffl e i 0, a small value for i=1 n i=1 e i does not mean that the errors e P i are small. There might be large positive and large negative errors that balance each other. Robust Statistics p.42/98

Error measures Therefore, we need a modified error ρ(e). Which properties should the function ρ : R! R have? ffl ρ(e) 0, ffl ρ(0) = 0, ffl ρ(e) = ρ( e), ffl ρ(e i ) ρ(e j ),ifje i j je j j. Robust Statistics p.43/98

ffl ρ(e) = e 2 Error measures Possible choices for ρ: ρ(e) = jej ffl : : :? ffl Advantage ρ(e) = e of : In order to minimize ρ(e), we can take derivatives. 2 P n i=1 This does not work for ρ(e) = jej, since the function f (x) = jxj is not differentiable (at 0). Robust Statistics p.44/98

y i = fi 0 + fi 1 x i1 + : : : + fi k x ik + " i Error measures Which other options do we have for ρ? The quadratic error is obviously not a good choice when we seek for robustness. Consider the more general setting of linear models of the form = x > i fi + " i: This covers also the special case of estimators for location: = fi 0 + " i y i Robust Statistics p.45/98

y i = ff + fi 1 x i1 + : : : + fi k x ik + " i y i = a + b 1 x i1 + : : : + b k x ik + e i nx nx Linear regression linear model: x > i fi + " i = computed model: x > i b + e i = objective function: ρ(e i ) = ρ(y i x > i b) i=1 i=1 Robust Statistics p.46/98

nx nx e 2 i = 1 2 nx (y i x > i b)2 Least squares regression Computing derivatives of 1 2 (the constant factor does not change the 1 2 optimisation problem) leads to i=1 i=1 (y i x > i b) x> i = 0: The solution of this system of linear equations is straight forward and can be found in any textbook. i=1 Robust Statistics p.47/98

Statistics tool R Open source software: http://www.r-project.org R uses a type-free command language. Assignments are written in the form > x <- y y is assigned to x. The object y must be defined (generated), before it can be assigned to x. Declaration of x is not required. Robust Statistics p.48/98

R: Reading a file > mydata <-read.table(file.choose(),header=t) opens a file chooser. The chosen file is assigned to the object named mydata. header = T contain a header. means that the chosen file will The first line of the file contains the names of the variables. The following contain the values (tab- or space-separated). Robust Statistics p.49/98

R: Accessing a single variable > vn <- mydata$varname assigns the column named varname of the data set contained in the object mydata to the object vn. The command > print(vn) prints the corresponding column on the screen. Robust Statistics p.50/98

R: Printing on the screen [1] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2... 0.1 0.1 0.2 0.4 0.4 0.3 [19] 0.3 0.3 0.2 0.4 0.2 0.5 0.2 0.2... 0.2 0.4 0.1 0.2 0.1 0.2 [37] 0.2 0.1 0.2 0.2 0.3 0.3 0.2 0.6... 0.2 0.2 1.4 1.5 1.5 1.3 [55] 1.5 1.3 1.6 1.0 1.3 1.4 1.0 1.5... 1.5 1.0 1.5 1.1 1.8 1.3 [73] 1.5 1.2 1.3 1.4 1.4 1.7 1.5 1.0... 1.5 1.6 1.5 1.3 1.3 1.3 [91] 1.2 1.4 1.2 1.0 1.3 1.2 1.3 1.3... 2.1 1.8 2.2 2.1 1.7 1.8 [109] 1.8 2.5 2.0 1.9 2.1 2.0 2.4 2.3... 2.3 2.0 2.0 1.8 2.1 1.8 [127] 1.8 1.8 2.1 1.6 1.9 2.0 2.2 1.5... 1.8 2.1 2.4 2.3 1.9 2.3 [145] 2.5 2.3 1.9 2.0 2.3 1.8 Robust Statistics p.51/98

R: Empirical mean & median The mean and median can be computed using R by the functions mean() and median(), respectively. > mean(vn) [1] 1.198667 > median(vn) [1] 1.3 The mean and median can also be applied to data objects consisting of more than one (numerical) column, yielding a vector of mean/median values. Robust Statistics p.52/98

R: Empirical variance The function var() yields the empirical variance in R. > var(vn) [1] 0.5824143 The function sd() yields the empirical standard deviation. > sd(vn) [1] 0.7631607 Robust Statistics p.53/98

R: min and max The functions min() and max() compute the minimum and the maximum in a data set. > min(vn) [1] 0.1 > max(vn) [1] 2.5 The function IQR() yields the interquartile range. Robust Statistics p.54/98

Least squares regression > reg.lsq <- lm(y x) > summary(reg.lsq) Call: lm(formula = y x) Residuals: Min 1Q Median 3Q Max -4.76528-2.57376 0.06554 2.27587 4.33301 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -3.3149 1.2596-2.632 0.0273 * x 1.2085 0.9622 1.256 0.2408 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 3.372 on 9 degrees of freedom Multiple R-Squared: 0.1491, Adjusted R-squared: 0.05458 F-statistic: 1.577 on 1 and 9 DF, p-value: 0.2408 Robust Statistics p.55/98

Least squares regression > plot(x,y) > abline(reg.lsq) y 6 4 2 0 2 4 0 1 2 3 4 x Robust Statistics p.56/98

Least squares regression plot(y-predict.lm(reg.lsq)) y predict.lm(reg.lsq) 4 2 0 2 4 2 4 6 8 10 Index Robust Statistics p.57/98

Least squares regression > plot(x,y-predict.lm(reg.lsq)) y predict.lm(reg.lsq) 4 2 0 2 4 0 1 2 3 4 x Robust Statistics p.58/98

Least squares regression > reg.lsq <- lm(y x) > summary(reg.lsq) Call: lm(formula = y x) Residuals: Min 1Q Median 3Q Max -1.2437-0.9049-0.6414-0.3554 6.6398 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.75680 0.33288 2.273 0.0248 * x 0.09406 0.05666 1.660 0.0995. --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.841 on 120 degrees of freedom Multiple R-Squared: 0.02245, Adjusted R-squared: 0.0143 F-statistic: 2.756 on 1 and 120 DF, p-value: 0.0995 Robust Statistics p.59/98

Least squares regression > plot(x,y) > abline(reg.lsq) y 0 2 4 6 8 0 2 4 6 8 10 x Robust Statistics p.60/98

Least squares regression plot(y-predict.lm(reg.lsq)) y predict.lm(reg.lsq) 0 2 4 6 0 20 40 60 80 100 120 Index Robust Statistics p.61/98

n i=1 ρ(e i) = P n i=1 ρ(y i P x > i nx nx nx M-estimators Define ψ = ρ 0, w(e) = ψ(e)=e and w i = w(e i ). Computing derivatives of b) leads to i x > i b) ψ(y e > i x i = i e w i (y i x > i b) x> i = 0: The solution of this system of equations is the same as for the weighted least squares problem i=1 i=1 w i e 2 i : i=1 Robust Statistics p.62/98

M-estimators Problem: The weights w i depend on the errors e i and ffl the errors e ffl i depend on the w weights i. Solution strategy: Alternating optimisation: 1. Initialise with standard least squares regression. 2. Compute the weights. 3. Apply standard least squares regression with the computed weights. 4. Repeat 2. and 3. until convergence. Robust Statistics p.63/98

: Robust regression Method ρ(e) e least squares 2 Huber ρ 1 2 e2 ; jej» k if jej > k if Tukey 8 < 2 k 6 kjej 1 2 k2 ; 1 3 e 2 k 1 e 2 ; if jej» k 2 k 6 ; if jej > k Robust Statistics p.64/98

M-estimators: Least squares 40 1 35 30 0.8 25 0.6 rho 20 w 15 0.4 10 0.2 5 0-6 -4-2 0 2 4 6 e 0-6 -4-2 0 2 4 6 e Robust Statistics p.65/98

M-estimators: Huber 8 1 7 6 0.8 5 0.6 rho 4 w 3 0.4 2 0.2 1 0-6 -4-2 0 2 4 6 e 0-6 -4-2 0 2 4 6 e Robust Statistics p.66/98

M-estimators: Tukey 3.5 1 3 0.8 2.5 2 0.6 rho w 1.5 0.4 1 0.2 0.5 0-6 -4-2 0 2 4 6 e 0-6 -4-2 0 2 4 6 e Robust Statistics p.67/98

Robust regression with R At least the package MASS will be required. Packages can be downloaded directly in R from the Internet. Once a package is downloaded, it can be installed by > library(packagename) Robust Statistics p.68/98

Robust regression (Huber) > reg.rob <- rlm(y x) > summary(reg.rob) Call: rlm(formula = y x) Residuals: Min 1Q Median 3Q Max -4.76528-2.57376 0.06554 2.27587 4.33301 Coefficients: Value Std. Error t value (Intercept) -3.3149 1.2596-2.6316 x 1.2085 0.9622 1.2559 Residual standard error: 3.678 on 9 degrees of freedom Correlation of Coefficients: (Intercept) x -0.5903 Robust Statistics p.69/98

Robust regression (Huber) > plot(x,y) > abline(reg.rob) y 6 4 2 0 2 4 0 1 2 3 4 x Robust Statistics p.70/98

Robust regression (Huber) > plot(y-predict.lm(reg.rob)) y predict.lm(reg.rob) 4 2 0 2 4 2 4 6 8 10 Index Robust Statistics p.71/98

Robust regression (Huber) > plot(reg.rob$w) reg.rob$w 0.6 0.8 1.0 1.2 1.4 2 4 6 8 10 Index Robust Statistics p.72/98

Robust regression (Tukey) > reg.rob <- rlm(y x,method="mm") > summary(reg.rob) Call: rlm(formula = y x, method = "MM") Residuals: Min 1Q Median 3Q Max -0.7199-0.2407 0.1070 0.3573 40.4858 Coefficients: Value Std. Error t value (Intercept) 1.2250 0.1819 6.7342 x -9.4277 0.1390-67.8441 Residual standard error: 0.5866 on 9 degrees of freedom Correlation of Coefficients: (Intercept) x -0.5903 Robust Statistics p.73/98

Robust regression (Tukey) > plot(x,y) > abline(reg.rob) y 6 4 2 0 2 4 0 1 2 3 4 x Robust Statistics p.74/98

Robust regression (Tukey) plot(y-predict.lm(reg.rob)) y predict.lm(reg.rob) 0 10 20 30 40 2 4 6 8 10 Index Robust Statistics p.75/98

Robust regression (Tukey) > plot(reg.rob$w) reg.rob$w 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 Index Robust Statistics p.76/98

Robust regression (Huber) > reg.rob <- rlm(y x) > summary(reg.rob) Call: rlm(formula = y x) Residuals: Min 1Q Median 3Q Max -0.65231-0.29731-0.02757 0.26270 7.23700 Coefficients: Value Std. Error t value (Intercept) 0.1821 0.0797 2.2842 x 0.0876 0.0136 6.4581 Residual standard error: 0.4137 on 120 degrees of freedom Correlation of Coefficients: (Intercept) x -0.8657 Robust Statistics p.77/98

Robust regression (Huber) > plot(x,y) > abline(reg.rob) y 0 2 4 6 8 0 2 4 6 8 10 x Robust Statistics p.78/98

Robust regression (Huber) > plot(y-predict.lm(reg.rob)) y predict.lm(reg.rob) 0 2 4 6 0 20 40 60 80 100 120 Index Robust Statistics p.79/98

Robust regression (Huber) > plot(reg.rob$w) reg.rob$w 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 120 Index Robust Statistics p.80/98

Robust regression (Tukey) > reg.rob <- rlm(y x,method="mm") > summary(reg.rob) Call: rlm(formula = y x, method = "MM") Residuals: Min 1Q Median 3Q Max -0.56126-0.18056 0.08183 0.35255 7.33345 Coefficients: Value Std. Error t value (Intercept) 0.1066 0.0592 1.8005 x 0.0816 0.0101 8.0978 Residual standard error: 0.3781 on 120 degrees of freedom Correlation of Coefficients: (Intercept) x -0.8657 Robust Statistics p.81/98

Robust regression (Tukey) > plot(x,y) > abline(reg.rob) y 0 2 4 6 8 0 2 4 6 8 10 x Robust Statistics p.82/98

Robust regression (Tukey) plot(y-predict.lm(reg.rob)) y predict.lm(reg.rob) 0 2 4 6 0 20 40 60 80 100 120 Index Robust Statistics p.83/98

Robust regression (Tukey) > plot(reg.rob$w) reg.rob$w 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 120 Index Robust Statistics p.84/98

Robust regression with R After plotting the weights by > plot(reg.rob$w) clicking single points can be enabled by > identify(1:length(reg.rob$w), reg.rob$w) in order to get the indices of interesting weights. Robust Statistics p.85/98

y i = ff + fi 1 x i1 + : : : + fi k x ik + " i Multivariate regression For simple linear regression y ß ax + b, plotting the data often helps to identify problems and outliers. This no longer possible for multivariate regression = x > i fi + " i: Here, methods like residual plots and residual analysis are possible ways to gain more insight on outliers and other problems. In R, simply write for instance rlm(yοx1+x2+x3). Robust Statistics p.86/98

Two-way tables Example. 1000 people were asked which political party they voted for in order to find out whether the choice of the party and the sex of the voter are independent. pol. partyn sex female male sum SPD 200 170 370 CDU/CSU 200 200 400 Grüne 45 35 80 FDP 25 35 70 PDS 20 30 50 Others 22 5 27 No answer 8 5 13 sum 520 480 1000 Robust Statistics p.87/98

ffl : : : Two-way tables In such contexts, typically statistical tests like χ the -test (for independence, homogeneity), ffl Fisher s exact test (for 2 2-tables), ffl 2 Kruskal-Wallis ffl test MANOVA ffl Robust Statistics p.88/98

Two-way tables The tests are not robust, have very restrictive assumptions (MANOVA, Fisher s exact test) and the χ 2 -test is only an asymptotic test. Alternative: Median polish Robust Statistics p.89/98

Median polish Underlying (additive) model: y ij = μ + ff i + fi j + " ij : ffl μ: Overall typical value (general level) ffl ff i : Row effect (here: the political party) ffl fi j : Column effect (here: the sex) ffl " ij : Noise or random fluctuation Robust Statistics p.90/98

Median polish Algorithm: 1. Subtract for each row its median. 2. For the updated table, subtract from each column its median. 3. Repeat 1. and 2. (with the corresponding updated tables) until convergence. Robust Statistics p.91/98

(t) i = medfe (t 1) ij a (t) b = medfb (t 1) j m (t) ij = e (t 1) ij d a (t) i Median polish Iterative estimation of the parameters: (0) = 0; a (0) i = 0; b (0) j = 0; e (0) ij = y ij ; m Rows: j j 2 f1; : : : ; Jgg j j 2 f1; : : : ; Jgg Robust Statistics p.92/98

(t) j = medfd (t) ij b (t) a = medfa (t 1) i m (t) ij = d (t 1) ij e (t) i = a (t 1) i a (t) j = b (t 1) j b b j (t) m (t) b + Median polish Columns: j i 2 f1; : : : ; Igg + a i (t) j i 2 f1; : : : ; Igg b (t) j Common value and effects: (t) = m (t 1) + m (t) a m m(t) b + a i (t) m (t) a + Robust Statistics p.93/98

Median polish After convergence, the remaining entries in the table correspond to the " ij. Median polish in R is implemented by the function medpolish(). Robust Statistics p.94/98

Summary Robust statistics allows the deviation from the ffl ideal model that the sample is not contaminated. Robust methods rely on the majority of the data. ffl Few outliers can be disregarded or their influence ffl is reduced. Robust Statistics p.95/98

Key references F.R. Hampel, E.M. Ronchetti, P.J. Rousseeuw, ffl W.A. Stahel: Robust Statistics. The Approach Based on Influence Functions. Wiley, New York (1986) S. Heritier, E. Cantoni, S. Copt, M.-P. ffl Victoria-Feser: Robust Methods in Biostatistics. Wiley, New York (2009) D.C. Hoaglin, F. Mosteller, J.W. Tukey: ffl Understanding Robust and Exploratory Data Analysis. Wiley, New York (2000) P.J. Huber: Robust Statistics. Wiley, New York ffl (2004) Robust Statistics p.96/98

Key references R. Maronna, D. Martin, V. Yohai: Robust ffl Statistics: Theory and Methods. Wiley, Toronto (2006) P.J. Rousseeuw, A.M. Leroy: Robust Regression ffl and Outlier Detection. Wiley, New York (1987) Robust Statistics p.97/98

Software R: http://www.r-project.org ffl Library: MASS ffl Library: robustbase ffl Library: rrcov ffl Robust Statistics p.98/98