Robust Statistics. Frank Klawonn

Robust Statistics Frank Klawonn f.klawonn@fh-wolfenbuettel.de, frank.klawonn@helmholtz-hzi.de Data Analysis and Pattern Recognition Lab Department of Computer Science University of Applied Sciences Braunschweig/Wolfenbüttel, Germany Bioinformatics & Statistics Helmholtz Centre for Infection Research Braunschweig, Germany Robust Statistics p.1/98

Outline Motivation: Mean or median ffl What is robust statistics? ffl M-estimators ffl Robust regression ffl Median polish ffl Summary and references ffl Robust Statistics p.2/98

Motivation: Mean or median Imagine a small town with 20 thousand ffl inhabitants. In average, each inhabitant has a capital of 10 ffl thousand $. Assume a very rich man named Bill G. owning a ffl capital of 20 billion $ decides to move to this town. After Bill G. has settled there, the inhabitants ffl own an average capital of roughly one million $. Robust Statistics p.3/98

(Empirical) median (1) ; : : : ; x (n) denotes a sample in ascending order. x Definition. The (sample or empirical) median denoted by ~x, isgivenby 8 < n+1 x ) ( 2 if n is odd ~x = : x ( n 2 ) +x ( n 2 + 1) if n is even 2 Robust Statistics p.5/98

(Empirical) median A @ E= @ @ A @ E= A L A Robust Statistics p.6/98

Motivation: Mean or median A less extreme example: Robust Statistics p.7/98

Motivation: Mean or median A less extreme example: Robust Statistics p.8/98

Motivation: Mean or median A less extreme example: Robust Statistics p.9/98

What is a good estimator? Assume, we want to estimate the expected value of a normal distribution from which a sample was generated. For a symmetric distributions like the normal distribution, the expected value and median are equal. The median q 0:5 of a (continuous) probability distribution, representing the random variable X, is the 50%-quantile, i.e. P (X» q 0:5 ) = 0:5 = P (X q 0:5 ): Robust Statistics p.10/98

What is a good estimator? Classical statistics: (a) The estimator should be correct in average (unbiased), at least for large sample sizes (asymptotically unbiased). (b) The estimator should have a small variance (efficiency). (c) With increasing sample size the variance of the estimator should tend to zero. (a) and (b) together guarantee consistency: With increasing sample size, the estimator converges with probability one to the true value of the parameter to be estimated. Robust Statistics p.11/98

What is a good estimator? Should we choose the mean or the median to estimate the expected value μ of our normal distribution? Both estimators are consistent. Robust Statistics p.12/98

Mean or median Histogram for the estimation of the mean, n= 20 Histogram for the estimation of the median, n= 20 Frequency 0 500 1000 1500 2000 2500 3000 Frequency 0 500 1000 1500 2000 2500 3000 1 0 1 2 mean x 1 0 1 2 median x Robust Statistics p.13/98

Mean or median Histogram for the estimation of the mean, n= 100 Histogram for the estimation of the median, n= 100 Frequency 0 500 1000 1500 2000 2500 3000 Frequency 0 500 1000 1500 2000 2500 3000 1 0 1 2 mean x 1 0 1 2 median x Robust Statistics p.14/98

Mean or median Histogram for the estimation of the mean, n= 20 Histogram for the estimation of the mean, n= 20 Frequency 0 500 1000 1500 2000 2500 3000 Frequency 0 500 1000 1500 2000 2500 3000 1 0 1 2 1 0 1 2 mean x mean (5% noise) xxx Robust Statistics p.15/98

Mean or median Histogram for the estimation of the mean, n= 100 Histogram for the estimation of the mean, n= 100 Frequency 0 500 1000 1500 2000 2500 3000 Frequency 0 500 1000 1500 2000 2500 3000 1 0 1 2 1 0 1 2 mean x mean (5% noise) x Robust Statistics p.16/98

Mean or median Histogram for the estimation of the median, n= 20 Histogram for the estimation of the median, n= 20 Frequency 0 500 1000 1500 2000 2500 3000 Frequency 0 500 1000 1500 2000 2500 3000 1 0 1 2 1 0 1 2 median x median (5% noise) x Robust Statistics p.17/98

Mean or median Histogram for the estimation of the median, n= 100 Histogram for the estimation of the median, n= 100 Frequency 0 500 1000 1500 2000 2500 3000 Frequency 0 500 1000 1500 2000 2500 3000 1 0 1 2 1 0 1 2 median x median (5% noise) x Robust Statistics p.18/98

Mean or median Under the ideal assumption that the data were ffl sampled from a normal distribution, the mean is a more efficient estimator than the median. If a small fraction of the data is for some reason ffl erroneous or generated by another distribution, the mean can even become a biased estimator and lose consistency. The median is more or less not affected if a small ffl fraction of the data is corrupted. Robust Statistics p.19/98

Robust statistics Hampel et al. (1986): In a broad informal sense, robust statistics is a body of knowledge, partly formalized into theories of robustness, relating to deviations from idealized assumptions in statistics. Robust Statistics p.20/98

Robust statistics idealized assumption: The data are sampled from the (possibly multivariate) random variable X with cumulative distribution function F X. modified assumption: The data are sampled from a random variable with "-contaminated cumulative distribution function F " = (1 ")F X + "F outliers : ffl F X : The assumed ideal model distribution ffl ": (small) probability for outliers ffl F outliers : unknown and unspecified distribution Robust Statistics p.21/98

nx (X i μ X) 2 Estimators (Statistics) Statistics is concerned with functionals t (or better t n ) called statistics which are used for parameter estimation and other purposes. The mean t n (X 1 ; : : : ; X n ) = μ X = 1 n nx X i ; the median or the (empirical) variance i=1 1 t n (X 1 ; : : : ; X n ) = s 2 = n 1 are typical examples for estimators. i=1 Robust Statistics p.22/98

Estimators (Statistics) Two views of estimators: Applied to (finite) samples (x 1 ; : : : ; x n ) resulting ffl in a concrete estimation (a realization of a random experiment consisting of the drawn sample). As random variables (applied to random ffl variables). This enables us to investigate the (theoretical) properties of estimators. Samples are not needed for this purpose. Robust Statistics p.23/98

Estimators (Statistics) Assuming an infinite sample size, the limit in probability t(f X ) = lim t n(x 1 ; : : : ; X n ) n!1 can be considered (in case it exists). t(f X ) is then again a random variable. For typical estimators, t(f X ) is a constant random variable, i.e. the limit converges (with probability 1) to a unique value. Robust Statistics p.24/98

Fisher consistency An estimator t is called Fisher consistent for a paramater of probability distribution X if t(f X ) = ; i.e. for large (infinite) sample sizes, the estimator converges with probability 1 to the true value of the parameter to be estimated. Robust Statistics p.25/98

Empirical influence function Given a sample (x 1 ; : : : ; x n ) and an estimator t n (x 1 ; : : : ; x n ), what is the influence of a single observation on t? Empirical influence function: = t n+1 (x 1 ; : : : ; x n ; x) EIF(x) x Vary 1 between and 1. Robust Statistics p.26/98

Empirical influence function Consider the (ordered) sample 0.4, 1.2, 1.4, 1.5, 1.7, 2.0, 2.9, 3.8, 3.8, 4.2 μx = 2:29 med(x) = 1:85 μx 10% = 2:2875 The (ff-)trimmed mean is the mean of the sample from which the lowest and highest 100 ff% values are removed. (For the mean: ff = 0, for the median: ff = 0:5.) Robust Statistics p.27/98

Empirical influence function 3 2.8 mean(x) median(x) trimmedmean(x) 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1-10 -5 0 5 10 Robust Statistics p.28/98

Sensitivity curve The (empirical) sensitivity curve is a normalized EIF (centred around 0 and scaled according to the sample size): = t n+1(x 1 ; : : : ; x n ; x) t n (x 1 ; : : : ; x n ) 1 SC(x) n+1 15 mean(x) median(x) trimmedmean(x) 10 5 0-5 -10-10 -5 0 5 10 Robust Statistics p.29/98

1 n 1 F + n ffi x t(f ) 1 t Influence function The influence function corresponds to the sensitivity curve for large (infinite) sample sizes. IF(x; t; F ) = lim n!1 x represents a (cumulative probability) distribution ffi yielding the x value with probability 1. In this sense, the influence function measures what happens with the estimator for an infinitesimal small contamination for large sample sizes. Note that the influence function might not be defined if the limit does not exist. 1 n Robust Statistics p.30/98

Gross-error sensitivity The worst case (in terms of the outlier x) is called gross-error sensitivity. Λ (t; F ) = sup fl fjif(x; t; F )jg x If fl Λ (t; F ) is finite, t is called a B-robust estimator (B stands for bias) (at F ). For the arithmetic mean, we have fl Λ (μx; F ) = 1. For the median and the trimmed mean, the gross-error sensitivity depends on the sample F distribution. Robust Statistics p.31/98

Breakdown point The influence curve and the gross-error sensitivity characterise the influence of single (or even infinitesimal) outliers. A minimum requirement for robustness is that the influence curve is bounded. What happens when the fraction of outliers increases? Robust Statistics p.32/98

Breakdown point The breakdown point is the smallest fraction of (extreme) outliers that need to be included in a sample in order to let the estimator break down completely, i.e. yield (almost) infinity. Let hd((x 1 ; : : : ; x n ); (y 1 ; : : : ; y n )) = jfi 2 f1; : : : ; ng j x i 6= y i gj denote the Hamming distance between two samples (x 1 ; : : : ; x n ) and (y 1 ; : : : ; y n ). Robust Statistics p.33/98

fi fi fi fi supfjt(y 1; : : : ; y n )j j ) Breakdown point The breakdown point of an estimator t is defined as " Λ n (t; x 1; : : : ; x n ) = 1 n min (m fi hd((x 1 ; : : : ; x n ); (y 1 ; : : : ; y n )) = mg = 1 : " Normally, n is independent of the specific choice of the sample Λ 1 ; : : : ; x n ). (x Robust Statistics p.34/98

Breakdown point " If n is independent of the sample, for large (infinite) sample sizes the breakdown point is defined as Λ Examples: Λ = lim " "Λ n : n!1 Arithmetic mean: " Λ = 0% Median: " Λ = 50% ff-trimmed mean: " Λ = ff Robust Statistics p.35/98

Criteria for robust estimators Bounded influence function: Single extreme ffl outliers cannot do too much harm to the estimator. Low gross-error sensitivity ffl Positive breakdown point (the higher, the better): ffl Even a number of outliers can be tolerated without leading to nonsense estimations. Fisher consistency: For very large sample sizes ffl the estimator will yield the correct value. High efficiency: The variance of the estimator ffl should be as low as possible. Robust Statistics p.36/98

Criteria for robust estimators There is no way to satisfy all criteria in the best way at the same time. There is a trade-off between robustness issues like positive breakdown point and low gross-error sensitivity on the one hand and efficiency on the other hand. As an example, compare the mean (high efficiency, breakdown point 0) and the median (lower efficiency, but very good breakdown point). Robust Statistics p.37/98

Robust measures of spread The (empirical) variance suffers from the same problems as the mean. (The estimation of the variance usually includes an estimation of the mean.) An example for a more robust estimator for spread is the interquartile range, the difference between the 75%- and the 25%-quantile. (The q%-quantile is the value x in the sample for which q% are smaller than x and (100 q)% are larger than x.) Robust Statistics p.38/98

E (X μ) 2 : nx Error measures The expected value μ minimizes the error function Correspondingly, the arithmetic mean μx minimizes the error function (x i μx) 2 : i=1 Robust Statistics p.39/98

nx Error measures The median q 0:5 minimizes the error function E (jx q 0:5 j) : Correspondingly, the (sample) median ~x minimies the error function jx i ~xj: i=1 Robust Statistics p.40/98

Error measures This also explains, why the median is less sensitive to outliers: The quadratic error for the mean punishes outlier much stronger than the absolute error. Therefore, extreme outliers have a higher influence ( pull stronger ) than other points. Robust Statistics p.41/98

Error measures How to measure errors? The error for an estimation ^ including the sign is = x i ^ : n i=1 Minimizing i does not make sense. e P e i ffl Usually inf ^ P n i=1 e i = 1. Even if we require P n ffl e i 0, a small value for i=1 n i=1 e i does not mean that the errors e P i are small. There might be large positive and large negative errors that balance each other. Robust Statistics p.42/98

Error measures Therefore, we need a modified error ρ(e). Which properties should the function ρ : R! R have? ffl ρ(e) 0, ffl ρ(0) = 0, ffl ρ(e) = ρ( e), ffl ρ(e i ) ρ(e j ),ifje i j je j j. Robust Statistics p.43/98

ffl ρ(e) = e 2 Error measures Possible choices for ρ: ρ(e) = jej ffl : : :? ffl Advantage ρ(e) = e of : In order to minimize ρ(e), we can take derivatives. 2 P n i=1 This does not work for ρ(e) = jej, since the function f (x) = jxj is not differentiable (at 0). Robust Statistics p.44/98

y i = fi 0 + fi 1 x i1 + : : : + fi k x ik + " i Error measures Which other options do we have for ρ? The quadratic error is obviously not a good choice when we seek for robustness. Consider the more general setting of linear models of the form = x > i fi + " i: This covers also the special case of estimators for location: = fi 0 + " i y i Robust Statistics p.45/98

y i = ff + fi 1 x i1 + : : : + fi k x ik + " i y i = a + b 1 x i1 + : : : + b k x ik + e i nx nx Linear regression linear model: x > i fi + " i = computed model: x > i b + e i = objective function: ρ(e i ) = ρ(y i x > i b) i=1 i=1 Robust Statistics p.46/98

nx nx e 2 i = 1 2 nx (y i x > i b)2 Least squares regression Computing derivatives of 1 2 (the constant factor does not change the 1 2 optimisation problem) leads to i=1 i=1 (y i x > i b) x> i = 0: The solution of this system of linear equations is straight forward and can be found in any textbook. i=1 Robust Statistics p.47/98

Statistics tool R Open source software: http://www.r-project.org R uses a type-free command language. Assignments are written in the form > x <- y y is assigned to x. The object y must be defined (generated), before it can be assigned to x. Declaration of x is not required. Robust Statistics p.48/98

R: Reading a file > mydata <-read.table(file.choose(),header=t) opens a file chooser. The chosen file is assigned to the object named mydata. header = T contain a header. means that the chosen file will The first line of the file contains the names of the variables. The following contain the values (tab- or space-separated). Robust Statistics p.49/98

R: Accessing a single variable > vn <- mydata$varname assigns the column named varname of the data set contained in the object mydata to the object vn. The command > print(vn) prints the corresponding column on the screen. Robust Statistics p.50/98

R: Printing on the screen [1] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2... 0.1 0.1 0.2 0.4 0.4 0.3 [19] 0.3 0.3 0.2 0.4 0.2 0.5 0.2 0.2... 0.2 0.4 0.1 0.2 0.1 0.2 [37] 0.2 0.1 0.2 0.2 0.3 0.3 0.2 0.6... 0.2 0.2 1.4 1.5 1.5 1.3 [55] 1.5 1.3 1.6 1.0 1.3 1.4 1.0 1.5... 1.5 1.0 1.5 1.1 1.8 1.3 [73] 1.5 1.2 1.3 1.4 1.4 1.7 1.5 1.0... 1.5 1.6 1.5 1.3 1.3 1.3 [91] 1.2 1.4 1.2 1.0 1.3 1.2 1.3 1.3... 2.1 1.8 2.2 2.1 1.7 1.8 [109] 1.8 2.5 2.0 1.9 2.1 2.0 2.4 2.3... 2.3 2.0 2.0 1.8 2.1 1.8 [127] 1.8 1.8 2.1 1.6 1.9 2.0 2.2 1.5... 1.8 2.1 2.4 2.3 1.9 2.3 [145] 2.5 2.3 1.9 2.0 2.3 1.8 Robust Statistics p.51/98

R: Empirical mean & median The mean and median can be computed using R by the functions mean() and median(), respectively. > mean(vn) [1] 1.198667 > median(vn) [1] 1.3 The mean and median can also be applied to data objects consisting of more than one (numerical) column, yielding a vector of mean/median values. Robust Statistics p.52/98

R: Empirical variance The function var() yields the empirical variance in R. > var(vn) [1] 0.5824143 The function sd() yields the empirical standard deviation. > sd(vn) [1] 0.7631607 Robust Statistics p.53/98

R: min and max The functions min() and max() compute the minimum and the maximum in a data set. > min(vn) [1] 0.1 > max(vn) [1] 2.5 The function IQR() yields the interquartile range. Robust Statistics p.54/98

Least squares regression > reg.lsq <- lm(y x) > summary(reg.lsq) Call: lm(formula = y x) Residuals: Min 1Q Median 3Q Max -4.76528-2.57376 0.06554 2.27587 4.33301 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -3.3149 1.2596-2.632 0.0273 * x 1.2085 0.9622 1.256 0.2408 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 3.372 on 9 degrees of freedom Multiple R-Squared: 0.1491, Adjusted R-squared: 0.05458 F-statistic: 1.577 on 1 and 9 DF, p-value: 0.2408 Robust Statistics p.55/98

Least squares regression > plot(x,y) > abline(reg.lsq) y 6 4 2 0 2 4 0 1 2 3 4 x Robust Statistics p.56/98

Least squares regression plot(y-predict.lm(reg.lsq)) y predict.lm(reg.lsq) 4 2 0 2 4 2 4 6 8 10 Index Robust Statistics p.57/98

Least squares regression > plot(x,y-predict.lm(reg.lsq)) y predict.lm(reg.lsq) 4 2 0 2 4 0 1 2 3 4 x Robust Statistics p.58/98

Least squares regression > reg.lsq <- lm(y x) > summary(reg.lsq) Call: lm(formula = y x) Residuals: Min 1Q Median 3Q Max -1.2437-0.9049-0.6414-0.3554 6.6398 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.75680 0.33288 2.273 0.0248 * x 0.09406 0.05666 1.660 0.0995. --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.841 on 120 degrees of freedom Multiple R-Squared: 0.02245, Adjusted R-squared: 0.0143 F-statistic: 2.756 on 1 and 120 DF, p-value: 0.0995 Robust Statistics p.59/98

Least squares regression > plot(x,y) > abline(reg.lsq) y 0 2 4 6 8 0 2 4 6 8 10 x Robust Statistics p.60/98

Least squares regression plot(y-predict.lm(reg.lsq)) y predict.lm(reg.lsq) 0 2 4 6 0 20 40 60 80 100 120 Index Robust Statistics p.61/98

n i=1 ρ(e i) = P n i=1 ρ(y i P x > i nx nx nx M-estimators Define ψ = ρ 0, w(e) = ψ(e)=e and w i = w(e i ). Computing derivatives of b) leads to i x > i b) ψ(y e > i x i = i e w i (y i x > i b) x> i = 0: The solution of this system of equations is the same as for the weighted least squares problem i=1 i=1 w i e 2 i : i=1 Robust Statistics p.62/98

M-estimators Problem: The weights w i depend on the errors e i and ffl the errors e ffl i depend on the w weights i. Solution strategy: Alternating optimisation: 1. Initialise with standard least squares regression. 2. Compute the weights. 3. Apply standard least squares regression with the computed weights. 4. Repeat 2. and 3. until convergence. Robust Statistics p.63/98

: Robust regression Method ρ(e) e least squares 2 Huber ρ 1 2 e2 ; jej» k if jej > k if Tukey 8 < 2 k 6 kjej 1 2 k2 ; 1 3 e 2 k 1 e 2 ; if jej» k 2 k 6 ; if jej > k Robust Statistics p.64/98

M-estimators: Least squares 40 1 35 30 0.8 25 0.6 rho 20 w 15 0.4 10 0.2 5 0-6 -4-2 0 2 4 6 e 0-6 -4-2 0 2 4 6 e Robust Statistics p.65/98

M-estimators: Huber 8 1 7 6 0.8 5 0.6 rho 4 w 3 0.4 2 0.2 1 0-6 -4-2 0 2 4 6 e 0-6 -4-2 0 2 4 6 e Robust Statistics p.66/98

M-estimators: Tukey 3.5 1 3 0.8 2.5 2 0.6 rho w 1.5 0.4 1 0.2 0.5 0-6 -4-2 0 2 4 6 e 0-6 -4-2 0 2 4 6 e Robust Statistics p.67/98

Robust regression with R At least the package MASS will be required. Packages can be downloaded directly in R from the Internet. Once a package is downloaded, it can be installed by > library(packagename) Robust Statistics p.68/98

Robust regression (Huber) > reg.rob <- rlm(y x) > summary(reg.rob) Call: rlm(formula = y x) Residuals: Min 1Q Median 3Q Max -4.76528-2.57376 0.06554 2.27587 4.33301 Coefficients: Value Std. Error t value (Intercept) -3.3149 1.2596-2.6316 x 1.2085 0.9622 1.2559 Residual standard error: 3.678 on 9 degrees of freedom Correlation of Coefficients: (Intercept) x -0.5903 Robust Statistics p.69/98

Robust regression (Huber) > plot(x,y) > abline(reg.rob) y 6 4 2 0 2 4 0 1 2 3 4 x Robust Statistics p.70/98

Robust regression (Huber) > plot(y-predict.lm(reg.rob)) y predict.lm(reg.rob) 4 2 0 2 4 2 4 6 8 10 Index Robust Statistics p.71/98

Robust regression (Huber) > plot(reg.rob$w) reg.rob$w 0.6 0.8 1.0 1.2 1.4 2 4 6 8 10 Index Robust Statistics p.72/98

Robust regression (Tukey) > reg.rob <- rlm(y x,method="mm") > summary(reg.rob) Call: rlm(formula = y x, method = "MM") Residuals: Min 1Q Median 3Q Max -0.7199-0.2407 0.1070 0.3573 40.4858 Coefficients: Value Std. Error t value (Intercept) 1.2250 0.1819 6.7342 x -9.4277 0.1390-67.8441 Residual standard error: 0.5866 on 9 degrees of freedom Correlation of Coefficients: (Intercept) x -0.5903 Robust Statistics p.73/98

Robust regression (Tukey) > plot(x,y) > abline(reg.rob) y 6 4 2 0 2 4 0 1 2 3 4 x Robust Statistics p.74/98

Robust regression (Tukey) plot(y-predict.lm(reg.rob)) y predict.lm(reg.rob) 0 10 20 30 40 2 4 6 8 10 Index Robust Statistics p.75/98

Robust regression (Tukey) > plot(reg.rob$w) reg.rob$w 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 Index Robust Statistics p.76/98

Robust regression (Huber) > reg.rob <- rlm(y x) > summary(reg.rob) Call: rlm(formula = y x) Residuals: Min 1Q Median 3Q Max -0.65231-0.29731-0.02757 0.26270 7.23700 Coefficients: Value Std. Error t value (Intercept) 0.1821 0.0797 2.2842 x 0.0876 0.0136 6.4581 Residual standard error: 0.4137 on 120 degrees of freedom Correlation of Coefficients: (Intercept) x -0.8657 Robust Statistics p.77/98

Robust regression (Huber) > plot(x,y) > abline(reg.rob) y 0 2 4 6 8 0 2 4 6 8 10 x Robust Statistics p.78/98

Robust regression (Huber) > plot(y-predict.lm(reg.rob)) y predict.lm(reg.rob) 0 2 4 6 0 20 40 60 80 100 120 Index Robust Statistics p.79/98

Robust regression (Huber) > plot(reg.rob$w) reg.rob$w 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 120 Index Robust Statistics p.80/98

Robust regression (Tukey) > reg.rob <- rlm(y x,method="mm") > summary(reg.rob) Call: rlm(formula = y x, method = "MM") Residuals: Min 1Q Median 3Q Max -0.56126-0.18056 0.08183 0.35255 7.33345 Coefficients: Value Std. Error t value (Intercept) 0.1066 0.0592 1.8005 x 0.0816 0.0101 8.0978 Residual standard error: 0.3781 on 120 degrees of freedom Correlation of Coefficients: (Intercept) x -0.8657 Robust Statistics p.81/98

Robust regression (Tukey) > plot(x,y) > abline(reg.rob) y 0 2 4 6 8 0 2 4 6 8 10 x Robust Statistics p.82/98

Robust regression (Tukey) plot(y-predict.lm(reg.rob)) y predict.lm(reg.rob) 0 2 4 6 0 20 40 60 80 100 120 Index Robust Statistics p.83/98

Robust regression (Tukey) > plot(reg.rob$w) reg.rob$w 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 120 Index Robust Statistics p.84/98

Robust regression with R After plotting the weights by > plot(reg.rob$w) clicking single points can be enabled by > identify(1:length(reg.rob$w), reg.rob$w) in order to get the indices of interesting weights. Robust Statistics p.85/98

y i = ff + fi 1 x i1 + : : : + fi k x ik + " i Multivariate regression For simple linear regression y ß ax + b, plotting the data often helps to identify problems and outliers. This no longer possible for multivariate regression = x > i fi + " i: Here, methods like residual plots and residual analysis are possible ways to gain more insight on outliers and other problems. In R, simply write for instance rlm(yοx1+x2+x3). Robust Statistics p.86/98

Two-way tables Example. 1000 people were asked which political party they voted for in order to find out whether the choice of the party and the sex of the voter are independent. pol. partyn sex female male sum SPD 200 170 370 CDU/CSU 200 200 400 Grüne 45 35 80 FDP 25 35 70 PDS 20 30 50 Others 22 5 27 No answer 8 5 13 sum 520 480 1000 Robust Statistics p.87/98

ffl : : : Two-way tables In such contexts, typically statistical tests like χ the -test (for independence, homogeneity), ffl Fisher s exact test (for 2 2-tables), ffl 2 Kruskal-Wallis ffl test MANOVA ffl Robust Statistics p.88/98

Two-way tables The tests are not robust, have very restrictive assumptions (MANOVA, Fisher s exact test) and the χ 2 -test is only an asymptotic test. Alternative: Median polish Robust Statistics p.89/98

Median polish Underlying (additive) model: y ij = μ + ff i + fi j + " ij : ffl μ: Overall typical value (general level) ffl ff i : Row effect (here: the political party) ffl fi j : Column effect (here: the sex) ffl " ij : Noise or random fluctuation Robust Statistics p.90/98

Median polish Algorithm: 1. Subtract for each row its median. 2. For the updated table, subtract from each column its median. 3. Repeat 1. and 2. (with the corresponding updated tables) until convergence. Robust Statistics p.91/98

(t) i = medfe (t 1) ij a (t) b = medfb (t 1) j m (t) ij = e (t 1) ij d a (t) i Median polish Iterative estimation of the parameters: (0) = 0; a (0) i = 0; b (0) j = 0; e (0) ij = y ij ; m Rows: j j 2 f1; : : : ; Jgg j j 2 f1; : : : ; Jgg Robust Statistics p.92/98

(t) j = medfd (t) ij b (t) a = medfa (t 1) i m (t) ij = d (t 1) ij e (t) i = a (t 1) i a (t) j = b (t 1) j b b j (t) m (t) b + Median polish Columns: j i 2 f1; : : : ; Igg + a i (t) j i 2 f1; : : : ; Igg b (t) j Common value and effects: (t) = m (t 1) + m (t) a m m(t) b + a i (t) m (t) a + Robust Statistics p.93/98

Median polish After convergence, the remaining entries in the table correspond to the " ij. Median polish in R is implemented by the function medpolish(). Robust Statistics p.94/98

Summary Robust statistics allows the deviation from the ffl ideal model that the sample is not contaminated. Robust methods rely on the majority of the data. ffl Few outliers can be disregarded or their influence ffl is reduced. Robust Statistics p.95/98

Key references F.R. Hampel, E.M. Ronchetti, P.J. Rousseeuw, ffl W.A. Stahel: Robust Statistics. The Approach Based on Influence Functions. Wiley, New York (1986) S. Heritier, E. Cantoni, S. Copt, M.-P. ffl Victoria-Feser: Robust Methods in Biostatistics. Wiley, New York (2009) D.C. Hoaglin, F. Mosteller, J.W. Tukey: ffl Understanding Robust and Exploratory Data Analysis. Wiley, New York (2000) P.J. Huber: Robust Statistics. Wiley, New York ffl (2004) Robust Statistics p.96/98

Key references R. Maronna, D. Martin, V. Yohai: Robust ffl Statistics: Theory and Methods. Wiley, Toronto (2006) P.J. Rousseeuw, A.M. Leroy: Robust Regression ffl and Outlier Detection. Wiley, New York (1987) Robust Statistics p.97/98

Software R: http://www.r-project.org ffl Library: MASS ffl Library: robustbase ffl Library: rrcov ffl Robust Statistics p.98/98