Introduction Robust regression Examples Conclusion. Robust regression. Jiří Franc

Size: px

Start display at page:

Download "Introduction Robust regression Examples Conclusion. Robust regression. Jiří Franc"

Erica Walker
6 years ago
Views:

1 Robust regression Robust estimation of regression coefficients in linear regression model Jiří Franc Czech Technical University Faculty of Nuclear Sciences and Physical Engineering Department of Mathematics

2 Outline 1 Introduction Regression analysis Classical regression estimators 2 Robust regression L-estimators M-estimators R-estimators S-estimators Least median of squares (LMS) Least trimmed squares (LTS) 3 Examples 4 Conclusion

3 Linear regression model Definition The multiple linear regression model is the model Where Y i = X i,1 β X i,2 β X i,p β 0 p + e i = X T i β 0 + e i i = 1... n. n is the sample size, X i =(X i,1, X i,2,..., X i,p ) T is called explanatory variables or regressors and it is a sequence of deterministic p-dimensional vectors or a sequence of random variables, Y i is called response variable (dependent variable) and it is i-th element of the random sequence of observations, β 0 =(β 0 1, β 0 2,..., β 0 p) T is a p-dimensional vector of true regression coefficients, e i is called the sequence of disturbances (error terms), it represents unexplained variation in the dependent variable and it is a sequence of random variables.

4 Linear regression model Let rewrite the previous definition with n equations in matrix notation. Definition Y 1 Y 2. = Y n X 1,1 X 1,2... X 1,p X 2,1 X 2,2... X 2,p X n,1 X n,2 X n,n β 0 1 β 0 2. β 0 p + e 1 e 2. e n Equivalently, Y = Xβ 0 + e

5 Classical regression estimators Ordinary least squares method (OLS) n ˆβ OLS = arg min β R p i=1 ˆβ OLS = (X T X) 1 X T Y (Y i X T i β) 2 = arg min(y Xβ) T (Y Xβ) β R p In the certain conditions OLS is the best linear unbiased estimator of β 0. In the certain conditions OLS is the best estimator among all unbiased estimators (ordinary least squares method is best for multiple regression when the iid errors are normal distributed.) OLS is not robust and consequently often gives false result for real data (even a single regression outlier can totally offset the OLS estimator).

6 Motivation An example of outliers in y-direction and OLS estimate. Number of International Calls (in tens of millions) from Belgium dependence on Year NoCalls Legend Dataset OLS estimate Year

7 Motivation An example of outliers in y-direction and OLS estimate. Y Legend Dataset OLS estimate X

8 Motivation An example of outliers in x-direction and OLS estimate. Y Legend Dataset OLS estimate X

9 Robust regression The main aims of robust statistics: description of the structure best fitting the bulk of the data. identification of deviating data points (outliers) or deviating substructures for further treatment. identification of highly influential data points (leverage points) or at least warning about them. deal with unsuspected serial correlations.

10 Robust regression The main aims of robust statistics: description of the structure best fitting the bulk of the data. identification of deviating data points (outliers) or deviating substructures for further treatment. identification of highly influential data points (leverage points) or at least warning about them. deal with unsuspected serial correlations. Two ways how to deal with regression outliers: Regression diagnostics: where certain quantities are computed from the data with the purpose of pinpointing influential points, after which these outliers can be removed or corrected. Robust regression: which tries to devise estimators that are not so strongly affected by outliers.

11 L-estimators α trimmed estimators ˆβ (Lte,α) = 1 n 2 nα n nα i= nα +1 Y (i), where (Y (1),..., Y (n) ) is the order statistic and α is definite coefficient: α 0, 1. α regression quantile (Koenker, Bassett (1978)) ˆβ (Lrq,α) = arg min β R p n ρ α(r i ), i=1 where r i (β) = Y i X T i β i = 1... n and ρ α(r i ) = { α r i if r i 0 (α 1) r i if r i < 0. L-estimators are scale equivariant and regression equivariant. However the breakdown point is still 0% and for α = 0.5 holds, 0.5-regression quantile estimator is least absolute values estimator.

12 L-estimators Least absolute values method n ˆβ L1 = arg min β R p i=1 Least maximum deviation method ˆβ L Y i X T i β = arg min β R p = arg min(max r i(β) ) β R p i ˆn n r i (β) i=1

13 M-estimators M-estimators are based on the idea of replacing the squared residuals used in OLS estimation by another function of the residuals. M-estimators ˆβ (M) = arg min β R p n ρ(r i (β)), where ρ is a symmetric function with a unique minimum at zero. Differentiating this expression with respect to the regression coefficients yields: M-estimators i=1 n ψ(r i )X i = 0, i=1 where ψ is the derivative of ρ. The M-estimate is obtained by solving this system of p equations.

14 M-estimators OLS, L 1 are also M-estimators with ψ(t) = t for OLS and ψ(t) = sgn(t) for L 1 estimate. M-estimators are unfortunately not scale equivariant even if they are regression equivariant. Hence one has to studentizate the M-estimators by an estimate of scale of disturbances ˆσ necessarily. ˆβ (M) = arg min β R p n ( ) ri (β) ρ, ˆσ One possibility is to use the median absolute deviation (MAD): ( ) ˆσ = C median r i median(r j ) i j, where C is a constant (correction factor), which depends on the distribution. i=1 For normally distributed data C =

15 M-estimators Let vectors (Y i, X T i ) T are iid with distribution function F (Y, X). If the function ρ has an absolutely continuously derivation ψ, σ = 1 for simplicity and the functional T (F ) corresponding to the M-estimator then the functional T (F ) is the solution of ψ(y X T T (F ))XdF (X, Y ) = 0. Define M(ψ, F ) := ψ (Y X T T (F ))XX T df (X, Y ). Then the influence function of T at a distribution F (on R p R) is given by IF (X 0, Y 0, T, F ) = M 1 (ψ, F )X 0ψ(Y 0 X 0 T T (F )). The influence function with respect of Y 0 can by bounded by choice of ψ, but the influence function of M-estimators is unbounded in respect of X 0. The breakdown point of M-estimators is 0% due to the vulnerability to leverage points.

16 M-estimators Maronna and Yoahai (1981) showed, under certain conditions, that M-estimators are consistent and asymptoticly normal with asymptotic covariance matrix where V (T, F ) = E [ IF (X, Y, T, F )IF T (X, Y, T, F ) ] = = IF (X, Y, T, F )IF T (X, Y, T, F )df (X, Y ) = = M(ψ, F ) 1 Q(ψ, F )M(ψ, F ) 1 M(ψ, F ) = ψ (Y X T T (F ))XX T df (X, Y ) Q(ψ, F ) = ψ 2 (Y X T T (F ))XX T df (X, Y ) If disturbances are normal distributed (e i N(0, σ 0 )) and iid, the multivariate M-estimator has the asymptotic covariance matrix V (ψ, Φ) = E[(XXT ) 1 ]σ 2, e where e is the asymptotic efficiency defined by ( ψ dφ ) 2 e = ψ2 dφ.

17 M-estimators Huber minimax M-estimator ψ(t) = { t b sgn(t) where b is a constant. if t < b if t b Andrew M-estimator { sin(t) if π t < π ψ(t) = 0 otherwise Tukey M-estimator ( t 1 ( ) ) t 2 2 if t < c ψ(t) = c 0 otherwise Descending minimax M-estimator t i ψ(t) = b sgn(t) tanh [ 1 2 b(c t )] i 0 o where a, b and c are constants. Hampel M-estimator t if t < a a sgn(t) if a t < b ψ(t) = c t sgn(t) if b t < c c b 0 otherwise where a, b and c are constants. where c is a constant.

18 M-estimators Huber minimax (b=1) psi(t) t

19 M-estimators Hampel (a=1,b=2,c=3) psi(t) t

20 M-estimators Descending minimax (a=1,b=1.2,c=3) psi(t) t

21 M-estimators Andrew psi(t) t

22 M-estimators Tukey (c=3) psi(t) t

23 GM-estimators Generalized M-estimators are introduced in order to bounding the influence function of outlying X i s by means of some weight function w. GM-estimators ˆβ (GM) = arg min β R p n i=1 w(x i ) ρ(r i(β)) ˆσ The definition can be rewrite to n w(x i )ψ i=1 ( ri ) X i = 0. ˆσ Unfortunately Maronna, Buston and Yohai (1979) showed that the breakdown point of GM-estimators can be no better than a certain value that decrease as a function of p 1, where p is the number of regression coefficients.

24 GM-estimators The algorithm of Iteratively reweighted least squares with GM-estimates based on some ψ function is following. 1 The first elementary estimate ˆβ (OLS) of β 0. 2 Count the residuals r i ( ˆβ) = Y i Ŷ i = Y i X T i 3 Count the estimate ˆσ of σ. (e.g. MAD: ˆσ = median i ( r i median j 4 Count the weights w i. (e.g. Andrew s ψ function: w i = ψ( r iˆσ ) r iˆσ ) ˆβ i = 1... n. ) (r j ) ) 5 Update the estimate ˆβ by performing a weighted least squares with the weights w i Calculate ˆβ (WLS) = (X T WX) 1 X T WY 6 Go back to item 2 and iterate until convergence

25 R-estimators R-estimation is procedure based on the ranks of the residuals. The idea of using rank statistics has been extended to the domain of multiple regression by Jurečková(1971) and Jaeckel(1972). Let R i (β) be the rank of Y i X i T β, further let a n (i), i = 1,..., n be a nondecreasing scores function, satisfying n a n (i) = 0. i=1 R-estimator ˆβ (R,a i ) = arg min β R p n i=1 a n (R i (β))(y i Xi T β) = arg min β R p n a n (R i )r i. i=1

26 R-estimators Some possibilities for the scores function a n(i) are: Wilcoxon scores Median scores a n(i) = i n Van der Waerden scores ( ) a n(i) = Φ 1 i n + 1 ( a n(i) = sgn i n + 1 ) 2 Bounded normal scores ( { ( a n(i) = min c, max Φ 1 where c is a parameter. i n + 1 ) }), c,

27 R-estimators The most widely used R-estimator is the estimator with Wilcoxon scores function, because the resulting R-estimator have a relatively simple structure. R-estimators are automatically scale equivariant. The breakdown point of R-estimators is still 0% due to the vulnerability to leverage points. In the case of regression with intercept, the constant term is estimated separately. In order to solve the R-estimates, one has to find the minimum of the objective function by varying the unknown parameters.

28 S-estimators S-estimators were introduced by Rousseeuw and Yohai (1984), they are derived from a scale statistic in an implicit way and they are regression, scale and affine equivariant. S-estimators are defined by minimization of the dispersion of the residuals by: S-estimator ˆβ (S,c,K) = arg min β R p s(r 1 (β),..., r n (β)) = arg min ŝ(β), β R p where ŝ(β) is estimator of scale defined as the solution of equation 1 n n ( ) ri (β) ρ = K. ŝ(β) i=1

29 S-estimators The function ρ must satisfy the following conditions: C1 ρ is symmetric and continuously differentiable function and ρ(0) = 0. C2 There exists parameter c > 0 such that ρ is strictly increasing on 0, c and constant on c,. For the S-estimator with breakdown point of 50% one extra condition has to be add. C3 An example: Tukey s biweight function ρ. ρ(x) = { x 2 2 x 4 2c + x 6 2 6c 4 K ρ(c) = 1 2 if x c c 2 6 otherwise, where c is a parameter from condition C2.

30 S-estimators Under certain conditions, the behavior of ˆβ (S) at the normal model, where (Y i, x i ) are iid random variables satisfying Y i = xi T β 0 + e i with e i N(0, σ 2 ) is given by ( lim n ˆβ(S) β 0) = N (0, E [ X T X ] 1 ( ψ 2 dφ ) ) ( n ψ dφ ). 2 Consequently the asymptotic efficiency is defined by e := ( ψ dφ ) 2 ( ψ2 dφ ). The same formulas was derived also for the M-estimator.

31 S-estimators An example of the S-estimator corresponding to the Tukey s function ρ with breakdown point of 50% is K = E Φ [ρ] and c = E Φ [ρ] = + ρ(x)φ(x)dx = 2 c c 2 6 φxdx + c c c c = c2 Φ( c) + 1 x 2 φ(x)dx c 2 c ( c = c2 Φ( c) Φ( c) (2c)e c ( 1 3 6Φ( c) ( 2c 3 + 6c ) e c2 2c 2 2 x 2 x4 + x6 φ(x)dx = 2 2c 2 6c 4 c x 4 φ(x)dx + 1 x 6 φ(x)dx = 6c ) 4 c ) c 4 ( 15 30Φ( c) ( 2c c c ) e c2 2 ) and K = α (0; 0.5] ρ(c)

32 S-estimators The asymptotic efficiency of the S-estimator corresponding to the Tukey s function ρ for different values of the breakdown point ε. ε e c K 50% 28.7% % 37.0% % 42.6% % 56.0% % 66.1% % 75.9% % 84.7% % 91.7% % 96.6% MM-estimators: high-breakdown and high-efficiency estimators, where the initial estimate is obtained with an S-estimator, and it is then improved with an M-estimator.

33 LMS-estimator LMS is probably the first really applicable 50% breakdown point estimator introduced by Rousseeuw (1984). The idea of the LMS is to replace the sum operator by a median, which is very robust. Least median of squares estimator ˆβ (LMS) = arg min β R p ( ) med(ri 2 (β)). i

34 LMS-estimator LMS is probably the first really applicable 50% breakdown point estimator introduced by Rousseeuw (1984). The idea of the LMS is to replace the sum operator by a median, which is very robust. Least median of squares estimator ˆβ (LMS) = arg min β R p ( ) med(ri 2 (β)). i There always exists a solution for the LMS estimator. The LMS estimator is regression equivariant, scale equivariant and affine equivariant. If p > 1 then the breakdown point of the LMS method is ε := ([n/2] p + 2)/n. The LMS is not asymptotically normal.

35 LTS-estimator Least trimmed squares estimator ˆβ (LTS) = arg min β R p = h r(i) 2 (β), i=1 where r(1) 2... r (n) 2 are the ordered squared residuals.

36 LTS-estimator Least trimmed squares estimator ˆβ (LTS) = arg min β R p = h r(i) 2 (β), i=1 where r(1) 2... r (n) 2 are the ordered squared residuals. There always exists a solution for the LTS-estimator. The LTS estimator is regression equivariant, scale equivariant and affine equivariant. If p > 1, h = [n/2] + [(p + 1)/2] then the breakdown point of the LTS-estimator is ε := ([n p]/2 + 1)/n. The LTS has the same asymptotic efficiency at the normal distribution as the Huber-type of M-estimator.

37 LTS-estimator Least trimmed squares estimator ˆβ (LTS) = arg min β R p = h r(i) 2 (β), i=1 where r(1) 2... r (n) 2 are the ordered squared residuals. The drawback of the LTS is that the objective function requires sorting of the squared residuals, which takes O(n log n) operations. Robust high breakdown point estimators like the LTS can be very sensitive to a very small change of data or to a deletion of even one point from data set (i.e. small change of data can really cause a large change of the estimate).

38 Examples Number of International Calls (in tens of millions) from Belgium dependence on Year NoCalls Legend Dataset OLS M Hampel M Huber MM Tukey LTS Year

39 Examples An example of outliers in y-direction. Y Legend Dataset OLS M Hampel M Huber MM Tukey LTS X

40 Examples An example of outliers in x-direction. Y Legend Dataset OLS M Hampel M Huber MM Tukey LTS X

41 Examples Robust regression in R m.ols <- lm(nocalls~(year),data=datab) library(mass) m.huber m.hampel m.tukey m.mm.bisq <- rlm(nocalls~(year),data=datab,psi = psi.huber) <- rlm(nocalls~(year),data=datab,psi = psi.hampel) <- rlm(nocalls~(year),data=datab,psi = psi.bisquare) <- rlm(nocalls~(year),data=datab, method= MM ) library(robustbase) - a package of basic robust statistics. m.lts <- ltsreg(nocalls~(year),data=datab) library(robust) - a package of robust methods.

42 Conclusion and Discussion Thank you for attention. The discussion is open.

43 References Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A. (1986), Robust Statistics: The Approach Based on Influence Functions, J. Wiley, New York. Rousseeuw, P. J. and Leroy, A. M. (1987), Robust Regression and Outlier Detection. J. Wiley, New York. Jurečková, J. (2001). Robustní statistické metody, Karolinum, Prague Víšek, J. Á. (2000): Regression with high breakdown point. Robust 2000, 2001,

Introduction to Robust Statistics. Elvezio Ronchetti. Department of Econometrics University of Geneva Switzerland.

Introduction to Robust Statistics Elvezio Ronchetti Department of Econometrics University of Geneva Switzerland Elvezio.Ronchetti@metri.unige.ch http://www.unige.ch/ses/metri/ronchetti/ 1 Outline Introduction