Robust regression in R. Eva Cantoni

Size: px

Start display at page:

Download "Robust regression in R. Eva Cantoni"

Rosa Wilkerson
5 years ago
Views:

1 Robust regression in R Eva Cantoni Research Center for Statistics and Geneva School of Economics and Management, University of Geneva, Switzerland April 4th, 2017

2 1 Robust statistics philosopy 2 Robust regression 3 R ressources 4 Examples 5 Bibliography

3 Against what is robust statistics robust? Robust Statistics aims at producing consistent and possibly efficient estimators and test statistics with stable level when the model is slightly misspecified. Model misspecification encompasses a relatively large set of possibilities, and robust statistics cannot deal with all types of model misspecifications. By slight model misspecification, we suppose that the data generating process lies in a neighborhood of the true (postulated) model, the one that is considered as useful for the problem under investigation.

4 Against what is robust statistics robust? This neighborhood is formalized as F ε = (1 ε)f θ + εg, (1) F θ is the postulated model, θ is a set of parameters of interest, G is an arbitrary distribution and 0 ε 1 captures the amount of model misspecification

5 Against what is robust statistics robust? Inference Classical G 0 << ε < 1 0 < ε << 1 ε = 0 arbitrary?? F θ G = z F ε F ε F θ G = F (θ,θ ) F ε F ε F θ G such that F ε = F (θ,θ ) F (θ,θ ) F (θ,θ ) F θ Robust arbitrary? F θ F θ G = z F ε F θ F θ G = F (θ,θ ) F ε F θ F θ G such that F ε = F (θ,θ ) F (θ,θ ) F θ F θ Table: Models at which inference can be at best done.

6 Robustness measures Robust estimators protect against: bias under contamination breakdown point They imply a trade-off between efficiency and robustness!

7 Linear model and classical estimation For i = 1,..., n consider with ɛ i N (0, σ 2 ). y i = x T i β + ɛ i, The maximum likelihood estimator ˆβ ML minimizes ( n y i xi T σ i=1 ˆβ ML ) 2 = n i=1 r 2 i, or, alternatively, solves ( n y i xi T σ i=1 ˆβ ML ) x i = n r i x i = 0. i=1

8 Outliers x y Classification des points aberrants Point vertical Bon point levier Mauvais point levier

9 Robustness vs diagnostic Masking... ML residuals Index resid(olsmultiple)

10 Robustness vs diagnostic Masking... Data and ML fit x y

11 M and GM-estimation The M-estimator ˆβ M solves ) n (y i xi T ˆβ M ψ c x i = σ i=1 n ψ c (r i )x i = i=1 n w c (r i ) r i x i = 0, i=1 where w c (r i ) = ψ c (r i )/r i, or minimizes ) n (y i xi T ˆβ M ρ c. σ i=1 The Mallows GM-estimator ˆβ GM is an alternative that solves ) n (y i xi T ˆβ GM n ψ c w(x i )x i = w(r i )w(x i )r i x i = 0. σ i=1 i=1

12 ψ c and ρ c functions Huber Tukey or bisquare psi_c(r) psi_c(r) r r ρ H (u) ρ B (u) u u

13 Comparison x y OLS M Huber GM Huber M Tukey

14 S-estimation The S-estimator ˆβ S is an alternative that minimizes ) n (y i xi T ˆβ S ρ c, s i=1 where s is a scale M-estimator that solves 1 n ( ρ (1) yi x T ) i β c = b. n s, i=1

15 MM-estimation The MM-estimator is a two-step estimator constructed as follow: 1. Let s n be the scale estimate from an initial S-estimator. 2. With ρ (2) c ( ) ρ (1) c ( ), the MM-estimator ˆβ MM minimizes ( ) n y i xi T ˆβ MM. i=1 ρ (2) c s n

16 Comparison x y OLS M Huber M Tukey MM S

17 Robust GLM (GM-estimator) For the GLM model (e.g. logistic, Poisson) g(µ i ) = x T i β where E(Y i ) = µ i, Var(Y i ) = v(µ i ) and r i = (y i µ i ), the robust φvµi estimator is defined by n [ 1 ] ψ c (r i )w(x i ) µ i a(β) = 0, (2) φvµi i=1 where µ i = µ i / β = µ i / η i x i and a(β) = 1 n n i=1 E[ψ(r i; c)]w(x i )/ φv µi µ i. The constant a(β) is a correction term to ensure Fisher consistency.

18 R functions for robust linear regression (G)M-estimation MASS: rlm() with method= M (Huber, Tukey, Hampel) Choice for the scale estimator: MAD, Huber Proposal 2 S-estimation robust: lmrob with estim= Initial robustbase: lmrob.s MM-estimation MASS: rlm() with method= MM robust: lmrob (with estim= Final, default) robustbase: lmrob()

19 R functions for other models robustbase: glmrob GM-estimation, Huber (include Gaussian) Negative binomial model: glmrob.nb from From our book webpage: eva-cantoni/books/ robust-methods-in-biostatistics/

20 Coleman Dataset coleman from package robustbase. A data frame with 20 observations on the following 6 variables. salaryp: staff salaries per pupil fatherwc: percent of white-collar fathers sstatus: socioeconomic status composite deviation: means for family size, family intactness, father s education, mother s education, and home items teachersc: mean teacher s verbal test score motherlev: mean mother s educational level, one unit is equal to two school years Y: verbal mean test score (y, all sixth graders)

21 Coleman > summary (m. lm ) C a l l : lm ( formula = Y., data = coleman ) Residuals : Min 1Q Median 3Q Max C o e f f i c i e n t s : Est imat e Std. E r r o r t v a l u e Pr(> t ) ( I n t e r c e p t ) s a l a r y P fatherwc s s t a t u s e 05 t e a c h e r S c motherlev S i g n i f. c o d e s : 0 Äò Äô Äò Äô Äò Äô Äò. Äô 0. 1 Äò Äô 1 Residual standard e r r o r : on 14 degrees of freedom M u l t i p l e R squared : , Adjusted R squared : F s t a t i s t i c : on 5 and 14 DF, p v a l u e : e 07

22 Coleman > r e q u i r e ( r o b u s t b a s e ) > summary (m. lmrob, s e t t i n g = KS2011 ) C a l l : lmrob ( formula = Y., data = coleman ) \ > method = MM Residuals : Min 1Q Median 3Q Max C o e f f i c i e n t s : Est imat e Std. E r r o r t v a l u e Pr(> t ) ( I n t e r c e p t ) s a l a r y P fatherwc e 05 s s t a t u s e 11 teachersc e 08 motherlev S i g n i f. c o d e s : 0 Äò Äô Äò Äô Äò Äô Äò. Äô 0. 1 Äò Äô 1 Robust r e s i d u a l s t a n d a r d e r r o r : M u l t i p l e R squared : , Adjusted R squared : Convergence i n 11 IRWLS i t e r a t i o n s Robustness weights : o b s e r v a t i o n 18 i s an o u t l i e r w i t h w e i g h t = 0 ( < ) ; The r e m a i n i n g 19 ones a r e summarized as Min. 1 st Qu. Median Mean 3 rd Qu. Max

23 Coleman Ctn d: A l g o r i t h m i c parameters : t u n i n g. c h i bb t u n i n g. p s i r e f i n e. t o l r e l. t o l e e e e e 0 s o l v e. t o l eps. o u t l i e r eps. x warn. l i m i t. r e j e c t warn. l i m i t. meanr e e e e e 0 nresample max. i t b e s t. r. s k. f a s t. s k. max maxit. s c a l t r a c e. l e v mts compute. rd f a s t. s. l a r g e. n p s i s u b s a m p l i n g cov compute. o u t l i e r. s t a t s b i s q u a r e n o n s i n g u l a r. vcov. a v a r 1 SM s e e d : i n t ( 0 )

24 Coleman Robustness weights m.lmrob$rweights Index

25 Coleman > summary (m. rlm ) C a l l : rlm ( f o r m u l a = Y., data = coleman ) Residuals : Min 1Q Median 3Q Max C o e f f i c i e n t s : Value Std. E r r o r t v a l u e ( I n t e r c e p t ) s a l a r y P fatherwc s s t a t u s t e a c h e r S c motherlev Residual standard e r r o r : on 14 degrees of freedom

26 Breastfeeding UK study on the decision of pregnant women to breastfeed. 135 expecting mothers asked on their feeding choice (breast= 1 if breastfeeding, try to breastfeed and mixed brest- and bottle-feeding, =0 if exclusive bottle-feeding). Covariates: advancement of the pregnancy (pregnancy, end or beginning), how mothers fed as babies (howfed, some breastfeeding or only bottle-feeding), how mother s friend fed their babies (howfedfriends, some breastfeeding or only bottle-feeding), if had a partner (partner, no or yes), age (age), age at which left full time education (educat), ethnic group (ethnic, white or non white) and if ever smoked (smokebf, no or yes) or if stopped smoking (smokenow, no or yes). The first listed level of each factor is used as the reference (coded 0).

27 Breastfeeding The sample characteristics are as follow: out of the 135 observations, 99 were from mothers that have decided at least to try to breastfeed, 54 mothers were at the beginning of their pregnancy, 77 were themselves breastfed as baby, 85 of the mother s friend had breastfed their babies, 114 mothers had a partner, median age was (with minimum equal 17 and maximum equal 40), median age at the end of education was 17 (minimum=14, maximum=38), 77 mothers were white and 32 mothers were smoking during the pregnancy, whereas 51 had smoked before.

28 Breastfeeding breast i Bernoulli(µ i ), so that E(breast i ) = µ i and Var(breast i ) = µ i (1 µ i ) (Binomial family). Use the logit link. logit(e(breast)) = logit(p(breast)) = = β 0 + β 1 pregnancy + β 2 howfed + β 3 howfedfr + β 4 partner + β 5 ethnic + β 6 smokebf + β 7 smokenow + β 8 age + β 9 educat, where logit(µ i ) = log( µ i 1 µ i ), with µ i /(1 µ i ) being the odds of a success, and µ i = P(breast) is the probability of at least try to breastfeed.

29 Breastfeeding > r e q u i r e ( r o b u s t b a s e ) > breast. glmrobwx=glmrob ( decwhat howfedfr+ethnic+educat+age+grp+howfed+partner+ smokenow+smokebf, family=binomial, weights. on. x= hat, tcc =1.5, data=breast ) > summary ( b r e a s t. glmrobwx ) C a l l : glmrob ( formula = decwhat howfedfr + ethnic + educat + age + grp + howfed + partner + smokenow + smokebf, family = binomial, data = breast, w e i g h t s. on. x = hat, t c c = 1. 5 ) C o e f f i c i e n t s : E s t i m a t e Std. E r r o r z v a l u e Pr(> z ) ( I n t e r c e p t ) h o w f e d f r B r e a s t ethnicnon w h i t e e d u c a t age g r p B e g i n n i n g h o w f e d B r e a s t p a r t n e r P a r t n e r smokenowyes smokebfyes S i g n i f. codes : R o b u s t n e s s w e i g h t s w. r w. x : Min. 1 st Qu. Median Mean 3 rd Qu. Max Number o f o b s e r v a t i o n s : 135 F i t t e d by method Mqle ( i n 12 i t e r a t i o n s ) ( Dispersion parameter f o r binomial family taken to be 1)

30 Breastfeeding Ctn d: No d e v i a n c e v a l u e s a v a i l a b l e A l g o r i t h m i c parameters : acc t c c maxit 50 t e s t. acc c o e f

31 Breastfeeding Robustness weights w(x) Index w(r) Index

32 Mailing list and conferences Dedicated mailing list: ICORS 2017 in Wollongong, Australia: ICORS 2016 in Geneva:

33 References [1] Jean-Jacques Droesbeke, Gilbert Saporta, and Christine Thomas-Agnan. Méthodes robustes en statistique. Editions TECHNIP, [2] Frank R. Hampel, Elvezio M. Ronchetti, Peter J. Rousseeuw, and Werner A. Stahel. Robust Statistics: The Approach Based on Influence Functions. Wiley, New York, [3] S. Heritier, E. Cantoni, S. Copt, and M.-P. Victoria-Feser. Robust Methods in Biostatistics. Wiley-Interscience, [4] Peter J. Huber. Robust Statistics. Wiley, New York, [5] Peter J. Huber and Elvezio M. Ronchetti. Robust Statistics. Wiley, New York, Second Edition. [6] R.A. Maronna, R.D. Martin, and V.J. Yohai. Robust statistics. Wiley New York, 2006.

Introduction to Robust Statistics. Elvezio Ronchetti. Department of Econometrics University of Geneva Switzerland.

Introduction to Robust Statistics Elvezio Ronchetti Department of Econometrics University of Geneva Switzerland Elvezio.Ronchetti@metri.unige.ch http://www.unige.ch/ses/metri/ronchetti/ 1 Outline Introduction