Robust regression in R. Eva Cantoni

Similar documents
Introduction to Robust Statistics. Elvezio Ronchetti. Department of Econometrics University of Geneva Switzerland.

Definitions of ψ-functions Available in Robustbase

Introduction Robust regression Examples Conclusion. Robust regression. Jiří Franc

Small Sample Corrections for LTS and MCD

Robust Statistics with S (R) JWT. Robust Statistics Collaborative Package Development: robustbase. Robust Statistics with R the past II

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL

ROBUST TESTS BASED ON MINIMUM DENSITY POWER DIVERGENCE ESTIMATORS AND SADDLEPOINT APPROXIMATIONS

A Modified M-estimator for the Detection of Outliers

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Robust negative binomial regression

robmixglm: An R package for robust analysis using mixtures

Robust Statistics. Frank Klawonn

Linear Regression Models P8111

OPTIMAL B-ROBUST ESTIMATORS FOR THE PARAMETERS OF THE GENERALIZED HALF-NORMAL DISTRIBUTION

Exam Applied Statistical Regression. Good Luck!

Classification. Chapter Introduction. 6.2 The Bayes classifier

Lecture 12 Robust Estimation

Computing Robust Regression Estimators: Developments since Dutter (1977)

6. INFLUENTIAL CASES IN GENERALIZED LINEAR MODELS. The Generalized Linear Model

WEIGHTED LIKELIHOOD NEGATIVE BINOMIAL REGRESSION

Linear Methods for Prediction

Robust model selection criteria for robust S and LT S estimators

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Statistics 203: Introduction to Regression and Analysis of Variance Course review

ROBUST ESTIMATION OF A CORRELATION COEFFICIENT: AN ATTEMPT OF SURVEY

Using R in 200D Luke Sonnet

Small sample corrections for LTS and MCD

Generalized linear models

COMPARING ROBUST REGRESSION LINES ASSOCIATED WITH TWO DEPENDENT GROUPS WHEN THERE IS HETEROSCEDASTICITY

Logistic Regression - problem 6.14

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Midwest Big Data Summer School: Introduction to Statistics. Kris De Brabanter

Generalized Estimating Equations

9 Generalized Linear Models

9. Robust regression

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

MULTIVARIATE TECHNIQUES, ROBUSTNESS

GLM I An Introduction to Generalized Linear Models

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Logistic Regressions. Stat 430

Breakdown points of Cauchy regression-scale estimators

A new robust model selection method in GLM with application to ecological data

Generalized Linear Models. Kurt Hornik

Robust Variable Selection Through MAVE

MATH 644: Regression Analysis Methods

COMPARISON OF THE ESTIMATORS OF THE LOCATION AND SCALE PARAMETERS UNDER THE MIXTURE AND OUTLIER MODELS VIA SIMULATION

Leverage effects on Robust Regression Estimators

Spatial Variation in Infant Mortality with Geographically Weighted Poisson Regression (GWPR) Approach

Lecture 5: LDA and Logistic Regression

Stat 579: Generalized Linear Models and Extensions

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Introduction and Single Predictor Regression. Correlation

Generalized Additive Models

PAijpam.eu M ESTIMATION, S ESTIMATION, AND MM ESTIMATION IN ROBUST REGRESSION

Logistic Regression 21/05

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Detecting and Assessing Data Outliers and Leverage Points

IMPROVING THE SMALL-SAMPLE EFFICIENCY OF A ROBUST CORRELATION MATRIX: A NOTE

Various Issues in Fitting Contingency Tables

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Regression Analysis for Data Containing Outliers and High Leverage Points

GARCH Models Estimation and Inference

Linear Methods for Prediction

Section 4.6 Simple Linear Regression

Indian Statistical Institute

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Ch 2: Simple Linear Regression

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Outlier Robust Nonlinear Mixed Model Estimation

Maximum Likelihood Conjoint Measurement in R

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Generalized Linear Models. stat 557 Heike Hofmann

FAMA-FRENCH 1992 REDUX WITH ROBUST STATISTICS

Investigating Models with Two or Three Categories

A SHORT COURSE ON ROBUST STATISTICS. David E. Tyler Rutgers The State University of New Jersey. Web-Site dtyler/shortcourse.

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Linear Regression With Special Variables

Detecting outliers and/or leverage points: a robust two-stage procedure with bootstrap cut-off points

Identifying and accounting for outliers and extreme response patterns in latent variable modelling

General Regression Model

Regression Methods for Survey Data

A Robust Strategy for Joint Data Reconciliation and Parameter Estimation

Lecture 12: Effect modification, and confounding in logistic regression

arxiv: v4 [math.st] 16 Nov 2015

Robustness of location estimators under t- distributions: a literature review

Semiparametric Generalized Linear Models

STA102 Class Notes Chapter Logistic Regression

Lecture 18: Simple Linear Regression

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Inference for Regression

Projection Estimators for Generalized Linear Models

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Mixed-effects Maximum Likelihood Difference Scaling

Regression Analysis By Example

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Selection endogenous dummy ordered probit, and selection endogenous dummy dynamic ordered probit models

STA6938-Logistic Regression Model

Transcription:

Robust regression in R Eva Cantoni Research Center for Statistics and Geneva School of Economics and Management, University of Geneva, Switzerland April 4th, 2017

1 Robust statistics philosopy 2 Robust regression 3 R ressources 4 Examples 5 Bibliography

Against what is robust statistics robust? Robust Statistics aims at producing consistent and possibly efficient estimators and test statistics with stable level when the model is slightly misspecified. Model misspecification encompasses a relatively large set of possibilities, and robust statistics cannot deal with all types of model misspecifications. By slight model misspecification, we suppose that the data generating process lies in a neighborhood of the true (postulated) model, the one that is considered as useful for the problem under investigation.

Against what is robust statistics robust? This neighborhood is formalized as F ε = (1 ε)f θ + εg, (1) F θ is the postulated model, θ is a set of parameters of interest, G is an arbitrary distribution and 0 ε 1 captures the amount of model misspecification

Against what is robust statistics robust? Inference Classical G 0 << ε < 1 0 < ε << 1 ε = 0 arbitrary?? F θ G = z F ε F ε F θ G = F (θ,θ ) F ε F ε F θ G such that F ε = F (θ,θ ) F (θ,θ ) F (θ,θ ) F θ Robust arbitrary? F θ F θ G = z F ε F θ F θ G = F (θ,θ ) F ε F θ F θ G such that F ε = F (θ,θ ) F (θ,θ ) F θ F θ Table: Models at which inference can be at best done.

Robustness measures Robust estimators protect against: bias under contamination breakdown point They imply a trade-off between efficiency and robustness!

Linear model and classical estimation For i = 1,..., n consider with ɛ i N (0, σ 2 ). y i = x T i β + ɛ i, The maximum likelihood estimator ˆβ ML minimizes ( n y i xi T σ i=1 ˆβ ML ) 2 = n i=1 r 2 i, or, alternatively, solves ( n y i xi T σ i=1 ˆβ ML ) x i = n r i x i = 0. i=1

Outliers... 4 2 0 2 4 6 0 1 2 3 4 5 6 x y Classification des points aberrants Point vertical Bon point levier Mauvais point levier

Robustness vs diagnostic Masking... ML residuals 0 20 40 60 80 100 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Index resid(olsmultiple)

Robustness vs diagnostic Masking... Data and ML fit 4 2 0 2 4 6 1 0 1 2 3 4 5 6 x y

M and GM-estimation The M-estimator ˆβ M solves ) n (y i xi T ˆβ M ψ c x i = σ i=1 n ψ c (r i )x i = i=1 n w c (r i ) r i x i = 0, i=1 where w c (r i ) = ψ c (r i )/r i, or minimizes ) n (y i xi T ˆβ M ρ c. σ i=1 The Mallows GM-estimator ˆβ GM is an alternative that solves ) n (y i xi T ˆβ GM n ψ c w(x i )x i = w(r i )w(x i )r i x i = 0. σ i=1 i=1

ψ c and ρ c functions Huber Tukey or bisquare psi_c(r) 1.0 0.5 0.0 0.5 1.0 psi_c(r) 1.0 0.5 0.0 0.5 1.0 4 2 0 2 4 r 4 2 0 2 4 r ρ H (u) ρ B (u) 0 1 2 3 4 5 0 1 2 3 4 5 4 2 0 2 4 u 4 2 0 2 4 u

Comparison 4 2 0 2 4 6 1 0 1 2 3 4 5 6 x y OLS M Huber GM Huber M Tukey

S-estimation The S-estimator ˆβ S is an alternative that minimizes ) n (y i xi T ˆβ S ρ c, s i=1 where s is a scale M-estimator that solves 1 n ( ρ (1) yi x T ) i β c = b. n s, i=1

MM-estimation The MM-estimator is a two-step estimator constructed as follow: 1. Let s n be the scale estimate from an initial S-estimator. 2. With ρ (2) c ( ) ρ (1) c ( ), the MM-estimator ˆβ MM minimizes ( ) n y i xi T ˆβ MM. i=1 ρ (2) c s n

Comparison 4 2 0 2 4 6 1 0 1 2 3 4 5 6 x y OLS M Huber M Tukey MM S

Robust GLM (GM-estimator) For the GLM model (e.g. logistic, Poisson) g(µ i ) = x T i β where E(Y i ) = µ i, Var(Y i ) = v(µ i ) and r i = (y i µ i ), the robust φvµi estimator is defined by n [ 1 ] ψ c (r i )w(x i ) µ i a(β) = 0, (2) φvµi i=1 where µ i = µ i / β = µ i / η i x i and a(β) = 1 n n i=1 E[ψ(r i; c)]w(x i )/ φv µi µ i. The constant a(β) is a correction term to ensure Fisher consistency.

R functions for robust linear regression (G)M-estimation MASS: rlm() with method= M (Huber, Tukey, Hampel) Choice for the scale estimator: MAD, Huber Proposal 2 S-estimation robust: lmrob with estim= Initial robustbase: lmrob.s MM-estimation MASS: rlm() with method= MM robust: lmrob (with estim= Final, default) robustbase: lmrob()

R functions for other models robustbase: glmrob GM-estimation, Huber (include Gaussian) Negative binomial model: glmrob.nb from https://github.com/williamaeberhard/ From our book webpage: http://www.unige.ch/gsem/rcs/members2/profs/ eva-cantoni/books/ robust-methods-in-biostatistics/

Coleman Dataset coleman from package robustbase. A data frame with 20 observations on the following 6 variables. salaryp: staff salaries per pupil fatherwc: percent of white-collar fathers sstatus: socioeconomic status composite deviation: means for family size, family intactness, father s education, mother s education, and home items teachersc: mean teacher s verbal test score motherlev: mean mother s educational level, one unit is equal to two school years Y: verbal mean test score (y, all sixth graders)

Coleman > summary (m. lm ) C a l l : lm ( formula = Y., data = coleman ) Residuals : Min 1Q Median 3Q Max 3.9497 0.6174 0. 0 623 0.7343 5.0018 C o e f f i c i e n t s : Est imat e Std. E r r o r t v a l u e Pr(> t ) ( I n t e r c e p t ) 19.94857 13.62755 1. 4 6 4 0.1653 s a l a r y P 1.79333 1.23340 1.454 0.1680 fatherwc 0.04360 0.05326 0. 8 1 9 0.4267 s s t a t u s 0.55576 0.09296 5. 979 3. 38 e 05 t e a c h e r S c 1.11017 0.43377 2. 5 5 9 0.0227 motherlev 1.81092 2.02739 0.893 0.3868 S i g n i f. c o d e s : 0 Äò Äô 0. 0 0 1 Äò Äô 0. 0 1 Äò Äô 0. 0 5 Äò. Äô 0. 1 Äò Äô 1 Residual standard e r r o r : 2. 074 on 14 degrees of freedom M u l t i p l e R squared : 0. 9 0 6 3, Adjusted R squared : 0.8728 F s t a t i s t i c : 2 7. 0 8 on 5 and 14 DF, p v a l u e : 9. 9 2 7 e 07

Coleman > r e q u i r e ( r o b u s t b a s e ) > summary (m. lmrob, s e t t i n g = KS2011 ) C a l l : lmrob ( formula = Y., data = coleman ) \ > method = MM Residuals : Min 1Q Median 3Q Max 4.16181 0.39226 0.01611 0.55619 7.22766 C o e f f i c i e n t s : Est imat e Std. E r r o r t v a l u e Pr(> t ) ( I n t e r c e p t ) 30.50232 6.71260 4. 5 4 4 0.000459 s a l a r y P 1.66615 0.43129 3.863 0.001722 fatherwc 0.08425 0.01467 5. 741 5. 10 e 05 s s t a t u s 0.66774 0.03385 19.726 1. 30 e 11 teachersc 1.16778 0.10983 10.632 4. 35 e 08 motherlev 4.13657 0.92084 4.492 0.000507 S i g n i f. c o d e s : 0 Äò Äô 0. 0 0 1 Äò Äô 0. 0 1 Äò Äô 0. 0 5 Äò. Äô 0. 1 Äò Äô 1 Robust r e s i d u a l s t a n d a r d e r r o r : 1. 1 3 4 M u l t i p l e R squared : 0. 9 8 1 4, Adjusted R squared : 0.9747 Convergence i n 11 IRWLS i t e r a t i o n s Robustness weights : o b s e r v a t i o n 18 i s an o u t l i e r w i t h w e i g h t = 0 ( < 0. 0 0 5 ) ; The r e m a i n i n g 19 ones a r e summarized as Min. 1 st Qu. Median Mean 3 rd Qu. Max. 0.1491 0.9412 0. 9847 0.9279 0.9947 0.9982

Coleman Ctn d: A l g o r i t h m i c parameters : t u n i n g. c h i bb t u n i n g. p s i r e f i n e. t o l r e l. t o l 1. 548 e+00 5. 000 e 01 4. 685 e+00 1. 000 e 07 1. 000 e 0 s o l v e. t o l eps. o u t l i e r eps. x warn. l i m i t. r e j e c t warn. l i m i t. meanr 1. 000 e 07 5. 000 e 03 1. 569 e 10 5. 000 e 01 5. 000 e 0 nresample max. i t b e s t. r. s k. f a s t. s k. max maxit. s c a l 500 50 2 1 200 200 t r a c e. l e v mts compute. rd f a s t. s. l a r g e. n 0 1000 0 2000 p s i s u b s a m p l i n g cov compute. o u t l i e r. s t a t s b i s q u a r e n o n s i n g u l a r. vcov. a v a r 1 SM s e e d : i n t ( 0 )

Coleman Robustness weights m.lmrob$rweights 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 Index

Coleman > summary (m. rlm ) C a l l : rlm ( f o r m u l a = Y., data = coleman ) Residuals : Min 1Q Median 3Q Max 4.2059 0.3886 0.1092 0.4231 6.7054 C o e f f i c i e n t s : Value Std. E r r o r t v a l u e ( I n t e r c e p t ) 27.3497 7. 6808 3.5608 s a l a r y P 1.6207 0. 6952 2.3314 fatherwc 0.0752 0. 0 300 2.5045 s s t a t u s 0. 6 4 01 0. 0524 12.2182 t e a c h e r S c 1. 1 5 57 0. 2445 4.7271 motherlev 3.5195 1. 1 427 3.0801 Residual standard e r r o r : 0.7461 on 14 degrees of freedom

Breastfeeding UK study on the decision of pregnant women to breastfeed. 135 expecting mothers asked on their feeding choice (breast= 1 if breastfeeding, try to breastfeed and mixed brest- and bottle-feeding, =0 if exclusive bottle-feeding). Covariates: advancement of the pregnancy (pregnancy, end or beginning), how mothers fed as babies (howfed, some breastfeeding or only bottle-feeding), how mother s friend fed their babies (howfedfriends, some breastfeeding or only bottle-feeding), if had a partner (partner, no or yes), age (age), age at which left full time education (educat), ethnic group (ethnic, white or non white) and if ever smoked (smokebf, no or yes) or if stopped smoking (smokenow, no or yes). The first listed level of each factor is used as the reference (coded 0).

Breastfeeding The sample characteristics are as follow: out of the 135 observations, 99 were from mothers that have decided at least to try to breastfeed, 54 mothers were at the beginning of their pregnancy, 77 were themselves breastfed as baby, 85 of the mother s friend had breastfed their babies, 114 mothers had a partner, median age was 28.17 (with minimum equal 17 and maximum equal 40), median age at the end of education was 17 (minimum=14, maximum=38), 77 mothers were white and 32 mothers were smoking during the pregnancy, whereas 51 had smoked before.

Breastfeeding breast i Bernoulli(µ i ), so that E(breast i ) = µ i and Var(breast i ) = µ i (1 µ i ) (Binomial family). Use the logit link. logit(e(breast)) = logit(p(breast)) = = β 0 + β 1 pregnancy + β 2 howfed + β 3 howfedfr + β 4 partner + β 5 ethnic + β 6 smokebf + β 7 smokenow + β 8 age + β 9 educat, where logit(µ i ) = log( µ i 1 µ i ), with µ i /(1 µ i ) being the odds of a success, and µ i = P(breast) is the probability of at least try to breastfeed.

Breastfeeding > r e q u i r e ( r o b u s t b a s e ) > breast. glmrobwx=glmrob ( decwhat howfedfr+ethnic+educat+age+grp+howfed+partner+ smokenow+smokebf, family=binomial, weights. on. x= hat, tcc =1.5, data=breast ) > summary ( b r e a s t. glmrobwx ) C a l l : glmrob ( formula = decwhat howfedfr + ethnic + educat + age + grp + howfed + partner + smokenow + smokebf, family = binomial, data = breast, w e i g h t s. on. x = hat, t c c = 1. 5 ) C o e f f i c i e n t s : E s t i m a t e Std. E r r o r z v a l u e Pr(> z ) ( I n t e r c e p t ) 7.77423 3.35952 2.314 0.02066 h o w f e d f r B r e a s t 1.49177 0.68777 2. 1 6 9 0.03008 ethnicnon w h i t e 2.68758 1.11582 2. 4 0 9 0.01601 e d u c a t 0.37372 0.18486 2. 0 2 2 0.04321 age 0.03116 0.05955 0. 5 2 3 0.60082 g r p B e g i n n i n g 0.81318 0.69269 1.174 0.24042 h o w f e d B r e a s t 0.52823 0.70456 0. 7 5 0 0.45342 p a r t n e r P a r t n e r 0.78295 0.81448 0. 9 6 1 0.33641 smokenowyes 3.44560 1.12137 3.073 0.00212 smokebfyes 1.50733 1.09900 1. 3 7 2 0.17020 S i g n i f. codes : 0 0. 001 0. 01 0. 05. 0. 1 1 R o b u s t n e s s w e i g h t s w. r w. x : Min. 1 st Qu. Median Mean 3 rd Qu. Max. 0.04103 0.82460 0.86500 0.82890 0.89400 0.93790 Number o f o b s e r v a t i o n s : 135 F i t t e d by method Mqle ( i n 12 i t e r a t i o n s ) ( Dispersion parameter f o r binomial family taken to be 1)

Breastfeeding Ctn d: No d e v i a n c e v a l u e s a v a i l a b l e A l g o r i t h m i c parameters : acc t c c 0.0001 1.5000 maxit 50 t e s t. acc c o e f

Breastfeeding Robustness weights w(x) 0.85 0.95 18 0 20 40 60 80 100 120 140 Index w(r) 0.2 0.6 1.0 3 11 14 53 63 75 90 115 0 20 40 60 80 100 120 140 Index

Mailing list and conferences Dedicated mailing list: r-sig-robust@r-project.org ICORS 2017 in Wollongong, Australia: ICORS 2016 in Geneva:

References [1] Jean-Jacques Droesbeke, Gilbert Saporta, and Christine Thomas-Agnan. Méthodes robustes en statistique. Editions TECHNIP, 2015. [2] Frank R. Hampel, Elvezio M. Ronchetti, Peter J. Rousseeuw, and Werner A. Stahel. Robust Statistics: The Approach Based on Influence Functions. Wiley, New York, 1986. [3] S. Heritier, E. Cantoni, S. Copt, and M.-P. Victoria-Feser. Robust Methods in Biostatistics. Wiley-Interscience, 2009. [4] Peter J. Huber. Robust Statistics. Wiley, New York, 1981. [5] Peter J. Huber and Elvezio M. Ronchetti. Robust Statistics. Wiley, New York, 2009. Second Edition. [6] R.A. Maronna, R.D. Martin, and V.J. Yohai. Robust statistics. Wiley New York, 2006.