ETH Zürich, October 25, 2010

Size: px

Start display at page:

Download "ETH Zürich, October 25, 2010"

Lesley Jordan
5 years ago
Views:

Sciences marcel.dettling@zhaw.ch http://stat.ethz.

1 Marcel Dettling Institute for Data nalysis and Process Design Zurich University of pplied Sciences t t th h/ ttli ETH Zürich, October 25,

2 Mortality Example City Mortality JanTemp JulyTemp RelHum Rain Educ Dens NonWhite WhiteCollar Pop House Income HC NOx SO2 kron, OH lbany, NY llentown, P tlanta, G altimore, MD irmingham, L

3 Model Diagnostics Why do we need to do this? a) make sure that estimatess and inference are valid - E[ [ ε i ] = Var( ε i ) = σ ε - Cov ( εi, ε j ) = εi ~ N(0, σ ε I), iid.. b) improving the model (better fit, reliable conclusions) - variable transformations - further predictors or interactions between them - weighted reg or more general model 3

4 Model Diagnostics: Structure and Variance Tukey-nscombe Plot Scale-Location Plot Residua als New Orleans lbany Lancaster residuals Stan ndardized New Orleans lbany Lancaster Fitted values Fitted values 4

5 Model Diagnostics: Example It is all in the residuals Important trickery: - Plot the residuals against some predictors - These predictors can be in or out of the model No matter what the predict tor is, we must not see any structures in these plots. If we find some, that means the model is inadequate. 5

6 Model Diagnostics: Normality/Correlation Normal Plot Serial Correlation Plot 4 New Orleans 100 ndardized residuals lbany residuals s(fit) 50 0 Sta -2-1 Lancaster Theoretical Quantiles Index 6

7 How To Identify Influential Data Points? 1) Poor man s approach Redo the analysis n times by excluding each data point 2) Leverage If we change yi by Δy, then h i iiδy is the change in y i ˆi High leverage for a data point ( hii > 2( p+ 1)/ n ) means that it forces the reg line to fit well to it. 3) Cook s Distance D i ( yˆ y ) h r = = ( p 1) 1 h ( p 1) 2 *2 j j () i ii i 2 + σ ε ii + e careful if Cook s Distance > 1. 7

8 Model Diagnostics: Example Cook's Distance Leverage Plot York 4 New Orleans tance Cook's dist New Orleans Miam i tandardized residuals St Miami 0.5 York Cook's distance Obs. number Leverage 8

9 Model Diagnostics: Conclusions Conclusions from the mode el diagnostics: - there are 2 influential data points: York and New Orleans - they do not seem to be very strongly influential, but still: - better to re-run the analysis without these and check results Results from that analysis: - log(so2) is significant ifi again!!! - Residual standard error smaller - Coefficient of determination higher - Thus: better fit! 9

10 Why re They Influential? Population Density % White Collar vs. Education York York 60 Washington Population Density New York % WhiteCo ollar Index Education 10

11 Weighted Reg We consider the model: Y = Xβ + ε, where ε 2 ~ N(0, σ Σ) ε with Σ = I generalized least squares weighted reg 1 1 Σ= diag,,..., w 1 w 2 1 w n 11

12 Weighted Reg When, why and how: Y i If the are means of observations, we choose wi = n i If the variance is proportional to a predictor: w i = 1/ x i For observed non-constan nt variance: estimate t from OLS! The reg coefficients in weighted reg are obtained by minimizing the sum of weighted least squares: n i= 1 wr 2 i i Solution is still explicit and unique for full-ranked design. 12

13 Robust Reg How to deal with outliers? Check for typos first. Keep contact to the original data source. Examine the physical context why did it happen? Outliers are often the most interesting data points! Exclude the outliers from the analysis, and re-fit the model. lways report the existenc ce of outliers that were removed. If outliers are not mistakess but occur naturally due to long- the analysis run a robust reg tailed error distribution: do not exclude them from 13

14 Robust Reg > library(mss) > fit.rlm <- rlm(mortality ~ JanTemp + + log(so2), data= ) Relies on Huber s method Downweights the effect of outliers summary(fit.rlm) Coefficients: Value Std. Errorr t value (Intercept) JanTemp log(so2) Residual standard error: on 46 degrees of freedom 14

15 Polynomial Regressi ion Polynomial Reg = Multiple Linear Reg!!! Goals: Y = β + β x+ β x fit a curvilinear relation - improve the fit between x and Y - determine the polynomial order d Example: + β + ε x d d - Savings dataset: personal savings ~ income per capita 15

16 Polynomial Regressi ion Fit Savings Data: Pol ynomial Reg Fit 0 5 sr ddpi 16

17 Polynomial Regressi ion Output from the model with the linear term only: > summary(lm(sr ~ ddpi, data = savings)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-10 *** ddpi * --- Residual standard error: on 48 degrees of freedom Multiple R-squared: , djusted R-squared: F-statistic: on 1 and 48 DF, p-value:

18 Diagnostic Plots 1 Residuals vs Fitted Normal Q-Q Residua als Chile Japan Zambia Stan ndardized residuals Libya Chile Japan Fitted values Theoretical Quantiles 18

19 Diagnostic Plots 2 Cook's distance Residuals vs Leverage Libya Japan 2 Cook's dis stance Japan Jamaica residuals Sta andardized Jamaica Cook's distance Libya Obs. number Leverage 19

20 Quadratic Reg dd the quadratic term: Y 2 = β + β x+ β x + ε > summary(lm(sr ~ ddpi + I( (ddpi^2), data = savings)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) *** ddpi ** I(ddpi^2) * --- Residual standard error: on 47 degrees of freedom Multiple R-squared: 0.205, djusted R-squared: F-statistic: on 2 and 47 DF, p-value:

21 Diagnostic Plots: Qu uadratic Reg Residuals vs Fitted Normal Q-Q Japan 2 Japan Residu uals Chile Korea Sta andardized d residuals Chile Korea Fitted values Theoretical Quantiles 21

22 Diagnostic Plots: Qu uadratic Reg Cook's distance Residuals vs Leverage Libya 2 Japan Cook's dis stance Japan Jamaicaa Sta andardized d residuals Jamaica Cook's distance Libya Obs. number Leverage 22

23 Cubic Reg Y dd the cubic term: = β + β x+ β x + β x + ε β 0 > summary(lm(sr~ddpi + I(ddpi^2) + I(ddpi^3), data = savings) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 5.145e e * ddpi 1.746e e I(ddpi^2) e e I(ddpi^3) e e Residual standard error: on 46 degrees of freedom Multiple R-squared: 0.205, djusted R-squared: F-statistic: on 3 and 46 DF, p-value:

24 Powers re Strongly Correlated Predictors! The smaller the x-range, the bigger the problem! > cor(cbind(ddpi, ddpi2=ddpi^2, ddpi3=ddpi^3)) ddpi ddpi2 ddpi ddpi ddpi Way out: use centered predictors! z = ( x x ) i i z = ( x x) 2 2 i i z = ( x x ) 3 3 i i ddpi3 24

25 Powers re Strongly Correlated Predictors! > summary(lm(sr~z.ddpi+i(z.ddpi^2)+i(z.ddpi^3),dat=z.savings) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) t) 1.042e e e < 2e-16 *** z.ddpi 1.059e e ** I(z.ddpi^2) e e e I(z.ddpi^3) e e Coefficients, standard errorr and tests are different Fitted values and global inference remain the same Not overly beneficial on this dataset! e careful: extrapolation with polynomials is dangerous! 25

26 Dummy Variables So far, we only considered continuous predictors: - temperature - distance - pressure - It is perfectly valid to have categorical predictors, too: - sex (male or female) - status variables (employed or unemployed) - working shift (day, evening, night) - Implementation in the reg g with dummy variables 26

27 Example: inary Categorical Variable The lathe dataset: Y - lifetime of a cutting tool in a lathe - x 1 speed of the machine in rpm x 2 - tool type or Dummy variable encoding: x 2 0 tool type = 1 tool type 27

28 Interpretation of the Model see blackboard > summary(lm(hours ~ rpm + Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-09 *** rpm e-05 *** tool e-09 *** --- tool, data = lathe)) Residual standard error: on 17 degrees of freedom Multiple R-squared: , djusted R-squared: F-statistic: on 2 and 17 DF, p-value: 3.086e-09 28

29 The Dummy Variable Fit Durability of Lathe Cutting Tools hours rpm 29

30 Model with Interactions Question: do the slopes nee ed to be identical? with the appropriate model, the answer is no! Y = β + β x + β x + β xx ε see blackboard for model interpretation 30

31 Different Slope for the Reg Lines Durability of Lathe Cuttin ng Tools: with Interaction hours rpm 31

32 Summary Output > summary(lm(hours rs ~ rpm * tool, data = lathe)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-06 *** rpm ** tool ** rpm:tool Residual standard error: on 16 degrees of freedom Multiple R-squared: , djusted R-squared: F-statistic: on 3 and 16 DF, p-value: 1.319e-08 32

33 How Complex the Model Needs to e? Question 1: do we need differ rent slopes for the two lines? H H against 0 : β 3 = 0 individual parameter test for the interaction term! Question 2: is there any difference altogether? H : β = β = 0 against : β3 0 H : β 0 and / or β this is a partial F-test we try to exclude interactioi nand dummy variable together R offers convenient functionality for these tests! 33

34 nova Output Summary output for the inte eraction model > fit1 <- lm(hours ~ rpm, data=lathe) > fit2 <- lm(hours ~ rpm * tool, data=lathe) > anova(fit1, fit2) Model 1: hours ~ rpm Model 2: hours ~ rpm * tool Res.Df RSS Df Sum of Sq F Pr(>F) e-08 *** no different slopes, but different intercept! 34

35 Categorical Input with More than 2 Levels There are now 3 tool types,, C: x2 x3 0 0 for observations of type 1 0 for observations of type 0 1 for observations of type C Main effect model: = 0 + With interactions: 0 Y β β x + β x + β x + ε Y = β + β x + β x + β x + β x x + β x x + ε

36 Three Types of Cutting Tools Durability of Lathe Cutting Tools: 3 Types hours C C C C C C C C C C rpm 36

37 Summary Output > summary(lm(hours ~ rpm * tool, data = abc.lathe) Coefficients:Estimate Std. Error t value Pr(> t ) (Intercept) e-07 *** rpm ** tool ** toolc rpm:tool rpm:toolc Residual standard error: 2.88 on 24 degrees of freedom Multiple R-squared: , djusted R-squared: F-statistic: on 5 and 24 DF, p-value: 9.064e-11 37

38 Inference with Categorical Predictors Do not perform individual hy ypothesis tests on factors! Question 1: do we have different slopes? H : β = 0 and β = against H : β4 0 and / or β5 0 Question 2: is there any difference altogether? H : β = β = β = β = 0 against H : any of β, β, β, β gain, Rprovides convenie ent functionality

39 nova Output > anova(fit.abc) nalysis of Variance Table Df Sum Sq Mean Sq F value Pr(>F) rpm *** tool e-11 *** rpm:tool * Residuals strong evidence that we need to distinguish the tools! weak evidence for the necessity of different slopes 39

Marcel Dettling. Applied Statistical Regression AS 2012 Week 05. ETH Zürich, October 22, Institute for Data Analysis and Process Design

Marcel Dettling. Applied Statistical Regression AS 2012 Week 05. ETH Zürich, October 22, Institute for Data Analysis and Process Design Marcel Dettling Institute for Data Analysis and Process Design Zurich University of Applied Sciences marcel.dettling@zhaw.ch http://stat.ethz.ch/~dettling ETH Zürich, October 22, 2012 1 What is Regression?