Lecture 1: Linear Models and Applications

Similar documents
Ch 3: Multiple Linear Regression

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Ch 2: Simple Linear Regression

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

14 Multiple Linear Regression

Lecture 6 Multiple Linear Regression, cont.

Inference for Regression

STAT5044: Regression and Anova

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Lectures on Simple Linear Regression Stat 431, Summer 2012

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

MS&E 226: Small Data

Multiple Linear Regression

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Lecture 18: Simple Linear Regression

Simple Linear Regression

Weighted Least Squares

STATISTICS 479 Exam II (100 points)

Coefficient of Determination

Likelihood Ratio Test of a General Linear Hypothesis. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 42

STAT763: Applied Regression Analysis. Multiple linear regression. 4.4 Hypothesis testing

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Chapter 1. Linear Regression with One Predictor Variable

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

13 Simple Linear Regression

Math 3330: Solution to midterm Exam

Simple and Multiple Linear Regression

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Math 423/533: The Main Theoretical Topics

Lecture One: A Quick Review/Overview on Regular Linear Regression Models

Simple Linear Regression

Applied Regression Analysis

Chapter 3: Multiple Regression. August 14, 2018

6. Multiple Linear Regression

ST430 Exam 2 Solutions

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Introduction to Linear regression analysis. Part 2. Model comparisons

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

Simple Linear Regression

Multiple Linear Regression

General Linear Test of a General Linear Hypothesis. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 35

Generalized Linear Models

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Lecture 4 Multiple linear regression

CAS MA575 Linear Models

MLR Model Checking. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

ST430 Exam 1 with Answers

Regression diagnostics

Diagnostics for Linear Models With Functional Responses

Lecture 15. Hypothesis testing in the linear model

MATH 644: Regression Analysis Methods

Simple Linear Regression

Beam Example: Identifying Influential Observations using the Hat Matrix

Chapter 12: Multiple Linear Regression

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Section 4.6 Simple Linear Regression

CHAPTER 2 SIMPLE LINEAR REGRESSION

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Lecture 10 Multiple Linear Regression

STAT 3A03 Applied Regression Analysis With SAS Fall 2017

36-707: Regression Analysis Homework Solutions. Homework 3

Simple Linear Regression

Dealing with Heteroskedasticity

STAT 215 Confidence and Prediction Intervals in Regression

STAT 540: Data Analysis and Regression

Linear models and their mathematical foundations: Simple linear regression

Lecture 2. The Simple Linear Regression Model: Matrix Approach

AMS-207: Bayesian Statistics

Lecture 2. Simple linear regression

Statistics for Engineers Lecture 9 Linear Regression

Homework 2: Simple Linear Regression

SCHOOL OF MATHEMATICS AND STATISTICS

[y i α βx i ] 2 (2) Q = i=1

Multivariate Linear Regression Models

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Remedial Measures, Brown-Forsythe test, F test

Multiple Predictor Variables: ANOVA

Stat 5102 Final Exam May 14, 2015

1 The classical linear regression model

STAT 4385 Topic 06: Model Diagnostics

SCHOOL OF MATHEMATICS AND STATISTICS

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

> modlyq <- lm(ly poly(x,2,raw=true)) > summary(modlyq) Call: lm(formula = ly poly(x, 2, raw = TRUE))

STAT 4385 Topic 03: Simple Linear Regression

1 Use of indicator random variables. (Chapter 8)

Linear Models and Estimation by Least Squares

20. REML Estimation of Variance Components. Copyright c 2018 (Iowa State University) 20. Statistics / 36

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Multiple Linear Regression

R 2 and F -Tests and ANOVA

Lecture 3: Inference in SLR

Simple linear regression

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Regression. Marc H. Mehlman University of New Haven

Transcription:

Lecture 1: Linear Models and Applications Claudia Czado TU München c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 0

Overview Introduction to linear models Exploratory data analysis (EDA) Estimation and testing in linear models Regression diagnostics c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 1

Introduction to linear models Goal of regression models is to determine how a response variable depends on covariates. A special class of regression models are linear models. The general setup is given by Data Responses Covariates (Y i, x i1,..., x ik ), i = 1,..., n Y = (Y 1,..., Y n ) T x i = (x i1,..., x ik ) T (known) c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 2

Example: Life Expectancies from 40 countries Source: The Economist s Book of World Statistics, 1990, Time Books, The World Almanac of Facts, 1992, World Almanac Books LIFE.EXP TEMP URBAN HOSP.POP COUNTRY Life Expectancy at Birth Temperature in degrees Fahrenheit Percent of population living in urban areas No. of hospitals per population The name of the country Which is the response and which are the covariates? c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 3

Linear models (LM) under normality assumption Y i = β 0 + β 1 x i1 +... + β k x ik + ɛ i, ɛ i N(0, σ 2 ) iid, i = 1,.., n, where the unknown β 0,..., β k regression parameters and the the unknown error variance σ 2 needs to be estimated. Note E(Y i ) is a linear function in β 0,..., β k. E(Y i ) = β 0 + β 1 x i1 +... + β k x ik V ar(y i ) = V ar(ɛ i ) = σ 2 variance homogeneity c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 4

LM s in matrix notation Y = Xβ + ɛ, X R n p, p = k + 1, β = (β 0,..., β k ) T E(Y) = Xβ V ar(y) = σ 2 I n Under the normal assumption we have Y N n (Xβ, σ 2 I n ), where N n (µ, Σ) is the n-dimensional normal distribution with mean vector µ and covariance matrix Σ. The matrix X is also called the design matrix. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 5

Exploratory data analysis (EDA) Consider the ranges of the responses and covariates. discrete, group covariate levels if they are sparse. When covariates are Plot the covariates against the responses. These scatter plots should look linear. Otherwise consider transformations of the covariates. To check if the constant variance assumption is reasonable the scatter plot of covariates against the responses should be contained in a band. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 6

Example: Life expectancies: Data summaries > summary(health.data) LIFE.EXP TEMP URBAN Min. :37 Min. :36 Min. :14 1st Qu.:56 1st Qu.:52 1st Qu.:42 Median :67 Median :61 Median :62 Mean :65 Mean :62 Mean :59 3rd Qu.:73 3rd Qu.:73 3rd Qu.:76 Max. :79 Max. :85 Max. :96 NA s : 1 HOSP.POP Min. :0.0057 1st Qu.:0.0204 Median :0.0295 Mean :0.0469 3rd Qu.:0.0614 Max. :0.2190 NA s :6.0000 NA=not available (missing data) c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 7

c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 8 Example: Life expectancies: EDA LIFE.EXP 40 50 60 70 80 0.0 0.05 0.10 0.15 0.20 40 50 60 70 80 40 50 60 70 80 TEMP URBAN 20 40 60 80 40 50 60 70 80 0.0 0.05 0.10 0.15 0.20 20 40 60 80 HOSP.POP LIFE.EXP decreases if TEMP increases LIFE.EXP increases if URBAN increases LIFE.EXP increases if HOSP.POP increases Check for nonlinearities and nonconstant variances

Example: Life expectancies: Transformations France UnitedStates Japan Australia NewZealand Argentina LIFE.EXP 50 60 70 Egypt China Tonga PapuaNewGuinea Norway LIFE.EXP 50 60 70 Morocco CostaRica Egypt China Tonga Australia Norway PapuaNewGuinea India 0.0 0.05 0.10 0.15 0.20 HOSP.POP India -5-4 -3-2 log(hosp.pop) Logarithm makes relationship more linear. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 9

Parameter estimation of β Under the normality assumption it is enough to minimize Q(β) := Y Xβ 2 for β R p to calculate the maximum likelihood estimate (MLE) of β. An estimator which minimizes Q(β) is also called a least square estimator (LSE). If the matrix X R n p has full rank p the minimum of Q(β) is taken at ˆβ = (X T X) 1 X T Y SS Res := n T (Y i x i ˆβ) 2 = Q(ˆβ) Residual Sum of Squares i=1 T Ŷ i := x i ˆβ fitted value for µi := E(Y i ) e i := Y i Ŷi raw residual It follows that E(ˆβ) = β and V ar(ˆβ) = σ 2 (X T X) 1, therefore one has under normality ˆβ N p (β, σ 2 (X T X) 1 ). c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 10

The MLE of σ 2 is given by Parameter estimation of σ 2 ˆσ 2 := 1 n n T (Y i x i ˆβ) 2 = 1 n i=1 n (Y i Ŷi) 2 i=1 One can show that the estimator is biased, in particular E(ˆσ 2 ) = n p n σ2. An unbiased estimator for σ 2 is therefore given by s 2 := 1 n p n i=1 (Y i Ŷi) 2 = n n pˆσ2 Under normality it follows that (n p)s 2 σ 2 = n (Y i Ŷi) 2 i=1 σ 2 χ 2 n p is independent of ˆβ. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 11

Goodness of fit in linear models Consider SS T otal = n (Y i Ȳ )2, SS Reg = n (Ŷi Ȳ )2, SS Res = n (Y i Ŷi) 2, i=1 i=1 i=1 where Ȳ = 1 n n i=1 Y i. It follows that SS T otal = SS Reg + SS Res, therefore R 2 := SS Reg = 1 SS Res SS T otal SS T otal explains the proportion of variability explained by the regression. R 2 is called the multiple coefficient of determination. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 12

Statistical hypothesis tests in LM s: F-test The restriction H 0 : Cβ = d for C R r,p, rank(c) = r, d R r is called general linear hypothesis with alternative H 1 : not H 0. Consider the restricted least square problem SS H0 = min β { Y Xβ 2 2 Cβ }{{ = d } under H 0 } = SS Res + (Cˆβ d) T [C(X T X) 1 C T ] 1 (Cˆβ d) Under normality and that H 0 : Cβ = d is valid we have F = (SS H 0 SS Res )/r SS Res /(n p) F r,n p, therefore the F-test is given by reject H 0 at level α if F > F 1 α,r,n p. Here F n,m denotes the F distribution with n numerator and m denominator degree of freedom and F 1 α,n,m is the corresponding 1 α quantile. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 13

Statistical hypothesis tests in LM s: t-test Consider for each regression parameter β j the hypothesis H 0j : β j = 0, against H 1j : not H 0j and use the corresponding F statistics F j := (SS H 0j SS Res )/1 SS Res /(n p) F 1,n p under H 0j. It follows that F j = T 2 j, where T j := ˆβ j ŝe( ˆβ j ) t n p under H 0j, where ŝe( ˆβ j ) is the estimated standard error of ˆβ j i.e. ŝe( ˆβ j ) = s (X T X) 1 jj. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 14

Weighted Least Squares Y = Xβ + ɛ, ɛ N n (0, σ 2 W ), X R n p, p = k + 1, β = (β 0,..., β k ) T An MLE of β (weighted LSE) is given by ˆβ = (X T W 1 X) 1 X T W 1 Y. It follows that E(ˆβ) = β and V ar(ˆβ) = σ 2 (X T W 1 X) 1. The MLE of σ 2 is given by ˆσ 2 := 1 n (Y X ˆβ) T W 1 (Y X ˆβ) = 1 n et W 1 e. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 15

Example: Life expectancies: First models Using the Original Scale for HOSP.POP > attach(health.data) > f1_life.exp ~ TEMP + URBAN + HOSP.POP > r1_lm(f1,na.action = na.omit) > summary(r1) Call: lm(formula = f1, na.action = na.omit) Residuals: Min 1Q Median 3Q Max -18.7-4.72 1.63 4.89 14.7 Coefficients: Value Std. Error t value Pr(> t ) (Intercept) 60.666 9.088 6.676 0.000 TEMP -0.167 0.113-1.480 0.150 URBAN 0.232 0.063 3.710 0.001 HOSP.POP 33.756 30.373 1.111 0.276 Residual standard error: 7.3 on 29 degrees of freedom Multiple R-Squared: 0.46 F-statistic: 8.3 on 3 and 29 degrees of freedom, the p-value is 0.00039 Correlation of Coefficients: (Intercept) TEMP URBAN TEMP -0.90 URBAN -0.59 0.24 HOSP.POP -0.24 0.18-0.14 TEMP and HOSP.POP are nonsignificant at the 10% level. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 16

Example: Life expectancies: First models Using the Logarithmic Scale for HOSP.POP > f2_life.exp ~ TEMP + URBAN + log(hosp.pop) > r2_lm(f2,na.action = na.omit) > summary(r2) Call: lm(formula = f2, na.action = na.omit) Residuals: Min 1Q Median 3Q Max -17.6-4.2 1.71 5.07 14.4 Coefficients: Value Std. Error t value Pr(> t ) (Intercept) 72.964 9.917 7.357 0.000 TEMP -0.151 0.109-1.387 0.176 URBAN 0.213 0.061 3.476 0.002 log(hosp.pop) 3.133 1.638 1.912 0.066 Residual standard error: 7 on 29 degrees of freedom Multiple R-Squared: 0.5 F-statistic: 9.7 on 3 and 29 degrees of freedom, the p-value is 0.00013 Correlation of Coefficients: (Intercept) TEMP URBAN TEMP -0.65 URBAN -0.67 0.22 log(hosp.pop) 0.52 0.19-0.24 TEMP is still nonsignificant, while log(hosp.pop) is now significant at the 10% level. 50% of the total variability is explained by the regression. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 17

ANOVA= ANalysis Of VAriance ANOVA Tables in LM s Model Formula SS null Y i = β 0 + ɛ i SSE 0 = n i=1 (Y i ˆβ 0 ) 2 = n i=1 (Y i Y ) 2 full Y = Xβ + ɛ SSE(X) = Y X ˆβ 2 reduced Y = X 1 β 1 + ɛ SSE(X 1 ) = Y X 1ˆβ1 2 ANOVA table: Source df Sum of Sq MS F regression p 1 SS Reg = SSE 0 SSE(X) MS Reg = SS Reg p 1 residual n p SSE(X) s 2 = SSE(X) n p total n 1 SSE 0 MS Reg s 2 c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 18

Hierarchical ANOVA Tables in LM s ANOVA table: Y = Xβ + ɛ = X 1 β 1 + X 2 β 2 + ɛ X R n,p, X 1 R n,p 1, X 2 R n,p 2, p 1 + p 2 = p Source df Sum of Sq MS F X 1 p 1 1 SSE 0 SSE(X 1 ) MS(X 1 ) = SSE 0 SSE(X 1 ) p 1 1 F (X 1 ) = MS(X 1 ) SSE(X 1 )/(n p 1 ) X 2 given X 1 p 2 SSE(X 1 ) SSE(X) MS(X 2 X 1 ) = SSE(X 1 ) SSE(X) p 2 F (X 2 X 1 ) = MS(X 2 X 1 ) s 2 residual n p SSE(X) s 2 = SSE(X) n p total n 1 SSE 0 F (X 1 ) tests H 0 : β 1 = 0 in Y = X 1 β 1 + ɛ (overall F-test in reduced model). F (X 2 X 1 ) tests H 0 : β 2 = 0 in Y = X 1 β 1 + X 2 β 2 + ɛ (partial F-test). c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 19

> r <- lm(y ~x1+x2+x3) > anova(r) ANOVA Tables in Splus Source df Sum of Sq MS F x1 1 SS 1 = SSE 0 SSE(x 1 ) MS 1 = SS 1 1 MS 1 s 2 x2 1 SS 2 = SSE(x 1 ) SSE(x 1, x 2 ) MS 2 = SS 2 1 x3 1 SS 3 = SSE(x 1, x 2 ) SSE(x 1, x 2, x 3 ) MS 3 = SS 3 1 MS 2 s 2 MS 3 s 2 residual n p SSE(x 1, x 2, x 3 ) s 2 = SSE(X) n 4 These F-values cannot be interpreted as partial F-values, since the denominator always assumes the full model. You need to replace it by s 2 1 = SSE(x 1) n 2, s 2 2 = SSE(x 1,x 2 ) n 3 and s 2 3 = SSE(x 1,x 2,x 3 ) n 4 = s 2, respectively. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 20

Example: Life Expectancies: ANOVA Tables ANOVA Table (Original Scale for HOSP.POP) > anova(r1) Analysis of Variance Table Response: LIFE.EXP Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) TEMP 1 455 455 8.5 0.007 URBAN 1 815 815 15.2 0.001 HOSP.POP 1 66 66 1.2 0.276 Residuals 29 1555 54 ANOVA Table (Log Scale for HOSP.POP) > anova(r2) Analysis of Variance Table Response: LIFE.EXP Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) TEMP 1 455 455 9.2 0.0051 URBAN 1 815 815 16.4 0.0003 log(hosp.pop) 1 182 182 3.7 0.0658 Residuals 29 1440 50 c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 21

Goal is to determine Regression diagnostics for LM s outliers with regard to the response (y outliers) outliers with regard to the design space covered by the covariates (x outliers or high leverage points) observations which change the results greatly when removed (influential observations). A general tool for this is to consider the hat matrix and residuals. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 22

Residuals in LM s: hat matrix Properties: H := X(X T X) 1 X T hat matrix Ŷ := X ˆβ = X(X T X) 1 X T Y = HY fitted values r := Y Ŷ = (I H)Y residual vector H is a projection matrix, i.e. H 2 = H and symmetric. E(r) = (I H)E(Y ) = (I H)Xβ = Xβ HXβ = Xβ Xβ = 0. V ar(r) = σ 2 (I H)(I H) T = σ 2 (I H) V ar(r i ) = σ 2 (1 h ii ), where h ii is i-th diagonal element of H. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 23

Standardized Residuals in LM s s i := r i 1 hii s, where s2 := Y X ˆβ 2 n p are called standardized residuals or internally studentized residuals h ii is called the leverage of observation i. Problem: The estimate s 2 is biased when outliers are present in the data, therefore one wants to estimate σ 2 without the ith observation. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 24

Jackknifed Residuals in LM s Y i = X i β + ɛ i model without ith obs X i = design matrix without ith obs ˆβ i := corresponding LSE of β Ŷ i, i := x T i ˆβ i ith fitted value without using ith obs r i, i := Y i Ŷi, i predictive residual n s 2 j=1,j j i := (Y j x T ˆβ j i ) 2 estimate of n p 1 t i := σ 2 in model without ith obs r i externally studentized or jackknifed residuals 1 hii s i c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 25

Jackknifed Residuals in LM s For a fast computation one can use linear algebra to show that ˆβ i = ˆβ (XT X) 1 x i r i 1 h ii r i, i = r i 1 h ii s 2 i = (n p)s2 r 2 i 1 h ii n p 1 Residual plots are plots where the observation number i or the jth covariate x ij is plotted against r i, s i or t i. There is no problem with the model if these plots look like a band. Deviations from this structure might indicate nonlinear regression effects or a violation of the variance homogeneity. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 26

Residuals in Splus r_lm(y~x1+x2+...+xk, x=t) Raw Residuals: r i = Y i Ŷi Residual Standard Error: s = Y X ˆβ 2 n p e_resid(summary(r)) sigma_summary(r)$sigma External Residual Standard Error: s i sigmai_lm.influence(r)$sigma Hat Diagonals: h ii hi_hat(r$x) or hi_lm.influence(r)$hat Internally Studentized Residuals: s i = r i si_e/(sigma*((1-hi)^.5)) 1 hii s Externally Studentized Residuals: r t i = i ti_e/(sigmai*((1-hi)^.5)) 1 hii s i c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 27

High leverage in LM s Since h ii = x i (X T X) 1 x T i one can interpret h ii as standardized measure between x i and x. Further we have therefore we call points with n h ii = p, i=1 as high leverage points or x-outliers. h ii > 2p n c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 28

Outlier detection in LM s To detect y outliers, we can use t i, since it can be written as t i = Y i Ŷ i = ˆδ i for δ i := Y i Ŷ i. V ar(y i Ŷ i) V ar(δ i ) In the mean shift outlier model for obs l given by Y i = x t iβ + γd li + ɛ i with D li = { 1 l = i 0 l i. Therefore reject H : γ = 0 versus K : γ 0 at level α iff t l > t n p 1,1 α 2, where t m,α is the α quantil of a t m distribution. Problem: If one uses this outlier test for every obs. l, one has the problem of multiple testing, since one needs to substitute α by α n. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 29

Leverage and Influence Figure A Figure B y 4 2 0 2 with obs 20 without obs 20 y 4 0 2 4 6 with obs 10 without obs 10 5 0 5 10 x 5 0 5 10 x Figure C y 2 0 2 4 with obs 20 without obs 20 5 0 5 10 x c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 30

Leverage and Influence Figure A shows that obs 20 is a high leverage point, which is also influential. Figure B shows that obs 10 is not a high leverage point and not influential. Figure C shows that obs 20 is a high leverage point, which is not influential. A real valued measure for influential obs. is the Cook s distance, which is defined as D i := (ˆβ ˆβ i ) T (X T X)(ˆβ ˆβ i ) ps 2, which can be interpreted as the shift in the confidence region when the ith obs. is deleted. Therefore obs. with D i > 1 are considered influential. The cutpoint 1 corresponds that the ith obs moves ˆβ to the edge of a 50 % confidence region. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 31

Ex:Life expectancies: Residuals and hat values Externally Studentized Residuals ex. stud. residuals -3-2 -1 0 1 2 UnitedStates Chile 0 5 10 15 20 25 30 Index Hat Diagonals hat diagonals 0.1 0.2 0.3 0.4 0.5 Australia PapuaNewGuinea Norway 0 5 10 15 20 25 30 Index c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 32

Ex:Life expectancies: Residual plots raw residuals -15-5 0 5 10 15 UnitedStates Chile 40 50 60 70 80 TEMP raw residuals -15-5 0 5 10 15 UnitedStates Chile 20 40 60 80 URBAN raw residuals -15-5 0 5 10 15 UnitedStates Chile -5-4 -3-2 log(hosp.pop) e -15-5 0 5 10 15 UnitedStates Chile 0 5 10 15 20 25 30 raw residuals Bottom right plot plots r i versus r i 1 to detect correlated errors. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 33

> r_lm(y~x) > plot(r) Splus function plot() for LM s Plot 1: Ŷ i versus r i to check for linearity and variance homogeneity. Plot 2: Ŷ i versus r i to check for large residuals. Plot 3: Ŷ i versus Y i should cluster around x = y line if fit is good. Plot 4: qq-plot of r i to check normality of errors Plot 5: Residual-Fit (r-f) spread plot: f-values (:= (1:n).5 n against ordered fitted values (left panel) and ordered residuals (right panel). When a good fit is present the spread of the residuals should be much smaller than that of the fitted values. Plot 6: i versus D i to check for influential observations. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 34

Ex:Life expectancies: Output from plot() > plot(r2) Residuals -15-10 -5 0 5 10 15 8 3 22 55 60 65 70 75 Fitted : TEMP + URBAN + log(hosp.pop) sqrt(abs(resid(r2))) 1.0 1.5 2.0 2.5 3.0 3.5 4.0 8 3 22 55 60 65 70 75 Fitted : TEMP + URBAN + log(hosp.pop) LIFE.EXP 50 60 70 55 60 65 70 75 Fitted : TEMP + URBAN + log(hosp.pop) Residuals -15-10 -5 0 5 10 15 22 3-2 -1 0 1 2 Quantiles of Standard Normal 8 LIFE.EXP -15-10 -5 0 5 10 15 Fitted Values -15-10 -5 0 5 10 15 Residuals 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 f-value Cook s Distance 0.0 0.05 0.10 0.15 0.20 22 14 8 0 5 10 15 20 25 30 Index c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 35