Lecture 1: Linear Models and Applications

Size: px

Start display at page:

Download "Lecture 1: Linear Models and Applications"

Colin Foster
5 years ago
Views:

1 Lecture 1: Linear Models and Applications Claudia Czado TU München c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

2 Overview Introduction to linear models Exploratory data analysis (EDA) Estimation and testing in linear models Regression diagnostics c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

3 Introduction to linear models Goal of regression models is to determine how a response variable depends on covariates. A special class of regression models are linear models. The general setup is given by Data Responses Covariates (Y i, x i1,..., x ik ), i = 1,..., n Y = (Y 1,..., Y n ) T x i = (x i1,..., x ik ) T (known) c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

4 Example: Life Expectancies from 40 countries Source: The Economist s Book of World Statistics, 1990, Time Books, The World Almanac of Facts, 1992, World Almanac Books LIFE.EXP TEMP URBAN HOSP.POP COUNTRY Life Expectancy at Birth Temperature in degrees Fahrenheit Percent of population living in urban areas No. of hospitals per population The name of the country Which is the response and which are the covariates? c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

5 Linear models (LM) under normality assumption Y i = β 0 + β 1 x i β k x ik + ɛ i, ɛ i N(0, σ 2 ) iid, i = 1,.., n, where the unknown β 0,..., β k regression parameters and the the unknown error variance σ 2 needs to be estimated. Note E(Y i ) is a linear function in β 0,..., β k. E(Y i ) = β 0 + β 1 x i β k x ik V ar(y i ) = V ar(ɛ i ) = σ 2 variance homogeneity c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

6 LM s in matrix notation Y = Xβ + ɛ, X R n p, p = k + 1, β = (β 0,..., β k ) T E(Y) = Xβ V ar(y) = σ 2 I n Under the normal assumption we have Y N n (Xβ, σ 2 I n ), where N n (µ, Σ) is the n-dimensional normal distribution with mean vector µ and covariance matrix Σ. The matrix X is also called the design matrix. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

7 Exploratory data analysis (EDA) Consider the ranges of the responses and covariates. discrete, group covariate levels if they are sparse. When covariates are Plot the covariates against the responses. These scatter plots should look linear. Otherwise consider transformations of the covariates. To check if the constant variance assumption is reasonable the scatter plot of covariates against the responses should be contained in a band. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

8 Example: Life expectancies: Data summaries > summary(health.data) LIFE.EXP TEMP URBAN Min. :37 Min. :36 Min. :14 1st Qu.:56 1st Qu.:52 1st Qu.:42 Median :67 Median :61 Median :62 Mean :65 Mean :62 Mean :59 3rd Qu.:73 3rd Qu.:73 3rd Qu.:76 Max. :79 Max. :85 Max. :96 NA s : 1 HOSP.POP Min. : st Qu.: Median : Mean : rd Qu.: Max. : NA s : NA=not available (missing data) c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

9 c (Claudia Czado, TU Munich) ZFS/IMS Göttingen Example: Life expectancies: EDA LIFE.EXP TEMP URBAN HOSP.POP LIFE.EXP decreases if TEMP increases LIFE.EXP increases if URBAN increases LIFE.EXP increases if HOSP.POP increases Check for nonlinearities and nonconstant variances

10 Example: Life expectancies: Transformations France UnitedStates Japan Australia NewZealand Argentina LIFE.EXP Egypt China Tonga PapuaNewGuinea Norway LIFE.EXP Morocco CostaRica Egypt China Tonga Australia Norway PapuaNewGuinea India HOSP.POP India log(hosp.pop) Logarithm makes relationship more linear. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

11 Parameter estimation of β Under the normality assumption it is enough to minimize Q(β) := Y Xβ 2 for β R p to calculate the maximum likelihood estimate (MLE) of β. An estimator which minimizes Q(β) is also called a least square estimator (LSE). If the matrix X R n p has full rank p the minimum of Q(β) is taken at ˆβ = (X T X) 1 X T Y SS Res := n T (Y i x i ˆβ) 2 = Q(ˆβ) Residual Sum of Squares i=1 T Ŷ i := x i ˆβ fitted value for µi := E(Y i ) e i := Y i Ŷi raw residual It follows that E(ˆβ) = β and V ar(ˆβ) = σ 2 (X T X) 1, therefore one has under normality ˆβ N p (β, σ 2 (X T X) 1 ). c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

12 The MLE of σ 2 is given by Parameter estimation of σ 2 ˆσ 2 := 1 n n T (Y i x i ˆβ) 2 = 1 n i=1 n (Y i Ŷi) 2 i=1 One can show that the estimator is biased, in particular E(ˆσ 2 ) = n p n σ2. An unbiased estimator for σ 2 is therefore given by s 2 := 1 n p n i=1 (Y i Ŷi) 2 = n n pˆσ2 Under normality it follows that (n p)s 2 σ 2 = n (Y i Ŷi) 2 i=1 σ 2 χ 2 n p is independent of ˆβ. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

13 Goodness of fit in linear models Consider SS T otal = n (Y i Ȳ )2, SS Reg = n (Ŷi Ȳ )2, SS Res = n (Y i Ŷi) 2, i=1 i=1 i=1 where Ȳ = 1 n n i=1 Y i. It follows that SS T otal = SS Reg + SS Res, therefore R 2 := SS Reg = 1 SS Res SS T otal SS T otal explains the proportion of variability explained by the regression. R 2 is called the multiple coefficient of determination. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

14 Statistical hypothesis tests in LM s: F-test The restriction H 0 : Cβ = d for C R r,p, rank(c) = r, d R r is called general linear hypothesis with alternative H 1 : not H 0. Consider the restricted least square problem SS H0 = min β { Y Xβ 2 2 Cβ }{{ = d } under H 0 } = SS Res + (Cˆβ d) T [C(X T X) 1 C T ] 1 (Cˆβ d) Under normality and that H 0 : Cβ = d is valid we have F = (SS H 0 SS Res )/r SS Res /(n p) F r,n p, therefore the F-test is given by reject H 0 at level α if F > F 1 α,r,n p. Here F n,m denotes the F distribution with n numerator and m denominator degree of freedom and F 1 α,n,m is the corresponding 1 α quantile. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

15 Statistical hypothesis tests in LM s: t-test Consider for each regression parameter β j the hypothesis H 0j : β j = 0, against H 1j : not H 0j and use the corresponding F statistics F j := (SS H 0j SS Res )/1 SS Res /(n p) F 1,n p under H 0j. It follows that F j = T 2 j, where T j := ˆβ j ŝe( ˆβ j ) t n p under H 0j, where ŝe( ˆβ j ) is the estimated standard error of ˆβ j i.e. ŝe( ˆβ j ) = s (X T X) 1 jj. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

16 Weighted Least Squares Y = Xβ + ɛ, ɛ N n (0, σ 2 W ), X R n p, p = k + 1, β = (β 0,..., β k ) T An MLE of β (weighted LSE) is given by ˆβ = (X T W 1 X) 1 X T W 1 Y. It follows that E(ˆβ) = β and V ar(ˆβ) = σ 2 (X T W 1 X) 1. The MLE of σ 2 is given by ˆσ 2 := 1 n (Y X ˆβ) T W 1 (Y X ˆβ) = 1 n et W 1 e. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

17 Example: Life expectancies: First models Using the Original Scale for HOSP.POP > attach(health.data) > f1_life.exp ~ TEMP + URBAN + HOSP.POP > r1_lm(f1,na.action = na.omit) > summary(r1) Call: lm(formula = f1, na.action = na.omit) Residuals: Min 1Q Median 3Q Max Coefficients: Value Std. Error t value Pr(> t ) (Intercept) TEMP URBAN HOSP.POP Residual standard error: 7.3 on 29 degrees of freedom Multiple R-Squared: 0.46 F-statistic: 8.3 on 3 and 29 degrees of freedom, the p-value is Correlation of Coefficients: (Intercept) TEMP URBAN TEMP URBAN HOSP.POP TEMP and HOSP.POP are nonsignificant at the 10% level. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

18 Example: Life expectancies: First models Using the Logarithmic Scale for HOSP.POP > f2_life.exp ~ TEMP + URBAN + log(hosp.pop) > r2_lm(f2,na.action = na.omit) > summary(r2) Call: lm(formula = f2, na.action = na.omit) Residuals: Min 1Q Median 3Q Max Coefficients: Value Std. Error t value Pr(> t ) (Intercept) TEMP URBAN log(hosp.pop) Residual standard error: 7 on 29 degrees of freedom Multiple R-Squared: 0.5 F-statistic: 9.7 on 3 and 29 degrees of freedom, the p-value is Correlation of Coefficients: (Intercept) TEMP URBAN TEMP URBAN log(hosp.pop) TEMP is still nonsignificant, while log(hosp.pop) is now significant at the 10% level. 50% of the total variability is explained by the regression. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

19 ANOVA= ANalysis Of VAriance ANOVA Tables in LM s Model Formula SS null Y i = β 0 + ɛ i SSE 0 = n i=1 (Y i ˆβ 0 ) 2 = n i=1 (Y i Y ) 2 full Y = Xβ + ɛ SSE(X) = Y X ˆβ 2 reduced Y = X 1 β 1 + ɛ SSE(X 1 ) = Y X 1ˆβ1 2 ANOVA table: Source df Sum of Sq MS F regression p 1 SS Reg = SSE 0 SSE(X) MS Reg = SS Reg p 1 residual n p SSE(X) s 2 = SSE(X) n p total n 1 SSE 0 MS Reg s 2 c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

20 Hierarchical ANOVA Tables in LM s ANOVA table: Y = Xβ + ɛ = X 1 β 1 + X 2 β 2 + ɛ X R n,p, X 1 R n,p 1, X 2 R n,p 2, p 1 + p 2 = p Source df Sum of Sq MS F X 1 p 1 1 SSE 0 SSE(X 1 ) MS(X 1 ) = SSE 0 SSE(X 1 ) p 1 1 F (X 1 ) = MS(X 1 ) SSE(X 1 )/(n p 1 ) X 2 given X 1 p 2 SSE(X 1 ) SSE(X) MS(X 2 X 1 ) = SSE(X 1 ) SSE(X) p 2 F (X 2 X 1 ) = MS(X 2 X 1 ) s 2 residual n p SSE(X) s 2 = SSE(X) n p total n 1 SSE 0 F (X 1 ) tests H 0 : β 1 = 0 in Y = X 1 β 1 + ɛ (overall F-test in reduced model). F (X 2 X 1 ) tests H 0 : β 2 = 0 in Y = X 1 β 1 + X 2 β 2 + ɛ (partial F-test). c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

21 > r <- lm(y ~x1+x2+x3) > anova(r) ANOVA Tables in Splus Source df Sum of Sq MS F x1 1 SS 1 = SSE 0 SSE(x 1 ) MS 1 = SS 1 1 MS 1 s 2 x2 1 SS 2 = SSE(x 1 ) SSE(x 1, x 2 ) MS 2 = SS 2 1 x3 1 SS 3 = SSE(x 1, x 2 ) SSE(x 1, x 2, x 3 ) MS 3 = SS 3 1 MS 2 s 2 MS 3 s 2 residual n p SSE(x 1, x 2, x 3 ) s 2 = SSE(X) n 4 These F-values cannot be interpreted as partial F-values, since the denominator always assumes the full model. You need to replace it by s 2 1 = SSE(x 1) n 2, s 2 2 = SSE(x 1,x 2 ) n 3 and s 2 3 = SSE(x 1,x 2,x 3 ) n 4 = s 2, respectively. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

22 Example: Life Expectancies: ANOVA Tables ANOVA Table (Original Scale for HOSP.POP) > anova(r1) Analysis of Variance Table Response: LIFE.EXP Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) TEMP URBAN HOSP.POP Residuals ANOVA Table (Log Scale for HOSP.POP) > anova(r2) Analysis of Variance Table Response: LIFE.EXP Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) TEMP URBAN log(hosp.pop) Residuals c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

23 Goal is to determine Regression diagnostics for LM s outliers with regard to the response (y outliers) outliers with regard to the design space covered by the covariates (x outliers or high leverage points) observations which change the results greatly when removed (influential observations). A general tool for this is to consider the hat matrix and residuals. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

24 Residuals in LM s: hat matrix Properties: H := X(X T X) 1 X T hat matrix Ŷ := X ˆβ = X(X T X) 1 X T Y = HY fitted values r := Y Ŷ = (I H)Y residual vector H is a projection matrix, i.e. H 2 = H and symmetric. E(r) = (I H)E(Y ) = (I H)Xβ = Xβ HXβ = Xβ Xβ = 0. V ar(r) = σ 2 (I H)(I H) T = σ 2 (I H) V ar(r i ) = σ 2 (1 h ii ), where h ii is i-th diagonal element of H. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

25 Standardized Residuals in LM s s i := r i 1 hii s, where s2 := Y X ˆβ 2 n p are called standardized residuals or internally studentized residuals h ii is called the leverage of observation i. Problem: The estimate s 2 is biased when outliers are present in the data, therefore one wants to estimate σ 2 without the ith observation. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

26 Jackknifed Residuals in LM s Y i = X i β + ɛ i model without ith obs X i = design matrix without ith obs ˆβ i := corresponding LSE of β Ŷ i, i := x T i ˆβ i ith fitted value without using ith obs r i, i := Y i Ŷi, i predictive residual n s 2 j=1,j j i := (Y j x T ˆβ j i ) 2 estimate of n p 1 t i := σ 2 in model without ith obs r i externally studentized or jackknifed residuals 1 hii s i c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

27 Jackknifed Residuals in LM s For a fast computation one can use linear algebra to show that ˆβ i = ˆβ (XT X) 1 x i r i 1 h ii r i, i = r i 1 h ii s 2 i = (n p)s2 r 2 i 1 h ii n p 1 Residual plots are plots where the observation number i or the jth covariate x ij is plotted against r i, s i or t i. There is no problem with the model if these plots look like a band. Deviations from this structure might indicate nonlinear regression effects or a violation of the variance homogeneity. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

28 Residuals in Splus r_lm(y~x1+x2+...+xk, x=t) Raw Residuals: r i = Y i Ŷi Residual Standard Error: s = Y X ˆβ 2 n p e_resid(summary(r)) sigma_summary(r)$sigma External Residual Standard Error: s i sigmai_lm.influence(r)$sigma Hat Diagonals: h ii hi_hat(r$x) or hi_lm.influence(r)$hat Internally Studentized Residuals: s i = r i si_e/(sigma*((1-hi)^.5)) 1 hii s Externally Studentized Residuals: r t i = i ti_e/(sigmai*((1-hi)^.5)) 1 hii s i c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

29 High leverage in LM s Since h ii = x i (X T X) 1 x T i one can interpret h ii as standardized measure between x i and x. Further we have therefore we call points with n h ii = p, i=1 as high leverage points or x-outliers. h ii > 2p n c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

30 Outlier detection in LM s To detect y outliers, we can use t i, since it can be written as t i = Y i Ŷ i = ˆδ i for δ i := Y i Ŷ i. V ar(y i Ŷ i) V ar(δ i ) In the mean shift outlier model for obs l given by Y i = x t iβ + γd li + ɛ i with D li = { 1 l = i 0 l i. Therefore reject H : γ = 0 versus K : γ 0 at level α iff t l > t n p 1,1 α 2, where t m,α is the α quantil of a t m distribution. Problem: If one uses this outlier test for every obs. l, one has the problem of multiple testing, since one needs to substitute α by α n. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

31 Leverage and Influence Figure A Figure B y with obs 20 without obs 20 y with obs 10 without obs x x Figure C y with obs 20 without obs x c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

32 Leverage and Influence Figure A shows that obs 20 is a high leverage point, which is also influential. Figure B shows that obs 10 is not a high leverage point and not influential. Figure C shows that obs 20 is a high leverage point, which is not influential. A real valued measure for influential obs. is the Cook s distance, which is defined as D i := (ˆβ ˆβ i ) T (X T X)(ˆβ ˆβ i ) ps 2, which can be interpreted as the shift in the confidence region when the ith obs. is deleted. Therefore obs. with D i > 1 are considered influential. The cutpoint 1 corresponds that the ith obs moves ˆβ to the edge of a 50 % confidence region. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

33 Ex:Life expectancies: Residuals and hat values Externally Studentized Residuals ex. stud. residuals UnitedStates Chile Index Hat Diagonals hat diagonals Australia PapuaNewGuinea Norway Index c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

34 Ex:Life expectancies: Residual plots raw residuals UnitedStates Chile TEMP raw residuals UnitedStates Chile URBAN raw residuals UnitedStates Chile log(hosp.pop) e UnitedStates Chile raw residuals Bottom right plot plots r i versus r i 1 to detect correlated errors. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

35 > r_lm(y~x) > plot(r) Splus function plot() for LM s Plot 1: Ŷ i versus r i to check for linearity and variance homogeneity. Plot 2: Ŷ i versus r i to check for large residuals. Plot 3: Ŷ i versus Y i should cluster around x = y line if fit is good. Plot 4: qq-plot of r i to check normality of errors Plot 5: Residual-Fit (r-f) spread plot: f-values (:= (1:n).5 n against ordered fitted values (left panel) and ordered residuals (right panel). When a good fit is present the spread of the residuals should be much smaller than that of the fitted values. Plot 6: i versus D i to check for influential observations. c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

36 Ex:Life expectancies: Output from plot() > plot(r2) Residuals Fitted : TEMP + URBAN + log(hosp.pop) sqrt(abs(resid(r2))) Fitted : TEMP + URBAN + log(hosp.pop) LIFE.EXP Fitted : TEMP + URBAN + log(hosp.pop) Residuals Quantiles of Standard Normal 8 LIFE.EXP Fitted Values Residuals f-value Cook s Distance Index c (Claudia Czado, TU Munich) ZFS/IMS Göttingen

Ch 3: Multiple Linear Regression

Ch 3: Multiple Linear Regression 1. Multiple Linear Regression Model Multiple regression model has more than one regressor. For example, we have one response variable and two regressor variables: 1. delivery