Linear regression. Example 3.6.1: Relation between abrasion loss and hardness of rubber tires.

Size: px

Start display at page:

Download "Linear regression. Example 3.6.1: Relation between abrasion loss and hardness of rubber tires."

Crystal Dickerson
5 years ago
Views:

1 Linear regression Example 3.6.1: Relation between abrasion loss and hardness of rubber tires. X i the abrasion loss 30 observations, i = 1,..., 30. Sample space R 30. y i the hardness, i = 1,..., 30. Model: X i = β 0 + β 1 y i + σɛ i with ɛ i iid N(0, 1) for i = 1,..., 30. Parameters (β 0, β 1, σ) R 2 (0, ). Slide 1/11 Niels Richard Hansen Statistics BI/E December 14, 2009

2 General regression model A model of real valued variables X 1,..., X n X i = g β (y i ) + σɛ i where ɛ 1,..., ɛ n are iid, y 1,..., y n E and g β : E R. Parameters β Rd and σ (0, ). In this model the observables are assumed independent but not identically distributed. The value of y i dictates through g β the mean value of X i. Slide 2/11 Niels Richard Hansen Statistics BI/E December 14, 2009

3 The likelihood function Due to independence the joint density for the distribution of X 1,..., X n is f β,σ (x 1,..., x n ) = n i=1 1 σ f ( xi g β (y i ) with f the density for the distribution of the ɛ i s. σ ). (1) The minus-log-likelihood function n l x (β, σ) = n log σ log f i=1 ( xi g β (y i ) σ ). (2) Slide 3/11 Niels Richard Hansen Statistics BI/E December 14, 2009

4 Normal distribution If ɛ i N(0, 1) then l x (β, σ) = n log σ + 1 2σ 2 n (x i g β (y i )) 2 +n log 2π. i=1 } {{ } RSS(β) The minimizer over β is the minimizer of the residual sum of squares n RSS(β) = (x i g β (y i )) 2, (3) i=1 and if there is a unique minimizer ˆβ then is the MLE of the variance. Slide 4/11 Niels Richard Hansen Statistics BI/E December 14, 2009 σ 2 = 1 n RSS( ˆβ)

5 Least squares estimation The MLE of σ2, σ 2 = 1 n RSS( ˆβ), generally underestimates σ2 by a factor n d n. Therefore the variance is always estimated by ˆσ 2 = 1 n d RSS( ˆβ) where d is the dimension of the parameter space for β. Slide 5/11 Niels Richard Hansen Statistics BI/E December 14, 2009

6 Linear regression The linear regression model is obtained by g β (y) = β 0 + β 1 y where d = 2. There is a unique and explicit minimizer of RSS(β). The estimators are computed in R using lm (for linear model). The result is nicely formatted by the summary function of the resulting object. Slide 6/11 Niels Richard Hansen Statistics BI/E December 14, 2009

7 Residuals The fitted values are defined as ˆx i = ˆβ 0 + ˆβ 1 y i and the residuals as e i = x i ˆx i = x i ˆβ 0 ˆβ 1 y i. The residual e i is an approximation of the error variable σɛ i. Slide 7/11 Niels Richard Hansen Statistics BI/E December 14, 2009

8 Leverage and standardized residuals The error variables σɛ i for i = 1,..., n are iid and have variance σ 2. The residuals e i for i = 1,..., n are mildly dependent and the variance of e i is σ 2 (1 h ii ) 2 where h ii is known as the leverage of the i th observation. The standardized residuals are r i = e i ˆσ 1 h ii Slide 8/11 Niels Richard Hansen Statistics BI/E December 14, 2009

9 Model diagnostic The model assumptions are investigated via the residuals or standardized residuals. Is the model, β 0 + β 1 y, of the mean value adequate? Plot the residuals e i against y i or against the fitted values ˆx i. Is the assumption of a constant variance σ 2 reasonable? Plot the standardized residuals r i or as done in R, r i against the fitted values ˆx i. Is the error distribution normal? Do a QQ-plot of the residuals against the normal distribution. Are the outliers and/or influential observations. Look in the residual plots. Plot the standardized residuals against the leverage. Slide 9/11 Niels Richard Hansen Statistics BI/E December 14, 2009

10 ELISA and DNase ELISA is an assay where we measure the concentration of a protein/antibody indirectly by a measure of optical density (OD-value). For all experiments we are interested in estimating a standard curve relating concentration and observed OD-value. Using a known dilution series we can try linear regression of the log-od on the log-concentration. The model is mildly misspecified consequence; the error variance is overestimated as it has to capture the model misspecification too. Slide 10/11 Niels Richard Hansen Statistics BI/E December 14, 2009

11 Beaver temperature We consider the temperature of a beaver over time. We get the idea that the temperature has a 24-hours cycle with a minimum at 8.30 in the morning. If t denotes time in minutes since 8.30 we suggest the following model where ɛ 1,..., ɛ n are iid. X i = β 0 + β 1 cos(2πt/1440) + σɛ i The data have an additional variable indicating activity of the animal. An extended model reads X i = β 0 + β activ + β 1 cos(2πt/1440) + σɛ i Main problem is that dependence found over time in the residuals. Consequence; estimates of standard errors generally become too optimistic. Slide 11/11 Niels Richard Hansen Statistics BI/E December 14, 2009

F9 F10: Autocorrelation

F9 F10: Autocorrelation Feng Li Department of Statistics, Stockholm University Introduction In the classic regression model we assume cov(u i, u j x i, x k ) = E(u i, u j ) = 0 What if we break the assumption?