Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Size: px

Start display at page:

Download "Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model"

Benedict Hubbard
5 years ago
Views:

1 1 Linear Regression

2 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor variable: simple linear regression We will then consider the case of the model with more than one predictor variable: multiple linear regression

3 3 Linear Regression Models Recall a regression model is defined by 1 a random response variable 2 a list of predictor variables 3 a regression equation 4 a distribution for the value of the random response variable

4 4 Simple Linear Regression The regression equation for simple linear regression is: EY i = µ i = α + β x i Note that the link function g is the identity function for linear regression. The assumption here is that the relationship between x and EY i is a straight line The slope of the line is β

5 5 Example: Full Blood Count A clinical full blood count takes a standard volume of blood and measures: # of blood cells (platelets, white, red) haemoglobin concentration Empirically, there is a log-linear relationship between the number of red cells and haemoglobin concentration

6 6 log(haemoglobin) on log(rbc) Example

7 7 Interpretation of α To interpret α put x i = 0 into the regression equation: then EY i = α + β x i EY i = α α is the average value of the response variable amongst study subjects for which the predictor variable is zero.

8 8 Interpretation of β To interpret β put x = z and x = z + 1 for study subjects i and i into the regression equation to obtain: EY i = α + β z (1) EY i = α + β (z + 1) (2) then take equation (1) from equation (2) EY i EY i = β (3)

9 9 Interpretation: Blood Count Example EY i = α + β x i E log(hgb) = log(#rbc) HGB = haemoglobin concentration #RBC = red cell count For the blood example when log(rbc) = 0 the average log(hgb) = 1.50 Why is this interpretation silly? If log(#rbc) increases by 1; log(hgb) increases on average by 0.74

10 10 Linear Regression Response For linear regression the response distribution is assumed to be normal (sometimes called Gaussian). Y i N(µ i, σ 2 ) equivalently Y i N(α + β x i, σ 2 )

11 Linear Regression Errors The quantity ϵ i = Y i µ i = Y i (α + β x i ) is the error corresponding to study subject i The distributional assumption of linear regression is equivalent to the assumption that the errors are normally distributed, with mean zero: ϵ i N(0, σ 2 ) The error variance σ 2 is the same for each study sample 11 σ 2 can be estimated from the data using maximum likelihood

12 12 Linear Regression Error Assumption We can put the regression equation and distribution assumption into a single statement: Y i = α + β x i + ϵ i, ϵ i N(0, σ 2 )

13 13 Linear Regression Residuals We define residual for study subject i by: r i = Y i ( ˆα + ˆβ x i ) Recall that the error for study subject i is defined by ϵ i = Y i (α + β x i ) Note that residuals and errors are not the same. Errors are unknown because we don t know α and β Residuals can be computed from the data Residuals can be thought of as estimates of errors

14 14 Linear Regression Residuals

15 Properties of Linear Regression Residuals Although residuals and errors are not the same, residuals have similar properties to errors: 1 The mean (and sum) of the residuals for a study sample is equal to zero 2 Residuals are normally distributed 3 The variance of the residuals should not depend on the value of the predictors The first property holds regardless of the validity of the modelling assumptions 15 The second and third properties only holds if the model assumptions are valid. Specifically only if 1 The relationship between x and Y is linear 2 The errors are normally distributed 3 The variance of the errors is constant

16 16 Checking Modelling Assumptions Before we rely on an inference made from a linear regression model, we should always verify that the modelling assumptions hold Specifically we should check 1 EY is a linear function of x 2 The properties of the residuals are consistent with the assumption about the distribution of Y

17 17 Check Linearity Suppose the R variable y is a vector containing data from a response variable Y and the R variable x is a vector containing data from a predictor variable x. We can generate a plot of y against x with the command > plot(x,y) We will deal with non-linearity in a subsequent lecture

18 18 Fitting a Linear Model in R We can fit a linear regression in R using the lm function. > fit.obj = lm(y~x) Fits the regression equation EY = α + β x The result of the model fit is stored in the R object fit.obj

19 19 Extracting the Residuals The residuals can be extracted from the linear regression object using fit.obj$residuals For example to draw a histogram of the residuals you can type: > hist(fit.obj$residuals) Alternatively you can do the fitting and plotting in one statement, without storing a model object: > hist(lm(y~x)$residuals)

20 20 By examining a histogram of the residuals we can check the normality assumption holds Histogram of the Residuals

21 21 QQ-plot of the Residuals A Q-Q plot (short for quantile-quantile plot) is a graphical method for comparing two probability distributions We can use it to compare the observed distribution of the residuals with the distribution of a N(0, 1) random variable If the residuals are normally distributed the plot should follow an approximately straight line

22 22 1 Draw n lines on a normal density to divide into n + 1 regions of equal probability 2 Plot the x-axis values of the dotted lines against the ordered residuals QQ-plot of the Residuals Suppose we have n data points, so n residuals

23 QQ-plot of the Residuals 23 R commands to generate this plot in the next session

24 24 Plot the Residuals vs. the Predictor Plot the residuals against the predictor variable to verify that the distribution of the residuals is independent of x > plot(x, lm(y~x)$residuals)

25 25 Maximum Likelihood Estimation For linear regression there are formulae for the maximum likelihood estimates of the regression coefficients: n i ˆβ = (x i x)(y i Ȳ) n i (x i x) 2 ˆα = ȳ ˆβ x However we do not need to worry about these too much as R will do the calculations for us

26 26 Viewing Model Fit Information in R The simplest way to view model fit information in R is to type the name of a fitted model object and hit return: > fit.obj = lm(y~x) > fit.obj Call: lm(formula = y ~ x) Coefficients: (Intercept) x This prints the MLEs of the coefficients

27 27 Printing Confidence Intervals Using R 95% confidence limits can be computed with the confint function > confint(fit.obj) 2.5 % 97.5 % (Intercept) x A different confidence level can be specified if desired, e.g. 99%: > confint(fit.obj, level=0.99) 0.5 % 99.5 % (Intercept) x

28 28 The display command > display(fit.obj) lm(formula = y ~ x) coef.est coef.se (Intercept) x n = 500, k = 2 residual sd = 0.51, R-Squared = 0.94 Prints: the MLE, the standard errors of the coefficient MLEs, the standard deviations of the regression coefficients, the residual standard deviation and R 2

29 29 Standard Errors of the MLEs Back to the idea of imaginary repeated experiments Suppose, in an imaginary world we: 1 repeat our experiment very many times 2 generate a new dataset on each occasion 3 estimate a new MLE ˆβ using each dataset The MLE is a random variable under this replication process The standard error of ˆβ denoted SE( ˆβ) is defined as the standard deviation of the MLE.

30 30 Proportion of Variance Explained R 2 is the proportion of the variance in the response which is explained by the predictor. R 2 is a number between 0 and 1 R 2 is a measure of the correlation between x and y. When R 2 = 1, x is perfectly correlated with y and the residuals are all equal to 0 When R 2 = 0, x contains no information about y.

31 31 Residual Standard Deviation The residual standard deviation is what it says on the tin: sd(ˆϵ) = 1 n n (ϵ i ϵ) 2 i

32 32 The R summary Command > summary(fit.obj) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** x <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 498 degrees of freedom Multiple R-squared: ,^^IAdjusted R-squared: F-statistic: 7462 on 1 and 498 DF, p-value: < 2.2e-16

33 33 p-values The p-value in the Pr(> t ) column of the summary command is a measure of the weight of evidence against the null hypothesis that the regression coefficient in that row is equal to zero. The null hypothesis is so called because it refers to the assumed position that there is no association between the predictor and the response. Usually the evidence must be strong before a null hypothesis is rejected A p-value is a number between 0 and 1. The smaller the number the greater the evidence against the null hypothesis. Typically a p-value at least as small as 0.05 is required to reject a null hypothesis.

34 34 Interpretation of p-values The interpretation of p-values, is based on the idea of imaginary repeated experiments. Suppose, in an imaginary world we: 1 repeat our experiment very many times 2 generate a new dataset on each occasion 3 calculate a new p-value level using each dataset then assuming the null hypothesis is true α 100% of the calculated p-values should be less than α Small p-values are rare when the null hypothesis is true

35 35 Computing Confidence Intervals Manually Although R provides the confint function, confidence intervals can also be computed manually from standard errors Not all statistical software provides functions to compute confidence intervals so this is a useful skill Standard errors are listed in the second column of the summary output. (They are also printed by the display command) Manual calculation of confidence intervals is based the assumption that the MLE of the regression coefficient follows a normal distribution.

36 36 Computing Confidence Intervals Manually We can compute a 95% confidence interval for a regression coefficient using a normal approximation: ˆβ 1.96 SE( ˆβ) < β < ˆβ SE( ˆβ)

37 37 Multiple Linear Regression Multiple linear regression is very similar to simple linear regression More than one predictor is now allowed on the right handside of the equation EY i = µ i = α + β 1 x i1 + β 1 x i1 +...β 1 x ip The assumptions about the distribution of Y i (normal, homogeneous variance) are the same as those for simple linear regression.

38 38 Fitting a Multiple Linear Regression A multiple linear regression can be fitted with the lm command. > fit.obj=lm(y~x1+x2) Information can be extracted from the model object using the functions already seen: confint, display and summary.

39 39 When to Use Multiple Linear Regression Multiple linear regression is useful when more than one predictor is thought to associate with the response simultaneously By fitting both predictors in the same model we can get more precise estimates of the regression coefficients

40 40 Fitting a Multiple Linear Regression > summary(fit.obj) Call: lm(formula = y ~ x1 + x2) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) x e-07 *** x * --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 47 degrees of freedom Multiple R-squared: ,^^IAdjusted R-squared: F-statistic: on 2 and 47 DF, p-value: 6.864e-07

41 41 Multiple Linear Regression: Interpretation of α To interpret α put x ij = 0 into the regression equation for each predictor: then EY i = µ i = α + β 1 x i1 + β 1 x i1 +...β 1 x ip EY i = α α is the average value of the response variable amongst study subjects for which every predictor variable is zero.

42 42 Interpretation of β j To interpret β j, the regression coefficient for the jth predictor variable, put x = z for study subjects i and i and x = z + 1 into the regression equation to obtain: EY i = α + β 1 x i1 +...β j z +...β p x ip (4) EY i = α + β 1 x i β j (z + 1) +...β p x i p (5) then take equation (1) from equation (2) EY i EY i = β β is the difference in the average value of the response variable between groups of study subjects for which the predictor variable differs by one unit.

43 43 Multiple Linear Regression with Interactions Multiple linear regression assumes that the effect of a unit change in a predictor on the mean of the response is independent of the values of the other predictors. e.g. increasing the predictor value x ij by one unit increases EY i by the amount β j, whatever the values of the other predictor variables. Interaction models allow us to relax this assumption

44 44 Interactions in Linear Regression An interaction model is one where the interpretation of the effect of one predictor depends on the value of another and vice versa. The simplest interaction models includes a predictor variable formed by multiplying two ordinary predictors: EY i = α + β i1 x i1 + β 2 x i2 + β 3 x i1 x i2 Interaction term

45 45 Interaction Between 2 Variables Consider a linear model where the main predictors of Y (blood pressure in mmhg) are age (in years) and weight in (kg) EY = α + β 1 weight + β 2 age + β 3 (weight age)

46 46 Interpreting β 1 and β 2 EY i = α + β 1 weight + β 2 age + β 3 (weight age) We would like to know how to interpret β 1 if the interaction term was not there. Since in that case would just have an ordinary multivariate linear model. This happens when the age of a study subject is equal to 0, then weight: EY = α + β 1 weight + β β 3 (weight 0) = α + β 1 weight

47 47 Interpreting β 1 and β 2 Amongst study subjects aged 0 years EY = α + β 1 weight We know how to interpret β 1 in this case as it s a simple linear model. β 1 is the difference in the expected BP between individuals whose weight differs by 1kg and are aged 0 years This interpretation is factual correct, but practically not very useful The data aren t likely to contain many 0 year olds. Is the model valid in this range?

48 48 Interpreting β 1 and β 2 EY = α + β 1 weight + β 2 age + β 3 (weight age) To interpret β 2 we need to get rid of the interaction term without getting rid of the β 2 age term. Same argument as before but now set weight=0: EY = α + β β 2 age + β 3 (0 age) = β 0 + β 2 age β 2 is the difference in the expected BP between individuals whose age differs by 1 year and who weigh 0kg.

49 49 Interpreting β 3 EY = α + β 1 weight + β 2 age + β 3 (weight age) To interpret β 3 rewrite the regression equation: EY = α + [β 1 + β 3 age]weight + β 2 age This looks like a multivariate regression model with weight and age as predictors where: β 1 + β 3 age is the regression coefficient for weight β2 is the regression coefficient for age β 3 is the difference between the regression coefficients for weight for study subjects whose age differs by 1 year

50 50 Interpreting β 3 EY = α + β 1 weight + β 2 age + β 3 (weight age) We could just as well have rewritten the equation this way: EY = α + β 1 weight + [β 2 + β 3 weight]age β 3 is the difference between the regression coefficients for age for study subjects whose weight differs by 1kg So we have two ways of thinking about β 3 : 1 either as modification of the effect of weight by age 2 or the modification of the effect of age by weight.

51 51 Fitting an Interaction Term in R A multiple linear regression can be fitted with the lm command. > summary(lm(y~x1*x2)) Call: lm(formula = y ~ x1 * x2) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) x x x1:x ** --- Signif. codes: 0 *** ** 0.01 *

52 52 Red Blood Cell Count in UK Biobank The distribution of the number of red blood cells in a unit of blood in middle aged males and females in the UK.

53 53 Categorical Predictors in Regression So far we have implicitly assumed that the predictor variable x is numerical and that the data contain a range of values for x The previous example shows how we might wish to use a categorical variables such as sex as a predictor in a regression model How do we put a categorical variable into a regression equation? EY i = µ i = α + β "Female" does not make sense - only numbers can be put into equations

54 54 Dummy Variables The solution to this problem is to use dummy variables A dummy variable is a 0/1 variable which acts as proxy for the value of a categorical variable For example if x is a categorical variable with possible categories "Male"/"Female" we can substitute a dummy variable δ with: δ i = 0 δ i = 1 if and only if x i ="Male" if and only if x i ="Female" The regression equation is now: EY i = µ i = α + β δ i

55 55 Dummy Variables The regression equation: EY i = µ i = α + β δ i δ i = 0 δ i = 1 if and only if x i ="Male" if and only if x i ="Female" The mean value of Y i in females is α + β The mean value of Y i in males is α The interpretation of the regression coefficients depends on the coding chosen. We could have chosen to code Females as 0 and men as 1.

56 56 Dummy Variables When x has more than two possible categories we need more than one dummy variable to code the categories numerically. For example suppose x has possible categories "Male"/"Pre-Menopausal Female"/"Post-Menopausal Female". We need to choose a baseline category, which corresponds to all the dummy variables being equal to zero. This choice changes the interpretation of the coefficients but has no effect on statistical inferences

57 Dummy Variables δ i1 = 0 δ i1 = 1 δ i2 = 0 δ i2 = 1 if and only if x i ="Not Pre-menopausal Female" if and only if x i ="Pre-Menopausal Female" if and only if x i ="Not Pre-menopausal Male" if and only if x i ="Post-Menopausal Female" The regression equation becomes: EY i = µ i = α + β 1 δ i1 + β 2 δ i2 The mean value of Y i in pre-menopausal females is α + β 1 The mean value of Y i in post-menopausal females is α + β 2 57 The mean value of Y i in males is α

58 58 Fitting Categorical Predictors in R R calls categorical variables "factor" variables R automatically converts factor variables into dummy variables when they are put into a regression equation: > lm(rbc~sex) Call: lm(formula = rbc ~ sex) Coefficients: (Intercept) sexmale

59 59 Summary This lecture has been about linear regression Linear regression is used to model the association between the mean of a random variable and one or more predictors We ve covered univariate and multiple regression Interpretation of regression coefficients Interaction terms Dummy variables

Lecture 4 Multiple linear regression

Lecture 4 Multiple linear regression BIOST 515 January 15, 2004 Outline 1 Motivation for the multiple regression model Multiple regression in matrix notation Least squares estimation of model parameters