Chapter 9 Regression. 9.1 Simple linear regression Linear models Least squares Predictions and residuals.

Size: px

Start display at page:

Download "Chapter 9 Regression. 9.1 Simple linear regression Linear models Least squares Predictions and residuals."

Theodore Butler
5 years ago
Views:

1 9.1 Simple linear regression Linear models Response and eplanatory variables Chapter 9 Regression With bivariate data, it is often useful to predict the value of one variable (the response variable, Y) from the other (the eplanatory variable, X). A curve or straight line that is drawn close to the crosses on a scatterplot can be used to predict the y-value corresponding to any. might consider how well the model would predict the y-value of the point, These predictions are called fitted values. Residuals = b 0 + b 1 i The difference between the i'th fitted values and its actual y-value is called its residual. e i = y i The residuals describe the 'errors' that would have resulted from using the model to predict y from the -values of our data points. Note that the residuals are the vertical distances of the crosses to the line Least squares Aim of small residuals The residuals from a linear model (vertical distances from the crosses to the line) indicate how closely the model's predictions match the actual responses in the data. Note that the response variable should always be drawn on the vertical ais. Linear model A linear model is an adequate description of many bivariate data sets: y = b 0 + b 1 The constant b 0 is the intercept of the line and describes the y-value when is zero. The constant b 1 is the line's slope; it describes the change in y when increases by one. Small residuals are good, so the parameters b 0 and b 1 should be set to make them as small as possible. Least squares Predictions and residuals Fitted values To assess how well a particular linear model fits any one of our data points, ( i, y i ), we The size of the residuals is summarised by the residual sum of squares, Σ ei 2 = Σ ( y ) 2 i y i = Σ ( y ) 2 i b 0 b 1 i 'Good' values for b 0 and b 1 can be objectively chosen to be the values that minimise the residual sum of squares. This is the method of least squares and the values of b 0 and b 1 are called least squares estimates.

The diagram below respresents the squared residuals as blue squares. The least squares estimates minimise the total blue area. 9.1.

The diagram below shows one such data set using a histogram for the distribution of Y at each -value.

2 The diagram below respresents the squared residuals as blue squares. The least squares estimates minimise the total blue area Normal linear model In an eperiment, several response measurements are often made at each distinct value of X. The diagram below shows one such data set using a histogram for the distribution of Y at each -value. Formulae The problem of minimising the residual sum of squares is not difficult mathematically, but you will rarely require or use the resulting formulae for b 0 and b 1 since spreadsheets, statistical programs and even scientific calculators will do the calculations for you. However, for completeness, the formulae are Σ( ) ( y y) b1 = b = Σ ( ) 2 0 y b1 Model for data The response measurements at any -value can be modelled as a random sample from a normal distribution. The collection of distributions of Y at different values of X is called a regression model Interest in generalising from data In most bivariate data sets, we have no interest in the specific individuals from which the data are collected. The individuals are 'representative' of a larger population or process, and our main interest is in this underlying population. Eample A newspaper compiled data from each of New Jersey's 21 counties about the number of people per bank branch in each county and its percentage of minority groups. Normal linear model for the response The most commonly used regression model is a normal linear model. It involves: Local residents might be interested in the specific counties, but most outsiders would want to generalise from the data to describe the relationship in a way that might describe other similar areas in the Eastern USA. How strong is the evidence that banks tend to have fewer branches in areas with large minority groups? Normality At each value of X, Y has a normal distribution. Constant variance The standard deviation of Y is the same for all values of X. Linearity The mean of Y is linearly related to X. The last two properties of the normal linear model can be epressed as

3 σ y = σ µ y = β 0 + β Another way to describe the model The normal linear model describes the distribution of Y for any value of X: where Y ~ normal (µ y, σ y ) µ y = β 0 + β 1 σ y = σ An equivalent way to write the same model is... y = β 0 + β 1 + ε where ε is called the model error and has a distribution ε ~ normal (0, σ) The error, ε, for a data point is the vertical distance between the cross on a scatterplot and the regression line. Response, Y data point (, y) regression line µ y = β 0 + β 1 A normal linear model, µ y = β 0 + β 1 σ y = σ involves 3 parameters, β 0, β 1 and σ. The model's slope, β 1, and intercept, β 0, can be interpreted in a similar way to the slope and intercept of a least squares line. Slope Increase in the mean response per unit increase in X. Intercept Mean response when X = 0. Eamples of interpretation Contet Interpretation of β 1 Interpretation of β 0 Y = Sales of music CD ($) X = Money spent on advertising ($) Y = Eam mark X = Hours of study by student before eam Increase in mean sales for each etra dollar spent on advertising Increase in epected mark for each additional hour of study Mean sales if there was no advertising Epected mark if there is no study β 0 + y β 1 error, ε Y = Hospital stay (days) X = Age of patient Average etra days in hospital per etra year of age Average days in hospital at age 0. Not particularly meaningful here. Band containing about 95% of values Eplanatory, X Applying the rule of thumb to the errors, about 95% of them will be within 2 standard deviations of zero i.e. between ±2σ. Since the errors are vertical distances of data points to the regression line, a band 2σ on each side of it should contain about 95% of the crosses on a scatterplot of the data. Response, Y µ y = β 0 + β 1 2σ 2σ Eercise: Pick the eplanatory variable and response Eercises are only available online Eercise: Draw a straight line Eercises are only available online Eercise: Find the slope and intercept Eercises are only available online Eercise: Interpret the slope and intercept Eercises are only available online Model parameters approimately 95% of crosses lie in this band Eplanatory, X Eercise: Find a residual Eercises are only available online.

9.2 Linear model assumptions 9.2.1 Assumptions in a normal linear model The normal linear model is: y = β 0 + β 1 + ε ε ~ normal (0, σ) The following four requirements are implicit in the model but

3 Probability plot of residuals The normal linear model assumes that the model errors are normally distributed, ε ~ normal (0, σ) A histogram of the residuals can be eamined for normality but a

4 9.2 Linear model assumptions Assumptions in a normal linear model The normal linear model is: y = β 0 + β 1 + ε ε ~ normal (0, σ) The following four requirements are implicit in the model but may be violated, as illustrated by the eamples. Linearity The response may change nonlinearly with. Y Constant standard deviation The response may be more variable at some than others. Y Probability plot of residuals The normal linear model assumes that the model errors are normally distributed, ε ~ normal (0, σ) A histogram of the residuals can be eamined for normality but a better way is with a normal probability plot of the residuals. If the residuals are normally distributed, the crosses in the normal probability plot should lie close to a straight line. X Normal distribution for errors The errors may have skew distributions. Y Independent errors When the observations are ordered in time, successive errors may be correlated. Y X X Time, X Warning If the assumptions of linearity and constant variance are violated, or if there are outliers, the probability plot of residuals will often be curved, irrespective of the error distribution Residual plots Problems may be immediately apparent in a scatterplot of the raw data, but a residual plot often highlights them Outliers Only draw a probability plot if you are sure that the data are linear, have constant variance and have no outliers. In a scatterplot, a cross that is unusually far above or below the regression line is an outlier. It would correspond to a large error, ε.

5 Response, Y outlier error, regression line Large residuals pull very strongly on the line since they are squared in the least squares criterion. As a result, Outliers will strongly pull the least squares line towards themselves, making their residuals smaller than you might otherwise epect. Eplanatory, X Leverage If an outlier corresponds to an -value near its mean, it usually will have a large residual, Residual plot Outliers are usually clearer if the residuals are plotted against X rather than the original response Standardised residuals (opt) To help assess the residuals, we usually standardise them dividing each by an estimate of its standard deviation. However if the outlier occurs at an etreme -value, it has a stronger influence on the position of the least squares line than the other data points. Such points are called high leverage points and pull the least squares line strongly towards them. Outliers that are high leverage points may therefore result in residuals that do not stand out from the other residuals. standardised residual = e s e The standardised residuals are each approimately normal (0, 1) if the normal linear model fits, so only about 5% will be outside the range ±2, and hardly any outside the range ±3. Standadised residuals greater than 3 or less than -3 are often taken to indicate possible outliers. Note however that in a large data set of 1,000 values, we would epect 50 values outside ±2 and 3 values outside ±3. Values a little outside ±3 can occur by chance Outliers and leverage (opt) Problems with residuals as indicators of outliers All data points pull the least squares line towards themselves the line is positioned to minimise the sum of squares of the residuals minimise Σ ei Eercise: Pick the correct residual plot Eercises are only available online Eercise: Identify regression problems (opt) Eercises are only available online.

9.3 Inference for regression parameters 9.3.1 Estimating the slope and intercept Least squares In practical situations, we must estimate β 0, β 1 and σ from a data set that we believe satisfies the normal linear model.

6 9.3 Inference for regression parameters Estimating the slope and intercept Least squares In practical situations, we must estimate β 0, β 1 and σ from a data set that we believe satisfies the normal linear model. The best estimates of β 0 and β 1 are the slope and intercept of the least squares line, b 0 and b 1 β 0 + Response, Y y β 1 data point (, y) regression line µ y = β 0 + β 1 error, ε Eplanatory, X In practice, the slope and intercept of the regression line are unknown, so the errors are also unknown values, but the least squares residuals provide estimates. Response, Y Since b 0 and b 1 are functions of a data set that we assume to be a random sample from the normal linear model, b 0 and b 1 are themselves random quantities and have distributions. Simulated eample The diagram below represents a regression model with a grey band. A sample of 20 values has been generated from this model and the least squares line (shown in blue) has been fitted to the simulated data. The least squares line provides estimates of the slope and intercept but they are not eactly equal to the underlying model values. y b 0 + b 1 data point (, y) least squares line y = b 0 + b 1 residual, e Eplanatory, X Estimating the error standard deviation The third unknown parameter of the normal linear model, σ, is the standard deviation of the errors, σ = st devn( ε ) σ can be estimated from the least squares residuals, {e i }, σ = Σ e 2 n 2 This is similar to the formula for the standard deviation of the residuals, but uses the divisor (n 2) instead of (n 1). It describes the size of a 'typical' residual. Eample A different sample would give 20 different points and a different least squares line, so the least squares slope and intercept are random Estimating the error standard devn Errors and residuals The error, ε, for any data point is its vertical distance from the regression line.

7 9.3.3 Distn of least squares estimates The least squares line varies from sample to sample it is random. 3. the spread of -values is high To get the most accurate estimate of the slope from eperimental data, Reduce σ σ can be reduced by ensuring that the eperimental units are as similar as possible. Increase n Collect as much data as possible. Increase s Choose to run the eperiment with -values that are widely spread. The least squares estimates b 0 and b 1 of the two linear model parameters β 0 and β 1 therefore also vary from sample to sample and have normal distributions that are centered on β 0 and β 1 respectively. However don't just collect data at the ends of the 'acceptable' range of -values, even though this maimises s. Y Is the relationship linear? X Testing whether slope is zero Does the response depend on X? Standard error of least squares slope When the least squares slope, b 1, is used as an estimate of β 1, it has standard error, where σ σ σ b1 = = 2 1 s Σ( ) n σ is the standard deviation of the errors i.e. the spread of points around the regression line, n is the number of data points, and s is the sample standard deviation of X. Implications for data collection The standard error of b 1 is lowest when: 1. the response standard deviation, σ, is low 2. the sample size, n, is large In a normal linear model, the response has a distribution whose mean, µ y, depends linearly on the eplanatory variable, Y ~ normal (µ y, σ y ) If the slope parameter, β 1, is zero, then the response has a normal distribution that does not depend on X. Y ~ normal (β 0, σ) This can be tested formally with a hypothesis test for whether β 1 is zero. Hypothesis test H 0 : β 1 = 0 H A : β 1 0 The test is based on the 'statistical distance' of b 1 from zero, b1 t = σ b1 and this has a t distribution with (n - 1) degrees of freedom if there really is no relationship.

Summary statistic (helps distinguish H 0 and H A ) Test statistic (standard

'etreme' test statistic) p-value = sum of tail areas recorded Using output from

test in its regression output: Least squares estimates Standard error of slope Test

6 Strength of evidence and relationship It is important to distinguish between the

p-value for testing whether X and Y are related Describes the strength of evidence for

8 Summary statistic (helps distinguish H 0 and H A ) Test statistic (standard distribution with no unknown parameters under H 0 ) P-value (probability of more 'etreme' test statistic) p-value = sum of tail areas recorded Using output from statistical software Computer software will provide everything you need to perform the test in its regression output: Least squares estimates Standard error of slope Test statistic Eamples p-value Strength of evidence and relationship It is important to distinguish between the correlation coefficient, r, and the p-value for testing whether there is a relationship between X and Y. Correlation coefficient Describes the strength of the relationship between X and Y The p-value for testing whether X and Y are related Describes the strength of evidence for whether X and Y are related at all It is important not to confuse these two values when interpreting the p-value for a test. A p-value close to zero does not imply that there must be a strong relationship. It just means that we are sure that there is some relationship, however weak. A large p-value does not imply that the relationship must be weak. The sample size might just be too small to be sure that the relationship eists.

This is partly eplained by an alternative formula for the test statistic, t b1 = = r n 2 σ 1 2 b r 1 The test statistic and the p-value therefore both depend on both r and the sample size, n.

4 Predicting the response 9.4.1 Point estimates of the response Our point estimate (best guess) for a the response at a particular value of is = b 0 + b 1 Note that the least squares line should only

9 This is partly eplained by an alternative formula for the test statistic, t b1 = = r n 2 σ 1 2 b r 1 The test statistic and the p-value therefore both depend on both r and the sample size, n. Increasing n and increasing r both result in a lower p-value. n = 30 p-value = n = 200 p-value = r = 0.24 more data stronger relationship r = 0.63 p-value = Predicting the response Point estimates of the response Our point estimate (best guess) for a the response at a particular value of is = b 0 + b 1 Note that the least squares line should only be used for prediction when the linear model assumptions hold. In particular there should be: Variability of estimate at X (opt) The predicted response at X is and has a normal distribution with mean = b 0 + b 1 µ y = β 0 + β 1 Its standard deviation depends on the value at which the prediction is being made. The further is from its mean in the training data,, the greater the variability in the prediction. Simulation The effect of the -value on the variability of the predicted response can be shown using least squares lines fitted to simulated data: 1. No outliers or points with high leverage 2. No curvature It is also dangerous to predict far outside the range of the 's we have used to fit the model (the training data) since we have no information about whether the relationship remains linear. This is called etrapolation Estimated response distn at X (opt) A normal linear model provides a response distribution for all X. With estimates for all three model parameters, we can obtain the approimate response distribution at any - value, even if we have no data at that -value. The diagram below shows two theoretical distributions from the above model. (The spread would be even greater for predicting at = 10.)

9.4.4 Estimating the mean vs prediction (opt) Estimating the mean response In some situations, we are interested in estimating the mean response at some -value, The least squares estimate, µ y = β 0

Predicting a single item's response To predict the response for a single new individual with a known -value, the same prediction would be used, = b 0 + b 1 However no matter how accurately we

10 9.4.4 Estimating the mean vs prediction (opt) Estimating the mean response In some situations, we are interested in estimating the mean response at some -value, The least squares estimate, µ y = β 0 + β 1 = b 0 + b 1 becomes increasingly accurate as the sample size increases (since b 0 and b 1 become more accurate estimates of β 0 and β 1 ). Predicting a single item's response To predict the response for a single new individual with a known -value, the same prediction would be used, = b 0 + b 1 However no matter how accurately we estimate the mean response for such individuals, a single new individual's response will have a distribution with standard deviation σ around this mean and we have no information to help us predict how far it will be from its mean. The prediction error cannot have a standard deviation that is less than σ. Simulation The error in predicting an individual's response is usually greater than the error in estimating the mean response. The diagram below contrasts estimation of the mean response and prediction of a new individual's response at = 5.5. Least squares lines have been fitted to several simulated data sets, one of which is shown on the left. The two kinds of errors from the simulations are shown on the right, showing that the prediction errors are usually greater Confidence & prediction intervals (opt) The same value, = b 0 + b 1 is used both to estimate the mean response at and to predict a new individual's response at, but the errors are different in the two situations they tend to be larger for predicting a new value. 95% confidence interval for mean response (b 0 + b 1 ) ± tn 2 σ b 0 +b 1 A formula for the standard error on the right eists, but you should rely on statistical software to find its value. 95% prediction interval for a new individual's response For prediction, a similar interval is used: (b 0 + b 1 ) ± tn 2 k where k is greater than the corresponding standard error for the confidence interval. Statistical software should again be used to find its value. Eample The diagram below shows 95% confidence intervals for the mean response at and 95% prediction intervals for a new response at as bands for a small data set with n = 7 values.

11 model. Etrapolation These 95% confidence intervals and 95% prediction intervals are valid within the range of -values about which we have collected data, but they should not be relied on for etrapolation. Both intervals assume that the normal linear model describes the process, but we have no information about linearity beyond the -values that have been collected Eercise: Predict the response Eercises are only available online. Residual variation (noise) The residual sum of squares is the uneplained variation. Note that the pooled estimate of the error variance, σ 2, is the residual sum of squares divided by (n - 2). 9.5 Coefficient of determination Sums of squares Total variation The total sum of squares reflects the total variability of the response. The overall variance of all response values is the total sum of squares divided by (n - 1). Relationship between sums of squares The following relationship requires some algebra to prove but is important Coefficient of determination Eplained variation (signal) When the relationship is strong,...the eplained sum of squares is close to the total sum of squares (and the residual sum of squares is small). When the relationship is weak,...the eplained sum of squares is small relative to the total sum of squares. The eplained sum of squares is the variation that is eplained by the

A useful summary statistic is the proportion of the total variation that is eplained, the coefficient of determination, R 2, A proportion (1 - R 2 ) of the total variation remains uneplained by the

.. This type of model is called a multiple regression model. Coefficients Despite our use of the same symbols (b 0, b 1,.

12 A useful summary statistic is the proportion of the total variation that is eplained, the coefficient of determination, R 2, A proportion (1 - R 2 ) of the total variation remains uneplained by the model. Although it is derived with quite a different aim, Eample and so on with more eplanatory variables, y = b 0 + b 1 + b 2 z y = b 0 + b 1 + b 2 z + b 3 w +... This type of model is called a multiple regression model. Coefficients Despite our use of the same symbols (b 0, b 1,...) for all three models above, their 'best' values are often different for the different models. An eample will be given in the net page Interpreting coefficients Marginal and conditional relationships In a linear model that predicts a response from several eplanatory variables, the least squares coefficient associated with any eplanatory variable describes its effect on the response if all other variables are held constant. This is also called the variable's conditional effect on the response. This may be very different from the size and even the sign of the coefficient when a linear model is fitted with only that single eplanatory variable. This simple linear model describes the marginal relationship between the response and that variable. Eample In a model for predicting the percentage body fat of men, the best model (as determined by least squares) in a simple model with weight, is 9.6 Multiple regression More than one eplanatory variable Response and eplanatory variables We are often interested in how a 'response' variable, Y, depends on other eplanatory variables. If there is a single eplanatory variable, X, we can predict Y from X with a simple linear model of the form, y = b 0 + b 1 However if other eplanatory variables have been recorded from each individual, we should be able to use them to predict the response more accurately Multiple regression equation Adding etra variables A simple linear model for a single eplanatory variable, y = b 0 + b 1 can be easily etended to describe the effect of a second eplanatory variable, Z, with an etra linear term, Predicted body fat = Weight However if we add Abdomen circumference to the model, the best values for the coefficients are Predicted body fat = Weight Abdomen For each 1lb etra Weight, men have, on average, 0.162% more body fat. For each 1lb etra Weight, men have, on average, 0.136% less body fat than others with the same Abdomen circumference Standard errors General linear model The general linear model is where Parameter estimates and standard errors y = β 0 + β β β ε ε ~ normal (0, σ) The best estimates of β 0, β 1,... are the least squares estimates, b 0, b 1,... The best estimate of σ 2 is the residual sum of squares, divided by its degrees of freedom,

where n is the number of observations and p is the number of β-parameters (i.e. the number of eplanatory variables plus 1). The least squares estimates, b 0, b 1,.

Eample The equation below gives the least squares equation for predicting the percentage body fat of men, based on other body measurements.

13 where n is the number of observations and p is the number of β-parameters (i.e. the number of eplanatory variables plus 1). The least squares estimates, b 0, b 1,... are random quantities and have distributions. The formulae for their standard errors are comple but statistical software will report their values. Eample The equation below gives the least squares equation for predicting the percentage body fat of men, based on other body measurements. However each p-value assesses whether you can drop a single eplanatory variable from the full model. After dropping one variable from the full model, the p-values for the other variables will change and they may no longer be unimportant. Eample If several eplanatory variables have high p-values, this does not give evidence that you can simultaneously drop all of them from the model. The table below shows the p-values for testing whether the individual parameters are zero in the body fat model. Several p-values are higher than 0.1, giving evidence that these variables could be dropped from the full model but this does not mean that we could drop all such variables simultaneously. The table below shows the standard errors of these coefficients and the estimate of the error standard deviation, σ Inference for general linear models Hypothesis tests for single parameters This test asks whether the corresponding eplanatory variable can be dropped from the full model. The test statistic is the 'statistical distance' of the least squares estimate, b i, from zero, and its p-value is found from the tail area of the t distribution with (n - p) degrees of freedom. Interpretation of p-values The p-values are interpreted in the usual way as the strength of evidence against the null hypothesis. 9.7 What will be assessed? What you need to know You will not be eamined about everything in this chapter. Some of the material has been included to eplain why the chapter's methods are used, in the hope that it will help you to understand these methods better. What you need to learn for the eam is more limited. We now describe what we epect you to be able to do in the assignment and eam after studying the regression chapter. A. Simple linear regression Simple linear regression models are used to predict the value of a "response" from a single "eplanatory" variable. 1. Identify the response and eplanatory variable When a scenario is described in words, identify the response and eplanatory variable. 2. Describe the relationship When shown a scatterplot of Y against X, describe the relationship between the

14 variables: Positive or negative? Linear or curved? Strong or weak? 3. Given Ecel regression output, you need to: (a) Regression model: Find the values of the slope and intercept in the Ecel output. Write down the regression model using these values. Use the regression model to predict the response from a value of the eplanatory variable. (b) Slope and intercept: Eplain what the values of the slope and intercept describe, in a way that a non-statistician might understand. Use the p-value associated with the eplanatory variable to test whether the variables are related. Eplain your conclusion from this hypothesis test to a nonstatistician. 4. Coefficient of determination Identify the coefficient of determination, R 2. Interpret the R 2 value (in terms of the percentage of response variation eplained by the model). (c) Coefficient of determination, R 2 : Identify the coefficient of determination, R 2 from the Ecel output. Interpret the R 2 value (in terms of the percentage of response variation eplained by the model). 4. From a scatterplot of residuals against the eplanatory variable, B. Multiple regression List the assumptions required for the model (linearity, constant variance, independence). Use the scatterplot to discuss whether the model assumptions hold. Multiple regression models etend the idea of simple linear regression with two or more eplanatory variables. Given Ecel output from a multiple regression model, you should be able to: 1. Regression model 2. Slope parameters Write down the regression model that best predicts the response from all eplanatory variables. Use the regression model to predict the response from particular values of all eplanatory variables. Eplain what the values of the slope parameters describe, in a way that a non-statistician might understand. 3. Important eplanatory variables Interpret the p-values associated with the different eplanatory variables. Eplain which eplanatory variables are most important and which might be considered for dropping from the model.

Chapter 27 Summary Inferences for Regression

Chapter 27 Summary Inferences for Regression Chapter 7 Summary Inferences for Regression What have we learned? We have now applied inference to regression models. Like in all inference situations, there are conditions that we must check. We can test