13 Simple Linear Regression

Size: px

Start display at page:

Download "13 Simple Linear Regression"

Loreen Cameron
5 years ago
Views:

1 B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 3 Simple Linear Regression 3. An industrial example A study was undertaken to determine the effect of stirring rate on the amount of impurity in paint produced by a chemical process. The study yielded the data shown in the following S+ output. The stirring rate is in revolutions per minute and the impurity is recorded as a percentage. It appears from the data that twelve stirring rates were chosen at intervals of 2 rpm and the resulting impurity levels recorded for each stirring rate. The subsequent plot shows that impurity increases approximately linearly with stirring rate. > stirrate <- seq(20, 42, 2) > impurity <- c(8.4, 9.5,.8, 0.4, 3.3, 4.8, 3.2, 4.7, 6.4, 6.5, 8.9, 8.5) > paint.data <- data.frame(stirrate, impurity) > paint.data stirrate impurity > rm(stirrate, impurity) > attach(paint.data) > plot(stirrate, impurity)

2 impurity stirrate Figure : Plot of Impurity versus Stirrate 3.2 The statistical model for simple linear regression In general, suppose that we have observed n pairs of values, (x, y ), (x 2, y 2 ),..., (x n, y n ), where y is regarded as the observed value of the response variable Y and x as the regressor variable (or predictor variable or explanatory variable), so that Y is the dependent variable, and we wish to investigate how the values of Y depend upon the values of x. The simplest model is a linear one. Given the set of values x i, i =,..., n, regarded as fixed and observed without error, consider the linear regression model Y i = β 0 + β x i + ε i i =,..., n, () where β 0 and β are unknown parameters. The random errors ε i are assumed to be NID(0, σ 2 ), with σ 2 unknown. We are now looking at the relationship of the (observed) response variable y to a quantitative factor, which takes numerical values x. Previously we dealt with a qualitative factor, in the form of a treatment, the different levels of which did not necessarily represent different numerical levels of some variable, and even if they did, this was not taken into account in the underlying statistical model. The line with equation y = β 0 + β x is known as the regression line. The regression coefficient β is the slope of the regression line and the regression coefficient (the constant) β 0 is the intercept of the line on the y-axis. 2

3 3.3 The least squares estimates of the parameters We shall use hatted Greek letters, ˆβ, for parameter estimators, and lower case Roman letters, b, for parameter estimates. Thus coefficients regression line model parameters β 0 β y = β 0 + β x parameter estimators ˆβ0 ˆβ y = ˆβ 0 + ˆβ x parameter estimates b 0 b y = b 0 + b x Given estimated parameter values, for each x i the corresponding observed fitted value ŷ i is given by ŷ i = b 0 + b x i and e i y i ŷ i is the corresponding observed residual. According to the method of least squares, given the observed values (x i, y i ), i =,..., n, we choose our parameter estimates, b 0 and b, to be those values of β 0 and β that minimize L = n (y i β 0 β x i ) 2. (2) i= In geometrical terms, given a scatter plot of the points (x i, y i ), i =,..., n, we choose our fitted regression line in such a way as to minimize the sum of squares of the vertical distances of the points from the line. It is worth introducing some more notation at this stage. In what follows, all summations are from i = to n. Denote the corrected sums of squares by = (x i x) 2 and S yy = (Y i Ȳ )2. Note that S yy is the total (corrected) sum of squares in the ANOVA. The corrected sum of products, S xy, is defined by S xy = (x i x)(y i Ȳ ), or, equivalently, by S xy = (x i x)y i. Note that, whereas and S yy are necessarily non-negative, S xy can take negative values. The observed values of S yy and S xy are denotes by s yy and s xy respectively. It turns out that the least squares estimates b 0 and b of β 0 and β, respectively, are given by b = s xy (3) 3

4 and Thus the equation of the fitted regression line, can be written as b 0 = ȳ b x. y = b 0 + b x, y = ȳ + b (x x). This is the equation of the line with slope b s xy / passing through the point ( x, ȳ). 3.4 The partition of the total sum of squares It turns out that the total sum of squares SS T S yy may be partitioned as where SS Reg is the regression sum of squares, SS T = SS Reg + SS R, (4) SS Reg = ˆβ S xy, (5) and the residual sum of squares SS R corresponds to the minimized value of L in Equation (2). The regression sum of squares SS Reg may be interpreted as that part of the total sum of squares which is accounted for by the estimated regression. Given SS T, the larger the value of SS Reg and the smaller the value of SS R, the better the fit of the estimated regression line. We may test the null hypothesis H 0 : β = 0 against the alternative H : β 0, which is a test for the absence of a linear relationship between the x and y variables. If β = 0 then the regression model () reduces to Y i = β 0 + ε i i =,..., n, so that the Y i are assumed to be NID(β 0, σ 2 ). In this case, the joint distribution of the Y i does not depend upon the values of the x i, so that the x i have no predictive power. It may be shown that the two terms on the right hand side of Equation (4), SS Reg and SS R, are independently distributed. SS R /σ 2 has the χ 2 n 2 distribution and, under H 0, SS Reg /σ 2 has the χ 2 distribution. Hence, under H 0, the ratio F = MS Reg MS R has the F,n 2 distribution. This statistic is used for a one-tail test of H 0. The calculations may be laid out in the form of the following ANOVA table. As in previous ANOVAs, a mean 4

5 square (MS) is obtained by dividing the corresponding sum of squares by its degrees of freedom and Ŝ2 MS R is an unbiased estimator of the error variance σ 2. ANOVA TABLE Source DF SS M S Regression ˆβ S xy SS Reg Error n 2 by subtraction Ŝ 2 SS R /(n 2) Total n S yy 3.5 Example (continued) The regression analysis is carried out using the S+ function lm, where impurity is regressed against a constant (which is included by default) and stirrate, the data being drawn from the data frame paint.data. The functions summary and anova are then applied to the fitted model object paint.lm in order to obtain the corresponding parameter estimates and analysis of variance table. > paint.lm <- lm(impurity ~ stirrate, data = paint.data) > summary(paint.lm) Call: lm(formula = impurity ~ stirrate, data = paint.data) Residuals: Min Q Median 3Q Max Coefficients: Value Std. Error t value Pr(> t ) (Intercept) stirrate Residual standard error: on 0 degrees of freedom Multiple R-Squared: Adjusted R-squared: F-statistic: 4. on and 0 degrees of freedom, the p-value is 3.2e-007 > anova(paint.lm) Analysis of Variance Table Response: impurity Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) stirrate e-007 Residuals The output shows that the coefficients of the fitted regression line are b 0 = and b = We shall discuss in a future section some of the details of the calculation of the 5

6 associated standard errors and tests of significance. In the case of simple linear regression, the t-test for the coefficient β (for stirrate) is equivalent to the F-test in the ANOVA. The p-value for both is 0.000, so clearly there is a very significant linear relationship. The p-value for the constant β 0 is not significant, but we do keep the constant term in the regression equation. We may also verify that, correct to two decimal places, the observed value of Ŝ = MS R is ŝ = 0.85 = Correlation and the coefficient of determination From Equations (3) and (5), Hence where r is the sample correlation coefficient, SS Reg = S2 xy. SS Reg SS T = S2 xy S yy = r 2, (6) r = S xy sxx S yy. r, which satisfies the inequalities r, may be thought of as a measure of the strength of the linear relationship between the x i and the y i. The closer r is to the value, the stronger the relationship. But from Equation (6) it follows that r 2 may be characterized as the proportion of the total sum of squares accounted for by the regression. (robs 2 = 93.4% = 9.28/27.73 in our example.) It also follows from Equations (4) and (6) that r 2 = SS R SS T, (7) and it turns out that Equation (7) is the one that is the most appropriate for generalization to more general regression models and measures of fit. In general, define the coefficient of determination R 2 by R 2 = SS Reg SS T = SS R SS T. This quantity, which like r 2 is the proportion of the total sum of squares accounted for by the regression, may be regarded as a measure of the goodness of fit of the regression model. An alternative measure, which is often preferred, is the adjusted coefficient of determination R 2 (adjusted for the number of regressor variables, one in the case of simple linear regression), R 2 = MS R MS T, where MS T = SS T /(n ). The significance of these quantities becomes apparent only when more complicated regression models are to be investigated. S+ outputs these two coefficients Multiple R-Squared and Adjusted R-squared, respectively. 6

7 In comparing the use of the F -statistic and R 2, we may recall that the F -statistic is used to investigate whether there is evidence of a linear relationship between the variables x and y. The value of R 2 is an indicator of the strength of that relationship. It is readily checked that, in the case of simple linear regression, the values of F and R 2 are related by the formula or, equivalently, R 2 = F = F n 2 + F (n 2)R2 R 2. It is possible to have a highly significant value of F together with a relatively low value of R 2 (if n is large) or a relatively large value of R 2 with a non-significant value of F (if n is small). 3.7 A test and confidence interval for the slope parameter Recall that the least squares estimator ˆβ of β is given by ˆβ = S xy = (xi x)y i (8) and that in the regression model the x i are regarded as fixed. So, on the right hand side of Equation (8), only the Y i are random variables, independently and normally distributed. Since ˆβ is a linear combination of normally distributed r.v.s, it follows that ˆβ is also normally distributed. It may be shown that ˆβ is an unbiased estimator of β, that is, and that the variance of ˆβ is given by E[ ˆβ ] = β var( ˆβ ) = σ2. Hence ˆβ has the N(β, σ 2 / ) distribution. We estimate the unknown error variance σ 2 by using the estimator Ŝ2 MS R from the ANOVA table. (In the S+ output, the estimate ŝ of σ is given by Residual standard error.) Thus ŝ/ is the observed standard error of b and the t-statistic for testing H 0 : β = 0 is T = ˆβ sxx, ŝ which under H 0 has the t n 2 distribution. We can verify from our S+ output that T obs for β is calculated as the ratio of the estimated coefficient to its standard error:.9 = / The above t-statistic satisfies T 2 = ˆβ 2 ŝ 2 = MS Reg MS R = F, 7

8 where F is the F-statistic calculated from the ANOVA. This fact is a special feature of simple linear regression and does not hold for more general regression models. In our example, F obs = 4.3 = = T 2 obs. It follows from the definitions of the distributions that the square of a random variable with a t ν distribution has the F,ν distribution. The p-values of the above t-statistic and F-statistic are identical. Given the value of b, a 00( α)% observed confidence interval for β is given by b ± t n 2,α/2ŝ sxx. In our example we may calculate the 95% confidence interval for β using S+. ##Direct Calculation of the observed CI #k is upper 2.5% percentage point of t-distn wit 0 d.o.f. #k2 is the half-length of interval #k3 is the estimated value of #the slope > k <- qt(0.975, 0) > k2 <- k * > k3 < > CI <- c(k3 - k2, k3 + k2) > CI [] Thus the confidence interval for β is (0.37,0.54). 3.8 Fitted values and Analysis of residuals Previously, we found that the fitted equation was of the form y = x. The observed fitted values may be obtained for each of the stir rates in the data set using the function fitted. > fitted.values <- fitted(paint.lm) > fitted.values Recall that the residuals ˆε i are defined by ˆε i = Y i Ŷi = Y i ˆβ 0 ˆβ x i i =,..., n. Given that ˆβ is an unbiased estimator of β, it is easy to check that ˆβ 0 is an unbiased estimator of β 0. It follows that E[ˆε i ] = E[Y i ] E[ ˆβ 0 ] E[ ˆβ ]x i = (β 0 + β x i ) β 0 β x i = 0. 8

9 A more detailed analysis shows that var(ˆε i ) = ( h i )σ 2 i =,..., n, where h i is the leverage of the i-th observation, Hence the standardized residuals D i are defined by h i = n + (x i x) 2 i =,..., n. (9) D i = ˆε i Ŝ h i i =,..., n. If the assumptions of the regression model are correct, the standardized residuals are approximately NID(0, ). The leverage h i of the i-th observation as defined in Equation (9) depends only on the value x i of the predictor variable and not on the value y i of the response variable. The leverage h i may be regarded as a measure of the remoteness of the value x i of the predictor variable for the i-th observation from the sample mean x of all n observed values of the predictor variable. It is always the case for simple regression that n h i i =,..., n and hi = 2, so that h = 2/n. If h i is large then the corresponding observation may be highly influential in determining the estimated regression coefficients. There are situations in which removal of an observation with large leverage from the data set can result in drastic changes in the estimates of the regression coefficients. So observations with large leverage should be treated with caution. We can obtain a list of the leverage values and the standardized residuals by using the commands lm.influence() and (upon invoking library(mass) first) stdres(), respectively. As a benchmark, we might consider an h i greater than say 3 times the average (or very close to ), which equates to 0.5 in our example, as high (suggesting corresponding predictor is unusual) and the standardized residual d i satisfying d i > 2 to be high (suggesting corresponding response is unusual). > leverages <- lm.influence(paint.lm)$hat > library(mass) > std.residuals <- stdres(paint.lm) > diagnostics <- data.frame(leverages, std.residuals) > diagnostics leverages std.residuals

10 Nothing untoward in the above output. 3.9 Prediction One of the reasons for carrying out a linear regression analysis may be that, in future, given an x-value, we wish to be able to predict the corresponding y-value, using the fitted regression equation, so that Ŷ = ˆβ 0 + ˆβ x. (0) Assuming the validity of the linear regression model, for the given x-value, the actual y-value will be given by Y = β 0 + β x + ε, where, as before, the error term ε is assumed to have the N(0, σ 2 ) distribution. Hence and E[Y ] = β 0 + β x Y = E[Y ] + ε. () The Ŷ defined in Equation (0) may be regarded in two ways, either as an estimator of E[Y ] (the long-term average of all y-values for the given x-value) or as a predictor of y (one particular y-value for the given x-value). In the latter case, there are two sources of error in accounting for the difference between an observed value of Y, i.e. y, and the predicted value ŷ: one due to using the estimators ˆβ 0 and ˆβ instead of the actual parameter values β 0 and β, and the other due to the presence of the error term ϵ. Since ˆβ is an unbiased estimator of β and ˆβ 0 is an unbiased estimator of β 0, from Equation (0), E[Ŷ ] = E[ ˆβ 0 + ˆβ x] = β 0 + β x = E[Y ]. Thus Ŷ is an unbiased estimator of E[Y ] and an unbiased predictor of Y. From Equation (0), var(ŷ ) = var( ˆβ 0 + ˆβ x), which turns out to be given by var(ŷ ) = ( n ) (x x)2 + σ 2. (2) 0

11 Additionally, using Equation (), var(ŷ Y ) is equal to i.e. ( + ) (x x)2 + σ 2. n var(ŷ ) + var(ε) = var(ŷ ) + σ2. (3) As before, we estimate σ 2 by Ŝ2 MS R from the ANOVA table. A 00( α)% observed confidence interval for E[Y ] is given by (x x)2 b 0 + b x ± t n 2,α/2 + ŝ. n A 00( α)% observed prediction interval for the value of y is given by b 0 + b x ± t n 2,α/2 + (x x)2 + ŝ. n S+ refers to the quantity (x x)2 + n ŝ as se.fit, the standard error of the fit. Note how the widths of the confidence and prediction intervals depend on the distance of x from x. The prediction interval is wider than the confidence interval. If the regression equation has been fitted using x-values in some interval A and appears to provide a good representation of the relationship between x and y in A, we should be wary of extrapolating this equation to make predictions for x-values outside A, as the linear relationship between x and y may not hold outside A. 3.0 Example (continued) We use the function predict in S+ to obtain predicted values and their standard errors. We construct a data frame x whose variable name is that of the regressor variable, stirrate, and which contains the values of the regressor variables for which we wish to make predictions. In the present case, we shall use the single value of 4. The first argument of the predict function is the object paint.lm that corresponds to our model and the second argument is the data frame x that contains the values of the regressor variable for which we wish to make predictions. The argument se.fit = TRUE is required so that we obtain standard errors for our predictions and so that, subsequently, we can use the function pointwise to produce confidence intervals.

12 In the output, the term residual.scale refers to the value of ŝ. Given this and the value of standard error of the fit, we may, if desired, calculate the prediction interval as defined above, in addition to the confidence interval produced by the function pointwise. > x <- data.frame(stirrate = 4) > predict.impurity <- predict(paint.lm, x, se.fit = TRUE) > predict.impurity $fit: $se.fit: $residual.scale: [] $df: [] 0 > pointwise(predict.impurity, 0.95) $upper: $fit: $lower:

14 Multiple Linear Regression

B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 14 Multiple Linear Regression 14.1 The multiple linear regression model In simple linear regression, the response variable y is expressed in