14 Multiple Linear Regression

Size: px

Start display at page:

Download "14 Multiple Linear Regression"

Kathryn Allison
5 years ago
Views:

1 B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 14 Multiple Linear Regression 14.1 The multiple linear regression model In simple linear regression, the response variable y is expressed in terms of a single regressor variable x with a corresponding regression model of the form Y = β 0 + β 1 x + ε. In multiple linear regression, the response variable is expressed as a linear function of k regressor (or predictor or explanatory) variables, x 1, x 2,..., x k, with corresponding model of the form Y = β 0 + β 1 x 1 + β 2 x β k x k + ε. Suppose that n observations have been made of the variables (where n k + 1) with x ij the i-th observed value of the j-th regressor variable, corresponding to the i-th observed value y i of the response variable. Thus the data could be tabulated in the following form of a data matrix. The multiple linear regression model is Observation Variable Number y x 1 x 2... x k 1 y 1 x 11 x x 1k 2 y 2 x 21 x x 2k Y i = β n y n x n1 x n2... x nk k β j x ij + ε i i = 1,..., n, (1) j=1 where the x ij, i = 1,..., n; j = 1,..., k are regarded as fixed, β 0, β 1,..., β k are unknown parameters and the errors ε i, i = 1,..., n are assumed to be NID(0, σ 2 ), with σ 2 unknown. The model (1) may be written in matrix notation as where Y is an n 1 vector of observations, Y = Xβ + ε, (2) Y = (Y 1, Y 2,..., Y n ), β is a p 1 vector of parameters, where p = k + 1, β = (β 0, β 1,..., β k ), 1

2 ε is an n 1 vector of errors, ε = (ε 1, ε 2,..., ε n ), and X is an n p matrix, the design matrix, 1 x 11 x x 1k 1 x 21 x x 2k X = x n1 x n2... x nk. Equation (2) expresses the regression model in the form of what is known as the general linear model. The general linear model encompasses a wide variety of cases, including the linear statistical model for the one-way completely randomized design. For these models the design matrix X turns out to have a rather different form than for the regression model in that each element of the matrix is either 0 or 1. The term design matrix is used because X expresses the structure of the underlying experimental design. According to the method of least squares, we choose as our estimates, b = (b 0, b 1,..., b k ), the vector of parameters β whose elements jointly minimize the error (or residual) sum of squares, L = (y Xβ) (y Xβ), i.e., setting x i0 = 1, L = ( n y i i=1 ) 2 k x ij β j. (3) The expression (3) is minimized by setting the partial derivatives with respect to each of the β r, r = 0,..., k equal to zero. This yields the normal equations, a set of p = k +1 simultaneous linear equations for the p unknowns, b 0, b 1,..., b k, j=0 n k x ir x ij b j = i=1 j=0 n x ir y i r = 0,..., k, i=1 which may be written in matrix form as Note that X X is a symmetric p p matrix Rank and invertibility X Xb = X y. (4) The rank, rank(a), of a matrix A is the number of linearly independent columns of A. Recall that our design matrix X is an n p matrix with n p. It follows that rank(x) p. If rank(x) = p then X is said to be of full rank. It may be shown that rank(x X) = rank(x). (5) 2

3 A square matrix is said to be non-singular if it has an inverse. A p p square matrix is non-singular if and only if it is of full rank p. If X X is non-singular, which by the result (5) occurs if and only if X is of full rank p, then the normal equations (4) have a unique solution, ˆβ = (X X) 1 X Y. (6) It will generally be the case for sensible regression models that the design matrix X is of full rank, but this is not necessarily always the case. To take an extreme example, if the silly mistake is made of taking one of the regressor variables to be a scaled version of another, e.g., if one regressor variable is height measured in inches and another is the same height measured in cms, then the two corresponding columns of the matrix X are scalar multiples of each other and hence rank(x) < p. The normal equations do not then have a unique solution the estimates of the parameters are not well-determined. For a given set of data, assuming that X is of full rank p, the formal mathematical solution (6) of the normal equations (4) is translated in a statistical package such as S+ into a numerical procedure for solving the normal equations The hat matrix and leverage Assuming that X is of full rank, the vector Ŷ of fitted values is given by Ŷ = Xˆβ = HY, where, using Equation (6), the hat matrix H is defined by H = X(X X) 1 X. Note that H is a symmetric n n matrix of rank p. The vector ˆε of residuals is given by ˆε = Y Ŷ = (I H)Y, where I is the n n identity matrix. It turns out that var(ˆε i ) = (1 h i )σ 2 i = 1,..., n, where h i is the i-th diagonal element of the hat matrix H and is known as the leverage of the i-th observation. The leverage h i may be regarded as a measure of the remoteness of the i-th observation from the remaining n 1 observations in the space of the regressor variables. It is always the case that 1/n h i 1 i = 1,..., n and h i = p, so that h = p/n. If an individual h i is large then the corresponding observation may have a large influence in determining the estimated regression coefficients. Recall that we can obtain a list of the leverage values from within S+ by using the function lm.influence() applied to the appropriate model object. We may regard h i as being high if h i > min(0.99, 3p/n). 3

4 14.4 The covariance matrix of the estimators of the parameters It turns out that ˆβ r is a normally distributed, unbiased estimator of β r, r = 0,..., k, with variance given by the r-th diagonal element of the matrix σ 2 (X X) 1, where the rows and columns of the matrix are indexed 0, 1,..., k (rather than 1, 2,..., k + 1). In fact, it can be shown that ˆβ N p (β, σ 2 (X X) 1 ) Example An estimate is required of the percentage yield of petroleum spirit from crude oil, based upon certain rough laboratory determinations of properties of the crude oil. The following table shows actual percentage yields of petroleum spirit, y, and four properties, x 1, x 2, x 3, x 4, of the crude oil, for samples from 32 different crudes. 4

5 Data on yields of petroleum spirit The variables recorded are as follows. y x 1 x 2 x 3 x y: percentage yield of petroleum spirit x 1 : specific gravity of the crude x 2 : crude oil vapour pressure, measured in pounds per square inch x 3 : the ASTM 10% distillation point, in F x 4 : the petroleum fraction end point, in F 5

6 It is required to use these data to provide an equation for predicting y from measurements of the four explanatory variables, x 1, x 2, x 3, x 4, (or some subset of them). The data have been read into an S+ data frame oil. The function names is used to assign names to the five variables. The linear model function lm is then used to carry out a multiple linear regression of the response variable spirit upon the four regressor variables, gravity, pressure, distil and endpoint, the results of which are stored in the object oil.lm. > y <- c(69, 144, 74, 85, 80, 28, 50, 122, 100, 152, 268, 140, 147, 64, 176, 223, 248, 260, 349, 182, 232, 180, 131, 161, 321, 347, 317, 336, 304, 266, 278, 457)/10 > x1 <- c(384, 403, 400, 318, 408, 413, 381, 508, 322, 384, 403, 322, 318, 413, 381, 508, 322, 384, 403, 400, 322, 318, 408, 413, 381, 508, 322, 384, 400, 408, 413, 508)/ 10 > x2 <- c(61, 48, 61, 2, 35, 18, 12, 86, 52, 61, 48, 24, 2, 18, 12, 86, 52, 61, 48, 61, 24, 2, 35, 18, 12, 86, 52, 61, 61, 35, 18, 86)/10 > x3 <- c(220, 231, 217, 316, 210, 267, 274, 190, 236, 220, 231, 284, 316, 267, 274, 190, 236, 220, 231, 217, 284, 316, 210, 267, 274, 190, 236, 220, 217, 210, 267, 190) > x4 <- c(235, 307, 212, 365, 218, 235, 285, 205, 267, 300, 367, 351, 379, 275, 365, 275, 360, 365, 395, 272, 424, 428, 273, 358, 444, 345, 402, 410, 340, 347, 416, 407) > oil <- data.frame(y, x1, x2, x3, x4) > names(oil) <- c("spirit", "gravity", "pressure", "distil", "endpoint") > oil.lm <- lm(spirit ~ gravity + pressure + distil + endpoint, data = oil) > summary(oil.lm) Call: lm(formula = spirit ~ gravity + pressure + distil + endpoint, data = oil) Residuals: Min 1Q Median 3Q Max Coefficients: Value Std. Error t value Pr(> t ) (Intercept) gravity pressure distil endpoint Residual standard error: on 27 degrees of freedom Multiple R-Squared: Adjusted R-squared: F-statistic: on 4 and 27 degrees of freedom, the p-value is 0 6

7 14.6 ANOVA for multiple linear regression We now begin to outline the theory that will enable us to interpret the above analysis of variance and to carry out further analyses. It turns out that we may partition the total sum of squares SS T S yy into the sum of the regression sum of squares SS Reg, and the error (or residual) sum of squares SS R, i.e., SS T = SS Reg + SS R. The regression sum of squares, which is that part of the total sum of squares that is accounted for by the fitted regression, is given by k SS Reg = ˆβ r S ry, where S ry = r=1 n (x ir x.r )(Y i Ȳ ) r = 1,..., k. i=1 Corresponding to the above partition of the sum of squares we have the following ANOVA, where, as in earlier ANOVAs, Ŝ2 MS R is an unbiased estimator of the error variance σ 2. ANOVA TABLE Source DF SS M S Regression k ˆβr S ry SS Reg /k Error n k 1 by subtraction S 2 SS R /(n k 1) Total n 1 S yy We may wish to test the hypothesis that there is no linear relationship between the response variable y and the regressor variables x 1, x 2,..., x k. Formally, we test the null hypothesis against the alternative H 0 : β 1 = β 2 =... = β k = 0 H 1 : β j 0 for some j, j = 1,..., k. This hypothesis is tested using the test statistic which under H 0 has the F k,n k 1 distribution. F = MS Reg MS R, Clearly, from the S+ output, there is overwhelming evidence (F obs = 171.7, p = 0.000) to reject the null hypothesis of no linear relationship between the yield of petroleum spirit and the four regressor variables. 7

13 Simple Linear Regression

13 Simple Linear Regression B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 3 Simple Linear Regression 3. An industrial example A study was undertaken to determine the effect of stirring rate on the amount of impurity