2. Regression Review

Size: px

Start display at page:

Download "2. Regression Review"

John Floyd
5 years ago
Views:

1 2. Regression Review 2.1 The Regression Model The general form of the regression model y t = f(x t, β) + ε t where x t = (x t1,, x tp ), β = (β 1,..., β m ). ε t is a random variable, Eε t = 0, Var(ε t ) = σ 2, and the error ε t are uncorrelated. Normal distribution assumption for error ε t implies independence among error 1

2 Examples: 1. y t = β 0 + ε t (constant mean model) 2. y t = β 0 + β 1 x t + ε t (simple linear regression model) 3. y t = β 0 exp(β 1 x t ) + ε t (exponential growth model) 4. y t = β 0 + β 1 x t + β 2 x 2 t + ε t (quadratic model) 5. y t = β 0 + β 1 x t1 + β 2 x t2 + ε t (linear model with two independent variables) 6. y t = β 0 + β 1 x t1 + β 2 x t2 + β 11 x 2 t1 + β 22 x 2 t2 + β 12 x t1 x t2 + ε t (quadratic model with two independent variables) Linear Models: see Example 1, 2, 4, 5, 6 2

3 3

4 2.2 Prediction from regression models with known coefficients prediction y pred k forecast error e k = y pred k for y k = f(x k, β) + ε k y k The expected value of the squared forecast error E(e 2 k) = E(y pred k y k ) 2 = σ 2 + E[f(x k, β) y pred k ]2 Minimum mean square error(mmse) y pred k = f(x k, β) 100(1 α) percent prediction interval for the future value y k [f(x k, β) µ α/2 σ; f(x k, β) + µ α/2 σ] where µ α/2 is the 100(1 α/2) percentage point of the N (0, 1) distribution. 4

5 y t = β 0 + β 1 x t + ε t Least squares estimates of unknown coefficients The parameter estimates that minimize the sum of the squared deviations S(β) = n [y t f(x t ; β)] 2 t=1 are called the least squares estimates and are denoted by ˆβ. Examples 1. Simple linear regression model through the origin y t = βx t + ε t. 2. Simple linear regression model

6 Example: Running Performance For illustration, consider the data list in Table. In this example we are interested in predicting the racing performance of trained female distance runners. The measured variables include the running performance in 10-km road race; body composition such as height, weight, skinfold sum, and relative body fat; and maximal aerobic power. The subjects for study were 14 trained and competition-experienced female runners who had placed among the top 20 women in a 10-km road race with more than 300 female entrants. The laboratory data were collected during the week following the competitive run. For the moment we are interested only in the relationship between the racing performance(running time) and maximal aerobic power(volume of maximal oxygen uptake; V O2.) 6

7 7

8 8

9 Estimation in the General Linear Regression Model Linear Regression models can be written as y t = β 0 + β 1 x t1 + β 2 x t2 + + β p x tp + ε t or y t = x tβ + ε t where x t = (1, x t1,..., x tp ) and β = (β 0, β 1,..., β p ) To simplify the calculation of the least squares estimates for the general linear regression model, we introduce matrix notation, y = Xβ + ε where E(ε) = 0, Var((ε)) = σ 2 I. Then E(y) = Xβ, Var(y) = E[y E(y)][y E(y)] = σ 2 I 9

10 In matrix notation the least squares criterion can be expressed as minimizing S(β) = n (y t x tβ) 2 = (y Xβ) (y Xβ) t=1 Then the solution is then given by ˆβ = (X X) 1 X y Two special cases, 1. Simple linear regression through the origin: y t = βx t + ε t, for 1 t n. 2. Simple linear regression model y t = β 0 + β 1 x t + ε t, for 1 t n. 10

11 2.4 Properties of Least Squares Estimation 1. E(ˆβ) = β, Var(ˆβ) = σ 2 (X X) If it is assumed that errors in linear model are normally distribution, then ˆβ N p+1 [β, σ 2 (X X) 1 ] 3. The difference between the observed and fitted values are called residuals and are given by e = y ŷ = y X(X X) 1 X y = [I X(X X) 1 X ]y. Then we have X e = X [I X(X X) 1 X ]y = 0. 11

12 4. The variation of the observations y t around their mean ȳ, total sum of squares (SSTO) SSTO = (y t ŷ) 2 = y 2 t nȳ 2 = y y nȳ 2. The variation of fitted value ŷ t around the mean ȳ, sum of squares due to regression (SSR) SSR = (ŷ t ȳ) 2 = ŷ 2 t nȳ 2 = ˆβ X y nȳ 2. sum of squares due to error (or residual) (SSE) SSE = (y t ŷ t ) 2 = e e. Then SSTO = SSR + SSE and define the coefficient of determination R 2 as the ratio of the sum of squares due to regression and the total sum of squares: R 2 = SSR SSTO = 1 SSE SSTO 12

13 5. The variance σ 2 is usually unknown. However it can be estimated by the mean square error s 2 = SSE n p 1. Hence the estimated covariance matrix of the least squares estimator ˆβ is Var(ˆβ) = s 2 (X X) 1 The estimated standard error of ˆβ i is given by s ˆβi = s c ii, where c ii is the corresponding diagonal element in (X X) The least square estimate ˆβ and s 2 are independent. 13

14 2.5 Confidence Intervals and Hypothesis Testing Consider the quantity ˆβ i β i s c ii, i = 0, 1,..., p where ˆβ i is the least square estimate of β i, and s ˆβi = s c ii is its standard error. This quantity has a t-distribution with n p 1 degrees of freedom. Confidence intervals A 100(1 α) confidence interval for the unknown parameter β i is given by [ ˆβ i t α/2 (n p 1)s c ii, ˆβ i + t α/2 (n p 1)s c ii ] where t α/2 (n p 1) is the 100(1 α/2) percentage point of a t distribution with n p 1 degree of freedom 14

15 Hypothesis Tests for individual Coefficients H 0 : β i = β i0 vs H 1 : β i β i0 The test statistic is given by t = ˆβ i β i0 s c ii If t > t α/2(n p 1), we reject H 0 in favor of H 1 at significance level α. If t t α/2(n p 1), there is not enough evidence for rejecting the null hypothesis H 0 in favor of H 1. Thus loosely speaking, we accept H 0 at significance level α. Special case when β i0 = 0. 15

16 2.6 Prediction from Regression Models with Estimated Coefficients The MMSE prediction of a future y k from regression model y t = β 0 + β 1 x t1 + + β p x tp + ε t is given by y pred k = β 0 + β 1 x k1 + + β p x kp = x kβ Replace β by their least estimate ˆβ = (X X) 1 X y ŷ pred k = ˆβ 0 + ˆβ 1 x k1 + + ˆβ p x kp = x k ˆβ 16

17 These predictions have the following properties 1. Unbiased E(y k ŷ pred ) = 0 2. ŷ pred is the minimum mean square error forecast among all linear unbias forecast 3. The variance of the forecast error y k ŷ pred k is given by Var(y k ŷ pred k ) = Var{ε k + x k(β ˆβ)} = σ 2 [1 + x k(x X) 1 x k ] 4. The estimated variance of the forecast error is Var(y k ŷ pred k ) = s2 [1 + x k(x X) 1 x k ] A 100(1 α) percent prediction interval for the future y k is then given by its upper and lower limits ŷ pred k ± t α/2 (n p 1)s[1 + x k(x X) 1 x k ]

18 Example: Running Performance The simple linear regression model y t = β 0 + β 1 x t + ε t where y t is the time to run the 10-km race, and x t is the maximal aerobic capacity V O2. The least square estimate of ˆβ 0 = and β 1 = The fitting values ŷ t = ˆβ 0 + ˆβ 1 x t, the residuals e t = y t ŷ t, and the entries in ANOVA table are easily calculated. Source SS df MS F Regression Error Total

19 The coefficient of determination is given by R 2 = / =.435. Var( ˆβ) = s 2 (X X) 1 = s 2 [ 1 n + x 2 P (xt x) 2 = [ x P (xt x) 2 P x 1 (xt x) 2 P (xt x) 2 ] ] The standard error s ˆβi and t statistics t ˆβi = ˆβ i /s ˆβi are given in the following table ˆβ i s ˆβi t ˆβi Constant V O

20 H 0 : β 1 = 0 against H 1 : β 1 < 0 The critical value for this one-sided test with significance level α = 0.05 is t(12) = Since the t statistic is smaller than this critical value, we reject the null hypothesis. In the other words, there is evidence for a significance inverse relationship between running time and aerobic capacity. Prediction intervals For example, let us predict the running time for a female athlete with maximal aerobic capacity x k = 55. Then ŷ pred = ˆβ 0 + ˆβ 1 x k = and its standard error by s (1 + 1n + (x k x) 2 ) 1/2 (xt x) 2 = 2.40 Thus a 95 percent prediction interval for the running time is given by ± t (12)2.40 = ± 5.23 or (37.52, 47.98) 20

21 To investigate whether other variable (height X 2, weight X 3, skinfold sum X 4, relative body fat X 5 help explain the variation in the data, fit the model Then we have y t = β 0 + β 1 x t1 + + β 5 x t5 + ε t and the ANOVA table ˆβ i s ˆβi t ˆβi Constant V O Height Weight Skinfold sum Relative body fat Source SS df MS F Regression Error Total

22 Case Study I: Gas Mileage Data Consider predicting the gas mileage of an automobile as a function of its size and other engine characteristics. In table 2.2. we have list data on 38 automobile ( models). These data, which were originally taken from Consumer Reports. The variables include gas mileage in miles per gallon (MPG), number of cylinders, cubic engine displacement, horse power, weight, acceleration, and engine type [straight(1), V(0)]. The gas consumption (in gallons) should be proportional to the effort(=force distance) it takes to move the car. Furthermore, since force is proportional to weight, we except that gas consumption per unit distance is proportional to force and thus proportional to weight. Thus a model of the form y t = β 0 + β 1 x t + ε t where y = MPG 1 = GP M and x =weight. To check whether a quadratic term (weight) 2 is need, then fit the model and make a hypothesis test y t = β 0 + β 1 x t + β 2 x 2 t + ε t H 0 : β 2 = 0 againsit H 1 : β

23 23

24 24

25 2.7 Model Selection Techniques (I) Adjusted coefficient of fitness R 2 a = 1 SSE/(n p 1) SSTO/(n 1) = 1 s 2 SSTO/(n 1). Compared with R 2, we have extra penalty here for introducing more independent variables. Intention: maximize R 2 a (equivalently minimizing s 2 ). (II) Consider C p = SSE p s 2 (n 2p). Since E(SSE p ) (n p)σ 2, (large n), C p p. So choose smallest p such that C p p. (III) Backward Elimination, Forward Selection, and Stepwise Regression. 25

26 Example 2 (Case study I continued) Results from Fitting All possible Regression p Variables Included Ra 2 1 X X 2 X X 1 X 3 X X 1 X 3 X 5 X X 1 X 2 X 3 X 4 X X 1 X 2 X 3 X 4 X 5 X Correlations Among the Predictor Variables X 1 X 2 X 3 X 4 X 5 X 6 X X X X X X

27 t Statistics from Forward Selection and Backward Elimination Procedures Step Constant X 1 X 2 X 3 X 4 X 5 X 6 Ra 2 (a) Forward Selection (b) Backward Elimination Selected Models or the even simpler model GPM t = β 0 + β 1 (weight t ) + β 2 (displacement t ) + ε t GPM t = β 0 + β 1 (weight t ) + ε t 27

28 2.8 Multi-colinearity in Predictor Variables This is referring to the situation when the columns of X are almost linearly dependent. This sort of situation is common for observational data rather than artificially designed matrix. We say, in this case, that X X is ill conditioned. This could cause large variance of ˆβ i. So techniques like Ridge Regression will be applied, i.e. with restriction β β r 2. Example y t = β 0 + β 1 x t1 + β 2 x t2 + ε t, t = 1, 2,..., n. The least square estimate of βi = s i s y β i, i = 1, 2, where s i n 1 = [ (xit x i ) 2 ], s y n 1 = [ (yt ȳ) 2 ], has such property Var(β1) = Var(β2) = σ r12 2 and Corr(β 1, β 2) = r 12 where r 12 = [ (x t1 x 1 )(x t2 x 2 )]/[ (x t1 x 1 ) 2 (x t2 x 2 ) 2 ]

29 2.9 General Principles for Modelling Principle of Parsimony (As Simples as Possible) and ŷ = Xˆβ = X(X X) 1 X y = Var(ŷ) = σ 2 X(X X) 1 X. 1 n n i=1 Var(ŷ t ) = σ2 n Trace[X(X X) 1 X ] = pσ2 n. Hence the average forecast error has variance σ 2 (1 + p n ). So extra independent variables would increase p and the forecast error. Plot Residuals Ideally, the residuals would be in the form of white noise, otherwise some modifications might be necessary, such as extra linear/quadratic terms. Particularly for the funnel Shape (like a trumpet), the logarithm transformation might be proper. 29

30 30

Ch 3: Multiple Linear Regression

Ch 3: Multiple Linear Regression 1. Multiple Linear Regression Model Multiple regression model has more than one regressor. For example, we have one response variable and two regressor variables: 1. delivery