MATH 644: Regression Analysis Methods

MATH 644: Regression Analysis Methods FINAL EXAM Fall, 2012 INSTRUCTIONS TO STUDENTS: 1. This test contains SIX questions. It comprises ELEVEN printed pages. 2. Answer ALL questions for a total of 100 marks. 3. This is an open-book and open-note test; you can use any materials you have. 4. Write your name on the front of your answer booklet and on any additional sheets you write on. 1

1. True/False. Please read each statement and put T(True)/F(False) in the beginning. Notice: the standard least squares estimator is applied whenever needed in the statements. Define the standard multiple linear regression model as follows: Y i = β 0 + β 1 X i1 + + β p X ip + ε i, where ε i i.i.d. N(0, σ 2 ). (a) related. A coefficient of determination zero indicates that X and Y are not (b) (c) (d) (e) A high coefficient of determination indicates that the estimated regression line is a good fit. If two multiple linear regression models have the same mean squared error (MSE), we prefer the model with less variables. For any F-test associated with the multiple linear regression model, we can find an equivalent t-test. In a standard multiple linear regression model, the variance of the prediction becomes larger as X j deviates from the sample mean X j. (f) (g) In a standard multiple linear regression model, define the residuals to be e i = Y i Ŷi, we have n i=1 e ix ij = 0 for all j = 1,..., p 1. In a standard multiple linear regression model, the prediction for a new observation with predictors X (new) = ( X 1, X 2,..., X p ) is Ȳ = n 1 n i=1 Y i, where X j = n 1 n i=1 X ji, j = 1,..., p is the sample mean. 2. Yes/No. Suppose you have four possible predictor variables X 1, X 2, X 3, and X 4 that could be used in a regression analysis. You run a forward selection procedure, and the variables are entered as follows: Step 1: X 2 Step 2: X 4 Step 3: X 1 Step 4: X 3 In other words, after Step 1, the model is E{Y } = β 0 + β 1 X 2. After Step 2, the model is E{Y } = β 0 + β 1 X 2 + β 2 X 4. 2

And so on. You also run an all subsets regression analysis using R 2 as the criterion for the best model for each possible number of predictors. Would the same models result from this analysis as from the forward selection procedure? In other words, would all subsets regression definitely identify the following as the best models for 1, 2, 3, and 4 variables? Choose Yes or No in each case. (a) β 0 + 1 variable, the best model would be E{Y } = β 0 + β 1 X 2. (b) β 0 + 2 variables, the best model would be E{Y } = β 0 + β 1 X 2 + β 2 X 4. (c) β 0 + 3 variables, the best model would be E{Y } = β 0 + β 1 X 2 + β 2 X 4 + β 3 X 1. (d) β 0 + 4 variables, the best model would be E{Y } = β 0 + β 1 X 2 + β 2 X 4 + β 3 X 1 + β 4 X 3. 3. Given data pairs (X i, Y i ), where i = 1,..., n. We fit the simple linear regression Y i = β 0 + β 1 X i + ε i. Suppose, in addition, ε i. are independent, normally distributed with mean 0 and variance σ 2. For each of the following three scenarios, how are b 0, b 1, σ 2, R 2 and the t-test of H 0 : β 1 = 0 v.s. H a : β 1 0 affected? Please answer accordingly and make necessary explanations. (a) X i is replaced by 2X i and Y i remains the same. (b) Y i is replaced by 2Y i and X i remains the same. (c) X i is replaced by 2X i and Y i is replaced by 2Y i. 3

(Extra Space for Answers) 4

4. Suppose we have the following two multiple linear regression models: Y i = β 0 + β 1 X 1i + β 2 X 2i + ε i (1) and Y i = β 0 + β 1 X 1i + β 2 X 2i + β 3 X 3i + ε i, (2) where ε i i.i.d. N(0, σ 2 ). We first perform the analysis for model (1) in R: > fit12 = lm(y ~ X1 + X2) > summary(fit12) Call: lm(formula = Y ~ X1 + X2) Residuals: Min 1Q Median 3Q Max -1.72610-0.71385 0.03204 0.62244 3.04545 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -0.002956 0.094429-0.031 0.975 X1 2.171693 0.108222 20.067 <2e-16 X2 2.949736 0.098936 29.814 <2e-16 Residual standard error: 0.9428 on 97 degrees of freedom Multiple R-squared: 0.938, Adjusted R-squared:??? F-statistic: 733.5 on 2 and 97 DF, p-value: < 2.2e-16 (a) Calculate the adjusted R-square value from the output. (b) Calculate the SSR (Regression Sum of Squares) from the output. (c) Perform the hypothesis test, H 0 : β 1 = β 2 = 0 v.s. H 1 : not both β 1 and β 2 equal zero. Write down the test method and calculate the test statistic. 5

Now, we perform the analysis for model (2) in R: > fit = lm(y ~ X1 + X2 + X3) > summary(fit) Call: lm(formula = Y ~ X1 + X2 + X3) Residuals: Min 1Q Median 3Q Max -1.72110-0.71459 0.02617 0.62992 3.04839 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -0.002549 0.095098-0.027 0.979 X1 2.184818 0.217837 10.030 <2e-16 X2 2.968544 0.288143 10.302 <2e-16 X3-0.063097 0.907274-0.070 0.945 Residual standard error: 0.9476 on 96 degrees of freedom Multiple R-squared: 0.938, Adjusted R-squared: 0.936 F-statistic: 484 on 3 and 96 DF, p-value: < 2.2e-16 (d) Calculate the Extra Sum of Squares SSR(X 3 X 1, X 2 ) and the coefficient of partial correlation RY 2 3 12. (e) Compare model (1) and model (2), which one do you prefer and explain the reasons. 6

(Extra Space for Answers) 7

5. An analyst decided to fit the multiple regression model Y i = β 0 + β 1 X i1 + β 2 X i2 + β 3 X i3 + β 4 X i1 X i2 + β 5 X i1 X i3 + β 6 X i2 X i3 + ε i, where ε i N(0, σ 2 ), i = 1,..., 20. To reduce correlation between the covariates in this model, the centered variables x i1 = X i1 X 1 = X i1 25.305, x i2 = X i2 X 2 = X i2 51.170, and x i3 = X i3 X 3 = X i3 27.620 are used. The fitted regression equation is given by Ŷ = 20.53 + 3.43x 1 2.095x 2 1.616x 3 + 0.00888x 1 x 2 0.08479x 1 x 3 + 0.09042x 2 x 3, MSE = 6.745, where the true model is Y i = β 0 + β 1 x i1 + β 2 x i2 + β 3 x i3 + β 4 x i1 x i2 + β 5 x i1 x i3 + β 6 x i2 x i3 + ε i. One would like to test whether the interaction terms between the three predictor variables should be included in the regression model. Use the above information and the following Table to conduct a F-test at 5% significance level. Clearly state the null and alternate hypotheses, test statistic, decision rule and the conclusion. Variable Extra Sum of Squares Value x 1 SSR(x 1 ) = 352.270 x 2 SSR(x 2 x 1 ) = 33.169 x 3 SSR(x 3 x 1, x 2 ) = 11.546 x 1 x 2 SSR(x 1 x 2 x 1, x 2, x 3 ) = 1.496 x 1 x 3 SSR(x 1 x 3 x 1, x 2, x 3, x 1 x 2 ) = 2.704 x 2 x 3 SSR(x 2 x 3 x 1, x 2, x 3, x 1 x 2, x 1 x 3 ) = 6.515 F(0.975, 3, 13) = 4.3472, F(0.95, 3, 13) = 3.4105, F(0.975, 7, 19) = 3.0509, F(0.95, 7, 19) = 2.5435, F(0.95, 4, 13) = 3.1791, F(0.975, 4, 13) = 3.9959, F(0.95, 4, 19) = 2.8951, F(0.975, 4, 19) = 3.5587, F(0.975, 3, 19) = 3.9034, F(0.95, 3, 19) = 3.1274. 8

(Extra Space for Answers) 9

6. Suppose we have the following two multiple linear regression models: Y i = β 0 + β 1 X i1 + + β p 1 X i,p 1 + ε i (3) and Y i = β 0 + β 1 X i1 + + β p 1 X i,p 1 + β p X i,p + ε i, (4) where ε i i.i.d. N(0, σ 2 ). (a) Denote the R 2 (the coefficient of multiple determination) for the two models (3) and (4) as R 2 (3) and R 2 (4). Is it true that R 2 (3) R 2 (4) always holds? If yes, prove it. If not, give a counter example. (If you are providing a counter example, please write down the design matrix X and the response vector Y explicitly. The reasoning of R 2 (3) > R 2 (4) is required.) (b) Denote the Ra 2 (the adjusted coefficient of multiple determination) for the two models as Ra(3) 2 and Ra(4). 2 Is it true that Ra(3) 2 Ra(4) 2 always holds? If yes, prove it. If not, give a counter example. (If you are providing a counter example, please write down the design matrix X and the response vector Y explicitly. The reasoning of Ra(3) 2 > Ra(4) 2 is required.) 10

(Extra Space for Answers) 11