STAT 4385 Applied Regression Analysis Homework : Simple Linear Regression (Simple Linear Regression) Thirty (n = 30) College graduates who have recently entered the job market. For each student, the CGPA (cumulative grade-point average X) and SAL (Starting Annual Salary in thousands dollars Y ) were recorded, as shown in the following table. Reference: Kleinbaum, D., Kupper, L., Nizam, A., and Rosenberg, E. (013) Applied Regression Analysis and Other Multivariable Methods. Cengage Learning. ID CGPA SAL x i yi x i y i 1.58 10.455 6.656 109.307 6.974.31 9.680 5.336 93.70.361 3.47 7.300 6.101 53.90 18.031 4.5 9.388 6.350 88.135 3.658 5 3. 1.496 10.368 156.150 40.37 6 3.37 11.81 11.357 139.53 39.806 7.43 9.4 5.905 85.08.414 8 3.08 11.75 9.78 11.30 7.78 18.14 31.470 10.98 1.000 8.880 144.000 35.760 11 3.55 1.500 1.603 156.50 44.375 1 3.64 13.310 13.50 177.156 48.448 13 3.7 1.105 13.838 146.531 45.031 14.4 6.00 5.018 38.440 13.888 15.7 11.5 7.90 13.756 31.109 16.3 8.000 5.90 64.000 18.400 17.83 1.548 8.009 157.45 35.511 18.37 7.700 5.617 59.90 18.49 19.5 10.08 6.350 100.561 5.71 0 3. 13.176 10.368 173.607 4.47 1 3.55 13.55 1.603 175.695 47.055 3.55 13.004 1.603 169.104 46.164 3.47 8.000 6.101 64.000 19.760 4.47 8.4 6.101 67.634 0.313 5.78 10.750 7.78 115.563 9.885 6.78 11.669 7.78 136.166 3.440 7.98 1.3 8.880 151.83 36.70 8.58 11.00 6.656 11.044 8.385 9.58 10.666 6.656 113.764 7.518 30.58 10.839 6.656 117.484 7.965 sum 85.15 3. 47.515 3573.136 935.738 1. Complete the above table first by filling the blanks and do some preliminary calculations. Based on the above worksheet, compute x, ȳ,, SS yy, and SS xy. 1
. The following graph provides a scatterplot of the data. Interpret the graph. Scatter Plot of the Data SAL 6 7 8 9 10 11 1 13.5 3.0 3.5 CGPA 3. Compute the sample correlation coefficient r and interpret. Using Fisher s Z-transformation approach, construct a 95% confidence interval of the (population) correlation coefficient ρ between CGPA and SAL. 4. Simple linear regression is used to study the relationship between CGPA and SAL. The model can be stated as y i = β 0 + β 1 x i + ε i, with ε i N (0, σ ). Compute the least squares estimator of (β 0, β 1 ) and add the fitted LS line to the scatterplot. 5. First complete the ANOVA table by filling blanks (1) (9) and answer questions accordingly. Analysis of Variance Table (ANOVA) Source df SS MS F P-Value Model (1) (4) (7) (9) <.0001 Error () (5) (8) Total (3) (6) (a) Compute the coefficient of determination and interpret; (b) Test the overall usefulness of the model. (c) Provide an estimator of the error variance σ.
6. Construct a 95% confidence interval for β 0 and β 1, respectively, and interpret. 7. Perform a test at significance level α = 0.05 to see if the following statement is true: If Student A has a CGPA 0.5 point higher than Student B, then Student A is expected to make more than one thousand per year than Student B. 8. Compute the fitted value ŷ 1 and the corresponding residual r 1 for the first student, based on the LS fitted model. 9. The following figure provides diagnostic plots based on the residuals r i s: (a) histogram; (b) normal Q-Q plot; (c) r i vs. fitted value; and (d) r i vs. x i. Comment on the plots in terms of model assumptions. (a) Histogram (b) Normal Q Q Plot Frequency 0 4 6 8 Sample Quantiles 1 0 1 3 1 0 1 residuals (c) residual vs. fitted 1 0 1 Theoretical Quantiles 1 0 1 1 0 1 r r (d) residual vs. x 9 10 11 1 13 14 y^.5 3.0 3.5 x 10. Construct a 95% confidence interval for the mean annual salary for students who have a CGPA of 3.5. 3
11. Suppose that Tom is graduating with a CGPA of 3.5, construct a 95% prediction interval for his starting salary. 1. The following figure plots the fitted LS line as well as the 95% Working-Hoteling confidence band. Please comment on the model fit and potential outliers. Working Hoteling Confidence Bands SAL 6 8 10 1 14 16 LS line Hoteling CB.0.5 3.0 3.5 4.0 CGPA 4
STAT 4385 Applied Regression Analysis Solutions for Homework 1. Some preliminary calculations include n = 30 xi = 85.15 yi = 3. x i = 47.515 y i = 3573.136 xi y i = 935.738 = x =.838 ȳ = 10.741 = 5.831 SS yy = 11.78 SS xy = 1.170.. It can be found that r = 0.874, indicating a strong positive linear association between college GPA and starting salary among college graduates. A 95% CI for Z-transformed ρ = (1/) ln{(1 + ρ)/(1 ρ)} is given by 1 ln 1 + r 1 r ± z 1 0.975 n 3 = 1 1 + 0.874 ln 1 0.874 ± 1.96 1 30 3 = 1.1797 ± 1.96/5.196 = (0.805, 1.5569) or denoted as (L, U ) Now transform (L, U ) back to a 95% CI for ρ: { exp(l ) 1 exp(l ) + 1, exp(u } ) 1 exp(u ) + 1 = (0.6655, 0.9149). 3. The completed tables are given below (see next page): 4. From the ANOVA table, compute R = 68.45% and ˆσ = MSE = 1.65. The overall usefulness (or global utility) F test statistic is 60.758 with p-value = Pr { F (1,8) > 60.758 } <.0001. Thus we reject H 0 and conclude that the model is useful. If you want to use the critical value approach, it can be found that F (1,8) 0.95 = 4.196. 1
Parameter Estimates parameter estimate s.e. t P-Value β 0 0.4359 1.3379 0.36 0.747 β 1 3.6306 0.4658 7.795 < 0.0001 ANVOA Table Source df SS MS F P-Value Model 1 76.858 76.858 60.758 < 0.0001 Error 8 35.40 1.65 Total 9 11.78 5. A 95% CI for β 0 is ˆβ 0 ± t (n ) ˆσ { 1 n + 0.4359 ±.0484 1.3379 (.3046, 3.1764) x } In interpreting the CI for β 0, note the extrapolation problem since no student gets 0 GPA. A 95% CI for β 1 is ˆβ 1 ± t (n ) ˆσ 3.6306 ±.0484 0.4658 (.6765, 4.5846). 6. Want to see if 0.5 point increase in GPA leads to 1K increase in SAL. Proportionally, 1 point increase in GPA would lead to K increase in SAL. Namely, the alternative hypothesis, and hence the null, can be given by { Ha : 0.5β 1 > 1 = β 1 > 1/0.5 = The t test can be used: H 0 : β 1 =. t obs = ˆβ 1 SE( ˆβ 1 ) = 3.6306 = 3.501. 0.4658 This is an upper-sided test. We reject H 0 at α = 0.05 since t obs > t (8) 0.95 = 1.701.
7. For the first observation, CGPA (x 1 ) =.58 and SAL (y 1 ) = 10.455. It can be found that ŷ 1 = 9.803 and the residual r 1 = y 1 ŷ 1 = 0.65. 8. At GPA x = 3.5, a 95% CI for estimating the mean SAL E(y x) is given by ( { ˆβ0 + ˆβ 1 x) ± t (n ) 1 ˆσ n + x } x 13.149 ±.0484 0.3703 (1.384, 13.901). A 95% PI for predicting the individual SAL is given by ( ˆβ0 + ˆβ 1 x) ± t (n ) 13.149 ±.0484 1.1841 (10.717, 15.568). { ˆσ 1 + 1 n + x } x 9. In Working-Hoteling confidence band for E(y x), the critical value is (, n ) W = F 1 α = F (,8) 0.95 = 3.3404 =.5847. 3