STAT 512 MidTerm I (2/21/2013) Spring 2013 Name: Key INSTRUCTIONS 1. This exam is open book/open notes. All papers (but no electronic devices except for calculators) are allowed. 2. There are 5 pages in addition to the cover sheet. If you need more room for a problem, use the back of the sheets; clearly indicate where the location of the answer is. 3. Only 3 decimal places are required for all answers except for some answers in question 1. In question 1, if the number used is less than 0.01, please include the whole number in the work. 4. Work is required to receive credit. Partial credit will be given for work that is partially correct. Points will be deducted for incorrect work even if the final answer is correct. 5. If I cannot read your answer, it will be marked wrong. 6. Good Luck! Question Possible Score 1 42 2 25 3 43 Total 110 1
(42 pts.) 1. How was Hubble s Constant calculated? In 1929, Hubble investigated the relationship between distance and recession velocity of extra-galactic nebulae. The following is the edited results from the 24 nebulae that Hubble used in his study. The distance is in Megaparsecs from Earth and the recession velocity (r_velocity) is in km/s. (http://lib.stat.cmu.edu/dasl/datafiles/hubble.html) Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 5.97547 5.97547 36.44 <.0001 Error 22 3.60782 0.16399 Corrected Total 23 9.58329 Variable Parameter Estimates DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 0.39910 0.11847??? 0.0028 r_velocity 1 0.00137 0.00022744??? <.0001 Obs r_velocity Dependent Variable Output Statistics Predicted Value Std Error Mean Predict 95% CL Mean Residual 25 200. 0.6737 0.0916??????. a) Write down the simple linear regression model and the assumed distribution of the errors. Y (distance) = 0 + 1 X(r_velocity) + ~ iid N(0, 2 ) b) Write down the estimated regression line using the data above. Y = 0.399 + 0.00137 X c) Explain the difference between parts a) and b). The model (part a) describes each individual point in the population. The estimated regression line (part b) describes the best fit line in the sample. d) What is the fitted value of Y (distance) for X (r_velocity) = 300? If the actual value of Y is 0.80, what is the residual? Y = 0.399 + (0.00137)(300) = 0.810 e = Y - Y = 0.80 0.81 = -0.01 2
e) Would it be reasonable to consider inference on the intercept 0 for this data? Explain why or why not. Without looking at the data, yes you could consider inference on 0 because it is possible that the recession velocity could be 0 (this means that the extra-galactic nebulae is moving at the same relative speed to the earth). In fact, the data included points that were negative and very close to 0. f) Calculate R 2 using the information provided. R 2 = SSM SST = 5.975 9.583 = 0.624 OR R 2 = 1 SSE SST = 1 3.608 9.583 = 0.624 g) From the information provided, can this model be used for prediction of new data points? Why or why not? Either answer was correct here depending on the justification. My answer is maybe. The P-value is good and with the units used, both SSM and SST are small, therefore if R 2 was high enough, it would be a good fit. However, since the R 2 is not very high (there is a fair amount of noise in the data), therefore I am not sure. Note: You had to mention the size of the values for SSM and SST in addition to R 2 and the p-value. We are assuming here that the data is a straight line and not curvilinear. h) Calculate and interpret the 95% confidence interval for the estimate of the mean at X (r_velocity)=200. t c = t 22 (1 - α 2 ) = t 22(0.975) = 2.074 Y 200 t c (0.975)s{Y 200 ) = 0.674 (2.074)(0.0916) = 0.674 0.190 ==> (0.484, 0.364) We are 95% confident that the mean distance in the population at a recession velocity of 200 km/s is between 0.484 Megaparsecs and 0.364 Megaparsecs. i) In this situation, when would the confidence band be more appropriate than what was calculated in part h). A confidence band would be more appropriate than a confidence interval if we were interested in looking at all of the possible recession velocities at once versus only one of them like in part h) j) If the optimal Box-Cox transformation suggests = 0. What is the optimal transformed response. That is, what function of Y should be used to perform the linear regression? Y = ln Y = log e Y I did give full credit for Y = log Y (this is assumed to be log 10 ) 3
(25 pts.) 2. The following modified data is based on the number of cigarettes smoked (households per capita) and the deaths per 100K from kidney cancer for 40 states in 1960. (http://lib.stat.cmu.edu/dasl/datafiles/cigcancerdat.html). a) State each of the assumptions that are required for linear regression model and state whether they are or are not met in this context. Be sure to mention all 3 of the plots in your discussion. linearity: not met because of graphs A and B. outliers: not met because of graph A. constant variance: not met because of graphs A and B (note, I did give full credit if you stated that they were met). normality: met because of QQplot (note: I did give full credit if you stated that it was not met). independence: can not tell from the graphs, needs to be determined from experimental conditions. 4
b) If any of the assumptions are not met, describe a possible method that could be used to remedy the situation. Please explain your choice being as explicit as possible. That is, if you are going to use a transformation, state which possible transformations would be used, if you are going to use a procedure in SAS, describe which procedure to use, etc. If all of the assumptions are met, state that fact and then choose one assumption and finish the question assuming that your chosen assumption is violated. The answer to this question depended on the answer to part a). If you stated that the only condition not met was linearity, then the correct answer here would be a X transformation. From the table in the book, the possible choices were X = log X or X = X. If you stated that constant variance and/or normality were not met, then the correct answer here would be a Y transformation. To determine which Y transformation would be appropriate, you would use the Box Cox procedure in SAS. After the transformation above is performed, then you need to rerun the diagnostics to see if all of the assumptions are now met. (43 pts.) 3. A General Psychology Instructor wanted to know if she could predict the score on the final from the scores on the first three exams. There were 25 students in this class. The following is the edited SAS output. (http://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/mlr/frames/frame.html - test scores for General Psychology) Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 3 13732 4577.333 672.196 <.0001 Error 21 143 6.8095 Corrected Total 24 13875 a) Fill in the missing values in the SAS output above. Please show work for all of the empty spaces below. p = 4, n = 24 df M = p 1 = 4 1 = 3 df E = n p = 24 4 = 21 df T = n 1 = 25 1 = 24 = df M + df E = 3 + 21 SSE = SST SSM = 13875 13732 = 143 MSM = SSM = 13732 = 4577.333 MSE = SSE = 143 df M 3 df E 21 = 6.8095 F = MSM MSE = 4577.333 6.8095 = 672.196 5
b) Using the results in part a), perform the appropriate significance test. Please include the null and alternative hypothesis, the test statistic with the degrees of freedom, the p-value, the decision and the conclusion in words (that is, the conclusion needs to be stated in the context of the problem). H 0 : 1 = 2 = 3 = 0 H a : at least one k 0 F = 672.196 with df(numerator) = 3, df(denominator) = 21 p < 0.0001 decision: reject H o the data strong supports the claim (P < 0.0001) that at least one of the three exam scores is associated with the score on the final. c) Write down the design matrix for this problem. You may use variables for the actual data points. Be sure to clearly indicate what the dimensions of the matrix are. Note: If you wrote down all of the matrices and did not indicate which one was the design matrix, you lost points X 25X4 = 1 X 1,1 X 1,2 X 1,3 1 X 25,1 X 25,2 X 25,3 d) Given the data below, do you expect a problem with correlation between the explanatory variables? Why or why not? Pearson Correlation Coefficients, N = 25 Final Exam1 Exam2 Exam3 Final 1.00000 0.94607 0.92947 0.97233 Exam1 0.94607 1.00000 0.90136 0.89274 Exam2 0.92947 0.90136 1.00000 0.84636 Exam3 0.97233 0.89274 0.84636 1.00000 Yes, I expect a problem with correlation between the explanatory variables because of the high correlation between them; they run from 0.84 to 0.90. There was no reason to include the p values here because of the high numbers. Remember, in the body fat example, we had problems when the correlation coefficients were around 0.5. 6