Advanced Statistical Regression Analysis: Mid-Term Exam Chapters 1-5 Instructions: Read each question carefully before determining the best answer. Show all work; supporting computer code and output must be attached to the question for which it is used. Use computer output as a guide only; all questions must be answered in full on the exam paper. Report all final numerical answers to a precision of 4 units past the decimal point (e.g., 18.1234 or 1.1234 10 5.) There are 100 total points on this exam. Do not discuss this exam or its components with anyone besides the course instructor or the course TA. This exam is due by 3:15 PM, October 31, 2017. 1. A recent study on financial activity among n = 9 health care companies recorded two variables: (i) current-year Price-to-Earnings ratio ( PE, a measure of the firm s market value), and (ii) 5-year growth rate (%). Take X i = PE and Y i = Growth rate. Download the data from http://math.arizona.edu/~piegorsch/571a/exam1prob1.csv. a. (5 points) Plot the data as Y vs. X. What relationship is observed? Sample R commands: finance.df = read.csv( file.choose() ) attach( finance.df ); X = PE; Y = Growth.rte plot( Y ~ X, pch=19 ) The plot indicates a clear increase in Y = Growth rate as X = PE increases: cont d
1. Financial data (cont d) b. (10 points) Assume a simple linear regression (SLR) model: Y i ~ indep. N( + X i, 2 ), i = 1,..., n. Calculate the least squares (LS) estimators for and. Sample R commands: finance.lm <- lm( Y ~ X ) summary( finance.lm ) round( coef( finance.lm ), dig=4 ) Call: lm(formula = Y ~ X) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -8.4510 5.7367-1.473 0.1842 X 0.9501 0.3565 2.665 0.0322 Residual standard error: 2.582 on 7 degrees of freedom Multiple R-squared: 0.5037, Adjusted R-squared: 0.4328 F-statistic: 7.104 on 1 and 7 DF, p-value: 0.03221 (Intercept) X -8.4510 0.9501 from which we find b 0 = 8.4510 and b 1 = 0.9501. c. (5 points) What is the coefficient of determination with these data? From the R output in part (b): R-squared: 0.5037 So we see R 2 = 0.5037, i.e., 50.37% of variability in Y is explained by the differential values of X. 2 cont d
1. Financial data (cont d) d. (5 points) Find the raw residuals from the SLR fit and plot them against the predictor variable. Do any untoward patterns appear in the plot? Use plot( resid(finance.lm) ~ X, pch=19 ) abline( h=0 ) The plot shows no untoward patterns, although it is always difficult to discern much from such a small data set. 3 cont d
1. Financial data (cont d) e. (15 points) Under the SLR model, test to determine if increases in PE lead to increases in the Growth rate, on average, in these data. Operate at a 5% false positive rate. Test H o : 1 = 0 vs. H a : 1 > 0 ( increases in Growth rate ). Set = 0.05. The test statistic is t* = b 1 /se[b 1 ] = 0.95013/0.35647 = 2.6654 from the R output in part 1(b). The P-value is given as P = 0.0322 but this is twosided! In R, the correct one-sided P-value is computed via tstar = coef( finance.lm )[2]/sqrt(vcov(finance.lm)[2,2]) pt( tstar, df=finance.lm$df.residual, lower=f ) producing [1] 0.0161063 Since P = 0.0161 < 0.05 =, we reject H o and conclude that increases in PE lead to significant increases in Growth rate, on average, in these data. Via a rejection region: reject H o when t* > t(1 ; df E ) = t(0.95; 7) = 1.895 (see Table B.2). Since 2.6654 > 1.895, we again reject H o. f. (15 points) If a health care company whose PE equals 19 were to be studied, what growth rate would you anticipate the company would report, based on your SLR fit? Also give a 95% interval estimate for this value As written, this is a prediction problem. Sample R code: predict(finance.lm,newdata=data.frame(x=19),interval="pred",level=.95) detach(finance.df) fit lwr upr 1 9.601517 2.660056 16.54298 Thus the predicted value at PE = 19 is given by Ŷh = 9.6015, with corresponding 95% prediction limits of 2.6601 < Y h < 16.5430. 4
2. A scientist studied climate change among n = 35 island nation-states across the Earth. Two paired response variables were recorded: (i) population (in thousands) and (ii) annual CO 2 emissions (megatons). Responses such as these tend to skew, so take the transformations Y i1 = log{population} and Y i2 = log{co 2 }. Download the data from http://math.arizona.edu/~piegorsch/571a/exam1prob2.csv. a. (5 points) To quantify the association between these two transformed variables, calculate the Pearson correlation, r, between them. These data are from the article: Ebi, K.L., Lewis, N.D., and Corvalan, C. (2006). Climate variability and change and their potential health effects in small island states: Information for adaptation planning in the health sector. Environmental Health Perspectives 114, 1957-1963. Sample R commands: CO2.df = read.csv( file.choose() ); attach( CO2.df ) Y1 = log(popln); Y2 = log(co2) round( cor(x=y1, y=y2), dig=4 ) [1] 0.7961 b. (10 points) Assume Y 1 and Y 2 possess a bivariate normal distribution with correlation. The investigator suspects a priori that increasing population would associate with increases in CO 2 emissions. Use the Pearson correlation to test this assertion with these paired data. Operate at a false positive rate of 0.5%. Hypotheses are H o : = 0 vs. H a : > 0 ( associate with increases ). Set = 0.005. Sample R statements and (edited) output: cor.test( x=y1, y=y2, alternative='greater') Pearson's product-moment correlation t = 7.5565, df = 33, p-value = 5.408e-09 alternative hypothesis: true correlation is greater than 0 Since P = 5.408 10 9 < 0.005 =, we reject H o. Or via a rejection region, find the test statistic as t* = r n 2 / 1 r 2 = (0.7961) 33 / 0.3662 = 4.5732/0.6052 = 7.5570. (Using the higher precision available in R, a more-accurate value is t* = 7.5565; see the output above.) Reject H o if t* > t(0.995, 33) = 2.7333. Find this t critical point in R via qt( 1-0.005, df=ques3b.cor$parameter, lower=t ) Since t* = 7.5565 > 2.7333, we again reject H o. In either case, conclude that a significant positive correlation is evidenced between (log-)population and (log-)co 2 emissions. 5
2. Climate change data (cont d) c. (5 points) An alternative approach here that avoids the data transformations is to apply Spearman s rank correlation, r s, to the original data pairs. Do so and report the value. Sample R command: round( cor(x=popln, y=co2, method='spearman'), dig=4 ) [1] 0.7173 i.e., r s = 0.7173. d. (10 points) Repeat the significance test in part (b), now with the original, untransformed data and Spearman s correlation. Operate at a false positive rate of 0.5%. Hypotheses are H o : No association vs. H a : Positive association (one-sided, as per the problem statement). Set = 0.005. Sample R statements and output (edited) are cor.test( x=popln, y=co2, method='spearman', alternative='greater', exact=f ) detach( CO2.df ) Spearman's rank correlation rho data: Popln and CO2 S = 2018.7, p-value = 6.225e-07 alternative hypothesis: true rho is greater than 0 Since P = 6.2252 10 7 < 0.005 =, we again reject H o. Or via a rejection region, find the test statistic for this large-sample setting as t* = r s n 2 / 1 r 2 s = (0.7173) 33 / 0.4855 = 4.1206/0.6968 = 5.9139. As above, reject H o if t* > t(0.995, 33) = 2.7333. Since t* = 5.9139 > 2.7333, we again reject H o, and conclude that a significant positive association is evidenced between population and CO 2 emissions. 6
3. (15 points) Consider the simple linear regression (SLR) model with Y i ~ indep. N( 0 + 1 X i, 2 ), i = 1,..., n. Let Y be the vector of observations and X be the concomitant vector of predictor variables. Also let b = [b 0 b 1 ] be the vector of LS estimates. Using matrix operations, establish algebraically that (X Xb) (Y Xb) = 0. Start with (X Xb) (Y Xb) = [X( b)] (Y Xb) = ( b) X (Y Xb). Now, notice that since b = (X X) 1 X Y, we can write the last term as Y Xb = Y X(X X) 1 X Y = (I X(X X) 1 X )Y so that (X Xb) (Y Xb) = ( b) X (Y Xb) = ( b) X [I X(X X) 1 X ]Y. But then clearly X (I X(X X) 1 X )Y = (X X X(X X) 1 X )Y = (X IX )Y = (X X )Y = 0 so (X Xb) (Y Xb) = ( b) 0 which is a row vector, ( b), times a column vector of zeroes. This obviously produces a sum of zeroes, which is itself zero, as desired. 7