Advanced Statistical Regression Analysis: Mid-Term Exam Chapters 1-5

Similar documents
Statistics GIDP Ph.D. Qualifying Exam Methodology

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

Chapter 8: Correlation & Regression

Chapter 16: Understanding Relationships Numerical Data

Inferences on Linear Combinations of Coefficients

1 A Review of Correlation and Regression

1: a b c d e 2: a b c d e 3: a b c d e 4: a b c d e 5: a b c d e. 6: a b c d e 7: a b c d e 8: a b c d e 9: a b c d e 10: a b c d e

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

ISQS 5349 Final Exam, Spring 2017.

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Inference for Regression

ST430 Exam 1 with Answers

Concordia University (5+5)Q 1.

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

MATH 644: Regression Analysis Methods

Biostatistics 380 Multiple Regression 1. Multiple Regression

Homework 9 Sample Solution

ST430 Exam 2 Solutions

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Stat 401B Exam 2 Fall 2015

Correlation and Regression

STAT 3022 Spring 2007

Question 1a 1b 1c 1d 1e 2a 2b 2c 2d 2e 2f 3a 3b 3c 3d 3e 3f M ult: choice Points

2 Regression Analysis

Regression. Marc H. Mehlman University of New Haven

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

Correlation and the Analysis of Variance Approach to Simple Linear Regression

WISE International Masters

14 Multiple Linear Regression

STATISTICS 110/201 PRACTICE FINAL EXAM

Statistics GIDP Ph.D. Qualifying Exam Methodology

1 Multiple Regression

Variance Decomposition and Goodness of Fit

Applied Regression Analysis

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Stat 328 Final Exam (Regression) Summer 2002 Professor Vardeman

Correlation and regression

Area1 Scaled Score (NAPLEX) .535 ** **.000 N. Sig. (2-tailed)

Lecture 18: Simple Linear Regression

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

REVIEW 8/2/2017 陈芳华东师大英语系

Ch 13 & 14 - Regression Analysis

Regression and correlation

Correlation and Simple Linear Regression

Statistics GIDP Ph.D. Qualifying Exam Methodology

Regression on Faithful with Section 9.3 content

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

CREATED BY SHANNON MARTIN GRACEY 146 STATISTICS GUIDED NOTEBOOK/FOR USE WITH MARIO TRIOLA S TEXTBOOK ESSENTIALS OF STATISTICS, 3RD ED.

Study Sheet. December 10, The course PDF has been updated (6/11). Read the new one.

Exam Applied Statistical Regression. Good Luck!

Chapter 8: Correlation & Regression

Tests of Linear Restrictions

Can you tell the relationship between students SAT scores and their college grades?

Ch 2: Simple Linear Regression

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

Math 2311 Written Homework 6 (Sections )

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Introduction and Single Predictor Regression. Correlation

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

STAT 212 Business Statistics II 1

FREC 608 Guided Exercise 9

LI EAR REGRESSIO A D CORRELATIO

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Midterm 2 - Solutions

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

Second Midterm Exam Name: Solutions March 19, 2014

Lecture 3: Inference in SLR

Statistics 191 Introduction to Regression Analysis and Applied Statistics Practice Exam

Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression

Statistics GIDP Ph.D. Qualifying Exam Methodology May 26 9:00am-1:00pm

Statistics GIDP Ph.D. Qualifying Exam Methodology May 26 9:00am-1:00pm

Chapter 10. Correlation and Regression. McGraw-Hill, Bluman, 7th ed., Chapter 10 1

No other aids are allowed. For example you are not allowed to have any other textbook or past exams.

Exercise I.1 I.2 I.3 I.4 II.1 II.2 III.1 III.2 III.3 IV.1 Question (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Answer

Name: Biostatistics 1 st year Comprehensive Examination: Applied in-class exam. June 8 th, 2016: 9am to 1pm

Density Temp vs Ratio. temp

BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) R Users

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

Statistics. Introduction to R for Public Health Researchers. Processing math: 100%

CAS MA575 Linear Models

Simple linear regression

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Statistics GIDP Ph.D. Qualifying Exam Methodology January 10, 9:00am-1:00pm

Diagnostics and Transformations Part 2

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

STAT 350 Final (new Material) Review Problems Key Spring 2016

Handout 4: Simple Linear Regression

Econometrics Homework 1

13 Simple Linear Regression

STAT 4385 Topic 03: Simple Linear Regression

Transcription:

Advanced Statistical Regression Analysis: Mid-Term Exam Chapters 1-5 Instructions: Read each question carefully before determining the best answer. Show all work; supporting computer code and output must be attached to the question for which it is used. Use computer output as a guide only; all questions must be answered in full on the exam paper. Report all final numerical answers to a precision of 4 units past the decimal point (e.g., 18.1234 or 1.1234 10 5.) There are 100 total points on this exam. Do not discuss this exam or its components with anyone besides the course instructor or the course TA. This exam is due by 3:15 PM, October 31, 2017. 1. A recent study on financial activity among n = 9 health care companies recorded two variables: (i) current-year Price-to-Earnings ratio ( PE, a measure of the firm s market value), and (ii) 5-year growth rate (%). Take X i = PE and Y i = Growth rate. Download the data from http://math.arizona.edu/~piegorsch/571a/exam1prob1.csv. a. (5 points) Plot the data as Y vs. X. What relationship is observed? Sample R commands: finance.df = read.csv( file.choose() ) attach( finance.df ); X = PE; Y = Growth.rte plot( Y ~ X, pch=19 ) The plot indicates a clear increase in Y = Growth rate as X = PE increases: cont d

1. Financial data (cont d) b. (10 points) Assume a simple linear regression (SLR) model: Y i ~ indep. N( + X i, 2 ), i = 1,..., n. Calculate the least squares (LS) estimators for and. Sample R commands: finance.lm <- lm( Y ~ X ) summary( finance.lm ) round( coef( finance.lm ), dig=4 ) Call: lm(formula = Y ~ X) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -8.4510 5.7367-1.473 0.1842 X 0.9501 0.3565 2.665 0.0322 Residual standard error: 2.582 on 7 degrees of freedom Multiple R-squared: 0.5037, Adjusted R-squared: 0.4328 F-statistic: 7.104 on 1 and 7 DF, p-value: 0.03221 (Intercept) X -8.4510 0.9501 from which we find b 0 = 8.4510 and b 1 = 0.9501. c. (5 points) What is the coefficient of determination with these data? From the R output in part (b): R-squared: 0.5037 So we see R 2 = 0.5037, i.e., 50.37% of variability in Y is explained by the differential values of X. 2 cont d

1. Financial data (cont d) d. (5 points) Find the raw residuals from the SLR fit and plot them against the predictor variable. Do any untoward patterns appear in the plot? Use plot( resid(finance.lm) ~ X, pch=19 ) abline( h=0 ) The plot shows no untoward patterns, although it is always difficult to discern much from such a small data set. 3 cont d

1. Financial data (cont d) e. (15 points) Under the SLR model, test to determine if increases in PE lead to increases in the Growth rate, on average, in these data. Operate at a 5% false positive rate. Test H o : 1 = 0 vs. H a : 1 > 0 ( increases in Growth rate ). Set = 0.05. The test statistic is t* = b 1 /se[b 1 ] = 0.95013/0.35647 = 2.6654 from the R output in part 1(b). The P-value is given as P = 0.0322 but this is twosided! In R, the correct one-sided P-value is computed via tstar = coef( finance.lm )[2]/sqrt(vcov(finance.lm)[2,2]) pt( tstar, df=finance.lm$df.residual, lower=f ) producing [1] 0.0161063 Since P = 0.0161 < 0.05 =, we reject H o and conclude that increases in PE lead to significant increases in Growth rate, on average, in these data. Via a rejection region: reject H o when t* > t(1 ; df E ) = t(0.95; 7) = 1.895 (see Table B.2). Since 2.6654 > 1.895, we again reject H o. f. (15 points) If a health care company whose PE equals 19 were to be studied, what growth rate would you anticipate the company would report, based on your SLR fit? Also give a 95% interval estimate for this value As written, this is a prediction problem. Sample R code: predict(finance.lm,newdata=data.frame(x=19),interval="pred",level=.95) detach(finance.df) fit lwr upr 1 9.601517 2.660056 16.54298 Thus the predicted value at PE = 19 is given by Ŷh = 9.6015, with corresponding 95% prediction limits of 2.6601 < Y h < 16.5430. 4

2. A scientist studied climate change among n = 35 island nation-states across the Earth. Two paired response variables were recorded: (i) population (in thousands) and (ii) annual CO 2 emissions (megatons). Responses such as these tend to skew, so take the transformations Y i1 = log{population} and Y i2 = log{co 2 }. Download the data from http://math.arizona.edu/~piegorsch/571a/exam1prob2.csv. a. (5 points) To quantify the association between these two transformed variables, calculate the Pearson correlation, r, between them. These data are from the article: Ebi, K.L., Lewis, N.D., and Corvalan, C. (2006). Climate variability and change and their potential health effects in small island states: Information for adaptation planning in the health sector. Environmental Health Perspectives 114, 1957-1963. Sample R commands: CO2.df = read.csv( file.choose() ); attach( CO2.df ) Y1 = log(popln); Y2 = log(co2) round( cor(x=y1, y=y2), dig=4 ) [1] 0.7961 b. (10 points) Assume Y 1 and Y 2 possess a bivariate normal distribution with correlation. The investigator suspects a priori that increasing population would associate with increases in CO 2 emissions. Use the Pearson correlation to test this assertion with these paired data. Operate at a false positive rate of 0.5%. Hypotheses are H o : = 0 vs. H a : > 0 ( associate with increases ). Set = 0.005. Sample R statements and (edited) output: cor.test( x=y1, y=y2, alternative='greater') Pearson's product-moment correlation t = 7.5565, df = 33, p-value = 5.408e-09 alternative hypothesis: true correlation is greater than 0 Since P = 5.408 10 9 < 0.005 =, we reject H o. Or via a rejection region, find the test statistic as t* = r n 2 / 1 r 2 = (0.7961) 33 / 0.3662 = 4.5732/0.6052 = 7.5570. (Using the higher precision available in R, a more-accurate value is t* = 7.5565; see the output above.) Reject H o if t* > t(0.995, 33) = 2.7333. Find this t critical point in R via qt( 1-0.005, df=ques3b.cor$parameter, lower=t ) Since t* = 7.5565 > 2.7333, we again reject H o. In either case, conclude that a significant positive correlation is evidenced between (log-)population and (log-)co 2 emissions. 5

2. Climate change data (cont d) c. (5 points) An alternative approach here that avoids the data transformations is to apply Spearman s rank correlation, r s, to the original data pairs. Do so and report the value. Sample R command: round( cor(x=popln, y=co2, method='spearman'), dig=4 ) [1] 0.7173 i.e., r s = 0.7173. d. (10 points) Repeat the significance test in part (b), now with the original, untransformed data and Spearman s correlation. Operate at a false positive rate of 0.5%. Hypotheses are H o : No association vs. H a : Positive association (one-sided, as per the problem statement). Set = 0.005. Sample R statements and output (edited) are cor.test( x=popln, y=co2, method='spearman', alternative='greater', exact=f ) detach( CO2.df ) Spearman's rank correlation rho data: Popln and CO2 S = 2018.7, p-value = 6.225e-07 alternative hypothesis: true rho is greater than 0 Since P = 6.2252 10 7 < 0.005 =, we again reject H o. Or via a rejection region, find the test statistic for this large-sample setting as t* = r s n 2 / 1 r 2 s = (0.7173) 33 / 0.4855 = 4.1206/0.6968 = 5.9139. As above, reject H o if t* > t(0.995, 33) = 2.7333. Since t* = 5.9139 > 2.7333, we again reject H o, and conclude that a significant positive association is evidenced between population and CO 2 emissions. 6

3. (15 points) Consider the simple linear regression (SLR) model with Y i ~ indep. N( 0 + 1 X i, 2 ), i = 1,..., n. Let Y be the vector of observations and X be the concomitant vector of predictor variables. Also let b = [b 0 b 1 ] be the vector of LS estimates. Using matrix operations, establish algebraically that (X Xb) (Y Xb) = 0. Start with (X Xb) (Y Xb) = [X( b)] (Y Xb) = ( b) X (Y Xb). Now, notice that since b = (X X) 1 X Y, we can write the last term as Y Xb = Y X(X X) 1 X Y = (I X(X X) 1 X )Y so that (X Xb) (Y Xb) = ( b) X (Y Xb) = ( b) X [I X(X X) 1 X ]Y. But then clearly X (I X(X X) 1 X )Y = (X X X(X X) 1 X )Y = (X IX )Y = (X X )Y = 0 so (X Xb) (Y Xb) = ( b) 0 which is a row vector, ( b), times a column vector of zeroes. This obviously produces a sum of zeroes, which is itself zero, as desired. 7