BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) R Users

Similar documents
BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) STATA Users

Unit 2 Regression and Correlation Practice Problems. SOLUTIONS Version STATA

Unit 2 Regression and Correlation WEEK 3 - Practice Problems. Due: Wednesday February 10, 2016

STAT 3022 Spring 2007

36-707: Regression Analysis Homework Solutions. Homework 3

Inference for Regression

Regression and Models with Multiple Factors. Ch. 17, 18

ST430 Exam 2 Solutions

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

ST430 Exam 1 with Answers

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Dealing with Heteroskedasticity

Regression and the 2-Sample t

Variance Decomposition and Goodness of Fit

SCHOOL OF MATHEMATICS AND STATISTICS

Second Midterm Exam Name: Solutions March 19, 2014

Tests of Linear Restrictions

Chapter 16: Understanding Relationships Numerical Data

MODELS WITHOUT AN INTERCEPT

Density Temp vs Ratio. temp

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

Week 7 Multiple factors. Ch , Some miscellaneous parts

Comparing Nested Models

Biostatistics 380 Multiple Regression 1. Multiple Regression

Multiple Regression: Example

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Introduction and Single Predictor Regression. Correlation

R 2 and F -Tests and ANOVA

Practice 2 due today. Assignment from Berndt due Monday. If you double the number of programmers the amount of time it takes doubles. Huh?

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

1 Multiple Regression

Lecture 18: Simple Linear Regression

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Regression. Marc H. Mehlman University of New Haven

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

Inferences on Linear Combinations of Coefficients

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA

Lecture 19: Inference for SLR & Transformations

Swarthmore Honors Exam 2012: Statistics

MATH 644: Regression Analysis Methods

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Garvan Ins)tute Biosta)s)cal Workshop 16/6/2015. Tuan V. Nguyen. Garvan Ins)tute of Medical Research Sydney, Australia

Stat 412/512 TWO WAY ANOVA. Charlotte Wickham. stat512.cwick.co.nz. Feb

Applied Regression Analysis

> modlyq <- lm(ly poly(x,2,raw=true)) > summary(modlyq) Call: lm(formula = ly poly(x, 2, raw = TRUE))

Ch 2: Simple Linear Regression

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Stat 5102 Final Exam May 14, 2015

1 Use of indicator random variables. (Chapter 8)

Chapter 8: Correlation & Regression

Handout 4: Simple Linear Regression

Stat 5031 Quadratic Response Surface Methods (QRSM) Sanford Weisberg November 30, 2015

Introduction and Background to Multilevel Analysis

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression

No other aids are allowed. For example you are not allowed to have any other textbook or past exams.

Exercise 2 SISG Association Mapping

Chapter 8 Conclusion

1 The Classic Bivariate Least Squares Model

CAS MA575 Linear Models

Workshop 7.4a: Single factor ANOVA

Inference with Simple Regression

STAT 215 Confidence and Prediction Intervals in Regression

Linear Model Specification in R

Extensions of One-Way ANOVA.

Recall that a measure of fit is the sum of squared residuals: where. The F-test statistic may be written as:

13 Simple Linear Regression

Extensions of One-Way ANOVA.

Simple Linear Regression

Chapter 12: Linear regression II

Study Sheet. December 10, The course PDF has been updated (6/11). Read the new one.

BMI 541/699 Lecture 22

Diagnostics and Transformations Part 2

Foundations of Correlation and Regression

Multiple Linear Regression (solutions to exercises)

Statistiek II. John Nerbonne. March 17, Dept of Information Science incl. important reworkings by Harmut Fitz

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

MS&E 226: Small Data

Regression Analysis Chapter 2 Simple Linear Regression

Scatter plot of data from the study. Linear Regression

Multiple Linear Regression

Linear Modelling: Simple Regression

Inferences for Regression

Math 2311 Written Homework 6 (Sections )

STK4900/ Lecture 3. Program

STAT 350 Final (new Material) Review Problems Key Spring 2016

Example: 1982 State SAT Scores (First year state by state data available)

Analysis of variance. Gilles Guillot. September 30, Gilles Guillot September 30, / 29

Lecture 18 Miscellaneous Topics in Multiple Regression

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Chapter 3 - Linear Regression

STAT 572 Assignment 5 - Answers Due: March 2, 2007

Booklet of Code and Output for STAC32 Final Exam

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

STAT 571A Advanced Statistical Regression Analysis. Chapter 8 NOTES Quantitative and Qualitative Predictors for MLR

Unit 9 Regression and Correlation Homework #14 (Unit 9 Regression and Correlation) SOLUTIONS. X = cigarette consumption (per capita in 1930)

Unit 6 - Introduction to linear regression

Inference for Regression Simple Linear Regression

Transcription:

BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users Unit Regression and Correlation of - Practice Problems Solutions R Users. In this exercise, you will gain some practice doing a simple linear regression using an R data set called week0.rdata. This data set has n=3 observations of boiling points (Y=boiling) and temperature (X=temp). Tip In this exercise you will need to create a new variable newy = 00*log 0 (boiling) Carry out an exploratory analysis to determine whether the relationship between temperature and boiling point is better represented using (i) (ii) Y = β 0+βX or ' ' 00 log (Y) = β +βx 0 0 In developing your answer, try our hand at producing (a) Estimates of the regression line parameters (b) Analysis of variance tables (c) R (d) Scatter plot with overlay of fitted line. Complete your answer with a one paragraph text that is an interpretation of your work. Take your time with this and have fun. # Input data that you have downloaded and saved to your desktop (edit my desktop) setwd( /Users/cbigelow/Desktop ) load(file= week0.rdata ) # Look at first 6 rows of data using command head( ) head(week0) temp boiling 0.8 9. 0. 8.559 3 08.4 7.97 4 0.5 4.697 5 00.6 3.76 6 00. 3.369 sol_regression( of ) R USERS.docx Page of 4

BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users # I will create newvar=oldvar as follows: newvar <- dataframe$oldvar x <- week0$temp y <- week0$boiling m <- lm(y ~ x) # OR: m <- lm(boiling~temp, data=dat) # OR: m <- lm(dat$boiling ~ dat$temp) summary(m) Call: lm(formula = y ~ x) Residuals: Min Q Median 3Q Max - 4.806 0.0057 0.48 0.5573.00 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) - 65.34300 4.6438-4.07.73e- 4 *** x 0.44379 0.049 8.35 < e- 6 *** - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' Residual standard error:.57 on 9 degrees of freedom Multiple R- squared: 0.907, Adjusted R- squared: 0.979 F- statistic: 336.6 on and 9 DF, p- value: <.e- 6 m$coefficients # output the intercept and slope (Intercept) x - 65.349977 0.443794 # confint(m) # output the CI for the parameters in the fitted model anova(m) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 450.56 450.56 336.6 <.e- 6 *** Residuals 9 38.8.34 - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' plot(x, y, main="simple Linear Regression of Y=boiling on X=temp", xlab="temp", ylab="boiling", pch=8) # pch means "plot character", can be to 8; abline(m, col="red", lty=, lwd=) # lty: "line type"; lwd: "line width". sol_regression( of ) R USERS.docx Page of 4

BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users From the above summary(m) output, we know that: (a): Parameter Estimates: -65.34 and 0.44, so regression line: Y=-65.34+0.44*X; (b): ANOVA (Analysis of Variance) table is obtained above using the anova(m) code; (c): R-square=0.907; (d): The scatterplot with the fitted line is plotted above. (). Now, instead of Y=boiling, we want to use newy=00*log0(boiling) newy=00*log0(y) m <- lm(newy ~ x) summary(m) Call: lm(formula = newy ~ x) Residuals: Min Q Median 3Q Max - 4.548 0.386 0.6079 0.993.459 Coefficients: sol_regression( of ) R USERS.docx Page 3 of 4

BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users Estimate Std. Error t value Pr(> t ) (Intercept) - 48.8589.88807-4. 0.00097 *** x 0.965 0.069 4.96 3.6e- 5 *** - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' Residual standard error:.96 on 9 degrees of freedom Multiple R- squared: 0.885, Adjusted R- squared: 0.883 F- statistic: 3.7 on and 9 DF, p- value: 3.63e- 5 m$coefficients # output the intercept and slope (Intercept) x - 48.858854 0.96547 # confint(m) # output the CI for the parameters in the fitted model anova(m) Analysis of Variance Table Response: newy Df Sum Sq Mean Sq F value Pr(>F) x 96. 96.5 3.69 3.63e- 5 *** Residuals 9 54.4 8.77 - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' plot(x, newy, main="simple Linear Regression of Y=00*log0(boiling) on X=temp", xlab="te mp", ylab="00*log0(boiling)", pch=8) # pch means "plot character", can be to 8; abline(m, col="red", lwd=) # lwd means "line width" sol_regression( of ) R USERS.docx Page 4 of 4

BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users From the above summary(m) output, we know that: (a): Parameter Estimates: -48.86 and 0.93, so regression line: 00log0(Y)=-48.86+0.93*X; (b): ANOVA (Analysis of Variance) table is obtained above using the anova(m) code; (c): R-square=0.885; (d): The scatterplot with the fitted line is plotted above. Solution (one paragraph of text that is interpretation of analysis): Did you notice that the scatter plot of these data reveal two outlying values? Their inclusion may or may not be appropriate. If all n=3 data points are included in the analysis, then the model that explains more of the variability in boiling point is Y=boiling point modeled linearly in X=temp. It has a greater R^ (9.07% vs. 88.5%). Be careful - It would not make sense to compare the residual mean squares of the two models because the scales of measurement involved are different. sol_regression( of ) R USERS.docx Page 5 of 4

BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users. A psychiatrist wants to know whether the level of pathology (Y) in psychotic patients 6 months after treatment could be predicted with reasonable accuracy from knowledge of pretreatment symptom ratings of thinking disturbance (X ) and hostile suspiciousness (X ). (a) The least squares estimation equation involving both independent variables is given by Y = -0.68 + 3.639(X ) 7.47(X ) Using this equation, determine the predicted level of pathology (Y) for a patient with pretreatment scores of.80 on thinking disturbance and 7.0 on hostile suspiciousness. How does the predicted value obtained compare with the actual value of 5 observed for this patient? Ŷ = -0.68 + 3.639 X - 7.47 X with X =.80 and X =7.0 Ŷ = 5.53 This value is lower than the observed value of 5 (b) Using the analysis of variance tables below, carry out the overall regression F tests for models containing both X and X, X alone, and X alone. Source DF Sum of Squares Regression on X 546 Residual 5 46 Source DF Sum of Squares Regression on X 60 Residual 5 363 sol_regression( of ) R USERS.docx Page 6 of 4

BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users Source DF Sum of Squares Regression on X, X 784 Residual 50 008 Model Containing X and X F = SSQ regression on X and X /Regression df SSQ residual / Residual df =,784 /,008 / 50 = 6.37 on DF=,50 p-value=0.00356 à Application of the null hypothesis model has led to an extremely unlikely result (p-value =.00356), prompting statistical rejection of the null hypothesis. The fitted linear model in X and X explains statistically significantly more of the variability in level of pathology (Y) than is explained by Y (the intercept model) alone. Model Containing X ALONE F = SSQ Regression on X /DF Regression SSQ residual/df Residual = 546 /,46 / 5 = 6.4385 on DF=,5 p-value=0.047 Here, too, application of the null hypothesis model has led to an extremely unlikely result (p-value =.04), prompting statistical rejection of the null hypothesis. The fitted linear model in X explains statistically significantly more of the variability in level of pathology (Y) than is explained by Y (the intercept model) alone. Model Containing X ALONE F = SSQ Regresion on X /DF Regression SSQ Residual/DF Residual = 60 / 3,63 / 5 = 0.5986 on DF=,5 p-value=0.4468 Here, application of the null hypothesis model has not led to an extremely unlikely result (p-value =.44). The null hypothesis is therefore not rejected. The fitted linear model in X does not explain statistically significantly more of the variability in level of pathology (Y) than is explained by Y (the intercept model) alone. (c) Based on your results in part (b), how would you rate the importance of the two variables in predicting Y? X explains a significant proportion of the variability in Y when modelled as a linear predictor. X does not. (However, we don't know if a different functional form might have been important.) sol_regression( of ) R USERS.docx Page 7 of 4

BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users (d) What are the R values for the three regressions referred to in part (b)? Total SSQ= (Regression SSQ) + (Residual SSQ) is constant. Therefore total SSQ can be calculated from just one anova table: Total (SSQ)=,546 +,46 = 3,79 ( ) R X only = (Re gression SSQ)/(Total SSQ) R (X only) = (60)/(3,79) = 0.06 ( ) ( ) = (546)/(3,79) = 0. R Xand X = 784 /(3, 79) = 0.09 (e) What is the best model involving either one or both of the two independent variables? Eliminate from consideration model with X only. Compare model with X alone versus X and X using partial F test. Partial F = {(SSQ Regression on X,X ) - (SSQ Regression on X )}/VDF = SSQ Residual for model w X,X /Residual DF = 5.663 on DF =,50 P-value =0.06 Addition of X to model containing X is statistically significant (p-value =.0). More appropriate model includes X and X (784 546) / (,008) / 50 sol_regression( of ) R USERS.docx Page 8 of 4

BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users 3. In an experiment to describe the toxic action of a certain chemical on silkworm larvae, the relationship of log 0 (dose) and log 0 (larva weight) to log 0 (survival) was sought. The data, obtained by feeding each larva a precisely measured dose of the chemical in an aqueous solution and then recording the survival time (ie time until death) are given in the table. Also given are relevant computer results and the analysis of variance table. Larva 3 4 5 6 7 8 Y = log 0 (survival time).836.966.687.679.87.44.4.60 X =log 0 (dose) 0.50 0.4 0.487 0.509 0.570 0.593 0.640 0.78 X =log 0 (weight) 0.45 0.439 0.30 0.35 0.37 0.093 0.40 0.406 Larva 9 0 3 4 5 Y = log 0 (survival time).556.44.40.439.385.45.35 X =log 0 (dose) 0.739 0.83 0.865 0.904 0.94.090.94 X =log 0 (weight) 0.364 0.56 0.47 0.78 0.4 0.89 0.93 Y =.95 0.550 (X ) Y =.87 +.370 (X ) Y =.593 0.38 (X ) + ).87 (X ) Source DF Sum of Squares Regression on X 0.3633 Residual 3 0.480 Source DF Sum of Squares Regression on X 0.3367 Residual 3 0.746 Source DF Sum of Squares Regression on X, X 0.464 Residual 0.047 sol_regression( of ) R USERS.docx Page 9 of 4

BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users (a) Test for the significance of the overall regression involving both independent variables X and X. X and X F = (SSQ regression on X and X ) / = (0.464) / = 59.8 on DF =, (SSQ Residual) / (0.047) / P value < 0.000 Application of the null hypothesis model has led to an extremely unlikely result (p-value =.000), prompting statistical rejection of the null hypothesis. The fitted linear model in X and X explains statistically significantly more of the variability in log 0 (survival time) (Y) than is explained by Y (the intercept model) alone. (b) Test to see whether using X alone significantly helps in predicting survival time. X alone F = (0.3633) / (0.480) / 3 = (SSQ Regression on X ) / = 3.95 on DF =,3 (SSQ Residual) / 3 P value = 0.00008 Application of the null hypothesis model has led to an extremely unlikely result (p-value =.00008), prompting statistical rejection of the null hypothesis. The fitted linear model in X explains statistically significantly more of the variability in log 0 (survival time) (Y) than is explained by Y (the intercept model) alone. (c) Test to see whether using X alone significantly helps in predicting survival time. X alone F = (SSQ Regression on X ) / (SSQ Residual) / 3 P value = 0.0007 = (0.3367) / = 5.07 on DF =,3 (0.746) / 3 Application of the null hypothesis model has led to an extremely unlikely result (p-value =.0007), prompting statistical rejection of the null hypothesis. The fitted linear model in X explains statistically significantly more of the variability in log 0 (survival time) (Y) than is explained by Y (the intercept model) alone. sol_regression( of ) R USERS.docx Page 0 of 4

BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users (d) Compute R for each of the three models. TotalSSQ = 0.53 ( ) R X and X = 0.464 / 0.53 = 0.9079 R (X alone) = 0.3633/ 0.53 = 0.705 ( ) R X alone = 0.3367 / 0.53 = 0.6585 (e) Which independent predictor do you consider to be the best single predictor of survival time? Using just the criteria of the overall F test and comparison of containing X is better. R, the single predictor model (f) Which model involving one or both of the independent predictors do you prefer and why? Partial F for comparing model with Xalone versus model with X and X ( ΔRegressionSSQ)/ΔReg DF ( = = 0.464-0.3633 )/ - ( ResidualSSQ-model w X, X )/ResidualDF-model w X, X 0.047/ = 5.707 on DF=, P-value= 0.0003 Choose model with both X and X ( ) sol_regression( of ) R USERS.docx Page of 4

BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users 4. Using R, try your hand at reproducing the analysis of variance tables you worked with in problem #3. load(file= larvae.rdata ) dat <- larvae # model regression on x alone m <- lm(y ~ x, data=dat) summary(m) Call: lm(formula = y ~ x, data = dat) Residuals: Min Q Median 3Q Max - 0.8435-0.05477 0.0058 0.06784 0.889 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept).950 0.07356 40.36 5.3e- 5 *** x - 0.54986 0.09734-5.649 7.94e- 05 *** - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' Residual standard error: 0.067 on 3 degrees of freedom Multiple R- squared: 0.705, Adjusted R- squared: 0.6883 F- statistic: 3.9 on and 3 DF, p- value: 7.944e- 05 anova(m) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 0.3637 0.3637 3.9 7.944e- 05 *** Residuals 3 0.4799 0.038 - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' sol_regression( of ) R USERS.docx Page of 4

BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users # model regression on x alone m <- lm(y ~ x, data=dat) summary(m) Call: lm(formula = y ~ x, data = dat) Residuals: Min Q Median 3Q Max - 0.49-0.69 0.0470 0.07746 0.774 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept).847 0.080 6.644 9.9e- 3 *** x.3756 0.747 5.007 0.0004 *** - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' Residual standard error: 0.59 on 3 degrees of freedom Multiple R- squared: 0.6585, Adjusted R- squared: 0.63 F- statistic: 5.07 on and 3 DF, p- value: 0.0004 anova(m) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 0.33667 0.33667 5.068 0.0004 *** Residuals 3 0.7459 0.0343 - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' sol_regression( of ) R USERS.docx Page 3 of 4

BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users # model regression on x and x m3 <- lm(y ~ x + x, data=dat) summary(m3) Call: lm(formula = y ~ x + x, data = dat) Residuals: Min Q Median 3Q Max - 0.07789-0.049678-0.0076 0.09783 0.9 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept).58900 0.0836 30.966 8.09e- 3 *** x - 0.37848 0.06637-5.70 9.88e- 05 *** x 0.87497 0.748 5.073 0.00074 *** - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' Residual standard error: 0.0663 on degrees of freedom Multiple R- squared: 0.9079, Adjusted R- squared: 0.896 F- statistic: 59.8 on and DF, p- value: 6.085e- 07 anova(m3) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 0.3637 0.3637 9.63 5.409e- 07 *** x 0.0093 0.0093 5.733 0.000739 *** Residuals 0.04706 0.0039 - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' sol_regression( of ) R USERS.docx Page 4 of 4