BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users Unit Regression and Correlation of - Practice Problems Solutions R Users. In this exercise, you will gain some practice doing a simple linear regression using an R data set called week0.rdata. This data set has n=3 observations of boiling points (Y=boiling) and temperature (X=temp). Tip In this exercise you will need to create a new variable newy = 00*log 0 (boiling) Carry out an exploratory analysis to determine whether the relationship between temperature and boiling point is better represented using (i) (ii) Y = β 0+βX or ' ' 00 log (Y) = β +βx 0 0 In developing your answer, try our hand at producing (a) Estimates of the regression line parameters (b) Analysis of variance tables (c) R (d) Scatter plot with overlay of fitted line. Complete your answer with a one paragraph text that is an interpretation of your work. Take your time with this and have fun. # Input data that you have downloaded and saved to your desktop (edit my desktop) setwd( /Users/cbigelow/Desktop ) load(file= week0.rdata ) # Look at first 6 rows of data using command head( ) head(week0) temp boiling 0.8 9. 0. 8.559 3 08.4 7.97 4 0.5 4.697 5 00.6 3.76 6 00. 3.369 sol_regression( of ) R USERS.docx Page of 4
BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users # I will create newvar=oldvar as follows: newvar <- dataframe$oldvar x <- week0$temp y <- week0$boiling m <- lm(y ~ x) # OR: m <- lm(boiling~temp, data=dat) # OR: m <- lm(dat$boiling ~ dat$temp) summary(m) Call: lm(formula = y ~ x) Residuals: Min Q Median 3Q Max - 4.806 0.0057 0.48 0.5573.00 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) - 65.34300 4.6438-4.07.73e- 4 *** x 0.44379 0.049 8.35 < e- 6 *** - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' Residual standard error:.57 on 9 degrees of freedom Multiple R- squared: 0.907, Adjusted R- squared: 0.979 F- statistic: 336.6 on and 9 DF, p- value: <.e- 6 m$coefficients # output the intercept and slope (Intercept) x - 65.349977 0.443794 # confint(m) # output the CI for the parameters in the fitted model anova(m) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 450.56 450.56 336.6 <.e- 6 *** Residuals 9 38.8.34 - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' plot(x, y, main="simple Linear Regression of Y=boiling on X=temp", xlab="temp", ylab="boiling", pch=8) # pch means "plot character", can be to 8; abline(m, col="red", lty=, lwd=) # lty: "line type"; lwd: "line width". sol_regression( of ) R USERS.docx Page of 4
BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users From the above summary(m) output, we know that: (a): Parameter Estimates: -65.34 and 0.44, so regression line: Y=-65.34+0.44*X; (b): ANOVA (Analysis of Variance) table is obtained above using the anova(m) code; (c): R-square=0.907; (d): The scatterplot with the fitted line is plotted above. (). Now, instead of Y=boiling, we want to use newy=00*log0(boiling) newy=00*log0(y) m <- lm(newy ~ x) summary(m) Call: lm(formula = newy ~ x) Residuals: Min Q Median 3Q Max - 4.548 0.386 0.6079 0.993.459 Coefficients: sol_regression( of ) R USERS.docx Page 3 of 4
BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users Estimate Std. Error t value Pr(> t ) (Intercept) - 48.8589.88807-4. 0.00097 *** x 0.965 0.069 4.96 3.6e- 5 *** - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' Residual standard error:.96 on 9 degrees of freedom Multiple R- squared: 0.885, Adjusted R- squared: 0.883 F- statistic: 3.7 on and 9 DF, p- value: 3.63e- 5 m$coefficients # output the intercept and slope (Intercept) x - 48.858854 0.96547 # confint(m) # output the CI for the parameters in the fitted model anova(m) Analysis of Variance Table Response: newy Df Sum Sq Mean Sq F value Pr(>F) x 96. 96.5 3.69 3.63e- 5 *** Residuals 9 54.4 8.77 - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' plot(x, newy, main="simple Linear Regression of Y=00*log0(boiling) on X=temp", xlab="te mp", ylab="00*log0(boiling)", pch=8) # pch means "plot character", can be to 8; abline(m, col="red", lwd=) # lwd means "line width" sol_regression( of ) R USERS.docx Page 4 of 4
BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users From the above summary(m) output, we know that: (a): Parameter Estimates: -48.86 and 0.93, so regression line: 00log0(Y)=-48.86+0.93*X; (b): ANOVA (Analysis of Variance) table is obtained above using the anova(m) code; (c): R-square=0.885; (d): The scatterplot with the fitted line is plotted above. Solution (one paragraph of text that is interpretation of analysis): Did you notice that the scatter plot of these data reveal two outlying values? Their inclusion may or may not be appropriate. If all n=3 data points are included in the analysis, then the model that explains more of the variability in boiling point is Y=boiling point modeled linearly in X=temp. It has a greater R^ (9.07% vs. 88.5%). Be careful - It would not make sense to compare the residual mean squares of the two models because the scales of measurement involved are different. sol_regression( of ) R USERS.docx Page 5 of 4
BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users. A psychiatrist wants to know whether the level of pathology (Y) in psychotic patients 6 months after treatment could be predicted with reasonable accuracy from knowledge of pretreatment symptom ratings of thinking disturbance (X ) and hostile suspiciousness (X ). (a) The least squares estimation equation involving both independent variables is given by Y = -0.68 + 3.639(X ) 7.47(X ) Using this equation, determine the predicted level of pathology (Y) for a patient with pretreatment scores of.80 on thinking disturbance and 7.0 on hostile suspiciousness. How does the predicted value obtained compare with the actual value of 5 observed for this patient? Ŷ = -0.68 + 3.639 X - 7.47 X with X =.80 and X =7.0 Ŷ = 5.53 This value is lower than the observed value of 5 (b) Using the analysis of variance tables below, carry out the overall regression F tests for models containing both X and X, X alone, and X alone. Source DF Sum of Squares Regression on X 546 Residual 5 46 Source DF Sum of Squares Regression on X 60 Residual 5 363 sol_regression( of ) R USERS.docx Page 6 of 4
BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users Source DF Sum of Squares Regression on X, X 784 Residual 50 008 Model Containing X and X F = SSQ regression on X and X /Regression df SSQ residual / Residual df =,784 /,008 / 50 = 6.37 on DF=,50 p-value=0.00356 à Application of the null hypothesis model has led to an extremely unlikely result (p-value =.00356), prompting statistical rejection of the null hypothesis. The fitted linear model in X and X explains statistically significantly more of the variability in level of pathology (Y) than is explained by Y (the intercept model) alone. Model Containing X ALONE F = SSQ Regression on X /DF Regression SSQ residual/df Residual = 546 /,46 / 5 = 6.4385 on DF=,5 p-value=0.047 Here, too, application of the null hypothesis model has led to an extremely unlikely result (p-value =.04), prompting statistical rejection of the null hypothesis. The fitted linear model in X explains statistically significantly more of the variability in level of pathology (Y) than is explained by Y (the intercept model) alone. Model Containing X ALONE F = SSQ Regresion on X /DF Regression SSQ Residual/DF Residual = 60 / 3,63 / 5 = 0.5986 on DF=,5 p-value=0.4468 Here, application of the null hypothesis model has not led to an extremely unlikely result (p-value =.44). The null hypothesis is therefore not rejected. The fitted linear model in X does not explain statistically significantly more of the variability in level of pathology (Y) than is explained by Y (the intercept model) alone. (c) Based on your results in part (b), how would you rate the importance of the two variables in predicting Y? X explains a significant proportion of the variability in Y when modelled as a linear predictor. X does not. (However, we don't know if a different functional form might have been important.) sol_regression( of ) R USERS.docx Page 7 of 4
BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users (d) What are the R values for the three regressions referred to in part (b)? Total SSQ= (Regression SSQ) + (Residual SSQ) is constant. Therefore total SSQ can be calculated from just one anova table: Total (SSQ)=,546 +,46 = 3,79 ( ) R X only = (Re gression SSQ)/(Total SSQ) R (X only) = (60)/(3,79) = 0.06 ( ) ( ) = (546)/(3,79) = 0. R Xand X = 784 /(3, 79) = 0.09 (e) What is the best model involving either one or both of the two independent variables? Eliminate from consideration model with X only. Compare model with X alone versus X and X using partial F test. Partial F = {(SSQ Regression on X,X ) - (SSQ Regression on X )}/VDF = SSQ Residual for model w X,X /Residual DF = 5.663 on DF =,50 P-value =0.06 Addition of X to model containing X is statistically significant (p-value =.0). More appropriate model includes X and X (784 546) / (,008) / 50 sol_regression( of ) R USERS.docx Page 8 of 4
BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users 3. In an experiment to describe the toxic action of a certain chemical on silkworm larvae, the relationship of log 0 (dose) and log 0 (larva weight) to log 0 (survival) was sought. The data, obtained by feeding each larva a precisely measured dose of the chemical in an aqueous solution and then recording the survival time (ie time until death) are given in the table. Also given are relevant computer results and the analysis of variance table. Larva 3 4 5 6 7 8 Y = log 0 (survival time).836.966.687.679.87.44.4.60 X =log 0 (dose) 0.50 0.4 0.487 0.509 0.570 0.593 0.640 0.78 X =log 0 (weight) 0.45 0.439 0.30 0.35 0.37 0.093 0.40 0.406 Larva 9 0 3 4 5 Y = log 0 (survival time).556.44.40.439.385.45.35 X =log 0 (dose) 0.739 0.83 0.865 0.904 0.94.090.94 X =log 0 (weight) 0.364 0.56 0.47 0.78 0.4 0.89 0.93 Y =.95 0.550 (X ) Y =.87 +.370 (X ) Y =.593 0.38 (X ) + ).87 (X ) Source DF Sum of Squares Regression on X 0.3633 Residual 3 0.480 Source DF Sum of Squares Regression on X 0.3367 Residual 3 0.746 Source DF Sum of Squares Regression on X, X 0.464 Residual 0.047 sol_regression( of ) R USERS.docx Page 9 of 4
BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users (a) Test for the significance of the overall regression involving both independent variables X and X. X and X F = (SSQ regression on X and X ) / = (0.464) / = 59.8 on DF =, (SSQ Residual) / (0.047) / P value < 0.000 Application of the null hypothesis model has led to an extremely unlikely result (p-value =.000), prompting statistical rejection of the null hypothesis. The fitted linear model in X and X explains statistically significantly more of the variability in log 0 (survival time) (Y) than is explained by Y (the intercept model) alone. (b) Test to see whether using X alone significantly helps in predicting survival time. X alone F = (0.3633) / (0.480) / 3 = (SSQ Regression on X ) / = 3.95 on DF =,3 (SSQ Residual) / 3 P value = 0.00008 Application of the null hypothesis model has led to an extremely unlikely result (p-value =.00008), prompting statistical rejection of the null hypothesis. The fitted linear model in X explains statistically significantly more of the variability in log 0 (survival time) (Y) than is explained by Y (the intercept model) alone. (c) Test to see whether using X alone significantly helps in predicting survival time. X alone F = (SSQ Regression on X ) / (SSQ Residual) / 3 P value = 0.0007 = (0.3367) / = 5.07 on DF =,3 (0.746) / 3 Application of the null hypothesis model has led to an extremely unlikely result (p-value =.0007), prompting statistical rejection of the null hypothesis. The fitted linear model in X explains statistically significantly more of the variability in log 0 (survival time) (Y) than is explained by Y (the intercept model) alone. sol_regression( of ) R USERS.docx Page 0 of 4
BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users (d) Compute R for each of the three models. TotalSSQ = 0.53 ( ) R X and X = 0.464 / 0.53 = 0.9079 R (X alone) = 0.3633/ 0.53 = 0.705 ( ) R X alone = 0.3367 / 0.53 = 0.6585 (e) Which independent predictor do you consider to be the best single predictor of survival time? Using just the criteria of the overall F test and comparison of containing X is better. R, the single predictor model (f) Which model involving one or both of the independent predictors do you prefer and why? Partial F for comparing model with Xalone versus model with X and X ( ΔRegressionSSQ)/ΔReg DF ( = = 0.464-0.3633 )/ - ( ResidualSSQ-model w X, X )/ResidualDF-model w X, X 0.047/ = 5.707 on DF=, P-value= 0.0003 Choose model with both X and X ( ) sol_regression( of ) R USERS.docx Page of 4
BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users 4. Using R, try your hand at reproducing the analysis of variance tables you worked with in problem #3. load(file= larvae.rdata ) dat <- larvae # model regression on x alone m <- lm(y ~ x, data=dat) summary(m) Call: lm(formula = y ~ x, data = dat) Residuals: Min Q Median 3Q Max - 0.8435-0.05477 0.0058 0.06784 0.889 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept).950 0.07356 40.36 5.3e- 5 *** x - 0.54986 0.09734-5.649 7.94e- 05 *** - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' Residual standard error: 0.067 on 3 degrees of freedom Multiple R- squared: 0.705, Adjusted R- squared: 0.6883 F- statistic: 3.9 on and 3 DF, p- value: 7.944e- 05 anova(m) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 0.3637 0.3637 3.9 7.944e- 05 *** Residuals 3 0.4799 0.038 - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' sol_regression( of ) R USERS.docx Page of 4
BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users # model regression on x alone m <- lm(y ~ x, data=dat) summary(m) Call: lm(formula = y ~ x, data = dat) Residuals: Min Q Median 3Q Max - 0.49-0.69 0.0470 0.07746 0.774 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept).847 0.080 6.644 9.9e- 3 *** x.3756 0.747 5.007 0.0004 *** - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' Residual standard error: 0.59 on 3 degrees of freedom Multiple R- squared: 0.6585, Adjusted R- squared: 0.63 F- statistic: 5.07 on and 3 DF, p- value: 0.0004 anova(m) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 0.33667 0.33667 5.068 0.0004 *** Residuals 3 0.7459 0.0343 - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' sol_regression( of ) R USERS.docx Page 3 of 4
BIOSTATS 640 Spring 08 Unit. Regression and Correlation (Part of ) R Users # model regression on x and x m3 <- lm(y ~ x + x, data=dat) summary(m3) Call: lm(formula = y ~ x + x, data = dat) Residuals: Min Q Median 3Q Max - 0.07789-0.049678-0.0076 0.09783 0.9 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept).58900 0.0836 30.966 8.09e- 3 *** x - 0.37848 0.06637-5.70 9.88e- 05 *** x 0.87497 0.748 5.073 0.00074 *** - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' Residual standard error: 0.0663 on degrees of freedom Multiple R- squared: 0.9079, Adjusted R- squared: 0.896 F- statistic: 59.8 on and DF, p- value: 6.085e- 07 anova(m3) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x 0.3637 0.3637 9.63 5.409e- 07 *** x 0.0093 0.0093 5.733 0.000739 *** Residuals 0.04706 0.0039 - - - Signif. codes: 0 '***' 0.00 '**' 0.0 '*' 0.05 '.' 0. ' ' sol_regression( of ) R USERS.docx Page 4 of 4