BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Size: px

Start display at page:

Download "BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression"

Diane Williams
5 years ago
Views:

1 BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression Introduction to Correlation and Regression The procedures discussed in the previous ANOVA labs are most useful in cases where we are interested in testing hypotheses about differences in the locations of several populations in terms of one or more sets of "factors" which represent discrete levels. However, we may also be interested in examining the relationship between random variables Y and X, where both X and Y are continuous measures on each subject sampled in a single population. When there are just two such variables, the relationship is known as a "bivariate." If there is no implied relationship in which say variable Y "depends" on variable X, then we are just asking if the two variable, Y and X are associated, and we calculate correlation coefficients to determine the strength of this association. However, if two variables are related in such a way that the value of one the variables X is useful in predicting the value of the variable Y, then we can explore the "regression" of variable Y on variable X, we fit a linear model.. Correlation and linear regression analyses are useful in evaluating the association between variables and expressing the nature of their relationship. Correlation Correlation measures the strength of the association between two variables, say Y and X. Correlation is related to Regression, but correlation analyses make different assumptions about the data. First in correlation, there is no independent or dependent variable, so one is not predicting Y from X as in Regression. The Pearson Product Moment linear correlation coefficient assumes the data are independent and bivariate normal - that is that the joint probability distribution of (Y, X) is bivariate normal. The formula for Pearson's r is: n ( i=1 (y i y )(x i x ) ) r = ( n i=1 (y i y ) 2 (x i x ) 2 n i=1 ) 1/2 Pearson's r takes on values between -1 and +1, with values near 0 indicating no correlation (no association), and values near 1 meaning a strong positive association in which Y and X increase together, and values near -1 meaning a strong negative association indicating that when the value of Y goes up the value of X goes down, or vice versa. Significance tests are available to establish that the estimated correlation is unlikely by chance at some α level. There are also non-parametric correlation coefficients. The most commonly used is Spearman rho, ρ. To calculate Spearman's ρ, first rank the Y and X values separately, and then calculate the difference in the Y and X ranks for each subject (d i = R y - R x ). Spearmans ρ is then: 1

2 ρ = 1 6 n i=1 d i 2 n(n 2 1) In R, one can apply the functions "cor" or "cor.test" to calculate Pearson's, Spearman's, or Kendall's correlation coefficient. The function "cor" does not provide a significance test, but cor.test does. The necessary arguments are: cor.test(x,y,method="pearson") or alternatively insert "spearman" or "kendall" for the method argument. For example using the data file kenyabees.csv we can perform a correlation analysis computing the Pearson r or Spearman's ρ. The data represent the coefficient of variation in bee abundance (CVN) at pan traps on farms spread across several regions in Kenya. The coefficient of variation is calculated as CV = (s x ) 100 and is a measure of variability that attempts to remove the fact that data with high means have higher variances in order to compare variability among samples with different means. The variable CTYPE is the number of different crop species planted at each farm. Zero indicates that the farm had no planted crops, only pasture. Read in the data and obtain a scatter plot. dat=read.csv("k:/biometry/biometry-fall-2015/lab9/kenyabees.csv",header=true) head(dat) ctype cvn text plot(dat$ctype,dat$cvn, xlab= "Number of Crop Species", ylab= "CV Bees", lwd= 1, cex.lab=1.25,cex.axis=1.25,cex=1.5) 2

3 There is a trend toward lower values of the CV in bee abundance on farms with more crop species. However, it is difficult to tell how strong the trend is. Now calculate Pearson's r using cor.test cor.test(dat$ctype,dat$cvn,na.rm=true, method ="pearson") Pearson's product-moment correlation data: dat$ctype and dat$cvn t = , df = 93, p-value = alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: sample estimates: cor Now calculate Spearman's ρ using cor.test. cor.test(dat$ctype,dat$cvn,method="spearman") Warning in cor.test.default(dat$ctype, dat$cvn, method = "spearman"): Cannot compute exact p-value with ties Spearman's rank correlation rho data: dat$ctype and dat$cvn S = , p-value = alternative hypothesis: true rho is not equal to 0 sample estimates: rho

4 Neither correlation coefficient is significant at α = 0.05, but the Pearson's r = p = is suggestive of a trend. If one just wanted the value of the correlation coefficient, the cor function could be used to calculate either the Pearson's, Spearman's or Kendall's correlation coefficient. Note however, that cor does not automatically discard cases with incomplete or missing data as des cor.test. One must specify the "use=" argument to be "complete.obs." cor(dat$ctype,dat$cvn, use="complete.obs",method="pearson") [1] Regression In regression, Y is termed the dependent variable, and X the independent variable and one builds a model using X to predict values of Y. For example, consider the relationship between crop yield and precipitation. Yield (Y) is a function of precipitation (X), since we hypothesize that water availability affects plant growth, but on average plant growth does not affect precipitation. Using linear regression, we can quantify the observed relationship between the two variables. We might ask if there is a significant regression of yield on precipitation, indicating that yield can be predicted from knowledge of precipitation, or if no regression exists and yield cannot be predicted from knowledge of precipitation. In bivariate regression, we assume that the relationship between variables can be described by a straight line. The line relating two variables X and Y is described by the equation: Y = b o + b 1 x + ε where, b 0 is called the intercept, corresponding to the point where X = 0 and the line intercepts the Y axis, and b 1 is the slope, the change in Y per unit change in X. The independent variable, X, is used to predict Y, the dependent variable. b 0 and b 1 are the regression coefficients or the parameters of a line which fix and define the linear relationship between Y and X. ε is the "error" or that component of the variation in the values of Y that cannot be predicted by the regression of Y on X. ε arises both because the fit of the regression equation or regression model to the data may be inadequate and because there is inherent variability in the values of Y observed at each value of X. The differences (errors) between the actual values of Y and the values predicted by the regression equation - a line fitting the data - are called residuals. The estimation of the parameters is performed by finding the "best fit" regression line, the line that minimizes the sums - of - square of the deviations of the observed values from the predicted values. This method is called the method of Ordinary Least Squares (OLS). To fit a regression model to the data on CV of be abundance from Kenya, we use CV as the dependent variable and CTYPE as the independent variable. This is because it 4

5 seems possible that the variability in bee abundance might depend on the number of crop species on a farm, but unlikely that the number of crop types on a farm depends on the variability in bee abundance, rather it is up to the farmer how many crops to plant. To fit a linear regression in R, we use the "lm" or linear model function and then get a summary of the model fit. All we need to specify is a formula for the model. which is of the form y~x (y is a function of x). m1=lm(dat$cvn~dat$ctype) summary(m1) Call: lm(formula = dat$cvn ~ dat$ctype) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** dat$ctype Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 93 degrees of freedom (9 observations deleted due to missingness) Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 93 DF, p-value: Using the str(m1) function in R, you could see all the information contained in the model object m1. Several items can be extracted using the extractor symbol $. For example, the $coefficients, $residuals, and $fitted.values. The summary of the model object includes a summary of the values of the residuals. The regression coefficients and their standard errors are given in the "estimate" and "Std. Error" columns, respectively. The row labeled "(Intercept)" is the y - intercept (b 0 ) in the model, and the row labeled dat$ctype is the regression coefficient for CTYPE (b 1 ). On the same rows there are t - tests that test the null hypothesis that the respective coefficient equal 0. At the bottom of the summary, you will see the Residual Standard error (the square root of the Mean Square Residual), An F - test for the significance of the overall model, and an estimate of R 2 which the proportion of variation in the Y variable that is accounted for by the X variable. We can plot the fitted regression line on the scatter plot of points using the function "abline" in R. We see the downward trend in CV, but there is substantial scatter of the data about the fitted line. 5

6 plot(dat$ctype,dat$cvn, xlab= "Number of Crop Species", ylab= "CV Bees", lwd= 1, cex.lab=1.25,cex.axis=1.25,cex=1.5) abline(lm(dat$cvn ~ dat$ctype),lwd=2) Assumptions of Linear Regression In using linear regression, a number of assumptions must be made; these are discussed in detail in lecture. In summary, it is assumed that: 1. each value of X is measured without error; 2. the set of observed (X, Y) values consists of n independent measures; 3. ε is a normally distributed error with a mean of zero and some standard deviation which is constant for all values of X (homogeneity of variances) 4. Y is a linear function of X. If we make these assumptions, fitting the regression model is very simple; however, to use the regression model to predict or estimate values of Y, we must test the assumptions we have made. The residuals are the difference between the observed and predicted values of Y. These deviations can be used as a tool to see if the necessary assumptions for regression have been met, and to further investigate the adequacy or goodness - of - fit of the regression model. If the assumptions have been met, the residuals (errors) should be independent and, for each value of X, the set of possible residuals should be approximately normally distributed with a mean of zero and a variance that is not a function of X. If the residuals do not have these characteristics, then some of the assumptions made in fitting the 2 σ e 6

7 model must be incorrect and the results of the regression cannot be deemed valid. Therefore, it is not reasonable to accept a regression without examining whether the assumptions are met. There are many methods that can be used in evaluating the residuals from a regression equation. Some of these methods are objective hypothesis tests; the alternative is to graph the residuals versus the values of X and to evaluate them subjectively. The properties to be evaluated are: 1) independence, 2) normality, and 3) constant variance. The property of independence can be tested in a variety of ways. Basically, though, we can assume that, if the errors are independent, the errors will not tend to have any pattern; they will be random. If they are not random, this fact will be evident because there will be a few long series of either positive or negative values, instead of numerous shorter series. A hypothesis test known as a "runs" test can be used to evaluate the randomness of the residuals; alternately, a subjective evaluation can be performed. Other objective tests are also available. Normality is another desirable property in the residuals. What is required is that for any particular value of X the set of possible errors should be normally distributed. Unless a large number of measurements of Y have been obtained for each of the values of X, there is no reasonable way of testing this assumption. Therefore, we examine the overall distribution of the residuals for normality using graphical and/or statistical approaches. The requirement of a constant variance is very important in evaluating regression models. However, unless the number of residuals is relatively large, a subjective visual evaluation is usually all that can be performed. Visually, residuals that suggest that the variances are constant look like a rectangular scatter that is evenly spread about the regression line all along its length. However, when samples sizes are small it is often difficult to make an accurate judgment about constancy of variances based on a visual inspection of the residuals. One way in R to get some quick diagnostic information to help determine if the data meet the assumption of regression is to plot the model object. This will produce 4 diagnostic plots so, if you want to see them simultaneously then you need to set a plotting parameter mfrows as such par(mfrows=c(2,2)) This will generate two rows of plots each with two plots. par(mfrow=c(2,2)) plot(m1) 7

8 The upper left plot shows the raw residuals plotted against the predicted y - values from the model. I prefer to plot the standardized residuals against either the values of X or the standardized values of X. In either case, the scatter of the residuals should be similar along the length of the plot. The lower left plot shows the square root of the standardized residual with a similar pattern. These plots are useful to determine if the data meet the homogeneity of variances assumption. An equal range of the scatter of the data about the horizontal line would suggest that the assumption is met. The upper right plot is a normal probability plot of the residuals. If the residuals are normally distributed then the points should fall on the dotted line. In this case they deviate substantially in at the upper end of the plot. Using the norm function in the QuantPsyc package on the residuals, show that the residuals are not normally distributed. library(quantpsyc) 8

9 Loading required package: boot Loading required package: MASS Attaching package: 'QuantPsyc' The following object is masked from 'package:base': norm norm(m1$residuals) Statistic SE t-val p Skewness e-07 Kurtosis e-02 The final plot in the lower right is used to diagnose data points that are outliers or overly influential. The solid red line demarcates where points would have Cook's D values of 1 or more from those with Cook's D values of less than 1. Cook's D is a measure of the influence of a data point on the regression. Small values of Cook's D are preferred (values <1). To get a plot of the standardized residuals versus the X values. We have to get the X values used in the model fit, since some data points were dropped because the CV values were missing (NA). If you use the str(m1) command, you will see that m1 consists of 13 lists. The second item in the 13th list are those values of the X variable. If you just tried to plot CTYPE versus the m1$residuals and error indicating that the vector are not the same length would occur. This is because the cases in CTYPE for which CV was NA are still present. nctype=m1[[13]][[2]] stdresid=scale(m1$residuals,center=true, scale=true) plot(nctype,stdresid) abline(h=0, lty=2) 9

10 This plot also does not have an equal scatter about the fitted lien (represented by the horizontal dashed line. However, we cannot tell if the low scatter on the right end of the plot is due to inherently unequal variances, or to there being few farms that plant a large number of crops in our sample of farms. Further Instructions for Lab 9 Data files for regression and correlation require that each subject be represented by a line in the data file and each column represents a variable. So, for correlation or bivariate regression, an R data file need only have 2 columns of values. However, if you have more than two variables for a single set of subjects for which you want to calculate their correlations, just enter all the variables in separate columns and R can calculate the correlations between the variables in each pair of columns - a correlation matrix. instead of inserting the X and Y variable names when using the 'cor' function insert the name of the data.frame and all pairs of correlation will be calculated. LAB - 9 Assignment PART 1: Introduction to Correlation and Regression The Bermuda Petrel is an oceanic bird spending most of its year on the open sea, only returning to land during the breeding season. Its nesting sites are on a small, uninhabited island of the Bermuda group, where careful hatching records have been kept over several years. The Bermuda Petrel feeds only upon fish caught in the open ocean waters far from land. Unfortunately, DDT is now so widespread, and is so concentrated by the biological amplification system knows as the "food chain," that the Bermuda Petrel can no longer lay hard shelled eggs. Since DDT breaks down so slowly, it would appear that this beautiful bird is doomed to extinction (along with how many others?) You data below represent hatching rates of clutches of eggs over a number of years. Use correlation and linear regression in R to see if there is a significant relationship between the percent of clutches hatching over time. Interpret the output. Also produce a scatter plot of the relationship between hatching rate and year. Year % of Clutches Hatching 10

11 PART 2: Assumptions of simple linear regression A) Using the kenyabees.csv data, is it possible to transform the CV data to an alternative scale on which the residuals and the Y variable are normally distributed? For example, what if we log transformed the CV data? B) Estimate the linear regression model for each of the three sample data sets (reg1, reg3, reg5) using the lm function in R. Use data in Column 1 as the X-variate and column 2 as the Y-variate in each data file. C) Write the regression equation for at least two of the data sets. D) Reiterating from the lab, the null hypothesis to be tested in each instance states that Y is not a linear function of X, and thus X will not be a good predictor of Y. More specifically, under the null hypothesis we are testing that the slope, b 1, will be equal to zero, since this would be indicative of no relationship between the two variables. At the α = 0.05 level, based on the output of the regression alone (F - test) for which of the three data sets would you reject the null hypothesis? E) Based on the R 2 values, which model reveals the best fit? F) To see if the models are adequate, you must check to see if the assumptions of regression have been met. Use graphical and/or statistical methods to assess the assumptions of normality, homogeneity of variances, and linearity for each data set each data set? For which data sets is linear regression appropriate, and for which data sets is it clear that a linear regression model should not be imposed on the data? Would some transformation of scale for the Y or X data make these data normal and homoscedastic? Would transformation of X improve linearity? 11

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website. SLR output RLS Refer to slr (code) on the Lecture Page of the class website. Old Faithful at Yellowstone National Park, WY: Simple Linear Regression (SLR) Analysis SLR analysis explores the linear association