STAT 458 Lab 4 Linear Regression Analysis Scatter Plots: When investigating relationship between 2 quantitative variables, one of the first steps might be to construct a scatter plot of the response variable (y variable on the y axis) vs the explanatory variable (x variable on the x axis). Usually, the plot lives in the first quadrant of the x-y graph, and the points (x,y) represent (explanatory, response) pairs. Do birds chirp more or less in cold climates? Let us investigate if there is any relationship between the bird chirping frequency and ambient temperature. I import chirpstemp.csv into R Studio, which shows the following. We initially see that our chirp frequency goes up as the temperature goes up, as a general trend. We see that possibly a straight line model might be a fair predictor of the response (chirps) from the ambient temperature. The code used to generate this is shown below. # scatter plot chirpstemp <- read.csv("c:/users/michael O'Lear/Desktop/chirpstemp(temp.csv") plot(chirpstemp$temp, chirpstemp$chirps, main="bird Chirps vs Temp (in deg F)") Linear Model: If it appears that there might be a linear relationship between the variables, we can set up a model: y=β 0 x+ϵ where y is the value of response variable, x is the value of explanatory variable, b 0 is the linear model intercept, b 1 is the linear model slope, and e is the error or residual (either plus or minus) of the predicted model response (y-hat) and the actual value of y. We sometimes express the linear model equation (or the line of best fit ) in the following way. -1-
ŷ=β 0 x, where the y-hat stands for the estimation of y, or y estimate, from our model. With this symbolism we see that the error, called the residual, is ϵ = y ŷ. It is important that you always remember to take actual y minus estimated y, since the sign of e indicates model over prediction (a minus e) and under prediction (a plus e). Since it looks like we might have a linear model with our birdchirp data here, we can use R to find the parameter values of that relationship (i.e., b o and b 1 ). The lm() command creates the model, along with useful statistics and values, as shown in the code below. # linear model model1 <- lm(chirpstemp$temp ~ chirpstemp$chirps, data=chirpstemp) model1 Output is shown below. By the output we see that the linear model for this is chirps= 131.232+3.809 temp( F o ) We could predict chirps from ambient temp., as long as we stay within the range of the temp. data and not commit the mistake of extrapolation. We will talk more about extrapolation later. For example, for a temperature of 70 0, we estimate a predicted number of chirps at -131.232 + 3.809 * 70 = 135.4 chirps/min. We can superimpose a line of best fit onto the original scatter plot using the abline() command, now that we have a model to refer to. See below. The linear model seems to fit rather well across the range of temps. -2-
A final warning about using the plot command and lm command close to each other. Note that the context of plot is plot(x, y) whereas the lm context is lm(y ~ x). R users may sometimes do both commands the same way, resulting in bad information. Using Formulas in lm() and in other commands: Notice that there is a sort of formula notation inside of the lm() command. Sometimes we have to generate formulas for linear models, as well as other transformed models. A brief table showing how R recognizes formulas is shown below. Equation R formula ŷ=β o x ŷ=β o ln( x) ln( ŷ)=β o x ln( ŷ)=β o ln( x) ŷ=β o x 2 +β 3 x ŷ=β o (x) y ~ x y ~ log(x) log(y) ~ x log(y) ~ log(x) y ~ I(x^2) + x y ~ sqrt(x) We will look at more R formulas later when we talk about multiple regression analyses. What lm() generates: When you execute lm() and create a model, as we did when we created model1 above, many useful statistics were generated in the background of R. The following is a list of some of the more useful commands. -3-
Generated What it does/is Uses summary(model1) Gives useful info like b o, b 1, r 2, F-statistic and p values resid(model1) fitted(model1) predict(model1) Lists residuals for every point Lists model predictions for every point Computes predictions of y from x, according to model Used in residual plot and in other plots and uses Used in residual plot, plot() command and others anova(model1) Used later in course deviance(model1) Computes RSS Used later in course AIC(model1) Used later in course
model1 coef(model1) Returns coefficients of model Residual plot: No determination of linearity is thorough until a residual plot is presented. This is a sort of scatter plot of residuals vs predicted (or sometimes x) values, showing how appropriate a fit a linear model might be to the data points. Briefly, if we: see no pattern in the dots, see the points roughly uniformly distributed above and below the x axis, and see no marked change in variance above and below the x axis as we go from left to right on the plot, we can make the case that the linear fit is an appropriate fit or model for our study. If we violate any of those bulleted points, we must proceed in our linear regression with caution and suspicion in our results. Below is the residual plot of our chirp data, where we used the resid(model1) and fitted(model1) as our y and x axis values, respectively. # residual plot plot(fitted(model1),resid(model1), main="residual of Chirps Model" ) abline(h=0) Notice that we plotted a horizontal line at y=0 to show the x axis of this plot. We see that the residual plot seems to show that a linear fit is appropriate for this data. Other Information: We finally would want to know what the degree of fit is, now that we believe that a linear model will fit OK we want to know what r (the correlation coefficient is). We can get that from the summary information, showing that the r2 value is 0.9567, giving us -5-
.9567 = 0.9781, or use the cor() command shown below. # other information sqrt(.9567) cor(chirpstemp$temp,chirpstemp$chirps) Homework [1]: Import the data file patients.csv. We want to see if there is a linear relationship between height and weight (units unknown) of the patients used in this study, and if so, how strong that relationship is, where height is the explanatory variable. Produce scatter plot, residual plot, and model statistics answering these questions, and include a 50-100 word statement of conclusions for this study. Find the estimated weight of a patient who has height of 51. Scatter Plot Matrix: Sometimes we have a large number of quantitative variables in a study, and we want to see if there are any relationships between any two of them. R has a nice function, called pairs(), which make multiple scatter plots of pairs of variables. The data set statgrades.csv is a standardized set of grades for a statistics class and includes midterm, final, homework, and final class grade for each student. The command shown below was used to produce the following graphical display. pairs(statgrades) Notice that all grades have been converted from their raw score to a standardized N(0,1) z score, so that difficulty levels are not weighting/biasing the cumulative scores, -6-
but rather associated with the mean of each test. This is why all scores seem to go from -3 to 3 or less. Homework [2]: Pick one or two pairs which appear to have a possible linear relationship. Make individual scatter plots, summary statistics, residual plots, etc. and come up with a conclusion to the question: Is there a linear relationship and, if so, how strong is it? Pick what you think should be explanatory and response variables. Include a 50 word justification of your results. Homework [3]: Import the data set bloodpres.csv. We are interested in any linear relationship between age and systolic pressure. As before, investigate the linearity and strength of relationship, if age can predict systolic pressure. -7-