We d like to know the equation of the line shown (the so called best fit or regression line).

Size: px

Start display at page:

Download "We d like to know the equation of the line shown (the so called best fit or regression line)."

Willis Neal Smith
5 years ago
Views:

1 Linear Regression in R. Example. Let s create a data frame. > exam1 = c(100,90,90,85,80,75,60) > exam2 = c(95,100,90,80,95,60,40) > students = c("asuka", "Rei", "Shinji", "Mari", "Hikari", "Toji", "Kensuke") > class=data.frame(students,exam1,exam2) As seen in the previous section, a scatterplot of the data is: We d like to know the equation of the line shown (the so called best fit or regression line). The simplest way to do so, is to use the lm(linear model) command. > lm(exam2~exam1, data=class) Produces

2 Call: lm(formula = exam2 ~ exam1, data = class) Coefficients: (Intercept) exam exam2 ~ exam1 indicates we want to think of exam 2 as being the dependent variable Y, and exam 1 as the independent variable X. data=class indicates we are looking in the data frame we ve named class. The output indicates that the regression line is given by the formula That is, More detailed information can be obtained by creating an object and calling for its summary statistics. > Linearmodel1 = lm(exam2~exam1, data=class) creates an object named Linearmodel1 containing the regression information. > summary(linearmodel1) Provides additional information about the regression. Call: lm(formula = exam2 ~ exam1, data = class) Residuals: Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) exam ** Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 5 degrees of freedom Multiple R squared: , Adjusted R squared: F statistic: on 1 and 5 DF, p value: If you just want the correlation coefficient between exam 2 and exam 1, of course you could just take the square root of R squared = , or if you don t want to go to that trouble > cor(class$exam2, class$exam1) [1] does the trick.

3 Multiple regression. Example. Let s take our existing data frame from the previous section and add some additional variables, namely a third exam, and each student s final grade. > exam3=c(95, 90, 88, 75, 85, 60, 50) > final =c(95, 91, 86, 77, 86, 55, 50) > class=data.frame(students, exam1, exam2, exam3, final) > class Students exam1 exam2 exam3 final 1 Asuka Rei Shinji Mari Hikari Toji Kensuke Now suppose we wanted to create a model which relates the student s final grade to their performance on the first three exams. >grademodel = lm(final~exam1+exam2+exam3, data=class) As before lm denotes that we re making a linear model, the variable before the ~ is the independent (response) variable, and the variables after ~ are the independent (explanatory) variables. Again, the summary command will provide more information. Here s the output.

4 The value of r squared is good (98.65% of the variation in final grades is explained by the predictors in our model, but there are some things that are not so good. The entries in the column Pr( > t ) are called p values. At this point in the course, we re not going to go into all the details of what a p value is, but for now all you need to know is the lower the p value, the better that predictor is. The residuals tell you how much our model over or under predicts the final grade in the known data. The standard error can be thought of as sort of a standard deviation of the residuals or errors (so generally, the lower the standard error, the better). In the output above, note that exam 3 looks like a good predictor of final grade, and the high p values on exam 1 and exam 2 suggest we might not really need them in our model. So, let s try again, throwing out exams 1 and 2. Note that on its own, exam 3 looks like a good predictor of final grade. Tossing exam 1 and exam 2 actually reduces the standard error (which is good!), R squared is barely changed, and the adjusted R squared (which takes into account the number of predictors used in the model) actually goes up. So, a pretty good linear model to predict the final grade would be final = *exam

5 Example. At spots.gru.edu/nsmith12/rexamples/georgiacolleges.csv is a file containing data about colleges and universities in the great state of Georgia. This data set contains the name of each institution, the student count, the percentage of the students who are full time (ft.pct), the median sat score (math+verbal) at each school, and the six and four year graduation rates (six.yr.grad, four.yr.grad). Now, suppose we wanted to try to figure out why, say, UGA s graduation rate is high, and Augusta State s isn t so high. Let s create an object called gradmodel to do this. At the command line, we ll enter >gradmodel = lm(six.yr.grad ~ ft.pct + med.sat, data=ga) This model will look for a relationship between the six year graduation rate, the percentage of full time students, and the median sat score.

6 This says that the best fit is given by: Six year graduation rate =.87374*ft.pct *med.sat Let s now say you wanted to see how Augusta State should be doing according to this model. ASU s predicted graduation rate =.87374(74.5) (965) , about (percent). ASU only graduated 24.5 percent, a little worse (about 6.78 percentage points lower) than the model predicts. If you don t want to do all this by hand > predict(gradmodel, list(ft.pct = 74.5, med.sat = 965)) The syntax is predict(name of the model you want, list(values of the variables in question)). In the output, a residual tells you how far each data point differs from what is predicted by the regression equation. If you want to see all the residuals > gradmodel$residuals This tells you that for data point 1 (ASU), the (actual) percentage of students graduating in 6 years is 6.78 percentage points lower than the model predicts. For school 2 (Fort Valley State), the actual graduation rate is about 0.8 percentage points higher than the model predicts, and so on. Maybe in the future, the undergraduate side of the former ASU tightens up its standards so that the median SAT score is 1100 and 90% of the students are full time. All things being equal, we d predict a (six year) graduation rate of > predict(gradmodel, list(ft.pct = 90, med.sat = 1100)) about 56 percent.

7 Note! Whenever trying to do predictions from a regression line, there is some amount of uncertainty in the result. For instance, in the example above, we were trying to predict ASU s graduation rate, assuming a median SAT score of 965 and a full time percentage of 74.5%. If you ask R > predict(gradmodel, list(ft.pct = 74.5, med.sat = 965), interval="predict", level=.95) the output is fit lwr upr interval = predict means that we want to construct a so called prediction interval about the point in question. Setting level=.95 means that we are using a confidence level of 95%; more on that later in the course. For now, the output says: At a 95% confidence level, a range of plausible estimates for the 6 year graduation rate at an institution like ASU is 16.2% to 46.4%. Intuitively, this gives upper and lower estimates for the graduation rate based on the model we are using and based on the available data. Note that increasing the confidence level to 99%, > predict(gradmodel, list(ft.pct = 74.5, med.sat = 965), interval="predict", level=.99) fit lwr upr gives a somewhat wider prediction interval. For now, the idea is that to have greater confidence in a prediction, you need more latitude in the range of predicted values. Note! As Spock would say, all things being equal, I d agree however, all things are not equal, and one must be careful when messing around in this fashion with a regression model. To see why, try computing predict(gradmodel, list(ft.pct = 100, med.sat = 1600)) (the ideal scenario!) and make sure you see why the result doesn t make any damn sense.

8 Dummy variables. In many applications, there are qualitative factors that you might want to include in a regression. Example. At a certain university, there is a college of Science that houses the departments of Biology, Chemistry, and Mathematics. Here is some salary data about a sample of professors from that college. Let s import the data at This data tells you the salaries of faculty members in some (fictional) academic department, based on the faculty member s rank (Assistant, Associate, or Full professor), their gender, and the how long in years it has been since the faculty member obtained their Ph.D. > salary = read.csv(" > head(salary) Salary Rank Gender Yrs Assistant M Assistant M Assistant M Assistant M Assistant F Assistant F 3 Now, the difficulty is how to incorporate the predictors Rank and Gender into a regression model. The trick is to use what are called dummy variables. Gender is a binary variable, so let s add a dummy variable G to the mix. G will equal 1 if the gender is male, and 0 is female (or vice versa!). G simply encodes the gender of each faculty member in a way that can be handled numerically. Important! Rank is a little trickier. We don t want to do something like let R = 0 if the faculty member is an assistant professor, R = 1 if associate, and R = 2 if full. We d be assuming that the faculty member gets the same salary bump when they go from level 0 to 1 (assistant to associate) as they do when going from level 1 to 2 (associate to full professor). That seems like a really dicey assumption! So, what we ll do is incorporate two dummy variables. Let Assoc = 1 if the faculty member holds the rank of Associate (and 0 otherwise), let Full = 1 if they re a full professor (0 otherwise). Thus, Assistant Professors have Assoc=0 and Full=0. In general, if we want to model a qualitative factor, we need one fewer dummy variables than levels of the qualitative variable. Now, we just need to code the dummy variables and get the data ready to go. > G = c(1,1,1,1,0,0,1,0,1,1,0,1,0,1,1,0,0,1,1,1,1,1,0,1,1,0,1,0,0,1) > Assoc=c(0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0) > Full =c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1) > salary = data.frame(salary, G, Assoc, Full)

9 > salary Salary Rank Gender Yrs G Assoc Full Assistant M Assistant M Assistant M Assistant M Assistant F Assistant F Assistant M Assistant F Assistant M Associate M Associate F Now, let s create a model that takes all of these variables into account. For the moment, let s conveniently ignore that Rank and Yrs strongly correlate (which is not ideal, since we risk overfitting the model!). It s time to create a linear model. It looks like Yrs, Assoc, and Full are good predictors of salary. The middling p value on the gender variable suggests that is the weakest predictor of salary. Our tentative model is thus salary = 469*Yrs *G *Assoc *Full

10 Now,one could argue that the variables Yrs, Assoc, and Full are measuring similar things (to get promoted to associate or full professor you have to be on the job for a while!), so it might be worth investigating what happens if we drop some of these variables from the model. If you look at the output below, Note that dropping Yrs or Assoc and Full from the model causes a fairly large dip in R squared and a jump in the standard error, so I d probably leave also those variables in the model.

STAT 3022 Spring 2007

STAT 3022 Spring 2007 Simple Linear Regression Example These commands reproduce what we did in class. You should enter these in R and see what they do. Start by typing > set.seed(42) to reset the random number generator so