Take-home Final. The questions you are expected to answer for this project are numbered and italicized. There is one bonus question. Good luck!

Size: px

Start display at page:

Download "Take-home Final. The questions you are expected to answer for this project are numbered and italicized. There is one bonus question. Good luck!"

Jennifer Sherman
5 years ago
Views:

1 Take-home Final The data for our final project come from a study initiated by the Tasmanian Aquaculture and Fisheries Institute to investigate the growth patterns of abalone living along the Tasmanian coastline. The harvest of abalone is subject to quotas that restrict both the number of abalone that can be caught as well as their size. The population of abalone can be regulated effectively if there is a simple way to tell the age of abalone based solely on their appearance. Hence, researchers are interested in relating abalone age to variables like length, height and weight of the animal (measurements that a diver can take before they harvest the animal). Determining the actual age of an abalone is a bit like estimating the age of a tree. Rings are formed in the shell of the abalone as it grows, usually at the rate of one ring per year. Getting access to the rings of an abalone involves cutting the shell. After polishing and staining, a lab technician examines a shell sample under a microscope and counts the rings. Because some rings are hard to make out using this method, these researchers believed adding 1.5 to the ring count is a reasonable estimate of the abalone s age. The relationship between age and ring count, however, is somewhat controversial. Under certain conditions, abalone can grow more than one ring per year. These conditions relate to weather patterns and other environmental variables. These effects suggest that any relationship we find between ring count and size measurements taken from the animal is likely to involve a lot of error; there are plenty of important variables that have been left out of the study. With these caveats, we will use data collected on 4,177 abalone by the team of researchers from Tasmania. I was not able to find an electronic version of their actual report on these data, but I am making a related report by the same organization available on the course Web site. The questions you are expected to answer for this project are numbered and italicized. There is one bonus question. Good luck!

2 1. Getting started We begin by downloading the dataset of abalone measurements. source(" cocteau/stat13/data/final.r") ls() You should see (possibly among other things) the dataset ab. dim(ab) The output of this command indicates that ab is a matrix with 4,177 rows and 9 columns. Each row refers to a different abalone caught off the coast of Tasmania. For each of the 4,177 abalone caught, the researchers recorded a number of measurements. We describe these measurements in detail in Table 1. The following command lists the values of the variables for the first three abalone in the dataset. ab[1:3,] Recall from Lab 2 that R stores its datasets as matrices. With this command, we are specifying just the first three rows of ab. You should see something like the following. rings length diameter height whole shucked viscera shell infant The first column in this output is just the row number of the specimen in our dataset; in this case, we are just looking at rows 1, 2 and 3. The number of rings for these three abalone ranged from 7 to 15. We can approximately determine the age of an abalone in this study by adding 1.5 to the number of rings. This means that our three specimens were 16.5, 8.5 and 10.5 years old, respectively. Looking at the output a bit more, we see that their lengths varied from 70mm to 106mm, and their whole weights ranged from 45.1g to 135.4g. Looking at the last column in the output, we see that these three specimens were also mature (that is, they were past their infant stage). The 4,177 abalone in this dataset all had between 1 and 29 rings. If we use the

3 Variable name Units Description rings count number of rings length mm longest shell measurement diameter mm measured perpendicular to the length height mm height of abalone with meat in the shell whole grams weight of the whole abalone shucked grams weight of just the meat viscera grams gut weight after bleeding shell grams weight of shell after drying infant 0/1 1 if the abalone is an infant, 0 otherwise Table 1: Descriptions for the variables in the dataset ab. The first variable (column 1 in ab) is our response, the quantity we d like to predict. It represents the number of rings in an abalone s shell. The second 8 variables (columns 2 through 9 in ab) are potential predictors; we want to form a model that will allow us to predict ring count from these 8 measurements. approximate dating rule, that means they are between 2.5 and 30.5 years old. Examine the data for the oldest and youngest abalone in the dataset with the following two commands. ab[ab$rings==1,] ab[ab$rings==29,] In the first command, the == relation is asking for all the data in ab for abalone having just one ring; in the second command we are asking for the data for those abalone with 29 rings. 1. Do the sets of measurements make sense? That is, what aspects of the data agree with the fact that one of these abalone is very young and the other is very old? 2. A first look at the data We begin by looking at simple summaries of the separate predictor variables. But first we need a graphics window.

4 quartz() In Lab 2 we introduced the commands boxplot, hist and summary. Apply these tools to the separate variables to get a sense of their distributions. Take for example, the variable length, the widest point of the abalone. summary(ab$length) hist(ab$length) 2. Describe the distribution of length. Is it skewed? Bell-shaped? Repeat these summaries for a few of the variables to get a feel for the data. Report on at least one variable other than length. Next, consider the four variables related to the weight of the abalones, whole, shucked, viscera, and shell. These are stored in columns 5 through 8 of ab. Look at a simple summary and a histogram for each variable. To compare the values for each weight measurement directly, we can look at a boxplot. boxplot(ab[,5:8]) Again, since this dataset is basically a matrix in R, the index 5:8 in the command above refers to just columns 5 through 8. What do you see in the plot? Do these relationships agree with what you would expect given their description in Table 1? To dig a bit deeper, we will make a simple scatterplot of shucked versus whole. (The second command adds a reference line with intercept 0 and slope 1.) plot(ab$shucked,ab$whole) abline(0,1) 3. What do you see from this distribution? Is there a trend? How does the spread vary as the shucked weight (the weight of just the meat of the abalone) increases? Since whole is meant to describe the weight of the entire animal, and shucked just one part of that weight, what can we say about the four points that are below our reference line? Do these make sense? Make the same plot for ab$viscera and ab$shucked (in this case, viscera is the weight of the meat after bleeding and should be less than shucked, the weight of the meat before bleeding). Do you see any questionable points in this plot? In R, we can make scatterplots between several pairs of variables at one time. For

5 example, we can consider all possible pairs of scatterplots from the 4 variables related to an abalone s weight with the following command. pairs(ab[,5:8]) Note that in this plot, each cell is a scatterplot between two variables. The graph in the second row and first column is a scatterplot of whole and shucked, while the graph in the fourth row and second column is a plot of shucked and shell. What do you see? Finally, we examine rings, our response variable. You can look at the five-number summary as well as a histogram of this variable. Because this variable represents the number of rings, it is discrete. We can see a tabulation of all the values with the following command. table(ab$rings) 4. What ring count is most prevalent in our dataset? What approximate age does this represent? 3. Correlation We now consider more formal descriptions of the relationships between two variables. In Section of your text, the authors introduce the correlation coefficient. This quantity measures the degree to which the data in a simple scatterplot lie on a straight line. See your text for details on how you would compute the correlation coefficient from a sample of data. In R, we can use the command cor. cor(ab$length,ab$height) What value do you get? Use the description in your text on page 541 together with the graphs in Figure to guess what the scatterplot of length and height might look like. Now, let s make this plot and see if your guess was correct plot(ab$length,ab$height) What do you notice? There seems to be a strong trend, but there also appears to be two outliers. Technically, outliers are points that don t seem to agree with the rest of the data in some way. Depending on their size, they can have a big impact

6 on some of our analyses. In this case, the two points have extremely large values of height. These points correspond to rows 1418 and We can remove them and create a new dataset ab2 with the following command. ab2 = ab[-c(1418,2052),] Again, we are indexing rows in this command, but the minus sign tells R that we want to leave these rows out. Hence ab2 has 4175 rows and 9 columns. dim(ab2) To see how things have changed, we can again compute the correlation coefficient and make a scatterplot of height and length, but this time using ab2. cor(ab2$length,ab2$height) plot(ab2$length,ab2$height) 5. What do you notice? What has happened to the correlation coefficient? Does this agree with what you see in the new scatterplot? Just like the command pairs that we used above, the command cor can also return a matrix of correlation coefficients between all the pairs in a dataset. Try the following two commands. cor(ab2[,1:3]) pairs(ab2[,1:3]) (Note that plots involving rings will have stripes because the ring count from abalone shells is a discrete variable. It has a large number of levels, but it s still discrete.) 6. Do the relationships in these plots agree with your intuition about the correlation coefficients? Explain using a pair with high correlation and one with relatively low correlation. The diagonal elements in cor(ab2[,1:3]) are all 1. Does that make sense? From this point on, we will work with ab2 rather than ab. If at some point you exit from R and don t save your work, you will have to retype the command creating ab2.

7 4. Modeling ring count At the end of the last section, we saw plots of rings against the first two predictor variables. There seemed to be a lot of spread in these data. As noted in the introduction, the investigators at the Marine Resources Division acknowledge that several environmental factors can influence the growth of rings in abalone. That means, there are variables relating to the condition of the water off the coast of Tasmania as well as the weather patterns for the last 30 years that might help explain ring counts. None of these data are available to us at this point, and hence we expect a certain amount of error in any model we build with our 8 predictor variables. To begin the modeling process, we will just look at a simple linear model with only one predictor. In mathematical terms we consider the following description of the data: rings = β 0 + β 1 length + U, (0.1) where U is a random error. Here, the error also accounts for all the variables that we can t measure (like weather patterns). We can compute least squares estimates for β 0 and β 1 in R with this command. fit = lm(rings 1+length,data=ab2) (The symbol between rings and 1 is a tilde; on my keyboard it is above the single quote, just next to the 1. ) The formula specifies the same relationship given in (0.1), where 1 represents the intercept. The summary table we studied in class is obtained with next command. summary(fit) 7. What are the least squares estimates ˆβ 0 and ˆβ 1? What are their standard errors? Give a 95% confidence interval for β 1 (since the number of samples here is large, use the approximation involving plus or minus two standard errors). In Section of the text, we see that for the simple linear model Y = β 0 + β 1 X + U, (0.2)

8 we can write down the least squares estimates explicitly. They are given by ˆβ 1 = r s Y s X and ˆβ0 = ȳ ˆβ 1 x (0.3) where ȳ and x are the sample means of the response and predictor variables, respectively; and s Y and s X are the sample standard deviations of the response and predictor variables, respectively. The sample correlation coefficient is denoted r. Taking rings as a response and length as the predictor variable, we can form the necessary ingredients in (0.3) with calls to mean(ab2$rings), mean(ab2$length), sd(ab2$rings), sd(ab2$length), and cor(ab2$length,ab2$rings). 8. Execute these commands and verify that the coefficients you get in summary(fit) agree with those in equation (0.3). Next, we will consider a simple linear model involving height as the predictor rather than length. fit2 = lm(rings 1+height,data=ab2) summary(fit2) What expression of the form (0.2) does this represent? Now, let s fit this model using the dataset that includes the outliers in height. Notice that we are now using data=ab and not data=ab2 as before. fit3 = lm(rings 1+height,data=ab) summary(fit3) 9. Do the least squares estimates change? By how much? Bonus. Create a new dataset ab3 by removing two points from ab2. Use a command like the one in the previous section that we used to create ab2 from ab. Instead of 1418 and 2052, pick any two other rows. Refit the linear model using the option data=ab3 and look at the summary. How much did the coefficients change from fit2? Compare that to the change you observed by removing the two outliers in height. Finally, we will consider a full model using all of our 8 predictor variables. bigfit = lm(rings 1+length+diameter+height+whole+shucked+ viscera+shell+infant,data=ab2)

9 summary(bigfit) 10. The P-values in the last column represent a series of hypothesis tests. What null hypothesis is being tested in each case? What is the alternative? Which would we reject at the 0.05 level? (That is, which have P-values less than 0.05?) 5. Evaluating the fit Continuing with the summary table from the full bigfit, we have established statistical significance of many of the variables. But what about the practical significance? That is, how much of an effect do these variables have on ring count? To study this, consider the variable whole. In Section 2 you probably observed that the median value of this variable is Three abalone have whole weights of and we can see them as follows. test = ab2[ab2$whole == 159.9,] test With the data in each row of test, we can use our fitted model to make a prediction about these abalone s ages. 11. Using the coefficients in the summary table for bigfit and the values in the first row of test, compute the prediction from our model. Compare it to the true value for rings for this abalone. In R we can compute the predictions using the following command. predict(bigfit,newdata=test) Make sure the first value matches what you computed above. Now, let s change the value of whole for these abalone by 50g and see how our predictions change. test$whole = test$whole + 50 predict(bigfit,newdata=test) 12. By how much have our predictions changed? Interpret this in terms of the coefficients in the summary table for bigfit. Comment on the change you might expect by varying a predictor other than whole weight.

10 Finally, we consider an elaboration of the linear model represented by bigfit. Notice that the coefficient of length is not significant in bigfit. This means the data do not contain strong evidence to suggest that our linear model depends on length, or, rather, that the coefficient on length is not zero. We can explore this fact in more detail by making a scatterplot of the variable length and the residuals from the fit. plot(ab2$length,residuals(bigfit)) Here, the command residuals(bigfit) constructs the residuals from our fit involving all the variables. Give yourself a fairly big graphics window to see the relationship between the variables. What do you notice? To give you a hint, let us try a slightly larger model. This time, we will use a quadratic in length. In R, this is done by using a call to the function poly(length,2) where poly specifies that we want a polynomial in length and the degree of that polynomial is 2, or a quadratic. bigfit2 = lm(rings 1+poly(length,2)+diameter+height+whole+shucked+ viscera+shell+infant,data=ab2) summary(bigfit2) In the summary you will now see two entries involving length. The first row is labeled poly(length,2)1 and is just length, while the second poly(length,2)2 involves length 2. 1 To see how we have changed the fit, look at the plot of length against the residuals of this new fit, and compare it to the same plot for the previous fit. plot(ab2$length,residuals(bigfit2)) 13. What has changed? Admittedly, the effects here are somewhat subtle. There is still a lot of error in the data that we have not accounted for with this simple model. If we were to continue with this exercise, we would be led to transform the response variable rings and consider log(rings + 1.5) instead. This reduces some of the deficiencies in the final fit. Even after all of this work, however, the relationship is still noisy. As noted 1 Actually, a bit more is being done to transform length, but all you need to know is that the first term is basically a straight line, while the second is a bowl-shaped quadratic.

11 earlier, in this case, what we are calling noise can probably be explained by variables we do not have at our disposal, like weather patterns and ocean conditions. The point of this project was not to come up with the best explanation of ring growth, but rather to give you some experience working with linear models. Have a good summer!

Business Statistics. Lecture 9: Simple Regression

Business Statistics. Lecture 9: Simple Regression Business Statistics Lecture 9: Simple Regression 1 On to Model Building! Up to now, class was about descriptive and inferential statistics Numerical and graphical summaries of data Confidence intervals