Take-home Final. The questions you are expected to answer for this project are numbered and italicized. There is one bonus question. Good luck!

Similar documents
Business Statistics. Lecture 9: Simple Regression

1 Least Squares Estimation - multiple regression.

Exam Applied Statistical Regression. Good Luck!

MICROPIPETTE CALIBRATIONS

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

INFERENCE FOR REGRESSION

AMS 7 Correlation and Regression Lecture 8

UNIT 12 ~ More About Regression

Warm-up Using the given data Create a scatterplot Find the regression line

Business Statistics. Lecture 10: Course Review

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

STAT 350: Summer Semester Midterm 1: Solutions

appstats27.notebook April 06, 2017

Chapter 27 Summary Inferences for Regression

Density Temp vs Ratio. temp

Multiple Regression Theory 2006 Samuel L. Baker

MATH 1150 Chapter 2 Notation and Terminology

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

Computer projects for Mathematical Statistics, MA 486. Some practical hints for doing computer projects with MATLAB:

Math Precalculus I University of Hawai i at Mānoa Spring

Uncertainty, Error, and Precision in Quantitative Measurements an Introduction 4.4 cm Experimental error

Chapter 3. Measuring data

1 Introduction to Minitab

Probability Distributions

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

Regression Analysis in R

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

Regression. Marc H. Mehlman University of New Haven

Quadratic Equations Part I

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Inferences for Regression

Section 3: Simple Linear Regression

WISE Regression/Correlation Interactive Lab. Introduction to the WISE Correlation/Regression Applet

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Math Precalculus I University of Hawai i at Mānoa Spring

Sampling Distributions in Regression. Mini-Review: Inference for a Mean. For data (x 1, y 1 ),, (x n, y n ) generated with the SRM,

22s:152 Applied Linear Regression. Chapter 6: Statistical Inference for Regression

GRE Quantitative Reasoning Practice Questions

Section 29: What s an Inverse?

MAT Mathematics in Today's World

Results and Analysis 10/4/2012. EE145L Lab 1, Linear Regression

Data Analysis and Statistical Methods Statistics 651

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Chapter 9: Roots and Irrational Numbers

This is particularly true if you see long tails in your data. What are you testing? That the two distributions are the same!

Experiment 1: The Same or Not The Same?

Learning Packet. Lesson 5b Solving Quadratic Equations THIS BOX FOR INSTRUCTOR GRADING USE ONLY

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Review of Statistics 101

Seismicity of Northern California

Linear Regression and Correlation. February 11, 2009

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

1 A Review of Correlation and Regression

1 The Classic Bivariate Least Squares Model

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations

Introduction to Algebra: The First Week

Inference with Simple Regression

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

STEP Support Programme. Hints and Partial Solutions for Assignment 17

Data Analysis and Statistical Methods Statistics 651

Let the x-axis have the following intervals:

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

appstats8.notebook October 11, 2016

Stat 101 L: Laboratory 5

Using Microsoft Excel

Analyzing the NYC Subway Dataset

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Extreme Values and Positive/ Negative Definite Matrix Conditions

Regression Analysis IV... More MLR and Model Building

STAT 3022 Spring 2007

AP Statistics. The only statistics you can trust are those you falsified yourself. RE- E X P R E S S I N G D A T A ( P A R T 2 ) C H A P 9

STAT 350. Assignment 4

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

L21: Chapter 12: Linear regression

Math Lab 10: Differential Equations and Direction Fields Complete before class Wed. Feb. 28; Due noon Thu. Mar. 1 in class

Modern Algebra Prof. Manindra Agrawal Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

The SuperBall Lab. Objective. Instructions

STA 114: Statistics. Notes 21. Linear Regression

At the start of the term, we saw the following formula for computing the sum of the first n integers:

Describing Distributions with Numbers

STATISTICS 110/201 PRACTICE FINAL EXAM

Experiment 1: Linear Regression

EXPERIMENT 2 Reaction Time Objectives Theory

Lesson 26: Characterization of Parallel Lines

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

Chapter 3 - Linear Regression

1 Newton s 2nd and 3rd Laws

Simple Linear Regression

Simple linear regression

Survey on Population Mean

Describing distributions with numbers

3 Non-linearities and Dummy Variables

Math 148. Polynomial Graphs

Scatter plot of data from the study. Linear Regression

Lesson 5b Solving Quadratic Equations

Transcription:

Take-home Final The data for our final project come from a study initiated by the Tasmanian Aquaculture and Fisheries Institute to investigate the growth patterns of abalone living along the Tasmanian coastline. The harvest of abalone is subject to quotas that restrict both the number of abalone that can be caught as well as their size. The population of abalone can be regulated effectively if there is a simple way to tell the age of abalone based solely on their appearance. Hence, researchers are interested in relating abalone age to variables like length, height and weight of the animal (measurements that a diver can take before they harvest the animal). Determining the actual age of an abalone is a bit like estimating the age of a tree. Rings are formed in the shell of the abalone as it grows, usually at the rate of one ring per year. Getting access to the rings of an abalone involves cutting the shell. After polishing and staining, a lab technician examines a shell sample under a microscope and counts the rings. Because some rings are hard to make out using this method, these researchers believed adding 1.5 to the ring count is a reasonable estimate of the abalone s age. The relationship between age and ring count, however, is somewhat controversial. Under certain conditions, abalone can grow more than one ring per year. These conditions relate to weather patterns and other environmental variables. These effects suggest that any relationship we find between ring count and size measurements taken from the animal is likely to involve a lot of error; there are plenty of important variables that have been left out of the study. With these caveats, we will use data collected on 4,177 abalone by the team of researchers from Tasmania. I was not able to find an electronic version of their actual report on these data, but I am making a related report by the same organization available on the course Web site. The questions you are expected to answer for this project are numbered and italicized. There is one bonus question. Good luck!

1. Getting started We begin by downloading the dataset of abalone measurements. source("http://www.stat.ucla.edu/ cocteau/stat13/data/final.r") ls() You should see (possibly among other things) the dataset ab. dim(ab) The output of this command indicates that ab is a matrix with 4,177 rows and 9 columns. Each row refers to a different abalone caught off the coast of Tasmania. For each of the 4,177 abalone caught, the researchers recorded a number of measurements. We describe these measurements in detail in Table 1. The following command lists the values of the variables for the first three abalone in the dataset. ab[1:3,] Recall from Lab 2 that R stores its datasets as matrices. With this command, we are specifying just the first three rows of ab. You should see something like the following. rings length diameter height whole shucked viscera shell infant 1 15 91 73 19 102.8 44.9 20.2 30 0 2 7 70 53 18 45.1 19.9 9.7 14 0 3 9 106 84 27 135.4 51.3 28.3 42 0 The first column in this output is just the row number of the specimen in our dataset; in this case, we are just looking at rows 1, 2 and 3. The number of rings for these three abalone ranged from 7 to 15. We can approximately determine the age of an abalone in this study by adding 1.5 to the number of rings. This means that our three specimens were 16.5, 8.5 and 10.5 years old, respectively. Looking at the output a bit more, we see that their lengths varied from 70mm to 106mm, and their whole weights ranged from 45.1g to 135.4g. Looking at the last column in the output, we see that these three specimens were also mature (that is, they were past their infant stage). The 4,177 abalone in this dataset all had between 1 and 29 rings. If we use the

Variable name Units Description rings count number of rings length mm longest shell measurement diameter mm measured perpendicular to the length height mm height of abalone with meat in the shell whole grams weight of the whole abalone shucked grams weight of just the meat viscera grams gut weight after bleeding shell grams weight of shell after drying infant 0/1 1 if the abalone is an infant, 0 otherwise Table 1: Descriptions for the variables in the dataset ab. The first variable (column 1 in ab) is our response, the quantity we d like to predict. It represents the number of rings in an abalone s shell. The second 8 variables (columns 2 through 9 in ab) are potential predictors; we want to form a model that will allow us to predict ring count from these 8 measurements. approximate dating rule, that means they are between 2.5 and 30.5 years old. Examine the data for the oldest and youngest abalone in the dataset with the following two commands. ab[ab$rings==1,] ab[ab$rings==29,] In the first command, the == relation is asking for all the data in ab for abalone having just one ring; in the second command we are asking for the data for those abalone with 29 rings. 1. Do the sets of measurements make sense? That is, what aspects of the data agree with the fact that one of these abalone is very young and the other is very old? 2. A first look at the data We begin by looking at simple summaries of the separate predictor variables. But first we need a graphics window.

quartz() In Lab 2 we introduced the commands boxplot, hist and summary. Apply these tools to the separate variables to get a sense of their distributions. Take for example, the variable length, the widest point of the abalone. summary(ab$length) hist(ab$length) 2. Describe the distribution of length. Is it skewed? Bell-shaped? Repeat these summaries for a few of the variables to get a feel for the data. Report on at least one variable other than length. Next, consider the four variables related to the weight of the abalones, whole, shucked, viscera, and shell. These are stored in columns 5 through 8 of ab. Look at a simple summary and a histogram for each variable. To compare the values for each weight measurement directly, we can look at a boxplot. boxplot(ab[,5:8]) Again, since this dataset is basically a matrix in R, the index 5:8 in the command above refers to just columns 5 through 8. What do you see in the plot? Do these relationships agree with what you would expect given their description in Table 1? To dig a bit deeper, we will make a simple scatterplot of shucked versus whole. (The second command adds a reference line with intercept 0 and slope 1.) plot(ab$shucked,ab$whole) abline(0,1) 3. What do you see from this distribution? Is there a trend? How does the spread vary as the shucked weight (the weight of just the meat of the abalone) increases? Since whole is meant to describe the weight of the entire animal, and shucked just one part of that weight, what can we say about the four points that are below our reference line? Do these make sense? Make the same plot for ab$viscera and ab$shucked (in this case, viscera is the weight of the meat after bleeding and should be less than shucked, the weight of the meat before bleeding). Do you see any questionable points in this plot? In R, we can make scatterplots between several pairs of variables at one time. For

example, we can consider all possible pairs of scatterplots from the 4 variables related to an abalone s weight with the following command. pairs(ab[,5:8]) Note that in this plot, each cell is a scatterplot between two variables. The graph in the second row and first column is a scatterplot of whole and shucked, while the graph in the fourth row and second column is a plot of shucked and shell. What do you see? Finally, we examine rings, our response variable. You can look at the five-number summary as well as a histogram of this variable. Because this variable represents the number of rings, it is discrete. We can see a tabulation of all the values with the following command. table(ab$rings) 4. What ring count is most prevalent in our dataset? What approximate age does this represent? 3. Correlation We now consider more formal descriptions of the relationships between two variables. In Section 12.5.2 of your text, the authors introduce the correlation coefficient. This quantity measures the degree to which the data in a simple scatterplot lie on a straight line. See your text for details on how you would compute the correlation coefficient from a sample of data. In R, we can use the command cor. cor(ab$length,ab$height) What value do you get? Use the description in your text on page 541 together with the graphs in Figure 12.52 to guess what the scatterplot of length and height might look like. Now, let s make this plot and see if your guess was correct plot(ab$length,ab$height) What do you notice? There seems to be a strong trend, but there also appears to be two outliers. Technically, outliers are points that don t seem to agree with the rest of the data in some way. Depending on their size, they can have a big impact

on some of our analyses. In this case, the two points have extremely large values of height. These points correspond to rows 1418 and 2052. We can remove them and create a new dataset ab2 with the following command. ab2 = ab[-c(1418,2052),] Again, we are indexing rows in this command, but the minus sign tells R that we want to leave these rows out. Hence ab2 has 4175 rows and 9 columns. dim(ab2) To see how things have changed, we can again compute the correlation coefficient and make a scatterplot of height and length, but this time using ab2. cor(ab2$length,ab2$height) plot(ab2$length,ab2$height) 5. What do you notice? What has happened to the correlation coefficient? Does this agree with what you see in the new scatterplot? Just like the command pairs that we used above, the command cor can also return a matrix of correlation coefficients between all the pairs in a dataset. Try the following two commands. cor(ab2[,1:3]) pairs(ab2[,1:3]) (Note that plots involving rings will have stripes because the ring count from abalone shells is a discrete variable. It has a large number of levels, but it s still discrete.) 6. Do the relationships in these plots agree with your intuition about the correlation coefficients? Explain using a pair with high correlation and one with relatively low correlation. The diagonal elements in cor(ab2[,1:3]) are all 1. Does that make sense? From this point on, we will work with ab2 rather than ab. If at some point you exit from R and don t save your work, you will have to retype the command creating ab2.

4. Modeling ring count At the end of the last section, we saw plots of rings against the first two predictor variables. There seemed to be a lot of spread in these data. As noted in the introduction, the investigators at the Marine Resources Division acknowledge that several environmental factors can influence the growth of rings in abalone. That means, there are variables relating to the condition of the water off the coast of Tasmania as well as the weather patterns for the last 30 years that might help explain ring counts. None of these data are available to us at this point, and hence we expect a certain amount of error in any model we build with our 8 predictor variables. To begin the modeling process, we will just look at a simple linear model with only one predictor. In mathematical terms we consider the following description of the data: rings = β 0 + β 1 length + U, (0.1) where U is a random error. Here, the error also accounts for all the variables that we can t measure (like weather patterns). We can compute least squares estimates for β 0 and β 1 in R with this command. fit = lm(rings 1+length,data=ab2) (The symbol between rings and 1 is a tilde; on my keyboard it is above the single quote, just next to the 1. ) The formula specifies the same relationship given in (0.1), where 1 represents the intercept. The summary table we studied in class is obtained with next command. summary(fit) 7. What are the least squares estimates ˆβ 0 and ˆβ 1? What are their standard errors? Give a 95% confidence interval for β 1 (since the number of samples here is large, use the approximation involving plus or minus two standard errors). In Section 12.3.2 of the text, we see that for the simple linear model Y = β 0 + β 1 X + U, (0.2)

we can write down the least squares estimates explicitly. They are given by ˆβ 1 = r s Y s X and ˆβ0 = ȳ ˆβ 1 x (0.3) where ȳ and x are the sample means of the response and predictor variables, respectively; and s Y and s X are the sample standard deviations of the response and predictor variables, respectively. The sample correlation coefficient is denoted r. Taking rings as a response and length as the predictor variable, we can form the necessary ingredients in (0.3) with calls to mean(ab2$rings), mean(ab2$length), sd(ab2$rings), sd(ab2$length), and cor(ab2$length,ab2$rings). 8. Execute these commands and verify that the coefficients you get in summary(fit) agree with those in equation (0.3). Next, we will consider a simple linear model involving height as the predictor rather than length. fit2 = lm(rings 1+height,data=ab2) summary(fit2) What expression of the form (0.2) does this represent? Now, let s fit this model using the dataset that includes the outliers in height. Notice that we are now using data=ab and not data=ab2 as before. fit3 = lm(rings 1+height,data=ab) summary(fit3) 9. Do the least squares estimates change? By how much? Bonus. Create a new dataset ab3 by removing two points from ab2. Use a command like the one in the previous section that we used to create ab2 from ab. Instead of 1418 and 2052, pick any two other rows. Refit the linear model using the option data=ab3 and look at the summary. How much did the coefficients change from fit2? Compare that to the change you observed by removing the two outliers in height. Finally, we will consider a full model using all of our 8 predictor variables. bigfit = lm(rings 1+length+diameter+height+whole+shucked+ viscera+shell+infant,data=ab2)

summary(bigfit) 10. The P-values in the last column represent a series of hypothesis tests. What null hypothesis is being tested in each case? What is the alternative? Which would we reject at the 0.05 level? (That is, which have P-values less than 0.05?) 5. Evaluating the fit Continuing with the summary table from the full bigfit, we have established statistical significance of many of the variables. But what about the practical significance? That is, how much of an effect do these variables have on ring count? To study this, consider the variable whole. In Section 2 you probably observed that the median value of this variable is 159.9. Three abalone have whole weights of 159.9 and we can see them as follows. test = ab2[ab2$whole == 159.9,] test With the data in each row of test, we can use our fitted model to make a prediction about these abalone s ages. 11. Using the coefficients in the summary table for bigfit and the values in the first row of test, compute the prediction from our model. Compare it to the true value for rings for this abalone. In R we can compute the predictions using the following command. predict(bigfit,newdata=test) Make sure the first value matches what you computed above. Now, let s change the value of whole for these abalone by 50g and see how our predictions change. test$whole = test$whole + 50 predict(bigfit,newdata=test) 12. By how much have our predictions changed? Interpret this in terms of the coefficients in the summary table for bigfit. Comment on the change you might expect by varying a predictor other than whole weight.

Finally, we consider an elaboration of the linear model represented by bigfit. Notice that the coefficient of length is not significant in bigfit. This means the data do not contain strong evidence to suggest that our linear model depends on length, or, rather, that the coefficient on length is not zero. We can explore this fact in more detail by making a scatterplot of the variable length and the residuals from the fit. plot(ab2$length,residuals(bigfit)) Here, the command residuals(bigfit) constructs the residuals from our fit involving all the variables. Give yourself a fairly big graphics window to see the relationship between the variables. What do you notice? To give you a hint, let us try a slightly larger model. This time, we will use a quadratic in length. In R, this is done by using a call to the function poly(length,2) where poly specifies that we want a polynomial in length and the degree of that polynomial is 2, or a quadratic. bigfit2 = lm(rings 1+poly(length,2)+diameter+height+whole+shucked+ viscera+shell+infant,data=ab2) summary(bigfit2) In the summary you will now see two entries involving length. The first row is labeled poly(length,2)1 and is just length, while the second poly(length,2)2 involves length 2. 1 To see how we have changed the fit, look at the plot of length against the residuals of this new fit, and compare it to the same plot for the previous fit. plot(ab2$length,residuals(bigfit2)) 13. What has changed? Admittedly, the effects here are somewhat subtle. There is still a lot of error in the data that we have not accounted for with this simple model. If we were to continue with this exercise, we would be led to transform the response variable rings and consider log(rings + 1.5) instead. This reduces some of the deficiencies in the final fit. Even after all of this work, however, the relationship is still noisy. As noted 1 Actually, a bit more is being done to transform length, but all you need to know is that the first term is basically a straight line, while the second is a bowl-shaped quadratic.

earlier, in this case, what we are calling noise can probably be explained by variables we do not have at our disposal, like weather patterns and ocean conditions. The point of this project was not to come up with the best explanation of ring growth, but rather to give you some experience working with linear models. Have a good summer!