Take-home Final. The questions you are expected to answer for this project are numbered and italicized. There is one bonus question. Good luck!

Size: px
Start display at page:

Download "Take-home Final. The questions you are expected to answer for this project are numbered and italicized. There is one bonus question. Good luck!"

Transcription

1 Take-home Final The data for our final project come from a study initiated by the Tasmanian Aquaculture and Fisheries Institute to investigate the growth patterns of abalone living along the Tasmanian coastline. The harvest of abalone is subject to quotas that restrict both the number of abalone that can be caught as well as their size. The population of abalone can be regulated effectively if there is a simple way to tell the age of abalone based solely on their appearance. Hence, researchers are interested in relating abalone age to variables like length, height and weight of the animal (measurements that a diver can take before they harvest the animal). Determining the actual age of an abalone is a bit like estimating the age of a tree. Rings are formed in the shell of the abalone as it grows, usually at the rate of one ring per year. Getting access to the rings of an abalone involves cutting the shell. After polishing and staining, a lab technician examines a shell sample under a microscope and counts the rings. Because some rings are hard to make out using this method, these researchers believed adding 1.5 to the ring count is a reasonable estimate of the abalone s age. The relationship between age and ring count, however, is somewhat controversial. Under certain conditions, abalone can grow more than one ring per year. These conditions relate to weather patterns and other environmental variables. These effects suggest that any relationship we find between ring count and size measurements taken from the animal is likely to involve a lot of error; there are plenty of important variables that have been left out of the study. With these caveats, we will use data collected on 4,177 abalone by the team of researchers from Tasmania. I was not able to find an electronic version of their actual report on these data, but I am making a related report by the same organization available on the course Web site. The questions you are expected to answer for this project are numbered and italicized. There is one bonus question. Good luck!

2 1. Getting started We begin by downloading the dataset of abalone measurements. source(" cocteau/stat13/data/final.r") ls() You should see (possibly among other things) the dataset ab. dim(ab) The output of this command indicates that ab is a matrix with 4,177 rows and 9 columns. Each row refers to a different abalone caught off the coast of Tasmania. For each of the 4,177 abalone caught, the researchers recorded a number of measurements. We describe these measurements in detail in Table 1. The following command lists the values of the variables for the first three abalone in the dataset. ab[1:3,] Recall from Lab 2 that R stores its datasets as matrices. With this command, we are specifying just the first three rows of ab. You should see something like the following. rings length diameter height whole shucked viscera shell infant The first column in this output is just the row number of the specimen in our dataset; in this case, we are just looking at rows 1, 2 and 3. The number of rings for these three abalone ranged from 7 to 15. We can approximately determine the age of an abalone in this study by adding 1.5 to the number of rings. This means that our three specimens were 16.5, 8.5 and 10.5 years old, respectively. Looking at the output a bit more, we see that their lengths varied from 70mm to 106mm, and their whole weights ranged from 45.1g to 135.4g. Looking at the last column in the output, we see that these three specimens were also mature (that is, they were past their infant stage). The 4,177 abalone in this dataset all had between 1 and 29 rings. If we use the

3 Variable name Units Description rings count number of rings length mm longest shell measurement diameter mm measured perpendicular to the length height mm height of abalone with meat in the shell whole grams weight of the whole abalone shucked grams weight of just the meat viscera grams gut weight after bleeding shell grams weight of shell after drying infant 0/1 1 if the abalone is an infant, 0 otherwise Table 1: Descriptions for the variables in the dataset ab. The first variable (column 1 in ab) is our response, the quantity we d like to predict. It represents the number of rings in an abalone s shell. The second 8 variables (columns 2 through 9 in ab) are potential predictors; we want to form a model that will allow us to predict ring count from these 8 measurements. approximate dating rule, that means they are between 2.5 and 30.5 years old. Examine the data for the oldest and youngest abalone in the dataset with the following two commands. ab[ab$rings==1,] ab[ab$rings==29,] In the first command, the == relation is asking for all the data in ab for abalone having just one ring; in the second command we are asking for the data for those abalone with 29 rings. 1. Do the sets of measurements make sense? That is, what aspects of the data agree with the fact that one of these abalone is very young and the other is very old? 2. A first look at the data We begin by looking at simple summaries of the separate predictor variables. But first we need a graphics window.

4 quartz() In Lab 2 we introduced the commands boxplot, hist and summary. Apply these tools to the separate variables to get a sense of their distributions. Take for example, the variable length, the widest point of the abalone. summary(ab$length) hist(ab$length) 2. Describe the distribution of length. Is it skewed? Bell-shaped? Repeat these summaries for a few of the variables to get a feel for the data. Report on at least one variable other than length. Next, consider the four variables related to the weight of the abalones, whole, shucked, viscera, and shell. These are stored in columns 5 through 8 of ab. Look at a simple summary and a histogram for each variable. To compare the values for each weight measurement directly, we can look at a boxplot. boxplot(ab[,5:8]) Again, since this dataset is basically a matrix in R, the index 5:8 in the command above refers to just columns 5 through 8. What do you see in the plot? Do these relationships agree with what you would expect given their description in Table 1? To dig a bit deeper, we will make a simple scatterplot of shucked versus whole. (The second command adds a reference line with intercept 0 and slope 1.) plot(ab$shucked,ab$whole) abline(0,1) 3. What do you see from this distribution? Is there a trend? How does the spread vary as the shucked weight (the weight of just the meat of the abalone) increases? Since whole is meant to describe the weight of the entire animal, and shucked just one part of that weight, what can we say about the four points that are below our reference line? Do these make sense? Make the same plot for ab$viscera and ab$shucked (in this case, viscera is the weight of the meat after bleeding and should be less than shucked, the weight of the meat before bleeding). Do you see any questionable points in this plot? In R, we can make scatterplots between several pairs of variables at one time. For

5 example, we can consider all possible pairs of scatterplots from the 4 variables related to an abalone s weight with the following command. pairs(ab[,5:8]) Note that in this plot, each cell is a scatterplot between two variables. The graph in the second row and first column is a scatterplot of whole and shucked, while the graph in the fourth row and second column is a plot of shucked and shell. What do you see? Finally, we examine rings, our response variable. You can look at the five-number summary as well as a histogram of this variable. Because this variable represents the number of rings, it is discrete. We can see a tabulation of all the values with the following command. table(ab$rings) 4. What ring count is most prevalent in our dataset? What approximate age does this represent? 3. Correlation We now consider more formal descriptions of the relationships between two variables. In Section of your text, the authors introduce the correlation coefficient. This quantity measures the degree to which the data in a simple scatterplot lie on a straight line. See your text for details on how you would compute the correlation coefficient from a sample of data. In R, we can use the command cor. cor(ab$length,ab$height) What value do you get? Use the description in your text on page 541 together with the graphs in Figure to guess what the scatterplot of length and height might look like. Now, let s make this plot and see if your guess was correct plot(ab$length,ab$height) What do you notice? There seems to be a strong trend, but there also appears to be two outliers. Technically, outliers are points that don t seem to agree with the rest of the data in some way. Depending on their size, they can have a big impact

6 on some of our analyses. In this case, the two points have extremely large values of height. These points correspond to rows 1418 and We can remove them and create a new dataset ab2 with the following command. ab2 = ab[-c(1418,2052),] Again, we are indexing rows in this command, but the minus sign tells R that we want to leave these rows out. Hence ab2 has 4175 rows and 9 columns. dim(ab2) To see how things have changed, we can again compute the correlation coefficient and make a scatterplot of height and length, but this time using ab2. cor(ab2$length,ab2$height) plot(ab2$length,ab2$height) 5. What do you notice? What has happened to the correlation coefficient? Does this agree with what you see in the new scatterplot? Just like the command pairs that we used above, the command cor can also return a matrix of correlation coefficients between all the pairs in a dataset. Try the following two commands. cor(ab2[,1:3]) pairs(ab2[,1:3]) (Note that plots involving rings will have stripes because the ring count from abalone shells is a discrete variable. It has a large number of levels, but it s still discrete.) 6. Do the relationships in these plots agree with your intuition about the correlation coefficients? Explain using a pair with high correlation and one with relatively low correlation. The diagonal elements in cor(ab2[,1:3]) are all 1. Does that make sense? From this point on, we will work with ab2 rather than ab. If at some point you exit from R and don t save your work, you will have to retype the command creating ab2.

7 4. Modeling ring count At the end of the last section, we saw plots of rings against the first two predictor variables. There seemed to be a lot of spread in these data. As noted in the introduction, the investigators at the Marine Resources Division acknowledge that several environmental factors can influence the growth of rings in abalone. That means, there are variables relating to the condition of the water off the coast of Tasmania as well as the weather patterns for the last 30 years that might help explain ring counts. None of these data are available to us at this point, and hence we expect a certain amount of error in any model we build with our 8 predictor variables. To begin the modeling process, we will just look at a simple linear model with only one predictor. In mathematical terms we consider the following description of the data: rings = β 0 + β 1 length + U, (0.1) where U is a random error. Here, the error also accounts for all the variables that we can t measure (like weather patterns). We can compute least squares estimates for β 0 and β 1 in R with this command. fit = lm(rings 1+length,data=ab2) (The symbol between rings and 1 is a tilde; on my keyboard it is above the single quote, just next to the 1. ) The formula specifies the same relationship given in (0.1), where 1 represents the intercept. The summary table we studied in class is obtained with next command. summary(fit) 7. What are the least squares estimates ˆβ 0 and ˆβ 1? What are their standard errors? Give a 95% confidence interval for β 1 (since the number of samples here is large, use the approximation involving plus or minus two standard errors). In Section of the text, we see that for the simple linear model Y = β 0 + β 1 X + U, (0.2)

8 we can write down the least squares estimates explicitly. They are given by ˆβ 1 = r s Y s X and ˆβ0 = ȳ ˆβ 1 x (0.3) where ȳ and x are the sample means of the response and predictor variables, respectively; and s Y and s X are the sample standard deviations of the response and predictor variables, respectively. The sample correlation coefficient is denoted r. Taking rings as a response and length as the predictor variable, we can form the necessary ingredients in (0.3) with calls to mean(ab2$rings), mean(ab2$length), sd(ab2$rings), sd(ab2$length), and cor(ab2$length,ab2$rings). 8. Execute these commands and verify that the coefficients you get in summary(fit) agree with those in equation (0.3). Next, we will consider a simple linear model involving height as the predictor rather than length. fit2 = lm(rings 1+height,data=ab2) summary(fit2) What expression of the form (0.2) does this represent? Now, let s fit this model using the dataset that includes the outliers in height. Notice that we are now using data=ab and not data=ab2 as before. fit3 = lm(rings 1+height,data=ab) summary(fit3) 9. Do the least squares estimates change? By how much? Bonus. Create a new dataset ab3 by removing two points from ab2. Use a command like the one in the previous section that we used to create ab2 from ab. Instead of 1418 and 2052, pick any two other rows. Refit the linear model using the option data=ab3 and look at the summary. How much did the coefficients change from fit2? Compare that to the change you observed by removing the two outliers in height. Finally, we will consider a full model using all of our 8 predictor variables. bigfit = lm(rings 1+length+diameter+height+whole+shucked+ viscera+shell+infant,data=ab2)

9 summary(bigfit) 10. The P-values in the last column represent a series of hypothesis tests. What null hypothesis is being tested in each case? What is the alternative? Which would we reject at the 0.05 level? (That is, which have P-values less than 0.05?) 5. Evaluating the fit Continuing with the summary table from the full bigfit, we have established statistical significance of many of the variables. But what about the practical significance? That is, how much of an effect do these variables have on ring count? To study this, consider the variable whole. In Section 2 you probably observed that the median value of this variable is Three abalone have whole weights of and we can see them as follows. test = ab2[ab2$whole == 159.9,] test With the data in each row of test, we can use our fitted model to make a prediction about these abalone s ages. 11. Using the coefficients in the summary table for bigfit and the values in the first row of test, compute the prediction from our model. Compare it to the true value for rings for this abalone. In R we can compute the predictions using the following command. predict(bigfit,newdata=test) Make sure the first value matches what you computed above. Now, let s change the value of whole for these abalone by 50g and see how our predictions change. test$whole = test$whole + 50 predict(bigfit,newdata=test) 12. By how much have our predictions changed? Interpret this in terms of the coefficients in the summary table for bigfit. Comment on the change you might expect by varying a predictor other than whole weight.

10 Finally, we consider an elaboration of the linear model represented by bigfit. Notice that the coefficient of length is not significant in bigfit. This means the data do not contain strong evidence to suggest that our linear model depends on length, or, rather, that the coefficient on length is not zero. We can explore this fact in more detail by making a scatterplot of the variable length and the residuals from the fit. plot(ab2$length,residuals(bigfit)) Here, the command residuals(bigfit) constructs the residuals from our fit involving all the variables. Give yourself a fairly big graphics window to see the relationship between the variables. What do you notice? To give you a hint, let us try a slightly larger model. This time, we will use a quadratic in length. In R, this is done by using a call to the function poly(length,2) where poly specifies that we want a polynomial in length and the degree of that polynomial is 2, or a quadratic. bigfit2 = lm(rings 1+poly(length,2)+diameter+height+whole+shucked+ viscera+shell+infant,data=ab2) summary(bigfit2) In the summary you will now see two entries involving length. The first row is labeled poly(length,2)1 and is just length, while the second poly(length,2)2 involves length 2. 1 To see how we have changed the fit, look at the plot of length against the residuals of this new fit, and compare it to the same plot for the previous fit. plot(ab2$length,residuals(bigfit2)) 13. What has changed? Admittedly, the effects here are somewhat subtle. There is still a lot of error in the data that we have not accounted for with this simple model. If we were to continue with this exercise, we would be led to transform the response variable rings and consider log(rings + 1.5) instead. This reduces some of the deficiencies in the final fit. Even after all of this work, however, the relationship is still noisy. As noted 1 Actually, a bit more is being done to transform length, but all you need to know is that the first term is basically a straight line, while the second is a bowl-shaped quadratic.

11 earlier, in this case, what we are calling noise can probably be explained by variables we do not have at our disposal, like weather patterns and ocean conditions. The point of this project was not to come up with the best explanation of ring growth, but rather to give you some experience working with linear models. Have a good summer!

Business Statistics. Lecture 9: Simple Regression

Business Statistics. Lecture 9: Simple Regression Business Statistics Lecture 9: Simple Regression 1 On to Model Building! Up to now, class was about descriptive and inferential statistics Numerical and graphical summaries of data Confidence intervals

More information

1 Least Squares Estimation - multiple regression.

1 Least Squares Estimation - multiple regression. Introduction to multiple regression. Fall 2010 1 Least Squares Estimation - multiple regression. Let y = {y 1,, y n } be a n 1 vector of dependent variable observations. Let β = {β 0, β 1 } be the 2 1

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

MICROPIPETTE CALIBRATIONS

MICROPIPETTE CALIBRATIONS Physics 433/833, 214 MICROPIPETTE CALIBRATIONS I. ABSTRACT The micropipette set is a basic tool in a molecular biology-related lab. It is very important to ensure that the micropipettes are properly calibrated,

More information

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS 1a) The model is cw i = β 0 + β 1 el i + ɛ i, where cw i is the weight of the ith chick, el i the length of the egg from which it hatched, and ɛ i

More information

INFERENCE FOR REGRESSION

INFERENCE FOR REGRESSION CHAPTER 3 INFERENCE FOR REGRESSION OVERVIEW In Chapter 5 of the textbook, we first encountered regression. The assumptions that describe the regression model we use in this chapter are the following. We

More information

AMS 7 Correlation and Regression Lecture 8

AMS 7 Correlation and Regression Lecture 8 AMS 7 Correlation and Regression Lecture 8 Department of Applied Mathematics and Statistics, University of California, Santa Cruz Suumer 2014 1 / 18 Correlation pairs of continuous observations. Correlation

More information

UNIT 12 ~ More About Regression

UNIT 12 ~ More About Regression ***SECTION 15.1*** The Regression Model When a scatterplot shows a relationship between a variable x and a y, we can use the fitted to the data to predict y for a given value of x. Now we want to do tests

More information

Warm-up Using the given data Create a scatterplot Find the regression line

Warm-up Using the given data Create a scatterplot Find the regression line Time at the lunch table Caloric intake 21.4 472 30.8 498 37.7 335 32.8 423 39.5 437 22.8 508 34.1 431 33.9 479 43.8 454 42.4 450 43.1 410 29.2 504 31.3 437 28.6 489 32.9 436 30.6 480 35.1 439 33.0 444

More information

Business Statistics. Lecture 10: Course Review

Business Statistics. Lecture 10: Course Review Business Statistics Lecture 10: Course Review 1 Descriptive Statistics for Continuous Data Numerical Summaries Location: mean, median Spread or variability: variance, standard deviation, range, percentiles,

More information

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math. Regression, part II I. What does it all mean? A) Notice that so far all we ve done is math. 1) One can calculate the Least Squares Regression Line for anything, regardless of any assumptions. 2) But, if

More information

STAT 350: Summer Semester Midterm 1: Solutions

STAT 350: Summer Semester Midterm 1: Solutions Name: Student Number: STAT 350: Summer Semester 2008 Midterm 1: Solutions 9 June 2008 Instructor: Richard Lockhart Instructions: This is an open book test. You may use notes, text, other books and a calculator.

More information

appstats27.notebook April 06, 2017

appstats27.notebook April 06, 2017 Chapter 27 Objective Students will conduct inference on regression and analyze data to write a conclusion. Inferences for Regression An Example: Body Fat and Waist Size pg 634 Our chapter example revolves

More information

Chapter 27 Summary Inferences for Regression

Chapter 27 Summary Inferences for Regression Chapter 7 Summary Inferences for Regression What have we learned? We have now applied inference to regression models. Like in all inference situations, there are conditions that we must check. We can test

More information

Density Temp vs Ratio. temp

Density Temp vs Ratio. temp Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks,

More information

Multiple Regression Theory 2006 Samuel L. Baker

Multiple Regression Theory 2006 Samuel L. Baker MULTIPLE REGRESSION THEORY 1 Multiple Regression Theory 2006 Samuel L. Baker Multiple regression is regression with two or more independent variables on the right-hand side of the equation. Use multiple

More information

MATH 1150 Chapter 2 Notation and Terminology

MATH 1150 Chapter 2 Notation and Terminology MATH 1150 Chapter 2 Notation and Terminology Categorical Data The following is a dataset for 30 randomly selected adults in the U.S., showing the values of two categorical variables: whether or not the

More information

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A = Matrices and vectors A matrix is a rectangular array of numbers Here s an example: 23 14 17 A = 225 0 2 This matrix has dimensions 2 3 The number of rows is first, then the number of columns We can write

More information

Computer projects for Mathematical Statistics, MA 486. Some practical hints for doing computer projects with MATLAB:

Computer projects for Mathematical Statistics, MA 486. Some practical hints for doing computer projects with MATLAB: Computer projects for Mathematical Statistics, MA 486. Some practical hints for doing computer projects with MATLAB: You can save your project to a text file (on a floppy disk or CD or on your web page),

More information

Math Precalculus I University of Hawai i at Mānoa Spring

Math Precalculus I University of Hawai i at Mānoa Spring Math 135 - Precalculus I University of Hawai i at Mānoa Spring - 2013 Created for Math 135, Spring 2008 by Lukasz Grabarek and Michael Joyce Send comments and corrections to lukasz@math.hawaii.edu Contents

More information

Uncertainty, Error, and Precision in Quantitative Measurements an Introduction 4.4 cm Experimental error

Uncertainty, Error, and Precision in Quantitative Measurements an Introduction 4.4 cm Experimental error Uncertainty, Error, and Precision in Quantitative Measurements an Introduction Much of the work in any chemistry laboratory involves the measurement of numerical quantities. A quantitative measurement

More information

Chapter 3. Measuring data

Chapter 3. Measuring data Chapter 3 Measuring data 1 Measuring data versus presenting data We present data to help us draw meaning from it But pictures of data are subjective They re also not susceptible to rigorous inference Measuring

More information

1 Introduction to Minitab

1 Introduction to Minitab 1 Introduction to Minitab Minitab is a statistical analysis software package. The software is freely available to all students and is downloadable through the Technology Tab at my.calpoly.edu. When you

More information

Probability Distributions

Probability Distributions CONDENSED LESSON 13.1 Probability Distributions In this lesson, you Sketch the graph of the probability distribution for a continuous random variable Find probabilities by finding or approximating areas

More information

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION In this lab you will learn how to use Excel to display the relationship between two quantitative variables, measure the strength and direction of the

More information

Regression Analysis in R

Regression Analysis in R Regression Analysis in R 1 Purpose The purpose of this activity is to provide you with an understanding of regression analysis and to both develop and apply that knowledge to the use of the R statistical

More information

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation y = a + bx y = dependent variable a = intercept b = slope x = independent variable Section 12.1 Inference for Linear

More information

Regression. Marc H. Mehlman University of New Haven

Regression. Marc H. Mehlman University of New Haven Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and

More information

Quadratic Equations Part I

Quadratic Equations Part I Quadratic Equations Part I Before proceeding with this section we should note that the topic of solving quadratic equations will be covered in two sections. This is done for the benefit of those viewing

More information

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc. Chapter 8 Linear Regression Copyright 2010 Pearson Education, Inc. Fat Versus Protein: An Example The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu: Copyright

More information

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur Lecture 10 Software Implementation in Simple Linear Regression Model using

More information

Inferences for Regression

Inferences for Regression Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In

More information

Section 3: Simple Linear Regression

Section 3: Simple Linear Regression Section 3: Simple Linear Regression Carlos M. Carvalho The University of Texas at Austin McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction

More information

WISE Regression/Correlation Interactive Lab. Introduction to the WISE Correlation/Regression Applet

WISE Regression/Correlation Interactive Lab. Introduction to the WISE Correlation/Regression Applet WISE Regression/Correlation Interactive Lab Introduction to the WISE Correlation/Regression Applet This tutorial focuses on the logic of regression analysis with special attention given to variance components.

More information

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij = K. Model Diagnostics We ve already seen how to check model assumptions prior to fitting a one-way ANOVA. Diagnostics carried out after model fitting by using residuals are more informative for assessing

More information

Math Precalculus I University of Hawai i at Mānoa Spring

Math Precalculus I University of Hawai i at Mānoa Spring Math 135 - Precalculus I University of Hawai i at Mānoa Spring - 2014 Created for Math 135, Spring 2008 by Lukasz Grabarek and Michael Joyce Send comments and corrections to lukasz@math.hawaii.edu Contents

More information

Sampling Distributions in Regression. Mini-Review: Inference for a Mean. For data (x 1, y 1 ),, (x n, y n ) generated with the SRM,

Sampling Distributions in Regression. Mini-Review: Inference for a Mean. For data (x 1, y 1 ),, (x n, y n ) generated with the SRM, Department of Statistics The Wharton School University of Pennsylvania Statistics 61 Fall 3 Module 3 Inference about the SRM Mini-Review: Inference for a Mean An ideal setup for inference about a mean

More information

22s:152 Applied Linear Regression. Chapter 6: Statistical Inference for Regression

22s:152 Applied Linear Regression. Chapter 6: Statistical Inference for Regression 22s:152 Applied Linear Regression Chapter 6: Statistical Inference for Regression Simple Linear Regression Assumptions for inference Key assumptions: linear relationship (between Y and x) *we say the relationship

More information

GRE Quantitative Reasoning Practice Questions

GRE Quantitative Reasoning Practice Questions GRE Quantitative Reasoning Practice Questions y O x 7. The figure above shows the graph of the function f in the xy-plane. What is the value of f (f( ))? A B C 0 D E Explanation Note that to find f (f(

More information

Section 29: What s an Inverse?

Section 29: What s an Inverse? Section 29: What s an Inverse? Our investigations in the last section showed that all of the matrix operations had an identity element. The identity element for addition is, for obvious reasons, called

More information

MAT Mathematics in Today's World

MAT Mathematics in Today's World MAT 1000 Mathematics in Today's World Last Time 1. Three keys to summarize a collection of data: shape, center, spread. 2. Can measure spread with the fivenumber summary. 3. The five-number summary can

More information

Results and Analysis 10/4/2012. EE145L Lab 1, Linear Regression

Results and Analysis 10/4/2012. EE145L Lab 1, Linear Regression EE145L Lab 1, Linear Regression 10/4/2012 Abstract We examined multiple sets of data to assess the relationship between the variables, linear or non-linear, in addition to studying ways of transforming

More information

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651 Data Analysis and Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasini/teaching.html Lecture 31 (MWF) Review of test for independence and starting with linear regression Suhasini Subba

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph. Regression, Part I I. Difference from correlation. II. Basic idea: A) Correlation describes the relationship between two variables, where neither is independent or a predictor. - In correlation, it would

More information

The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

The scatterplot is the basic tool for graphically displaying bivariate quantitative data. Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data. Example: Some investors think that the performance of the stock market in January

More information

Chapter 9: Roots and Irrational Numbers

Chapter 9: Roots and Irrational Numbers Chapter 9: Roots and Irrational Numbers Index: A: Square Roots B: Irrational Numbers C: Square Root Functions & Shifting D: Finding Zeros by Completing the Square E: The Quadratic Formula F: Quadratic

More information

This is particularly true if you see long tails in your data. What are you testing? That the two distributions are the same!

This is particularly true if you see long tails in your data. What are you testing? That the two distributions are the same! Two sample tests (part II): What to do if your data are not distributed normally: Option 1: if your sample size is large enough, don't worry - go ahead and use a t-test (the CLT will take care of non-normal

More information

Experiment 1: The Same or Not The Same?

Experiment 1: The Same or Not The Same? Experiment 1: The Same or Not The Same? Learning Goals After you finish this lab, you will be able to: 1. Use Logger Pro to collect data and calculate statistics (mean and standard deviation). 2. Explain

More information

Learning Packet. Lesson 5b Solving Quadratic Equations THIS BOX FOR INSTRUCTOR GRADING USE ONLY

Learning Packet. Lesson 5b Solving Quadratic Equations THIS BOX FOR INSTRUCTOR GRADING USE ONLY Learning Packet Student Name Due Date Class Time/Day Submission Date THIS BOX FOR INSTRUCTOR GRADING USE ONLY Mini-Lesson is complete and information presented is as found on media links (0 5 pts) Comments:

More information

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation? Did You Mean Association Or Correlation? AP Statistics Chapter 8 Be careful not to use the word correlation when you really mean association. Often times people will incorrectly use the word correlation

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Seismicity of Northern California

Seismicity of Northern California Homework 4 Seismicity of Northern California Purpose: Learn how we make predictions about seismicity from the relationship between the number of earthquakes that occur and their magnitudes. We know that

More information

Linear Regression and Correlation. February 11, 2009

Linear Regression and Correlation. February 11, 2009 Linear Regression and Correlation February 11, 2009 The Big Ideas To understand a set of data, start with a graph or graphs. The Big Ideas To understand a set of data, start with a graph or graphs. If

More information

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model Lab 3 A Quick Introduction to Multiple Linear Regression Psychology 310 Instructions.Work through the lab, saving the output as you go. You will be submitting your assignment as an R Markdown document.

More information

1 A Review of Correlation and Regression

1 A Review of Correlation and Regression 1 A Review of Correlation and Regression SW, Chapter 12 Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then

More information

1 The Classic Bivariate Least Squares Model

1 The Classic Bivariate Least Squares Model Review of Bivariate Linear Regression Contents 1 The Classic Bivariate Least Squares Model 1 1.1 The Setup............................... 1 1.2 An Example Predicting Kids IQ................. 1 2 Evaluating

More information

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations Lab 2 Worksheet Problems Problem : Geometry and Linear Equations Linear algebra is, first and foremost, the study of systems of linear equations. You are going to encounter linear systems frequently in

More information

Introduction to Algebra: The First Week

Introduction to Algebra: The First Week Introduction to Algebra: The First Week Background: According to the thermostat on the wall, the temperature in the classroom right now is 72 degrees Fahrenheit. I want to write to my friend in Europe,

More information

Inference with Simple Regression

Inference with Simple Regression 1 Introduction Inference with Simple Regression Alan B. Gelder 06E:071, The University of Iowa 1 Moving to infinite means: In this course we have seen one-mean problems, twomean problems, and problems

More information

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Lecture - 04 Basic Statistics Part-1 (Refer Slide Time: 00:33)

More information

STEP Support Programme. Hints and Partial Solutions for Assignment 17

STEP Support Programme. Hints and Partial Solutions for Assignment 17 STEP Support Programme Hints and Partial Solutions for Assignment 7 Warm-up You need to be quite careful with these proofs to ensure that you are not assuming something that should not be assumed. For

More information

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651 y 1 2 3 4 5 6 7 x Data Analysis and Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasini/teaching.html Lecture 32 Suhasini Subba Rao Previous lecture We are interested in whether a dependent

More information

Let the x-axis have the following intervals:

Let the x-axis have the following intervals: 1 & 2. For the following sets of data calculate the mean and standard deviation. Then graph the data as a frequency histogram on the corresponding set of axes. Set 1: Length of bass caught in Conesus Lake

More information

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.] Math 43 Review Notes [Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty Dot Product If v (v, v, v 3 and w (w, w, w 3, then the

More information

Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data. Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data. Example: Some investors think that the performance of the stock market in January

More information

appstats8.notebook October 11, 2016

appstats8.notebook October 11, 2016 Chapter 8 Linear Regression Objective: Students will construct and analyze a linear model for a given set of data. Fat Versus Protein: An Example pg 168 The following is a scatterplot of total fat versus

More information

Stat 101 L: Laboratory 5

Stat 101 L: Laboratory 5 Stat 101 L: Laboratory 5 The first activity revisits the labeling of Fun Size bags of M&Ms by looking distributions of Total Weight of Fun Size bags and regular size bags (which have a label weight) of

More information

Using Microsoft Excel

Using Microsoft Excel Using Microsoft Excel Objective: Students will gain familiarity with using Excel to record data, display data properly, use built-in formulae to do calculations, and plot and fit data with linear functions.

More information

Analyzing the NYC Subway Dataset

Analyzing the NYC Subway Dataset PROJECT REPORT Analyzing the NYC Subway Dataset Short Questions Overview This project consists of two parts. In Part 1 of the project, you should have completed the questions in Problem Sets 2, 3, 4, and

More information

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals

Chapter 8. Linear Regression. The Linear Model. Fat Versus Protein: An Example. The Linear Model (cont.) Residuals Chapter 8 Linear Regression Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8-1 Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Fat Versus

More information

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression Introduction to Correlation and Regression The procedures discussed in the previous ANOVA labs are most useful in cases where we are interested

More information

Extreme Values and Positive/ Negative Definite Matrix Conditions

Extreme Values and Positive/ Negative Definite Matrix Conditions Extreme Values and Positive/ Negative Definite Matrix Conditions James K. Peterson Department of Biological Sciences and Department of Mathematical Sciences Clemson University November 8, 016 Outline 1

More information

Regression Analysis IV... More MLR and Model Building

Regression Analysis IV... More MLR and Model Building Regression Analysis IV... More MLR and Model Building This session finishes up presenting the formal methods of inference based on the MLR model and then begins discussion of "model building" (use of regression

More information

STAT 3022 Spring 2007

STAT 3022 Spring 2007 Simple Linear Regression Example These commands reproduce what we did in class. You should enter these in R and see what they do. Start by typing > set.seed(42) to reset the random number generator so

More information

AP Statistics. The only statistics you can trust are those you falsified yourself. RE- E X P R E S S I N G D A T A ( P A R T 2 ) C H A P 9

AP Statistics. The only statistics you can trust are those you falsified yourself. RE- E X P R E S S I N G D A T A ( P A R T 2 ) C H A P 9 AP Statistics 1 RE- E X P R E S S I N G D A T A ( P A R T 2 ) C H A P 9 The only statistics you can trust are those you falsified yourself. Sir Winston Churchill (1874-1965) (Attribution to Churchill is

More information

STAT 350. Assignment 4

STAT 350. Assignment 4 STAT 350 Assignment 4 1. For the Mileage data in assignment 3 conduct a residual analysis and report your findings. I used the full model for this since my answers to assignment 3 suggested we needed the

More information

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist sales $ (y - dependent variable) advertising $ (x - independent variable)

More information

L21: Chapter 12: Linear regression

L21: Chapter 12: Linear regression L21: Chapter 12: Linear regression Department of Statistics, University of South Carolina Stat 205: Elementary Statistics for the Biological and Life Sciences 1 / 37 So far... 12.1 Introduction One sample

More information

Math Lab 10: Differential Equations and Direction Fields Complete before class Wed. Feb. 28; Due noon Thu. Mar. 1 in class

Math Lab 10: Differential Equations and Direction Fields Complete before class Wed. Feb. 28; Due noon Thu. Mar. 1 in class Matter & Motion Winter 2017 18 Name: Math Lab 10: Differential Equations and Direction Fields Complete before class Wed. Feb. 28; Due noon Thu. Mar. 1 in class Goals: 1. Gain exposure to terminology and

More information

Modern Algebra Prof. Manindra Agrawal Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

Modern Algebra Prof. Manindra Agrawal Department of Computer Science and Engineering Indian Institute of Technology, Kanpur Modern Algebra Prof. Manindra Agrawal Department of Computer Science and Engineering Indian Institute of Technology, Kanpur Lecture 02 Groups: Subgroups and homomorphism (Refer Slide Time: 00:13) We looked

More information

The SuperBall Lab. Objective. Instructions

The SuperBall Lab. Objective. Instructions 1 The SuperBall Lab Objective This goal of this tutorial lab is to introduce data analysis techniques by examining energy loss in super ball collisions. Instructions This laboratory does not have to be

More information

STA 114: Statistics. Notes 21. Linear Regression

STA 114: Statistics. Notes 21. Linear Regression STA 114: Statistics Notes 1. Linear Regression Introduction A most widely used statistical analysis is regression, where one tries to explain a response variable Y by an explanatory variable X based on

More information

At the start of the term, we saw the following formula for computing the sum of the first n integers:

At the start of the term, we saw the following formula for computing the sum of the first n integers: Chapter 11 Induction This chapter covers mathematical induction. 11.1 Introduction to induction At the start of the term, we saw the following formula for computing the sum of the first n integers: Claim

More information

Describing Distributions with Numbers

Describing Distributions with Numbers Describing Distributions with Numbers Using graphs, we could determine the center, spread, and shape of the distribution of a quantitative variable. We can also use numbers (called summary statistics)

More information

STATISTICS 110/201 PRACTICE FINAL EXAM

STATISTICS 110/201 PRACTICE FINAL EXAM STATISTICS 110/201 PRACTICE FINAL EXAM Questions 1 to 5: There is a downloadable Stata package that produces sequential sums of squares for regression. In other words, the SS is built up as each variable

More information

Experiment 1: Linear Regression

Experiment 1: Linear Regression Experiment 1: Linear Regression August 27, 2018 1 Description This first exercise will give you practice with linear regression. These exercises have been extensively tested with Matlab, but they should

More information

EXPERIMENT 2 Reaction Time Objectives Theory

EXPERIMENT 2 Reaction Time Objectives Theory EXPERIMENT Reaction Time Objectives to make a series of measurements of your reaction time to make a histogram, or distribution curve, of your measured reaction times to calculate the "average" or mean

More information

Lesson 26: Characterization of Parallel Lines

Lesson 26: Characterization of Parallel Lines Student Outcomes Students know that when a system of linear equations has no solution, i.e., no point of intersection of the lines, then the lines are parallel. Lesson Notes The discussion that begins

More information

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b). Confidence Intervals 1) What are confidence intervals? Simply, an interval for which we have a certain confidence. For example, we are 90% certain that an interval contains the true value of something

More information

Chapter 3 - Linear Regression

Chapter 3 - Linear Regression Chapter 3 - Linear Regression Lab Solution 1 Problem 9 First we will read the Auto" data. Note that most datasets referred to in the text are in the R package the authors developed. So we just need to

More information

1 Newton s 2nd and 3rd Laws

1 Newton s 2nd and 3rd Laws Physics 13 - Winter 2007 Lab 2 Instructions 1 Newton s 2nd and 3rd Laws 1. Work through the tutorial called Newton s Second and Third Laws on pages 31-34 in the UW Tutorials in Introductory Physics workbook.

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression ST 430/514 Recall: A regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates)

More information

Simple linear regression

Simple linear regression Simple linear regression Biometry 755 Spring 2008 Simple linear regression p. 1/40 Overview of regression analysis Evaluate relationship between one or more independent variables (X 1,...,X k ) and a single

More information

Survey on Population Mean

Survey on Population Mean MATH 203 Survey on Population Mean Dr. Neal, Spring 2009 The first part of this project is on the analysis of a population mean. You will obtain data on a specific measurement X by performing a random

More information

Describing distributions with numbers

Describing distributions with numbers Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods measure one of two data characteristics: The central

More information

3 Non-linearities and Dummy Variables

3 Non-linearities and Dummy Variables 3 Non-linearities and Dummy Variables Reading: Kennedy (1998) A Guide to Econometrics, Chapters 3, 5 and 6 Aim: The aim of this section is to introduce students to ways of dealing with non-linearities

More information

Math 148. Polynomial Graphs

Math 148. Polynomial Graphs Math 148 Lab 1 Polynomial Graphs Due: Monday Wednesday, April April 10 5 Directions: Work out each problem on a separate sheet of paper, and write your answers on the answer sheet provided. Submit the

More information

Scatter plot of data from the study. Linear Regression

Scatter plot of data from the study. Linear Regression 1 2 Linear Regression Scatter plot of data from the study. Consider a study to relate birthweight to the estriol level of pregnant women. The data is below. i Weight (g / 100) i Weight (g / 100) 1 7 25

More information

Lesson 5b Solving Quadratic Equations

Lesson 5b Solving Quadratic Equations Lesson 5b Solving Quadratic Equations In this lesson, we will continue our work with Quadratics in this lesson and will learn several methods for solving quadratic equations. The first section will introduce

More information