HOMEWORK ANALYSIS #3 - WATER AVAILABILITY (DATA FROM WEISBERG 2014) 1. In your own words, summarize the overarching problem and any specific questions that need to be answered using the water data. Discuss how statistical modeling will be able to answer the posed questions. To receive full credit for this question, students should: (a) (2.5pts) State that we need to predict how much water runoff will result from snowfall. (b) (2.5pts) State that statistical modeling allows us to use the linear relationship to get the prediction. (c) (Example) California is experiencing a drought, therefore being able to predict how much water will be available from water runoff from the Sierra Nevada Mountains would be very useful for determining what restrictions on water consumption will need to be put in place to make sure there is enough water. If we can determine a linear relationship between snow fall and runoff exists we should be able model that relationship and use snowfall for predicting runoff. 2. Use the data to assess if a simple linear regression (SLR) model is suitable to analyze the water data. Justify your answer using any necessary graphics and relevant summary statistics that would suggest an SLR model would be successful at achieving the goals of the study. Provide discussion on why an SLR model is or is not appropriate. To receive full credit for this question, students should: (a) (3pts) draw a scatterplot like the one below (make sure the x and y axes are labeled appropriately). (b) (2pts) comment that the scatterplot looks roughly linear with a correlation of r = 0.94 so a linear model is appropriate. (c) If we plot the data on a scatterplot we can see that there seems to be a linear relationship between snowfall and runoff. In fact, with a r = 0.94 we know that the linear relationship is quite strong. Therefore, it looks like using an SLR model is an appropriate method to analyze the data. Code for Problem 2: plot(water$precip,water$runoff,pch=19,xlab="snowfall",ylab="runoff") cor(water$precip,water$runoff) 3. Write out (in mathematical form with greek letters) a justifiable SLR model that would help answer the questions in problem. Clearly state any assumptions you are using in your model. Provide an interpretation of each mathematical term (β parameter) included in your model. Using the mathematical form, discuss how your model, after fitting it to the data, will be able to answer the questions in this problem. The model that I used is Runoff i = β 0 +β 1 Snow i +ɛ i where ɛ i N (0, σ 2 ) students need all parts (including the ɛ and normal distribution piece). If they used a transformation or centered their x s then, that is fine but I didn t think it was necessary. To receive full credit for this question, students should: 1
Runoff 40000 60000 80000 100000 120000 140000 5 10 15 20 25 30 Snowfall (a) (1pt) Write out the model Runoff i = β 0 + β 1 Snow i + ɛ i where ɛ i N (0, σ 2 ) students need all parts (including the ɛ and normal distribution piece) (b) (2pts) interpret β 0 and β 1 appropriately. (c) (2pts) mention assumptions of linearity, independence, normality and equal variance. (d) (Example) We will use the following model: Runoff i = β 0 + β 1 Snow i + ɛ i where ɛ i N (0, σ 2 ) i. β 0 - The y-intercept. For all years with no snow, runoff will be β 0 on average. ii. β 1 - The slope. As snow increases by one unit, runoff will increase by β 1, on average. After fitting this model to the data we will be able to predict runoff for a given year by just plugging in the amount of snowfall for that year into the equation. To use this model we are assuming linearity, independence, normality, and equal variance. 4. Fit your model in #3 to the water data and summarize the results by displaying the fitted model in equation form (do NOT just provide a screen shot of the R or SAS output). Interpret each of the fitted parameters in the context of the problem. Provide a plot of the data with a fitted regression line. My model was estimated to be Runoff = 27014 + 3752.5 Snow but if they used a transformation that is OK. (a) (1pt) State their estimated equation in equation form (not a snapshot of R output). (b) (2pts) interpret each estimate in context (c) (2pts) provide a plot of the fitted regression line like the one below. (d) (Example) By fitting our data to the model we get the following equation: Runoff = 27014 + 3752.5 Snow i. 27014 is our estimate of β 0 ; for a year with no snow we would expect a runoff of 27014 acre-feet. 2
Runoff 40000 60000 80000 100000 120000 140000 5 10 15 20 25 30 Snowfall ii. 3752.5 is our estimate of β 1 ; for every one unit increase in snow, runoff increases by 3752.5 acre-feet, on average. Code for Problem 4: my.lm <- lm(runoff Precip,data=water) plot(water$precip,water$runoff,pch=19,xlab="snowfall",ylab="runoff") abline(my.lm$coef[1],my.lm$coef[2],col="red",lwd=3) 5. Justify your model assumptions using appropriate graphics or summary statistics. Assess the fit and predictive capability of your model. Discuss on the level of your target audience (e.g. interpret your model R 2 ). (a) (1pt) State their model R 2 (I got R 2 = 0.88) (b) (2pts) Interpret R 2 in context of the problem. E.g. The percent of variability in runoff explained by snowfall is R 2. (c) (2pts) Look at plots like those below and appropriately justify the assumptions of linearity, independence, normality and equal variance. (d) (example) We believe all four assumptions, linearity, indepenence, normality, and equal variance, are met. Looking at the scatterplot of the data we can see that there is a linear relationship. We can reasonably assume that runoff from any one year does not affect the runoff of any other year and are therefore independent. The histogram of standardized residuals looks normal. Finally, based on the Residuals Vs Fitted Values plot we can see that the residuals vary relatively equally about the line. The R 2 for this data was.88. This means that 88% of the variablilty in runoff is explained by snowfall. Considering that such a high percentage of variabilty is explained by our model, we happy with how well our model fits the data. After performing a cross validation procedure using 6 randomly selected obervations as our test data, we calculated a bias of -4212. This tells us that on average our predictions are too low. We also calculated rpmse to be 9218, telling us that our predictions miss the mark by an average of 9218 acre-feet. Considering the scale of the data 9218 acre-feet isn t a very significant error and the bias of -4212 isnt very large either. Therefore, we feel that our model predicts well enough to rely on it for predicting runoff for snowfall levels that are within the range of our data. 3
Histogram of sres Std. Resids 2 1 0 1 2 Density 0.0 0.1 0.2 0.3 0.4 0.5 0.6 60000 80000 100000 120000 140000 Fitted Values 2 1 0 1 2 sres Code for Problem 5: sres <- stdres(my.lm) plot(my.lm$fitted.values,sres,pch=19,xlab="fitted Values",ylab="Std. Resids") abline(h=0,lwd=3,col="red") hist(sres,freq=false) lines(seq(-3,3,length=100),dnorm(seq(-3,3,length=100)),col="red",lwd=3) n.test<- 6 test.obs<-sample(1:nrow(water),n.test) test.data<-water[test.obs,] train.data<-water[-test.obs,] train.data train.lm<-lm(runoff Precip,data=train.data) summary(train.lm) preds<-predict(train.lm,newdata=test.data) bias<-mean(preds-test.data$runoff) pmse<-mean((preds-test.data$runoff)ˆ2) rpmse<-sqrt(pmse) 6. Carry out a test that there is no relationship between snowfall and runoff (e.g., write out the hypotheses, report an appropriate p-value and conclude in context). (a) (1pts) Null and alternative hypothesises written correctly. (b) (2pts) Report a p-value of approximately zero (c) (2pts) State that they reject the null hypothesis and conclude that there is a relationship between snowfall and runoff. (d) (Example) We ran a hypothesis test on the slope with H 0 : β 1 = 0 and H a : β 1 0. That resulted in a p-value that was approximately 0 so we will reject the null hypothesis and conclude that there is a significant relationship between snow and runoff. Code for Problem 6: summary(my.lm) 4
7. Construct 95% confidence intervals for the slope and intercept parameters and interpret these intervals in the context of the problem. (a) (3pts) Report 95% confidence intervals for β 0 as (20513, 33515) and for β 1 as (3316,4188). NOTE: if the students did a transformation or centered their x s then their intervals will be slightly different. (b) (2pts) Interpret these intervals in context. E.g. I am 95% confident that as snowfall increases by 1 then the runoff will increase by between 3316 and 4188 acre-feet, on average. (c) (example) Because β 0 and β 1 are just point estimates of the parameters, we calculated 95% confidence intervals for each. The 95% confidence interval for β 0 is (20513, 33515). This means we are 95% confident that when snow is 0, we expect runoff to be between 20513 and 33515 acre-feet. The 95% confidence interval for β 1 is (3316, 4188). This means that we are 95% confident as snowfall increases by one unit, runoff will increase by between 3316 and 4188 acre-feet, on average. Code for Problem 7: confint(my.lm) 8. In the winter of 2013-2014, the site only received 4.5 inches of snowfall. What do you predict will be the associated runoff? Provide a 95% predictive interval and interpret the interval in the context of the problem. Do you have any hesitations performing this prediction (hint: you should)? Describe these hesitations and their potential impact on your prediction. To receive full credit, students must, (a) (2pts) Predict for 4.5 inches of snowfall and present a 95% prediction interval. My prediction for 4.5 inches of snowfall is 43900.77 with a 95% prediction interval of (25254.2, 62547.34). (b) (2pts) Interpret the prediction interval in the context of the problem. (c) (1pt) Mention that 4.5 inches of snowfall is outside of the range of the data so this is an extrapolation; therefore, this prediction might be off. (d) (example) If there is only 4.5 inches of snowfall we predict a runoff of 43900.77 acre-feet. A 95% prediction interval for this estimate is (25254.2, 62547.34). Therefore, we are 95% confident that for one year that has 4.5 inches of snowfall then the resulting runoff will be between 24254.2 and 62547.34 acre-feet. However, this prediction may not be very accurate because we are extrapolating by predicting snowfall for 4.5 inches because 4.5 is outside the range of our collected data. Code for Problem 8: predict.lm(my.lm,newdata=data.frame(precip=c(4.5)),interval="prediction") 5