Density Temp vs Ratio. temp

Similar documents
ST430 Exam 1 with Answers

Ch 2: Simple Linear Regression

Inference for Regression

Simple Linear Regression

Simple and Multiple Linear Regression

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Multiple Linear Regression

Lecture 18: Simple Linear Regression

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Unit 6 - Simple linear regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Scatter plot of data from the study. Linear Regression

Unit 6 - Introduction to linear regression

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

AMS 7 Correlation and Regression Lecture 8

Handout 4: Simple Linear Regression

Scatter plot of data from the study. Linear Regression

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Inferences for Regression

Statistics for Engineers Lecture 9 Linear Regression

Ch 3: Multiple Linear Regression

Coefficient of Determination

MATH 644: Regression Analysis Methods

13 Simple Linear Regression

Introduction and Single Predictor Regression. Correlation

Lecture 11: Simple Linear Regression

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Oct Simple linear regression. Minimum mean square error prediction. Univariate. regression. Calculating intercept and slope

Comparing Nested Models

Applied Regression Analysis

Ordinary Least Squares Regression Explained: Vartanian

Lecture 6 Multiple Linear Regression, cont.

Chapter 16: Understanding Relationships Numerical Data

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Homework 2: Simple Linear Regression

STAT 350: Summer Semester Midterm 1: Solutions

Statistical View of Least Squares

Review of Statistics 101

Chapter 8: Simple Linear Regression

ST430 Exam 2 Solutions

Biostatistics 380 Multiple Regression 1. Multiple Regression

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Math 3330: Solution to midterm Exam

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

INFERENCE FOR REGRESSION

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Linear Regression Model. Badr Missaoui

L21: Chapter 12: Linear regression

STAT Chapter 11: Regression

Ordinary Least Squares Regression Explained: Vartanian

Simple Linear Regression

Lecture 3: Inference in SLR

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

9 Correlation and Regression

Simple Linear Regression

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

A discussion on multiple regression models

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

MS&E 226: Small Data

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

Lectures on Simple Linear Regression Stat 431, Summer 2012

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

2.4.3 Estimatingσ Coefficient of Determination 2.4. ASSESSING THE MODEL 23

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Sociology 6Z03 Review II

Chapter 12: Linear regression II

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Regression. Marc H. Mehlman University of New Haven

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Final Exam. Name: Solution:

36-707: Regression Analysis Homework Solutions. Homework 3

Regression Analysis II

STAT 3022 Spring 2007

Model Specification and Data Problems. Part VIII

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

R 2 and F -Tests and ANOVA

SSR = The sum of squared errors measures how much Y varies around the regression line n. It happily turns out that SSR + SSE = SSTO.

Simple linear regression

Business Statistics. Lecture 10: Course Review

Variance Decomposition and Goodness of Fit

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Lecture 4 Multiple linear regression

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Correlation and Simple Linear Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

Analysis of Bivariate Data

Review of Multiple Regression

Midterm 2 - Solutions

9. Linear Regression and Correlation

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

Data Analysis and Statistical Methods Statistics 651

Assumptions, Diagnostics, and Inferences for the Simple Linear Regression Model with Normal Residuals

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

1 Multiple Regression

Transcription:

Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks, one around 172, and another higher peak around 182. The ratio measures seem slightly skewed, with more observations around 1 or 1.5, and fewer out in the tails around 2.5 or 3. Neither sets of measurements appear to have large outliers. (b) No, the value of the efficiency ratio is not completely and uniquely determined by tank temperature. If this were the case, then every tank temperature would correspond to a unique ratio measure. However, we can see that there are five temperature measure equal to 180, but they correspond to five different ratio measures (1.45, 1.60, 1.61, 2.13, 2.15). If the ratio were completely determined by temperature, the corresponding ratios for the 180 temp measures would all be the same. Temp vs Ratio ratio 1.0 1.5 2.0 2.5 3.0 (c) 170 175 180 185 temp The scatterplot of temperature vs ratio does appear to show an increasing and linear relationship between the two variables. The higher temperature values are generally associated with higher ratio values. It seems reasonable that temperature might predict ratio values. (d) Using statistical software, we can get the estimated regression line: Ŷ i = ˆβ 0 + ˆβ 1 temp i = 15.25 + 0.094temp i. The regression line shows that for every degree increase in temperature, the ratio increases less than 0.1 in the efficiency ratio. When the temperature is zero, the efficiency ratio is equal to 1

-15.25. This is not directly interpretable, because the efficiency ratio cannot go below zero. However, since the smallest temperature value is 170 degrees, the intercept is calculated using this value as our baseline value. One way to fix this issue would be to recalibrate the temperature values, subtracting 170 from all of them so that 0 is a meaningful value. We have three assumptions we would like to test for the model: i. The relationship between temperature and efficiency ratio is linear. ii. The ɛ i are normally distributed iii. The ɛ i are normally distributed with the same variance ( Homoscedasticity ) Fitted vs Residuals Standardized residuals Residuals -2-1 0 1 Frequency 0 1 2 3 4 5 6 1.0 1.5 2.0 2.5 Fitted -2-1 0 1 2 e_star The first assumption seems valid based on the scatterplot we created earlier, which shows a reasonable, increasing linear relationship between temperature and the efficiency ratio. The histogram of the standardized residuals e i seems to be normal, so the second assumption is reasonable. The graph of the fitted values vs the standardized residuals shows no pattern (which is what we want), and the values lie between -2 and 2. Because there is not pattern, it seems the homoscedastic assumption is valid. Code and output in R: #### MAKE THE MODEL #### > fit1 = lm(ratio ~ Temp, data = prob1data) # Predict ratio with temp using prob1 data > summary(fit1) Call: lm(formula = Ratio ~ Temp, data = prob1data) Residuals: Min 1Q Median 3Q Max -1.00601-0.27580-0.08906 0.37700 0.81128 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -15.24497 3.97705-3.833 0.000905 *** 2

Temp 0.09424 0.02215 4.255 0.000324 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.4972 on 22 degrees of freedom Multiple R-squared: 0.4514,Adjusted R-squared: 0.4265 F-statistic: 18.1 on 1 and 22 DF, p-value: 0.0003239 ### MAKE THE PLOTS ### > e_star = (fit1$resid-mean(fit1$resid))/sd(fit1$resid) > par(mfrow = c(1,2)) > plot(fit1$fitted, e_star, main = Fitted vs Residuals, xlab = Fitted, ylab = Residuals ) > abline(h = 0) > hist(e_star, main = Standardized residuals ) (e) The regression line is used to predict the average efficiency ratio: EY x=182 = 15.25 + 0.094 182 = 1.906. When the temperature equals 182 degrees, the average efficiency ratio equals 1.906. (f) The residuals for the four observations for which temperature equals 182 are: -1.006, -0.096, 0.034, and 0.774. The reason that these do not all have the same sign is because they values do not all lie on the regression line. Some of them are above the regression line, and some are below it. This is due to the random variation in the observed Y i values. (g) The output in part (d) shows that R 2 is equal to 0.4514, which means that 45% of the variation in the efficiency ratio is explained by temperature. 2. (a) The scatterplot shows that the simple linear regression model appears to be reasonable, as the relationship between SO 2 and steel weight loss seems linear. Steel weight loss 400 600 800 1000 1200 20 40 60 80 100 S02 (b) Ŷ = 137.9 + 9.31SO 2 The estimated regression equation shows that for a 1 mg/m 2 /d increase in SO2, the steel weight loss goes up an average of 9.31 g/m 2. With the sodium chloride level is 0, then the average steel weight loss is 137.9 g/m 2. 3

> summary(fit2) Call: lm(formula = y ~ x, data = prob2data) Residuals: 1 2 3 4 5 6 11.762 44.516-40.338-38.273 3.104 19.229 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 137.8756 26.3776 5.227 0.0064 ** x 9.3116 0.4745 19.622 3.98e-05 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 37.39 on 4 degrees of freedom Multiple R-squared: 0.9897,Adjusted R-squared: 0.9871 F-statistic: 385 on 1 and 4 DF, p-value: 3.978e-05 It is difficult to tell from the plots of the residuals if our assumptions hold since there are so few of them, but it does not appear there are any obvious violations. Fitted vs resid Histogram of estar2 Resid -1.0-0.5 0.0 0.5 1.0 Frequency 0.0 0.5 1.0 1.5 2.0 400 600 800 1000 1200 Fitted -1.5-1.0-0.5 0.0 0.5 1.0 1.5 estar2 (c) The output above shows that R 2 equals 0.9897, which means that almost 99% of the variation in steel weight loss can be attributed to SO 2. This value is extremely high, indicating that this is a great predictor. (d) Before this model is even created, we can guess that the slope of the regression line will not change much. We can guess this by looking at the scatterplot; even though the SO 2 measure is quite high, so is the steel weight value, so this point still fits along the regression line and is probably not that influential. Call: lm(formula = y ~ x, data = prob2data[-6, ]) Residuals: 1 2 3 4 5-16.07 23.72-22.41-15.07 29.83 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 190.3524 33.4017 5.699 0.01071 * 4

x 7.5515 0.9647 7.828 0.00434 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 28.52 on 3 degrees of freedom Multiple R-squared: 0.9533,Adjusted R-squared: 0.9378 F-statistic: 61.27 on 1 and 3 DF, p-value: 0.004341 Results from the two models show that the parameter estimates for β 0 are quite different, with higher values for the original model. The estimates for the slope are slightly different, although the change is not as drastic. A plot of the fitted values from each model shows that they are fairly similar, although the relationship is not perfectly at 45 degrees. Fitted values New model 300 350 400 450 500 550 300 350 400 450 500 550 Original model 3. (a) Ŷ = 1.58 + 2.585 tannin ˆβ 1 = S xy = 3.831 1.482 = 2.585 ˆβ 0 = ȳ ˆβ 1 x =.549/32 2.585 19.404/32 = 1.58 Interpretation of the regression line: For every 1-unit increase in tannin concentration, the perceived astringency increases by 2.585 units. If the tannin concentration is zero, then the perceived astringency is -1.58. The figure does not show any violations of our assumptions, and the linear model seems to fit very nicely to the data. On the left, we see that there is a strong linear relationship between the tannin level and the perceived astringency. In the middle, there does not appear to be any trend among the residuals, and they are centered around 0, with values between -2 and 2. On the right, we see that the residuals appear to be normally distributed. (b) The confidence interval for β 1 can be calculated in two ways. i. Calculate MSE and use the equation: > MSE = sum(fit$resid^2/30) > 2.585 - qt(.975, 30)*sqrt(MSE/1.482) ˆβ 1 ± t α/2,n 2 MSE/Sxx 5

Tannin vs Astringency Fitted vs Residuals Hist of residuals Astringency -1.0-0.5 0.0 0.5 1.0 Residuals -1 0 1 2 Frequency 0 2 4 6 8 10 0.4 0.6 0.8 1.0 Tannin -1.0-0.5 0.0 0.5 1.0 Fitted -2-1 0 1 2 estar [1] 2.160132 > 2.585 + qt(.975, 30)*sqrt(MSE/1.482) [1] 3.009868 ii. Use the output, which provides estimates of the slope as well as standard errors: Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -1.5846 0.1339-11.84 7.84e-13 *** tannin 2.5849 0.2080 12.43 2.33e-13 *** --- > 2.585-qt(.975, 30)*.2080 [1] 2.160207 > 2.585+qt(.975, 30)*.2080 [1] 3.009793 The 95% confidence interval for the slope is (2.16, 3.01), which means that we are 95% confident that the true slope is in this range. (c) We use the estimated regression line to estimate the average astringency when the tannin concentration is 0.6: EY x=0.6 = 1.585 + 2.585 0.6 = 0.034. However, we want to express how reliable this estimate is, so we can calculate a 95% confidence interval for the average astringency when the tannin level is 0.6: [ ] 1 Ŷ x=x ± t α/2,n 2 MSE [ ] 1 (0.6 0.606375)2 = 0.034 ± 2.04 0.064 + 32 1.48 = ( 0.128, 0.060) We are 95% confident that the true average perceived astringency level is between -0.128 and 0.060 when the tannin concentration is 0.6. (d) The prediction interval will be similar to the confidence interval, except that we are not making the interval for the average outcome, we are making it for the range of all possible outcomes, 6

which produces a larger variance: [ Ŷ x=x ± t α/2,n 2 MSE 1 + 1 ] = 0.034 ± 2.04 = ( 0.560, 0.492) 0.064 [ 1 + 1 ] (0.6 0.606375)2 + 32 1.48 This means that we predict that 95% of the time, the perceived astringency will be between -0.560 and 0.492 when the tannin level is 0.6. (e) We can test the null hypothesis with a confidence interval for the average astringency for a tannin concentration of 0.7, and if 0 is in the interval, then we fail to reject the null hypothesis. H o : EY x=0.7 = 0, H a : EY x=0.7 0 [ ] 1 Ŷ x=x ± t α/2,n 2 MSE [ ] 1 (0.7 0.606375)2 = 1.585 + 2.585 0.7 ± 2.04 0.064 + 32 1.48 = (0.125, 0.324) We are 95% confident that the true average perceived astringency level is between 0.125 and 0.324 when the tannin concentration is equal to 0.7. Because zero is not contained in the interval, we can also reject the null hypothesis. There is evidence that this average is significantly different from zero. 4. (a) Based on the given calculations, the estimated regression line is Ŷ = 6.45 + 10.60x cf. This means that for every SCCM unit increase in chlorine flow, the etch rate increases 10.6 100A/min. With no chlorine flow, the average etch rate is 6.45 100A/min. A check of our assumptions shows that the assumption of linearity has not been violated. An examination of the residuals is harder to determine because the sample size is small. However, the residuals are between -2 and 2 and seem to show no patterns. R 2 is equal to 0.94, which means that 94% of CF vs etch rate Fitted vs Residuals Histogram of estar Etch rate 25 30 35 40 45 50 Residuals -1.5-1.0-0.5 0.0 0.5 1.0 Frequency 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1.5 2.0 2.5 3.0 3.5 4.0 CF 25 30 35 40 45 Fitted -1.5-1.0-0.5 0.0 0.5 1.0 1.5 estar the variation in the etch rate is explained by the chlorine flow. It seems that the regression model specifies a useful relationship between chlorine flow and etch rate. HERE DO AN F-TEST? 7

(b) The average change in etch rate associated with a 1-SCCM increase in flow rate is the slope, β 1. The estimate for β 1 is equal to 10.6 (given). We can create a 95% confidence interval for this parameter using the equation: (c) (d) ˆβ 1 ± t α/2,n 2 MSE/Sxx = 10.603 ± 2.364 6.48/6.5 = (8.24, 12.96). We are 95% confident that the true average change in etch rate is between 8.24 and 12.96 for every 1-SCCM increase in chlorine flow rate. [ ] 1 Ŷ x=3 ± t α/2,n 2 MSE [ ] 1 (3 2.67)2 = 6.45 + 10.60 3 ± 2.364 6.48 + 9 6.5 = (36.098, 40.402) We are 95% confident that the average etch rate is between 36.1 and 40.4 100A/min when the chlorine flow is 3 SCCM. Because 3 falls in the range of the x values in the data set, it seems reasonable to assume that our estimate of the average etch rate is likely to be accurate. [ Ŷ x=3 ± t α/2,n 2 MSE 1 + 1 ] = 6.45 + 10.60 3 ± 2.364 = (35.05, 41.45) 6.48 [ 1 + 1 ] (3 2.67)2 + 9 6.5 (e) The standard error of the prediction intervals and the confidence intervals contains the term (x x) 2. The value of x that is closer to the average will produce a smaller standard error than a value that is further. The average chlorine flow values is 2.67 SCCM, and because 2.5 is closer to this average than 3.0, the confidence and prediction intervals for EY x=2.5 will be smaller than EY x=3.0. (f) It would not wise to recommend a 95% PI for a flow of 6.0, because this value is so far from any of the recorded x values in the data set. The maximum value is 4.0 SCCM, and because 6.0 is much higher, the interval will be very wide and inaccurate. 5. The estimated regression equation is: Ŷ = 6.05 + 0.142NAOH 0.0169T IME When the NaOH and treatment time are equal to 0, the average specific surface area is 6.05 cm 2 /g. When treatment time is held fixed, a one-percent increase in NaOH causes a 0.142 increase in cm 2 /g in surface area. When NaOH is held fixed, a one minute increase in treatment time decreases the surface are by 0.0169 cm 2 /g. (a) R 2 = 0.807, which means that time and NaOH account for 80.7% of the variation in surface area. (b) The p-value for the entire model (the F-statistic) is 0.007, which means that there is a useful relationship between the dependent variable and the predictors. (c) Provided that the percentage of NaOH remains in the model, it does not appear the predictor treatment time needs to be eliminated if we use a significance level α = 0.05, since the p- value for that coefficient is 0.043 (which means that we reject H o : β time = 0). However, if the model were being validated against a higher significance level, such as α = 0.01, then we would recommend possibly eliminating this variable from the model. 8

(d) Calculating a 95% CI for the expected change in specific surface area associated with a 1% in NaOH (treatment time is held fixed) means we are calculating a 95% CI for β NaOH. We can use the output to make this confidence interval, using the standard error provided by the output: 0.14167 ± t.975,6 0.03301 = 0.14167 ± 2.45 0.03301 = (0.061, 0.222) Note that the confidence interval does not contain zero, which means that we can reject the null hypothesis that this parameter is equal to 0 at the 0.05 level. 9