Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks, one around 172, and another higher peak around 182. The ratio measures seem slightly skewed, with more observations around 1 or 1.5, and fewer out in the tails around 2.5 or 3. Neither sets of measurements appear to have large outliers. (b) No, the value of the efficiency ratio is not completely and uniquely determined by tank temperature. If this were the case, then every tank temperature would correspond to a unique ratio measure. However, we can see that there are five temperature measure equal to 180, but they correspond to five different ratio measures (1.45, 1.60, 1.61, 2.13, 2.15). If the ratio were completely determined by temperature, the corresponding ratios for the 180 temp measures would all be the same. Temp vs Ratio ratio 1.0 1.5 2.0 2.5 3.0 (c) 170 175 180 185 temp The scatterplot of temperature vs ratio does appear to show an increasing and linear relationship between the two variables. The higher temperature values are generally associated with higher ratio values. It seems reasonable that temperature might predict ratio values. (d) Using statistical software, we can get the estimated regression line: Ŷ i = ˆβ 0 + ˆβ 1 temp i = 15.25 + 0.094temp i. The regression line shows that for every degree increase in temperature, the ratio increases less than 0.1 in the efficiency ratio. When the temperature is zero, the efficiency ratio is equal to 1
-15.25. This is not directly interpretable, because the efficiency ratio cannot go below zero. However, since the smallest temperature value is 170 degrees, the intercept is calculated using this value as our baseline value. One way to fix this issue would be to recalibrate the temperature values, subtracting 170 from all of them so that 0 is a meaningful value. We have three assumptions we would like to test for the model: i. The relationship between temperature and efficiency ratio is linear. ii. The ɛ i are normally distributed iii. The ɛ i are normally distributed with the same variance ( Homoscedasticity ) Fitted vs Residuals Standardized residuals Residuals -2-1 0 1 Frequency 0 1 2 3 4 5 6 1.0 1.5 2.0 2.5 Fitted -2-1 0 1 2 e_star The first assumption seems valid based on the scatterplot we created earlier, which shows a reasonable, increasing linear relationship between temperature and the efficiency ratio. The histogram of the standardized residuals e i seems to be normal, so the second assumption is reasonable. The graph of the fitted values vs the standardized residuals shows no pattern (which is what we want), and the values lie between -2 and 2. Because there is not pattern, it seems the homoscedastic assumption is valid. Code and output in R: #### MAKE THE MODEL #### > fit1 = lm(ratio ~ Temp, data = prob1data) # Predict ratio with temp using prob1 data > summary(fit1) Call: lm(formula = Ratio ~ Temp, data = prob1data) Residuals: Min 1Q Median 3Q Max -1.00601-0.27580-0.08906 0.37700 0.81128 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -15.24497 3.97705-3.833 0.000905 *** 2
Temp 0.09424 0.02215 4.255 0.000324 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.4972 on 22 degrees of freedom Multiple R-squared: 0.4514,Adjusted R-squared: 0.4265 F-statistic: 18.1 on 1 and 22 DF, p-value: 0.0003239 ### MAKE THE PLOTS ### > e_star = (fit1$resid-mean(fit1$resid))/sd(fit1$resid) > par(mfrow = c(1,2)) > plot(fit1$fitted, e_star, main = Fitted vs Residuals, xlab = Fitted, ylab = Residuals ) > abline(h = 0) > hist(e_star, main = Standardized residuals ) (e) The regression line is used to predict the average efficiency ratio: EY x=182 = 15.25 + 0.094 182 = 1.906. When the temperature equals 182 degrees, the average efficiency ratio equals 1.906. (f) The residuals for the four observations for which temperature equals 182 are: -1.006, -0.096, 0.034, and 0.774. The reason that these do not all have the same sign is because they values do not all lie on the regression line. Some of them are above the regression line, and some are below it. This is due to the random variation in the observed Y i values. (g) The output in part (d) shows that R 2 is equal to 0.4514, which means that 45% of the variation in the efficiency ratio is explained by temperature. 2. (a) The scatterplot shows that the simple linear regression model appears to be reasonable, as the relationship between SO 2 and steel weight loss seems linear. Steel weight loss 400 600 800 1000 1200 20 40 60 80 100 S02 (b) Ŷ = 137.9 + 9.31SO 2 The estimated regression equation shows that for a 1 mg/m 2 /d increase in SO2, the steel weight loss goes up an average of 9.31 g/m 2. With the sodium chloride level is 0, then the average steel weight loss is 137.9 g/m 2. 3
> summary(fit2) Call: lm(formula = y ~ x, data = prob2data) Residuals: 1 2 3 4 5 6 11.762 44.516-40.338-38.273 3.104 19.229 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 137.8756 26.3776 5.227 0.0064 ** x 9.3116 0.4745 19.622 3.98e-05 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 37.39 on 4 degrees of freedom Multiple R-squared: 0.9897,Adjusted R-squared: 0.9871 F-statistic: 385 on 1 and 4 DF, p-value: 3.978e-05 It is difficult to tell from the plots of the residuals if our assumptions hold since there are so few of them, but it does not appear there are any obvious violations. Fitted vs resid Histogram of estar2 Resid -1.0-0.5 0.0 0.5 1.0 Frequency 0.0 0.5 1.0 1.5 2.0 400 600 800 1000 1200 Fitted -1.5-1.0-0.5 0.0 0.5 1.0 1.5 estar2 (c) The output above shows that R 2 equals 0.9897, which means that almost 99% of the variation in steel weight loss can be attributed to SO 2. This value is extremely high, indicating that this is a great predictor. (d) Before this model is even created, we can guess that the slope of the regression line will not change much. We can guess this by looking at the scatterplot; even though the SO 2 measure is quite high, so is the steel weight value, so this point still fits along the regression line and is probably not that influential. Call: lm(formula = y ~ x, data = prob2data[-6, ]) Residuals: 1 2 3 4 5-16.07 23.72-22.41-15.07 29.83 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 190.3524 33.4017 5.699 0.01071 * 4
x 7.5515 0.9647 7.828 0.00434 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 28.52 on 3 degrees of freedom Multiple R-squared: 0.9533,Adjusted R-squared: 0.9378 F-statistic: 61.27 on 1 and 3 DF, p-value: 0.004341 Results from the two models show that the parameter estimates for β 0 are quite different, with higher values for the original model. The estimates for the slope are slightly different, although the change is not as drastic. A plot of the fitted values from each model shows that they are fairly similar, although the relationship is not perfectly at 45 degrees. Fitted values New model 300 350 400 450 500 550 300 350 400 450 500 550 Original model 3. (a) Ŷ = 1.58 + 2.585 tannin ˆβ 1 = S xy = 3.831 1.482 = 2.585 ˆβ 0 = ȳ ˆβ 1 x =.549/32 2.585 19.404/32 = 1.58 Interpretation of the regression line: For every 1-unit increase in tannin concentration, the perceived astringency increases by 2.585 units. If the tannin concentration is zero, then the perceived astringency is -1.58. The figure does not show any violations of our assumptions, and the linear model seems to fit very nicely to the data. On the left, we see that there is a strong linear relationship between the tannin level and the perceived astringency. In the middle, there does not appear to be any trend among the residuals, and they are centered around 0, with values between -2 and 2. On the right, we see that the residuals appear to be normally distributed. (b) The confidence interval for β 1 can be calculated in two ways. i. Calculate MSE and use the equation: > MSE = sum(fit$resid^2/30) > 2.585 - qt(.975, 30)*sqrt(MSE/1.482) ˆβ 1 ± t α/2,n 2 MSE/Sxx 5
Tannin vs Astringency Fitted vs Residuals Hist of residuals Astringency -1.0-0.5 0.0 0.5 1.0 Residuals -1 0 1 2 Frequency 0 2 4 6 8 10 0.4 0.6 0.8 1.0 Tannin -1.0-0.5 0.0 0.5 1.0 Fitted -2-1 0 1 2 estar [1] 2.160132 > 2.585 + qt(.975, 30)*sqrt(MSE/1.482) [1] 3.009868 ii. Use the output, which provides estimates of the slope as well as standard errors: Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -1.5846 0.1339-11.84 7.84e-13 *** tannin 2.5849 0.2080 12.43 2.33e-13 *** --- > 2.585-qt(.975, 30)*.2080 [1] 2.160207 > 2.585+qt(.975, 30)*.2080 [1] 3.009793 The 95% confidence interval for the slope is (2.16, 3.01), which means that we are 95% confident that the true slope is in this range. (c) We use the estimated regression line to estimate the average astringency when the tannin concentration is 0.6: EY x=0.6 = 1.585 + 2.585 0.6 = 0.034. However, we want to express how reliable this estimate is, so we can calculate a 95% confidence interval for the average astringency when the tannin level is 0.6: [ ] 1 Ŷ x=x ± t α/2,n 2 MSE [ ] 1 (0.6 0.606375)2 = 0.034 ± 2.04 0.064 + 32 1.48 = ( 0.128, 0.060) We are 95% confident that the true average perceived astringency level is between -0.128 and 0.060 when the tannin concentration is 0.6. (d) The prediction interval will be similar to the confidence interval, except that we are not making the interval for the average outcome, we are making it for the range of all possible outcomes, 6
which produces a larger variance: [ Ŷ x=x ± t α/2,n 2 MSE 1 + 1 ] = 0.034 ± 2.04 = ( 0.560, 0.492) 0.064 [ 1 + 1 ] (0.6 0.606375)2 + 32 1.48 This means that we predict that 95% of the time, the perceived astringency will be between -0.560 and 0.492 when the tannin level is 0.6. (e) We can test the null hypothesis with a confidence interval for the average astringency for a tannin concentration of 0.7, and if 0 is in the interval, then we fail to reject the null hypothesis. H o : EY x=0.7 = 0, H a : EY x=0.7 0 [ ] 1 Ŷ x=x ± t α/2,n 2 MSE [ ] 1 (0.7 0.606375)2 = 1.585 + 2.585 0.7 ± 2.04 0.064 + 32 1.48 = (0.125, 0.324) We are 95% confident that the true average perceived astringency level is between 0.125 and 0.324 when the tannin concentration is equal to 0.7. Because zero is not contained in the interval, we can also reject the null hypothesis. There is evidence that this average is significantly different from zero. 4. (a) Based on the given calculations, the estimated regression line is Ŷ = 6.45 + 10.60x cf. This means that for every SCCM unit increase in chlorine flow, the etch rate increases 10.6 100A/min. With no chlorine flow, the average etch rate is 6.45 100A/min. A check of our assumptions shows that the assumption of linearity has not been violated. An examination of the residuals is harder to determine because the sample size is small. However, the residuals are between -2 and 2 and seem to show no patterns. R 2 is equal to 0.94, which means that 94% of CF vs etch rate Fitted vs Residuals Histogram of estar Etch rate 25 30 35 40 45 50 Residuals -1.5-1.0-0.5 0.0 0.5 1.0 Frequency 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1.5 2.0 2.5 3.0 3.5 4.0 CF 25 30 35 40 45 Fitted -1.5-1.0-0.5 0.0 0.5 1.0 1.5 estar the variation in the etch rate is explained by the chlorine flow. It seems that the regression model specifies a useful relationship between chlorine flow and etch rate. HERE DO AN F-TEST? 7
(b) The average change in etch rate associated with a 1-SCCM increase in flow rate is the slope, β 1. The estimate for β 1 is equal to 10.6 (given). We can create a 95% confidence interval for this parameter using the equation: (c) (d) ˆβ 1 ± t α/2,n 2 MSE/Sxx = 10.603 ± 2.364 6.48/6.5 = (8.24, 12.96). We are 95% confident that the true average change in etch rate is between 8.24 and 12.96 for every 1-SCCM increase in chlorine flow rate. [ ] 1 Ŷ x=3 ± t α/2,n 2 MSE [ ] 1 (3 2.67)2 = 6.45 + 10.60 3 ± 2.364 6.48 + 9 6.5 = (36.098, 40.402) We are 95% confident that the average etch rate is between 36.1 and 40.4 100A/min when the chlorine flow is 3 SCCM. Because 3 falls in the range of the x values in the data set, it seems reasonable to assume that our estimate of the average etch rate is likely to be accurate. [ Ŷ x=3 ± t α/2,n 2 MSE 1 + 1 ] = 6.45 + 10.60 3 ± 2.364 = (35.05, 41.45) 6.48 [ 1 + 1 ] (3 2.67)2 + 9 6.5 (e) The standard error of the prediction intervals and the confidence intervals contains the term (x x) 2. The value of x that is closer to the average will produce a smaller standard error than a value that is further. The average chlorine flow values is 2.67 SCCM, and because 2.5 is closer to this average than 3.0, the confidence and prediction intervals for EY x=2.5 will be smaller than EY x=3.0. (f) It would not wise to recommend a 95% PI for a flow of 6.0, because this value is so far from any of the recorded x values in the data set. The maximum value is 4.0 SCCM, and because 6.0 is much higher, the interval will be very wide and inaccurate. 5. The estimated regression equation is: Ŷ = 6.05 + 0.142NAOH 0.0169T IME When the NaOH and treatment time are equal to 0, the average specific surface area is 6.05 cm 2 /g. When treatment time is held fixed, a one-percent increase in NaOH causes a 0.142 increase in cm 2 /g in surface area. When NaOH is held fixed, a one minute increase in treatment time decreases the surface are by 0.0169 cm 2 /g. (a) R 2 = 0.807, which means that time and NaOH account for 80.7% of the variation in surface area. (b) The p-value for the entire model (the F-statistic) is 0.007, which means that there is a useful relationship between the dependent variable and the predictors. (c) Provided that the percentage of NaOH remains in the model, it does not appear the predictor treatment time needs to be eliminated if we use a significance level α = 0.05, since the p- value for that coefficient is 0.043 (which means that we reject H o : β time = 0). However, if the model were being validated against a higher significance level, such as α = 0.01, then we would recommend possibly eliminating this variable from the model. 8
(d) Calculating a 95% CI for the expected change in specific surface area associated with a 1% in NaOH (treatment time is held fixed) means we are calculating a 95% CI for β NaOH. We can use the output to make this confidence interval, using the standard error provided by the output: 0.14167 ± t.975,6 0.03301 = 0.14167 ± 2.45 0.03301 = (0.061, 0.222) Note that the confidence interval does not contain zero, which means that we can reject the null hypothesis that this parameter is equal to 0 at the 0.05 level. 9