Simple Linear Regression 1 Steps for Regression Make a Scatter plot Does it make sense to plot a line? Check Residual Plot (Residuals vs. X) Are there any patterns? Check Histogram of Residuals Is it Normal? Check Model Utility Make Interferences 2 Example In order to design an efficient incinerator of municipal waste, information about the energy content of types of waste is necessary. 3 1
Data % plastics by weight Energy Content (kcal/kg) % plastics by weight Energy Content (kcal/kg) 18.69 947 18.28 1334 19.43 1407 21.41 1155 19.24 1452 25.11 1453 22.64 1553 21.04 1278 16.54 989 17.99 1153 21.44 1162 18.73 1225 19.53 1466 18.49 1237 23.97 1656 22.08 1327 21.45 1254 14.28 1229 20.34 1336 17.74 1205 17.03 1097 20.54 1221 21.03 1266 18.25 1138 20.49 1401 19.09 1295 20.45 1223 21.25 1391 18.81 1216 21.62 1372 4 Scatterplot 5 Residuals vs. X 6 2
Histogram of Residuals 7 Residual Normal Probability Plot 8 Reg. Analysis Check Model Utility Regression Analysis The regression equation is Energy Content (kcal/kg) = 469 + 40.8 % plastics by weight Predictor Coef StDev T P Constant 468.9 211.6 2.22 0.035 % plasti 40.82 10.57 3.86 0.001 S = 126.8 R-Sq = 34.8% R-Sq(adj) = 32.4% 9 3
(con t) Check Model Utility Regression Analysis Analysis of Variance Source DF SS MS F P Regression 1 239735 239735 14.92 0.001 Resid. Error 28 449975 16071 Total 29 689710 10 Inference What is the proportion of variation of energy content explained by the percentage of plastic in the waste? What is the correlation between energy content and percentage of plastic in the waste? For every percent increase in plastic in municipal waste, how much of an increase or decrease do you expect in energy content? 11 Inference(con t) Can we make inferences about the energy content of waste that is 75% plastic by weight? What is the average energy content expected at 22% plastic by weight? (Give interval.) What is the expected energy content of the next observation that contains 22% plastic waste? (Give Interval.) 12 4
Transformed x s and y s Section 13.2 13 What if it doesn t pass the residual test? We can transform the x s or the y s and make a linear regression line with the new x and new y. A function relating y to x is intrinsically linear if by means of a transformation on x and / or y, the function can be expressed as where y is a transformed independent variable and x is a transformed dependent variable. 14 Types of Transformations Function Transformations Linear Form y = α * e βx y =ln(y) y' = ln( α ) + βx y = αx β y =ln(y), x =ln(x) y ' = ln(α) + βx y = α + β * log( x) x = ln(x) y = α + βx y = α + β * 1 x x = 1 y = α + βx x 15 5
Notes 1. We can estimate the Betahat values by using the same least squares regression formulas. 2. The r 2 refers to proportion of variation that the new ys are explained by the new xs. 3. To make CI and PI, the transformed errors need to be approx. Normal. Example: Use trial and error or theory to find the appropriate transformation. 16 Example A Tortilla Chip maker would like to make the optimal tortilla chip. They would like to have the chip that has the most appealing texture. X=frying time(sec) Y=% moisture content X 5 10 15 20 25 30 45 60 Y 16.3 9.7 8.1 4.2 3.4 2.9 1.9 1.3 17 Scatter Plot (orginal x s and y s) 18 6
Residuals 19 Scatterplot(ln(x), y) 20 Scatterplot (1/x, y) 21 7
Scatterplot(x, ln(y)) 22 Scatterplot(lnx, lny) 23 Scatterplot(logx), log(y)) 24 8
Decision Time Two reasonable choices ln(x), ln(y) 1/x, y Look at the Normal Probability plots. 25 Reg Output Regression Analysis: ln(%) versus ln(fry) The regression equation is ln(%) = 4.64-1.05 ln(fry) Predictor Coef SE Coef T P Constant 4.6384 0.2110 21.98 0.000 ln(fry) -1.04920 0.06786-15.46 0.000 S = 0.1449 R-Sq = 97.6% R-Sq(adj) = 97.1% 26 Output(con t) Analysis of Variance Source DF SS MS F P Regres 1 5.0199 5.0199 239.06 0.000 Resi Err 6 0.1260 0.0210 Total 7 5.1458 27 9
Predictions Predict the moisture content when the frying time is 61 sec. and 62 sec. You should be at least 94% confident in all of the statements. 28 Polynomial Regression 13.3 29 What do we do if the scatterplot is not linear? In section 13.2, we fixed this with intrinsically linear transformations. If the plot has any peaks or valleys, the transformations will not work. You can know try to fit a polynomial with X 2 and X 3 terms. You could fit higher order terms, but it is strongly discouraged!! 30 10
Equation y = β + β x+ β 2x2+ β x3+... + e 0 1 2 3 The errors have to have a mean of zero and a constant variance. The Beta terms are estimated using the method of least squares. We will always use Minitab to fit the curve. 31 Example A company wants to improve the fermentation process of their malt liquor. Below is the data. X=fermentation time(days) Y=glucose concentration X 1 2 3 4 5 6 7 8 Y 74 54 52 51 52 53 58 71 32 Scatterplot 33 11
Residual Plot 34 Fixes Try fitting a squared term If that doesn t work, add a cubic term. Keep going until you have a good fit. You want to have the smallest amount of terms possible. 35 Fitted Line with Squared Term 36 12
Fitted Line with Cubed Term 37 What is R 2 and R 2 (adjusted)? R 2 is measuring how much of the variation of y is explained by all x terms. R 2 gets bigger every time you add a term. So, R 2 with a cubic would appear to be automatically better than a squared term. However, simplicity is better. So, R 2 (adjusted) takes out the automatic inflation. Moral: If you have multiple x s, use the R 2 (adjusted). If you have a single x, use the R 2. 38 Regression Output Regression Analysis: y versus x, xsquared The regression equation is y = 84.5-15.9 x + 1.77 xsquared Predictor Coef SE Coef T P Constant 84.482 4.904 17.23 0.000 x -15.875 2.500-6.35 0.001 xsquared 1.7679 0.2712 6.52 0.001 S = 3.515 R-Sq = 89.5% R-Sq(adj) = 85.3%???Model Utility Test??? 39 13
ANOVA Table Analysis of Variance Source DF SS MS F P Regress 2 525.11 262.55 21.25 0.004 Resid Error 5 61.77 12.35 Total 7 586.87 Source DF Seq SS x 1 0.05 xsquared 1 525.05 You have to use this table now to conduct the model utility test. 40 Inferences Test Statistic Formula: Test for the Beta Terms: 41 Confidence Intervals Confidence Interval Formula For the Beta Terms 42 14
Intervals Confidence Interval Formula: Prediction Interval Formula: 43 Additional Output Obs x y Fit SE Fit Residual St Resid 1 1.00 74.00 70.38 2.96 3.62 1.91 2 2.00 54.00 59.80 1.86-5.80-1.95 3 3.00 52.00 52.77 1.69-0.77-0.25 4 4.00 51.00 49.27 1.86 1.73 0.58 5 5.00 52.00 49.30 1.86 2.70 0.90 6 6.00 53.00 52.88 1.69 0.12 0.04 7 7.00 58.00 59.98 1.86-1.98-0.66 8 8.00 71.00 70.62 2.96 0.38 0.20 44 One Additional Note Many Statisticians believe that it is a good practice to center the x s before you fit the equations. This helps tremendously with round off error. You then have to be really careful to transform everything back to its original state. We are not going to do an example of this. 45 15