Steps for Regression. Simple Linear Regression. Data. Example. Residuals vs. X. Scatterplot. Make a Scatter plot Does it make sense to plot a line?

Steps for Regression Simple Linear Regression Make a Scatter plot Does it make sense to plot a line? Check Residual Plot (Residuals vs. X) Are there any patterns? Check Histogram of Residuals Is it Normal? Check Model Utility Make Interferences 1 2 Example In order to design an efficient incinerator of municipal waste, information about the energy content of types of waste is necessary. Data % plastics by weight Energy Content (kcal/kg) % plastics by weight Energy Content (kcal/kg) 18.69 947 18.28 1334 19.43 1407 21.41 11 19.24 142 2.11 143 22.64 13 21.04 1278 16.4 989 17.99 113 21.44 1162 18.73 122 19.3 1466 18.49 1237 23.97 166 22.08 1327 21.4 124 14.28 1229 20.34 1336 17.74 120 17.03 1097 20.4 1221 21.03 1266 18.2 1138 20.49 1401 19.09 129 20.4 1223 21.2 1391 18.81 1216 21.62 1372 3 4 Scatterplot Residuals vs. X 6 1

Histogram of Residuals Residual Normal Probability Plot 7 8 Reg. Analysis Check Model Utility (con t) Check Model Utility Regression Analysis The regression equation is Energy Content (kcal/kg) = 469 + 40.8 % plastics by weight Predictor Coef StDev T P Constant 468.9 211.6 2.22 0.03 % plasti 40.82 10.7 3.86 0.001 Regression Analysis Analysis of Variance Source DF SS MS F P Regression 1 23973 23973 14.92 0.001 Resid. Error 28 44997 16071 Total 29 689710 S = 126.8 R-Sq = 34.8% R-Sq(adj) = 32.4% 9 10 Inference What is the proportion of variation of energy content explained by the percentage of plastic in the waste? What is the correlation between energy content and percentage of plastic in the waste? For every percent increase in plastic in municipal waste, how much of an increase or decrease do you expect in energy content? Inference(con t) Can we make inferences about the energy content of waste that is 7% plastic by weight? What is the average energy content expected at 22% plastic by weight? (Give interval.) What is the expected energy content of the next observation that contains 22% plastic waste? (Give Interval.) 11 12 2

Transformed x s and y s Section 13.2 What if it doesn t pass the residual test? We can transform the x s or the y s and make a linear regression line with the new x and new y. A function relating y to x is intrinsically linear if by means of a transformation on x and / or y, the function can be expressed as where y is a transformed independent variable and x is a transformed dependent variable. 13 14 Types of Transformations Function Transformations Linear Form y = α * e βx y =ln(y) y' = ln( α ) + βx y = αx β y =ln(y), x =ln(x) y ' = ln(α) + βx y = α + β * log( x) x = ln(x) y = α + βx y = α + β * 1 x x = 1 y = α + βx x Notes 1. We can estimate the Betahat values by using the same least squares regression formulas. 2. The r 2 refers to proportion of variation that the new ys are explained by the new xs. 3. To make CI and PI, the transformed errors need to be approx. Normal. Example: Use trial and error or theory to find the appropriate transformation. 1 16 Example Scatter Plot (orginal x s and y s) A Tortilla Chip maker would like to make the optimal tortilla chip. They would like to have the chip that has the most appealing texture. X=frying time(sec) Y=% moisture content X 10 1 20 2 30 4 60 Y 16.3 9.7 8.1 4.2 3.4 2.9 1.9 1.3 17 18 3

Residuals Scatterplot(ln(x), y) 19 20 Scatterplot (1/x, y) Scatterplot(x, ln(y)) 21 22 Scatterplot(lnx, lny) Scatterplot(logx), log(y)) 23 24 4

Decision Time Two reasonable choices ln(x), ln(y) 1/x, y Look at the Normal Probability plots. Reg Output Regression Analysis: ln(%) versus ln(fry) The regression equation is ln(%) = 4.64-1.0 ln(fry) Predictor Coef SE Coef T P Constant 4.6384 0.2110 21.98 0.000 ln(fry) -1.04920 0.06786-1.46 0.000 S = 0.1449 R-Sq = 97.6% R-Sq(adj) = 97.1% 2 26 Output(con t) Analysis of Variance Source DF SS MS F P Regres 1.0199.0199 239.06 0.000 Resi Err 6 0.1260 0.0210 Total 7.148 Polynomial Regression 13.3 27 28 What do we do if the scatterplot is not linear? In section 13.2, we fixed this with intrinsically linear transformations. If the plot has any peaks or valleys, the transformations will not work. You can know try to fit a polynomial with X 2 and X 3 terms. You could fit higher order terms, but it is strongly discouraged!! Equation y = β + β x+ β 2x2+ β x3+... + e 0 1 2 3 The errors have to have a mean of zero and a constant variance. The Beta terms are estimated using the method of least squares. We will always use Minitab to fit the curve. 29 30

Example Scatterplot A company wants to improve the fermentation process of their malt liquor. Below is the data. X=fermentation time(days) Y=glucose concentration X 1 2 3 4 6 7 8 Y 74 4 2 1 2 3 8 71 31 32 Residual Plot Fixes Try fitting a squared term If that doesn t work, add a cubic term. Keep going until you have a good fit. You want to have the smallest amount of terms possible. 33 34 Fitted Line with Squared Term Fitted Line with Cubed Term 3 36 6

What is R 2 and R 2 (adjusted)? R 2 is measuring how much of the variation of y is explained by all x terms. R 2 gets bigger every time you add a term. So, R 2 with a cubic would appear to be automatically better than a squared term. However, simplicity is better. So, R 2 (adjusted) takes out the automatic inflation. Moral: If you have multiple x s, use the R 2 (adjusted). If you have a single x, use the R 2. Regression Output Regression Analysis: y versus x, xsquared The regression equation is y = 84. - 1.9 x + 1.77 xsquared Predictor Coef SE Coef T P Constant 84.482 4.904 17.23 0.000 x -1.87 2.00-6.3 0.001 xsquared 1.7679 0.2712 6.2 0.001 S = 3.1 R-Sq = 89.% R-Sq(adj) = 8.3%???Model Utility Test??? 37 38 ANOVA Table Analysis of Variance Source DF SS MS F P Regress 2 2.11 262. 21.2 0.004 Resid Error 61.77 12.3 Total 7 86.87 Source DF Seq SS x 1 0.0 xsquared 1 2.0 Inferences Test Statistic Formula: Test for the Beta Terms: You have to use this table now to conduct the model utility test. 39 40 Confidence Intervals Confidence Interval Formula Intervals Confidence Interval Formula: For the Beta Terms Prediction Interval Formula: 41 42 7