Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables 26.1 S 4 /IEE Application Examples: Multiple Regression An S 4 /IEE project was created to improve the 30,000-footlevel metric DSO. Two inputs that surfaced from a causeand-effect diagram were the size of the invoice and the number of line items included within the invoice. A multiple regression analysis was conducted for DSO versus size of invoice and number of line items included in invoice. An S 4 /IEE project was created to improve the 30,000-footlevel metric, the diameter of a manufactured part. Inputs that surfaced from a cause-and-effect diagram were the temperature, pressure, and speed of the manufacturing process. A multiple regression analysis of diameter versus temperature, pressure, and speed was conducted. 1

26.2 Description A general model includes polynomial terms in one or more variables such as Y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 2 + β 4 x 2 2 + β 5 x 1 x 2 + ε Where β s are unknown parameters and ε is random error. This full quadratic model of Y on x 1 and x 2 is of great use in DOE. For the situation without polynomial terms where there are k predictor variables, the general model reduces to the form Y = β 0 + β 1 x 1 + + β k x k + ε 26.2 Description The object is to determine from data the least squares estimates (b 0, b 1,, b k ) of the unknown parameters (β 0, β 1,, β k ) for the prediction equation Y = b 0 + b 1 x 1 + + b k x k Where Y is the predicted value of Y for given values of (x 1,, x k ). Many statistical software packages can perform these calculations. 2

26.3 Example 26.1: Multiple Regression An investigator wants to determine the relationship of a key process output variable, product strength, to two key process input variables, hydraulic pressure during a forming process and acid concentration. The data are given as follows, Strength Pressure Concentration 665 110 116 618 119 104 620 138 94 578 130 86 682 143 110 594 133 87 722 147 114 700 142 106 681 125 107 695 135 106 664 152 98 548 118 86 620 155 87 595 128 96 740 146 120 670 132 108 640 130 104 590 112 91 570 113 92 640 120 100 26.3 Example 26.1: Multiple Regression Regression Analysis: Strength versus Pressure, Concentration The regression equation is Strength = 16.3 + 1.57 Pressure + 4.16 Concentration Predictor Coef SE Coef T P Constant 16.28 44.30 0.37 0.718 Pressure 1.5718 0.2606 6.03 0.000 Concentration 4.1629 0.3340 12.47 0.000 S = 15.0996 R-Sq = 92.8% R-Sq(adj) = 92.0% Analysis of Variance Source DF SS MS F P Regression 2 50101 25050 109.87 0.000 Residual Error 17 3876 228 Total 19 53977 Source DF Seq SS Pressure 1 14673 Concentration 1 35428 3

26.3 Example 26.1: Multiple Regression The P columns give the significance level for each model term. Typically, if a P value is less than or equal to 0.05, the variable is considered statistically significant (i.e., null hypothesis is rejected). If a P value is greater than 0.10, the term is removed from the model. A practitioner might leave the term if the P value is the gray region between these two probability levels. The coefficient of determination (R 2 ) is presented as R-Sq and R-Sq (adj) in the output. When a variable is added to an equation the coefficient of determination will get larger, even if the added variable has no real value. R 2 (adj) is an approximate unbiased estimate that compensates for this. 26.3 Example 26.1: Multiple Regression In the analysis of variance portion of this output the F value is used to determine an overall P value for the model fit. In this case the resulting P value of 0.000 indicates a very high level of significance. The regression and residual sum of squares (SS) and mean square (MS) values are interim steps toward determining the F value. Standard error is the square root of the mean square. No unusual patterns were apparent in the residual analysis plots. Also, no correlation was shown between hydraulic pressure and acid concentration. 4

26.4 Other Consideration Regressor variables should be independent within a model (i.e., completely uncorrelated). Multicollinearity occurs when variables are dependent. A measure of the magnitude of multicollinearity that is often available in statistical software is the variance inflation factor (VIF). VIF quantifies how much the variance of an estimated regression coefficient increases if the predictors are correlated. Regression coefficients can be considered poorly estimated when VIF exceeds 5 or 10. Strategies for breaking up multicollinearity include collecting additional data or using different predictors. 26.4 Other Consideration Another approach to data analysis is the use of stepwise regression (Draper and Smith 1966) or of all possible regressions of the data when selecting the number of terms to include in a model. This approach can be most useful when data derives from an experiment that does not have experiment structure. However, experimenters should be aware of the potential pitfalls resulting from happenstance data (Box et al. 1978). A multiple regression best subset analysis is another analysis alternative. 5

26.4 Other Consideration Best Subsets Regression: Output timing versus mot_temp, algor, Response is Output timing m s o m e u t o x p Minitab: _ a t t _ Stat t l v e g a a o Regression Best Subset Mallows m o d d l Vars R-Sq R-Sq(adj) Cp S p r j j t 1 57.7 54.7 43.3 1.2862 X 1 33.4 28.6 75.2 1.6152 X 2 91.1 89.7 1.7 0.61288 X X 2 58.3 51.9 44.5 1.3251 X X 3 91.7 89.6 2.9 0.61593 X X X 3 91.5 89.4 3.1 0.62300 X X X 4 92.1 89.2 4.3 0.62718 X X X X 4 91.9 89.0 4.5 0.63331 X X X X 5 92.4 88.5 6.0 0.64701 X X X X X 26.4 Other Consideration A multiple regression best subset analysis first considers only one factor in a model, then two, and so forth. The R 2 value is then considered for each of the models; only factor combinations containing the highest two R 2 values are shown. The Mallows C p statistic is useful to determining the minimum number of parameters that best fits the model. Technically this statistic measures the sum of the squared biases plus the squared random errors in Y at all n data points (Daniel and Wood 1980). 6

26.4 Other Consideration The minimum number of factors needed in the model occurs when the Mallows C p statistic is a minimum. From this output the pertinent Mallows C p statistic values under consideration as a function of a number of factors in this model are Number in Model Mallows C p 1 43.3 2 1.7** 3 2.9 4 4.3 5 6.0 From this summary it is noted that the Mallows C p statistic is minimized whenever there are two parameters in the model. The corresponding factors are algor and mot adj. 26.5 Example 26.2: Multiple Regression Best Subset Analysis The results from a cause-and-effect matrix lead to a passive analysis of factors A, B, C, and D on Throughput. In a plastic molding process, for example, the throughput response might be shrinkage as a function of the input factors. A best subsets computer regression analysis of the collected data yielded: Best Subsets Regression: Thruput versus A, B, C, D Response is Thruput Vars R-Sq R-Sq(adj) Mallows Cp S A B C D 1 92.1 91.4 38.3 0.25631 X 1 49.2 44.6 294.2 0.64905 X 2 96.3 95.6 14.9 0.18282 X X 2 95.2 94.3 21.5 0.20867 X X 3 98.5 98.0 4.1 0.12454 X X X 3 97.9 97.1 7.8 0.14723 X X X 4 98.7 98.0 5.0 0.12363 X X X X 7

26.5 Example 26.2: Multiple Regression Best Subset Analysis From this output we note: R-Sq: Look for the highest value when comparing models with the same number of predictors (vars). Adj.R-Sq: look for the highest value when comparing models with different numbers of predictors. C p : Look for models where C p is small and close to the number of parameters in the model, e.g., look for a model with C p close to four for a three-predictor model that has an intercept constant (often we just look for the lowest C p value). s: We want s, the estimate of the standard deviation about the regression, to be as small as possible. 26.5 Example 26.2: Multiple Regression Best Subset Analysis The regression equation for a 3-parameter model from a computer program is: The magnitude of the VIFs is satisfactory, i.e., not larger than 5-10. In addition, there were no observed problems with the residual analysis. Regression Analysis: Thruput versus A, C, D The regression equation is Thruput = 3.87 + 0.393 A + 3.19 C + 0.0162 D Predictor Coef SE Coef T P VIF Constant 3.8702 0.7127 5.43 0.000 A 0.39333 0.07734 5.09 0.001 1.368 C 3.1935 0.2523 12.66 0.000 1.929 D 0.016189 0.004570 3.54 0.006 1.541 S = 0.124543 R-Sq = 98.5% R-Sq(adj) = 98.0% 8

26.6 Indicator Variables (Dummy Variables) to Analyze Categorical Data Categorical data such as location, operator, and color can also be modeled using simple and multiple linear regression. It is not generally correct to use numerical code when analyzing this type of data within regression, since the fitted values within the model will be dependent upon the assignment of the numerical values. The correct approach is through the use of indicator variables or dummy variables, which indicate whether a factor should or should not be included in the model. 26.6 Indicator Variables (Dummy Variables) to Analyze Categorical Data If we are given information about two variables, we can calculate the third. Hence, only two variables are needed for a model that has three variables, where it does not matter which variable is left out of the model. After indicator or dummy variables are created, indicator variables are analyzed using regression to create a cell means model. If the intercept is left out of the regression equation, a no intercept cell means model is created. For the case where there are three indicator variables, a no intercept model would then have three terms where the coefficients are the cell means. 9

26.7 Example 26.3: Indicator Variables Revenue for Arizona, Florida, and Texas is shown in Table 26.3 ( Bower 2001). This table also contains indicator variables that were created to represent these states. Regression Analysis: Revenue versus AZ, FL, TX * TX is highly correlated with other X variables * TX has been removed from the equation. The regression equation is Revenue = 48.7-23.8 AZ - 16.0 FL Predictor Coef SE Coef T P Constant 48.7329 0.4537 107.41 0.000 AZ -23.8190 0.6416-37.12 0.000 FL -15.9927 0.6416-24.93 0.000 S = 3.20811 R-Sq = 90.7% R-Sq(adj) = 90.6% 26.7 Example 26.3: Indicator Variables Calculations for various revenues would be: Texas Revenue = 48.7-24.1(0) - 16.0(0) = 48.7 Arizona Revenue = 48.7-24.1(1) - 16.0(0) = 24.6 Florida Revenue = 48.7-24.1(0) - 16.0(1) = 32.7 A no intercept cell means model from a computer analysis would be Regression Analysis: Revenue versus AZ, FL, TX The regression equation is Revenue = 24.9 AZ + 32.7 FL + 48.7 TX Predictor Coef SE Coef T P Noconstant AZ 24.9139 0.4537 54.91 0.000 FL 32.7402 0.4537 72.16 0.000 TX 48.7329 0.4537 107.41 0.000 S = 3.20811 10

26.8 Example 26.4: Indicator Variables with Covariate Consider the following data set, which has created indicator variables and a covariate. This covariate might be a continuous variable such as process temperature or dollar amount for an invoice. Response Factor1 Factor2 A B High Covariate 1 A 1 1 0 1 11 3 A 0 1 0-1 7 2 A 1 1 0 1 5 2 A 0 1 0-1 6 4 B 1 0 1 1 6 6 B 0 0 1-1 3 3 B 1 0 1 1 14 5 B 0 0 1-1 20 8 C 1-1 -1 1 2 9 C 0-1 -1-1 17 7 C 1-1 -1 1 19 10 C 0-1 -1-1 14 26.8 Example 26.4: Indicator Variables with Covariate Regression Analysis: Response versus Factor2, A, B, High, Covariate * High is highly correlated with other X variables * High has been removed from the equation. The regression equation is Response = 6.50-1.77 Factor2-3.18 A - 0.475 B - 0.0598 Covariate Predictor Coef SE Coef T P Constant 6.5010 0.4140 15.70 0.000 Factor2-1.7663 0.3391-5.21 0.001 A -3.1844 0.2550-12.49 0.000 B -0.4751 0.2374-2.00 0.086 Covariate -0.05979 0.03039-1.97 0.090 S = 0.580794 R-Sq = 97.6% R-Sq(adj) = 96.2% 11

Ingots prepared with different heating and soaking times are tested for readiness to be rolled: 26.10 Example 26.5: Binary Logistic Regression Sample Heat Soak Ready Not Ready 1 7 1.0 10 0 2 7 1.7 17 0 3 7 2.2 7 0 4 7 2.8 12 0 5 7 4.0 9 0 6 14 1.0 31 0 7 14 1.7 43 0 8 14 2.2 31 2 9 14 2.8 31 0 10 14 4.0 19 0 11 27 1.0 55 1 12 27 1.7 40 4 13 27 2.2 21 0 14 27 2.8 21 1 15 27 4.0 15 1 16 51 1.0 10 3 17 51 1.7 1 0 18 51 2.2 1 0 19 51 4.0 1 0 26.10 Example 26.5: Binary Logistic Regression Binary Logistic Regression: Ready, Trials versus Heat, Soak Link Function: Normit Response Information Variable Value Count Ready Event 375 Non-event 12 Trials Total 387 Logistic Regression Table Predictor Coef SE Coef Z P Constant 2.89342 0.500601 5.78 0.000 Heat -0.0399555 0.0118466-3.37 0.001 Soak -0.0362537 0.146743-0.25 0.805 Log-Likelihood = -47.480 Test that all slopes are zero: G = 12.029, DF = 2, P-Value = 0.002 12

26.10 Example 26.5: Binary Logistic Regression Heat would be considered statistically significant. Let s now address the question of which levels are important. Rearranging the data by heat only, we get the p chart of this data it appears that heat at the 51 level causes a larger portion of not readys. Heat Not Ready Sample Size 7 0 55 14 2 157 27 7 159 51 3 16 13