Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables

Similar documents
LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

(4) 1. Create dummy variables for Town. Name these dummy variables A and B. These 0,1 variables now indicate the location of the house.

Multiple Linear Regression

Model Building Chap 5 p251

Apart from this page, you are not permitted to read the contents of this question paper until instructed to do so by an invigilator.

STAT 212 Business Statistics II 1

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

Basic Business Statistics, 10/e

Multiple Regression Methods

School of Mathematical Sciences. Question 1

School of Mathematical Sciences. Question 1. Best Subsets Regression

Multiple Regression: Chapter 13. July 24, 2015

TMA4255 Applied Statistics V2016 (5)

The simple linear regression model discussed in Chapter 13 was written as

EXAM IN TMA4255 EXPERIMENTAL DESIGN AND APPLIED STATISTICAL METHODS

STAT 212: BUSINESS STATISTICS II Third Exam Tuesday Dec 12, 6:00 PM

Chapter 14 Student Lecture Notes 14-1

SMAM 314 Computer Assignment 5 due Nov 8,2012 Data Set 1. For each of the following data sets use Minitab to 1. Make a scatterplot.

Single and multiple linear regression analysis

Chapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models

Chapter 14 Multiple Regression Analysis

Simple Linear Regression

STATISTICS 110/201 PRACTICE FINAL EXAM

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Chapter 4: Regression Models

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Inference for Regression Inference about the Regression Model and Using the Regression Line

Models with qualitative explanatory variables p216

Chapter 13. Multiple Regression and Model Building

22S39: Class Notes / November 14, 2000 back to start 1

Chapter 12: Multiple Regression

Ch 13 & 14 - Regression Analysis

Correlation and Regression

Linear Regression In God we trust, all others bring data. William Edwards Deming

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

23. Inference for regression

Regression Models - Introduction

Examination paper for TMA4255 Applied statistics

Basic Business Statistics 6 th Edition

SMAM 314 Practice Final Examination Winter 2003

1 Introduction to Minitab

1. Least squares with more than one predictor

2.4.3 Estimatingσ Coefficient of Determination 2.4. ASSESSING THE MODEL 23

Six Sigma Black Belt Study Guides

Prediction of Bike Rental using Model Reuse Strategy

Chapter 3 Multiple Regression Complete Example

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

Predict y from (possibly) many predictors x. Model Criticism Study the importance of columns

27. SIMPLE LINEAR REGRESSION II

Math 423/533: The Main Theoretical Topics

Notebook Tab 6 Pages 183 to ConteSolutions

Confidence Intervals, Testing and ANOVA Summary

Chapter 1. Linear Regression with One Predictor Variable

Multiple Regression Examples

MULTIPLE LINEAR REGRESSION IN MINITAB

A discussion on multiple regression models

Simple Linear Regression: A Model for the Mean. Chap 7

Ph.D. Preliminary Examination Statistics June 2, 2014

Lab 07 Introduction to Econometrics

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Categorical Predictor Variables

Institutionen för matematik och matematisk statistik Umeå universitet November 7, Inlämningsuppgift 3. Mariam Shirdel

MULTICOLLINEARITY AND VARIANCE INFLATION FACTORS. F. Chiaromonte 1

MISCELLANEOUS REGRESSION TOPICS

PART I. (a) Describe all the assumptions for a normal error regression model with one predictor variable,

Unit 11: Multiple Linear Regression

MBA Statistics COURSE #4

General Linear Model (Chapter 4)

Ch 2: Simple Linear Regression

Multiple Regression an Introduction. Stat 511 Chap 9

STAT 3A03 Applied Regression With SAS Fall 2017

ST430 Exam 2 Solutions

Exam Applied Statistical Regression. Good Luck!

Steps for Regression. Simple Linear Regression. Data. Example. Residuals vs. X. Scatterplot. Make a Scatter plot Does it make sense to plot a line?

10. Alternative case influence statistics

The Flight of the Space Shuttle Challenger

Chapter 1 Statistical Inference

SMAM 314 Exam 42 Name

Stat 5102 Final Exam May 14, 2015

SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

Confidence Interval for the mean response

Business 320, Fall 1999, Final

Classification & Regression. Multicollinearity Intro to Nominal Data

Lecture 6 Multiple Linear Regression, cont.

Lecture 18: Simple Linear Regression

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression:

28. SIMPLE LINEAR REGRESSION III

Analysing data: regression and correlation S6 and S7

Chapter 30 Design and Analysis of

Chapter 4. Regression Models. Learning Objectives

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Final Exam - Solutions

Logistic Regression in R. by Kerry Machemer 12/04/2015

Multiple linear regression S6

Concordia University (5+5)Q 1.

42 GEO Metro Japan

Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Introduction to Regression

Transcription:

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables 26.1 S 4 /IEE Application Examples: Multiple Regression An S 4 /IEE project was created to improve the 30,000-footlevel metric DSO. Two inputs that surfaced from a causeand-effect diagram were the size of the invoice and the number of line items included within the invoice. A multiple regression analysis was conducted for DSO versus size of invoice and number of line items included in invoice. An S 4 /IEE project was created to improve the 30,000-footlevel metric, the diameter of a manufactured part. Inputs that surfaced from a cause-and-effect diagram were the temperature, pressure, and speed of the manufacturing process. A multiple regression analysis of diameter versus temperature, pressure, and speed was conducted. 1

26.2 Description A general model includes polynomial terms in one or more variables such as Y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 2 + β 4 x 2 2 + β 5 x 1 x 2 + ε Where β s are unknown parameters and ε is random error. This full quadratic model of Y on x 1 and x 2 is of great use in DOE. For the situation without polynomial terms where there are k predictor variables, the general model reduces to the form Y = β 0 + β 1 x 1 + + β k x k + ε 26.2 Description The object is to determine from data the least squares estimates (b 0, b 1,, b k ) of the unknown parameters (β 0, β 1,, β k ) for the prediction equation Y = b 0 + b 1 x 1 + + b k x k Where Y is the predicted value of Y for given values of (x 1,, x k ). Many statistical software packages can perform these calculations. 2

26.3 Example 26.1: Multiple Regression An investigator wants to determine the relationship of a key process output variable, product strength, to two key process input variables, hydraulic pressure during a forming process and acid concentration. The data are given as follows, Strength Pressure Concentration 665 110 116 618 119 104 620 138 94 578 130 86 682 143 110 594 133 87 722 147 114 700 142 106 681 125 107 695 135 106 664 152 98 548 118 86 620 155 87 595 128 96 740 146 120 670 132 108 640 130 104 590 112 91 570 113 92 640 120 100 26.3 Example 26.1: Multiple Regression Regression Analysis: Strength versus Pressure, Concentration The regression equation is Strength = 16.3 + 1.57 Pressure + 4.16 Concentration Predictor Coef SE Coef T P Constant 16.28 44.30 0.37 0.718 Pressure 1.5718 0.2606 6.03 0.000 Concentration 4.1629 0.3340 12.47 0.000 S = 15.0996 R-Sq = 92.8% R-Sq(adj) = 92.0% Analysis of Variance Source DF SS MS F P Regression 2 50101 25050 109.87 0.000 Residual Error 17 3876 228 Total 19 53977 Source DF Seq SS Pressure 1 14673 Concentration 1 35428 3

26.3 Example 26.1: Multiple Regression The P columns give the significance level for each model term. Typically, if a P value is less than or equal to 0.05, the variable is considered statistically significant (i.e., null hypothesis is rejected). If a P value is greater than 0.10, the term is removed from the model. A practitioner might leave the term if the P value is the gray region between these two probability levels. The coefficient of determination (R 2 ) is presented as R-Sq and R-Sq (adj) in the output. When a variable is added to an equation the coefficient of determination will get larger, even if the added variable has no real value. R 2 (adj) is an approximate unbiased estimate that compensates for this. 26.3 Example 26.1: Multiple Regression In the analysis of variance portion of this output the F value is used to determine an overall P value for the model fit. In this case the resulting P value of 0.000 indicates a very high level of significance. The regression and residual sum of squares (SS) and mean square (MS) values are interim steps toward determining the F value. Standard error is the square root of the mean square. No unusual patterns were apparent in the residual analysis plots. Also, no correlation was shown between hydraulic pressure and acid concentration. 4

26.4 Other Consideration Regressor variables should be independent within a model (i.e., completely uncorrelated). Multicollinearity occurs when variables are dependent. A measure of the magnitude of multicollinearity that is often available in statistical software is the variance inflation factor (VIF). VIF quantifies how much the variance of an estimated regression coefficient increases if the predictors are correlated. Regression coefficients can be considered poorly estimated when VIF exceeds 5 or 10. Strategies for breaking up multicollinearity include collecting additional data or using different predictors. 26.4 Other Consideration Another approach to data analysis is the use of stepwise regression (Draper and Smith 1966) or of all possible regressions of the data when selecting the number of terms to include in a model. This approach can be most useful when data derives from an experiment that does not have experiment structure. However, experimenters should be aware of the potential pitfalls resulting from happenstance data (Box et al. 1978). A multiple regression best subset analysis is another analysis alternative. 5

26.4 Other Consideration Best Subsets Regression: Output timing versus mot_temp, algor, Response is Output timing m s o m e u t o x p Minitab: _ a t t _ Stat t l v e g a a o Regression Best Subset Mallows m o d d l Vars R-Sq R-Sq(adj) Cp S p r j j t 1 57.7 54.7 43.3 1.2862 X 1 33.4 28.6 75.2 1.6152 X 2 91.1 89.7 1.7 0.61288 X X 2 58.3 51.9 44.5 1.3251 X X 3 91.7 89.6 2.9 0.61593 X X X 3 91.5 89.4 3.1 0.62300 X X X 4 92.1 89.2 4.3 0.62718 X X X X 4 91.9 89.0 4.5 0.63331 X X X X 5 92.4 88.5 6.0 0.64701 X X X X X 26.4 Other Consideration A multiple regression best subset analysis first considers only one factor in a model, then two, and so forth. The R 2 value is then considered for each of the models; only factor combinations containing the highest two R 2 values are shown. The Mallows C p statistic is useful to determining the minimum number of parameters that best fits the model. Technically this statistic measures the sum of the squared biases plus the squared random errors in Y at all n data points (Daniel and Wood 1980). 6

26.4 Other Consideration The minimum number of factors needed in the model occurs when the Mallows C p statistic is a minimum. From this output the pertinent Mallows C p statistic values under consideration as a function of a number of factors in this model are Number in Model Mallows C p 1 43.3 2 1.7** 3 2.9 4 4.3 5 6.0 From this summary it is noted that the Mallows C p statistic is minimized whenever there are two parameters in the model. The corresponding factors are algor and mot adj. 26.5 Example 26.2: Multiple Regression Best Subset Analysis The results from a cause-and-effect matrix lead to a passive analysis of factors A, B, C, and D on Throughput. In a plastic molding process, for example, the throughput response might be shrinkage as a function of the input factors. A best subsets computer regression analysis of the collected data yielded: Best Subsets Regression: Thruput versus A, B, C, D Response is Thruput Vars R-Sq R-Sq(adj) Mallows Cp S A B C D 1 92.1 91.4 38.3 0.25631 X 1 49.2 44.6 294.2 0.64905 X 2 96.3 95.6 14.9 0.18282 X X 2 95.2 94.3 21.5 0.20867 X X 3 98.5 98.0 4.1 0.12454 X X X 3 97.9 97.1 7.8 0.14723 X X X 4 98.7 98.0 5.0 0.12363 X X X X 7

26.5 Example 26.2: Multiple Regression Best Subset Analysis From this output we note: R-Sq: Look for the highest value when comparing models with the same number of predictors (vars). Adj.R-Sq: look for the highest value when comparing models with different numbers of predictors. C p : Look for models where C p is small and close to the number of parameters in the model, e.g., look for a model with C p close to four for a three-predictor model that has an intercept constant (often we just look for the lowest C p value). s: We want s, the estimate of the standard deviation about the regression, to be as small as possible. 26.5 Example 26.2: Multiple Regression Best Subset Analysis The regression equation for a 3-parameter model from a computer program is: The magnitude of the VIFs is satisfactory, i.e., not larger than 5-10. In addition, there were no observed problems with the residual analysis. Regression Analysis: Thruput versus A, C, D The regression equation is Thruput = 3.87 + 0.393 A + 3.19 C + 0.0162 D Predictor Coef SE Coef T P VIF Constant 3.8702 0.7127 5.43 0.000 A 0.39333 0.07734 5.09 0.001 1.368 C 3.1935 0.2523 12.66 0.000 1.929 D 0.016189 0.004570 3.54 0.006 1.541 S = 0.124543 R-Sq = 98.5% R-Sq(adj) = 98.0% 8

26.6 Indicator Variables (Dummy Variables) to Analyze Categorical Data Categorical data such as location, operator, and color can also be modeled using simple and multiple linear regression. It is not generally correct to use numerical code when analyzing this type of data within regression, since the fitted values within the model will be dependent upon the assignment of the numerical values. The correct approach is through the use of indicator variables or dummy variables, which indicate whether a factor should or should not be included in the model. 26.6 Indicator Variables (Dummy Variables) to Analyze Categorical Data If we are given information about two variables, we can calculate the third. Hence, only two variables are needed for a model that has three variables, where it does not matter which variable is left out of the model. After indicator or dummy variables are created, indicator variables are analyzed using regression to create a cell means model. If the intercept is left out of the regression equation, a no intercept cell means model is created. For the case where there are three indicator variables, a no intercept model would then have three terms where the coefficients are the cell means. 9

26.7 Example 26.3: Indicator Variables Revenue for Arizona, Florida, and Texas is shown in Table 26.3 ( Bower 2001). This table also contains indicator variables that were created to represent these states. Regression Analysis: Revenue versus AZ, FL, TX * TX is highly correlated with other X variables * TX has been removed from the equation. The regression equation is Revenue = 48.7-23.8 AZ - 16.0 FL Predictor Coef SE Coef T P Constant 48.7329 0.4537 107.41 0.000 AZ -23.8190 0.6416-37.12 0.000 FL -15.9927 0.6416-24.93 0.000 S = 3.20811 R-Sq = 90.7% R-Sq(adj) = 90.6% 26.7 Example 26.3: Indicator Variables Calculations for various revenues would be: Texas Revenue = 48.7-24.1(0) - 16.0(0) = 48.7 Arizona Revenue = 48.7-24.1(1) - 16.0(0) = 24.6 Florida Revenue = 48.7-24.1(0) - 16.0(1) = 32.7 A no intercept cell means model from a computer analysis would be Regression Analysis: Revenue versus AZ, FL, TX The regression equation is Revenue = 24.9 AZ + 32.7 FL + 48.7 TX Predictor Coef SE Coef T P Noconstant AZ 24.9139 0.4537 54.91 0.000 FL 32.7402 0.4537 72.16 0.000 TX 48.7329 0.4537 107.41 0.000 S = 3.20811 10

26.8 Example 26.4: Indicator Variables with Covariate Consider the following data set, which has created indicator variables and a covariate. This covariate might be a continuous variable such as process temperature or dollar amount for an invoice. Response Factor1 Factor2 A B High Covariate 1 A 1 1 0 1 11 3 A 0 1 0-1 7 2 A 1 1 0 1 5 2 A 0 1 0-1 6 4 B 1 0 1 1 6 6 B 0 0 1-1 3 3 B 1 0 1 1 14 5 B 0 0 1-1 20 8 C 1-1 -1 1 2 9 C 0-1 -1-1 17 7 C 1-1 -1 1 19 10 C 0-1 -1-1 14 26.8 Example 26.4: Indicator Variables with Covariate Regression Analysis: Response versus Factor2, A, B, High, Covariate * High is highly correlated with other X variables * High has been removed from the equation. The regression equation is Response = 6.50-1.77 Factor2-3.18 A - 0.475 B - 0.0598 Covariate Predictor Coef SE Coef T P Constant 6.5010 0.4140 15.70 0.000 Factor2-1.7663 0.3391-5.21 0.001 A -3.1844 0.2550-12.49 0.000 B -0.4751 0.2374-2.00 0.086 Covariate -0.05979 0.03039-1.97 0.090 S = 0.580794 R-Sq = 97.6% R-Sq(adj) = 96.2% 11

Ingots prepared with different heating and soaking times are tested for readiness to be rolled: 26.10 Example 26.5: Binary Logistic Regression Sample Heat Soak Ready Not Ready 1 7 1.0 10 0 2 7 1.7 17 0 3 7 2.2 7 0 4 7 2.8 12 0 5 7 4.0 9 0 6 14 1.0 31 0 7 14 1.7 43 0 8 14 2.2 31 2 9 14 2.8 31 0 10 14 4.0 19 0 11 27 1.0 55 1 12 27 1.7 40 4 13 27 2.2 21 0 14 27 2.8 21 1 15 27 4.0 15 1 16 51 1.0 10 3 17 51 1.7 1 0 18 51 2.2 1 0 19 51 4.0 1 0 26.10 Example 26.5: Binary Logistic Regression Binary Logistic Regression: Ready, Trials versus Heat, Soak Link Function: Normit Response Information Variable Value Count Ready Event 375 Non-event 12 Trials Total 387 Logistic Regression Table Predictor Coef SE Coef Z P Constant 2.89342 0.500601 5.78 0.000 Heat -0.0399555 0.0118466-3.37 0.001 Soak -0.0362537 0.146743-0.25 0.805 Log-Likelihood = -47.480 Test that all slopes are zero: G = 12.029, DF = 2, P-Value = 0.002 12

26.10 Example 26.5: Binary Logistic Regression Heat would be considered statistically significant. Let s now address the question of which levels are important. Rearranging the data by heat only, we get the p chart of this data it appears that heat at the 51 level causes a larger portion of not readys. Heat Not Ready Sample Size 7 0 55 14 2 157 27 7 159 51 3 16 13