Regression Models. Chapter 4. Introduction. Introduction. Introduction

Similar documents
Chapter 4. Regression Models. Learning Objectives

Chapter 4: Regression Models

Regression Models. Chapter 4

Bayesian Analysis LEARNING OBJECTIVES. Calculating Revised Probabilities. Calculating Revised Probabilities. Calculating Revised Probabilities

The Multiple Regression Model

Chapter 7 Student Lecture Notes 7-1

Regression Models REVISED TEACHING SUGGESTIONS ALTERNATIVE EXAMPLES

Statistics for Managers using Microsoft Excel 6 th Edition

Chapter 3 Multiple Regression Complete Example

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Correlation Analysis

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Basic Business Statistics, 10/e

Chapter 14 Student Lecture Notes 14-1

Basic Business Statistics 6 th Edition

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 16. Simple Linear Regression and Correlation

Inferences for Regression

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Ch 13 & 14 - Regression Analysis

Business Statistics. Lecture 10: Correlation and Linear Regression

Chapter 14 Simple Linear Regression (A)

Mathematics for Economics MA course

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Regression Analysis. BUS 735: Business Decision Making and Research

LI EAR REGRESSIO A D CORRELATIO

Chapter 13. Multiple Regression and Model Building

Regression Analysis II

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Inference for Regression Inference about the Regression Model and Using the Regression Line

What is a Hypothesis?

The simple linear regression model discussed in Chapter 13 was written as

STA121: Applied Regression Analysis

Data Analysis 1 LINEAR REGRESSION. Chapter 03

Chapter 13 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics

Simple Linear Regression

Ch14. Multiple Regression Analysis

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Ordinary Least Squares Regression Explained: Vartanian

SIMPLE REGRESSION ANALYSIS. Business Statistics

Econ 3790: Statistics Business and Economics. Instructor: Yogesh Uppal

Inference for Regression

Finding Relationships Among Variables

Linear Regression and Correlation

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

Variance Decomposition and Goodness of Fit

Chapter 12 - Part I: Correlation Analysis

Ordinary Least Squares Regression Explained: Vartanian

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Single and multiple linear regression analysis

STAT 350 Final (new Material) Review Problems Key Spring 2016

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

1 A Non-technical Introduction to Regression

F-tests and Nested Models

Multiple Regression Methods

Section 3: Simple Linear Regression

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

BNAD 276 Lecture 10 Simple Linear Regression Model

Lecture 10 Multiple Linear Regression

MULTIPLE REGRESSION ANALYSIS AND OTHER ISSUES. Business Statistics

A discussion on multiple regression models

Chapter 15 Multiple Regression

Inference for Regression Simple Linear Regression

PubH 7405: REGRESSION ANALYSIS. MLR: INFERENCES, Part I

df=degrees of freedom = n - 1

ECON 497 Midterm Spring

Midterm 2 - Solutions

1 Correlation and Inference from Regression

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Simple Linear Regression

Review of Statistics 101

Correlation and Regression Analysis. Linear Regression and Correlation. Correlation and Linear Regression. Three Questions.

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

Correlation & Simple Regression

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

3. Diagnostics and Remedial Measures

CHAPTER EIGHT Linear Regression

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

Inference with Simple Regression

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Ch 2: Simple Linear Regression

ST Correlation and Regression

CS 5014: Research Methods in Computer Science

5. Multiple Regression (Regressioanalyysi) (Azcel Ch. 11, Milton/Arnold Ch. 12) The k-variable Multiple Regression Model

Chapter 14 Multiple Regression Analysis

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

Lectures on Simple Linear Regression Stat 431, Summer 2012

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

ST430 Exam 2 Solutions

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Correlation and regression

What Is ANOVA? Comparing Groups. One-way ANOVA. One way ANOVA (the F ratio test)

Transcription:

Chapter 4 Regression Models Quantitative Analysis for Management, Tenth Edition, by Render, Stair, and Hanna 008 Prentice-Hall, Inc. Introduction Regression analysis is a very valuable tool for a manager There are generally two purposes for regression analysis. To understand the relationship between variables E.g. the relationship between the sales volume and the advertising spending amount, the relationship between the price of a house and the square footage, etc.. To predict the value of one variable based on the value of another variable 4 Introduction Introduction Three types of regression models will be studied Simple linear regression models have only two variables We will first develop this model Multiple regression models have more than two variables Nonlinear regression models are used when the relationships between the variables are not linear 4 3 The variable to be predicted is called the dependent variable Sometimes called the response variable The value of this variable depends on the value of the independent variable Sometimes called the explanatory or predictor variable Dependent variable Independent variable Independent variable = + +... Prediction Relationship 4 4

Scatter Diagram One way to investigate the relationship between variables is by plotting the data on a graph Such a graph is often called a scatter diagram or a scatter plot The independent variable is normally plotted on the axis The dependent variable is normally plotted on the axis 4 5 Triple A Construction Example Triple A Construction renovates old homes They have found that the dollar volume of renovation work each year is dependent on the area payroll Triple A s revenues and the total wage earnings for the past six years are listed below dependent variable Table 4. TRIPLE A S SALES LOCAL PAROLL ($00,000 s) ($00,000,000 s) 6 3 8 4 9 6 5 4 4.5 9.5 5 independent variable 4 6 Triple A Construction Example Triple A Construction Example Sales ($00 0,000) 0 8 6 4 0 0 3 4 5 6 7 8 Payroll ($00 million) The graph indicates higher payroll seem to result in higher sales A line has been drawn to show the relationship between the payroll and the sales There is not a perfect relationship because not all points lie in a straight line Errors are involved if this line is used to predict sales based on payroll Many lines could be drawn through these points, but which one best represents the true relationship? Figure 4.: Scatter Diagram for Triple A Construction Company Data in Table 4. 4 7 4 8

Simple Linear Regression Simple Linear Regression Regression models are used to find the relationship between variables i.e. to predict the value of one variable based on the other However there is some random error that cannot be predicted Regression models can also be used to test if a relationship exists between variables The underlying simple linear regression model is: 0 = β + β + ε where = dependent variable (response) = independent variable (predictor or explanatory) β 0 = intercept (value of when = 0) β = slope of the regression line ε = random error 4 9 4 0 Simple Linear Regression Triple A Construction The random error cannot be predicted. So an approximation of the model is used where ^ ˆ = b 0 + b = predicted value of = independent variable (predictor or explanatory) b 0 = estimate of β 0 b = estimate of β Triple A Construction is trying to predict sales based on area payroll = Sales = Area payroll The line chosen in Figure 4. is the one that best fits the sample data by minimizing the sum of all errors Error = (Actual value) (Predicted value) e = ˆ 4 4

Triple A Construction The errors may be positive or negative large positive and negative errors may cancel each other result in very small average error thus errors are squared Error = [(Actual value) (Predicted value)] e = ( ) ˆ The best regression line is defined as the one that minimize the sum of squared errors, i.e. the total distance between the actual data points and the line Triple A Construction For the simple linear regression model, the values of the intercept and slope can be calculated from n sample data using the formulas below = = n = = n b ˆ = b + 0 b average (mean) of average (mean) of ( )( ) = b0 = b ( ) values values 4 3 4 4 Triple A Construction Regression calculations ( ) ( )( ) 6 3 (3 4) = (3 4)(6 7) = 8 4 (4 4) = 0 (4 4)(8 7) = 0 Σ = 4 = 4/6 = 7 9 6 (6 4) = 4 (6 4)(9 7) = 4 5 4 (4 4) = 0 (4 4)(5 7) = 0 4.5 ( 4) = 4 ( 4)(4.5 7) = 5 9.5 5 (5 4) = (5 4)(9.5 7) =.5 Σ = 4 = 4/6 = 4 Σ( ) = 0 Σ( )( ) =.5 Triple A Construction Regression calculations = 6 4 = = 4 6 4 = = = 7 6 6 b ( )( ). 5 = = 5 ( ) 0 =. b0 = b = 7 (. 5)( 4) = Table 4. Therefore ˆ = +. 5 4 5 4 6

Triple A Construction Measuring the Fit of the Regression Model Regression calculations = 6 4 = = 4 sales = +.5(payroll) 6 4 If the payroll next = = = 7 year is $600 million 6 6 ( )( ˆ = ) +. 5. 5( 6) = 9. 5 or $ 950, 000 b = = =. 5 ( ) 0 b0 = b = 7 (. 5)( 4) = Therefore ˆ = +. 5 4 7 Regression models can be developed for any variables and How do we know the model is good enough (with small errors) in predicting based on? The following measures are useful in describing the accuracy of the model Three measures of variability SST Total variability about the mean SSE Variability about the regression line SSR Total variability that is explained by the model 4 8 Measuring the Fit of the Regression Model Measuring the Fit of the Regression Model Sum of the squares total SST = ( ) ( ) ( ) ( ) 6 3 (6 7) = +.5(3) = 5.75 0.065.563 ^ ^ ^ Sum of the squared error 8 4 (8 7) = +.5(4) = 7.00 0 SSE = e = ( ˆ ) Sum of squares due to regression SSR = ˆ ( ) 9 6 (9 7) = 4 +.5(6) = 9.50 0.5 6.5 5 4 (5 7) = 4 +.5(4) = 7.00 4 0 4.5 (4.5 7) = 6.5 +.5() = 4.50 0 6.5 9.5 5 (9.5 7) = 6.5 +.5(5) = 8.5.565.563 An important relationship SST = SSR + SSE ( ) ^ =.5 ( ) ^ = 6.875 ( ) = 5.65 = 7 SST =.5 SSE = 6.875 SSR = 5.65 Table 4.3 4 9 4 0

Measuring the Fit of the Regression Model SST =.5 is the variability of the prediction using mean value of SSE = 6.875 is the variability of the prediction using regression line Prediction using regression line has reduced the variability by.5 6.875 = 5.65 SSR = 5.65 indicates how much of the total variability in is explained by the regression model Note: SST = SSR + SSE SSR explained variability SSE unexplained variability Sales ($00 0,000) 0 8 6 4 Measuring the Fit of the Regression Model (SSR) (SSE) ^ ^ ^ = +.5 (SST) 0 0 3 4 5 6 7 8 Payroll ($00 million) Figure 4. 4 4 Coefficient of Determination Correlation Coefficient The proportion of the variability in explained by regression equation is called the coefficient of determination The coefficient of determination is r r SSR SSE = = SST SST For Triple A Construction r 5. 65 = = 0. 6944. 5 About 69% of the variability in is explained by the equation based on payroll () If SSE 0, then r 00% 4 3 The correlation coefficient is an expression of the strength of the linear relationship between the variables r = ± It will always be between + and Negative slope r < 0; positive slope r > 0 The correlation coefficient is r For Triple A Construction r = r 0. 6944 = 0. 8333 4 4

Correlation Coefficient Using Computer Software for Regression Figure 4.3 (a) Perfect Positive Correlation: r = + (c) No Correlation: r = 0 (b) Positive Correlation: 0 < r < (d) Perfect Negative Correlation: r = 4 5 Program 4.A 4 6 Using Computer Software for Regression Using Computer Software for Regression Program 4.B Program 4.C 4 7 4 8

Using Computer Software for Regression Correlation coefficient (r) is Multiple R in Excel Assumptions of the Regression Model If we make certain assumptions about the errors in a regression model, we can perform statistical tests to determine if the model is useful. Errors are independent. Errors are normally distributed 3. Errors have a mean of zero 4. Errors have a constant variance A plot of the residuals (errors) will often highlight any glaring violations of the assumption Program 4.D = +.5 ˆ 4 9 4 30 Residual Plots Residual Plots A random plot of residuals Healthy pattern no violations Nonconstant error variance violation Errors increase as increases, violating the constant variance assumption Error Error = 0 Error Error = 0 Figure 4.4A Figure 4.4B 4 3 4 3

Residual Plots Nonlinear relationship violation Errors consistently increasing and then consistently decreasing indicate that the model is not linear (perhaps quadratic) Estimating the Variance Errors are assumed to have a constant variance (σ ), but we usually don t know this It can be estimated using the mean squared error (MSE), s Error Error = 0 s SSE = MSE = n k where n = number of observations in the sample k = number of independent variables Figure 4.4C 4 33 4 34 Estimating the Variance Testing the Model for Significance For Triple A Construction s SSE 6. 8750 6. 8750 = MSE = = = =. 788 n k 6 4 We can estimate the standard deviation, s This is also called the standard error of the estimate or the standard deviation of the regression s = MSE =. 788 =. 3 A small s or s means the actual data deviate within a small range from the predicted result Both r and the MSE (s ) provide a measure of accuracy in a regression model However when the sample size is too small, you can get good values for MSE and r even if there is no relationship between the variables Testing the model for significance helps determine if r and MSE are meaningful and if a linear relationship exists between the variables We do this by performing a statistical hypothesis test 4 35 4 36

Testing the Model for Significance We start with the general linear model 0 = β + β + ε If β = 0, the null hypothesis is that there is no relationship between and The alternate hypothesis is that there is a linear relationship (β 0) If the null hypothesis can be rejected, we have proven there is a linear relationship We use the F statistic for this test The F Distribution A continuous probability distribution (Fig..5) The area underneath the curve represents probability of the F statistic value falling within a particular interval. The F statistic is the ratio of two sample variances F distributions have two sets of degrees of freedom Degrees of freedom are based on sample size and used to calculate the numerator and denominator df = degrees of freedom for the numerator df = degrees of freedom for the denominator 4 37 4 38 The F Distribution The F Distribution Consider the example: df = 5 df = 6 α = 0.05 (probability) From Appendix D, we get This means F α, df, df = F 0.05, 5, 6 = 4.39 P(F > 4.39) = 0.05 There is only a 5% probability that F will exceed 4.39 (see Fig..6) Figure.5 F α 4 39 4 40

The F Distribution Testing the Model for Significance Figure.6 F value for 0.05 probability with 5 and 6 degrees of freedom F = 4.39 0.05 The F statistic for testing the model is based on the MSE (s ) and mean squared regression (MSR) SSR MSR = k where k = number of independent variables in the model The F statistic is F = MSR MSE This describes an F distribution with degrees of freedom for the numerator = df = k degrees of freedom for the denominator = df = n k 4 4 4 4 Testing the Model for Significance If there is very little error, the MSE would be small and the F-statistic would be large indicating the model is useful If the F-statistic is large, the significance level (p-value) will be low, indicating it is unlikely this would have occurred by chance So when the F-value is large, we can reject the null hypothesis and accept that there is a linear relationship between and and the values of the MSE and r are meaningful Steps in a Hypothesis Test. Specify null and alternative hypotheses H0 : β = 0 H : β 0. Select the level of significance (α). Common values are 0.0 and 0.05 3. Calculate the value of the test statistic using the formula MSR F = MSE 4 43 4 44

Steps in a Hypothesis Test Triple A Construction 4. Make a decision using one of the following methods a) Reject the null hypothesis if the test statistic is greater than the F-value from the table in Appendix D. Otherwise, do not reject the null hypothesis: Reject if df = k df = n k F calculated > Fα, df, df b) Reject the null hypothesis if the observed significance level, or p-value, is less than the level of significance (α). Otherwise, do not reject the null hypothesis: p - value = P( F > calculated test statistic) Reject if p - value < α Step. H 0 : β = 0 (no linear relationship between and ) H : β 0 (linear relationship exists between and ) Step. Select α = 0.05 Step 3. Calculate the value of the test statistic MSR F SSR = k MSR = MSE 5. 650 = = 5. 650 5. 650 = = 9. 09. 788 4 45 4 46 Triple A Construction Triple A Construction Step 4. Reject the null hypothesis if the test statistic is greater than the F-value in Appendix D df = k = df = n k = 6 = 4 The value of F associated with a 5% level of significance and with degrees of freedom and 4 is found in Appendix D We can conclude there is a statistically significant relationship between and The r value of 0.69 means about 69% of the variability in sales () is explained by local payroll () F 0.05,,4 = 7.7 F calculated = 9.09 Reject H 0 because 9.09 > 7.7 Figure 4.5 0.05 F = 7.7 9.09 4 47 4 48

Triple A Construction The F-test determines whether or not there is a relationship between the variables r (coefficient of determination) is the best measure of the strength of the prediction relationship between the and variables Values closer to indicate a strong prediction relationship Good regression models have a low significance level for the F-test and high r value. Analysis of Variance (ANOVA) Table When software is used to develop a regression model, an ANOVA table is typically created that shows the observed significance level (p-value) for the calculated F value This can be compared to the level of significance (α) to make a decision DF SS MS F SIGNIFICANCE Regression k SSR MSR = SSR/k MSR/MSE P(F > MSR/MSE) Residual n - k - SSE MSE = SSE/(n - k - ) Total n - SST Table 4.4 4 49 4 50 ANOVA for Triple A Construction Multiple Regression Analysis Multiple regression models are extensions to the simple linear model and allow the creation of models with several independent variables Program 4.D (partial) P(F > 9.0909) = 0.0394 Because this probability is less than 0.05, we reject the null hypothesis of no linear relationship and conclude there is a linear relationship between and = β 0 + β + β + + β k k + ε where = dependent variable (response variable) i = ith independent variable (predictor or explanatory variable) β 0 = intercept (value of when all i = 0) β I = coefficient of the ith independent variable k = number of independent variables ε = random error 4 5 4 5

Multiple Regression Analysis To estimate these values, samples are taken and the following equation is developed ˆ = b + b + b + + b 0... k k where Ŷ = predicted value of b 0 = sample intercept (and is an estimate of β 0 ) b i = sample coefficient of the ith variable (and is an estimate of β i ) 4 53 Jenny Wilson Realty Jenny Wilson wants to develop a model to determine the suggested listing price for houses based on the size and age of the house where ˆ = b + b + b 0 ˆ = predicted value of dependent variable (selling price) b 0 = intercept and = value of the two independent variables (square footage and age) respectively b and b = slopes for and respectively She selects a few samples of the houses sold recently and records the data shown in Table 4.5 She also saves information on house condition to be used later 4 54 Jenny Wilson Realty Jenny Wilson Realty Table 4.5 SELLING PRICE ($) SQUARE FOOTAGE AGE 95,000,96 30 Good CONDITION 9,000,069 40 Excellent 4,800,70 30 Excellent 35,000,396 5 Good 4,000,706 3 Mint 45,000,847 38 Mint 59,000,950 7 Mint 65,000,33 30 Excellent 8,000,85 6 Mint 83,000 3,75 35 Good 00,000,300 8 Good,000,55 7 Good 5,000 3,800 40 Excellent 9,000,740 Mint Program 4. ˆ 0.00788 = 4663+ 44 899 4 55 4 56

Evaluating Multiple Regression Models Evaluation is similar to simple linear regression models The p-value for the F-test and r are interpreted the same The hypothesis is different because there is more than one independent variable The F-test is investigating whether all the coefficients are equal to 0 If the F-test is significant, it does not mean all independent variables are significant Evaluating Multiple Regression Models To determine which independent variables are significant, tests are performed for each variable H H 0 : β = : β The test statistic is calculated and if the p-value is lower than the level of significance (α), the null hypothesis is rejected 0 0 4 57 4 58 Jenny Wilson Realty The model is statistically significant The p-value for the F-test is 0.00 r = 0.679 so the model explains about 67% of the variation in selling price () But the F-test is for the entire model and we can t tell if one or both of the independent variables are significant By calculating the p-value of each variable, we can assess the significance of the individual variables Since the p-value for (square footage) and (age) are both less than the significance level of 0.05, both null hypotheses can be rejected Binary or Dummy Variables Binary (or dummy or indicator) variables are special variables created for qualitative data A binary variable is assigned a value of if a particular qualitative condition is met and a value of 0 otherwise Adding binary variables may increase the accuracy of the regression model The number of binary variables must be one less than the number of categories of the qualitative variable 4 59 4 60

Jenny Wilson Realty Jenny Wilson Realty Jenny believes a better model can be developed if she includes information about the condition of the property 3 = if house is in excellent condition = 0 otherwise 4 = if house is in mint (perfect) condition = 0 otherwise Two binary variables are used to describe the three categories of condition No variable is needed for good condition since if both 3 = 0 and 4 = 0, the house must be in good condition Program 4.3 4 6 4 6 Jenny Wilson Realty Jenny Wilson Realty ˆ =, 658 + 56. 43 +, 3, 96 + 33, 6 3 47 369 4 Model explains about 89.8% of the variation in selling price F-value indicates significance Program 4.3 4 63 Program 4.3 The two additional dummy variables result in higher r and smaller significance value. Low p-values indicate each variable is significant 4 64

Model Building Model Building The best model is a statistically significant model with a high r and few variables As more variables are added to the model, the r -value usually increases However more variables does not necessarily mean better model For this reason, the adjusted r value is often used to determine if additional independent variable is beneficial The adjusted r takes into account the number of independent variables in the model 4 65 The formula for r r SSR = = SST The formula for adjusted r SSE SST SSE /( n k ) Adjusted r = SST /( n ) As the number of independent variables (k) increases, n-k- decreases. This causes SSE/(n-k-) to increase and the adjusted r to decrease unless the extra variable causes a significant decrease in the SSE (and error) to offset the change in k 4 66 Model Building Model Building Note when new variables are added to the model, the value of r will never decrease; however the adjusted r may decrease In general, if a new variable increases the adjusted r, it should probably be included in the model A variable should not be added to the model if it causes the adjusted r to decrease Compare the adjusted r before and after adding the two binary variables in Jenny Wilson Realty example (0.6 vs 0.856) 4 67 In some cases, variables contain duplicate information E.g. size of the lot, # of bedrooms and # of bathrooms might be correlated with the square footage of the house When two independent variables are correlated, they are said to be collinear When more than two independent variables are correlated, multicollinearity exists The model is still good for prediction purpose when multicollinearity is present But hypothesis tests (p-values) for the individual variables and the interpretation of their coefficients are not valid 4 68

Nonlinear Regression In some situations relationships between variables are not linear Transformations may be used to turn a nonlinear model into a linear model to use linear regression analysis programs e.g. Excel Linear relationship Nonlinear relationship Colonel Motors The engineers want to use regression analysis to improve fuel efficiency They have been asked to study the impact of weight on miles per gallon (MPG) MPG WEIGHT (,000 LBS.) MPG WEIGHT (,000 LBS.) 4.58 0 3.8 3 4.66 3.68 5 4.0 4.65 8.53 33.70 9 3.09 36.95 9 3. 4.9 Table 4.6 4 69 4 70 Colonel Motors Colonel Motors MPG Figure 4.6A 45 40 35 30 5 0 5 0 5 Linear model ˆ = b + b 0 0.00.00 3.00 4.00 5.00 Weight (,000 lb.) Program 4.4 A useful model with a small F-test for significance and a good r value = 47.6 8. 4 7 4 7

MPG 45 40 35 30 5 0 5 0 5 Colonel Motors A nonlinear model seams better MPG = b weight + ) 0 + b( ) b ( weight 0.00.00 3.00 4.00 5.00 Figure 4.6B Weight (,000 lb.) Colonel Motors The nonlinear model is a quadratic model The easiest way to work with this model is to develop a new variable = ( weight) This gives us a model that can be solved with linear regression software ˆ = b + 0 + b b 4 73 4 74 Colonel Motors Cautions and Pitfalls Program 4.5 ˆ = 79. 8 30. +. 3 4 A better model with a smaller F-test for significance and a larger adjusted r value Interpretation of coefficients and P-values are not valid 4 75 If the assumptions about the errors are not met, the statistical test may not be valid Correlation does not necessarily mean causation (e.g. price of automobiles and your annual salary) Multicollinearity makes interpreting coefficients problematic, but the model may still be good Using a regression model beyond the range of is questionable, the relationship may not hold outside the sample data (e.g. advertising amount and sales volume) 4 76

Cautions and Pitfalls t-tests for the intercept (b 0 ) may be ignored as this point (=0) is often outside the range of the model A linear relationship may not be the best relationship, even if the F-test returns an acceptable value A nonlinear relationship can exist even if a linear relationship does not Just because a relationship is statistically significant doesn't mean it has any practical value r must also be significant Homework Assignment http://www.sci.brooklyn.cuny.edu/~dzhu/busn3430/ 4 77 4 78