Ch14. Multiple Regression Analysis

Similar documents
Chapter 4. Regression Models. Learning Objectives

Chapter 14 Student Lecture Notes 14-1

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Chapter 7 Student Lecture Notes 7-1

Correlation Analysis

Chapter 4: Regression Models

Inferences for Regression

Chapter 3 Multiple Regression Complete Example

Statistics for Managers using Microsoft Excel 6 th Edition

Basic Business Statistics 6 th Edition

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 16. Simple Linear Regression and Correlation

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

The Multiple Regression Model

Mathematics for Economics MA course

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Ch 2: Simple Linear Regression

Chapter 13. Multiple Regression and Model Building

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Bayesian Analysis LEARNING OBJECTIVES. Calculating Revised Probabilities. Calculating Revised Probabilities. Calculating Revised Probabilities

Unit 10: Simple Linear Regression and Correlation

Regression Analysis II

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

CHAPTER EIGHT Linear Regression

LI EAR REGRESSIO A D CORRELATIO

Lecture 10 Multiple Linear Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

Basic Business Statistics, 10/e

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Concordia University (5+5)Q 1.

STAT Chapter 10: Analysis of Variance

Chapter 14 Multiple Regression Analysis

The simple linear regression model discussed in Chapter 13 was written as

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Regression Models. Chapter 4

Basic Statistics Exercises 66

Formal Statement of Simple Linear Regression Model

SIMPLE REGRESSION ANALYSIS. Business Statistics

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

STA121: Applied Regression Analysis

Inference for Regression Simple Linear Regression

Regression Analysis. BUS 735: Business Decision Making and Research

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Unit 11: Multiple Linear Regression

Simple Linear Regression

SMAM 314 Exam 3d Name

Ch 3: Multiple Linear Regression

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

FinQuiz Notes

Biostatistics 380 Multiple Regression 1. Multiple Regression

Simple Linear Regression: One Qualitative IV

Inference for the Regression Coefficient

5. Multiple Regression (Regressioanalyysi) (Azcel Ch. 11, Milton/Arnold Ch. 12) The k-variable Multiple Regression Model

Inference for Regression

holding all other predictors constant

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

What is a Hypothesis?

A discussion on multiple regression models

3. Diagnostics and Remedial Measures

Diagnostics and Remedial Measures

Simple Linear Regression: One Quantitative IV

Correlation & Simple Regression

Sociology 6Z03 Review II

Statistics and Quantitative Analysis U4320

Simple Linear Regression

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Multiple Regression Methods

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model.

Econometrics. 4) Statistical inference

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Lecture 9: Linear Regression

Chapter 14 Simple Linear Regression (A)

Econ 3790: Statistics Business and Economics. Instructor: Yogesh Uppal

Applied Regression Analysis

Regression Analysis IV... More MLR and Model Building

Linear models and their mathematical foundations: Simple linear regression

STAT Chapter 11: Regression

Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear

The Standard Linear Model: Hypothesis Testing

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

MULTIPLE REGRESSION ANALYSIS AND OTHER ISSUES. Business Statistics

INFERENCE FOR REGRESSION

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

F-tests and Nested Models

DEMAND ESTIMATION (PART III)

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Simple Linear Regression

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Regression Models for Quantitative and Qualitative Predictors: An Overview

Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

Ch 13 & 14 - Regression Analysis

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Chapter 12 - Lecture 2 Inferences about regression coefficient

Section 3: Simple Linear Regression

Transcription:

Ch14. Multiple Regression Analysis 1

Goals : multiple regression analysis Model Building and Estimating More than 1 independent variables Quantitative( 量 ) independent variables Qualitative( ) independent variables: dummy variables Regression coefficients Multiple standard error of estimate Model Evaluation Goodness-of-fit: global and individual linearity Multi-collinearity Model assumptions diagnostic: analysis of residuals, residual plot 2

略 remedies. If not ok, try a new model. If not ok, need some adequate 立 立 度 1. Linearity 2. Multi-collinearity 1. 立 2. 3. 異數 3

A multiple regression analysis : When there are k independent variables, X1,X2,,Xk, the multiple regression equation is : µ = α + β X + β X + L+ Y 1 1 2 2 β k X k 4

A multiple regression analysis : Where µ Y = the Y-intercept = when X1=X2= =0. 1, 2,, k = net/partial regression coefficients, 1 = the net change in mean of Y for each unit change in X1 when other variables X2,,Xk are kept constants. 數 X1 Y 數 量 5

If there are k=2 independent variables, see Chart 14-1 6

For qualitative X Recall that a regression model establishes systematic( ) relationship between two continuous variables, independent and dependent variable. What if some independent variables are nominal-scale/ qualitative( )? Ans. Using a dummy variable( 數 ) to replace the original variable. 7

Example. X1=X2= Y= X1, Y are continuous variables, while X2 is nominal. If X2=male, y = 0.6+0.9X1 If X2=female, y = 5.6 + 0.9X1 How to express such a model? 8

Dummy variable : a variable with only two possible outcomes, 0 or 1 I=1, if success ; I=0, if failure. 9

Example. Let I2 = 1, if X2=female, I2=0, if X2=male, Multiple regression model : previous model is expressed as 0.6 + 0.9X 1, X2 = male µ Y = 5.6 + 0.9X 1, X2 = female 0.6 + 0.9X 1, X2 = male, I2 = 0 = 0.6 + 0.9X1+ 5, X2 = female, I2 = 1 = 0.6 + 0.9X + 5I = α + β X + β I 1 2 1 1 2 2 =0.6 : I2=0, 0 數 0.6 不 女 數 1=0.9 2=5 X1 女 (I=1)(I=0) 數 2=5 10

female male 11

數 數 Ex. 0.6 + 0.9X 1, X2 = male, I2 = 0 µ Y = 0.6 + 0.9X1+ 0.3X 1, X2 = female, I2 = 1 = 0.6 + 0.9X + 0.3X I = α + β X + β X I 1 1 2 1 1 2 2 2 µ Y = 0.6 + 1.2x female male 0.6 µ Y = 0.6 + 0.9x 12

(1%) Bonus 1 : 數 數 數 µ Y 0.6 + 0.9X 1, X2 = male,i2 = 0 = 5.6 + 1.2X 1, X2 = female, I2 = 1 Bonus 2 : X1= X2= ( 金 ) Y= 列 0.6 + 0.9X1 X2 = 金 µ Y = 3.6 + 0.9X1 X2 = 2.6 + 0.9X1 X2 = X Y 13

14 Estimating the regression equation : The multiple regression equation is estimated by where a, b1,, bk are the least squared estimates (LSE). The calculations are tedious as k becomes large. Example. K=2, two independent variable, solving the equations : Many software packages provide LSEs. k k 2 2 1 1 X b X b X b a ' Y + + + + = L + + = + + = + + = 2 2 2 2 1 1 2 2 1 2 2 2 1 1 1 1 2 2 1 1 x b x x b x a yx x x b x b x a yx x b x b an y

Example. P477 Salsberry Realty sells homes along the east coast of USA. How much can one expect to pay to heat it during the winter? frequently asked by customers. Independent variables : X s 1. The mean daily outside temperature 2. The number of inches of insulation( ) in the attic( 樓 ) 3. The age of the furnace( 爐 ) Dependent variable : Y = heating cost n=20 houses were sampled and investigated. 15

Answer the following questions : 1. Determine the multiple regression equation 2. Discuss the regression coefficients 3. What does it indicate that some are positive and some are negative? What is the intercept? 4. What is the estimated heating cost for a home if the mean outside temperature is 30 degrees, there are 5 inches of insulation in the attic, and the furnace is 10 years old? 16

Home Heating cost (Y) Mean outside temp. (X1) Attic insulation (X2) Age of furnace (X3) 17

數 數 度 數 Y = 427.19 4.58 X1 14.83 X2 + 6.10 X3 18

Findings : 1. Y = 427.19 4.58 X1 14.83 X2 + 6.10 X3 2. The intercept is 427.19. 3. b1, b2 are negative, X1, X2 have inverse relationship. As the outside temperature X1 increases, the mean heating cost will go down. reasonable. For each degree the mean temperature increases, the mean heating cost decreases 4.58 per month. The more insulation in the attic, the less the heating cost. 4. b3=6.1 > 0, X3 has a direct relationship. An older furnace, more heating cost. 5. If X1=30, X2=5, X3=10, the estimated heating cost is Y = 427.19 4.58 (30) 14.83 (5) + 6.10 (10) = 276.60 19

More on estimation : Multiple standard error of estimate: A measure of the error or variability in the prediction. Formula : S 2 y 12Kk = = (Y Y') = n (k + 1) SSE n (k + 1) MSE Residual = Y-Y = The standard error of estimate helps to construct confidence intervals and prediction intervals. Why the degrees of freedom is n-(k+1)? There are n responses, Y1,,Yn. The Y is determined by the predicted equation with (k+1) estimated coefficients 20

Home Heating cost (Y) (Y-Y') (Y-Y')^2 sum S 2 y 123 = = (Y Y') = n (k + 1) 41695.28 20 (3 + 1) 51.05 21

Or, the estimate can be found in the output of EXCEL 數 數 S y 123 = 51.05 度 S 123 y = MSE = 2606 = 51.05 22

Model fit Model Evaluation 1. There is a linear relationship between each X1, Xk and Y Global test : (X1, Xk) vs. Y Individual regression coefficient : Xi vs. Y 2. There is no correlation among X1,,Xk If there is, multicollinearity exists. Diagnosed by correlation matrix Model assumptions diagnostic 1. All pairs of observation (X1,,Xk, Y) are independent. Residual plots 2. The random error = Y Y ~ Normal(0, 2 ), equal variance Residual plot, normal plot Homoscedasticity = equal variance 23

Linearity Linear relationship between X1, Xk and Y Global linearity : jointly, (X1,, Xk), has linear relationship with Y. Individual linearity : each X1,X2,,Xk, has linear relationship with Y Methods : Subjective : eyeball, r 2 Objective : statistical tests 24

Linearity : subjective methods Individual linearity between Xi and Y. Scatter diagrams : Plots of (X1, Y), (X2, Y),, (Xk, Y) linear relationship : positively linear, negatively linear Correlation matrix : a matrix showing the correlation coefficient r between all pairs of variables. Off-diagonal : correlation coefficients The correlation between (X1, Y),, (Xk, Y) should be r 1 or r -1 25

Example. P485 scatter plots 26

Example. P486 EXCEL : correlation matrix X1, X2 are negatively related to Y, while X3 is positively related. X1 has strongest correlation with Y X2 has weakest correlation with Y. 27

Linearity : subjective methods Global linear relationship between (X1,.., Xk) and Y. Coefficient of multiple determination r 2 : ANOVA table The proportion of the total variation of Y explained by X1,, Xk r 2 = SSR SStotal = 1 SSE SStotal ANOVA Table Source of Variation Sum of Squares Degrees of Freedom Mean Square F Regression SSR k SSR/k=MSR MST/MSE Error SSE n-k-1 SSE/(n-k-1)=MSE Total SS total n-1 28

Example. r 2 can be found in the output of EXCEL 數 數 r 2 = 80.42% 度 or SSR 171220.5 r 2 = = = SStotal 212915.8 0.8042 X1, X2, X3 80% Y 異 29

Linearity : objective methods Objectively, the hypothesis of linearity are tested. Global linearity : whether jointly, (X 1,, X k ), has linear relationship with Y? Whether all population coefficients 1,, k are not 0? H0 : β1 =... = βk = 0 --F test in ANOVA! Individual linearity : whether individually, each X 1,X 2,,X k, has linear relationship with Y? Whether any of the population coefficient 1,, k is not 0? Eg, for X1, testing H : 0 vs H : 0 --t-test! 0 β1 = 0 β1 30

Individual test (P490) Step 1. Hypotheses H : β = 0 vs H : β 0 0 1 1 1 Step 2. Significant level Step 3. Test statistic : t-test statistic b1 0 t = = SE(b ) 1 b s 1 b 1 31

Step 4. Decision rule : A two-sided t-test Since under null hypothesis, t ~ t distribution with d.f. n-(k+1). H0 is rejected if t t Or if p-value ( n (k+ 1), α / 2), t t(n (k+ 1), α / 2) Step 5. Conclusion : 32

Example. P518 Since n-k-1=16, =0.05,critical values = t 16,0.025 = 2.12 數 Conclusion : if =0.05 1. For intercept, a = 427.19, SE(a)=59.6, t=7.17, p-value=0, significant! 2. For X1, b1=-4.58, SE(b1)=0.77, t = -5.93, p-value = 0.00, significant! 3. For X2, b2=-14.83, SE(b2)=4.75, t = -3.12, p-value=0.0066, significant! 4. For X3, b3=6.10, SE(b3)=4.01, t = 1.52, p-value=0.1479 > 0.05, not significant! Recall : in the correlation matrix, r(y,x3)=0.53 is quite large, why the linearity is insignificant here? r(y,x2)=-0.25 is close to 0, why the linearity is significant here? 33

Global test (P487-488) Step 1. Hypotheses H0 : β1 = β2 =... = βk = 0 Step 2. Significant level Step 3. Test statistic : F-test statistic SSR / k F = = SSE /(n (k + 1)) MSR MSE 34

Step 4. Decision rule : A one-sided F-test : significant if F is large Since under null hypothesis, F ~ F distribution with d.f. (k, n-(k+1)). H0 is rejected if F Or if p-value F (k,n (k+ 1), α) Step 5. Conclusion : 35

Example. P488 Since k=3, n-k-1=16, =0.05,critical values =F (3,16,0.05) =3.24 度 Conclusion : at =0.05, H0 is rejected since F=21.9 > 3.24 or p-value = 0.000007 < 0.05, 36

Strategy for model selection : (P490) how many independent variables should be in the model? 1. Develop a multiple regression equation based on all independent variables. 1) Global test : significant? If not : stop and conclude that (X1,,Xk) are uncorrelated with Y. If yes : continue to 1-2). 2) Individual test : significant? If all are, go to 3. If some are, some are not: go to 2. 2. Remove the X with the largest p-value, back to 1. Delete the most insignificant independent variable. 3. The global and individual linearity are significant, check the model assumptions. 37

Is there any nonlinear relationship between X and Y? Residual = e = Y-Y = unexplained error/variation Is there any systematic pattern in a residual plot : (X, e)? If the model is right, Y ~ N( µ e The residuals are around 0 and independent with X. If the model is not right, e.g. Y ~ N( µ e Y = Y Y' Y = α + β x, σ Y µ = α + β x + β = Y (a + bx) 1 1 Y ),Y' = ~ N(0, σ Y ( α + β ),Y' = x) ~ N( β a + bx The residual is a quadratic function of X If the nonlinear relationship exists, the model should be modified. 2 2 x 2, σ 1 2 2 a + bx ) 2 x 2, σ 2 ) 38

39

2. Check the multicollinearity between Xs Multicollinearity : correlation exists among the independent variables Xs. Multicollinearity can distort the SE(b) and lead to incorrect conclusions in hypotheses testing. SE(b) becomes large, the conclusion is insignificant. In previous example, X1, X3 are correlated. Method : check the X part in the correlation matrix Multicollinearity exists if r > 0.7, or r < -0.7 Strategy : If multicollinearity exists, drop one of the independent variables and rebuild the model. 40

Example. P486 EXCEL Slight correlations between (X1, X2) (X2, X3) Moderate negative correlation between (X1, X3) Recall that H 0 : 3 =0 is not rejected. 41

Model assumptions : If the model is correct, Y ~ N( µ = α + β x, σ e = Y Y' Y 1,..,Y n are independent e i,,e n are independent approximately Y 1,..,Y n ~Normal population distribution e i,,e n ~normal Y Y µ ),Y' = ~ N(0, σ a + bx Y 1,..,Y n has constant variance at each level of x 1 Y 2 2 ) 42

Under independency, 3. Assumption of independence 1. The observed value should be independent with the sampling order (i), Residual plot : ( i, e) 2. The successive observations should be uncorrelated. Residual plot : (e(i), e(i+1)) 43

3. Assumption of independence-1 The residuals ei should be independent with the order i Plot ( i, e i ) 44

3. Assumption of independence -2 There should be no systematic pattern between successive obsn s Plot e,e ) ( i i+ 1 45

4. Assumption of normality and equal variance Normal distribution : the residuals, e s ~ normal 1. Histogram of e s. : bell-shaped, symmetric Example. Model : X1, X2, X3 P505 residual 率 率 46

2. Normal probability plot, p-p plot : 率 : linear Example. Model : X1, X2, X3 P505 率 Nearly a straight line, we can conclude that normal is true. 47

Equal variance/homoscedasticity The distributions of Y at different X-levels have equal variances. homoscedasticity If the variances are not equal, SE(regression coeff.) is understated t-statistic is too large incorrectly conclude the significance of X. If the variances are not equal, Select other independent variables Some transformations on X or Y 48

The residuals should have equal variations at different X- levels. Check the residual plot (X, e) or (, e) Ŷ P526 Example 1. Unequal variance : increased. Example 2. Other association, quadratic, may exists. 49

50

Example. An analyst is studying the effect of tire pressure on fuel economy (Mpg) for a fleet of 24 sedans used by regional supervisors. There are four different cars driven with a tire pressure of 30, 31, 32, 33, 34 and 35 pounds per square inch. Develop an appropriate regression model to relate tire pressure to fuel effectiveness. What appears to be the best level for tire pressure? 51

The mileage seems to be curvilinear to the pressure. 52

數 數 The R 2 is low. 度 數 Y =4.53+0.89(Pressure) 53

According to the residual plot, there is a non-linear relation between the residual and the pressure. 54

數 數 度 數 Y =-1208.43+75.74(Pressure)-1.15(pressure)^2 55

56

率 According to the residual plots, there is no severe departure from the model assumptions. 57

EXCEL : 料 數 Exercise : 9, 10, 11, 13, 14, 15 Excel: 17, 21, 23, 25 58

Bonus : (1%) Exercise 14.25 利 EXCEL 料 立 1. Linear relationship between X and Y? Global linearity Individual linearity 2. Multicollinearity? 3. Independent observations (X1,,Xk, Y)? 4. Normal distribution? 5. Equal variance? EXCEL output 59