Multiple Regression. Peerapat Wongchaiwat, Ph.D.

Similar documents
The Multiple Regression Model

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Chapter 14 Student Lecture Notes 14-1

Basic Business Statistics, 10/e

Chapter 7 Student Lecture Notes 7-1

Chapter 3 Multiple Regression Complete Example

Inferences for Regression

Chapter 4 Regression with Categorical Predictor Variables Page 1. Overview of regression with categorical predictors

Chapter 4. Regression Models. Learning Objectives

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Chapter 4: Regression Models

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Binary Logistic Regression

ECON 4230 Intermediate Econometric Theory Exam

Data Analysis 1 LINEAR REGRESSION. Chapter 03

Ordinary Least Squares Regression Explained: Vartanian

Sociology Research Statistics I Final Exam Answer Key December 15, 1993

SIMPLE REGRESSION ANALYSIS. Business Statistics

Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity

Simple Linear Regression: One Qualitative IV

Correlation Analysis

Basic Business Statistics 6 th Edition

4. Nonlinear regression functions

Regression Analysis II

STA121: Applied Regression Analysis

Least Squares Estimation-Finite-Sample Properties

Statistics and Quantitative Analysis U4320

Chapter 14 Simple Linear Regression (A)

Multiple Regression and Model Building Lecture 20 1 May 2006 R. Ryznar

Simple Linear Regression

Lecture 4: Multivariate Regression, Part 2

The simple linear regression model discussed in Chapter 13 was written as

ECON 497 Midterm Spring

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Statistics for Managers using Microsoft Excel 6 th Edition

MULTIPLE REGRESSION ANALYSIS AND OTHER ISSUES. Business Statistics

ECON 450 Development Economics

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Multiple linear regression S6

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Lecture 4: Multivariate Regression, Part 2

Ordinary Least Squares Regression Explained: Vartanian

Multiple Regression Analysis. Part III. Multiple Regression Analysis

ECO220Y Simple Regression: Testing the Slope

Variance Decomposition and Goodness of Fit

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

NATCOR Regression Modelling for Time Series

Multiple Regression Analysis

2. Linear regression with multiple regressors

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

(ii) Scan your answer sheets INTO ONE FILE only, and submit it in the drop-box.

Econometrics Summary Algebraic and Statistical Preliminaries

Applied Econometrics. Professor Bernard Fingleton

MATH 644: Regression Analysis Methods

The Simple Regression Model. Part II. The Simple Regression Model

Simple Linear Regression

Ch 2: Simple Linear Regression

Section 3: Simple Linear Regression

G. S. Maddala Kajal Lahiri. WILEY A John Wiley and Sons, Ltd., Publication

ECON3150/4150 Spring 2015

Inference for Regression Inference about the Regression Model and Using the Regression Line

Homoskedasticity. Var (u X) = σ 2. (23)

ECON 5350 Class Notes Functional Form and Structural Change

Multiple Regression Methods

Making sense of Econometrics: Basics

Unit 11: Multiple Linear Regression

Applied Statistics and Econometrics

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

DEMAND ESTIMATION (PART III)

Finding Relationships Among Variables

LI EAR REGRESSIO A D CORRELATIO

Mathematics for Economics MA course

Sample Problems. Note: If you find the following statements true, you should briefly prove them. If you find them false, you should correct them.

Christopher Dougherty London School of Economics and Political Science

CHAPTER 6: SPECIFICATION VARIABLES

Review of Multiple Regression

Instructions: Closed book, notes, and no electronic devices. Points (out of 200) in parentheses

Review of Econometrics

Lectures on Simple Linear Regression Stat 431, Summer 2012

MATH ASSIGNMENT 2: SOLUTIONS

Inference for Regression Simple Linear Regression

Answers to Problem Set #4

Introduction to Regression Analysis. Dr. Devlina Chatterjee 11 th August, 2017

Diagnostics of Linear Regression

Regression Models - Introduction

Econometrics I Lecture 3: The Simple Linear Regression Model

holding all other predictors constant

Inference for Regression

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

Immigration attitudes (opposes immigration or supports it) it may seriously misestimate the magnitude of the effects of IVs

Introductory Econometrics

Simple Linear Regression

Single and multiple linear regression analysis

MGEC11H3Y L01 Introduction to Regression Analysis Term Test Friday July 5, PM Instructor: Victor Yu

Lecture (chapter 13): Association between variables measured at the interval-ratio level

Transcription:

Peerapat Wongchaiwat, Ph.D. wongchaiwat@hotmail.com

The Multiple Regression Model Examine the linear relationship between 1 dependent (Y) & 2 or more independent variables (X i ) Multiple Regression Model with k Independent Variables: Y-intercept Population slopes Random Error Y β β X β X β X 0 1 1 2 2 k k ε

Equation The coefficients of the multiple regression model are estimated using sample data Multiple regression equation with k independent variables: Estimated (or predicted) value of y Estimated intercept Estimated slope coefficients ŷ i b 0 b 1 x 1i b 2 x 2i b k x k,i We will always use a computer to obtain the regression slope coefficients and other regression summary measures.

Sales Example Week Pie Sales Price ($) Advertising ($100s) 1 350 5.50 3.3 2 460 7.50 3.3 3 350 8.00 3.0 4 430 8.00 4.5 5 350 6.80 3.0 6 380 7.50 4.0 7 430 4.50 3.0 8 470 6.40 3.7 9 450 7.00 3.5 10 490 5.00 4.0 11 340 7.20 3.5 12 300 7.90 3.2 13 440 5.90 4.0 14 450 5.00 3.5 15 300 7.00 2.7 Multiple regression equation: Sales t = b 0 + b 1 (Price) t + b 2 (Advertising) t + e t

Output Regression Statistics Multiple R 0.72213 R Square 0.52148 Adjusted R Square 0.44172 Standard Error 47.46341 Observations 15 Sales 306.526-24.975(Price) 74.131(Advertising) ANOVA df SS MS F Significance F Regression 2 29460.027 14730.013 6.53861 0.01201 Residual 12 27033.306 2252.776 Total 14 56493.333 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404 Price -24.97509 10.83213-2.30565 0.03979-48.57626-1.37392 Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

Adjusted R 2 never decreases when a new X variable is added to the model, even if the new variable is not an important predictor variable Hence, models with different number of explanatory variables cannot be compared by R 2 What is the net effect of adding a new variable? We lose a degree of freedom when a new X variable is added Did the new X variable add enough explanatory power to offset the loss of one degree of freedom? * Adjusted R 2 penalizes excessive use of unimportant independent variables Adjusted R 2 is always smaller than R 2 (except when R 2 =1) R 2

F-Test for Overall Significance of the Model Shows if there is a linear relationship between all of the X variables considered together and Y Use F test statistic Hypotheses: H 0 : β 1 = β 2 = = β k = 0 (no linear relationship) H 1 : at least one β i 0 (at least one independent variable affects Y)

F-Test for Overall Significance (continued) Regression Statistics Multiple R 0.72213 R Square 0.52148 Adjusted R Square 0.44172 Standard Error 47.46341 Observations 15 F MSR MSE With 2 and 12 degrees of freedom 14730.0 2252.8 6.5386 P-value for the F-Test ANOVA df SS MS F Significance F Regression 2 29460.027 14730.013 6.53861 0.01201 Residual 12 27033.306 2252.776 Total 14 56493.333 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404 Price -24.97509 10.83213-2.30565 0.03979-48.57626-1.37392 Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

11-9 The ANOVA Table in Regression Source of Variation Sum of Squares Regression SSR Degrees of Freedom Mean Square F Ratio (k) MSR SSR k F MSR MSE Error SSE (n-(k+1)) =(n-k-1) MSE SSE ( n k 1) Total SST (n-1) MST SST ( n 1) R 2 = SSR SST = 1 - SSE 2 R ( n ( k 1)) F SST 2 R 2 = 1 - ( 1 R ) ( k ) SSE (n - (k + 1)) SST (n - 1) = MSE MST

Tests of the Significance of Individual Regression Parameters 11-10 Hypothesis tests about individual regression slope parameters: (1) H 0 : b 1 = 0 H 1 : b 1 0 (2) H 0 : b 2 = 0 H 1 : b 2 0. (k) H 0 : b k = 0 H 1 : b k 0 Test statistic for test i: t bi 0 k 1 s( b ) ( n ( ) i

The Concept of Partial Regression Coefficients In multiple regression, the interpretation of slope coefficients requires special attention: ŷ i b 0 b 1 x 1i b 2 x 2i Here, b 1 shows the relationship between X 1 and Y holding X 2 constant (i.e. controlling for the effect of X 2 ).

Purifying X 1 from X 2 (i.e. Removing the effect of X 2 on X 1 : Run a regression of X 2 on X 1 X 2i = 0 + 1 X 1i + v i v i = X 2i ( 0 + 1 X 1i ) is X 2 purified from X 1 Then, run a regression of Y i on v i. Y i = 0 + 1 v i. 1 is the b 1 in the original multiple regression equation.

b 1 shows the relationship between X 1 purified from X 2 and Y. Whenever, a new explanatory variable is added into the regression equation or removed from from the equation, all b coefficients change. (unless, the covariance of the added or removed variable with all other variables is zero).

The Principle of Parsimony: Any insignificant explanatory variable should be removed out of the regression equation. The Principle of Generosity: Any significant variable must be included in the regression equation. Choosing the best model: Choose the model with the highest adjusted R 2 or F or the lowest AIC (Akaike Information Criterion) or SC (Schwarz Criterion). Apply the stepwise regression procedure.

For example: A researcher may be interested in the relationship between Education and Income and Number of Children in a family. Independent Variables Education Family Income Dependent Variable Number of Children

For example: Research Hypothesis: As education of respondents increases, the number of children in families will decline (negative relationship). Research Hypothesis: As family income of respondents increases, the number of children in families will decline (negative relationship). Independent Variables Dependent Variable Education Number of Children Family Income

For example: Null Hypothesis: There is no relationship between education of respondents and the number of children in families. Null Hypothesis: There is no relationship between family income and the number of children in families. Independent Variables Dependent Variable Education Number of Children Family Income

Bivariate regression is based on fitting a line as close as possible to the plotted coordinates of your data on a two-dimensional graph. Trivariate regression is based on fitting a plane as close as possible to the plotted coordinates of your data on a three-dimensional graph. Case: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Children (Y): 2 5 1 9 6 3 0 3 7 7 2 5 1 9 6 3 0 3 7 14 2 5 1 9 6 Education (X 1 ) 12 16 2012 9 18 16 14 9 12 12 10 20 11 9 18 16 14 9 8 12 10 20 11 9 Income 1=$10K (X 2 ): 3 4 9 5 4 12 10 1 4 3 10 4 9 4 4 12 10 6 4 1 10 3 9 2 4

Plotted coordinates (1 10) for Education, Income and Number of Children Y 0 X 2 X 1 Case: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Children (Y): 2 5 1 9 6 3 0 3 7 7 2 5 1 9 6 3 0 3 7 14 2 5 1 9 6 Education (X 1 ) 12 16 2012 9 18 16 14 9 12 12 10 20 11 9 18 16 14 9 8 12 10 20 11 9 Income 1=$10K (X 2 ): 3 4 9 5 4 12 10 1 4 3 10 4 9 4 4 12 10 6 4 1 10 3 9 2 4

What multiple regression does is fit a plane to these coordinates. Y 0 X 2 X 1 Case: 1 2 3 4 5 6 7 8 9 10 Children (Y): 2 5 1 9 6 3 0 3 7 7 Education (X 1 ) 12 16 2012 9 18 16 14 9 12 Income 1=$10K (X 2 ): 3 4 9 5 4 12 10 1 4 3

Mathematically, that plane is: Y = a + b 1 X 1 + b 2 X 2 a = y-intercept, where X s equal zero b=coefficient or slope for each variable For our problem, SPSS says the equation is: Y = 11.8 -.36X 1 -.40X 2 Expected # of Children = 11.8 -.36*Educ -.40*Income

Let s take a moment to reflect Why do I write the equation: Y = a + b 1 X 1 + b 2 X 2 Whereas KBM often write: Y i = a + b 1 X 1i + b 2 X 2i + e i One is the equation for a prediction, the other is the value of a data point for a person.

Model 1 Model Summary Adjust ed Std. Error of R R Square R Square the Estimate.757 a.573.534 2.33785 a. Predictors: (Constant), Income, Education ANOVA b 57% of the variation in number of children is explained by education and income! Y = 11.8 -.36X 1 -.40X 2 Model 1 Regress ion Res idual Total Sum of Squares df Mean Square F Sig. 161.518 2 80.759 14.776.000 a 120.242 22 5. 466 281.760 24 a. Predictors: (Constant), Income, Education b. Dependent Variable: Children Model 1 (Constant) Education Income Uns tandardized Coef f icients a. Dependent Variable: Children Coefficients a Standardized Coef f icients B Std. Error Beta t Sig. 11. 770 1. 734 6. 787.000 -. 364.173 -. 412-2.105.047 -. 403.194 -. 408-2.084.049

Model 1 Model Summary Adjust ed Std. Error of R R Square R Square the Estimate.757 a.573.534 2.33785 a. Predictors: (Constant), Income, Education ANOVA b r 2 (Y Y) 2 - (Y Y) 2 (Y Y) 2 Y = 11.8 -.36X 1 -.40X 2 Model 1 Coefficients a Regress ion Res idual Total Sum of Squares df Mean Square F Sig. 161.518 2 80.759 14.776.000 a 120.242 22 5. 466 281.760 24 a. Predictors: (Constant), Income, Education b. Dependent Variable: Children 161.518 261.76 =.573 Model 1 (Constant) Education Income Uns tandardized Coef f icients a. Dependent Variable: Children Standardized Coef f icients B Std. Error Beta t Sig. 11. 770 1. 734 6. 787.000 -. 364.173 -. 412-2.105.047 -. 403.194 -. 408-2.084.049

So what does our equation tell us? Y = 11.8 -.36X 1 -.40X 2 Expected # of Children = 11.8 -.36*Educ -.40*Income Try plugging in some values for your variables.

So what does our equation tell us? ^ Y = 11.8 -.36X 1 -.40X 2 Expected # of Children = 11.8 -.36*Educ -.40*Income If Education equals:& If Income Equals: Then, children equals: 0 0 11.8 10 0 8.2 10 10 4.2 20 10 0.6 20 11 0.2

So what does our equation tell us? ^ Y = 11.8 -.36X 1 -.40X 2 Expected # of Children = 11.8 -.36*Educ -.40*Income If Education equals:& If Income Equals: Then, children equals: 1 0 11.44 1 1 11.04 1 5 9.44 1 10 7.44 1 15 5.44

So what does our equation tell us? ^ Y = 11.8 -.36X 1 -.40X 2 Expected # of Children = 11.8 -.36*Educ -.40*Income If Education equals:& If Income Equals: Then, children equals: 0 1 11.40 1 1 11.04 5 1 9.60 10 1 7.80 15 1 6.00

If graphed, holding one variable constant produces a twodimensional graph for the other variable. Y 11.40 Y 11.44 b = -.36 b = -.4 6.00 5.44 0 15 X 1 = Education 0 15 X 2 = Income

Dummy Explanatory Variables Qualitative binomial (0,1) variables. D i Y i = β 0 + β 1 X i + β 2 D i + u i For D i = 0 : Y i = β 0 + β 1 X i + u i For D i = 1 : Y i = β 0 + β 1 X i + β 2 +u i Y i = (β 0 +β 2 )+ β 1 X i +u i To measure the effect of D i on the relation between X and Y Y i = β 0 + β 1 X i + β 2 X i *D i + u i For D i = 0 : Y i = β 0 + β 1 X i + u i For D i = 1 : Y i = β 0 + β 1 X i + β 2 X i +u i Y i = β 0 + (β 1 +β 2 )X i +u i

Warning: Dummy variables can be used only as regressors. Should the dependent variable be binomial, you need to use Logit or Probit regression models, which employ ML estimator. This is because the binomial feature violates the normal distribution assumption which renders t-statistics invalid. (you can learn these techniques in Econometrics II) Time-period dummies can be used for: 1) measuring the stability of a relationship over time 2) to treat outliers Seasonal dummies can be used to treat seasonal variation in seasonally-unadjusted data. Simply create n 1 dummies for n seasonal sections and use them as regressors. You may include the seasonal dummies in the regression to control for seasonal variation.

The way you use nominal variables in regression is by converting them to a series of dummy variables. Recode into different Nomimal Variable Dummy Variables Race 1. White 1 = White 0 = Not White; 1 = White 2 = Black 2. Black 3 = Other 0 = Not Black; 1 = Black 3. Other 0 = Not Other; 1 = Other

The way you use nominal variables in regression is by converting them to a series of dummy variables. Recode into different Nomimal Variable Dummy Variables Religion 1. Catholic 1 = Catholic 0 = Not Catholic; 1 = Catholic 2 = Protestant 2. Protestant 3 = Jewish 0 = Not Prot.; 1 = Protestant 4 = Muslim 3. Jewish 5 = Other Religions 0 = Not Jewish; 1 = Jewish 4. Muslim 0 = Not Muslim; 1 = Muslim 5. Other Religions 0 = Not Other; 1 = Other Relig.

When you need to use a nominal variable in regression (like race), just convert it to a series of dummy variables. When you enter the variables into your model, you MUST LEAVE OUT ONE OF THE DUMMIES. Leave Out One White Enter Rest into Regression Black Other

The reason you MUST LEAVE OUT ONE OF THE DUMMIES is that regression is mathematically impossible without an excluded group. If all were in, holding one of them constant would prohibit variation in all the rest. Leave Out One Catholic Enter Rest into Regression Protestant Jewish Muslim Other Religion

The regression equations for dummies will look the same. For Race, with 3 dummies, predicting self-esteem: Y = a + b 1 X 1 + b 2 X 2 a = the y-intercept, which in this case is the predicted value of self-esteem for the excluded group, white. b 1 = the slope for variable X 1, black b 2 = the slope for variable X 2, other

If our equation were: For Race, with 3 dummies, predicting self-esteem: Y = 28 + 5X 1 2X 2 Plugging in values for the dummies tells you each group s selfesteem average: a = the y-intercept, which in this case is the predicted value of self-esteem for the excluded group, white. 5 = the slope for variable X 1, black -2 = the slope for variable X 2, other White = 28 Black = 33 Other = 26 When cases values for X 1 = 0 and X 2 = 0, they are white; when X 1 = 1 and X 2 = 0, they are black; when X 1 = 0 and X 2 = 1, they are other.

Dummy variables can be entered into multiple regression along with other dichotomous and continuous variables. For example, you could regress selfesteem on sex, race, and education: Y = a + b 1 X 1 + b 2 X 2 + b 3 X 3 + b 4 X 4 How would you interpret this? Y = 30 4X 1 + 5X 2 2X 3 + 0.3X 4 X 1 = Female X 2 = Black X 3 = Other X 4 = Education

How would you interpret this? Y = 30 4X 1 + 5X 2 2X 3 + 0.3X 4 X 1 = Female X 2 = Black X 3 = Other X 4 = Education 1. Women s self-esteem is 4 points lower than men s. 2. Blacks self-esteem is 5 points higher than whites. 3. Others self-esteem is 2 points lower than whites and consequently 7 points lower than blacks. 4. Each year of education improves self-esteem by 0.3 units.

How would you interpret this? Y = 30 4X 1 + 5X 2 2X 3 + 0.3X 4 Plugging in some select values, we d get self-esteem for select groups: White males with 10 years of education = 33 Black males with 10 years of education = 38 X 1 = Female X 2 = Black X 3 = Other Other females with 10 years of education = 27 X 4 = Education Other females with 16 years of education = 28.8

How would you interpret this? Y = 30 4X 1 + 5X 2 2X 3 + 0.3X 4 X 1 = Female X 2 = Black X 3 = Other X 4 = Education The same regression rules apply. The slopes represent the linear relationship of each independent variable in relation to the dependent while holding all other variables constant. Make sure you get into the habit of saying the slope is the effect of an independent variable while holding everything else constant.

Seasonal-adjusment using dummy variables Example: Suppose a researcher is using seasonallyunadjusted data at the quarterly frequency for the variable Y t. For 4 quarters, create 3 dummies: D 1 = 1 if t is Q 1, 0 otherwise D 2 = 1 if t is Q 2, 0 otherwise D 3 = 1 if t is Q 3, 0 otherwise The residuals of the regression: Y t = β 0 + β 1 D 1,t + β 2 D 2,t + β 3 D 3,t + ε t is the seasonally-adjusted Y t

Log Transformations Y i = β 0 + β 1 X i + u i The β 1 in the above regression indicates the expected change in Y i resulting from a 1-unit increase in X i. not the relationship in % terms If you need to compute the expected % change in Y i resulting from a 1% increase in X i, you need to run the following regression: Ln(Y i )= β 0 + β 1 Ln(X i ) + u i

Assumptions of OLS Estimator 1) E(e i ) = 0 (unbiasedness) 2) Var(e i ) is constant (homoscedasticity) 3) Cov(u i,u j ) = 0 (independent error terms) 4) Cov(u i,x i ) = 0 (error terms unrelated to X s) ei ~ iid (0, 2 ) Gauss-Markov Theorem: If these conditions hold, OLS is the best linear unbiased estimator (BLUE). Additional Assumption: e i s are normally distributed.

Time Series Regressions Lagged variable: Y t = β 0 +β 1 X t +β 2 X t-1 +u t Autoregressive Model: X t = β 1 X t-1 +β 2 X t-2 +u t Time-Trend: Y t = β 0 + β 1 X t + β 2 Tt+u t

Spurious Regressions As a general and very strict rule: All variables in a time-series regression must be stationary. Never run a regression with nonstationary variables! * DW statistic will warn. A nonstationary variable can be made stationary by taking its first difference. If X is nonstationary, stationary. ΔX = Xt Xt-1 may be

Exercise: How to create a regression? Statistic descriptive: Mean, median, etc Correlation: not over 0.5 for xi (explanatory variables) Stationary: ADF test Run regression Test heteroscedasticity, Normality Test VIF in case of Multicollinearity