Multiple Regression and Model Building Lecture 20 1 May 2006 R. Ryznar

Similar documents
Multiple Regression and Model Building (cont d) + GIS Lecture 21 3 May 2006 R. Ryznar

11.433J / J Real Estate Economics Fall 2008

Simple Linear Regression

( ), which of the coefficients would end

Diagnostics of Linear Regression

The Multiple Regression Model

Categorical Predictor Variables

Chapter 13. Multiple Regression and Model Building

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Regression ( Kemampuan Individu, Lingkungan kerja dan Motivasi)

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Chapter 4. Regression Models. Learning Objectives

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

CIVL 7012/8012. Simple Linear Regression. Lecture 3

Inferences for Regression

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Multiple linear regression

Lecture (chapter 13): Association between variables measured at the interval-ratio level

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

ECON 4230 Intermediate Econometric Theory Exam

Chapter 4: Regression Models

Practical Biostatistics

Ordinary Least Squares Regression Explained: Vartanian

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Regression Models. Chapter 4. Introduction. Introduction. Introduction

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Simple Linear Regression: One Qualitative IV

Chapter 7 Student Lecture Notes 7-1

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

Single and multiple linear regression analysis

Chapter 14 Student Lecture Notes 14-1

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

In Class Review Exercises Vartanian: SW 540

Inference for Regression

ECON 497 Midterm Spring

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Lecture 8. Using the CLR Model. Relation between patent applications and R&D spending. Variables

Chapter 3 Multiple Regression Complete Example

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

Statistics for Managers using Microsoft Excel 6 th Edition

WORKSHOP 3 Measuring Association

9. Linear Regression and Correlation

A discussion on multiple regression models

Hypothesis Testing hypothesis testing approach

Review of Multiple Regression

Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

STA441: Spring Multiple Regression. More than one explanatory variable at the same time

y response variable x 1, x 2,, x k -- a set of explanatory variables

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

AMS 7 Correlation and Regression Lecture 8

Math 3330: Solution to midterm Exam

ESP 178 Applied Research Methods. 2/23: Quantitative Analysis

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Self-Assessment Weeks 8: Multiple Regression with Qualitative Predictors; Multiple Comparisons

ECO220Y Simple Regression: Testing the Slope

The Simple Regression Model. Part II. The Simple Regression Model

Area1 Scaled Score (NAPLEX) .535 ** **.000 N. Sig. (2-tailed)

Example. Multiple Regression. Review of ANOVA & Simple Regression /749 Experimental Design for Behavioral and Social Sciences

Multiple linear regression S6

Lecture 6: Linear Regression

MULTIPLE REGRESSION ANALYSIS AND OTHER ISSUES. Business Statistics

VARIANCE ANALYSIS OF WOOL WOVEN FABRICS TENSILE STRENGTH USING ANCOVA MODEL

Data Analysis 1 LINEAR REGRESSION. Chapter 03

Business Statistics. Lecture 10: Correlation and Linear Regression

We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model.

Finding Relationships Among Variables

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

Correlation and simple linear regression S5

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Multiple Regression: Chapter 13. July 24, 2015

Lecture 24: Partial correlation, multiple regression, and correlation

Simple Linear Regression: One Qualitative IV

Unit 11: Multiple Linear Regression

Chapter 14 Simple Linear Regression (A)

Draft Proof - Do not copy, post, or distribute. Chapter Learning Objectives REGRESSION AND CORRELATION THE SCATTER DIAGRAM

Outline. Nature of the Problem. Nature of the Problem. Basic Econometrics in Transportation. Autocorrelation

Sociology 593 Exam 2 Answer Key March 28, 2002

STAT5044: Regression and Anova. Inyoung Kim

Correlation and Simple Linear Regression

Lecture 10 Multiple Linear Regression

ECON 5350 Class Notes Functional Form and Structural Change

Econometrics Midterm Examination Answers

Correlation and regression. Correlation and regression analysis. Measures of association. Why bother? Positive linear relationship

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

Instructions: Closed book, notes, and no electronic devices. Points (out of 200) in parentheses

LECTURE 11. Introduction to Econometrics. Autocorrelation

Chapter 4 Regression with Categorical Predictor Variables Page 1. Overview of regression with categorical predictors

Simple Linear Regression: One Quantitative IV

Heteroscedasticity 1

LI EAR REGRESSIO A D CORRELATIO

Lab 07 Introduction to Econometrics

Lecture 4: Multivariate Regression, Part 2

DEMAND ESTIMATION (PART III)

Basic Business Statistics, 10/e

Transcription:

Multiple Regression and Model Building 11.220 Lecture 20 1 May 2006 R. Ryznar

Building Models: Making Sure the Assumptions Hold 1. There is a linear relationship between the explanatory (independent) variable(s) and the response (dependent variable). Check a scatterplot of your data. 2. The error terms are normally distributed with a mean of zero. In the regression model Y=α + βx + ε, we assume that the values of the error term ε (i.e., observed - predicted values) are distributed normally with a mean of zero. We can check this by plotting the error values from our data. Why is this important? It can be shown that, if these errors are normally distributed, then our model s b is an unbiased estimator of the true slope in the population, β. 3. The errors have equal variance for all values of X. This property is called homoscedasticity. In plain language, this means that the variance of the error term does not change systematically with changes in the value of X. 4. The error values are independent. The value of any given error term is independent of the value of any other error term. This is most frequently a problem with data collected over time. 5. The data are ratio or interval. All dependent and independent variables are continuous and measured on ratio or interval scales. This is technically correct; in practice, however, this rule is regularly compromised. For example, using dummy variables in order to incorporate categorical data into regression models.

Dummy variables a way to use nominal (qualitative) data in the regression equation We create one or more variables, each of which takes on the values of 0 or 1 only. The number of dummy variables we need is equal to k-1, where k is the number of categories in your original nominal variable. The regression coefficient for your dummy variable can be interpreted as the predicted change in Y when an observation is a member of the particular category, as compared to the reference category (explained shortly).

How NOT to use dummy variables: Let RACE = 1 if African American 2 if Asian American 3 if Caucasian 4 if Hispanic 5 if Other

The correct way is to use a set of indicator ( dummy ) variables and code them in this manner: Let AFRAMER = Let ASIAMER = Let CAUCAS = Let HISPAN = Let OTHER = 1 if African American and 0 otherwise 1 if Asian American and 0 otherwise 1 if Caucasian and 0 otherwise 1 if Hispanic and 0 otherwise 1 if Other and 0 otherwise

Suppose our conceptual model is: Y = α + β 1 X 1 + β 2 X 2 + e Income = α + β 1 Race +β 2 Education + e Income = a +b 1 ASIAMER+ b 2 CAUCAS + b 3 HISPAN + b 4 OTHER+ b 5 EDUC

Possible model results for Income = 5.41+ 1.9* ASIAMER + 2.5* CAUCAS + 0.7* HISPAN + 2.2* OTHER +.95*12 Thus, to find the predicted income for individuals of different races, each with 12 years of schooling Asian American = a + b 1 + (12 X b 5 ) = 5.41 + 1.9 + (12 X 0.95) = 18,710 Caucasian = a + b 2 + (12 X b 5 ) = 5.41 + 2.5 + (12 X 0.95) = 19,310 Hispanic = a + b 3 + (12 X b 5 ) = 5.41-0.7 + (12 X 0.95) = 16,110 Other = a + b 4 + (12 X b 5 ) = 5.41-2.2 + (12 X 0.95) = 14,610 African American =? = 5.41 + (12 X 0.95) = 16,810 Do you know another way of determining the effect of race?

The category that is not coded is the category to which all others will be compared. It is called the omitted or reference group. How do you interpret the intercept? The intercept is the mean of the omitted group. How do you interpret the other beta coefficients? The b 1 coefficient is the mean of the Asiamer group minus the mean of the Aframer group. The b 2 coefficient is the mean of the Caucas group minus the mean of the AfrAmer group and so on.

Building Models: Making Sure the Assumptions Hold 1.There is a linear relationship between the explanatory (independent) variable(s) and the response (dependent variable). 2.The error terms are normally distributed with a mean of zero. 3.The errors have equal variance for all values of X. This property is called homoscedasticity. 4.The error values are independent. 5.The data are ratio or interval.

Linearity (and how to get it) Monthly Electrical Usage and Size of Home Size of Home Monthly Electrical Usage (Square Feet) (Kilowatt Hours) 1,290 1,182 1,350 1,172 1,470 1,264 1,600 1,493 1,710 1,571 1,840 1,711 1,980 1,804 2,230 1,840 2,400 1,956 2,930 1,954

2000 1800 Energy Use (kilowatt hours) 1600 1400 1200 1000 1500 2000 2500 3000 Home Size in Square Feet Electrical usage appears to increase in a curvilinear manner with the size of the home.

Transformations for Nonlinear Relation Only A X' = log 10 X X' = X B X' = X 2 X' = exp(x) C X' = 1/X X' = exp(-x) Prototype Regression Pattern Transformations of X Figure by MIT OCW.

2000 2000 1800 1800 EnergyUse 1600 EnergyUse 1600 1400 1400 1200 1200 1000 1500 2000 2500 3000 HomeSize 7.00 7.20 7.40 7.60 7.80 8.00 lnx

Prototype Regression Patterns with Unequal Error Variances and Simple Transformations of Y A Y' = Y B Y' = log 10 Y C Y' = 1/Y Prototype Regression Pattern Transformations of Y Note: A simultaneous transformation on X may also be helpful or necessary. Figure by MIT OCW.

EnergyUse 2000 Observed Linear Quadratic 1800 1600 1400 1200 1000 1500 2000 2500 3000 HomeSize = β + β x + ε y 0 1 y = 2 β + β x + β + ε 0 1 2 x

Building Models: Making Sure the Assumptions Hold 1.There is a linear relationship between the explanatory (independent) variable(s) and the response (dependent variable). 2.The error terms are normally distributed with a mean of zero. 3.The errors have equal variance for all values of X. This property is called homoscedasticity. 4.The error values are independent. 5.The data are ratio or interval.

Residual Plots against Energy Use Before inclusion of X 2 After inclusion of X 2 Scatterplot Scatterplot Dependent Variable: EnergyUse Dependent Variable: EnergyUse Regression Standardized Residual 1 0-1 Regression Standardized Residual 1 0-1 -2-2 1200 1400 1600 1800 2000 EnergyUse 1200 1400 1600 1800 2000 EnergyUse

Normal P-P Plot of Regression Standardized Residual Normal P-P Plot of Regression Standardized Residual 1.0 Dependent Variable: EnergyUse 1.0 Dependent Variable: EnergyUse 0.8 0.8 Expected Cum Prob 0.6 0.4 Expected Cum Prob 0.6 0.4 0.2 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Observed Cum Prob 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Observed Cum Prob y = β + β1x + ε 0 2 y = β + β + β + ε 0 1x 2 x

Homoscedasticity Dispersion of residual errors around a regression. The graph on the left shows a homoscedastic regression the variance of residuals given x is a constant. The graph on the right shows a heteroscedastic regression the variance of residuals increases with x. In this case, higher x values predict y with less certainty. A B y y x x Figure by MIT OCW.

Linear Regression Statistics: Model Fit Tests for serial correlation among residuals. The test value ranges from 0 to 4. Values close to 0 indicate positive correlation. Values close to 4 indicate negative correlation. Values between 1.5 and 2.5 are expected, i.e., indicate no correlation. Model 1 Model Summary b Adjusted Std. Error of Durbin- R R Square R Square the Estimate Watson.991 a.982.977 46.801 2.079 a. Predictors: (Constant), SizeSquared, HomeSize b. Dependent Variable: EnergyUse

Model Utility R 2 = SSR/SST F test F= R 2 /k (1-R 2 )/[n-(k+1)] Where n= the number of observations and k= the number of independent (predictor) variables. The F test tests the global utility of the model, i.e., at least one of the coefficients is nonzero. Find critical value for F α in table with k df in the numerator and [n-(k+1)] df in the denominator. Rejection region is where F > F α.

Model Summary b 1-[(SSE/n-k+1)/(SST/n-1)] Model 1 Adjusted Std. Error of R R Square R Square the Estimate.991 a.982.977 46.801 a. Predictors: (Constant), SizeSquared, HomeSize SSE Model 1 Regression Residual Total b. Dependent Variable: EnergyUse R 2 =SSR/SST or 1-(SSE/SST) ANOVA b Sum of Squares df Mean Square F Sig. 831069.5 2 415534.773 189.710.0001 a 15332.554 7 2190.365 846402.1 9 a. Predictors: (Constant), SizeSquared, HomeSize b. Dependent Variable: EnergyUse Coefficients a S 2 = SSE/n (k + 1) Sometimes called MSE F= R 2 /k (1-R 2 )/[n-(k+1)] Model 1 (Constant) HomeSize SizeSquared a. Dependent Variable: EnergyUse Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. -1216.1438870 242.80636850-5.009.00155 2.39893018.24583560 4.049 9.758.00003 -.00045004.00005908-3.161-7.618.00012 y = 2 β + β x + β + ε 0 1 2 x K=number of X variables

For the home energy example the critical value of F was 4.74 with 2 df in the numerator and 7 df in the denominator. Since the computed F=189.71 we reject H 0 and conclude that at least one of the model coefficients β 1 and β 2 is nonzero.

y 0 1 = β + β x + ε Model 1 Model Summary b Adjusted Std. Error of R R Square R Square the Estimate.912 a.832.811 133.438 a. Predictors: (Constant), HomeSize b. Dependent Variable: EnergyUse ANOVA b Model 1 Regression Residual Total Sum of Squares df Mean Square F Sig. 703957.2 1 703957.183 39.536.000 a 142444.9 8 17805.615 846402.1 9 a. Predictors: (Constant), HomeSize b. Dependent Variable: EnergyUse Model 1 (Constant) HomeSize Unstandardized Coefficients a. Dependent Variable: EnergyUse Coefficients a Standardized Coefficients B Std. Error Beta t Sig. 578.928 166.968 3.467.008.540.086.912 6.288.000