Multiple Regression. Midterm results: AVG = 26.5 (88%) A = 27+ B = C =

Similar documents
CHAPTER 6: SPECIFICATION VARIABLES

Applied Statistics and Econometrics

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

Multiple Regression Analysis

6.1 The F-Test 6.2 Testing the Significance of the Model 6.3 An Extended Model 6.4 Testing Some Economic Hypotheses 6.5 The Use of Nonsample

2. Linear regression with multiple regressors

Lecture 4: Multivariate Regression, Part 2

Multiple Regression Analysis. Part III. Multiple Regression Analysis

1 A Non-technical Introduction to Regression

Wooldridge, Introductory Econometrics, 3d ed. Chapter 9: More on specification and data problems

Lecture 4: Multivariate Regression, Part 2

ECNS 561 Multiple Regression Analysis

Statistical Inference with Regression Analysis

1 Motivation for Instrumental Variable (IV) Regression

2) For a normal distribution, the skewness and kurtosis measures are as follows: A) 1.96 and 4 B) 1 and 2 C) 0 and 3 D) 0 and 0

The general linear regression with k explanatory variables is just an extension of the simple regression as follows

ECON2228 Notes 8. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 35

ACE 564 Spring Lecture 8. Violations of Basic Assumptions I: Multicollinearity and Non-Sample Information. by Professor Scott H.

Applied Quantitative Methods II

Contest Quiz 3. Question Sheet. In this quiz we will review concepts of linear regression covered in lecture 2.

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Multiple Regression: Inference

Linear Regression with Multiple Regressors

Final Exam - Solutions

ECON 497 Midterm Spring

Lecture #8 & #9 Multiple regression

Multiple Regression Analysis: Inference MULTIPLE REGRESSION ANALYSIS: INFERENCE. Sampling Distributions of OLS Estimators

Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity

Multiple Regression Analysis

Lectures 5 & 6: Hypothesis Testing

Eastern Mediterranean University Department of Economics ECON 503: ECONOMETRICS I. M. Balcilar. Midterm Exam Fall 2007, 11 December 2007.

Econ 1123: Section 2. Review. Binary Regressors. Bivariate. Regression. Omitted Variable Bias

Lecture 8. Using the CLR Model. Relation between patent applications and R&D spending. Variables

Making sense of Econometrics: Basics

statistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors:

Multiple Regression Analysis: Estimation. Simple linear regression model: an intercept and one explanatory variable (regressor)

Unless provided with information to the contrary, assume for each question below that the Classical Linear Model assumptions hold.

Answers to Problem Set #4

6. Assessing studies based on multiple regression

PBAF 528 Week 8. B. Regression Residuals These properties have implications for the residuals of the regression.

THE MULTIVARIATE LINEAR REGRESSION MODEL

Final Exam - Solutions

Applied Statistics and Econometrics

Handout 12. Endogeneity & Simultaneous Equation Models

Lab 07 Introduction to Econometrics

Problem set 1: answers. April 6, 2018

ECON 4230 Intermediate Econometric Theory Exam

Rockefeller College University at Albany

UNIVERSIDAD CARLOS III DE MADRID ECONOMETRICS FINAL EXAM (Type B) 2. This document is self contained. Your are not allowed to use any other material.

5. Erroneous Selection of Exogenous Variables (Violation of Assumption #A1)

Inference in Regression Analysis

Multiple Regression Analysis

Steps in Regression Analysis

Linear Regression with Multiple Regressors

Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals

Multiple Linear Regression CIVL 7012/8012

UNIVERSIDAD CARLOS III DE MADRID ECONOMETRICS Academic year 2009/10 FINAL EXAM (2nd Call) June, 25, 2010

LECTURE 9: GENTLE INTRODUCTION TO

Model Specification and Data Problems. Part VIII

ECO220Y Simple Regression: Testing the Slope

LECTURE 11. Introduction to Econometrics. Autocorrelation

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

EC402 - Problem Set 3

Chapter 1. An Overview of Regression Analysis. Econometrics and Quantitative Analysis. What is Econometrics? (cont.) What is Econometrics?

Midterm Examination #2 - SOLUTION

Inference in Regression Model

Heteroscedasticity 1

Basic econometrics. Tutorial 3. Dipl.Kfm. Johannes Metzler

Friday, March 15, 13. Mul$ple Regression

Exercises (in progress) Applied Econometrics Part 1

Chapter 6: Linear Regression With Multiple Regressors

FNCE 926 Empirical Methods in CF

An overview of applied econometrics

Introduction to Econometrics. Multiple Regression (2016/2017)

ECON Introductory Econometrics. Lecture 7: OLS with Multiple Regressors Hypotheses tests

Making sense of Econometrics: Basics

ECON Introductory Econometrics. Lecture 17: Experiments

Iris Wang.

Self-Assessment Weeks 8: Multiple Regression with Qualitative Predictors; Multiple Comparisons

FinQuiz Notes

REED TUTORIALS (Pty) LTD ECS3706 EXAM PACK

Hypothesis Tests and Confidence Intervals in Multiple Regression

Types of economic data

Business Economics BUSINESS ECONOMICS. PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS MODULE No. : 3, GAUSS MARKOV THEOREM

Applied Econometrics (MSc.) Lecture 3 Instrumental Variables

Outline. 11. Time Series Analysis. Basic Regression. Differences between Time Series and Cross Section

Applied Econometrics. Applied Econometrics Second edition. Dimitrios Asteriou and Stephen G. Hall

Econometrics. Week 8. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics

Panel Data. March 2, () Applied Economoetrics: Topic 6 March 2, / 43

Econometrics -- Final Exam (Sample)

2 Prediction and Analysis of Variance

Homoskedasticity. Var (u X) = σ 2. (23)

Assessing Studies Based on Multiple Regression

Econometrics Summary Algebraic and Statistical Preliminaries

Gov 2000: 9. Regression with Two Independent Variables

Econometrics Review questions for exam

download instant at

Introduction to Econometrics. Heteroskedasticity

Transcription:

Economics 130 Lecture 6 Midterm Review Next Steps for the Class Multiple Regression Review & Issues Model Specification Issues Launching the Projects!!!!!

Midterm results: AVG = 26.5 (88%) A = 27+ B = 24 26 C = 21 23...

8 10/22 Multiple Regression, model specification review. Collinearity 9 10/29 Nonlinear Relationships (logs, etc.) Dummy (Indicator) Variables 10 11/5 Heteroskedasticity Midterm Review 11 11/12 2 nd MIDTERM

PROJECTS!!!!!!!!!! 8 10/22 Formalize the teams Review How do to a Project 9 10/29 10/28 TOPICS/DATA (give to SS) Gretl Practice 10 11/5 11/4 Research How s it Going More practice; work on projects 11 11/12 2 nd MIDTERM 12 11/19 Focus on projects; data & first regressions

Model specification Omitted Variables Irrelevant Variables

Model Specification Choosing independent and dependent variables in a econometric model is a product of: Economic theory Knowledge of the underlying behavior Simple experience

Remember, we can NEVER know the true relationship among econometric variables Therefore, we can expect some specification errors

Sources of specification errors: Choice of variables Functional forms Structure of the error terms (e s)

Sources of specification errors: Choice of variables Functional forms (non-linear relationships) Structure of the error terms (e s)

Choice of variables: Two potential problems Omitting important variables Including irrelevant variables

Omitting variables that matter is very significant.

Remember our assumptions that made OLS BLUE. Unbiased Consistent These are lost with omitted variables. Your estimates may neither be unbiased or consistent.

Consider unbiased-ness: Suppose the true model is: Suppose you estimate the following without x 3

Now v i = b 3 x 3 + e i Therefore E(v i ) = b 3 x 3 Therefore it s biased

What about consistency? As for consistency, it is the property that estimates converge to true values as the sample size is increased indefinitely. Similar to unbiased-ness, if our first four assumptions hold, (especially #4, which implies Xs and e s are uncorrelated), then OLS estimators are consistent.

If... v i = b 3 x 3 + e i Then: Cov (x 2,v i ) = Cov (x 2,b 3 x 3 + e i ) = b 3 Cov (x 2, x 3 ) Unless x s are completely uncorrelated, their covariance will NOT = 0. That violates Assumption 4, so b is no longer consistent.

What does this means in practical terms? Omitting a variable (that matters) corrupts: (1) Interpretation of causal effects (2) Interpretation of magnitudes Example of (1): Regression of Percent of 10 th graders passing standardized math test, on Percent in school lunch program Data: 408 observations on Michigan high schools in 1993 (Wooldridge, meap93) Gretl output:

Model 3: OLS, using observations 1-408 Dependent variable: math10 coefficient std. error t-ratio p-value ------------------------------------------------- --------- -------- const 32.1427 0.997582 32.22 6.27e-114 *** lnchprg -0.318864 0.0348393-9.152 2.75e-018 *** Mean dependent var 24.10686 S.D. dependent var 10.49361 Sum squared resid 37151.91 S.E. of regression 9.565938 R-squared 0.171034 Adjusted R-squared 0.168992 F(1, 406) 83.76683 P-value(F) 2.75e-18

So lunch program lowers math performance? This tells us a 10% increase in lunch program participation lowers pass rate by 3.2%. Do you believe this conclusion? What about omitted variables? How do you expect these omitted variables to affect coefficient estimate on school lunch?

Problem: Omitted variables. School lunch program is correlated with (proxies for) other RELEVANT variables, such as family income, parental educational achievement, school quality, etc. How do you expect these omitted variables to affect coefficient estimate on school lunch? Lower family incomes, lower parental educational achievement may impair student performance and also promote school lunch participation. So omitted effects lead to a negative correlation (but NOT causal effect) between school lunch and math scores.

Let s look at another example. Housing starts (000), GNP ($billions) & interest rates (%). HOUSING = 687.898 +.905GNP 169.658INTRATE (1.80) (3.64) (-3.87) Adjusted R 2 =.375 F (2, 20) = 7.609 HOUSING = 1442.898 +.058GNP (3.89) (.38) Adjusted R 2 = -.04 F (1, 21) =.144

Upshot: If you think there might be an important omitted variable (i.e., one that has a non-zero coefficient in the true model ), but don t have data on this variable, then you need to worry about its likely correlation with variables of interest (Of course if you think and can argue convincingly that the omitted variable is uncorrelated with included variables, then you are off the hook!)

What about including irrelevant variables?

Including irrelevant variables can inflate variances of coefficient estimates on relevant variables, thereby -reducing the precision of estimated coefficients NOT Gauss Markov!!! the least squares estimator of the correct model is the minimum variance linear unbiased estimator (best).

Let s return to the housing model: HOUSING = 687.898 +.905GNP 169.658INTRATE (1.80) (3.64) (-3.87)

Now let s run a new regression using some additional variables. HOUSING = 5087.43 + 1.756GNP 174.69INTRATE (.46) (.82) (-2.86) -33.43POP + 79.72UNEMPL (-.40) (.65)

Model 4: OLS, using observations 1963-1985 (T = 23) Dependent variable: housing coefficient std. error t-ratio p-value ------------------------------------------------------- Const 687.898 382.682 1.798 0.0874 * Gnp 0.905395 0.248978 3.636 0.0016 *** Intrate -169.658 43.8383-3.870 0.0010 *** Model 5: OLS, using observations 1963-1985 (T = 23) Dependent variable: housing Coefficient Std. Error t-ratio p-value Const 5087.43 11045. 0.4606 0.65062 Gnp 1.75635 2.13998 0.8207 0.42254 Intrate -174.692 61.0007-2.8638 0.01032 ** Pop -33.4337 83.0756-0.4024 0.69209 Unemp 79.7199 122.579 0.6504 0.52368

Let s do an F test. Unrestricted model: HOUSING MODEL with POP and UNEMPL (k = 5) Restricted model: HOUSING MODEL with only GNP and INTRATE (m=3)

H 0 : b 3 = b 4 = 0 H 1 : one coefficient does not equal zero Values R 2 restricted =.4321 R 2 unrestricted =.4499 k = 5 m = 3 J (# of restrictions) = 5 3 = 2 n = 23; n k = 18

Recall our equation for the F Statistic: F (j, n-k) = (ESS R ESS U )/J = (R 2 U- R 2 R)/J ESS U /(N-K) (1- R 2 U)/(N-K) F (2, 18) = (.4499.4321)/2 =.0089 =.29 (1 -.4499)/18.0306 F* (2,18) = 3.55 >.29.

Therefore, we CANNOT REJECT the null hypothesis that the regression coefficients for POP and UNEMP are zero. This is consistent with these being irrelevant variables.

1. Choose variables and a functional form on the basis of your theoretical and general understanding of the relationship. Think long and hard about what kinds of things may affect your dependent variable and try to include measures of these factors.

2. If an estimated equation has coefficients with unexpected signs, or unrealistic magnitudes, they could be caused by a misspecification such as the omission of an important variable. Again, think about what s going on. Try to explain your results.

3. One method for assessing whether a variable or a group of variables should be included in an equation is to perform significance tests. That is, t-tests for hypotheses such as t or F- tests for hypotheses such as:.

A related criterion: Choose model with better fit (adjusted R 2 )

However, if a variable logically belongs in your model and has an insignificant coefficient, this does not mean it should be dropped. Your data may not be sufficiently rich (or precise) to measure the variable s effect. Including the variable controls for the logical effect. On the other hand, if logic for inclusion is weak and the variable is insignificant, then you have a case for dropping it.

Remember, it s an art not a science.

s Collinearity (Multicollinearity)

Readings for This Week Text: CH 6 CH 2, 2.8 2.9 CH 4, 4.4 4.6 CH 5, 5.6 5.8, CH 7

We continue with addressing our second issue + add in how we evaluate these relationships: Where do we get data to do this analysis? How do we create the model relating the data? How do we relate data to on another? How do we evaluate these relationships?

Multicollinearity Intuition: If explanatory variables are highly correlated with one another then regression model has trouble telling which individual variable is explaining Y. Symptom: Individual coefficients may look insignificant, but regression as a whole may look significant (e.g. R 2 big, F-stat big).

Example: Y = exchange rate Explanatory variable(s) = interest rate X 1 = bank prime rate X 2 = Treasury bill rate Using both X 1 and X 2 will probably cause multicollinearity problem Solution: Include either X 1 or X 2 but not both. In some cases this solution will be unsatisfactory if it causes you to drop out explanatory variables which economic theory says should be there.

Definitions of multicollinearity: Perfect multicollinearity: When one independent variable is linear function of another, x j =α 1 +α 2 x m Imperfect multicollinearity: When one variable is highly correlated (negative or positive) with another variable Remember: correlation is measure of linear association. r=1 means perfect positive collinearity, r=-1 means perfect negative collinearity 43

If two included variables are highly correlated, then coefficient estimates for both will be very imprecise Why? Suppose two variables are perfectly collinear. Then there is only one independent linear effect for the two variables. I.e., you cannot estimate (identify) two effects, only one. 44

The consequences of Multicollinearity: (A) (B) High correlation in x s does not violate GM assumptions. Therefore, parameter estimates are unbiased. If your model is right and you have multicollinearity, you will have 1. High variances of coefficient estimates 2. Low t values 3. (erroneous) conclusion that coefficients are not significantly different from zero 4. But relatively high R 2 (variables jointly explain a lot) and significant F stats for joint tests of zero coefficients (again variables are jointly significant) 5. Because overall model not largely affected, can use for prediction 6. Because estimates are imprecise, they are sensitive to changes in model specification, such as dropping or adding a variable or changing functional form

Dataset: POE cars MPG = miles per gallon CYL = number of cylinders ENG = engine displacement in cubic inches WGT = vehicle weight in pounds Question: How is MPG related to vehicle design? Expect more powerful cars (more cylinders, greater engine displacement) and larger cars (more weight) to have lower fuel economy. Problem: CYL and ENG are highly (positively) correlated.

Estimate the model to obtain: Model 1: OLS, using observations 1-392 Dependent variable: mpg coefficient std. error t-ratio p-value ---------------------------------------------------------- const 44.3710 1.48069 29.97 5.32e-103 *** cyl -0.267797 0.413067-0.6483 0.5172 eng -0.0126740 0.00825007-1.536 0.1253 wgt -0.00570788 0.00071392-7.995 1.50e-014 *** Mean dependent var 23.44592 S.D. dependent var 7.805007 Sum squared resid 7162.549 S.E. of regression 4.296531 R-squared 0.699293 Adjusted R-squared 0.696967 F(3, 388) 300.7635 P-value(F) 7.6e-101

Questions: Is coefficient on CYL (b 2 ) significant? Is coefficient on ENG (b 3 ) significant?

Can you reject the H 0 : β 2 =0 vs H 1 :β 2 0 (check lowest level) at: A. Yes (at α=.01) B. Yes (at α=.05) C. Yes (at α=.10) D. No (cannot reject null at any of these α)

A. Yes (at α=.01) B. Yes (at α=.05) C. Yes (at α=.10) D. No (cannot reject null at any of these α)

Can you reject the H 0 : β 3 =0 vs H 1 :β 3 0 (check lowest level) at: A. Yes (at α=.01) B. Yes (at α=.05) C. Yes (at α=.10) D. No (cannot reject null at any of these α)

A. Yes (at α=.01) B. Yes (at α=.05) C. Yes (at α=.10) D. No (cannot reject null at any of these α)

Now suppose exclude ENG, what do you get: Model 1: OLS, using observations 1-392 Dependent variable: mpg coefficient std. error t-ratio p-value ---------------------------------------------------------- const 46.2923 0.793969 58.30 2.31e-194 *** cyl -0.721378 0.289378-2.493 0.0131 ** wgt -0.006347 0.000581133-10.92 2.11e-024 *** CYL is now significant! Insignificance is due to correlation between two measures of engine size.

Now let s do a test of the restriction that both CYL and ENG have zero coefficients (b 2 =b 3 =0 in the first model): Restriction model: 1: b[cyl] = 0 2: b[eng] = 0 Test statistic: F(2, 388) = 4.29802, with p-value = 0.0142485 Do you reject the null? At what α?

Are both β 3 and β 2 equal to zero? (check lowest level) A. No (at α=.01) B. No (at α=.05) C. No (at α=.10) Test statistic: F(2, 388) = 4.29802, with p-value = 0.0142485 D. Yes (cannot reject null at any α) 55

Identifying Multicollinearity 1. Basic rule: Don t worry about it unless you have a problem!! When do you have a problem? When you are surprised that a key variable is insignificant. In this case, THEN you should probably investigate for symptoms of multicollinearity: 2. Look for High R 2 with low value t-statistics 3. Examine pairwise correlations. 56

Mitigating Multicollinearity 1. Obtain more information (data) and include it in the analysis. This often is costly and doesn t help if underlying variables are highly correlated, no matter how much data. 2. Drop Variables 1. Pro: You can try to proxy for the single effect that matters. In MPG example, use CYL OR ENG to capture engine size. 2. Con: if dropped variables are important (relevant), estimates will be biased (omitted variable bias) 3. Do nothing fixing may be more costly 4. Reformulate the model