Multiple Regression. Midterm results: AVG = 26.5 (88%) A = 27+ B = C =

Economics 130 Lecture 6 Midterm Review Next Steps for the Class Multiple Regression Review & Issues Model Specification Issues Launching the Projects!!!!!

Midterm results: AVG = 26.5 (88%) A = 27+ B = 24 26 C = 21 23...

8 10/22 Multiple Regression, model specification review. Collinearity 9 10/29 Nonlinear Relationships (logs, etc.) Dummy (Indicator) Variables 10 11/5 Heteroskedasticity Midterm Review 11 11/12 2 nd MIDTERM

PROJECTS!!!!!!!!!! 8 10/22 Formalize the teams Review How do to a Project 9 10/29 10/28 TOPICS/DATA (give to SS) Gretl Practice 10 11/5 11/4 Research How s it Going More practice; work on projects 11 11/12 2 nd MIDTERM 12 11/19 Focus on projects; data & first regressions

Model specification Omitted Variables Irrelevant Variables

Model Specification Choosing independent and dependent variables in a econometric model is a product of: Economic theory Knowledge of the underlying behavior Simple experience

Remember, we can NEVER know the true relationship among econometric variables Therefore, we can expect some specification errors

Sources of specification errors: Choice of variables Functional forms Structure of the error terms (e s)

Sources of specification errors: Choice of variables Functional forms (non-linear relationships) Structure of the error terms (e s)

Choice of variables: Two potential problems Omitting important variables Including irrelevant variables

Omitting variables that matter is very significant.

Remember our assumptions that made OLS BLUE. Unbiased Consistent These are lost with omitted variables. Your estimates may neither be unbiased or consistent.

Consider unbiased-ness: Suppose the true model is: Suppose you estimate the following without x 3

Now v i = b 3 x 3 + e i Therefore E(v i ) = b 3 x 3 Therefore it s biased

What about consistency? As for consistency, it is the property that estimates converge to true values as the sample size is increased indefinitely. Similar to unbiased-ness, if our first four assumptions hold, (especially #4, which implies Xs and e s are uncorrelated), then OLS estimators are consistent.

If... v i = b 3 x 3 + e i Then: Cov (x 2,v i ) = Cov (x 2,b 3 x 3 + e i ) = b 3 Cov (x 2, x 3 ) Unless x s are completely uncorrelated, their covariance will NOT = 0. That violates Assumption 4, so b is no longer consistent.

What does this means in practical terms? Omitting a variable (that matters) corrupts: (1) Interpretation of causal effects (2) Interpretation of magnitudes Example of (1): Regression of Percent of 10 th graders passing standardized math test, on Percent in school lunch program Data: 408 observations on Michigan high schools in 1993 (Wooldridge, meap93) Gretl output:

Model 3: OLS, using observations 1-408 Dependent variable: math10 coefficient std. error t-ratio p-value ------------------------------------------------- --------- -------- const 32.1427 0.997582 32.22 6.27e-114 *** lnchprg -0.318864 0.0348393-9.152 2.75e-018 *** Mean dependent var 24.10686 S.D. dependent var 10.49361 Sum squared resid 37151.91 S.E. of regression 9.565938 R-squared 0.171034 Adjusted R-squared 0.168992 F(1, 406) 83.76683 P-value(F) 2.75e-18

So lunch program lowers math performance? This tells us a 10% increase in lunch program participation lowers pass rate by 3.2%. Do you believe this conclusion? What about omitted variables? How do you expect these omitted variables to affect coefficient estimate on school lunch?

Problem: Omitted variables. School lunch program is correlated with (proxies for) other RELEVANT variables, such as family income, parental educational achievement, school quality, etc. How do you expect these omitted variables to affect coefficient estimate on school lunch? Lower family incomes, lower parental educational achievement may impair student performance and also promote school lunch participation. So omitted effects lead to a negative correlation (but NOT causal effect) between school lunch and math scores.

Let s look at another example. Housing starts (000), GNP ($billions) & interest rates (%). HOUSING = 687.898 +.905GNP 169.658INTRATE (1.80) (3.64) (-3.87) Adjusted R 2 =.375 F (2, 20) = 7.609 HOUSING = 1442.898 +.058GNP (3.89) (.38) Adjusted R 2 = -.04 F (1, 21) =.144

Upshot: If you think there might be an important omitted variable (i.e., one that has a non-zero coefficient in the true model ), but don t have data on this variable, then you need to worry about its likely correlation with variables of interest (Of course if you think and can argue convincingly that the omitted variable is uncorrelated with included variables, then you are off the hook!)

What about including irrelevant variables?

Including irrelevant variables can inflate variances of coefficient estimates on relevant variables, thereby -reducing the precision of estimated coefficients NOT Gauss Markov!!! the least squares estimator of the correct model is the minimum variance linear unbiased estimator (best).

Let s return to the housing model: HOUSING = 687.898 +.905GNP 169.658INTRATE (1.80) (3.64) (-3.87)

Now let s run a new regression using some additional variables. HOUSING = 5087.43 + 1.756GNP 174.69INTRATE (.46) (.82) (-2.86) -33.43POP + 79.72UNEMPL (-.40) (.65)

Model 4: OLS, using observations 1963-1985 (T = 23) Dependent variable: housing coefficient std. error t-ratio p-value ------------------------------------------------------- Const 687.898 382.682 1.798 0.0874 * Gnp 0.905395 0.248978 3.636 0.0016 *** Intrate -169.658 43.8383-3.870 0.0010 *** Model 5: OLS, using observations 1963-1985 (T = 23) Dependent variable: housing Coefficient Std. Error t-ratio p-value Const 5087.43 11045. 0.4606 0.65062 Gnp 1.75635 2.13998 0.8207 0.42254 Intrate -174.692 61.0007-2.8638 0.01032 ** Pop -33.4337 83.0756-0.4024 0.69209 Unemp 79.7199 122.579 0.6504 0.52368

Let s do an F test. Unrestricted model: HOUSING MODEL with POP and UNEMPL (k = 5) Restricted model: HOUSING MODEL with only GNP and INTRATE (m=3)

H 0 : b 3 = b 4 = 0 H 1 : one coefficient does not equal zero Values R 2 restricted =.4321 R 2 unrestricted =.4499 k = 5 m = 3 J (# of restrictions) = 5 3 = 2 n = 23; n k = 18

Recall our equation for the F Statistic: F (j, n-k) = (ESS R ESS U )/J = (R 2 U- R 2 R)/J ESS U /(N-K) (1- R 2 U)/(N-K) F (2, 18) = (.4499.4321)/2 =.0089 =.29 (1 -.4499)/18.0306 F* (2,18) = 3.55 >.29.

Therefore, we CANNOT REJECT the null hypothesis that the regression coefficients for POP and UNEMP are zero. This is consistent with these being irrelevant variables.

1. Choose variables and a functional form on the basis of your theoretical and general understanding of the relationship. Think long and hard about what kinds of things may affect your dependent variable and try to include measures of these factors.

2. If an estimated equation has coefficients with unexpected signs, or unrealistic magnitudes, they could be caused by a misspecification such as the omission of an important variable. Again, think about what s going on. Try to explain your results.

3. One method for assessing whether a variable or a group of variables should be included in an equation is to perform significance tests. That is, t-tests for hypotheses such as t or F- tests for hypotheses such as:.

A related criterion: Choose model with better fit (adjusted R 2 )

However, if a variable logically belongs in your model and has an insignificant coefficient, this does not mean it should be dropped. Your data may not be sufficiently rich (or precise) to measure the variable s effect. Including the variable controls for the logical effect. On the other hand, if logic for inclusion is weak and the variable is insignificant, then you have a case for dropping it.

Remember, it s an art not a science.

s Collinearity (Multicollinearity)

Readings for This Week Text: CH 6 CH 2, 2.8 2.9 CH 4, 4.4 4.6 CH 5, 5.6 5.8, CH 7

We continue with addressing our second issue + add in how we evaluate these relationships: Where do we get data to do this analysis? How do we create the model relating the data? How do we relate data to on another? How do we evaluate these relationships?

Multicollinearity Intuition: If explanatory variables are highly correlated with one another then regression model has trouble telling which individual variable is explaining Y. Symptom: Individual coefficients may look insignificant, but regression as a whole may look significant (e.g. R 2 big, F-stat big).

Example: Y = exchange rate Explanatory variable(s) = interest rate X 1 = bank prime rate X 2 = Treasury bill rate Using both X 1 and X 2 will probably cause multicollinearity problem Solution: Include either X 1 or X 2 but not both. In some cases this solution will be unsatisfactory if it causes you to drop out explanatory variables which economic theory says should be there.

Definitions of multicollinearity: Perfect multicollinearity: When one independent variable is linear function of another, x j =α 1 +α 2 x m Imperfect multicollinearity: When one variable is highly correlated (negative or positive) with another variable Remember: correlation is measure of linear association. r=1 means perfect positive collinearity, r=-1 means perfect negative collinearity 43

If two included variables are highly correlated, then coefficient estimates for both will be very imprecise Why? Suppose two variables are perfectly collinear. Then there is only one independent linear effect for the two variables. I.e., you cannot estimate (identify) two effects, only one. 44

The consequences of Multicollinearity: (A) (B) High correlation in x s does not violate GM assumptions. Therefore, parameter estimates are unbiased. If your model is right and you have multicollinearity, you will have 1. High variances of coefficient estimates 2. Low t values 3. (erroneous) conclusion that coefficients are not significantly different from zero 4. But relatively high R 2 (variables jointly explain a lot) and significant F stats for joint tests of zero coefficients (again variables are jointly significant) 5. Because overall model not largely affected, can use for prediction 6. Because estimates are imprecise, they are sensitive to changes in model specification, such as dropping or adding a variable or changing functional form

Dataset: POE cars MPG = miles per gallon CYL = number of cylinders ENG = engine displacement in cubic inches WGT = vehicle weight in pounds Question: How is MPG related to vehicle design? Expect more powerful cars (more cylinders, greater engine displacement) and larger cars (more weight) to have lower fuel economy. Problem: CYL and ENG are highly (positively) correlated.

Estimate the model to obtain: Model 1: OLS, using observations 1-392 Dependent variable: mpg coefficient std. error t-ratio p-value ---------------------------------------------------------- const 44.3710 1.48069 29.97 5.32e-103 *** cyl -0.267797 0.413067-0.6483 0.5172 eng -0.0126740 0.00825007-1.536 0.1253 wgt -0.00570788 0.00071392-7.995 1.50e-014 *** Mean dependent var 23.44592 S.D. dependent var 7.805007 Sum squared resid 7162.549 S.E. of regression 4.296531 R-squared 0.699293 Adjusted R-squared 0.696967 F(3, 388) 300.7635 P-value(F) 7.6e-101

Questions: Is coefficient on CYL (b 2 ) significant? Is coefficient on ENG (b 3 ) significant?

Can you reject the H 0 : β 2 =0 vs H 1 :β 2 0 (check lowest level) at: A. Yes (at α=.01) B. Yes (at α=.05) C. Yes (at α=.10) D. No (cannot reject null at any of these α)

A. Yes (at α=.01) B. Yes (at α=.05) C. Yes (at α=.10) D. No (cannot reject null at any of these α)

Can you reject the H 0 : β 3 =0 vs H 1 :β 3 0 (check lowest level) at: A. Yes (at α=.01) B. Yes (at α=.05) C. Yes (at α=.10) D. No (cannot reject null at any of these α)

A. Yes (at α=.01) B. Yes (at α=.05) C. Yes (at α=.10) D. No (cannot reject null at any of these α)

Now suppose exclude ENG, what do you get: Model 1: OLS, using observations 1-392 Dependent variable: mpg coefficient std. error t-ratio p-value ---------------------------------------------------------- const 46.2923 0.793969 58.30 2.31e-194 *** cyl -0.721378 0.289378-2.493 0.0131 ** wgt -0.006347 0.000581133-10.92 2.11e-024 *** CYL is now significant! Insignificance is due to correlation between two measures of engine size.

Now let s do a test of the restriction that both CYL and ENG have zero coefficients (b 2 =b 3 =0 in the first model): Restriction model: 1: b[cyl] = 0 2: b[eng] = 0 Test statistic: F(2, 388) = 4.29802, with p-value = 0.0142485 Do you reject the null? At what α?

Are both β 3 and β 2 equal to zero? (check lowest level) A. No (at α=.01) B. No (at α=.05) C. No (at α=.10) Test statistic: F(2, 388) = 4.29802, with p-value = 0.0142485 D. Yes (cannot reject null at any α) 55

Identifying Multicollinearity 1. Basic rule: Don t worry about it unless you have a problem!! When do you have a problem? When you are surprised that a key variable is insignificant. In this case, THEN you should probably investigate for symptoms of multicollinearity: 2. Look for High R 2 with low value t-statistics 3. Examine pairwise correlations. 56

Mitigating Multicollinearity 1. Obtain more information (data) and include it in the analysis. This often is costly and doesn t help if underlying variables are highly correlated, no matter how much data. 2. Drop Variables 1. Pro: You can try to proxy for the single effect that matters. In MPG example, use CYL OR ENG to capture engine size. 2. Con: if dropped variables are important (relevant), estimates will be biased (omitted variable bias) 3. Do nothing fixing may be more costly 4. Reformulate the model